cortex eval

A comprehensive evaluation framework for benchmarking agent performance. Run suites of tasks across 8 task categories with pattern matching, file verification, and tool sequence validation. Supports regression detection against baseline runs.

Usage

cortex eval [options]

Options

OptionDescription
--suite, -sEvaluation suite to run
--baseline, -bBaseline run for regression comparison
--model, -mModel to use for evaluation
--save-baselineSave current results as baseline for future comparison
--helpShow help for this command

Task Categories

CategoryDescription
code_generationGenerate code from natural language prompts
bug_fixFix bugs in provided code
refactoringRefactor code for improved structure
code_reviewReview code and provide feedback
shell_commandGenerate and execute correct shell commands
file_operationPerform file system operations
search_retrievalSearch and retrieve information
tool_use_sequenceExecute multi-step tool sequences

Scoring Methods

MethodDescription
Regex patternsMatch response against expected patterns
Contains/not_containsCheck for presence or absence of strings
Fuzzy matchingApproximate string matching for tolerance
File content verificationVerify file contents after execution
Exit code checkingValidate command exit codes
Tool sequence validationVerify correct tool call ordering

Regression Detection

Compare current evaluation runs against previous baselines:

  • RegressionCheck.previousScore — Score from baseline run
  • RegressionCheck.degraded — Flag if performance dropped
  • Per-category pass/fail and average score breakdowns

Examples

# Run an evaluation suite
cortex eval -s code-generation

# Run with a specific model
cortex eval -s bug-fix -m claude-sonnet-4-5

# Compare against a baseline
cortex eval -s refactoring -b baseline-run-001

# Save current results as new baseline
cortex eval -s code-review --save-baseline