cortex eval
A comprehensive evaluation framework for benchmarking agent performance. Run suites of tasks across 8 task categories with pattern matching, file verification, and tool sequence validation. Supports regression detection against baseline runs.
Usage
cortex eval [options]
Options
| Option | Description |
|---|
--suite, -s | Evaluation suite to run |
--baseline, -b | Baseline run for regression comparison |
--model, -m | Model to use for evaluation |
--save-baseline | Save current results as baseline for future comparison |
--help | Show help for this command |
Task Categories
| Category | Description |
|---|
code_generation | Generate code from natural language prompts |
bug_fix | Fix bugs in provided code |
refactoring | Refactor code for improved structure |
code_review | Review code and provide feedback |
shell_command | Generate and execute correct shell commands |
file_operation | Perform file system operations |
search_retrieval | Search and retrieve information |
tool_use_sequence | Execute multi-step tool sequences |
Scoring Methods
| Method | Description |
|---|
| Regex patterns | Match response against expected patterns |
| Contains/not_contains | Check for presence or absence of strings |
| Fuzzy matching | Approximate string matching for tolerance |
| File content verification | Verify file contents after execution |
| Exit code checking | Validate command exit codes |
| Tool sequence validation | Verify correct tool call ordering |
Regression Detection
Compare current evaluation runs against previous baselines:
RegressionCheck.previousScore — Score from baseline run
RegressionCheck.degraded — Flag if performance dropped
- Per-category pass/fail and average score breakdowns
Examples
# Run an evaluation suite
cortex eval -s code-generation
# Run with a specific model
cortex eval -s bug-fix -m claude-sonnet-4-5
# Compare against a baseline
cortex eval -s refactoring -b baseline-run-001
# Save current results as new baseline
cortex eval -s code-review --save-baseline