Model Quartermaster Guide
The Model Quartermaster (MQM) is an adaptive prediction engine that learns which model your agent should use for each task. It observes every LLM call across your sessions, computes six different prediction signals, and fuses them into an actionable suggestion — automate, suggest, or defer.
Over time, the Quartermaster's predictions improve through reinforcement learning, making your agent more cost-efficient and performant.
What the Quartermaster Does
Every time your agent needs to decide which model to use, the Quartermaster runs six independent prediction signals against the current context. It fuses their outputs into a weighted score and makes one of three decisions:
| Confidence | Action | Behavior |
|---|---|---|
| ≥ 85% | Enforce | Override model selection (safe operations only) |
| 65–84% | Suggest | Recommend the model to the agent |
| < 65% | Defer | Let the agent decide entirely |
Safety note: Unsafe tools (shell execution, file writes, network access) are never auto-executed, regardless of confidence. The highest action for unsafe tools is "Suggest."
Before Predictions Begin
The Quartermaster requires 50 observations before entering active prediction mode. Before that threshold, it operates in learning-only mode — collecting data, computing signal baselines, and building context fingerprints — but not making predictions.
You can check your observation count at any time:
cortex qm stats
Look for the observations field. If it's under 50, keep using Cortex normally and the Quartermaster will activate automatically once enough data is collected.
Viewing Predictions
The Dashboard
The richest view of Quartermaster activity is the dashboard:
# Session-level dashboard (specific session)
cortex qm dashboard -s sess_abc123
# Model-level dashboard (aggregated across all sessions)
cortex mqm dashboard
The dashboard displays:
- Accuracy bars per signal: How often each signal's prediction was correct
- Current signal weights: Visual representation of how much each signal contributes
- Top models by prediction accuracy: Which models the MQM predicts best
- Confidence distribution histogram: How often the MQM hits each confidence band
- Session and model-level trends: Accuracy over time
Prediction Decisions
See the raw history of what the QM predicted and whether it was right:
# Session-level decisions
cortex qm decisions -s sess_abc123
# Limit output
cortex qm decisions -s sess_abc123 --limit 20
# Model-level decisions
cortex mqm decisions --limit 50
Each decision entry shows:
- The tool that was predicted
- The actual tool called
- Whether the prediction was correct
- The confidence score
- Which signal contributed most to the prediction
The Six Prediction Signals
Each signal looks at model call patterns from a different angle. The Quartermaster weights them dynamically based on historical accuracy.
1. Trajectory Signal
What it tracks: Recent model usage patterns and sequences. If you typically use claude-sonnet-4-5 for code review tasks followed by gpt-4o for documentation, the trajectory signal learns this pattern.
Weight: Dynamic (EMA), adjusted up or down based on correctness.
When it's strongest: Repetitive workflows with predictable model sequences — coding sessions that alternate between reasoning-heavy and creative tasks.
2. Episodic Signal
What it tracks: Similarity between the current conversation context and past conversations. It generates 12-feature fingerprints from the conversation (model distribution, message length, token composition, task category indicators) and performs cosine similarity matching.
Weight: Dynamic (EMA).
When it's strongest: Tasks similar to ones you've done before — deploying the same project, debugging similar error types, following established patterns.
3. Historical Signal
What it tracks: Past performance for specific task categories. Queries historical success rates of models for similar task types and uses that to influence the current prediction.
Weight: Dynamic (EMA).
When it's strongest: When you've accumulated significant history across well-defined task categories — code review, refactoring, documentation, testing.
4. Cost Signal
What it tracks: Cost efficiency optimization. Prefers models based on their cost-per-token profiles, balancing capability against expense.
Weight: Dynamic (EMA).
When it's strongest: In cost-sensitive environments or when task complexity doesn't justify expensive frontier models.
5. Quality Signal
What it tracks: Expected output quality based on each model's known capabilities. Evaluates whether a model is better suited for reasoning, creative work, factual accuracy, or code generation.
Weight: Dynamic (EMA).
When it's strongest: When output quality is the primary concern — production code, security audits, customer-facing content.
6. Reflection Signal
What it tracks: Per-turn reflection feedback. After each agent action, the agent reflects on whether the right model was used. This signal incorporates that meta-cognitive feedback.
Weight: Dynamic (EMA).
When it's strongest: When the agent is actively reflecting on its decisions — during complex multi-step tasks where the agent evaluates its own tool choices.
Viewing and Understanding Weights
# Current signal weights for a session
cortex qm weights
# Model-level weights (aggregated)
cortex mqm weights
Example output:
Signal Weights (session sess_abc123):
Trajectory: 0.22 ███████████░░░░░░░░░░░
Episodic: 0.18 █████████░░░░░░░░░░░░░
Historical: 0.20 ██████████░░░░░░░░░░░░
Cost: 0.15 ███████░░░░░░░░░░░░░░░
Quality: 0.17 ████████░░░░░░░░░░░░░░
Reflection: 0.08 ████░░░░░░░░░░░░░░░░░░
Higher weights mean the signal has been more accurate historically. In this example, the trajectory signal is the strongest predictor — this session involves repetitive, pattern-driven work. The reflection signal is weakest — the agent hasn't been generating much reflective feedback.
Monitoring Accuracy
# Session accuracy
cortex qm accuracy -s sess_abc123
# Model-level accuracy over the last 24 hours
cortex mqm accuracy -h 24
# Model-level accuracy over the last week
cortex mqm accuracy -h 168
Accuracy is measured as the percentage of predictions that matched the actual tool the agent called. A healthy Quartermaster typically reaches 65–85% accuracy after sufficient training.
Interpreting Accuracy Numbers
| Accuracy | Interpretation | Action |
|---|---|---|
| > 85% | Excellent — QM is well-trained | No action needed |
| 65–85% | Good — typical for a trained QM | Continue collecting observations |
| 40–65% | Moderate — still learning or noisy signals | Check for varied task types; let it collect more data |
| < 40% | Low — insufficient data or reset needed | Make sure you have > 50 observations; consider resetting |
When to Reset
Reset clears the Quartermaster's learned state and starts fresh. Use it when:
- You've changed workflows significantly: The patterns the QM learned for frontend development won't apply to infrastructure work.
- Accuracy is stuck below 40%: Sometimes the learning gets stuck in a local minimum.
- You're setting up a new project: A clean slate for a new codebase is often better than carrying over patterns from an unrelated project.
# Reset for a specific session
cortex qm reset -s sess_abc123
# Reset all session-level data (keeps model-level data)
cortex qm reset-all
# Reset model-level data (aggregate across all sessions)
cortex mqm reset
# Nuclear option: reset everything
cortex mqm reset-all
Warning: cortex mqm reset-all wipes all learned patterns and returns to learning-only mode. You'll need another 50 observations before predictions resume.
How Reinforcement Learning Improves Predictions
After every prediction, the Quartermaster evaluates whether it was correct:
- Correct prediction: The contributing signal weights increase (EMA α = 0.15).
- Incorrect prediction: The contributing signal weights decrease faster (EMA α = 0.25).
The asymmetry means the Quartermaster is pessimistic by design — it's quicker to lose confidence in a signal than to gain it. This prevents over-confidence on sparse data.
Convergence
Weights typically stabilize after 200–500 observations. During the first 50 observations (learning-only mode), the MQM builds baselines but doesn't predict. Observations 50–200 are the active exploration phase where predictions start but weights fluctuate significantly. After ~200 observations, weights converge and predictions become stable.
The 85% Confidence Threshold
The 85% threshold for model override is deliberately high. The Quartermaster must be nearly certain before it enforces a model selection:
CLI Quick Reference
| Command | What It Shows |
|---|---|
cortex qm dashboard -s <id> | Session dashboard with accuracy, weights, top tools |
cortex mqm dashboard | Model-level dashboard across all sessions |
cortex qm decisions -s <id> | History of predictions and outcomes |
cortex qm weights | Current signal weights for a session |
cortex mqm weights | Aggregated signal weights across all sessions |
cortex qm accuracy -s <id> | Prediction accuracy for a session |
cortex mqm accuracy -h 24 | Accuracy over the last N hours |
cortex qm patterns --limit 20 | Learned tool-call sequence patterns |
cortex qm stats | Tool usage statistics and observation count |
cortex qm trace <turn> | Step-by-step prediction chain for a turn |
cortex qm reset -s <id> | Reset session-level state |
cortex qm reset-all | Reset all session-level data |
cortex mqm reset | Reset model-level state |
cortex mqm reset-all | Reset everything (back to learning mode) |
Practical Example
Here's what the Quartermaster looks like in practice during a typical session:
# 1. Start working — QM is in learning mode (< 50 observations)
cortex qm stats
# Observations: 23 | Mode: learning
# 2. After a while, check again
cortex qm stats
# Observations: 67 | Mode: active | Accuracy: 72%
# 3. View the dashboard to understand prediction quality
cortex qm dashboard -s sess_abc123
# Shows: Trajectory signal is strongest (0.22 weight)
# Episodic signal has improved 12% this session
# Top predicted models: claude-sonnet-4-5 (89%), gpt-4o (82%), gemini-2.5-pro (76%)
# 4. Check what decisions are being made
cortex qm decisions -s sess_abc123 --limit 5
# Model: claude-sonnet-4-5 | Predicted: claude-sonnet-4-5 | Correct: yes | Confidence: 93%
# Model: gpt-4o | Predicted: gpt-4o | Correct: yes | Confidence: 88%
# Model: gemini-2.5-pro | Predicted: gemini-2.5-pro | Correct: no | Confidence: 45%
# Model: claude-sonnet-4-5 | Predicted: claude-sonnet-4-5 | Correct: yes | Confidence: 91%
# Model: gpt-4o | Predicted: gpt-4o | Correct: yes | Confidence: 95%
# 5. After switching to a very different project, reset
cortex qm reset -s sess_abc123
cortex qm stats
# Observations: 0 | Mode: learning
Next Steps
- Learn about the Quartermaster architecture for the system design and pipeline integration.
- Review the full
cortex qmCLI reference for all subcommands and options. - Explore the Model Router for how provider selection complements tool prediction.