Model Quartermaster Guide

The Model Quartermaster (MQM) is an adaptive prediction engine that learns which model your agent should use for each task. It observes every LLM call across your sessions, computes six different prediction signals, and fuses them into an actionable suggestion — automate, suggest, or defer.

Over time, the Quartermaster's predictions improve through reinforcement learning, making your agent more cost-efficient and performant.

What the Quartermaster Does

Every time your agent needs to decide which model to use, the Quartermaster runs six independent prediction signals against the current context. It fuses their outputs into a weighted score and makes one of three decisions:

Confidence	Action	Behavior
≥ 85%	Enforce	Override model selection (safe operations only)
65–84%	Suggest	Recommend the model to the agent
< 65%	Defer	Let the agent decide entirely

Safety note: Unsafe tools (shell execution, file writes, network access) are never auto-executed, regardless of confidence. The highest action for unsafe tools is "Suggest."

Before Predictions Begin

The Quartermaster requires 50 observations before entering active prediction mode. Before that threshold, it operates in learning-only mode — collecting data, computing signal baselines, and building context fingerprints — but not making predictions.

You can check your observation count at any time:

cortex qm stats

Look for the observations field. If it's under 50, keep using Cortex normally and the Quartermaster will activate automatically once enough data is collected.

Viewing Predictions

The Dashboard

The richest view of Quartermaster activity is the dashboard:

# Session-level dashboard (specific session)
cortex qm dashboard -s sess_abc123

# Model-level dashboard (aggregated across all sessions)
cortex mqm dashboard

The dashboard displays:

Accuracy bars per signal: How often each signal's prediction was correct
Current signal weights: Visual representation of how much each signal contributes
Top models by prediction accuracy: Which models the MQM predicts best
Confidence distribution histogram: How often the MQM hits each confidence band
Session and model-level trends: Accuracy over time

Prediction Decisions

See the raw history of what the QM predicted and whether it was right:

# Session-level decisions
cortex qm decisions -s sess_abc123

# Limit output
cortex qm decisions -s sess_abc123 --limit 20

# Model-level decisions
cortex mqm decisions --limit 50

Each decision entry shows:

The tool that was predicted
The actual tool called
Whether the prediction was correct
The confidence score
Which signal contributed most to the prediction

The Six Prediction Signals

Each signal looks at model call patterns from a different angle. The Quartermaster weights them dynamically based on historical accuracy.

1. Trajectory Signal

What it tracks: Recent model usage patterns and sequences. If you typically use claude-sonnet-4-5 for code review tasks followed by gpt-4o for documentation, the trajectory signal learns this pattern.

Weight: Dynamic (EMA), adjusted up or down based on correctness.

When it's strongest: Repetitive workflows with predictable model sequences — coding sessions that alternate between reasoning-heavy and creative tasks.

2. Episodic Signal

What it tracks: Similarity between the current conversation context and past conversations. It generates 12-feature fingerprints from the conversation (model distribution, message length, token composition, task category indicators) and performs cosine similarity matching.

Weight: Dynamic (EMA).

When it's strongest: Tasks similar to ones you've done before — deploying the same project, debugging similar error types, following established patterns.

3. Historical Signal

What it tracks: Past performance for specific task categories. Queries historical success rates of models for similar task types and uses that to influence the current prediction.

Weight: Dynamic (EMA).

When it's strongest: When you've accumulated significant history across well-defined task categories — code review, refactoring, documentation, testing.

4. Cost Signal

What it tracks: Cost efficiency optimization. Prefers models based on their cost-per-token profiles, balancing capability against expense.

Weight: Dynamic (EMA).

When it's strongest: In cost-sensitive environments or when task complexity doesn't justify expensive frontier models.

5. Quality Signal

What it tracks: Expected output quality based on each model's known capabilities. Evaluates whether a model is better suited for reasoning, creative work, factual accuracy, or code generation.

Weight: Dynamic (EMA).

When it's strongest: When output quality is the primary concern — production code, security audits, customer-facing content.

6. Reflection Signal

What it tracks: Per-turn reflection feedback. After each agent action, the agent reflects on whether the right model was used. This signal incorporates that meta-cognitive feedback.

Weight: Dynamic (EMA).

When it's strongest: When the agent is actively reflecting on its decisions — during complex multi-step tasks where the agent evaluates its own tool choices.

Viewing and Understanding Weights

# Current signal weights for a session
cortex qm weights

# Model-level weights (aggregated)
cortex mqm weights

Example output:

Signal Weights (session sess_abc123):
  Trajectory:    0.22  ███████████░░░░░░░░░░░
  Episodic:      0.18  █████████░░░░░░░░░░░░░
  Historical:    0.20  ██████████░░░░░░░░░░░░
  Cost:          0.15  ███████░░░░░░░░░░░░░░░
  Quality:       0.17  ████████░░░░░░░░░░░░░░
  Reflection:    0.08  ████░░░░░░░░░░░░░░░░░░

Higher weights mean the signal has been more accurate historically. In this example, the trajectory signal is the strongest predictor — this session involves repetitive, pattern-driven work. The reflection signal is weakest — the agent hasn't been generating much reflective feedback.

Monitoring Accuracy

# Session accuracy
cortex qm accuracy -s sess_abc123

# Model-level accuracy over the last 24 hours
cortex mqm accuracy -h 24

# Model-level accuracy over the last week
cortex mqm accuracy -h 168

Accuracy is measured as the percentage of predictions that matched the actual tool the agent called. A healthy Quartermaster typically reaches 65–85% accuracy after sufficient training.

Interpreting Accuracy Numbers

Accuracy	Interpretation	Action
> 85%	Excellent — QM is well-trained	No action needed
65–85%	Good — typical for a trained QM	Continue collecting observations
40–65%	Moderate — still learning or noisy signals	Check for varied task types; let it collect more data
< 40%	Low — insufficient data or reset needed	Make sure you have > 50 observations; consider resetting

When to Reset

Reset clears the Quartermaster's learned state and starts fresh. Use it when:

You've changed workflows significantly: The patterns the QM learned for frontend development won't apply to infrastructure work.
Accuracy is stuck below 40%: Sometimes the learning gets stuck in a local minimum.
You're setting up a new project: A clean slate for a new codebase is often better than carrying over patterns from an unrelated project.

# Reset for a specific session
cortex qm reset -s sess_abc123

# Reset all session-level data (keeps model-level data)
cortex qm reset-all

# Reset model-level data (aggregate across all sessions)
cortex mqm reset

# Nuclear option: reset everything
cortex mqm reset-all

Warning: cortex mqm reset-all wipes all learned patterns and returns to learning-only mode. You'll need another 50 observations before predictions resume.

How Reinforcement Learning Improves Predictions

After every prediction, the Quartermaster evaluates whether it was correct:

Correct prediction: The contributing signal weights increase (EMA α = 0.15).
Incorrect prediction: The contributing signal weights decrease faster (EMA α = 0.25).

The asymmetry means the Quartermaster is pessimistic by design — it's quicker to lose confidence in a signal than to gain it. This prevents over-confidence on sparse data.

Convergence

Weights typically stabilize after 200–500 observations. During the first 50 observations (learning-only mode), the MQM builds baselines but doesn't predict. Observations 50–200 are the active exploration phase where predictions start but weights fluctuate significantly. After ~200 observations, weights converge and predictions become stable.

The 85% Confidence Threshold

The 85% threshold for model override is deliberately high. The Quartermaster must be nearly certain before it enforces a model selection:

CLI Quick Reference

Command	What It Shows
`cortex qm dashboard -s <id>`	Session dashboard with accuracy, weights, top tools
`cortex mqm dashboard`	Model-level dashboard across all sessions
`cortex qm decisions -s <id>`	History of predictions and outcomes
`cortex qm weights`	Current signal weights for a session
`cortex mqm weights`	Aggregated signal weights across all sessions
`cortex qm accuracy -s <id>`	Prediction accuracy for a session
`cortex mqm accuracy -h 24`	Accuracy over the last N hours
`cortex qm patterns --limit 20`	Learned tool-call sequence patterns
`cortex qm stats`	Tool usage statistics and observation count
`cortex qm trace <turn>`	Step-by-step prediction chain for a turn
`cortex qm reset -s <id>`	Reset session-level state
`cortex qm reset-all`	Reset all session-level data
`cortex mqm reset`	Reset model-level state
`cortex mqm reset-all`	Reset everything (back to learning mode)

Practical Example

Here's what the Quartermaster looks like in practice during a typical session:

# 1. Start working — QM is in learning mode (< 50 observations)
cortex qm stats
# Observations: 23 | Mode: learning

# 2. After a while, check again
cortex qm stats
# Observations: 67 | Mode: active | Accuracy: 72%

# 3. View the dashboard to understand prediction quality
cortex qm dashboard -s sess_abc123
# Shows: Trajectory signal is strongest (0.22 weight)
#        Episodic signal has improved 12% this session
#        Top predicted models: claude-sonnet-4-5 (89%), gpt-4o (82%), gemini-2.5-pro (76%)

# 4. Check what decisions are being made
cortex qm decisions -s sess_abc123 --limit 5
# Model: claude-sonnet-4-5 | Predicted: claude-sonnet-4-5 | Correct: yes | Confidence: 93%
# Model: gpt-4o          | Predicted: gpt-4o          | Correct: yes | Confidence: 88%
# Model: gemini-2.5-pro  | Predicted: gemini-2.5-pro  | Correct: no  | Confidence: 45%
# Model: claude-sonnet-4-5 | Predicted: claude-sonnet-4-5 | Correct: yes | Confidence: 91%
# Model: gpt-4o          | Predicted: gpt-4o          | Correct: yes | Confidence: 95%

# 5. After switching to a very different project, reset
cortex qm reset -s sess_abc123
cortex qm stats
# Observations: 0 | Mode: learning

Next Steps

Learn about the Quartermaster architecture for the system design and pipeline integration.
Review the full cortex qm CLI reference for all subcommands and options.
Explore the Model Router for how provider selection complements tool prediction.

Model Quartermaster Guide

What the Quartermaster Does

Before Predictions Begin

Viewing Predictions

The Dashboard

Prediction Decisions

The Six Prediction Signals

1. Trajectory Signal

2. Episodic Signal

3. Historical Signal

4. Cost Signal

5. Quality Signal

6. Reflection Signal

Viewing and Understanding Weights

Monitoring Accuracy

Interpreting Accuracy Numbers

When to Reset

How Reinforcement Learning Improves Predictions

Convergence

The 85% Confidence Threshold

CLI Quick Reference

Practical Example

Next Steps

Comments