June 23, 20260 views

Model Quartermaster Guide

The Model Quartermaster (MQM) is an adaptive prediction engine that learns which model your agent should use for each task. It observes every LLM call across your sessions, computes six different prediction signals, and fuses them into an actionable suggestion — automate, suggest, or defer.

Over time, the Quartermaster's predictions improve through reinforcement learning, making your agent more cost-efficient and performant.

What the Quartermaster Does

Every time your agent needs to decide which model to use, the Quartermaster runs six independent prediction signals against the current context. It fuses their outputs into a weighted score and makes one of three decisions:

ConfidenceActionBehavior
≥ 85%EnforceOverride model selection (safe operations only)
65–84%SuggestRecommend the model to the agent
< 65%DeferLet the agent decide entirely

Safety note: Unsafe tools (shell execution, file writes, network access) are never auto-executed, regardless of confidence. The highest action for unsafe tools is "Suggest."

Before Predictions Begin

The Quartermaster requires 50 observations before entering active prediction mode. Before that threshold, it operates in learning-only mode — collecting data, computing signal baselines, and building context fingerprints — but not making predictions.

You can check your observation count at any time:

cortex qm stats

Look for the observations field. If it's under 50, keep using Cortex normally and the Quartermaster will activate automatically once enough data is collected.

Viewing Predictions

The Dashboard

The richest view of Quartermaster activity is the dashboard:

# Session-level dashboard (specific session)
cortex qm dashboard -s sess_abc123

# Model-level dashboard (aggregated across all sessions)
cortex mqm dashboard

The dashboard displays:

  • Accuracy bars per signal: How often each signal's prediction was correct
  • Current signal weights: Visual representation of how much each signal contributes
  • Top models by prediction accuracy: Which models the MQM predicts best
  • Confidence distribution histogram: How often the MQM hits each confidence band
  • Session and model-level trends: Accuracy over time

Prediction Decisions

See the raw history of what the QM predicted and whether it was right:

# Session-level decisions
cortex qm decisions -s sess_abc123

# Limit output
cortex qm decisions -s sess_abc123 --limit 20

# Model-level decisions
cortex mqm decisions --limit 50

Each decision entry shows:

  • The tool that was predicted
  • The actual tool called
  • Whether the prediction was correct
  • The confidence score
  • Which signal contributed most to the prediction

The Six Prediction Signals

Each signal looks at model call patterns from a different angle. The Quartermaster weights them dynamically based on historical accuracy.

1. Trajectory Signal

What it tracks: Recent model usage patterns and sequences. If you typically use claude-sonnet-4-5 for code review tasks followed by gpt-4o for documentation, the trajectory signal learns this pattern.

Weight: Dynamic (EMA), adjusted up or down based on correctness.

When it's strongest: Repetitive workflows with predictable model sequences — coding sessions that alternate between reasoning-heavy and creative tasks.

2. Episodic Signal

What it tracks: Similarity between the current conversation context and past conversations. It generates 12-feature fingerprints from the conversation (model distribution, message length, token composition, task category indicators) and performs cosine similarity matching.

Weight: Dynamic (EMA).

When it's strongest: Tasks similar to ones you've done before — deploying the same project, debugging similar error types, following established patterns.

3. Historical Signal

What it tracks: Past performance for specific task categories. Queries historical success rates of models for similar task types and uses that to influence the current prediction.

Weight: Dynamic (EMA).

When it's strongest: When you've accumulated significant history across well-defined task categories — code review, refactoring, documentation, testing.

4. Cost Signal

What it tracks: Cost efficiency optimization. Prefers models based on their cost-per-token profiles, balancing capability against expense.

Weight: Dynamic (EMA).

When it's strongest: In cost-sensitive environments or when task complexity doesn't justify expensive frontier models.

5. Quality Signal

What it tracks: Expected output quality based on each model's known capabilities. Evaluates whether a model is better suited for reasoning, creative work, factual accuracy, or code generation.

Weight: Dynamic (EMA).

When it's strongest: When output quality is the primary concern — production code, security audits, customer-facing content.

6. Reflection Signal

What it tracks: Per-turn reflection feedback. After each agent action, the agent reflects on whether the right model was used. This signal incorporates that meta-cognitive feedback.

Weight: Dynamic (EMA).

When it's strongest: When the agent is actively reflecting on its decisions — during complex multi-step tasks where the agent evaluates its own tool choices.

Viewing and Understanding Weights

# Current signal weights for a session
cortex qm weights

# Model-level weights (aggregated)
cortex mqm weights

Example output:

Signal Weights (session sess_abc123):
  Trajectory:    0.22  ███████████░░░░░░░░░░░
  Episodic:      0.18  █████████░░░░░░░░░░░░░
  Historical:    0.20  ██████████░░░░░░░░░░░░
  Cost:          0.15  ███████░░░░░░░░░░░░░░░
  Quality:       0.17  ████████░░░░░░░░░░░░░░
  Reflection:    0.08  ████░░░░░░░░░░░░░░░░░░

Higher weights mean the signal has been more accurate historically. In this example, the trajectory signal is the strongest predictor — this session involves repetitive, pattern-driven work. The reflection signal is weakest — the agent hasn't been generating much reflective feedback.

Monitoring Accuracy

# Session accuracy
cortex qm accuracy -s sess_abc123

# Model-level accuracy over the last 24 hours
cortex mqm accuracy -h 24

# Model-level accuracy over the last week
cortex mqm accuracy -h 168

Accuracy is measured as the percentage of predictions that matched the actual tool the agent called. A healthy Quartermaster typically reaches 65–85% accuracy after sufficient training.

Interpreting Accuracy Numbers

AccuracyInterpretationAction
> 85%Excellent — QM is well-trainedNo action needed
65–85%Good — typical for a trained QMContinue collecting observations
40–65%Moderate — still learning or noisy signalsCheck for varied task types; let it collect more data
< 40%Low — insufficient data or reset neededMake sure you have > 50 observations; consider resetting

When to Reset

Reset clears the Quartermaster's learned state and starts fresh. Use it when:

  • You've changed workflows significantly: The patterns the QM learned for frontend development won't apply to infrastructure work.
  • Accuracy is stuck below 40%: Sometimes the learning gets stuck in a local minimum.
  • You're setting up a new project: A clean slate for a new codebase is often better than carrying over patterns from an unrelated project.
# Reset for a specific session
cortex qm reset -s sess_abc123

# Reset all session-level data (keeps model-level data)
cortex qm reset-all

# Reset model-level data (aggregate across all sessions)
cortex mqm reset

# Nuclear option: reset everything
cortex mqm reset-all

Warning: cortex mqm reset-all wipes all learned patterns and returns to learning-only mode. You'll need another 50 observations before predictions resume.

How Reinforcement Learning Improves Predictions

After every prediction, the Quartermaster evaluates whether it was correct:

  • Correct prediction: The contributing signal weights increase (EMA α = 0.15).
  • Incorrect prediction: The contributing signal weights decrease faster (EMA α = 0.25).

The asymmetry means the Quartermaster is pessimistic by design — it's quicker to lose confidence in a signal than to gain it. This prevents over-confidence on sparse data.

Convergence

Weights typically stabilize after 200–500 observations. During the first 50 observations (learning-only mode), the MQM builds baselines but doesn't predict. Observations 50–200 are the active exploration phase where predictions start but weights fluctuate significantly. After ~200 observations, weights converge and predictions become stable.

The 85% Confidence Threshold

The 85% threshold for model override is deliberately high. The Quartermaster must be nearly certain before it enforces a model selection:

CLI Quick Reference

CommandWhat It Shows
cortex qm dashboard -s <id>Session dashboard with accuracy, weights, top tools
cortex mqm dashboardModel-level dashboard across all sessions
cortex qm decisions -s <id>History of predictions and outcomes
cortex qm weightsCurrent signal weights for a session
cortex mqm weightsAggregated signal weights across all sessions
cortex qm accuracy -s <id>Prediction accuracy for a session
cortex mqm accuracy -h 24Accuracy over the last N hours
cortex qm patterns --limit 20Learned tool-call sequence patterns
cortex qm statsTool usage statistics and observation count
cortex qm trace <turn>Step-by-step prediction chain for a turn
cortex qm reset -s <id>Reset session-level state
cortex qm reset-allReset all session-level data
cortex mqm resetReset model-level state
cortex mqm reset-allReset everything (back to learning mode)

Practical Example

Here's what the Quartermaster looks like in practice during a typical session:

# 1. Start working — QM is in learning mode (< 50 observations)
cortex qm stats
# Observations: 23 | Mode: learning

# 2. After a while, check again
cortex qm stats
# Observations: 67 | Mode: active | Accuracy: 72%

# 3. View the dashboard to understand prediction quality
cortex qm dashboard -s sess_abc123
# Shows: Trajectory signal is strongest (0.22 weight)
#        Episodic signal has improved 12% this session
#        Top predicted models: claude-sonnet-4-5 (89%), gpt-4o (82%), gemini-2.5-pro (76%)

# 4. Check what decisions are being made
cortex qm decisions -s sess_abc123 --limit 5
# Model: claude-sonnet-4-5 | Predicted: claude-sonnet-4-5 | Correct: yes | Confidence: 93%
# Model: gpt-4o          | Predicted: gpt-4o          | Correct: yes | Confidence: 88%
# Model: gemini-2.5-pro  | Predicted: gemini-2.5-pro  | Correct: no  | Confidence: 45%
# Model: claude-sonnet-4-5 | Predicted: claude-sonnet-4-5 | Correct: yes | Confidence: 91%
# Model: gpt-4o          | Predicted: gpt-4o          | Correct: yes | Confidence: 95%

# 5. After switching to a very different project, reset
cortex qm reset -s sess_abc123
cortex qm stats
# Observations: 0 | Mode: learning

Next Steps

Comments