CortexPrism v0.50: The Stabilization Release

When we shipped the modular architecture in v0.49.0, CortexPrism crossed a structural threshold — 6 packages, 41 contract interfaces, three monoliths decomposed into 151 focused modules. But structure isn't the same as reliability. v0.50.0 is the "make it actually work" release. The one where we trace every wire, plug every hole, and ensure nothing was left disconnected during the rapid development of the past six months.

The numbers tell the story: 5 CLI commands fully implemented and tested but never registered. A 476-line tool invisible to agents. An entire A2A remote agent bridge that was a no-op. 18 security issues across all 6 layers. 18 silent prediction bugs in the dual quartermaster intelligence systems. 30+ dead files and 4 orphaned database tables. This release is about finding and fixing everything that fell through the cracks.

Wiring What Was Built But Never Connected

The most humbling discovery: significant features existed in the codebase — fully implemented, tested, and passing CI — but were never accessible to users because they weren't registered in the right place.

Four CLI commands totaling 1,036 lines fell into this category. Each had full implementations with flags, validation, error handling, and test coverage. None were in the CLI command registry:

cortex import — the data migration system. Converts OpenClaw configs (25 providers, agents, model pools, MCP servers) to Cortex format. Imports session transcripts from JSONL with tool_calls and tool_result extraction. Reads Hermes' SQLite state.db directly — no export step required — extracting 24+ session columns and 18+ message columns. Supports --config-only, --sessions-only, --memory-only, --dry-run. The config mapper framework is extensible with a ConfigMapper type and PROVIDER_NAME_MAP covering 25 providers.
cortex qm — the Quartermaster tool orchestration system (293 lines). Learns which tools work best for each task type across 5 signals. Shows patterns, signal weights, prediction stats, decision history, and accuracy metrics.
cortex mqm — the Model Quartermaster selection system (243 lines). Learns which models work best for each task across 6 signals. Shows decisions, confidence scores, accuracy trends, and signal contributions.
cortex service — full micro-service CRUD (186 lines): list, show, create, update, delete, start, stop.

The file_diff tool (476 lines) was similarly stuck in the tool source directory with no registry entry. Agents that needed to diff files — a common operation — couldn't. The implementation was there: unified diffs, side-by-side, syntax hints, context lines. Just never wired to the agent loop.

Most critically, the A2A remote agent bridge was a complete no-op. createA2AToolWrapper() existed as an exported function but nothing ever called it. The a2a.remoteAgents section in config.json was parsed and validated, but the configured agents were never registered as tools. Every instance of CortexPrism with remote A2A agents configured was silently ignoring them. Registration now happens at the end of registerAllBuiltins(), looping through config.a2a.remoteAgents and registering each as an a2a_<name> tool.

Security Hardening: 18 Issues Across All 6 Layers

A comprehensive audit of the Parallax security model identified 18 issues spanning critical, high, medium, and low priority tiers. The theme was consistent: security mechanisms existed in the codebase but weren't actually invoked at runtime.

SSRF protection existed but was never called. resolveAndCheck() blocked requests to private IPs (RFC 1918), loopback addresses, and cloud metadata endpoints (169.254.169.254, metadata.google.internal). The shell command validator never called it. An agent executing curl http://169.254.169.254/latest/meta-data/ would succeed — the DNS-resolution guard checked the IP but nothing triggered it. Now wired: every shell command containing a URL passes through SSRF validation before execution.

Session isolation was registered but never enforced. isPathAllowed() checked file paths against registered session boundaries. The file tool validator was aware of it but never consulted it during validation. File operations from one agent session could reach into another session's workspace. Now enforced at the tool-call boundary: path arguments in file read/write/edit/delete tools are checked against the calling session's registered boundaries.

The policy table had a blocking CHECK constraint. Migration 009 defined valid policy kinds as ('tool', 'shell', 'domain', 'capability'). But the validator also checks 'path' and 'computer' kinds. Any attempt to insert a path-based or computer-action policy would fail. Migration 042 recreates the table with the full set of 6 kinds.

16 new default deny rules were seeded using INSERT OR IGNORE to avoid overwriting user customizations. Five shell rules: filesystem creation (mkfs), kernel parameter writes (/proc/sys/), firewall manipulation (iptables, ufw), cron modification (crontab -), and force pushes (git push). Seven path rules: password hashes (/etc/shadow), SSH keys (/root/.ssh/, id_rsa), GPG config (.gnupg/), environment files (.env), SSH server config (sshd_config), and sudoers. Three domain rules: AWS metadata (169.254.169.254), GCP metadata (metadata.google.internal), and loopback (127.0.0.1). One computer action rule: blocking raw type actions.

Four existing shell regex patterns were hardened to close bypass vectors. The rm -rf pattern now catches -r -f, --recursive --force, and -fr variants. The fork bomb pattern matches the actual :(){ :|: & };: syntax. The dd pattern catches bare device names like /dev/sda. The chmod 777 pattern catches -R 777 and non-root paths.

12 chrome_ and codegraph tools received individual risk profiles.* Previously, all fell through to a blanket 'medium' classification with no specific guardrails. Now profiled individually: chrome_execute_js, chrome_http_auth, and chrome_network_rules at 'high' with confirmation required; chrome_navigate, chrome_create_tab, chrome_upload_file, chrome_save_page, chrome_manage_downloads, chrome_fill_form, and chrome_type_text at 'medium' with appropriate guardrails; code_index and code_pilot profiled with their specific risk surfaces.

The vault encryption key was in the safe environment variable set. CORTEX_VAULT_KEY was listed alongside harmless vars like HOME and PATH. Any agent could read it. Removed entirely from the safe-var set.

Data classification defaulted to 'sensitive' for all non-empty content. This triggered excessive supervisor LLM calls for routine conversations. The defense-in-depth review concluded that explicit pattern matching is sufficient: only content matching SENSITIVE_PATTERNS (PII, credentials, financial data) or SECRET_PATTERNS (API keys, tokens, passwords) is now elevated.

Guardrail shell injection patterns were too aggressive. The backtick and $() patterns matched empty content and blocked legitimate code examples in conversation. Changed from * to {1,200} quantifier, requiring 1+ characters bounded to typical inline code length.

The sidebar-based layout that served CortexPrism since early versions was replaced with a horizontal top navigation bar. Five category tabs (Chat, Development, Knowledge, Infrastructure, System) sit alongside the logo, command palette trigger, experience level toggle, theme toggle, and WebSocket badge. Clicking a category populates the sidebar with contextual sub-navigation — 40 pages across 5 categories, each with icon, label, tooltip, and experience level assignment.

A [B] [I] [A] experience level segmented control filters visible navigation. Beginner mode shows 10 core pages (chat, dashboard, memory, sessions, files, code graph, skills, settings, agents, plugins). Intermediate exposes 29. Advanced shows all 40 including developer tools, observability, and infrastructure pages. The choice is persisted in localStorage as cortex_experience_level. Navigating to a hidden page via URL hash shows a level gate overlay with an upgrade button. The command palette also respects the experience filter.

The CSS-only [data-tip]::after pseudo-element tooltip hack was replaced with a proper JavaScript tooltip system. It uses event delegation on [data-tooltip] attributes, creates a single reusable #global-tooltip element with role="tooltip" and aria-describedby, supports both mouse (250ms delay, instant hide) and keyboard (focusin/focusout), smart positioning (flip above/below, clamp horizontal to viewport), and Escape to dismiss. Deployed on all nav items, mode toggle buttons, and theme toggle.

A dark/light theme system was built with CSS custom properties. Dark theme is the default (:root). Light theme is applied via [data-theme="light"] selector overrides. The toggle respects prefers-color-scheme media query on first load and persists the choice in localStorage as cortex_theme. The CSS was rewritten (599→1000+ lines) with CortexPrism brand colors (cyan #06b6d4, indigo #6366f1), a spacing scale (--space-1 through --space-8), updated typography (Inter 14px/1.6 for body, JetBrains Mono 13px for code), and 80+ new component classes.

Quartermaster Intelligence: 18 Critical Fixes

CortexPrism has two self-learning intelligence systems that ran silently in the background — the Model Quartermaster for model selection and the Quartermaster for tool prediction. The audit revealed both had significant prediction bugs that accumulated as the systems evolved.

Model Quartermaster (8 fixes):

The most impactful bug was signal atrophy through normalization. The learning system had 6 signals: historical performance, episodic memory, cost, quality, trajectory, and reflection. But learn.ts only reinforced 3 of them (historical, quality, reflection) on good choices. The other 3 received no positive reinforcement. After each normalization pass, unreinforced signals drifted toward zero. Over time, the system effectively operated on 3 signals instead of 6. The fix sends proportional reinforcement to all 6 signals on every good choice and proportional penalties on bad ones.

A missing coverage penalty let models matching only one signal achieve the same confidence as models matching all six. fusion.ts computed confidence = weightedSum / activeWeightSum, so a model matching only the reflection signal (weight 0.05, score 0.9) got confidence 0.045 / 0.05 = 0.9 — identical to a model hitting all six signals. Added confidence *= 0.7 + 0.3 * (signalCount / 6). A model with 1/6 signals now gets a 0.733× multiplier.

A race condition in the observe→active mode transition let both incrementSessionObservations() (store) and observeModel() (mod.ts) independently set mode = 'active' at the 50-observation threshold, potentially creating duplicate mode-change events. The mode-setting was removed from incrementSessionObservations(); only observeModel() handles the transition.

Cost was compared per-call, not per-task. Raw avg_cost_usd was compared across models without normalizing for task complexity, penalizing models used for complex tasks. Cost is now divided by taskComplexity to produce a cost-per-complexity-unit metric.

Recency decay was missing entirely. mqm_model_stats accumulated forever with equal weight, meaning a model that performed well six months ago had the same influence as one that performed well today. Added 2%/day decay (floor 40%) on historical, quality, and cost signal scores based on last_used timestamp.

The heuristic model tier detection only knew 3 tiers (opus|gpt-4|o1, sonnet|gpt-3.5, haiku|flash|mini) dating to early 2025. Modern models like gpt-4o, gemini-2.5-pro, claude-3.5-sonnet, and llama-3.3 all fell into the "unknown" tier. Expanded to 6 tiers covering 30+ models with per-tier cost and quality baselines.

The episodic signal used fragile regex — (?:model|using|with)\s+([\w-]+) to extract model names from memory hit text. It failed when memory entries didn't match this exact pattern. Replaced with direct substring search: scans each memory hit for all candidate model name and provider strings.

Accuracy was tracked inconsistently. The accuracy trend query used was_correct >= 0.7 while session state accuracy used raw correctCount (updated by a separate reflection path). Added CORRECTNESS_THRESHOLD = 0.7 constant; observeModel() now updates both was_correct on decisions and correctCount in session state from a single unified source.

Quartermaster (10 fixes):

The trajectory signal was completely dead. findPatterns(last3) searched for JSON.stringify(last3) in stored patterns, but stored patterns contained JSON.stringify(allToolsInTurn). A full-turn sequence like ["read","edit","write","shell"] never matched a prefix search for ["edit","write","shell"]. Added prefix mode via SQL LIKE ? || '%'; computeTrajectorySignal() extracts the next tool from sequences matching the prefix; learn.ts stores prefix_3_tools + actualTool instead of the full turn.

Fusion never reached the suggest threshold. Unlike the Model QM, the Quartermaster just summed weight * score without dividing by activeWeightSum. A tool matching only taskContext (weight 0.15, score 0.8) got 0.12 — 5× below the 0.6 suggest threshold. Added rawTotal / activeWeightSum normalization plus coveragePenalty = 0.7 + 0.3 * (signalCount / 5).

The avg_confidence incremental average formula was broken. upsertPattern() computed (confidence + success_count) / (hit_count + 1), mixing a 0-1 score with an integer count. Now uses (avg_confidence * hit_count + confidence) / (hit_count + 1) after fetching the current avg_confidence.

A race condition in observe() let two concurrent observations read the same observationCount, increment independently, and both trigger mode transitions. Replaced with atomic incrementSessionObservations() at the store level; mode transition checked against the returned newCount.

learn() corrupted predictionCount by overwriting it with sessionState.predictionCount + decisions.length, but predict() had already incremented it for each call. Removed the overwrite; learn() now only writes correctCount.

Only 3 of 5 signals were penalized on bad predictions. updateWeightsFromDecision() penalized trajectory, episodic, and taskContext. toolStats and reflection were immune. Now all 5 signals receive proportional penalties using confidenceFloor enforcement.

confidenceFloor was stored but never enforced. It existed in the schema (qm_signal_weights.confidence_floor) and migration but updateWeightsFromDecision() ignored it entirely. Now enforces Math.max(floor, newWeight) on every update.

Reflection confidence was hardcoded to 0.5. predict() always passed 0.5 to gatherSignalScores() regardless of actual reflection quality. Now accepts a reflectionConfidence parameter (default 0.5) so callers can pass real confidence.

The candidate tool list was hardcoded to 10 tools — missing 14+ modern tools: web_search, web_fetch, brave_search, computer, sandbox_exec, task, a2a, mcp, semantic_search, codebase_search, git_commit, git_stash, web_scrape. Expanded to 24.

Episodic signal regex had the same fragility as the Model QM. Replaced with direct text.includes(toolName) search across candidate tools.

Both quartermaster systems now provide full visibility via their newly accessible CLI commands (cortex qm and cortex mqm) with subcommands for pattern inspection, weight viewing, decision history, accuracy trends, and system reset.

Prompt Lab: A/B Testing & Structured Generation

The Prompt Lab grew from 2 endpoints to 14, transforming from a simple template list into a full prompt engineering workspace with three tabs: Templates, A/B Tests, and Generator.

The A/B Testing system supports creating tests with two variants, recording runs with scores and latency and token counts, and comparing results with winner detection (confidence-based, highlighting the statistically superior variant). Each variant can be iterated and retested. The test lifecycle supports pause, resume, and completion states.

The Generator produces prompts from structured parameters: role, tone, style, length, constraints, and examples. It can also generate automatic variations of existing prompts using 5 strategies: restructure (reorganize without changing meaning), clarity (simplify language), specificity (add concrete details), format (change output structure), and persona (recast for a different role). Templates support {{variable}} interpolation with automatic extraction of variable names, making it straightforward to build parameterized prompt families.

The UI shows variant scores, per-run latency and token counts, and confidence-based winner indicators. The run buffer increased from 100 to 500 to support thorough A/B comparisons.

Sessions: Tree View with Token Metrics

The sessions list was redesigned as a hierarchical tree. Parent sessions display with an accent left border; child sub-agent sessions are indented underneath with an amber border and a tree connector line. Each row shows: a status indicator dot, the session name, a truncated ID, agent/channel/sub-agent type badges (including cortex_import, hermes_, and restored badges for imported sessions), a child count chip, turn count, total tokens, cost, tool calls, and average LLM duration.

A new GET /api/sessions/enriched endpoint joins lens_events token data per session, plus GET /api/sessions/:id/stats for single-session detail. Children link back to their parent with a ← parent navigation link. Archival opacity transitions smoothly on hover. Child navigation uses DOM-based addEventListener to avoid the inline-onclick escaping problems that plagued template literal exports.

Agent Loop: Spiral Prevention & Hardening

Three defenses were added against a class of failure where the LLM chases tangents instead of producing results:

DuckDuckGo sidebar confusion. The web_search tool returned DuckDuckGo's RelatedTopics API field labeled simply as **Related:**. The LLM interpreted these algorithmically-suggested Wikipedia sidebar snippets as conversation context, triggering recursive tool-call feedback loops through 12 rounds of search before delivering a confused error. Now labeled **DuckDuckGo Sidebar (algorithmically suggested — may be unrelated to your query):** with an explicit instruction to ignore.

Self-referential search detection. If any search or fetch tool query matches a >30-character substring of recent assistant output, a [SYSTEM WARNING] is injected telling the LLM to reread the user's original message rather than recycling its own responses as search fodder.

Confusion spiral guard. A counter tracks consecutive rounds where all tool calls are search/fetch tools with zero user-facing output. At 3+ rounds, a [SYSTEM WARNING] interrupts the loop telling the LLM it's chasing tangents and to produce results from already-collected data.

Computer Use now uses a singleton executor. Previously, every executeComputerAction() call created a new ComputerUseExecutor — starting Xvfb, initializing the display, and then destroying it. Every mouse click, keypress, and screenshot incurred ~1s of Xvfb startup overhead. Now a module-level singleton persists across tool calls and auto-shuts down after 5 minutes of inactivity.

Dead Code Purge & Infrastructure Fixes

30+ dead files were removed. Nine duplicate files under packages/server/src/ and packages/ai/src/ were migration scaffold that was never imported (all imports resolve to src/). Six dead computer-use duplicate files. The 409-line packages/core/src/db/migrate.ts duplicate. RemoteAgentManager (47 lines of pure Map wrappers, never imported by any file). Seven dead remote agent type exports (RemoteAgentStatus, RemoteAgentInfo, etc.).

Four orphaned database tables were dropped: working_memory (zero runtime references), channel_sessions and channel_messages (incomplete channel message persistence), and the cleanup migration 041.

The daemon restart endpoint was a no-op — POST /api/daemons/*/restart returned {ok: true} without doing anything. Now reads the daemon's PID file, sends SIGTERM, and waits up to 15s for the supervisor to auto-restart the process.

Workflow approvals always returned empty — GET /api/workflows/approvals hardcoded json([]). Now queries all registered workflows for actual pending approval state. Added POST /api/workflows/approvals/:name for approve/reject.

Memori preview returned a stub — GET /api/memori/preview always returned {checkpoints: []}. Now queries the actual checkpoint store.

Computer use screenshots returned 5MB of base64 blobs per request. Split into metadata-list plus per-file endpoint with lazy-loaded thumbnails.

Node rekey and config_update handlers were no-ops that only logged events. rekey now stores the rotated token and closes the WebSocket to trigger auto-reconnect. config_update now stores toolsAllowList and blockedTools in mutable config overrides.

What's Next

With the stabilization work complete, CortexPrism is positioned for three areas of development:

Distributed agent swarms. The kernel's process tree and resource accounting extend naturally across machines. The A2A remote agent bridge — now actually functional — provides the wire protocol. Cross-instance coordination becomes an implementation detail of @cortex/infra.
WebAssembly tool plugins. With the Parallax security model hardened and session isolation enforced, WASM plugins get stronger isolation guarantees. A plugin that only needs stdout shouldn't have filesystem access at all. The capability tier system maps cleanly to WASM sandbox permissions.
Multi-user collaboration. Shared workspaces with per-user agent configs. The @cortex/server contracts define channels and sessions in a way that naturally extends to multi-tenant routing. The experience level system already demonstrates per-user UI customization.

Get Started

CortexPrism runs on macOS, Linux, and Windows as a single Deno binary. No Docker required, no Python, no node_modules.

# Install
curl -fsSL https://cortexprism.io/install.sh | bash

# Setup and start
cortex setup
cortex serve

# Open http://localhost:3000

Already running? Upgrade in place:

cortex self update

The project is Apache 2.0 licensed, fully open source, and has zero telemetry. Everything runs on your hardware.

GitHub: github.com/CortexPrism/cortex Changelog: CHANGELOG.md