← Back

Engineering

Claude Opus 4.8: What Changed, How It Compares, and Whether to Upgrade

· Foculoom Engineering

Opus 4.8 shipped today. Here's what changed, how it benchmarks against GPT-5.5 and Gemini 3.1 Pro, and what it means for Foculoom's model routing.

The model routing problem

If you build with AI, you make routing decisions constantly: which model for planning, which for code review, which for fast cheap tasks, which for long autonomous runs. Get it right and costs stay predictable. Get it wrong in a multi-step agentic loop and a poor routing choice compounds across every subsequent action.

The pressure to revisit those decisions has been intense in 2026. Anthropic shipped Opus 4.5, 4.6, 4.7, and now 4.8 in roughly four months. Each release moves the tradeoff curves. This post is about what actually changed in 4.8, how it benchmarks against the other models Foculoom evaluates regularly, and what it means for the routing decisions we make every day.

What's new in Claude Opus 4.8

Anthropic announced Opus 4.8 today (official announcement). The headline numbers:

1M token context window is now the default across the API and most platforms. For agentic workflows, this removes the need to chunk large codebases across tool calls. You can load a full project, its test suite, and prior conversation history in a single context and reason over all of it at once.

Fast mode runs at 2.5× the output speed, at one-third the price of fast mode for prior Opus models. This changes the cost tier at which Opus is viable. Tasks where you were routing to Sonnet for cost reasons should now be re-evaluated — the gap narrowed significantly.

Honesty and reliability improved meaningfully. Opus 4.8 is four times less likely to allow flaws in its own code to pass unremarked than prior versions, and it flags uncertainty rather than filling gaps with plausible-sounding wrong answers. For autonomous agents where errors compound across steps, a model that says "I'm not sure" is far more valuable than one that confidently produces a plausible mistake.

Dynamic workflows (parallel subagent execution and large-scale planning) are in research preview via Claude Code. Not production-ready, but the direction is clear.

Mid-conversation system messages let developers update instructions mid-task without breaking prompt continuity. For long-running automation, this removes a painful workaround that required terminating and restarting runs to adjust behavior.

Regular pricing is unchanged from Opus 4.7. Fast mode pricing is one-third of what it was for prior Opus versions.

Benchmark comparison

Claude Opus 4.8 benchmarks against GPT-5.5 and Gemini 3.1 Pro across the tasks most relevant to developers building autonomous workflows:

| Benchmark | Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro | |---|---|---|---| | Agentic Coding (SWE-Bench Pro) | 69.2% | 58.6% | 54.2% | | Bug-fix accuracy (SWE-Bench Verified) | 88.6% | — † | — † | | Terminal task execution (Terminal-Bench 2.1) | 74.6% | 78.2% | 70.3% | | Multidisciplinary Reasoning (HLE) | 57.9% | ~52.2% | ~51.4% | | Knowledge Work (GDPval-AA) | 1890 | 1769 | 1314 | | Agentic Computer Use (OSWorld) | 83.4% | 78.7% | 76.2% | | Graduate-level Reasoning (GPQA Diamond) | 93.6% | 93.6% | ~92% |

† SWE-Bench Verified scores for GPT-5.5 and Gemini 3.1 Pro were not available in third-party coverage at time of writing. This is the benchmark most directly relevant to autonomous bug-fixing and code-quality tasks.

Benchmark guide (what each measures):

  • SWE-Bench Pro / SWE-Bench Verified — how well a model resolves real GitHub issues autonomously; Verified is a human-curated higher-quality subset
  • Terminal-Bench 2.1 — multi-step agentic shell and terminal task completion; measures CLI and file-system workflows, not code review
  • HLE (Humanity's Last Exam) — expert-level multidisciplinary question set requiring deep reasoning across many domains
  • GDPval-AA — simulated professional knowledge-work tasks (analysis, writing, research)
  • OSWorld — ability to complete desktop OS tasks autonomously (computer use)
  • GPQA Diamond — graduate-level science reasoning

Sources: Anthropic (primary), OfficeChai, Worthview, 9to5Mac.

The headline: for autonomous, long-horizon agentic work — which is the most expensive and highest-stakes category — Opus 4.8 leads by a wide margin.

Two exceptions worth noting. GPT-5.5 wins Terminal-Bench 2.1 (78.2% vs 74.6% — shell/CLI task execution) and retains the lead in visual reasoning (ARC-AGI 2: 85% vs 68.8% for Opus 4.6, a test where 4.8 hasn't been formally benchmarked yet). For terminal-heavy automation or visual-reasoning workloads, GPT-5.5 is still the better tool.

Gemini 3.1 Pro generally trails on the tasks above, but it maintains advantages in multimodal understanding, its largest context window variants, and cost-efficiency at scale through its Flash tier. If your workload involves high-volume, lower-stakes inference, Gemini's pricing structure can make sense.

The broader pattern is that the gaps between frontier models are narrowing everywhere except the tasks Claude was already winning. On graduate-level reasoning (GPQA Diamond), all three models are within two percentage points of each other.

How Foculoom routes models

Here is the actual routing matrix Foculoom uses in production, encoded in the copilot-instructions.md of our workflow orchestration layer:

| Use case | Model | Reason | |---|---|---| | Day-to-day coding | claude-sonnet-4.6 | Cost/quality sweet spot for routine tasks | | Complex architecture / deep debugging | claude-opus-4.7 | Reasoning depth justifies the cost | | Code review and generation | gpt-5.4 / gpt-5.3-codex | Observed performance advantage; comparative benchmark data incomplete (see §3) | | iOS / Swift debugging | claude-sonnet-4.6 (Opus 4.7 for deep Swift 6) | Tier matched to task depth | | PLANNER and REVIEWER agents | claude-opus-4.7 (pinned) | Role agents need sustained reasoning | | CONDUCTOR orchestration | claude-sonnet-4.6 | Throughput over depth for routing decisions | | Fallback / cost step-down | claude-haiku-4.5 | When budget constraints require it |

The organizing principle is not "use the best model for everything" — it is "match model tier to task consequence." A task that compounds errors across an autonomous run gets the flagship. A task that routes a decision to another agent does not.

This routing matrix has been a living document. It gets updated every time a new release changes the cost-performance tradeoff on a tier. Today's release triggers an update.

What Opus 4.8 changes for us

We plan to upgrade PLANNER and REVIEWER from 4.7 to 4.8. Same price. The honesty improvements matter directly here — a PLANNER that flags uncertainty in a spec is better than one that fills gaps with plausible-but-wrong details that BUILDER then implements. Upgrading the reasoning agents first is the obvious call.

Reconsidering the Sonnet tier boundary. With fast mode pricing now one-third of what it was for prior Opus versions, the cost gap between Sonnet and Opus fast mode has compressed significantly. The key question for any given task is still whether it compounds: if an error in step 3 affects steps 4 through 12, the cost savings on the cheaper model disappear quickly. We will run Opus 4.8 fast mode on a sample of current Sonnet-tier tasks and publish the cost comparison once we have data.

The 1M context window changes deep debugging. We previously chunked large codebases across tool calls for deep architecture analysis. A single 1M-token context window removes that and should reduce the number of false inferences that come from incomplete context.

Keeping GPT-5.4 for code review. We currently route code review to GPT-5.4 based on observed performance. The honest benchmark picture here is incomplete: SWE-Bench Verified — the most directly relevant test for autonomous code quality work — only has data for Opus 4.8 in this comparison. Terminal-Bench (where GPT-5.5 leads) measures shell task execution, not code review; that win doesn't transfer directly to this use case. We're keeping this routing for now on observed quality grounds and will revisit once SWE-Bench Verified data is available for GPT-5.5.

Should you upgrade?

For devtools builders thinking about their own routing matrices:

Upgrade your highest-stakes reasoning tasks from Opus 4.7 to 4.8. No price increase. The honesty and code reliability improvements are the most practically valuable changes in this release, and they matter most in exactly the tasks you are paying flagship prices for.

Recalculate your cost tier boundaries. If you avoided Opus for cost reasons, run the math again against fast mode pricing. The one-third cost reduction on fast mode is meaningful.

Wait on dynamic workflows. The parallel subagent execution capability is research preview only. Do not build production automation on top of it yet.

For terminal-heavy and visual reasoning tasks, the GPT family retains its edge. GPT-5.5 wins Terminal-Bench 2.1 and leads on visual reasoning. The data supports keeping code review and terminal-heavy tasks on the GPT family; which specific version to use depends on your budget and workload.

Watch "Mythos." Anthropic has signaled a next model class in limited preview that exceeds the current flagship. When it ships broadly, the routing matrix will need another update. Build your infrastructure so those updates are cheap to make.

The pace is the point

Four major Opus releases in four months is not an anomaly — it is the operating environment. A model routing decision made in January 2026 is already stale. The teams that will compound advantages in this environment are the ones that treat their routing matrices as code: versioned, tested, and actively maintained, not set once and forgotten.

The benchmarks will keep changing. The principle stays the same: match model to consequence, not to familiarity.


We publish Foculoom's engineering decisions as we make them. If you have a different routing setup or a reason to disagree with anything here, reach out — we read every reply.