2026 · 05 · 224 min read

Alibaba's Qwen3.7-Max ships with native Anthropic API support, 1M context, 35-hour autonomous runs

Alibaba released Qwen3.7-Max as an agent-first model with a 1M-token context, native support for the Anthropic API protocol (so it drops into a Claude Code harness), and benchmark wins including 92.4 on GPQA Diamond, 41.4 on HLE, and $2.08M of simulated revenue in YC-Bench. It is

# The interesting part of Qwen3.7-Max is not the leaderboard

Alibaba shipped Qwen3.7-Max this week. The benchmark numbers are strong — 92.4 on GPQA Diamond, 41.4 on HLE, 97.1 on HMMT Feb 2026, $2.08 million of simulated revenue in YC-Bench, 1M-token context, autonomous runs reported out to 35 hours. The coverage is leading with those.

I would lead with one line buried in the model card.

> Qwen3.7-Max natively supports the Anthropic API protocol.

That sentence is what changes the calculus for everyone shipping agents.

## What protocol compatibility actually means

If you have built an agent product in the last twelve months, you have probably written code against either the Anthropic API or the OpenAI API. You have a tool-use loop that expects a specific message shape, a streaming format, an error vocabulary, and a particular set of stop conditions. Most of the code you have written is not model code. It is protocol code.

Swapping models has historically meant rewriting protocol code. Sometimes a thin adapter is enough. More often, the adapter accumulates edge cases — a different way of representing tool calls, a different streaming chunking pattern, a different idea of system messages — and you discover three months in that your model-agnostic abstraction quietly hardcoded the original provider.

Qwen3.7-Max removes that work. It speaks the Anthropic API natively. Existing Claude Code harnesses, existing Claude agent loops, existing tool-use orchestration, and existing eval scripts run against Qwen with no protocol translation. You change the base URL. You change the auth header. The rest of your code does not know it switched.

## Why this matters more than the benchmarks

For two years I have told portfolio CTOs that the durable asset in an AI product is not the model. It is the layer around the model. The routing logic that decides which call goes to which tier. The eval harness that catches regressions. The policy gates that enforce spending limits, content rules, and human-in-the-loop checkpoints.

That argument has always come with a caveat: in practice, switching models was painful enough that teams treated their initial provider choice as load-bearing. The your-model-is-interchangeable claim was true in theory and aspirational in operation.

A native-protocol drop-in changes the operation. It makes swap-the-backbone a Tuesday-afternoon experiment rather than a Q3 project. Once that is true, the routing layer stops being a nice-to-have. It is the asset.

## What to do this quarter

Three concrete moves for any team shipping agents.

First, audit your routing layer. If your code has a hardcoded provider string anywhere outside a single config file, fix that this week. The fix is small now and impossible later.

Second, add Qwen3.7-Max to your existing eval set. Not as a research project. As a row in your standard table, scored on the same task suite you use to grade Claude and GPT-5. The eval should run against the existing harness. If it does not, the harness is the bug.

Third, identify the slice of your traffic that does not need frontier reasoning. Most teams I audit have 30 to 60 percent of calls running on a frontier model for work that a medium model would handle. Route a canary of that slice to Qwen3.7-Max and let it run for a week. Measure task completion, error rates, and tail latency. The dollars are usually meaningful.

## The honest caveats

A few things to keep in mind before this becomes a swap-everything memo.

Protocol compatibility is not capability compatibility. Qwen3.7-Max may handle your Claude prompts syntactically and still produce different outputs. Evals catch this. Vibes do not.

The reported autonomous-run numbers — up to 35 hours — are aspirational. Long-horizon stability is the hardest thing to evaluate honestly, and ran-for-35-hours and produced-useful-output-for-35-hours are different claims. Treat the upper bound as a marketing number until you reproduce it on a real task.

Compliance posture matters. If your product runs in a regulated jurisdiction, the residency and audit posture of an Alibaba-hosted endpoint is a different conversation than a Claude or GPT endpoint. That conversation is worth having before the eval, not after.

## What I am watching next

Two things.

First, whether OpenAI ships a similar gesture toward protocol portability or whether they double down on differentiating their tool-use semantics. The market case for portability is now stronger than the lock-in case.

Second, how many other open-weights labs follow. If Mistral, Cohere, and one or two of the new Asian labs ship native Anthropic-protocol endpoints in the next six months, the protocol becomes a standard the way OpenAI's chat format became one in 2023. Good for builders. Uncomfortable for anyone selling a moat.

If your team is running this experiment, send me the numbers. I am collecting before-and-after data for a longer piece on routing economics and will publish the comparison.

If this was useful, the weekly Brief covers shorter ideas like this every Wednesday.

Read the Briefs →