2026 · 04 · 028 min read

The desktop just got automated

GPT-5.4 scored above the human baseline on OSWorld-V this quarter. The 12 week response for SaaS founders looks the same as the playbook from 2009.

OpenAI shipped GPT-5.4 last month with one number that should restructure your roadmap. The model scored 75 percent on OSWorld-V, a benchmark of real desktop productivity tasks. The human baseline on the same set is 72.4. For the first time a frontier model is above human at running someone else's desktop on someone else's behalf.

I want to be careful with that sentence because the benchmark is a benchmark. The set is a sample. Humans on the test were not the best humans. The 75 percent is for individual tasks, not for an eight hour day of unstructured judgment. None of this should let you skip the conclusion. The model can drive a desktop now. The question is what you do about it.

If you are a SaaS founder, the honest reading is that the moat you have been protecting is now visible from the model's screen. You spent five years building the UI. The model can use it. You spent three years building the integrations. The model can call them. You spent two years building the workflow product. The model can recreate it on top of your competitor's UI in an afternoon. The shape of this is exactly the same as the mobile to web transition fifteen years ago, when companies that thought they had a product realised they had a delivery mechanism for a feature.

What I would do, if I were a founder of a category leading SaaS this quarter, is the same thing the best companies did in 2009 when they realised the iPhone was not optional.

The first thing I would do is ship the agent before someone else ships an agent on top of me. Not a chat bot. An agent. The product surface where a customer says do my Tuesday morning work and the system goes and does it. The mistake people make in this category is starting with a model. Start with a customer use case where the work currently takes a human forty minutes and the deliverable is an artifact your product already produces. Build the agent that finishes the artifact and writes a one paragraph note about what it did. The model is incidental. The product is the bottom of the screen where the human reads the paragraph and decides.

The second thing I would do is publish my own benchmark. The reason your customers cannot tell whether an external agent works on your product is that there is no neutral way to measure. Solve that. Pick the twenty tasks your power users do most often. Write them down. Time them. Publish a benchmark that anyone can run against your product, with your product, against you, against a competitor's product. The first SaaS company in each category to publish this benchmark gets to set the goalposts. Every later entrant runs the test you wrote. This is the cheapest defensive move in the playbook and almost nobody runs it.

The third thing I would do is rewrite the integration story. Every SaaS company has a webhook page and an API page. By the end of this year, every SaaS company that wants to be agent native will also have a tool spec page in the MCP format, with worked examples, latency targets, idempotency notes, and a sandbox key that an agent can use without legal review. The MCP standard crossed 97 million installs last month. The first version of agent integrations was a thousand bespoke API contracts. The second version is one tool spec. The companies that publish the spec in the next six months are going to be the platforms. The ones that wait will be the integrations.

The fourth thing I would do is hire a human who is great at the work the agent is supposed to do. Not a prompt engineer. Not an ML engineer. A great practitioner of the work. The person who is in the top one percent of doing the task your customers pay you to do. The reason is that the eval for an agent is, eventually, did the work get done correctly. The only person on your team who can write that eval is the practitioner. The model is fast. The practitioner is the judge. Most teams I see in this space have ten engineers and zero practitioners. Their evals look like engineering tests. Their products work for engineers.

The fifth thing I would do is stop optimising for the model that exists this quarter. The model that ships in Q4 is going to be better. The model that ships in Q4 of next year is going to be better than that. The architecture choices you make this quarter should assume the model will improve by ten percent every six months on the benchmarks you care about, and the pricing per token will fall by half every twelve months. If your unit economics break the moment that happens, you have built the wrong product. If they get better, you are aligned with the wind.

The cheapest action on this list is the benchmark. The most important is the practitioner hire. The one with the largest competitive ceiling is the MCP spec. Do all four, in any order, before September. The desktop is automated now. Your product is on it.

I will write a longer piece soon on what this means for the consumer applications layer, which is, if anything, in a stranger position than B2B SaaS. If you have a take, send it. The pieces that have made me think most this quarter all came from operators in the middle of this transition who had the patience to write down what they were seeing.

If this was useful, the weekly Brief covers shorter ideas like this every Wednesday.

Read the Briefs →