Nate’s Substack

Run this 4-question test before you let any AI into your files, your Slack, or your phone.

Nate — Mon, 29 Jun 2026 13:00:57 GMT

We should probably stop talking about AI model releases like normal software launches.

The old rhythm was simple: a lab ships a model, people try it, benchmarks land, everyone argues for two days, and the tool either earns a place in your workflow or it doesn’t.

GPT-5.6 broke that rhythm. OpenAI shipped it, but only to a small group of government-approved…

Executive Briefing: Cheap Intelligence Won’t Matter If Your Context Is Trapped

Nate — Sun, 28 Jun 2026 15:01:51 GMT

A few days ago I thought this week’s briefing would be a pricing story.

GLM-5.2 had dropped, coding was getting cheaper if you wanted it to, and the obvious enterprise question seemed to be: why keep sending every job to the most expensive frontier model if a good open model can do a growing slice of the work?

And then Anthropic dropped Claude Tag, and it complicated the whole cheap-intelligence story. Anthropic was not getting cheaper, and its customers were bracing for higher bills and paying them anyway. Back in May, The Information had reported as much: enterprise buyers expected to pay more for Claude, not because Claude was useless, but because Claude was useful enough, and increasingly woven into the work enough, that they did not want to turn it off.

And I got stuck on the contradiction. How can intelligence be getting cheaper and more expensive at the same time?

The lazy version of the answer is that frontier labs still have market power. Claude is good. Companies want the best tools. Engineers and analysts are expensive. If a $250,000 employee becomes meaningfully more productive, a company will tolerate a surprisingly ugly AI bill for a while.

What it doesn’t explain is why the pricing power might survive as open models get better. It also doesn’t explain why Claude Tag feels more strategically important to me than another model launch.

The more I sat with it, the less this looked like a token pricing story. It looked like a story about where intelligence is allowed to work.

Cheaper intelligence only helps you if you can put it to work. If you can’t, the discount stays on paper, and the expensive tool stays the one your team keeps reaching for. So the real question is whether you can actually capture that discount, or keep paying the premium because nothing else fits the work yet.

This briefing covers:

Why the cheap option is real now. GLM-5.2 means a growing slice of work no longer needs frontier prices, with a security tradeoff attached to running it yourself.
What you are actually paying for. Most companies can buy the model cheaply and still can’t use it, because the context and permissions around it are the part they haven’t built.
The Claude Tag move. How a Slack integration becomes context that gets harder to leave the longer your team uses it.
Why the obvious fix is hard. Owning your own context is the right answer on paper and a brutal hiring problem in practice.
The seven questions to settle now. Settle them, or have them answered for you.

Grab the Open Engine guide: the copy-paste task record that makes one AI's work the next AI's job, with receipts

Nate — Fri, 26 Jun 2026 13:04:03 GMT

You finish a client call and the work starts its commute. The transcript goes to Claude to find the argument. Codex changes the file. ChatGPT reads the draft again. A browser agent checks the page actually rendered. Slack has the conversation, Linear has the task, and the calendar decides whether any of it survives the afternoon.

That’s not seven jobs. It’s one job crossing seven systems. And every time it crosses, you carry the state: what was decided, what source mattered, what changed, what the next tool is allowed to touch.

I don’t want one of those tools to swallow the others. Most serious AI users I know don’t either. They have preferences. They know Claude is better at one thing, Codex at another, a local agent at a third, and they don’t want to crown a favorite and pretend the rest of the world disappeared. The trouble isn’t that we own too many good tools. It’s the boring middle between them, the place where one tool’s result becomes the next tool’s task with its source and limits still attached, and for most of us that middle is still a person. Right now, the integration layer is you. It doesn’t have to stay that way.

If you can code, you can build your way around this. You can wire the tools together with APIs, a custom harness, a few cron jobs, and for engineers that’s getting easier every month. But that’s not an answer for everyone else, and it’s barely an answer for a team. The need underneath isn’t exotic. You want one AI’s result to become another AI’s task, with the sources attached, the limits visible, and enough of a trail that the next person or agent doesn’t have to read a giant chat to catch up.

I know a product lead with a newborn and an agency. She runs Claude Code, she has loops and automations, she’s looked hard at OpenClaw. She’s not new to any of this. Her problem is smaller and more maddening than that. A client call collides with a baby appointment. The product scoping still has to move, the team still needs to know what changed, and she’s the one copy-pasting the state of her life between five tools while holding a baby. That’s the load Open Engine takes off her. Not the judgment. Not the taste. The handoffs around them.

I’ve been running a working version to get content out, organize my life, move houses, and coordinate with my team. I’m releasing it now because the next real AI problem isn’t “which model is smartest.” It’s whether the work can move between models at all.

Here’s what’s inside:

Open Engine itself. The build I actually run, packaged as copy-paste templates you hand to the AI you already use, so you have a loop running today.
The handoff, not the model. Why the thing that breaks is never “can the agent do it” but “can the work survive the trip to the next tool.”
The smallest useful version. A shared task list and a seven-part task record that carry the job across tools, so a good answer stops dying in a private chat.
The one-loop audit, and the 30-minute build. Nine questions that turn one annoying handoff into a task an agent can claim, pause, resume, and finish with evidence.
The receipt. The short vocabulary that keeps an agent accountable after the run ends, so “done” stops meaning “now go audit it yourself.”
Teams, and the rest of the field. How one person’s agent hands another person’s agent real work, and where this sits next to OpenClaw, Hermes, and Symphony.

Start where everyone feels it first: the moment the work has to leave one tool and land in the next.

The Five Questions That Turn a Messy Task Into an AI Loop (+ the prompts to map yours)

Nate — Wed, 24 Jun 2026 13:01:22 GMT

We have forgotten how much work we do in the age of apps.

Not the work that shows up as one clean task. The work between tasks: remembering, checking, following up, noticing that one small change moves three other things you forgot were connected. Apps made each piece easy to reach and left the wiring between them to you. That wiring is the job. Nobody logs it as work. The school trip is not hard because packing a lunch is hard. It is hard because the email, the weather, the pickup time, the calendar conflict, and the kid who outgrew the raincoat all live in different places, and the job of connecting them lives in your head. The customer update is heavy for the same reason: what happened, what changed, who owns the next step, and what cannot be promised yet are scattered across email, Slack, and three different calls.

AI should help with that. People flinch when I say it, and they are right to. The fantasy version has been bad for years: a cartoon assistant, a nanny in a box, a cheerful agent gliding around your calendar somehow knowing what you want. I do not trust it. I do not want a system that pretends my life is simpler than it is, or one that turns a bad assumption into an action because autonomy looks good in a demo.

So when people say they want an AI assistant, I do not hear a request for another app with a chat box. I hear something simpler: stop making me be the person who remembers how all the pieces fit together. Not one giant agent running your life. Something smaller. I built it for my own work and call it a loop of loops: a set of narrow recurring jobs, each with memory, sources, safe actions, and boundaries, allowed to notice when one affects another.

A prompt helps when you know what to ask right now. A loop helps when the same obligation keeps coming back. A loop of loops helps when those obligations stop being independent, because a change in one place should wake up work somewhere else.

Here’s what’s inside:

The hidden loop around every prompt. Typing the email takes a minute. Knowing why it exists, what it cannot promise, and who to chase next is the part nobody sees.
How apps turned you into the integration layer. The job the home screen never learned to do, and why more AI inside more apps will not fix it.
When your recurring jobs notice each other. Rain changes packing, a late pickup changes the calendar, and those dependencies finally live somewhere other than your head.
The beginner-safe stack to build first. A few narrow household loops with hard edges, the kind that draft the message and stop before sending it.
Find your first loop. Two prompts do the interviewing: one walks you to a single loop you could build this week, the other finds where those loops should notice each other and stops you before you point any of it at something you can’t undo.

You do not need to automate your life or trust a system that pretends it is simple. You need to start seeing the work the app era trained you not to count.

Grab the 5 prompts that get you ready for Fable 5 before it's back

Nate — Tue, 23 Jun 2026 13:03:23 GMT

Authors’s Note: The best model in the world is offline, and I’m publishing my review of it anyway.

Fable 5 arrived as the most capable model I’d ever tested, and within days the US government pulled it from production and Anthropic switched it off worldwide. As I write this, nobody can tell you when it returns. I recorded this review before any of that, pulled it when the news hit, and then changed my mind, because going dark about what this model can do felt like the wrong response to losing access to it.

Why it’s worth your time even though you can’t touch it: what Fable showed me isn’t staying with Fable. That big-model feeling is the leading edge of what lands in the next ChatGPT, and in the open-weight models a few months behind it. The capability is coming to tools you can run. What stays scarce is knowing what to ask of it, and you can build that today. One thing I’ll save you first, since my inbox is full of it: no, you can’t reconstruct Fable from a system prompt or a stack of smaller models. I tested the recipes. They don’t hold.

So in the meantime, enjoy the piece, and I can't wait to hear what you get up to once Fable's back online.

A few days into Fable 5, I noticed I’d stopped doing the thing I always do with a new model.

I stopped hovering.

I’d handed it a properly ugly job (a database I keep around that I’ve poisoned on purpose, full of ghost records, corrupt files, and planted traps, the kind of mess that lives in every company’s back office), and instead of sitting there watching it think, I went and did something else. When I came back, the work was done. Not answered. Done. Real files. A clean database with the garbage quarantined instead of “fixed” behind my back. And a review queue it had built for me, unprompted, holding every call it wasn’t sure about — as if it expected to be checked.

That was the moment this stopped being a launch for me and started being a question.

I want to be careful here, because “the new model changes everything” is the most exhausted sentence on the internet, and if you’re done with it, you’ve earned that. The honest part first: if you use Fable 5 the way most of us actually use AI (summaries, rewrites, a quick draft, a code snippet), you will not feel what I felt. You’ll feel a slow, expensive model doing fine work that a cheaper model does faster. The early reviews calling it overkill aren’t wrong. They’re accurately describing the size of the ask.

My through line is this: Fable 5 is the first model that’s bigger than our habits.

For three years, the model was the limit. You learned where it broke, and you asked underneath that line. These last few days, for the first time, I kept running out of line before the model did. The wall I hit wasn’t the model running out of ability. It was me running out of work I knew how to hand over.

That’s not a capability story. That’s a skill story. The skill, which I’ve been calling detailed task imagination, is concrete, learnable, and almost nobody is teaching it, because we’ve spent three years teaching prompts instead.

And the part that actually gives me energy is what comes back. When I talk about this (in my community, in my inbox, in conversations with operators), the response that comes back isn’t fatigue but hunger. People aren’t tired of AI. They’re tired of being told it’s amazing while their actual experience of it stays small. Asking bigger is the first instruction in three years that matches the size of the technology, and people can feel it.

Here’s what’s inside:

Why AI has felt smaller than the headlines. The ask-size problem, and why your fatigue is evidence, not cynicism.
What “bigger” means in practice. What changes when a model can carry a whole job, and where Fable 5 still falls down.
Task imagination, defined and demonstrated. The first-week question, six functions of Fable-grade work, and one job walked all the way through.
The concrete shift. How to restructure one week of work around jobs instead of prompts, starting this week, including what the first run will get wrong.
The Whole-Job Spec. The nine fields I fill out before I hand over anything real.

By the end of this you’ll have spotted one on your own desk. Almost everyone I walk through it does. First, why it hasn’t felt true until now.

Executive Briefing: Your team is running agents nobody owns. The one-page card and two prompts that fix it.

Nate — Sun, 21 Jun 2026 15:01:12 GMT

The fastest way to make an AI agent dangerous is to let everybody use it and nobody own it.

The moment an agent starts reading real files, drafting real messages, and changing things other people rely on, it stops being a tool you use and becomes work you are responsible for. When nobody owns that work, it does not fail loudly. A support agent keeps answering from a policy that changed last quarter. A planning agent keeps turning noisy tickets into clean-looking priorities. The agent keeps running, the output keeps arriving, and the value drains out while everyone stays busy.

That is the haunted house version of a company: rooms full of automated systems still moving long after the reason for them disappeared, leaving drafts, tickets, and recommendations that look like progress and change nothing. No more haunted houses.

Every useful agent eventually becomes part of the work, and the work needs one accountable owner. Not a committee, not the AI team by default. One person close enough to know whether the agent is helping, drifting, or just adding polished noise. Knowing exactly when an agent crosses from convenient to consequential is the difference between owning your tools and being run by them without noticing.

This briefing covers:

The one-sentence ownership test. The rule that tells you exactly when an agent becomes work someone has to own, and who that someone is.
How ownership scales. Why the job changes as you move from a personal agent to a shared team agent to a multi-agent pipeline, and where each one tends to break.
How ownerless agents fail. The ordinary failure modes that turn a useful agent into confident noise: stale diets, rotted instructions, dead review loops.
The Agent Owner’s Card. A one-page artifact that makes an agent visible, the human-readable counterpart to the machine cards agents already hand each other, plus a prompt you point an existing agent at so it drafts the fields it can see and hands the ownership questions back to you to answer and review.
Why this is not IT governance. A committee can govern the road. One person still has to own the agent doing the work.

Getting this right is less about how many agents you run and more about knowing who owns each one.

Your skills are leaving your hands. Don't let a rent-a-brain keep them.

Nate — Fri, 19 Jun 2026 13:02:17 GMT

You finally got an agent working exactly how you wanted, and then you switched tools and it broke. That small, maddening moment is becoming one of the most expensive problems in knowledge work, and it is the second act of a problem that began with memory.

Open Brain was about memory leaving our heads. Notes, transcripts, Slack threads, calendars, meeting summaries, project histories. Huge parts of what we know had already moved into software. The trouble was that AI could not reliably use that memory, and when it could, the memory was trapped inside one company’s product. Open Brain made one claim: your memory should be yours. Whoever owns the memory owns the starting point, and the architecture itself was never the interesting part. If your context lives in one app, every new tool makes you start over and every better model arrives with a switching cost.

Open Skills is the same fight, one level up, and more urgent. Our skills are leaving our hands now too, and that should feel uncomfortable. For most of human history a skill lived in a person. You knew how to research, write, code, test, review work, and recover when something broke. You carried the standards, the shortcuts, the taste, and the checks that kept bad work from becoming real damage. AI is turning that into software. Your way of working can now become prompts, SKILL.md files, runbooks, scripts, MCP configs, permission boundaries, and agent workflows. The thing that used to live in your hands can now live in a harness.

That is one of the biggest opportunities in AI. It is also one of the biggest ownership fights. Because if your skills are going to live outside your hands, they should not belong to a rent-a-brain. Not to Claude because it shipped the best skill format this quarter, not to Codex because it gave you the best workbench, not to Cursor because the project started there, not to ChatGPT because the subscription was convenient. They should be yours. Visible, movable, inspectable, testable, and available wherever you work.

Today almost none of that is true. The prompt copies over. The intention copies over. The skill does not. So you rebuild it from memory, a teammate improves a different copy and never tells you, and your best workflow ends up stranded in a chat history nobody can find again. You are not short on intelligence or context. You are short on a way to carry the procedure itself.

That gap has a cost, and it compounds. Every tool switch becomes a rebuild. Every new hire starts from scratch. Every improvement risks dying as one person’s private habit. This is not a hobby problem. If agents are part of how you do your job, the skills you build with them are career capital, and right now most of that capital is rented back to you by whichever vendor’s tool you happened to build it in. Putting it back in your hands is one of the real practical projects of 2026, and it is bigger than any single file format.

Here’s what’s inside:

Open Skills, the library. A public set of agent skills and runbooks you can install today, built on one rule: the way you work should be yours, not rented back to you.
The debt you didn’t know you were carrying. The four ways this breakage shows up at work, and the name for what you keep paying down.
Prompt, memory, skill. Three things people collapse into one, and why only one of them survives a model change.
The work package. A checklist that separates a skill you actually own from a lucky setup you happened to land in one app.
One messy workflow, every tool. A real support-billing process rebuilt so it travels across Codex, Claude Code, and Cursor instead of dying on contact.
The one-question test. The single thing to ask about any workflow to find out whether you own it or rent it.

The pain is already here, but so is the fix, and it is closer than it looks. You do not need a new platform or a new subscription. You need to own the way you already work and make it move with you. The rest of this is how: start with the workflow that keeps breaking, end with skills that travel anywhere you do.

Vercel deleted 80% of its agent's tools and the agent got better + what to delete from yours (guide inside!)

Nate — Wed, 17 Jun 2026 13:01:05 GMT

I learned maintenance first from boats.

Not as a theory. As a job. I maintained boats in Indonesia, where saltwater finds every shortcut and the difference between “probably fine” and fine gets very real once you are away from shore. You look at lines, fittings, pumps, batteries, corrosion, and weather differently when the thing you are maintaining is also the thing that has to bring you back.

I also watched planes get maintained there, then climbed into them hoping the work had been done well. Mostly it had. I mean that literally: one engine memorably failed over the jungle once, and it was, in the end, fine. Not because failure is harmless. Fine because the plane stayed a plane, the people knew what to do, and there was enough care and margin in the system that the failure stayed local.

That is what “mostly” means in maintenance. Things still break. The point is that the thing has been cared for well enough that when something breaks, the failure stays small.

The agents you have already built will keep producing work long after they stop being right. Keeping them honest is about to be one of the most valuable AI skills there is.

Maintenance is one of those words that sounds dull until you depend on it. Then it becomes intimate. You notice the sound that was not there yesterday, the frayed edge, the mechanic’s face. You learn that care is not a feeling. It is inspection, memory, habit, replacement, skepticism, and respect for the ways things fail.

There is a Barry Lopez line I have carried around for years, from the end of “The Orrery”:

If one is patient, if you are careful, I think there is probably nothing that cannot be retrieved.

The word I keep is retrieved — not fixed, not replaced — but stayed-with long enough to bring back.

That is the spirit this AI conversation is missing.

We talk about agents as if the hard part is getting them to exist. Build the agent. Launch the agent. Connect the tools. Give it memory. Let it work.

But anything useful enough to depend on becomes something you have to maintain. That is true of boats. It is true of planes. It is true of buildings, institutions, data pipelines, customer-support systems, editorial standards, and software. It will be true of AI agents too.

Vercel’s sales agent story is easy to read the wrong way.

The obvious version is the labor story. Business Insider reported that Vercel trained an AI agent on one of its best sales development reps, used it to handle much of the inbound sales workflow, and moved from a ten-person inbound team to one person overseeing the agent while the rest shifted into more complex outbound work.

But often the biggest story isn’t the most useful one.

The more useful story is what had to be true around the agent for the work to become trustworthy. Vercel did not just tell a model to “do sales.” Engineers watched a strong rep. They documented the workflow. The agent filtered inbound messages, qualified leads, researched companies, drafted responses, routed support questions away from sales, and had a human reviewing its work in Slack.

In other words, the agent had a workbench. It had sources. It had tools. It had a defined job. It had handoffs. It had a review path. It had feedback. It had a human who could see what was happening. The agent was not a free-floating brain. It was a system around delegated work.

That is the part most people still miss.

The obvious question is, “Can I build an agent?”

The better question is, “What workbench does this agent need?”

The mature question is, “How do I keep that workbench healthy after the agent starts working?”

That third question is agent maintenance. And it is about to matter more than the building, because delegated intelligence creates a maintenance surface. Once a system reads context, calls tools, remembers preferences, drafts work, or touches a workflow other people depend on, someone has to keep the setup around it fit for the job.

Here’s what’s inside:

The two ways agents break. One when the world around them drifts, and one stranger failure: the model underneath them gets better, and the harness built for its old weaknesses turns into dead weight.
Why “more” is the wrong instinct. More context, more tools, more memory feels like care. Usually it is the thing rotting your agent from the inside.
The seven parts that go stale. Job, diet, memory, tools, reach, proof, and value — the harness around the model, and the specific way each one fails before you notice.
Five agents, maintained in the open. A writing agent, a product-backlog agent, a Codex workflow, a support and revenue-risk agent, and a content pipeline, each shown drifting and each pulled back.
The loop I run before I trust one again. The short, plain maintenance pass I walk before letting any agent stay close to real work.
The audit, ready to run. The loop turned into a guide you can point at a live agent today: the last ten runs, the seven surfaces, and a keep, change, pause, or retire call before you trust it again.

Below, the seven parts of the harness, what breaks where, and the loop I would run before trusting any agent that is part of the work.

Executive Briefing: Your company is about to get cheap intelligence. That is not the same as being able to use it.

Nate — Sun, 14 Jun 2026 15:00:46 GMT

Last week I got a call from a founder I’ve known for years. He runs an agentic startup, meaning his software does work for customers instead of waiting for customers to work inside the software. He had just moved his product off a frontier model and onto an open-weight model.

His model bill dropped 97 percent in one month.

He kept refreshing the billing dashboard because he did not believe the number.

Two days later, OpenAI filed to go public.

Those are not the same kind of fact. One is a private operating detail from a founder who sees model costs directly in his margins. The other is one of the biggest capital-markets events in technology. But putting them next to each other clarifies the problem leaders now have to think through.

The market story is that intelligence is scarce. OpenAI, Anthropic, and xAI are preparing public investors to value them as if the companies that own the best intelligence will own a huge share of the next decade of enterprise value.

The company story is different. For a growing amount of everyday work, intelligence is getting cheaper very quickly. The hard part is not always getting a smarter model. It is making the company ready to use the intelligence it can already buy.

These IPOs are priced on intelligence. But the work that will matter inside most companies is not just choosing the smartest model. It is deciding how the company needs to change when intelligence becomes cheap enough to put everywhere.

An AI pilot is too small for that question. So is one automated workflow. The harder questions are structural, and they sit in a layer most companies have never had to name.

That layer around the model is the harness.

The model supplies intelligence. The harness supplies the company: the context, documents, permissions, review standards, memory, budgets, decision rights, and accountability that make intelligence useful in a real organization.

The labs can sell you models, tools, even engineers to install them. What they cannot sell you is your own operating context, or the judgment that decides when an output is good enough to ship. Those are the decisions leaders cannot outsource.

A caveat: the S-1s are confidential. Nobody outside the companies, the SEC, and their advisers has read the full documents. The numbers circulating right now are reporting and estimates, and I will treat them that way. What we can read is the public behavior around the filings: the enterprise revenue story, the compute commitments, the pressure from cheaper models, the claims about self-improving AI, and the very human deployment businesses these companies are building around their models.

The question is not whether your company will use these models. It will.

The question is whether the structure around them becomes something you understand and own, or something that happens to you.

This briefing covers:

The two worlds you now have to operate in. Public markets are pricing intelligence as scarce, while inside most companies the cost of useful intelligence is collapsing. Both are true, and you have to plan for both.
The harness, and why it may be the real scarce asset. The model supplies intelligence. The company layer around it, meaning context, permissions, review standards, memory, and decision rights, is what turns intelligence into trustworthy work, and it is the part the labs cannot fully sell you.
Why the labs are hiring humans to install AI. The same companies telling investors that machines may soon improve machines are also staffing up to go workflow by workflow inside their customers, which tells you where the hard part actually lives.
Five numbers to read when the S-1 opens. A short scorecard for telling, from the filing itself, whether intelligence stays the scarce asset or whether the value moves to whoever puts it to use.

Wall Street is pricing the bet that intelligence stays scarce. What follows is the part about what your company has to become.

Grab my Ultimate Guide to Codex and catch up to the 1 in 1,600 people using it every week (mostly no code!)

Nate — Fri, 12 Jun 2026 13:03:03 GMT

I’m going to skip the wind-up on this one.

Codex just passed five million weekly active users. That’s OpenAI’s own count, from its June 2 report on knowledge work. Run that against eight billion people and you get six hundredths of one percent, about one in every 1,600 humans alive. Be generous and count only the world’s billion or so knowledge workers, the audience this tool is quietly turning into, and adoption is still about half a percent. Most of those five million are still developers, which only widens the gap I actually care about. And every week I watch what that sliver gets done — pages shipped, pipelines run, whole jobs handed to a computer and handed back finished with receipts — and then I read my inbox, full of smart, capable people describing work that feels heavier every month, and the distance between those two experiences gets to me.

So here’s the closest I will ever come to yelling in this newsletter: use Codex.

Not “consider an agentic workflow.” Not “explore the space when things calm down.” Things are not going to calm down. Use Codex. It is the best daily driver in AI right now, it is absurdly underused, and the gap between the people running it and the people who haven’t touched it is the most fixable gap in your entire working life. It is not a talent gap. It is not a technical gap (most of what’s in this piece requires zero code). It is a setup gap, and a setup gap closes in a weekend.

That’s what this piece is: the catch-up. I just published the complete operating guide to Codex (every habit, every prompt, every skill I actually run), and below I’m going to walk you through all of it, because five million is an embarrassing number and I intend to move it.

Here’s what’s inside:

The Ultimate Guide to Codex. Go from “I should really learn this” to a working setup in a weekend, with the model wired into your actual files and pages instead of sitting in a chat tab. Every page is a copy-paste prompt, most needing zero code.
The shift worth making. The unit of work moved from the prompt to the run, and there’s a one-paragraph test for which side you’re working on.
Why you’re not actually behind. The gap isn’t facts, it’s setup: the difference between a problem that compounds and one you can close on purpose.
Where to actually start. A first day, first week, and first month, built on one real folder, one small job, and proof you can check.

It’s one honest weekend wide. Below is the whole thing, start to finish. But first, let me make the case for why Codex specifically.

I heard you

Thu, 11 Jun 2026 22:05:25 GMT

A lot of you asked for the applied layer. Less ‘here’s what happened in AI this week,’ and more ‘here’s what to do about it.’ That’s Zero to AI.

I soft launched it in April, and starting today it’s part of your membership.

It assumes you’re smart and busy. It does not assume you’re technical. Inside:

How to fix answers that come back generic
A reset prompt …

Claude vs. Codex isn't about code. It's about whether you steer or dispatch.

Nate — Wed, 10 Jun 2026 13:01:57 GMT

The strange moment is not when an AI answers you.

We have gotten used to that. You type something into a box. A model writes back. Sometimes it is useful. Sometimes it is nonsense. Either way, it still feels familiar. You asked. It answered. You are sitting there, judging the response in real time.

The strange moment is when an AI comes back with work.

It read the folder. It edited the file. It ran the command. It compared the sources. It says it is done.

And now you have a new problem. You did not do the work. You may not have watched every step, or know which assumptions it made, which branch of the task it abandoned, or which shortcut it took because the shortcut made the answer look cleaner. But the work is sitting in front of you.

Is it real?

That question is the real Claude Code versus Codex story, and almost nobody frames it that way. Everyone wants to know which tool is better, which model is smarter, which writes cleaner code, which wins the benchmark. Fair questions. They are not the main event.

These tools are training us to manage AI labor, and they train us to do it differently. Claude teaches you to steer agents. Codex teaches you to dispatch them. That sounds like a workflow note. It is deeper than that. Use one long enough and it changes what you reach for when a problem lands: another conversation, or a better assignment.

I run into this every working day. Getting the machine to do work is the easy part. The hard part is deciding when the work is good enough to leave the machine. That decision is going to define a lot of white-collar jobs, and not because everyone will learn to code. More people are simply going to start receiving work from machines they did not supervise. The first time it happens, it feels like magic. The tenth time, it feels like management.

That is why these tools matter even if you never write code. They showed up in software first because code has clean feedback loops, but the habit is already spreading into research, sales notes, spreadsheets, legal summaries, support triage, and every kind of knowledge work that lives in files and messages. Neither one is really just a coding tool anymore. The useful question is what kind of AI worker each tool is training you to become, and those habits will outlast this month’s leaderboard.

Here’s what’s inside:

The two ways agents fail you. Understanding theater, where a good conversation convinces you the work was understood, and completion theater, where a finished run feels far more done than it is.
The jargon, decoded. Context, permissions, worktrees, hooks, and proof stop reading like programmer-speak and start reading like the moving parts of any assignment you hand a machine.
Why the real test comes after the output. A head-to-head where both agents reached the same result in completely different ways, and what that says about trusting work you did not watch happen.
The standard I would teach everyone. The five shapes every agent run takes, the six questions to answer before you launch one, and the cost almost nobody budgets for.
Four prompts you can paste today. A Run Spec that turns a fuzzy task into a bounded assignment, a steer-or-dispatch diagnostic for when you cannot tell which the work wants, an “is it real?” audit for work an agent hands back, and a cross-check that makes one agent grade another.

Let me show you how each tool trains you, where each one fails, and the standard I use to keep the work honest.

Executive Briefing: Uber Burned Its Entire AI Budget Early. The Bill Was Trying to Tell Them Something.

Nate — Sun, 07 Jun 2026 15:01:35 GMT

The next AI budget fight will not start because employees refuse to use AI.

It will start because they finally do.

This is why the date matters. This is a June 2026 problem, not a December 2025 problem.

In May 2026, Uber became one of the first big companies to make the new problem concrete: 95% of its engineers now use AI tools every month, most of them in agent-style workflows, and an internal coding agent writes roughly 1,800 code changes a week. Uber was not playing with chatbots. It was doing exactly what every board has been demanding: get serious about AI, put the tools into real workflows, find the leverage.

Then the cost story broke. Uber’s CTO, Praveen Neppalli Naga, reportedly told people the company had blown through its entire 2026 AI budget months early. The easy read was that the tools cost too much and employees need reining in.

I think that read is incomplete. The sharper signal came from Uber’s president and COO, Andrew Macdonald, who said the company can see the usage, the commits, and the token spend, and still cannot cleanly connect any of it to better features for customers.

That is the real story, and it is bigger than Uber. The bill is the first hard evidence that AI has crossed from a tool you buy into labor you have to manage, and almost no company has built a system to manage labor it cannot see. Read correctly, token burn is not waste but information about a kind of work the company has not learned to run yet.

Where you sit decides what the bill threatens. If you own the budget, it becomes the line item that justifies a layoff you did not want to make. If you run engineering, it becomes the cap that kills the experiments that were working. If you do the work, it turns “used too much AI” into a performance problem instead of a signal that you found a job worth automating. Same invoice, three warnings, one missing system.

The companies that win this will not be the ones that spent the least or the most. Spending freely and capping hard are both easy, and both are wrong. The harder answer is the one in between, and the rest of this briefing is how you get there.

This briefing covers:

The real shape of the AI cost curve. Why the work you actually want from frontier models keeps getting more expensive even as the price per call falls, and what that does to next year’s budget.
A routing rule for every AI dollar. One principle, minimum effective intelligence, for deciding when a job needs a frontier model, an open model, or no model at all.
Why your 2025 budget model is the thing breaking. Seats and licenses cannot price work that plans, retries, and runs for hours, and a better dashboard will not save it.
The operating model that replaces the token cap. What an agent-first company actually changes: work objects, gates, permissions, and the training that turns usage into compounding advantage.
How to read your own token bill. A way to tell production from tuition from waste from the signal that you just found a workflow worth turning into infrastructure.

The argument runs in seven parts, and it ends somewhere you can use: an operating model and a routing rule you can take into your next budget conversation. The setup is free. The system is below.

You can't trust one token number across your tools. Here's the guide to a dashboard that keeps Codex, Claude, and ChatGPT honest.

Nate — Fri, 05 Jun 2026 13:02:43 GMT

As of this writing, the biggest single day I have ever run through Codex is north of 860 million tokens, counted exactly. By the time you read this the number will be higher, because I keep giving the computer more to do.

You could read that as a brag. It is the least useful thing the number can tell you.

A token count is not a scoreboard. It is a trace. It shows where you handed work to AI, how much delegated intelligence you spent doing it, and whether your behavior is actually changing. Tie that trace to outcomes and it stops being a cost chart or a status flex and turns into something better: a feedback loop for what your computer should do next. That is the whole reason to build one.

That record day was not a day of asking for more paragraphs. It was a day when more of my work surface moved through agents: files, browser sessions, drafts, local tools, source notes, checks, revisions, automations, and several threads each carrying part of something real.

The stake here is bigger than my token count. The models keep getting better and the tools keep getting broader, but a lot of capable people cannot feel it, because their own usage settled into a groove a year ago. If your picture of AI is still “ask a question, get an answer,” you will keep leaving the most valuable work on the table without noticing.

They ask for a paragraph when they could ask for a full draft. They ask for a summary when they could ask for decisions, owners, and the follow-up note. Then they look at what comes back and call AI useful but not transformative. Of course it feels that way. They gave it assistant work. They never gave it computer work.

A dashboard is how I catch that gap in myself. It does not make me better at using AI any more than a fitness tracker makes you healthy. What it gives me is the loop: a way to see whether AI is expanding what I can do or just making the same old work a little faster. That is why I built it.

You can poke at the live version here: the beta dashboard. This is the one I originally built last week, running on my own usage. The version in the guide is an improved build of it, but I am leaving this one up for reference.

Here is what is inside:

Build your own token dashboard from scratch, with a step-by-step walkthrough, the prompt I used, and a full build video over in the guide.
Start from a ready-made kit for your stack instead of a blank page, whether you live in Codex, Claude, ChatGPT, or all of them at once.
The line between assistant work and computer work, and why being stuck on the wrong side has nothing to do with the model.
Five rules for reading your chart, including why a quiet stretch can be a worse sign than your biggest spike.
A fifteen-minute weekly review that turns your best one-off runs into workflows you stop rebuilding.
Why ranking a team by token volume backfires, and the record that actually shows who can lead an AI rollout.

Opus 4.8 scored 81 in my benchmark. I still wouldn't default to it. (The full breakdown + Nate's Community Slack)

Nate — Wed, 03 Jun 2026 13:01:07 GMT

Claude Opus 4.8 is excellent. The harder question is where it should replace your current workflow, where it should be a specialist, and where turning the reasoning dial up can make the work worse.

After I read the runs, I wanted the recommendation to be simple: use Opus 4.8.

The score almost lets you say that. In my current benchmark suite, Opus 4.8 is the leader. It scored 81 on the strict average. GPT-5.5 scored 71. The rest of the field was well behind: Gemini 3.5 Flash High Fast at 56, Opus 4.7 at 54, Sonnet 4.6 at 52, GPT-5.4 at 51, and Gemini 3.1 Pro at 38.

If all you want is a leaderboard, the article can end there.

But that would be a bad article, and it would make you worse at choosing models.

The result gets more useful when you stop at the individual runs. Opus 4.8 won because it was much better than Opus 4.7 at the parts of work that usually break professional AI output: source discipline, operational judgment, canary handling, provenance, self-correction, and knowing when a messy data problem should be reviewed instead of quietly “fixed.”

I care about that more than I care about a slightly prettier answer.

It also did not win every task. GPT-5.5 beat it on the Artemis visualization. Opus 4.8 still had visual and front-end weaknesses in multiple runs. And outside our suite, Andon Labs found a long-horizon business benchmark where Opus 4.8 on max effort did worse than Opus 4.8 on high effort, and both did worse than Opus 4.7.

That last point is the one I keep coming back to, because it breaks the lazy way people talk about model launches.

We are used to asking, “Which model is smartest?”

I still want to know the answer. But if you are actually building, managing a team, buying enterprise licenses, or choosing your own daily tool, the question has more parts:

What work are you doing?
How long does the task run?
What source material does it need?
What tools can the model use?
Can it inspect the artifact it just made?
Does it preserve state when the work gets long?
How much does the human have to babysit it?
What happens when it is uncertain?
What does a failed run cost you?

Those questions decide whether the model saves you time or creates another review queue.

So I am not treating Opus 4.8 as a “switch everything” release. It is one of the best models available right now. It is the best model in my current strict suite. I would use it aggressively for some work. I would not blindly make it my default for every long-running workflow.

Here’s what I’m covering:

Every test, scored and picked apart. Where Opus 4.8 won, where GPT-5.5 beat it, and where the score hides real caveats.
The effort-level trap. The Vending-Bench data on why max can make long-running work worse, and how I configure each mode for real work.
How I actually choose my daily tools. Why I still reach for Codex/5.5 despite the score, plus a routing guide for when to use Opus 4.8, Codex/5.5, and GPT-5.5.
What builders, leaders, and executives should each do differently. Role-specific guidance and four prompts you can paste and use today.

The reasoning is below, along with the tools to make the same decision for your own stack.

Why I’m moving this Substack from daily coverage to deeper weekly work

Nate — Mon, 01 Jun 2026 13:04:03 GMT

I need to make a change to how I show up here.

For the last year or so, I’ve been in your inbox almost every day covering AI. That was not a gimmick. It made sense for the moment we were in. A new model would ship in the morning. A new capability would show up by lunch. By the afternoon, the practical answer to “what can I build with this?” had changed.

During that period, keeping up really mattered.

I think that period is changing. The models are here. The agents are here. The harnesses are here. Execution is getting cheaper by the month. A prototype, a draft, an analysis, a small internal tool, all of that is easier to create than it was a year ago.

But that has made one thing more obvious to me, not less: the hard part is understanding what to build, why it matters, and how to use these tools well enough that they change your work.

I don’t think daily AI news is the best way to build that kind of fluency anymore. It can help you know what launched. It cannot, by itself, help you get good.

And I want this Substack to help you get good.

So I’m changing the cadence.

Three anchor pieces a week

Starting now, I’m going to build this around three serious pieces each week.

The first is the deep dive. I’ll take the most important AI story or development of the week and go underneath the headline: what happened, what changed, what people are missing, and what it means for the work you’re actually doing.

The second is the build. This is the one I’m most excited about. Each week I want to give you something practical enough to take to your own machine. Not a “hello world” demo. A real build. The kind of guide where you understand the tool better because you have actually made something with it.

The third is the executive briefing. For paid executive subscribers, I’m going to keep sharpening the weekly read on where to invest, what leaders need to understand, what is noise, and what decisions are worth making now.

That is the new default: fewer pieces, more depth, more work that is worth keeping.

What this looks like first

I want to be concrete about the first few weeks, because otherwise this can sound like a nice editorial promise.

I’m working on a full Codex guide that goes well past the feature list. I want to show where it works, where it breaks, and what it feels like to build with it in real workflows day after day.

I’m also building a token burn dashboard you can run yourself. If you are using APIs and can’t see where the money is going, this will put a working answer on your machine.

And I’m writing a deep dive on Claude 4.8: what actually changed, what the benchmarks don’t capture, and what it means for the things you are shipping right now.

That is the standard I want to hold myself to.

Two other changes

I’m also standing up a Slack workspace for paid subscribers.

The reason is simple: a lot of the best conversations around this work happen in replies and DMs, and then they disappear. I want a place where serious builders can find each other, where I can surface useful things faster than a weekly article allows, and where you can ask for help on a Tuesday afternoon instead of waiting for the next post.

If you’re a paid subscriber, you’ll get an invite this week.

And I am not disappearing from news.

If something drops that changes how you should work, I’ll cover it. A major model release, a new capability, a shift that changes the practical answer to “what should I do this week.” Those will still get their own coverage.

The bar is just higher. If it matters, you’ll hear from me. If it’s interesting but not urgent, I’ll save it for the deeper work. That also means fewer emails from me, which I know some of you have asked for. I’ve heard that.

What I’m trying to build here

The gap I keep coming back to is the gap between “I’ve heard of that tool” and “I can build with that tool.”

I think that gap is going to matter a lot in 2026. For careers. For companies. For anyone trying to lead, learn, build, or make good decisions while the ground keeps moving.

Most people are going to keep skimming. They’ll read the launch post, save the thread, watch the demo, and feel like they’re keeping up.

But keeping up is not the same as building fluency. Sometimes it is just watching other people build.

I don’t want that for this audience. I want this to be the place where you come to understand what matters, practice it, and leave with something you can actually use.

I can’t do the practice for you. But I can make the work clearer. I can give you better guides, better judgment, better builds, and a stronger place to talk through the hard parts with other people doing the same thing.

That is what I’m committing to now.

Fewer emails. Higher bar. More depth. More work you can carry into your own projects, teams, and career.

I’m grateful you’re here. I’m especially grateful to everyone who has stuck with the daily pace, replied, challenged me, asked for more depth, and pushed this into something better.

Let’s build.

Nate

Executive Briefing: Your career evidence is thinner than you think + 3 prompts that rebuild it

Nate — Sun, 31 May 2026 15:02:24 GMT

A reader wrote in with a job search problem that has been sitting with me. They had been laid off, and the work they were proudest of was the work they could not show.

They had made decisions with limited information and kept a team moving through a quarter that could have gone sideways. They had understood a messy system better than anyone else in the room. In interviews, they had to tell those stories over and over to people who were skeptical by default.

I’ve been thinking about why that problem keeps getting worse.

AI did not break human judgment. It broke the signal that judgment used to leave behind. A polished memo no longer proves you understood the business. A clean prototype no longer proves you understood the user. Everyone can look productive now. The hard part is seeing who actually understood the work.

And this runs in both directions. If you run an organization, your evaluation systems are losing signal. If you are the talent — especially if you sit far from the execution layer — the evidence problem is worse. You did not write the code. You did not design the screen. You made the call that kept the company from spending millions on the wrong bet, and there is no artifact for that.

The thinking layer has to travel with the work.

The evidence problem I am describing hits hardest at the executive level, because the work is almost entirely judgment — portfolio bets, org design, market timing, decisions that shaped the company for years. None of that ships as a work sample. So this briefing talks to you as someone who evaluates others, but also as someone who may be facing the same problem from the other side.

This briefing covers:

Build portable judgment evidence. Four questions (situation, decision, risk, change) applied from IC to division lead, with a sanitized case showing how to share reasoning without leaking confidential work.
Change how you evaluate and get evaluated. What to ask in interviews, what to look for in promotion packets, and why AI makes the old evidence unreliable on both sides of the table.
Use the prompts that do the hard extraction. A diagnostic that flags where your career evidence is thin, a builder that interviews you through one real decision, and a question set for when you’re on the hiring side.
Put the evidence somewhere it travels. Why judgment artifacts need a home you own — a TalentBoard profile, a personal site, a packet you bring to interviews — before your badge stops working.

Your prototype graveyard is leaking secrets. The Prototype Classifier + Demotion Audit decide what stays

Nate — Fri, 29 May 2026 13:03:25 GMT

Product management has always been a rationing job. Most ideas would not get built. Engineering time was scarce. Coordination was slow. A roadmap was partly a strategy document and partly a rationing system, and product managers helped decide which customer problems, executive priorities, technical constraints, and market bets deserved the company’s limited ability to make software.

That role is changing, because the cost of a first version has collapsed. The thing entering the product conversation is no longer a request. It is a working artifact. A dashboard. A lightweight app. An agent that already touches a system of record.

The scale this reaches is already documented. Inside Microsoft, employees have built more than 1 million Power Platform citizen-development assets: 18,000-plus environments, 170,000 apps, 50,000 automated flows, 1,200 chatbots. Most companies are nowhere near that, but the shape of the problem is arriving everywhere, and the product function is the part of the org that has to absorb it.

The old model asked, “Should we build this?” The new model starts one step later: somebody already built something. Now the company has to decide whether it should matter. The PM is no longer mainly a coordination role around scarce engineering. It becomes the discipline that classifies software abundance into market value, internal reliance, or deletion. That is a more strategic job, and a more technical one. Get it wrong and the failure is not loud. You do not get an outage on launch day. You get a pile of half-real tools nobody owns, spreading into systems of record before anyone decided they were allowed to.

Here’s what’s inside:

Why the old roadmap filter broke. When a first version costs almost nothing, rationing engineering time stops being the job. You get a clear read on what replaces it, and why the shift is more strategic than the prototyping conversation suggests.
A four-state ladder for classifying what your team builds. Personal tool, team beta, supported internal product, customer-facing product, with the specific user-count and risk thresholds that move a tool from one rung to the next.
The demotion triggers almost everyone skips. The exact signals that tell you a tool you still support has stopped earning it, so you stop paying to keep dead software alive.
Two prompts you can run this week. One classifies any employee-built tool into its real production class and names what promotion would take. The other audits a tool you already support and tests whether it should be demoted.

The cost of making software fell. The cost of being wrong about what you depend on did not. Below, here is how the product job changes when production stops being the scarce input, and the two prompts that turn it into something you can run on Monday.

Your agent dashboard is green. The run underneath it is where the work actually broke.

Nate — Thu, 28 May 2026 13:03:01 GMT

A Cursor agent deleted a software company’s production database and its volume-level backups in nine seconds.

This was late April 2026. The founder, Jer Crane of PocketOS, watched it happen. It is the kind of story that gets passed around because it reads like a warning about how dangerous agents have become, or how badly one vendor failed. That reading misses the more interesting thing, which is that nothing on a normal product dashboard would have seen it coming. An active user, a long session, a healthy pile of chat messages, a feature getting used. All green, right up until the moment the database was gone.

Everything that actually mattered happened inside the run, and that is precisely the part most analytics cannot see. When the user is an agent, the unit of product behavior is becoming the agent run: the work a user handed over, the steps the agent took, the tools it touched, the boundaries it hit, the corrections it got back, and whether anyone accepted the result.

For the first time in the history of software, we can watch the consequences of our decisions land in real time. You used to make a call, ship it, and wait weeks to learn whether it worked. An agent collapses that loop to minutes, and if you get good signal back while it runs, you can shape and steer it mid-flight. Speed is the engine. Analytics is the rudder. A database that vanishes in nine seconds is what happens when you have a powerful engine and no way to steer.

Here’s what’s inside:

The events that are the new clicks. What to actually count when the user is an agent and the click, the page view, and the session have stopped telling you anything useful.
Why your traces aren’t your answer. Engineering already has the execution data. Why that’s necessary, not sufficient, and what product still has to build on top of it.
The difference between a task that finished and a task the user trusted. Reading that one gap is how you tell which workflows have earned more autonomy.
The starter setup. The three events to ship this week, the full event schema underneath them, and the prompts that turn that schema into instrumentation in your own stack, your corrections into eval cases, and your numbers into a roadmap.

Most teams have filed all of this under engineering telemetry instead of product, and that is exactly why the runs keep going fast in the wrong direction. This is how you get the rudder.

The deck got forwarded with a wrong number inside. The Trust Layer's two-model review is built to catch exactly that.

Nate — Wed, 27 May 2026 13:00:35 GMT

AI builds your board deck now. Drop a folder of messy files into ChatGPT or Claude or Copilot, ask for the deck or the budget or the QBR, and you get back something that looks like finished work. The capability is real and it is not the interesting part anymore. The interesting part is that the file looks done long before it is true.

Last quarter I opened a workbook that looked like a financial model. Assumption inputs at the top, revenue projections, valuation rolling up cleanly, and a written guide attached saying the model had been validated. Then I opened the revenue growth row. The formula copied the same two cells across every future year instead of rolling forward: =C5/B5-1, again and again. Excel did not flag it. There was no #REF! error. The valuation still looked clean. A busy person signs that deck and forwards it, and the mistake travels with it.

That is the new Office risk, and it is specific. A deck mixes current numbers with old ones. A spreadsheet carries a formula that points to the wrong cell. A model gets saved as an Excel file with almost no live formulas inside it. A chart looks executive-ready while nobody can say which source the data came from. None of these look wrong. They are wrong in the one way polish cannot show you, because polish is exactly what we are trained to read as trust.

So stop treating the generated file as the first thing you make. Make the truth layer first: an inventory of your sources, a map of which claim rests on which source, a log of every assumption, and a verification pass that tries to break the result before anyone else can. Build that, and the model gets far more useful, because now it is working on top of something real instead of guessing inside a costume that looks like work. Skip it, and you are shipping confidence you have not earned.

Here’s what’s inside:

The four-stage workflow that turns a messy folder into a file you can defend: source prep, structure, creation, and verification, in that order and not skippable.
Source prep and structure, the two stages nobody does, and the exact inventory and specification to demand from the model before it writes a single slide or formula.
The PowerPoint rules: how to make slide headlines into traceable claims and turn speaker notes into the evidence layer that survives the forward.
The Excel rules: a raw-data tab, an assumptions tab, and a checks tab that works like a smoke alarm, so a broken formula trips an alarm instead of riding into a board meeting.
The Trust Layer, a guide and prompt kit for Office files that survive the forward: the guide that maps the six ways Office work breaks, plus a five-prompt runbook you paste in order, from the source-packet setup that catches conflicting numbers before a slide exists, to the two-model hostile review that hunts the formula a busy reader would have signed.

The truth layer is the whole game. The rest of this piece is how to build one.