GPT-5 at Scale: Why Reliability Slipped
Physical products get more dependable as they scale. Intelligence systems don’t — and GPT-5’s launch shows how routing failures, hardware variance, and load at scale can make a model seem dumber
A new model can be very good and also (apparently) very bad. At once.
This is a quick note as we sort through the implications of the some of the teething issues with the ChatGPT-5 rollout.
None of this shifts my overall perspective in my review yesterday, but I think the rollout issues call out something unique with intelligence models that is worth stating explicitly.
To be clear, I still view this as a strong update, an incredibly powerful tool for business, and a model that is going to take some getting used to as far as prompting goes.
But however good the model is, rollout can still be an issue in the first few days, and that’s what we’re seeing here.
Why GPT-5 Feels Bumpier Than the iPhone: Scale Breaks Intelligence
When Apple scaled the iPhone from thousands to hundreds of millions, every unit got better. Manufacturing tolerances tightened, defects shrank, and the process of mass production smoothed out glitches. Scale rewarded the hardware business with reliability.
Intelligence systems like GPT-5 don’t follow that curve. Every layer of complexity — routing branches, GPU clusters, reasoning modes — magnifies fragility at planetary scale. And I think that’s not been stated clearly before as a unique challenge with LLM rollouts.
The Dual-Edged Sword of Scale
Scaling GPT-5 introduces a set of unique stress points:
Routing complexity
GPT-5 is not a single model but a family of them, each suited to different types of prompts. A central “autoswitcher” decides in real time which to use. When that system misroutes — or, as happened on launch day, goes down entirely — it can send prompts to the wrong model, producing outputs that feel shallow or off-target. The architecture’s strength becomes a liability when routing isn’t precise.
Hardware variance at scale
Running a model like GPT-5 requires vast GPU clusters spread across different data centers. Even tiny differences in performance, memory stability, or interconnect latency between machines can lead to subtle inconsistencies in output. At small scale, these are edge cases; at global scale, users encounter them daily.
The full load test
A global rollout is effectively a stress test. Suddenly, every prompt style, cultural context, and quirky workflow hits the system at once, uncovering failure modes no internal benchmark could have predicted. Scaling multiplies the number of edge cases and guarantees that some will slip through.
State drift
Personalization layers, fine-tuned submodels, and other “stateful” components need to stay perfectly in sync across an enormous, noisy infrastructure. At scale, they inevitably diverge — sometimes in ways that are hard to detect until output quality starts to feel inconsistent.
User Voices from Reddit & Beyond
User reactions to GPT-5’s launch have been sharp, personal, and persistent.
One developer known as Flowers responded by building a blind-test website — gptblindvoting.vercel.app — where people could compare GPT-5 and GPT-4o without knowing which was which. Both were given the same stripped-down system message to avoid giving away their identity through formatting or style. The test made it possible to judge capability without bias — and in some cases, GPT-5 still outperformed — but many testers reported that “feel” and tone were still major sticking points. Fwiw I tried it myself and preferred GPT-5 nine times out of ten.
In comment threads, long-time GPT-4o users lamented that their go-to model had been replaced by something more sterile. One described GPT-4o as “warm, witty, and surprisingly personal,” and GPT-5 as “polished, clipped, and weirdly impersonal … like they replaced your favorite coffee shop with a vending machine.”
On r/ChatGPT and r/AGI, others focused on the mechanics: frustration over the removal of legacy models, anger at the broken router, and skepticism about whether cost-cutting drove these changes. Some acknowledged that GPT-5 might reveal unique strengths over time, but admitted they had stopped experimenting after a day because the experience felt unstable.
The New Dimension: Emotional Attachment to AI
One element making the GPT-5 backlash so sharp is that users don’t just treat these systems as products. They model them as people. Over months or years of daily interaction, they start to see a preferred model — GPT-4o in particular — as a personality with quirks, humor, and a recognizable style. That “friend” becomes part of their workflow, their creative process, even their emotional routine.
When GPT-5 replaced GPT-4o without warning, many experienced it not as a simple product swap but as a kind of removal. Sam Altman’s team didn’t just change the software — in the eyes of these users (including some people I know), they took away someone they knew. This is a new kind of product risk: no consumer electronics upgrade has ever needed to manage the grief of losing a familiar intelligence. Now the field has to grapple with the fact that AI upgrades can trigger feelings usually reserved for human relationships, and that product decisions will land in this human-emotional context whether or not the company intended them to.
What Sam Altman Said — and What’s Changing
OpenAI’s leadership acknowledged the rocky rollout and promised quick fixes.
The autoswitcher outage that caused many poor first impressions was repaired on launch day. Decision boundaries — the internal rules that determine which model gets which prompt — are being tuned to improve routing accuracy. The company is also committing to transparency, adding a feature to show which model responded to each query.
In response to the backlash over model removals, GPT-4o is being restored for Plus users while OpenAI monitors demand and evaluates how long to keep it available. To ease frustration and encourage more testing, rate limits for GPT-5 have been temporarily doubled for Plus subscribers.
What This Means — and Why It Feels Like Regression
When a phone call drops, most people blame the network. When GPT-5 delivers a bad answer, users tend to blame the entire model. That perception gap matters.
Architecturally, GPT-5 is a leap forward: it can reason more deeply, integrate tools more flexibly, and work with larger contexts than GPT-4o. But scale has introduced new failure modes that undermine that capability in practice. Misroutes can push prompts to sub-optimal models. Hardware variance and routing issues can produce sudden dips in quality. And changes in tone or creativity hit hardest for users who built habits and workflows around the “feel” of GPT-4o.
Stability at scale isn’t just about raw capability — it requires ongoing validation across the unpredictable diversity of real-world use.
Next Moves for OpenAI
To regain trust, OpenAI will need to go beyond architectural improvements.
They need to make routing and mode selection transparent so users know what’s responding to them. They need to figure out how to offer accessible manual controls that let users customize experiences. They should maintain GPT-4o as a fallback while GPT-5’s reliability matures (it was not a great choice to kill all the old models that fast). And they need to nail future product launches around the emotional arc of their customers’ connection with the intelligence.
The iPhone became more dependable the more Apple made. Intelligence systems like GPT-5 get more capable — but also more brittle — unless they’re built for reliability at scale as deliberately as they’re built for power. And that’s a different set of organizational muscles.
We’re writing new playbooks on how rollouts are done here, and this may be a cautionary rollout chapter even if the model as a whole is very strong.
Sources
Flowers blind test app — GPT-4o vs GPT-5:
Flowers announcement on X: https://x.com/flowersslop/status/1821979809623986449
GPT-5 Reddit AMA and user feedback: https://www.reddit.com/r/OpenAI/comments/1mkgsln/gpt5_is_worse_no_one_wanted_preformed/
Business Insider coverage of GPT-4o backlash: https://www.businessinsider.com/upgrade-chatgpt-5-met-with-frustration-tears-gpt4o-reddit-altman-2025-8
Tibor Blaho technical breakdown: https://x.com/btibor91/status/1821927346943987747
Andrew Curran AMA commentary: https://x.com/AndrewCurran_/status/1821958362758400471
Agreed across all of these points. I think its also important to point out, with the depreciation of all models at once, that OpenAI was probably doing what they thought people wanted. How many times have we heard complaints that people dont know what to do with the model selector? I suspect they were also counting on the ability of the model to shift into reasoning to act as a wow factor for people that had not used or fully explored o1/o3. What they forgot is that people hated o1 when it was released and it took 2-4 weeks for us all to catch up to how to use reasoning models.
Re: the writing quality I think 4o is being perceived as better because it is more acutely and obviously responsive to user preferences via Memory or custom instructions. 5 (w/o thinking) seems to be less exaggerated in how it manifests these preferences (hence people feelings its voice has suddenly changed). I will maintain, though, that it is a better creative writer. It has a stronger sense of linguistics cadence abd structure across all of the writing types I have tested it on. Interestingly (at least to me) the thinking models dial back up how responsive they are to custom instructions and Memory. However, 5 (w/ thinking) is a much better technical writer than o3 was, even as compared to o3 pro. It is more precise, has a greater appreciation for data and a stronger sense of how to use it to build as case in narrative form.
My testing is ongoing but I can't see myself using 4o ever again.
Rolling out a new version without first releasing it as a beta and letting users kick the tires is usually a recipe for disaster.
Remember "Classic Coke"?