Vibe check: does GPT-5 live up to the hype?

8 Aug

Short answer: mostly yes. GPT-5 is a clear step up in accuracy and usefulness, especially when it is allowed to “think” for a few seconds. It is not magic, but it does feel more reliable and more capable on everyday work than previous models.

Below we cut through the noise and focus on three things that matter for teams: what the real‑world numbers say, what people who push these systems hard are seeing, and why those changes actually help you get work done.

The real‑world numbers that matter

Fewer made‑up facts, especially the big ones
Hallucinates almost half as much. OpenAI reports two related views of accuracy. First, at the response level, GPT‑5 cuts major factual errors a lot. The standard GPT‑5 model produces 44% fewer responses with at least one major factual mistake than GPT‑4o, and GPT‑5 Thinking produces 78% fewer than o3. That is what “almost half” refers to, because these big, trust‑breaking errors are the ones that derail real work. Second, at the claim level, which counts minor slips as well as big ones, GPT‑5’s hallucination rate is 26% lower than GPT‑4o for the standard model and 65% lower than o3 for the Thinking model. In separate open‑ended factuality tests, GPT‑5 Thinking makes over five times fewer factual errors than o3. Net effect: long answers go wrong less often and you spend less time fixing confident mistakes.

Fixes more real code, not just toy puzzles
On a benchmark that asks the model to patch real issues in real open source projects and pass the tests, GPT‑5 scores 74.9%, up from 69.1% for o3. It also reaches that score with fewer tool calls and shorter outputs, so you wait less and spend less to get a working fix. In practice that means fewer blocked tickets and more small changes that ship without human babysitting.

Much stronger on hard maths
On a tough set of competition‑style questions, GPT‑5 scores 94.6% without external tools. Earlier top models were in the mid to high 80s, so you are looking at roughly an eight‑point gain at the sharp end. You feel this in planning, analysis and spreadsheet work where careful, multi‑step logic used to wobble.

Works with much longer documents
GPT‑5 can keep far more of your material in context at once. Think the size of a 700‑page book in one go. The context window is simply how much of the conversation or documents it can remember at any one time. That means you can drop in a full policy binder, a complete RFP with appendices, or months of meeting notes, then ask cross‑document questions without chopping text into pieces. For example, include the whole procurement pack and previous vendor emails, then ask it to highlight risk clauses, compare them with last year’s terms, and draft a one‑page exec summary with page references in a single thread.

More honest about its limits
When images were deliberately removed from a vision test, o3 still answered as if it could see them about 86.7% of the time. GPT‑5 did so about 9% of the time. In live traffic studies, deceptive replies fell from 4.8% with o3 to 2.1% with GPT‑5 Thinking. Translation: it is more likely to say “I do not have enough information to answer” than to bluff.

Routing that usually “just works”
GPT‑5 in ChatGPT is not a single model. It routes your request to a fast model for simple jobs and a deeper reasoning model when it looks hard. Most of the time that means you do not need to choose. Power users have noticed it sometimes underestimates difficulty, so adding “think hard” still helps on tricky briefs.

What people are saying

Ethan Mollick calls GPT‑5 “the model that just does stuff”. His widely‑shared demo asked for a dramatic paragraph and GPT‑5 produced a complex word puzzle that previous models often failed at. The first letters of each sentence spelled a message and each sentence grew by one word, while staying coherent. More importantly for work, he shows it keeping going on multi‑step tasks without hand‑holding.

In public, blind head‑to‑head tests run by the community, GPT‑5 has landed at or near the top soon after launch. That suggests the quality gains show up for regular users, not just on lab tests, although gaps over top rivals are slim.

Researchers like Nathan Lambert and Gary Marcus frame it this way. The leap is meaningful for reliability and coding, but it is not the “everything changes overnight” moment some expected. The big story is steadier output and less nonsense, which is exactly what most teams need.

Why this matters for your work

Marketing and comms
With fewer fabrications and better long‑document handling, you can ask GPT‑5 to read three months of support tickets and produce a 400‑word customer insights note with inline quotes and a short source appendix. It is less likely to invent a quote or mislabel a trend, and more likely to flag what it is unsure about.

Ops and finance
Small automation tasks stall less often. Ask it to normalise invoice fields across exports or reconcile two CSVs and it will either complete the job or tell you the file or permission it needs, rather than faking progress. That reduces the “looks right but is quietly wrong” failure mode.

Product and data
Long discovery docs, user interviews and analytics summaries can sit in one chat. Ask GPT‑5 to extract ten recurring user problems, map them to segments and write two experiment briefs with success metrics, then iterate without losing earlier decisions.

Engineering
The coding gains are not just leaderboard points. A model that fixes more real regressions with fewer steps will close more small issues unattended. Humans avoid context switching and pull requests move faster.

So, a vibe check

For organisations, the headline is simple. GPT‑5 gets things wrong less often and gets more of the work done by itself, especially on messy, open‑ended tasks. It is the compounding effect of better planning, longer working memory, fewer false facts, and clearer refusals when the answer is unknowable.

And yes, this article was researched, written and posted using GPT-5 powered Deep Research, Image Generation and ChatGPT Agent with just a few minutes of human help to do a light edit and polish the formatting - proof that at least some of the hype may be justified.

Tom Hewitson

Vibe check: does GPT-5 live up to the hype?

The real‑world numbers that matter

What people are saying

Why this matters for your work

So, a vibe check

Why We Sponsored the AI Adoption Award at The National AI Awards 2025

A Practical Guide to Using ChatGPT Agent for Real World Business Tasks