No, GPT-4.5 does NOT hallucinate (i.e. make things up) in 37.1% of its responses. And GPT-4o definitely doesn't make things up 61.8% of the time! These widely circulating figures - that come from the GPT-4.5 release info - are causing unnecessary panic about the level of hallucination risk in AI use. I get that the people sharing the numbers think they are reliable since the numbers are straight from OpenAI. They are real. The problem is, people are completely misunderstanding what the numbers are actually measuring and representing. These numbers are actually from an AI test called SimpleQA that's designed to measure hallucinations using specific questions that are challenging and have exact-fact based answers. These are specific types of questions that AI may easily get wrong. The goal of the test is to measure the propensity to hallucinate so that it can be compared across different models. OpenAI tested GPT their models using this SimpleQA curated set of difficult factual questions. Those percentages (37.1% for GPT-4.5, 61.8% for GPT-4o, etc) reflect the percent of responses where the models gave incorrect or made-up answers to those particular challenging questions. They absolutely do NOT represent the general accuracy or reliability of these models in everyday scenarios, because the numbers are NOT from the typical, everyday interactions most users have with these models In real-world use, hallucination rates vary dramatically based on several factor like... 📌The type of question you're asking. • Common-knowledge questions will likely get accurate responses in most situations. • Obscure or tricky questions might have higher rates of incorrect answers. • Creative or open-ended prompts may not really involve hallucinations at all. 📌Your prompting approach. • While there's no way to entirely eliminate hallucinations, your overall prompting strategy can dramatically impact how much it happens. • Strategies like explicitly instructing the AI not to hallucinate or make things up can help reduce inaccuracies, as can saying you'd rather it say "I don't know" than make something up. • Including directions like "take a deep breath and think step-by-step" or telling it a reasoning process to follow can also enhance reliability (this is what the reasoning models do on their own, but you can prompt this behavior into models like GPT-4o and 4.5, too). 📌The context you provide. • Supplying specific reference documents and explicitly instructing the AI to rely solely on that information can greatly reduce hallucinations. Hallucinations are very real, and verifying AI outputs remains absolutely vital. But these widely shared statistics aren't general indicators - they're context-specific results from a controlled test. Your actual experience with AI accuracy heavily depends on exactly what you're asking, how you're asking it, the context you're giving, and your prompting strategies.
Insightful and informative. Thanks for posting.
Very helpful Nicole Leffer. I will post this link in my post so people can see this detail.
Great info!
This is exciting news. It makes sense that you need to clearly communicate your request to getting the results you're looking for
Fuuuully agree with this statement. It’s crucial to clarify that those numbers (37.1% for GPT-4.5 and 61.8% for GPT-4o) come from a specific test, SimpleQA, designed to challenge models with tough, targeted factual questions. They absolutely do not reflect their overall accuracy or reliability in the everyday interactions most users have with these systems. ➡️ I like GPT-4.5, it feels light and smooth compared to other models, though I get that the test numbers don’t tell the full story.... Ciao
Yes! You can’t go far wrong with reliable reference documents and solid examples.
Great breakdown! Context matters, and these stats are misleading without it. AI reliability depends on usage, prompting, and verification.
VP, MarTech Strategy and Enablement @ TDECU
1wWell said!