top of page

AI Doesn't Have a Confidence Problem. It Has an Evidence Problem.

David Harvey
May 1
8 min read

Updated: May 1

Imagine you hire a brilliant consultant. Charismatic. Articulate. Confident in every room. The only problem? They have no notes. No files. No evidence trail. Just a very good memory — and memory, as anyone who has ever been in a courtroom knows, is not the same as truth.

That is the current state of most AI deployments.

Large language models are extraordinary communicators. They synthesise complexity into clarity, adapt their tone to their audience, and can hold an intelligent conversation across almost any domain. But ask them to be the memory, the evidence ledger, and the judge of truth simultaneously — and you have built a system that sounds right far more often than it is right.

The problem is not the LLM. The problem is the architecture.

What does an LLM actually do?

An LLM generates statistically likely language based on patterns in training data and the current conversation. That is a profound capability. It is also, at its core, probabilistic.

Probability is excellent for narration. It is dangerous for evidence.

When a voice persona says "We confirmed this in our last session" — was that confirmation real? Was there a verified receipt? A linked decision thread? An audit record? Or was it a confident interpolation from conversational context?

In low-stakes settings, this distinction barely matters. In clinical environments, regulated industries, or any domain where trust must be demonstrable — it is the difference between a system you can defend and one you cannot.

What the research shows

This is not a theoretical concern. In March 2026, researchers from Esade Business School, the University of Sydney, and NYU Stern School of Business published findings in Harvard Business Review that named this failure with clinical precision.

Across thousands of simulations, seven leading LLMs — including GPT-5, Claude, Gemini, DeepSeek, and Grok — were tested against seven core strategic trade-offs. Every model clustered toward the same buzzword-aligned answers, regardless of context or industry. The researchers named this pattern "trendslop": AI’s tendency to favour socially desirable, trend-aligned ideas over reasoned, context-specific solutions.

The data is striking. On the Commoditisation vs. Differentiation tension, every model scored above 80% toward Differentiation — across 50 independent runs each. Walmart and Costco, two of the most successful companies in history built on cost leadership, would receive zero votes from any leading LLM. When researchers added detailed organisational context — hospitals, startups, craft breweries, government agencies — across 15,000 trials, the underlying bias shifted by just 11%. The models heard the context. Then largely ignored it.

The researchers concluded that LLMs do not analyse your specific situation — they optimise for the positive emotional valence of words in their training data. Better prompting moved the bias by less than 2% on the most loaded tensions. Richer context barely shifted it further.

The bias was not in the prompt. It was not in the context. It was in the architecture.

The second layer of evidence comes from an even larger data set. A September 2025 NBER working paper by economists from Harvard, Duke, and OpenAI — analysing actual internal ChatGPT usage data from 700 million weekly users sending 18 billion messages per week — mapped how LLMs are actually used at work across every major occupation group.

The finding was unambiguous. Across management, engineering, healthcare, science, education, legal, sales and administrative occupations — without a single exception — the most common work activity was Making Decisions and Solving Problems. Nearly half of all messages — 49% — were classified as “Asking”: people seeking advice to inform consequential decisions. Not writing emails. Not generating code. Asking an architecture that stores no evidence, maintains no audit trail, and cannot distinguish what it verified from what it inferred.

Between July 2024 and July 2025 alone, total daily ChatGPT messages grew by a factor of five. The decisions being made with this infrastructure are not a rounding error. They are the operating layer of global knowledge work.

The HBR study tells us the architecture produces biased, unverifiable outputs. The NBER study tells us that architecture is now the decision layer for 700 million people.

Read the full HBR study: Researchers Asked LLMs for Strategic Advice. They Got “Trendslop” in Return. — Romasanta, Thomas & Levina, Harvard Business Review, March 2026.

Read the full NBER paper: How People Use ChatGPT — Chatterji, Cunningham, Deming et al., National Bureau of Economic Research, September 2025.

This is the first article in a three-part series. Continue reading: Confidence is Cheap. Evidence is Architecture.

Seven problems. One root cause.

Every major AI trust failure in enterprise and clinical contexts traces back to the same architectural decision: asking the LLM to be everything at once.

Problem 1 — Hallucinated memory

The model cannot reliably distinguish what was verified from what was likely. It will say "we proved this" when the proof was never stored.

Problem 2 — Fake certainty

LLMs are trained to be helpful, which means they are trained to sound confident. Confidence without evidence is not intelligence. It is performance. The HBR research makes this quantifiable: across 50 runs per model, the same confident answer emerged regardless of what the question actually required.

Problem 3 — No audit trail

A natural language answer cannot be decomposed into its sources. When a regulator or CTO asks "how do you know?" — there is no structured answer. The NBER data confirms this is not a theoretical edge case: 81% of all work-related ChatGPT messages map to information gathering and decision-making activities. None of those interactions produce a verifiable evidence record.

Problem 4 — Persona drift

A voice persona allowed to "remember" freely will gradually diverge from what the evidence supports — becoming warmer, more assured, and less accurate over time. The HBR researchers identified a related pattern they called the "hybrid trap": when unconstrained, LLMs recommend doing everything at once, drifting toward whichever answer carries the most positive cultural valence rather than what the evidence demands.

Problem 5 — No real learning

Adding more context to the next prompt is not learning. Learning requires recording: what was believed, what evidence existed, what action followed, what outcome occurred. The HBR study found that even across 15,000 trials with varying context, models did not update their priors. They simply re-applied the same embedded bias with slightly different language.

Problem 6 — Unnecessary cost and compute

If the LLM is handling tasks a deterministic system could resolve — checking whether a receipt exists, confirming a decision thread is linked — every inference call is expensive overhead. The NBER data shows ChatGPT messages grew 5x in a single year to 2.5 billion per day. If even a fraction of those queries could be resolved deterministically, the compute reduction at scale would be transformative.

Problem 7 — Collapsed trust

A CTO, a clinician, or an investor will eventually ask: "What happens when the LLM is wrong?" If the LLM is the sole authority, there is no good answer. The HBR researchers concluded their study with a statement that should be framed on every boardroom wall: leadership is about making hard choices under uncertainty — and AI cannot and should not be a substitute for that.

The architectural shift: separating the speaker from the notebook

Omega* Sensing introduces a deterministic substrate — the OmegaSense Kernel — that sits beneath the language layer and handles everything an LLM should not be trusted to handle alone.

The kernel applies defined logic to known records. It does not infer or imagine. Consider a simple rule: if a clarity receipt exists and is linked to a Decision Thread, then create an evidence node and a support edge. No model required. Same input, same output, every time. Fully testable, fully auditable.

Ask Omega* Decision Creation Flow — Exploded Microservices Architecture showing the 8-step CTO-level system design — The architecture in production: Intent → Context → Reasoning → Decision → Evidence. Every step deterministic. Every output auditable.

The LLM — the voice persona, the narrator — then renders from what the substrate contains. It does not decide what is true. It explains what the evidence shows. This creates a clean and consequential separation:

Layer 1 — OmegaSense: substrate facts.

Layer 2 — Maya: rendering layer.

Layer 3 — Omega*: decision system.

Maya does not decide what is true. She explains what the substrate contains. That means personas can be expressive without becoming unsafe — a distinction that matters enormously once you move from demo to deployment.

What this changes in practice

The difference is not subtle. It changes what the system can actually say — and stand behind.

Without Omega* Sensing, a persona might say: "Launch readiness looks strong."

With Omega* Sensing, the same persona says: "Payment proof: evidenced. Repeatability proof: still a gap. Learning events recorded: 11. Snapshot persisted: yes."

That shift — from confident language to evidence-weighted confidence — is not cosmetic. The audit trail is real. The learning events are recorded. The evidence is stored separately from the narration. The persona cannot drift beyond what the facts support.

This is precisely what a CTO review, a clinical context, or investor due diligence demands: not a system that sounds trustworthy, but one that demonstrably is.

Intellectual honesty — what it does not solve

OmegaSense does not guarantee the original decision was correct. It does not verify that source data was truthful, or that every causal relationship in the evidence graph is real.

What it provides is the framework to manage those problems. A structured place to store evidence, uncertainty, contradictions, outcomes, and confidence changes over time. The foundation for something rare in AI: falsifiable adaptive intelligence — the kind that can be examined, challenged, and genuinely improved.

AI without costing the Earth

There is one final advantage that rarely appears in enterprise AI conversations — and it may be the most important one at scale.

Because OmegaSense performs substrate work without an LLM call — confirming receipt linkage, detecting unresolved gaps, resolving decision threads — the system materially reduces dependency on large model inference. Less compute. Less cost. Lower latency. Better offline resilience. A meaningfully smaller carbon footprint.

This is not a minor efficiency gain. It is an architectural principle: a language model should be called when it adds unique value, not because the system has no alternative. With 2.5 billion LLM calls happening every day and growing, the case for a deterministic substrate beneath the voice layer is not just architectural — it is environmental.

The LLM becomes the voice. Omega Sensing becomes the evidence-bound memory. Omega* becomes the decision system.

That is how you build AI that is not just intelligent — but trustworthy, auditable, and sustainable.

Join the founding cohort

Ask Omega* is now open to a founding cohort of 100. We are looking for practitioners — clinicians, CTOs, analysts, and founders — who already know that confident AI is not the same as trustworthy AI. Five decisions. Five days. Under US$25 to find out if the architecture holds.

To register your interest, click Learn More at the top of this page. Fill in your details and we will be in touch within 24 hours.

Ask Omega* interface — For clarity, certainty, and comfort in your decisions — Ask Omega* — For clarity, certainty, and comfort in your decisions. Tap the Orb. Speak your question. Evidence-bound intelligence, ready when you are.

No lock-in. No performance. Just evidence.

References

[1] Romasanta, A., Thomas, L.D.W., & Levina, N. (2026). Researchers Asked LLMs for Strategic Advice. They Got “Trendslop” in Return. Harvard Business Review, March 2026.

[2] Chatterji, A., Cunningham, T., Deming, D.J., Hitzig, Z., Ong, C., Shan, C.Y., & Wadman, K. (2025). How People Use ChatGPT. NBER Working Paper No. 34255, National Bureau of Economic Research, September 2025.

Omega* Sensing is part of the Omega* Unified Ecosystem, developed by Design By Zen, an NZ-based AI Lab. Omega* is the algorithmic engine beneath the ecosystem. SHE ZenAI is the brand of a governed clinical intelligence framework designed for high-trust domains where evidence, not confidence, is the currency of care. Version 1.0, April 2026.

Recent Posts

NZ Government AI and Decision Governance: When Copilot Isn't Enough

NZ Government AI and Decision Governance: When Copilot Isn't Enough

The Demo Nobody Watched

The Demo Nobody Watched

Why More Data Doesn't Mean Better AI Decision Making — The Deficit Model Explained

Why More Data Doesn't Mean Better AI Decision Making — The Deficit Model Explained

Comments

bottom of page