# Beyond the Prediction Engine: Defensible AI Reasoning

[Jirka Koutny](https://www.strv.com/blog/authors/jirka-koutny)  
Backend Engineering Manager

---

## TL;DR
We’re moving from models that [mainly predict](https://www.strv.com/blog/language-games-and-llms-what-wittgenstein-can-teach-ai-engineers) (pattern fluency) to AI reasoning systems that [must decide](https://www.turing.ac.uk/research/interest-groups/neuro-symbolic-ai) (structured, defensible outcomes). A defensible AI decision is one that can show what evidence it used, what rules it followed and why it rejected alternatives. It shifts the standard from "Did the AI sound right?" to "Can the AI's reasoning survive scrutiny?"  
> *”If a machine is expected to be infallible, it cannot also be intelligent.”*  

The emerging design isn’t magic. It’s governance by architecture: Give the system a scratchpad where thinking can be messy. Enforce a boundary where outputs become valid, auditable and safe.  

The hard part isn’t making the model talk. The hard part is this: **How do we let an AI do real thinking in the open without letting that openness become an uncheckable decision?** Because it always comes down to the same trade-off: **freedom produces insight; rules produce responsibility**.

## Start With One Meeting
It’s Tuesday. There’s a meeting. There’s always a meeting. The CEO wants a clean outcome. A security officer wants guarantees. Engineering wants something implementable. The COO wants to know who owns the risk. Someone says a sentence that sounds simple: *“Let’s use AI to speed up decisions.”*  

In that room, speed means three different things: shipping velocity? Lower operational cost? Fewer incidents and fewer surprises? Same word. Different game [again](https://www.strv.com/blog/language-games-and-llms-what-wittgenstein-can-teach-ai-engineers). And AI is unusually good at winning the “looks fast” game while quietly losing the “is safe” game because it can produce fluent conclusions long before it can produce defensible ones.  

## Fluency vs. Defensibility
Executives want an answer engine: type a question, get a clean decision.  
Engineers see a fluency engine: impressive output, uncertain grounding and a black-box feel when things go wrong.  

Both are right inside their own language game. The failure begins when we pretend the games are identical. That’s how you get the modern artifact: **a confident answer that cannot be defended.**  

This is why [AI reasoning systems](https://www.ibm.com/think/topics/ai-reasoning) show up now. Not because logic was rediscovered, but because the cost of being wrong has become visible (financially, legally, operationally). **Fluency survives demos. Decisions must survive scrutiny.**  

And there’s a deeper trap: fluency doesn’t just look like understanding. It invites us to treat it as understanding. The system can speak with the tone of certainty while doing nothing more than a highly trained continuation of patterns (often labeled as “stochastic parrots” or “semantic zombies”). It’s the difference between sounding right and having grounds.  

So the meeting becomes a design question, not a model-selection question: what would it mean for an AI decision to be defensible?  

## When Wittgenstein Met Turing: The Philosophical Roots of AI Reasoning Limits
Here’s a small historical scene that turns out to be a perfect mirror.  

Cambridge, 1939. [Ludwig Wittgenstein](https://en.wikipedia.org/wiki/Ludwig_Wittgenstein) is teaching his famously combative seminars on the foundations of mathematics — less lecture and more live disassembly of assumptions. [Alan Turing](https://en.wikipedia.org/wiki/Alan_Turing) is there too, attending.  

They’re orbiting the same puzzle from opposite directions. Turing is building the cleanest possible picture of procedure: what can be computed and what can be decided by a systematic method. Wittgenstein is suspicious of pictures that pretend meaning lives inside symbols. For him, rules aren’t timeless abstractions hovering “out there.” Rules are what humans *do*, a practice embedded in forms of life.  

One wants a boundary so sharp it can be mechanized. The other keeps pointing out that boundaries only make *sense inside a game*.  

That tension didn’t end in the seminar room. It became the blueprint of our confusion.  

Because a few years later, Turing gave the world the [Imitation Game](https://en.wikipedia.org/wiki/Turing_test): a test not of inner essence, but of whether a machine can *pass as a conversational participant*. The genius is that it avoids metaphysics. The danger is that it tempts us back into metaphysics anyway: if it sounds human, we start to treat it as if it *is* human; competent, grounded, responsible.  

And that is exactly where modern LLMs hurt us: they pass “Turing-style” tests constantly (polished text, confident tone, social plausibility) while failing “Wittgenstein-style” expectations.  

So the shift we’re watching today is not “models got smarter.” It’s that we’re finally admitting that The Imitation Game is not enough for the games that matter. A chat that feels convincing is not the same as a decision that survives an audit.  

**Alan Turing and Ludwig Wittgenstein**  

## A Useful Lens: Two Modes, One System
Think of this shift as two modes working together, each with different goals.  

- **Scratchpad mode:** exploration, hypothesis, backtracking, “mess.”  
- **Boundary mode:** constraints that make outputs valid, retrievable, secure and accountable.  

With that, if you remember only one sentence, let it be this: **Reasoning happens in the scratchpad. Responsibility happens at the boundary.**  

Now return to the meeting. The CEO wants the AI to generate a “yes/no” decision on a customer request. Engineering asks what it should output. Legal asks what it should cite. Security asks what it should never do.  

And suddenly the project isn’t “build an AI.” It’s “define a contract.” Under this lens, terms stop being fuzzy:  

- Prompting and retrieval stop being “asking nicely” and become rule design: what counts as evidence, what counts as success, what is forbidden.  
- Hallucinations become less mystical and more mechanical: often a grounding failure. No reliable anchor to truth conditions, policy or reality.  
- Evaluation stops being about vibe and becomes about contracts: did it follow the rules, and did it use allowed sources in allowed ways?  

Most production failures are not language failures. They are contract failures.  

## The Story: A Decision That Must Survive an Audit
Let’s make the running story specific. A bank wants an AI copilot to support loan decisions. Not to replace humans (yet) but to summarize risk, propose an outcome and draft the decision memo. The appeal is obvious:  

- faster processing,  
- fewer manual steps,  
- consistent formatting,  
- cleaner records.  

The first prototype is dazzling. The model writes a memo that sounds like it came from a senior underwriter: crisp, confident, complete.  

Then compliance asks a simple question: “Show me why this applicant was denied.”  

Not “tell me.”  
> “Show me!”  

And the entire design pivots. The bank doesn’t need a model that can write. It needs a system that can justify. This is the transition in one scene: from fluent output to defensible AI reasoning.  

## The Paradox of Constraints: Why Early Rules Kill AI Reasoning
The default instinct is: “If it’s risky, constrain it harder.” Sometimes that’s wisdom. Often it’s a trap.  

Hard problems require an intermediate workspace. Humans don’t solve difficult equations by writing the final answer directly into a tiny box. We scribble. We try routes. We keep scaffolding; we never publish. If you enforce strict formatting too early, you don’t just restrict output. You remove the workspace where reasoning occurs.  

Imagine telling a senior engineer: “Write perfect mission-critical code, but no scratch files, no temporary notes, no intermediate reasoning. Only final output.” That isn’t rigor. That’s sabotage.  

In our bank story, this plays out immediately. Someone proposes:  
- “Force the model to output only JSON.”  
- “Only allow approved fields.”  
- “Only allow canned phrasing.”  

The result looks safe. Until it isn’t.  

### The Free Thinker and the Bureaucrat
Two failure archetypes appear:  

- **The Free Thinker:** it reasons well, but the final structured output has a tiny format error, mixes fields or produces an unparseable object. Great thinking. Broken delivery.  
- **The Bureaucrat:** It outputs perfect JSON every time, but the reasoning collapses into shallow compliance. Perfect form. Weak mind.  

This is why timing matters more than force. Rules aren’t inherently harmful. But rules applied too early collapse expressivity. You get a system that can comply but cannot plan.  

So the pattern that’s emerging is not “no constraints.” It’s constraints at the end: allow exploration first, then force the result into a strict, validated format.  

Rules aren’t the enemy. Premature rules are.  

In practice, the [winning design](https://icml.cc/virtual/2025/poster/43624) looks boring and obvious in hindsight: let the system explore in a scratchpad. Then switch modes and produce a final decision artifact that is machine-checkable.  

That switch, freedom first, bureaucracy last, is not a hack. It’s how thinking works.  

## More Data Is Not Always More Truth
Now the team fixes the format and gets to the next failure. The model makes claims. “Debt-to-income is high.” “Past delinquencies indicate elevated risk.” “Policy X applies.”  

So the team adds retrieval. “We’ll ground it. We’ll feed it more context.” The second instinct appears: “If the model is uncertain, feed it more context.”  

But retrieval isn’t a shopping cart. Past a point, adding more reduces quality. Not because the system is lazy, but because the prompt becomes noisy, redundant and internally competing.  

The system wastes tokens, yes. More importantly, it wastes attention. And attention is a budget.  

This shows up in a painfully simple way. The bank’s knowledge base has dozens of policy fragments. A naïve top-k retriever returns the “most similar” passages. They often repeat the same clause in different words. That redundancy crowds out the one missing exception that matters.  

It’s the [Eiffel Tower problem](https://arxiv.org/html/2411.00744v1#S1), just wearing enterprise badges:  

Someone asks a three-part question: **who**, **when**, **how tall**.  

The retriever proudly returns the “best matches” and hands the model two paragraphs that both say: built **1887–1889**, height **324 m,** landmark in Paris. The model reads them, nods and answers the parts it was given.  

But the missing fact (the designer’s name) never arrives. Not because it doesn’t exist, but because redundancy quietly spent the token budget before the crucial chunk could get in.  

That’s why retrieval isn’t just ranking. It’s **composition**. You’re not shopping for the single most similar paragraph. You’re assembling a packet that completes the question.  

In other words, retrieval isn’t just ranking. It’s **composition**.  

In our bank story, that stops being a theoretical point and becomes an operational constraint. You’re not “fetching context.” You’re assembling a briefing packet that a decision can stand on.  

That packet has a job to do:  
- it should be **small enough** to fit the budget,  
- **diverse enough** to cover every required factor,  
- **clean enough** to avoid repeating itself,  
- **sequenced well enough** that the model doesn’t glide past the one sentence that matters.  

Because in production, “extra context” is not free. Redundancy isn’t harmless. Every duplicated paragraph is a tax on attention, and attention is the only currency the model can spend.  

Philosophically, more words do not automatically create more meaning. Sometimes they dilute it until even the right answer is present but effectively invisible.  

## From Automation Tool to Reasoning Partner: The Neuro-Symbolic Split
Now we can see what [neuro-symbolic AI](https://en.wikipedia.org/wiki/Neuro-symbolic_AI) really means in practice. Not as a research slogan. As a neuro-symbolic AI governance split:  

- **Neural layer (perception):** messy inputs, fuzzy signals, pattern recognition, generation.  
- **Symbolic layer (cognition):** rules, constraints, structured knowledge, policy, verifiable logic.  

In the bank story:  
- The neural part reads messy applications, emails, documents and human notes.  
- The symbolic part enforces what counts as a valid decision: required fields, allowed sources, policy constraints and AI audit trails.  

Let the neural part interpret the world. Let the symbolic part enforce how decisions are made and presented.  

This matters because businesses are moving toward defensible AI reasoning. “It answered quickly” stops being impressive when the question is expensive. What you need is explainable, auditable and refinable. Not speed at all costs. A reasoning partner isn’t just a model that speaks. It’s a system that can be held to rules. And it must know the difference between an exploration, a proposal and a commitment.  

In our bank story, the commitment is the memo that goes into the official record. That memo must survive a cross-examination: here is the evidence I used, the structure of the decision, why alternatives were rejected and the boundary I complied with.  

That’s not an add-on feature. It’s a different kind of system.  

A good mental image is the clinician + protocol pattern: A predictive system suggests a diagnosis. Fine. But treatment cannot be “whatever sounds plausible.” It must pass through clinical rules: contraindications, interactions and guideline compliance.  

The output becomes a proposal that survived a boundary. That’s defensibility.  

## Is Your AI Reasoning Defensible? A Short Checklist for Leaders
- **Scratchpad space:** does the system have enough intermediate room to reason before it must commit to a constrained output?  
- **Retrieval utility, not similarity:** is retrieval selecting chunks because they’re individually similar, or because they’re collectively useful, non-redundant and ordered for comprehension?  
- **The “why” factor:** can the system justify why it used one piece of evidence and rejected another, especially under policy or compliance constraints?  

If not, you don’t have a reasoning partner. You have a fluent guesser.  

## Three Warnings
- **More isn’t better.** Context is non-monotonic: extra information can reduce correctness by adding noise.  
- **Constraints can backfire.** Strict output rules without a reasoning space produce clean-looking fragility.  
- **Governance isn’t a bolt-on.** If you can’t produce a defensible trace for high-stakes decisions, fluency won’t save you when scrutiny arrives.  

## Closing Thought
Our bank story ends in a slightly uncomfortable place. Once the system can generate answers cheaply, the scarce resource is no longer answers. It’s legitimacy. Authority. The right to decide what counts.  

We’re digitizing two layers of the organization at once: the sensory layer (what we perceive) and the cognitive layer (how we decide).  

The shift from “predict the next word” to “justify the next decision” changes what matters.  

So the question isn’t whether the machine can reason. It's more uncomfortable: when defensible AI reasoning becomes cheap, who controls the rules that justification must satisfy? And the boundary — what counts as evidence, what counts as compliance, what counts as “acceptable” — is where power quietly lives.  

---  
*Don't miss anything*