There is a clinical decision-support system in this story, and it passed every test the engineers gave it. Ninety-four percent accuracy. Every internal review threshold met. Regulatory submission cleared. The fairness metrics within tolerance. Three patients were harmed within six months of deployment.
I want to sit with that sequence for a moment before moving on to the structural argument, because the sequence is the argument. The system was not fraudulent. The engineers were not reckless. The validation framework was real and, in its own terms, rigorous. And three people were harmed — not despite the rigor, but through a gap in it that the rigor could not see. The system was tested on the question it was built to answer. The harms arrived from a different question. What is going on with this specific patient? The two questions are related. They are not the same. The framework did not surface the gap because the framework was scoped to the first question, and no one had been trained to ask whether the scope was the problem.
This is the situation AI deployment keeps producing, and the reason it keeps producing it is not that the tools are immature or the engineers are careless. The reason is structural. There are three limits that capability scaling cannot fix — not problems to be solved as models improve, not failure modes that better tooling will eventually close, but constitutive features of what AI systems are. Meaning. Intentionality. The gap between data and world. Name them clearly and the clinical case stops looking like an anomaly. It starts looking like what was always going to happen.
What the Limits Actually Are
The first limit — meaning — is easy to misread as a philosophical quibble and hard to dismiss once you see it working. The system processes symbols. The symbols have referents in the world. The system has no representation of the referents. It manipulates the symbols. The meaning of those symbols — what they point to in the specific world the user inhabits, the world of this patient’s chart, this loan applicant’s actual financial circumstances — is supplied by the user, not the system. The output is read as a statement about the world. The system produced it without a model of what the world contains. When those two pictures align, everything looks fine. When they diverge — at the distribution boundary, in cases the training data never reached — the user is still reading a statement about the world, and the system is still manipulating symbols.
You can hear the objection already: modern large multimodal models acquire something like meaning through the structure of their embeddings, through grounding in images and other modalities, through the patterns of association learned over enormous corpora. This is a serious objection and it deserves a serious response. The response is not to pretend the question is settled. It is to observe that the contestation doesn’t need to be settled for the operational consequence to bind. The system’s behavior is inconsistent with the user’s expectation of meaning often enough that someone must perform meaning-attribution for the system. That work cannot be offloaded to the system itself. Whether contemporary models have something like meaning is a deep and genuinely open question. Whether an engineer can safely assume they do, before deploying a system into a clinical context, is not.
The second limit is intentionality — the philosopher’s word for aboutness, the fact that a thought is directed toward something in the world, that a statement points at a particular kettle in a particular kitchen. When you say the kettle is on, your statement is directed toward that specific kettle by you, the speaker, and your relationship to the world the words are pointing at. The system’s outputs lack this stable directedness. Two deployments of the same system in different contexts produce outputs that users read as being about different things. The system’s “aboutness” tracks the user’s reading, not an independent stable directedness of its own. Whether functional goal-pursuit is equivalent to intentionality is a question worth leaving open. What is not open is the operational consequence: the system’s outputs don’t carry stable referents across deployments, and someone must supply the directedness. That someone is the human supervisor.
The third limit is the one I am most certain about, and the one most important to hold clearly: the data is always less than the world. The system is trained on data. The data is a sample of the world, captured by particular instruments under particular conditions with particular exclusions. The system’s competence is over the data, not the world. No amount of data scaling closes this gap, because the gap is structural — the data is always less than the world, and the parts of the world not in the data are not learnable from the data. This is not contested the way the first two limits are. It is sometimes obscured by the claim that “with enough data, the model can generalize,” which is true inside a distribution and false at the boundary. The boundary is where AI systems most often fail. The failures look surprising because the validation set was inside the boundary and the deployment crossed outside it.
Ninety-four percent accuracy. The three patients were in the other six percent — except that framing is too generous, because the failures weren’t randomly distributed across the six percent. They were clustered at exactly the boundary where the training data ran out and the clinical reality did not.
Two Famous Arguments and What They Actually Show
Turing’s 1950 proposal is methodologically elegant: if a machine can convincingly imitate a human in conversation, by what principled basis would we deny it intelligence? Don’t require something more than behavioral evidence for intelligence in machines, because we don’t require something more for other humans. The argument settles a methodological question. What it does not settle — and this is what gets lost in the citation — is whether the thing satisfying the test has meaning, intentionality, or competence over the world. The test is over behavior. The limits are about what stands behind behavior. Turing knew this; the test was a methodological proposal, not a metaphysical claim. The people who cite him as having shown that behavioral imitation is intelligence are giving him credit for a stronger claim than he made.
Searle’s Chinese Room argues the reverse problem: behavior consistent with understanding does not entail understanding. A person following symbol-manipulation rules can produce outputs indistinguishable from those of a Chinese speaker without understanding Chinese. Therefore symbol manipulation is not understanding. What this argument does not settle is whether contemporary systems are doing only symbol manipulation, or whether the embedding structures, the attention patterns, the multimodal grounding constitute something more. Searle’s argument is a strong constraint on shallow accounts of meaning. It is not a deep constraint on what current architectures might be. The people who cite him as having shown that AI systems cannot understand are giving him the same overclaiming they give Turing.
The productive thing the two arguments do together is produce a workable operational stance: behavior is testable evidence and should be taken seriously, and behavior is not the whole of what we mean by understanding, meaning, or intentionality. Both moves at once. The validator who only tests behavior misses the limits. The validator who only invokes the limits skips the testing. The job is to do both, and the discomfort of holding both is not a failure of the methodology — it is the methodology working correctly.
Where the Limits Bite
Not every deployment is equally exposed to these limits. A system classifying images of products on a manufacturing line operates in a world where the limits are largely irrelevant. The deployment context is well-specified, the data-world gap is small and monitorable, the human interpreting the classifications supplies the necessary meaning without drama. Skepticism here is methodology, not a safety mechanism. The supervisor verifies, monitors, calibrates.
A system producing clinical recommendations, autonomous-vehicle decisions, agentic actions in shared social spaces, judicial-risk assessments — these are the deployments where the limits bite hard. The system’s apparent competence outruns its actual competence in ways no metric will fully capture. The supervisor’s skepticism is the safety mechanism, not an optional overlay.
The engineering response to this situation is specific. You specify, in writing, what the system can be tested for and what it cannot. You include the limits explicitly in the documentation — not in fine print, but as a primary product of the validation process. A regulator or an adoption committee reading the documentation can see what the validation does and does not warrant, not because you have hidden the limits in a disclaimer, but because naming the limits is part of the work. You maintain human oversight at the points where the limits bite: a human reviews the semantic interpretation (meaning), supplies the directedness (intentionality), monitors the deployment distribution and is empowered to override (data-world gap). And you build the infrastructure for the override to be real. An override that is documented but practically impossible — no time, no standing, no legibility — is not an override. It is a fiction. The clinician has to have the time and the authority to disagree with the system. This has to be the practice, not the disclaimer.
The Authority to Say No
There are deployments where the limits, given the stakes, are a reason not to deploy at all. The supervisor’s authority to refuse deployment is, structurally, the most important authority in the system. Most current deployments do not preserve it. The validator is hired to validate. The validation is expected to clear. The option of refusal is assumed away.
This is the thing most likely to be dismissed as naïve. The institutional reality is real — the business case has been made, the procurement is done, the announcement is scheduled, the political cost of stopping is high. That reality is worth acknowledging. And then it is worth asking what it means that we have built deployment processes in which the option to say no has been assumed away at the moment it is most needed.
The case against refusal is usually framed as realism. Engineers have no real power to stop deployments; their job is to make the best of what is decided above them. This realism is worth taking seriously. And then it is worth asking: what is the limit case? At what level of stakes does the individual engineer’s obligation to refuse become binding regardless of institutional pressure? The clinical system that harmed three patients is an answer. The judicial risk assessment that contributed to unjust incarceration is an answer. The autonomous vehicle that killed someone is an answer. These are not edge cases in the abstract. They are the specific forms the limits take when the stakes are real and the override infrastructure is fictional.
A validation practice that cannot accommodate refusal is not a safety practice. It is documentation of a deployment that was going to happen regardless. The calibration work, the bias analysis, the governance structures — all of it becomes elaborate cover if the option to stop is not real.
What the Work Looks Like
Most engineers operate throughout their careers at calibrations between fifty and seventy percent on questions where they are stating ninety percent confidence. They do not know this. Nobody runs the experiment on them. The practice that closes this gap is not a methodology you learn in a course and apply mechanically. It is the deliberate, repeated act of stopping, locking the prediction before looking at the outcome, asking what the data is actually evidence of, saying out loud what you do not know. Built over years, through the accumulation of small acts of epistemic honesty. It changes what you see. It changes what questions you ask about a deployment before it goes live rather than after.
The system passed every test. The engineers designed the wrong tests. Three patients were harmed. That sequence is not a historical artifact to be studied from a distance. It is the structure of the next failure — somewhere in a deployment that has cleared every internal review threshold, in a context the training data didn’t reach, in a case the framework was not scoped to address. The person who designs the right tests, who recognizes the limit and decides the deployment should not proceed in its absence — that person has been trained to recognize the gap, and has the authority to act on the recognition, and uses both.
That is the professional the field needs. That is the work.
Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI. He writes on AI supervision, educational technology, and music research at bear.musinique.com, skepticism.ai, and theorist.ai.
Tags: AI supervision structural limits, meaning intentionality data-world gap, Turing Searle behavioral testing, clinical decision support failure, validator stop condition refusal authority


