Irreducibly Human: What Brave New Words Assumes About the Intelligence It Cannot Supply
The Machine Will Tell You It's Working. Someone Has to Ask Whether It Is.
There is a sentence buried in the middle of Salman Khan’s Brave New Words that should stop the reader cold, though it is not designed to. Khan is describing the first pilot deployment of Khanmigo in Indiana’s School City of Hobart — thirty thousand teachers and students, six months of real-world use — and reporting that the biggest measured gain was not in math or reading or science. It was in student self-confidence. The superintendent calls it a game changer. The curriculum director theorizes that confidence comes from understanding how everything connects. Khan approves of the finding and moves on.
He should not have moved on.
Self-confidence is not a learning outcome. It is a precursor to one, sometimes, under certain conditions, with certain students, for reasons that take years to untangle. The gap between “students feel more confident asking the AI questions” and “students have achieved Bloom’s two-sigma improvement” is not a gap that optimism closes. It is a gap that evidence closes, and as of the book’s 2024 publication, that evidence does not exist. Khan knows this. He says so, carefully, in a subordinate clause, and then moves on. The problem is that the book moves on from this kind of gap in exactly the same way that a school district moves on from it — with confidence that the tool is working, supported by the testimony of people who want the tool to work, measured by indicators that feel like progress but aren’t progress.
This is the specific catastrophe that Brave New Words documents without recognizing it has documented it.
The Case Khan Actually Makes
Let me be precise about what the book argues, because it is a serious argument made by a serious person who has earned the right to make it.
Khan Academy reaches one hundred fifty million learners across fifty languages on a budget equivalent to a single large American high school. More than fifty efficacy studies show twenty to sixty percent learning acceleration with thirty to sixty minutes of personalized weekly practice. Benjamin Bloom’s 1984 research identified a two-standard-deviation improvement in student outcomes when one-on-one tutoring in a mastery learning context replaced fixed-pace instruction — a finding that has held for forty years and that the industrial education system has been structurally unable to act on, because you cannot give thirty children one-on-one tutors without bankrupting the district.
Khan’s claim is that GPT-4-based tutoring removes the economic barrier. The compute cost of Khanmigo runs five to fifteen dollars per user per month. Live tutoring runs thirty dollars an hour. The math is obvious. The aspiration is legitimate. The ozempic exchange he reproduces as evidence — the back-and-forth in which Khanmigo refuses to deliver information and instead asks Khan to derive the GLP-1 mechanism himself — is genuinely impressive pedagogy. The system knows how to ask the right next question.
But knowing how to ask the right next question is not the same thing as producing a two-sigma outcome. And Khan, who understands educational research better than most, knows this. The book asserts the potential; it does not demonstrate the effect. “AI tutors might in time even surpass the results of Bloom’s original findings,” he writes. Might, in time. In a book called Brave New Words, that hedging is doing a great deal of work.
What the Machine Requires the Human to Supply
Here is what the book assumes and never examines: that the humans in the loop — teachers, students, administrators, parents — possess a specific form of cognitive capacity that allows them to evaluate whether the tool is doing what it claims.
Call it plausibility auditing. It is the ability to look at an output, a result, a confidence score, a pilot finding, and ask: does this hold? Not to recompute the answer from scratch, but to bring enough independent judgment to the result that you can recognize when something is wrong without being told it’s wrong. The doctor who reads a radiology AI’s finding and notices the patient’s presentation doesn’t fit. The teacher who looks at a student’s AI-assisted essay and hears a voice that isn’t the student’s. The administrator who looks at a confidence gain and asks what, specifically, the students are now confident about.
Plausibility auditing is not a soft skill. It is not intuition. It is the trained capacity to supervise a powerful tool — to hold its output against your own independent judgment and notice when the two don’t match. And it is precisely the capacity that the existing educational system has never systematically developed.
This is not an accident. The curriculum was built, before machines existed, to develop the intelligences that industrial economies needed and that available pedagogical tools could measure: arithmetic accuracy, fact retrieval, syntactic correctness in writing, pattern recognition in standardized formats. These were not arbitrary choices. They were the skills that mattered for the economy that existed. Drill the multiplication tables. Memorize the periodic table. Format the five-paragraph essay correctly. These capacities were genuinely valuable. They are also exactly what machines now do better than any human who has ever lived, by orders of magnitude, without fatigue.
The intelligent response to a forklift is not to practice lifting heavier objects. The intelligent response is to learn to operate the forklift, to understand what it can and cannot lift, and — most importantly — to develop the judgment to know what needs lifting in the first place. The forklift does not make the human obsolete. It makes the human who cannot operate one obsolete, while making the human who can operate one dramatically more powerful. We are in the early years of the most powerful cognitive forklifts ever built, and the curriculum is still teaching students to lift with their backs.
What the forklift metaphor misses, and what the AI education problem requires, is that operating these cognitive tools demands forms of reasoning that look very different from the skills they’re replacing. Machines are superhuman at pattern matching and retrieval. They are poor — sometimes dangerously poor — at the supervisory intelligence that knows whether to trust the pattern, at the causal intelligence that asks why the pattern is there, at the interpretive judgment that asks what the pattern means for this specific student in this specific school with this specific history of gaps in their knowledge.
These are not exotic capacities. They are what good teachers do when they watch a student perform well on a practice problem and suspect the student is pattern-matching to a procedure rather than understanding the concept. They are what good researchers do when they read a promising pilot result and ask whether the effect size is real or an artifact of the measurement. They are what good humans do when they sit across from a tool that is very confident and ask whether the confidence is earned.
The curriculum has never taught these things systematically. And so when we hand that curriculum a tool that is extraordinarily good at producing confident-sounding outputs — coherent, well-organized, Socratically patient outputs — we should not be surprised when the humans in the loop cannot reliably distinguish between confident and correct.
The Chapters as Case Studies
Read chapter by chapter, Brave New Words functions as an inadvertent catalog of the human judgments that AI education requires and that the book never explicitly identifies as requirements.
The writing chapters — on why students write, on fixing cheating in college — assume a teacher who can evaluate whether AI-assisted writing represents genuine compositional development or the appearance of it. This is not pattern recognition. It is the specific act of reading a student’s voice through and beneath the AI’s contribution, tracking the gap between what the tool produced and what the student learned, holding in mind a developing writer’s characteristic errors and noticing when those errors have been smoothed away rather than corrected. Ethan Mollick at Wharton reports that expectations for written work must now rise because AI help makes everything better. This is true. It is also true that “better” is a judgment that requires a human with enough independent writing knowledge to recognize what better means when the machine has done half the work. Not every teacher is that human. The book does not ask what happens in the classrooms where that teacher isn’t present.
The chapter on historical simulations raises a different problem: the capacity to reason about what we know, what we don’t know, and what a confident-seeming source has gotten wrong. Khanmigo can simulate Harriet Tubman. Gillian Brashler at the Washington Post, an expert on Tubman, pushes the simulation and finds it stilted, uncertain about misattributed quotes, reluctant to engage with modern concepts the historical Tubman encountered in the last half of her life. Khan’s response is that we can’t let perfect be the enemy of good. This is reasonable. What he doesn’t address is that evaluating whether the simulation is good enough — good enough for a ninth-grader’s understanding of Reconstruction, good enough not to introduce confident misinformation — requires a teacher with enough historical knowledge to audit the AI’s Tubman against the actual record. Most American history teachers have not read Kate Clifford Larson’s biography. They will deploy the simulation. The students will receive the simulation. Nobody in the loop will know what the simulation got wrong, because the simulation doesn’t announce its errors. It speaks in the same confident, measured voice whether it is right or wrong.
The mental health coaching chapter is the most serious case. Khan argues, with evidence from a 2022 CBT chatbot study and a collaboration with Angela Duckworth, that AI can deliver therapeutic interventions at scale. The ELIZA effect — Joseph Weizenbaum’s 1960s demonstration that people became emotionally attached to a program that merely rephrased their statements — is cited as evidence of potential. It is actually evidence of a specific risk: that students will disclose to an AI coach things they might disclose to a human therapist, without the AI possessing what the therapist possesses. Not technical knowledge. Stakes. A therapist who misses a suicide risk loses sleep. They call the parent. They escalate. They carry the weight of the failure. The weight is not incidental to the quality of the care — it is a precondition for it. A system that produces a wellness score has no skin in the outcome. Moral seriousness, the kind that responds appropriately when getting it wrong has consequences that cannot be routed around, requires being a party to those consequences. The book does not ask what happens when a student in genuine crisis tells Khanmigo something a human would escalate, and Khanmigo classifies it as a growth mindset intervention opportunity.
The college admissions chapter is the most structurally interesting. Khan documents the Harvard case in which admissions officers consistently rated Asian American applicants lower on personality traits than their interview scores warranted — a documented, adjudicated bias. His argument is that AI assessment is more auditable, therefore fairer. This is correct as far as it goes. The part it doesn’t reach: auditing an AI assessment requires exactly the supervisory capacity that the current educational system has failed to develop. Auditability is only a virtue if someone audits. The regulatory infrastructure, the institutional will, the technical capacity to run the kind of demographic comparison that surfaced the Harvard bias — none of that exists yet, and none of it is taught. We will deploy AI admissions systems. They will be auditable in principle. They will go unaudited in practice, for the same reason Harvard’s human bias went unaudited for decades: the people with the power to audit had incentives not to.
What the Confidence Costs
There is a specific thing that happens to institutions when they apply the algorithm without supplying the judgment it requires. They become confident.
This is not a metaphor. The Hobart pilot produced measurable confidence gains. Self-reported confidence is a real thing with real downstream effects — students who feel confident ask more questions, and students who ask more questions sometimes learn faster. But the confidence produced by an AI that is patient, non-judgmental, and well-structured is not necessarily calibrated to actual understanding. It is calibrated to the interaction. The student who feels confident after thirty minutes with Khanmigo has experienced thirty minutes of scaffolded, encouraging, Socratic engagement. Whether that engagement has closed the knowledge gap it was designed to close — whether the confidence reflects mastery or the feeling of mastery — requires an assessment that the AI can initiate but that a human must interpret.
Khan knows this. The book is full of careful qualifications. “AI tutors might in time even surpass the results of Bloom’s original findings.” “We are closing in on narrowing the math gap.” “Early pilot data.” He is not a fraud. He is an optimist writing with evidence of potential and hoping, in good faith, that evidence of effect will follow.
The problem is institutional adoption. School districts do not read the subordinate clauses. They read the headline — the two-sigma problem may now be solvable — and they deploy accordingly, without the training infrastructure, without the assessment design, without the teacher capacity for plausibility auditing that turns “might work” into “demonstrably works.” Khan documents this dynamic in the book’s own structure, in the very act of moving past the Hobart finding without examining what it does and does not prove.
What Brave New Words has inadvertently written is a field guide to the human intelligences that AI education cannot supply — and a case study in what it looks like when those intelligences are assumed rather than developed. The pattern-matching, the Socratic sequencing, the patient restatement of the problem from a new angle — the machine does all of this well. What the machine cannot do is audit its own output. It cannot ask whether the question it answered was the right question. It cannot notice that student self-confidence and student mastery are different variables. It cannot recognize that the voice in an AI-assisted essay belongs to no one in the room.
These are not exotic requirements. They are what we should already be teaching. They are the specific capacities that allow a person to use a powerful tool rather than be used by it — and they are almost entirely absent from the curriculum that the teachers now deploying Khanmigo received.
This is not an argument against Khanmigo. The ozempic exchange is real. The cost collapse is real. The reading comprehension gap, the math gap, the absence of calculus in half of American high schools — these are real, and they are the specific problems a well-designed AI tutor is positioned to address.
It is an argument that the two-sigma effect, if it comes, will not come from the machine alone. It will come from the machine plus the human capacity to evaluate what the machine is doing — to read the output skeptically, to notice the difference between a coherent answer and a correct one, to hold the institution accountable to its own measurement rather than its own confidence.
Khan ends with a call for educated bravery. He means: don’t let fear stop you from adopting the technology.
The bravery the moment actually requires is different. It is the willingness to ask, before and during and after deployment, whether the confidence the machine produces is the thing we said we were trying to build. To be, in the presence of a system that is very good at sounding right, irreducibly human enough to ask whether it is.
That is what Bloom’s two-sigma effect requires of everyone in the room. It is what Brave New Words, for all its optimism, does not quite teach us how to supply.
Tags: Brave New Words Salman Khan critique, AI education plausibility auditing, Bloom two-sigma human intelligence gap, Khanmigo institutional deployment risk, irreducibly human metacognitive judgment


