The Comparison That Was Never Fair
What Intelligent Tutoring Systems Actually Measured, and What They Were Compared Against
In 2014, RAND published one of the most carefully designed evaluations of an educational technology system in the history of the field. John Pane, Beth Ann Griffin, Daniel McCaffrey, and Rita Karam ran a cluster-randomized controlled trial across 147 schools in seven states, assigning roughly 25,000 students either to use Cognitive Tutor Algebra I or to continue with whatever algebra instruction those schools had previously offered. The outcome measure was a standardized algebra proficiency exam. The design was, by the standards of a field that routinely tolerates thin evidence and motivated reporting, unusually rigorous.
The finding was specific. In the first year of implementation, Cognitive Tutor produced no statistically significant effect on algebra proficiency. In the second year, a significant positive effect emerged at high schools — approximately 0.20 standard deviations, sufficient to move a median student from the 50th to roughly the 58th percentile. At middle schools, the second-year effect was similar in magnitude but did not reach statistical significance.
Pane and colleagues called this an “implementation learning curve.” They were careful to note that the learning did not seem to happen at the level of individual teachers — students of teachers new to the system in year two performed similarly to students of experienced teachers. The learning happened at the level of schools: scheduling, infrastructure, coordination, institutional adjustment to a new instructional logic. The sites that figured out how to implement Cognitive Tutor took a year to figure it out, and then the system worked.
This is what a rigorous evaluation of an intelligent tutoring system looks like. The findings are real. The effects are modest. The implementation costs were substantial — approximately $97 per student per year for Cognitive Tutor against approximately $28 for the traditional textbook instruction it replaced. And in the field’s characteristic framing, this result was narrated as disappointment. Intelligent tutoring systems were supposed to approach human tutoring effectiveness. They had not.
I want to examine that disappointment. Not to redeem ITS, and not to dismiss the evaluation record. I want to examine what was being compared to what, and whether the comparison — the one that has driven ITS research, ITS funding, and now AI-tutor rhetoric for forty years — was ever structurally sound.
What the Tutor Actually Measured
Cognitive Tutor was built to embody a specific theory of cognition. John Anderson’s ACT-R framework posits that skill acquisition is the conversion of declarative knowledge — facts, concepts — into procedural knowledge: production rules, condition-action pairs. To become skilled at algebra is to acquire a set of increasingly sophisticated rules for algebraic manipulation. Recognize that the goal is to isolate a variable and the coefficient is 4, and divide both sides by 4. The rule fires. The step is taken correctly.
The instructional design that follows from this is specific. If you can specify the production rules that constitute algebraic competence, you can build a system that monitors whether each rule is acquired. Cognitive Tutor did exactly this. As a student worked through a problem, the tutor compared each step against its internal model of valid solution paths. Correct step: proceed. Step matching a stored buggy production — a common misconception encoded in the system — respond with immediate feedback. Student requests help: deliver a graduated hint sequence targeting the specific production the student is struggling to fire.
Across many problems, the tutor maintained running Bayesian estimates of whether each production rule had been mastered. Students could not advance to new material until the estimates crossed a mastery threshold. This is model tracing and knowledge tracing: two technical operations that together constitute the system’s measurement apparatus. What the apparatus measures is step-level correctness, time per step, hint requests, error patterns, and estimated mastery of each production rule. These are not arbitrary choices. They are what ACT-R theory specifies as relevant to procedural skill acquisition. The design is internally consistent with the theory it was built on.
The 1995 paper in which Anderson, Corbett, Koedinger, and Pelletier published their decade of findings was titled Cognitive Tutors: Lessons Learned. The plural of lessons learned is deliberate. The paper names what the system does not measure with the same specificity as what it does. Cognitive Tutor does not model affective state. It cannot detect whether a student is frustrated, bored, or emotionally disengaged from the material. It cannot identify conceptual confusion that lives above the production-rule grain — a student may fire productions correctly while failing to understand the domain they are operating in, and the tutor will not notice. It does not measure transfer, durability, or motivation. These are not oversights. They are structural features of a system designed for a specific theoretical purpose.
The researchers knew exactly what they had built. The disappointment that followed was partly not theirs.
What Human Tutors Actually Do
The comparison that generated the disappointment is this: ITS produces effect sizes of roughly 0.20 to 0.40 sigma relative to classroom instruction. Expert human tutors produce effect sizes of roughly 0.40 to 0.80 sigma. Therefore ITS has failed to approach human effectiveness.
This comparison requires that both numbers measure the same construct at different magnitudes. They do not.
The research literature on what expert human tutors actually do is not sparse, and much of it was produced by the same researchers who built ITS. Art Graesser — who built AutoTutor, one of the more sophisticated ITS systems in the research tradition — spent years analyzing videotaped sessions between expert tutors and students, specifically to understand what tutors were doing that his system might learn to do. What Graesser’s analyses documented was a specific set of interactional moves.
Tutors approach a topic with what Graesser called expectations and misconceptions: a mental model of the components of correct understanding and a map of how students typically go wrong. As students respond, the tutor evaluates the response against this map — not syntactically, as an ITS matches a step against a production rule, but semantically, tracking which elements of the expected understanding are present and which are missing. The next move is determined by this evaluation. The response is therefore flexible in a way that production-rule matching is not.
Tutors continuously check comprehension. “Can you say that in your own words?” “What would happen if this were different?” These are not assessment items; they are real questions that tutors use to calibrate what to do next. The comprehension check is an instrument for reading the student’s understanding, not recording it in a database.
Tutors manage affect. Graesser’s research documented that expert tutors are often deliberately imprecise about negative feedback — indirect, softened, delivered in ways designed to protect the student’s willingness to continue engaging. This is not sloppiness. It is the management of an ongoing relationship whose continuation matters to the learning. A student who has been made to feel consistently stupid by their tutor stops engaging, and a tutor who cannot detect or respond to that risk is a different kind of instrument.
Tutors follow student questions. When a student asks something the tutor had not planned to address, expert tutors engage. Graesser, describing AutoTutor’s limitations with characteristic directness, noted that his system had to use “diversionary tactics” when students asked questions outside its agenda. Human tutors do not divert. They follow.
Michelene Chi, working from a different angle, documented that what makes human tutoring effective is not primarily the information the tutor delivers. It is the interactivity — the tutor’s prompts that elicit the student’s own elaboration, the student’s attempts at articulation that reveal gaps, the tutor’s calibration of the next move to what the student’s specific response has revealed. Self-explanation is a primary driver of conceptual change, and expert tutors are specifically skilled at eliciting the right kind of self-explanation through well-calibrated prompts. An ITS can prompt for self-explanation. What it cannot do is read the specific partial answer the student just produced and respond to that answer’s specific weaknesses.
And from an even earlier lineage: Wood, Bruner, and Ross, in a foundational 1976 paper, identified six functions tutors perform when scaffolding learners through tasks. Recruitment of interest. Reduction of degrees of freedom. Direction maintenance. Marking critical features. Frustration control. Demonstration. Of these six, Cognitive Tutor was specifically engineered to perform one: reduction of degrees of freedom, the step-by-step scaffolding that makes a complex problem tractable by breaking it into smaller operations. The tutor is structurally blind to recruitment, structurally unable to perform frustration control, and limited in demonstration to displaying the system’s own solution paths rather than modeling the expert’s move for the novice in ways the novice can watch and internalize.
The Axis Problem
Here is what this produces.
The ITS measurement apparatus was built to measure one specific dimension of what expert human tutors do: the reduction-of-degrees-of-freedom move. Cognitive Tutor performs this move with remarkable precision. Its model tracing, its knowledge tracing, its mastery-learning constraints — these are all optimized for ensuring students acquire the production rules that constitute procedural competence in a specific domain. When evaluated on measures aligned with this construct, the system produces real effects. Pane’s 0.20 sigma is not noise. It reflects what the system actually does.
Human tutoring, as documented in Graesser’s and Chi’s and Wood, Bruner, and Ross’s research, involves that same move alongside several others: expectation-and-misconception dialogue, comprehension checks, affective management, student-question handling, recruitment, frustration control, demonstration. The effect sizes produced by expert human tutors in the research literature reflect this fuller set of moves acting in concert, against whatever outcome measures the studies used.
When these two numbers — the ITS effect and the human-tutoring effect — are placed on a single sigma axis for comparison, the implicit claim is that they measure the same construct at different magnitudes. They do not. ITS measures what a procedural-scaffolding technology produces on assessments that test procedural skills. Human tutoring measures what a full interactional relationship produces on assessments that, depending on the study, test some combination of procedural skills and broader constructs. The numbers can be placed on the same axis only if the underlying outcome measures are the same — which they frequently are not — and only if the interactional moves the two interventions involve are comparable — which the research literature establishes they are not.
This is the construct mismatch. It is not a peripheral observation. It is the structural feature of a comparison that has been doing field-level work for forty years, driving research agendas, guiding institutional adoption decisions, and anchoring the contemporary rhetoric that AI can approach human instructional effectiveness. What the comparison has consistently obscured is that the two things it is comparing were never fully on the same axis.
Cognitive Tutor did something real, with discipline and theoretical grounding, and produced genuine effects when evaluated appropriately. The disappointment in its failure to match human-tutor effect sizes is partly the disappointment of a comparison that was underdetermined from the start. Asking whether Cognitive Tutor matched human tutors is like asking whether a skilled surgeon matches a general practitioner across all dimensions of medical care. The surgeon is extraordinarily good at the specific thing the surgeon does. The general practitioner does that thing and many others. The sigma gap between them does not mean the surgeon failed.
The Inheritance
The current AI-tutor moment has been presented, in much public discourse, as an advance that finally addresses what ITS lacked. Large language models can engage in natural-language dialogue. They can handle questions they were not specifically designed to handle. They can, in principle, perform some of the interactional moves Graesser documented as characteristic of expert human tutoring — the expectation-and-misconception dialogue, the comprehension check, the flexible response to what a student actually said. The rhetoric suggests the construct mismatch has been resolved.
Read through the ITS apparatus, the claim is more complicated than the rhetoric suggests.
The current AI-tutor evaluation studies still measure what ITS evaluations measured: item-level mastery, step-level performance, post-test scores on aligned assessments, immediate outcomes rather than durable learning. The measurement apparatus has been inherited. What has changed is the interaction layer. Whether the interaction-layer changes produce meaningfully different learning outcomes — or produce the appearance of more-human interaction without producing the underlying effects — is an empirical question the current literature has not cleanly answered. The Kestin Harvard physics study, with its 0.73 to 1.3 sigma effects on researcher-designed tests of the specific content a two-hour AI session had just covered, is measured on a Skinnerian axis. The measurement does not index whether the AI performed the interactional moves that make human tutoring what it is. It indexes whether students correctly answered questions about surface tension and fluid flow immediately after being tutored about surface tension and fluid flow.
The construct mismatch is not solved by better interaction capabilities. It is solved by better measurement. A system that performs rich tutoring interaction and is evaluated on aligned immediate assessments remains, from the evaluation’s perspective, on the same axis as Cognitive Tutor. The measurement apparatus determines what the sigma numbers mean, and the measurement apparatus has not substantially changed across the transition from production-rule ITS to generative AI tutoring.
This matters because the comparison that has driven forty years of ITS disappointment is being recycled to drive the current AI-tutor moment. The benchmarks invoked — Bloom’s 2-sigma, the expert-human-tutor effect-size range, the framing that AI can now “approach” human instruction — are the same benchmarks. The construct mismatch they depend on is the same mismatch. Whether a system that generates flexible natural-language responses has actually closed the distance that matters, or has closed the part of the distance that is easier to perform while leaving the harder parts unaddressed, is the question the measurement apparatus is not yet equipped to answer.
Three Questions to Ask
When you next encounter a claim that an educational technology has approached the effectiveness of human tutoring, three questions will orient you.
What did the technology actually measure? If the evaluation used item-level or step-level assessments aligned with the technology’s instructional content, the system has been measured against a construct aligned with what it was built to do. This is not a criticism; it is a description of what the evaluation supports.
What does the human-tutoring construct actually involve? The research literature on expert human tutors documents a specific set of interactional moves — expectation-and-misconception dialogue, comprehension checks, affective management, student-question handling, recruitment, frustration control, demonstration. These are not peripheral features. They are the substance of what expert tutors do.
Was the comparison conducted on an axis that indexes both? If the outcome measure favors procedural scaffolding — which most ITS and AI-tutor evaluations use — the axis is not measuring what human tutoring does beyond procedural scaffolding. The comparison is limited by the measurement choice. A finding that the technology approaches human tutoring on such a measure is a finding about procedural scaffolding, not about the interactional richness the construct human tutoring would require.
These questions do not answer whether AI can replace human tutors. They answer the prior question: what are we measuring when we make the comparison? The field has been skipping the prior question since 1984, when Benjamin Bloom placed his two-sigma number on the same axis as his classroom-instruction comparison and the discourse collapsed the distance between them into a single rhetorical invitation. Cognitive Tutor responded to the invitation seriously, with theoretical rigor and methodological discipline, and produced 0.20 sigma at high schools after a year of implementation and $97 per student per year of cost. That result is not a failure. It is what the move that Cognitive Tutor was designed to do produces, measured honestly, at scale, in actual schools.
The number that system was compared against was never on the same axis. The comparison is the problem. It was the problem in 1990, when ITS researchers were trying to build what it named. It is still the problem now, when generative AI is being asked to close a gap the measurement apparatus cannot fully see.
Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)). | skepticism.ai | theorist.ai
Tags: intelligent tutoring systems construct validity, Cognitive Tutor RAND evaluation, human tutoring comparison mismatch, ACT-R model tracing procedural scaffolding, AI tutor measurement apparatus critique


