The Inheritance We Never Examined
How Skinner’s Teaching Machine Still Grades Your Children’s Software
There is a machine in every classroom now, and it measures what it has always measured. The name on the box changes — Duolingo, Khanmigo, i-Ready, DreamBox — but what the box counts has remained, across seventy years of silicon and software and venture capital and neuroscience, almost perfectly stable. Accuracy per item. Time per response. Progression through atomized units. Performance on the test the system was built to prepare you for.
B.F. Skinner named these measurements in 1958. He had a good reason.
He had observed his daughter’s fourth-grade arithmetic class and been, in his own word, shocked. Students completed problems and waited. The papers were collected. Perhaps two days later, perhaps a week, the marked papers returned. By the time the feedback arrived, the behavior it was meant to reinforce had already moved on, taken up residence in some adjacent habit of mind that was no longer the one in need of correction. Skinner believed he understood the mechanism of learning better than anyone alive — the contingencies of reinforcement, the precise timing of feedback, the accumulation of correctly shaped behavior into competence — and what he had watched in that classroom was the systematic breaking of every mechanism he understood. A technology that could restore the contingencies, he reasoned, would be a technology that could teach.
His teaching machine presented material one frame at a time. The student responded. The machine verified, immediately, whether the response was correct. The contingencies were repaired.
What I am asking you to notice is not that this was wrong. I am asking you to notice what the machine measured — accuracy per frame, time per response, progression, error patterns — and to hold those measurements in mind as we trace them forward through sixty-six years of educational technology that kept the apparatus while abandoning almost everything else about Skinner’s framework.
What the Machine Could Not See
The teaching machine could not look up from the immediate interaction to ask what the student would remember in six months.
This is not a glancing criticism of Skinner. His behavioral framework did not require him to ask the question; the question was not yet a question the field had organized itself to ask in the precise way that Bjork and Bjork’s subsequent research would demand. Skinner’s science was about the shaping of behavior through reinforcement, and a behavior that could be elicited at the moment of measurement had been shaped. That the behavior might dissolve in the absence of the reinforcing conditions was not, within behaviorism, a separate problem requiring separate measurement. Generalization was expected to follow naturally.
But this is where the inheritance turns costly. The assumption that immediate performance predicts durable learning was embedded in the measurement apparatus before it was tested empirically. By the time Robert and Elizabeth Bjork’s work made the distinction between retrieval strength and storage strength unavoidable — by the time it was clear, empirically, that the conditions maximizing immediate performance (massed practice, aligned testing, minimal difficulty) could actively impair long-term retention — the measurement apparatus had already been handed down through Patrick Suppes’s 1960s computer-assisted instruction and was settling into the bones of the field.
Suppes’s system at Stanford presented arithmetic problems to elementary students and recorded what Skinner’s machine had recorded: accuracy rates, response times, error patterns, progression. The technology shifted from mechanical device to mainframe computer. The measurements did not shift. Accuracy rose from 53 percent to over 90 percent. Response times fell from 630 seconds to 279. Suppes reported these numbers as evidence the system worked, and within the apparatus he had inherited, they were. He was not wrong to report them. He was working inside a set of choices about what evidence looked like that the apparatus had bequeathed him without flagging as choices.
The question of what those 90-percent-accurate students could do two years later was not asked.
The Apparatus Becomes Theory
Here is what makes the inheritance pattern strange rather than simply historical: the apparatus persisted past the abandonment of the theoretical framework that had justified it.
John Anderson’s Cognitive Tutor, developed in the 1980s and 1990s at Carnegie Mellon, was built on ACT-R theory — a cognitive-psychological architecture that treated learning as the acquisition of production rules rather than the shaping of behavior. Theoretically, this was a departure from Skinner significant enough to constitute a revolution. The language of reinforcement was replaced by the language of cognition. The unit of analysis shifted from the frame to the production rule.
The measurement apparatus did not shift.
The Cognitive Tutor recorded step-level correctness — whether each student action matched one of the production rules the cognitive model identified as correct. It recorded time per step. It recorded hint requests, error patterns, estimated mastery of each production rule through Bayesian knowledge tracing. When Anderson and colleagues published their foundational 1995 paper in the Journal of the Learning Sciences, the evidence they offered that the system worked was: step-level accuracy, progression, and post-test performance on assessments aligned with the content the tutor had taught.
Skinner’s apparatus, operating at higher resolution, within a more sophisticated theoretical framework, carrying new vocabulary.
Anderson and colleagues were, I want to say this plainly, more honest about the limits of their measurements than most of the researchers who cited them. The 1995 paper notes explicitly that students “display transfer to the degree that they can map the tutor environment into the test environment” — an acknowledgment that the evidence of learning the system could produce depended on the degree to which the post-test resembled the tutor’s own format. This is the measurement-alignment problem stated with precision by the researchers who built the system it applied to. The acknowledgment was there. What happened subsequently was that the effect sizes from aligned post-tests entered the literature as if Anderson’s own caveat had not been published alongside them.
The apparatus inherits even what its originators flagged as provisional.
The Industrial Turn
The 2010s commercial adaptive-learning era — Knewton, DreamBox, i-Ready, ALEKS — represents the point at which the inherited apparatus became an industry standard.
Knewton’s José Ferreira, during the 2012-2015 period of the platform’s public prominence, positioned his technology as capable of personalization so granular that it would transform education at scale. The claim invoked the Suppes promise in the language of twenty-first-century data science. What the platform actually measured was behavioral engagement data: which problems students attempted, which hints they took, how their patterns of interaction with the system correlated with eventual performance on the system’s own assessments. Independent efficacy research on Knewton was, during the period of its most expansive claims, notably absent. The apparatus was present in the measurement choices; the evidence was not.
DreamBox Learning, which earned more research attention than most adaptive platforms, became the subject of a 2016 Harvard Center for Education Policy Research study that found students at the median gained 1.4 to 3.9 percentile points on the NWEA MAP for approximately 7 to 8 hours of DreamBox usage. The researchers were transparent about a critical limitation: DreamBox usage might “partially reflect students’ motivation levels,” meaning the correlation between usage and achievement might reflect that motivated students both use DreamBox more and learn more, independent of DreamBox’s instructional contribution. The acknowledgment, honest and specific, appeared in the paper. It rarely appeared in the citations that followed.
i-Ready produced a particularly clarifying version of the apparatus’s internal logic. The platform’s efficacy research typically demonstrated that students who achieved “usage fidelity” — meeting the system’s recommended weekly engagement minutes — showed higher scores on the i-Ready Diagnostic. The Diagnostic was itself calibrated to predict state test performance. A system measuring how well students learn to do well on the assessment the system provides, where the assessment was engineered to predict the external standard — this is the apparatus become recursive. The alignment between instruction and measurement, which Skinner had simply taken as a natural feature of teaching a student the specific behavior you then measured, had been engineered into the product design itself. The inheritance was now embedded in the commercial structure.
ALEKS routed the apparatus through Knowledge Space Theory, a mathematical framework for mapping curricular competencies that provided sophisticated theoretical grounding for the same fundamental measurement choices. Efficacy claims rested on performance within the system’s own knowledge mapping and on aligned post-tests that measured progression through the curricular content the system taught. The theoretical vocabulary was different from Skinner’s. The measurement choices were the same.
Duolingo, 2021
I want to read a specific study carefully, because careful reading is the point.
Evaluating the reading and listening outcomes of beginning-level Duolingo courses, by Xiangying Jiang, Joseph Rollinson, Luke Plonsky, Erin Gustafson, and Bozena Pajak, published in Foreign Language Annals in 2021. The fifth author, Plonsky, is an academic researcher at Northern Arizona University with specialization in applied linguistics. The other four were employed by Duolingo at the time of publication. The paper is peer-reviewed. It is cited in Duolingo’s own marketing materials. It is, within the conventions of the field, a careful study.
Two hundred and twenty-five adults in the United States — 135 studying Spanish, 90 studying French. Participants were required to have little to no prior proficiency in their target language, to be using Duolingo as their only learning tool, and — the consequential criterion — to have completed the beginning-level course content through Unit 4. The sample, the paper reports, skewed toward highly educated Caucasian Americans with bachelor’s or master’s degrees.
The outcome measure was the STAMP 4S test from Avant Assessment, covering reading and listening. Thirty multiple-choice items in each modality. The assessment was administered immediately after learners completed the beginning-level content.
The finding: Duolingo learners reached ACTFL Intermediate Low in reading and Novice High in listening — levels the paper characterizes as “comparable with those of university students at the end of the fourth semester” of college-level language study.
Now apply the apparatus.
The outcome measure is external — not designed by Duolingo, which is a genuine methodological improvement over purely internal assessment. But reading and listening are the specific modalities that Duolingo’s interface is engineered around. Multiple-choice comprehension items, translation tasks, listening exercises with multiple-choice responses: these are what Duolingo builds, and these are what the STAMP 4S measures. Speaking and writing — modalities that Duolingo’s app-based format supports weakly — are explicitly excluded from the study. The assessment is external. The choice of which aspects of language proficiency to measure is not.
The timescale: the post-test was administered immediately after course completion. There is no delayed assessment. Bjork’s distinction between retrieval strength and storage strength is directly relevant — the STAMP 4S scores reflect what Duolingo users can do at the moment they finish the course, not what they can do when they have been away from the app for six months. This question is not asked.
The population: only learners who completed the beginning-level content. Most Duolingo users do not. The platform’s attrition is substantial; most people who download the app never reach the end of the beginning-level material. The study measures the performance of survivors. What 100 people who finished the course achieved is a different finding from what 100 people who started it achieved. The paper is transparent about this selection. The subsequent framing of the findings — in the paper’s own conclusion and, more aggressively, in Duolingo’s marketing — as Duolingo users reach Intermediate Low does not preserve the completion-threshold restriction.
The baseline: a historical comparison. University students at the end of the fourth semester. There is no contemporaneous control group of comparable adults who spent equivalent time on a different learning approach. The two populations were measured in different conditions, at different times, possibly with different motivations and starting points. The comparable to four semesters claim treats them as if they had been measured equivalently.
The cost: not reported. Duolingo is free at its base tier, which is rhetorically powerful — free app comparable to paid college course — but the comparison elides the substantial time investment Duolingo users make. The paper does not ask what equivalent time investment in human-tutored instruction, structured self-study, or an immersive experience would produce. The cost denominator, which is constitutive of what a comparative claim actually supports, is absent.
I am not saying the study is dishonest. I am saying that each of these specific measurement choices — aligned-modality outcome, immediate timescale, survivor population, historical baseline, absent cost denominator — is traceable, in structure, to the apparatus Skinner initiated in 1958. The study is careful within conventions it has inherited. The conventions themselves are what require examination.
The Alternatives Have Always Existed
This is what I want you to sit with: the apparatus did not persist in the absence of alternatives. It persisted alongside them.
Edward Thorndike established in 1906 and 1924 that improvement in one mental function rarely produces general improvement in others unless the two share identical elements. The methodological implication — that learning gains must be tested outside the conditions of the intervention, in contexts structurally different from training, to establish what the training actually produced — was available to the field for the entire history of educational technology. It has been occasionally adopted, routinely praised, and treated as aspirational rather than as the baseline standard that Thorndike’s own work suggested it should be.
The Bjorks’ work on storage strength versus retrieval strength, canonical since the early 1990s, established empirically that the conditions maximizing immediate performance can impair durable retention. The specific implication — that a delayed post-test is required to distinguish performance from learning — has been in the learning sciences literature for over thirty years. Its adoption in educational technology efficacy research as standard practice has not happened.
Bransford, Brown, and Cocking’s How People Learn, the 1999 National Academies synthesis, argued explicitly that assessment should tap understanding rather than the ability to repeat facts. The argument was widely read, widely cited, and narrowly operationalized.
Samuel Messick’s theory of validity, developed across decades and codified in the 1989 Educational Measurement volume, specified that a test score’s interpretation requires examination of construct-relevant versus construct-irrelevant variance, construct underrepresentation, and the consequences of the test’s use. Applied rigorously, Messick’s framework would require educational technology efficacy research to examine what its outcome measures actually index rather than assuming that performance-on-aligned-items equals evidence-of-learning. The framework has been the theoretical standard in measurement theory for over thirty years.
These alternatives were not hidden. They were taught in graduate programs, cited in methods sections, present in the same journals that published the aligned-outcome studies. What did not happen, across six decades of technology change, was their adoption as the field’s measurement standard. The inherited apparatus — aligned outcomes at immediate timescale, survivor population, weak baseline, absent cost denominator — remained dominant. The alternatives remained alternative.
This is not a story about intellectual failure. It is a story about what happens when a theoretical commitment gets flattened into a methodological convention. Skinner had reasons for his measurement choices that were grounded in a coherent behavioral science. When the field moved past behavioral science — when Suppes and Anderson and everyone who followed adopted different theoretical frameworks — the measurement choices did not travel with the theory that had justified them. They traveled alone, as conventions, as what evidence looked like, as the unexamined default.
The apparatus became invisible by becoming obvious. And invisible apparatus is the most durable kind.
The Current Wave
The contemporary AI-tutor literature — Khanmigo, Kestin and colleagues’ 2024 Harvard physics study, Eedi with Google Research, Rori in Ghana — inherits the apparatus in its turn, with variation worth noting.
Khanmigo’s evaluation evidence has rested primarily on engagement metrics and performance within Khan Academy’s own internal assessment structures. What has been measured at scale is usage patterns; what has been claimed is educational transformation; what has not been established at the level of rigorous efficacy research is learning gains on independent standardized measures at delayed timescales with cost-inclusive reporting. The characteristic gaps of the apparatus are present.
The Kestin et al. 2024 Harvard physics study — AI-tutored instruction versus a single session of active-learning classroom instruction — reported effect sizes of 0.73 to 1.3 sigma on researcher-designed post-tests covering surface tension and fluid flow, the specific content the two-hour intervention taught, assessed shortly after the intervention. The measurement choices are the apparatus’s measurement choices. The effect sizes are real within those choices. What they establish about learning is bounded by what those choices can establish.
Eedi with Google Research 2025 introduced transfer testing — measuring performance on novel problems from subsequent topics rather than problems aligned with what the intervention taught. This is a genuine departure from the inherited convention. The N of 165 and single-term duration remain short relative to what durability research would require, but the outcome measure itself represents the kind of revision the apparatus needs rather than another inheritance of it. This is a credit to the researchers who chose to build the study that way.
Rori in Ghana used an external curriculum-aligned assessment over eight months and reported cost at $5 per student per year. The longer timescale, the external measure, the explicit cost denominator — these are partial revisions of the apparatus in the direction the field has needed for six decades. The pattern is: when researchers choose to work against the inherited conventions, the field moves. The field moves rarely, because the inherited conventions are the default, because departures from them require additional effort and often smaller effect sizes and sometimes no significant effect at all, which is a kind of finding that is harder to publish than 0.73 sigma.
The apparatus has not been reformed. It has been revised in specific instances by specific researchers. The instances are the exceptions that make the pattern visible.
Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)). His research on educational AI efficacy appears at hypotheticalai.substack.com. | skepticism.ai | theorist.ai
Tags: educational technology measurement apparatus, Skinner teaching machine inheritance, Duolingo efficacy research critique, aligned outcome EdTech validity, learning science transfer testing history


