The Gap Between What We Measure and What We Name
On the Structural Problem That Forty Years of EdTech Efficacy Research Has Not Solved
Consider two findings, forty years apart.
In 1984, Benjamin Bloom published a seventeen-page paper reporting that students tutored one-on-one under mastery-learning conditions performed approximately two standard deviations above students taught in conventional classrooms. The finding has been cited tens of thousands of times. It has become, across four decades, the single most-invoked benchmark in educational technology. Whenever a new system claims to approach the effectiveness of human one-on-one instruction, it is Bloom’s 2-sigma it is claiming to approach.
In 2024, a research team at Harvard led by Gregory Kestin reported that an AI tutor, deployed in introductory physics, produced learning gains larger than active-learning classroom instruction. The effect size exceeded what prior literature had typically reported for any tutoring intervention, including Bloom’s. The study was methodologically careful. The finding circulated quickly. Within weeks it was being cited as evidence that current-generation AI tutors meaningfully exceed what good conventional instruction can deliver.
Forty years apart. Different technologies. Different research traditions. And yet, read carefully, the two findings share a structure.
In each, a specific measurement — performance on items aligned to the intervention’s content, assessed at short timescale, against a conventional-instruction baseline — is offered as evidence for a construct about which the measurement is not, strictly, a measurement. Bloom’s 2-sigma is evidence about performance on aligned items under particular tutoring conditions in the mid-1980s. It is cited as evidence about the effectiveness of tutoring as an instructional mode. Kestin’s physics finding is evidence about short-timescale aligned-item performance in a selective undergraduate population. It is cited as evidence that AI tutoring outperforms human instruction in some general sense the measurement does not index.
The measurements are not false. The findings are not inflated. In each case, the researchers reported carefully what they measured. The question is what happens between the measurement and its citation — the small, structural, and repeated gap between what the apparatus indexes and what the vocabulary surrounding the apparatus claims.
The Structure of the Problem
Name the structure directly.
An efficacy claim in this field consists of three things: a measurement, a construct, and an asserted relationship between them. The measurement is what researchers actually did — items administered, scores computed, conditions compared. The construct is what the measurement is meant to be evidence for — learning, mastery, effectiveness, personalization, engagement. The asserted relationship is the claim that the measurement indexes the construct adequately to license the uses the finding is put to.
This structure appears in every empirical field. Biology works this way, and so does nutrition research, and so does clinical psychology. The gap between measurement and construct is not a problem specific to educational technology. It is a feature of empirical inquiry. Measurements never exhaustively capture their constructs. The question for any field is how seriously it takes the gap, how much work it does to establish the measurement-construct relationship, and how much it assumes versus demonstrates.
The observation this book has been building toward, essai by essai, is that the learning-systems field has, across six decades, taken the gap less seriously than its claims require. The measurement-construct relationships it invokes are almost universally assumed rather than demonstrated. The field’s vocabulary outruns what its evidence apparatus can support, and the gap persists not because it has gone unnoticed — it has been noticed, repeatedly, by careful researchers across multiple traditions — but because the apparatus that persists serves specific production conditions, and a more adequate apparatus would serve them less well.
The structure is not: the field is wrong about what works. The structure is: the field makes claims about effectiveness that its measurements are not positioned to support, and does so systematically. These are importantly different claims. The first is about facts. The second is about apparatus — about the specific set of measurement practices, citation habits, and research conventions that together produce what the field calls its evidence base.
The distinction matters because the remedy differs. If the field were making factual errors, the remedy would be better studies of the same interventions. If the apparatus is producing a systematic gap between measurement and claim, the remedy is different apparatus. This book has not argued for either remedy. It has argued, by the accumulated force of twelve close readings, that the second diagnosis is correct.
What the Vocabulary Actually Invokes
Open a textbook in educational psychology. Open a learning-sciences journal. Open the marketing copy for any major adaptive-learning platform. Open the abstract of any recent AI-tutor efficacy study. The vocabulary is remarkably consistent. The field claims to be producing evidence about learning. About understanding. About mastery. About effectiveness. About personalization and engagement. Each of these words points toward a construct. Each construct has, in serious research traditions, substantial theoretical and empirical articulation.
Consider learning. In Robert Bjork’s decades of experimental work, learning is not a single construct but a distinction between two separable things: storage strength and retrieval strength. Storage strength refers to how well a representation is encoded. Retrieval strength refers to how accessible it is at the moment of test. A student can have high retrieval strength at the end of a unit — they perform well on the post-test — without high storage strength. Weeks later, the retrieval strength decays, and the post-test performance turns out to have been measuring the wrong thing. Conditions that maximize immediate performance — massed practice, aligned testing, minimal difficulty — often actively impair long-term storage. This is the central insight of what Bjork calls desirable difficulties.
A learning claim grounded in Bjork’s construct requires evidence of storage strength, not just retrieval strength — which requires measuring performance after a delay, in new contexts, on items not identical to training. The methodology exists. It has existed since the early 1990s. It is the basis of essentially every recommendation in Make It Stick and in the broader spaced-practice and retrieval-practice literature that has accumulated since.
Now consider how learning is typically operationalized in EdTech efficacy research. The outcome measure is a post-test administered at the end of the instructional unit. The items are aligned with the instructional content. The interval between instruction and test is hours to days. The retrieval context is the same or similar to the learning context. What this operationalization measures is retrieval strength at short delay. What Bjork’s construct requires is storage strength at longer delay under different retrieval conditions. These are not the same thing.
The gap between the two is not subtle. It is structural. And it is present in nearly every efficacy claim this book has examined.
Consider understanding. Jean Lave, Etienne Wenger, John Dewey, and the situated-cognition tradition spent decades articulating understanding as something different from performance on items. Understanding involves the capacity to apply knowledge in contexts that differ from the contexts of acquisition. It involves participation in practices — knowing how to use what one knows in the world where it applies. Transfer testing — the capacity to apply learning to problems that differ meaningfully from training — is the minimum methodological requirement for a claim about understanding. Transfer testing has been advocated for in educational research since Thorndike’s early twentieth-century work. It remains exceptional in EdTech efficacy research.
Consider mastery. Bloom’s own construct, as articulated in his mastery-learning work, involves structural reorganization of knowledge — the kind of reorganization that allows a learner to solve problems the instruction did not specifically address. Bloom’s 2-sigma finding emerged from studies that implemented criterion-referenced assessment, formative assessment with corrective feedback, demonstrated performance across multiple item types. The 2-sigma number is cited routinely as a benchmark for tutoring effectiveness. Bloom’s construct of mastery, including its methodological requirements, is cited far less often.
Consider personalization, as examined in the eighth essai. The term invokes a construct rooted in Vygotskian zone-of-proximal-development work and the aptitude-treatment interaction literature — instruction responsive to who the individual learner actually is. What adaptive-learning systems operationalize is item sequencing and pacing based on item-level response patterns. These are not the same construct.
Consider engagement. The construct, as articulated in the psychological literature, involves attention, motivation, affect, persistence in the face of difficulty, meaningful cognitive investment. What AI-tutor efficacy research typically measures is time on task, session counts, and completion rates. Kristen DiCerbo of Khan Academy observed in April 2026 that when students engaged with Khanmigo, they were typing “IDK IDK” — I don’t know, I don’t know — and moving on. The platform counted them as engaged. They were not engaged in any cognitively meaningful sense.
Each of these constructs has serious theoretical articulation in one or more research traditions. Each is routinely invoked by the field’s claim-making vocabulary. Each is routinely operationalized as aligned-item performance at short timescale. The gap between the construct and the operationalization is what the apparatus produces. And taken across the field, it is the difference between the learning the vocabulary claims and the performance the measurements index.
What the Field Has Tried
It would be inaccurate to say the field has not tried to close this gap. It has tried, across multiple traditions, for decades. That these attempts have not produced a different default apparatus is itself instructive.
How People Learn, the 1999 National Academies synthesis by Bransford, Brown, and Cocking, made transfer testing a central methodological theme. The implication was straightforward: efficacy research should include transfer measures if it wants to make claims about learning rather than claims about trained performance. Two and a half decades later, transfer testing remains exceptional.
Samuel Messick’s theory of validity, codified in his 1989 chapter in Educational Measurement, specified that a test score’s interpretation requires examination of construct-relevant versus construct-irrelevant variance, construct underrepresentation, and the consequences of the test’s use. Applied rigorously, Messick’s framework would require EdTech efficacy research to examine what its outcome measures actually index rather than assuming that performance-on-aligned-items equals evidence-of-learning. The framework has been the theoretical standard in measurement theory for over thirty years. Its rigorous application in educational technology efficacy has been partial at best.
Jean Lave’s situated-cognition tradition articulated assessment that requires observation of practice rather than administration of tests. It has had essentially no impact on deployed-product efficacy research.
Each of these traditions has existed for decades. Each has produced methodology that could be adopted. Each remains exceptional rather than routine. The alternatives have not been hidden. They have been taught in graduate programs, cited in methods sections, present in the same journals that published the aligned-outcome studies.
The question is why they have not taken.
Why the Apparatus Persists
The apparatus persists because it serves the specific production conditions of the field in which it operates.
Consider what a researcher needs to do research in this field. Funding, on grant cycles of two to five years. Publications, through peer-reviewed journals with specific conventions. Access to populations — schools, classrooms, platforms — through institutional partnerships with their own timelines and constraints. Findings that other researchers can cite.
Now consider what a more adequate apparatus would require. Transfer testing adds design complexity and reduces effect sizes. Durability testing extends the study timeline past the typical grant cycle. Multi-paradigm convergence requires methodological range that most research programs do not possess. Pre-registration of analytic plans constrains the exploratory moves that often produce publishable findings.
Each of these, if adopted as a default, would reduce the rate at which researchers produce citable positive findings. Not because the interventions do not work — some of them do — but because the findings that survive the more demanding methodology would be smaller, noisier, and less rhetorically useful. A researcher who adopts the more demanding methodology competes with researchers who do not. The less-demanding researcher’s findings will be larger, cleaner, and more citable. Grant agencies, tenure committees, and publication venues all reward the latter.
The same pressures operate on the institutions that surround the research. Product vendors have commercial reasons to prefer methodologies that produce larger numbers. Policy bodies have political reasons to prefer evidence that looks clean. Philanthropists want defensible findings, and clean findings are easier to defend than nuanced ones. Journal editors respond to what their referees will accept, and what referees will accept is shaped by the conventions the field has institutionalized.
No individual in this system is behaving cynically. Researchers are doing their best work under the constraints of their funding. The apparatus is not what anyone chose. It is what the incentives produce when rational actors operate within them.
This is why advocacy for better methodology has not produced better methodology. The problem is not that researchers do not know better methodology exists — they do. The problem is that operating under the existing apparatus produces careers; operating against it produces, for most researchers, shorter and more difficult careers.
The apparatus persists because it is an equilibrium. Equilibria are stable not because the actors inside them are irrational but because they are responding rationally to incentives that no single actor created and no single actor can change. Changing an equilibrium of this kind requires changing the incentives across grant agencies, tenure systems, journal conventions, institutional practices, and funder expectations simultaneously. Such coordination is rare.
This is a structural observation, not a moral one. Researchers in this field are not broken. The evidence base is what the apparatus produces when careful, rigorous, well-meaning researchers operate under the conventions the apparatus enforces. Improving any individual researcher’s methods would not change what the field’s evidence base looks like, because the evidence base is the aggregate output of many careful researchers responding to shared incentives.
That is what the apparatus was always supposed to produce.
Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)). This essay appears as part of the Computational Skepticism series at skepticism.ai. | theorist.ai | hypotheticalai.substack.com
Tags: measurement construct validity EdTech efficacy, Bjork storage retrieval strength learning systems, transfer testing durability educational technology, apparatus equilibrium research incentives, Bloom Kestin aligned outcome measure gap


