The Measurement That Wasn't There

On the quiet fraud at the center of AI education research — and why it's harder to catch than the kind that gets retracted

Apr 29, 2026

There is a paper circulating in AI education circles as a counterpoint to the skeptics. Wang and Zhang, published in February 2026 in the International Journal of Educational Technology in Higher Education, a Springer Nature journal. It passed peer review. It has four studies. It has 912 participants across three continents. It deploys PLS-SEM and fsQCA and IPMA, and it has a methodology flowchart with seven stages, and it uses the word “paradoxical” in its title and delivers on the promise — two hypotheses come back significant in the wrong direction, which the authors then claim as the actual discovery.

I want to be honest about what I am about to argue. The Wang and Fan retraction that prompted this conversation is a case of bad causal evidence overclaimed. That is one problem. Wang and Zhang is a different problem. It is methodologically elaborate work that is not actually measuring what it claims to measure. In some ways it is harder to catch, because the machinery is impressive and the numbers are clean and the peer reviewers, like the rest of us, have been trained to evaluate internal consistency rather than construct validity.

Strip away the machinery. Here is what Wang and Zhang actually did.

Nine hundred and twelve business students filled out a questionnaire. The questionnaire asked them to rate their agreement with statements like: “My interaction with the generative AI has led me to question my long-held assumptions.” And: “Using generative AI has fundamentally changed the way I understand certain subjects.” And: “My use of generative AI has prompted a deep re-evaluation of my ways of thinking.”

Those five items, averaged together, are the outcome variable. The paper calls this outcome “transformative learning experience.”

It is not transformative learning experience. It is self-reported perception of transformative learning experience. The difference is not semantic. It is the entire study.

Jack Mezirow’s transformative learning theory — the anchor the paper correctly treats as its theoretical foundation — describes a slow, disorienting, often unconscious process of perspective reconstruction. Mezirow was not describing a feeling students could report after two weeks. He was describing something that happens to people over months or years, something they often cannot name while it is occurring, something that shows up in changed behavior and revised assumptions and different relationships to knowledge — not in survey responses. The theory Mezirow actually wrote is about the kind of learning that happens when a person discovers that the framework they have been using to understand the world is inadequate. That does not feel like an insight. It feels like vertigo.

Measuring this with five Likert items is not a methodological shortcut. It is a category error. You might as well measure altitude with a thermometer and then report, with SRMR = 0.031, that higher temperatures correlate with being closer to the sky.

The paper knows this, in the way that papers of this type always know what they are doing, which is to say: it is in the limitations section. “Generalizability is bounded by exclusive reliance on self-reported perceptions,” the authors write, and then proceed to spend eight thousand words drawing inferences about transformative learning from self-reported perceptions. The limitation is disclosed and then ignored. This is the standard operation.

Now add the demand characteristics.

I said “convenience sampling from business schools,” and that is the phrase papers in this area use. What it usually means in practice is that the 912 participants are the researchers’ own students, or the students of colleagues at institutions where the researchers have relationships. The paper does not specify. It describes “multistage purposive sampling” and leaves the details of how institutions were contacted and how students were recruited conspicuously absent. But here is what we know: the qualitative component — the 45 interviews providing “rich process-oriented insights” — was drawn “exclusively from the Chinese sample,” and one of the authors is at a Chinese university. We know the students knew they were participating in an academic study. We know, from two thousand years of social psychology, that students who are aware of being studied by people who may have access to their grades tend to report what they believe is the expected or approved answer.

The paper deploys a temporal separation of two weeks between waves to “minimize common method bias.” Two weeks between surveys does not eliminate the problem of students reporting what they believe the study wants to hear. It separates the questions. It does not change who is answering them or why.

I want to name the third problem, which is the one I raised in the group and which I think is the most structurally interesting.

Almost every learning environment is a massive violation of SUTVA — the Stable Unit Treatment Value Assumption. SUTVA says that the treatment received by one unit doesn’t affect the outcomes for another. In a classroom, this is almost never true. Students talk to each other. They share AI tools. They discuss assignments. They copy strategies. One student’s approach to using ChatGPT influences other students’ approaches, which influences their outcomes, which shows up in the data as independent observations that are not independent at all.

In a networked environment where 912 business students across three continents are all using the same publicly available AI tools, the assumption that each student’s “transformative learning experience” is a function solely of their individual “pedagogical partnership orientation” and “cognitive vigilance” and “efficiency orientation” is not a simplifying assumption. It is an assumption that, if violated — and it is almost certainly violated — means the causal model is wrong in ways the statistical machinery cannot detect. PLS-SEM with excellent fit statistics can sit on top of fundamentally confounded data and produce clean-looking path coefficients. The cleanliness of the output is not evidence of the validity of the model. It is evidence that the model fits the data it was given.

True causal inference in learning environments would require experimental variation, not survey waves. It would require controlling for the social transmission of strategies and norms. It would require outcome measures that are behavioral, not perceptual. Absent these, what you have is a very sophisticated correlation study that has dressed itself in the language of mechanism.

The paper is not a fraud in the sense of fabricated data. The numbers are probably exactly what the authors say they are. The students probably filled out exactly the surveys the authors describe. The analysis was probably executed correctly in SmartPLS 4.1.

The problem is upstream of all of that. The problem is in the question “what did we measure?”

We measured whether students who reported viewing AI as a collaborative partner also reported having their assumptions challenged. We found that they did. We called this “transformative learning.” We built a four-study architecture around this finding, with fsQCA and IPMA and 45 interviews and cross-cultural multi-group analysis, and we used the word “revolutionizes” in the discussion section, and we were published in a Springer Nature journal.

This is the second problem the field has, and it is subtler than the retracted meta-analysis. The retracted Wang and Fan paper is the kind of failure that produces retractions: fabricated or manipulated data, statistical impossibilities, evidence that the numbers were not real. That is a catastrophic failure, but it is detectable. It triggers the mechanisms the field has built for self-correction.

The Wang and Zhang problem does not trigger those mechanisms. The numbers are real. The peer review process evaluated internal consistency and found it satisfactory. The methodology flowchart has seven stages. The HTMT ratios are all below 0.85. The paper did exactly what the field rewarded it for doing.

And what it measured was: how students feel about whether they learned something.

Here is what I think is actually going on in that data, if you want my honest read of it.

Students who frame AI as a collaborative partner rather than a tool are probably more engaged with the learning process in general. Engagement is positively correlated with self-reported learning. This is not a surprise. It is not a paradox. It is not evidence that “partnership orientation simultaneously activates cognitive vigilance and cognitive offloading through synergistic cognitive collaboration.” It is evidence that students who are paying attention think they learned more.

The finding that cognitive offloading is positively associated with self-reported transformative learning is interesting — the paper hypothesized the opposite and got a significant result in the other direction, and that is worth noting. But the post-hoc explanation (that offloading liberates cognitive resources for higher-order reflection) is plausible, not demonstrated. The paper discovered an unexpected correlation, generated a theory to explain it, and presented the theory as established. The U-shaped analyses that appear to confirm the theory were conducted after the unexpected finding was observed, without correction for exploratory inflation. This is the standard operation, and it is why most published findings in social science do not replicate.

The correct statement of the finding is: among 912 business students who self-report using AI, those who self-report viewing AI as a partner also self-report greater subjective sense of perspective change, and this association holds when we control for several other self-reported constructs. This is an interesting starting point for a research program. It is not a demonstration that pedagogical AI partnerships cause transformative learning.

I want to be fair to the authors and to the field. They are working in an area where longitudinal behavioral research is genuinely hard to conduct, where IRB constraints limit what can be measured, where publication timelines create pressure toward the kind of efficiency the paper’s own subjects were reporting, and where the methodological standards for what counts as evidence have been established over decades of work that made the same choices at every turn. They did what the field taught them to do. The peer reviewers evaluated the paper against the standards of the field and found it acceptable by those standards.

That is the problem. Not this paper. The standards.

What would adequate evidence look like? It would measure transformative learning through behavioral change over meaningful time periods — different academic choices, different engagement with contradictory evidence, different patterns of intellectual behavior — not through survey items administered two weeks after measuring the predictors. It would use experimental variation in AI access or framing. It would account for social transmission between students. It would treat the gap between self-reported perception and actual cognitive change as a research question, not a footnote.

This kind of research is harder to do. It takes longer. It is more expensive. It produces noisier results. It is less likely to yield the clean path coefficients and the R² of 0.475 and the SRMR of 0.031 that signal competence to reviewers. The incentive structure of academic publishing does not reward it.

The Wang and Fan retraction is the kind of failure that looks like a violation of the rules. Wang and Zhang is the kind of failure that looks like following them.

I am building AI tools for anyone who wants to ride the AI revolution. I am not the right person to tell education researchers how to fix their field. But I notice the same thing in AI music research that I see here: the willingness to dress up a survey with sophisticated analytical machinery and call the output evidence about what AI actually does to people. The infrastructure for appearing rigorous has outpaced the infrastructure for being rigorous.

And this matters beyond the journals. The Wang and Zhang paper is circulating as evidence about AI and learning. Institutions are making policy based on papers like this. Educators are redesigning curricula. Students are being told, by implication, that their sense of having learned something is the same as having learned something.

It is not. And the gap between those two things is exactly the gap that Mezirow was writing about — the gap between the story you tell yourself about your perspective and the actual reconstruction of the framework through which you understand the world. Transformative learning is what happens when you discover that the story you have been telling yourself is wrong.

It would be ironic if the research claiming to measure it turned out to be an example of the thing it failed to measure.

Nik Bear Brown teaches AI at Northeastern University and runs Musinique LLC, which builds tools for indie musicians. He is also the founder of Humanitarians AI, a 501(c)(3) nonprofit. More at bear.musinique.com · skepticism.ai · theorist.ai

Tags: measurement validity, AI education research, transformative learning, construct validity, self-report bias

Nik Bear Brown - Computational Skepticism

Discussion about this post

Ready for more?