The Article That Claimed Too Much
On Rebecca Winthrop’s “What 370,000 College Essays Tell Us About A.I.’s Effects on Creativity” and what the underlying research actually supports
There is a particular kind of intellectual move that appears responsible but isn’t. It starts with real evidence, follows it accurately for a while, then extends it — just a little — into territory the evidence cannot actually reach. The extension doesn’t look like speculation because it arrives wrapped in citation and earnest concern. By the time the reader notices anything, the claim has already landed. This is the move Rebecca Winthrop makes in her May 2026 New York Times essay, and understanding exactly where she crosses the line tells us something important about how we should be thinking about AI and education — and more precisely, about who we are blaming and for what.
Winthrop’s central claim is alarming: AI tools, she writes, “constrict our full range of thoughts and our ability to generate original and useful ideas.” She cites Georgetown neuroscientist Adam Green’s study of more than 370,000 college admissions essays, finding that post-ChatGPT writing became linguistically polished but ideologically homogeneous, and that human raters judged this polished writing as more creative despite its greater uniformity. She mentions a separate study finding that human-written short stories contained up to eight times more novel ideas than AI-assisted ones. The concern feels urgent, specific, and scientifically grounded.
It is also, at its most important moment, a category error.
The research Winthrop describes is real. The preprint — Moon et al., “The Creative Link Between Words and Ideas is Weakening in the AI Era” — is a serious piece of work. Four natural experiments across multiple institutions, more than 370,000 essays, pre-registered directional hypotheses, and a within-subjects controlled experiment that adds genuine causal texture: this is not a thin study dressed up for press coverage. The core finding holds up: in post-ChatGPT admissions essays, word-level lexical diversity went up — essays used more varied, more colorful vocabulary — while sentence-level and document-level conceptual distinctness went down. Essays sounded more interesting while being more alike. The authors call this “disjunctive homogenization,” and it is a real, measurable phenomenon with meaningful implications for how educators and admissions officers evaluate writing.
So far, so good. The problem is what Winthrop makes of it.
“The bigger and more alarming impact,” she writes, “is to constrict our full range of thoughts and our ability to generate original and useful ideas.” The study measures properties of texts — specifically, embedding-based distances between words, sentences, and documents. It shows that AI-era essays are less semantically distinct from one another at the conceptual level. What it does not show, and cannot show, is what was happening inside the minds of the students who wrote them. The claim that AI “constricts our full range of thoughts” is a cognitive claim about people. The evidence supports a textual claim about outputs. These are different things, and the difference matters precisely in proportion to how urgent we consider the problem.
Consider what would be required to establish the cognitive version of this claim. You would need to measure students’ ideational capacity before and after AI use, through some instrument independent of the writing they produce with AI assistance. You would need to distinguish between students who drafted with AI, students who revised with AI, and students who used AI only for surface editing. You would need to rule out the entirely plausible alternative explanation that students facing a high-stakes writing task, given access to a tool that reduces their anxiety, choose to produce safer content — not because their creative capacity has diminished, but because the incentive structure of the situation changed. The Moon et al. study, careful as it is, does none of this. Its evidence lives in the essays, not in the students.
This distinction matters for a reason that goes beyond methodological precision. If the problem is that AI tools produce homogenized outputs when used to draft or heavily revise, the corrective is a set of pedagogical practices and assessment structures that don’t mistake polish for thought. If the problem is that AI is actually narrowing human creativity — eroding a cognitive capacity — then the corrective is something more like a public health intervention. Winthrop’s framing calls for the second kind of response. The evidence warrants only the first.
There is also the question of what the study’s creativity ratings actually show. Winthrop writes that post-ChatGPT essays “were rated as more ‘creative’ by human judges.” This is accurate as far as it goes, but it significantly understates the measurement architecture. The large-scale creativity ratings in Moon et al. were produced not by human judges reading 370,000 essays, but by a GPT-4.1 mini model that was fine-tuned on ratings from a much smaller calibration sample of 370 essays. The human experts rated the calibration set; the model extended those ratings across the full corpus. The study’s own peer review flags this pipeline as introducing circularity risk: if the fine-tuned model has internalized the human raters’ preference for lexically polished prose — which the paper strongly suggests is happening — then using that model to show that lexically polished prose gets higher creativity ratings is not independent evidence. It is the same bias, measured twice.
None of this is fatal to the paper’s core finding. The disjunction between surface polish and conceptual distinctness is well-established in the data. But it does mean that “human judges rated post-ChatGPT essays as more creative” is a simplified rendering of a more complicated story — and that the simplified rendering, presented as straightforwardly as it is in the Times, lends more certainty to the creative-erosion hypothesis than the evidence actually carries.
The deepest misreading in Winthrop’s essay is the one that feels the most intuitive. She writes that “when teenagers write their own essays, the work reflects their thoughts and personalities, their attempts to make meaning of their experiences. When we search for words, we are sifting through the same brain networks that form connections between ideas.” This is genuinely lovely, and it draws on a real neurocognitive literature about the relationship between verbal fluency and creative ideation. The implication is that AI use interrupts this sifting process, short-circuiting the connection between language and thought.
But the study did not measure that process. It measured the distributional properties of finished texts. The students whose post-ChatGPT essays show lower document-level distinctness might have arrived at that sameness through any number of paths: by using AI to generate their essays wholesale, by asking AI for topic suggestions and anchoring on them, by revising a human-drafted essay with AI assistance, by attending college prep programs that coached them toward conventional “compelling” narrative structures, or by simply writing in a genre — the admissions personal statement — that has always exerted homogenizing pressure on its writers. The Moon et al. study acknowledges it cannot distinguish between these pathways. Winthrop does not.
Here is what the study actually establishes, stated as precisely as it should be: in the years following ChatGPT’s release, college admissions essays became more lexically varied and less conceptually distinct from one another. The polished surface fooled evaluators — including, the paper argues, a human calibration sample and a fine-tuned model trained to reproduce their judgments. This means that lexical sophistication can no longer be treated as reliable evidence of conceptual originality. Admissions officers, educators, and writing instructors need to rethink the proxies they use to detect original thought.
That is a serious finding. It deserves serious treatment. It does not require the additional claim that human creative capacity itself is diminishing.
I want to be precise about what kind of mistake Winthrop is making, because it is not dishonesty and it is not carelessness. It is something more like motivated extrapolation — the researcher who has spent years thinking about AI’s effects on education, who is genuinely alarmed by what the data suggests, who reaches, in the last mile of the argument, for the version of the claim that feels most urgent. This happens in science communication constantly, and it usually goes unchallenged because the extrapolation is directionally plausible. It probably is true that heavy AI use in the drafting process tends to reduce the idiosyncratic qualities of student writing. It is probably true that students who outsource brainstorming lose some of the generative friction that produces unexpected ideas. The cognitive version of the claim may even turn out to be correct, once someone runs the study that would actually establish it.
But “probably true” and “supported by this evidence” are different things, and the willingness to collapse them is precisely what makes AI discourse so difficult to navigate. The overclaimed version of the finding — AI is eroding creative thinking — positions the remedy as a kind of cognitive public health campaign, with AI tools as the pathogen. The warranted version — AI produces writing that looks creative but isn’t, and our evaluative instruments can’t tell the difference — positions the remedy as better assessment design, better pedagogical practices, and better understanding of what AI use actually consists of when students do it.
The Moon et al. study’s own most practically useful finding is the one Winthrop underplays: AI-revised essays, in the controlled within-subjects experiment, retained significantly more document-level distinctness than AI-generated essays. This result has immediate implications. Using AI to refine a human-drafted text is not the same as using AI to produce a text. The cognitive and compositional labor involved is different. The outcome, measurably, is different. If we are serious about thinking clearly about AI and writing, this nuance — not the headline number — is where the work is.
The essay Winthrop should have written is also the more interesting essay. It is about the failure of our evaluative instruments. It is about what happens when the surface-level signals we have trained ourselves to read as evidence of quality — elegant sentences, sophisticated vocabulary, structural coherence — become decoupled from the properties they were supposed to index. It is about the ways in which AI does not introduce a new problem so much as expose an old one: that we have always been measuring proxies, and the proxies were always fragile, and we chose not to notice because they worked well enough in the pre-AI world.
That essay would acknowledge that the homogenization finding is real and consequential, without requiring us to believe that something is happening to students’ minds that the evidence does not establish. It would treat the finding as a measurement problem — which is urgent, which is tractable, which points toward specific interventions — rather than a cognitive crisis, which is frightening, which is vague, and which cannot be fixed by anything short of removing the tools.
The article says: AI is eroding our ability to think originally. What the evidence says: AI can make writing sound original while making it less so, and our tools for distinguishing the two have failed. The first claim requires a different world. The second requires better teachers, better rubrics, and a clearer understanding of what we are actually evaluating when we evaluate writing.
We should take the second claim seriously. We should stop pretending it says the first.
Tags: Moon et al. AI creativity, college essays disjunctive homogenization, NYT Winthrop AI education, science communication overclaim, AI writing assessment validity


