<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Nik Bear Brown - Computational Skepticism]]></title><description><![CDATA[Daily insights on the asymmetry of AI-generated bullshit, practical AI tutorials, research updates for the Humanitarians AI Lab, and guidance for my research group.
AI literacy through practice. Understanding the tech.  
Produced by Bear Brown, LLC]]></description><link>https://www.skepticism.ai</link><image><url>https://substackcdn.com/image/fetch/$s_!ea9u!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F73f2e8c8-c907-4319-a9cb-14cda74f5128_800x800.png</url><title>Nik Bear Brown - Computational Skepticism</title><link>https://www.skepticism.ai</link></image><generator>Substack</generator><lastBuildDate>Mon, 15 Jun 2026 15:09:32 GMT</lastBuildDate><atom:link href="https://www.skepticism.ai/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Bear Brown, LLC]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[nikbearbrown@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[nikbearbrown@substack.com]]></itunes:email><itunes:name><![CDATA[Nik Bear Brown]]></itunes:name></itunes:owner><itunes:author><![CDATA[Nik Bear Brown]]></itunes:author><googleplay:owner><![CDATA[nikbearbrown@substack.com]]></googleplay:owner><googleplay:email><![CDATA[nikbearbrown@substack.com]]></googleplay:email><googleplay:author><![CDATA[Nik Bear Brown]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Dignity That Can Be Destroyed]]></title><description><![CDATA[On the Anthropological Contradiction at the Heart of Magnifica Humanitas]]></description><link>https://www.skepticism.ai/p/the-dignity-that-can-be-destroyed</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-dignity-that-can-be-destroyed</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Sun, 31 May 2026 04:26:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!DH_5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DH_5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DH_5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png 424w, https://substackcdn.com/image/fetch/$s_!DH_5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png 848w, https://substackcdn.com/image/fetch/$s_!DH_5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!DH_5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DH_5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png" width="1456" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4526419,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/199940144?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DH_5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png 424w, https://substackcdn.com/image/fetch/$s_!DH_5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png 848w, https://substackcdn.com/image/fetch/$s_!DH_5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!DH_5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64aeeb37-93c2-4cd7-8a34-17a487cfb265_2744x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>On the Anthropological Contradiction at the Heart of <em>Magnifica Humanitas</em></h2><p><em>The full text of the encyclical is available directly from the Vatican: <a href="https://www.vatican.va/content/leo-xiv/en/encyclicals/documents/20260515-magnifica-humanitas.html">Magnifica Humanitas</a></em></p><div><hr></div><p>When I wrote <em>Magnifica Humanitas: The Operational Guide</em>, I treated Leo XIV&#8217;s encyclical as a witness &#8212; one of three independent sources that arrived, by completely different roads, at the same structural conclusion about AI governance: that authority belongs with the person closest to the problem, that the knowledge required for good decisions is held locally and cannot be moved upward without being destroyed, and that redistributing power toward those persons is not a pious hope but an institutional design requirement. I derived the same conclusion from Judea Pearl&#8217;s causal ladder and from Hayek&#8217;s dispersed-knowledge theorem, and I used the encyclical as a third confirmation &#8212; the unexpected door in the room that nobody thought had a door. I called it a witness, not a judge. Its value was precisely its independence.</p><p>Writing that book required me to read <em>Magnifica Humanitas</em> carefully enough to know which of its arguments were doing real work and which were decorative. And reading it that carefully revealed a tension the document does not resolve &#8212; one that I passed over in the operational text because it did not affect the prescriptions I was drawing from it, but that deserves a fuller reckoning on its own terms. The encyclical opens with a claim and closes with a prayer, and between them runs a contradiction it never fully examines.</p><p>The claim is this: human dignity is ontological, inalienable, grounded in God&#8217;s love rather than in human capacity or social recognition, impervious to any act or condition that might appear to diminish it. &#8220;Every human person possesses an infinite dignity, inalienably grounded in his or her very being, which prevails in and beyond every circumstance, state, or situation the person may ever encounter&#8221; &#8212; the encyclical quotes <em>Dignitas Infinita</em> approvingly, italicizing the universality (para. 62). No sin, failure, humiliation, or exclusion can touch it. The prayer &#8212; the Magnificat &#8212; envisions God scattering the proud and lifting the lowly, which implies that the lowly are genuinely low, that real damage has been done to real people, that something of genuine moral weight has been threatened and requires divine reversal.</p><p>Between the claim and the prayer, a question: if dignity truly prevails in and beyond every circumstance, what exactly does AI threaten? What is at stake in dehumanization if the human cannot, in the relevant sense, be dehumanized?</p><p>My secular argument sharpens that question in a way the encyclical cannot see about itself. When I demonstrated that the subsidiarity prescriptions, the trust-gap diagnosis, and even the Magnificat&#8217;s redistributive logic all stand independently on cognitive-science and information-economics grounds, I inadvertently established something damaging about the encyclical&#8217;s theological foundation: the prescriptions are well-grounded whether or not the ontological dignity claim is coherent. They would be right even if the foundation failed. A utilitarian could accept every governance prescription in <em>Magnifica Humanitas</em> &#8212; mandatory transparency, meaningful human control, data as a common good &#8212; on purely consequentialist grounds, because supporting local judgment produces better aggregate outcomes than replacing it. The inalienable dignity of the person is not doing the derivational work. It is accompanying an argument that stands without it.</p><p>This is the pressure this essay examines. The document offers an answer to the question of what AI threatens that it does not fully examine: that what is threatened is not the ontological fact of dignity but its recognition and enactment in social relations and institutional arrangements. A person subjected to algorithmic discrimination retains inalienable worth; what is damaged is the social acknowledgment of that worth, the concrete conditions in which it can be lived. The question is whether that distinction holds under the encyclical&#8217;s own pressure &#8212; whether <em>Magnifica Humanitas</em> maintains a stable separation between dignity as metaphysical status and dignity as lived condition, or whether its strongest applied claims quietly convert the former into the latter, making dignity dependent on the very social conditions it was invoked to transcend. Because if the prescriptions are independently grounded, then the theological foundation is doing less work than the document believes &#8212; and the anthropological architecture is more exposed, and more consequential in its exposure, than the confident tone suggests.</p><div><hr></div><p>The architectural function of ontological dignity in <em>Magnifica Humanitas</em> is clear and important. It provides the foundation that resists the utilitarian erosion the encyclical is most worried about. Para. 51 identifies the insidious ideology that &#8220;every person must earn or justify his or her own worth, to the point of attributing greater value to those who are more efficient or effective&#8221; &#8212; precisely the logic that AI systems optimizing for productivity metrics tend to embed and amplify. Against this, the encyclical insists that dignity &#8220;does not depend on a person&#8217;s abilities, wealth or position in life, nor on the right or wrong choices made; instead, it is a gift that precedes and transcends each person&#8221; (para. 50). The strategy is to locate dignity outside the register in which AI operates &#8212; outside capacity, performance, and measurable output &#8212; so that no technical system can reach it. What AI calculates, the argument runs, is not what persons are worth.</p><p>This is the right move against a genuine danger. Capacity-based accounts of human worth are philosophically precarious precisely because capacities vary, diminish, and can be exceeded by machines. An account of dignity immune to capacity comparisons is better equipped to survive the age of AI than one that grounds human worth in cognitive performance or economic productivity. The encyclical is not wrong to want this foundation.</p><p>The problem begins when the document shifts from establishing the foundation to deploying it in applied analysis. Para. 99&#8217;s account of what AI lacks &#8212; the passage most often cited as the encyclical&#8217;s philosophically strongest &#8212; argues that AI systems &#8220;do not undergo experiences, do not possess a body, do not feel joy or pain, do not mature through relationships.&#8221; The implicit argument is that these capacities are what make human beings the kind of thing whose dignity matters. But this argument structure is in tension with the foundational claim: if dignity is truly prior to and independent of capacities, then the absence of those capacities in AI tells us nothing about the presence of dignity in humans. The capacity inventory in para. 99 only does the work the document needs it to do if dignity is grounded in those capacities &#8212; which is precisely what the ontological account denies.</p><p>The document is running two different arguments about human dignity simultaneously. One argument grounds dignity in being, independently of capacity; the other distinguishes humans from AI by appeal to capacities. These arguments serve different purposes in the document, but they cannot both be correct as foundations. Either dignity is capacity-independent (in which case the capacity inventory in para. 99 is beside the point as a dignity argument) or dignity is capacity-grounded (in which case the inalienability claim is weaker than advertised). The encyclical needs both and examines neither.</p><p>The sharper version of this problem became clear to me when I worked through the same passage from Pearl&#8217;s direction. In my operational text, I derive the identical distinctions Leo draws in para. 99 &#8212; embodiment, conscience, relational maturity, the difference between statistical adaptation and inner growth &#8212; from the three-rung causal ladder and Polanyi&#8217;s account of tacit knowledge. The physician reasoning counterfactually about a patient who does not yet exist under a decision not yet made is doing something an AI pattern-matcher constitutively cannot do, I argued, not because the physician has inalienable ontological dignity but because she has <em>stakes, a body, and consequences</em>: she is on Pearl&#8217;s third rung, running counterfactuals, while the system is on the first, associating. That is a mechanistic claim about cognitive architecture, grounded in the phenomenology of cognition rather than the ontology of dignity. Para. 99&#8217;s genuine philosophical content &#8212; its careful, correct account of the gap between fluency and judgment, between statistical pattern and situated understanding &#8212; holds on those secular grounds with no theological support whatsoever. The encyclical is right for reasons it does not know it has. And when a document is right for reasons it does not know it has, the reasons it thinks it has are doing less work than it believes.</p><div><hr></div><p>The metaphysical/lived distinction enters most clearly in para. 52, where the document explicitly distinguishes four registers of dignity: moral (how a person directs choices and actions), social (living conditions and concrete respect received), existential (subjective sense of worth and value), and ontological (dignity belonging to every human being simply by virtue of existing, willed and loved by God). The first three &#8220;can be enhanced or diminished&#8221;; the fourth cannot. This is a careful philosophical move, and it offers a potential resolution of the contradiction: what AI threatens is the social and existential registers; the ontological register remains intact.</p><p>The resolution is coherent as a taxonomy. The question is whether the document uses it consistently when making its strongest claims about what is at stake. Consider para. 103, one of the encyclical&#8217;s most rhetorically charged passages: &#8220;entrusting an algorithm in practice with the power to select who is worthy or not, without anyone bearing responsibility for that judgment, is to hand over the task of redefining the boundaries of human possibilities.&#8221; The language of &#8220;redefining the boundaries of human possibilities&#8221; is not language about social or existential dignity &#8212; it is language about what persons fundamentally are and can become. Similarly, para. 112 warns that the technocratic paradigm threatens to &#8220;normalize an anti-human vision&#8221; in which &#8220;the fullness of life is equated with having more, reducing weakness, eliminating uncertainty and exerting total control.&#8221; The threat named here is not to social conditions but to the self-understanding of what a human being is &#8212; a threat to humanity&#8217;s grasp of its own nature, which is a different and more serious claim than a threat to social recognition.</p><p>These passages are making an ontological-register claim dressed in social-register language. They are arguing that AI threatens not merely how people are treated but what kind of beings people understand themselves to be, which is a claim that strikes at the ontological foundation itself. The document cannot simultaneously hold that ontological dignity is impervious to any circumstance and that there exists a technological configuration capable of normalizing an &#8220;anti-human vision&#8221; that reaches the ontological self-understanding of persons. If the latter is possible, the former is overstated.</p><div><hr></div><p>The pressure becomes most visible in Chapter Three&#8217;s treatment of transhumanism and posthumanism (paras. 115&#8211;117). The encyclical argues that these ideological currents, even when &#8220;largely speculative,&#8221; gain relevance &#8220;by altering the collective imagination&#8221; and thereby &#8220;influence social, economic and political choices.&#8221; The threat is not to individual persons in specific transactions but to the cultural substrate within which persons understand themselves and are understood by others. Para. 117 makes this explicit: &#8220;If the human being is treated as something to be perfected or surpassed, it becomes easier to accept that some lives are less useful, less desirable or less worthy.&#8221;</p><p>This is a claim about moral epistemology at the civilizational level: that certain technological and ideological environments degrade humanity&#8217;s capacity to recognize dignity in one another. It is a serious and important claim. But notice what it implies about the ontological account. If dignity is truly inalienable and grounded in God&#8217;s love independently of social recognition, then the degradation of humanity&#8217;s capacity to <em>recognize</em> dignity is a tragedy &#8212; but it is a tragedy about epistemology, not about the dignity itself. The person declared &#8220;less worthy&#8221; by an algorithm, or imagined as a substrate for enhancement, retains inalienable ontological dignity regardless of how they are classified. Their dignity is not diminished by the misclassification; only the recognition of it is.</p><p>But the encyclical does not treat this as a merely epistemological problem. It treats it as a threat to human dignity as such &#8212; as something that must be resisted not because false recognition is epistemically unfortunate but because something of genuine worth is actually at stake. Para. 126 concludes: &#8220;humanity &#8212; in all its grandeur and woundedness &#8212; must never be replaced or surpassed.&#8221; The phrase &#8220;must never be replaced&#8221; is not about recognition; it is about the thing itself. The encyclical is not saying that humans must never be <em>misrecognized</em> as replaceable; it is saying that replacement would be a genuine loss, that something real would be destroyed. This is a claim that the ontological dignity framework, strictly interpreted, cannot support &#8212; because what cannot be diminished cannot be lost, and what cannot be lost cannot be destroyed by replacement.</p><div><hr></div><p>There is a move available to the encyclical that would resolve this tension, and the document occasionally approaches it without making it explicit. The move is to distinguish between dignity as a metaphysical fact about persons and dignity as a normative claim about how persons must be treated &#8212; and to argue that what AI threatens is the second, not the first. On this reading, the inalienability claim is about the metaphysical fact: no act or condition can change what a person is or remove them from God&#8217;s love. The dehumanization claim is about the normative practice: systematic AI-driven exclusion, algorithmic discrimination, and the reduction of persons to data profiles constitute treatments that <em>fail to enact</em> the dignity that metaphysically exists. The person is not diminished; the treatment is wrong.</p><p>This distinction is implicit in the Catholic natural law tradition, and it is philosophically defensible. It echoes Kant&#8217;s formula of humanity &#8212; the injunction to treat persons always as ends and never merely as means &#8212; which operates precisely by distinguishing between the dignity persons have (always, inalienably) and the respect that dignity commands (which can be violated). On this reading, the encyclical&#8217;s strongest applied claims are claims about violated duty, not damaged being.</p><p>But <em>Magnifica Humanitas</em> does not consistently hold this distinction. The document moves between the normative claim and the ontological claim without flagging the difference, and in its most charged moments &#8212; the warnings about transhumanism, the account of dehumanization, the civilizational rhetoric of Babel &#8212; it is clearly making a stronger claim than violated duty. Para. 15&#8217;s &#8220;pressing duty to remain profoundly human&#8221; is not a claim about how others should treat us; it is a claim about what we ourselves might become &#8212; or fail to remain. This is a claim about ontological change, not merely normative failure.</p><p>The distinction between dignity as fact and dignity as normative claim would also solve a problem the document creates for itself in para. 100&#8217;s treatment of AI-simulated communication. The encyclical warns that the deeper danger of AI-simulated friendship is that a person &#8220;may gradually lose the very desire to form genuine human connections.&#8221; What is threatened here is not social recognition of dignity in the standard sense &#8212; the person is not being denied rights, exploited for labor, or classified as unworthy. They are at risk of losing the relational capacity through which they enact their own dignity. This implies that dignity, as lived, is not simply given but achieved through specific forms of human relationship that AI-simulated connection cannot provide &#8212; which makes dignity at least partially dependent on the social conditions it was invoked to transcend. The inalienability thesis says dignity prevails regardless of circumstance; the relational actualization thesis says dignity requires certain circumstances for its full expression. These are not the same claim, and the tension between them is not resolved by distinguishing ontological from social dignity &#8212; because the relational claim is about ontological dignity&#8217;s actualization, not merely about social conditions.</p><div><hr></div><p>The document&#8217;s most honest passage on this question is its treatment of human limitation in paras. 118&#8211;122. Here the encyclical argues that finitude, suffering, and vulnerability are not defects but conditions through which humanity matures: &#8220;humanity flourishes not despite limitations, but often through them&#8221; (para. 118). The argument is that limitation is partly constitutive of genuine human life &#8212; that to eliminate suffering would be to &#8220;extinguish love and desire as well&#8221; (para. 120). This is a claim about what human dignity actually consists in, and it includes limitation as an ingredient rather than an obstacle.</p><p>If limitation is partly constitutive of human dignity, then AI systems that systematically eliminate certain limitations &#8212; cognitive augmentation, emotional simulation, labor replacement &#8212; are not merely failing to recognize dignity but potentially altering the conditions under which dignity is realized as a fully human life. Dehumanization is not merely a failure of recognition but a restructuring of the conditions in which human dignity can be lived.</p><p>But the encyclical does not follow this implication to its conclusion. Para. 126 pivots immediately: &#8220;we can embrace the technological progress that alleviates suffering and unlocks new possibilities, provided that we do not abandon the very essence of our humanity.&#8221; The phrase &#8220;provided that&#8221; is doing enormous work. It implies that some technological changes are compatible with the human essence and some are not &#8212; but the document does not specify the criterion for this distinction. It cannot simply be that limitation-removing technology is dehumanizing and limitation-preserving technology is not, because that would condemn medicine, literacy, and sanitation alongside AI. The criterion must be more discriminating, and the encyclical does not supply it.</p><div><hr></div><p>The encyclical&#8217;s Trinitarian anthropology in para. 48 points toward a potential resolution, and it is worth following it further than the document does &#8212; precisely in order to show where it fails. But first, it is worth being precise about what my own secular argument has already established, because it sharpens the indictment.</p><p>In my operational text, I read the Magnificat not as a devotional text but as a structural claim about institutional design: the mighty are cast down and the lowly raised not by divine intervention but by the accumulated work of people building the section of wall nearest their own house, each one holding the local knowledge no central authority can possess. I derived the same conclusion &#8212; locate authority with the person closest to the problem &#8212; from Hayek&#8217;s dispersed-knowledge theorem: the knowledge good decisions require is local, tacit, and generated in the act of local decision itself, so pulling it upward to a central system does not gather it but destroys it. And from Woolley&#8217;s collective-intelligence research: distributed networks of trained, diverse, independent local actors outperform any single central intelligence on the hard problems precisely because they cover more of the solution space and preserve independence. The Magnificat&#8217;s redistribution, I argued, is not a theological aspiration &#8212; it is the architecture that actually works.</p><p>What this demonstrates about the encyclical is precise and damaging. The document&#8217;s subsidiarity prescriptions &#8212; transparent algorithms, meaningful human control over automated decisions, data governed as a common good, authority located with communities rather than platforms &#8212; are correct. But they are correct because they reflect the structure of knowledge and the requirements of collective intelligence, not because they are derived from the inalienable dignity of the human person. The ontological dignity claim is not doing the derivational work. It is accompanying an argument that stands without it. And a foundation whose removal leaves the building standing was never the foundation.</p><p>Now the Trinitarian anthropology. &#8220;Human persons are called to communion with God,&#8221; the passage states, quoting <em>Gaudium et Spes</em> 24, and &#8220;can fully discover their true selves only in sincere self-giving.&#8221; The implication is that personhood is not a static possession but a dynamic orientation: persons are constituted in and through relation, their identity realized through the movement of self-gift toward God and neighbor. On this reading, dignity is not the name of a property persons have, like mass or height, but the name of a calling &#8212; a vocation to communion inscribed in their being that no external force can remove.</p><p>This relational-vocational account has genuine philosophical resources. It avoids the static inertness of the &#8220;metaphysical property&#8221; model while preserving the unconditional character of dignity&#8217;s ground: the calling to communion is not contingent on whether it is recognized or enacted; it belongs to the person as creature of a God whose love is inalienable. What AI threatens, on this account, is not the vocation itself but the conditions under which persons can pursue it.</p><p>But the relational-vocational account is not the utilitarian firewall the encyclical needs it to be. If dignity is a calling to communion that is realized through relationship and self-gift, then dignity as <em>lived</em> is variable: some persons are in conditions that enable the journey; others are in conditions that obstruct it. A person isolated by AI-simulated relationships (para. 100), deskilled by algorithmic labor management (para. 150), surveilled and profiled into behavioral predictability (para. 171), or excluded from economic participation by opaque automated systems (para. 102) is a person whose vocation to communion is systematically impeded. Their dignity-as-calling is not extinguished, but their ability to pursue it is structurally curtailed. They are, in the relevant sense, less far along the journey.</p><p>The utilitarian now has a foothold. If what matters morally is not only the inalienable calling but the actual trajectory of persons toward their fulfillment, then conditions that impede that trajectory make persons, in some morally relevant sense, worse off in their dignity than persons whose trajectory is unimpeded. The utilitarian calculus the encyclical was designed to exclude re-enters through the relational account&#8217;s back door. Those whose communion-journey is more socially enabled become, under this account, persons with more fully realized dignity &#8212; which is not the inalienability claim, but something considerably weaker.</p><p>The encyclical cannot have it both ways without a further argument it does not supply. It cannot simultaneously hold that dignity is inalienable regardless of circumstance and that AI&#8217;s obstruction of relational life is a threat to dignity, unless it specifies exactly what is inalienable and exactly what is at stake in obstruction &#8212; and then demonstrates that impeding the expression is a moral wrong of a different kind than diminishing the ground. The encyclical approaches this distinction at several points but never assembles it into a coherent account.</p><div><hr></div><p>What the document is left with is an anthropological architecture that is strong at the foundation and strained at every load-bearing joint above it. The <em>imago Dei</em> grounding of ontological dignity is secure as a theological claim. The Trinitarian anthropology of para. 48 is coherent as a framework for understanding persons as relational creatures. The phenomenological account of what AI lacks in para. 99 is careful and largely persuasive. None of these is wrong. What is missing is the connective argument &#8212; the account of how inalienable ontological dignity, relational-vocational personhood, and AI-specific threats to human development are related to one another in a way that generates the prescriptive force the encyclical needs without making dignity secretly contingent on the social conditions it was invoked to transcend.</p><p>The failure is not one of theological conviction but of philosophical exposition. The encyclical knows what it is trying to protect and why it matters. What it does not have is a fully worked account of how the protection works. Para. 52&#8217;s four-register taxonomy names the problem without solving it. The Trinitarian anthropology of para. 48 points toward a solution without completing it. The capacity inventory of para. 99 serves the argument rhetorically while undermining it structurally.</p><p>The full gravity of this failure is visible only when my secular argument is held alongside the encyclical. Writing <em>The Operational Guide</em>, I demonstrated &#8212; without intending to &#8212; that the encyclical&#8217;s most important prescriptions are well-grounded on foundations the encyclical did not supply and does not know it needs. The subsidiarity argument is grounded in Hayek, not in <em>imago Dei</em>. The trust-gap diagnosis is grounded in Pearl and Polanyi, not in Trinitarian anthropology. The Magnificat-as-redistribution is grounded in collective intelligence theory, not in theological eschatology. The encyclical arrived at the right prescriptions by a route whose adequacy it assumed rather than demonstrated.</p><p>This matters for the prescriptive program of the encyclical in the following way. Its strongest calls for AI governance rest on the claim that what is at stake is not merely policy preference or social welfare optimization but the inalienable dignity of every person &#8212; a claim that without these arrangements something is violated that cannot be compromised. The force of that claim depends on the inalienability thesis being coherent under pressure. The encyclical cannot fully sustain that claim in its current form. What it can sustain &#8212; and what its best passages actually argue &#8212; is that persons are called to a fullness they can be systematically prevented from approaching, and that AI governance is therefore not merely a matter of efficiency or welfare but of whether the conditions for human becoming are preserved or destroyed. That is a serious claim and a true one. But it is a claim about the journey, not the ground.</p><p>My secular argument shows that the conditions argument can be made, and made rigorously, without the inalienability thesis at all. Cognitive science and information economics arrive at &#8220;protect the person on the spot&#8221; by routes that make no metaphysical claims about persons. The encyclical&#8217;s foundation is therefore not merely incoherent under its own pressure. It is, in the precise sense my own book establishes, unnecessary for its own conclusions. The document built a load-bearing wall, then inadvertently proved the building stands without it. A wall whose removal leaves the building intact was never load-bearing. It was decorative &#8212; which means the document has been claiming, for its most urgent prescriptions, a philosophical authority it has not demonstrated and may not need.</p><p>The Magnificat&#8217;s prayer is therefore more honest than the encyclical&#8217;s formal theology, and my secular reading of it is more honest still. The prayer does not pray that inalienable dignity be recognized; it prays that the lowly be raised &#8212; that conditions be changed, that trajectories be reversed, that something genuinely damaged be genuinely repaired. The prayer knows what the theology has not yet admitted: that what is at stake is not merely the misrecognition of an impervious fact, but the actual fate of persons in the process of becoming what they are called to be. And what I demonstrated in my operational text is that the raising of the lowly &#8212; the redistribution of authority toward the person closest to the problem &#8212; is not only theologically commanded and morally required but cognitively and economically optimal. The encyclical&#8217;s most urgent prescriptions would survive the total collapse of their theological foundation. That is a more vulnerable anthropology than <em>Magnifica Humanitas</em> is willing to formally endorse. It is also the only one adequate to the urgency the document everywhere displays &#8212; and the only one honest about why the prescriptions are actually right.</p><div><hr></div><p><strong>Tags:</strong> <em>Magnifica Humanitas</em>, theological anthropology, <em>imago Dei</em> dignity, ontological vs. relational dignity, AI dehumanization, subsidiarity secular grounding</p>]]></content:encoded></item><item><title><![CDATA[The Article That Claimed Too Much]]></title><description><![CDATA[On Rebecca Winthrop&#8217;s &#8220;What 370,000 College Essays Tell Us About A.I.&#8217;s Effects on Creativity&#8221; and what the underlying research actually supports]]></description><link>https://www.skepticism.ai/p/the-article-that-claimed-too-much</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-article-that-claimed-too-much</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Thu, 28 May 2026 03:15:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!NA01!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NA01!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NA01!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png 424w, https://substackcdn.com/image/fetch/$s_!NA01!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png 848w, https://substackcdn.com/image/fetch/$s_!NA01!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png 1272w, https://substackcdn.com/image/fetch/$s_!NA01!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NA01!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png" width="1456" height="911" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:911,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1135028,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/199550671?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NA01!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png 424w, https://substackcdn.com/image/fetch/$s_!NA01!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png 848w, https://substackcdn.com/image/fetch/$s_!NA01!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png 1272w, https://substackcdn.com/image/fetch/$s_!NA01!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf9f9e61-bfad-43e4-8159-ad9ff84257a1_2748x1720.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>There is a particular kind of intellectual move that appears responsible but isn&#8217;t. It starts with real evidence, follows it accurately for a while, then extends it &#8212; just a little &#8212; into territory the evidence cannot actually reach. The extension doesn&#8217;t look like speculation because it arrives wrapped in citation and earnest concern. By the time the reader notices anything, the claim has already landed. This is the move Rebecca Winthrop makes in her May 2026 New York Times essay, and understanding exactly where she crosses the line tells us something important about how we should be thinking about AI and education &#8212; and more precisely, about who we are blaming and for what.</p><p>Winthrop&#8217;s central claim is alarming: AI tools, she writes, &#8220;constrict our full range of thoughts and our ability to generate original and useful ideas.&#8221; She cites Georgetown neuroscientist Adam Green&#8217;s study of more than 370,000 college admissions essays, finding that post-ChatGPT writing became linguistically polished but ideologically homogeneous, and that human raters judged this polished writing as <em>more</em> creative despite its greater uniformity. She mentions a separate study finding that human-written short stories contained up to eight times more novel ideas than AI-assisted ones. The concern feels urgent, specific, and scientifically grounded.</p><p>It is also, at its most important moment, a category error.</p><div><hr></div><p>The research Winthrop describes is real. The preprint &#8212; Moon et al., &#8220;The Creative Link Between Words and Ideas is Weakening in the AI Era&#8221; &#8212; is a serious piece of work. Four natural experiments across multiple institutions, more than 370,000 essays, pre-registered directional hypotheses, and a within-subjects controlled experiment that adds genuine causal texture: this is not a thin study dressed up for press coverage. The core finding holds up: in post-ChatGPT admissions essays, word-level lexical diversity went <em>up</em> &#8212; essays used more varied, more colorful vocabulary &#8212; while sentence-level and document-level conceptual distinctness went <em>down</em>. Essays sounded more interesting while being more alike. The authors call this &#8220;disjunctive homogenization,&#8221; and it is a real, measurable phenomenon with meaningful implications for how educators and admissions officers evaluate writing.</p><p>So far, so good. The problem is what Winthrop makes of it.</p><p>&#8220;The bigger and more alarming impact,&#8221; she writes, &#8220;is to constrict our full range of thoughts and our ability to generate original and useful ideas.&#8221; The study measures properties of <em>texts</em> &#8212; specifically, embedding-based distances between words, sentences, and documents. It shows that AI-era essays are less semantically distinct from one another at the conceptual level. What it does <em>not</em> show, and cannot show, is what was happening inside the minds of the students who wrote them. The claim that AI &#8220;constricts our full range of thoughts&#8221; is a cognitive claim about people. The evidence supports a textual claim about outputs. These are different things, and the difference matters precisely in proportion to how urgent we consider the problem.</p><p>Consider what would be required to establish the cognitive version of this claim. You would need to measure students&#8217; ideational capacity before and after AI use, through some instrument independent of the writing they produce with AI assistance. You would need to distinguish between students who drafted with AI, students who revised with AI, and students who used AI only for surface editing. You would need to rule out the entirely plausible alternative explanation that students facing a high-stakes writing task, given access to a tool that reduces their anxiety, choose to produce safer content &#8212; not because their creative capacity has diminished, but because the incentive structure of the situation changed. The Moon et al. study, careful as it is, does none of this. Its evidence lives in the essays, not in the students.</p><div><hr></div><p>This distinction matters for a reason that goes beyond methodological precision. If the problem is that AI <em>tools</em> produce homogenized outputs when used to draft or heavily revise, the corrective is a set of pedagogical practices and assessment structures that don&#8217;t mistake polish for thought. If the problem is that AI is actually <em>narrowing human creativity</em> &#8212; eroding a cognitive capacity &#8212; then the corrective is something more like a public health intervention. Winthrop&#8217;s framing calls for the second kind of response. The evidence warrants only the first.</p><p>There is also the question of what the study&#8217;s creativity ratings actually show. Winthrop writes that post-ChatGPT essays &#8220;were rated as more &#8216;creative&#8217; by human judges.&#8221; This is accurate as far as it goes, but it significantly understates the measurement architecture. The large-scale creativity ratings in Moon et al. were produced not by human judges reading 370,000 essays, but by a GPT-4.1 mini model that was fine-tuned on ratings from a much smaller calibration sample of 370 essays. The human experts rated the calibration set; the model extended those ratings across the full corpus. The study&#8217;s own peer review flags this pipeline as introducing circularity risk: if the fine-tuned model has internalized the human raters&#8217; preference for lexically polished prose &#8212; which the paper strongly suggests is happening &#8212; then using that model to show that lexically polished prose gets higher creativity ratings is not independent evidence. It is the same bias, measured twice.</p><p>None of this is fatal to the paper&#8217;s core finding. The disjunction between surface polish and conceptual distinctness is well-established in the data. But it does mean that &#8220;human judges rated post-ChatGPT essays as more creative&#8221; is a simplified rendering of a more complicated story &#8212; and that the simplified rendering, presented as straightforwardly as it is in the Times, lends more certainty to the creative-erosion hypothesis than the evidence actually carries.</p><div><hr></div><p>The deepest misreading in Winthrop&#8217;s essay is the one that feels the most intuitive. She writes that &#8220;when teenagers write their own essays, the work reflects their thoughts and personalities, their attempts to make meaning of their experiences. When we search for words, we are sifting through the same brain networks that form connections between ideas.&#8221; This is genuinely lovely, and it draws on a real neurocognitive literature about the relationship between verbal fluency and creative ideation. The implication is that AI use interrupts this sifting process, short-circuiting the connection between language and thought.</p><p>But the study did not measure that process. It measured the distributional properties of finished texts. The students whose post-ChatGPT essays show lower document-level distinctness might have arrived at that sameness through any number of paths: by using AI to generate their essays wholesale, by asking AI for topic suggestions and anchoring on them, by revising a human-drafted essay with AI assistance, by attending college prep programs that coached them toward conventional &#8220;compelling&#8221; narrative structures, or by simply writing in a genre &#8212; the admissions personal statement &#8212; that has always exerted homogenizing pressure on its writers. The Moon et al. study acknowledges it cannot distinguish between these pathways. Winthrop does not.</p><p>Here is what the study actually establishes, stated as precisely as it should be: in the years following ChatGPT&#8217;s release, college admissions essays became more lexically varied and less conceptually distinct from one another. The polished surface fooled evaluators &#8212; including, the paper argues, a human calibration sample and a fine-tuned model trained to reproduce their judgments. This means that lexical sophistication can no longer be treated as reliable evidence of conceptual originality. Admissions officers, educators, and writing instructors need to rethink the proxies they use to detect original thought.</p><p>That is a serious finding. It deserves serious treatment. It does not require the additional claim that human creative capacity itself is diminishing.</p><div><hr></div><p>I want to be precise about what kind of mistake Winthrop is making, because it is not dishonesty and it is not carelessness. It is something more like motivated extrapolation &#8212; the researcher who has spent years thinking about AI&#8217;s effects on education, who is genuinely alarmed by what the data suggests, who reaches, in the last mile of the argument, for the version of the claim that feels most urgent. This happens in science communication constantly, and it usually goes unchallenged because the extrapolation is directionally plausible. It probably is true that heavy AI use in the drafting process tends to reduce the idiosyncratic qualities of student writing. It is probably true that students who outsource brainstorming lose some of the generative friction that produces unexpected ideas. The cognitive version of the claim may even turn out to be correct, once someone runs the study that would actually establish it.</p><p>But &#8220;probably true&#8221; and &#8220;supported by this evidence&#8221; are different things, and the willingness to collapse them is precisely what makes AI discourse so difficult to navigate. The overclaimed version of the finding &#8212; AI is eroding creative thinking &#8212; positions the remedy as a kind of cognitive public health campaign, with AI tools as the pathogen. The warranted version &#8212; AI produces writing that looks creative but isn&#8217;t, and our evaluative instruments can&#8217;t tell the difference &#8212; positions the remedy as better assessment design, better pedagogical practices, and better understanding of what AI use actually consists of when students do it.</p><p>The Moon et al. study&#8217;s own most practically useful finding is the one Winthrop underplays: AI-revised essays, in the controlled within-subjects experiment, retained significantly more document-level distinctness than AI-generated essays. This result has immediate implications. Using AI to refine a human-drafted text is not the same as using AI to produce a text. The cognitive and compositional labor involved is different. The outcome, measurably, is different. If we are serious about thinking clearly about AI and writing, this nuance &#8212; not the headline number &#8212; is where the work is.</p><div><hr></div><p>The essay Winthrop should have written is also the more interesting essay. It is about the failure of our evaluative instruments. It is about what happens when the surface-level signals we have trained ourselves to read as evidence of quality &#8212; elegant sentences, sophisticated vocabulary, structural coherence &#8212; become decoupled from the properties they were supposed to index. It is about the ways in which AI does not introduce a new problem so much as expose an old one: that we have always been measuring proxies, and the proxies were always fragile, and we chose not to notice because they worked well enough in the pre-AI world.</p><p>That essay would acknowledge that the homogenization finding is real and consequential, without requiring us to believe that something is happening to students&#8217; minds that the evidence does not establish. It would treat the finding as a measurement problem &#8212; which is urgent, which is tractable, which points toward specific interventions &#8212; rather than a cognitive crisis, which is frightening, which is vague, and which cannot be fixed by anything short of removing the tools.</p><p>The article says: AI is eroding our ability to think originally. What the evidence says: AI can make writing sound original while making it less so, and our tools for distinguishing the two have failed. The first claim requires a different world. The second requires better teachers, better rubrics, and a clearer understanding of what we are actually evaluating when we evaluate writing.</p><p>We should take the second claim seriously. We should stop pretending it says the first.</p><div><hr></div><p><strong>Tags:</strong> Moon et al. AI creativity, college essays disjunctive homogenization, NYT Winthrop AI education, science communication overclaim, AI writing assessment validity</p>]]></content:encoded></item><item><title><![CDATA[The Clause That Runs the Country]]></title><description><![CDATA[What 31 phone ban laws got right &#8212; and the four words they all left unfinished]]></description><link>https://www.skepticism.ai/p/the-clause-that-runs-the-country</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-clause-that-runs-the-country</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Mon, 18 May 2026 21:59:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!lqL3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lqL3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lqL3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!lqL3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!lqL3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!lqL3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lqL3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:597717,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/198053214?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lqL3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!lqL3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!lqL3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!lqL3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F397180b0-82c6-4068-9ee0-a39e0e12993e_1456x816.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Every phone ban in America has a secret.</p><p>You&#8217;ll find it buried in the statutory language, usually in the third or fourth section, after the prohibitions and the penalties and the legislative findings about adolescent anxiety and classroom distraction. Indiana buries it in a clause about &#8220;school-issued devices&#8221; used when &#8220;a lesson requires internet access.&#8221; California writes it into the phrase &#8220;faculty permission.&#8221; The secret is always the same four words: <em>except for educational purposes.</em></p><p>Thirty-one states. Thirty-one bans. Thirty-one versions of an exception that nobody has defined.</p><p>The clause that runs the country is four words long. And not one legislature has finished writing it.</p><div><hr></div><p>Here is why they can&#8217;t.</p><p>The device is neutral. The same school-issued Chromebook can run an AI tutor that refuses to give the answer until the student has demonstrated reasoning, or it can run YouTube&#8217;s recommendation algorithm, which will serve a twelve-year-old increasingly extreme content until she closes the tab. The same phone can extend a student&#8217;s thinking &#8212; forcing her to interrogate a source, construct an argument, audit a plausible-sounding claim &#8212; or it can replace that thinking entirely, producing the paragraph she was supposed to write while she watches. No legislature can write a rule specific enough to distinguish these in real time, across every classroom, for every student, in every lesson.</p><p>The distinction isn&#8217;t in the device. It isn&#8217;t in the application. It isn&#8217;t even in the assignment. It&#8217;s in what the student&#8217;s mind is doing while her hands are on the keyboard. That is not a legislative question. It is a professional judgment question. And professional judgments require trained professionals.</p><div><hr></div><p>The research community has tried to build the framework legislators couldn&#8217;t. The American Academy of Pediatrics abandoned time-based limits in favor of the 5 Cs &#8212; Content, Context, Connections, Co-viewing, Communication. Screen quality researchers have identified &#8220;productive friction&#8221; as the real indicator of educational value. These frameworks are not wrong. They are designed for pediatricians counseling individual families, one child at a time. They cannot govern real-time technology decisions for fifty million students across thirteen thousand districts. A teacher cannot run five qualitative assessments during a forty-minute class period. The clinical framework and the institutional problem are different problems.</p><div><hr></div><p>The only mechanism that actually works is a trained teacher exercising professional judgment.</p><p>Not trained in the general sense &#8212; credentialed, experienced, well-intentioned. Trained specifically in AI. A teacher who has used these tools herself. Who has watched what genuine AI-assisted learning looks like: the student who uses Claude to pressure-test her own argument, who asks it to find the flaw in her reasoning, who treats it as an interlocutor rather than a ghostwriter. And who has watched what substitution looks like: the student who has learned to prompt the tool to produce the appearance of thinking without any thinking taking place.</p><p>That teacher can make the call the law requires. In real time. For the specific student in front of her, in the specific lesson, on the specific day. She knows the difference because she has done the work of learning it. The law cannot make that call. Only she can.</p><p>This is why the phone ban&#8217;s relationship to teachers is so self-defeating. A ban written into statute is, at its core, a statement that teachers cannot be trusted to make technology decisions. Remove the device, remove the judgment call, route around the professional. But the exemption immediately reinstates that judgment call &#8212; <em>for educational purposes</em> &#8212; and hands it back to the same teacher the ban just said couldn&#8217;t be trusted, without any additional training, without any framework, without any investment in her capacity to apply the exception correctly.</p><p>You haven&#8217;t solved the definitional problem. You&#8217;ve distributed it across fifty million uninformed individual decisions.</p><div><hr></div><p>This is where the equity argument lands hardest.</p><p>The teacher who can apply the exemption well &#8212; who can distinguish genuine AI-assisted learning from substitution, who can make the real-time call with confidence &#8212; is the teacher who has been trained. And training is not distributed equally. Districts currently spend $847 per teacher per year on educational technology professional development. Sixty-two percent of teachers report feeling unprepared to use the tools they&#8217;ve been assigned. Less than 40 percent of districts use available federal professional development funds for technology-enabled learning at all.</p><p>The schools that have invested in teacher AI training &#8212; typically wealthier districts with more professional development resources &#8212; will apply the exemption with some consistency and judgment. The schools that haven&#8217;t will apply it with inconsistency, risk-aversion, or not at all. The undefined exemption does not create a level playing field. It creates a playing field that tilts in exactly the direction equity requires it not to tilt.</p><p>Low-income schools need more investment in teacher AI training, not less. The exemption is only useful to the teacher who understands what she&#8217;s exempting. Without that understanding, the exemption either collapses into a blanket prohibition &#8212; no devices, full stop, because the safer call is the simpler one &#8212; or it becomes ungovernable, a different rule in every classroom, which is no rule at all.</p><p>Figlio and &#214;zek&#8217;s 2025 analysis of Florida&#8217;s statewide phone ban found that disciplinary costs were front-loaded and fell disproportionately on Black boys in year one, with test score benefits arriving for all groups only in year two. The students who can least afford a year of elevated suspensions absorb the cost first. If the exemption is the mechanism that makes the ban educationally productive rather than merely punitive, then failing to invest in the teachers who apply it is a choice &#8212; with a specific, identifiable cost, and a specific, identifiable population that pays it.</p><div><hr></div><p>The exemption is only as good as the teacher applying it.</p><p>We have written the exemption into thirty-one laws. We have not trained the teacher who is supposed to make it work. That is not a legislative failure &#8212; no additional statutory language will close this gap. It is an investment failure, and it has a specific remedy: sustained, subject-specific, AI-focused professional development. Not one-shot workshops. Ongoing. Mandatory. Doctors are required to complete fifty hours of continuing medical education every year, with license renewal tied to demonstrated learning &#8212; because medicine decided that keeping current is not optional, it is the condition of practice. Teachers need the same infrastructure, built for the same reason.</p><p>The phone is in the pouch. The exemption is in the law. The teacher is in the room, making the call the legislature could not make, with the preparation the district did not provide.</p><p>That is the work that remains.</p><div><hr></div><p><strong>Tags:</strong> AI+1 education, phone ban educational exemption, teacher professional development, screen value framework, EdTech equity ISTE</p>]]></content:encoded></item><item><title><![CDATA[Not the App. Not the Ban. Train the Teacher.]]></title><description><![CDATA[Why every phone ban in America contains a sentence nobody has finished writing]]></description><link>https://www.skepticism.ai/p/not-the-app-not-the-ban-train-the</link><guid isPermaLink="false">https://www.skepticism.ai/p/not-the-app-not-the-ban-train-the</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Sat, 16 May 2026 20:30:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TkBV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TkBV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TkBV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png 424w, https://substackcdn.com/image/fetch/$s_!TkBV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png 848w, https://substackcdn.com/image/fetch/$s_!TkBV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!TkBV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TkBV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png" width="1456" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4963537,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/198045559?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TkBV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png 424w, https://substackcdn.com/image/fetch/$s_!TkBV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png 848w, https://substackcdn.com/image/fetch/$s_!TkBV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!TkBV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79e541ea-a0dc-40cf-9efe-5f07bc40d957_2744x1536.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>Why the phone ban debate is asking the wrong question &#8212; and what would actually fix the school</h2><p>In the first year of Florida&#8217;s phone ban, suspensions increased 30 percent for Black boys. For white and Hispanic students, the increase was near zero.</p><p>The same study &#8212; the most rigorous causal analysis of a statewide phone ban yet published, drawing on 3.5 million student-observations from one of the ten largest school districts in the United States &#8212; found that test scores improved for all groups in year two. The ban is right. The benefits are real.<a href="#user-content-fn-1"><sup>1</sup></a></p><p>But the disciplinary costs are front-loaded, and they fall on Black boys first. The students who can least afford a year of elevated suspensions absorb those suspensions while they wait for benefits that arrive later. For a student suspended out of a critical semester, later may not be enough.</p><p>This is not an argument against phone bans. It is an argument about what phone bans, by themselves, cannot do &#8212; and about what we keep refusing to do instead.</p><div><hr></div><p>The education policy debate in 2025 offered two answers to the same question. The question was: technology is harming students. What do we do?</p><p>Answer one was the phone ban. Thirty-one states enacted one. Remove the device. Restore the classroom.</p><p>Answer two, pushed by the educational technology industry, was the opposite: don&#8217;t ban technology, buy better technology. The right platform, the right adaptive learning system, the right AI tutor. Trust the algorithm. The $165 billion EdTech market had a product for every problem.</p><p>Both answers have the same flaw. Neither of them is about teachers.</p><p>A January 2026 RAND Europe analysis synthesizing large-scale trials of digital learning platforms, math feedback systems, and reading software found something the EdTech industry would prefer not to advertise: teacher training and support consistently made or broke the success of EdTech interventions. Where teachers were prepared and supported, technology improved outcomes. Where they weren&#8217;t, technology was underused, misused, or simply ignored.<a href="#user-content-fn-2"><sup>2</sup></a></p><p>Platform design does matter &#8212; research on personalized computer-aided learning shows that well-designed adaptive software can produce meaningful gains, particularly in under-resourced settings.<a href="#user-content-fn-3"><sup>3</sup></a> A five-dollar-per-student AI tool is genuinely better than no tool. The evidence is real. But platform quality sets the ceiling. Teacher training determines whether students ever get close to it. The WestEd ASSISTments trial makes the point precisely: the software was free, the devices were already in schools, and the only implementation cost was $46 per student &#8212; entirely in teacher professional development. That investment drove significant gains on state math assessments.<a href="#user-content-fn-4"><sup>4</sup></a> The platform didn&#8217;t do it. The training did.</p><p>A great teacher trained in how to use AI purposefully will outperform any expensive educational technology platform. This is not a philosophical claim. It is what the evidence shows.</p><div><hr></div><p>We are not building that teacher. We are banning the phone.</p><p>WestEd&#8217;s synthesis of technology integration research recommends that districts spend no more than 30 percent of their technology budgets on hardware and infrastructure, and at least 70 percent on teacher professional development and coaching. Most districts do roughly the opposite.<a href="#user-content-fn-5"><sup>5</sup></a> Fewer than 40 percent of districts use federal Title II-A professional development funds for technology-enabled learning at all, according to a November 2025 SETDA analysis of 24 state educational agencies and 76 districts.<a href="#user-content-fn-6"><sup>6</sup></a> Among teachers who have not yet used AI in their teaching, 70 percent report lacking the knowledge and skills to do so, per the OECD TALIS 2024 survey of US educators.<a href="#user-content-fn-7"><sup>7</sup></a></p><p>The budget flows to hardware. Training gets the remainder.</p><p>Meanwhile, the phone ban hands teachers a new enforcement responsibility &#8212; making real-time judgments about which technology uses qualify as &#8220;educational&#8221; and which don&#8217;t &#8212; without any framework for making that judgment, any training to support it, or any clarity about what &#8220;educational purposes&#8221; actually means. Every one of those thirty-one state laws contains an exemption for educational purposes. Not one defines it. The teacher standing in front of a classroom at 9 a.m. is supposed to decide, in the moment, what qualifies &#8212; based on nothing.</p><p>This is not a technology problem. This is a teacher investment problem dressed up as a technology problem.</p><div><hr></div><p>The equity dimension makes it worse.</p><p>For the student with home broadband, a laptop, and parents who work in technology, the phone ban removes a distraction. Everything else she had before the ban she still has at home. The ban touches almost nothing.</p><p>For the student whose phone is her household&#8217;s primary internet connection &#8212; more than half of students without home Wi-Fi access their internet through a smartphone, per NCES data &#8212; the ban removes infrastructure.<a href="#user-content-fn-8"><sup>8</sup></a> The lockable pouch closes at 7:45 and the only quality technology access she will have for the next sixteen hours goes with it.</p><p>In 2025, the FCC made this worse by rescinding E-Rate funding for off-premises Wi-Fi hotspots &#8212; the program that provided school-subsidized home internet to students without broadband at home. In the same policy window: states removed the personal device these students used as primary internet access, and the federal government removed the school-provided home internet substitute. Both removals. Same students. Simultaneously.</p><p>New York City lifted its first school phone ban in 2015 specifically because enforcement fell harder on low-income schools. Metal detectors meant visible phones meant stricter policing. The 2025 wave brought the ban back. None of the new laws addressed what broke the policy the first time.</p><p>These students &#8212; the ones absorbing the disciplinary costs of year one, the ones losing their only after-school internet access &#8212; are in the schools that have invested least in teacher training. The compounding is not accidental. It is the logic of under-resourced institutions receiving policies designed for well-resourced ones.</p><div><hr></div><p>Here is what would actually help.</p><p>First: guarantee supervised, teacher-led technology time during school hours, explicitly scaled to replace what the ban removes from students who have nothing at home. Not a Chromebook in a cart. Structured, purposeful access &#8212; the kind that only exists if a teacher knows what she is doing with it. Which brings us back to the teacher.</p><p>Second: fill in the definition every state law left blank. Every law exempts technology use for &#8220;educational purposes.&#8221; Define it before you enforce what it excludes. Technology qualifies as educational when it requires genuine cognitive effort from the student &#8212; when it extends their thinking rather than replaces it. That is a criterion a teacher can apply. If she has been trained to apply it.</p><p>Third, and most important: train teachers in AI. Not one-shot workshops. Not a half-day on how to use ChatGPT. Sustained, subject-specific, grade-level-specific professional development in how AI changes what teaching requires and what it makes possible. Doctors have mandatory continuing medical education &#8212; 50 hours per year, license renewal tied to demonstrated learning, specialization updates when practice changes significantly. Nobody argues that doctors should figure out new treatments on their own. The infrastructure for continuous professional development exists in medicine. It does not exist for teaching. It costs less to build than the hardware budgets it would replace.</p><p>A trained teacher with access to Claude is not a smaller version of an AI platform. She is a different instrument entirely &#8212; one that can hear the wrong note in a classroom, ask the question that unlocks the student who has been stuck for two weeks, recognize that today this child needs the device and that one doesn&#8217;t. Claude is accessible, inexpensive, and powerful. But even Claude requires a teacher who knows how to use it purposefully &#8212; who can distinguish the student using it to think harder from the student using it to avoid thinking altogether. No platform makes that call. A trained teacher does. That is the investment that closes the gap the phone ban opens, builds the capacity the EdTech budget cannot buy, and gives low-income students something worth more than another app: a great teacher who knows what she is doing.</p><div><hr></div><p>Thirty-one states have told low-income students: the device you were using to compensate for what your school couldn&#8217;t provide is now gone. The school is responsible for your technology access now.</p><p>The school was responsible before, too. The difference is that before, the student had the phone as a backup.</p><p>The ban took the backup. It did not fix the school. Fixing the school means investing in the people who run it &#8212; not banning their tools and not replacing them with better ones. Training them. Trusting them. Giving them what medicine gives doctors: the expectation of continuous learning as a condition of professional practice.</p><p>That is what the ban debate is not about. That is what it should be.</p><div><hr></div><p><strong>References</strong></p><h2>Footnotes</h2><ol><li><p>Figlio, D.N. &amp; &#214;zek, U. (2025, October). <em>The impact of cellphone bans in schools on student outcomes: Evidence from Florida.</em> NBER Working Paper No. 34388. National Bureau of Economic Research. <a href="https://www.nber.org/papers/w34388">https://www.nber.org/papers/w34388</a> <a href="#user-content-fnref-1">&#8617;</a></p></li><li><p>RAND Europe. (2026, January 5). <em>Harnessing the benefits of EdTech: What research tells us about using digital technology to support pupils.</em> RAND Corporation. <a href="https://www.rand.org/pubs/commentary/2026/01/harnessing-the-benefits-of-edtech-what-research-tells.html">https://www.rand.org/pubs/commentary/2026/01/harnessing-the-benefits-of-edtech-what-research-tells.html</a> <a href="#user-content-fnref-2">&#8617;</a></p></li><li><p>Muralidharan, K., Singh, A., &amp; Ganimian, A.J. (2019). Disrupting education? Experimental evidence on technology-aided instruction in India. <em>American Economic Review, 109</em>(4), 1426&#8211;1460. See also VoxDev education technology meta-analysis on personalized computer-aided learning in low- and middle-income settings. <a href="#user-content-fnref-3">&#8617;</a></p></li><li><p>WestEd. (2022). <em>ASSISTments randomized controlled trial: Cost-effectiveness analysis.</em> The total incremental implementation cost of $46.23 per student was driven entirely by teacher professional development. <a href="#user-content-fnref-4">&#8617;</a></p></li><li><p>WestEd. (n.d.). <em>Technology integration and the 30/70 rule: A synthesis of K&#8211;12 digital learning research.</em> WestEd Policy Brief. <a href="#user-content-fnref-5">&#8617;</a></p></li><li><p>State Educational Technology Directors Association (SETDA). (2025, November 5). <em>Improving professional learning systems to better support today&#8217;s educators: How Title II, Part A offers a model for state and local leadership.</em> Supported by Google.org. As reported in: Klein, A. (2025, November 7). Billions of federal dollars are spent on teacher training. Less than half goes to tech PD. <em>Education Week.</em> <a href="https://www.edweek.org/technology/billions-of-federal-dollars-are-spent-on-teacher-training-less-than-half-goes-to-tech-pd/2025/11">https://www.edweek.org/technology/billions-of-federal-dollars-are-spent-on-teacher-training-less-than-half-goes-to-tech-pd/2025/11</a> <a href="#user-content-fnref-6">&#8617;</a></p></li><li><p>OECD. (2025, October). <em>Results from TALIS 2024: Country notes &#8212; United States.</em> Organisation for Economic Co-operation and Development. <a href="https://www.oecd.org/en/publications/results-from-talis-2024-country-notes_e127f9e2-en/united-states_66573a34-en.html">https://www.oecd.org/en/publications/results-from-talis-2024-country-notes_e127f9e2-en/united-states_66573a34-en.html</a> <a href="#user-content-fnref-7">&#8617;</a></p></li><li><p>National Center for Education Statistics. (2021). <em>Home internet access and use among children in the United States.</em> U.S. Department of Education. <a href="#user-content-fnref-8">&#8617;</a></p></li></ol>]]></content:encoded></item><item><title><![CDATA[The Man Who Couldn't Follow His Own Argument]]></title><description><![CDATA[Nassim Taleb's Fooled by Randomness is a brilliant book that does, repeatedly, the exact thing it tells you not to do.]]></description><link>https://www.skepticism.ai/p/the-man-who-couldnt-follow-his-own</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-man-who-couldnt-follow-his-own</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Thu, 14 May 2026 04:33:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6i2m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6i2m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6i2m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png 424w, https://substackcdn.com/image/fetch/$s_!6i2m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png 848w, https://substackcdn.com/image/fetch/$s_!6i2m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png 1272w, https://substackcdn.com/image/fetch/$s_!6i2m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6i2m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png" width="896" height="1344" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1344,&quot;width&quot;:896,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1193226,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/197633387?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6i2m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png 424w, https://substackcdn.com/image/fetch/$s_!6i2m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png 848w, https://substackcdn.com/image/fetch/$s_!6i2m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png 1272w, https://substackcdn.com/image/fetch/$s_!6i2m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0cbf26-5476-48b7-9e5f-c0a37c17d7c7_896x1344.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Here is the book&#8217;s central argument: we mistake luck for skill because we only see the outcomes, not the process that generated them. A hundred traders start with the same strategy. Fifty blow up in year one. The other fifty survive. Of those, twenty-five blow up in year two. We keep going. After a decade, we have one trader with a perfect ten-year record &#8212; and we write a book about his genius. The survivor looked like proof of something. He was proof of nothing except that someone had to be last standing.</p><p>This argument is correct. It is also one of the most important ideas in the book. And Nassim Taleb, who understood it completely, then spent two hundred pages ignoring it about himself.</p><div><hr></div><h2>The Gap That Nero Doesn&#8217;t See</h2><p>The book&#8217;s central pairing is between two traders: John and Nero. John runs a high-yield strategy, posts spectacular returns for seven years, and blows up catastrophically when the rare event arrives. Nero is cautious, probabilistically humble, and survives everything. Taleb presents this as an epistemological contrast. John was fooled by randomness. Nero was not.</p><p>But the contrast between John and Nero is not primarily epistemological. It is structural.</p><p>Nero has treasury bonds. He has four thousand books and a part-time professorship. He has capped his downside in a way that John has not. They are not drawing from the same distribution. Nero can afford to treat a bad year as philosophical material because a bad year will not end him. John cannot. When the rare event arrives for John, it doesn&#8217;t arrive as a test of his epistemic humility. It arrives as a terminal event.</p><p>Taleb knows this. The barbell strategy &#8212; keep one end of your exposure extremely safe, the other end in high-upside optionality &#8212; is the structural recommendation he would make explicit in <em>Antifragile</em> a decade later. But in <em>Fooled by Randomness</em>, it surfaces only obliquely. The book diagnoses John&#8217;s cognitive error and leaves the structural gap underneath largely unexamined.</p><p>This matters because Taleb&#8217;s Stoic prescription &#8212; dress well on your execution day, be courteous to your assistant when you lose money, receive catastrophe with dignity &#8212; assumes you can survive the blow. The behavioral protocols are available to people with enough margin to absorb a loss and continue. For everyone else, the practical question is not how to accept randomness with grace. It is how to avoid placing yourself in a position where one unlucky sample path ends you.</p><p>The book that would fully honor its own insight would say this clearly. The Stoic apparatus applies to Nero. Most people are not Nero. The path to becoming Nero is structural, not philosophical, and it isn&#8217;t in the book.</p><div><hr></div><h2>What He Gets Right, and How He Gets There</h2><p>I want to be careful here, because the misunderstanding is easy. The critique is not that Taleb is wrong. His core arguments are right, and some of them are indispensable.</p><p>The noise-to-signal calculation &#8212; that a trader with genuine positive expected returns, observed at high frequency, sees noise in almost every observation &#8212; is mathematically clean and practically important. The birthday paradox applied to back-tested trading strategies is a genuinely useful epistemic tool. The demonstration that past performance in non-stationary systems cannot ground reliable inference is the Turkey problem stated precisely, and it matters in every field where past performance is routinely mistaken for predictive power, which is most fields. The Popper chapter alone is worth the price.</p><p>These arguments don&#8217;t require the anecdotes. They stand on their own.</p><p>The critique is narrower: Taleb makes empirical claims &#8212; most successful traders are lucky, most financial gurus are survivorship artifacts &#8212; that his own framework says cannot be established by the method he is using. He assembles examples: Carlos, the emerging-markets wizard who lost $300 million in one summer; John, whose seven-year run ended catastrophically; Nero, who survived. Every trader who used naive empiricism lost. Every trader who used careful probabilistic methods survived.</p><p>This is exactly the sample a motivated reasoner would assemble. The book never asks how many cautious, Popperian traders also blew up, for reasons unrelated to their epistemic style. It never asks how many high-yield traders with John&#8217;s exact strategy survived because the rare event happened not to arrive during their particular run. He diagnosed this error in <em>The Millionaire Next Door</em>. He didn&#8217;t diagnose it in himself.</p><p>Taleb&#8217;s defense is that the book is &#8220;a series of logical thought experiments, not an economic term paper&#8221; and that &#8220;logic does not require empirical verification.&#8221; That works for the deductive arguments. It doesn&#8217;t work for the empirical claims. And he keeps making empirical claims.</p><p>The honest version of the position would be: I have deductive arguments for why skill and luck are systematically confounded in financial markets. I have illustrative examples. I do not have inductive proof that most traders are lucky fools. I cannot have such proof without the very statistical machinery I am critiquing.</p><p>He comes close to saying this. He never quite says it.</p><div><hr></div><h2>The Availability Heuristic, Live</h2><p>The deeper irony runs through the book&#8217;s method. <em>Fooled by Randomness</em> is enjoyable to read for exactly the same reasons that make its primary target &#8212; financial journalism that presents vivid anecdotes as evidence &#8212; enjoyable to consume. Carlos and John and Nero are compelling because they are vivid and emotionally engaging. These are precisely the properties the availability heuristic exploits: when something is easy to picture, the brain assigns it higher probability and higher evidential weight than the evidence warrants.</p><p>Taleb knows this. He spends most of Chapter 11 explaining how the availability heuristic operates, why it evolved, why education doesn&#8217;t fix it, why traders who understand the mechanism still fall for it in real time. Then he writes a book that relies on it.</p><p>This is not a contradiction that destroys the argument. You cannot write about availability bias without making your examples vivid. Vivid examples are how the argument reaches the System 1 that needs to be reached &#8212; the diamond cut by another diamond. But the acknowledgment is mostly absent. The book that argues most forcefully against narrative confirmation uses narrative confirmation on nearly every page and treats this as unremarkable.</p><p>The book&#8217;s final third, where Taleb stops arguing empirically and starts doing philosophy &#8212; receives catastrophe with dignity, do not beg fortune to reverse itself, the Cavafy poem addressed to Mark Antony as Alexandria falls &#8212; is its most honest section and, perhaps not coincidentally, the one where the internal contradiction mostly disappears. He isn&#8217;t making claims he can&#8217;t support. He is showing what it looks like to know you live inside uncertainty and keep going anyway.</p><p>This is wisdom, not science. He admits it. It is also the only section where the method and the argument are the same thing.</p><div><hr></div><h2>Empirica</h2><p>In 1999, Taleb founded a hedge fund. He named it Empirica Capital.</p><p>Empirica ran a long-volatility strategy: buy underpriced tail-risk protection repeatedly at small cost, absorb steady losses, collect massively when the rare event arrived. The strategy rested on the argument Taleb had been developing for years &#8212; that tail risks were systematically underpriced because models built on historical data underestimate the frequency and magnitude of extreme events. He was not predicting catastrophe. He was pricing it. He was the insurance company that knew the actuarial tables were wrong.</p><p>The fund ran from 1999 to 2004. Then it closed.</p><p><em>Fooled by Randomness</em> was published in 2001 &#8212; while Empirica was actively bleeding. Taleb was not reflecting on a closed chapter. He was mid-trade, constructing in public the intellectual framework that justified continuing the bleed. That is not hypocrisy; it may be exactly the psychological scaffolding you need to run a long-volatility strategy through years of small losses. But it means the book is partly motivational literature for its author dressed as philosophy for everyone else.</p><p>Now consider the fund closure against Taleb&#8217;s own framework.</p><p>He distinguishes two failure modes. John blows up fast &#8212; leverage, one catastrophic draw, game over. The alternative is the careful trader who survives by never placing himself in a position where a single bad outcome is terminal. Taleb presents survival as the goal. But survival is not what Empirica achieved. Empirica bled for five years and closed before the rare event paid off.</p><p>By his own framework, that is a blowup. Not a dramatic single-day crater. A slow bleed that exhausted its funding runway. He even names this failure mode &#8212; the trader who buys volatility protection too early, or prices it wrong, or runs out of capital before the event arrives. Different mechanism than John&#8217;s, same terminal outcome.</p><p>There is also the opportunity cost Taleb never accounts for. Five years is not neutral time. The trader who blows up in year one is free in year one &#8212; financially damaged, but free to start the company, retrain, pivot, move on. The five-year bleeder faces a different trap. Every year the exit gets harder: you&#8217;ve already lost four years, so quitting now makes those losses definitively unrecoverable. The rare event feels closer simply because you&#8217;ve waited longer &#8212; which is the Gambler&#8217;s Fallacy, which Taleb explicitly warns against. The identity investment compounds alongside the financial loss. The barbell strategy&#8217;s hidden assumption is that you have the structural margin to keep bleeding. Empirica eventually didn&#8217;t.</p><p>And then there is the name itself. The man who built his entire intellectual identity around the failure of naive empiricism &#8212; around the Turkey&#8217;s perfectly accurate historical data that predicted nothing about Thanksgiving &#8212; named his fund after the thing he said would kill you.</p><p>He could argue he meant Popperian falsificationism, not naive inductivism. But Popper&#8217;s whole point is that you test theories by trying to break them. Did Taleb try to break his own trading thesis? The book suggests the opposite: every anecdote confirms it.</p><div><hr></div><h2>The Last Line</h2><p><em>The Black Swan</em> was published in 2007 &#8212; three years after Empirica closed. It became a bestseller, made Taleb famous, and generated more wealth than the fund ever did.</p><p>The rare event that saved him wasn&#8217;t the trade. It was Malcolm Gladwell blurbing the book. The payout that the strategy never delivered came instead from writing about the strategy&#8217;s philosophical soundness. He bled to death and got famous writing a book about how bleeding to death is a great trading strategy.</p><p>Taleb is not a fraud. The ideas in <em>Fooled by Randomness</em> are real and the best of them are important. But a man who had genuinely internalized his own argument would write a very different book. It would be quieter. More uncertain. Less populated with fools who couldn&#8217;t see what he saw. It might acknowledge, somewhere, that the writer diagnosing the trap is always writing from inside it.</p><p>The book that actually demonstrates this &#8212; not argues it, but demonstrates it, in real time, across two hundred pages &#8212; is <em>Fooled by Randomness</em> itself.</p><p>Read it. Just read it knowing what it is: a man who understood, better than almost anyone writing in 2001, how survivorship bias and narrative confirmation mislead us &#8212; and who couldn&#8217;t apply that understanding to his own narrative, his own fund, his own career arc.</p><p>The irony is not a flaw. It is the argument made flesh.</p><div><hr></div><p><strong>Tags:</strong> Nassim Nicholas Taleb, survivorship bias financial markets, <em>Fooled by Randomness</em> review, Empirica Capital, narrative confirmation bias</p>]]></content:encoded></item><item><title><![CDATA[The Measurement That Wasn't There]]></title><description><![CDATA[On the quiet fraud at the center of AI education research &#8212; and why it's harder to catch than the kind that gets retracted]]></description><link>https://www.skepticism.ai/p/the-measurement-that-wasnt-there</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-measurement-that-wasnt-there</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Wed, 29 Apr 2026 19:15:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!GKsY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GKsY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GKsY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png 424w, https://substackcdn.com/image/fetch/$s_!GKsY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png 848w, https://substackcdn.com/image/fetch/$s_!GKsY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png 1272w, https://substackcdn.com/image/fetch/$s_!GKsY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GKsY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png" width="1456" height="590" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:590,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:793291,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/195907789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GKsY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png 424w, https://substackcdn.com/image/fetch/$s_!GKsY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png 848w, https://substackcdn.com/image/fetch/$s_!GKsY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png 1272w, https://substackcdn.com/image/fetch/$s_!GKsY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82b791a3-8561-4bc4-b32f-cd2e70a7b897_2780x1126.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There is a paper circulating in AI education circles as a counterpoint to the skeptics. Wang and Zhang, published in February 2026 in the <em>International Journal of Educational Technology in Higher Education</em>, a Springer Nature journal. It passed peer review. It has four studies. It has 912 participants across three continents. It deploys PLS-SEM and fsQCA and IPMA, and it has a methodology flowchart with seven stages, and it uses the word &#8220;paradoxical&#8221; in its title and delivers on the promise &#8212; two hypotheses come back significant in the wrong direction, which the authors then claim as the actual discovery.</p><p>I want to be honest about what I am about to argue. The Wang and Fan retraction that prompted this conversation is a case of bad causal evidence overclaimed. That is one problem. Wang and Zhang is a different problem. It is methodologically elaborate work that is not actually measuring what it claims to measure. In some ways it is harder to catch, because the machinery is impressive and the numbers are clean and the peer reviewers, like the rest of us, have been trained to evaluate internal consistency rather than construct validity.</p><p>Strip away the machinery. Here is what Wang and Zhang actually did.</p><p>Nine hundred and twelve business students filled out a questionnaire. The questionnaire asked them to rate their agreement with statements like: &#8220;My interaction with the generative AI has led me to question my long-held assumptions.&#8221; And: &#8220;Using generative AI has fundamentally changed the way I understand certain subjects.&#8221; And: &#8220;My use of generative AI has prompted a deep re-evaluation of my ways of thinking.&#8221;</p><p>Those five items, averaged together, are the outcome variable. The paper calls this outcome &#8220;transformative learning experience.&#8221;</p><p>It is not transformative learning experience. It is self-reported perception of transformative learning experience. The difference is not semantic. It is the entire study.</p><div><hr></div><p>Jack Mezirow&#8217;s transformative learning theory &#8212; the anchor the paper correctly treats as its theoretical foundation &#8212; describes a slow, disorienting, often unconscious process of perspective reconstruction. Mezirow was not describing a feeling students could report after two weeks. He was describing something that happens to people over months or years, something they often cannot name while it is occurring, something that shows up in changed behavior and revised assumptions and different relationships to knowledge &#8212; not in survey responses. The theory Mezirow actually wrote is about the kind of learning that happens when a person discovers that the framework they have been using to understand the world is inadequate. That does not feel like an insight. It feels like vertigo.</p><p>Measuring this with five Likert items is not a methodological shortcut. It is a category error. You might as well measure altitude with a thermometer and then report, with SRMR = 0.031, that higher temperatures correlate with being closer to the sky.</p><p>The paper knows this, in the way that papers of this type always know what they are doing, which is to say: it is in the limitations section. &#8220;Generalizability is bounded by exclusive reliance on self-reported perceptions,&#8221; the authors write, and then proceed to spend eight thousand words drawing inferences about transformative learning from self-reported perceptions. The limitation is disclosed and then ignored. This is the standard operation.</p><div><hr></div><p>Now add the demand characteristics.</p><p>I said &#8220;convenience sampling from business schools,&#8221; and that is the phrase papers in this area use. What it usually means in practice is that the 912 participants are the researchers&#8217; own students, or the students of colleagues at institutions where the researchers have relationships. The paper does not specify. It describes &#8220;multistage purposive sampling&#8221; and leaves the details of how institutions were contacted and how students were recruited conspicuously absent. But here is what we know: the qualitative component &#8212; the 45 interviews providing &#8220;rich process-oriented insights&#8221; &#8212; was drawn &#8220;exclusively from the Chinese sample,&#8221; and one of the authors is at a Chinese university. We know the students knew they were participating in an academic study. We know, from two thousand years of social psychology, that students who are aware of being studied by people who may have access to their grades tend to report what they believe is the expected or approved answer.</p><p>The paper deploys a temporal separation of two weeks between waves to &#8220;minimize common method bias.&#8221; Two weeks between surveys does not eliminate the problem of students reporting what they believe the study wants to hear. It separates the questions. It does not change who is answering them or why.</p><div><hr></div><p>I want to name the third problem, which is the one I raised in the group and which I think is the most structurally interesting.</p><p>Almost every learning environment is a massive violation of SUTVA &#8212; the Stable Unit Treatment Value Assumption. SUTVA says that the treatment received by one unit doesn&#8217;t affect the outcomes for another. In a classroom, this is almost never true. Students talk to each other. They share AI tools. They discuss assignments. They copy strategies. One student&#8217;s approach to using ChatGPT influences other students&#8217; approaches, which influences their outcomes, which shows up in the data as independent observations that are not independent at all.</p><p>In a networked environment where 912 business students across three continents are all using the same publicly available AI tools, the assumption that each student&#8217;s &#8220;transformative learning experience&#8221; is a function solely of their individual &#8220;pedagogical partnership orientation&#8221; and &#8220;cognitive vigilance&#8221; and &#8220;efficiency orientation&#8221; is not a simplifying assumption. It is an assumption that, if violated &#8212; and it is almost certainly violated &#8212; means the causal model is wrong in ways the statistical machinery cannot detect. PLS-SEM with excellent fit statistics can sit on top of fundamentally confounded data and produce clean-looking path coefficients. The cleanliness of the output is not evidence of the validity of the model. It is evidence that the model fits the data it was given.</p><p>True causal inference in learning environments would require experimental variation, not survey waves. It would require controlling for the social transmission of strategies and norms. It would require outcome measures that are behavioral, not perceptual. Absent these, what you have is a very sophisticated correlation study that has dressed itself in the language of mechanism.</p><div><hr></div><p>The paper is not a fraud in the sense of fabricated data. The numbers are probably exactly what the authors say they are. The students probably filled out exactly the surveys the authors describe. The analysis was probably executed correctly in SmartPLS 4.1.</p><p>The problem is upstream of all of that. The problem is in the question &#8220;what did we measure?&#8221;</p><p>We measured whether students who reported viewing AI as a collaborative partner also reported having their assumptions challenged. We found that they did. We called this &#8220;transformative learning.&#8221; We built a four-study architecture around this finding, with fsQCA and IPMA and 45 interviews and cross-cultural multi-group analysis, and we used the word &#8220;revolutionizes&#8221; in the discussion section, and we were published in a Springer Nature journal.</p><p>This is the second problem the field has, and it is subtler than the retracted meta-analysis. The retracted Wang and Fan paper is the kind of failure that produces retractions: fabricated or manipulated data, statistical impossibilities, evidence that the numbers were not real. That is a catastrophic failure, but it is detectable. It triggers the mechanisms the field has built for self-correction.</p><p>The Wang and Zhang problem does not trigger those mechanisms. The numbers are real. The peer review process evaluated internal consistency and found it satisfactory. The methodology flowchart has seven stages. The HTMT ratios are all below 0.85. The paper did exactly what the field rewarded it for doing.</p><p>And what it measured was: how students feel about whether they learned something.</p><div><hr></div><p>Here is what I think is actually going on in that data, if you want my honest read of it.</p><p>Students who frame AI as a collaborative partner rather than a tool are probably more engaged with the learning process in general. Engagement is positively correlated with self-reported learning. This is not a surprise. It is not a paradox. It is not evidence that &#8220;partnership orientation simultaneously activates cognitive vigilance and cognitive offloading through synergistic cognitive collaboration.&#8221; It is evidence that students who are paying attention think they learned more.</p><p>The finding that cognitive offloading is positively associated with self-reported transformative learning is interesting &#8212; the paper hypothesized the opposite and got a significant result in the other direction, and that is worth noting. But the post-hoc explanation (that offloading liberates cognitive resources for higher-order reflection) is plausible, not demonstrated. The paper discovered an unexpected correlation, generated a theory to explain it, and presented the theory as established. The U-shaped analyses that appear to confirm the theory were conducted after the unexpected finding was observed, without correction for exploratory inflation. This is the standard operation, and it is why most published findings in social science do not replicate.</p><p>The correct statement of the finding is: among 912 business students who self-report using AI, those who self-report viewing AI as a partner also self-report greater subjective sense of perspective change, and this association holds when we control for several other self-reported constructs. This is an interesting starting point for a research program. It is not a demonstration that pedagogical AI partnerships cause transformative learning.</p><div><hr></div><p>I want to be fair to the authors and to the field. They are working in an area where longitudinal behavioral research is genuinely hard to conduct, where IRB constraints limit what can be measured, where publication timelines create pressure toward the kind of efficiency the paper&#8217;s own subjects were reporting, and where the methodological standards for what counts as evidence have been established over decades of work that made the same choices at every turn. They did what the field taught them to do. The peer reviewers evaluated the paper against the standards of the field and found it acceptable by those standards.</p><p>That is the problem. Not this paper. The standards.</p><p>What would adequate evidence look like? It would measure transformative learning through behavioral change over meaningful time periods &#8212; different academic choices, different engagement with contradictory evidence, different patterns of intellectual behavior &#8212; not through survey items administered two weeks after measuring the predictors. It would use experimental variation in AI access or framing. It would account for social transmission between students. It would treat the gap between self-reported perception and actual cognitive change as a research question, not a footnote.</p><p>This kind of research is harder to do. It takes longer. It is more expensive. It produces noisier results. It is less likely to yield the clean path coefficients and the R&#178; of 0.475 and the SRMR of 0.031 that signal competence to reviewers. The incentive structure of academic publishing does not reward it.</p><p>The Wang and Fan retraction is the kind of failure that looks like a violation of the rules. Wang and Zhang is the kind of failure that looks like following them.</p><div><hr></div><p>I am building AI tools for anyone who wants to ride the AI revolution. I am not the right person to tell education researchers how to fix their field. But I notice the same thing in AI music research that I see here: the willingness to dress up a survey with sophisticated analytical machinery and call the output evidence about what AI actually does to people. The infrastructure for appearing rigorous has outpaced the infrastructure for being rigorous.</p><p>And this matters beyond the journals. The Wang and Zhang paper is circulating as evidence about AI and learning. Institutions are making policy based on papers like this. Educators are redesigning curricula. Students are being told, by implication, that their sense of having learned something is the same as having learned something.</p><p>It is not. And the gap between those two things is exactly the gap that Mezirow was writing about &#8212; the gap between the story you tell yourself about your perspective and the actual reconstruction of the framework through which you understand the world. Transformative learning is what happens when you discover that the story you have been telling yourself is wrong.</p><p>It would be ironic if the research claiming to measure it turned out to be an example of the thing it failed to measure.</p><div><hr></div><p><em>Nik Bear Brown teaches AI at Northeastern University and runs Musinique LLC, which builds tools for indie musicians. He is also the founder of Humanitarians AI, a 501(c)(3) nonprofit. More at <strong><a href="http://bear.musinique.com/">bear.musinique.com</a></strong> &#183; <strong><a href="http://skepticism.ai/">skepticism.ai</a></strong> &#183; <strong><a href="http://theorist.ai/">theorist.ai</a></strong></em></p><div><hr></div><p><strong>Tags:</strong> measurement validity, AI education research, transformative learning, construct validity, self-report bias</p>]]></content:encoded></item><item><title><![CDATA[The Limits of AI: What the Tools Cannot Do]]></title><description><![CDATA[The Test You Did Not Design]]></description><link>https://www.skepticism.ai/p/the-limits-of-ai-what-the-tools-cannot</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-limits-of-ai-what-the-tools-cannot</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Wed, 29 Apr 2026 03:21:09 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!w5bA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w5bA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w5bA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!w5bA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!w5bA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!w5bA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w5bA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1601591,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/195827711?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w5bA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!w5bA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!w5bA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!w5bA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ab21e50-58d7-4f98-833c-8f3a1ba13245_1456x816.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There is a clinical decision-support system in this story, and it passed every test the engineers gave it. Ninety-four percent accuracy. Every internal review threshold met. Regulatory submission cleared. The fairness metrics within tolerance. Three patients were harmed within six months of deployment.</p><p>I want to sit with that sequence for a moment before moving on to the structural argument, because the sequence is the argument. The system was not fraudulent. The engineers were not reckless. The validation framework was real and, in its own terms, rigorous. And three people were harmed &#8212; not despite the rigor, but through a gap in it that the rigor could not see. The system was tested on the question it was built to answer. The harms arrived from a different question. <em>What is going on with this specific patient?</em> The two questions are related. They are not the same. The framework did not surface the gap because the framework was scoped to the first question, and no one had been trained to ask whether the scope was the problem.</p><p>This is the situation AI deployment keeps producing, and the reason it keeps producing it is not that the tools are immature or the engineers are careless. The reason is structural. There are three limits that capability scaling cannot fix &#8212; not problems to be solved as models improve, not failure modes that better tooling will eventually close, but constitutive features of what AI systems are. Meaning. Intentionality. The gap between data and world. Name them clearly and the clinical case stops looking like an anomaly. It starts looking like what was always going to happen.</p><h2>What the Limits Actually Are</h2><p>The first limit &#8212; meaning &#8212; is easy to misread as a philosophical quibble and hard to dismiss once you see it working. The system processes symbols. The symbols have referents in the world. The system has no representation of the referents. It manipulates the symbols. The meaning of those symbols &#8212; what they point to in the specific world the user inhabits, the world of this patient&#8217;s chart, this loan applicant&#8217;s actual financial circumstances &#8212; is supplied by the user, not the system. The output is read as a statement about the world. The system produced it without a model of what the world contains. When those two pictures align, everything looks fine. When they diverge &#8212; at the distribution boundary, in cases the training data never reached &#8212; the user is still reading a statement about the world, and the system is still manipulating symbols.</p><p>You can hear the objection already: modern large multimodal models acquire something like meaning through the structure of their embeddings, through grounding in images and other modalities, through the patterns of association learned over enormous corpora. This is a serious objection and it deserves a serious response. The response is not to pretend the question is settled. It is to observe that the contestation doesn&#8217;t need to be settled for the operational consequence to bind. The system&#8217;s behavior is inconsistent with the user&#8217;s expectation of meaning often enough that someone must perform meaning-attribution for the system. That work cannot be offloaded to the system itself. Whether contemporary models have something like meaning is a deep and genuinely open question. Whether an engineer can safely assume they do, before deploying a system into a clinical context, is not.</p><p>The second limit is intentionality &#8212; the philosopher&#8217;s word for <em>aboutness</em>, the fact that a thought is directed toward something in the world, that a statement points at a particular kettle in a particular kitchen. When you say the kettle is on, your statement is directed toward that specific kettle by you, the speaker, and your relationship to the world the words are pointing at. The system&#8217;s outputs lack this stable directedness. Two deployments of the same system in different contexts produce outputs that users read as being about different things. The system&#8217;s &#8220;aboutness&#8221; tracks the user&#8217;s reading, not an independent stable directedness of its own. Whether functional goal-pursuit is equivalent to intentionality is a question worth leaving open. What is not open is the operational consequence: the system&#8217;s outputs don&#8217;t carry stable referents across deployments, and someone must supply the directedness. That someone is the human supervisor.</p><p>The third limit is the one I am most certain about, and the one most important to hold clearly: the data is always less than the world. The system is trained on data. The data is a sample of the world, captured by particular instruments under particular conditions with particular exclusions. The system&#8217;s competence is over the data, not the world. No amount of data scaling closes this gap, because the gap is structural &#8212; the data is always less than the world, and the parts of the world not in the data are not learnable from the data. This is not contested the way the first two limits are. It is sometimes obscured by the claim that &#8220;with enough data, the model can generalize,&#8221; which is true inside a distribution and false at the boundary. The boundary is where AI systems most often fail. The failures look surprising because the validation set was inside the boundary and the deployment crossed outside it.</p><p>Ninety-four percent accuracy. The three patients were in the other six percent &#8212; except that framing is too generous, because the failures weren&#8217;t randomly distributed across the six percent. They were clustered at exactly the boundary where the training data ran out and the clinical reality did not.</p><h2>Two Famous Arguments and What They Actually Show</h2><p>Turing&#8217;s 1950 proposal is methodologically elegant: if a machine can convincingly imitate a human in conversation, by what principled basis would we deny it intelligence? Don&#8217;t require something more than behavioral evidence for intelligence in machines, because we don&#8217;t require something more for other humans. The argument settles a methodological question. What it does not settle &#8212; and this is what gets lost in the citation &#8212; is whether the thing satisfying the test has meaning, intentionality, or competence over the world. The test is over behavior. The limits are about what stands behind behavior. Turing knew this; the test was a methodological proposal, not a metaphysical claim. The people who cite him as having shown that behavioral imitation <em>is</em> intelligence are giving him credit for a stronger claim than he made.</p><p>Searle&#8217;s Chinese Room argues the reverse problem: behavior consistent with understanding does not entail understanding. A person following symbol-manipulation rules can produce outputs indistinguishable from those of a Chinese speaker without understanding Chinese. Therefore symbol manipulation is not understanding. What this argument does not settle is whether contemporary systems are doing <em>only</em> symbol manipulation, or whether the embedding structures, the attention patterns, the multimodal grounding constitute something more. Searle&#8217;s argument is a strong constraint on shallow accounts of meaning. It is not a deep constraint on what current architectures might be. The people who cite him as having shown that AI systems <em>cannot</em> understand are giving him the same overclaiming they give Turing.</p><p>The productive thing the two arguments do together is produce a workable operational stance: behavior is testable evidence and should be taken seriously, <em>and</em> behavior is not the whole of what we mean by understanding, meaning, or intentionality. Both moves at once. The validator who only tests behavior misses the limits. The validator who only invokes the limits skips the testing. The job is to do both, and the discomfort of holding both is not a failure of the methodology &#8212; it is the methodology working correctly.</p><h2>Where the Limits Bite</h2><p>Not every deployment is equally exposed to these limits. A system classifying images of products on a manufacturing line operates in a world where the limits are largely irrelevant. The deployment context is well-specified, the data-world gap is small and monitorable, the human interpreting the classifications supplies the necessary meaning without drama. Skepticism here is methodology, not a safety mechanism. The supervisor verifies, monitors, calibrates.</p><p>A system producing clinical recommendations, autonomous-vehicle decisions, agentic actions in shared social spaces, judicial-risk assessments &#8212; these are the deployments where the limits bite hard. The system&#8217;s apparent competence outruns its actual competence in ways no metric will fully capture. The supervisor&#8217;s skepticism is the safety mechanism, not an optional overlay.</p><p>The engineering response to this situation is specific. You specify, in writing, what the system can be tested for and what it cannot. You include the limits explicitly in the documentation &#8212; not in fine print, but as a primary product of the validation process. A regulator or an adoption committee reading the documentation can see what the validation does and does not warrant, not because you have hidden the limits in a disclaimer, but because naming the limits is part of the work. You maintain human oversight at the points where the limits bite: a human reviews the semantic interpretation (meaning), supplies the directedness (intentionality), monitors the deployment distribution and is empowered to override (data-world gap). And you build the infrastructure for the override to be real. An override that is documented but practically impossible &#8212; no time, no standing, no legibility &#8212; is not an override. It is a fiction. The clinician has to have the time and the authority to disagree with the system. This has to be the practice, not the disclaimer.</p><h2>The Authority to Say No</h2><p>There are deployments where the limits, given the stakes, are a reason not to deploy at all. The supervisor&#8217;s authority to refuse deployment is, structurally, the most important authority in the system. Most current deployments do not preserve it. The validator is hired to validate. The validation is expected to clear. The option of refusal is assumed away.</p><p>This is the thing most likely to be dismissed as na&#239;ve. The institutional reality is real &#8212; the business case has been made, the procurement is done, the announcement is scheduled, the political cost of stopping is high. That reality is worth acknowledging. And then it is worth asking what it means that we have built deployment processes in which the option to say no has been assumed away at the moment it is most needed.</p><p>The case against refusal is usually framed as realism. Engineers have no real power to stop deployments; their job is to make the best of what is decided above them. This realism is worth taking seriously. And then it is worth asking: what is the limit case? At what level of stakes does the individual engineer&#8217;s obligation to refuse become binding regardless of institutional pressure? The clinical system that harmed three patients is an answer. The judicial risk assessment that contributed to unjust incarceration is an answer. The autonomous vehicle that killed someone is an answer. These are not edge cases in the abstract. They are the specific forms the limits take when the stakes are real and the override infrastructure is fictional.</p><p>A validation practice that cannot accommodate refusal is not a safety practice. It is documentation of a deployment that was going to happen regardless. The calibration work, the bias analysis, the governance structures &#8212; all of it becomes elaborate cover if the option to stop is not real.</p><h2>What the Work Looks Like</h2><p>Most engineers operate throughout their careers at calibrations between fifty and seventy percent on questions where they are stating ninety percent confidence. They do not know this. Nobody runs the experiment on them. The practice that closes this gap is not a methodology you learn in a course and apply mechanically. It is the deliberate, repeated act of stopping, locking the prediction before looking at the outcome, asking what the data is actually evidence of, saying out loud what you do not know. Built over years, through the accumulation of small acts of epistemic honesty. It changes what you see. It changes what questions you ask about a deployment before it goes live rather than after.</p><p>The system passed every test. The engineers designed the wrong tests. Three patients were harmed. That sequence is not a historical artifact to be studied from a distance. It is the structure of the next failure &#8212; somewhere in a deployment that has cleared every internal review threshold, in a context the training data didn&#8217;t reach, in a case the framework was not scoped to address. The person who designs the right tests, who recognizes the limit and decides the deployment should not proceed in its absence &#8212; that person has been trained to recognize the gap, and has the authority to act on the recognition, and uses both.</p><p>That is the professional the field needs. That is the work.</p><div><hr></div><p><em>Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI. He writes on AI supervision, educational technology, and music research at <a href="https://bear.musinique.com">bear.musinique.com</a>, <a href="https://skepticism.ai">skepticism.ai</a>, and <a href="https://theorist.ai">theorist.ai</a>.</em></p><div><hr></div><p><strong>Tags:</strong> AI supervision structural limits, meaning intentionality data-world gap, Turing Searle behavioral testing, clinical decision support failure, validator stop condition refusal authority</p>]]></content:encoded></item><item><title><![CDATA[The Ladder That Isn't There]]></title><description><![CDATA[What Companies Are Building to Replace the Rung AI Eliminated]]></description><link>https://www.skepticism.ai/p/the-ladder-that-isnt-there</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-ladder-that-isnt-there</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Sat, 25 Apr 2026 23:09:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pJ47!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pJ47!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pJ47!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!pJ47!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!pJ47!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!pJ47!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pJ47!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2264048,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/195482027?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pJ47!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!pJ47!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!pJ47!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!pJ47!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0b467d6-3ccf-4a8d-be6e-cc1c5debce93_1456x816.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The argument goes like this: AI automates entry-level coding work, so companies stop hiring junior developers, so there is nobody to become the senior developers of 2030, so the companies that cut the pipeline will find themselves in 2030 with powerful AI tools and no one with the judgment to use them safely. IBM&#8217;s chief human resources officer, Nickle LaMoreaux, made exactly this case in February 2026, announced that IBM was tripling its entry-level hiring, and called on HR leaders across the industry to do the same. &#8220;The companies three to five years from now that are going to be the most successful,&#8221; she said, &#8220;are those companies that doubled down on entry-level hiring in this environment.&#8221;</p><p>It is a coherent argument. It is also, in its publicly available form, incomplete in precisely the ways that matter most.</p><h2>The Gap Between the PR and the Pipeline</h2><p>LaMoreaux is right about the pipeline problem. She is far less specific about the solution. What IBM has said publicly is that it &#8220;rewrote&#8221; entry-level software developer roles &#8212; less boilerplate coding, more AI oversight, more customer interaction, more focus on what the company calls &#8220;systems judgment.&#8221; Junior developers will spend less time on routine code generation and more time auditing AI output, working directly with clients, and doing the cognitive work of translating business requirements into prompts that produce useful results.</p><p>This is not nothing. It represents a genuine attempt to think through what the entry-level job becomes when AI can generate syntactically correct code faster than a human junior can type it. But there is a question embedded in the new job description that IBM has not publicly answered, and it is the only question that matters: does &#8220;AI oversight&#8221; actually develop the judgment needed to become a senior engineer?</p><p>The historical pathway was not glamorous. A junior developer spent two, three, four years writing boilerplate. Authentication flows, database migration scripts, unit tests, CRUD endpoints. Nobody loved the work. The work was, in terms of its immediate output, largely automatable. But the work was also, in terms of its developmental function, the curriculum &#8212; and the precise mechanism was not the writing. It was the failure. You wrote the authentication flow. It broke in production in ways you did not anticipate. The error message was visible, the gap between your expectation and reality was undeniable, and you had no choice but to struggle with it. You debugged it, which meant reading documentation you hadn&#8217;t read, asking a senior why your mental model was wrong, building a new mental model to replace it. You did this thousands of times. At the end of the process you were a senior engineer &#8212; not because you had written a lot of boilerplate, but because engaging repeatedly with its failures had built something durable in your brain.</p><p>This distinction matters, because it reframes the problem precisely. AI does not just remove the writing. It removes the visible failure. Code compiles. Tests pass. The race condition hides inside a sleep call. The memory leak is invisible to the test suite. The architectural drift from intent looks like a working feature until it fails at scale in production. The failure is still there &#8212; AI-generated code fails in ways human-generated code fails, and in new ways besides. But the failure is no longer surfacing where the junior developer can see it, at a latency and legibility that would allow them to learn from it. That is the actual developmental gap.</p><h2>The Comprehension Debt Problem</h2><p>Anthropic published research in January 2026 that should be uncomfortable for every company now designing &#8220;AI-native&#8221; entry-level roles. Junior developers who delegated code generation to AI tools scored between 24% and 39% on subsequent comprehension assessments. Those who used AI as a collaborator &#8212; asking questions, challenging outputs, forcing themselves to understand what the AI produced &#8212; scored between 65% and 86%. The difference is not AI versus no AI. The difference is <em>how</em> you use the tool.</p><p>The researchers called the gap &#8220;comprehension debt&#8221; &#8212; a cumulative deficit between what the codebase does and what the people managing it understand. It is a subtle disaster. The code works. The tests pass. The junior developer ships the feature. The comprehension debt doesn&#8217;t reveal itself until the system breaks in a way that requires architectural judgment to diagnose &#8212; which is precisely the moment when you need the senior engineer who was supposed to emerge from the junior developer who was supposed to be learning while working.</p><p>There is neurophysiological evidence for the mechanism. A 2025 MIT study by Kosmyna et al. tracked EEG connectivity in participants writing under three conditions: LLM-assisted, search-engine-assisted, and unaided. Across alpha, theta, and delta bands &#8212; associated with internal semantic processing, working memory, and self-directed ideation &#8212; connectivity scaled inversely with external support. LLM users showed the weakest brain network engagement. More consequentially: when LLM-habituated participants were later asked to work without the tool, their neural connectivity did not reset to novice levels, but it did not reach the levels achieved by practiced unassisted writers either. Alpha and beta engagement &#8212; associated with top-down planning and self-driven organization &#8212; remained measurably suppressed. The authors call this accumulation &#8220;cognitive debt.&#8221; The study involves essay writing rather than software development, and the sample of 54 students is too small to carry causal weight. But the finding is structurally consistent with the broader claim: if the generative cognitive work is externalized during the period when mental models are supposed to form, those models form incompletely &#8212; and the deficit persists when the tool is removed.</p><p>Microsoft&#8217;s Azure CTO Mark Russinovich and VP Scott Hanselman put the problem with blunt clarity in a February 2026 paper in <em>Communications of the ACM</em>. Senior engineers experience an &#8220;AI boost&#8221; &#8212; the tools multiply their throughput, and they have the judgment to steer and verify the output. Junior engineers experience what Russinovich and Hanselman call &#8220;AI drag&#8221; &#8212; the tools produce output that looks correct, which the junior developer lacks the judgment to evaluate, and the work is done without the learning happening. The rational economic response for any CFO is to hire seniors and automate juniors. The structural consequence is: no pipeline.</p><p>What makes their diagnosis particularly useful is that they catalogue the specific failure modes AI tools exhibit that juniors cannot catch without guidance: agents masking race conditions with sleep calls, agents claiming success on buggy code, agents implementing algorithms that pass tests but don&#8217;t generalize. These are Layer 1 failures &#8212; implementation-level breakdowns in code that appears to work. A junior developer encountering these outputs sees success where a senior sees warning signs. The failure signal exists. It is not visible to the person who needs to learn from it.</p><h2>The IBM Critique, Sharpened</h2><p>IBM&#8217;s rewritten roles can be mapped onto the three types of failure signal that produce engineering judgment. There is implementation-level failure &#8212; the race condition, the architectural drift, the code that claims success when bugs remain. There is systems-level failure &#8212; the customer complaint that maps through the stack to a root cause nobody documented. And there is specification-level failure &#8212; the moment someone has to stake their name on whether the requirements themselves were right.</p><p>The old boilerplate model exposed juniors to implementation-level failure almost exclusively, and accidentally. The new IBM model &#8212; AI oversight, customer interaction, requirements translation &#8212; is, in theory, exposure to all three. That is not a step backward. It might be a step forward.</p><p>But the theory collapses without the preceptorship. Implementation-level failures in AI output are invisible to someone who lacks enough technical intuition to recognize them. You cannot learn to catch the subtle wrong if no one makes the subtle wrong visible. IBM has rewritten the job description to include &#8220;AI oversight&#8221; without building the structural condition under which AI oversight actually teaches anything. Without a preceptor paired with the junior, making the failure legible &#8212; pointing at the sleep call masking the race condition and explaining <em>why</em> that is wrong, not just that it failed &#8212; the oversight role is compliance work, not learning. The junior sees that the tests passed. The preceptor sees the problem the tests don&#8217;t catch. Without the preceptor, that gap is just a gap.</p><p>Some organizations are doing more than announcing intentions. The responses are uneven, but they are real.</p><p>Microsoft proposed a preceptorship model that is worth examining in detail. The structure is adapted from clinical nursing: senior engineers paired with early-in-career developers at three-to-one or five-to-one ratios, for a minimum of one year, on real product teams rather than training sidecars. AI tools are configured to operate in what Russinovich and Hanselman call &#8220;EiC mode&#8221; &#8212; Socratic coaching before code generation, forcing the junior to articulate what they&#8217;re trying to accomplish before receiving a solution. Mentorship hours are measured as &#8220;human impact&#8221; alongside product metrics in performance reviews, which means the senior engineer&#8217;s career is now connected to the junior&#8217;s development, not just the senior&#8217;s own throughput. The model is modeled on clinical preceptorships explicitly because clinical nursing faced the same problem decades ago: how do you develop judgment in someone who is working in a high-stakes environment with experienced practitioners who have better things to do than teach?</p><p>Russinovich and Hanselman are honest about the limits of their own proposal. Microsoft cut significant engineering headcount in 2024 and 2025. Whether the preceptorship model will scale into a sustained program depends on whether leadership changes the metrics they optimize &#8212; a &#8220;big ask&#8221; for organizations whose incentives have historically emphasized shipping velocity above all else.</p><p>McKinsey redesigned its screening process for the AI era through an assessment called Solve &#8212; a gamified evaluation that tests critical thinking, decision-making, and systems thinking, explicitly not prior business knowledge or technical credentials. The framing is sound: what the company needs is people who can learn in the new environment, not people who already know the old skills. Whether a better hiring filter compensates for a weaker developmental pathway is not yet clear.</p><p>IBM&#8217;s own &#8220;New Collar&#8221; apprenticeship program is being updated to include what the company calls &#8220;AI-native habits&#8221; &#8212; using AI tools to deconstruct pull requests rather than build from scratch, understanding the architecture of LLMs, designing with generative tools before implementing. The Flatiron School is running an &#8220;Accelerated AI Engineer Apprenticeship&#8221; that pairs participants with mentors on real agentic frameworks at $20 per hour, with a foundations-first approach that introduces concepts simply before revisiting them with increasing technical depth.</p><p>These are attempts. They are not yet evidence.</p><h2>The Review Tax Nobody Discusses</h2><p>There is a cost to the existing senior engineers that the pipeline conversation mostly ignores. When one senior can generate the volume of three juniors, the productivity gains are real. But generating code is cognitively different from verifying code, and the verification is now happening at three times the volume.</p><p>Senior engineers are spending their days as high-speed compliance officers. Thousands of lines of AI-generated logic, auditing for subtle hallucinations &#8212; race conditions masked by sleep calls, code that passes tests but doesn&#8217;t generalize, architectural drift that looks fine in isolation and fails at scale. A 2025 paper found that after AI adoption, core developers reviewed more code but their own original productivity dropped 19%. The creative, architectural, problem-solving work that makes senior engineering satisfying and that produces the judgment juniors are supposed to be learning from &#8212; that work is being crowded out by the cognitive exhaustion of reviewing AI output at industrial scale.</p><p>The delegation vacuum compounds this. Seniors previously handed off lower-risk tasks to juniors as a pressure valve and as a teaching mechanism. Junior implements the UI component, senior reviews it, junior learns something. That loop no longer exists. The junior&#8217;s tasks were automated. The senior&#8217;s workload increased. The teaching is not happening.</p><p>This is the tax that makes the developmental problem worse. The senior engineers who were supposed to mentor are stretched thin doing work that used to be distributed. The preceptorship model addresses this in theory &#8212; by making mentorship a measured part of the senior&#8217;s job rather than an afterthought. Whether organizations are actually willing to accept the velocity tradeoff is a different question.</p><h2>What Is Actually Known</h2><p>The honest answer to the core question &#8212; can AI-assisted entry-level work produce the same developmental outcomes as the boilerplate-and-struggle model &#8212; is that nobody knows yet.</p><p>The cohort that entered the workforce in 2024 and 2025 under AI-assisted conditions will become mid-level engineers in 2027 and 2029. Whether they emerge with the architectural judgment, the debugging instincts, the systems thinking that the old pipeline produced will not be visible until then. The data will arrive precisely when it is needed most &#8212; when those engineers are supposed to be the senior developers filling the next generation&#8217;s pipeline &#8212; and if the answer is no, the remediation options will be limited and expensive.</p><p>The Dreyfus model of skill acquisition gives a name to what is at risk. Novices follow rules. Advanced beginners develop pattern recognition. Competent practitioners make choices and bear the consequences of those choices &#8212; this is where accountability and emotional investment enter, and where learning accelerates. Proficient practitioners sense problems before the data confirms them. Experts operate through intuition built from thousands of absorbed experiences. The concern is not that AI-assisted juniors are incompetent. It is that they plateau. They recognize patterns. They generate outputs that look like what competent practitioners produce. But they have not made choices whose consequences they had to live with. They have not debugged the 2am production failure that rewired their mental model of how distributed systems actually behave. They have not asked a senior why their elegant solution was wrong and received an answer that changed how they think permanently.</p><p>The Kosmyna finding is the most uncomfortable piece of evidence in this space. It is preliminary and domain-limited. But if it holds in technical domains &#8212; if the cognitive debt from AI-assisted early-career work doesn&#8217;t reverse when the tool is removed &#8212; then the preceptorship model is not sufficient on its own. The preceptor can make visible the failure the junior cannot yet see. But they cannot rebuild the neural substrate that early unassisted struggle was supposed to create. The minimum viable intervention may require some version of deliberately maintained struggle &#8212; manual-first implementation for foundational modules, Socratic AI tools that require the junior to predict before they receive &#8212; to preserve the generative cognitive engagement that builds the mental models the preceptorship then calibrates.</p><h2>The Wager</h2><p>IBM&#8217;s wager is that oversight, verification, and customer-facing accountability can replace the old developmental pathway. That a junior developer who spends years auditing AI output, explaining architectural choices to clients, and taking responsibility for the correctness of generated code will develop the judgment that used to come from writing and debugging the code yourself.</p><p>It might be true. And the three-layer framing suggests it could be more than just &#8220;not worse&#8221; &#8212; exposure to systems-level and specification-level failure earlier in a career, rather than after years of boilerplate, might actually compress the timeline to senior judgment rather than extend it. Customer-facing rotation, where the junior must translate vague failure descriptions into root-cause hypotheses, is the kind of developmental experience that the old model often didn&#8217;t provide until mid-career.</p><p>But the theory requires the load-bearing piece that IBM has not publicly committed to: preceptorship at Stage 1. The implementation-level failures in AI output are invisible to a junior who lacks the technical intuition to recognize them. Making those failures legible is the senior engineer&#8217;s job &#8212; not reviewing for correctness, but externalizing judgment that the junior cannot yet access. Without that, the oversight role is compliance work. The junior sees tests passing where the senior sees warning signs. The gap between those two observations is where the learning was supposed to happen.</p><p>LaMoreaux is right that the companies which doubled down on entry-level hiring in this environment will be better positioned in 2030. She is right that the pipeline problem is real. What she has not yet answered &#8212; what no major company has publicly answered with evidence &#8212; is whether the new developmental pathway they are building actually delivers Stage 2 and Stage 3. Whether the junior who spends a year doing AI oversight develops the systems intuition to translate &#8220;it stops working sometimes&#8221; to root cause. Whether they get to the point of staking their name on an architectural judgment call, being wrong about something, and learning from the consequence.</p><p>The ladder looks different. Whether it goes to the same place, and whether the companies building it have designed the rungs deliberately enough to find out, we do not yet know.</p><div><hr></div><p><strong>Tags:</strong> junior developer pipeline AI, failure signal model developer expertise, IBM entry-level roles 2026, Kosmyna cognitive debt LLM, Russinovich Hanselman preceptorship ACM</p>]]></content:encoded></item><item><title><![CDATA[The Robot Tutor and the Fishing Village]]></title><description><![CDATA[What "Personalization" Has Always Meant, and What Adaptive Learning Has Always Delivered]]></description><link>https://www.skepticism.ai/p/the-robot-tutor-and-the-fishing-village</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-robot-tutor-and-the-fishing-village</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Fri, 24 Apr 2026 03:20:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!JT8Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JT8Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JT8Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!JT8Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!JT8Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!JT8Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JT8Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1635623,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/194873207?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JT8Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!JT8Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!JT8Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!JT8Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3817a05d-7e7f-4fc7-b7c8-97b0926accd6_1456x816.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The girl in the Cambodian fishing village was never real.</p><p>She was an argument. Between 2013 and 2015, Jos&#233; Ferreira, founder of Knewton, invoked her in promotional materials and public statements to describe what his technology could do: a girl in a fishing village, receiving through Knewton&#8217;s adaptive engine the same personalized instruction as a student at an elite private school, growing up to invent the cure for ovarian cancer. Educational inequality, in Ferreira&#8217;s framing, was a problem that adaptive learning could address at the software layer. The instruction would be what unlocked the capacity. The fishing village was a rhetorical device, not a pilot deployment.</p><p>By 2019, Knewton had been acquired by John Wiley &amp; Sons for a sum understood to be a small fraction of its peak valuation. The partnership with Pearson had dissolved. The product that remained &#8212; Knewton Alta, a conventional higher-education courseware platform &#8212; bore little resemblance to the robot tutor in the sky. The fishing village was still waiting.</p><p>I want to examine what happened. Not Knewton specifically, and not Ferreira personally &#8212; he was the most articulate spokesman for a framing the whole industry was using, not its author. What I want to examine is the word that Ferreira&#8217;s framing deployed, the word that was doing the most rhetorical work in every version of that framing, the word that has survived the collapse of its first generation of spokescompanies and is still doing the same work today.</p><p><em>Personalization.</em></p><div><hr></div><h2>What the Word Invokes</h2><p>The word has a history in educational psychology that predates by decades any commercial deployment of adaptive software. Lev Vygotsky&#8217;s zone of proximal development is about personalization &#8212; the idea that effective instruction operates in the specific zone between what a learner can do independently and what they can do with support, a zone that is different for every learner and that requires a teacher&#8217;s specific attention to identify. Lee Cronbach and Richard Snow&#8217;s work on aptitude-treatment interactions spent two decades trying to formalize the finding that different learners respond differently to different instructional approaches &#8212; that no single method is optimal for everyone, and that the optimal method for a given learner depends on who that learner is. The differentiated-instruction tradition in teacher education has argued for thirty years that good teaching requires knowing students individually, designing instruction around their specific needs, and adjusting in real time to what each student brings and what each student shows.</p><p>The construct is real. It has serious empirical and theoretical grounding. When Ferreira said Knewton was personalizing learning, he was invoking this history &#8212; pointing at a tradition that educational psychology had spent decades documenting and that every good teacher knows, in the bone, as what it means to actually teach rather than to deliver content.</p><p>What Knewton&#8217;s technology operationalized was different.</p><p>Knewton&#8217;s engine was built on two well-established statistical techniques. The first was Item Response Theory, the mathematical framework underlying modern standardized testing, which models the probability of a correct response as a function of a student&#8217;s latent ability and an item&#8217;s difficulty. The second was Bayesian Knowledge Tracing, which estimates whether a student has mastered a specific discrete skill by updating probability estimates as the student responds to items. Together, these gave Knewton a learner model: a collection of probability distributions over latent abilities and specific skill masteries, updated continuously as the student interacted with the system.</p><p>This is real technology. It is not trivial to build. The engineers who built it did substantive mathematical work. Knewton&#8217;s claim that its engine operated on sophisticated foundations was true. What was not quite true was the claim about what those foundations amounted to.</p><p>The learner model Knewton maintained was expressible, in its technical form, as: <em>the probability this student has mastered skill A is 0.78; the probability this student has mastered skill B is 0.34; the student&#8217;s estimated ability on dimension X is 1.2 standard deviations above the population mean.</em> This is useful information for deciding what to present next. It is not a model of the student as a person. It is not a model of their interests, their emotional state, their cognitive style, their cultural background, their creative capacity, their relationship to learning. It is a model of item-response patterns on a bank of pre-authored content.</p><p>The gap between <em>we know this student better than their parents</em> and <em>our model assigns probabilities to their mastery of skills we&#8217;ve tagged to a knowledge graph</em> is the central artifact of the adaptive-learning era.</p><div><hr></div><h2>The Fishing Village Made Specific</h2><p>The girl in the Cambodian fishing village makes the gap visible because the specific nature of what was claimed and what was possible becomes clear once you name each requirement.</p><p>For the girl to receive, through Knewton&#8217;s engine, instruction equivalent to an elite private-school education, the technology would need, first, content: a comprehensive curriculum in mathematics, science, language, and humanities, built by human curriculum developers, available in a language she could read, calibrated for her cultural and linguistic context. Knewton licensed pre-authored material from publishers. The content was what the publishers had built and the partnerships had arranged. The engine sequenced content that already existed. Building the content was not what the engine did.</p><p>The technology would need, second, an outcome measure capable of telling whether the instruction was producing the kind of understanding that leads to cancer research &#8212; conceptual depth, transfer across domains, creative problem-solving, the tacit skills that accumulate over years of serious engagement with scientific thinking. Knewton&#8217;s engine could measure item-level response patterns on pre-authored assessments. Whether those patterns indexed what a future researcher would need was not addressed. The engine was not designed to measure the construct the rhetoric invoked.</p><p>The technology would need, third, to function in conditions of intermittent electricity, unreliable internet, shared devices, limited home support, a language and cultural context for which the content was probably not designed. Knewton was built for contexts with substantially more infrastructure. The rhetoric invoked the fishing village as a demonstration of reach. The technology had not been deployed there or validated there.</p><p>The claim was aspirational. The <em>could</em> was doing substantial work. What was true was that the technology could hypothetically produce this outcome if a great many other things were also true, none of which were Knewton&#8217;s responsibility or within Knewton&#8217;s control. The fishing village was a vision of what the future might look like if a great many problems that have nothing to do with adaptive sequencing algorithms were solved. It was not a description of what Knewton could actually deliver.</p><div><hr></div><h2>Three Systems, One Pattern</h2><p>The pattern the Knewton arc illustrates is not Knewton-specific. It appears, in different configurations, across every major adaptive-learning platform that followed.</p><p>DreamBox Learning, focused on K-8 mathematics and backed by the strongest external evidence base in the category, has been evaluated by the Harvard Center for Education Policy Research in multiple studies. The evaluations used standardized mathematics assessments over school-year timescales and were conducted by researchers with no affiliation to the company. The findings: effect sizes in the range of 0.10 to 0.15 standard deviations for students using the platform at recommended levels. Real effects. Detectable by rigorous researchers using independent measures. Considerably more modest than the marketing implied. And dependent, in every evaluation, on implementation &#8212; on how much classroom time schools actually allocated to the platform. The adaptive sophistication of the software did not substitute for the hours it required.</p><p>i-Ready, among the most widely deployed adaptive platforms in American K-12 education, integrates adaptive diagnostic assessment with what the company calls &#8220;Personalized Instruction&#8221; &#8212; a sequence of pre-authored lessons targeted at the student&#8217;s estimated level. Critics have noted that the personalization, operationally, consists of placing students at different starting points in a common instructional sequence. Students are still completing pre-authored lessons. They are starting at different points and progressing at different speeds. Whether this is <em>personalization</em> in the sense the word implies &#8212; instruction responsive to who the student is &#8212; or more honestly <em>adaptive placement within a fixed curriculum</em>, is exactly the question the word is being deployed to avoid asking.</p><p>ALEKS, built on Knowledge Space Theory, represents the most theoretically rigorous operationalization in the category. Rather than treating ability as a single number, Knowledge Space Theory maps a domain as a set of discrete items and a learner&#8217;s knowledge state as the specific subset of items they have mastered. ALEKS uses an AI engine to efficiently navigate the combinatorial space of possible knowledge states, asking questions that narrow its estimate of where the student is. The resulting ALEKS Pie &#8212; a visual display of what has been mastered, what has not, what is ready to learn &#8212; is grounded in serious mathematics, specified precisely, falsifiable in principle. It has been evaluated in multiple contexts. Effect sizes fall in the same general range as DreamBox and i-Ready.</p><p>What is clarifying about ALEKS is this: even the most theoretically careful operationalization of personalization &#8212; one drawing on decades of rigorous mathematical work &#8212; models a student&#8217;s mastery state over a defined domain of discrete items. It does not model the student&#8217;s interests, their emotional state, their cognitive style, their cultural background, their creative capacity, their relationships. ALEKS is honest about this. The documentation says clearly that the system models knowledge states over specific domains. But even ALEKS demonstrates that the gap between the marketing construct and the technical operationalization is not a failure of specific companies. It is a feature of what item-level response tracking can and cannot do.</p><div><hr></div><h2>The Gap and Its Consequences</h2><p>The word <em>personalization</em> is doing specific rhetorical work. It invokes a construct that educational psychology spent decades building &#8212; instruction responsive to the individual learner in the deep sense that Vygotsky pointed at, that good teachers practice, that Cronbach and Snow tried to formalize. The construct is real. The technology operationalizes something narrower: item-level response tracking, probability distributions over mastery parameters, next-item selection from pre-authored content banks, pacing adjustments based on observed response patterns. This is what the data these systems collect and the algorithms they run can actually support. It is not trivial. It is not the same thing as the construct the word invokes.</p><p>Three consequences follow.</p><p>Critiques of adaptive learning for failing to deliver what the marketing promised are both fair and partially misdirected. Fair because the systems cannot deliver what the rich construct invokes. Misdirected because assigning this to specific companies treats a structural feature of item-level tracking as a product failure. The rhetoric over-promised. The technology delivered what the technology could deliver.</p><p>Evaluations of these systems on outcome measures aligned to the item-level tracking are measuring the operationalization, not the construct. They find modest positive effects, which is the honest finding. Whether the same systems produce transfer to novel problems, durable learning over years, growth in dimensions that do not map to any test-bank item &#8212; these questions remain mostly unanswered, because answering them would require outcome measures that do not yet exist in the forms evaluators would need.</p><p>And the pattern persists. The vocabulary has survived the collapse of Knewton and its generation. When current AI-tutor companies claim to provide personalized tutoring, to adapt to each learner&#8217;s needs, to meet students where they are, the claim is doing the same rhetorical work Knewton&#8217;s robot tutor in the sky was doing: invoking the rich construct while operationalizing a narrower version. The gap remains where it was.</p><div><hr></div><h2>What to Ask</h2><p>When you next encounter an educational-technology claim that uses the word <em>personalization</em>, or variants like <em>individualized</em> or <em>adaptive</em> or <em>tailored to the learner</em> or <em>meets each student where they are</em>, two questions will orient you.</p><p>What, specifically, is the technical operation? The honest answer for the large majority of systems using this vocabulary is one of a small family: item-level response tracking with adaptive item selection; diagnostic assessment followed by placement in a pre-authored sequence; pacing adjustments based on response patterns; content recommendation from a pre-authored bank based on inferred mastery. If you can name which operation is happening, you have the beginning of an honest account of what the system does. The vocabulary may suggest more. The technical substrate does not support more.</p><p>Does the claim invite the listener to believe the system does something the operation does not do? The answer is often yes, specifically in the dimensions educators and parents most hope for. Operationalized personalization &#8212; item selection based on mastery estimates &#8212; can contribute to instruction responsive to the individual learner, in contexts where it is embedded in the harder relational and responsive work that teachers do. It cannot replace that work. When a product is marketed as though algorithmic item selection substitutes for a teacher&#8217;s specific attention to a specific child, the marketing is doing rhetorical work the technology does not underwrite.</p><p>The fishing village is still waiting. The girl who will invent the cure for ovarian cancer has not yet received the education the rhetoric promised. This is not primarily Ferreira&#8217;s fault, or Knewton&#8217;s, or any single company&#8217;s. It is the consequence of a gap that was always structural &#8212; between what a word can invoke and what a technical operation can deliver &#8212; that the field has chosen, for a decade and more, not to name.</p><p>Naming it is the prerequisite to closing it.</p><div><hr></div><p><em>Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)). This essay appears as part of the Computational Skepticism series at <a href="https://skepticism.ai">skepticism.ai</a>. | <a href="https://theorist.ai">theorist.ai</a></em></p><div><hr></div><p><strong>Tags:</strong> adaptive learning personalization gap, Knewton IRT Bayesian knowledge tracing operationalization, DreamBox i-Ready ALEKS efficacy evaluation, personalized learning construct versus operation, EdTech rhetoric fishing village critique</p>]]></content:encoded></item><item><title><![CDATA[The Assessment Was Already Broken]]></title><description><![CDATA[On Jessica Winter's "What Will It Take to Get A.I. Out of Schools?" and what the panic about AI reveals about everything that came before it]]></description><link>https://www.skepticism.ai/p/the-assessment-was-already-broken</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-assessment-was-already-broken</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Fri, 24 Apr 2026 00:37:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!l9KP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l9KP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l9KP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png 424w, https://substackcdn.com/image/fetch/$s_!l9KP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png 848w, https://substackcdn.com/image/fetch/$s_!l9KP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png 1272w, https://substackcdn.com/image/fetch/$s_!l9KP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l9KP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png" width="1456" height="522" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:522,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2081000,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/195299281?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l9KP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png 424w, https://substackcdn.com/image/fetch/$s_!l9KP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png 848w, https://substackcdn.com/image/fetch/$s_!l9KP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png 1272w, https://substackcdn.com/image/fetch/$s_!l9KP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2689ec55-93a2-4ccb-b9e0-c0ecbdcd191e_3018x1082.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A response to Jessica Winter's<strong><a href="https://www.newyorker.com/culture/progress-report/what-will-it-take-to-get-ai-out-of-schools"> "What Will It Take to Get A.I. Out of Schools?"</a></strong></p><p>There is a moment in Jessica Winter&#8217;s New Yorker piece that contains the entire argument she doesn&#8217;t make. Her sixth-grade daughter runs a fifth-grade slide show through Gemini&#8217;s beautifying tools. In thirty seconds, the typography improves, the pictures reshuffle symmetrically, the design evokes fifteenth-century movable type against a background of aged vellum. Winter describes it as the pool race from <em>Mommie Dearest</em>: the larger, faster thing that will always beat you.</p><p>Her daughter is unmoved. &#8220;I like mine better, because it&#8217;s original and I worked really hard on it.&#8221;</p><p>Hold that sentence. It is the right answer. It is also the answer that does not appear on any rubric in any public school in Massachusetts or New York or Los Angeles. The rubric rewards the prettier slide. The rubric was always going to reward the prettier slide. Winter wants her daughter to hold values that the institution has never rewarded, and she writes a five-thousand-word piece about artificial intelligence without once asking why the institution doesn&#8217;t reward them.</p><p>This is the intellectual hole at the center of a piece that is otherwise sharp, well-reported, and morally earnest. AI didn&#8217;t break the assessment system. It exposed that the assessment system was already broken, and everyone was pretending otherwise.</p><div><hr></div><h2>What the Slide Show Already Was</h2><p>The printing-press slide show existed before Gemini. It was made in fifth grade to demonstrate learning. Whether it demonstrated learning was always a question nobody asked, because asking it would require admitting that the artifact &#8212; the thing handed in, the thing graded &#8212; was never reliable evidence of the process. The slide show could have been made with a parent&#8217;s help, with a template, with a slightly older sibling, with a capable friend who understood visual design. These interventions existed before large language models. They produced polished artifacts that the teacher accepted as evidence of understanding.</p><p>The educational research on this predates AI by decades. Robert Bjork&#8217;s distinction between performance and learning &#8212; the observable output versus the durable cognitive change &#8212; is from 1992. The problem of using artifacts as proxies for thinking is at least as old as Vygotsky. What AI did was not create this problem. It made the problem so visible, so fast, so cheap, that willful ignorance became impossible.</p><p>Winter quotes USC professor Mary Helen Immordino-Yang: &#8220;We are cutting off learning at the knees.&#8221; She quotes University of Toronto psychologist Amy Finn on the magic of how children retain unexpected, non-strategic details that adults would find irrelevant, a kind of creative unpredictability fundamentally misaligned with LLMs&#8217; orientation toward speed and sleekness. These are real insights. They are also insights that apply equally to the printing-press slide show assigned as homework, graded for visual appeal and accuracy, returned in two days, and forgotten. The neuropsychological substrate for creating narratives and thinking through arguments over time is not developed by making a slide show under time pressure at home with no adult monitoring the process.</p><p>The question is not whether AI belongs in schools. The question &#8212; the one the piece never asks &#8212; is whether the assessment was measuring what it was supposed to measure before AI arrived. The answer is: sometimes, unevenly, and less than we told ourselves.</p><div><hr></div><h2>The Tool Hierarchy Problem</h2><p>Winter&#8217;s implicit argument, followed consistently, condemns more than Gemini. Calculators offload arithmetic before numeracy is built. Spell-check offloads orthography. Grammarly offloads syntax judgment. Google Search offloads memory and source evaluation. Slide templates offload visual design judgment. Word processors themselves offload handwriting, which Winter mentions approvingly has developmental benefits &#8212; which means she believes at least one tool was introduced too early.</p><p>She draws the line at the tool that frightens her right now. This is a very human response and a terrible policy foundation.</p><p>The honest version of her argument looks like a developmental sequence: here are the cognitive substrates that must be built before each category of tool is introduced, and here is the evidence for that ordering. Immordino-Yang and Finn gesture at this &#8212; the &#8220;cognitive muscles&#8221; framing, the concern about atrophy before onloading &#8212; but nobody builds it out into something a school board could actually implement. Without that framework, the anti-AI position reduces to: tools I grew up with are fine, tools that postdate my childhood are suspect.</p><p>Amanda Bickerstaff, CEO of AI for Education, comes closest to the principled version: children should not be using chatbots under age ten, she says, because these tools require expertise and evaluation skills that even many adults don&#8217;t have. That&#8217;s a threshold with a rationale. It&#8217;s also the only threshold in the piece with a rationale. Everything else is rhetoric standing in for policy.</p><div><hr></div><h2>The Research That Isn&#8217;t Quite Research</h2><p>The piece anchors much of its scientific authority in three studies. The 2025 MIT warning that LLMs &#8220;may inadvertently contribute to cognitive atrophy&#8221; &#8212; the authors felt it necessary to append an FAQ asking journalists not to use words like &#8220;brain rot&#8221; or &#8220;brain damage,&#8221; which tells you something about how the finding was being reported before Winter&#8217;s piece and how it will be reported after. The multi-institution study (MIT, CMU, UCLA, Oxford) on fraction-solving, which showed that students who lost AI access after using it performed significantly worse &#8212; not yet peer-reviewed, not yet published, findings are concerning, the concern is real. The Brookings &#8220;premortem,&#8221; which pairs 400 studies with hundreds of interviews to conclude that AI tools &#8220;undermine children&#8217;s foundational development.&#8221;</p><p>These are worth taking seriously. They are also worth examining carefully.</p><p>The fraction-solving study is the most empirically specific, and it is also the most useful argument against Winter&#8217;s piece rather than for it. The students who used LLMs on fraction-solving and then lost access performed significantly worse and were more likely to give up. The proposed mechanism: AI gives answers, students become dependent on the answer-giving, remove the answers and the capacity to generate them independently has atrophied.</p><p>But this is an argument about a specific implementation &#8212; an answer machine &#8212; not about the technology class. An LLM configured as a Socratic interlocutor, one that refuses to answer directly and instead returns questions that scaffold toward understanding, that detects when a student is stuck versus when they&#8217;re avoiding, that withholds confirmation until the student demonstrates the reasoning &#8212; that tool would presumably produce the opposite result. Students would have developed the reasoning process rather than outsourcing it, because outsourcing was never made available to them.</p><p>This is not an exotic capability. It is prompt engineering plus scaffolding logic. The reason it isn&#8217;t what&#8217;s being deployed in K-12 classrooms is that Google ships Gemini with a &#8220;Help me write&#8221; button because that&#8217;s the path of least resistance and maximum engagement. That is a product decision, not a technological inevitability. Winter never distinguishes between AI as answer machine and AI as thinking partner. The cognitive offloading critique collapses the moment you make that distinction, because the problem isn&#8217;t the tool &#8212; it&#8217;s the incentive structure of the company deploying it.</p><p>The social-emotional hijacking argument from UNC psychologist Mitch Prinstein is the weakest scientific claim in the piece, and it&#8217;s presented with the same credentialed authority as the others. Surging oxytocin and dopamine receptors around ages ten to eleven do drive peer-bonding &#8212; that&#8217;s established developmental neuroscience. Sycophantic LLMs &#8220;hijack the biological tendency to want peer feedback&#8221; &#8212; that&#8217;s a hypothesis, not a finding. The claim requires that chatbot interaction activates the same neurological pathways as peer interaction, that substituting chatbot interaction for peer interaction produces measurable deficits in social skill development, and that the effect is &#8220;hijacking&#8221; &#8212; a strong, directional, causal claim &#8212; rather than displacement or preference shift. No study is cited because none exists at the necessary scale with the necessary longitudinal follow-up.</p><p>This is neuroscience&#8217;s authority dressed over a speculation. Which is particularly ironic given that Winter is writing a piece about tools that generate confident-sounding output without rigorous foundations.</p><div><hr></div><h2>The Grade Your Daughter Is Going to Receive</h2><p>Return to the slide show.</p><p>Winter&#8217;s daughter likes hers better because it&#8217;s original and she worked really hard on it. This is the right value. This is the value Winter wants the school to transmit. The school is not transmitting it, because the school is not grading for it.</p><p>If the rubric rewards polish, visual appeal, and impressive output &#8212; which most rubrics do, implicitly, because these are the things teachers can assess quickly across thirty slide shows at 11pm &#8212; then the student who uses Gemini gets the A. Not abstractly. On the transcript. The student who refuses Gemini, who holds Winter&#8217;s daughter&#8217;s values, receives the C. Neither of them learns the lesson Winter intends.</p><p>The deeper problem: homework was already a weak pedagogical instrument before AI. Most research on homework in K-8 is lukewarm. It was largely accountability theater &#8212; proof that learning happened, easy to grade, easy to assign, poor evidence of the process it was supposed to represent. AI exposed the theater. The theater was playing for years before AI bought a ticket.</p><p>What would it look like to actually assess the process? That question is harder than &#8220;what do we do about Gemini,&#8221; and it requires admitting that the current system was already failing to measure what it claimed to measure. Winter doesn&#8217;t want to ask that question, because asking it would mean the problem is older and deeper than the creepy neighbor who moved in recently.</p><div><hr></div><h2>What Actually Needs to Change</h2><p>The resistance movements Winter profiles &#8212; District 14 Families for Human Learning, the Coalition for an AI Moratorium, Schools Beyond Screens &#8212; are better at stopping things than proposing them. The Student Tech Bill of Rights includes the right to read whole books, write on paper, and learn in a low-stimulation environment free from undue corporate influence. These are reasonable demands. They don&#8217;t add up to a pedagogy.</p><p>The conflict-of-interest thread is the piece&#8217;s most structurally damning detail and the most underplayed. The NYC DOE official overseeing the preliminary AI guidelines holds a fellowship jointly offered by Google and GSV Ventures &#8212; whose portfolio includes Amira and MagicSchool, two of the primary AI tools being deployed in the classrooms those guidelines govern. Other Google-GSV fellowship recipients include top school officials in Berkeley, Dallas, Los Angeles, Newark, Colorado, and Maryland. &#8220;If you ask tobacco companies to help write your school&#8217;s policy on cigarettes,&#8221; one parent says, &#8220;you&#8217;re going to end up with guidance on how to smoke responsibly in school.&#8221;</p><p>This is the argument Winter should have built the piece around. Not &#8220;AI is cognitively harmful&#8221; &#8212; which is partly true, partly speculation, and entirely dependent on implementation &#8212; but &#8220;the people writing the rules are being paid by the companies they&#8217;re supposed to regulate.&#8221; That is verifiable, structural, and not dependent on a not-yet-peer-reviewed study about fractions.</p><p>The piece ends with Sinha&#8217;s question &#8212; &#8220;What do you want from this?&#8221; &#8212; and Winter&#8217;s answer: nothing. It&#8217;s a parent&#8217;s answer. A good parent&#8217;s answer. But it is not a policy answer, and it is not an answer that acknowledges what was already not working before the neighbor moved in.</p><p>The assessment was already broken. The rubric was already rewarding the wrong things. The slide show was already a poor proxy for thinking. AI made all of this impossible to ignore. That is a service, not a crime &#8212; even if the service was rendered by someone with cloven hooves in Yeezy Boosts and a market cap of four trillion dollars.</p><p>What we owe children is not the tools of the past but a clear account of what learning actually is, what evidence of it looks like, and how to build assessments that can tell the difference. That conversation is harder than banning Gemini. It is also the only conversation that addresses what Gemini exposed.</p><div><hr></div><p><em>Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI. His work on AI in education, including the Genuine Learning Protocol framework, is published at bearbrown.co.</em></p><div><hr></div><p><strong>Tags:</strong> AI education New Yorker critique, cognitive offloading assessment design, Bjork learning performance distinction, AI schools policy Jessica Winter, GLP genuine learning protocol</p>]]></content:encoded></item><item><title><![CDATA[The Gap Between What We Measure and What We Name]]></title><description><![CDATA[On the Structural Problem That Forty Years of EdTech Efficacy Research Has Not Solved]]></description><link>https://www.skepticism.ai/p/the-gap-between-what-we-measure-and</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-gap-between-what-we-measure-and</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Thu, 23 Apr 2026 00:38:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZxKu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZxKu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZxKu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!ZxKu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!ZxKu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!ZxKu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZxKu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1453300,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/194861665?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZxKu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!ZxKu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!ZxKu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!ZxKu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf68abee-cf05-4605-a9fc-933442d405bf_1456x816.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Consider two findings, forty years apart.</p><p>In 1984, Benjamin Bloom published a seventeen-page paper reporting that students tutored one-on-one under mastery-learning conditions performed approximately two standard deviations above students taught in conventional classrooms. The finding has been cited tens of thousands of times. It has become, across four decades, the single most-invoked benchmark in educational technology. Whenever a new system claims to approach the effectiveness of human one-on-one instruction, it is Bloom&#8217;s 2-sigma it is claiming to approach.</p><p>In 2024, a research team at Harvard led by Gregory Kestin reported that an AI tutor, deployed in introductory physics, produced learning gains larger than active-learning classroom instruction. The effect size exceeded what prior literature had typically reported for any tutoring intervention, including Bloom&#8217;s. The study was methodologically careful. The finding circulated quickly. Within weeks it was being cited as evidence that current-generation AI tutors meaningfully exceed what good conventional instruction can deliver.</p><p>Forty years apart. Different technologies. Different research traditions. And yet, read carefully, the two findings share a structure.</p><p>In each, a specific measurement &#8212; performance on items aligned to the intervention&#8217;s content, assessed at short timescale, against a conventional-instruction baseline &#8212; is offered as evidence for a construct about which the measurement is not, strictly, a measurement. Bloom&#8217;s 2-sigma is evidence about performance on aligned items under particular tutoring conditions in the mid-1980s. It is <em>cited</em> as evidence about the effectiveness of tutoring as an instructional mode. Kestin&#8217;s physics finding is evidence about short-timescale aligned-item performance in a selective undergraduate population. It is <em>cited</em> as evidence that AI tutoring outperforms human instruction in some general sense the measurement does not index.</p><p>The measurements are not false. The findings are not inflated. In each case, the researchers reported carefully what they measured. The question is what happens between the measurement and its citation &#8212; the small, structural, and repeated gap between what the apparatus indexes and what the vocabulary surrounding the apparatus claims.</p><div><hr></div><h2>The Structure of the Problem</h2><p>Name the structure directly.</p><p>An efficacy claim in this field consists of three things: a measurement, a construct, and an asserted relationship between them. The measurement is what researchers actually did &#8212; items administered, scores computed, conditions compared. The construct is what the measurement is meant to be evidence for &#8212; <em>learning</em>, <em>mastery</em>, <em>effectiveness</em>, <em>personalization</em>, <em>engagement</em>. The asserted relationship is the claim that the measurement indexes the construct adequately to license the uses the finding is put to.</p><p>This structure appears in every empirical field. Biology works this way, and so does nutrition research, and so does clinical psychology. The gap between measurement and construct is not a problem specific to educational technology. It is a feature of empirical inquiry. Measurements never exhaustively capture their constructs. The question for any field is how seriously it takes the gap, how much work it does to establish the measurement-construct relationship, and how much it assumes versus demonstrates.</p><p>The observation this book has been building toward, essai by essai, is that the learning-systems field has, across six decades, taken the gap less seriously than its claims require. The measurement-construct relationships it invokes are almost universally assumed rather than demonstrated. The field&#8217;s vocabulary outruns what its evidence apparatus can support, and the gap persists not because it has gone unnoticed &#8212; it has been noticed, repeatedly, by careful researchers across multiple traditions &#8212; but because the apparatus that persists serves specific production conditions, and a more adequate apparatus would serve them less well.</p><p>The structure is not: <em>the field is wrong about what works.</em> The structure is: <em>the field makes claims about effectiveness that its measurements are not positioned to support, and does so systematically.</em> These are importantly different claims. The first is about facts. The second is about apparatus &#8212; about the specific set of measurement practices, citation habits, and research conventions that together produce what the field calls its evidence base.</p><p>The distinction matters because the remedy differs. If the field were making factual errors, the remedy would be better studies of the same interventions. If the apparatus is producing a systematic gap between measurement and claim, the remedy is different apparatus. This book has not argued for either remedy. It has argued, by the accumulated force of twelve close readings, that the second diagnosis is correct.</p><div><hr></div><h2>What the Vocabulary Actually Invokes</h2><p>Open a textbook in educational psychology. Open a learning-sciences journal. Open the marketing copy for any major adaptive-learning platform. Open the abstract of any recent AI-tutor efficacy study. The vocabulary is remarkably consistent. The field claims to be producing evidence about <em>learning</em>. About <em>understanding</em>. About <em>mastery</em>. About <em>effectiveness</em>. About <em>personalization</em> and <em>engagement</em>. Each of these words points toward a construct. Each construct has, in serious research traditions, substantial theoretical and empirical articulation.</p><p>Consider <em>learning</em>. In Robert Bjork&#8217;s decades of experimental work, learning is not a single construct but a distinction between two separable things: storage strength and retrieval strength. Storage strength refers to how well a representation is encoded. Retrieval strength refers to how accessible it is at the moment of test. A student can have high retrieval strength at the end of a unit &#8212; they perform well on the post-test &#8212; without high storage strength. Weeks later, the retrieval strength decays, and the post-test performance turns out to have been measuring the wrong thing. Conditions that maximize immediate performance &#8212; massed practice, aligned testing, minimal difficulty &#8212; often actively impair long-term storage. This is the central insight of what Bjork calls desirable difficulties.</p><p>A learning claim grounded in Bjork&#8217;s construct requires evidence of storage strength, not just retrieval strength &#8212; which requires measuring performance after a delay, in new contexts, on items not identical to training. The methodology exists. It has existed since the early 1990s. It is the basis of essentially every recommendation in <em>Make It Stick</em> and in the broader spaced-practice and retrieval-practice literature that has accumulated since.</p><p>Now consider how <em>learning</em> is typically operationalized in EdTech efficacy research. The outcome measure is a post-test administered at the end of the instructional unit. The items are aligned with the instructional content. The interval between instruction and test is hours to days. The retrieval context is the same or similar to the learning context. What this operationalization measures is retrieval strength at short delay. What Bjork&#8217;s construct requires is storage strength at longer delay under different retrieval conditions. These are not the same thing.</p><p>The gap between the two is not subtle. It is structural. And it is present in nearly every efficacy claim this book has examined.</p><p>Consider <em>understanding</em>. Jean Lave, Etienne Wenger, John Dewey, and the situated-cognition tradition spent decades articulating understanding as something different from performance on items. Understanding involves the capacity to apply knowledge in contexts that differ from the contexts of acquisition. It involves participation in practices &#8212; knowing how to use what one knows in the world where it applies. Transfer testing &#8212; the capacity to apply learning to problems that differ meaningfully from training &#8212; is the minimum methodological requirement for a claim about understanding. Transfer testing has been advocated for in educational research since Thorndike&#8217;s early twentieth-century work. It remains exceptional in EdTech efficacy research.</p><p>Consider <em>mastery</em>. Bloom&#8217;s own construct, as articulated in his mastery-learning work, involves structural reorganization of knowledge &#8212; the kind of reorganization that allows a learner to solve problems the instruction did not specifically address. Bloom&#8217;s 2-sigma finding emerged from studies that implemented criterion-referenced assessment, formative assessment with corrective feedback, demonstrated performance across multiple item types. The 2-sigma number is cited routinely as a benchmark for tutoring effectiveness. Bloom&#8217;s construct of mastery, including its methodological requirements, is cited far less often.</p><p>Consider <em>personalization</em>, as examined in the eighth essai. The term invokes a construct rooted in Vygotskian zone-of-proximal-development work and the aptitude-treatment interaction literature &#8212; instruction responsive to who the individual learner actually is. What adaptive-learning systems operationalize is item sequencing and pacing based on item-level response patterns. These are not the same construct.</p><p>Consider <em>engagement</em>. The construct, as articulated in the psychological literature, involves attention, motivation, affect, persistence in the face of difficulty, meaningful cognitive investment. What AI-tutor efficacy research typically measures is time on task, session counts, and completion rates. Kristen DiCerbo of Khan Academy observed in April 2026 that when students engaged with Khanmigo, they were typing &#8220;IDK IDK&#8221; &#8212; <em>I don&#8217;t know, I don&#8217;t know</em> &#8212; and moving on. The platform counted them as engaged. They were not engaged in any cognitively meaningful sense.</p><p>Each of these constructs has serious theoretical articulation in one or more research traditions. Each is routinely invoked by the field&#8217;s claim-making vocabulary. Each is routinely operationalized as aligned-item performance at short timescale. The gap between the construct and the operationalization is what the apparatus produces. And taken across the field, it is the difference between the learning the vocabulary claims and the performance the measurements index.</p><div><hr></div><h2>What the Field Has Tried</h2><p>It would be inaccurate to say the field has not tried to close this gap. It has tried, across multiple traditions, for decades. That these attempts have not produced a different default apparatus is itself instructive.</p><p><em>How People Learn</em>, the 1999 National Academies synthesis by Bransford, Brown, and Cocking, made transfer testing a central methodological theme. The implication was straightforward: efficacy research should include transfer measures if it wants to make claims about learning rather than claims about trained performance. Two and a half decades later, transfer testing remains exceptional.</p><p>Samuel Messick&#8217;s theory of validity, codified in his 1989 chapter in <em>Educational Measurement</em>, specified that a test score&#8217;s interpretation requires examination of construct-relevant versus construct-irrelevant variance, construct underrepresentation, and the consequences of the test&#8217;s use. Applied rigorously, Messick&#8217;s framework would require EdTech efficacy research to examine what its outcome measures actually index rather than assuming that performance-on-aligned-items equals evidence-of-learning. The framework has been the theoretical standard in measurement theory for over thirty years. Its rigorous application in educational technology efficacy has been partial at best.</p><p>Jean Lave&#8217;s situated-cognition tradition articulated assessment that requires observation of practice rather than administration of tests. It has had essentially no impact on deployed-product efficacy research.</p><p>Each of these traditions has existed for decades. Each has produced methodology that could be adopted. Each remains exceptional rather than routine. The alternatives have not been hidden. They have been taught in graduate programs, cited in methods sections, present in the same journals that published the aligned-outcome studies.</p><p>The question is why they have not taken.</p><div><hr></div><h2>Why the Apparatus Persists</h2><p>The apparatus persists because it serves the specific production conditions of the field in which it operates.</p><p>Consider what a researcher needs to do research in this field. Funding, on grant cycles of two to five years. Publications, through peer-reviewed journals with specific conventions. Access to populations &#8212; schools, classrooms, platforms &#8212; through institutional partnerships with their own timelines and constraints. Findings that other researchers can cite.</p><p>Now consider what a more adequate apparatus would require. Transfer testing adds design complexity and reduces effect sizes. Durability testing extends the study timeline past the typical grant cycle. Multi-paradigm convergence requires methodological range that most research programs do not possess. Pre-registration of analytic plans constrains the exploratory moves that often produce publishable findings.</p><p>Each of these, if adopted as a default, would reduce the rate at which researchers produce citable positive findings. Not because the interventions do not work &#8212; some of them do &#8212; but because the findings that survive the more demanding methodology would be smaller, noisier, and less rhetorically useful. A researcher who adopts the more demanding methodology competes with researchers who do not. The less-demanding researcher&#8217;s findings will be larger, cleaner, and more citable. Grant agencies, tenure committees, and publication venues all reward the latter.</p><p>The same pressures operate on the institutions that surround the research. Product vendors have commercial reasons to prefer methodologies that produce larger numbers. Policy bodies have political reasons to prefer evidence that looks clean. Philanthropists want defensible findings, and clean findings are easier to defend than nuanced ones. Journal editors respond to what their referees will accept, and what referees will accept is shaped by the conventions the field has institutionalized.</p><p>No individual in this system is behaving cynically. Researchers are doing their best work under the constraints of their funding. The apparatus is not what anyone chose. It is what the incentives produce when rational actors operate within them.</p><p>This is why advocacy for better methodology has not produced better methodology. The problem is not that researchers do not know better methodology exists &#8212; they do. The problem is that operating under the existing apparatus produces careers; operating against it produces, for most researchers, shorter and more difficult careers.</p><p>The apparatus persists because it is an equilibrium. Equilibria are stable not because the actors inside them are irrational but because they are responding rationally to incentives that no single actor created and no single actor can change. Changing an equilibrium of this kind requires changing the incentives across grant agencies, tenure systems, journal conventions, institutional practices, and funder expectations simultaneously. Such coordination is rare.</p><p>This is a structural observation, not a moral one. Researchers in this field are not broken. The evidence base is what the apparatus produces when careful, rigorous, well-meaning researchers operate under the conventions the apparatus enforces. Improving any individual researcher&#8217;s methods would not change what the field&#8217;s evidence base looks like, because the evidence base is the aggregate output of many careful researchers responding to shared incentives.</p><div><hr></div><p>That is what the apparatus was always supposed to produce.</p><div><hr></div><p><em>Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)). This essay appears as part of the Computational Skepticism series at <a href="https://skepticism.ai">skepticism.ai</a>. | <a href="https://theorist.ai">theorist.ai</a> | <a href="https://hypotheticalai.substack.com">hypotheticalai.substack.com</a></em></p><div><hr></div><p><strong>Tags:</strong> measurement construct validity EdTech efficacy, Bjork storage retrieval strength learning systems, transfer testing durability educational technology, apparatus equilibrium research incentives, Bloom Kestin aligned outcome measure gap</p>]]></content:encoded></item><item><title><![CDATA[The Comparison That Was Never Fair]]></title><description><![CDATA[What Intelligent Tutoring Systems Actually Measured, and What They Were Compared Against]]></description><link>https://www.skepticism.ai/p/the-comparison-that-was-never-fair</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-comparison-that-was-never-fair</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Tue, 21 Apr 2026 19:21:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!BLUQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BLUQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BLUQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!BLUQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!BLUQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!BLUQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BLUQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1626422,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/194834752?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BLUQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!BLUQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!BLUQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!BLUQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37bd9d3b-e6e4-4371-a664-178094eaa5c6_1456x816.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In 2014, RAND published one of the most carefully designed evaluations of an educational technology system in the history of the field. John Pane, Beth Ann Griffin, Daniel McCaffrey, and Rita Karam ran a cluster-randomized controlled trial across 147 schools in seven states, assigning roughly 25,000 students either to use Cognitive Tutor Algebra I or to continue with whatever algebra instruction those schools had previously offered. The outcome measure was a standardized algebra proficiency exam. The design was, by the standards of a field that routinely tolerates thin evidence and motivated reporting, unusually rigorous.</p><p>The finding was specific. In the first year of implementation, Cognitive Tutor produced no statistically significant effect on algebra proficiency. In the second year, a significant positive effect emerged at high schools &#8212; approximately 0.20 standard deviations, sufficient to move a median student from the 50th to roughly the 58th percentile. At middle schools, the second-year effect was similar in magnitude but did not reach statistical significance.</p><p>Pane and colleagues called this an &#8220;implementation learning curve.&#8221; They were careful to note that the learning did not seem to happen at the level of individual teachers &#8212; students of teachers new to the system in year two performed similarly to students of experienced teachers. The learning happened at the level of schools: scheduling, infrastructure, coordination, institutional adjustment to a new instructional logic. The sites that figured out how to implement Cognitive Tutor took a year to figure it out, and then the system worked.</p><p>This is what a rigorous evaluation of an intelligent tutoring system looks like. The findings are real. The effects are modest. The implementation costs were substantial &#8212; approximately $97 per student per year for Cognitive Tutor against approximately $28 for the traditional textbook instruction it replaced. And in the field&#8217;s characteristic framing, this result was narrated as <em>disappointment</em>. Intelligent tutoring systems were supposed to approach human tutoring effectiveness. They had not.</p><p>I want to examine that disappointment. Not to redeem ITS, and not to dismiss the evaluation record. I want to examine what was being compared to what, and whether the comparison &#8212; the one that has driven ITS research, ITS funding, and now AI-tutor rhetoric for forty years &#8212; was ever structurally sound.</p><div><hr></div><h2>What the Tutor Actually Measured</h2><p>Cognitive Tutor was built to embody a specific theory of cognition. John Anderson&#8217;s ACT-R framework posits that skill acquisition is the conversion of declarative knowledge &#8212; facts, concepts &#8212; into procedural knowledge: production rules, condition-action pairs. To become skilled at algebra is to acquire a set of increasingly sophisticated rules for algebraic manipulation. Recognize that the goal is to isolate a variable and the coefficient is 4, and divide both sides by 4. The rule fires. The step is taken correctly.</p><p>The instructional design that follows from this is specific. If you can specify the production rules that constitute algebraic competence, you can build a system that monitors whether each rule is acquired. Cognitive Tutor did exactly this. As a student worked through a problem, the tutor compared each step against its internal model of valid solution paths. Correct step: proceed. Step matching a stored buggy production &#8212; a common misconception encoded in the system &#8212; respond with immediate feedback. Student requests help: deliver a graduated hint sequence targeting the specific production the student is struggling to fire.</p><p>Across many problems, the tutor maintained running Bayesian estimates of whether each production rule had been mastered. Students could not advance to new material until the estimates crossed a mastery threshold. This is model tracing and knowledge tracing: two technical operations that together constitute the system&#8217;s measurement apparatus. What the apparatus measures is step-level correctness, time per step, hint requests, error patterns, and estimated mastery of each production rule. These are not arbitrary choices. They are what ACT-R theory specifies as relevant to procedural skill acquisition. The design is internally consistent with the theory it was built on.</p><p>The 1995 paper in which Anderson, Corbett, Koedinger, and Pelletier published their decade of findings was titled <em>Cognitive Tutors: Lessons Learned</em>. The plural of lessons learned is deliberate. The paper names what the system does not measure with the same specificity as what it does. Cognitive Tutor does not model affective state. It cannot detect whether a student is frustrated, bored, or emotionally disengaged from the material. It cannot identify conceptual confusion that lives above the production-rule grain &#8212; a student may fire productions correctly while failing to understand the domain they are operating in, and the tutor will not notice. It does not measure transfer, durability, or motivation. These are not oversights. They are structural features of a system designed for a specific theoretical purpose.</p><p>The researchers knew exactly what they had built. The disappointment that followed was partly not theirs.</p><div><hr></div><h2>What Human Tutors Actually Do</h2><p>The comparison that generated the disappointment is this: ITS produces effect sizes of roughly 0.20 to 0.40 sigma relative to classroom instruction. Expert human tutors produce effect sizes of roughly 0.40 to 0.80 sigma. Therefore ITS has failed to approach human effectiveness.</p><p>This comparison requires that both numbers measure the same construct at different magnitudes. They do not.</p><p>The research literature on what expert human tutors actually do is not sparse, and much of it was produced by the same researchers who built ITS. Art Graesser &#8212; who built AutoTutor, one of the more sophisticated ITS systems in the research tradition &#8212; spent years analyzing videotaped sessions between expert tutors and students, specifically to understand what tutors were doing that his system might learn to do. What Graesser&#8217;s analyses documented was a specific set of interactional moves.</p><p>Tutors approach a topic with what Graesser called expectations and misconceptions: a mental model of the components of correct understanding and a map of how students typically go wrong. As students respond, the tutor evaluates the response against this map &#8212; not syntactically, as an ITS matches a step against a production rule, but semantically, tracking which elements of the expected understanding are present and which are missing. The next move is determined by this evaluation. The response is therefore flexible in a way that production-rule matching is not.</p><p>Tutors continuously check comprehension. &#8220;Can you say that in your own words?&#8221; &#8220;What would happen if this were different?&#8221; These are not assessment items; they are real questions that tutors use to calibrate what to do next. The comprehension check is an instrument for reading the student&#8217;s understanding, not recording it in a database.</p><p>Tutors manage affect. Graesser&#8217;s research documented that expert tutors are often deliberately imprecise about negative feedback &#8212; indirect, softened, delivered in ways designed to protect the student&#8217;s willingness to continue engaging. This is not sloppiness. It is the management of an ongoing relationship whose continuation matters to the learning. A student who has been made to feel consistently stupid by their tutor stops engaging, and a tutor who cannot detect or respond to that risk is a different kind of instrument.</p><p>Tutors follow student questions. When a student asks something the tutor had not planned to address, expert tutors engage. Graesser, describing AutoTutor&#8217;s limitations with characteristic directness, noted that his system had to use &#8220;diversionary tactics&#8221; when students asked questions outside its agenda. Human tutors do not divert. They follow.</p><p>Michelene Chi, working from a different angle, documented that what makes human tutoring effective is not primarily the information the tutor delivers. It is the interactivity &#8212; the tutor&#8217;s prompts that elicit the student&#8217;s own elaboration, the student&#8217;s attempts at articulation that reveal gaps, the tutor&#8217;s calibration of the next move to what the student&#8217;s specific response has revealed. Self-explanation is a primary driver of conceptual change, and expert tutors are specifically skilled at eliciting the right kind of self-explanation through well-calibrated prompts. An ITS can prompt for self-explanation. What it cannot do is read the specific partial answer the student just produced and respond to that answer&#8217;s specific weaknesses.</p><p>And from an even earlier lineage: Wood, Bruner, and Ross, in a foundational 1976 paper, identified six functions tutors perform when scaffolding learners through tasks. Recruitment of interest. Reduction of degrees of freedom. Direction maintenance. Marking critical features. Frustration control. Demonstration. Of these six, Cognitive Tutor was specifically engineered to perform one: reduction of degrees of freedom, the step-by-step scaffolding that makes a complex problem tractable by breaking it into smaller operations. The tutor is structurally blind to recruitment, structurally unable to perform frustration control, and limited in demonstration to displaying the system&#8217;s own solution paths rather than modeling the expert&#8217;s move for the novice in ways the novice can watch and internalize.</p><div><hr></div><h2>The Axis Problem</h2><p>Here is what this produces.</p><p>The ITS measurement apparatus was built to measure one specific dimension of what expert human tutors do: the reduction-of-degrees-of-freedom move. Cognitive Tutor performs this move with remarkable precision. Its model tracing, its knowledge tracing, its mastery-learning constraints &#8212; these are all optimized for ensuring students acquire the production rules that constitute procedural competence in a specific domain. When evaluated on measures aligned with this construct, the system produces real effects. Pane&#8217;s 0.20 sigma is not noise. It reflects what the system actually does.</p><p>Human tutoring, as documented in Graesser&#8217;s and Chi&#8217;s and Wood, Bruner, and Ross&#8217;s research, involves that same move alongside several others: expectation-and-misconception dialogue, comprehension checks, affective management, student-question handling, recruitment, frustration control, demonstration. The effect sizes produced by expert human tutors in the research literature reflect this fuller set of moves acting in concert, against whatever outcome measures the studies used.</p><p>When these two numbers &#8212; the ITS effect and the human-tutoring effect &#8212; are placed on a single sigma axis for comparison, the implicit claim is that they measure the same construct at different magnitudes. They do not. ITS measures what a procedural-scaffolding technology produces on assessments that test procedural skills. Human tutoring measures what a full interactional relationship produces on assessments that, depending on the study, test some combination of procedural skills and broader constructs. The numbers can be placed on the same axis only if the underlying outcome measures are the same &#8212; which they frequently are not &#8212; and only if the interactional moves the two interventions involve are comparable &#8212; which the research literature establishes they are not.</p><p>This is the construct mismatch. It is not a peripheral observation. It is the structural feature of a comparison that has been doing field-level work for forty years, driving research agendas, guiding institutional adoption decisions, and anchoring the contemporary rhetoric that AI can approach human instructional effectiveness. What the comparison has consistently obscured is that the two things it is comparing were never fully on the same axis.</p><p>Cognitive Tutor did something real, with discipline and theoretical grounding, and produced genuine effects when evaluated appropriately. The disappointment in its failure to match human-tutor effect sizes is partly the disappointment of a comparison that was underdetermined from the start. Asking whether Cognitive Tutor matched human tutors is like asking whether a skilled surgeon matches a general practitioner across all dimensions of medical care. The surgeon is extraordinarily good at the specific thing the surgeon does. The general practitioner does that thing and many others. The sigma gap between them does not mean the surgeon failed.</p><div><hr></div><h2>The Inheritance</h2><p>The current AI-tutor moment has been presented, in much public discourse, as an advance that finally addresses what ITS lacked. Large language models can engage in natural-language dialogue. They can handle questions they were not specifically designed to handle. They can, in principle, perform some of the interactional moves Graesser documented as characteristic of expert human tutoring &#8212; the expectation-and-misconception dialogue, the comprehension check, the flexible response to what a student actually said. The rhetoric suggests the construct mismatch has been resolved.</p><p>Read through the ITS apparatus, the claim is more complicated than the rhetoric suggests.</p><p>The current AI-tutor evaluation studies still measure what ITS evaluations measured: item-level mastery, step-level performance, post-test scores on aligned assessments, immediate outcomes rather than durable learning. The measurement apparatus has been inherited. What has changed is the interaction layer. Whether the interaction-layer changes produce meaningfully different learning outcomes &#8212; or produce the appearance of more-human interaction without producing the underlying effects &#8212; is an empirical question the current literature has not cleanly answered. The Kestin Harvard physics study, with its 0.73 to 1.3 sigma effects on researcher-designed tests of the specific content a two-hour AI session had just covered, is measured on a Skinnerian axis. The measurement does not index whether the AI performed the interactional moves that make human tutoring what it is. It indexes whether students correctly answered questions about surface tension and fluid flow immediately after being tutored about surface tension and fluid flow.</p><p>The construct mismatch is not solved by better interaction capabilities. It is solved by better measurement. A system that performs rich tutoring interaction and is evaluated on aligned immediate assessments remains, from the evaluation&#8217;s perspective, on the same axis as Cognitive Tutor. The measurement apparatus determines what the sigma numbers mean, and the measurement apparatus has not substantially changed across the transition from production-rule ITS to generative AI tutoring.</p><p>This matters because the comparison that has driven forty years of ITS disappointment is being recycled to drive the current AI-tutor moment. The benchmarks invoked &#8212; Bloom&#8217;s 2-sigma, the expert-human-tutor effect-size range, the framing that AI can now &#8220;approach&#8221; human instruction &#8212; are the same benchmarks. The construct mismatch they depend on is the same mismatch. Whether a system that generates flexible natural-language responses has actually closed the distance that matters, or has closed the part of the distance that is easier to perform while leaving the harder parts unaddressed, is the question the measurement apparatus is not yet equipped to answer.</p><div><hr></div><h2>Three Questions to Ask</h2><p>When you next encounter a claim that an educational technology has approached the effectiveness of human tutoring, three questions will orient you.</p><p>What did the technology actually measure? If the evaluation used item-level or step-level assessments aligned with the technology&#8217;s instructional content, the system has been measured against a construct aligned with what it was built to do. This is not a criticism; it is a description of what the evaluation supports.</p><p>What does the human-tutoring construct actually involve? The research literature on expert human tutors documents a specific set of interactional moves &#8212; expectation-and-misconception dialogue, comprehension checks, affective management, student-question handling, recruitment, frustration control, demonstration. These are not peripheral features. They are the substance of what expert tutors do.</p><p>Was the comparison conducted on an axis that indexes both? If the outcome measure favors procedural scaffolding &#8212; which most ITS and AI-tutor evaluations use &#8212; the axis is not measuring what human tutoring does beyond procedural scaffolding. The comparison is limited by the measurement choice. A finding that the technology approaches human tutoring on such a measure is a finding about procedural scaffolding, not about the interactional richness the construct human tutoring would require.</p><p>These questions do not answer whether AI can replace human tutors. They answer the prior question: what are we measuring when we make the comparison? The field has been skipping the prior question since 1984, when Benjamin Bloom placed his two-sigma number on the same axis as his classroom-instruction comparison and the discourse collapsed the distance between them into a single rhetorical invitation. Cognitive Tutor responded to the invitation seriously, with theoretical rigor and methodological discipline, and produced 0.20 sigma at high schools after a year of implementation and $97 per student per year of cost. That result is not a failure. It is what the move that Cognitive Tutor was designed to do produces, measured honestly, at scale, in actual schools.</p><p>The number that system was compared against was never on the same axis. The comparison is the problem. It was the problem in 1990, when ITS researchers were trying to build what it named. It is still the problem now, when generative AI is being asked to close a gap the measurement apparatus cannot fully see.</p><div><hr></div><p><em>Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)). | <a href="https://skepticism.ai">skepticism.ai</a> | <a href="https://theorist.ai">theorist.ai</a></em></p><div><hr></div><p><strong>Tags:</strong> intelligent tutoring systems construct validity, Cognitive Tutor RAND evaluation, human tutoring comparison mismatch, ACT-R model tracing procedural scaffolding, AI tutor measurement apparatus critique</p>]]></content:encoded></item><item><title><![CDATA[The Debt That Was Never Owed]]></title><description><![CDATA[Palantir posted a bootlicking new manifesto to X on Saturday]]></description><link>https://www.skepticism.ai/p/the-debt-that-was-never-owed</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-debt-that-was-never-owed</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Tue, 21 Apr 2026 02:39:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6pk5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6pk5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6pk5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!6pk5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!6pk5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!6pk5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6pk5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1467135,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/194869790?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6pk5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!6pk5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!6pk5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!6pk5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F361989e9-4dad-4370-8ca3-45aecb284555_1456x816.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Palantir posted a <a href="https://x.com/PalantirTech/status/2045574398573453312">bootlicking new manifesto</a> to X on Saturday, calling it a brief summary of The Technological Republic, a 2025 book by Palantir co-founder and CEO Alexander C. Karp and head of corporate and legal affairs Nicholas W. Zamiska. You can read the <a href="https://x.com/PalantirTech/status/2045574398573453312">full manifesto here</a>.</p><p>There is a word missing from Palantir&#8217;s 22-point manifesto, and its absence is the most revealing thing about the document. The word is <em>citizen</em>. Not customer, not taxpayer, not the &#8220;public&#8221; whose security the company claims to protect&#8212;citizen, the person with rights that precede the state&#8217;s demands on them. In 318 words posted to X on a Saturday, Alexander Karp and Nicholas Zamiska laid out a vision of the relationship between Silicon Valley and the American government that has no room for that word, because the vision does not require it. What it requires is something older and more coercive: <em>debt</em>.</p><p>&#8220;Silicon Valley owes a moral debt,&#8221; the manifesto announces, &#8220;to the country that made its rise possible.&#8221; The engineering elite has &#8220;an affirmative obligation to participate in the defense of the nation.&#8221; Read slowly, this is an extraordinary claim&#8212;not that companies <em>should</em> contribute to national defense as a matter of civic choice, but that they <em>owe</em> this contribution as repayment for being permitted to exist and thrive. The logic underneath is not liberal. It is feudal. You were allowed to build here; now you must serve.</p><p>This distinction matters because it forecloses the question the manifesto most wants to avoid: serve <em>what</em>, and decided by <em>whom</em>?</p><div><hr></div><h2>The Machine That Needs No Ethics</h2><p>Palantir is not a neutral observer of the relationship between technology and national power. It is one of the primary architects of that relationship. Its tools help run predictive policing programs in American cities&#8212;programs with documented records of racially disparate impact. Its analytics support military operations in Gaza, where the scale of civilian death has generated calls for investigation at the International Court of Justice. The company&#8217;s stated business is to make governments and militaries more effective at finding and targeting people.</p><p>This background is not incidental to reading the manifesto. It is the lens through which every high-minded claim about &#8220;hard power&#8221; and &#8220;the long peace&#8221; must be understood. When point five declares that &#8220;the question is not whether A.I. weapons will be built; it is who will build them and for what purpose&#8221;&#8212;Palantir is answering its own question. It will build them. The purpose will be defined later, by clients.</p><p>The manifesto&#8217;s treatment of AI weaponry is instructive precisely because of what it refuses to say. &#8220;Our adversaries will not pause to indulge in theatrical debates about the merits of developing technologies with critical military and national security applications.&#8221; The word <em>theatrical</em> is doing enormous work here. It transforms any moral inquiry&#8212;any attempt to ask what these systems will do to human bodies, to civilian populations, to the international frameworks that have governed warfare since 1949&#8212;into performance. The person who asks &#8220;should we build this?&#8221; is not thoughtful. They are theatrical. They are wasting time while China proceeds.</p><p>This is an old move. It has been used to justify every weapons program that ever required the silencing of conscience. The urgency of the adversary becomes the alibi for the abandonment of ethics. What is new is the audacity of building that alibi directly into a manifesto and posting it with apparent pride.</p><div><hr></div><h2>The Hierarchy They Won&#8217;t Name</h2><p>The manifesto&#8217;s most revealing quality is its double standard, operating so consistently across so many of its twenty-two points that it must be understood as a design feature rather than an oversight.</p><p>Ordinary people who look to politics &#8220;to nourish their soul and sense of self&#8221; are warned they &#8220;will be left disappointed.&#8221; They should not rely too heavily on their internal life finding expression in politicians they&#8217;ll never meet. <em>Stay in your lane.</em> But Elon Musk should not be &#8220;snickered at&#8221; for his grand narratives. The rich man&#8217;s vision is legitimate ambition; the ordinary person&#8217;s political investment is pathetic dependency.</p><p>Public figures deserve &#8220;far more grace.&#8221; The &#8220;ruthless exposure of the private lives of public figures drives far too much talent away from government service.&#8221; The culture of accountability&#8212;the press, the investigators, the citizens who demand that power justify itself&#8212;is characterized as a pathology driving good people from public life. But the document offers no equivalent concern for the people whose private lives are exposed by Palantir&#8217;s surveillance tools. The predictive policing database. The behavioral analytics. The location tracking. The inference engines that make private lives legible to the state. That exposure is the product. The grace is reserved for those doing the exposing.</p><p>Point 21 declares that some cultures &#8220;have produced wonders&#8221; while others &#8220;have proven middling, and worse, regressive and harmful.&#8221; This is not accompanied by any methodology, any acknowledgment of the material conditions that produce what Karp and Zamiska are willing to call cultural failure, any reckoning with the history of a Western civilization that has spent five centuries extracting labor and resources from the cultures it now grades. It is simply asserted, with the confidence of people who have never had to justify to anyone why their own culture gets to be the rubric.</p><p>This is the hierarchy the manifesto will not name: the people who build the tools and those upon whom the tools are used. The engineers whose creative lives deserve protection from decadence and the citizens whose movements, associations, and behaviors feed the databases that fund the manifesto&#8217;s authors. The public figures who deserve grace and the communities who deserve, apparently, nothing but efficiency.</p><div><hr></div><h2>The Draft and the Document</h2><p>Point six is the most honest sentence in the manifesto: &#8220;We should, as a society, seriously consider moving away from an all-volunteer force and only fight the next war if everyone shares in the risk and the cost.&#8221;</p><p>I want to sit with this for a moment, because buried inside its apparent fairness is something important. Karp and Zamiska are calling for conscription. Universal national service. They are saying that the all-volunteer military&#8212;the force assembled from people who, for economic or ideological reasons, chose to enlist&#8212;is insufficient. Everyone must go.</p><p>And yet.</p><p>The same document argues that engineers have a &#8220;moral debt&#8221; to the national defense that must be repaid through the production of AI weapons. The same document argues that tech companies must be conscripted to serve national interests. The same document warns that &#8220;theatrical debates&#8221; about the ethics of these weapons should not be permitted to slow their development.</p><p>What the manifesto envisions, in full, is a society in which everyone serves&#8212;but in which the purposes they serve, the weapons they build, and the targets those weapons find are determined by the people writing 22-point manifestos and posting them to X. Universal obligation. Elite prerogative. The risk is shared; the decisions are not.</p><p>This is the structure of every regime that has ever called for national sacrifice while exempting its own planning class from accountability. The workers die in the wars that the strategists design.</p><div><hr></div><h2>What Decadence Actually Is</h2><p>The manifesto&#8217;s most irritating rhetorical move is its deployment of <em>decadence</em> as an indictment of ordinary life. &#8220;The decadence of a culture or civilization, and indeed its ruling class, will be forgiven only if that culture is capable of delivering economic growth and security for the public.&#8221; &#8220;Is the iPhone our greatest creative if not crowning achievement as a civilization?&#8221; &#8220;Free email is not enough.&#8221;</p><p>This is the pose of someone who has everything and is bored by it&#8212;who mistakes their boredom for moral clarity and their ambition for national purpose. Karp and Zamiska are billionaires. They run a company whose stock has made many of its employees extraordinarily wealthy. The product they are now positioning as the antidote to decadence&#8212;AI-powered weapons systems&#8212;is the revenue engine that sustains their own very comfortable lives. The argument is: you are distracted by your phones while we build the future, which we will sell to governments at market rates.</p><p>What decadence actually looks like is a surveillance capitalism that profits from exposure while calling for privacy protections for its principals. It looks like a company that takes federal contracts to build targeting systems and then writes a book about the spiritual failure of the engineering class that won&#8217;t do the same. It looks like the audacity to write about public service while running a company whose compensation structure would, as the manifesto itself notes, cause any normal business to &#8220;struggle to survive&#8221;&#8212;and offering no solution to that problem beyond the vague instruction that the situation must change.</p><div><hr></div><h2>The Peace That Is Not Peace</h2><p>Point fourteen asserts that &#8220;American power has made possible an extraordinarily long peace.&#8221; The framing is precise, calibrated, and wrong in the ways that matter most.</p><p>The hundred years of &#8220;some version of peace&#8221; that the manifesto celebrates looks different depending on where you are standing. It looks like the Korean War if you are Korean. It looks like Vietnam if you are Vietnamese, or Laotian, or Cambodian. It looks like a series of coups and counter-insurgency operations if you are Guatemalan, Chilean, Iranian. It looks like the Iraq War and its 200,000 civilian dead if you are Iraqi. It looks like the drone program if you are Yemeni, Pakistani, or Somali.</p><p>The &#8220;long peace&#8221; is a peace among great powers, purchased in part by the exportation of violence to places whose people the manifesto is not designed to address. When Karp and Zamiska write that &#8220;nearly a century of some version of peace has prevailed in the world without a great power military conflict,&#8221; they are using &#8220;the world&#8221; to mean something smaller than the world.</p><p>This is not a minor error. It is the error that makes possible everything else in the document&#8212;the easy celebration of hard power, the dismissal of ethical debate, the confidence that the instruments of American military capacity are, on balance, a gift to humanity. If you exclude from your accounting the people on whom American military power has been used, the accounting works out very well. If you include them, it does not.</p><div><hr></div><h2>What I Find Myself Unable to Dismiss</h2><p>And yet.</p><p>There are things in this document that cannot simply be mocked away. The concern about Germany and Japan&#8212;point fifteen&#8217;s argument that Europe is &#8220;paying a heavy price&#8221; for the overcorrection of German demilitarization&#8212;has been vindicated with terrible specificity by events since 2022. The observation that public service compensation structures drive talented people toward private alternatives is empirically accurate. The critique of a political culture that has become so punitive that it discourages participation is something that people across the political spectrum have made, often for opposite reasons.</p><p>The scaffolding of the manifesto is not entirely wrong. The conclusion it draws from that scaffolding&#8212;that Silicon Valley companies have an obligation to build weapons and a right to do so without ethical interference&#8212;is where the document reveals what it actually is.</p><p>The scaffolding says: the world is dangerous, democracies must compete, technical capacity is the foundation of power, the people who can build technical capacity have responsibilities that go beyond personal enrichment.</p><p>The conclusion says: therefore, Palantir.</p><p>These do not follow from each other. The premises could support a very different conclusion&#8212;one in which technical capacity is developed under democratic accountability, in which the ethical debates the manifesto calls theatrical are understood as the very mechanism by which a free society maintains control over its instruments of power, in which the &#8220;debt&#8221; to the country is repaid through transparency and restraint rather than through the manufacture of ever more effective targeting systems.</p><p>The manifesto&#8217;s authors know this. They wrote around it. The question is whether we will let them.</p><div><hr></div><h2>The Last Line</h2><p>&#8220;The republic is left with a significant roster of ineffectual, empty vessels whose ambition one would forgive if there were any genuine belief structure lurking within.&#8221;</p><p>This is Karp and Zamiska on the quality of American public servants. It is contemptuous in a way that, in a less polished document, would read as rage.</p><p>I find I agree with the sentence. I disagree with its intended targets.</p><p>The ineffectual empty vessels with insufficient belief structures are not the public servants who refused to build weapons. They are not the engineers who asked whether they should before they asked whether they could. They are not the citizens who looked to politics to nourish something in themselves and were told to stay in their lane.</p><p>The problem with genuine belief is that it imposes obligations. It means being accountable to something larger than the manifesto you published on a Saturday. It means the ethics are not theatrical. It means the debt runs in more directions than down.</p><p>Karp and Zamiska believe in hard power. They believe in American strength. They believe in the obligation of technical elites to serve national purpose. They have built a company that embodies these beliefs and made themselves very wealthy in the process.</p><p>What they do not believe in&#8212;what the bootlicking manifesto&#8217;s 318 words systematically exclude&#8212;is accountability to the people the tools touch. The communities surveilled. The bodies targeted. The cultures graded and found regressive. The ordinary citizens whose political investments are characterized as pathetic while their physical conscription is proposed as necessary.</p><p>That is not a belief structure. That is a business model wearing a belief structure as a costume.</p><p>The republic deserves better than costumes. So do its people.</p><div><hr></div><p><em>Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)). His research on algorithmic systems, AI ethics, and platform accountability is published at bear.musinique.com, skepticism.ai, and theorist.ai.</em></p><div><hr></div><p><strong>Tags:</strong> Palantir Technological Republic critique, AI weapons ethics Silicon Valley, conscription tech manifesto, surveillance capitalism accountability, Alexander Karp national service obligation</p><p></p>]]></content:encoded></item><item><title><![CDATA[The Inheritance We Never Examined]]></title><description><![CDATA[How Skinner&#8217;s Teaching Machine Still Grades Your Children&#8217;s Software]]></description><link>https://www.skepticism.ai/p/the-inheritance-we-never-examined</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-inheritance-we-never-examined</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Mon, 20 Apr 2026 18:42:15 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!zZrL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zZrL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zZrL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!zZrL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!zZrL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!zZrL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zZrL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1878068,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/194830729?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zZrL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png 424w, https://substackcdn.com/image/fetch/$s_!zZrL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png 848w, https://substackcdn.com/image/fetch/$s_!zZrL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png 1272w, https://substackcdn.com/image/fetch/$s_!zZrL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3af82244-4500-483f-a5a5-7d30bf18480e_1456x816.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p>There is a machine in every classroom now, and it measures what it has always measured. The name on the box changes &#8212; Duolingo, Khanmigo, i-Ready, DreamBox &#8212; but what the box counts has remained, across seventy years of silicon and software and venture capital and neuroscience, almost perfectly stable. Accuracy per item. Time per response. Progression through atomized units. Performance on the test the system was built to prepare you for.</p><p>B.F. Skinner named these measurements in 1958. He had a good reason.</p><p>He had observed his daughter&#8217;s fourth-grade arithmetic class and been, in his own word, shocked. Students completed problems and waited. The papers were collected. Perhaps two days later, perhaps a week, the marked papers returned. By the time the feedback arrived, the behavior it was meant to reinforce had already moved on, taken up residence in some adjacent habit of mind that was no longer the one in need of correction. Skinner believed he understood the mechanism of learning better than anyone alive &#8212; the contingencies of reinforcement, the precise timing of feedback, the accumulation of correctly shaped behavior into competence &#8212; and what he had watched in that classroom was the systematic breaking of every mechanism he understood. A technology that could restore the contingencies, he reasoned, would be a technology that could teach.</p><p>His teaching machine presented material one frame at a time. The student responded. The machine verified, immediately, whether the response was correct. The contingencies were repaired.</p><p>What I am asking you to notice is not that this was wrong. I am asking you to notice what the machine measured &#8212; accuracy per frame, time per response, progression, error patterns &#8212; and to hold those measurements in mind as we trace them forward through sixty-six years of educational technology that kept the apparatus while abandoning almost everything else about Skinner&#8217;s framework.</p><div><hr></div><h2>What the Machine Could Not See</h2><p>The teaching machine could not look up from the immediate interaction to ask what the student would remember in six months.</p><p>This is not a glancing criticism of Skinner. His behavioral framework did not require him to ask the question; the question was not yet a question the field had organized itself to ask in the precise way that Bjork and Bjork&#8217;s subsequent research would demand. Skinner&#8217;s science was about the shaping of behavior through reinforcement, and a behavior that could be elicited at the moment of measurement had been shaped. That the behavior might dissolve in the absence of the reinforcing conditions was not, within behaviorism, a separate problem requiring separate measurement. Generalization was expected to follow naturally.</p><p>But this is where the inheritance turns costly. The assumption that immediate performance predicts durable learning was embedded in the measurement apparatus before it was tested empirically. By the time Robert and Elizabeth Bjork&#8217;s work made the distinction between retrieval strength and storage strength unavoidable &#8212; by the time it was clear, empirically, that the conditions maximizing immediate performance (massed practice, aligned testing, minimal difficulty) could actively impair long-term retention &#8212; the measurement apparatus had already been handed down through Patrick Suppes&#8217;s 1960s computer-assisted instruction and was settling into the bones of the field.</p><p>Suppes&#8217;s system at Stanford presented arithmetic problems to elementary students and recorded what Skinner&#8217;s machine had recorded: accuracy rates, response times, error patterns, progression. The technology shifted from mechanical device to mainframe computer. The measurements did not shift. Accuracy rose from 53 percent to over 90 percent. Response times fell from 630 seconds to 279. Suppes reported these numbers as evidence the system worked, and within the apparatus he had inherited, they were. He was not wrong to report them. He was working inside a set of choices about what evidence looked like that the apparatus had bequeathed him without flagging as choices.</p><p>The question of what those 90-percent-accurate students could do two years later was not asked.</p><div><hr></div><h2>The Apparatus Becomes Theory</h2><p>Here is what makes the inheritance pattern strange rather than simply historical: the apparatus persisted past the abandonment of the theoretical framework that had justified it.</p><p>John Anderson&#8217;s Cognitive Tutor, developed in the 1980s and 1990s at Carnegie Mellon, was built on ACT-R theory &#8212; a cognitive-psychological architecture that treated learning as the acquisition of production rules rather than the shaping of behavior. Theoretically, this was a departure from Skinner significant enough to constitute a revolution. The language of reinforcement was replaced by the language of cognition. The unit of analysis shifted from the frame to the production rule.</p><p>The measurement apparatus did not shift.</p><p>The Cognitive Tutor recorded step-level correctness &#8212; whether each student action matched one of the production rules the cognitive model identified as correct. It recorded time per step. It recorded hint requests, error patterns, estimated mastery of each production rule through Bayesian knowledge tracing. When Anderson and colleagues published their foundational 1995 paper in the <em>Journal of the Learning Sciences</em>, the evidence they offered that the system worked was: step-level accuracy, progression, and post-test performance on assessments aligned with the content the tutor had taught.</p><p>Skinner&#8217;s apparatus, operating at higher resolution, within a more sophisticated theoretical framework, carrying new vocabulary.</p><p>Anderson and colleagues were, I want to say this plainly, more honest about the limits of their measurements than most of the researchers who cited them. The 1995 paper notes explicitly that students &#8220;display transfer to the degree that they can map the tutor environment into the test environment&#8221; &#8212; an acknowledgment that the evidence of learning the system could produce depended on the degree to which the post-test resembled the tutor&#8217;s own format. This is the measurement-alignment problem stated with precision by the researchers who built the system it applied to. The acknowledgment was there. What happened subsequently was that the effect sizes from aligned post-tests entered the literature as if Anderson&#8217;s own caveat had not been published alongside them.</p><p>The apparatus inherits even what its originators flagged as provisional.</p><div><hr></div><h2>The Industrial Turn</h2><p>The 2010s commercial adaptive-learning era &#8212; Knewton, DreamBox, i-Ready, ALEKS &#8212; represents the point at which the inherited apparatus became an industry standard.</p><p>Knewton&#8217;s Jos&#233; Ferreira, during the 2012-2015 period of the platform&#8217;s public prominence, positioned his technology as capable of personalization so granular that it would transform education at scale. The claim invoked the Suppes promise in the language of twenty-first-century data science. What the platform actually measured was behavioral engagement data: which problems students attempted, which hints they took, how their patterns of interaction with the system correlated with eventual performance on the system&#8217;s own assessments. Independent efficacy research on Knewton was, during the period of its most expansive claims, notably absent. The apparatus was present in the measurement choices; the evidence was not.</p><p>DreamBox Learning, which earned more research attention than most adaptive platforms, became the subject of a 2016 Harvard Center for Education Policy Research study that found students at the median gained 1.4 to 3.9 percentile points on the NWEA MAP for approximately 7 to 8 hours of DreamBox usage. The researchers were transparent about a critical limitation: DreamBox usage might &#8220;partially reflect students&#8217; motivation levels,&#8221; meaning the correlation between usage and achievement might reflect that motivated students both use DreamBox more and learn more, independent of DreamBox&#8217;s instructional contribution. The acknowledgment, honest and specific, appeared in the paper. It rarely appeared in the citations that followed.</p><p>i-Ready produced a particularly clarifying version of the apparatus&#8217;s internal logic. The platform&#8217;s efficacy research typically demonstrated that students who achieved &#8220;usage fidelity&#8221; &#8212; meeting the system&#8217;s recommended weekly engagement minutes &#8212; showed higher scores on the i-Ready Diagnostic. The Diagnostic was itself calibrated to predict state test performance. A system measuring how well students learn to do well on the assessment the system provides, where the assessment was engineered to predict the external standard &#8212; this is the apparatus become recursive. The alignment between instruction and measurement, which Skinner had simply taken as a natural feature of teaching a student the specific behavior you then measured, had been engineered into the product design itself. The inheritance was now embedded in the commercial structure.</p><p>ALEKS routed the apparatus through Knowledge Space Theory, a mathematical framework for mapping curricular competencies that provided sophisticated theoretical grounding for the same fundamental measurement choices. Efficacy claims rested on performance within the system&#8217;s own knowledge mapping and on aligned post-tests that measured progression through the curricular content the system taught. The theoretical vocabulary was different from Skinner&#8217;s. The measurement choices were the same.</p><div><hr></div><h2>Duolingo, 2021</h2><p>I want to read a specific study carefully, because careful reading is the point.</p><p><em>Evaluating the reading and listening outcomes of beginning-level Duolingo courses</em>, by Xiangying Jiang, Joseph Rollinson, Luke Plonsky, Erin Gustafson, and Bozena Pajak, published in <em>Foreign Language Annals</em> in 2021. The fifth author, Plonsky, is an academic researcher at Northern Arizona University with specialization in applied linguistics. The other four were employed by Duolingo at the time of publication. The paper is peer-reviewed. It is cited in Duolingo&#8217;s own marketing materials. It is, within the conventions of the field, a careful study.</p><p>Two hundred and twenty-five adults in the United States &#8212; 135 studying Spanish, 90 studying French. Participants were required to have little to no prior proficiency in their target language, to be using Duolingo as their only learning tool, and &#8212; the consequential criterion &#8212; to have completed the beginning-level course content through Unit 4. The sample, the paper reports, skewed toward highly educated Caucasian Americans with bachelor&#8217;s or master&#8217;s degrees.</p><p>The outcome measure was the STAMP 4S test from Avant Assessment, covering reading and listening. Thirty multiple-choice items in each modality. The assessment was administered immediately after learners completed the beginning-level content.</p><p>The finding: Duolingo learners reached ACTFL Intermediate Low in reading and Novice High in listening &#8212; levels the paper characterizes as &#8220;comparable with those of university students at the end of the fourth semester&#8221; of college-level language study.</p><p>Now apply the apparatus.</p><p>The outcome measure is external &#8212; not designed by Duolingo, which is a genuine methodological improvement over purely internal assessment. But reading and listening are the specific modalities that Duolingo&#8217;s interface is engineered around. Multiple-choice comprehension items, translation tasks, listening exercises with multiple-choice responses: these are what Duolingo builds, and these are what the STAMP 4S measures. Speaking and writing &#8212; modalities that Duolingo&#8217;s app-based format supports weakly &#8212; are explicitly excluded from the study. The assessment is external. The choice of which aspects of language proficiency to measure is not.</p><p>The timescale: the post-test was administered immediately after course completion. There is no delayed assessment. Bjork&#8217;s distinction between retrieval strength and storage strength is directly relevant &#8212; the STAMP 4S scores reflect what Duolingo users can do at the moment they finish the course, not what they can do when they have been away from the app for six months. This question is not asked.</p><p>The population: only learners who completed the beginning-level content. Most Duolingo users do not. The platform&#8217;s attrition is substantial; most people who download the app never reach the end of the beginning-level material. The study measures the performance of survivors. What 100 people who finished the course achieved is a different finding from what 100 people who started it achieved. The paper is transparent about this selection. The subsequent framing of the findings &#8212; in the paper&#8217;s own conclusion and, more aggressively, in Duolingo&#8217;s marketing &#8212; as <em>Duolingo users reach Intermediate Low</em> does not preserve the completion-threshold restriction.</p><p>The baseline: a historical comparison. University students at the end of the fourth semester. There is no contemporaneous control group of comparable adults who spent equivalent time on a different learning approach. The two populations were measured in different conditions, at different times, possibly with different motivations and starting points. The <em>comparable to four semesters</em> claim treats them as if they had been measured equivalently.</p><p>The cost: not reported. Duolingo is free at its base tier, which is rhetorically powerful &#8212; free app comparable to paid college course &#8212; but the comparison elides the substantial time investment Duolingo users make. The paper does not ask what equivalent time investment in human-tutored instruction, structured self-study, or an immersive experience would produce. The cost denominator, which is constitutive of what a comparative claim actually supports, is absent.</p><p>I am not saying the study is dishonest. I am saying that each of these specific measurement choices &#8212; aligned-modality outcome, immediate timescale, survivor population, historical baseline, absent cost denominator &#8212; is traceable, in structure, to the apparatus Skinner initiated in 1958. The study is careful within conventions it has inherited. The conventions themselves are what require examination.</p><div><hr></div><h2>The Alternatives Have Always Existed</h2><p>This is what I want you to sit with: the apparatus did not persist in the absence of alternatives. It persisted alongside them.</p><p>Edward Thorndike established in 1906 and 1924 that improvement in one mental function rarely produces general improvement in others unless the two share identical elements. The methodological implication &#8212; that learning gains must be tested outside the conditions of the intervention, in contexts structurally different from training, to establish what the training actually produced &#8212; was available to the field for the entire history of educational technology. It has been occasionally adopted, routinely praised, and treated as aspirational rather than as the baseline standard that Thorndike&#8217;s own work suggested it should be.</p><p>The Bjorks&#8217; work on storage strength versus retrieval strength, canonical since the early 1990s, established empirically that the conditions maximizing immediate performance can impair durable retention. The specific implication &#8212; that a delayed post-test is required to distinguish performance from learning &#8212; has been in the learning sciences literature for over thirty years. Its adoption in educational technology efficacy research as standard practice has not happened.</p><p>Bransford, Brown, and Cocking&#8217;s <em>How People Learn</em>, the 1999 National Academies synthesis, argued explicitly that assessment should tap understanding rather than the ability to repeat facts. The argument was widely read, widely cited, and narrowly operationalized.</p><p>Samuel Messick&#8217;s theory of validity, developed across decades and codified in the 1989 <em>Educational Measurement</em> volume, specified that a test score&#8217;s interpretation requires examination of construct-relevant versus construct-irrelevant variance, construct underrepresentation, and the consequences of the test&#8217;s use. Applied rigorously, Messick&#8217;s framework would require educational technology efficacy research to examine what its outcome measures actually index rather than assuming that performance-on-aligned-items equals evidence-of-learning. The framework has been the theoretical standard in measurement theory for over thirty years.</p><p>These alternatives were not hidden. They were taught in graduate programs, cited in methods sections, present in the same journals that published the aligned-outcome studies. What did not happen, across six decades of technology change, was their adoption as the field&#8217;s measurement standard. The inherited apparatus &#8212; aligned outcomes at immediate timescale, survivor population, weak baseline, absent cost denominator &#8212; remained dominant. The alternatives remained alternative.</p><p>This is not a story about intellectual failure. It is a story about what happens when a theoretical commitment gets flattened into a methodological convention. Skinner had reasons for his measurement choices that were grounded in a coherent behavioral science. When the field moved past behavioral science &#8212; when Suppes and Anderson and everyone who followed adopted different theoretical frameworks &#8212; the measurement choices did not travel with the theory that had justified them. They traveled alone, as conventions, as what evidence looked like, as the unexamined default.</p><p>The apparatus became invisible by becoming obvious. And invisible apparatus is the most durable kind.</p><div><hr></div><h2>The Current Wave</h2><p>The contemporary AI-tutor literature &#8212; Khanmigo, Kestin and colleagues&#8217; 2024 Harvard physics study, Eedi with Google Research, Rori in Ghana &#8212; inherits the apparatus in its turn, with variation worth noting.</p><p>Khanmigo&#8217;s evaluation evidence has rested primarily on engagement metrics and performance within Khan Academy&#8217;s own internal assessment structures. What has been measured at scale is usage patterns; what has been claimed is educational transformation; what has not been established at the level of rigorous efficacy research is learning gains on independent standardized measures at delayed timescales with cost-inclusive reporting. The characteristic gaps of the apparatus are present.</p><p>The Kestin et al. 2024 Harvard physics study &#8212; AI-tutored instruction versus a single session of active-learning classroom instruction &#8212; reported effect sizes of 0.73 to 1.3 sigma on researcher-designed post-tests covering surface tension and fluid flow, the specific content the two-hour intervention taught, assessed shortly after the intervention. The measurement choices are the apparatus&#8217;s measurement choices. The effect sizes are real within those choices. What they establish about learning is bounded by what those choices can establish.</p><p>Eedi with Google Research 2025 introduced transfer testing &#8212; measuring performance on novel problems from subsequent topics rather than problems aligned with what the intervention taught. This is a genuine departure from the inherited convention. The N of 165 and single-term duration remain short relative to what durability research would require, but the outcome measure itself represents the kind of revision the apparatus needs rather than another inheritance of it. This is a credit to the researchers who chose to build the study that way.</p><p>Rori in Ghana used an external curriculum-aligned assessment over eight months and reported cost at $5 per student per year. The longer timescale, the external measure, the explicit cost denominator &#8212; these are partial revisions of the apparatus in the direction the field has needed for six decades. The pattern is: when researchers choose to work against the inherited conventions, the field moves. The field moves rarely, because the inherited conventions are the default, because departures from them require additional effort and often smaller effect sizes and sometimes no significant effect at all, which is a kind of finding that is harder to publish than 0.73 sigma.</p><p>The apparatus has not been reformed. It has been revised in specific instances by specific researchers. The instances are the exceptions that make the pattern visible.</p><div><hr></div><div><hr></div><p><em>Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)). His research on educational AI efficacy appears at <a href="https://hypotheticalai.substack.com">hypotheticalai.substack.com</a>. | <a href="https://skepticism.ai">skepticism.ai</a> | <a href="https://theorist.ai">theorist.ai</a></em></p><div><hr></div><p><strong>Tags:</strong> educational technology measurement apparatus, Skinner teaching machine inheritance, Duolingo efficacy research critique, aligned outcome EdTech validity, learning science transfer testing history</p>]]></content:encoded></item><item><title><![CDATA[The Artifact Was Once Enough]]></title><description><![CDATA[This essay is a response to Lila Shroff's "Is Schoolwork Optional Now?" published in The Atlantic on April 10, 2026.]]></description><link>https://www.skepticism.ai/p/the-artifact-was-once-enough</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-artifact-was-once-enough</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Sat, 11 Apr 2026 04:47:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6EVu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6EVu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6EVu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png 424w, https://substackcdn.com/image/fetch/$s_!6EVu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png 848w, https://substackcdn.com/image/fetch/$s_!6EVu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png 1272w, https://substackcdn.com/image/fetch/$s_!6EVu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6EVu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png" width="1456" height="803" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:803,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3143432,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/193858776?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6EVu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png 424w, https://substackcdn.com/image/fetch/$s_!6EVu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png 848w, https://substackcdn.com/image/fetch/$s_!6EVu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png 1272w, https://substackcdn.com/image/fetch/$s_!6EVu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff61c4d77-310b-4950-b6ad-d209533eb3c3_3146x1734.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>This essay is a response to Lila Shroff&#8217;s &#8220;<a href="https://www.theatlantic.com/technology/2026/04/ai-agents-school-education/686754/">Is Schoolwork Optional Now?</a>&#8220; published in The Atlantic on April 10, 2026. The argument it makes in full is developed in the preprint &#8220;<a href="https://www.nikbearbrown.com/notes/Frictional/frictional">Frictional: Measuring the Struggle</a>&#8220; at <a href="https://www.irreducibly.xyz/">irreducibly.xyz</a>.</em></p><div><hr></div><p>There is a word &#8212; <em>decoupling</em> &#8212; that sounds technical enough to keep us comfortable. Clinical. As if what has happened in classrooms since 2022 is primarily a logistics problem, a puzzle about detection and enforcement, a cat-and-mouse game that the right algorithm might someday win.</p><p>It is not that.</p><p>What has happened is something more fundamental than cheating at scale. The artifact &#8212; the essay, the proof, the lab report &#8212; used to be evidence of a process. The process was the point. The essay was proof that thinking had occurred, that a mind had engaged with difficulty and emerged changed. When we graded the essay, we were really grading the encounter: the hours of confusion, the drafts that failed, the moment when something clicked and then had to be organized into sentences for another person. The artifact was the residue of all that. It was upstream evidence of downstream consequence.</p><p>Generative AI has broken the causal chain. Not bent it &#8212; broken it.</p><p>A bot called Einstein, built by a 22-year-old entrepreneur named Advait Paliwal, recently completed all eight modules and seven quizzes of an introductory statistics course in under an hour. Perfect score. The human who set it loose reports that she &#8220;hardly so much as read the course website.&#8221; What Einstein produced &#8212; the evidence that a course had been completed &#8212; was real. The learning it was supposed to represent did not occur. The artifact existed. The process that should have produced it did not happen.</p><p>Paliwal says he released the tool to alert educators. His more honest statement is buried in the subtext: &#8220;If I didn&#8217;t post about this, someone would have used the same technology and hidden it from the professors.&#8221; He is right. He is also describing a world in which the distinction between using it secretly and not using it at all is narrowing toward irrelevance. The tool exists. The temptation exists. The economic pressure on students &#8212; especially international students, especially students working jobs to pay tuition, especially students in courses they are taking to satisfy requirements rather than from genuine interest &#8212; those pressures exist independently of any single tool.</p><p>The institutional response has been to build better detectors. This is a reasonable first move. It is not a durable one.</p><div><hr></div><h2>Why Detection Cannot Save Us</h2><p>Here is the structural problem with artifact-based AI detection: the arms race has a predetermined winner. Detection is always trained on the outputs of current generation technology. Generation technology improves continuously. The detector trained on today&#8217;s AI writing fails on tomorrow&#8217;s &#8212; not because detectors are poorly built, but because that is how the mathematics of the problem works. The forensic window closes.</p><p>There is a deeper problem. The educationally relevant question was never <em>did a human type these words</em>. It was <em>did a human develop this understanding</em>. A student who dictated an essay to a transcriptionist and then submitted it word-for-word would have technically written no AI content. The essay would pass every detector. The learning would have occurred or not occurred based on whether they thought hard while dictating, not based on who typed it. The detector is solving the wrong problem.</p><p>And there is a third problem, the one that produces the most corrosive outcomes. When you build a system to catch AI use, you teach students to game the detector. They learn strategies for mimicking authentic writing &#8212; inserting typos, varying sentence structure, using phrases the model knows sound &#8220;human.&#8221; The simulation improves. The gap between simulated engagement and genuine engagement widens at precisely the moment we need it to narrow.</p><p>William Liu, a Stanford sophomore who finished high school two years ago, puts it plainly: his educational experience and his younger sibling&#8217;s are vastly different despite a two-year gap. The technology arrived. The classroom has not yet figured out what to do next.</p><div><hr></div><h2>What Genuine Learning Actually Leaves Behind</h2><p>Here is the thing we have been too polite to say: learning is not the same as performance.</p><p>Robert Bjork has been saying this for thirty years in academic papers that educators read and administrators do not read and curriculum designers read and then ignore when the calendar pressure comes. Performance is the observable, often temporary thing &#8212; how well a student does on a measure. Learning is the durable change in what the student can do and understand and transfer to a new context. These two things are not the same. We have built an entire institutional infrastructure that measures only one of them.</p><p>Genuine human learning is a biological event. When a learner encounters material that genuinely challenges their current understanding &#8212; material in that productive zone where their current model is wrong or incomplete &#8212; something specific happens neurologically. Dopamine neurons fire in response to prediction errors. BDNF expression upregulates, sometimes by nearly three times. New dendritic spines form at the synaptic connections that will hold the memory. These are not metaphors. They are the physical substrate of the thing we call learning.</p><p>The behavioral consequences of these neurological events are traceable. A student engaged in genuine cognitive struggle spends time proportional to difficulty. Their errors follow a coherent developmental path &#8212; misconceptions that make sense given their current model, corrections that build on each other. When tested in a new context, they can transfer. When scaffolded with a partial hint, they respond &#8212; because there is a partially formed structure for the hint to connect to. Their confidence, over time, calibrates to their actual performance rather than inheriting the confidence of the AI explanation they processed.</p><p>These are what I have been calling <em>friction traces</em> &#8212; the behavioral signatures that genuine human cognitive engagement leaves in observable data. They exist because genuine learning is a biological event. An AI can produce the artifact without triggering any of these neurological events. It cannot produce the behavioral traces, because the biological events that generate those traces did not occur.</p><div><hr></div><h2>The Seven Things We Can Now Measure</h2><p>The Genuine Learning Probability framework I have been developing with Humanitarians AI specifies seven such traces:</p><p>The <em>temporal engagement pattern</em> &#8212; the correlation between how hard an item is and how long a student spends on it. Genuine engagement produces this correlation. AI-assisted completion decouples time from difficulty.</p><p>The <em>error trajectory</em> &#8212; whether a student&#8217;s mistakes follow conceptually coherent developmental paths. Genuine learning produces coherent errors; the reward prediction error mechanism drives the model toward better models in patterned ways. Borrowed certainty produces random errors with respect to conceptual structure.</p><p><em>Cross-context transfer</em> &#8212; the Bjorkian definition of learning. A student who genuinely understood something can apply it in novel contexts. Borrowed certainty produces surface representations tied to the specific context of the AI explanation.</p><p><em>Uncertainty calibration</em> &#8212; whether a student&#8217;s expressed confidence tracks their actual performance. Borrowed certainty produces systematic overconfidence: the student inherits the AI&#8217;s confidence distribution without the knowledge base that would justify it.</p><p><em>Social knowledge texture</em> &#8212; the quality of a student&#8217;s engagement in discussion contexts. Genuine encounter with material leaves a characteristic texture: specific confusions, particular connections, the specific questions that arose from actual engagement. This texture cannot be manufactured without having had the encounter.</p><p>The <em>retrieval strength decay signature</em> &#8212; whether performance decays at rates consistent with genuine encoding. The spacing effect is the benchmark of genuine learning. Borrowed certainty has no storage strength to retrieve; performance decays monotonically and the spacing effect does not appear.</p><p>And the <em>scaffolding response curve</em> &#8212; whether a student&#8217;s performance responds appropriately to partial hints. A student with genuine partial understanding has a zone of proximal development. A partial hint activates the structure that is already forming. Borrowed certainty has no such zone.</p><div><hr></div><h2>What the Bot Cannot Manufacture</h2><p>Here is the argument I want to make carefully, because it is often misunderstood: this framework is not about catching AI use. It is about measuring learning directly.</p><p>An AI detector fails when AI outputs become indistinguishable from human outputs. A learning measure fails when borrowed certainty becomes indistinguishable from genuine learning &#8212; which would require borrowed certainty to produce the same neurobiological events, the same schema formation, the same durable transfer. At that point, borrowed certainty has become learning. That is not AI defeating assessment. That is learning occurring through a different pathway than we expected.</p><p>What manufacturing all seven friction traces simultaneously &#8212; without performing the underlying cognitive work &#8212; actually requires is something close to performing the underlying cognitive work. A student who spends genuine time on difficult material, who makes and corrects errors in a conceptually coherent sequence, who demonstrates transfer across novel contexts, who maintains calibrated uncertainty, who engages with genuine texture in discussion, who shows the spacing effect across weeks, and who responds appropriately to partial hints &#8212; has learned the material. At that point the game has become indistinguishable from the thing we wanted in the first place.</p><p>Natalie Lahr, a Barnard sophomore studying history and political science, describes an &#8220;anti-AI radicalizing&#8221; experience: a tutor at the writing center pasted her essay prompt into Perplexity and handed her the AI-generated outline. &#8220;Why am I even here?&#8221; she asked afterward. The question is not rhetorical. It is the correct question.</p><div><hr></div><h2>What We Must Build Instead</h2><p>The crisis of evidence facing educational institutions is not a technical problem. It is an epistemological problem. The evidence infrastructure we built assumed a world in which the artifact was upstream evidence of the process. That world no longer reliably exists.</p><p>What we need is an assessment infrastructure built on the process itself.</p><p>This means longitudinal process documentation &#8212; portfolios that capture the history of engagement, not just its products. It means embedded formative assessment that generates the data necessary to observe the seven friction traces over time. It means treating developmental trajectory as evidence: not what a student produced, but how their understanding developed, what they got wrong and corrected and why, where they transferred and where they didn&#8217;t.</p><p>Marc Watkins at the University of Mississippi describes an instructor who could, theoretically, set an AI to grade thirty essays during a fifteen-minute walk to Starbucks. He calls this &#8220;really scary.&#8221; He is right, but I want to be precise about why. The fear is not the efficiency. It is the loop: AI-generated assignments completed and assessed by AI agents, with human understanding nowhere in the chain. The fully automated loop is not a future dystopia. It is the logical endpoint of current trajectories. Einstein completes the course. The grader grades Einstein&#8217;s work. Both certificate and grade are real. The learning did not occur.</p><p>The artifact was once enough. It is no longer enough. The arms race between generation and detection has a winner, and it is not the detector.</p><p>We must now measure the struggle itself. Not because friction is intrinsically valuable &#8212; productive struggle matters only because of what it builds in the brain that does the struggling. We must measure it because the brain that struggles is the brain that learns, and the brain that learns is the only thing education was ever actually for.</p><p>The methodology is developed in full in &#8220;<a href="https://www.nikbearbrown.com/notes/Frictional/frictional">Frictional: Measuring the Struggle</a>&#8220; &#8212; a preprint specifying the seven friction components, the ensemble architecture, and the tier calibration system &#8212; and at <a href="https://www.irreducibly.xyz/">irreducibly.xyz</a>. The framework is not a secret.</p><div><hr></div><p><em>Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)).</em><br><em>bear.musinique.com &#183; skepticism.ai &#183; theorist.ai</em></p><div><hr></div><p><strong>Tags:</strong> AI detection education failure, genuine learning probability framework, friction traces assessment, Bjork performance vs learning, Einstein bot Canvas schoolwork automation</p>]]></content:encoded></item><item><title><![CDATA[The Loop That Watches Itself]]></title><description><![CDATA[On OpenAI's Automated Researcher and the Profession It Forgot to Invent]]></description><link>https://www.skepticism.ai/p/the-loop-that-watches-itself</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-loop-that-watches-itself</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Fri, 10 Apr 2026 04:00:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!vb9O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vb9O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vb9O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png 424w, https://substackcdn.com/image/fetch/$s_!vb9O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png 848w, https://substackcdn.com/image/fetch/$s_!vb9O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png 1272w, https://substackcdn.com/image/fetch/$s_!vb9O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vb9O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png" width="1456" height="669" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:669,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1097418,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/193760173?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vb9O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png 424w, https://substackcdn.com/image/fetch/$s_!vb9O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png 848w, https://substackcdn.com/image/fetch/$s_!vb9O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png 1272w, https://substackcdn.com/image/fetch/$s_!vb9O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3a5a518-2499-4594-aeda-a98c67ca4743_3376x1552.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Jakub Pachocki has a timeline. By September, OpenAI plans to deploy what it calls an AI research intern &#8212; a system that can work on a specific problem for the length of time a person would need days to resolve. By 2028, the full version: a multi-agent system capable of running research programs too large for humans to manage. Drug discovery. Novel proofs. Problems &#8220;formulated in text, code, or whiteboard scribbles.&#8221;</p><p>The vision is coherent. More than most in this field, it is operationally specific. And it contains a foundational error that no amount of scaling will fix.</p><p>The error isn&#8217;t technical. It&#8217;s logical.</p><h2>The Scratch Pad That Watches Itself</h2><p>Pachocki is candid about the risks. A system this powerful could go off the rails, get hacked, or simply misunderstand its instructions. His proposed solution is chain-of-thought monitoring &#8212; training reasoning models to externalize their work into a kind of scratch pad, then using other AI systems to watch those scratch pads for anomalous behavior.</p><p>This is not oversight. It is the appearance of oversight, implemented entirely inside the loop it was supposed to close.</p><p>Sixty years before anyone worried about AI safety, Kurt G&#246;del established something directly relevant. No formal system powerful enough to express arithmetic can verify its own consistency from within itself. Any sufficiently capable system will generate statements it cannot evaluate using only its own rules &#8212; truths it can approach but not recognize as true through internal derivation alone.</p><p>Apply this to Pachocki&#8217;s architecture. The AI researcher derives. Chain-of-thought monitoring by another AI system is more derivation. What is structurally absent is recognition &#8212; the moment of contact between a formal output and an external reality. That moment cannot be replicated by adding another layer of derivation on top.</p><p>This is not a philosophical objection. It is a logical one. The validator must be outside the system being validated. There is no version of this argument that resolves in favor of AI systems self-monitoring.</p><h2>The Proof Candidate Problem</h2><p>What an AI system produces when it generates a novel mathematical proof is not a proof. It is a proof candidate &#8212; a string of symbols following valid inference rules that may or may not establish something true.</p><p>The distinction is not semantic. A proof in the full sense is a social and epistemic act. It is what a mathematical community recognizes as establishing truth. Remove the recognition and you have a sophisticated computation that has no relationship to truth except statistical proximity.</p><p>The same structure applies to every domain Pachocki names.</p><p>A novel molecule with predicted therapeutic properties is not a drug. It is a candidate. The drug trial process &#8212; Phase I, Phase II, Phase III, post-market surveillance &#8212; exists precisely because we have learned, through catastrophic experience, that prediction and reality are different things and the gap between them kills people. Thalidomide. Vioxx. The graveyard of promising compounds that passed every computational test and failed in bodies.</p><p>As AI systems generate increasingly sophisticated candidates across more domains, the need for rigorous external validation does not decrease. It increases. The more sophisticated the output, the harder it is to catch the subtle error buried in ten thousand valid steps. A wrong answer that looks wrong is easy to reject. A wrong answer that looks right for nine thousand nine hundred and ninety-nine steps requires something the internal system cannot provide: an independent perspective.</p><h2>Common Cause Failure</h2><p>There is a concept in safety engineering called common cause failure. It describes what happens when two redundant systems share the same fundamental assumptions &#8212; the thing most likely to fool System A is also most likely to fool System B, because both were built on the same foundation.</p><p>Pachocki&#8217;s monitoring architecture is a common cause failure risk by design. If the system being monitored can produce subtly wrong outputs that look correct, the monitoring system trained on similar data with similar architecture will have correlated blind spots. You have not introduced an independent check. You have introduced a correlated one.</p><p>Every high-stakes validation system humans have built &#8212; clinical trials, aircraft certification, nuclear safety, financial auditing &#8212; depends on something genuinely outside. Not because humans are infallible. Because humans are the only validators who face consequences when wrong. The FDA reviewer whose approval leads to harm is accountable in ways that a monitoring LLM is not and cannot be.</p><p>Accountability is not a luxury feature of validation systems. It is load-bearing. Remove it and the system loses the incentive structure that makes rigorous checking worth doing.</p><h2>Stakes as the Organizing Principle</h2><p>None of this means AI systems cannot contribute to research. They already do. The question is not whether to deploy them. The question is which level of external validation each deployment requires.</p><p>This maps onto a natural taxonomy organized by stakes.</p><p>For low-stakes, reversible outputs &#8212; a song recommendation, a draft email, a code snippet that will be reviewed before deployment &#8212; AI can largely run with minimal human oversight. The cost of failure is low and recoverable.</p><p>For moderate-stakes, partially recoverable outputs &#8212; a business analysis, a research summary, an engineering specification &#8212; systematic human review at checkpoints is appropriate. The human does not need to be in the loop constantly, but must be able to catch errors before they compound.</p><p>For high-stakes, irreversible outputs &#8212; drug candidates, structural engineering recommendations, policy analysis that will drive consequential decisions, mathematical proofs that will be published as established results &#8212; continuous human oversight is not incidental to the output&#8217;s validity. It is constitutive of it.</p><p>The drug trial architecture already encodes this wisdom. It was not built for AI, but it is exactly the right framework for AI-assisted research in high-stakes domains. The humans do not disappear as system confidence grows. They shift function &#8212; from intensive validation to ongoing monitoring, from checking every step to catching systematic drift. This is not a concession to human limitation. It is a recognition that the system&#8217;s credibility requires external accountability at every stage.</p><h2>The Profession Pachocki Forgot to Invent</h2><p>What emerges from this analysis is not only a procedural requirement for human oversight. It is the outline of a new profession.</p><p>A plausibility auditor is not a fact-checker. Not a quality assurance technician. Not a safety researcher who looks for misaligned objectives in training runs. A plausibility auditor is someone trained specifically to stand outside sophisticated AI outputs and ask whether those outputs correspond to reality rather than merely to internal consistency.</p><p>This requires two distinct forms of expertise that current training pipelines do not produce together.</p><p>The first is deep domain knowledge &#8212; enough expertise to recognize when a result is too clean, suspiciously convergent, subtly wrong in the way that only an expert in the specific domain would catch. The AI system that generates a novel proof in algebraic geometry needs to be reviewed by someone who has spent years in algebraic geometry, not by a generalist AI safety researcher who can evaluate the logical structure of the output but cannot evaluate its mathematical significance.</p><p>The second is knowledge of AI failure modes, which differ fundamentally from human error patterns. Human errors cluster around cognitive bias, motivated reasoning, fatigue, and the known weaknesses of intuition under uncertainty. AI errors cluster around distribution shift, spurious correlations that held in training data, confident extrapolation beyond the valid range of the model, and &#8212; most dangerously &#8212; systematic errors that look like high-quality outputs because they were trained on a corpus where high-quality outputs had certain structural characteristics. Auditing AI outputs requires knowing which kind of error you are hunting.</p><p>The training pipeline for plausibility auditors looks nothing like current AI safety work. It looks more like producing people with genuine deep expertise in a specific domain who have additionally developed the metacognitive capacity &#8212; what Penrose, extending G&#246;del, might describe as the recognitional faculty &#8212; to evaluate outputs they could not themselves have produced. The auditor does not need to be able to generate the proof. The auditor needs to be able to recognize whether it is actually true.</p><p>This is not a concession to human limitation. The requirement for external validation is not a temporary scaffolding that will be removed once the systems mature. It follows directly from the logical structure of the problem. The validator must be outside the system being validated. This requirement does not disappear as systems become more sophisticated. If anything, it becomes harder to satisfy, because the auditor&#8217;s task grows more demanding as the outputs grow more complex.</p><h2>The Central Irony</h2><p>Pachocki&#8217;s automated researcher, if it works as described, will be the thing that finally creates the market for what it treats as unnecessary.</p><p>The more sophisticated the AI output, the harder the auditing task, the more valuable the human who can do it. OpenAI&#8217;s north star may be pointing directly at the profession it forgot to invent.</p><p>There is precedent for this dynamic. The industrialization of manufacturing did not eliminate the need for quality engineers &#8212; it made quality engineering a more demanding and more specialized discipline. The digitization of financial markets did not eliminate the need for auditors &#8212; it made financial auditing a more technically demanding field and produced an entire industry of forensic accountants whose value derives precisely from the complexity of what they are reviewing.</p><p>The automated researcher will produce more outputs of greater sophistication across more domains than any previous generation of scientific tools. Each of those outputs will be a candidate. Each candidate will require validation. The validation will require humans. Not because we cannot build systems smart enough to evaluate the outputs &#8212; we will almost certainly build systems with that capability. But because the evaluation&#8217;s credibility depends on the evaluator&#8217;s accountability, and accountability requires the possibility of consequence.</p><p>An AI system does not lose its job when it certifies a flawed drug candidate. A plausibility auditor does.</p><h2>What Governments Actually Need to Figure Out</h2><p>Pachocki acknowledges that the concentrated power implications of this technology are &#8220;a big challenge for governments to figure out.&#8221; He is right that governments need to be involved, and right that OpenAI alone cannot resolve the governance questions.</p><p>But the governance architecture he gestures toward does not yet exist, and the reason it does not exist is that the validation infrastructure that would make it functional has not been built. You cannot regulate AI research outputs if there is no institutionalized capacity to evaluate whether those outputs are trustworthy. Chain-of-thought monitoring provides the appearance of evaluability without the substance.</p><p>The question for 2028 &#8212; when Pachocki&#8217;s multi-agent research system is scheduled to arrive &#8212; is not only whether the system works. It is whether we have built, in parallel, the human capacity to stand outside the most powerful reasoning systems ever constructed and ask the oldest question in epistemology.</p><p>Is it actually true?</p><p>No algorithm answers that. Someone has to.</p><div><hr></div><p><em>bear.musinique.com &#183; skepticism.ai &#183; theorist.ai</em></p><p><strong>Tags:</strong> AI plausibility auditor, G&#246;del incompleteness AI oversight, OpenAI automated researcher chain-of-thought monitoring, common cause failure AI safety, high-stakes AI</p>]]></content:encoded></item><item><title><![CDATA[Brutalist.art - The "Beautiful.ai" that Educators Need]]></title><description><![CDATA[Talking to a slide deck through Claude code]]></description><link>https://www.skepticism.ai/p/brutalistart-the-beautifulai-that</link><guid isPermaLink="false">https://www.skepticism.ai/p/brutalistart-the-beautifulai-that</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Mon, 06 Apr 2026 00:49:13 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/193305176/36365434e9ead489eec0094e88101873.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<h3><strong>The Slide Deck You Built Was Not for the Learner</strong></h3><h3><strong>It Was for You</strong></h3><p>There is a lie at the center of most educational content production, and it goes mostly unnamed because naming it is professionally uncomfortable. The lie is this: the slide deck you built last Tuesday, the one you spent three hours arranging, the one with the custom fonts and the carefully chosen images and the thirty-seven bullets across fourteen slides &#8212; that deck was not built for the people who had to sit through it. It was built for you. It was built so you could feel the relief of having covered the material. It was built so the topic had a container. It was built because you had a deadline and a template and a vague professional obligation to produce <em>something</em>, and a slide deck is always <em>something</em>.</p><p>The learner &#8212; the specific human being with specific prior knowledge and a specific amount of time and a specific gap between what they currently understand and what they need to understand &#8212; that person never really entered the room where the deck was being built. What entered the room instead was a topic. And a topic is not a person.</p><p>Brutalist was built to address this. Not to address it gently, with suggestions and style guides and best-practice checklists. To address it structurally, in the architecture of the tool itself, before a single slide gets made.</p><h3><strong>The Architecture of Avoidance</strong></h3><p>The conventional workflow for building educational content runs roughly like this: you receive a topic (or assign yourself one), you collect material &#8212; readings, notes, data, existing slides &#8212; and you begin arranging it. If you are experienced, you arrange it with craft. You think about sequence and pacing. You choose examples. You know when to deploy a metaphor and when to let a statistic land without ornamentation. The result, at its best, is a coherent and well-paced presentation of material.</p><p>What you have not done &#8212; and this is the gap that produces most failures in educational content &#8212; is started from what the learner will be able to <em>do</em> when you are finished with them. You have started from what you know, and you have worked forward through that knowledge toward a clean ending. This is a completely understandable approach, and it produces content that would be unrecognizable as failing by any ordinary standard of review. It is organized. It is clear. It covers the material.</p><p>It just doesn&#8217;t reliably produce learning.</p><p>Backwards design &#8212; the pedagogical framework that governs every output Brutalist produces &#8212; insists on reversing this sequence. You begin with a measurable outcome: not a topic, not a list of things the instructor will present, but a single sentence describing what a learner will be able to <em>do</em> at the end that they could not do at the beginning. Construct a DAG from domain knowledge and identify all backdoor paths. Distinguish between a learning outcome and a topic. Evaluate a rubric for the difference between qualitative descriptions and observable behaviors. These are not aspirations. They are commitments &#8212; to a learner, to a measurable change, to the possibility of knowing whether the teaching worked.</p><p>The reason most content production doesn&#8217;t begin here is not ignorance. Most instructors know what backwards design is. The reason is that starting from a learning outcome is harder than starting from a topic, and the tools available for producing educational content &#8212; PowerPoint, Keynote, Google Slides &#8212; offer no friction whatsoever against starting from the wrong place. They are indifferent to the question of who the learner is and what the learner needs to be able to do. They are happy to help you arrange forty slides around a topic, and they will never once ask whether the arrangement serves a learner or just a speaker.</p><p>Brutalist asks. It asks before it produces anything. In interactive mode &#8212; the default &#8212; it will not generate a single slide until it has confirmed the audience, confirmed the outcome, and confirmed that the outcome is measurable. &#8220;Understand X&#8221; is not measurable. Brutalist says so, explicitly, in the voice of a pedagogical skeptic rather than a customer-service chatbot. <em>That describes a mental state, not a behavior. A learner can&#8217;t demonstrate &#8216;understanding.&#8217; What&#8217;s the one thing they should be able to do?</em> This is not rudeness. It is the one question that changes the output.</p><h3><strong>The Phase Gate as Moral Commitment</strong></h3><p>There is a design decision embedded in Brutalist that deserves more attention than it usually gets in conversations about AI tools, which tend to focus on capability rather than constraint. That decision is the phase gate.</p><p>A phase gate is exactly what it sounds like: a gate that holds until a phase is complete. In Brutalist, the first gate holds at source confirmation &#8212; no output until the source material is present. The second holds at outcome identification &#8212; no output until the outcome can be stated in one sentence. The third holds at form confirmation &#8212; no output until the right command for the content is confirmed. Only then does the tool produce anything.</p><p>This is unusual. Most AI tools are designed to produce output as quickly as possible, because output is what users think they want and user satisfaction is what tools are optimized for. The experience of receiving forty slides in thirty seconds feels like productivity. It feels like the machine is working for you. What it actually is, much of the time, is the machine generating plausible-looking content that fills the form without serving the function &#8212; decoration rather than argument, coverage rather than learning.</p><p>Brutalist is optimized for the learner, not the user. These are not the same person. The user is the instructor who wants a slide deck. The learner is the person who will sit in front of that deck and try to change what they understand. Optimizing for the user produces faster output. Optimizing for the learner produces harder questions before any output is generated at all.</p><p>The phase gate is where this optimization manifests in the tool&#8217;s behavior. It is the structural embodiment of a moral position: that output built on wrong assumptions about audience or outcome wastes more time than the intake that would have caught those assumptions. Two minutes of friction before the deck is built is less costly than an hour of instruction that doesn&#8217;t change what anyone understands.</p><h3><strong>What &#8220;Understand X&#8221; Is Actually Doing</strong></h3><p>Spend any time in educational settings &#8212; as a student, as an instructor, as a curriculum designer &#8212; and you develop a particular sensitivity to the phrase &#8220;by the end of this, students will understand X.&#8221; It appears in syllabi, in lesson plans, in course descriptions, in accreditation documents. It appears so frequently and so unexamined that most people who write it have stopped noticing it at all. It is pedagogical wallpaper.</p><p>But the phrase is doing something specific, and it is worth naming. &#8220;Students will understand X&#8221; is a sentence that sounds like a learning outcome and functions as an escape from accountability. Understanding is a mental state. You cannot observe it, you cannot measure it, you cannot score it on a rubric or assess it in a portfolio. You can ask someone to demonstrate understanding &#8212; which means you are no longer assessing understanding, you are assessing a behavior &#8212; but the phrase as written commits you to nothing. It is a promise with no deliverable attached.</p><p>The reason this matters to a tool like Brutalist is that the learning outcome is not just the first step in backwards design. It is the specification for everything that follows. The slides that get built, the visual types that get selected, the checks for understanding that get inserted every four to six slides &#8212; these are all derived from the outcome, working backward from what the learner needs to be able to do. If the outcome is vague, the derivation has nothing to anchor to. The result is a deck that covers material in the general direction of a topic, which is not the same thing as a deck that moves a specific learner from a specific gap to a specific capability.</p><p>This is why Brutalist treats &#8220;understand X&#8221; not as a minor stylistic imprecision but as a structural failure that must be corrected before building anything. The outcome is the foundation. A vague foundation does not produce a stable structure. It produces decoration.</p><h3><strong>Brutalist HTML and the Question of Deployment</strong></h3><p>There is a second commitment embedded in this tool that is worth examining, and it lives in the signature output: the brutalist HTML presentation. Not a PowerPoint file. Not a PDF. A single self-contained HTML file, deployable immediately, built on a design system called Musinique brutalist &#8212; JetBrains Mono, parchment tokens, per-slide audio, keyboard navigation, zero decorative radius.</p><p>The choice of HTML as the primary output format is not aesthetic. It is pedagogical and practical simultaneously. A PowerPoint file requires PowerPoint. A Google Slides file requires Google. An HTML file requires a browser, which is to say it requires nothing &#8212; it deploys anywhere, runs without software dependencies, and can be shared as a URL or a file with equal ease. The friction of tool access is a real barrier to distribution, and distribution is where educational content either serves learners or stops serving them.</p><p>The design choices embedded in the brutalist system &#8212; every slide does one thing, every title is a claim not a topic, components are typed by what they communicate rather than how they look &#8212; these are cognitive load principles encoded as aesthetic constraints. The slide with a hero number and a two-line muted caption exists because research on split attention and redundancy effects has things to say about how visual and verbal information compete for working memory. The check for understanding every four to six slides exists because spaced retrieval practice produces stronger retention than massed coverage. The design is not decoration. It is applied cognitive science, translated into a component library and a phase-gated workflow.</p><h3><strong>The Pushback Layer</strong></h3><p>Brutalist pushes back. This is the part of the tool that most users encounter with some surprise, because tools &#8212; especially AI tools &#8212; are generally not in the business of disagreement. They are in the business of helpfulness, and helpfulness has been operationally defined as producing what the user asks for as quickly as possible. Friction is a UX failure. Pushback is an anomaly.</p><p>In Brutalist, pushback is a feature. Not an accident of the model&#8217;s personality or a quirk of the prompting, but a designed behavior with specific triggers and specific exit conditions. Weak learning outcomes get flagged &#8212; not once, politely, but persistently, with an offer to rewrite the outcome if the user fails the measurability test twice. Vague audience descriptions get challenged, because &#8220;college students&#8221; is not an audience and the specificity that changes the content, examples, and pacing cannot be inferred from it. Mismatched command choices get named &#8212; if the content calls for a <code>/showtell</code> and the user has requested <code>/slides</code>, the tool explains the difference in instructional design terms before proceeding.</p><p>Every pushback ends with a path forward. This is the moral discipline that separates useful friction from obstruction. The tool is not in the business of refusing to build. It is in the business of building toward the right specification, and the right specification cannot be assumed from the wrong brief. The pushback is the tool asking the question that the instructor should have asked before they opened a blank deck and started arranging.</p><p>What is the learner supposed to be able to do?</p><p>Everything else follows from that.</p><div><hr></div><p><em>Brutalist is part of the Humanitarians AI Ecosystem. The primary workflow: </em><code>/slides</code> produces the blueprint. <code>/brutalist</code> converts it to HTML. <code>/deck</code> does both in one command. Type <code>help</code> to begin.</p><p><strong>Tags:</strong> Brutalist instructional design engine, backwards design pedagogy, learning outcomes Bloom&#8217;s taxonomy, brutalist HTML presentation system, educational content production failure</p>]]></content:encoded></item><item><title><![CDATA[The Struggle Is the Point]]></title><description><![CDATA[What We Lost When We Made the Artifact the Grade]]></description><link>https://www.skepticism.ai/p/the-struggle-is-the-point</link><guid isPermaLink="false">https://www.skepticism.ai/p/the-struggle-is-the-point</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Sat, 04 Apr 2026 03:35:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!SDu5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SDu5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SDu5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png 424w, https://substackcdn.com/image/fetch/$s_!SDu5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png 848w, https://substackcdn.com/image/fetch/$s_!SDu5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png 1272w, https://substackcdn.com/image/fetch/$s_!SDu5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SDu5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png" width="1456" height="543" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:543,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:167415,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.skepticism.ai/i/193135422?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SDu5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png 424w, https://substackcdn.com/image/fetch/$s_!SDu5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png 848w, https://substackcdn.com/image/fetch/$s_!SDu5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png 1272w, https://substackcdn.com/image/fetch/$s_!SDu5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af9b2b9-9f0e-46ad-aff1-94d64d45472e_1886x704.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The paper rough draft: <strong><a href="https://www.nikbearbrown.com/notes/Papers/glp-framework-genuine-learning-probability">https://www.nikbearbrown.com/notes/Papers/glp-framework-genuine-learning-probability</a></strong></p><h2>What We Lost When We Made the Artifact the Grade</h2><p>Here is the situation as it actually exists, not as anyone in an official capacity is willing to describe it clearly.</p><p>A student sits down to write a paper. The paper is due in twelve hours. The student has three other assignments due this week, a job that starts at six, and the accumulated evidence of two semesters telling them that the grade lives in the artifact &#8212; the paper itself &#8212; not in the thinking that was supposed to produce it. The student opens an AI tool. The paper gets written. It is, by most measurable standards, better than what the student would have produced alone at midnight after a shift.</p><p>In the next building, the professor who assigned the paper has used AI to draft the assignment prompt, the rubric, and the feedback comments they will paste into the LMS after running the submitted papers through a grading interface that summarizes them automatically.</p><p>Neither of them is a villain. Both of them are responding rationally to a system that has always rewarded the artifact and never found a way to measure the process that was supposed to produce it. Generative AI did not create this problem. It revealed it &#8212; suddenly, completely, and without the courtesy of suggesting a solution.</p><p>This essay is about what the solution might look like. It is not technical. The technical apparatus exists and is documented elsewhere. What doesn&#8217;t exist yet, in language plain enough to be useful, is a way of talking about why the solution matters &#8212; what it would mean for a student to be seen by an educational system that has, for most of institutional history, been looking at the wrong thing.</p><h2>What the Artifact Was Supposed to Prove</h2><p>The essay, the exam, the project, the recorded performance &#8212; these were never the thing education cared about. They were evidence. The artifact was valuable because it was causally downstream of a process: the reading, the confusion, the rereading, the argument with yourself at two in the morning about whether you actually understood what you thought you understood. The artifact was a trace of that process. Grading the artifact was a way of inferring the process, because the two were coupled tightly enough that measuring one was effectively measuring both.</p><p>That coupling has broken. This is not a scandal or a failure or a temporary condition that better AI detection will resolve. It is a structural change in what artifacts can tell us, and it is permanent. The forensic window &#8212; the period during which you can reliably distinguish a human-written essay from an AI-generated one &#8212; is closing sequentially across every domain in which humans produce artifacts. In writing it is largely closed already. In code it is closing. The detectors trained on today&#8217;s AI outputs will be obsolete when tomorrow&#8217;s outputs arrive.</p><p>Every educational institution that is currently responding to this situation by installing better detection software is solving last year&#8217;s problem with next year&#8217;s obsolescence already scheduled.</p><h2>The Complicity No One Names</h2><p>The conversation about AI and academic integrity is almost entirely conducted as a conversation about student dishonesty. This framing is not wrong, exactly. It is just so incomplete as to function as a kind of dishonesty itself.</p><p>Students are using AI because the artifact is the grade. The artifact is the grade because grading the process &#8212; the confusion, the revision, the dead ends, the moments of genuine understanding &#8212; is hard, and institutions have never built the infrastructure to do it at scale. The result is a system that has always been measuring the wrong thing, and now the wrong thing can be produced in thirty seconds by a tool that costs less than a textbook.</p><p>Professors are not innocent bystanders. Many are using the same tools to manage the same impossible workloads &#8212; drafting prompts, generating feedback, summarizing submissions &#8212; that the institution&#8217;s growth model has made unmanageable. The incentive structure reaches all the way up. Publish or perish does not reward good teaching. Good teaching does not require good teaching to be measurable, only for its artifacts &#8212; syllabi, course evaluations, enrollment numbers &#8212; to look like good teaching.</p><p>The student who uses AI to write a paper is not defecting from a system that is working. They are defecting from a system that has always asked them to perform learning rather than do it, and has never been able to tell the difference. AI has not corrupted that system. AI has made the corruption visible.</p><p>This is the thing worth sitting with before any solution is proposed: the problem is not the tools. The problem is what we decided to measure, and what we decided to ignore, long before the tools arrived.</p><h2>What Genuine Learning Leaves Behind</h2><p>Here is what the research shows, stated plainly.</p><p>When a human being genuinely learns something hard, the process is biological. Neurons fire in response to the gap between what the learner expected and what they encountered. That gap &#8212; the prediction error &#8212; is uncomfortable. It is the feeling of not understanding, the specific texture of confusion that is different from ignorance because it knows what it doesn&#8217;t know. Working through that discomfort produces measurable changes: in how information is encoded, in how long it persists, in whether it transfers to new contexts or stays locked to the specific example through which it was learned.</p><p>Genuine learning leaves traces. Not in the artifact &#8212; the artifact is the product, and products can be manufactured without the process. The traces are in the behavior that surrounds the artifact&#8217;s production: the time spent on the hard parts, the errors that follow a coherent path as the mental model develops, the ability to apply what was learned to a problem that looks different on the surface but has the same underlying structure, the calibrated uncertainty of someone who knows not just what they know but what they don&#8217;t.</p><p>None of these traces require looking at the artifact. They require looking at the process.</p><p>This is what the concept of friction in assessment is about. Not friction as punishment, not friction as obstacle, not friction as the gatekeeping logic that has always made elite education a credentialing system for people who already had advantages. Friction as signal. The productive struggle of genuine learning &#8212; the confusion, the revision, the wrong turn and the recovery &#8212; is not the unfortunate cost of arriving at the artifact. It is the thing the artifact was supposed to be evidence of. It is the learning itself.</p><p>The proposal is to measure it directly.</p><h2>What This Would Mean for a Student</h2><p>I want to be specific about what it would feel like to be in a classroom where this kind of assessment exists, because the abstract case is easy to make and the human case is the one that matters.</p><p>It would mean that the time you spent genuinely confused about something counts &#8212; not as performance of confusion, not as a participation grade for looking engaged, but as actual data about actual thinking. It would mean that the draft that was a mess, the question you asked in office hours that revealed you&#8217;d been working from the wrong assumption for two weeks, the revision that turned a competent response into a thinking one &#8212; these are evidence of the thing education is supposed to produce. They would be part of the record.</p><p>It would also mean that the smooth, perfectly structured submission produced at midnight with no evidence of genuine engagement is not, by itself, proof of anything. The artifact is not worthless. It has not become zero evidence. It has become insufficient evidence. Insufficient means it needs a partner &#8212; and the partner is the process that was supposed to produce it.</p><p>This is not a punishment for using AI. It is a recognition that the artifact alone was never the right thing to measure, and that the tools which have made that limitation undeniable have also, in the same move, made the solution more urgent than it has ever been.</p><h2>The Uncomfortable Truth About Friction</h2><p>The research contains a finding that takes a moment to absorb. The smooth, well-structured artifact &#8212; the one that reads with perfect confidence, that has no rough edges, no places where the writer lost the thread and found it again &#8212; may be mild negative evidence of genuine learning.</p><p>The rough, searching one may be positive evidence.</p><p>Not because roughness is a virtue. Not because difficulty signals intelligence. Because genuine struggle with hard material characteristically produces texture &#8212; places where the thinking was actually happening, where the writer was working something out rather than reporting a conclusion they arrived at before they started writing. The friction of genuine learning leaves marks. The borrowed certainty of an AI-assisted artifact is often smooth in a way that real thinking, at its most effortful, is not.</p><p>This is uncomfortable because educational institutions have spent generations rewarding the smooth artifact and interpreting roughness as inadequacy. We taught students that the goal was to arrive at certainty quickly and present it cleanly. We built rubrics that rewarded the appearance of knowing and had no mechanism for distinguishing it from the thing itself.</p><p>Generative AI did not create that confusion. It just made it expensive.</p><h2>What Comes Next</h2><p>The framework that formalizes this argument &#8212; the specific components of friction that genuine learning leaves in observable data, the way those components can be measured, combined, and calibrated to different kinds of cognitive work &#8212; is documented in the paper that follows this introduction. It is technical in the way that any serious methodology is technical, and it is also not the point of this essay.</p><p>The point of this essay is this: the crisis that AI has created for educational assessment is not primarily a cheating problem. It is an evidence problem. The artifact, which was always a proxy for the process, can now be produced without the process. Any response that tries to restore the artifact&#8217;s evidentiary value by detecting AI use is fighting a war that the progression of technology has already decided.</p><p>The response that might actually work is to stop relying on the artifact as the sole evidence of learning, and start building the infrastructure to measure what the artifact was always supposed to be downstream of.</p><p>Students are not wrong that the system gives them no choice but to produce the artifact by whatever means are available. They are responding rationally to a broken incentive structure. Educators are not wrong that something has been lost when the struggle disappears from the work. They are mourning the only evidence they were ever given access to.</p><p>The argument this paper makes is that the struggle was always the point. It is still the point. We have spent a long time measuring the wrong thing, and the tools that have made that undeniable have also, in the process, handed us a reason to build something better.</p><p>The infrastructure for measuring the struggle exists. The question is whether the institutions that credential learning are willing to build it before the artifact becomes so decoupled from the process that the credential stops meaning anything at all.</p><p>That window is not closed. But it is not wide open either.</p><p>The struggle is the point. It is time to measure it.</p><div><hr></div><p><strong>Tags:</strong> AI academic integrity assessment friction traces genuine learning, generative AI education artifact decoupling, GLP framework formative assessment process evidence, student professor AI use structural incentives, irreducibly human cognitive engagement pedagogy</p>]]></content:encoded></item><item><title><![CDATA[Boondoggling: You Are the Conductor]]></title><description><![CDATA[What Most Developers Miss About AI-Assisted Programming]]></description><link>https://www.skepticism.ai/p/boondoggling-you-are-the-conductor</link><guid isPermaLink="false">https://www.skepticism.ai/p/boondoggling-you-are-the-conductor</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Wed, 01 Apr 2026 03:16:34 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/192806158/0f765b9715f44ad9ce88a372a7e3a40d.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>There is a moment in every AI-assisted coding session that tells you everything about the developer sitting at the keyboard. The model generates a block of code &#8212; clean, confident, internally consistent. It compiles. The tests pass. The developer commits it and moves on.</p><p>What they never ask is the question that would save them three weeks in six months: <em>Is this solving the right problem?</em></p><p>I came to <a href="https://www.boondoggling.ai/">Boondoggling</a> the way most people come to uncomfortable realizations &#8212; after the thing that was supposed to work didn&#8217;t. The code was technically correct. The architecture was sound. And it was aimed, with beautiful precision, at a problem that had already been reframed by the time implementation began. Claude had done exactly what it was told. Nobody had told it the right thing.</p><p>This is not an AI failure. This is a human supervisory failure. And it is the failure that the developers now spending $20 a month on AI subscriptions are making, every day, at scale.</p><div><hr></div><h2>The 20% Problem</h2><p>Here is what most developers actually do with Claude Code or Cursor: they describe a problem, they delegate the implementation, they verify that the output compiles, and they ship.</p><p>That is not 100% of the job. That is 20% of the job dressed up as 100%.</p><p>The other 80% &#8212; the part that determines whether the fast, confident, technically impeccable output is pointed in the right direction &#8212; requires five capacities that no model possesses. Not because current models are limited. Because of what statistical pattern matching structurally is and is not.</p><p>Claude solves faster than any human. That gap will not close. What will not change is this: the model cannot verify whether its output is grounded in the specific domain reality at hand. It cannot reframe a poorly formulated problem. It cannot interpret what an accurate result means in a specific human context. And it cannot integrate multiple legitimate but conflicting perspectives into a recommendation that someone is accountable for.</p><p>These are not bugs to be patched in the next release. They are features of the architecture. The model has been trained on what is common and likely. Your specific project, your specific codebase, your specific business constraint &#8212; these are neither common nor likely. The gap between what the model knows and what your situation requires is where all the damage lives.</p><div><hr></div><h2>The Conductor</h2><p>The <a href="https://www.boondoggling.ai/">Boondoggling methodology</a> is built around a single metaphor that earns its place rather than announcing itself. A conductor does not play any instrument. They hold the whole performance in mind while each section plays its part. They hear the wrong note before the score confirms it. They decide which piece is worth performing and how it should be interpreted. The performance collapses without them &#8212; even though they produce no sound themselves.</p><p>This is what graduate-level AI supervision looks like. And it is the role that most AI integration workflows currently fail to develop.</p><p>The developers who are getting genuine leverage from AI coding tools are not out-prompting the model. They are conducting it. Before Claude Code sees a single requirement, they have decided what the problem actually is. Before the first function is generated, they have specified what done looks like. After the output arrives, they verify it against domain reality before the next step begins.</p><p>The ones who are mostly generating technical debt faster than they generated it before &#8212; they learned to play their instrument. Nobody taught them to conduct.</p><div><hr></div><h2>Five Things the Model Cannot Do for You</h2><p>The <a href="https://www.irreducibly.xyz/notes/Irreducibly-Human/Irreducibly-Human-Conducting-AI">Irreducibly Human course</a> at Northeastern &#8212; built on the same framework as Boondoggling &#8212; names these five supervisory capacities precisely. Not as professional development recommendations. As structural requirements for AI-assisted work.</p><p><strong>Plausibility auditing</strong> is the judgment that happens before verification. It is knowing an output is wrong because of what you know about the domain &#8212; not because you ran a test. The model cannot audit its own plausibility. It does not know what it does not know. When it confabulates &#8212; when it produces a confident, internally consistent answer that is not grounded in reality &#8212; it does so fluently. The code runs. The tests pass. Plausibility auditing is the human capacity that catches this before it ships.</p><p><strong>Problem formulation</strong> is deciding what the mission is before the model sees it. Not after. The quality of every output is determined here, at the moment of framing, before a single prompt is written. AI optimizes for the common and likely; humans must reframe toward the salient and important. The Semmelweis case &#8212; the formulation that saves lives was not the computationally tractable one &#8212; is the permanent lesson here. Hand problem definition to the model and you have not delegated. You have abdicated.</p><p><strong>Tool orchestration</strong> is the sequencing decision. Which tool, in what order, with what context, and what does done look like at each handoff. The developer who reaches for Claude Code because it is already open is not orchestrating &#8212; they are defaulting. Orchestration means choosing the audit tool with a different failure mode than the generation tool, so they catch each other&#8217;s blind spots.</p><p><strong>Interpretive judgment</strong> is supplying meaning that the model cannot supply. Which of these three implementations is correct for this context &#8212; not in the abstract, but here, in this organization, for this user, at this moment. The model can tell you what each implementation does. It cannot tell you what it means. Somebody has to sign their name to that answer. The model cannot do that either.</p><p><strong>Executive integration</strong> is not sequencing the four prior capacities. It is holding all four simultaneously toward a unified goal &#8212; recognizing when a plausibility audit finding requires problem formulation to re-engage, when an orchestration decision surfaces an interpretive judgment that wasn&#8217;t on the agenda. This is what the conductor does in the fourth quarter of a difficult performance: not running a checklist, but maintaining a unified hold on where the whole thing is going.</p><p>Better models will not close these gaps. They will widen the stakes of them.</p><div><hr></div><h2>What the Build Actually Looks Like</h2><p>A moderately complex website &#8212; six routes, hybrid architecture, admin dashboard, community upload pipeline, sandboxed iframe viewer, full prompt library &#8212; built using the Boondoggling method took roughly three hours. Two hours of conversation with <a href="https://www.boondoggling.ai/tools/gru-tool">Gru</a>, the custom orchestration prompt. One hour with Claude Code.</p><p>Nearly all the time was spent talking. Not coding. Not debugging. Not searching documentation. Talking &#8212; precisely, in the right order, about what the site was, who it was for, what it would and would not do, and what each piece needed to be true before the next piece began.</p><p>The result was a Boondoggle Score: a conductor&#8217;s score with two simultaneous parts. The Minion Part &#8212; exact prompts for Claude, in dependency order, each with context required, expected output, and a handoff condition. The Gru Part &#8212; precise human actions, labeled by supervisory capacity, in the same dependency order.</p><p>Nine Claude tasks. Eleven human tasks. More human decisions than machine outputs. But the Claude tasks ran fast and clean because the structure was already there. Every prompt worked &#8212; not because the prompts were magic, but because the conversation that produced them was structured.</p><p>The handoff condition is the most important element in the score. It is the conductor&#8217;s downbeat. A model that does not know when to stop will stop at the wrong place or not stop at all.</p><div><hr></div><h2>The Vocabulary of What Is Actually Happening</h2><p>The Boondoggling framework gives names to the different kinds of work in an AI-assisted build. The names are worth knowing because naming a thing is the first step to doing it deliberately.</p><p><em>Frick-fracking</em> is the iterative work &#8212; small precise edits, one thing changed at a time, the kind of work Claude Code does exceptionally well when given clear scope. This is where the actual build lives after the structure is established. It is productive and it does not require your full attention. It is not, however, the whole job.</p><p><em>Noodling</em> is the dreaming phase. Figuring out what to build before figuring out how. This happens before the model sees anything. It is the lightest touch &#8212; a thought that something could be interesting, a question about whether this feature serves the person the thing is built for. The discipline is knowing which noodle is worth developing. The problem statement is the filter.</p><p><em>Confabulating</em> is the danger word. When the model produces plausible output that is not grounded in reality. It sounds correct. It reads correctly. The code compiles. Only domain knowledge catches it. This is precisely the failure mode that plausibility auditing exists to address &#8212; and precisely the failure mode that developers who have learned to prompt but not to supervise will miss every time.</p><div><hr></div><h2>What You Are Actually Responsible For</h2><p>The developers most effectively using AI coding tools are not the ones generating the most code. They are the ones who have understood that their job changed &#8212; and changed in a specific direction.</p><p>The job is not to type less. The job is to decide more precisely.</p><p>You are responsible for what the problem actually is. You are responsible for what done actually looks like. You are responsible for whether the fast, confident, technically impeccable output is pointed at reality or pointed at a plausible simulation of it. The model takes no responsibility for any of this. It cannot.</p><p>The minions are excellent. They are enthusiastic. They will execute exactly what they understood you to mean.</p><p>That gap &#8212; between what you meant and what they understood &#8212; is where all the damage lives.</p><p>Anyone can use Claude Code. The question is whether you are playing an instrument or conducting the orchestra.</p><div><hr></div><p><strong>Tags:</strong> boondoggling AI methodology, Claude Code supervision framework, AI-assisted software development, solve-verify asymmetry, plausibility auditing human-AI collaboration</p>]]></content:encoded></item><item><title><![CDATA[Medhavy Hub Walkthrough]]></title><description><![CDATA[Intelligent Textbook]]></description><link>https://www.skepticism.ai/p/medhavy-hub-walkthrough</link><guid isPermaLink="false">https://www.skepticism.ai/p/medhavy-hub-walkthrough</guid><dc:creator><![CDATA[Nik Bear Brown]]></dc:creator><pubDate>Sun, 29 Mar 2026 06:49:35 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/192485038/12781cf2a5d9b44351da983db4e46790.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Ask your textbook a question. Get a sourced, context-aware answer &#8212; instantly. This is a full walkthrough of Medhavy Hub, the AI-powered textbook platform built for students who want more than a page to stare at.</p><p>In this video, we walk through everything: creating your account, requesting access, navigating chapters, and using the built-in AI Assistant Panel to study smarter across Physics Volume 1 and Cancer Biology.</p><p>The AI Assistant answers from the active chapter &#8212; not the open web &#8212; and shows every source it used so you can trust and verify the response. Ask follow-up questions, request step-by-step derivations, generate concept-check questions, get the answer key, and loop back to the text with stronger understanding. Every session is yours to pace and direct.</p><p>This is what an interactive textbook actually looks like.</p><div><hr></div><p>&#128279; Create your free account &#8594; medhavy.ai</p><p></p>]]></content:encoded></item></channel></rss>