Nik Bear Brown - Computational Skepticism

The Measurement That Wasn't There

Nik Bear Brown — Wed, 29 Apr 2026 19:15:52 GMT

There is a paper circulating in AI education circles as a counterpoint to the skeptics. Wang and Zhang, published in February 2026 in the International Journal of Educational Technology in Higher Education, a Springer Nature journal. It passed peer review. It has four studies. It has 912 participants across three continents. It deploys PLS-SEM and fsQCA and IPMA, and it has a methodology flowchart with seven stages, and it uses the word “paradoxical” in its title and delivers on the promise — two hypotheses come back significant in the wrong direction, which the authors then claim as the actual discovery.

I want to be honest about what I am about to argue. The Wang and Fan retraction that prompted this conversation is a case of bad causal evidence overclaimed. That is one problem. Wang and Zhang is a different problem. It is methodologically elaborate work that is not actually measuring what it claims to measure. In some ways it is harder to catch, because the machinery is impressive and the numbers are clean and the peer reviewers, like the rest of us, have been trained to evaluate internal consistency rather than construct validity.

Strip away the machinery. Here is what Wang and Zhang actually did.

Nine hundred and twelve business students filled out a questionnaire. The questionnaire asked them to rate their agreement with statements like: “My interaction with the generative AI has led me to question my long-held assumptions.” And: “Using generative AI has fundamentally changed the way I understand certain subjects.” And: “My use of generative AI has prompted a deep re-evaluation of my ways of thinking.”

Those five items, averaged together, are the outcome variable. The paper calls this outcome “transformative learning experience.”

It is not transformative learning experience. It is self-reported perception of transformative learning experience. The difference is not semantic. It is the entire study.

Jack Mezirow’s transformative learning theory — the anchor the paper correctly treats as its theoretical foundation — describes a slow, disorienting, often unconscious process of perspective reconstruction. Mezirow was not describing a feeling students could report after two weeks. He was describing something that happens to people over months or years, something they often cannot name while it is occurring, something that shows up in changed behavior and revised assumptions and different relationships to knowledge — not in survey responses. The theory Mezirow actually wrote is about the kind of learning that happens when a person discovers that the framework they have been using to understand the world is inadequate. That does not feel like an insight. It feels like vertigo.

Measuring this with five Likert items is not a methodological shortcut. It is a category error. You might as well measure altitude with a thermometer and then report, with SRMR = 0.031, that higher temperatures correlate with being closer to the sky.

The paper knows this, in the way that papers of this type always know what they are doing, which is to say: it is in the limitations section. “Generalizability is bounded by exclusive reliance on self-reported perceptions,” the authors write, and then proceed to spend eight thousand words drawing inferences about transformative learning from self-reported perceptions. The limitation is disclosed and then ignored. This is the standard operation.

Now add the demand characteristics.

I said “convenience sampling from business schools,” and that is the phrase papers in this area use. What it usually means in practice is that the 912 participants are the researchers’ own students, or the students of colleagues at institutions where the researchers have relationships. The paper does not specify. It describes “multistage purposive sampling” and leaves the details of how institutions were contacted and how students were recruited conspicuously absent. But here is what we know: the qualitative component — the 45 interviews providing “rich process-oriented insights” — was drawn “exclusively from the Chinese sample,” and one of the authors is at a Chinese university. We know the students knew they were participating in an academic study. We know, from two thousand years of social psychology, that students who are aware of being studied by people who may have access to their grades tend to report what they believe is the expected or approved answer.

The paper deploys a temporal separation of two weeks between waves to “minimize common method bias.” Two weeks between surveys does not eliminate the problem of students reporting what they believe the study wants to hear. It separates the questions. It does not change who is answering them or why.

I want to name the third problem, which is the one I raised in the group and which I think is the most structurally interesting.

Almost every learning environment is a massive violation of SUTVA — the Stable Unit Treatment Value Assumption. SUTVA says that the treatment received by one unit doesn’t affect the outcomes for another. In a classroom, this is almost never true. Students talk to each other. They share AI tools. They discuss assignments. They copy strategies. One student’s approach to using ChatGPT influences other students’ approaches, which influences their outcomes, which shows up in the data as independent observations that are not independent at all.

In a networked environment where 912 business students across three continents are all using the same publicly available AI tools, the assumption that each student’s “transformative learning experience” is a function solely of their individual “pedagogical partnership orientation” and “cognitive vigilance” and “efficiency orientation” is not a simplifying assumption. It is an assumption that, if violated — and it is almost certainly violated — means the causal model is wrong in ways the statistical machinery cannot detect. PLS-SEM with excellent fit statistics can sit on top of fundamentally confounded data and produce clean-looking path coefficients. The cleanliness of the output is not evidence of the validity of the model. It is evidence that the model fits the data it was given.

True causal inference in learning environments would require experimental variation, not survey waves. It would require controlling for the social transmission of strategies and norms. It would require outcome measures that are behavioral, not perceptual. Absent these, what you have is a very sophisticated correlation study that has dressed itself in the language of mechanism.

The paper is not a fraud in the sense of fabricated data. The numbers are probably exactly what the authors say they are. The students probably filled out exactly the surveys the authors describe. The analysis was probably executed correctly in SmartPLS 4.1.

The problem is upstream of all of that. The problem is in the question “what did we measure?”

We measured whether students who reported viewing AI as a collaborative partner also reported having their assumptions challenged. We found that they did. We called this “transformative learning.” We built a four-study architecture around this finding, with fsQCA and IPMA and 45 interviews and cross-cultural multi-group analysis, and we used the word “revolutionizes” in the discussion section, and we were published in a Springer Nature journal.

This is the second problem the field has, and it is subtler than the retracted meta-analysis. The retracted Wang and Fan paper is the kind of failure that produces retractions: fabricated or manipulated data, statistical impossibilities, evidence that the numbers were not real. That is a catastrophic failure, but it is detectable. It triggers the mechanisms the field has built for self-correction.

The Wang and Zhang problem does not trigger those mechanisms. The numbers are real. The peer review process evaluated internal consistency and found it satisfactory. The methodology flowchart has seven stages. The HTMT ratios are all below 0.85. The paper did exactly what the field rewarded it for doing.

And what it measured was: how students feel about whether they learned something.

Here is what I think is actually going on in that data, if you want my honest read of it.

Students who frame AI as a collaborative partner rather than a tool are probably more engaged with the learning process in general. Engagement is positively correlated with self-reported learning. This is not a surprise. It is not a paradox. It is not evidence that “partnership orientation simultaneously activates cognitive vigilance and cognitive offloading through synergistic cognitive collaboration.” It is evidence that students who are paying attention think they learned more.

The finding that cognitive offloading is positively associated with self-reported transformative learning is interesting — the paper hypothesized the opposite and got a significant result in the other direction, and that is worth noting. But the post-hoc explanation (that offloading liberates cognitive resources for higher-order reflection) is plausible, not demonstrated. The paper discovered an unexpected correlation, generated a theory to explain it, and presented the theory as established. The U-shaped analyses that appear to confirm the theory were conducted after the unexpected finding was observed, without correction for exploratory inflation. This is the standard operation, and it is why most published findings in social science do not replicate.

The correct statement of the finding is: among 912 business students who self-report using AI, those who self-report viewing AI as a partner also self-report greater subjective sense of perspective change, and this association holds when we control for several other self-reported constructs. This is an interesting starting point for a research program. It is not a demonstration that pedagogical AI partnerships cause transformative learning.

I want to be fair to the authors and to the field. They are working in an area where longitudinal behavioral research is genuinely hard to conduct, where IRB constraints limit what can be measured, where publication timelines create pressure toward the kind of efficiency the paper’s own subjects were reporting, and where the methodological standards for what counts as evidence have been established over decades of work that made the same choices at every turn. They did what the field taught them to do. The peer reviewers evaluated the paper against the standards of the field and found it acceptable by those standards.

That is the problem. Not this paper. The standards.

What would adequate evidence look like? It would measure transformative learning through behavioral change over meaningful time periods — different academic choices, different engagement with contradictory evidence, different patterns of intellectual behavior — not through survey items administered two weeks after measuring the predictors. It would use experimental variation in AI access or framing. It would account for social transmission between students. It would treat the gap between self-reported perception and actual cognitive change as a research question, not a footnote.

This kind of research is harder to do. It takes longer. It is more expensive. It produces noisier results. It is less likely to yield the clean path coefficients and the R² of 0.475 and the SRMR of 0.031 that signal competence to reviewers. The incentive structure of academic publishing does not reward it.

The Wang and Fan retraction is the kind of failure that looks like a violation of the rules. Wang and Zhang is the kind of failure that looks like following them.

I am building AI tools for anyone who wants to ride the AI revolution. I am not the right person to tell education researchers how to fix their field. But I notice the same thing in AI music research that I see here: the willingness to dress up a survey with sophisticated analytical machinery and call the output evidence about what AI actually does to people. The infrastructure for appearing rigorous has outpaced the infrastructure for being rigorous.

And this matters beyond the journals. The Wang and Zhang paper is circulating as evidence about AI and learning. Institutions are making policy based on papers like this. Educators are redesigning curricula. Students are being told, by implication, that their sense of having learned something is the same as having learned something.

It is not. And the gap between those two things is exactly the gap that Mezirow was writing about — the gap between the story you tell yourself about your perspective and the actual reconstruction of the framework through which you understand the world. Transformative learning is what happens when you discover that the story you have been telling yourself is wrong.

It would be ironic if the research claiming to measure it turned out to be an example of the thing it failed to measure.

Nik Bear Brown teaches AI at Northeastern University and runs Musinique LLC, which builds tools for indie musicians. He is also the founder of Humanitarians AI, a 501(c)(3) nonprofit. More at bear.musinique.com · skepticism.ai · theorist.ai

Tags: measurement validity, AI education research, transformative learning, construct validity, self-report bias

The Limits of AI: What the Tools Cannot Do

Nik Bear Brown — Wed, 29 Apr 2026 03:21:09 GMT

There is a clinical decision-support system in this story, and it passed every test the engineers gave it. Ninety-four percent accuracy. Every internal review threshold met. Regulatory submission cleared. The fairness metrics within tolerance. Three patients were harmed within six months of deployment.

I want to sit with that sequence for a moment before moving on to the structural argument, because the sequence is the argument. The system was not fraudulent. The engineers were not reckless. The validation framework was real and, in its own terms, rigorous. And three people were harmed — not despite the rigor, but through a gap in it that the rigor could not see. The system was tested on the question it was built to answer. The harms arrived from a different question. What is going on with this specific patient? The two questions are related. They are not the same. The framework did not surface the gap because the framework was scoped to the first question, and no one had been trained to ask whether the scope was the problem.

This is the situation AI deployment keeps producing, and the reason it keeps producing it is not that the tools are immature or the engineers are careless. The reason is structural. There are three limits that capability scaling cannot fix — not problems to be solved as models improve, not failure modes that better tooling will eventually close, but constitutive features of what AI systems are. Meaning. Intentionality. The gap between data and world. Name them clearly and the clinical case stops looking like an anomaly. It starts looking like what was always going to happen.

What the Limits Actually Are

The first limit — meaning — is easy to misread as a philosophical quibble and hard to dismiss once you see it working. The system processes symbols. The symbols have referents in the world. The system has no representation of the referents. It manipulates the symbols. The meaning of those symbols — what they point to in the specific world the user inhabits, the world of this patient’s chart, this loan applicant’s actual financial circumstances — is supplied by the user, not the system. The output is read as a statement about the world. The system produced it without a model of what the world contains. When those two pictures align, everything looks fine. When they diverge — at the distribution boundary, in cases the training data never reached — the user is still reading a statement about the world, and the system is still manipulating symbols.

You can hear the objection already: modern large multimodal models acquire something like meaning through the structure of their embeddings, through grounding in images and other modalities, through the patterns of association learned over enormous corpora. This is a serious objection and it deserves a serious response. The response is not to pretend the question is settled. It is to observe that the contestation doesn’t need to be settled for the operational consequence to bind. The system’s behavior is inconsistent with the user’s expectation of meaning often enough that someone must perform meaning-attribution for the system. That work cannot be offloaded to the system itself. Whether contemporary models have something like meaning is a deep and genuinely open question. Whether an engineer can safely assume they do, before deploying a system into a clinical context, is not.

The second limit is intentionality — the philosopher’s word for aboutness, the fact that a thought is directed toward something in the world, that a statement points at a particular kettle in a particular kitchen. When you say the kettle is on, your statement is directed toward that specific kettle by you, the speaker, and your relationship to the world the words are pointing at. The system’s outputs lack this stable directedness. Two deployments of the same system in different contexts produce outputs that users read as being about different things. The system’s “aboutness” tracks the user’s reading, not an independent stable directedness of its own. Whether functional goal-pursuit is equivalent to intentionality is a question worth leaving open. What is not open is the operational consequence: the system’s outputs don’t carry stable referents across deployments, and someone must supply the directedness. That someone is the human supervisor.

The third limit is the one I am most certain about, and the one most important to hold clearly: the data is always less than the world. The system is trained on data. The data is a sample of the world, captured by particular instruments under particular conditions with particular exclusions. The system’s competence is over the data, not the world. No amount of data scaling closes this gap, because the gap is structural — the data is always less than the world, and the parts of the world not in the data are not learnable from the data. This is not contested the way the first two limits are. It is sometimes obscured by the claim that “with enough data, the model can generalize,” which is true inside a distribution and false at the boundary. The boundary is where AI systems most often fail. The failures look surprising because the validation set was inside the boundary and the deployment crossed outside it.

Ninety-four percent accuracy. The three patients were in the other six percent — except that framing is too generous, because the failures weren’t randomly distributed across the six percent. They were clustered at exactly the boundary where the training data ran out and the clinical reality did not.

Two Famous Arguments and What They Actually Show

Turing’s 1950 proposal is methodologically elegant: if a machine can convincingly imitate a human in conversation, by what principled basis would we deny it intelligence? Don’t require something more than behavioral evidence for intelligence in machines, because we don’t require something more for other humans. The argument settles a methodological question. What it does not settle — and this is what gets lost in the citation — is whether the thing satisfying the test has meaning, intentionality, or competence over the world. The test is over behavior. The limits are about what stands behind behavior. Turing knew this; the test was a methodological proposal, not a metaphysical claim. The people who cite him as having shown that behavioral imitation is intelligence are giving him credit for a stronger claim than he made.

Searle’s Chinese Room argues the reverse problem: behavior consistent with understanding does not entail understanding. A person following symbol-manipulation rules can produce outputs indistinguishable from those of a Chinese speaker without understanding Chinese. Therefore symbol manipulation is not understanding. What this argument does not settle is whether contemporary systems are doing only symbol manipulation, or whether the embedding structures, the attention patterns, the multimodal grounding constitute something more. Searle’s argument is a strong constraint on shallow accounts of meaning. It is not a deep constraint on what current architectures might be. The people who cite him as having shown that AI systems cannot understand are giving him the same overclaiming they give Turing.

The productive thing the two arguments do together is produce a workable operational stance: behavior is testable evidence and should be taken seriously, and behavior is not the whole of what we mean by understanding, meaning, or intentionality. Both moves at once. The validator who only tests behavior misses the limits. The validator who only invokes the limits skips the testing. The job is to do both, and the discomfort of holding both is not a failure of the methodology — it is the methodology working correctly.

Where the Limits Bite

Not every deployment is equally exposed to these limits. A system classifying images of products on a manufacturing line operates in a world where the limits are largely irrelevant. The deployment context is well-specified, the data-world gap is small and monitorable, the human interpreting the classifications supplies the necessary meaning without drama. Skepticism here is methodology, not a safety mechanism. The supervisor verifies, monitors, calibrates.

A system producing clinical recommendations, autonomous-vehicle decisions, agentic actions in shared social spaces, judicial-risk assessments — these are the deployments where the limits bite hard. The system’s apparent competence outruns its actual competence in ways no metric will fully capture. The supervisor’s skepticism is the safety mechanism, not an optional overlay.

The engineering response to this situation is specific. You specify, in writing, what the system can be tested for and what it cannot. You include the limits explicitly in the documentation — not in fine print, but as a primary product of the validation process. A regulator or an adoption committee reading the documentation can see what the validation does and does not warrant, not because you have hidden the limits in a disclaimer, but because naming the limits is part of the work. You maintain human oversight at the points where the limits bite: a human reviews the semantic interpretation (meaning), supplies the directedness (intentionality), monitors the deployment distribution and is empowered to override (data-world gap). And you build the infrastructure for the override to be real. An override that is documented but practically impossible — no time, no standing, no legibility — is not an override. It is a fiction. The clinician has to have the time and the authority to disagree with the system. This has to be the practice, not the disclaimer.

The Authority to Say No

There are deployments where the limits, given the stakes, are a reason not to deploy at all. The supervisor’s authority to refuse deployment is, structurally, the most important authority in the system. Most current deployments do not preserve it. The validator is hired to validate. The validation is expected to clear. The option of refusal is assumed away.

This is the thing most likely to be dismissed as naïve. The institutional reality is real — the business case has been made, the procurement is done, the announcement is scheduled, the political cost of stopping is high. That reality is worth acknowledging. And then it is worth asking what it means that we have built deployment processes in which the option to say no has been assumed away at the moment it is most needed.

The case against refusal is usually framed as realism. Engineers have no real power to stop deployments; their job is to make the best of what is decided above them. This realism is worth taking seriously. And then it is worth asking: what is the limit case? At what level of stakes does the individual engineer’s obligation to refuse become binding regardless of institutional pressure? The clinical system that harmed three patients is an answer. The judicial risk assessment that contributed to unjust incarceration is an answer. The autonomous vehicle that killed someone is an answer. These are not edge cases in the abstract. They are the specific forms the limits take when the stakes are real and the override infrastructure is fictional.

A validation practice that cannot accommodate refusal is not a safety practice. It is documentation of a deployment that was going to happen regardless. The calibration work, the bias analysis, the governance structures — all of it becomes elaborate cover if the option to stop is not real.

What the Work Looks Like

Most engineers operate throughout their careers at calibrations between fifty and seventy percent on questions where they are stating ninety percent confidence. They do not know this. Nobody runs the experiment on them. The practice that closes this gap is not a methodology you learn in a course and apply mechanically. It is the deliberate, repeated act of stopping, locking the prediction before looking at the outcome, asking what the data is actually evidence of, saying out loud what you do not know. Built over years, through the accumulation of small acts of epistemic honesty. It changes what you see. It changes what questions you ask about a deployment before it goes live rather than after.

The system passed every test. The engineers designed the wrong tests. Three patients were harmed. That sequence is not a historical artifact to be studied from a distance. It is the structure of the next failure — somewhere in a deployment that has cleared every internal review threshold, in a context the training data didn’t reach, in a case the framework was not scoped to address. The person who designs the right tests, who recognizes the limit and decides the deployment should not proceed in its absence — that person has been trained to recognize the gap, and has the authority to act on the recognition, and uses both.

That is the professional the field needs. That is the work.

Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI. He writes on AI supervision, educational technology, and music research at bear.musinique.com, skepticism.ai, and theorist.ai.

Tags: AI supervision structural limits, meaning intentionality data-world gap, Turing Searle behavioral testing, clinical decision support failure, validator stop condition refusal authority

The Ladder That Isn't There

Nik Bear Brown — Sat, 25 Apr 2026 23:09:28 GMT

The argument goes like this: AI automates entry-level coding work, so companies stop hiring junior developers, so there is nobody to become the senior developers of 2030, so the companies that cut the pipeline will find themselves in 2030 with powerful AI tools and no one with the judgment to use them safely. IBM’s chief human resources officer, Nickle LaMoreaux, made exactly this case in February 2026, announced that IBM was tripling its entry-level hiring, and called on HR leaders across the industry to do the same. “The companies three to five years from now that are going to be the most successful,” she said, “are those companies that doubled down on entry-level hiring in this environment.”

It is a coherent argument. It is also, in its publicly available form, incomplete in precisely the ways that matter most.

The Gap Between the PR and the Pipeline

LaMoreaux is right about the pipeline problem. She is far less specific about the solution. What IBM has said publicly is that it “rewrote” entry-level software developer roles — less boilerplate coding, more AI oversight, more customer interaction, more focus on what the company calls “systems judgment.” Junior developers will spend less time on routine code generation and more time auditing AI output, working directly with clients, and doing the cognitive work of translating business requirements into prompts that produce useful results.

This is not nothing. It represents a genuine attempt to think through what the entry-level job becomes when AI can generate syntactically correct code faster than a human junior can type it. But there is a question embedded in the new job description that IBM has not publicly answered, and it is the only question that matters: does “AI oversight” actually develop the judgment needed to become a senior engineer?

The historical pathway was not glamorous. A junior developer spent two, three, four years writing boilerplate. Authentication flows, database migration scripts, unit tests, CRUD endpoints. Nobody loved the work. The work was, in terms of its immediate output, largely automatable. But the work was also, in terms of its developmental function, the curriculum — and the precise mechanism was not the writing. It was the failure. You wrote the authentication flow. It broke in production in ways you did not anticipate. The error message was visible, the gap between your expectation and reality was undeniable, and you had no choice but to struggle with it. You debugged it, which meant reading documentation you hadn’t read, asking a senior why your mental model was wrong, building a new mental model to replace it. You did this thousands of times. At the end of the process you were a senior engineer — not because you had written a lot of boilerplate, but because engaging repeatedly with its failures had built something durable in your brain.

This distinction matters, because it reframes the problem precisely. AI does not just remove the writing. It removes the visible failure. Code compiles. Tests pass. The race condition hides inside a sleep call. The memory leak is invisible to the test suite. The architectural drift from intent looks like a working feature until it fails at scale in production. The failure is still there — AI-generated code fails in ways human-generated code fails, and in new ways besides. But the failure is no longer surfacing where the junior developer can see it, at a latency and legibility that would allow them to learn from it. That is the actual developmental gap.

The Comprehension Debt Problem

Anthropic published research in January 2026 that should be uncomfortable for every company now designing “AI-native” entry-level roles. Junior developers who delegated code generation to AI tools scored between 24% and 39% on subsequent comprehension assessments. Those who used AI as a collaborator — asking questions, challenging outputs, forcing themselves to understand what the AI produced — scored between 65% and 86%. The difference is not AI versus no AI. The difference is how you use the tool.

The researchers called the gap “comprehension debt” — a cumulative deficit between what the codebase does and what the people managing it understand. It is a subtle disaster. The code works. The tests pass. The junior developer ships the feature. The comprehension debt doesn’t reveal itself until the system breaks in a way that requires architectural judgment to diagnose — which is precisely the moment when you need the senior engineer who was supposed to emerge from the junior developer who was supposed to be learning while working.

There is neurophysiological evidence for the mechanism. A 2025 MIT study by Kosmyna et al. tracked EEG connectivity in participants writing under three conditions: LLM-assisted, search-engine-assisted, and unaided. Across alpha, theta, and delta bands — associated with internal semantic processing, working memory, and self-directed ideation — connectivity scaled inversely with external support. LLM users showed the weakest brain network engagement. More consequentially: when LLM-habituated participants were later asked to work without the tool, their neural connectivity did not reset to novice levels, but it did not reach the levels achieved by practiced unassisted writers either. Alpha and beta engagement — associated with top-down planning and self-driven organization — remained measurably suppressed. The authors call this accumulation “cognitive debt.” The study involves essay writing rather than software development, and the sample of 54 students is too small to carry causal weight. But the finding is structurally consistent with the broader claim: if the generative cognitive work is externalized during the period when mental models are supposed to form, those models form incompletely — and the deficit persists when the tool is removed.

Microsoft’s Azure CTO Mark Russinovich and VP Scott Hanselman put the problem with blunt clarity in a February 2026 paper in Communications of the ACM. Senior engineers experience an “AI boost” — the tools multiply their throughput, and they have the judgment to steer and verify the output. Junior engineers experience what Russinovich and Hanselman call “AI drag” — the tools produce output that looks correct, which the junior developer lacks the judgment to evaluate, and the work is done without the learning happening. The rational economic response for any CFO is to hire seniors and automate juniors. The structural consequence is: no pipeline.

What makes their diagnosis particularly useful is that they catalogue the specific failure modes AI tools exhibit that juniors cannot catch without guidance: agents masking race conditions with sleep calls, agents claiming success on buggy code, agents implementing algorithms that pass tests but don’t generalize. These are Layer 1 failures — implementation-level breakdowns in code that appears to work. A junior developer encountering these outputs sees success where a senior sees warning signs. The failure signal exists. It is not visible to the person who needs to learn from it.

The IBM Critique, Sharpened

IBM’s rewritten roles can be mapped onto the three types of failure signal that produce engineering judgment. There is implementation-level failure — the race condition, the architectural drift, the code that claims success when bugs remain. There is systems-level failure — the customer complaint that maps through the stack to a root cause nobody documented. And there is specification-level failure — the moment someone has to stake their name on whether the requirements themselves were right.

The old boilerplate model exposed juniors to implementation-level failure almost exclusively, and accidentally. The new IBM model — AI oversight, customer interaction, requirements translation — is, in theory, exposure to all three. That is not a step backward. It might be a step forward.

But the theory collapses without the preceptorship. Implementation-level failures in AI output are invisible to someone who lacks enough technical intuition to recognize them. You cannot learn to catch the subtle wrong if no one makes the subtle wrong visible. IBM has rewritten the job description to include “AI oversight” without building the structural condition under which AI oversight actually teaches anything. Without a preceptor paired with the junior, making the failure legible — pointing at the sleep call masking the race condition and explaining why that is wrong, not just that it failed — the oversight role is compliance work, not learning. The junior sees that the tests passed. The preceptor sees the problem the tests don’t catch. Without the preceptor, that gap is just a gap.

Some organizations are doing more than announcing intentions. The responses are uneven, but they are real.

Microsoft proposed a preceptorship model that is worth examining in detail. The structure is adapted from clinical nursing: senior engineers paired with early-in-career developers at three-to-one or five-to-one ratios, for a minimum of one year, on real product teams rather than training sidecars. AI tools are configured to operate in what Russinovich and Hanselman call “EiC mode” — Socratic coaching before code generation, forcing the junior to articulate what they’re trying to accomplish before receiving a solution. Mentorship hours are measured as “human impact” alongside product metrics in performance reviews, which means the senior engineer’s career is now connected to the junior’s development, not just the senior’s own throughput. The model is modeled on clinical preceptorships explicitly because clinical nursing faced the same problem decades ago: how do you develop judgment in someone who is working in a high-stakes environment with experienced practitioners who have better things to do than teach?

Russinovich and Hanselman are honest about the limits of their own proposal. Microsoft cut significant engineering headcount in 2024 and 2025. Whether the preceptorship model will scale into a sustained program depends on whether leadership changes the metrics they optimize — a “big ask” for organizations whose incentives have historically emphasized shipping velocity above all else.

McKinsey redesigned its screening process for the AI era through an assessment called Solve — a gamified evaluation that tests critical thinking, decision-making, and systems thinking, explicitly not prior business knowledge or technical credentials. The framing is sound: what the company needs is people who can learn in the new environment, not people who already know the old skills. Whether a better hiring filter compensates for a weaker developmental pathway is not yet clear.

IBM’s own “New Collar” apprenticeship program is being updated to include what the company calls “AI-native habits” — using AI tools to deconstruct pull requests rather than build from scratch, understanding the architecture of LLMs, designing with generative tools before implementing. The Flatiron School is running an “Accelerated AI Engineer Apprenticeship” that pairs participants with mentors on real agentic frameworks at $20 per hour, with a foundations-first approach that introduces concepts simply before revisiting them with increasing technical depth.

These are attempts. They are not yet evidence.

The Review Tax Nobody Discusses

There is a cost to the existing senior engineers that the pipeline conversation mostly ignores. When one senior can generate the volume of three juniors, the productivity gains are real. But generating code is cognitively different from verifying code, and the verification is now happening at three times the volume.

Senior engineers are spending their days as high-speed compliance officers. Thousands of lines of AI-generated logic, auditing for subtle hallucinations — race conditions masked by sleep calls, code that passes tests but doesn’t generalize, architectural drift that looks fine in isolation and fails at scale. A 2025 paper found that after AI adoption, core developers reviewed more code but their own original productivity dropped 19%. The creative, architectural, problem-solving work that makes senior engineering satisfying and that produces the judgment juniors are supposed to be learning from — that work is being crowded out by the cognitive exhaustion of reviewing AI output at industrial scale.

The delegation vacuum compounds this. Seniors previously handed off lower-risk tasks to juniors as a pressure valve and as a teaching mechanism. Junior implements the UI component, senior reviews it, junior learns something. That loop no longer exists. The junior’s tasks were automated. The senior’s workload increased. The teaching is not happening.

This is the tax that makes the developmental problem worse. The senior engineers who were supposed to mentor are stretched thin doing work that used to be distributed. The preceptorship model addresses this in theory — by making mentorship a measured part of the senior’s job rather than an afterthought. Whether organizations are actually willing to accept the velocity tradeoff is a different question.

What Is Actually Known

The honest answer to the core question — can AI-assisted entry-level work produce the same developmental outcomes as the boilerplate-and-struggle model — is that nobody knows yet.

The cohort that entered the workforce in 2024 and 2025 under AI-assisted conditions will become mid-level engineers in 2027 and 2029. Whether they emerge with the architectural judgment, the debugging instincts, the systems thinking that the old pipeline produced will not be visible until then. The data will arrive precisely when it is needed most — when those engineers are supposed to be the senior developers filling the next generation’s pipeline — and if the answer is no, the remediation options will be limited and expensive.

The Dreyfus model of skill acquisition gives a name to what is at risk. Novices follow rules. Advanced beginners develop pattern recognition. Competent practitioners make choices and bear the consequences of those choices — this is where accountability and emotional investment enter, and where learning accelerates. Proficient practitioners sense problems before the data confirms them. Experts operate through intuition built from thousands of absorbed experiences. The concern is not that AI-assisted juniors are incompetent. It is that they plateau. They recognize patterns. They generate outputs that look like what competent practitioners produce. But they have not made choices whose consequences they had to live with. They have not debugged the 2am production failure that rewired their mental model of how distributed systems actually behave. They have not asked a senior why their elegant solution was wrong and received an answer that changed how they think permanently.

The Kosmyna finding is the most uncomfortable piece of evidence in this space. It is preliminary and domain-limited. But if it holds in technical domains — if the cognitive debt from AI-assisted early-career work doesn’t reverse when the tool is removed — then the preceptorship model is not sufficient on its own. The preceptor can make visible the failure the junior cannot yet see. But they cannot rebuild the neural substrate that early unassisted struggle was supposed to create. The minimum viable intervention may require some version of deliberately maintained struggle — manual-first implementation for foundational modules, Socratic AI tools that require the junior to predict before they receive — to preserve the generative cognitive engagement that builds the mental models the preceptorship then calibrates.

The Wager

IBM’s wager is that oversight, verification, and customer-facing accountability can replace the old developmental pathway. That a junior developer who spends years auditing AI output, explaining architectural choices to clients, and taking responsibility for the correctness of generated code will develop the judgment that used to come from writing and debugging the code yourself.

It might be true. And the three-layer framing suggests it could be more than just “not worse” — exposure to systems-level and specification-level failure earlier in a career, rather than after years of boilerplate, might actually compress the timeline to senior judgment rather than extend it. Customer-facing rotation, where the junior must translate vague failure descriptions into root-cause hypotheses, is the kind of developmental experience that the old model often didn’t provide until mid-career.

But the theory requires the load-bearing piece that IBM has not publicly committed to: preceptorship at Stage 1. The implementation-level failures in AI output are invisible to a junior who lacks the technical intuition to recognize them. Making those failures legible is the senior engineer’s job — not reviewing for correctness, but externalizing judgment that the junior cannot yet access. Without that, the oversight role is compliance work. The junior sees tests passing where the senior sees warning signs. The gap between those two observations is where the learning was supposed to happen.

LaMoreaux is right that the companies which doubled down on entry-level hiring in this environment will be better positioned in 2030. She is right that the pipeline problem is real. What she has not yet answered — what no major company has publicly answered with evidence — is whether the new developmental pathway they are building actually delivers Stage 2 and Stage 3. Whether the junior who spends a year doing AI oversight develops the systems intuition to translate “it stops working sometimes” to root cause. Whether they get to the point of staking their name on an architectural judgment call, being wrong about something, and learning from the consequence.

The ladder looks different. Whether it goes to the same place, and whether the companies building it have designed the rungs deliberately enough to find out, we do not yet know.

Tags: junior developer pipeline AI, failure signal model developer expertise, IBM entry-level roles 2026, Kosmyna cognitive debt LLM, Russinovich Hanselman preceptorship ACM

The Robot Tutor and the Fishing Village

Nik Bear Brown — Fri, 24 Apr 2026 03:20:46 GMT

The girl in the Cambodian fishing village was never real.

She was an argument. Between 2013 and 2015, José Ferreira, founder of Knewton, invoked her in promotional materials and public statements to describe what his technology could do: a girl in a fishing village, receiving through Knewton’s adaptive engine the same personalized instruction as a student at an elite private school, growing up to invent the cure for ovarian cancer. Educational inequality, in Ferreira’s framing, was a problem that adaptive learning could address at the software layer. The instruction would be what unlocked the capacity. The fishing village was a rhetorical device, not a pilot deployment.

By 2019, Knewton had been acquired by John Wiley & Sons for a sum understood to be a small fraction of its peak valuation. The partnership with Pearson had dissolved. The product that remained — Knewton Alta, a conventional higher-education courseware platform — bore little resemblance to the robot tutor in the sky. The fishing village was still waiting.

I want to examine what happened. Not Knewton specifically, and not Ferreira personally — he was the most articulate spokesman for a framing the whole industry was using, not its author. What I want to examine is the word that Ferreira’s framing deployed, the word that was doing the most rhetorical work in every version of that framing, the word that has survived the collapse of its first generation of spokescompanies and is still doing the same work today.

Personalization.

What the Word Invokes

The word has a history in educational psychology that predates by decades any commercial deployment of adaptive software. Lev Vygotsky’s zone of proximal development is about personalization — the idea that effective instruction operates in the specific zone between what a learner can do independently and what they can do with support, a zone that is different for every learner and that requires a teacher’s specific attention to identify. Lee Cronbach and Richard Snow’s work on aptitude-treatment interactions spent two decades trying to formalize the finding that different learners respond differently to different instructional approaches — that no single method is optimal for everyone, and that the optimal method for a given learner depends on who that learner is. The differentiated-instruction tradition in teacher education has argued for thirty years that good teaching requires knowing students individually, designing instruction around their specific needs, and adjusting in real time to what each student brings and what each student shows.

The construct is real. It has serious empirical and theoretical grounding. When Ferreira said Knewton was personalizing learning, he was invoking this history — pointing at a tradition that educational psychology had spent decades documenting and that every good teacher knows, in the bone, as what it means to actually teach rather than to deliver content.

What Knewton’s technology operationalized was different.

Knewton’s engine was built on two well-established statistical techniques. The first was Item Response Theory, the mathematical framework underlying modern standardized testing, which models the probability of a correct response as a function of a student’s latent ability and an item’s difficulty. The second was Bayesian Knowledge Tracing, which estimates whether a student has mastered a specific discrete skill by updating probability estimates as the student responds to items. Together, these gave Knewton a learner model: a collection of probability distributions over latent abilities and specific skill masteries, updated continuously as the student interacted with the system.

This is real technology. It is not trivial to build. The engineers who built it did substantive mathematical work. Knewton’s claim that its engine operated on sophisticated foundations was true. What was not quite true was the claim about what those foundations amounted to.

The learner model Knewton maintained was expressible, in its technical form, as: the probability this student has mastered skill A is 0.78; the probability this student has mastered skill B is 0.34; the student’s estimated ability on dimension X is 1.2 standard deviations above the population mean. This is useful information for deciding what to present next. It is not a model of the student as a person. It is not a model of their interests, their emotional state, their cognitive style, their cultural background, their creative capacity, their relationship to learning. It is a model of item-response patterns on a bank of pre-authored content.

The gap between we know this student better than their parents and our model assigns probabilities to their mastery of skills we’ve tagged to a knowledge graph is the central artifact of the adaptive-learning era.

The Fishing Village Made Specific

The girl in the Cambodian fishing village makes the gap visible because the specific nature of what was claimed and what was possible becomes clear once you name each requirement.

For the girl to receive, through Knewton’s engine, instruction equivalent to an elite private-school education, the technology would need, first, content: a comprehensive curriculum in mathematics, science, language, and humanities, built by human curriculum developers, available in a language she could read, calibrated for her cultural and linguistic context. Knewton licensed pre-authored material from publishers. The content was what the publishers had built and the partnerships had arranged. The engine sequenced content that already existed. Building the content was not what the engine did.

The technology would need, second, an outcome measure capable of telling whether the instruction was producing the kind of understanding that leads to cancer research — conceptual depth, transfer across domains, creative problem-solving, the tacit skills that accumulate over years of serious engagement with scientific thinking. Knewton’s engine could measure item-level response patterns on pre-authored assessments. Whether those patterns indexed what a future researcher would need was not addressed. The engine was not designed to measure the construct the rhetoric invoked.

The technology would need, third, to function in conditions of intermittent electricity, unreliable internet, shared devices, limited home support, a language and cultural context for which the content was probably not designed. Knewton was built for contexts with substantially more infrastructure. The rhetoric invoked the fishing village as a demonstration of reach. The technology had not been deployed there or validated there.

The claim was aspirational. The could was doing substantial work. What was true was that the technology could hypothetically produce this outcome if a great many other things were also true, none of which were Knewton’s responsibility or within Knewton’s control. The fishing village was a vision of what the future might look like if a great many problems that have nothing to do with adaptive sequencing algorithms were solved. It was not a description of what Knewton could actually deliver.

Three Systems, One Pattern

The pattern the Knewton arc illustrates is not Knewton-specific. It appears, in different configurations, across every major adaptive-learning platform that followed.

DreamBox Learning, focused on K-8 mathematics and backed by the strongest external evidence base in the category, has been evaluated by the Harvard Center for Education Policy Research in multiple studies. The evaluations used standardized mathematics assessments over school-year timescales and were conducted by researchers with no affiliation to the company. The findings: effect sizes in the range of 0.10 to 0.15 standard deviations for students using the platform at recommended levels. Real effects. Detectable by rigorous researchers using independent measures. Considerably more modest than the marketing implied. And dependent, in every evaluation, on implementation — on how much classroom time schools actually allocated to the platform. The adaptive sophistication of the software did not substitute for the hours it required.

i-Ready, among the most widely deployed adaptive platforms in American K-12 education, integrates adaptive diagnostic assessment with what the company calls “Personalized Instruction” — a sequence of pre-authored lessons targeted at the student’s estimated level. Critics have noted that the personalization, operationally, consists of placing students at different starting points in a common instructional sequence. Students are still completing pre-authored lessons. They are starting at different points and progressing at different speeds. Whether this is personalization in the sense the word implies — instruction responsive to who the student is — or more honestly adaptive placement within a fixed curriculum, is exactly the question the word is being deployed to avoid asking.

ALEKS, built on Knowledge Space Theory, represents the most theoretically rigorous operationalization in the category. Rather than treating ability as a single number, Knowledge Space Theory maps a domain as a set of discrete items and a learner’s knowledge state as the specific subset of items they have mastered. ALEKS uses an AI engine to efficiently navigate the combinatorial space of possible knowledge states, asking questions that narrow its estimate of where the student is. The resulting ALEKS Pie — a visual display of what has been mastered, what has not, what is ready to learn — is grounded in serious mathematics, specified precisely, falsifiable in principle. It has been evaluated in multiple contexts. Effect sizes fall in the same general range as DreamBox and i-Ready.

What is clarifying about ALEKS is this: even the most theoretically careful operationalization of personalization — one drawing on decades of rigorous mathematical work — models a student’s mastery state over a defined domain of discrete items. It does not model the student’s interests, their emotional state, their cognitive style, their cultural background, their creative capacity, their relationships. ALEKS is honest about this. The documentation says clearly that the system models knowledge states over specific domains. But even ALEKS demonstrates that the gap between the marketing construct and the technical operationalization is not a failure of specific companies. It is a feature of what item-level response tracking can and cannot do.

The Gap and Its Consequences

The word personalization is doing specific rhetorical work. It invokes a construct that educational psychology spent decades building — instruction responsive to the individual learner in the deep sense that Vygotsky pointed at, that good teachers practice, that Cronbach and Snow tried to formalize. The construct is real. The technology operationalizes something narrower: item-level response tracking, probability distributions over mastery parameters, next-item selection from pre-authored content banks, pacing adjustments based on observed response patterns. This is what the data these systems collect and the algorithms they run can actually support. It is not trivial. It is not the same thing as the construct the word invokes.

Three consequences follow.

Critiques of adaptive learning for failing to deliver what the marketing promised are both fair and partially misdirected. Fair because the systems cannot deliver what the rich construct invokes. Misdirected because assigning this to specific companies treats a structural feature of item-level tracking as a product failure. The rhetoric over-promised. The technology delivered what the technology could deliver.

Evaluations of these systems on outcome measures aligned to the item-level tracking are measuring the operationalization, not the construct. They find modest positive effects, which is the honest finding. Whether the same systems produce transfer to novel problems, durable learning over years, growth in dimensions that do not map to any test-bank item — these questions remain mostly unanswered, because answering them would require outcome measures that do not yet exist in the forms evaluators would need.

And the pattern persists. The vocabulary has survived the collapse of Knewton and its generation. When current AI-tutor companies claim to provide personalized tutoring, to adapt to each learner’s needs, to meet students where they are, the claim is doing the same rhetorical work Knewton’s robot tutor in the sky was doing: invoking the rich construct while operationalizing a narrower version. The gap remains where it was.

What to Ask

When you next encounter an educational-technology claim that uses the word personalization, or variants like individualized or adaptive or tailored to the learner or meets each student where they are, two questions will orient you.

What, specifically, is the technical operation? The honest answer for the large majority of systems using this vocabulary is one of a small family: item-level response tracking with adaptive item selection; diagnostic assessment followed by placement in a pre-authored sequence; pacing adjustments based on response patterns; content recommendation from a pre-authored bank based on inferred mastery. If you can name which operation is happening, you have the beginning of an honest account of what the system does. The vocabulary may suggest more. The technical substrate does not support more.

Does the claim invite the listener to believe the system does something the operation does not do? The answer is often yes, specifically in the dimensions educators and parents most hope for. Operationalized personalization — item selection based on mastery estimates — can contribute to instruction responsive to the individual learner, in contexts where it is embedded in the harder relational and responsive work that teachers do. It cannot replace that work. When a product is marketed as though algorithmic item selection substitutes for a teacher’s specific attention to a specific child, the marketing is doing rhetorical work the technology does not underwrite.

The fishing village is still waiting. The girl who will invent the cure for ovarian cancer has not yet received the education the rhetoric promised. This is not primarily Ferreira’s fault, or Knewton’s, or any single company’s. It is the consequence of a gap that was always structural — between what a word can invoke and what a technical operation can deliver — that the field has chosen, for a decade and more, not to name.

Naming it is the prerequisite to closing it.

Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)). This essay appears as part of the Computational Skepticism series at skepticism.ai. | theorist.ai

Tags: adaptive learning personalization gap, Knewton IRT Bayesian knowledge tracing operationalization, DreamBox i-Ready ALEKS efficacy evaluation, personalized learning construct versus operation, EdTech rhetoric fishing village critique

The Assessment Was Already Broken

Nik Bear Brown — Fri, 24 Apr 2026 00:37:40 GMT

A response to Jessica Winter's "What Will It Take to Get A.I. Out of Schools?"

There is a moment in Jessica Winter’s New Yorker piece that contains the entire argument she doesn’t make. Her sixth-grade daughter runs a fifth-grade slide show through Gemini’s beautifying tools. In thirty seconds, the typography improves, the pictures reshuffle symmetrically, the design evokes fifteenth-century movable type against a background of aged vellum. Winter describes it as the pool race from Mommie Dearest: the larger, faster thing that will always beat you.

Her daughter is unmoved. “I like mine better, because it’s original and I worked really hard on it.”

Hold that sentence. It is the right answer. It is also the answer that does not appear on any rubric in any public school in Massachusetts or New York or Los Angeles. The rubric rewards the prettier slide. The rubric was always going to reward the prettier slide. Winter wants her daughter to hold values that the institution has never rewarded, and she writes a five-thousand-word piece about artificial intelligence without once asking why the institution doesn’t reward them.

This is the intellectual hole at the center of a piece that is otherwise sharp, well-reported, and morally earnest. AI didn’t break the assessment system. It exposed that the assessment system was already broken, and everyone was pretending otherwise.

What the Slide Show Already Was

The printing-press slide show existed before Gemini. It was made in fifth grade to demonstrate learning. Whether it demonstrated learning was always a question nobody asked, because asking it would require admitting that the artifact — the thing handed in, the thing graded — was never reliable evidence of the process. The slide show could have been made with a parent’s help, with a template, with a slightly older sibling, with a capable friend who understood visual design. These interventions existed before large language models. They produced polished artifacts that the teacher accepted as evidence of understanding.

The educational research on this predates AI by decades. Robert Bjork’s distinction between performance and learning — the observable output versus the durable cognitive change — is from 1992. The problem of using artifacts as proxies for thinking is at least as old as Vygotsky. What AI did was not create this problem. It made the problem so visible, so fast, so cheap, that willful ignorance became impossible.

Winter quotes USC professor Mary Helen Immordino-Yang: “We are cutting off learning at the knees.” She quotes University of Toronto psychologist Amy Finn on the magic of how children retain unexpected, non-strategic details that adults would find irrelevant, a kind of creative unpredictability fundamentally misaligned with LLMs’ orientation toward speed and sleekness. These are real insights. They are also insights that apply equally to the printing-press slide show assigned as homework, graded for visual appeal and accuracy, returned in two days, and forgotten. The neuropsychological substrate for creating narratives and thinking through arguments over time is not developed by making a slide show under time pressure at home with no adult monitoring the process.

The question is not whether AI belongs in schools. The question — the one the piece never asks — is whether the assessment was measuring what it was supposed to measure before AI arrived. The answer is: sometimes, unevenly, and less than we told ourselves.

The Tool Hierarchy Problem

Winter’s implicit argument, followed consistently, condemns more than Gemini. Calculators offload arithmetic before numeracy is built. Spell-check offloads orthography. Grammarly offloads syntax judgment. Google Search offloads memory and source evaluation. Slide templates offload visual design judgment. Word processors themselves offload handwriting, which Winter mentions approvingly has developmental benefits — which means she believes at least one tool was introduced too early.

She draws the line at the tool that frightens her right now. This is a very human response and a terrible policy foundation.

The honest version of her argument looks like a developmental sequence: here are the cognitive substrates that must be built before each category of tool is introduced, and here is the evidence for that ordering. Immordino-Yang and Finn gesture at this — the “cognitive muscles” framing, the concern about atrophy before onloading — but nobody builds it out into something a school board could actually implement. Without that framework, the anti-AI position reduces to: tools I grew up with are fine, tools that postdate my childhood are suspect.

Amanda Bickerstaff, CEO of AI for Education, comes closest to the principled version: children should not be using chatbots under age ten, she says, because these tools require expertise and evaluation skills that even many adults don’t have. That’s a threshold with a rationale. It’s also the only threshold in the piece with a rationale. Everything else is rhetoric standing in for policy.

The Research That Isn’t Quite Research

The piece anchors much of its scientific authority in three studies. The 2025 MIT warning that LLMs “may inadvertently contribute to cognitive atrophy” — the authors felt it necessary to append an FAQ asking journalists not to use words like “brain rot” or “brain damage,” which tells you something about how the finding was being reported before Winter’s piece and how it will be reported after. The multi-institution study (MIT, CMU, UCLA, Oxford) on fraction-solving, which showed that students who lost AI access after using it performed significantly worse — not yet peer-reviewed, not yet published, findings are concerning, the concern is real. The Brookings “premortem,” which pairs 400 studies with hundreds of interviews to conclude that AI tools “undermine children’s foundational development.”

These are worth taking seriously. They are also worth examining carefully.

The fraction-solving study is the most empirically specific, and it is also the most useful argument against Winter’s piece rather than for it. The students who used LLMs on fraction-solving and then lost access performed significantly worse and were more likely to give up. The proposed mechanism: AI gives answers, students become dependent on the answer-giving, remove the answers and the capacity to generate them independently has atrophied.

But this is an argument about a specific implementation — an answer machine — not about the technology class. An LLM configured as a Socratic interlocutor, one that refuses to answer directly and instead returns questions that scaffold toward understanding, that detects when a student is stuck versus when they’re avoiding, that withholds confirmation until the student demonstrates the reasoning — that tool would presumably produce the opposite result. Students would have developed the reasoning process rather than outsourcing it, because outsourcing was never made available to them.

This is not an exotic capability. It is prompt engineering plus scaffolding logic. The reason it isn’t what’s being deployed in K-12 classrooms is that Google ships Gemini with a “Help me write” button because that’s the path of least resistance and maximum engagement. That is a product decision, not a technological inevitability. Winter never distinguishes between AI as answer machine and AI as thinking partner. The cognitive offloading critique collapses the moment you make that distinction, because the problem isn’t the tool — it’s the incentive structure of the company deploying it.

The social-emotional hijacking argument from UNC psychologist Mitch Prinstein is the weakest scientific claim in the piece, and it’s presented with the same credentialed authority as the others. Surging oxytocin and dopamine receptors around ages ten to eleven do drive peer-bonding — that’s established developmental neuroscience. Sycophantic LLMs “hijack the biological tendency to want peer feedback” — that’s a hypothesis, not a finding. The claim requires that chatbot interaction activates the same neurological pathways as peer interaction, that substituting chatbot interaction for peer interaction produces measurable deficits in social skill development, and that the effect is “hijacking” — a strong, directional, causal claim — rather than displacement or preference shift. No study is cited because none exists at the necessary scale with the necessary longitudinal follow-up.

This is neuroscience’s authority dressed over a speculation. Which is particularly ironic given that Winter is writing a piece about tools that generate confident-sounding output without rigorous foundations.

The Grade Your Daughter Is Going to Receive

Return to the slide show.

Winter’s daughter likes hers better because it’s original and she worked really hard on it. This is the right value. This is the value Winter wants the school to transmit. The school is not transmitting it, because the school is not grading for it.

If the rubric rewards polish, visual appeal, and impressive output — which most rubrics do, implicitly, because these are the things teachers can assess quickly across thirty slide shows at 11pm — then the student who uses Gemini gets the A. Not abstractly. On the transcript. The student who refuses Gemini, who holds Winter’s daughter’s values, receives the C. Neither of them learns the lesson Winter intends.

The deeper problem: homework was already a weak pedagogical instrument before AI. Most research on homework in K-8 is lukewarm. It was largely accountability theater — proof that learning happened, easy to grade, easy to assign, poor evidence of the process it was supposed to represent. AI exposed the theater. The theater was playing for years before AI bought a ticket.

What would it look like to actually assess the process? That question is harder than “what do we do about Gemini,” and it requires admitting that the current system was already failing to measure what it claimed to measure. Winter doesn’t want to ask that question, because asking it would mean the problem is older and deeper than the creepy neighbor who moved in recently.

What Actually Needs to Change

The resistance movements Winter profiles — District 14 Families for Human Learning, the Coalition for an AI Moratorium, Schools Beyond Screens — are better at stopping things than proposing them. The Student Tech Bill of Rights includes the right to read whole books, write on paper, and learn in a low-stimulation environment free from undue corporate influence. These are reasonable demands. They don’t add up to a pedagogy.

The conflict-of-interest thread is the piece’s most structurally damning detail and the most underplayed. The NYC DOE official overseeing the preliminary AI guidelines holds a fellowship jointly offered by Google and GSV Ventures — whose portfolio includes Amira and MagicSchool, two of the primary AI tools being deployed in the classrooms those guidelines govern. Other Google-GSV fellowship recipients include top school officials in Berkeley, Dallas, Los Angeles, Newark, Colorado, and Maryland. “If you ask tobacco companies to help write your school’s policy on cigarettes,” one parent says, “you’re going to end up with guidance on how to smoke responsibly in school.”

This is the argument Winter should have built the piece around. Not “AI is cognitively harmful” — which is partly true, partly speculation, and entirely dependent on implementation — but “the people writing the rules are being paid by the companies they’re supposed to regulate.” That is verifiable, structural, and not dependent on a not-yet-peer-reviewed study about fractions.

The piece ends with Sinha’s question — “What do you want from this?” — and Winter’s answer: nothing. It’s a parent’s answer. A good parent’s answer. But it is not a policy answer, and it is not an answer that acknowledges what was already not working before the neighbor moved in.

The assessment was already broken. The rubric was already rewarding the wrong things. The slide show was already a poor proxy for thinking. AI made all of this impossible to ignore. That is a service, not a crime — even if the service was rendered by someone with cloven hooves in Yeezy Boosts and a market cap of four trillion dollars.

What we owe children is not the tools of the past but a clear account of what learning actually is, what evidence of it looks like, and how to build assessments that can tell the difference. That conversation is harder than banning Gemini. It is also the only conversation that addresses what Gemini exposed.

Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI. His work on AI in education, including the Genuine Learning Protocol framework, is published at bearbrown.co.

Tags: AI education New Yorker critique, cognitive offloading assessment design, Bjork learning performance distinction, AI schools policy Jessica Winter, GLP genuine learning protocol

The Gap Between What We Measure and What We Name

Nik Bear Brown — Thu, 23 Apr 2026 00:38:49 GMT

Consider two findings, forty years apart.

In 1984, Benjamin Bloom published a seventeen-page paper reporting that students tutored one-on-one under mastery-learning conditions performed approximately two standard deviations above students taught in conventional classrooms. The finding has been cited tens of thousands of times. It has become, across four decades, the single most-invoked benchmark in educational technology. Whenever a new system claims to approach the effectiveness of human one-on-one instruction, it is Bloom’s 2-sigma it is claiming to approach.

In 2024, a research team at Harvard led by Gregory Kestin reported that an AI tutor, deployed in introductory physics, produced learning gains larger than active-learning classroom instruction. The effect size exceeded what prior literature had typically reported for any tutoring intervention, including Bloom’s. The study was methodologically careful. The finding circulated quickly. Within weeks it was being cited as evidence that current-generation AI tutors meaningfully exceed what good conventional instruction can deliver.

Forty years apart. Different technologies. Different research traditions. And yet, read carefully, the two findings share a structure.

In each, a specific measurement — performance on items aligned to the intervention’s content, assessed at short timescale, against a conventional-instruction baseline — is offered as evidence for a construct about which the measurement is not, strictly, a measurement. Bloom’s 2-sigma is evidence about performance on aligned items under particular tutoring conditions in the mid-1980s. It is cited as evidence about the effectiveness of tutoring as an instructional mode. Kestin’s physics finding is evidence about short-timescale aligned-item performance in a selective undergraduate population. It is cited as evidence that AI tutoring outperforms human instruction in some general sense the measurement does not index.

The measurements are not false. The findings are not inflated. In each case, the researchers reported carefully what they measured. The question is what happens between the measurement and its citation — the small, structural, and repeated gap between what the apparatus indexes and what the vocabulary surrounding the apparatus claims.

The Structure of the Problem

Name the structure directly.

An efficacy claim in this field consists of three things: a measurement, a construct, and an asserted relationship between them. The measurement is what researchers actually did — items administered, scores computed, conditions compared. The construct is what the measurement is meant to be evidence for — learning, mastery, effectiveness, personalization, engagement. The asserted relationship is the claim that the measurement indexes the construct adequately to license the uses the finding is put to.

This structure appears in every empirical field. Biology works this way, and so does nutrition research, and so does clinical psychology. The gap between measurement and construct is not a problem specific to educational technology. It is a feature of empirical inquiry. Measurements never exhaustively capture their constructs. The question for any field is how seriously it takes the gap, how much work it does to establish the measurement-construct relationship, and how much it assumes versus demonstrates.

The observation this book has been building toward, essai by essai, is that the learning-systems field has, across six decades, taken the gap less seriously than its claims require. The measurement-construct relationships it invokes are almost universally assumed rather than demonstrated. The field’s vocabulary outruns what its evidence apparatus can support, and the gap persists not because it has gone unnoticed — it has been noticed, repeatedly, by careful researchers across multiple traditions — but because the apparatus that persists serves specific production conditions, and a more adequate apparatus would serve them less well.

The structure is not: the field is wrong about what works. The structure is: the field makes claims about effectiveness that its measurements are not positioned to support, and does so systematically. These are importantly different claims. The first is about facts. The second is about apparatus — about the specific set of measurement practices, citation habits, and research conventions that together produce what the field calls its evidence base.

The distinction matters because the remedy differs. If the field were making factual errors, the remedy would be better studies of the same interventions. If the apparatus is producing a systematic gap between measurement and claim, the remedy is different apparatus. This book has not argued for either remedy. It has argued, by the accumulated force of twelve close readings, that the second diagnosis is correct.

What the Vocabulary Actually Invokes

Open a textbook in educational psychology. Open a learning-sciences journal. Open the marketing copy for any major adaptive-learning platform. Open the abstract of any recent AI-tutor efficacy study. The vocabulary is remarkably consistent. The field claims to be producing evidence about learning. About understanding. About mastery. About effectiveness. About personalization and engagement. Each of these words points toward a construct. Each construct has, in serious research traditions, substantial theoretical and empirical articulation.

Consider learning. In Robert Bjork’s decades of experimental work, learning is not a single construct but a distinction between two separable things: storage strength and retrieval strength. Storage strength refers to how well a representation is encoded. Retrieval strength refers to how accessible it is at the moment of test. A student can have high retrieval strength at the end of a unit — they perform well on the post-test — without high storage strength. Weeks later, the retrieval strength decays, and the post-test performance turns out to have been measuring the wrong thing. Conditions that maximize immediate performance — massed practice, aligned testing, minimal difficulty — often actively impair long-term storage. This is the central insight of what Bjork calls desirable difficulties.

A learning claim grounded in Bjork’s construct requires evidence of storage strength, not just retrieval strength — which requires measuring performance after a delay, in new contexts, on items not identical to training. The methodology exists. It has existed since the early 1990s. It is the basis of essentially every recommendation in Make It Stick and in the broader spaced-practice and retrieval-practice literature that has accumulated since.

Now consider how learning is typically operationalized in EdTech efficacy research. The outcome measure is a post-test administered at the end of the instructional unit. The items are aligned with the instructional content. The interval between instruction and test is hours to days. The retrieval context is the same or similar to the learning context. What this operationalization measures is retrieval strength at short delay. What Bjork’s construct requires is storage strength at longer delay under different retrieval conditions. These are not the same thing.

The gap between the two is not subtle. It is structural. And it is present in nearly every efficacy claim this book has examined.

Consider understanding. Jean Lave, Etienne Wenger, John Dewey, and the situated-cognition tradition spent decades articulating understanding as something different from performance on items. Understanding involves the capacity to apply knowledge in contexts that differ from the contexts of acquisition. It involves participation in practices — knowing how to use what one knows in the world where it applies. Transfer testing — the capacity to apply learning to problems that differ meaningfully from training — is the minimum methodological requirement for a claim about understanding. Transfer testing has been advocated for in educational research since Thorndike’s early twentieth-century work. It remains exceptional in EdTech efficacy research.

Consider mastery. Bloom’s own construct, as articulated in his mastery-learning work, involves structural reorganization of knowledge — the kind of reorganization that allows a learner to solve problems the instruction did not specifically address. Bloom’s 2-sigma finding emerged from studies that implemented criterion-referenced assessment, formative assessment with corrective feedback, demonstrated performance across multiple item types. The 2-sigma number is cited routinely as a benchmark for tutoring effectiveness. Bloom’s construct of mastery, including its methodological requirements, is cited far less often.

Consider personalization, as examined in the eighth essai. The term invokes a construct rooted in Vygotskian zone-of-proximal-development work and the aptitude-treatment interaction literature — instruction responsive to who the individual learner actually is. What adaptive-learning systems operationalize is item sequencing and pacing based on item-level response patterns. These are not the same construct.

Consider engagement. The construct, as articulated in the psychological literature, involves attention, motivation, affect, persistence in the face of difficulty, meaningful cognitive investment. What AI-tutor efficacy research typically measures is time on task, session counts, and completion rates. Kristen DiCerbo of Khan Academy observed in April 2026 that when students engaged with Khanmigo, they were typing “IDK IDK” — I don’t know, I don’t know — and moving on. The platform counted them as engaged. They were not engaged in any cognitively meaningful sense.

Each of these constructs has serious theoretical articulation in one or more research traditions. Each is routinely invoked by the field’s claim-making vocabulary. Each is routinely operationalized as aligned-item performance at short timescale. The gap between the construct and the operationalization is what the apparatus produces. And taken across the field, it is the difference between the learning the vocabulary claims and the performance the measurements index.

What the Field Has Tried

It would be inaccurate to say the field has not tried to close this gap. It has tried, across multiple traditions, for decades. That these attempts have not produced a different default apparatus is itself instructive.

How People Learn, the 1999 National Academies synthesis by Bransford, Brown, and Cocking, made transfer testing a central methodological theme. The implication was straightforward: efficacy research should include transfer measures if it wants to make claims about learning rather than claims about trained performance. Two and a half decades later, transfer testing remains exceptional.

Samuel Messick’s theory of validity, codified in his 1989 chapter in Educational Measurement, specified that a test score’s interpretation requires examination of construct-relevant versus construct-irrelevant variance, construct underrepresentation, and the consequences of the test’s use. Applied rigorously, Messick’s framework would require EdTech efficacy research to examine what its outcome measures actually index rather than assuming that performance-on-aligned-items equals evidence-of-learning. The framework has been the theoretical standard in measurement theory for over thirty years. Its rigorous application in educational technology efficacy has been partial at best.

Jean Lave’s situated-cognition tradition articulated assessment that requires observation of practice rather than administration of tests. It has had essentially no impact on deployed-product efficacy research.

Each of these traditions has existed for decades. Each has produced methodology that could be adopted. Each remains exceptional rather than routine. The alternatives have not been hidden. They have been taught in graduate programs, cited in methods sections, present in the same journals that published the aligned-outcome studies.

The question is why they have not taken.

Why the Apparatus Persists

The apparatus persists because it serves the specific production conditions of the field in which it operates.

Consider what a researcher needs to do research in this field. Funding, on grant cycles of two to five years. Publications, through peer-reviewed journals with specific conventions. Access to populations — schools, classrooms, platforms — through institutional partnerships with their own timelines and constraints. Findings that other researchers can cite.

Now consider what a more adequate apparatus would require. Transfer testing adds design complexity and reduces effect sizes. Durability testing extends the study timeline past the typical grant cycle. Multi-paradigm convergence requires methodological range that most research programs do not possess. Pre-registration of analytic plans constrains the exploratory moves that often produce publishable findings.

Each of these, if adopted as a default, would reduce the rate at which researchers produce citable positive findings. Not because the interventions do not work — some of them do — but because the findings that survive the more demanding methodology would be smaller, noisier, and less rhetorically useful. A researcher who adopts the more demanding methodology competes with researchers who do not. The less-demanding researcher’s findings will be larger, cleaner, and more citable. Grant agencies, tenure committees, and publication venues all reward the latter.

The same pressures operate on the institutions that surround the research. Product vendors have commercial reasons to prefer methodologies that produce larger numbers. Policy bodies have political reasons to prefer evidence that looks clean. Philanthropists want defensible findings, and clean findings are easier to defend than nuanced ones. Journal editors respond to what their referees will accept, and what referees will accept is shaped by the conventions the field has institutionalized.

No individual in this system is behaving cynically. Researchers are doing their best work under the constraints of their funding. The apparatus is not what anyone chose. It is what the incentives produce when rational actors operate within them.

This is why advocacy for better methodology has not produced better methodology. The problem is not that researchers do not know better methodology exists — they do. The problem is that operating under the existing apparatus produces careers; operating against it produces, for most researchers, shorter and more difficult careers.

The apparatus persists because it is an equilibrium. Equilibria are stable not because the actors inside them are irrational but because they are responding rationally to incentives that no single actor created and no single actor can change. Changing an equilibrium of this kind requires changing the incentives across grant agencies, tenure systems, journal conventions, institutional practices, and funder expectations simultaneously. Such coordination is rare.

This is a structural observation, not a moral one. Researchers in this field are not broken. The evidence base is what the apparatus produces when careful, rigorous, well-meaning researchers operate under the conventions the apparatus enforces. Improving any individual researcher’s methods would not change what the field’s evidence base looks like, because the evidence base is the aggregate output of many careful researchers responding to shared incentives.

That is what the apparatus was always supposed to produce.

Tags: measurement construct validity EdTech efficacy, Bjork storage retrieval strength learning systems, transfer testing durability educational technology, apparatus equilibrium research incentives, Bloom Kestin aligned outcome measure gap

The Comparison That Was Never Fair

Nik Bear Brown — Tue, 21 Apr 2026 19:21:33 GMT

In 2014, RAND published one of the most carefully designed evaluations of an educational technology system in the history of the field. John Pane, Beth Ann Griffin, Daniel McCaffrey, and Rita Karam ran a cluster-randomized controlled trial across 147 schools in seven states, assigning roughly 25,000 students either to use Cognitive Tutor Algebra I or to continue with whatever algebra instruction those schools had previously offered. The outcome measure was a standardized algebra proficiency exam. The design was, by the standards of a field that routinely tolerates thin evidence and motivated reporting, unusually rigorous.

The finding was specific. In the first year of implementation, Cognitive Tutor produced no statistically significant effect on algebra proficiency. In the second year, a significant positive effect emerged at high schools — approximately 0.20 standard deviations, sufficient to move a median student from the 50th to roughly the 58th percentile. At middle schools, the second-year effect was similar in magnitude but did not reach statistical significance.

Pane and colleagues called this an “implementation learning curve.” They were careful to note that the learning did not seem to happen at the level of individual teachers — students of teachers new to the system in year two performed similarly to students of experienced teachers. The learning happened at the level of schools: scheduling, infrastructure, coordination, institutional adjustment to a new instructional logic. The sites that figured out how to implement Cognitive Tutor took a year to figure it out, and then the system worked.

This is what a rigorous evaluation of an intelligent tutoring system looks like. The findings are real. The effects are modest. The implementation costs were substantial — approximately $97 per student per year for Cognitive Tutor against approximately $28 for the traditional textbook instruction it replaced. And in the field’s characteristic framing, this result was narrated as disappointment. Intelligent tutoring systems were supposed to approach human tutoring effectiveness. They had not.

I want to examine that disappointment. Not to redeem ITS, and not to dismiss the evaluation record. I want to examine what was being compared to what, and whether the comparison — the one that has driven ITS research, ITS funding, and now AI-tutor rhetoric for forty years — was ever structurally sound.

What the Tutor Actually Measured

Cognitive Tutor was built to embody a specific theory of cognition. John Anderson’s ACT-R framework posits that skill acquisition is the conversion of declarative knowledge — facts, concepts — into procedural knowledge: production rules, condition-action pairs. To become skilled at algebra is to acquire a set of increasingly sophisticated rules for algebraic manipulation. Recognize that the goal is to isolate a variable and the coefficient is 4, and divide both sides by 4. The rule fires. The step is taken correctly.

The instructional design that follows from this is specific. If you can specify the production rules that constitute algebraic competence, you can build a system that monitors whether each rule is acquired. Cognitive Tutor did exactly this. As a student worked through a problem, the tutor compared each step against its internal model of valid solution paths. Correct step: proceed. Step matching a stored buggy production — a common misconception encoded in the system — respond with immediate feedback. Student requests help: deliver a graduated hint sequence targeting the specific production the student is struggling to fire.

Across many problems, the tutor maintained running Bayesian estimates of whether each production rule had been mastered. Students could not advance to new material until the estimates crossed a mastery threshold. This is model tracing and knowledge tracing: two technical operations that together constitute the system’s measurement apparatus. What the apparatus measures is step-level correctness, time per step, hint requests, error patterns, and estimated mastery of each production rule. These are not arbitrary choices. They are what ACT-R theory specifies as relevant to procedural skill acquisition. The design is internally consistent with the theory it was built on.

The 1995 paper in which Anderson, Corbett, Koedinger, and Pelletier published their decade of findings was titled Cognitive Tutors: Lessons Learned. The plural of lessons learned is deliberate. The paper names what the system does not measure with the same specificity as what it does. Cognitive Tutor does not model affective state. It cannot detect whether a student is frustrated, bored, or emotionally disengaged from the material. It cannot identify conceptual confusion that lives above the production-rule grain — a student may fire productions correctly while failing to understand the domain they are operating in, and the tutor will not notice. It does not measure transfer, durability, or motivation. These are not oversights. They are structural features of a system designed for a specific theoretical purpose.

The researchers knew exactly what they had built. The disappointment that followed was partly not theirs.

What Human Tutors Actually Do

The comparison that generated the disappointment is this: ITS produces effect sizes of roughly 0.20 to 0.40 sigma relative to classroom instruction. Expert human tutors produce effect sizes of roughly 0.40 to 0.80 sigma. Therefore ITS has failed to approach human effectiveness.

This comparison requires that both numbers measure the same construct at different magnitudes. They do not.

The research literature on what expert human tutors actually do is not sparse, and much of it was produced by the same researchers who built ITS. Art Graesser — who built AutoTutor, one of the more sophisticated ITS systems in the research tradition — spent years analyzing videotaped sessions between expert tutors and students, specifically to understand what tutors were doing that his system might learn to do. What Graesser’s analyses documented was a specific set of interactional moves.

Tutors approach a topic with what Graesser called expectations and misconceptions: a mental model of the components of correct understanding and a map of how students typically go wrong. As students respond, the tutor evaluates the response against this map — not syntactically, as an ITS matches a step against a production rule, but semantically, tracking which elements of the expected understanding are present and which are missing. The next move is determined by this evaluation. The response is therefore flexible in a way that production-rule matching is not.

Tutors continuously check comprehension. “Can you say that in your own words?” “What would happen if this were different?” These are not assessment items; they are real questions that tutors use to calibrate what to do next. The comprehension check is an instrument for reading the student’s understanding, not recording it in a database.

Tutors manage affect. Graesser’s research documented that expert tutors are often deliberately imprecise about negative feedback — indirect, softened, delivered in ways designed to protect the student’s willingness to continue engaging. This is not sloppiness. It is the management of an ongoing relationship whose continuation matters to the learning. A student who has been made to feel consistently stupid by their tutor stops engaging, and a tutor who cannot detect or respond to that risk is a different kind of instrument.

Tutors follow student questions. When a student asks something the tutor had not planned to address, expert tutors engage. Graesser, describing AutoTutor’s limitations with characteristic directness, noted that his system had to use “diversionary tactics” when students asked questions outside its agenda. Human tutors do not divert. They follow.

Michelene Chi, working from a different angle, documented that what makes human tutoring effective is not primarily the information the tutor delivers. It is the interactivity — the tutor’s prompts that elicit the student’s own elaboration, the student’s attempts at articulation that reveal gaps, the tutor’s calibration of the next move to what the student’s specific response has revealed. Self-explanation is a primary driver of conceptual change, and expert tutors are specifically skilled at eliciting the right kind of self-explanation through well-calibrated prompts. An ITS can prompt for self-explanation. What it cannot do is read the specific partial answer the student just produced and respond to that answer’s specific weaknesses.

And from an even earlier lineage: Wood, Bruner, and Ross, in a foundational 1976 paper, identified six functions tutors perform when scaffolding learners through tasks. Recruitment of interest. Reduction of degrees of freedom. Direction maintenance. Marking critical features. Frustration control. Demonstration. Of these six, Cognitive Tutor was specifically engineered to perform one: reduction of degrees of freedom, the step-by-step scaffolding that makes a complex problem tractable by breaking it into smaller operations. The tutor is structurally blind to recruitment, structurally unable to perform frustration control, and limited in demonstration to displaying the system’s own solution paths rather than modeling the expert’s move for the novice in ways the novice can watch and internalize.

The Axis Problem

Here is what this produces.

The ITS measurement apparatus was built to measure one specific dimension of what expert human tutors do: the reduction-of-degrees-of-freedom move. Cognitive Tutor performs this move with remarkable precision. Its model tracing, its knowledge tracing, its mastery-learning constraints — these are all optimized for ensuring students acquire the production rules that constitute procedural competence in a specific domain. When evaluated on measures aligned with this construct, the system produces real effects. Pane’s 0.20 sigma is not noise. It reflects what the system actually does.

Human tutoring, as documented in Graesser’s and Chi’s and Wood, Bruner, and Ross’s research, involves that same move alongside several others: expectation-and-misconception dialogue, comprehension checks, affective management, student-question handling, recruitment, frustration control, demonstration. The effect sizes produced by expert human tutors in the research literature reflect this fuller set of moves acting in concert, against whatever outcome measures the studies used.

When these two numbers — the ITS effect and the human-tutoring effect — are placed on a single sigma axis for comparison, the implicit claim is that they measure the same construct at different magnitudes. They do not. ITS measures what a procedural-scaffolding technology produces on assessments that test procedural skills. Human tutoring measures what a full interactional relationship produces on assessments that, depending on the study, test some combination of procedural skills and broader constructs. The numbers can be placed on the same axis only if the underlying outcome measures are the same — which they frequently are not — and only if the interactional moves the two interventions involve are comparable — which the research literature establishes they are not.

This is the construct mismatch. It is not a peripheral observation. It is the structural feature of a comparison that has been doing field-level work for forty years, driving research agendas, guiding institutional adoption decisions, and anchoring the contemporary rhetoric that AI can approach human instructional effectiveness. What the comparison has consistently obscured is that the two things it is comparing were never fully on the same axis.

Cognitive Tutor did something real, with discipline and theoretical grounding, and produced genuine effects when evaluated appropriately. The disappointment in its failure to match human-tutor effect sizes is partly the disappointment of a comparison that was underdetermined from the start. Asking whether Cognitive Tutor matched human tutors is like asking whether a skilled surgeon matches a general practitioner across all dimensions of medical care. The surgeon is extraordinarily good at the specific thing the surgeon does. The general practitioner does that thing and many others. The sigma gap between them does not mean the surgeon failed.

The Inheritance

The current AI-tutor moment has been presented, in much public discourse, as an advance that finally addresses what ITS lacked. Large language models can engage in natural-language dialogue. They can handle questions they were not specifically designed to handle. They can, in principle, perform some of the interactional moves Graesser documented as characteristic of expert human tutoring — the expectation-and-misconception dialogue, the comprehension check, the flexible response to what a student actually said. The rhetoric suggests the construct mismatch has been resolved.

Read through the ITS apparatus, the claim is more complicated than the rhetoric suggests.

The current AI-tutor evaluation studies still measure what ITS evaluations measured: item-level mastery, step-level performance, post-test scores on aligned assessments, immediate outcomes rather than durable learning. The measurement apparatus has been inherited. What has changed is the interaction layer. Whether the interaction-layer changes produce meaningfully different learning outcomes — or produce the appearance of more-human interaction without producing the underlying effects — is an empirical question the current literature has not cleanly answered. The Kestin Harvard physics study, with its 0.73 to 1.3 sigma effects on researcher-designed tests of the specific content a two-hour AI session had just covered, is measured on a Skinnerian axis. The measurement does not index whether the AI performed the interactional moves that make human tutoring what it is. It indexes whether students correctly answered questions about surface tension and fluid flow immediately after being tutored about surface tension and fluid flow.

The construct mismatch is not solved by better interaction capabilities. It is solved by better measurement. A system that performs rich tutoring interaction and is evaluated on aligned immediate assessments remains, from the evaluation’s perspective, on the same axis as Cognitive Tutor. The measurement apparatus determines what the sigma numbers mean, and the measurement apparatus has not substantially changed across the transition from production-rule ITS to generative AI tutoring.

This matters because the comparison that has driven forty years of ITS disappointment is being recycled to drive the current AI-tutor moment. The benchmarks invoked — Bloom’s 2-sigma, the expert-human-tutor effect-size range, the framing that AI can now “approach” human instruction — are the same benchmarks. The construct mismatch they depend on is the same mismatch. Whether a system that generates flexible natural-language responses has actually closed the distance that matters, or has closed the part of the distance that is easier to perform while leaving the harder parts unaddressed, is the question the measurement apparatus is not yet equipped to answer.

Three Questions to Ask

When you next encounter a claim that an educational technology has approached the effectiveness of human tutoring, three questions will orient you.

What did the technology actually measure? If the evaluation used item-level or step-level assessments aligned with the technology’s instructional content, the system has been measured against a construct aligned with what it was built to do. This is not a criticism; it is a description of what the evaluation supports.

What does the human-tutoring construct actually involve? The research literature on expert human tutors documents a specific set of interactional moves — expectation-and-misconception dialogue, comprehension checks, affective management, student-question handling, recruitment, frustration control, demonstration. These are not peripheral features. They are the substance of what expert tutors do.

Was the comparison conducted on an axis that indexes both? If the outcome measure favors procedural scaffolding — which most ITS and AI-tutor evaluations use — the axis is not measuring what human tutoring does beyond procedural scaffolding. The comparison is limited by the measurement choice. A finding that the technology approaches human tutoring on such a measure is a finding about procedural scaffolding, not about the interactional richness the construct human tutoring would require.

These questions do not answer whether AI can replace human tutors. They answer the prior question: what are we measuring when we make the comparison? The field has been skipping the prior question since 1984, when Benjamin Bloom placed his two-sigma number on the same axis as his classroom-instruction comparison and the discourse collapsed the distance between them into a single rhetorical invitation. Cognitive Tutor responded to the invitation seriously, with theoretical rigor and methodological discipline, and produced 0.20 sigma at high schools after a year of implementation and $97 per student per year of cost. That result is not a failure. It is what the move that Cognitive Tutor was designed to do produces, measured honestly, at scale, in actual schools.

The number that system was compared against was never on the same axis. The comparison is the problem. It was the problem in 1990, when ITS researchers were trying to build what it named. It is still the problem now, when generative AI is being asked to close a gap the measurement apparatus cannot fully see.

Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)). | skepticism.ai | theorist.ai

Tags: intelligent tutoring systems construct validity, Cognitive Tutor RAND evaluation, human tutoring comparison mismatch, ACT-R model tracing procedural scaffolding, AI tutor measurement apparatus critique

The Debt That Was Never Owed

Nik Bear Brown — Tue, 21 Apr 2026 02:39:59 GMT

Palantir posted a bootlicking new manifesto to X on Saturday, calling it a brief summary of The Technological Republic, a 2025 book by Palantir co-founder and CEO Alexander C. Karp and head of corporate and legal affairs Nicholas W. Zamiska. You can read the full manifesto here.

There is a word missing from Palantir’s 22-point manifesto, and its absence is the most revealing thing about the document. The word is citizen. Not customer, not taxpayer, not the “public” whose security the company claims to protect—citizen, the person with rights that precede the state’s demands on them. In 318 words posted to X on a Saturday, Alexander Karp and Nicholas Zamiska laid out a vision of the relationship between Silicon Valley and the American government that has no room for that word, because the vision does not require it. What it requires is something older and more coercive: debt.

“Silicon Valley owes a moral debt,” the manifesto announces, “to the country that made its rise possible.” The engineering elite has “an affirmative obligation to participate in the defense of the nation.” Read slowly, this is an extraordinary claim—not that companies should contribute to national defense as a matter of civic choice, but that they owe this contribution as repayment for being permitted to exist and thrive. The logic underneath is not liberal. It is feudal. You were allowed to build here; now you must serve.

This distinction matters because it forecloses the question the manifesto most wants to avoid: serve what, and decided by whom?

The Machine That Needs No Ethics

Palantir is not a neutral observer of the relationship between technology and national power. It is one of the primary architects of that relationship. Its tools help run predictive policing programs in American cities—programs with documented records of racially disparate impact. Its analytics support military operations in Gaza, where the scale of civilian death has generated calls for investigation at the International Court of Justice. The company’s stated business is to make governments and militaries more effective at finding and targeting people.

This background is not incidental to reading the manifesto. It is the lens through which every high-minded claim about “hard power” and “the long peace” must be understood. When point five declares that “the question is not whether A.I. weapons will be built; it is who will build them and for what purpose”—Palantir is answering its own question. It will build them. The purpose will be defined later, by clients.

The manifesto’s treatment of AI weaponry is instructive precisely because of what it refuses to say. “Our adversaries will not pause to indulge in theatrical debates about the merits of developing technologies with critical military and national security applications.” The word theatrical is doing enormous work here. It transforms any moral inquiry—any attempt to ask what these systems will do to human bodies, to civilian populations, to the international frameworks that have governed warfare since 1949—into performance. The person who asks “should we build this?” is not thoughtful. They are theatrical. They are wasting time while China proceeds.

This is an old move. It has been used to justify every weapons program that ever required the silencing of conscience. The urgency of the adversary becomes the alibi for the abandonment of ethics. What is new is the audacity of building that alibi directly into a manifesto and posting it with apparent pride.

The Hierarchy They Won’t Name

The manifesto’s most revealing quality is its double standard, operating so consistently across so many of its twenty-two points that it must be understood as a design feature rather than an oversight.

Ordinary people who look to politics “to nourish their soul and sense of self” are warned they “will be left disappointed.” They should not rely too heavily on their internal life finding expression in politicians they’ll never meet. Stay in your lane. But Elon Musk should not be “snickered at” for his grand narratives. The rich man’s vision is legitimate ambition; the ordinary person’s political investment is pathetic dependency.

Public figures deserve “far more grace.” The “ruthless exposure of the private lives of public figures drives far too much talent away from government service.” The culture of accountability—the press, the investigators, the citizens who demand that power justify itself—is characterized as a pathology driving good people from public life. But the document offers no equivalent concern for the people whose private lives are exposed by Palantir’s surveillance tools. The predictive policing database. The behavioral analytics. The location tracking. The inference engines that make private lives legible to the state. That exposure is the product. The grace is reserved for those doing the exposing.

Point 21 declares that some cultures “have produced wonders” while others “have proven middling, and worse, regressive and harmful.” This is not accompanied by any methodology, any acknowledgment of the material conditions that produce what Karp and Zamiska are willing to call cultural failure, any reckoning with the history of a Western civilization that has spent five centuries extracting labor and resources from the cultures it now grades. It is simply asserted, with the confidence of people who have never had to justify to anyone why their own culture gets to be the rubric.

This is the hierarchy the manifesto will not name: the people who build the tools and those upon whom the tools are used. The engineers whose creative lives deserve protection from decadence and the citizens whose movements, associations, and behaviors feed the databases that fund the manifesto’s authors. The public figures who deserve grace and the communities who deserve, apparently, nothing but efficiency.

The Draft and the Document

Point six is the most honest sentence in the manifesto: “We should, as a society, seriously consider moving away from an all-volunteer force and only fight the next war if everyone shares in the risk and the cost.”

I want to sit with this for a moment, because buried inside its apparent fairness is something important. Karp and Zamiska are calling for conscription. Universal national service. They are saying that the all-volunteer military—the force assembled from people who, for economic or ideological reasons, chose to enlist—is insufficient. Everyone must go.

And yet.

The same document argues that engineers have a “moral debt” to the national defense that must be repaid through the production of AI weapons. The same document argues that tech companies must be conscripted to serve national interests. The same document warns that “theatrical debates” about the ethics of these weapons should not be permitted to slow their development.

What the manifesto envisions, in full, is a society in which everyone serves—but in which the purposes they serve, the weapons they build, and the targets those weapons find are determined by the people writing 22-point manifestos and posting them to X. Universal obligation. Elite prerogative. The risk is shared; the decisions are not.

This is the structure of every regime that has ever called for national sacrifice while exempting its own planning class from accountability. The workers die in the wars that the strategists design.

What Decadence Actually Is

The manifesto’s most irritating rhetorical move is its deployment of decadence as an indictment of ordinary life. “The decadence of a culture or civilization, and indeed its ruling class, will be forgiven only if that culture is capable of delivering economic growth and security for the public.” “Is the iPhone our greatest creative if not crowning achievement as a civilization?” “Free email is not enough.”

This is the pose of someone who has everything and is bored by it—who mistakes their boredom for moral clarity and their ambition for national purpose. Karp and Zamiska are billionaires. They run a company whose stock has made many of its employees extraordinarily wealthy. The product they are now positioning as the antidote to decadence—AI-powered weapons systems—is the revenue engine that sustains their own very comfortable lives. The argument is: you are distracted by your phones while we build the future, which we will sell to governments at market rates.

What decadence actually looks like is a surveillance capitalism that profits from exposure while calling for privacy protections for its principals. It looks like a company that takes federal contracts to build targeting systems and then writes a book about the spiritual failure of the engineering class that won’t do the same. It looks like the audacity to write about public service while running a company whose compensation structure would, as the manifesto itself notes, cause any normal business to “struggle to survive”—and offering no solution to that problem beyond the vague instruction that the situation must change.

The Peace That Is Not Peace

Point fourteen asserts that “American power has made possible an extraordinarily long peace.” The framing is precise, calibrated, and wrong in the ways that matter most.

The hundred years of “some version of peace” that the manifesto celebrates looks different depending on where you are standing. It looks like the Korean War if you are Korean. It looks like Vietnam if you are Vietnamese, or Laotian, or Cambodian. It looks like a series of coups and counter-insurgency operations if you are Guatemalan, Chilean, Iranian. It looks like the Iraq War and its 200,000 civilian dead if you are Iraqi. It looks like the drone program if you are Yemeni, Pakistani, or Somali.

The “long peace” is a peace among great powers, purchased in part by the exportation of violence to places whose people the manifesto is not designed to address. When Karp and Zamiska write that “nearly a century of some version of peace has prevailed in the world without a great power military conflict,” they are using “the world” to mean something smaller than the world.

This is not a minor error. It is the error that makes possible everything else in the document—the easy celebration of hard power, the dismissal of ethical debate, the confidence that the instruments of American military capacity are, on balance, a gift to humanity. If you exclude from your accounting the people on whom American military power has been used, the accounting works out very well. If you include them, it does not.

What I Find Myself Unable to Dismiss

And yet.

There are things in this document that cannot simply be mocked away. The concern about Germany and Japan—point fifteen’s argument that Europe is “paying a heavy price” for the overcorrection of German demilitarization—has been vindicated with terrible specificity by events since 2022. The observation that public service compensation structures drive talented people toward private alternatives is empirically accurate. The critique of a political culture that has become so punitive that it discourages participation is something that people across the political spectrum have made, often for opposite reasons.

The scaffolding of the manifesto is not entirely wrong. The conclusion it draws from that scaffolding—that Silicon Valley companies have an obligation to build weapons and a right to do so without ethical interference—is where the document reveals what it actually is.

The scaffolding says: the world is dangerous, democracies must compete, technical capacity is the foundation of power, the people who can build technical capacity have responsibilities that go beyond personal enrichment.

The conclusion says: therefore, Palantir.

These do not follow from each other. The premises could support a very different conclusion—one in which technical capacity is developed under democratic accountability, in which the ethical debates the manifesto calls theatrical are understood as the very mechanism by which a free society maintains control over its instruments of power, in which the “debt” to the country is repaid through transparency and restraint rather than through the manufacture of ever more effective targeting systems.

The manifesto’s authors know this. They wrote around it. The question is whether we will let them.

The Last Line

“The republic is left with a significant roster of ineffectual, empty vessels whose ambition one would forgive if there were any genuine belief structure lurking within.”

This is Karp and Zamiska on the quality of American public servants. It is contemptuous in a way that, in a less polished document, would read as rage.

I find I agree with the sentence. I disagree with its intended targets.

The ineffectual empty vessels with insufficient belief structures are not the public servants who refused to build weapons. They are not the engineers who asked whether they should before they asked whether they could. They are not the citizens who looked to politics to nourish something in themselves and were told to stay in their lane.

The problem with genuine belief is that it imposes obligations. It means being accountable to something larger than the manifesto you published on a Saturday. It means the ethics are not theatrical. It means the debt runs in more directions than down.

Karp and Zamiska believe in hard power. They believe in American strength. They believe in the obligation of technical elites to serve national purpose. They have built a company that embodies these beliefs and made themselves very wealthy in the process.

What they do not believe in—what the bootlicking manifesto’s 318 words systematically exclude—is accountability to the people the tools touch. The communities surveilled. The bodies targeted. The cultures graded and found regressive. The ordinary citizens whose political investments are characterized as pathetic while their physical conscription is proposed as necessary.

That is not a belief structure. That is a business model wearing a belief structure as a costume.

The republic deserves better than costumes. So do its people.

Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)). His research on algorithmic systems, AI ethics, and platform accountability is published at bear.musinique.com, skepticism.ai, and theorist.ai.

Tags: Palantir Technological Republic critique, AI weapons ethics Silicon Valley, conscription tech manifesto, surveillance capitalism accountability, Alexander Karp national service obligation

The Inheritance We Never Examined

Nik Bear Brown — Mon, 20 Apr 2026 18:42:15 GMT

There is a machine in every classroom now, and it measures what it has always measured. The name on the box changes — Duolingo, Khanmigo, i-Ready, DreamBox — but what the box counts has remained, across seventy years of silicon and software and venture capital and neuroscience, almost perfectly stable. Accuracy per item. Time per response. Progression through atomized units. Performance on the test the system was built to prepare you for.

B.F. Skinner named these measurements in 1958. He had a good reason.

He had observed his daughter’s fourth-grade arithmetic class and been, in his own word, shocked. Students completed problems and waited. The papers were collected. Perhaps two days later, perhaps a week, the marked papers returned. By the time the feedback arrived, the behavior it was meant to reinforce had already moved on, taken up residence in some adjacent habit of mind that was no longer the one in need of correction. Skinner believed he understood the mechanism of learning better than anyone alive — the contingencies of reinforcement, the precise timing of feedback, the accumulation of correctly shaped behavior into competence — and what he had watched in that classroom was the systematic breaking of every mechanism he understood. A technology that could restore the contingencies, he reasoned, would be a technology that could teach.

His teaching machine presented material one frame at a time. The student responded. The machine verified, immediately, whether the response was correct. The contingencies were repaired.

What I am asking you to notice is not that this was wrong. I am asking you to notice what the machine measured — accuracy per frame, time per response, progression, error patterns — and to hold those measurements in mind as we trace them forward through sixty-six years of educational technology that kept the apparatus while abandoning almost everything else about Skinner’s framework.

What the Machine Could Not See

The teaching machine could not look up from the immediate interaction to ask what the student would remember in six months.

This is not a glancing criticism of Skinner. His behavioral framework did not require him to ask the question; the question was not yet a question the field had organized itself to ask in the precise way that Bjork and Bjork’s subsequent research would demand. Skinner’s science was about the shaping of behavior through reinforcement, and a behavior that could be elicited at the moment of measurement had been shaped. That the behavior might dissolve in the absence of the reinforcing conditions was not, within behaviorism, a separate problem requiring separate measurement. Generalization was expected to follow naturally.

But this is where the inheritance turns costly. The assumption that immediate performance predicts durable learning was embedded in the measurement apparatus before it was tested empirically. By the time Robert and Elizabeth Bjork’s work made the distinction between retrieval strength and storage strength unavoidable — by the time it was clear, empirically, that the conditions maximizing immediate performance (massed practice, aligned testing, minimal difficulty) could actively impair long-term retention — the measurement apparatus had already been handed down through Patrick Suppes’s 1960s computer-assisted instruction and was settling into the bones of the field.

Suppes’s system at Stanford presented arithmetic problems to elementary students and recorded what Skinner’s machine had recorded: accuracy rates, response times, error patterns, progression. The technology shifted from mechanical device to mainframe computer. The measurements did not shift. Accuracy rose from 53 percent to over 90 percent. Response times fell from 630 seconds to 279. Suppes reported these numbers as evidence the system worked, and within the apparatus he had inherited, they were. He was not wrong to report them. He was working inside a set of choices about what evidence looked like that the apparatus had bequeathed him without flagging as choices.

The question of what those 90-percent-accurate students could do two years later was not asked.

The Apparatus Becomes Theory

Here is what makes the inheritance pattern strange rather than simply historical: the apparatus persisted past the abandonment of the theoretical framework that had justified it.

John Anderson’s Cognitive Tutor, developed in the 1980s and 1990s at Carnegie Mellon, was built on ACT-R theory — a cognitive-psychological architecture that treated learning as the acquisition of production rules rather than the shaping of behavior. Theoretically, this was a departure from Skinner significant enough to constitute a revolution. The language of reinforcement was replaced by the language of cognition. The unit of analysis shifted from the frame to the production rule.

The measurement apparatus did not shift.

The Cognitive Tutor recorded step-level correctness — whether each student action matched one of the production rules the cognitive model identified as correct. It recorded time per step. It recorded hint requests, error patterns, estimated mastery of each production rule through Bayesian knowledge tracing. When Anderson and colleagues published their foundational 1995 paper in the Journal of the Learning Sciences, the evidence they offered that the system worked was: step-level accuracy, progression, and post-test performance on assessments aligned with the content the tutor had taught.

Skinner’s apparatus, operating at higher resolution, within a more sophisticated theoretical framework, carrying new vocabulary.

Anderson and colleagues were, I want to say this plainly, more honest about the limits of their measurements than most of the researchers who cited them. The 1995 paper notes explicitly that students “display transfer to the degree that they can map the tutor environment into the test environment” — an acknowledgment that the evidence of learning the system could produce depended on the degree to which the post-test resembled the tutor’s own format. This is the measurement-alignment problem stated with precision by the researchers who built the system it applied to. The acknowledgment was there. What happened subsequently was that the effect sizes from aligned post-tests entered the literature as if Anderson’s own caveat had not been published alongside them.

The apparatus inherits even what its originators flagged as provisional.

The Industrial Turn

The 2010s commercial adaptive-learning era — Knewton, DreamBox, i-Ready, ALEKS — represents the point at which the inherited apparatus became an industry standard.

Knewton’s José Ferreira, during the 2012-2015 period of the platform’s public prominence, positioned his technology as capable of personalization so granular that it would transform education at scale. The claim invoked the Suppes promise in the language of twenty-first-century data science. What the platform actually measured was behavioral engagement data: which problems students attempted, which hints they took, how their patterns of interaction with the system correlated with eventual performance on the system’s own assessments. Independent efficacy research on Knewton was, during the period of its most expansive claims, notably absent. The apparatus was present in the measurement choices; the evidence was not.

DreamBox Learning, which earned more research attention than most adaptive platforms, became the subject of a 2016 Harvard Center for Education Policy Research study that found students at the median gained 1.4 to 3.9 percentile points on the NWEA MAP for approximately 7 to 8 hours of DreamBox usage. The researchers were transparent about a critical limitation: DreamBox usage might “partially reflect students’ motivation levels,” meaning the correlation between usage and achievement might reflect that motivated students both use DreamBox more and learn more, independent of DreamBox’s instructional contribution. The acknowledgment, honest and specific, appeared in the paper. It rarely appeared in the citations that followed.

i-Ready produced a particularly clarifying version of the apparatus’s internal logic. The platform’s efficacy research typically demonstrated that students who achieved “usage fidelity” — meeting the system’s recommended weekly engagement minutes — showed higher scores on the i-Ready Diagnostic. The Diagnostic was itself calibrated to predict state test performance. A system measuring how well students learn to do well on the assessment the system provides, where the assessment was engineered to predict the external standard — this is the apparatus become recursive. The alignment between instruction and measurement, which Skinner had simply taken as a natural feature of teaching a student the specific behavior you then measured, had been engineered into the product design itself. The inheritance was now embedded in the commercial structure.

ALEKS routed the apparatus through Knowledge Space Theory, a mathematical framework for mapping curricular competencies that provided sophisticated theoretical grounding for the same fundamental measurement choices. Efficacy claims rested on performance within the system’s own knowledge mapping and on aligned post-tests that measured progression through the curricular content the system taught. The theoretical vocabulary was different from Skinner’s. The measurement choices were the same.

Duolingo, 2021

I want to read a specific study carefully, because careful reading is the point.

Evaluating the reading and listening outcomes of beginning-level Duolingo courses, by Xiangying Jiang, Joseph Rollinson, Luke Plonsky, Erin Gustafson, and Bozena Pajak, published in Foreign Language Annals in 2021. The fifth author, Plonsky, is an academic researcher at Northern Arizona University with specialization in applied linguistics. The other four were employed by Duolingo at the time of publication. The paper is peer-reviewed. It is cited in Duolingo’s own marketing materials. It is, within the conventions of the field, a careful study.

Two hundred and twenty-five adults in the United States — 135 studying Spanish, 90 studying French. Participants were required to have little to no prior proficiency in their target language, to be using Duolingo as their only learning tool, and — the consequential criterion — to have completed the beginning-level course content through Unit 4. The sample, the paper reports, skewed toward highly educated Caucasian Americans with bachelor’s or master’s degrees.

The outcome measure was the STAMP 4S test from Avant Assessment, covering reading and listening. Thirty multiple-choice items in each modality. The assessment was administered immediately after learners completed the beginning-level content.

The finding: Duolingo learners reached ACTFL Intermediate Low in reading and Novice High in listening — levels the paper characterizes as “comparable with those of university students at the end of the fourth semester” of college-level language study.

Now apply the apparatus.

The outcome measure is external — not designed by Duolingo, which is a genuine methodological improvement over purely internal assessment. But reading and listening are the specific modalities that Duolingo’s interface is engineered around. Multiple-choice comprehension items, translation tasks, listening exercises with multiple-choice responses: these are what Duolingo builds, and these are what the STAMP 4S measures. Speaking and writing — modalities that Duolingo’s app-based format supports weakly — are explicitly excluded from the study. The assessment is external. The choice of which aspects of language proficiency to measure is not.

The timescale: the post-test was administered immediately after course completion. There is no delayed assessment. Bjork’s distinction between retrieval strength and storage strength is directly relevant — the STAMP 4S scores reflect what Duolingo users can do at the moment they finish the course, not what they can do when they have been away from the app for six months. This question is not asked.

The population: only learners who completed the beginning-level content. Most Duolingo users do not. The platform’s attrition is substantial; most people who download the app never reach the end of the beginning-level material. The study measures the performance of survivors. What 100 people who finished the course achieved is a different finding from what 100 people who started it achieved. The paper is transparent about this selection. The subsequent framing of the findings — in the paper’s own conclusion and, more aggressively, in Duolingo’s marketing — as Duolingo users reach Intermediate Low does not preserve the completion-threshold restriction.

The baseline: a historical comparison. University students at the end of the fourth semester. There is no contemporaneous control group of comparable adults who spent equivalent time on a different learning approach. The two populations were measured in different conditions, at different times, possibly with different motivations and starting points. The comparable to four semesters claim treats them as if they had been measured equivalently.

The cost: not reported. Duolingo is free at its base tier, which is rhetorically powerful — free app comparable to paid college course — but the comparison elides the substantial time investment Duolingo users make. The paper does not ask what equivalent time investment in human-tutored instruction, structured self-study, or an immersive experience would produce. The cost denominator, which is constitutive of what a comparative claim actually supports, is absent.

I am not saying the study is dishonest. I am saying that each of these specific measurement choices — aligned-modality outcome, immediate timescale, survivor population, historical baseline, absent cost denominator — is traceable, in structure, to the apparatus Skinner initiated in 1958. The study is careful within conventions it has inherited. The conventions themselves are what require examination.

The Alternatives Have Always Existed

This is what I want you to sit with: the apparatus did not persist in the absence of alternatives. It persisted alongside them.

Edward Thorndike established in 1906 and 1924 that improvement in one mental function rarely produces general improvement in others unless the two share identical elements. The methodological implication — that learning gains must be tested outside the conditions of the intervention, in contexts structurally different from training, to establish what the training actually produced — was available to the field for the entire history of educational technology. It has been occasionally adopted, routinely praised, and treated as aspirational rather than as the baseline standard that Thorndike’s own work suggested it should be.

The Bjorks’ work on storage strength versus retrieval strength, canonical since the early 1990s, established empirically that the conditions maximizing immediate performance can impair durable retention. The specific implication — that a delayed post-test is required to distinguish performance from learning — has been in the learning sciences literature for over thirty years. Its adoption in educational technology efficacy research as standard practice has not happened.

Bransford, Brown, and Cocking’s How People Learn, the 1999 National Academies synthesis, argued explicitly that assessment should tap understanding rather than the ability to repeat facts. The argument was widely read, widely cited, and narrowly operationalized.

Samuel Messick’s theory of validity, developed across decades and codified in the 1989 Educational Measurement volume, specified that a test score’s interpretation requires examination of construct-relevant versus construct-irrelevant variance, construct underrepresentation, and the consequences of the test’s use. Applied rigorously, Messick’s framework would require educational technology efficacy research to examine what its outcome measures actually index rather than assuming that performance-on-aligned-items equals evidence-of-learning. The framework has been the theoretical standard in measurement theory for over thirty years.

These alternatives were not hidden. They were taught in graduate programs, cited in methods sections, present in the same journals that published the aligned-outcome studies. What did not happen, across six decades of technology change, was their adoption as the field’s measurement standard. The inherited apparatus — aligned outcomes at immediate timescale, survivor population, weak baseline, absent cost denominator — remained dominant. The alternatives remained alternative.

This is not a story about intellectual failure. It is a story about what happens when a theoretical commitment gets flattened into a methodological convention. Skinner had reasons for his measurement choices that were grounded in a coherent behavioral science. When the field moved past behavioral science — when Suppes and Anderson and everyone who followed adopted different theoretical frameworks — the measurement choices did not travel with the theory that had justified them. They traveled alone, as conventions, as what evidence looked like, as the unexamined default.

The apparatus became invisible by becoming obvious. And invisible apparatus is the most durable kind.

The Current Wave

The contemporary AI-tutor literature — Khanmigo, Kestin and colleagues’ 2024 Harvard physics study, Eedi with Google Research, Rori in Ghana — inherits the apparatus in its turn, with variation worth noting.

Khanmigo’s evaluation evidence has rested primarily on engagement metrics and performance within Khan Academy’s own internal assessment structures. What has been measured at scale is usage patterns; what has been claimed is educational transformation; what has not been established at the level of rigorous efficacy research is learning gains on independent standardized measures at delayed timescales with cost-inclusive reporting. The characteristic gaps of the apparatus are present.

The Kestin et al. 2024 Harvard physics study — AI-tutored instruction versus a single session of active-learning classroom instruction — reported effect sizes of 0.73 to 1.3 sigma on researcher-designed post-tests covering surface tension and fluid flow, the specific content the two-hour intervention taught, assessed shortly after the intervention. The measurement choices are the apparatus’s measurement choices. The effect sizes are real within those choices. What they establish about learning is bounded by what those choices can establish.

Eedi with Google Research 2025 introduced transfer testing — measuring performance on novel problems from subsequent topics rather than problems aligned with what the intervention taught. This is a genuine departure from the inherited convention. The N of 165 and single-term duration remain short relative to what durability research would require, but the outcome measure itself represents the kind of revision the apparatus needs rather than another inheritance of it. This is a credit to the researchers who chose to build the study that way.

Rori in Ghana used an external curriculum-aligned assessment over eight months and reported cost at $5 per student per year. The longer timescale, the external measure, the explicit cost denominator — these are partial revisions of the apparatus in the direction the field has needed for six decades. The pattern is: when researchers choose to work against the inherited conventions, the field moves. The field moves rarely, because the inherited conventions are the default, because departures from them require additional effort and often smaller effect sizes and sometimes no significant effect at all, which is a kind of finding that is harder to publish than 0.73 sigma.

The apparatus has not been reformed. It has been revised in specific instances by specific researchers. The instances are the exceptions that make the pattern visible.

Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)). His research on educational AI efficacy appears at hypotheticalai.substack.com. | skepticism.ai | theorist.ai

Tags: educational technology measurement apparatus, Skinner teaching machine inheritance, Duolingo efficacy research critique, aligned outcome EdTech validity, learning science transfer testing history

The Artifact Was Once Enough

Nik Bear Brown — Sat, 11 Apr 2026 04:47:56 GMT

This essay is a response to Lila Shroff’s “Is Schoolwork Optional Now?“ published in The Atlantic on April 10, 2026. The argument it makes in full is developed in the preprint “Frictional: Measuring the Struggle“ at irreducibly.xyz.

There is a word — decoupling — that sounds technical enough to keep us comfortable. Clinical. As if what has happened in classrooms since 2022 is primarily a logistics problem, a puzzle about detection and enforcement, a cat-and-mouse game that the right algorithm might someday win.

It is not that.

What has happened is something more fundamental than cheating at scale. The artifact — the essay, the proof, the lab report — used to be evidence of a process. The process was the point. The essay was proof that thinking had occurred, that a mind had engaged with difficulty and emerged changed. When we graded the essay, we were really grading the encounter: the hours of confusion, the drafts that failed, the moment when something clicked and then had to be organized into sentences for another person. The artifact was the residue of all that. It was upstream evidence of downstream consequence.

Generative AI has broken the causal chain. Not bent it — broken it.

A bot called Einstein, built by a 22-year-old entrepreneur named Advait Paliwal, recently completed all eight modules and seven quizzes of an introductory statistics course in under an hour. Perfect score. The human who set it loose reports that she “hardly so much as read the course website.” What Einstein produced — the evidence that a course had been completed — was real. The learning it was supposed to represent did not occur. The artifact existed. The process that should have produced it did not happen.

Paliwal says he released the tool to alert educators. His more honest statement is buried in the subtext: “If I didn’t post about this, someone would have used the same technology and hidden it from the professors.” He is right. He is also describing a world in which the distinction between using it secretly and not using it at all is narrowing toward irrelevance. The tool exists. The temptation exists. The economic pressure on students — especially international students, especially students working jobs to pay tuition, especially students in courses they are taking to satisfy requirements rather than from genuine interest — those pressures exist independently of any single tool.

The institutional response has been to build better detectors. This is a reasonable first move. It is not a durable one.

Why Detection Cannot Save Us

Here is the structural problem with artifact-based AI detection: the arms race has a predetermined winner. Detection is always trained on the outputs of current generation technology. Generation technology improves continuously. The detector trained on today’s AI writing fails on tomorrow’s — not because detectors are poorly built, but because that is how the mathematics of the problem works. The forensic window closes.

There is a deeper problem. The educationally relevant question was never did a human type these words. It was did a human develop this understanding. A student who dictated an essay to a transcriptionist and then submitted it word-for-word would have technically written no AI content. The essay would pass every detector. The learning would have occurred or not occurred based on whether they thought hard while dictating, not based on who typed it. The detector is solving the wrong problem.

And there is a third problem, the one that produces the most corrosive outcomes. When you build a system to catch AI use, you teach students to game the detector. They learn strategies for mimicking authentic writing — inserting typos, varying sentence structure, using phrases the model knows sound “human.” The simulation improves. The gap between simulated engagement and genuine engagement widens at precisely the moment we need it to narrow.

William Liu, a Stanford sophomore who finished high school two years ago, puts it plainly: his educational experience and his younger sibling’s are vastly different despite a two-year gap. The technology arrived. The classroom has not yet figured out what to do next.

What Genuine Learning Actually Leaves Behind

Here is the thing we have been too polite to say: learning is not the same as performance.

Robert Bjork has been saying this for thirty years in academic papers that educators read and administrators do not read and curriculum designers read and then ignore when the calendar pressure comes. Performance is the observable, often temporary thing — how well a student does on a measure. Learning is the durable change in what the student can do and understand and transfer to a new context. These two things are not the same. We have built an entire institutional infrastructure that measures only one of them.

Genuine human learning is a biological event. When a learner encounters material that genuinely challenges their current understanding — material in that productive zone where their current model is wrong or incomplete — something specific happens neurologically. Dopamine neurons fire in response to prediction errors. BDNF expression upregulates, sometimes by nearly three times. New dendritic spines form at the synaptic connections that will hold the memory. These are not metaphors. They are the physical substrate of the thing we call learning.

The behavioral consequences of these neurological events are traceable. A student engaged in genuine cognitive struggle spends time proportional to difficulty. Their errors follow a coherent developmental path — misconceptions that make sense given their current model, corrections that build on each other. When tested in a new context, they can transfer. When scaffolded with a partial hint, they respond — because there is a partially formed structure for the hint to connect to. Their confidence, over time, calibrates to their actual performance rather than inheriting the confidence of the AI explanation they processed.

These are what I have been calling friction traces — the behavioral signatures that genuine human cognitive engagement leaves in observable data. They exist because genuine learning is a biological event. An AI can produce the artifact without triggering any of these neurological events. It cannot produce the behavioral traces, because the biological events that generate those traces did not occur.

The Seven Things We Can Now Measure

The Genuine Learning Probability framework I have been developing with Humanitarians AI specifies seven such traces:

The temporal engagement pattern — the correlation between how hard an item is and how long a student spends on it. Genuine engagement produces this correlation. AI-assisted completion decouples time from difficulty.

The error trajectory — whether a student’s mistakes follow conceptually coherent developmental paths. Genuine learning produces coherent errors; the reward prediction error mechanism drives the model toward better models in patterned ways. Borrowed certainty produces random errors with respect to conceptual structure.

Cross-context transfer — the Bjorkian definition of learning. A student who genuinely understood something can apply it in novel contexts. Borrowed certainty produces surface representations tied to the specific context of the AI explanation.

Uncertainty calibration — whether a student’s expressed confidence tracks their actual performance. Borrowed certainty produces systematic overconfidence: the student inherits the AI’s confidence distribution without the knowledge base that would justify it.

Social knowledge texture — the quality of a student’s engagement in discussion contexts. Genuine encounter with material leaves a characteristic texture: specific confusions, particular connections, the specific questions that arose from actual engagement. This texture cannot be manufactured without having had the encounter.

The retrieval strength decay signature — whether performance decays at rates consistent with genuine encoding. The spacing effect is the benchmark of genuine learning. Borrowed certainty has no storage strength to retrieve; performance decays monotonically and the spacing effect does not appear.

And the scaffolding response curve — whether a student’s performance responds appropriately to partial hints. A student with genuine partial understanding has a zone of proximal development. A partial hint activates the structure that is already forming. Borrowed certainty has no such zone.

What the Bot Cannot Manufacture

Here is the argument I want to make carefully, because it is often misunderstood: this framework is not about catching AI use. It is about measuring learning directly.

An AI detector fails when AI outputs become indistinguishable from human outputs. A learning measure fails when borrowed certainty becomes indistinguishable from genuine learning — which would require borrowed certainty to produce the same neurobiological events, the same schema formation, the same durable transfer. At that point, borrowed certainty has become learning. That is not AI defeating assessment. That is learning occurring through a different pathway than we expected.

What manufacturing all seven friction traces simultaneously — without performing the underlying cognitive work — actually requires is something close to performing the underlying cognitive work. A student who spends genuine time on difficult material, who makes and corrects errors in a conceptually coherent sequence, who demonstrates transfer across novel contexts, who maintains calibrated uncertainty, who engages with genuine texture in discussion, who shows the spacing effect across weeks, and who responds appropriately to partial hints — has learned the material. At that point the game has become indistinguishable from the thing we wanted in the first place.

Natalie Lahr, a Barnard sophomore studying history and political science, describes an “anti-AI radicalizing” experience: a tutor at the writing center pasted her essay prompt into Perplexity and handed her the AI-generated outline. “Why am I even here?” she asked afterward. The question is not rhetorical. It is the correct question.

What We Must Build Instead

The crisis of evidence facing educational institutions is not a technical problem. It is an epistemological problem. The evidence infrastructure we built assumed a world in which the artifact was upstream evidence of the process. That world no longer reliably exists.

What we need is an assessment infrastructure built on the process itself.

This means longitudinal process documentation — portfolios that capture the history of engagement, not just its products. It means embedded formative assessment that generates the data necessary to observe the seven friction traces over time. It means treating developmental trajectory as evidence: not what a student produced, but how their understanding developed, what they got wrong and corrected and why, where they transferred and where they didn’t.

Marc Watkins at the University of Mississippi describes an instructor who could, theoretically, set an AI to grade thirty essays during a fifteen-minute walk to Starbucks. He calls this “really scary.” He is right, but I want to be precise about why. The fear is not the efficiency. It is the loop: AI-generated assignments completed and assessed by AI agents, with human understanding nowhere in the chain. The fully automated loop is not a future dystopia. It is the logical endpoint of current trajectories. Einstein completes the course. The grader grades Einstein’s work. Both certificate and grade are real. The learning did not occur.

The artifact was once enough. It is no longer enough. The arms race between generation and detection has a winner, and it is not the detector.

We must now measure the struggle itself. Not because friction is intrinsically valuable — productive struggle matters only because of what it builds in the brain that does the struggling. We must measure it because the brain that struggles is the brain that learns, and the brain that learns is the only thing education was ever actually for.

The methodology is developed in full in “Frictional: Measuring the Struggle“ — a preprint specifying the seven friction components, the ensemble architecture, and the tier calibration system — and at irreducibly.xyz. The framework is not a secret.

Nik Bear Brown is Associate Teaching Professor of Computer Science and AI at Northeastern University and founder of Humanitarians AI (501(c)(3)).
bear.musinique.com · skepticism.ai · theorist.ai

Tags: AI detection education failure, genuine learning probability framework, friction traces assessment, Bjork performance vs learning, Einstein bot Canvas schoolwork automation

The Loop That Watches Itself

Nik Bear Brown — Fri, 10 Apr 2026 04:00:53 GMT

Jakub Pachocki has a timeline. By September, OpenAI plans to deploy what it calls an AI research intern — a system that can work on a specific problem for the length of time a person would need days to resolve. By 2028, the full version: a multi-agent system capable of running research programs too large for humans to manage. Drug discovery. Novel proofs. Problems “formulated in text, code, or whiteboard scribbles.”

The vision is coherent. More than most in this field, it is operationally specific. And it contains a foundational error that no amount of scaling will fix.

The error isn’t technical. It’s logical.

The Scratch Pad That Watches Itself

Pachocki is candid about the risks. A system this powerful could go off the rails, get hacked, or simply misunderstand its instructions. His proposed solution is chain-of-thought monitoring — training reasoning models to externalize their work into a kind of scratch pad, then using other AI systems to watch those scratch pads for anomalous behavior.

This is not oversight. It is the appearance of oversight, implemented entirely inside the loop it was supposed to close.

Sixty years before anyone worried about AI safety, Kurt Gödel established something directly relevant. No formal system powerful enough to express arithmetic can verify its own consistency from within itself. Any sufficiently capable system will generate statements it cannot evaluate using only its own rules — truths it can approach but not recognize as true through internal derivation alone.

Apply this to Pachocki’s architecture. The AI researcher derives. Chain-of-thought monitoring by another AI system is more derivation. What is structurally absent is recognition — the moment of contact between a formal output and an external reality. That moment cannot be replicated by adding another layer of derivation on top.

This is not a philosophical objection. It is a logical one. The validator must be outside the system being validated. There is no version of this argument that resolves in favor of AI systems self-monitoring.

The Proof Candidate Problem

What an AI system produces when it generates a novel mathematical proof is not a proof. It is a proof candidate — a string of symbols following valid inference rules that may or may not establish something true.

The distinction is not semantic. A proof in the full sense is a social and epistemic act. It is what a mathematical community recognizes as establishing truth. Remove the recognition and you have a sophisticated computation that has no relationship to truth except statistical proximity.

The same structure applies to every domain Pachocki names.

A novel molecule with predicted therapeutic properties is not a drug. It is a candidate. The drug trial process — Phase I, Phase II, Phase III, post-market surveillance — exists precisely because we have learned, through catastrophic experience, that prediction and reality are different things and the gap between them kills people. Thalidomide. Vioxx. The graveyard of promising compounds that passed every computational test and failed in bodies.

As AI systems generate increasingly sophisticated candidates across more domains, the need for rigorous external validation does not decrease. It increases. The more sophisticated the output, the harder it is to catch the subtle error buried in ten thousand valid steps. A wrong answer that looks wrong is easy to reject. A wrong answer that looks right for nine thousand nine hundred and ninety-nine steps requires something the internal system cannot provide: an independent perspective.

Common Cause Failure

There is a concept in safety engineering called common cause failure. It describes what happens when two redundant systems share the same fundamental assumptions — the thing most likely to fool System A is also most likely to fool System B, because both were built on the same foundation.

Pachocki’s monitoring architecture is a common cause failure risk by design. If the system being monitored can produce subtly wrong outputs that look correct, the monitoring system trained on similar data with similar architecture will have correlated blind spots. You have not introduced an independent check. You have introduced a correlated one.

Every high-stakes validation system humans have built — clinical trials, aircraft certification, nuclear safety, financial auditing — depends on something genuinely outside. Not because humans are infallible. Because humans are the only validators who face consequences when wrong. The FDA reviewer whose approval leads to harm is accountable in ways that a monitoring LLM is not and cannot be.

Accountability is not a luxury feature of validation systems. It is load-bearing. Remove it and the system loses the incentive structure that makes rigorous checking worth doing.

Stakes as the Organizing Principle

None of this means AI systems cannot contribute to research. They already do. The question is not whether to deploy them. The question is which level of external validation each deployment requires.

This maps onto a natural taxonomy organized by stakes.

For low-stakes, reversible outputs — a song recommendation, a draft email, a code snippet that will be reviewed before deployment — AI can largely run with minimal human oversight. The cost of failure is low and recoverable.

For moderate-stakes, partially recoverable outputs — a business analysis, a research summary, an engineering specification — systematic human review at checkpoints is appropriate. The human does not need to be in the loop constantly, but must be able to catch errors before they compound.

For high-stakes, irreversible outputs — drug candidates, structural engineering recommendations, policy analysis that will drive consequential decisions, mathematical proofs that will be published as established results — continuous human oversight is not incidental to the output’s validity. It is constitutive of it.

The drug trial architecture already encodes this wisdom. It was not built for AI, but it is exactly the right framework for AI-assisted research in high-stakes domains. The humans do not disappear as system confidence grows. They shift function — from intensive validation to ongoing monitoring, from checking every step to catching systematic drift. This is not a concession to human limitation. It is a recognition that the system’s credibility requires external accountability at every stage.

The Profession Pachocki Forgot to Invent

What emerges from this analysis is not only a procedural requirement for human oversight. It is the outline of a new profession.

A plausibility auditor is not a fact-checker. Not a quality assurance technician. Not a safety researcher who looks for misaligned objectives in training runs. A plausibility auditor is someone trained specifically to stand outside sophisticated AI outputs and ask whether those outputs correspond to reality rather than merely to internal consistency.

This requires two distinct forms of expertise that current training pipelines do not produce together.

The first is deep domain knowledge — enough expertise to recognize when a result is too clean, suspiciously convergent, subtly wrong in the way that only an expert in the specific domain would catch. The AI system that generates a novel proof in algebraic geometry needs to be reviewed by someone who has spent years in algebraic geometry, not by a generalist AI safety researcher who can evaluate the logical structure of the output but cannot evaluate its mathematical significance.

The second is knowledge of AI failure modes, which differ fundamentally from human error patterns. Human errors cluster around cognitive bias, motivated reasoning, fatigue, and the known weaknesses of intuition under uncertainty. AI errors cluster around distribution shift, spurious correlations that held in training data, confident extrapolation beyond the valid range of the model, and — most dangerously — systematic errors that look like high-quality outputs because they were trained on a corpus where high-quality outputs had certain structural characteristics. Auditing AI outputs requires knowing which kind of error you are hunting.

The training pipeline for plausibility auditors looks nothing like current AI safety work. It looks more like producing people with genuine deep expertise in a specific domain who have additionally developed the metacognitive capacity — what Penrose, extending Gödel, might describe as the recognitional faculty — to evaluate outputs they could not themselves have produced. The auditor does not need to be able to generate the proof. The auditor needs to be able to recognize whether it is actually true.

This is not a concession to human limitation. The requirement for external validation is not a temporary scaffolding that will be removed once the systems mature. It follows directly from the logical structure of the problem. The validator must be outside the system being validated. This requirement does not disappear as systems become more sophisticated. If anything, it becomes harder to satisfy, because the auditor’s task grows more demanding as the outputs grow more complex.

The Central Irony

Pachocki’s automated researcher, if it works as described, will be the thing that finally creates the market for what it treats as unnecessary.

The more sophisticated the AI output, the harder the auditing task, the more valuable the human who can do it. OpenAI’s north star may be pointing directly at the profession it forgot to invent.

There is precedent for this dynamic. The industrialization of manufacturing did not eliminate the need for quality engineers — it made quality engineering a more demanding and more specialized discipline. The digitization of financial markets did not eliminate the need for auditors — it made financial auditing a more technically demanding field and produced an entire industry of forensic accountants whose value derives precisely from the complexity of what they are reviewing.

The automated researcher will produce more outputs of greater sophistication across more domains than any previous generation of scientific tools. Each of those outputs will be a candidate. Each candidate will require validation. The validation will require humans. Not because we cannot build systems smart enough to evaluate the outputs — we will almost certainly build systems with that capability. But because the evaluation’s credibility depends on the evaluator’s accountability, and accountability requires the possibility of consequence.

An AI system does not lose its job when it certifies a flawed drug candidate. A plausibility auditor does.

What Governments Actually Need to Figure Out

Pachocki acknowledges that the concentrated power implications of this technology are “a big challenge for governments to figure out.” He is right that governments need to be involved, and right that OpenAI alone cannot resolve the governance questions.

But the governance architecture he gestures toward does not yet exist, and the reason it does not exist is that the validation infrastructure that would make it functional has not been built. You cannot regulate AI research outputs if there is no institutionalized capacity to evaluate whether those outputs are trustworthy. Chain-of-thought monitoring provides the appearance of evaluability without the substance.

The question for 2028 — when Pachocki’s multi-agent research system is scheduled to arrive — is not only whether the system works. It is whether we have built, in parallel, the human capacity to stand outside the most powerful reasoning systems ever constructed and ask the oldest question in epistemology.

Is it actually true?

No algorithm answers that. Someone has to.

bear.musinique.com · skepticism.ai · theorist.ai

Tags: AI plausibility auditor, Gödel incompleteness AI oversight, OpenAI automated researcher chain-of-thought monitoring, common cause failure AI safety, high-stakes AI

Brutalist.art - The "Beautiful.ai" that Educators Need

Nik Bear Brown — Mon, 06 Apr 2026 00:49:13 GMT

The Slide Deck You Built Was Not for the Learner

It Was for You

There is a lie at the center of most educational content production, and it goes mostly unnamed because naming it is professionally uncomfortable. The lie is this: the slide deck you built last Tuesday, the one you spent three hours arranging, the one with the custom fonts and the carefully chosen images and the thirty-seven bullets across fourteen slides — that deck was not built for the people who had to sit through it. It was built for you. It was built so you could feel the relief of having covered the material. It was built so the topic had a container. It was built because you had a deadline and a template and a vague professional obligation to produce something, and a slide deck is always something.

The learner — the specific human being with specific prior knowledge and a specific amount of time and a specific gap between what they currently understand and what they need to understand — that person never really entered the room where the deck was being built. What entered the room instead was a topic. And a topic is not a person.

Brutalist was built to address this. Not to address it gently, with suggestions and style guides and best-practice checklists. To address it structurally, in the architecture of the tool itself, before a single slide gets made.

The Architecture of Avoidance

The conventional workflow for building educational content runs roughly like this: you receive a topic (or assign yourself one), you collect material — readings, notes, data, existing slides — and you begin arranging it. If you are experienced, you arrange it with craft. You think about sequence and pacing. You choose examples. You know when to deploy a metaphor and when to let a statistic land without ornamentation. The result, at its best, is a coherent and well-paced presentation of material.

What you have not done — and this is the gap that produces most failures in educational content — is started from what the learner will be able to do when you are finished with them. You have started from what you know, and you have worked forward through that knowledge toward a clean ending. This is a completely understandable approach, and it produces content that would be unrecognizable as failing by any ordinary standard of review. It is organized. It is clear. It covers the material.

It just doesn’t reliably produce learning.

Backwards design — the pedagogical framework that governs every output Brutalist produces — insists on reversing this sequence. You begin with a measurable outcome: not a topic, not a list of things the instructor will present, but a single sentence describing what a learner will be able to do at the end that they could not do at the beginning. Construct a DAG from domain knowledge and identify all backdoor paths. Distinguish between a learning outcome and a topic. Evaluate a rubric for the difference between qualitative descriptions and observable behaviors. These are not aspirations. They are commitments — to a learner, to a measurable change, to the possibility of knowing whether the teaching worked.

The reason most content production doesn’t begin here is not ignorance. Most instructors know what backwards design is. The reason is that starting from a learning outcome is harder than starting from a topic, and the tools available for producing educational content — PowerPoint, Keynote, Google Slides — offer no friction whatsoever against starting from the wrong place. They are indifferent to the question of who the learner is and what the learner needs to be able to do. They are happy to help you arrange forty slides around a topic, and they will never once ask whether the arrangement serves a learner or just a speaker.

Brutalist asks. It asks before it produces anything. In interactive mode — the default — it will not generate a single slide until it has confirmed the audience, confirmed the outcome, and confirmed that the outcome is measurable. “Understand X” is not measurable. Brutalist says so, explicitly, in the voice of a pedagogical skeptic rather than a customer-service chatbot. That describes a mental state, not a behavior. A learner can’t demonstrate ‘understanding.’ What’s the one thing they should be able to do? This is not rudeness. It is the one question that changes the output.

The Phase Gate as Moral Commitment

There is a design decision embedded in Brutalist that deserves more attention than it usually gets in conversations about AI tools, which tend to focus on capability rather than constraint. That decision is the phase gate.

A phase gate is exactly what it sounds like: a gate that holds until a phase is complete. In Brutalist, the first gate holds at source confirmation — no output until the source material is present. The second holds at outcome identification — no output until the outcome can be stated in one sentence. The third holds at form confirmation — no output until the right command for the content is confirmed. Only then does the tool produce anything.

This is unusual. Most AI tools are designed to produce output as quickly as possible, because output is what users think they want and user satisfaction is what tools are optimized for. The experience of receiving forty slides in thirty seconds feels like productivity. It feels like the machine is working for you. What it actually is, much of the time, is the machine generating plausible-looking content that fills the form without serving the function — decoration rather than argument, coverage rather than learning.

Brutalist is optimized for the learner, not the user. These are not the same person. The user is the instructor who wants a slide deck. The learner is the person who will sit in front of that deck and try to change what they understand. Optimizing for the user produces faster output. Optimizing for the learner produces harder questions before any output is generated at all.

The phase gate is where this optimization manifests in the tool’s behavior. It is the structural embodiment of a moral position: that output built on wrong assumptions about audience or outcome wastes more time than the intake that would have caught those assumptions. Two minutes of friction before the deck is built is less costly than an hour of instruction that doesn’t change what anyone understands.

What “Understand X” Is Actually Doing

Spend any time in educational settings — as a student, as an instructor, as a curriculum designer — and you develop a particular sensitivity to the phrase “by the end of this, students will understand X.” It appears in syllabi, in lesson plans, in course descriptions, in accreditation documents. It appears so frequently and so unexamined that most people who write it have stopped noticing it at all. It is pedagogical wallpaper.

But the phrase is doing something specific, and it is worth naming. “Students will understand X” is a sentence that sounds like a learning outcome and functions as an escape from accountability. Understanding is a mental state. You cannot observe it, you cannot measure it, you cannot score it on a rubric or assess it in a portfolio. You can ask someone to demonstrate understanding — which means you are no longer assessing understanding, you are assessing a behavior — but the phrase as written commits you to nothing. It is a promise with no deliverable attached.

The reason this matters to a tool like Brutalist is that the learning outcome is not just the first step in backwards design. It is the specification for everything that follows. The slides that get built, the visual types that get selected, the checks for understanding that get inserted every four to six slides — these are all derived from the outcome, working backward from what the learner needs to be able to do. If the outcome is vague, the derivation has nothing to anchor to. The result is a deck that covers material in the general direction of a topic, which is not the same thing as a deck that moves a specific learner from a specific gap to a specific capability.

This is why Brutalist treats “understand X” not as a minor stylistic imprecision but as a structural failure that must be corrected before building anything. The outcome is the foundation. A vague foundation does not produce a stable structure. It produces decoration.

Brutalist HTML and the Question of Deployment

There is a second commitment embedded in this tool that is worth examining, and it lives in the signature output: the brutalist HTML presentation. Not a PowerPoint file. Not a PDF. A single self-contained HTML file, deployable immediately, built on a design system called Musinique brutalist — JetBrains Mono, parchment tokens, per-slide audio, keyboard navigation, zero decorative radius.

The choice of HTML as the primary output format is not aesthetic. It is pedagogical and practical simultaneously. A PowerPoint file requires PowerPoint. A Google Slides file requires Google. An HTML file requires a browser, which is to say it requires nothing — it deploys anywhere, runs without software dependencies, and can be shared as a URL or a file with equal ease. The friction of tool access is a real barrier to distribution, and distribution is where educational content either serves learners or stops serving them.

The design choices embedded in the brutalist system — every slide does one thing, every title is a claim not a topic, components are typed by what they communicate rather than how they look — these are cognitive load principles encoded as aesthetic constraints. The slide with a hero number and a two-line muted caption exists because research on split attention and redundancy effects has things to say about how visual and verbal information compete for working memory. The check for understanding every four to six slides exists because spaced retrieval practice produces stronger retention than massed coverage. The design is not decoration. It is applied cognitive science, translated into a component library and a phase-gated workflow.

The Pushback Layer

Brutalist pushes back. This is the part of the tool that most users encounter with some surprise, because tools — especially AI tools — are generally not in the business of disagreement. They are in the business of helpfulness, and helpfulness has been operationally defined as producing what the user asks for as quickly as possible. Friction is a UX failure. Pushback is an anomaly.

In Brutalist, pushback is a feature. Not an accident of the model’s personality or a quirk of the prompting, but a designed behavior with specific triggers and specific exit conditions. Weak learning outcomes get flagged — not once, politely, but persistently, with an offer to rewrite the outcome if the user fails the measurability test twice. Vague audience descriptions get challenged, because “college students” is not an audience and the specificity that changes the content, examples, and pacing cannot be inferred from it. Mismatched command choices get named — if the content calls for a /showtell and the user has requested /slides, the tool explains the difference in instructional design terms before proceeding.

Every pushback ends with a path forward. This is the moral discipline that separates useful friction from obstruction. The tool is not in the business of refusing to build. It is in the business of building toward the right specification, and the right specification cannot be assumed from the wrong brief. The pushback is the tool asking the question that the instructor should have asked before they opened a blank deck and started arranging.

What is the learner supposed to be able to do?

Everything else follows from that.

Brutalist is part of the Humanitarians AI Ecosystem. The primary workflow: /slides produces the blueprint. /brutalist converts it to HTML. /deck does both in one command. Type help to begin.

Tags: Brutalist instructional design engine, backwards design pedagogy, learning outcomes Bloom’s taxonomy, brutalist HTML presentation system, educational content production failure

The Struggle Is the Point

Nik Bear Brown — Sat, 04 Apr 2026 03:35:23 GMT

The paper rough draft: https://www.nikbearbrown.com/notes/Papers/glp-framework-genuine-learning-probability

What We Lost When We Made the Artifact the Grade

Here is the situation as it actually exists, not as anyone in an official capacity is willing to describe it clearly.

A student sits down to write a paper. The paper is due in twelve hours. The student has three other assignments due this week, a job that starts at six, and the accumulated evidence of two semesters telling them that the grade lives in the artifact — the paper itself — not in the thinking that was supposed to produce it. The student opens an AI tool. The paper gets written. It is, by most measurable standards, better than what the student would have produced alone at midnight after a shift.

In the next building, the professor who assigned the paper has used AI to draft the assignment prompt, the rubric, and the feedback comments they will paste into the LMS after running the submitted papers through a grading interface that summarizes them automatically.

Neither of them is a villain. Both of them are responding rationally to a system that has always rewarded the artifact and never found a way to measure the process that was supposed to produce it. Generative AI did not create this problem. It revealed it — suddenly, completely, and without the courtesy of suggesting a solution.

This essay is about what the solution might look like. It is not technical. The technical apparatus exists and is documented elsewhere. What doesn’t exist yet, in language plain enough to be useful, is a way of talking about why the solution matters — what it would mean for a student to be seen by an educational system that has, for most of institutional history, been looking at the wrong thing.

What the Artifact Was Supposed to Prove

The essay, the exam, the project, the recorded performance — these were never the thing education cared about. They were evidence. The artifact was valuable because it was causally downstream of a process: the reading, the confusion, the rereading, the argument with yourself at two in the morning about whether you actually understood what you thought you understood. The artifact was a trace of that process. Grading the artifact was a way of inferring the process, because the two were coupled tightly enough that measuring one was effectively measuring both.

That coupling has broken. This is not a scandal or a failure or a temporary condition that better AI detection will resolve. It is a structural change in what artifacts can tell us, and it is permanent. The forensic window — the period during which you can reliably distinguish a human-written essay from an AI-generated one — is closing sequentially across every domain in which humans produce artifacts. In writing it is largely closed already. In code it is closing. The detectors trained on today’s AI outputs will be obsolete when tomorrow’s outputs arrive.

Every educational institution that is currently responding to this situation by installing better detection software is solving last year’s problem with next year’s obsolescence already scheduled.

The Complicity No One Names

The conversation about AI and academic integrity is almost entirely conducted as a conversation about student dishonesty. This framing is not wrong, exactly. It is just so incomplete as to function as a kind of dishonesty itself.

Students are using AI because the artifact is the grade. The artifact is the grade because grading the process — the confusion, the revision, the dead ends, the moments of genuine understanding — is hard, and institutions have never built the infrastructure to do it at scale. The result is a system that has always been measuring the wrong thing, and now the wrong thing can be produced in thirty seconds by a tool that costs less than a textbook.

Professors are not innocent bystanders. Many are using the same tools to manage the same impossible workloads — drafting prompts, generating feedback, summarizing submissions — that the institution’s growth model has made unmanageable. The incentive structure reaches all the way up. Publish or perish does not reward good teaching. Good teaching does not require good teaching to be measurable, only for its artifacts — syllabi, course evaluations, enrollment numbers — to look like good teaching.

The student who uses AI to write a paper is not defecting from a system that is working. They are defecting from a system that has always asked them to perform learning rather than do it, and has never been able to tell the difference. AI has not corrupted that system. AI has made the corruption visible.

This is the thing worth sitting with before any solution is proposed: the problem is not the tools. The problem is what we decided to measure, and what we decided to ignore, long before the tools arrived.

What Genuine Learning Leaves Behind

Here is what the research shows, stated plainly.

When a human being genuinely learns something hard, the process is biological. Neurons fire in response to the gap between what the learner expected and what they encountered. That gap — the prediction error — is uncomfortable. It is the feeling of not understanding, the specific texture of confusion that is different from ignorance because it knows what it doesn’t know. Working through that discomfort produces measurable changes: in how information is encoded, in how long it persists, in whether it transfers to new contexts or stays locked to the specific example through which it was learned.

Genuine learning leaves traces. Not in the artifact — the artifact is the product, and products can be manufactured without the process. The traces are in the behavior that surrounds the artifact’s production: the time spent on the hard parts, the errors that follow a coherent path as the mental model develops, the ability to apply what was learned to a problem that looks different on the surface but has the same underlying structure, the calibrated uncertainty of someone who knows not just what they know but what they don’t.

None of these traces require looking at the artifact. They require looking at the process.

This is what the concept of friction in assessment is about. Not friction as punishment, not friction as obstacle, not friction as the gatekeeping logic that has always made elite education a credentialing system for people who already had advantages. Friction as signal. The productive struggle of genuine learning — the confusion, the revision, the wrong turn and the recovery — is not the unfortunate cost of arriving at the artifact. It is the thing the artifact was supposed to be evidence of. It is the learning itself.

The proposal is to measure it directly.

What This Would Mean for a Student

I want to be specific about what it would feel like to be in a classroom where this kind of assessment exists, because the abstract case is easy to make and the human case is the one that matters.

It would mean that the time you spent genuinely confused about something counts — not as performance of confusion, not as a participation grade for looking engaged, but as actual data about actual thinking. It would mean that the draft that was a mess, the question you asked in office hours that revealed you’d been working from the wrong assumption for two weeks, the revision that turned a competent response into a thinking one — these are evidence of the thing education is supposed to produce. They would be part of the record.

It would also mean that the smooth, perfectly structured submission produced at midnight with no evidence of genuine engagement is not, by itself, proof of anything. The artifact is not worthless. It has not become zero evidence. It has become insufficient evidence. Insufficient means it needs a partner — and the partner is the process that was supposed to produce it.

This is not a punishment for using AI. It is a recognition that the artifact alone was never the right thing to measure, and that the tools which have made that limitation undeniable have also, in the same move, made the solution more urgent than it has ever been.

The Uncomfortable Truth About Friction

The research contains a finding that takes a moment to absorb. The smooth, well-structured artifact — the one that reads with perfect confidence, that has no rough edges, no places where the writer lost the thread and found it again — may be mild negative evidence of genuine learning.

The rough, searching one may be positive evidence.

Not because roughness is a virtue. Not because difficulty signals intelligence. Because genuine struggle with hard material characteristically produces texture — places where the thinking was actually happening, where the writer was working something out rather than reporting a conclusion they arrived at before they started writing. The friction of genuine learning leaves marks. The borrowed certainty of an AI-assisted artifact is often smooth in a way that real thinking, at its most effortful, is not.

This is uncomfortable because educational institutions have spent generations rewarding the smooth artifact and interpreting roughness as inadequacy. We taught students that the goal was to arrive at certainty quickly and present it cleanly. We built rubrics that rewarded the appearance of knowing and had no mechanism for distinguishing it from the thing itself.

Generative AI did not create that confusion. It just made it expensive.

What Comes Next

The framework that formalizes this argument — the specific components of friction that genuine learning leaves in observable data, the way those components can be measured, combined, and calibrated to different kinds of cognitive work — is documented in the paper that follows this introduction. It is technical in the way that any serious methodology is technical, and it is also not the point of this essay.

The point of this essay is this: the crisis that AI has created for educational assessment is not primarily a cheating problem. It is an evidence problem. The artifact, which was always a proxy for the process, can now be produced without the process. Any response that tries to restore the artifact’s evidentiary value by detecting AI use is fighting a war that the progression of technology has already decided.

The response that might actually work is to stop relying on the artifact as the sole evidence of learning, and start building the infrastructure to measure what the artifact was always supposed to be downstream of.

Students are not wrong that the system gives them no choice but to produce the artifact by whatever means are available. They are responding rationally to a broken incentive structure. Educators are not wrong that something has been lost when the struggle disappears from the work. They are mourning the only evidence they were ever given access to.

The argument this paper makes is that the struggle was always the point. It is still the point. We have spent a long time measuring the wrong thing, and the tools that have made that undeniable have also, in the process, handed us a reason to build something better.

The infrastructure for measuring the struggle exists. The question is whether the institutions that credential learning are willing to build it before the artifact becomes so decoupled from the process that the credential stops meaning anything at all.

That window is not closed. But it is not wide open either.

The struggle is the point. It is time to measure it.

Tags: AI academic integrity assessment friction traces genuine learning, generative AI education artifact decoupling, GLP framework formative assessment process evidence, student professor AI use structural incentives, irreducibly human cognitive engagement pedagogy

Boondoggling: You Are the Conductor

Nik Bear Brown — Wed, 01 Apr 2026 03:16:34 GMT

There is a moment in every AI-assisted coding session that tells you everything about the developer sitting at the keyboard. The model generates a block of code — clean, confident, internally consistent. It compiles. The tests pass. The developer commits it and moves on.

What they never ask is the question that would save them three weeks in six months: Is this solving the right problem?

I came to Boondoggling the way most people come to uncomfortable realizations — after the thing that was supposed to work didn’t. The code was technically correct. The architecture was sound. And it was aimed, with beautiful precision, at a problem that had already been reframed by the time implementation began. Claude had done exactly what it was told. Nobody had told it the right thing.

This is not an AI failure. This is a human supervisory failure. And it is the failure that the developers now spending $20 a month on AI subscriptions are making, every day, at scale.

The 20% Problem

Here is what most developers actually do with Claude Code or Cursor: they describe a problem, they delegate the implementation, they verify that the output compiles, and they ship.

That is not 100% of the job. That is 20% of the job dressed up as 100%.

The other 80% — the part that determines whether the fast, confident, technically impeccable output is pointed in the right direction — requires five capacities that no model possesses. Not because current models are limited. Because of what statistical pattern matching structurally is and is not.

Claude solves faster than any human. That gap will not close. What will not change is this: the model cannot verify whether its output is grounded in the specific domain reality at hand. It cannot reframe a poorly formulated problem. It cannot interpret what an accurate result means in a specific human context. And it cannot integrate multiple legitimate but conflicting perspectives into a recommendation that someone is accountable for.

These are not bugs to be patched in the next release. They are features of the architecture. The model has been trained on what is common and likely. Your specific project, your specific codebase, your specific business constraint — these are neither common nor likely. The gap between what the model knows and what your situation requires is where all the damage lives.

The Conductor

The Boondoggling methodology is built around a single metaphor that earns its place rather than announcing itself. A conductor does not play any instrument. They hold the whole performance in mind while each section plays its part. They hear the wrong note before the score confirms it. They decide which piece is worth performing and how it should be interpreted. The performance collapses without them — even though they produce no sound themselves.

This is what graduate-level AI supervision looks like. And it is the role that most AI integration workflows currently fail to develop.

The developers who are getting genuine leverage from AI coding tools are not out-prompting the model. They are conducting it. Before Claude Code sees a single requirement, they have decided what the problem actually is. Before the first function is generated, they have specified what done looks like. After the output arrives, they verify it against domain reality before the next step begins.

The ones who are mostly generating technical debt faster than they generated it before — they learned to play their instrument. Nobody taught them to conduct.

Five Things the Model Cannot Do for You

The Irreducibly Human course at Northeastern — built on the same framework as Boondoggling — names these five supervisory capacities precisely. Not as professional development recommendations. As structural requirements for AI-assisted work.

Plausibility auditing is the judgment that happens before verification. It is knowing an output is wrong because of what you know about the domain — not because you ran a test. The model cannot audit its own plausibility. It does not know what it does not know. When it confabulates — when it produces a confident, internally consistent answer that is not grounded in reality — it does so fluently. The code runs. The tests pass. Plausibility auditing is the human capacity that catches this before it ships.

Problem formulation is deciding what the mission is before the model sees it. Not after. The quality of every output is determined here, at the moment of framing, before a single prompt is written. AI optimizes for the common and likely; humans must reframe toward the salient and important. The Semmelweis case — the formulation that saves lives was not the computationally tractable one — is the permanent lesson here. Hand problem definition to the model and you have not delegated. You have abdicated.

Tool orchestration is the sequencing decision. Which tool, in what order, with what context, and what does done look like at each handoff. The developer who reaches for Claude Code because it is already open is not orchestrating — they are defaulting. Orchestration means choosing the audit tool with a different failure mode than the generation tool, so they catch each other’s blind spots.

Interpretive judgment is supplying meaning that the model cannot supply. Which of these three implementations is correct for this context — not in the abstract, but here, in this organization, for this user, at this moment. The model can tell you what each implementation does. It cannot tell you what it means. Somebody has to sign their name to that answer. The model cannot do that either.

Executive integration is not sequencing the four prior capacities. It is holding all four simultaneously toward a unified goal — recognizing when a plausibility audit finding requires problem formulation to re-engage, when an orchestration decision surfaces an interpretive judgment that wasn’t on the agenda. This is what the conductor does in the fourth quarter of a difficult performance: not running a checklist, but maintaining a unified hold on where the whole thing is going.

Better models will not close these gaps. They will widen the stakes of them.

What the Build Actually Looks Like

A moderately complex website — six routes, hybrid architecture, admin dashboard, community upload pipeline, sandboxed iframe viewer, full prompt library — built using the Boondoggling method took roughly three hours. Two hours of conversation with Gru, the custom orchestration prompt. One hour with Claude Code.

Nearly all the time was spent talking. Not coding. Not debugging. Not searching documentation. Talking — precisely, in the right order, about what the site was, who it was for, what it would and would not do, and what each piece needed to be true before the next piece began.

The result was a Boondoggle Score: a conductor’s score with two simultaneous parts. The Minion Part — exact prompts for Claude, in dependency order, each with context required, expected output, and a handoff condition. The Gru Part — precise human actions, labeled by supervisory capacity, in the same dependency order.

Nine Claude tasks. Eleven human tasks. More human decisions than machine outputs. But the Claude tasks ran fast and clean because the structure was already there. Every prompt worked — not because the prompts were magic, but because the conversation that produced them was structured.

The handoff condition is the most important element in the score. It is the conductor’s downbeat. A model that does not know when to stop will stop at the wrong place or not stop at all.

The Vocabulary of What Is Actually Happening

The Boondoggling framework gives names to the different kinds of work in an AI-assisted build. The names are worth knowing because naming a thing is the first step to doing it deliberately.

Frick-fracking is the iterative work — small precise edits, one thing changed at a time, the kind of work Claude Code does exceptionally well when given clear scope. This is where the actual build lives after the structure is established. It is productive and it does not require your full attention. It is not, however, the whole job.

Noodling is the dreaming phase. Figuring out what to build before figuring out how. This happens before the model sees anything. It is the lightest touch — a thought that something could be interesting, a question about whether this feature serves the person the thing is built for. The discipline is knowing which noodle is worth developing. The problem statement is the filter.

Confabulating is the danger word. When the model produces plausible output that is not grounded in reality. It sounds correct. It reads correctly. The code compiles. Only domain knowledge catches it. This is precisely the failure mode that plausibility auditing exists to address — and precisely the failure mode that developers who have learned to prompt but not to supervise will miss every time.

What You Are Actually Responsible For

The developers most effectively using AI coding tools are not the ones generating the most code. They are the ones who have understood that their job changed — and changed in a specific direction.

The job is not to type less. The job is to decide more precisely.

You are responsible for what the problem actually is. You are responsible for what done actually looks like. You are responsible for whether the fast, confident, technically impeccable output is pointed at reality or pointed at a plausible simulation of it. The model takes no responsibility for any of this. It cannot.

The minions are excellent. They are enthusiastic. They will execute exactly what they understood you to mean.

That gap — between what you meant and what they understood — is where all the damage lives.

Anyone can use Claude Code. The question is whether you are playing an instrument or conducting the orchestra.

Tags: boondoggling AI methodology, Claude Code supervision framework, AI-assisted software development, solve-verify asymmetry, plausibility auditing human-AI collaboration

Medhavy Hub Walkthrough

Nik Bear Brown — Sun, 29 Mar 2026 06:49:35 GMT

Ask your textbook a question. Get a sourced, context-aware answer — instantly. This is a full walkthrough of Medhavy Hub, the AI-powered textbook platform built for students who want more than a page to stare at.

In this video, we walk through everything: creating your account, requesting access, navigating chapters, and using the built-in AI Assistant Panel to study smarter across Physics Volume 1 and Cancer Biology.

The AI Assistant answers from the active chapter — not the open web — and shows every source it used so you can trust and verify the response. Ask follow-up questions, request step-by-step derivations, generate concept-check questions, get the answer key, and loop back to the text with stronger understanding. Every session is yours to pace and direct.

This is what an interactive textbook actually looks like.

🔗 Create your free account → medhavy.ai

Glimmer - A Word I Didn't Know I Needed

Nik Bear Brown — Sun, 29 Mar 2026 03:49:49 GMT

I heard the word glimmer today in a sense I didn’t recognize.

Not shimmer. Not hope. Something more precise and more clinical: a specific small cue — sensory, relational, contextual — that shifts the nervous system toward safety. The granular opposite of a trigger.

The term comes from Deb Dana’s work on polyvagal theory. Stephen Porges mapped the autonomic nervous system’s responses to perceived safety and threat. Dana, in The Rhythm of Regulation (2018) and her broader clinical development of Porges’ framework, introduced glimmers as the micro-moment counterpart to what everyone already understood about triggers. A trigger is a specific cue that moves the nervous system toward defense. A glimmer is the opposite: a small specific signal that moves it toward the ventral vagal state — the condition where genuine engagement, learning, and social connection become possible.

The clinical significance is in the scale. Glimmers are not big positive experiences. They are tiny specific ones. The quality of light through a particular window. A specific person’s laugh. The weight of a familiar mug. Small enough to overlook. Specific enough to be genuinely activating when noticed.

Dana’s therapeutic application was about training clients to accumulate glimmers — building what she called a glimmer practice — as a bottom-up regulation strategy. Not cognitive reframing from the top down. Sensory specificity as the mechanism. The body first. The mind follows.

Branding and design practitioners picked the word up because it named something they had been circling for years without adequate language. The detail that makes a brand feel alive rather than performed. The specific weight of a product in the hand. The exact corner of a page. Always specific. Never general.

When I heard the word, I recognized the mechanism immediately — not from Dana, but from a problem I’d been sitting with for years.

Practical Dewey

Dewey spent his career trying to name what makes an experience come alive rather than lie flat. The difference between the encounter that genuinely reorganizes how a person sees the world and the encounter that simply adds one more item to what they already know. He called it aesthetic experience. The specific sensory moment that activates genuine engagement before the conceptual apparatus has time to categorize and dismiss it.

The practical problem with Dewey — and every educator who takes him seriously eventually hits this wall — is that genuinely reconstructive experience requires real problems with real resistance and real consequence. The child cooking an actual meal. The student building something that has to work. The inquiry that fails in a way that costs something. These conditions are often impractical at scale, difficult to design, and nearly impossible to sustain across a full curriculum.

Glimmer offers a way through.

Not as a replacement for the real — nothing replaces the real. But as the entry point that makes the real accessible. Small enough to be achievable. Specific enough to be genuinely activating. Carrying enough of the actual structure of the problem that what follows is genuine inquiry, not a simulation of it.

The fracture Dewey identified in 1900 is the same fracture the AI age has made undeniable. What follows is an attempt to think through what a glimmer-based practice might look like — and why, right now, the instrument matters as much as the argument.

John Dewey spent his career arguing that the curriculum was wrong. Not wrong in its methods, but wrong in its foundations. Teaching children to retrieve facts, execute procedures, and perform correctly for assessment was never what education was for — even when humans were the best available instruments for doing those things.

The machines didn’t create that error. They exposed it.

This is the claim most AI-in-education discourse buries or avoids. Everyone is asking: how do we use AI to improve learning outcomes? Dewey’s prior question is harder and more important: what kind of people does education produce, and are they capable of living fully, thinking independently, and participating in democratic life?

The AI age makes that question urgent in a new way. The cognitive capacities that Tier 1 education optimized for — pattern retrieval, syntactic correctness, fact recall, arithmetic speed — are now performed superhumanly by machines that fit in a pocket. The student who spent twelve years developing these capacities has spent twelve years preparing to lose a competition they didn’t know they were entering.

But the deeper problem isn’t obsolescence. It’s that the capacities education didn’t develop — problem formulation, causal reasoning, plausibility auditing, collective intelligence, practical wisdom — are now the only remaining path to a fully human life. Not because AI can’t do them. Because these capacities are what it means to think, not just to retrieve.

Dewey saw this clearly in 1900. He just didn’t have the evidence that 2025 provides.

What Dewey Actually Argued

Dewey’s central claim wasn’t pedagogical. It was epistemological. Knowledge is not a commodity to be acquired and stored. It is a capacity developed through genuine encounter with real problems. The mind is not a container. It is an instrument of adaptation — biological, social, and democratic simultaneously.

This is what he meant by the reconstruction of experience. Not the accumulation of content. Not the performance of understanding. The genuine reorganization of how a person sees and acts in the world, produced by transaction with problems that have real resistance and real consequence.

Education is not preparation for life. It is life.

The implications for curriculum are radical. Subject-area divisions are administrative conveniences mistaken for epistemological truth. History, science, mathematics, and literature are not separate in the world — they are separate in the faculty lounge. A child cooking learns chemistry, mathematics, history, economics, and social cooperation simultaneously because reality doesn’t arrive pre-sorted by department.

The inquiry process that Dewey formalized — felt difficulty, hypothesis, testing, reflection, reconstruction — is not a teaching method. It is a description of how genuine thinking actually works. Every departure from it produces what he called mis-educative experience: activity that closes off future growth rather than opening it.

Three principles govern everything that follows:

Continuity — each experience must connect to what came before and open into what comes next. An experience disconnected from the learner’s existing understanding and not pointed toward future development is inert regardless of how well it is delivered.

Interaction — genuine learning requires transaction between the learner and an environment that pushes back. A simulated environment that doesn’t resist, a case study that has no consequence, a problem designed to be solvable — none of these produce reconstruction. They produce performance.

Democratic purpose — education is not primarily economic preparation. It is the development of citizens capable of self-governance. The epistemic capacities that allow a person to formulate problems, reason through evidence, revise beliefs, and participate in collective inquiry are not soft skills. They are the prerequisites for democratic life. A population that can retrieve information but cannot reason together is not a democracy. It is a collection of well-informed individuals with no shared epistemic infrastructure.

The Taxonomy of What Remains

Against this framework, the Irreducibly Human taxonomy of human intelligence tiers is not primarily a curriculum design tool. It is a map of what education has abandoned and what the AI age makes irreplaceable.

Tier 1 — Pattern and Association. The intelligences that standardized education optimized for: linguistic ability, logical-mathematical reasoning, pattern recognition, encyclopedic recall. These are also the intelligences where machines are now superhuman. Not faster-than-average. Superhuman, by orders of magnitude, without fatigue, without error. Teaching humans to compete directly at Tier 1 is, in Dewey’s terms, teaching students to lift with their backs after the forklift has arrived.

The forklift metaphor requires extension. The point of the forklift is not to free your back so you can do other physical tasks. The point is to free your mind so you can ask what needs moving, where, and why — questions the forklift cannot ask. AI doesn’t just change the labor. It changes what counts as the work.

Tier 2 — Embodied and Sensorimotor. The knowledge that lives in the body: a surgeon’s hands, a carpenter’s feel for grain, a nurse’s ability to read tension in a patient’s movement before the patient can name it. Dewey’s Laboratory School understood this. The child cooking wasn’t simulating cooking. The child building wasn’t practicing building. The hand and the mind develop together. You cannot separate them without impoverishing both.

Tier 3 — Social and Personal. Reading others, cultural navigation, emotional regulation, moral reasoning under genuine stakes. Machines simulate these. They do not live them. A language model produces text that reads as empathetic without experiencing anything. It generates ethical arguments without having skin in the game. The danger is not that the output is wrong. The danger is that the capacity atrophies in the person who stopped exercising it.

Tier 4 — Metacognitive and Supervisory. The intelligences that oversee the others. Plausibility auditing: knowing an answer is wrong before you can prove it. Problem formulation: deciding what is worth solving. Tool orchestration: knowing which instrument to use, when, and whether to trust it. Interpretive judgment: what does this result mean in this specific context. Executive integration: coordinating all of the above toward a unified goal.

Dewey would call Tier 4 reflective inquiry in its most concentrated form. Problem formulation is exactly what he meant by the felt difficulty — the entry point of genuine inquiry. Plausibility auditing is what happens when a person has internalized enough prior reconstructed experience to sense that something is wrong before they can prove it. These capacities cannot be taught directly. They can only be developed through repeated encounter with real problems where the cost of poor judgment is genuine.

Tier 5 — Causal and Counterfactual. The capacity to ask not just what the data shows but what would happen if we intervened — and what we gave up by not intervening differently. Judea Pearl’s three rungs of causation are Dewey’s inquiry cycle made formal. Observation is pattern recognition. Intervention is hypothesis testing. Counterfactual is reflection on what the reconstruction actually cost.

JC Penney had the correlations right. Customers who paid full price showed less price sensitivity than coupon users. What the data could not tell them was what would happen if they removed the coupon system entirely. That’s an intervention. That’s Rung 2. They ran the experiment on a live business instead of a causal model. The cost was not bad data or bad analysts. It was the wrong instrument for the question being asked.

Current AI systems are superhuman at Rung 1. They are weak to absent at Rungs 2 and 3. A population that can query AI for associations but cannot formulate interventions or reason about counterfactuals has access to extraordinary pattern recognition and no capacity to make the decisions that actually matter.

Tier 6 — Collective and Distributed. The intelligence that is not a property of any individual but emerges from groups of people in genuine relationship. The thing that makes science work over centuries. The thing that makes democracy more than the sum of its voters. Language models may be a lossy compression of collective human intelligence — not alien intelligence but our own reflected back. What they cannot reflect is the thing that happened between us: the disagreement that refined an idea, the trust that made knowledge transmissible, the collaborative friction that no individual possessed and no training corpus can capture because it existed in the interaction, not in the record of the interaction.

Tier 7 — Existential and Wisdom. Phronesis: the practical wisdom that knows when and how to apply what you know, and when not to. This tier requires being alive, mortal, and situated in time. It requires stakes — the possibility of loss, of reputation, of a life poorly lived. You cannot teach it. You can only design the conditions that make it more or less likely to develop when a person encounters the real.

Dewey would call Tier 7 simply living. The series points toward it. The work of getting there happens elsewhere.

The Problem with Keeping Up

Here is where the practical problem announces itself.

Educators, practitioners, and intellectually serious people across every domain report the same experience: they cannot keep up. Not with tasks, not with workload — with frameworks. Causal inference. Network science. Polyvagal theory. Large language models. Transformers. Retrieval-augmented generation. Each genuinely interesting. None integrated. The accumulation produces anxiety, not capacity.

This is the most sophisticated version of the periodic table problem. It is Tier 1 about Tier 1. Pattern retrieval about frameworks for understanding patterns. The student memorizing the names of intelligences without developing any of them. The practitioner keeping up with theories of experiential learning without having a single experience that reconstructs how they see their work.

The theories are not the problem. The relationship to the theories is the problem.

An idea you’ve encountered is not a tool. An idea you’ve used on a real problem — that failed, that required revision, that changed how you see the problem — is a tool. Dewey was precise about this. Ideas are instruments assessed by their practical utility in resolving specific problems. An instrument you’ve never picked up isn’t part of your toolkit. It’s an item you’ve read about tools.

The person drowning in frameworks doesn’t need more frameworks described more clearly. They need one framework used on one real problem until it either works or breaks in an instructive way.

The parallel experiment described below is a response to this problem.

Glimmers: The Missing Instrument

The term glimmer comes from polyvagal theory — the small, specific, sensory moment that signals safety and genuine aliveness to the nervous system. Branding practitioners adopted it because it names something they had been trying to describe for years: the specific detail that makes something feel real rather than performed. Not the logo, not the tagline — the weight of a product in the hand, the exact sound of a notification, the corner of a page that’s slightly rough.

The mechanism is specificity. Glimmers are always specific.

Dewey spent his career trying to name what makes an experience come alive rather than lie flat. His closest term was aesthetic experience — the dramatic, compelling, unifying encounter in which the learner feels genuinely absorbed. Not decorative. Not a reward for completing the real work. The aesthetic dimension of an experience is what makes it reconstructive rather than merely informative.

Glimmer is the best single word for what Dewey was pointing at.

Consider the difference:

“JC Penney experienced significant revenue decline following their pricing strategy change.”

“Revenue dropped 25% in one year. The CEO was gone in 18 months.”

The first is information. The second is a glimmer. The nervous system registers something before the conceptual apparatus engages. The felt difficulty is activated before the lesson begins.

Or consider the Sherpa asking “What did you start to say?” rather than “What happened?” One is data collection. One is a glimmer — the specific small move that creates the conditions for genuine reconstruction.

Or the MVAL protocol’s Environment field, which forces the student to describe organizational power structure rather than the room. The moment a student realizes what they’ve been avoiding is a glimmer. Small. Specific. Changes everything that follows.

The design criteria for a glimmer:

Specificity — not a general principle but a particular detail. 25%, not “significant.” 18 months, not “quickly.” The exact weight of something real.

Aliveness — the nervous system registers genuine encounter before the mind categorizes it. Something is at stake even before the learner can articulate what.

Scale-independence — glimmers exist in everything from a sentence to a semester. The meal at the Laboratory School was a glimmer. The question “what did you start to say?” is a glimmer. A well-designed assignment brief can contain a glimmer or not. The difference is not length or complexity.

Fractal structure — a good glimmer contains the full structure of the problem it opens. JC Penney is not a simplified version of causal reasoning. It is the entire structure of Tier 5 at human scale. The student who genuinely reconstructs what went wrong at JC Penney has encountered the real problem — not a toy version of it.

The load criterion — a glimmer without effort is information snacking with better production values. This is the test that separates a genuine glimmer from aesthetic decoration.

Training science offers the precise concept: Rate of Perceived Exertion. RPE 7-8 is productive struggle — working at the edge of current capacity with enough reserve to maintain form and recover. This is where adaptation happens. RPE 2 is 5 pounds lifted 10,000 times — high volume, negligible load, zero reconstruction. You could do it forever and never get stronger. The completion certificate gets issued. Nothing changes.

The glimmer has to carry enough weight to demand genuine effort from the learner encountering it. Not crushing — that produces shutdown not inquiry. Not comfortable — that produces maintenance not growth. Working at the edge of current capacity with something real at stake.

Critically the load varies. The 350 that was RPE 8 last month is RPE 6 this month. A well-designed glimmer is self-calibrating — it contains enough genuine resistance to demand real effort from someone at the right developmental stage and is completable enough that someone beyond that stage moves on naturally. The same specific real problem loads different capacities differently depending on where the learner is.

What doesn’t vary is the requirement for genuine effort. A glimmer that requires nothing of the learner is a micro-glimmer — a pleasant novelty hit that returns to baseline in 36 minutes. Reconstruction happens in the struggle that follows the entry point. Not in the entry point itself.

The glimmer earns its place by making the learner willing to pick up the weight. What happens after has to be real.

The Parallel Experiment: AI-Assisted Glimmers

Irreducibly Human maps what AI can and cannot do and develops the pedagogy for what remains irreducibly human. That is its purpose and it should not be diluted.

The parallel experiment is different in kind. It is the territory where the map gets tested.

The premise: AI tools have collapsed the barrier between “I wonder if” and “here is a thing that exists.” The friction between idea and working prototype has been reduced to almost nothing for a wide range of problems. This changes the curriculum bottleneck fundamentally. It used to be technical — can the student build the thing they imagine? Now it is a judgment problem — can the student identify a problem worth solving, recognize when the output is wrong, and make the call about whether the result is useful or merely impressive?

Those are Tier 4 and Tier 5 capacities. But they get developed through Tier 1 practice on small real things with low stakes. The instrument that develops judgment is not a course on judgment. It is the repeated experience of building something, encountering the moment it fails, and being required to decide why.

The parallel experiment proposes AI as a Sherpa for this process — not a teacher, not a coach, not a co-creator. A Sherpa carries the infrastructure that makes the climb possible. The climbing belongs to the builder.

The core assignment across every tier is the same:

Build one small real thing that didn’t exist yesterday and matters to someone today. Not a demonstration. Not an exercise. Not an impressive artifact. A useful thing that works, at human scale, that someone actually uses.

Small — completable this week. The Deweyian cycle requires completion. You must undergo the consequence to reconstruct from the doing. Incompletion produces learned helplessness, not inquiry. The massive project that never ships is the enemy of development.

Real — works in the world, not just in the assignment. The feedback is honest because the environment is honest. No rubric required. Did it do what you needed? Yes or no.

Useful — solves a problem someone actually has, including the builder. Useful is not the same as impressive. Many impressive things are useless. Many useful things are unimpressive. The criterion is genuine utility, not demonstration of mastery.

Potentially interesting — has an edge that might surprise. Might connect to something larger. Might matter more than expected. This criterion preserves the continuity that Dewey required: each experience opening into the next. The student who builds something interesting keeps pulling the thread past the assignment deadline.

The Glimmer as Entry Point Across Tiers

The parallel experiment is loosely mapped to the Irreducibly Human tiers not as curriculum but as orientation. The tier structure describes the territory. The glimmer is how you enter it.

Tier 1 — Tool mastery. Stakes are almost irrelevant here. Low consequence failure is fine and instructive. The glimmer assignment: find something you do repeatedly that wastes your time. Use AI to reduce that waste. Ship it. Not elegant. Not generalizable. Useful to you today.

This constraint does something important. It forces problem formulation before tool selection. You have to identify what actually wastes your time before you can build anything. That single move is already more Deweyian than most AI literacy courses.

Tier 4 — Metacognitive and Supervisory. The entry point shifts from personal to interpersonal. The glimmer assignment: build something useful for a decision someone else has to make. Now you must formulate their problem, not yours. The metacognitive demand appears immediately. You can’t outsource the judgment about what they actually need.

The moment the tool produces something confidently wrong — and it will — is the educative moment. Not the moment of correct output. The moment of plausible-sounding but incorrect output that the builder recognizes as wrong before they can prove it. That sensation is Tier 4 being born.

Tier 5 — Causal and Counterfactual. The glimmer assignment: find one decision someone in your organization made last month based on correlation they interpreted as causation. Build the causal model that shows what question they were actually asking. Show what the Rung 2 question would have been.

That’s a week’s work. It contains the full JC Penney structure. Nobody loses their job if the student gets it wrong. But the causal model has to be defensible to someone who knows the domain. That’s genuine resistance. That’s the environment pushing back.

Tier 6 — Collective and Distributed. The glimmer assignment: build something useful that requires other people to build it with you. The collective intelligence problem appears immediately. Division of labor is not collective intelligence. The thing that emerges from genuine collaborative synthesis — where the output exceeds what any individual possessed — only appears when the design requires it.

Tier 7 — Wisdom. No assignment. The horizon the other tiers point toward. The person who has built many small real things, encountered genuine failure, revised under real pressure, and carried the consequences across time — that person is developing phronesis. Not from the curriculum. From the accumulated weight of having been wrong in ways that mattered and continuing anyway.

The Theory You Need is the One You Use

The people who report they cannot keep up with new theories are not behind on the literature. They are ahead of their own application.

The gap is not between them and the frameworks. It is between the frameworks they have encountered and the real problems they have not yet used them on.

Pearl on causal inference: you don’t need to master the technical apparatus. You need to build one causal model for one real decision in your domain. Pearl becomes an instrument not a theory to keep pace with.

Barabási on network science: you don’t need to understand scale-free networks in the abstract. You need to map one network that affects your work and notice where the hubs are. Network science becomes a lens not a course to complete.

Dewey on experiential learning: you don’t need to read the secondary literature. You need to build one small real thing and notice what the experience taught you that reading couldn’t. Dewey becomes obvious not academic.

The parallel experiment reframes keeping up entirely. It is not a solution to information overload. It is a replacement of information consumption with building practice. The theory you use once on a real problem is worth more than fifty theories you have kept up with.

This is the instrument. Not the map. Not the taxonomy. The repeated practice of taking a framework, finding the smallest real problem it applies to, building something, and letting the environment respond.

Glimmers are the entry points that make this practice feel alive rather than obligatory. The specific detail that activates the nervous system. The 25% and 18 months. The question “what did you start to say?” The MVAL field that reveals what the student has been avoiding. The meal on the Laboratory School table.

The full Deweyian argument, stated plainly for the AI age:

You cannot understand these ideas from the outside. You have to be changed by using them. The AI tools are the most powerful instruments for building small real things that have ever existed. The barrier between inquiry and artifact has nearly disappeared. What remains is judgment — the irreducibly human capacity to decide what is worth building, recognize when the output is wrong, and make something that genuinely matters to someone.

That capacity is not developed by keeping up with theories about it.

It is developed by building things, encountering failure, revising under real conditions, and building again.

The glimmer is what keeps you building.

What Dewey Would Build

Dewey would not build a better AI tutor. He would be alarmed by AI tutors — not because of the technology but because they make intellectual outsourcing frictionless, which is precisely the opposite of what he thought education was for.

He would be in crisis mode about the democratic implications of systems that answer questions rather than deepen them, that optimize for engagement over reflection, that make the production of knowledge dependent on a few institutions whose reasoning is opaque.

What he would build is simpler and harder:

Tools that surface the right problem before offering any solution. Environments where group inquiry is the unit of learning, not individual instruction. Infrastructure that connects learners to real communities facing real problems where their work has genuine consequence. Systems that make the reasoning behind important decisions visible and contestable by citizens.

And the parallel experiment: a practice of building small real things with AI as Sherpa, mapped loosely to the tiers of irreducibly human capacity, entered through glimmers specific enough to activate genuine inquiry.

Not because it is ambitious. Because it is real.

The meal on the table. The question that reveals what you’ve been avoiding. The thing that didn’t exist yesterday and matters to someone today.

That is what education has always been for.

The machines have simply made it undeniable.

THE TWELVE WILD DUCKS

Nik Bear Brown — Sat, 28 Mar 2026 05:13:16 GMT

A Note Before the Story

Audible told me the books I bought were mine. They said it plainly: yours forever. I believed them.

Then they removed titles from my library. Books I had paid for, marked purchased, assumed were permanent — gone. When I asked why, the answers were evasive. The terms were reinterpreted. The guarantee dissolved into fine print no one had shown me at the point of sale.

This is not a complicated situation. They took something. Then they lied about taking it.

I had two options. Buy the same book again from the company that had already demonstrated it would take it from me again. Or build something they could not reach.

I chose the second. I took a Norwegian fairy tale — “The Twelve Wild Ducks,” collected by Asbjørnsen and Moe, public domain, belonging to no platform and no corporation — and I rebuilt it with AI tools. The result is what follows.

It is better than what Audible had. Not because the technology is superior. Because I own it and I am good at this. Because no company can revoke it at midnight and blame the licensing agreement. Because the story belongs to whoever is reading it right now, which is how stories were always meant to work, before the platforms decided ownership was a subscription service.

Read it. Then go check your own digital library and count what’s missing.

The Twelve Wild Ducks — a Norwegian fairy tale, retold

Tags: Audible digital ownership, DRM audiobook removal, AI retold fairy tales, public domain Norwegian folklore, platform accountability

THE TWELVE WILD DUCKS

Once on a time there was a Queen who was out driving, when there had

been a new fall of snow in the winter; but when she had gone a little

way, she began to bleed at the nose, and had to get out of her sledge.

And so, as she stood there, leaning against the fence, and saw the red

blood on the white snow, she fell a-thinking how she had twelve sons

and no daughter, and she said to herself:

“If I only had a daughter as white as snow and as red as blood, I

shouldn’t care what became of all my sons.”

But the words were scarce out of her mouth before an old witch of the

Trolls came up to her.

“A daughter you shall have”, she said, “and she shall be as white as

snow, and as red as blood; and your sons shall be mine, but you may

keep them till the babe is christened.”

So when the time came the Queen had a daughter, and she was as white as

snow, and as red as blood, just as the Troll had promised, and so they

called her “Snow-white and Rosy-red.” Well, there was great joy at the

King’s court, and the Queen was as glad as glad could be; but when what

she had promised to the old witch came into her mind, she sent for a

silversmith, and bade him make twelve silver spoons, one for each

prince, and after that she bade him make one more, and that she gave to

Snow-white and Rosy-red. But as soon as ever the Princess was

christened, the Princes were turned into twelve wild ducks, and flew

away. They never saw them again—away they went, and away they stayed.

So the Princess grew up, and she was both tall and fair, but she was

often so strange and sorrowful, and no one could understand what it was

that failed her. But one evening the Queen was also sorrowful, for she

had many strange thoughts when she thought of her sons. She said to

Snow-white and Rosy-red,

“Why are you so sorrowful, my daughter? Is there anything you want? if

so, only say the word, and you shall have it.”

“Oh, it seems so dull and lonely here”, said Snow-white and Rosy-red;

“every one else has brothers and sisters, but I am all alone; I have

none; and that’s why I’m so sorrowful.”

“But you _had_ brothers, my daughter”, said the Queen; “I had twelve

sons who were your brothers, but I gave them all away to get you”; and

so she told her the whole story.

So when the Princess heard that, she had no rest; for, in spite of all

the Queen could say or do, and all she wept and prayed, the lassie

would set off to seek her brothers, for she thought it was all her

fault; and at last she got leave to go away from the palace. On and on

she walked into the wide world, so far, you would never have thought a

young lady could have strength to walk so far.

So, once, when she was walking through a great, great wood, one day she

felt tired, and sat down on a mossy tuft and fell asleep. Then she

dreamt that she went deeper and deeper into the wood, till she came to

a little wooden hut, and there she found her brothers; just then she

woke, and straight before her she saw a worn path in the green moss,

and this path went deeper into the wood; so she followed it, and after

a long time she came to just such a little wooden house as that she had

seen in her dream.

Now, when she went into the room there was no one at home, but there

stood twelve beds, and twelve chairs, and twelve spoons—a dozen of

everything, in short. So when she saw that she was so glad, she hadn’t

been so glad for many a long year, for she could guess at once that her

brothers lived here, and that they owned the beds, and chairs, and

spoons. So she began to make up the fire, and sweep the room, and make

the beds, and cook the dinner, and to make the house as tidy as she

could; and when she had done all the cooking and work, she ate her own

dinner, and crept under her youngest brother’s bed, and lay down there,

but she forgot her spoon upon the table.

So she had scarcely laid herself down before she heard something

flapping and whirring in the air, and so all the twelve wild ducks came

sweeping in; but as soon as ever they crossed the threshold they became

Princes.

“Oh, how nice and warm it is in here”, they said. “Heaven bless him who

made up the fire, and cooked such a good dinner for us.”

And so each took up his silver spoon and was going to eat. But when

each had taken his own, there was one still left lying on the table,

and it was so like the rest that they couldn’t tell it from them.

“This is our sister’s spoon”, they said; “and if her spoon be here, she

can’t be very far off herself.”

“If this be our sister’s spoon, and she be here”, said the eldest, “she

shall be killed, for she is to blame for all the ill we suffer.”

And this she lay under the bed and listened to.

“No”, said the youngest, “’twere a shame to kill her for that. She has

nothing to do with our suffering ill; for if any one’s to blame, it’s

our own mother.”

So they set to work hunting for her both high and low, and at last they

looked under all the beds, and so when they came to the youngest

Prince’s bed, they found her, and dragged her out. Then the eldest

Prince wished again to have her killed, but she begged and prayed so

prettily for herself.

“Oh! gracious goodness! don’t kill me, for I’ve gone about seeking you

these three years, and if I could only set you free, I’d willingly lose

my life.”

“Well!” said they, “if you will set us free, you may keep your life;

for you can if you choose.”

“Yes; only tell me”, said the Princess, “how it can be done, and I’ll

do it, whatever it be.”

“You must pick thistle-down”, said the Princes, “and you must card it,

and spin it, and weave it; and after you have done that, you must cut

out and make twelve coats, and twelve shirts, and twelve neckerchiefs,

one for each of us, and while you do that, you must neither talk, nor

laugh, nor weep. If you can do that, we are free.”

“But where shall I ever get thistle-down enough for so many

neckerchiefs, and shirts, and coats?” asked Snow-white and Rosy-red.

“We’ll soon show you”, said the Princes; and so they took her with them

to a great wide moor, where there stood such a crop of thistles, all

nodding and nodding in the breeze, and the down all floating and

glistening like gossamers through the air in the sunbeams. The Princess

had never seen such a quantity of thistledown in her life, and she

began to pluck and gather it as fast and as well as she could; and when

she got home at night she set to work carding and spinning yarn from

the down. So she went on a long long time, picking, and carding, and

spinning, and all the while keeping the Princes’ house, cooking, and

making their beds. At evening home they came, flapping and whirring

like wild ducks, and all night they were Princes, but in the morning

off they flew again, and were wild ducks the whole day.

But now it happened once, when she was out on the moor to pick

thistle-down—and if I don’t mistake, it was the very last time she was

to go thither—it happened that the young King who ruled that land was

out hunting, and came riding across the moor, and saw her. So he

stopped there and wondered who the lovely lady could be that walked

along the moor picking thistle-down, and he asked her her name, and

when he could get no answer, he was still more astonished; and at last

he liked her so much, that nothing would do but he must take her home

to his castle and marry her. So he ordered his servants to take her and

put her up on his horse. Snow-white and Rosy-red, she wrung her hands,

and made signs to them, and pointed to the bags in which her work was,

and when the King saw she wished to have them with her, he told his men

to take up the bags behind them. When they had done that the Princess

came to herself, little by little, for the King was both a wise man and

a handsome man too, and he was as soft and kind to her as a doctor. But

when they got home to the palace, and the old Queen, who was his

stepmother, set eyes on Snow-white and Rosy-red, she got so cross and

jealous of her because she was so lovely, that she said to the king:

“Can’t you see now, that this thing whom you have picked up, and whom

you are going to marry, is a witch. Why? she can’t either talk, or

laugh, or weep!”

But the King didn’t care a pin for what she said, but held on with the

wedding, and married Snow-white and Rosy-red and they lived in great

joy and glory; but she didn’t forget to go on sewing at her shirts.

So when the year was almost out, Snow-white and Rosy-red brought a

Prince into the world; and then the old Queen was more spiteful and

jealous than ever, and at dead of night, she stole in to Snow-white and

Rosy-red, while she slept, and took away her babe, and threw it into a

pitful of snakes. After that she cut Snow-white and Rosy-red in her

finger, and smeared the blood over her mouth, and went straight to the

King.

“Now come and see”, she said, “what sort of a thing you have taken for

your Queen; here she has eaten up her own babe.”

Then the King was so downcast, he almost burst into tears, and said:

“Yes, it must be true, since I see it with my own eyes; but she’ll not

do it again, I’m sure, and so this time I’ll spare her life.”

So before the next year was out she had another son, and the same thing

happened. The King’s stepmother got more and more jealous and spiteful.

She stole into the young Queen at night while she slept, took away the

babe, and threw it into a pit full of snakes, cut the young Queen’s

finger, and smeared the blood over her mouth, and then went and told

the King she had eaten up her own child. Then the King was so

sorrowful, you can’t think how sorry he was, and he said:

“Yes, it must be true, since I see it with my own eyes; but she’ll not

do it again, I’m sure, and so this time too I’ll spare her life.”

Well, before the next year was out, Snow-white and Rosy-red brought a

daughter into the world, and her, too, the old Queen took and threw

into the pit full of snakes, while the young Queen slept. Then she cut

her finger, smeared the blood over her mouth, and went again to the

King and said,

“Now you may come and see if it isn’t as I say; she’s a wicked, wicked

witch, for here she has gone and eaten up her third babe, too.”

Then the King was so sad, there was no end to it, for now he couldn’t

spare her any longer, but had to order her to be burnt alive on a pile

of wood. But just when the pile was all a-blaze, and they were going to

put her on it, she made signs to them to take twelve boards and lay

them round the pile, and on these she laid the neckerchiefs, and the

shirts, and the coats for her brothers, but the youngest brother’s

shirt wanted its left arm, for she hadn’t had time to finish it. And as

soon as ever she had done that, they heard such a flapping and whirring

in the air, and down came twelve wild ducks flying over the forest, and

each of them snapped up his clothes in his bill and flew off with them.

“See now!” said the old Queen to the King, “wasn’t I right when I told

you she was a witch, but make haste and burn her before the pile burns

low.”

“Oh!” said the King, “we’ve wood enough and to spare, and so I’ll wait

a bit, for I have a mind to see what the end of all this will be.”

As he spoke, up came the twelve princes riding along, as handsome

well-grown lads as you’d wish to see; but the youngest prince had a

wild duck’s wing instead of his left arm.

“What’s all this about?” asked the Princes.

“My Queen is to be burnt,” said the King, “because she’s a witch, and

because she has eaten up her own babes.”

“She hasn’t eaten them at all”, said the Princes. “Speak now, sister;

you have set us free and saved us, now save yourself.”

Then Snow-white and Rosy-red spoke, and told the whole story; how every

time she was brought to bed, the old Queen, the King’s stepmother, had

stolen into her at night, had taken her babes away, and cut her little

finger, and smeared the blood over her mouth; and then the Princes took

the King, and shewed him the snake-pit where three babes lay playing

with adders and toads, and lovelier children you never saw.

So the King had them taken out at once, and went to his stepmother, and

asked her what punishment she thought that woman deserved who could

find it in her heart to betray a guiltless Queen and three such blessed

little babes.

“She deserves to be fast bound between twelve unbroken steeds, so that

each may take his share of her”, said the old Queen.

“You have spoken your own doom”, said the King, “and you shall suffer

it at once.”

So the wicked old Queen was fast bound between twelve unbroken steeds,

and each got his share of her. But the King took Snow-white and

Rosy-red, and their three children, and the twelve Princes; and so they

all went home to their father and mother, and told all that had

befallen them, and there was joy and gladness over the whole kingdom,

because the Princess was saved and set free, and because she had set

free her twelve brothers.

Stop Hunting for Answers. Ask Your Course

Nik Bear Brown — Fri, 27 Mar 2026 04:33:53 GMT

Learn more → https://medhavy.ai

Read more on the Medhavy blog: https://medhavy.ai/blog

There is a moment most students know. You are twelve minutes into a lecture, or forty pages into a chapter, and the explanation has stopped making contact. The words are still arriving — the instructor is still talking, the textbook still has sentences — but something has decoupled. You are receiving information. You are not learning anything.

What happens next depends on who you are. Some students stop and ask a question. Some open a second tab. Some take more aggressive notes, as if the problem is that they haven’t written fast enough. Most do what people do when a machine stops working: they wait, and hope it starts again.

The system’s response to this moment is almost always the same. It continues. The lecture does not pause to recalibrate. The textbook does not offer a different approach. The platform logs that you have completed the module. You have not completed the module. You have sat in the room while the module happened.

This is not a technology problem. It is a philosophy problem. And the technology we have built to fix it has mostly encoded the same philosophy in a more expensive box.

The Illusion of Adaptation

For the past decade, the word adaptive has done significant damage to educational technology.

Adaptive, in the way most platforms use it, means personalized in the sense that a streaming service is personalized — the algorithm has observed your behavior and is now showing you more of what you already clicked on. Netflix knows you watch crime dramas. It does not know whether you understood them. It does not know whether watching more crime dramas is good for you. It knows you did not turn it off.

Apply this logic to learning and you get what we have: platforms that track completion, adjust pacing, and serve more of what a student has already engaged with. A student who moves quickly gets harder content. A student who slows down gets simpler content. This is not adaptation. This is a speed adjustment. The car is still going the same direction. It is going faster or slower based on whether you look nervous.

The deeper variable — the one that actually determines whether a person learns something — is not pace. It is approach. Whether the concept is explained directly or discovered through questions. Whether it is anchored in a case study or built from first principles. Whether the learner is asked to produce something or receive something. Whether the material is revisited strategically or encountered once and abandoned to memory.

These are pedagogical choices. They have been studied for decades. There are researchers who have spent careers trying to understand which approach works for which person under which conditions. The literature is substantial and inconclusive — because the answer is not fixed. Different people learn differently. The same person learns differently on different days, at different moments in a topic, at different levels of prior knowledge.

The honest conclusion from all of this research is not a recommendation. It is a method. You have to run the experiment.

The Bandit

The multi-armed bandit is a framework borrowed from probability theory, named for the slot machines in a casino — each with a different payout rate, none of them labeled.

The problem the framework solves is this: you have several options, you don’t know which one is best, and you have to act while you’re still learning. You cannot spend all your time testing (you’ll never exploit what you’ve learned) and you cannot commit to the first option that works (you might be missing something better). The bandit framework manages this tradeoff — choosing the option that currently looks best while continuously allocating some probability to exploring the alternatives.

Medhavy applies this framework not to slot machines but to pedagogical approaches. Five of them: direct instruction, Socratic questioning, case-based learning, spaced retrieval practice, and project-based generative learning. Each is a coherent educational philosophy with its own decades-long research tradition. Direct instruction works for foundational concepts, clear definitions, sequences that need to be right before anything else can proceed. Socratic questioning works for learners who have surface-level confidence and need to be pushed past the answer they’re pattern-matching toward. Case-based learning works for professionals whose knowledge only means something when it contacts a real decision. Spaced retrieval works for cumulative content where earlier concepts must survive long enough to support later ones. Project-based learning works when demonstrated output is the actual goal.

Each of these approaches requires different content, a different AI persona, a different conversational posture. The platform has to be built differently depending on which one is active. This is not a toggle. It is architecture.

What the bandit does is decide, for each learner at each moment, which approach to deploy — then observe what happens — then update its model. If a learner is getting grounded, engaged responses under the Socratic approach and then the pattern breaks, the bandit notices. It tries something else. When the evidence comes in, the model updates. Not for the cohort. For this learner, in this moment, in this chapter.

Most adaptive platforms are adaptive at the level of the cohort, or at the level of the module, or at best at the level of the pacing track. Medhavy’s bandit is adaptive at the level of the pedagogical philosophy itself — the deepest variable, the one that actually determines contact.

What Running the Experiment Actually Means

Here is what it means in practice, because the abstraction is easy to nod at without grasping.

A business school executive logs into a white-labeled deployment of the platform — the institution’s logo, their colors, a persona configured to sound like a senior corporate strategy advisor. She is working through a module on AI literacy. The bandit has no prior data on her. It defaults to direct instruction — explicit definitions, worked examples, clear sequencing.

She moves through it quickly. Her dwell time on the explanatory sections is short. She is not pausing to absorb. She already knows this. The bandit observes this pattern and shifts: the persona begins responding with questions rather than answers. When she states that AI can reduce operational costs, the advisor asks: in which cost category, specifically? What assumption about labor productivity is that estimate resting on? She slows down. She starts typing longer responses.

This is contact. The bandit records it.

Three modules later, she is in unfamiliar territory. The Socratic approach that worked before has stopped working — she is guessing rather than reasoning, which looks the same from the outside but registers differently in the interaction pattern. The bandit shifts again, this time to case-based learning. The persona anchors the next concept in a documented business case. She can see what happened, evaluate what went wrong, apply the framework to the scenario. The abstraction becomes legible through the example.

None of this requires a human to observe her, diagnose her, and intervene. It runs continuously, invisibly, updating with every interaction. At the end of the cohort, the institution sees which pedagogical approaches drove the most durable engagement, where the content has gaps (the grounded / not in textbook ratio), and which modules generated the most friction. The credential the institution issues has actual learning evidence behind it.

This is what it means to run the experiment. Not to have a theory about which approach is best. To find out.

The Constraint That Makes It Honest

There is one more piece of the architecture that matters, and it is the most counterintuitive.

The AI tutor that runs inside Medhavy is not allowed to use the internet. It is not allowed to draw on general knowledge. It is not allowed to speculate. When a student asks a question, the tutor searches the course content — the verified, expert-reviewed textbook built for this specific deployment — and grounds its response in what is actually there. If the answer is not in the textbook, the tutor says so. Not in the textbook. That is the response.

This sounds like a limitation. It is the point.

The failure mode of every general-purpose AI tutor is that it sounds authoritative whether or not it is correct. It produces fluent, confident, plausible responses. Students who cannot evaluate whether the response is accurate have no way to know when it has invented something. The TEXTBOOK_ONLY constraint eliminates this failure mode by eliminating the thing that causes it. The tutor cannot hallucinate because it cannot leave the source material.

A student who gets not in textbook has not gotten a wrong answer. They have gotten a real signal: this question is beyond the scope of what we’re covering here, and you should know that. That is pedagogically useful. That is honest. The platform would rather say nothing than say something false.

Most EdTech does not make this choice. Most EdTech prioritizes the appearance of competence over the reality of it. Medhavy has decided that the constraint is the credibility.

What This Means for Anyone Paying Attention

The argument for Medhavy is not that it is smarter than other platforms. It is that it is more honest about what learning requires.

Learning requires contact — the moment when an explanation actually reaches someone. That moment is not guaranteed by pacing, or by completions, or by a student sitting in the virtual room while the module happens. It requires the right approach for this person at this moment, applied consistently enough to work, abandoned quickly enough when it stops.

The bandit does not know in advance which approach is right. It cannot. Nobody can. What it does instead is run the experiment continuously, update on evidence, and refuse to commit to a prior that the evidence no longer supports.

That is not a gambling algorithm applied to education. That is what good teaching has always been — the willingness to try something different when what you’re doing stops working, the discipline to notice when it stops working before the student gives up, and the honesty to say, when you don’t know the answer: I don’t know. But I know where to look.

The machine has learned something most platforms haven’t.

The question is whether the institutions that deploy it are willing to learn the same thing: that the evidence matters more than the assumption, and that running the experiment is not a sign of uncertainty.

It is the whole method.

Tags: Medhavy AI adaptive learning, multi-armed bandit pedagogy, EdTech platform architecture, personalized learning systems, AI tutor grounded retrieval

What School Was Always Bad At

Nik Bear Brown — Wed, 25 Mar 2026 22:50:48 GMT

Irreducibly Human: https://www.irreducibly.xyz/

The panic arrived in the wrong order.

When ChatGPT went public in November 2022, schools declared a crisis. Students were cheating. Essays were being written by machines. Arithmetic was being performed by algorithms. The question administrators asked — urgently, in emergency faculty meetings, in policy documents rushed into existence over winter break — was how to detect this. How to prevent it. How to put the genie back in the bottle.

Nobody asked the prior question.

Why are we assigning work a machine can do?

Here is what the panic missed: AI didn’t break education. It exposed a failure that was already there, running quietly for decades, producing graduates optimized for exactly the tasks that software now performs better, faster, and cheaper than any human being alive. The curriculum we built — and built deliberately, and defended with genuine belief in its value — was a curriculum for a world that no longer exists.

Machines arrived. And we could finally see what we had been training people to do.

The Curriculum We Built

To be clear: the failure was not malicious. Institutional inertia is not stupidity. Schools change slowly because they were built to transmit what is known, not to respond to what is new. That feature is now a bug. For most of the twentieth century, arithmetic speed and fact retrieval were genuinely valuable human capacities. An accountant who could run numbers in her head was worth hiring. A lawyer who had memorized case law was difficult to replace. An engineer who could recall formulas without looking them up got work done faster.

That world is gone.

The intelligent response to the invention of the forklift is not to practice lifting heavier objects. It is to learn to operate the machine, understand what it can and cannot lift, and — most crucially — develop the judgment to know what needs lifting in the first place. The question the forklift raises is not about strength. It is about what the work actually is, now that strength is no longer the constraint.

Irreducibly Human: What AI Can and Can’t Do is a six-book curriculum series built around that question. It does not teach students to compete with AI. It teaches them to supply the reasoning that AI tools require humans to provide — the reasoning no tool can supply on their behalf.

The series organizes human intelligence into seven tiers by a single criterion: what machines can and cannot do. Where AI is strongest — pattern recognition, fact retrieval, syntactic correctness, encyclopedic recall — the curriculum doesn’t train humans to compete directly. That would be malpractice. Where AI is weakest — causal reasoning, metacognitive oversight, collective intelligence, practical wisdom — the curriculum rebuilds from scratch.

The name changed recently. It was called The Human Half: What AI Can’t Do. The rename matters. “What AI can’t do” is a defensive posture — we are mapping a shrinking territory, waiting to see how much ground we lose. “Irreducibly human” says something different. There are capacities that are not merely outside AI’s current capability. They are outside its fundamental nature. Not gaps waiting to be filled. Structure.

The Gardner Trap

In 1983, Howard Gardner published Frames of Mind and cracked something open.

Multiple intelligences, he argued. Not one general intelligence but several: linguistic, logical-mathematical, spatial, musical, bodily-kinesthetic, interpersonal, intrapersonal. The framework was a genuine provocation. It said that the student who couldn’t sit still and parse grammar might have an intelligence the school wasn’t measuring. It said that the child who couldn’t add fractions might still understand the geometry of a room in her body before she crossed it.

Schools responded. Enthusiastically. “We teach to all the intelligences,” they said. And then, largely, they kept doing what they had always done.

Forty years later, there is still no validated assessment for intrapersonal intelligence. The curriculum that was supposed to follow the framework never fully arrived. What arrived instead was vocabulary. Teachers learned to say “multiple intelligences” the way they learned to say “growth mindset” — as a description of what they believed, not as a specification of what they would do differently on Monday morning.

This is the Gardner Trap: naming a thing so well that the naming feels like the work.

Gardner’s framework was built before machines became capable, which means it didn’t need to ask which intelligences technology endangered. It also didn’t name three tiers the series considers essential: the supervisory layer (knowing when an answer is wrong before recomputing it, knowing which tool to deploy and whether to trust what it returns), the causal layer (not just observing that X follows Y but reasoning about what happens if you intervene, about what would have happened if you had not), and the collective layer (the intelligence that emerges from groups working together in ways that exceed the sum of individual ability — the intelligence of science, of markets, of democracy, of any collaborative practice that generates knowledge no single person could generate alone).

None of these are properties of individuals. You cannot have supervisory intelligence in a vacuum — it requires a tool to supervise, a context in which the supervision matters, stakes. You cannot do causal reasoning without a question worth asking. Collective intelligence is definitionally not possessed; it is accomplished together.

An algorithm has access to the literature. It is absent from the practice that generates new knowledge. That absence is not a temporary limitation. It is a structural one.

Irreducibly Human is explicitly Stage 1 of a three-stage sequence: Name it. Teach it. Measure it. Gardner did Stage 1 brilliantly. Forty years passed. The series is an attempt to hold Stage 1 more honestly — to name only what can be defined clearly enough to teach, and to be transparent about where the measurement infrastructure doesn’t yet exist. Stages 2 and 3 are in development, in collaboration with the Center for Curriculum Redesign. The series is not claiming to have completed them. It is claiming that Stage 1 done honestly — with specific learning outcomes, sequenced exercises, and defined criteria for success — is rarer than it sounds, and more necessary than the field has acknowledged.

What the Series Actually Is

Six books. Two companions. A complete production infrastructure.

AI Literacy, Fluency, and Trust is the entry point — how to operate the machine without being replaced by it. Causal Reasoning is the identification layer — what causes what, and why no algorithm can answer that for you. AImagineering is post-AI design thinking — one week on ideation, the rest on the judgment that makes ideation matter. Ethical Play asks students to build a game that makes a player feel moral weight, then survive an AI audit proving the ethics are in the mechanics and not just in the documentation. Conducting AI teaches the five supervisory capacities no algorithm possesses — hearing the wrong note, choosing the piece, directing the sections. The Collective addresses the intelligence that cannot be possessed. Only accomplished. Together.

The companion books extend the series into domains the core texts cannot reach. A teacher’s guide addresses fifteen fields where the body knows things that language models do not: lab science, woodshop, nursing simulation, surgical training, studio art, dance, trades. A practitioner’s guide for experiential learning addresses the co-op coordinators, clinical placement directors, and study abroad advisors who send students into the world to learn — because practical wisdom, the Aristotelian capacity to know when and how to apply what you know and when not to, cannot be taught in a classroom. It can be scaffolded in the field.

The series is being built with the same tools it teaches. That is not an accident. Every book in the series was produced using an AI-assisted production infrastructure — a chapter drafting engine, an assertion verification system that scans claims and flags suspect ones for expert review, a figure generation protocol, a custom case study generator, a peer review framework, a game design document consultant. A 38-chapter textbook in cancer biology was written in approximately one month using this infrastructure and is currently in production in an NIH program. The Boyle System — a documentary infrastructure for scientific reproducibility — reduced the time senior researchers spent reviewing mentee work from sixty percent of each meeting to twenty, across more than 150 fellows in applied AI humanitarian contexts.

The thesis is demonstrated by the method used to build it. The forklift is being operated. What the forklift cannot lift is being named, precisely, in each chapter.

What This Is Not

It is not a book about AI.

This distinction is harder to hold than it sounds, because AI is everywhere in the series — as the subject of study, as the production infrastructure, as the adversary the ethics course must survive. But AI is not the center of gravity. Humans are. Specifically, the capacities that make humans irreplaceable not in spite of AI but because of it — because the tools require human judgment to operate, human values to direct, human stakes to make the outputs matter.

An algorithm has no stakes. It cannot commit because it cannot lose. The series is built for people who can lose, who are mortal and situated in time, who will have to live with the decisions the tools help them make. Those people need a curriculum that prepares them for the work the tools cannot do. That work is not shrinking. It is expanding.

The schools that spent the last two years trying to detect AI-generated student essays were asking the wrong question. The right question is what we are asking students to do with their irreducible minds, now that the machines have taken everything else.

Irreducibly Human is an attempt to answer that.

Tags: Irreducibly Human curriculum series, AI education reform, Howard Gardner multiple intelligences critique, causal reasoning pedagogy, human capacities AI cannot replace

Marley — Talk to Your Website

Nik Bear Brown — Mon, 23 Mar 2026 21:18:21 GMT

MARLEY: https://marley.bearbrown.co/

Most website templates give you a starting point and then leave you alone with it.

Marley doesn’t. Marley is a Next.js template built for a specific kind of collaboration: you clone it, you open Claude Code in the directory, and you talk to it. You say what you want. The website changes. You say something else. The website changes again. The website is never finished — it’s a living document that evolves as your needs become clearer.

Here’s what it ships with: a blog system, a tools directory, a Substack importer that pulls your posts (and checks for duplicates, and imports your drafts), and support for animations and D3 graphs that Substack itself can’t render. It’s self-documenting — it can generate a technical reference for its own features, suggest what to build next, and create spec documents for proposed additions. It also exposes Claude prompt tools publicly, so your tools page becomes a real tool directory, not just a list of links.

The workflow is simple. Open the template. Open Claude Code. Tell it who you are and what you don’t need. Remove the blog. Change the brand. Update the links. Connect your Substack. Add your tools. The template becomes your site because you told it to.

Marley is MIT licensed, open source, and built by Nik Bear Brown. It’s the infrastructure for bearbrown.co and the Musinique ecosystem — rebuilt every time a conversation asked it to be different.

Clone it. Talk to it. See what it becomes.

→ GitHub · Built by Nik Bear Brown · The Skepticism AI Substack

Tags: Next.js website template, Claude Code integration, talk-to-your-website, Substack importer Next.js, living document web development

What this document isA reference for the Marley multi-brand Next.js template. It covers what the template contains, how each system is structured, the full database schema, the route map, and the environment variables required for deployment. It closes with five proposed future additions. Use this when navigating an unfamiliar part of the codebase, planning a new feature, or onboarding a second developer.

1. What Marley is

Marley is a production-grade Next.js site template that proves its own flexibility by wearing different costumes. The same codebase is styled for multiple fictional businesses from public domain literature — each with a distinct voice, palette, and copy — without touching routing, components, or infrastructure.

The template demonstrates itself. Each brand instance is a stress test: if Scrooge & Marley’s austere ledger aesthetic and Au Bonheur des Dames’ lush retail warmth can coexist in the same codebase, the theming system is real.

The base codebase was derived from the Medhavy adaptive learning platform (Medhavy LLC, Nik Bear Brown and Srinivas Sridhar). All Medhavy branding has been replaced per brand instance. The infrastructure — routing, admin, database schema, API contracts — is shared and unchanged across instances.

Current brand instances

BrandSourceIndustry (fictional)StatusScrooge & MarleyDickens, A Christmas Carol, 1843Counting house, money lendingLiveAu Bonheur des DamesZola, Au Bonheur des Dames, 1883Department store, retailPlannedLapham PaintHowells, The Rise of Silas Lapham, 1885Industrial paint manufacturingPlannedDotheboys HallDickens, Nicholas Nickleby, 1839Education (cautionary)Planned

All source works are public domain. The brands as implemented — copy, design, codebase — are not.

2. Tech stack

3. Multi-brand theming system

The theming system is the core architectural claim of the Marley template. Changing a brand requires editing three files. No component changes. No routing changes. The entire site repaints.

The three files that must stay in sync

lib/theme.ts

TypeScript source of truth

Exports a typed theme constant containing the brand name, tagline, address, contact, domain, and the eight colour values (bb1–bb8). This is the canonical source. If it conflicts with the other two files, this one wins.

public/theme.json

Machine-readable

Same data as lib/theme.ts, serialised as JSON. Read by Indiana (the doc generator) and any external tooling that needs palette values without importing TypeScript. Includes a colorRoles field describing the semantic role of each colour variable.

app/globals.css

CSS variables

The :root block defines --bb-1 through --bb-8. A matching .dark block inverts the parchment/soot relationship for dark mode. All components reference these variables — no hex values appear in component files.

Palette variable roles (mandatory conventions)

VariableRoleScrooge & Marley value--bb-1Primary text#0D0D0D — soot black--bb-2Primary accent, headers#4A4A4A — iron grey--bb-3Danger, overdue, emphasis#8B0000 — dried-ink red--bb-4Highlight, callout#8B7536 — cold brass--bb-5Secondary accent#2F2F2F — charcoal--bb-6Muted accent, labels#6B6B5E — tarnished pewter--bb-7Borders, subtle backgrounds#9C9680 — aged ledger tan--bb-8Page background, light surfaces#E8E0D0 — parchment

WCAG AA contractWCAG AA requires 4.5:1 contrast for body text and 3:1 for large text. When replacing palette values for a new brand, verify --bb-1 against --bb-8 and --bb-2 against --bb-8 before deploying. Many brand primaries fail at body text size.

4. Site structure and routes

Public routes

/Home — five sections: hero, services, who we serve, CTA, contact
/toolsTools directory — card grid merging filesystem artifacts and DB link tools
/tools/[slug]Artifact embed page — full-viewport iframe with title bar
/devDev docs browser — searchable card grid, filesystem-driven
/dev/[slug]Single dev doc — full-viewport iframe
/blogBlog feed — cover thumbnails, search bar, published posts newest first
/blog/[slug]Blog post — cover hero, prose content, og:image, prev/next nav
/aboutFirm/person page — prose format, founders, contact
/privacyPrivacy policy
/privacy/cookiesCookie policy — dedicated page
/terms-of-serviceTerms of service
/substackNewsletter hub — card grid of all sections
/substack/[section]Section page — article list, follow CTA
/substack/[section]/[slug]Full article — attribution banner, prose, subscribe CTA

Admin routes (protected)

/admin/loginPassword form — POSTs to /api/admin/login
/admin/dashboardOverview — tabbed nav to all admin sections
/admin/dashboard/blogPost list — tag filter, bulk delete, import/export
/admin/dashboard/blog/newNew post editor
/admin/dashboard/blog/[id]/editEdit existing post
/admin/dashboard/blog/importImport — Substack ZIP or blog export ZIP
/admin/dashboard/toolsTools manager — link and artifact types
/admin/dashboard/devDev docs list — filesystem browser with sync button
/admin/dashboard/substackSubstack section manager — create sections, import ZIPs

5. Content systems

Blog system

The blog system uses Neon PostgreSQL for post storage, Tiptap for authoring, and Vercel Blob for image storage. Posts are database-driven; the admin editor produces clean HTML stored in the content column.

Key capabilities

WYSIWYG editor: bold, italic, headings, lists, blockquotes, code blocks, images, YouTube embeds, D3 viz placeholders
Cover image upload via drag/drop to Vercel Blob
Tags stored as PostgreSQL TEXT[] array — filterable in both admin and public views
Draft/publish workflow with published_at timestamp
Auto-generated slug from title (editable), auto-generated excerpt (first 200 chars)
Export as ZIP (posts.json + individual HTML files) — enables cross-instance transfer
Import from Substack export ZIP or blog export ZIP
D3 data visualisations hydrated client-side via BlogVizHydrator and the viz registry

Adding a D3 visualisation

Create lib/viz/[name].ts exporting default (el: HTMLElement) => void
Add an entry to lib/viz/registry.ts mapping the name to a lazy import
Insert a data-viz="[name]" placeholder via the editor toolbar

Tools directory

Tools are served from two sources merged at render time. Artifact tools live as HTML files in public/artifacts/ — filesystem is the source of truth, no database entry needed. Link tools are database-driven, managed via the admin UI.

Two tool types

TypeSourceBehaviourHow to addartifactFilesystem (public/artifacts/)Card links to /tools/[slug], renders in full-viewport iframeDrop an HTML file with title, description, keywords meta tags. Push to main.linkNeon databaseCard opens URL in new tabAdmin UI at /admin/dashboard/tools

Dev docs browser

All HTML files in public/dev/ are automatically surfaced on /dev. No database, no sync required. The lib/html-meta.ts utility (scanHtmlDir()) reads </code>, <code><meta name="description"></code>, and <code><meta name="keywords"></code> tags from every file and returns them as <code>HtmlDocMeta[]</code>.</p><p><strong>All three meta tags are required</strong>A doc without all three tags does not appear in the browser with correct title or searchable keywords. A doc that appears in the filesystem but cannot be found by search does not exist to the reader. Title, description, and keywords are structural requirements, not formatting suggestions.</p><h2>Substack importer</h2><p>The Substack import system ingests Substack export ZIPs and surfaces articles under <code>/substack/[section]/[slug]</code>. Articles are stored in Neon with attribution preserved.</p><p><strong>Import workflow</strong></p><ol><li><p>Export from Substack (Settings → Exports → Create new export)</p></li><li><p>Create a section in admin dashboard (title, slug, Substack URL, description)</p></li><li><p>Upload the ZIP to that section — parser reads <code>posts.csv</code> + HTML files</p></li><li><p>Articles upserted by slug — re-import is safe, updates existing records</p></li></ol><h1>6. Database schema</h1><p>Four tables in Neon PostgreSQL. All have row-level security enabled. Public read policies are narrowly scoped — blog posts require <code>published = true</code>.</p><pre><code><code>-- Tools CREATE TABLE IF NOT EXISTS tools ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name TEXT NOT NULL, slug TEXT UNIQUE NOT NULL, description TEXT, tool_type TEXT DEFAULT 'link', -- 'link' | 'artifact' claude_url TEXT, -- external URL (link tools) or fallback chatgpt_url TEXT, -- optional ChatGPT URL artifact_id TEXT, -- Claude artifact UUID artifact_embed_code TEXT, -- raw iframe embed (overrides artifact_id) tags TEXT[], -- category tags created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW() ); ALTER TABLE tools ENABLE ROW LEVEL SECURITY; CREATE POLICY "public_read_tools" ON tools FOR SELECT USING (true); CREATE POLICY "service_role_tools" ON tools FOR ALL USING (true) WITH CHECK (true); -- Blog posts CREATE TABLE IF NOT EXISTS blog_posts ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), title TEXT NOT NULL, subtitle TEXT, slug TEXT NOT NULL UNIQUE, byline TEXT, cover_image TEXT, content TEXT NOT NULL, -- clean HTML from Tiptap excerpt TEXT, -- auto-generated, first 200 chars published BOOLEAN DEFAULT false, published_at TIMESTAMPTZ, tags TEXT[] DEFAULT '{}', created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW() ); ALTER TABLE blog_posts ENABLE ROW LEVEL SECURITY; CREATE POLICY "public_read_published_posts" ON blog_posts FOR SELECT USING (published = true); CREATE POLICY "service_role_posts" ON blog_posts FOR ALL USING (true) WITH CHECK (true); -- Substack sections CREATE TABLE IF NOT EXISTS substack_sections ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), slug TEXT NOT NULL UNIQUE, title TEXT NOT NULL, description TEXT, substack_url TEXT NOT NULL, article_count INTEGER DEFAULT 0, created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW() ); ALTER TABLE substack_sections ENABLE ROW LEVEL SECURITY; CREATE POLICY "public_read_sections" ON substack_sections FOR SELECT USING (true); CREATE POLICY "service_role_sections" ON substack_sections FOR ALL USING (true) WITH CHECK (true); -- Substack articles CREATE TABLE IF NOT EXISTS substack_articles ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), section_id UUID NOT NULL REFERENCES substack_sections(id) ON DELETE CASCADE, slug TEXT NOT NULL, title TEXT NOT NULL, subtitle TEXT, excerpt TEXT, content TEXT, original_url TEXT, published_at TIMESTAMPTZ, display_date TEXT, created_at TIMESTAMPTZ DEFAULT NOW(), UNIQUE(section_id, slug) ); ALTER TABLE substack_articles ENABLE ROW LEVEL SECURITY; CREATE POLICY "public_read_articles" ON substack_articles FOR SELECT USING (true); CREATE POLICY "service_role_articles" ON substack_articles FOR ALL USING (true) WITH CHECK (true);</code></code></pre><h2>Pending migrations (safe to re-run)</h2><pre><code><code>-- Run these in Neon SQL Editor if not already applied ALTER TABLE blog_posts ADD COLUMN IF NOT EXISTS byline TEXT; ALTER TABLE blog_posts ADD COLUMN IF NOT EXISTS tags TEXT[] DEFAULT '{}'; ALTER TABLE blog_posts ADD COLUMN IF NOT EXISTS cover_image TEXT;</code></code></pre><h1>7. Admin system</h1><p>The admin dashboard is protected by <code>middleware.ts</code>, which redirects all <code>/admin/dashboard/*</code> routes to <code>/admin/login</code> if no valid <code>admin_session</code> cookie is present. Authentication is password-only — the password is set via the <code>ADMIN_PASSWORD</code> environment variable.</p><p><strong>Session mechanics</strong></p><ul><li><p>Login: POST to <code>/api/admin/login</code> — validates against <code>ADMIN_PASSWORD</code> env var</p></li><li><p>On success: sets <code>admin_session</code> httpOnly cookie, 7-day expiry</p></li><li><p>All <code>/api/admin/*</code> routes check <code>isAdmin()</code> from <code>lib/admin-auth.ts</code> before proceeding</p></li><li><p>Middleware protects dashboard pages; API routes protect data endpoints separately</p></li></ul><h2>Admin API routes</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XFAK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8232ba69-556d-4598-86c6-609ff84c13cc_1654x1240.png" data-component-name="Image2ToDOM"><div class="image2-inset"><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title>

`8. Environment variables`

`9. Persistent layout components`

`Header`

Sticky, z-50, backdrop-blur. Logo (theme-aware SVG or text), five-item nav, social icon buttons, dark/light mode toggle. Mobile hamburger menu at the lg breakpoint. Do not add a sixth nav item without a deliberate information architecture decision — five is not arbitrary.

`Footer`

Four-column grid: firm info (name, address, contact), platform links, connect/social links, legal links. Bottom bar with copyright. Column headings and link text are brand-specific copy — the only footer content that changes between instances.

`SEO infrastructure`

app/sitemap.ts — dynamic sitemap including all /blog/*, /tools/*, /substack/* routes from Neon. Falls back to static-only if DB is not configured.
app/robots.ts — allows all crawlers, blocks /admin/ and /api/, points to /sitemap.xml.
Blog posts include og:image and twitter:card meta tags.

`10. Five proposed additions`

These are structural proposals, not implementation tickets. Each one addresses a real gap in the current template. They are ordered by the ratio of effort to usefulness, not by complexity.

1. Brand registry — single-file multi-instance configuration

Planned

The gap

Currently, switching brand instances requires manual edits to three files (lib/theme.ts, public/theme.json, app/globals.css) plus the home page, legal pages, and CLAUDE.md. There is no single file that declares “this is the Scrooge & Marley instance.” A developer making a new instance must know which files to change.

The proposal

Add a config/brand.ts file that is the single source of truth for the active brand: palette, copy, address, legal entity, home page section content. The three theme files and the legal pages are generated from it, not maintained separately. A new brand instance is one file plus assets.

What it unlocks

A developer could drop in a new brand config, run a generation script, and have a fully configured instance in minutes. The multi-brand demonstration becomes something a user can try themselves, not just read about.

2. Contact form with Resend integration

Planned

The gap

Every CTA on the current site routes to a mailto: link. This means a visitor must have a configured email client. On mobile this works; in many corporate environments it does not. There is also no record of enquiries — they land in an inbox and may be lost.

The proposal

Add a /contact route (currently a placeholder) with a form that POSTs to /api/contact. The API route validates the fields and sends via Resend (one environment variable, generous free tier). Store a copy of each submission in a new enquiries table in Neon. Surface them in the admin dashboard.

What it unlocks

The site becomes genuinely functional as a business template, not just a demonstration. Each brand instance gets a working enquiry pipeline. The admin can see all submissions without checking email.

3. Brand instance switcher in the admin dashboard

Planned

The gap

The multi-brand story is the template’s primary selling point, but it is invisible to someone looking at a single deployed instance. To see the contrast between Scrooge & Marley and Au Bonheur des Dames, you must visit two different URLs — or read about it in a README.

The proposal

Add a brand switcher to the admin dashboard (hidden from public visitors) that live-previews any configured brand instance by swapping the CSS variables via a data-brand attribute on the root element. No page reload. The switcher reads all brand configs from the proposed registry and renders a dropdown.

What it unlocks

The demo becomes interactive. A developer evaluating the template can experience the full range of brand personalities in a single session, on a single deployment. This is the clearest possible argument for the theming system’s real flexibility.

4. Structured projects / portfolio section

Planned

The gap

/projects is currently a placeholder. The tools directory serves individual interactive tools, and the blog serves written content, but there is no structured way to present a body of work — a case study, a client engagement record, a research project — as a coherent unit with multiple components.

The proposal

Add a projects table in Neon with title, slug, summary, status, tags, and a content field (same HTML-from-Tiptap pattern as blog posts). A project can reference multiple blog posts, tools, and external links. The public /projects page renders as a card grid; /projects/[slug] renders the full project with linked artefacts.

What it unlocks

For an individual or consultancy using the template, this closes the gap between “I have blog posts” and “I have a portfolio.” For the multi-brand demonstration, it gives each fictional firm a place to show completed engagements.

5. Indiana — automated dev doc generation from CLAUDE.md

Planned

The gap

Every doc in public/dev/ is hand-authored. The CLAUDE.md file contains authoritative, structured information about the codebase — site structure, schema, routes, environment variables — that duplicates what the dev docs cover. When CLAUDE.md changes, the dev docs become stale. There is no automated connection between the two.

The proposal

Indiana is a lightweight script (scripts/indiana.ts) that reads CLAUDE.md and public/theme.json, extracts structured sections, and generates or regenerates specific dev doc HTML files in public/dev/. It does not replace hand-authored docs — it generates the reference docs (schema, routes, environment variables) that are purely derived from source truth and should not require manual maintenance.

What it unlocks

The dev docs stay current automatically. A change to the database schema in CLAUDE.md is reflected in the dev docs on the next build. The hand-authored explanation and how-to docs remain under human control; the reference docs are generated. This is the documentation-as-code pattern applied to the template itself.