The Loop That Watches Itself
On OpenAI's Automated Researcher and the Profession It Forgot to Invent
Jakub Pachocki has a timeline. By September, OpenAI plans to deploy what it calls an AI research intern — a system that can work on a specific problem for the length of time a person would need days to resolve. By 2028, the full version: a multi-agent system capable of running research programs too large for humans to manage. Drug discovery. Novel proofs. Problems “formulated in text, code, or whiteboard scribbles.”
The vision is coherent. More than most in this field, it is operationally specific. And it contains a foundational error that no amount of scaling will fix.
The error isn’t technical. It’s logical.
The Scratch Pad That Watches Itself
Pachocki is candid about the risks. A system this powerful could go off the rails, get hacked, or simply misunderstand its instructions. His proposed solution is chain-of-thought monitoring — training reasoning models to externalize their work into a kind of scratch pad, then using other AI systems to watch those scratch pads for anomalous behavior.
This is not oversight. It is the appearance of oversight, implemented entirely inside the loop it was supposed to close.
Sixty years before anyone worried about AI safety, Kurt Gödel established something directly relevant. No formal system powerful enough to express arithmetic can verify its own consistency from within itself. Any sufficiently capable system will generate statements it cannot evaluate using only its own rules — truths it can approach but not recognize as true through internal derivation alone.
Apply this to Pachocki’s architecture. The AI researcher derives. Chain-of-thought monitoring by another AI system is more derivation. What is structurally absent is recognition — the moment of contact between a formal output and an external reality. That moment cannot be replicated by adding another layer of derivation on top.
This is not a philosophical objection. It is a logical one. The validator must be outside the system being validated. There is no version of this argument that resolves in favor of AI systems self-monitoring.
The Proof Candidate Problem
What an AI system produces when it generates a novel mathematical proof is not a proof. It is a proof candidate — a string of symbols following valid inference rules that may or may not establish something true.
The distinction is not semantic. A proof in the full sense is a social and epistemic act. It is what a mathematical community recognizes as establishing truth. Remove the recognition and you have a sophisticated computation that has no relationship to truth except statistical proximity.
The same structure applies to every domain Pachocki names.
A novel molecule with predicted therapeutic properties is not a drug. It is a candidate. The drug trial process — Phase I, Phase II, Phase III, post-market surveillance — exists precisely because we have learned, through catastrophic experience, that prediction and reality are different things and the gap between them kills people. Thalidomide. Vioxx. The graveyard of promising compounds that passed every computational test and failed in bodies.
As AI systems generate increasingly sophisticated candidates across more domains, the need for rigorous external validation does not decrease. It increases. The more sophisticated the output, the harder it is to catch the subtle error buried in ten thousand valid steps. A wrong answer that looks wrong is easy to reject. A wrong answer that looks right for nine thousand nine hundred and ninety-nine steps requires something the internal system cannot provide: an independent perspective.
Common Cause Failure
There is a concept in safety engineering called common cause failure. It describes what happens when two redundant systems share the same fundamental assumptions — the thing most likely to fool System A is also most likely to fool System B, because both were built on the same foundation.
Pachocki’s monitoring architecture is a common cause failure risk by design. If the system being monitored can produce subtly wrong outputs that look correct, the monitoring system trained on similar data with similar architecture will have correlated blind spots. You have not introduced an independent check. You have introduced a correlated one.
Every high-stakes validation system humans have built — clinical trials, aircraft certification, nuclear safety, financial auditing — depends on something genuinely outside. Not because humans are infallible. Because humans are the only validators who face consequences when wrong. The FDA reviewer whose approval leads to harm is accountable in ways that a monitoring LLM is not and cannot be.
Accountability is not a luxury feature of validation systems. It is load-bearing. Remove it and the system loses the incentive structure that makes rigorous checking worth doing.
Stakes as the Organizing Principle
None of this means AI systems cannot contribute to research. They already do. The question is not whether to deploy them. The question is which level of external validation each deployment requires.
This maps onto a natural taxonomy organized by stakes.
For low-stakes, reversible outputs — a song recommendation, a draft email, a code snippet that will be reviewed before deployment — AI can largely run with minimal human oversight. The cost of failure is low and recoverable.
For moderate-stakes, partially recoverable outputs — a business analysis, a research summary, an engineering specification — systematic human review at checkpoints is appropriate. The human does not need to be in the loop constantly, but must be able to catch errors before they compound.
For high-stakes, irreversible outputs — drug candidates, structural engineering recommendations, policy analysis that will drive consequential decisions, mathematical proofs that will be published as established results — continuous human oversight is not incidental to the output’s validity. It is constitutive of it.
The drug trial architecture already encodes this wisdom. It was not built for AI, but it is exactly the right framework for AI-assisted research in high-stakes domains. The humans do not disappear as system confidence grows. They shift function — from intensive validation to ongoing monitoring, from checking every step to catching systematic drift. This is not a concession to human limitation. It is a recognition that the system’s credibility requires external accountability at every stage.
The Profession Pachocki Forgot to Invent
What emerges from this analysis is not only a procedural requirement for human oversight. It is the outline of a new profession.
A plausibility auditor is not a fact-checker. Not a quality assurance technician. Not a safety researcher who looks for misaligned objectives in training runs. A plausibility auditor is someone trained specifically to stand outside sophisticated AI outputs and ask whether those outputs correspond to reality rather than merely to internal consistency.
This requires two distinct forms of expertise that current training pipelines do not produce together.
The first is deep domain knowledge — enough expertise to recognize when a result is too clean, suspiciously convergent, subtly wrong in the way that only an expert in the specific domain would catch. The AI system that generates a novel proof in algebraic geometry needs to be reviewed by someone who has spent years in algebraic geometry, not by a generalist AI safety researcher who can evaluate the logical structure of the output but cannot evaluate its mathematical significance.
The second is knowledge of AI failure modes, which differ fundamentally from human error patterns. Human errors cluster around cognitive bias, motivated reasoning, fatigue, and the known weaknesses of intuition under uncertainty. AI errors cluster around distribution shift, spurious correlations that held in training data, confident extrapolation beyond the valid range of the model, and — most dangerously — systematic errors that look like high-quality outputs because they were trained on a corpus where high-quality outputs had certain structural characteristics. Auditing AI outputs requires knowing which kind of error you are hunting.
The training pipeline for plausibility auditors looks nothing like current AI safety work. It looks more like producing people with genuine deep expertise in a specific domain who have additionally developed the metacognitive capacity — what Penrose, extending Gödel, might describe as the recognitional faculty — to evaluate outputs they could not themselves have produced. The auditor does not need to be able to generate the proof. The auditor needs to be able to recognize whether it is actually true.
This is not a concession to human limitation. The requirement for external validation is not a temporary scaffolding that will be removed once the systems mature. It follows directly from the logical structure of the problem. The validator must be outside the system being validated. This requirement does not disappear as systems become more sophisticated. If anything, it becomes harder to satisfy, because the auditor’s task grows more demanding as the outputs grow more complex.
The Central Irony
Pachocki’s automated researcher, if it works as described, will be the thing that finally creates the market for what it treats as unnecessary.
The more sophisticated the AI output, the harder the auditing task, the more valuable the human who can do it. OpenAI’s north star may be pointing directly at the profession it forgot to invent.
There is precedent for this dynamic. The industrialization of manufacturing did not eliminate the need for quality engineers — it made quality engineering a more demanding and more specialized discipline. The digitization of financial markets did not eliminate the need for auditors — it made financial auditing a more technically demanding field and produced an entire industry of forensic accountants whose value derives precisely from the complexity of what they are reviewing.
The automated researcher will produce more outputs of greater sophistication across more domains than any previous generation of scientific tools. Each of those outputs will be a candidate. Each candidate will require validation. The validation will require humans. Not because we cannot build systems smart enough to evaluate the outputs — we will almost certainly build systems with that capability. But because the evaluation’s credibility depends on the evaluator’s accountability, and accountability requires the possibility of consequence.
An AI system does not lose its job when it certifies a flawed drug candidate. A plausibility auditor does.
What Governments Actually Need to Figure Out
Pachocki acknowledges that the concentrated power implications of this technology are “a big challenge for governments to figure out.” He is right that governments need to be involved, and right that OpenAI alone cannot resolve the governance questions.
But the governance architecture he gestures toward does not yet exist, and the reason it does not exist is that the validation infrastructure that would make it functional has not been built. You cannot regulate AI research outputs if there is no institutionalized capacity to evaluate whether those outputs are trustworthy. Chain-of-thought monitoring provides the appearance of evaluability without the substance.
The question for 2028 — when Pachocki’s multi-agent research system is scheduled to arrive — is not only whether the system works. It is whether we have built, in parallel, the human capacity to stand outside the most powerful reasoning systems ever constructed and ask the oldest question in epistemology.
Is it actually true?
No algorithm answers that. Someone has to.
bear.musinique.com · skepticism.ai · theorist.ai
Tags: AI plausibility auditor, Gödel incompleteness AI oversight, OpenAI automated researcher chain-of-thought monitoring, common cause failure AI safety, high-stakes AI


