The Verification Problem

by Kevin Scharp

Generation is solved.

I mean this almost literally. Expert-level research output — real mathematics, real philosophy, real technical analysis at the frontier — can now be generated at speeds no institution can verify. One operator with the right scaffold, running for two weeks, produces the output of a small university department running for a decade. I know because I did it.

What I did was produce approximately two thousand research papers and twenty-two books across mathematics, philosophy, and the foundations of several technical fields. The corpus includes novel attacks on open problems in analytic number theory and algebraic geometry. It includes a framework for trajectory-dependent capability in AI systems. It includes conceptual engineering applied to load-bearing concepts in AI safety research. It includes a group-theoretic result on which tasks a given neural architecture can and cannot learn. Three of the mathematical results have been externally validated by an expert as novel, correct, and publishable. The rest are at lower verification tiers. All of it is at soulmetric.com.

The corpus is the easy part. The hard part is what to do about it.

Before I go further, the biography — because the biography is the evidence the work was possible.

In 1992, at eighteen, I worked at Ray Arvidson's Earth and Planetary Remote Sensing Laboratory at Washington University through a Missouri Space Grant Consortium placement. I built a three-dimensional Huffman-style compression algorithm for hyperspectral imaging data, intended for ground processing of Mars Observer returns. Mars Observer was lost three days before Mars orbit insertion. The algorithms never saw Martian data. But that summer was where I learned the thing that mattered — that novel technical work is something a person does, not something a person studies.

I took a math degree, then a PhD in philosophy. My dissertation and the book that followed — Replacing Truth, Oxford 2013 — argued that the concept of truth is inconsistent in a way that can't be patched from within, and that the right response is to replace it with engineered successor concepts. I called the methodology conceptual engineering. The term had appeared before — Creath on Carnap in the 70s, Brandom once in a 2001 paper, Blackburn once in Think — but not as the name of a method. My 2013 book and the Philosophical Review paper that went with it put the term into circulation. In 2014 and 2015, Cappelen organized a series of conferences that brought together Haslanger, Eklund, Plunkett, Burgess, Richard, me, and others. We argued about what to call it. We settled on "conceptual ethics" for the evaluative work and "conceptual engineering" for the full project. Those conferences are where the field took shape.

Twenty years on truth. That's how long conceptual engineering takes in its native mode. One concept, one career, if you do it well. Which is the relevant fact for what comes next.

In 2025 I started seriously testing whether operator-scaffold coupling could run conceptual engineering at scale across domains. The results are what this essay is actually about.

Benchmarks test what a model says on one prompt. They tell you nothing about what happens across a trajectory of thousands of prompts, directed by an expert who knows when the model is wrong and how to redirect it. The generation-verification asymmetry is the central observation of the past eighteen months, and almost no one has named it.

The shape of the asymmetry is this. Large language models, coupled to an operator who supplies taste and judgment, can produce research artifacts at rates that exceed the individual verification capacity of any human researcher, including the operator who produced them. The operator catches the scaffold's errors; the scaffold catches some of the operator's blind spots; the system produces outputs neither could produce alone.

The coupling has specific requirements. The operator must have domain-level fluency — enough to recognize plausible nonsense at a glance. The operator must have adversarial integrity — the willingness to kill favorite outputs. The operator must have taste — the ability to distinguish a result worth pursuing from a result worth abandoning, which no algorithm yet produced can do.

I have produced 256 papers. I have externally verified three. The other 253 have been internally checked by the scaffold, scored by an internal evaluation system, and triaged by my own taste — but they have not been subjected to the external expert review that is the real gate on research publishability.

I cannot individually verify them. No one can. Generation has outpaced my own verification capacity, and I am the one who generated them.

If this is true for one operator, it will be true for the field. The question is not whether the verification problem will arrive. It has arrived. The question is what infrastructure the field builds in response.

The response I want to argue for is that verification, in the generation-abundant era, must become communal.

The traditional model places the author at the top: produce the work, take responsibility, submit to peer review. Trust is anchored in the author's reputation and track record. This model breaks at scale — not because peer reviewers are overloaded, but because the author themselves cannot personally verify their own output at generation speed.

I can produce a novel mathematical argument in an afternoon. I cannot verify that argument to referee-grade correctness in less than a week. The verification cost is higher than the generation cost. In the generation-abundant era, this gap becomes the entire problem.

Soulmetric exists to build the verification infrastructure. The corpus is the proof that the problem is real. The tiers are the beginning of the solution.

Tell me where the errors are. Tell me where the contributions are. Tell me what needs to be verified next. The corpus is open. The methodology is transferable. The problem belongs to all of us.