Benchmarks test what a model says on one prompt. We mine prompt trajectory space. One operator, thirty agents, two weeks — two thousand papers, twenty-two books, and results that independent experts have confirmed as novel, correct, and publishable.
Every paper is tiered by verification status. We do not claim unverified work is verified.
Reviewed by multiple independent domain experts and confirmed as novel, correct, and of publishable quality.
Reviewed by one independent domain expert and confirmed as novel, correct, and of publishable quality.