
We begin by Stage 1 – Seed Collection, harvesting manuscripts flagged for critical errors from two primary sources: WithdrarXiv (self-retractions and errata comments) and PubPeer (post-publication peer reviews). We extract entries annotated as factual, methodological, or other critical mistakes, scrape each paper’s metadata and full comments, then prune low-yield repositories (medRxiv/bioRxiv) to focus on high-signal samples.
Next, in Stage 2, we run two GPT-4o filtering passes to isolate comments that point to a specific section, figure, equation, or table and drop those requiring external artifacts. We then move to Stage 3 author validation—only keeping errors explicitly confirmed by the original authors—and Stage 4 human sanity checks, where annotators verify self-containment, identifiability, and author acknowledgement. Finally, in Stage 5 – Normalization, we convert each PDF into text and images via Llama-Parse + GPT-4.1 refinement, ensuring high-fidelity OCR and visual captures of every figure, table, and equation.