When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Researchs

1OneLineAI, 2EleutherAI, 3KAIST AI, 4Boeing Korea, 5Yonsei University, 6MIT

Abstract

Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative co-authors responsible for crafting hypotheses, synthesizing code, or drafting manuscripts.

In this work, we explore a complementary application: using LLMs as verifiers to automate the academic verification of scientific manuscripts. To that end, we introduce SPOT, a dataset of 83 published papers paired with 91 errors significant enough to prompt errata or retraction, cross-validated with actual authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find that none surpasses 21.1% recall or 6.1% precision (o3 achieves the best scores, with all others near zero). Furthermore, confidence estimates are uniformly low, and across eight independent runs, models rarely rediscover the same errors, undermining their reliability. Finally, qualitative analysis with domain experts reveals that even the strongest models make mistakes resembling student-level misconceptions derived from misunderstandings.

These findings highlight the substantial gap between current LLM capabilities and the requirements for dependable AI-assisted academic verification.

🔍 Creating SPOT

Diagram of the SPOT data curation pipeline stages
Figure: Overview of the five–stage SPOT data curation process.

We begin by Stage 1 – Seed Collection, harvesting manuscripts flagged for critical errors from two primary sources: WithdrarXiv (self-retractions and errata comments) and PubPeer (post-publication peer reviews). We extract entries annotated as factual, methodological, or other critical mistakes, scrape each paper’s metadata and full comments, then prune low-yield repositories (medRxiv/bioRxiv) to focus on high-signal samples.

Next, in Stage 2, we run two GPT-4o filtering passes to isolate comments that point to a specific section, figure, equation, or table and drop those requiring external artifacts. We then move to Stage 3 author validation—only keeping errors explicitly confirmed by the original authors—and Stage 4 human sanity checks, where annotators verify self-containment, identifiability, and author acknowledgement. Finally, in Stage 5 – Normalization, we convert each PDF into text and images via Llama-Parse + GPT-4.1 refinement, ensuring high-fidelity OCR and visual captures of every figure, table, and equation.

📊 Evaluation Metrics & Main Results

We evaluate verification performance via Precision, Recall, and pass@K. A predicted error is a true positive only when the model’s reported location matches a benchmark annotation (confirmed via GPT-4.1 similarity checks); all other flags count as false positives, and any missed annotations as false negatives:

Precision = TP / (TP + FP) penalizes spurious flags, while Recall = TP / (TP + FN) penalizes missed detections. To measure multi‐attempt gains, we report pass@1 and pass@4 over eight independent runs.

Precision, Recall, pass@1, pass@4 for ten multi-modal LLMs on SPOT

Even the strongest model, o3, achieves only 21.1 % recall and 6.1 % precision on SPOT, underscoring the challenge of automatically pinpointing real errors in full‐length scientific manuscripts. Open‐source counterparts like Qwen-2.5-VL-72B and Llama-4-Maverick collapse to near-zero performance. This shows that while neither proprietary nor open-source models fully satisfy the requirements of practical deployments of error-detecting AI systems, open-source models lag far behind in domain-specific rigor and robust error-detection capabilities essential for scientific applications.


Notably, reasoning models show strong performance in the Equation/Proof category, o3 leads with a 62.6 % pass@4 rate, followed by Gemini-2.5-Pro at 36.4 %, with all others below 5 %. However, they fall short on visual tasks: GPT-4.1 tops Figure Duplication at 44.4 %, while o3 and Gemini-2.5-Pro score 0 %. This contrast reveals that, despite impressive reasoning capablities on text, current reasoning LLMs show poor performance in grounding multimodal figure analysis.

⚠️ Unreliability of Miscalibrated Models

Alongside pass@4, calibration tells us how much to trust a model’s own confidence estimates. In scientific error detection, where chasing false positives can cost serious time and effort, knowing when to believe a model is crucial.

Density and scatter plots showing model confidence vs. pass@4 for six LLMs on SPOT

However, as shown in Figure 4 (right), confidence correlates only weakly with pass@4, and the density plots on the left reveal that all models report very low confidence, clustering near zero. Across 498 model–instance evaluations (83 instances × 6 models), we observe just two cases of full (100%) confidence (both from o3), underscoring that today’s LLMs remain largely unreliable for spotting real errors in scientific manuscripts.

📚 Detailed Case Studies

Mathematics: Petersen and Tommas (2024)

o3’s Feedback

Eichler–Shimura is quoted as H¹(M₁,₁,Vₗ) ≅ S_{ℓ+2} ⊕ ℚ(−1−ℓ), but the Tate (Eisenstein) summand should be ℚ(−1), independent of ℓ. The extra “−ℓ” is a mis-twist.

Gemini-2.5-Pro’s Feedback

The stated result H¹(M₂, V(2,2)) = ℚ(−3) contradicts the Gysin‐sequence derivation. One finds instead H¹(M₂, V(2,2)) = 0 after tracking H⁰(A₁,₁, V(2,2))(−1) → H²(A₂), implying no non-zero cohomology.

Materials Science: Ye et al. (2025)

o3’s Feedback

Optimal TfOH concentration is stated as 0.006–0.016 M, yet all syntheses use 0.2 M. This inconsistency invalidates the universality of the claimed optimum.

Gemini-2.5-Pro’s Feedback

(A) Indexed PXRD peaks incorrectly attribute facets to TAPPy-TFPPy-COF instead of TAPPy-BPTC-COF.

(B) The BET surface area is listed as 1606 cm² g⁻¹ rather than 1606 m² g⁻¹, a 10,000× unit error.

BibTeX

@article{son2025ai,
  title={When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research},
  author={Son, Guijin and Hong, Jiwoo and Fan, Honglu and Nam, Heejeong and Ko, Hyunwoo and Lim, Seungwon and Song, Jinyeop and Choi, Jinha and Paulo, Gon{\c{c}}alo and Yu, Youngjae and others},
  journal={arXiv preprint arXiv:2505.11855},
  year={2025}
}