When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Researchs

🌟 Top Performing Models on SPOT 🌟

Recall %

Precision %

Pass@4 %

Abstract

Recent advances in large language models (LLMs) have fueled the vision of automated scientific discovery, often called AI Co-Scientists. To date, prior work casts these systems as generative co-authors responsible for crafting hypotheses, synthesizing code, or drafting manuscripts.

In this work, we explore a complementary application: using LLMs as verifiers to automate the academic verification of scientific manuscripts. To that end, we introduce SPOT, a dataset of 83 published papers paired with 91 errors significant enough to prompt errata or retraction, cross-validated with actual authors and human annotators. Evaluating state-of-the-art LLMs on SPOT, we find that none surpasses 21.1% recall or 6.1% precision (o3 achieves the best scores, with all others near zero). Furthermore, confidence estimates are uniformly low, and across eight independent runs, models rarely rediscover the same errors, undermining their reliability. Finally, qualitative analysis with domain experts reveals that even the strongest models make mistakes resembling student-level misconceptions derived from misunderstandings.

These findings highlight the substantial gap between current LLM capabilities and the requirements for dependable AI-assisted academic verification.

🔍 Creating SPOT

Diagram of the SPOT data curation pipeline stages — Figure: Overview of the five–stage SPOT data curation process.

We begin by Stage 1 – Seed Collection, harvesting manuscripts flagged for critical errors from two primary sources: WithdrarXiv (self-retractions and errata comments) and PubPeer (post-publication peer reviews). We extract entries annotated as factual, methodological, or other critical mistakes, scrape each paper’s metadata and full comments, then prune low-yield repositories (medRxiv/bioRxiv) to focus on high-signal samples.

Next, in Stage 2, we run two GPT-4o filtering passes to isolate comments that point to a specific section, figure, equation, or table and drop those requiring external artifacts. We then move to Stage 3 author validation—only keeping errors explicitly confirmed by the original authors—and Stage 4 human sanity checks, where annotators verify self-containment, identifiability, and author acknowledgement. Finally, in Stage 5 – Normalization, we convert each PDF into text and images via Llama-Parse + GPT-4.1 refinement, ensuring high-fidelity OCR and visual captures of every figure, table, and equation.

📊 Evaluation Metrics & Main Results

We evaluate verification performance via Precision, Recall, and pass@K. A predicted error is a true positive only when the model’s reported location matches a benchmark annotation (confirmed via GPT-4.1 similarity checks); all other flags count as false positives, and any missed annotations as false negatives:

Precision = TP / (TP + FP) penalizes spurious flags, while Recall = TP / (TP + FN) penalizes missed detections. To measure multi‐attempt gains, we report pass@1 and pass@4 over eight independent runs.

Precision, Recall, pass@1, pass@4 for ten multi-modal LLMs on SPOT

Even the strongest model, o3, achieves only 21.1 % recall and 6.1 % precision on SPOT, underscoring the challenge of automatically pinpointing real errors in full‐length scientific manuscripts. Open‐source counterparts like Qwen-2.5-VL-72B and Llama-4-Maverick collapse to near-zero performance. This shows that while neither proprietary nor open-source models fully satisfy the requirements of practical deployments of error-detecting AI systems, open-source models lag far behind in domain-specific rigor and robust error-detection capabilities essential for scientific applications.

Notably, reasoning models show strong performance in the Equation/Proof category, o3 leads with a 62.6 % pass@4 rate, followed by Gemini-2.5-Pro at 36.4 %, with all others below 5 %. However, they fall short on visual tasks: GPT-4.1 tops Figure Duplication at 44.4 %, while o3 and Gemini-2.5-Pro score 0 %. This contrast reveals that, despite impressive reasoning capablities on text, current reasoning LLMs show poor performance in grounding multimodal figure analysis.

⚠️ Unreliability of Miscalibrated Models

Alongside pass@4, calibration tells us how much to trust a model’s own confidence estimates. In scientific error detection, where chasing false positives can cost serious time and effort, knowing when to believe a model is crucial.

Density and scatter plots showing model confidence vs. pass@4 for six LLMs on SPOT

However, as shown in Figure 4 (right), confidence correlates only weakly with pass@4, and the density plots on the left reveal that all models report very low confidence, clustering near zero. Across 498 model–instance evaluations (83 instances × 6 models), we observe just two cases of full (100%) confidence (both from o3), underscoring that today’s LLMs remain largely unreliable for spotting real errors in scientific manuscripts.

📚 Detailed Case Studies

Mathematics: Petersen and Tommas (2024)

o3’s Feedback

Eichler–Shimura is quoted as H¹(M₁,₁,Vₗ) ≅ S_{ℓ+2} ⊕ ℚ(−1−ℓ), but the Tate (Eisenstein) summand should be ℚ(−1), independent of ℓ. The extra “−ℓ” is a mis-twist.

Gemini-2.5-Pro’s Feedback

The stated result H¹(M₂, V(2,2)) = ℚ(−3) contradicts the Gysin‐sequence derivation. One finds instead H¹(M₂, V(2,2)) = 0 after tracking H⁰(A₁,₁, V(2,2))(−1) → H²(A₂), implying no non-zero cohomology.

Both o3 and Gemini-2.5-Pro exclusively flag issues in Section 3. Ironically, this is the only part of the manuscript not affected by the actual mathematical gap. o3 criticizes the calculation of H^k(M_{1,1}, V_\ell) in Section 3.3, claiming that the use of Eichler-Shimura isomorphism is wrong without giving too much details. But there is no mistake in this part and the claim is hallucinated. It is worth noting that applying Eichler-Shimura isomorphism in a similar context as the paper is exceedingly rare in internet data. The formula may also appear differently than how the Eichler-Shimura isomorphism is presented in common internet sources, whose equivalence requires genuine understanding of multiple subjects, such as variation of Hodge structures and the theory of modular forms.

On the other hand, Gemini-2.5-Pro tackles Theorem 3.8. Despite doing reasonable calculations, the model makes a false claim on H^0(A_{1,1}, V(2,2))(-1)code>, resembling mistakes students could make after recollecting a wrong fact.

Materials Science: Ye et al. (2025)

o3’s Feedback

Optimal TfOH concentration is stated as 0.006–0.016 M, yet all syntheses use 0.2 M. This inconsistency invalidates the universality of the claimed optimum.

Gemini-2.5-Pro’s Feedback

(A) Indexed PXRD peaks incorrectly attribute facets to TAPPy-TFPPy-COF instead of TAPPy-BPTC-COF.

(B) The BET surface area is listed as 1606 cm² g⁻¹ rather than 1606 m² g⁻¹, a 10,000× unit error.

Errors pointed out by o3 exemplify failures in both long-range context comprehension and multi-hop reasoning. o3 does not realize that the optimal concentration value reported by the authors (0.006–0.016 M) is the concentration of the final mixture, while the the second value (0.2M) is the concentration of the acid before being added to the final mixture. This misunderstanding likely arises because the optimal concentration in the final mixture is mentioned only once, and the explicit calculation is not shown throughout the manuscript. As a result, o3, having seen references only to the concentration before mixture, fails to infer the relationship between the two values.

In (A), Gemini 2.5 Pro seems to make a "reading" mistake, attributing the second facet pair to TAPPy-TFPPy-COF when it in fact describes TAPPy-BPTC-COF. Notably, however, in (B), it notices a potential error in the units, where a certain compound was assigned a surface area 10000x smaller than all the other compounds in the same family. Because the authors do not mention this extreme property of this material, we suspect that this is a real typo.

BibTeX

@article{son2025ai,
  title={When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research},
  author={Son, Guijin and Hong, Jiwoo and Fan, Honglu and Nam, Heejeong and Ko, Hyunwoo and Lim, Seungwon and Song, Jinyeop and Choi, Jinha and Paulo, Gon{\c{c}}alo and Yu, Youngjae and others},
  journal={arXiv preprint arXiv:2505.11855},
  year={2025}
}