LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses

As LLMs become more prominent in healthcare settings, ensuring that credible sources back their outputs is increasingly important. Although no LLMs are yet FDA-approved for clinical decision-making, top models such as GPT-4o, Claude, and MedPaLM have outperformed clinicians on standardized exams like the USMLE. These models are already being utilized in real-world scenarios, including mental health support and the diagnosis of rare diseases. However, their tendency to hallucinate—generating unverified or inaccurate statements—poses a serious risk, especially in medical contexts where misinformation can lead to harm. This issue has become a major concern for clinicians, with many citing a lack of trust and the inability to verify LLM responses as key barriers to adoption. Regulators, such as the FDA, have also emphasized the importance of transparency and accountability, underscoring the need for reliable source attribution in medical AI tools.

Recent improvements, such as instruction fine-tuning and RAG, have enabled LLMs to generate sources when prompted. Yet, even when references are from legitimate websites, there is often little clarity on whether those sources truly support the model’s claims. Prior research has introduced datasets such as WebGPT, ExpertQA, and HAGRID to assess LLM source attribution; however, these rely heavily on manual evaluation, which is time-consuming and difficult to scale. Newer approaches utilize LLMs themselves to assess attribution quality, as demonstrated in works such as ALCE, AttributedQA, and FactScore. While tools like ChatGPT can assist in evaluating citation accuracy, studies reveal that such models still struggle to ensure reliable attribution in their outputs, highlighting the need for continued development in this area.

Researchers from Stanford University and other institutions have developed SourceCheckup, an automated tool designed to evaluate the accuracy with which LLMs support their medical responses with relevant sources. Analyzing 800 questions and over 58,000 source-statement pairs, they found that 50%–90 % of LLM-generated answers were not fully supported by cited sources, with GPT-4 showing unsupported claims in about 30% of cases. Even LLMs with web access struggled to provide source-backed responses consistently. Validated by medical experts, SourceCheckup revealed significant gaps in the reliability of LLM-generated references, raising critical concerns about their readiness for use in clinical decision-making.

The study evaluated the source attribution performance of several top-performing and open-source LLMs using a custom pipeline called SourceCheckup. The process involved generating 800 medical questions—half from Reddit’s r/AskDocs and half created by GPT-4o using MayoClinic texts—then assessing each LLM’s responses for factual accuracy and citation quality. Responses were broken down into verifiable statements, matched with cited sources, and scored using GPT-4 for support. The framework reported metrics, including URL validity and support, at both the statement and response levels. Medical experts validated all components, and the results were cross-verified using Claude Sonnet 3.5 to assess potential bias from GPT-4.

The study presents a comprehensive evaluation of how well LLMs verify and cite medical sources, introducing a system called SourceCheckup. Human experts confirmed that the model-generated questions were relevant and answerable, and that parsed statements closely matched the original responses. In source verification, the model’s accuracy nearly matched that of expert doctors, with no statistically significant difference found between model and expert judgments. Claude Sonnet 3.5 and GPT-4o demonstrated comparable agreement with expert annotations, whereas open-source models such as Llama 2 and Meditron significantly underperformed, often failing to produce valid citation URLs. Even GPT-4o with RAG, though better than others due to its internet access, supported only 55% of its responses with reliable sources, with similar limitations observed across all models.

The findings underscore persistent challenges in ensuring factual accuracy in LLM responses to open-ended medical queries. Many models, even those enhanced with retrieval, failed to consistently link claims to credible evidence, particularly for questions from community platforms like Reddit, which tend to be more ambiguous. Human evaluations and SourceCheckup assessments consistently revealed low response-level support rates, highlighting a gap between current model capabilities and the standards needed in clinical contexts. To improve trustworthiness, the study suggests models should be trained or fine-tuned explicitly for accurate citation and verification. Additionally, automated tools like SourceCleanup demonstrated promise in editing unsupported statements to improve factual grounding, offering a scalable path to enhance citation reliability in LLM outputs.

Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: A Unique Way to Primary Key

BrowserStack launches Figma plugin for detecting accessibility issues in design phase

Parasoft brings agentic AI to service virtualization in latest release

Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

The best CRM software with email marketing in 2025: Expert tested and reviewed

This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

8 ways I quickly leveled up my Linux skills – and you can too

The Intersection of Agile and Accessibility – A Series on Designing for Everyone

The Intersection of Agile and Accessibility – A Series on Designing for Everyone

Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

Execute Ping Commands and Get Back Structured Data in PHP

A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

“I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Boolformer: Symbolic Regression of Logic Functions with Transformers

CVE-2024-13930 – ASPECT DoS Denial of Service

Unlocking Literacy: Ojje’s Journey With MongoDB

Iranian State TV hacked, and that’s modern warfare

CVE-2025-7504 – WordPress Friends Plugin PHP Object Injection Vulnerability

CVE-2025-5836 – Tenda AC9 Command Injection Vulnerability

CVE-2025-27819 – Apache Kafka SASL JAAS JndiLoginModule RCE/DOS

CVE-2025-32709 – “Windows Ancillary Function Driver for WinSock Use-After-Free Privilege Escalation Vulnerability”

Customize DeepSeek-R1 671b model using Amazon SageMaker HyperPod recipes – Part 2

LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses

Related Posts