Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: A Unique Way to Primary Key

      July 22, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      The best CRM software with email marketing in 2025: Expert tested and reviewed

      July 22, 2025

      This multi-port car charger can power 4 gadgets at once – and it’s surprisingly cheap

      July 22, 2025

      I’m a wearables editor and here are the 7 Pixel Watch 4 rumors I’m most curious about

      July 22, 2025

      8 ways I quickly leveled up my Linux skills – and you can too

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025
      Recent

      The Intersection of Agile and Accessibility – A Series on Designing for Everyone

      July 22, 2025

      Zero Trust & Cybersecurity Mesh: Your Org’s Survival Guide

      July 22, 2025

      Execute Ping Commands and Get Back Structured Data in PHP

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025
      Recent

      A Tomb Raider composer has been jailed — His legacy overshadowed by $75k+ in loan fraud

      July 22, 2025

      “I don’t think I changed his mind” — NVIDIA CEO comments on H20 AI GPU sales resuming in China following a meeting with President Trump

      July 22, 2025

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses

    LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses

    April 21, 2025

    As LLMs become more prominent in healthcare settings, ensuring that credible sources back their outputs is increasingly important. Although no LLMs are yet FDA-approved for clinical decision-making, top models such as GPT-4o, Claude, and MedPaLM have outperformed clinicians on standardized exams like the USMLE. These models are already being utilized in real-world scenarios, including mental health support and the diagnosis of rare diseases. However, their tendency to hallucinate—generating unverified or inaccurate statements—poses a serious risk, especially in medical contexts where misinformation can lead to harm. This issue has become a major concern for clinicians, with many citing a lack of trust and the inability to verify LLM responses as key barriers to adoption. Regulators, such as the FDA, have also emphasized the importance of transparency and accountability, underscoring the need for reliable source attribution in medical AI tools.

    Recent improvements, such as instruction fine-tuning and RAG, have enabled LLMs to generate sources when prompted. Yet, even when references are from legitimate websites, there is often little clarity on whether those sources truly support the model’s claims. Prior research has introduced datasets such as WebGPT, ExpertQA, and HAGRID to assess LLM source attribution; however, these rely heavily on manual evaluation, which is time-consuming and difficult to scale. Newer approaches utilize LLMs themselves to assess attribution quality, as demonstrated in works such as ALCE, AttributedQA, and FactScore. While tools like ChatGPT can assist in evaluating citation accuracy, studies reveal that such models still struggle to ensure reliable attribution in their outputs, highlighting the need for continued development in this area.

    Researchers from Stanford University and other institutions have developed SourceCheckup, an automated tool designed to evaluate the accuracy with which LLMs support their medical responses with relevant sources. Analyzing 800 questions and over 58,000 source-statement pairs, they found that 50%–90 % of LLM-generated answers were not fully supported by cited sources, with GPT-4 showing unsupported claims in about 30% of cases. Even LLMs with web access struggled to provide source-backed responses consistently. Validated by medical experts, SourceCheckup revealed significant gaps in the reliability of LLM-generated references, raising critical concerns about their readiness for use in clinical decision-making.

    The study evaluated the source attribution performance of several top-performing and open-source LLMs using a custom pipeline called SourceCheckup. The process involved generating 800 medical questions—half from Reddit’s r/AskDocs and half created by GPT-4o using MayoClinic texts—then assessing each LLM’s responses for factual accuracy and citation quality. Responses were broken down into verifiable statements, matched with cited sources, and scored using GPT-4 for support. The framework reported metrics, including URL validity and support, at both the statement and response levels. Medical experts validated all components, and the results were cross-verified using Claude Sonnet 3.5 to assess potential bias from GPT-4.

    The study presents a comprehensive evaluation of how well LLMs verify and cite medical sources, introducing a system called SourceCheckup. Human experts confirmed that the model-generated questions were relevant and answerable, and that parsed statements closely matched the original responses. In source verification, the model’s accuracy nearly matched that of expert doctors, with no statistically significant difference found between model and expert judgments. Claude Sonnet 3.5 and GPT-4o demonstrated comparable agreement with expert annotations, whereas open-source models such as Llama 2 and Meditron significantly underperformed, often failing to produce valid citation URLs. Even GPT-4o with RAG, though better than others due to its internet access, supported only 55% of its responses with reliable sources, with similar limitations observed across all models.

    The findings underscore persistent challenges in ensuring factual accuracy in LLM responses to open-ended medical queries. Many models, even those enhanced with retrieval, failed to consistently link claims to credible evidence, particularly for questions from community platforms like Reddit, which tend to be more ambiguous. Human evaluations and SourceCheckup assessments consistently revealed low response-level support rates, highlighting a gap between current model capabilities and the standards needed in clinical contexts. To improve trustworthiness, the study suggests models should be trained or fine-tuned explicitly for accurate citation and verification. Additionally, automated tools like SourceCleanup demonstrated promise in editing unsupported statements to improve factual grounding, offering a scalable path to enhance citation reliability in LLM outputs.


    Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post LLMs Still Struggle to Cite Medical Sources Reliably: Stanford Researchers Introduce SourceCheckup to Audit Factual Support in AI-Generated Responses appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleAmazon Bedrock Prompt Optimization Drives LLM Applications Innovation for Yuewen Group
    Next Article Build a location-aware agent using Amazon Bedrock Agents and Foursquare APIs

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 22, 2025
    Machine Learning

    Boolformer: Symbolic Regression of Logic Functions with Transformers

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2024-13930 – ASPECT DoS Denial of Service

    Common Vulnerabilities and Exposures (CVEs)

    Unlocking Literacy: Ojje’s Journey With MongoDB

    Databases

    Iranian State TV hacked, and that’s modern warfare

    Development

    CVE-2025-7504 – WordPress Friends Plugin PHP Object Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-5836 – Tenda AC9 Command Injection Vulnerability

    June 7, 2025

    CVE ID : CVE-2025-5836

    Published : June 7, 2025, 2:15 p.m. | 1 hour, 15 minutes ago

    Description : A vulnerability was found in Tenda AC9 15.03.02.13. It has been rated as critical. This issue affects the function formSetIptv of the file /goform/SetIPTVCfg of the component POST Request Handler. The manipulation of the argument list leads to command injection. The attack may be initiated remotely. The exploit has been disclosed to the public and may be used.

    Severity: 6.3 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-27819 – Apache Kafka SASL JAAS JndiLoginModule RCE/DOS

    June 10, 2025

    CVE-2025-32709 – “Windows Ancillary Function Driver for WinSock Use-After-Free Privilege Escalation Vulnerability”

    May 13, 2025

    Customize DeepSeek-R1 671b model using Amazon SageMaker HyperPod recipes – Part 2

    May 14, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.