From PubMed to TrialGPT: Exploring the Impact of AI & Large Language Models in Medicine

Name: From PubMed to TrialGPT: Exploring the Impact of AI & Large Language Models in Medicine
Uploaded: 2025-05-13T18:44:50.000Z
Duration: 2 h 4 min 43 s
Description: The Network of the National Library of Medicine is funded by the National Library of Medicine, National Institutes of Health, Department of Health and Human Services. Learn more at https://nnlm.gov

Introduction to the Webinar

Welcome and Overview

The session begins with host Erikica Lake introducing the topic: "From PubMed to Trial GPT," focusing on AI's impact in medicine.

Erikica is a health sciences librarian outreach specialist for Region 6 of the National Library of Medicine, based at the University of Iowa.

Participants are informed about live closed captioning and encouraged to use chat for comments and Q&A for questions during the presentation.

The session will be recorded, and a link will be sent to all registrants post-event. It is also eligible for one medical library association continuing education credit.

Erikica explains NNLM's mission: providing equal access to biomedical information for US health professionals and improving public access.

About NNLM

Mission and Offerings

NNLM aims to enhance access to biomedical information, enabling informed health decisions among the public.

They provide various trainings, webinars, educational guides, toolkits, funding opportunities, and engagement activities for librarians and health professionals.

The speaker spotlight webinar series features expert speakers discussing topics relevant to librarians, educators, clinicians, etc.

Speaker Introduction

Dr. Jur Yanglu's Background

Dr. Jur Yanglu is introduced as a senior investigator at NIH specializing in AI and machine learning research related to biomedical text and image processing.

He has over 400 publications; his work supports resources like PubMed that serve millions daily.

Dr. Yanglu has received multiple awards including the NIH Director's Challenge Award and is recognized by prestigious organizations in medical informatics.

Presentation Focus

Research Areas

Dr. Yanglu discusses his research group's focus on AI applications in analyzing vast amounts of biomedical literature data from sources like PubMed.

He mentions working with patient data such as electronic medical records (EMRs), clinical notes, radiology images due to collaborations within NIH campus facilities.

Talk Structure

Key Topics Covered

The talk will cover three aspects of text analysis: literature search techniques, information extraction methods relevant to medical librarianship, and recent advancements using large language models (LLMs).

Dr. Yanglu plans to address both challenges associated with LLM usage in healthcare contexts as well as potential opportunities they present.

Overview of PMED and Its Impact

Introduction to PMED

PMED is created and maintained by the National Library of Medicine, currently hosting over 36 million abstracts and titles.

On average, PMED serves about 2.5 million users daily, with approximately three million searches and nine million page views.

Traffic Comparison

While traffic volume is lower than general search engines like Google, PMED ranks among the top three or five most accessed U.S. federal government websites.

PMED operates continuously, even during government shutdowns.

AI and Machine Learning in Literature Search

Historical Context

The integration of AI and machine learning into improving user search experiences has been a focus for the National Library of Medicine for nearly 18 years.

Transparency in Development

Unlike commercial literature search systems, NLM emphasizes transparency by providing scientific publications detailing their AI implementations.

Blog posts are also available to explain how features were developed and user responses to them.

User Experience Challenges

Information Overload

Researchers often face an overwhelming number of papers returned from searches using just a few keywords.

Need for Enhanced Search Capabilities

The challenge lies not in finding relevant papers but managing the sheer volume of results that can reach thousands or tens of thousands.

Advancements in Search Systems

Teaching Computers to Understand Literature

Efforts have been made to teach computers to identify important topics within articles similarly to human scientists.

Development of Specialized Tools

Various systems have been developed that enhance traditional literature searches by automatically annotating entities such as genes, diseases, drugs, and chemicals.

Specific Tools Developed by NLM

Overview of Tools

Tools like LitVar (for genotype-phenotype relations), L-COVID (for COVID-related literature), and Pubator (a more generic tool aggregating abstracts and full texts).

Integration of Data Sources

Pubator combines data from both abstract repositories (PMED with over 36 million abstracts) and full-length articles (around 10 million).

Challenges with Supplementary Materials

User Feedback on Variants Research

Users requested access to supplementary materials when searching for mutations due to their high relevance in this context.

Processing Difficulties

The heterogeneous nature of supplementary materials poses challenges for accurate processing; thus, they are generally avoided except where necessary for variant research.

AI Annotation Tools and Their Effectiveness

Challenges in Text Mining and AI Annotation

The speaker emphasizes that AI annotation tools are not perfect, highlighting the inherent challenges in text mining.

Even expert biologists may disagree on gene name annotations, indicating the complexity of biological data interpretation.

Recent benchmarks show that current AI tools perform comparably to human scientists in annotating articles.

Accessibility and Usage of AI Tools

Users can access the tool via a web interface or download data through FTP for downstream tasks.

The service has a global user base, including researchers from diverse locations, even Antarctica.

User feedback is collected through presentations and emails, which helps improve the service.

Impact and Reach of the Service

Over a billion accesses have been recorded from users utilizing the service for various applications like drug discovery and clinical research.

Continuous maintenance and improvement of these tools are emphasized as essential for their ongoing relevance.

The Rise of Large Language Models

Introduction to ChatGPT's Capabilities

The speaker expresses amazement at the capabilities of large language models (LLMs), particularly ChatGPT, in language understanding and generation.

ChatGPT gained one million users within five days of launch, showcasing its rapid adoption compared to other platforms.

Notable Case Studies

A notable case involved a boy who received an accurate diagnosis after his mother input symptoms into ChatGPT, illustrating its potential impact on healthcare.

Performance Metrics Against Human Standards

Initially tested against medical exams with passing scores around 60%, ChatGPT quickly improved to scores in the 80% range within six months.

This performance is now believed to be closer to 90%, reflecting significant advancements over previous years where scores were much lower.

Limitations of Current AI Diagnostic Tools

Concerns Regarding Understanding vs. Answering

The speaker raises concerns about using multiple-choice questions for real-world medical diagnostics due to their limitations in capturing complex scenarios.

Reasoning Process Behind Answers

There is skepticism about whether LLM outputs reflect true understanding or if they merely produce correct answers without proper reasoning processes.

Exploring Multimodal Large Language Models

Research Questions on Multimodal Inputs

With advancements in multimodal models capable of processing both text and images, new studies were initiated to evaluate their performance against human physicians' responses.

Understanding AI's Decision-Making in Medical Imaging

Motivation of the Study

The primary focus is to investigate whether GPT-4 truly understands its decision-making process when answering medical questions.

Three key aspects are examined: image comprehension, recall of relevant medical knowledge, and step-by-step reasoning for answers.

Methodology

Utilized GPT-4V (released October 2023) to analyze over 200 questions from the New England Journal of Medicine's online image challenges, covering various medical specialties like dermatology and ophthalmology.

Engaged 10 domain experts (1 MD student and 9 attending physicians) to answer specialty-specific questions and evaluate AI's reasoning capabilities.

Results Comparison

Human physicians outperformed students with a 78% accuracy compared to students' 60%, while GPT-4 achieved an accuracy of 81%. However, no statistical significance was found between GPT-4 and human performance.

Physicians' assessments were conducted in a closed-book format; their performance could reach higher than 80% in open-book scenarios, indicating they still have an edge over AI models.

Insights on AI Explanations

Out of the cases where AI provided correct answers (169 out of 207), one-third had issues with explanations primarily related to image comprehension errors despite correct answers being given.

An example highlighted that AI misinterpreted differences in MRI images, showcasing limitations in its rationale even when it arrived at the right conclusion.

Trusting AI in Clinical Applications

The study emphasizes the importance of understanding both the strengths and limitations of AI systems before applying them clinically. It sets the stage for exploring practical applications such as trial matching using large language models.

Enhancing Clinical Trial Matching with AI

Importance of Clinical Trials

Acknowledges that clinical trials are critical for drug development and evidence-based medicine, addressing real-world needs within institutions like the National Institutes of Health (NIH).

Project Overview

Initiated a project mid-2023 aimed at improving patient-trial matching processes at NCI due to inefficiencies observed in manual matching by healthcare professionals which can lead to long wait times for patients seeking trial participation.

Understanding Patient-Trial Matching

Overview of Patient-Total Matching

The concept of patient-total matching involves two main perspectives: trial investigators seeking to enroll participants quickly and patients or caregivers looking for relevant trials for specific conditions.

Dual Direction Approach

Previous methods focused on either directing trials to patients or vice versa, but the current design aims to facilitate both directions effectively.

Input Components

The input consists of a trial (e.g., from clinicaltrials.gov) and a patient profile, with the goal of generating a score indicating the relevance and eligibility of the patient for the trial.

Criteria for Trial Eligibility

Trials have inclusion criteria that patients must meet and exclusion criteria that disqualify them. Understanding these is crucial for matching.

Importance of Efficient Enrollment

A significant number (25%-33%) of cancer trials in the U.S. fail due to insufficient enrollment, highlighting the need for improved matching processes.

Limitations of Existing AI Methods

Black Box Nature

Many existing AI systems operate as black boxes, providing yes/no answers without explanations, which undermines trust among healthcare professionals who may use these tools.

Need for Transparency

For AI tools to be effective assistants rather than replacements for human physicians, they must offer transparent reasoning behind their decisions.

Data Annotation Challenges

Current machine learning methods require extensive human-labeled data, which is time-consuming and requires domain expertise that is hard to source realistically.

Development of Tri GPT

Ground Truth Datasets Utilized

The development benefits from three gold standard datasets containing 200 patients and 20,000 trials that have been manually judged as relevant or irrelevant pairs.

Tri GPT Components Explained

Tri GPT comprises three components: retrieval based on patient summaries, criterion-by-criterion matching against narrowed-down trials, and aggregation of scores to rank trial relevance.

Initial Search Process

The first step involves searching through approximately 20,000 active trials in the U.S. based on basic patient information before conducting detailed eligibility evaluations.

Understanding the Scalability of Trial Identification

Overview of Active Trials

The number of active trials at institutions like NIH is relatively small, with only hundreds currently ongoing. However, this number can increase significantly when considering hospitals or regions globally.

Generalizing to worldwide active trials suggests a much larger figure, emphasizing the scalability and robustness of the developed approach.

Methodology for Identifying Relevant Trials

The initial goal is to identify potentially relevant trials without overly narrowing down options. This involves generating keywords from patient notes to facilitate trial retrieval.

A retriever combines traditional term matching (e.g., BM25 algorithm) with an in-house semantic matching tool called MET CPT, which accounts for synonyms automatically.

Comparison of Keyword Generation Techniques

Various methods for keyword generation were compared: using GPT-4, GPT-3.5, inputting all words from patient notes (least effective), and clinician-generated keywords.

Performance varied among human-selected keywords (recall scores ranged from 0.68 to 0.82). GPT-4 achieved a recall score of 0.934, indicating high relevance in top results.

Efficiency in Reducing Candidate Trials

By analyzing the top 1,000 results instead of 20,000 original candidates, over 95% coverage was maintained while minimizing loss of relevant trials.

This efficiency is crucial as the subsequent matching process is resource-intensive; thus, optimizing candidate selection is essential.

AI's Role in Inclusion/Exclusion Criteria Assessment

ChatGPT evaluates each candidate trial against individual inclusion/exclusion criteria based on patient notes and provides explanations for its decisions.

Two major benefits arise: transparency through AI-generated explanations allows users to verify AI decisions and reduces hallucination risks by prompting self-verification processes.

Addressing AI Limitations

Despite their power, AI tools can make mistakes due to hallucinations—fabricating information confidently. Encouraging explanation generation helps mitigate these errors.

AI Performance in Medical Trial Matching

Overview of AI Evaluation

A team of MDs manually evaluated 1,000 patient criterion pairs, finding that the AI system, Trial GPT, achieved high accuracy in the 80% range.

Human physicians' performance was comparable to the AI's, indicating that both have similar error rates despite achieving around 88% accuracy.

Error Analysis Insights

The analysis identified two main sources of errors: difficulties with abbreviations and misspellings in Electronic Health Records (EHR).

Another limitation noted was AI's reasoning capabilities; for instance, it failed to infer exclusion criteria based on incomplete information about a patient's condition.

Ranking Trials for Patients

The process involves generating a final rank list of trials relevant to patients by aggregating scores from individual criterion evaluations.

Two key scores are used: relevance score (how pertinent a trial is to a specific condition) and eligibility score (whether a patient meets trial criteria).

Comparison with Previous Models

The evaluation utilized three benchmark datasets from prior research; previous models achieved around 50% accuracy while Trial GPT surpassed 73%.

Unlike traditional methods relying on training data, Trial GPT employs zero-shot learning, making it adaptable across various disease areas without needing extensive training datasets.

Pilot User Study Findings

In a pilot study involving two physicians assessing six patient cases across different trials, results showed no difference in answers with or without AI assistance.

However, using AI reduced the time taken by physicians to answer questions by approximately 40%, highlighting its efficiency.

Summary of Key Components

Trial GPT consists of three components: retrieval (narrowing down search space), evaluation (assessing patient-trial matches), and ranking based on inclusion/exclusion criteria.

AI in Healthcare: Enhancing Clinical Decision-Making

Achievements in AI Accuracy

The AI system achieved an accuracy of 88.87% in matching patient pairs to clinical trials, comparable to human physicians' performance.

By integrating inclusion and exclusion criteria at the trial level, the system can rank trials based on relevance for patients, suggesting the most pertinent options while filtering out those that do not meet specific criteria.

Impact on Physician Efficiency

A pilot user study indicated that this AI approach helps physicians save over 40% of their time by streamlining trial selection processes.

Continuous research is being conducted to enhance these systems further, supported by a recent NH director's challenge award aimed at real-world deployment testing.

Collaboration and Future Directions

The project has attracted interest from various NIH institutes and external organizations, including patient advocacy groups and academic medical centers, indicating a broad collaborative effort to refine the AI system.

Collaborators include principal investigators from diverse disease specialties, emphasizing a multidisciplinary approach to improving healthcare outcomes through AI technology.

Philosophy of Research Application

The speaker emphasizes that their focus is not on competing with private sector models but rather on applying existing large language models (LLMs) within specific domains like healthcare for practical improvements.

There is an ongoing commitment to enhancing LLMs with domain-specific knowledge as new models are released every few months, allowing for continuous improvement in task performance.

Addressing Limitations of AI Models

As AI models become more integrated into real-life applications, it is crucial to investigate their limitations and challenges thoroughly; this aspect forms part of ongoing research efforts.

The presentation acknowledges the contributions of a dedicated team at the National Library of Medicine and collaborators both within NIH and externally in advancing this research agenda.

Questions about Future Implementations

An inquiry was made regarding whether PubMed would implement AI features similar to Open Evidence for entering clinical question prompts; the speaker confirmed awareness of such systems and ongoing research related to them.

Key Challenges in Health Science Research

A significant question raised pertains to ensuring safety and trustworthiness in using standardized language models for healthcare applications due to high stakes involved in medical decisions. This area requires more focused research efforts moving forward.

Emphasis was placed on utilizing high-quality data from biomedical literature when training these models since output quality heavily relies on input data integrity; incorrect information online could adversely affect model behavior.

Understanding the Limitations of Large Language Models in Biomedical Research

Importance of Traceability in Research Outputs

Dr. Lou emphasizes the necessity for large language models to provide traceable outputs linked to trustworthy evidence in biomedical research, highlighting a critical aspect of scientific integrity.

He points out that while these models can generate answers, they often fail to allow users to trace back to original studies or research projects, which is a significant limitation.

Audience Engagement and Future Communication

The session concludes with an acknowledgment of audience questions, indicating a high level of interest and engagement from participants.

Dr. Lou encourages attendees to reach out via email for further inquiries, fostering ongoing dialogue about the topics discussed.

The moderator notes that all questions have been captured for sharing with Dr. Lou, ensuring that participant concerns are addressed even after the session ends.