Using artificial intelligence on electronic health records to speed up patient identification for clinical trials

Author: Wout Brusselaers

Background/purpose: Every new drug or medical device must pass clinical trials before it can go to market. Yet, more than 80% of trials fail to meet enrollment timelines, and nearly 50% of sites enroll one or no patients. Finding and matching patients to trials is a slow, manual process requiring extensive outreach efforts and in-depth review of medical records by trained clinical staff. Since 90% of the data contained in medical records is unstructured or free-form text (e.g. doctors’ notes, pathology reports, operating notes), such reviews cannot easily be automated by current tools. We apply artificial intelligence (AI), natural language processing (NLP) and data analytics to clinical data to match patients against clinical trial eligibility criteria. We will examine the speed of patient recruitment, quality of patients based on eligibility criteria, and the quantity of patients identified using this approach versus conventional patient identification methods. We will also examine the benefits of various machine learning (ML) and NLP techniques in the patient identification process. Methods: We will conduct a retrospective review of a study to validate a biomarker for non-small cell lung cancer for which patients were recruited according to a major academic medical center’s standard practices. Patients > 18 years diagnosed with non-metastatic (stage I-III) non-small cell lung cancer (NSCLC) were identified in the study. Exclusion criteria included lung cancer patients and healthy controls (i) with history of active tuberculosis or other mycobacterial infections, (ii) who are not able to participate in sputum induction, and (iii) with severe asthmatics.  The dataset includes de-identified EMRs from the academic medical center that conducted the original study. Medical ontologies such as SNOMED and ICD-10 will be used in the NLP algorithm and machine annotation of EMRs. We will evaluate and compare the relative benefits of NLP on structured and unstructured data from EMRs, LIS, pathology reports and other clinical data sources versus querying structured data only. We will also assess the impact of ML techniques that enable the software to learn from its users and suggest better results, including the potential to further accelerate and automate patient record review to validate trial matches presented by AI. We will also examine false positives and negatives, and the ability of ML to allow for fuzzy matching, i.e. identifying attributes in patients that are not specifically mentioned in the data and can be inferred from modeling the overall database. To get a true representation of whether this approach is valid and beneficial, every step of the study was thoroughly documented to serve as a benchmark. The AI software will be run against patient medical records to measure the results: number of eligible patients identified, time to identification, time to validate suggested matches, precision, recall and overall time and cost savings.