Structured categorization of medical images using 3 million free-text radiology reports: building the next ImageNet for radiology

Author: Jae Ho Sohn

Background UCSF generates >200k radiology reports every year but has no way of efficiently harnessing the data because reports are in free-text format. These reports are linked to patient images and contain valuable medical information in the form of radiologist's visual description as well as overall diagnostic impression. ImageNet is a hierarchically organized database of 15 million images that fueled the deep learning revolution in 2012. If sufficient number of radiology images can be organized in a similar way, it can revolutionize the way radiologists create teaching files (short term goal), conduct clinical research (mid term goal), and train computer vision algorithms (long term goal). The aim of this multi-phase project is to categorize all radiology reports from UCSF’s clinical database (hence the linked medical images) using a standardized, hierarchical, radiological lexicon that can be easily indexed, annotated, and scaled to other institutions. We name this project R-Net. Method The very first step of the project involves creation of a natural language processing & search algorithm to cluster similar reports together, which can be followed by semiautomatic or manual assignment of specific hierarchical label from a radiology lexicon. Prior to extracting all 3 million reports, preliminary 249,120 anonymized radiology reports from January 2015 to October 2016 at the Zuckerberg San Francisco General Hospital (one of six affiliated UCSF hospitals) were extracted for initial data analysis and proof-of-concept. The reports were first categorized into each imaging modality (MR, CT, Ultrasound, Plain film, and others) and body parts (head, neck, chest, abdomen/pelvis, and others). Then, similarity was determined via Term Frequency – Inverse Document Frequency (TF-IDF) method where reports with the most number of similar words tend to cluster together while also assigning greater weight to uncommon words/phrases. Similarity score was recorded. A simple frontend user interface was created using Django. It allows user to input a radiology report and then outputs list of all sufficiently similar radiology reports in a given database. Preliminary Result / Future Vision Here, we took the first step towards R-Net, the next ImageNet of radiology. We demonstrated a proof-of-concept model of clustering similar radiology reports from a preliminary sub-dataset of 249,120 anonymized reports. Currently, the algorithm cluster similar reports with ease when categorized by similar diagnosis, such as “appendicitis, pulmonary embolism, and subdural hemorrhage.” However, the algorithm struggles with understanding negating statements such as “no evidence of appendicitis” and when multiple differential diagnoses are given. Next step will involve incorporation of these radiological lexicons to create better clusters and then finally assignment of the most appropriate standardized radiology lexicon system such as RadLex or ICD coding. Afterwards, under careful guidance by the patient privacy (HIPAA) and research ethics (IRB) department, we plan to outsource for completion of categorization. The database can then be used by radiology residents for teaching cases, clinical researchers for big data projects, and eventually by deep learning startups for computer vision products.

Co Author/Co-Investigator Names/Professional Title: 1. Lukasz Kidzinski (Postdoc Researcher, Computer Science, Stanford University) 2. Hyojung Paik (Postdoc Researcher, Institute for Computational Health Sciences, UCSF Medical Center) 3. Doron Reuven (Researcher, Computer Science, UC Berkeley) 4. Joseph Mesterhazy (Software Engineer, Radiology & Biomedical Imaging, UCSF Medical Center)