Cloud Computing & Big Data

Author: Nathaniel Bischoff

Coauthor(s): Louis Ehwerhemuepha PhD, Spyro Mousses PhD, William Feaster MD, Neil Garde, Duncan Yeung, Haikal Pribadi, Christian Jakenfelds, Tomas Sabat, Thanh Pham, Andy Liang, Helen Zhu, James Wimberley, Sharon Kim, Madison Pahl, Thomas Madden, Peter, Mousses, Anthony Chang

Status: Work in Progress

Biomedical Learning of an Accelerated Data Environment (BLADE)

The promise of Artificial Intelligence in Medicine is to be used as an energy or a rocket to amplify human intelligence. But to have this rocket take us to the next era of medical intelligence, we need to use healthcare data as the fuel. Traditionally, this structured type of data is kept in tables in a relational database management system (RDBMS) that take 100s of lines of code to retrieve little to no data and metadata. This doesn’t include unstructured data, such as medical imaging and physician notes, that are still relatively untapped in the medical domain except for in small cases in medical imaging. Given the current state of healthcare data, a new paradigm must be proposed and executed in order to propel AI in medicine.
We are constructing a hypergraph or hyper- relational database known as the Biomedical Learning of an Accelerated Data Environment (BLADE) as a way to store data to make it conducive to next generation data science and machine learning. This is a generalization of a graph database, and allows for more flexibility and functionality than a traditional RDBMS like SQL or MySQL. This will allow us to model data using a knowledge representation and create metadata to provide more context on each individual data point. Current advances in hypergraph database architecture technology allow programmers to place machine learning algorithms in the database to create new, nonobvious connections. The data and metadata will allow data scientists and analysts to better find cohorts of patients to study for research, and gain further insights for clinical use through machine learning and AI. To complete this project, we are working with GRAKN.AI, a distributed hyper-relational database for knowledge-oriented systems. GRAKN.AI was recently awarded Product of the Year 2017 by the University of Cambridge Computer Lab. For our data source for the project, we will be using the Cerner HealtheFacts Database. This is a deidentifiied database that include over 100 million patients, but we are selecting a subset of this population to compete this study. We have selected 5 cardiology diseases that have over 100,000 encounters for this pilot study.
The overall goal of the project is to give caregivers and patients access to timely and smart-filtered data at their fingertips, and to let data scientists work on data to find insights.