Cloud Computing & Big Data

Author: Pavel Smirnov

Coauthor(s): Nikolay Ryzhikov, PhD

Status: Completed Work

Funding Acknowledgment: The work is done by Health Samurai involved in several commercial projects with other organizations.

Building a FHIR back-end suitable for data analytics/machine learning applications


Introduction/Background

Most of healthcare data is not big data. When an organization needs to aggregate structured medical data to a centralized repository for running data analytics/machine learning algorithms choosing of the data models is an important step which can determine many things from ease of integration with EHR systems to a time to market and a cost of a health IT project. Leveraging standards and terminologies and in particular, FHIR (Fast Healthcare Interoperability Resources) which has received incomparable industry traction facilitates receiving of data from the EHRs and feeding it to various applications. While FHIR is the right answer and can be leveraged for storing medical data and integration with EHR systems, its tree-like structure with a lot of recursive references brings a significant technical challenge that prevents companies from a creation of an efficient FHIR storage that enables data analytics and business intelligence use cases.

Methods

How do we approach a development of a comprehensive clinical data repository (CDR) for FHIR that supports flexible search queries and thus enables data analytics use cases?

The pure relational approach appeared to be too complicated – the number of tables needed for expressing a complete FHIR data model exceeds a thousand and is very hard to manage. It significantly affects usability and implementation of basic operations such as create, read, update, delete (CRUD), and search.

The pure document database approach is easy to implement but has too many limitations with regard to search capabilities. It prevents its use in many applications and data analytics use cases.

Health Samurai’s solution for this problem is in using of a combination of relational and document data. Some databases available on the market started offering a support for JSON (JavaScript Object Notation) data which is the most popular format for representing FHIR resources. We store FHIR data in JSON columns while keeping all resource IDs and timestamps as regular columns in a relational database.

Results

Health Samurai’s implementation of the FHIR server aidbox.io is built on PostgreSQL database. PostgreSQL is a mature traditional database that can handle terabytes of structured data with transaction rate above 10,000 per second. Its ‘jsonb’ JSON data type enables SQL queries and every FHIR command that is executed against aidbox is translated internally to a SQL query. In case of significant volumes of data, it is possible to add GIN indexes (Generalized Inverted Index) to jsonb data to boost the performance of search queries.

Aidbox performance testing can be found here:
https://gist.github.com/Stas-Ghost/f647cdeada02a54f3c46e667a9b5ec0a#file-benchmark-results

Discussion

The approach described above was validated by many Health Samurai’s aidbox clients that built comprehensive healthcare solutions leveraging the FHIR specification. Health Samurai’s FHIR server implementation is built on PostgreSQL but JSON support is coming to other popular databases such as MS SQL Server and Oracle. In fact, JSON is a part of the recent SQL:2016 standard specification.