Solving Storage: A System for Cost-Effective Permanent Storage and High-Speed Processing of Physiological Data Signals

Author: Robert Greer

Physiological data streams are viewed as one of the fundamental data sources driving change in health care and health research. Despite this, the uptake of physiological data collection has been slow in large part due to the sheer quantity of data that is collected. Existing research projects have mitigated this problem by collecting only the subset of available physiological signals that are relevant to their particular study, and only during the study period. Though this approach has yielded quality research, it fails to provide a useful dataset for future study. In the Department of Critical Care Medicine at the Hospital for Sick Children in Toronto, a different approach has been taken. Across the 41 bed spaces of Pediatric and Cardiac Critical Care units, approximately 150GB of data per day is collected utilizing T3 (Etiometry, Boston, MA) and ViNES (True Process, Madison, WI). This includes physiological metrics and waveforms collected from bedside monitors, ventilators, and cerebral oxygenation monitors. Data has been continuously collected for over 42 months. Signals that are collected range from 1 sample every 5 seconds to 500 samples every second. The goal of this approach was to store all of this data permanently to enable real-time analytics, and conduct retrospective analysis. To store this quantity of data, an evaluation of many industry-leading database systems was conducted. The result of this was that existing solutions could not provide storage at the performance that was required for real-time analytics in a cost-effective manner. This result has lead to the Critical Care Data Science team creating a novel database system, specifically tailored for physiological time-series data. The Time-Series Compression (TSC) Framework is a novel read-optimized physiological time-series data framework that integrates proprietary compression technology to store collected data at a volume of 1/100th the size of the original data without any loss of information. This means that an entire year of physiological data collected can be stored in approximately 550GB. TSC also employs an adaptive data sharding algorithm to efficiently split data into optimal shards that are distributed across multiple systems. The overall distribution model leverages a centralized data and meta-data store with one to many distributed compute nodes to handle decompression, and computation. This model enables the rapid deployment and scaling of TSC from a single server to clusters containing many hundreds of servers. Initial tests of the TSC framework have demonstrated the ability to retrieve, decompress, and perform signal processing (R-R peak interval times) at a rate of over 100 million samples per second or approximately 7 patient-days of data per second on a single server (28-cores @ 2.0Ghz, 128GB Memory, 800GB SSD). TSC provides an important and scalable solution for the structure of stored physiologic data, one that enables the development of a physiologic databank and greatly enhances access of the data for research.

Co Author/Co-Investigator Names/Professional Title: Robert Greer, B.Eng., Andrew Goodwin, B.Eng., Anirudh Thommandram, B.Eng., M.A.Sc., Dr. Danny Eytan, MD., PhD., Dr. Peter Laussen, MB.BS., FCICM.

Funding Acknowledgement (If Applicable): SickKids Foundation: David and Stacey Cynamon Chair in Pediatric Critical Care, The Hospital for Sick Children, and University of Toronto.