Generation and Management of Annotation Datasets for Automated Classification of Time Series Data Using Machine Learning

Author: Anirudh Thommandram

Physiological data streams are increasingly being collected and stored for research purposes. Also increasing are the number of high sample rate waveforms being collected from bedside monitors ranging from 60Hz to 500Hz. Having a massive store of raw waveforms presents great opportunity for machine learning techniques, but the challenge of annotating such an immense dataset for analysis is also daunting. The task of finding the relevant data, extracting it, and collecting and organizing the annotations becomes very resource consuming when there are thousands of hours of time series data to be viewed and annotations to be collected for time windows as small as ten seconds. To pull a random sample of arbitrary duration out of a giant database of data is not a performant query in any existing time series storage solution. TSC is a medically focused time series data framework being developed at The Hospital for Sick Children that is specifically optimised for compressed storage and rapid retrieval and distributed analysis of physiological data. With this framework as the backbone, we are able to design a complete environment that supports all the steps involved in applying machine learning techniques to classify physiological data. The TSC framework has the capability to allow researchers with a research question to explore the data and see what signals exist and for how many patients and to easily create a cohort ready for analysis. The researcher can then start an annotation collection project by specifying parameters such as which signals need to be shown, what size time window, what kind of annotation is to be collected (yes/no, numeric, etc), how many distinct samples need to be annotated, how many unique individuals need to have an annotation for the same sample, whether the annotators should be of a specific role or specialisation, and various other optional specifications. Once the project has been created, a list of samples to be shown to users is generated that ensures a complete coverage of the research question. By leveraging the speed of the TSC framework we are able to load the data on demand and display it with sub second latency on a web page. Users can login from any device and continue their session. We are able to dynamically change the order in which samples are shown to reach the viable dataset required as soon as possible. With such a rich annotation dataset and a framework that is designed for distributed processing at scale, running machine learning analyses is fast enough to test many techniques and iterations in a short period of time. While it’s possible to generate a dataset for asking grand scale questions like whether a disease is present, it also makes it possible to generate data quality measures or other abstractions from the data such as pacing classification, deoxygenation, or heart rate variability with high confidence. These foundational metrics are then stored alongside the raw data and made accessible to researchers and can be vital as building blocks for achieving their research objectives.

Co Author/Co-Investigator Names/Professional Title: Anirudh Thommandram, B.Eng, M.A.Sc. Andrew Goodwin, B.Eng Robert Greer, B.Eng Dr. Danny Eytan, MD, PhD Dr. Peter Laussen, MB.BS., FCICM

Funding Acknowledgement: SickKids Foundation: David and Stacey Cynamon Chair in Pediatric Critical Care, The Hospital for Sick Children and University of Toronto.