Automated diagnosis of visually undetectable arrhythmias from EKG graphs using wavelets, principal component analysis, and multinomial logistic regression

Author: Alexander Barrett

Motivation The electrocardiogram (EKG) is an efficient tool to assess heart health and diagnose heart disease. In recent time, different machine learning and statistical classification models have proven useful in automatically diagnosing and classifying different heart arrhythmias from EKG’s. However, many hospitals do not store raw EKG data, which is often only stored temporarily within proprietary EKG machines before conversion to graphical printouts. Even electronic records of EKG’s often only consist of the graphical printout. This drastically limits the richness of historical EKG data which can be used to train arrhythmia classification models. To address this, we scrape data directly from clinical EKG printouts for automated classification. Data Acquisition We first locate then isolate each lead’s graph and then digitize it into coordinate form, as well as scrape relevant text data from the printouts. This procedure will be validated with public digital EKG datasets (obtained from physionet.org) to assure the digitization process introduces minimal error (feature extraction will be marginally affected by this as the small amount of noise added will be filtered out after smoothing is applied during preprocessing). We will use this digitized clinical EKG time series data in conjunction with public EKG datasets to automatically classify and diagnose 40 different arrhythmias. Feature Extraction We will first normalize the beats within the EKG’s, as this will help make feature extraction hardware independent. We will then use a level-6 Symlet5 wavelet as a high pass filter to denoise the data, using the obtained wavelet coefficients as features. Wavelets are also used for P-wave, QRS complex, and T-wave detection to extract the following features: length/height of P-waves, length of PR segments, percentage of P-waves per QRS complex and vice versa, length/height of QRS complexes, length and (relative) height of ST segments, and length/height of T-waves. Classification We will combine these previously extracted characteristics with relevant outside EHR data to be used as our features and will use the previously made diagnosis as our outcome variable. Principal component analysis (PCA) is then used to reduce the feature space to a maximum of 10 features. These remaining features are then fed into a multinomial regression model, which is used to classify the different arrhythmias (10-fold cross validation is used to validate the model). Goals With these methods, we aim to improve the AUC (area under the curve) above current methods for automated arrhythmia classification (the AUC is a metric commonly used to assess the overall accuracy of a classification model). Improving classification accuracy will help physicians better detect and diagnosis different arrhythmias that might otherwise elude manual diagnosis due to subtle graphical presentation.

Co Author/Co-Investigator Names/Professional Title: Alex Barrett (MS), Mohamed Allali (Ph.D.), Cyril Rakovski (Ph.D.)