Boosting the performance of symbolic pattern recognition algorithm by feature selection: A case study on detecting cardiac abnormalities

Author: Oguz Akbilgic

Background and Significance: Atrial Fibrillation (AF) and Congestive Heart Failure (CHF) are among the most common medical conditions associated with significant morbidity in the United States. The American heart association estimated that around 2.7-6.1 million population is affected with AF (paroxysmal/persistent or permanent) and about 5.7 million adults have heart failure. Early diagnoses of these cardiac abnormalities is a key in avoiding the risk of stroke, heart failure, dementia, and death, especially by analyzing Electrocardiogram (ECG) recorded during routine exams. Objective: In this paper, an improved Symbolic Pattern Recognition (SPR) algorithm is implemented on the ECG and R-R interval series to identify subjects with PAF and CHF conditions, respectively. The key objective is to find the optimum length of pattern transitions in a series that can distinguish normal subjects from other subjects with cardiac abnormalities. Method: To predict the PAF condition, SPR algorithm discretizes all ECG recordings (50 normal (NN), 25 recordings from PAF subjects - at least 45 min away from any PAF episode (PAFN), and 25 recordings from same PAF subjects during a PAF episode (PAFE)) using five-symbols {a,b,c,d,e}. The pattern transition probability (PTP) matrices are calculated for each discretized ECG up to 7-symbol pattern transitions (PTPj, j=1, 2,…,7). Next, these seven PTPs of 50 NN and 25 PAFN are compared to PTPs of 25 PAFE using a Euclidean distance metric. By doing so, we obtain a 75x7 feature matrix, where rows represent individuals (50 NN and 25 PAFN) and columns represent the pattern length. Similarly, to detect the CHF condition, R-R interval series from each group (72 NN and 44 CHF) are discretized using eight-symbols {a,b,c,d,e,f,g,h}. The similarity between eight observed PTPs for both groups is computed using Weighted distance metric providing a 116x8 feature matrix. The mean and standard deviation of R-R intervals are also used for classifying normal and CHF groups resulting in a 116x10 feature matrix. Results and Discussion: A simple logistic regression classifier is used for distinguishing PAF and CHF subject groups from normal individuals in two independent studies. In the improved SPR, a sequential forward feature selection approach is used to generalize the classification model and scrutinize significant features to be used for classification. Results indicate that the seventh and eighth pattern transition similarity (PTS) feature provides most of the information to differentiate PAF and CHF groups from the normal groups, respectively. Further, Mann-Whitney U test rejects the null hypothesis of equal median between normal and PAF/CHF groups with a p-value <0.00001. The average classification statistics for the test data over five-fold cross validation using selected features are accuracy=87%, sensitivity=80%, and specificity=90%, for the PAF study and accuracy=89%, sensitivity=75%, and specificity=97%, for the CHF study. Note that the classification accuracy of SPR without feature selection is 82% for PAF study and 84% for CHF study, suggesting that the SPR algorithm performs better when feature selection is implemented. Conclusion: The results suggest that metrics from the improved SPR can be used in the predictive models to automatically detect cardiac abnormalities with high accuracy.

Co Author/Co-Investigator Names/Professional Title: 1. Ruhi Mahajan,PhD, Postdoc fellow at UTHSC-ORNL Center for Biomedical Informatics and Department of Pediatrics ,The University of Tennessee Health Science Center, Memphis, TN, USA 2. Teeradache Viangteeravat, PhD, Associate Professor in the Department of Pediatrics, The University of Tennessee Health Science Center and Technical Director of Biomedical Informatics Core, Children’s Foundation Research Institute, Le Bonheur Children’s Hospital, Memphis, TN, USA 3.Oguz Akbilgic, PhD,Assistant Professor at the UTHSC-ORNL Center for Biomedical Informatics, Department of Pediatrics , and Department of Preventive Medicine, The University of Tennessee Health Science Center, Memphis,TN, USA.