Abstract

Predictive Analytics Pipeline for Clinical Narrative Document Classification

Author: Wei-Hung Weng

The unstructured or semi-structured clinical documents, such as clinical notes and reports, have enormous hidden medical knowledge embedded, and are known as the potential resources for clinical prediction tasks. Even though the increasing need for prediction tasks to use clinical narrative documents, there are few intuitive tools for researchers or clinicians to identify the meaningful features and construct their own classification models for the specific medical questions. To solve the problem, an automatic intelligent pipeline with the techniques of natural language processing (NLP) and machine learning is a potential solution. The aim of conducting the study is to develop a generalizable NLP-supervised learning pipeline to facilitate the construction of prediction models for any clinical document classification task. We have integrated NLP and machine learning tools to develop the pipeline for generalized clinical document classification problem. Any clinical corpus in free text format, which is annotated with the document level label, will serve as the input of the pipeline. The pipeline outputs the model for future use, along with performance metrics by performing repeated cross-validation on the input data. All tasks can be done using purely command line interface. The pipeline has four main components: natural language processor, supervised learning model constructor, model evaluator and prediction module. Natural language processor was applied to extract the meaningful medical concepts from the unstructured clinical free text. We used the lexers, parsers, and dictionary lookup modules in clinical NLP system, cTAKES, to acquire concepts for further ontology mapping. Except for using bag-of-words approach of raw texts, we also adopted the concepts derived from SNOMED-CT, UMLS Metathesaurus, and Semantic Network to restrict the concepts within the specific semantic types related to the clinical setting. The output, either words or concepts, can be transformed into the frequency count table, and are able to apply tf-idf weighting or paragraph vector method for different vector representations. For model construction, users can select different supervised learning algorithms, such as regularized logistic regression, support vector machine, or random forest to construct the prediction model. Repeated n-fold cross-validation is also provided to prevent from overfitting. Accuracy, precision, recall and F1 measurement have been used for model performance evaluation. Finally, users may use the prediction module with their new inputs and the constructed model to perform future document classification and prediction. We have successfully implemented the pipeline and used it for the medical domain classification problem to facilitate the process of clinical referral. The advantages of the NLP-supervised learning pipeline include its generalizability, simplicity and flexibility. The pipeline allows users to input any clinical document and finish the whole language processing and machine learning modeling within a line command in the terminal, with many optional parameters for users to adjust based on their specific data and preferences of concept extraction, feature selection, and learning algorithm selection. In the future, we plan to deploy the service online and open-sourced the pipeline for crowdsourcing improvement.