Abstract

Identifying Clinical Document Similarity with Class-Embedded Neural Concept Embedding Model

Author: Wei-Hung Weng

Clinical documents can be categorized not only by their content but also by the medical domain of their author. Such document’s medical domain can be very useful for characterizing the specific clinical problem and constructing prediction models for clinical machine learning tasks including diagnosis, prognosis, and clinical referral. Natural language processing (NLP) and machine learning techniques have been extensively applied to different clinical problems. However, few studies reported for building NLP and machine learning models with medical domain tagging. We have developed a neural concept embedding-based method for identifying the similarity between clinical documents with the information of medical domain, and we will evaluate the performance on the real clinical dataset. The neural concept embedding method is realized by integrating clinical NLP system, cTAKES, and the two-layer neural network architecture, paragraph vector model. We have extracted the clinical-related UMLS concepts from both the content of the documents, and the corresponding medical domains of the authors of the documents. Then the concepts of medical domains were embedded into the concept sequence of the document. Finally, we used distributed memory algorithm to train a group of document-level medical domain class-embedded concept sequences. The constructed model can, therefore, be used for future prediction. We have acquired 61,276 clinical notes of Massachusetts General Hospital (MGH) from Partners HealthCare RPDR data warehouse. The dataset includes eight medical domains with the highest document count. Deidentification and clinical concept parsing were done by deid software and cTAKES, respectively. The distributed memory algorithm with ten words of neighborhood, 600 dimensions of vector representation and 25 training epochs was done for neural concept embedding model construction. We will evaluate the performance of the model by using new clinical documents with manually annotated medical domains. New sentences will be first converted to the UMLS concept sequence and then the neural concept embedded vector. The vector of the new document will be compared with the training data by cosine similarity and returned the ranking of the most similar documents. Two clinicians will evaluate the model and ranking performance for quality check. The novel unsupervised approach of class-embedded neural concept embedding model may be useful for medical domain and expert prediction task using clinical free text documents. The further step will be applying the method to the real clinical problems, such as clinical referral or expert recommendation.