Clustering Large-scale Diverse Electronic Medical Records to Aid Annotation for Generic Named Entity Recognition Nithin Haridas Yubin Kim Carnegie Mellon University UPMC Enterprises Pittsburgh, Pennsylvania ‘ Pittsburgh, Pennsylvania nithinh@cs.cmu.edu kimy10@upmc.edu ABSTRACT are also referred to as source systems. The naming convention The full extent of diversity in clinical documents and the effects on for categories assigned to the documents by the source systems natural language processing (NLP) tasks in the medical domain have are not necessarily consistent. We refer to this assigned category not been well studied. In supervised NLP tasks, it is vital to have as a document type. The repository contains over 40,000 unique training data that resembles test data [27]. In the medical domain, document types. this often translates to uniform subject matter distribution [16]. We If we were to sample a set of documents across all the docu- have access to a corpus of 157 million documents from 42 different ment types, we will end up with 400,000 documents with just 10 electronic medical record (EMR) vendors, with over 40,000 distinct documents from each document type. In addition, annotations and categories assigned to the documents. The sheer diversity of the their verification are done by subject matter experts. Evidently, this documents is an obstacle to an accurate sub-sampling of the data exercise requires large resources financially as well as with respect for annotation. We propose that clustering clinical text documents to time. is an effective way to aid the annotation effort and ensure coverage. Generic named entity recognition is used to generate structured We demonstrate the effect of lack of coverage in training data for representation of a clinical text document by identifying biomedical a supervised generic named entity recognition(GNER) task and concepts in the text. GNER can be used to extract hidden informa- the impact of clustering on the task. We will also examine the tion in a diagnosis [5]. Information processing systems that rely on characteristics of clusters generated from a diverse dataset. structured data cannot access such hidden information in clinical texts. The distribution of the biomedical concepts is dictated by the KEYWORDS subject matter content of the document [6]. Supervised machine learning based GNER systems requires an- clustering, generic named entity recognition, electronic medical notated clinical text documents with wide coverage [21] [4] [3]. records, diversity Coverage in our context simply means that the training data con- This work was presented at the first Health Search and Data Mining tains a very diverse set of named entities. As we see above, this is a Workshop (HSDM 2020)[7] very challenging job when there are 40,000 categories. We propose that clustering the documents can delineate them into a smaller 1 INTRODUCTION number of groups with each group aligned to similar subject matter In 2010, the American Recovery and Reinvestment Act was passed content. into law, requiring all public and private US healthcare providers We examine the effects of inadequate coverage of training data to adopt electronic medical records (EMR) by January 1, 2014, em- for a generic named entity recognition (GNER) task and how clus- powering clinical information extraction and NLP research. tering can mitigate some of these effects. Note that our objective However, access to annotated data with a comprehensive cover- is not to classify clinical text into a certain category, but to ensure age of subject matter domains remains a major challenge in clinical coverage for the GNER task. NLP. Widely available clinical document datasets are often small or In this work, we want to answer the following research questions. are a narrow slice of the extant types of documents found in EMR (1) How does lack of coverage in training data affect a supervised systems. For example, the MIMIC dataset consists only of intensive GNER system ? care unit documents [14]. The i2b2/UTHealth 2014 dataset [24] is (2) Can we cluster documents such that sampling from every composed primarily of progress notes and discharge summaries. cluster improves coverage ? The dataset consists of 3 categories of patients at various stages of (3) How do we cluster documents and what are the characteris- coronary artery disease. tics of each cluster ? Real world medical records are very diverse with respect to sub- In the following sections, we will describe the dataset in further ject matter content and context [13]. There are lab procedures, con- detail, explain the clustering method that we used, experiments sult notes, x-ray and ultrasound reports in cardiology, pulmonology, demonstrating the impact of lack of coverage in the GNER task and orthopedics to name a few. finally the impact of clustering on coverage. We studied the data repository of a large healthcare provider from 42 different electronic medical record (EMR) vendors contain- 2 RELATED WORK ing in excess of 157 million clinical text documents. EMR vendors Classification and clustering of electronic medical documents are Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons relevant in other contexts within the medical domain. They are License Attribution 4.0 International (CC BY 4.0). Haridas and Kim, et al. primarily employed to address problems emanating from the di- Setting: e.g. hospital, clinic versity of documents based on subject matter content. BioASQ1 Role: e.g. attending, consultant orgainizes a large scale biomedical semantic indexing task every However, the naming conventions are not uniformly applied year to classify PubMed2 documents into classes from the MeSH3 even within the same system and source systems might not use all hierarchy. Weng et al. [28] use a neural network architecture and a the axes when deciding to assign document types. Because of these linear SVM for document classification in MGH [17] and iDASH incompatible naming conventions, our repository has a very large [18] datasets respectively. The MGH dataset includes 3 subdomains variety of document types. (neurology, cardiology, endocrinology). iDASH is annotated with The repository has 41,521 document types containing 718,337 6 subdomains (cardiology, endocrinology, nephrology, neurology, documents. The data is diverse and the distribution across subject psychiatry and pulmonary disease). matter categories is not uniform. There are 32,468 types in IMAGE- Clustering has been used to extract medication and symptom CAST, for example, a source system primarily for radiology. This names by Ling et al. [15] on the corpus from the i2b2/UTHealth is because radiology documents have a different document type 2014 dataset. Clustering has also been used for a drug repositioning based on the relevant body part and the type of the image (X-ray, task on a composite dataset consisting of 417 drugs and their prop- MRI). erties by Hameed et al. [9]. The drug repositioning task identifies 4,458 of the document types had at least 100 documents (com- additional uses for certain drugs based on similarities with other mon document types) and the remaining types had less than 100 drugs in the same cluster. documents per type (sparse document types). Features used in classification and clustering tasks in the medical Among the 4,458 common document types, radiology notes were domain frequently includes relevant terms extracted with the help grouped based on certain conventions (such as first 2 letters of their of a concept mapper. cTakes, [20], MetaMap [1] and NobleCoder name) into 18 types. The resulting consolidated dataset has 1296 [26] are examples of concept mappers. A concept mapper uses common document types. We used this dataset to examine high predefined rules to identify medically relevant terms and retrieve level characteristics. a standard representation such as in UMLS [2]. Our work uses The most frequent tokens in the documents within a document NobleCoder for this purpose. Section headers in a clinical document type can give a certain idea about the document type. This is illus- such as Complaint, Allergy and Summary have also been used as trated in Table 1. features in [15] and [8]. We also use section headers. We will see We were able to identify certain patterns among document types more details of possible set of features in Section 4. and on closer inspection, we found that certain document types Tang et al. [25] use clustering based word representation (WR) were duplicates of each other. The key properties were small eu- as features to improve a CRF based model in a biomedical named clidean distance between them in the feature vector space (described entity recognition (BNER) task on the BioCreAtIvE II GM [22] and in Section 4, an high overlap of top terms within the document JNLPBA [11] datasets. The BNER task is identical to the GNER task types’ vocabulary and similar names for the document types. Some described in Section 5.2. examples are shown in Table 2. Each duplicate pair here belong to We employ clustering as a means to ensure diversity in anno- the same source system. However, we cannot rule out a scenario tated data in downstream tasks. As a result of the large number where there are duplicate types across source systems. Clustering of document types from many source systems, it is not feasible the documents can be a way to ensure that these duplicate note to sample from every document type. The Document type dataset types are always grouped together which can significantly reduce as we will see in Section 3.1 has 168 document types. We aim to annotation efforts. show that sampling from clusters is a feasible strategy to represent The “GNER dataset" and the “Document type dataset" are subsets diverse clinical text documents in annotated data. We specifically of the documents from the 1296 common document types. choose GNER as the downstream task because biomedical concepts are correlated with the relevant subject matter domain [6]. 3.1 Document type dataset The Document type dataset is used to find the methodology for clus- 3 DATA tering and understand the best features. The resultant parameters We describe here the corpus of the large healthcare provider to are then used to cluster documents for the GNER dataset. illustrate the scale of the problem. The data corpus is collected The dataset contains 13,440 documents spanning across 30 unique with ethical approval. The data processing pipeline anonymizes subject matter domains. The subject categories are listed in Ap- patient data. The processed data is stored in a HIPAA compliant pendix A. The Document type dataset’s diversity is a very notable environment with restricted access. We mentioned earlier that aspect. With 13,440 documents, it is smaller than the MIMIC [14] documents are assigned document types by source systems. dataset( 53,423 documents), but certainly more diverse in terms of Naming document types depends upon 5 axes: subject matter content. To our understanding, we do not know of Subject matter domain: e.g. cardiology any other work that utilises clustering to tackle diversity in clinical Type of service: e.g. consultation text documents with broad coverage of subject matter content. Kind of document: e.g. note, consent Subject matter content for documents in this dataset is informed 1 http://bioasq.org/ by the document’s document type. We can map many of the doc- 2 https://www.ncbi.nlm.nih.gov/pubmed ument types to standard representation from the subject matter 3 https://meshb.nlm.nih.gov/search domain axis of LOINC Ontology [13]. This mapping is first created Clustering Large-scale Diverse Electronic Medical Records to Aid Annotation for Generic Named Entity Recognition id Term 1 Term 2 Term 3 Category 1907 foot ankle incision Orthopaedic Surgery 4906 liver hepatitis cirrhosis Gastroentrology 106 artery femoral catheter Cardiology Table 1: Sample top terms in document types Type 1 Type 2 distance Overlapping top terms BIDEXASKELNOREM BIDEXASKEL 0.153 bone, density, mass Neonatal_History CDIDNUM 0.09 infant, birth, delivery IM_Office_Visit FP_Office_Visit 0.192 continued, take, encounter ED Note HP_ED_Note 0.21 active, scope, coding Table 2: Duplicate note types in the dataset. id AP CE CO CV EP IM MU PR VA 0 0 6 21 0 0 496 0 0 0 1 0 0 0 0 98 0 0 0 0 2 0 0 183 0 0 0 0 0 0 3 35 529 21 12 2 36 21 7 0 4 0 0 0 0 0 0 0 172 0 5 0 269 0 0 0 0 0 0 0 6 4 3 0 85 0 6 0 0 0 Table 3: Distribution of source systems in the clusters for the GNER dataset AP: APOLLO, CE: CERNER, CO:COPATH, CV:CVIS, EP:EPIC, IM:IMAGECAST, MU: MUSE, PR: PROVATION, VA:VASCUPRO by a data analyst and subsequently verified by a subject matter 4 CLUSTER GENERATION PIPELINE expert. The 13,440 documents belong to 168 document types across The motivation behind clustering is that it can create coherent the 30 subject categories. groups of data. We show how closely the clusters are aligned ac- cording to their subject matter domain. For this analysis, we use 3.2 GNER dataset the “Document types dataset". Our experiments for the GNER task uses a dataset of 2059 docu- ments. We refer to this as the “GNER dataset". The GNER dataset 4.1 Purity consists of documents with each token annotated as one of G-B, Clusters are evaluated based on how coherent they are to a partic- G-I or O. All generic mentions of medical concepts (e.g. conditions, ular subject matter. This is measured using the purity metric. To procedures, labs-observation, medications) are annotated. compute purity, each cluster is assigned to the class which is most We do not have information about the subject categories for frequent in the cluster, and then the accuracy of this assignment is these documents. We do however, know the source systems for measured by counting the number of correctly assigned documents the documents. The distribution of source systems in the dataset is and dividing by N , the total number of documents. shown in table 3. Documents from different source systems have 1 Õ notable differences in content. COPATH notes are clinical obser- purity(W , C) = max |w k ∩ c j | N j vations and procedures while IMAGECAST notes are radiology k observations and procedures. CERNER and EPIC are typically sys- where W = (w 1, w 2, w 3, ...w k ) is the set of clusters and C = tems used at hospital level and contains a mixture of note types such (c 1, c 2, c 3, ..c j ) is the set of classes. as progress notes, consults and discharge summaries. PROVATION Let us look at the specific steps involved in generating document notes are gastroenterology procedures while VASCUPRO notes are clusters. vascular lab reports. MUSE notes consists of cardiology procedures, CVIS notes are related to cardiac imaging and APOLLO notes are 4.2 Preprocessing and tokenizing documents for documenting physical therapy. All documents are preprocessed for stopword removal (stopwords from the nltk4 corpus), filtering out highest frequency words (top 4 https://www.nltk.org/ Haridas and Kim, et al. 0.1%) in the corpus and words with only numbers in them. The doc- Feature Count uments are then tokenized with nltk word tokenizer. The number Unigrams 17462 of unique tokens in the dataset after preprocessing and tokenizing Bigrams 100000 is approximately 2.7 million. Trigrams 70000 Section headers 8432 4.3 Feature generation Concept tokens 27453 4.3.1 N-gram word tokens. The word tokens generated in the pre- Table 4: Feature count for multiple categories vious process were used to generate unigram, bigram and trigram word tokens. Looking at the top terms for documents, they indicate the subject matter domain (e.g. cardiology, neurology) or the kind of document (e.g. note, letter) in most of the cases. Bigrams and 4.4 Tf-idf matrix for generated features trigrams are included as features because many medical concepts N-gram tokens, section headers and concepts are combined and are 2 or 3 words or more. Some examples are “intravenous solution" they are weighted by their tf-idf (Term frequency - Inverse docu- and “atrial sinus solitus". We only considered the top 200,000 uni- ment frequency) [23] values. 3 separate feature matrices are created grams based on document frequency. As for bigrams and trigrams, corresponding to n-gram features, section headers and concepts the total size was 8.8 million and 23.5 million respectively. To limit respectively. the size, we put additional restrictions based on corpus frequency The feature matrices are then horizontally stacked8 in a variety (greater than 30). of combinations to examine their effects (5.1) Horizontal stacking is simply concatenating the individual feature matrices row-wise. E.g. 4.3.2 N-gram word tokens filtered on medical vocabulary. We also a 13,440 x 8,432 feature matrix for section headers is combined with restricted the word tokens to only those found in a medical dictio- a 13,440 x 27,453 concept feature matrix to get a 13,440 x 35,975 nary5 . The medical terms are unigram tokens from two corpuses combined feature matrix. (OpenMedSpel6 and MTH-Med-Spel-Chek7 ). Bigrams and trigrams are selected only when all the constituent terms are part of the dic- tionary. This was shown to group documents of the same subject 4.5 K-means clustering categories even closer. The clustering purity based on unigrams In K-means clustering, (Hartigan and Wong [10]) datapoints are with no filtering on the “Document type dataset" was 0.56 while divided into clusters of equal variance. K-means clustering is a fast the one based on filtered unigrams was 0.63. It also had the con- algorithm and the speed allows us to iterate quickly based on dif- siderable advantage of reducing the dimensionality of the features. ferent combinations of features. Document similarity is determined Unigram frequency reduced from 200,000 to 17,462. Bigram and based on the euclidean distance between the corresponding feature trigram frequencies were 100,000 and 70,000 respectively (Table 4). vectors. K-means clustering initializes ‘k’ centroids when generating ‘k’ 4.3.3 Section headers. Medical records typically have sections that clusters. Each centroid will be a particular document in the dataset. distinguish one type from another. For e.g. A cytology report con- In our setting, the centroid is chosen at random for each execu- tains section headers such as “CLINICAL HISTORY" and “SOURCE tion of the algorithm. Predetermined centroids did not change the OF SPECIMEN", whereas a test report of lumbar puncture has “PRO- purity values for the generated clusters. Subsequent members for CEDURE" and “TECHNIQUE". These section headers act as a sig- each cluster is chosen when that document is closer to a particular nature for documents from the same source and can be used to centroid than other centroids. The centroids are then recalculated identify format level features for the documents. We extract section after adding each member. The algorithm is run multiple times and headers from the documents with NobleCoder [26] and tokenize we check whether the results are consistent across each run. We them with nltk word tokenizer. Tokenizing the section headers tried a wide range of k-values from 10 to 140, but the clusters were helps create unigram, bigram and trigram features from the section found to be fairly stable and consistent across all values. We also headers that are normalized across the documents. tried other clustering techniques and the resultant purity values 4.3.4 Concepts. NobleCoder tool [26] is able to map words per- were comparable to values obtained from the k-means method. taining to medical concepts into their UMLS [2] representation. Table 5 illustrates the content for some of these clusters. Cluster In addition, we can restrict the concepts to be only from certain 0 is about sleep medicine and cluster 2 is about x-rays. But clusters semantic types. With inputs from knowledge engineers, we only 3 and 4 are about consults and office visits. extracted concepts from a particular list of semantic types. For de- tails, refer Appendix B. Similar to the use of medical vocabulary, 5 EXPERIMENTS we can reduce our focus to medical concepts only when identifying 5.1 Examining the best features for clustering features. for subject matter domain 5 github.com/glutanimate/wordlist-medicalterms-en 6 https://e-medtools.com/ Each document in the Document types dataset corresponds to a 7 https://rajn.co/ sample datapoint for the clustering algorithm. The Document type dataset has 13,440 samples. Each document has a subject matter category assignment and there are 30 unique subject categories. The 8 https://docs.scipy.org/doc/scipy/reference/generated /scipy.sparse.hstack.html Clustering Large-scale Diverse Electronic Medical Records to Aid Annotation for Generic Named Entity Recognition id Term 1 Term 2 Term 3 Taken out Dev F1 Test F1 Size 0 apnea events rem APOLLO 44.7 8.1 39 1 ms ear density CERNERH1 52.1 13.6 807 2 xray-left vendor x-ray right COPATH 33.9 4.9 225 3 reference discharging client CVIS 43 27.4 97 4 recorded routing best practice EPIC 46.4 24 98 Table 5: Sample top terms in clusters IMAGECAST 36.9 5.5 538 MUSE 44.6 0 21 PROVATION 37.3 21.2 179 VASCUPRO 48.7 0 51 Features used Purity Random 0.16 Table 7: Effects of source systems on the GNER task. F1 score Source 0.44 in percentage. Size refers to size of the test set (taken out Unigrams 0.63 source). Section headers 0.53 Concept tokens 0.57 Unigrams and bigrams 0.64 5.2 Generic Named Entity Recognition Task Unigrams, bigrams, trigrams 0.61 The GNER task is modeled as a sequence labeling problem. Given a Unigrams and section headers 0.60 word from the document, the task is to predict one of the labels. The Section headers and concepts 0.62 labels used are G-B = beginning of an entity, G-I = inside an entity, All features 0.59 and O = outside of an entity. For e.g., consider the input sentence, Table 6: Purity comparison between clusters generated from "The patient is suffering from a cardio-vascular disease." The cor- particular features rect predictions would be (The,O), (patient,O), (is, O), (suffering,O), (from,O), (a,O), (cardio,G-B), (vascular,G-I), (disease,G-I),(.O). GNER is an important step in the NLP pipeline to extract biomed- ical entities and concepts. The GNER model we use is a Bi-LSTM- number of clusters are chosen as 30 based on the mean silhouette CRF based model (Huang et al. [12]). The Bi-LSTM layer has access coefficient of all samples. Silhouette coefficient of a sample i in a to past and future tokens at any particular time-step in the to- cluster C is defined as ken stream. The CRF layer captures sentence level information. b(i) − a(i) s(i) = , |Ci | > 1 We make no changes to the configuration and features for the max a(i), b(i) Bi-LSTM-CRF model from the version presented in [12]. We use the GNER dataset for this task. Recall that we do not s(i) = 0, |Ci | = 1 have access to subject matter annotations for documents in the where ai is the mean distance of i to other points in C and bi is GNER dataset. But we find that, in this dataset, the document’s the smallest mean distance of i to all points in any other cluster, of source system is a good indicator of subject matter content (Section which i is not a member. 3.2). The lack of coverage in training data is first illustrated using The “closeness" of samples in a cluster can be measured by a document’s source system as a proxy for subject matter content. cluster’s mean silhouette coefficient. High silhouette coefficient of We will then show that if we use a document’s cluster in place of a sample in a cluster [19] indicates that it is a part of a cluster with the source system, GNER system performance suffers in a similar similar samples and separated from dissimilar samples. manner. We also use the clustering methodology as explained in Thus, we use the following parameters for the k-means algo- Section 4. Performance of the GNER model is measured as the F1 rithm, number of runs:10; number of clusters: 30. The “default" score on the test data. information we have about the document types is their source sys- There are 3 parts to the experiments in the GNER task. tems. So our baseline purity based on clustering by the 24 source (1) Leave one out cross validation based on document’s systems in the dataset is 0.44 (Table 6). We use all the features we source system. This identifies the drop in performance for mentioned in previous sections, namely, unigrams, bigrams, tri- documents in test data that are from unseen source systems grams, section headers, and concept tokens. Each of these features in training. independently outperform the baseline. On the basis of the exper- (2) Cluster the documents. Leave one out cross validation iments, we see that simply using unigrams and bigrams (117,462 based on document’s cluster. We show that performance features) resulted in the highest values. Using unigrams (17,462) on of GNER on documents in test data that are from unseen its own or the combination of section headers and concepts (35,975) clusters in training similarly drops have comparable purity. Usage of the latter set of features have (3) Train and test the documents on a per cluster basis. an added advantage of lower dimensionality as the corresponding Finally, we examine the GNER performance on documents feature vector is much smaller. For the experiments on the GNER trained and tested on a per cluster basis. dataset, we only used unigram features. 5.2.1 Effect of source systems on the GNER task. Documents in different source systems belong to different sub-domains and their Haridas and Kim, et al. format is different as well. We investigate the effect of source sys- size silhouette score avg Test F1 (%) tems on GNER performance with leave one out cross validation 181 -0.0059 27.99 on the dataset, based on the source system of the documents. Doc- 431 -0.071 30 uments originating from one source system is left out and kept 175 0.082 35.22 aside as a test set. Remaining documents are shuffled to make a 796 0.143 70.23 train/dev split in the ratio 9:1. The trained models are tested on the 478 0.209 55.64 corresponding dev set and the corresponding left out test set. Table 8: Avg silhouette score of samples in a cluster vs Test Table 7 shows the effect of source systems on GNER performance. F1 score. Pearson coefficient 0.63 Performance of all 8 trained models on the dev set is much better than their performance on the left out test set. Documents from source systems unseen in training causes a drop in performance. We Taken out Dev F1(%) Test F1(%) will refer to this as “the unseen type problem". This indicates that 0 46.7 36 when creating training data, we want to have as broad a coverage as possible. It is not feasible to annotate documents across the 1 43.8 18.4 overwhelming amount of document types that are available. We 2 35.8 54.6 will see how clustering the documents ensures that the document 3 44.4 23.1 types being selected are sufficiently diverse. 4 43.4 14.7 5 40.4 19 5.2.2 Effect of clustering on GNER task. For the next part of our 6 45.5 15.1 experiment, we created 7 clusters from documents in the GNER Table 9: Effects of clusters on the GNER task dataset. Table 3 shows the distribution of the document’s source systems in these 7 clusters. Each of the seven clusters are dominated by documents from a particular system. With the exception of cluster 3 and the source system MUSE, we can almost map a source Cluster id Train F1(%) Test F1(%) system with a particular cluster. The features and methodology for 0 44.60 36.36 clustering is described in detail in Section 4. We perform leave one 1 66.09 11.86 out cross validation on the 7 clusters. One cluster is selected, and 2 71.36 78.70 kept aside as the test set. The remaining documents in rest of the 3 64.23 23.30 documents is split into train/dev sets in the ratio 9:1. Then we train 4 69.29 55.93 the GNER model on the resulting training set and test the model 5 66.49 32.41 on the dev set and test sets. This is repeated for every cluster. 6 0 0 We see that there is a drop off in test set performance with the Table 10: GNER model trained and tested on a per cluster exception of cluster 2 (Table 9). The test performance is low for basis clusters that are strongly dominated by one source with no or very little documents of that source in other clusters. These are clusters 1,4, and 6. For cluster 0, the source systems of documents in this cluster, CERNER, COPATH and, IMAGECAST are also represented Cluster 2 has a high performance because of the nature of the in other clusters. In the case of clusters 3 and 5, the performance clinical pathology documents that constitute the cluster. Cluster 4 is lower (but still higher than 4 other clusters) because it is dom- is dominated by documents from PROVATION which also shares inated by documents from CERNER. Even though there are 278 some of the properties of cluster 2. Both clusters have a limited documents from CERNER in the training set, there is a diverse vocabulary of biomedical concepts with a dense distribution within range of documents from that system. the documents. Cluster 6 has a training and test F1 of 0 because For cluster 2, which is dominated by COPATH documents, there there are only 14 named entities in the entire training set. This are still 42 documents in training data. Further more, the documents brings to a potential pitfall when training on a per cluster basis. in COPATH are mostly clinical pathology or observation notes with We also need to make sure that each cluster has enough training a limited vocabulary. They also tend to be very dense with many data. Cluster 1 on the other hand, has enough training data, but it generic named entity examples to learn from. For instance, when is dominated by documents from the EPIC source system, which training on a per cluster basis, cluster 2 has 3857 named entities in similar to CERNER is very diverse in terms of subject matter domain. the training set from 200 documents. There are only 761 entities This is the second possible pitfall. We need to choose the features from 460 documents in the training set for cluster 0. so that clusters are aligned on the subject matter domain as much Going back to the 7 clusters in the leave one out cross validation, as possible. CERNER and EPIC documents are grouped together we generated train/test split for each of the clusters. The GNER despite the diverse nature of constituent documents because general model was trained and tested on a per cluster basis. Table 10 shows hospital notes are closer to each other than say, radiology notes. the results for the GNER model when it is trained and tested on We looked at 5 clusters generated from the GNER dataset, divided documents from each cluster separately. The F1 scores in the test each cluster into train and test sets. We trained the Bi-LSTM-CRF set are better in 3 of the 7 cases, maintained for 2 cases and drops model on the training sets of each cluster separately and tested it for clusters 1 and 6. on the corresponding test sets from that cluster. We posit that the Clustering Large-scale Diverse Electronic Medical Records to Aid Annotation for Generic Named Entity Recognition silhouette coefficient of a cluster is correlated with the GNER per- [3] Andreea Bodnari, Louise Deléger, Thomas Lavergne, Aurélie Névéol, and Pierre formance. In Table 8, the average silhouette coefficients of samples Zweigenbaum. 2013. A supervised named-entity extraction system for medical text. In CLEF. in a cluster are compared against the GNER F1 score when trained [4] K. P. Chodey and G. Hu. 2016. Clinical text analysis using machine learning and tested on a train/test split from the same cluster. methods. In 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pages 1–6. The high pearson correlation coefficient of 0.63 indicates that [5] Dina Demner-Fushman, Wendy W. Chapman, and Clement J. McDonald. 2009. well formed clusters have a high linear correlation with a higher What can natural language processing do for clinical decision support? Journal of test GNER performance. We saw in Section 4 that the “closeness" Biomedical Informatics, 42(5):760 – 772. Biomedical Natural Language Processing. [6] Dina Demner-Fushman, Wendy W Chapman, and Clement J McDonald. 2009. of the samples in a cluster is influenced by their respective subject What can natural language processing do for clinical decision support? Journal matter domain. of biomedical informatics, 42(5):760–772. Despite the caveats mentioned above, when we choose docu- [7] Carsten Eickhoff, Yubin Kim, and Ryen White. 2020. Overview of the health search and data mining (hsdm 2020) workshop. In Proceedings of the Thirteenth ments for annotation, we can improve test performance by choosing ACM International Conference on Web Search and Data Mining, WSDM ’20, New them from as many clusters as possible. This avoids having to an- York, NY, USA. ACM. [8] K. Ganesan and M. Subotin. 2014. A general supervised approach to segmentation notate subject matter domain for each document type. of clinical texts. In 2014 IEEE International Conference on Big Data (Big Data), pages 33–40. 6 CONCLUSION [9] Pathima Nusrath Hameed, Karin Verspoor, Snezana Kusljic, and Saman Halga- muge. 2018. A two-tiered unsupervised clustering approach for drug reposition- The diversity of unstructured clinical text documents has been an ing through heterogeneous data integration. BMC Bioinformatics, 19(1):129. under-studied problem in clinical NLP. This paper presented initial [10] J. A. Hartigan and M. A. Wong. 1979. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), explorations into a large real-life data repository of 157 million 28(1):100–108. documents across 42 source systems and found that the source [11] Ming-Siang Huang, Po-Ting Lai, Richard Tzong-Han Tsai, and Wen-Lian Hsu. systems reported more than 40,000 document types. 2019. Revised JNLPBA corpus: A revised version of biomedical NER corpus for relation extraction task. CoRR, abs/1901.10219. Initial explorations of the document types showed that they vary [12] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for widely in content and format, with significant ramifications on sequence tagging. CoRR, abs/1508.01991. [13] Stanley M. Huff, Roberto A. Rocha, Clement J. McDonald, Georges J. E. De Moor, supervised NLP tasks. When a supervised generic named entity Tom Fiers, Jr. Bidgood, W. Dean, Arden W. Forrey, William G. Francis, Wayne R. detection model was tested on document types that had not been Tracy, Dennis Leavelle, Frank Stalling, Brian Griffin, Pat Maloney, Diane Leland, present in the training data, the performance is much lower com- Linda Charles, Kathy Hutchins, and John Baenziger. 1998. Development of the Logical Observation Identifier Names and Codes (LOINC) Vocabulary. Journal of pared to a model trained on a more diverse training set (“the unseen the American Medical Informatics Association, 5(3):276–292. type problem"). This indicates a need for careful selection of data [14] Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling when annotating to create a training set for a new NLP task in a real Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. Mimic-iii, a freely accessible critical care database. world setting; poorly chosen training data will hinder the creation Scientific Data, 3:160035 EP –. of generalizable models. Ideally, an annotated training set would [15] Y. Ling, X. Pan, G. Li*, and X. Hu. 2015. Clinical documents clustering based on medication/symptom names using multi-view nonnegative matrix factorization. have coverage over all subject matter content. However, due to the IEEE Transactions on NanoBioscience, 14(5):500–504. large number of document types available, this is prohibitively ex- [16] Georgia McGaughey, W Patrick Walters, and Brian Goldman. 2016. Understand- pensive. Our study showed that many of the types reported by the ing covariate shift in model performance. F1000Research, 5:Chem Inf Sci–597. [17] Shawn N Murphy and Henry C Chueh. 2002. A security architecture for query systems are actually quite similar, leading us to explore clustering tools used to access large biomedical databases. Proceedings. AMIA Symposium, as a method to mitigate diversity of note types. pages 552–556. We experimented with various features for clustering and found [18] Lucila Ohno-Machado, Vineet Bafna, Aziz A Boxwala, Brian E Chapman, Wendy W Chapman, Kamalika Chaudhuri, Michele E Day, Claudiu Farcas, that to generate clusters along subject matter domains, a combi- Nathaniel D Heintzman, Xiaoqian Jiang, Hyeoneui Kim, Jihoon Kim, Michael E nation of unigram and bigram features worked well, providing Matheny, Frederic S Resnic, Staal A Vinterbo, , and the iDASH team. 2011. iDASH: integrating data for analysis, anonymization, and sharing. Journal of the American high purity scores on the “Document types dataset". Since we do Medical Informatics Association, 19(2):196–201. not have subject matter annotations for the larger data repository [19] Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and of 40,000 document types, we posit that clusters are a reasonable validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53 – 65. stand-in to ensure representation. [20] Guergana K. Savova, James J. Masanz, Philip V. Ogren, Jiaping Zheng, Sunghwan We showed that clustering captures information that helps trans- Sohn, Karin Kipper Schuler, and Christopher G. Chute. 2010. Mayo clinical text late training performance for the GNER task. By clustering the analysis and knowledge extraction system (ctakes): architecture, component eval- uation and applications. Journal of the American Medical Informatics Association document types to a smaller set of clusters, it is possible to select : JAMIA, 17 5:507–13. training data for NLP tasks with good coverage without wasting [21] M. Shekhar, V. R. Chikka, L. Thomas, S. Mandhan, and K. Karlapalem. 2015. Identifying medical terms related to specific diseases. In 2015 IEEE International annotation effort on similar types. Conference on Data Mining Workshop (ICDMW), pages 170–177. Future work may include utilizing semantic embedding repre- [22] Larry Smith, Lorraine K Tanabe, Rie Johnson nee Ando, Cheng-Ju Kuo, I-Fang sentations of documents and training feature weights for better Chung, Chun-Nan Hsu, Yu-Shi Lin, Roman Klinger, Christoph M Friedrich, Kuz- man Ganchev, Manabu Torii, Hongfang Liu, Barry Haddow, Craig A Struble, clustering. Richard J Povinelli, Andreas Vlachos, Jr Baumgartner, William A, Lawrence Hunter, Bob Carpenter, Richard Tzong-Han Tsai, Hong-Jie Dai, Feng Liu, Yifei REFERENCES Chen, Chengjie Sun, Sophia Katrenko, Pieter Adriaans, Christian Blaschke, Rafael Torres, Mariana Neves, Preslav Nakov, Anna Divoli, Manuel Maña-López, Jac- [1] Alan R Aronson and François-Michel Lang. 2010. An overview of MetaMap: into Mata, and W John Wilbur. 2008. Overview of biocreative ii gene mention historical perspective and recent advances. Journal of the American Medical recognition. Genome biology, 9 Suppl 2(Suppl 2):S2–S2. Informatics Association, 17(3):229–236. [23] Karen Sparck Jones. 1988. Document retrieval systems. In Peter Willett, editor, [2] Olivier Bodenreider. 2004. The Unified Medical Language System (UMLS): inte- Document Retrieval Systems, chapter A Statistical Interpretation of Term Speci- grating biomedical terminology. Nucleic Acids Research, 32:D267–D270. ficity and Its Application in Retrieval, pages 132–142. Taylor Graham Publishing, London, UK, UK. Haridas and Kim, et al. [24] Amber Stubbs, Christopher Kotfila, Hua Xu, and Özlem Uzuner. 2015. Identifying B SEMANTIC TYPES FOR CONCEPT risk factors for heart disease over time: Overview of 2014 i2b2/uthealth shared task track 2. Journal of biomedical informatics, 58 Suppl(Suppl):S67–S77. FILTERING [25] Buzhou Tang, Hongxin Cao, Xiaolong Wang, Qingcai Chen, and Hua Xu. 2014. Category Evaluating word representation features in biomedical named entity recognition tasks. BioMed research international, 2014:240403. Health Care Related Organization [26] Eugene Tseytlin, Kevin Mitchell, Elizabeth Legowski, Julia Corrigan, Girish Cha- Gene or Genome van, and Rebecca S. Jacobson. 2016. NOBLE – flexible concept recognition for Congenital Abnormality large-scale biomedical natural language processing. BMC Bioinformatics, 17(1). [27] Grace. Wahba. 1990. Spline Models for Observational Data. Society for Industrial Acquired Abnormality and Applied Mathematics. Clinical Drug [28] Wei-Hung Weng, Kavishwar B. Wagholikar, Alexa T. McCray, Peter Szolovits, and Henry C. Chueh. 2017. Medical subdomain classification of clinical notes Body System using a machine learning-based natural language processing approach. BMC Cell Component Medical Informatics and Decision Making, 17(1):155. Body Location or Region Injury or Poisoning A GOLD ANNOTATED SUBJECT MATTER Body Space or Junction CATEGORIES Hazardous or Poisonous substance Gastroentrology Finding Surgery Laboratory or Test Result Podiatry Pathologic Function Physical Medicine and Rehabilitation Cell Pulmonary Medicine Virus Orthopaedic surgery Therapeutic or Preventive Procedure Surgical Oncology Fungus Cardiology Mental or Behavioral Dysfunction Family Medicine Anatomical Abnormality Allergy Bacterium Molecular Genetic Pathology Neoplastic Process Anesthesiology Body Part, Organ, or Organ Component Diagnostic Radiology Biomedical or Dental Material Otolaryngology Anatomical Structure Neonatal perinatal summary Disease or Syndrome Interventional Radiology Indicator, Reagent, or Diagnostic Aid Geriatric Medicine Organic Chemical Nuclear Medicine Sign or Symptom Emergency Medicine Occupation or Discipline Neurology Pharmacologic Substance Endocrinology Biomedical Occupation or Discipline Obstetrics and Gynecology Diagnostic Procedure Clinical Pathology Social Behavior Sleep Medicine Laboratory Procedure Radiation Oncology Tissue Hematology Mental Health Urology Rheumatology