Clustering Large-scale Diverse Electronic Medical Records to
       Aid Annotation for Generic Named Entity Recognition
                                Nithin Haridas                                                                  Yubin Kim
                         Carnegie Mellon University                                                         UPMC Enterprises
                          Pittsburgh, Pennsylvania ‘                                                    Pittsburgh, Pennsylvania
                             nithinh@cs.cmu.edu                                                            kimy10@upmc.edu

ABSTRACT                                                                               are also referred to as source systems. The naming convention
The full extent of diversity in clinical documents and the effects on                  for categories assigned to the documents by the source systems
natural language processing (NLP) tasks in the medical domain have                     are not necessarily consistent. We refer to this assigned category
not been well studied. In supervised NLP tasks, it is vital to have                    as a document type. The repository contains over 40,000 unique
training data that resembles test data [27]. In the medical domain,                    document types.
this often translates to uniform subject matter distribution [16]. We                     If we were to sample a set of documents across all the docu-
have access to a corpus of 157 million documents from 42 different                     ment types, we will end up with 400,000 documents with just 10
electronic medical record (EMR) vendors, with over 40,000 distinct                     documents from each document type. In addition, annotations and
categories assigned to the documents. The sheer diversity of the                       their verification are done by subject matter experts. Evidently, this
documents is an obstacle to an accurate sub-sampling of the data                       exercise requires large resources financially as well as with respect
for annotation. We propose that clustering clinical text documents                     to time.
is an effective way to aid the annotation effort and ensure coverage.                     Generic named entity recognition is used to generate structured
We demonstrate the effect of lack of coverage in training data for                     representation of a clinical text document by identifying biomedical
a supervised generic named entity recognition(GNER) task and                           concepts in the text. GNER can be used to extract hidden informa-
the impact of clustering on the task. We will also examine the                         tion in a diagnosis [5]. Information processing systems that rely on
characteristics of clusters generated from a diverse dataset.                          structured data cannot access such hidden information in clinical
                                                                                       texts. The distribution of the biomedical concepts is dictated by the
KEYWORDS                                                                               subject matter content of the document [6].
                                                                                          Supervised machine learning based GNER systems requires an-
clustering, generic named entity recognition, electronic medical
                                                                                       notated clinical text documents with wide coverage [21] [4] [3].
records, diversity
                                                                                       Coverage in our context simply means that the training data con-
This work was presented at the first Health Search and Data Mining                     tains a very diverse set of named entities. As we see above, this is a
Workshop (HSDM 2020)[7]                                                                very challenging job when there are 40,000 categories. We propose
                                                                                       that clustering the documents can delineate them into a smaller
1    INTRODUCTION                                                                      number of groups with each group aligned to similar subject matter
In 2010, the American Recovery and Reinvestment Act was passed                         content.
into law, requiring all public and private US healthcare providers                        We examine the effects of inadequate coverage of training data
to adopt electronic medical records (EMR) by January 1, 2014, em-                      for a generic named entity recognition (GNER) task and how clus-
powering clinical information extraction and NLP research.                             tering can mitigate some of these effects. Note that our objective
   However, access to annotated data with a comprehensive cover-                       is not to classify clinical text into a certain category, but to ensure
age of subject matter domains remains a major challenge in clinical                    coverage for the GNER task.
NLP. Widely available clinical document datasets are often small or                       In this work, we want to answer the following research questions.
are a narrow slice of the extant types of documents found in EMR                           (1) How does lack of coverage in training data affect a supervised
systems. For example, the MIMIC dataset consists only of intensive                             GNER system ?
care unit documents [14]. The i2b2/UTHealth 2014 dataset [24] is                           (2) Can we cluster documents such that sampling from every
composed primarily of progress notes and discharge summaries.                                  cluster improves coverage ?
The dataset consists of 3 categories of patients at various stages of                      (3) How do we cluster documents and what are the characteris-
coronary artery disease.                                                                       tics of each cluster ?
   Real world medical records are very diverse with respect to sub-                       In the following sections, we will describe the dataset in further
ject matter content and context [13]. There are lab procedures, con-                   detail, explain the clustering method that we used, experiments
sult notes, x-ray and ultrasound reports in cardiology, pulmonology,                   demonstrating the impact of lack of coverage in the GNER task and
orthopedics to name a few.                                                             finally the impact of clustering on coverage.
   We studied the data repository of a large healthcare provider
from 42 different electronic medical record (EMR) vendors contain-                     2   RELATED WORK
ing in excess of 157 million clinical text documents. EMR vendors                      Classification and clustering of electronic medical documents are
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons   relevant in other contexts within the medical domain. They are
License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      Haridas and Kim, et al.


primarily employed to address problems emanating from the di-               Setting: e.g. hospital, clinic
versity of documents based on subject matter content. BioASQ1               Role: e.g. attending, consultant
orgainizes a large scale biomedical semantic indexing task every            However, the naming conventions are not uniformly applied
year to classify PubMed2 documents into classes from the MeSH3           even within the same system and source systems might not use all
hierarchy. Weng et al. [28] use a neural network architecture and a      the axes when deciding to assign document types. Because of these
linear SVM for document classification in MGH [17] and iDASH             incompatible naming conventions, our repository has a very large
[18] datasets respectively. The MGH dataset includes 3 subdomains        variety of document types.
(neurology, cardiology, endocrinology). iDASH is annotated with             The repository has 41,521 document types containing 718,337
6 subdomains (cardiology, endocrinology, nephrology, neurology,          documents. The data is diverse and the distribution across subject
psychiatry and pulmonary disease).                                       matter categories is not uniform. There are 32,468 types in IMAGE-
   Clustering has been used to extract medication and symptom            CAST, for example, a source system primarily for radiology. This
names by Ling et al. [15] on the corpus from the i2b2/UTHealth           is because radiology documents have a different document type
2014 dataset. Clustering has also been used for a drug repositioning     based on the relevant body part and the type of the image (X-ray,
task on a composite dataset consisting of 417 drugs and their prop-      MRI).
erties by Hameed et al. [9]. The drug repositioning task identifies         4,458 of the document types had at least 100 documents (com-
additional uses for certain drugs based on similarities with other       mon document types) and the remaining types had less than 100
drugs in the same cluster.                                               documents per type (sparse document types).
   Features used in classification and clustering tasks in the medical      Among the 4,458 common document types, radiology notes were
domain frequently includes relevant terms extracted with the help        grouped based on certain conventions (such as first 2 letters of their
of a concept mapper. cTakes, [20], MetaMap [1] and NobleCoder            name) into 18 types. The resulting consolidated dataset has 1296
[26] are examples of concept mappers. A concept mapper uses              common document types. We used this dataset to examine high
predefined rules to identify medically relevant terms and retrieve       level characteristics.
a standard representation such as in UMLS [2]. Our work uses                The most frequent tokens in the documents within a document
NobleCoder for this purpose. Section headers in a clinical document      type can give a certain idea about the document type. This is illus-
such as Complaint, Allergy and Summary have also been used as            trated in Table 1.
features in [15] and [8]. We also use section headers. We will see          We were able to identify certain patterns among document types
more details of possible set of features in Section 4.                   and on closer inspection, we found that certain document types
   Tang et al. [25] use clustering based word representation (WR)        were duplicates of each other. The key properties were small eu-
as features to improve a CRF based model in a biomedical named           clidean distance between them in the feature vector space (described
entity recognition (BNER) task on the BioCreAtIvE II GM [22] and         in Section 4, an high overlap of top terms within the document
JNLPBA [11] datasets. The BNER task is identical to the GNER task        types’ vocabulary and similar names for the document types. Some
described in Section 5.2.                                                examples are shown in Table 2. Each duplicate pair here belong to
   We employ clustering as a means to ensure diversity in anno-          the same source system. However, we cannot rule out a scenario
tated data in downstream tasks. As a result of the large number          where there are duplicate types across source systems. Clustering
of document types from many source systems, it is not feasible           the documents can be a way to ensure that these duplicate note
to sample from every document type. The Document type dataset            types are always grouped together which can significantly reduce
as we will see in Section 3.1 has 168 document types. We aim to          annotation efforts.
show that sampling from clusters is a feasible strategy to represent        The “GNER dataset" and the “Document type dataset" are subsets
diverse clinical text documents in annotated data. We specifically       of the documents from the 1296 common document types.
choose GNER as the downstream task because biomedical concepts
are correlated with the relevant subject matter domain [6].              3.1    Document type dataset
                                                                         The Document type dataset is used to find the methodology for clus-
3     DATA                                                               tering and understand the best features. The resultant parameters
We describe here the corpus of the large healthcare provider to          are then used to cluster documents for the GNER dataset.
illustrate the scale of the problem. The data corpus is collected           The dataset contains 13,440 documents spanning across 30 unique
with ethical approval. The data processing pipeline anonymizes           subject matter domains. The subject categories are listed in Ap-
patient data. The processed data is stored in a HIPAA compliant          pendix A. The Document type dataset’s diversity is a very notable
environment with restricted access. We mentioned earlier that            aspect. With 13,440 documents, it is smaller than the MIMIC [14]
documents are assigned document types by source systems.                 dataset( 53,423 documents), but certainly more diverse in terms of
    Naming document types depends upon 5 axes:                           subject matter content. To our understanding, we do not know of
    Subject matter domain: e.g. cardiology                               any other work that utilises clustering to tackle diversity in clinical
    Type of service: e.g. consultation                                   text documents with broad coverage of subject matter content.
    Kind of document: e.g. note, consent                                    Subject matter content for documents in this dataset is informed
1 http://bioasq.org/                                                     by the document’s document type. We can map many of the doc-
2 https://www.ncbi.nlm.nih.gov/pubmed                                    ument types to standard representation from the subject matter
3 https://meshb.nlm.nih.gov/search
                                                                         domain axis of LOINC Ontology [13]. This mapping is first created
Clustering Large-scale Diverse Electronic Medical Records to Aid Annotation for Generic Named Entity Recognition


                                       id       Term 1     Term 2       Term 3         Category
                                       1907     foot       ankle        incision       Orthopaedic Surgery
                                       4906     liver      hepatitis    cirrhosis      Gastroentrology
                                       106      artery     femoral      catheter       Cardiology
                                               Table 1: Sample top terms in document types


                             Type 1                  Type 2           distance Overlapping top terms
                             BIDEXASKELNOREM BIDEXASKEL               0.153     bone, density, mass
                             Neonatal_History        CDIDNUM          0.09      infant, birth, delivery
                             IM_Office_Visit         FP_Office_Visit 0.192      continued, take, encounter
                             ED Note                 HP_ED_Note       0.21      active, scope, coding
                                             Table 2: Duplicate note types in the dataset.


                                          id    AP   CE     CO     CV     EP     IM      MU      PR    VA
                                          0     0    6      21     0      0      496     0       0     0
                                          1     0    0      0      0      98     0       0       0     0
                                          2     0    0      183    0      0      0       0       0     0
                                          3     35   529    21     12     2      36      21      7     0
                                          4     0    0      0      0      0      0       0       172   0
                                          5     0    269    0      0      0      0       0       0     0
                                          6     4    3      0      85     0      6       0       0     0
Table 3: Distribution of source systems in the clusters for the GNER dataset AP: APOLLO, CE: CERNER, CO:COPATH, CV:CVIS,
EP:EPIC, IM:IMAGECAST, MU: MUSE, PR: PROVATION, VA:VASCUPRO


by a data analyst and subsequently verified by a subject matter              4      CLUSTER GENERATION PIPELINE
expert. The 13,440 documents belong to 168 document types across            The motivation behind clustering is that it can create coherent
the 30 subject categories.                                                  groups of data. We show how closely the clusters are aligned ac-
                                                                            cording to their subject matter domain. For this analysis, we use
3.2    GNER dataset                                                         the “Document types dataset".
Our experiments for the GNER task uses a dataset of 2059 docu-
ments. We refer to this as the “GNER dataset". The GNER dataset              4.1       Purity
consists of documents with each token annotated as one of G-B,               Clusters are evaluated based on how coherent they are to a partic-
G-I or O. All generic mentions of medical concepts (e.g. conditions,         ular subject matter. This is measured using the purity metric. To
procedures, labs-observation, medications) are annotated.                    compute purity, each cluster is assigned to the class which is most
   We do not have information about the subject categories for               frequent in the cluster, and then the accuracy of this assignment is
these documents. We do however, know the source systems for                  measured by counting the number of correctly assigned documents
the documents. The distribution of source systems in the dataset is          and dividing by N , the total number of documents.
shown in table 3. Documents from different source systems have                                                1 Õ
notable differences in content. COPATH notes are clinical obser-                            purity(W , C) =         max |w k ∩ c j |
                                                                                                             N        j
vations and procedures while IMAGECAST notes are radiology                                                       k
observations and procedures. CERNER and EPIC are typically sys-                  where W = (w 1, w 2, w 3, ...w k ) is the set of clusters and C =
tems used at hospital level and contains a mixture of note types such        (c 1, c 2, c 3, ..c j ) is the set of classes.
as progress notes, consults and discharge summaries. PROVATION                   Let us look at the specific steps involved in generating document
notes are gastroenterology procedures while VASCUPRO notes are               clusters.
vascular lab reports. MUSE notes consists of cardiology procedures,
CVIS notes are related to cardiac imaging and APOLLO notes are               4.2       Preprocessing and tokenizing documents
for documenting physical therapy.                                           All documents are preprocessed for stopword removal (stopwords
                                                                            from the nltk4 corpus), filtering out highest frequency words (top
                                                                             4 https://www.nltk.org/
                                                                                                                                Haridas and Kim, et al.


0.1%) in the corpus and words with only numbers in them. The doc-                                  Feature                 Count
uments are then tokenized with nltk word tokenizer. The number                                     Unigrams                17462
of unique tokens in the dataset after preprocessing and tokenizing                                 Bigrams                 100000
is approximately 2.7 million.                                                                      Trigrams                70000
                                                                                                   Section headers         8432
4.3      Feature generation                                                                        Concept tokens          27453
4.3.1 N-gram word tokens. The word tokens generated in the pre-                   Table 4: Feature count for multiple categories
vious process were used to generate unigram, bigram and trigram
word tokens. Looking at the top terms for documents, they indicate
the subject matter domain (e.g. cardiology, neurology) or the kind
of document (e.g. note, letter) in most of the cases. Bigrams and        4.4     Tf-idf matrix for generated features
trigrams are included as features because many medical concepts          N-gram tokens, section headers and concepts are combined and
are 2 or 3 words or more. Some examples are “intravenous solution"       they are weighted by their tf-idf (Term frequency - Inverse docu-
and “atrial sinus solitus". We only considered the top 200,000 uni-      ment frequency) [23] values. 3 separate feature matrices are created
grams based on document frequency. As for bigrams and trigrams,          corresponding to n-gram features, section headers and concepts
the total size was 8.8 million and 23.5 million respectively. To limit   respectively.
the size, we put additional restrictions based on corpus frequency          The feature matrices are then horizontally stacked8 in a variety
(greater than 30).                                                       of combinations to examine their effects (5.1) Horizontal stacking is
                                                                         simply concatenating the individual feature matrices row-wise. E.g.
4.3.2 N-gram word tokens filtered on medical vocabulary. We also
                                                                         a 13,440 x 8,432 feature matrix for section headers is combined with
restricted the word tokens to only those found in a medical dictio-
                                                                         a 13,440 x 27,453 concept feature matrix to get a 13,440 x 35,975
nary5 . The medical terms are unigram tokens from two corpuses
                                                                         combined feature matrix.
(OpenMedSpel6 and MTH-Med-Spel-Chek7 ). Bigrams and trigrams
are selected only when all the constituent terms are part of the dic-
tionary. This was shown to group documents of the same subject
                                                                         4.5     K-means clustering
categories even closer. The clustering purity based on unigrams          In K-means clustering, (Hartigan and Wong [10]) datapoints are
with no filtering on the “Document type dataset" was 0.56 while          divided into clusters of equal variance. K-means clustering is a fast
the one based on filtered unigrams was 0.63. It also had the con-        algorithm and the speed allows us to iterate quickly based on dif-
siderable advantage of reducing the dimensionality of the features.      ferent combinations of features. Document similarity is determined
Unigram frequency reduced from 200,000 to 17,462. Bigram and             based on the euclidean distance between the corresponding feature
trigram frequencies were 100,000 and 70,000 respectively (Table 4).      vectors.
                                                                            K-means clustering initializes ‘k’ centroids when generating ‘k’
4.3.3 Section headers. Medical records typically have sections that      clusters. Each centroid will be a particular document in the dataset.
distinguish one type from another. For e.g. A cytology report con-       In our setting, the centroid is chosen at random for each execu-
tains section headers such as “CLINICAL HISTORY" and “SOURCE             tion of the algorithm. Predetermined centroids did not change the
OF SPECIMEN", whereas a test report of lumbar puncture has “PRO-         purity values for the generated clusters. Subsequent members for
CEDURE" and “TECHNIQUE". These section headers act as a sig-             each cluster is chosen when that document is closer to a particular
nature for documents from the same source and can be used to             centroid than other centroids. The centroids are then recalculated
identify format level features for the documents. We extract section     after adding each member. The algorithm is run multiple times and
headers from the documents with NobleCoder [26] and tokenize             we check whether the results are consistent across each run. We
them with nltk word tokenizer. Tokenizing the section headers            tried a wide range of k-values from 10 to 140, but the clusters were
helps create unigram, bigram and trigram features from the section       found to be fairly stable and consistent across all values. We also
headers that are normalized across the documents.                        tried other clustering techniques and the resultant purity values
4.3.4 Concepts. NobleCoder tool [26] is able to map words per-           were comparable to values obtained from the k-means method.
taining to medical concepts into their UMLS [2] representation.             Table 5 illustrates the content for some of these clusters. Cluster
In addition, we can restrict the concepts to be only from certain        0 is about sleep medicine and cluster 2 is about x-rays. But clusters
semantic types. With inputs from knowledge engineers, we only            3 and 4 are about consults and office visits.
extracted concepts from a particular list of semantic types. For de-
tails, refer Appendix B. Similar to the use of medical vocabulary,       5 EXPERIMENTS
we can reduce our focus to medical concepts only when identifying        5.1 Examining the best features for clustering
features.                                                                    for subject matter domain
5 github.com/glutanimate/wordlist-medicalterms-en
6 https://e-medtools.com/                                                Each document in the Document types dataset corresponds to a
7 https://rajn.co/                                                       sample datapoint for the clustering algorithm. The Document type
                                                                         dataset has 13,440 samples. Each document has a subject matter
                                                                         category assignment and there are 30 unique subject categories. The
                                                                         8 https://docs.scipy.org/doc/scipy/reference/generated /scipy.sparse.hstack.html
Clustering Large-scale Diverse Electronic Medical Records to Aid Annotation for Generic Named Entity Recognition


           id    Term 1       Term 2          Term 3                                    Taken out        Dev F1     Test F1    Size
           0     apnea        events          rem                                       APOLLO           44.7       8.1        39
           1     ms           ear             density                                   CERNERH1         52.1       13.6       807
           2     xray-left    vendor          x-ray right                               COPATH           33.9       4.9        225
           3     reference    discharging     client                                    CVIS             43         27.4       97
           4     recorded     routing         best practice                             EPIC             46.4       24         98
             Table 5: Sample top terms in clusters                                      IMAGECAST        36.9       5.5        538
                                                                                        MUSE             44.6       0          21
                                                                                        PROVATION        37.3       21.2       179
                                                                                        VASCUPRO         48.7       0          51
            Features used                Purity
            Random                       0.16                              Table 7: Effects of source systems on the GNER task. F1 score
            Source                       0.44                              in percentage. Size refers to size of the test set (taken out
            Unigrams                     0.63                              source).
            Section headers              0.53
            Concept tokens               0.57
            Unigrams and bigrams         0.64                              5.2    Generic Named Entity Recognition Task
            Unigrams, bigrams, trigrams  0.61
                                                                           The GNER task is modeled as a sequence labeling problem. Given a
            Unigrams and section headers 0.60                              word from the document, the task is to predict one of the labels. The
            Section headers and concepts 0.62                              labels used are G-B = beginning of an entity, G-I = inside an entity,
            All features                 0.59                              and O = outside of an entity. For e.g., consider the input sentence,
Table 6: Purity comparison between clusters generated from                 "The patient is suffering from a cardio-vascular disease." The cor-
particular features                                                        rect predictions would be (The,O), (patient,O), (is, O), (suffering,O),
                                                                           (from,O), (a,O), (cardio,G-B), (vascular,G-I), (disease,G-I),(.O).
                                                                              GNER is an important step in the NLP pipeline to extract biomed-
                                                                           ical entities and concepts. The GNER model we use is a Bi-LSTM-
number of clusters are chosen as 30 based on the mean silhouette
                                                                           CRF based model (Huang et al. [12]). The Bi-LSTM layer has access
coefficient of all samples. Silhouette coefficient of a sample i in a
                                                                           to past and future tokens at any particular time-step in the to-
cluster C is defined as
                                                                           ken stream. The CRF layer captures sentence level information.
                             b(i) − a(i)
                     s(i) =                , |Ci | > 1                     We make no changes to the configuration and features for the
                            max a(i), b(i)                                 Bi-LSTM-CRF model from the version presented in [12].
                                                                              We use the GNER dataset for this task. Recall that we do not
                           s(i) = 0, |Ci | = 1
                                                                           have access to subject matter annotations for documents in the
    where ai is the mean distance of i to other points in C and bi is      GNER dataset. But we find that, in this dataset, the document’s
the smallest mean distance of i to all points in any other cluster, of     source system is a good indicator of subject matter content (Section
which i is not a member.                                                   3.2). The lack of coverage in training data is first illustrated using
    The “closeness" of samples in a cluster can be measured by a           document’s source system as a proxy for subject matter content.
cluster’s mean silhouette coefficient. High silhouette coefficient of      We will then show that if we use a document’s cluster in place of
a sample in a cluster [19] indicates that it is a part of a cluster with   the source system, GNER system performance suffers in a similar
similar samples and separated from dissimilar samples.                     manner. We also use the clustering methodology as explained in
    Thus, we use the following parameters for the k-means algo-            Section 4. Performance of the GNER model is measured as the F1
rithm, number of runs:10; number of clusters: 30. The “default"            score on the test data.
information we have about the document types is their source sys-             There are 3 parts to the experiments in the GNER task.
tems. So our baseline purity based on clustering by the 24 source
                                                                               (1) Leave one out cross validation based on document’s
systems in the dataset is 0.44 (Table 6). We use all the features we
                                                                                   source system. This identifies the drop in performance for
mentioned in previous sections, namely, unigrams, bigrams, tri-
                                                                                   documents in test data that are from unseen source systems
grams, section headers, and concept tokens. Each of these features
                                                                                   in training.
independently outperform the baseline. On the basis of the exper-
                                                                               (2) Cluster the documents. Leave one out cross validation
iments, we see that simply using unigrams and bigrams (117,462
                                                                                   based on document’s cluster. We show that performance
features) resulted in the highest values. Using unigrams (17,462) on
                                                                                   of GNER on documents in test data that are from unseen
its own or the combination of section headers and concepts (35,975)
                                                                                   clusters in training similarly drops
have comparable purity. Usage of the latter set of features have
                                                                               (3) Train and test the documents on a per cluster basis.
an added advantage of lower dimensionality as the corresponding
                                                                                   Finally, we examine the GNER performance on documents
feature vector is much smaller. For the experiments on the GNER
                                                                                   trained and tested on a per cluster basis.
dataset, we only used unigram features.
                                                                           5.2.1 Effect of source systems on the GNER task. Documents in
                                                                           different source systems belong to different sub-domains and their
                                                                                                                     Haridas and Kim, et al.


format is different as well. We investigate the effect of source sys-                size silhouette score avg Test F1 (%)
tems on GNER performance with leave one out cross validation                         181 -0.0059                 27.99
on the dataset, based on the source system of the documents. Doc-                    431 -0.071                  30
uments originating from one source system is left out and kept                       175 0.082                   35.22
aside as a test set. Remaining documents are shuffled to make a                      796 0.143                   70.23
train/dev split in the ratio 9:1. The trained models are tested on the               478 0.209                   55.64
corresponding dev set and the corresponding left out test set.            Table 8: Avg silhouette score of samples in a cluster vs Test
   Table 7 shows the effect of source systems on GNER performance.        F1 score. Pearson coefficient 0.63
Performance of all 8 trained models on the dev set is much better
than their performance on the left out test set. Documents from
source systems unseen in training causes a drop in performance. We
                                                                                        Taken out     Dev F1(%)     Test F1(%)
will refer to this as “the unseen type problem". This indicates that
                                                                                        0             46.7          36
when creating training data, we want to have as broad a coverage
as possible. It is not feasible to annotate documents across the                        1             43.8          18.4
overwhelming amount of document types that are available. We                            2             35.8          54.6
will see how clustering the documents ensures that the document                         3             44.4          23.1
types being selected are sufficiently diverse.                                          4             43.4          14.7
                                                                                        5             40.4          19
5.2.2 Effect of clustering on GNER task. For the next part of our                       6             45.5          15.1
experiment, we created 7 clusters from documents in the GNER
                                                                                  Table 9: Effects of clusters on the GNER task
dataset. Table 3 shows the distribution of the document’s source
systems in these 7 clusters. Each of the seven clusters are dominated
by documents from a particular system. With the exception of
cluster 3 and the source system MUSE, we can almost map a source                       Cluster id    Train F1(%)     Test F1(%)
system with a particular cluster. The features and methodology for                     0             44.60           36.36
clustering is described in detail in Section 4. We perform leave one                   1             66.09           11.86
out cross validation on the 7 clusters. One cluster is selected, and                   2             71.36           78.70
kept aside as the test set. The remaining documents in rest of the                     3             64.23           23.30
documents is split into train/dev sets in the ratio 9:1. Then we train                 4             69.29           55.93
the GNER model on the resulting training set and test the model                        5             66.49           32.41
on the dev set and test sets. This is repeated for every cluster.                      6             0               0
    We see that there is a drop off in test set performance with the
                                                                          Table 10: GNER model trained and tested on a per cluster
exception of cluster 2 (Table 9). The test performance is low for
                                                                          basis
clusters that are strongly dominated by one source with no or very
little documents of that source in other clusters. These are clusters
1,4, and 6. For cluster 0, the source systems of documents in this
cluster, CERNER, COPATH and, IMAGECAST are also represented                  Cluster 2 has a high performance because of the nature of the
in other clusters. In the case of clusters 3 and 5, the performance       clinical pathology documents that constitute the cluster. Cluster 4
is lower (but still higher than 4 other clusters) because it is dom-      is dominated by documents from PROVATION which also shares
inated by documents from CERNER. Even though there are 278                some of the properties of cluster 2. Both clusters have a limited
documents from CERNER in the training set, there is a diverse             vocabulary of biomedical concepts with a dense distribution within
range of documents from that system.                                      the documents. Cluster 6 has a training and test F1 of 0 because
    For cluster 2, which is dominated by COPATH documents, there          there are only 14 named entities in the entire training set. This
are still 42 documents in training data. Further more, the documents      brings to a potential pitfall when training on a per cluster basis.
in COPATH are mostly clinical pathology or observation notes with         We also need to make sure that each cluster has enough training
a limited vocabulary. They also tend to be very dense with many           data. Cluster 1 on the other hand, has enough training data, but it
generic named entity examples to learn from. For instance, when           is dominated by documents from the EPIC source system, which
training on a per cluster basis, cluster 2 has 3857 named entities in     similar to CERNER is very diverse in terms of subject matter domain.
the training set from 200 documents. There are only 761 entities          This is the second possible pitfall. We need to choose the features
from 460 documents in the training set for cluster 0.                     so that clusters are aligned on the subject matter domain as much
    Going back to the 7 clusters in the leave one out cross validation,   as possible. CERNER and EPIC documents are grouped together
we generated train/test split for each of the clusters. The GNER          despite the diverse nature of constituent documents because general
model was trained and tested on a per cluster basis. Table 10 shows       hospital notes are closer to each other than say, radiology notes.
the results for the GNER model when it is trained and tested on              We looked at 5 clusters generated from the GNER dataset, divided
documents from each cluster separately. The F1 scores in the test         each cluster into train and test sets. We trained the Bi-LSTM-CRF
set are better in 3 of the 7 cases, maintained for 2 cases and drops      model on the training sets of each cluster separately and tested it
for clusters 1 and 6.                                                     on the corresponding test sets from that cluster. We posit that the
Clustering Large-scale Diverse Electronic Medical Records to Aid Annotation for Generic Named Entity Recognition


silhouette coefficient of a cluster is correlated with the GNER per-                 [3] Andreea Bodnari, Louise Deléger, Thomas Lavergne, Aurélie Névéol, and Pierre
formance. In Table 8, the average silhouette coefficients of samples                     Zweigenbaum. 2013. A supervised named-entity extraction system for medical
                                                                                         text. In CLEF.
in a cluster are compared against the GNER F1 score when trained                     [4] K. P. Chodey and G. Hu. 2016. Clinical text analysis using machine learning
and tested on a train/test split from the same cluster.                                  methods. In 2016 IEEE/ACIS 15th International Conference on Computer and
                                                                                         Information Science (ICIS), pages 1–6.
   The high pearson correlation coefficient of 0.63 indicates that                   [5] Dina Demner-Fushman, Wendy W. Chapman, and Clement J. McDonald. 2009.
well formed clusters have a high linear correlation with a higher                        What can natural language processing do for clinical decision support? Journal of
test GNER performance. We saw in Section 4 that the “closeness"                          Biomedical Informatics, 42(5):760 – 772. Biomedical Natural Language Processing.
                                                                                     [6] Dina Demner-Fushman, Wendy W Chapman, and Clement J McDonald. 2009.
of the samples in a cluster is influenced by their respective subject                    What can natural language processing do for clinical decision support? Journal
matter domain.                                                                           of biomedical informatics, 42(5):760–772.
   Despite the caveats mentioned above, when we choose docu-                         [7] Carsten Eickhoff, Yubin Kim, and Ryen White. 2020. Overview of the health
                                                                                         search and data mining (hsdm 2020) workshop. In Proceedings of the Thirteenth
ments for annotation, we can improve test performance by choosing                        ACM International Conference on Web Search and Data Mining, WSDM ’20, New
them from as many clusters as possible. This avoids having to an-                        York, NY, USA. ACM.
                                                                                     [8] K. Ganesan and M. Subotin. 2014. A general supervised approach to segmentation
notate subject matter domain for each document type.                                     of clinical texts. In 2014 IEEE International Conference on Big Data (Big Data),
                                                                                         pages 33–40.
6    CONCLUSION                                                                      [9] Pathima Nusrath Hameed, Karin Verspoor, Snezana Kusljic, and Saman Halga-
                                                                                         muge. 2018. A two-tiered unsupervised clustering approach for drug reposition-
The diversity of unstructured clinical text documents has been an                        ing through heterogeneous data integration. BMC Bioinformatics, 19(1):129.
under-studied problem in clinical NLP. This paper presented initial                 [10] J. A. Hartigan and M. A. Wong. 1979. Algorithm as 136: A k-means clustering
                                                                                         algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics),
explorations into a large real-life data repository of 157 million                       28(1):100–108.
documents across 42 source systems and found that the source                        [11] Ming-Siang Huang, Po-Ting Lai, Richard Tzong-Han Tsai, and Wen-Lian Hsu.
systems reported more than 40,000 document types.                                        2019. Revised JNLPBA corpus: A revised version of biomedical NER corpus for
                                                                                         relation extraction task. CoRR, abs/1901.10219.
   Initial explorations of the document types showed that they vary                 [12] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for
widely in content and format, with significant ramifications on                          sequence tagging. CoRR, abs/1508.01991.
                                                                                    [13] Stanley M. Huff, Roberto A. Rocha, Clement J. McDonald, Georges J. E. De Moor,
supervised NLP tasks. When a supervised generic named entity                             Tom Fiers, Jr. Bidgood, W. Dean, Arden W. Forrey, William G. Francis, Wayne R.
detection model was tested on document types that had not been                           Tracy, Dennis Leavelle, Frank Stalling, Brian Griffin, Pat Maloney, Diane Leland,
present in the training data, the performance is much lower com-                         Linda Charles, Kathy Hutchins, and John Baenziger. 1998. Development of the
                                                                                         Logical Observation Identifier Names and Codes (LOINC) Vocabulary. Journal of
pared to a model trained on a more diverse training set (“the unseen                     the American Medical Informatics Association, 5(3):276–292.
type problem"). This indicates a need for careful selection of data                 [14] Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling
when annotating to create a training set for a new NLP task in a real                    Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi,
                                                                                         and Roger G. Mark. 2016. Mimic-iii, a freely accessible critical care database.
world setting; poorly chosen training data will hinder the creation                      Scientific Data, 3:160035 EP –.
of generalizable models. Ideally, an annotated training set would                   [15] Y. Ling, X. Pan, G. Li*, and X. Hu. 2015. Clinical documents clustering based on
                                                                                         medication/symptom names using multi-view nonnegative matrix factorization.
have coverage over all subject matter content. However, due to the                       IEEE Transactions on NanoBioscience, 14(5):500–504.
large number of document types available, this is prohibitively ex-                 [16] Georgia McGaughey, W Patrick Walters, and Brian Goldman. 2016. Understand-
pensive. Our study showed that many of the types reported by the                         ing covariate shift in model performance. F1000Research, 5:Chem Inf Sci–597.
                                                                                    [17] Shawn N Murphy and Henry C Chueh. 2002. A security architecture for query
systems are actually quite similar, leading us to explore clustering                     tools used to access large biomedical databases. Proceedings. AMIA Symposium,
as a method to mitigate diversity of note types.                                         pages 552–556.
   We experimented with various features for clustering and found                   [18] Lucila Ohno-Machado, Vineet Bafna, Aziz A Boxwala, Brian E Chapman,
                                                                                         Wendy W Chapman, Kamalika Chaudhuri, Michele E Day, Claudiu Farcas,
that to generate clusters along subject matter domains, a combi-                         Nathaniel D Heintzman, Xiaoqian Jiang, Hyeoneui Kim, Jihoon Kim, Michael E
nation of unigram and bigram features worked well, providing                             Matheny, Frederic S Resnic, Staal A Vinterbo, , and the iDASH team. 2011. iDASH:
                                                                                         integrating data for analysis, anonymization, and sharing. Journal of the American
high purity scores on the “Document types dataset". Since we do                          Medical Informatics Association, 19(2):196–201.
not have subject matter annotations for the larger data repository                  [19] Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and
of 40,000 document types, we posit that clusters are a reasonable                        validation of cluster analysis. Journal of Computational and Applied Mathematics,
                                                                                         20:53 – 65.
stand-in to ensure representation.                                                  [20] Guergana K. Savova, James J. Masanz, Philip V. Ogren, Jiaping Zheng, Sunghwan
   We showed that clustering captures information that helps trans-                      Sohn, Karin Kipper Schuler, and Christopher G. Chute. 2010. Mayo clinical text
late training performance for the GNER task. By clustering the                           analysis and knowledge extraction system (ctakes): architecture, component eval-
                                                                                         uation and applications. Journal of the American Medical Informatics Association
document types to a smaller set of clusters, it is possible to select                    : JAMIA, 17 5:507–13.
training data for NLP tasks with good coverage without wasting                      [21] M. Shekhar, V. R. Chikka, L. Thomas, S. Mandhan, and K. Karlapalem. 2015.
                                                                                         Identifying medical terms related to specific diseases. In 2015 IEEE International
annotation effort on similar types.                                                      Conference on Data Mining Workshop (ICDMW), pages 170–177.
   Future work may include utilizing semantic embedding repre-                      [22] Larry Smith, Lorraine K Tanabe, Rie Johnson nee Ando, Cheng-Ju Kuo, I-Fang
sentations of documents and training feature weights for better                          Chung, Chun-Nan Hsu, Yu-Shi Lin, Roman Klinger, Christoph M Friedrich, Kuz-
                                                                                         man Ganchev, Manabu Torii, Hongfang Liu, Barry Haddow, Craig A Struble,
clustering.                                                                              Richard J Povinelli, Andreas Vlachos, Jr Baumgartner, William A, Lawrence
                                                                                         Hunter, Bob Carpenter, Richard Tzong-Han Tsai, Hong-Jie Dai, Feng Liu, Yifei
REFERENCES                                                                               Chen, Chengjie Sun, Sophia Katrenko, Pieter Adriaans, Christian Blaschke, Rafael
                                                                                         Torres, Mariana Neves, Preslav Nakov, Anna Divoli, Manuel Maña-López, Jac-
 [1] Alan R Aronson and François-Michel Lang. 2010. An overview of MetaMap:              into Mata, and W John Wilbur. 2008. Overview of biocreative ii gene mention
     historical perspective and recent advances. Journal of the American Medical         recognition. Genome biology, 9 Suppl 2(Suppl 2):S2–S2.
     Informatics Association, 17(3):229–236.                                        [23] Karen Sparck Jones. 1988. Document retrieval systems. In Peter Willett, editor,
 [2] Olivier Bodenreider. 2004. The Unified Medical Language System (UMLS): inte-        Document Retrieval Systems, chapter A Statistical Interpretation of Term Speci-
     grating biomedical terminology. Nucleic Acids Research, 32:D267–D270.               ficity and Its Application in Retrieval, pages 132–142. Taylor Graham Publishing,
                                                                                         London, UK, UK.
                                                                                                                             Haridas and Kim, et al.


[24] Amber Stubbs, Christopher Kotfila, Hua Xu, and Özlem Uzuner. 2015. Identifying     B   SEMANTIC TYPES FOR CONCEPT
     risk factors for heart disease over time: Overview of 2014 i2b2/uthealth shared
     task track 2. Journal of biomedical informatics, 58 Suppl(Suppl):S67–S77.              FILTERING
[25] Buzhou Tang, Hongxin Cao, Xiaolong Wang, Qingcai Chen, and Hua Xu. 2014.                Category
     Evaluating word representation features in biomedical named entity recognition
     tasks. BioMed research international, 2014:240403.
                                                                                             Health Care Related Organization
[26] Eugene Tseytlin, Kevin Mitchell, Elizabeth Legowski, Julia Corrigan, Girish Cha-        Gene or Genome
     van, and Rebecca S. Jacobson. 2016. NOBLE – flexible concept recognition for            Congenital Abnormality
     large-scale biomedical natural language processing. BMC Bioinformatics, 17(1).
[27] Grace. Wahba. 1990. Spline Models for Observational Data. Society for Industrial        Acquired Abnormality
     and Applied Mathematics.                                                                Clinical Drug
[28] Wei-Hung Weng, Kavishwar B. Wagholikar, Alexa T. McCray, Peter Szolovits,
     and Henry C. Chueh. 2017. Medical subdomain classification of clinical notes            Body System
     using a machine learning-based natural language processing approach. BMC                Cell Component
     Medical Informatics and Decision Making, 17(1):155.
                                                                                             Body Location or Region
                                                                                             Injury or Poisoning
A     GOLD ANNOTATED SUBJECT MATTER                                                          Body Space or Junction
      CATEGORIES                                                                             Hazardous or Poisonous substance
         Gastroentrology                                                                     Finding
         Surgery                                                                             Laboratory or Test Result
         Podiatry                                                                            Pathologic Function
         Physical Medicine and Rehabilitation                                                Cell
         Pulmonary Medicine                                                                  Virus
         Orthopaedic surgery                                                                 Therapeutic or Preventive Procedure
         Surgical Oncology                                                                   Fungus
         Cardiology                                                                          Mental or Behavioral Dysfunction
         Family Medicine                                                                     Anatomical Abnormality
         Allergy                                                                             Bacterium
         Molecular Genetic Pathology                                                         Neoplastic Process
         Anesthesiology                                                                      Body Part, Organ, or Organ Component
         Diagnostic Radiology                                                                Biomedical or Dental Material
         Otolaryngology                                                                      Anatomical Structure
         Neonatal perinatal summary                                                          Disease or Syndrome
         Interventional Radiology                                                            Indicator, Reagent, or Diagnostic Aid
         Geriatric Medicine                                                                  Organic Chemical
         Nuclear Medicine                                                                    Sign or Symptom
         Emergency Medicine                                                                  Occupation or Discipline
         Neurology                                                                           Pharmacologic Substance
         Endocrinology                                                                       Biomedical Occupation or Discipline
         Obstetrics and Gynecology                                                           Diagnostic Procedure
         Clinical Pathology                                                                  Social Behavior
         Sleep Medicine                                                                      Laboratory Procedure
         Radiation Oncology                                                                  Tissue
         Hematology
         Mental Health
         Urology
         Rheumatology