<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Clustering Large-scale Diverse Electronic Medical Records to Aid Annotation for Generic Named Entity Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nithin Haridas</string-name>
          <email>nithinh@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yubin Kim</string-name>
          <email>kimy10@upmc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carnegie Mellon University</institution>
          ,
          <addr-line>Pittsburgh, Pennsylvania '</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UPMC Enterprises</institution>
          ,
          <addr-line>Pittsburgh, Pennsylvania</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The full extent of diversity in clinical documents and the efects on natural language processing (NLP) tasks in the medical domain have not been well studied. In supervised NLP tasks, it is vital to have training data that resembles test data [27]. In the medical domain, this often translates to uniform subject matter distribution [16]. We have access to a corpus of 157 million documents from 42 diferent electronic medical record (EMR) vendors, with over 40,000 distinct categories assigned to the documents. The sheer diversity of the documents is an obstacle to an accurate sub-sampling of the data for annotation. We propose that clustering clinical text documents is an efective way to aid the annotation efort and ensure coverage. We demonstrate the efect of lack of coverage in training data for a supervised generic named entity recognition(GNER) task and the impact of clustering on the task. We will also examine the characteristics of clusters generated from a diverse dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This work was presented at the first Health Search and Data Mining
Workshop (HSDM 2020)[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
      </p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>In 2010, the American Recovery and Reinvestment Act was passed
into law, requiring all public and private US healthcare providers
to adopt electronic medical records (EMR) by January 1, 2014,
empowering clinical information extraction and NLP research.</p>
      <p>
        However, access to annotated data with a comprehensive
coverage of subject matter domains remains a major challenge in clinical
NLP. Widely available clinical document datasets are often small or
are a narrow slice of the extant types of documents found in EMR
systems. For example, the MIMIC dataset consists only of intensive
care unit documents [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The i2b2/UTHealth 2014 dataset [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] is
composed primarily of progress notes and discharge summaries.
The dataset consists of 3 categories of patients at various stages of
coronary artery disease.
      </p>
      <p>
        Real world medical records are very diverse with respect to
subject matter content and context [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. There are lab procedures,
consult notes, x-ray and ultrasound reports in cardiology, pulmonology,
orthopedics to name a few.
      </p>
      <p>We studied the data repository of a large healthcare provider
from 42 diferent electronic medical record (EMR) vendors
containing in excess of 157 million clinical text documents. EMR vendors
are also referred to as source systems. The naming convention
for categories assigned to the documents by the source systems
are not necessarily consistent. We refer to this assigned category
as a document type. The repository contains over 40,000 unique
document types.</p>
      <p>If we were to sample a set of documents across all the
document types, we will end up with 400,000 documents with just 10
documents from each document type. In addition, annotations and
their verification are done by subject matter experts. Evidently, this
exercise requires large resources financially as well as with respect
to time.</p>
      <p>
        Generic named entity recognition is used to generate structured
representation of a clinical text document by identifying biomedical
concepts in the text. GNER can be used to extract hidden
information in a diagnosis [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Information processing systems that rely on
structured data cannot access such hidden information in clinical
texts. The distribution of the biomedical concepts is dictated by the
subject matter content of the document [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Supervised machine learning based GNER systems requires
annotated clinical text documents with wide coverage [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Coverage in our context simply means that the training data
contains a very diverse set of named entities. As we see above, this is a
very challenging job when there are 40,000 categories. We propose
that clustering the documents can delineate them into a smaller
number of groups with each group aligned to similar subject matter
content.
      </p>
      <p>We examine the efects of inadequate coverage of training data
for a generic named entity recognition (GNER) task and how
clustering can mitigate some of these efects. Note that our objective
is not to classify clinical text into a certain category, but to ensure
coverage for the GNER task.</p>
      <p>In this work, we want to answer the following research questions.
(1) How does lack of coverage in training data afect a supervised</p>
      <p>GNER system ?
(2) Can we cluster documents such that sampling from every
cluster improves coverage ?
(3) How do we cluster documents and what are the
characteristics of each cluster ?</p>
      <p>In the following sections, we will describe the dataset in further
detail, explain the clustering method that we used, experiments
demonstrating the impact of lack of coverage in the GNER task and
ifnally the impact of clustering on coverage.
2</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>
        Classification and clustering of electronic medical documents are
relevant in other contexts within the medical domain. They are
primarily employed to address problems emanating from the
diversity of documents based on subject matter content. BioASQ1
orgainizes a large scale biomedical semantic indexing task every
year to classify PubMed2 documents into classes from the MeSH3
hierarchy. Weng et al. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] use a neural network architecture and a
linear SVM for document classification in MGH [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and iDASH
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] datasets respectively. The MGH dataset includes 3 subdomains
(neurology, cardiology, endocrinology). iDASH is annotated with
6 subdomains (cardiology, endocrinology, nephrology, neurology,
psychiatry and pulmonary disease).
      </p>
      <p>
        Clustering has been used to extract medication and symptom
names by Ling et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] on the corpus from the i2b2/UTHealth
2014 dataset. Clustering has also been used for a drug repositioning
task on a composite dataset consisting of 417 drugs and their
properties by Hameed et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The drug repositioning task identifies
additional uses for certain drugs based on similarities with other
drugs in the same cluster.
      </p>
      <p>
        Features used in classification and clustering tasks in the medical
domain frequently includes relevant terms extracted with the help
of a concept mapper. cTakes, [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], MetaMap [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and NobleCoder
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] are examples of concept mappers. A concept mapper uses
predefined rules to identify medically relevant terms and retrieve
a standard representation such as in UMLS [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Our work uses
NobleCoder for this purpose. Section headers in a clinical document
such as Complaint, Allergy and Summary have also been used as
features in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We also use section headers. We will see
more details of possible set of features in Section 4.
      </p>
      <p>
        Tang et al. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] use clustering based word representation (WR)
as features to improve a CRF based model in a biomedical named
entity recognition (BNER) task on the BioCreAtIvE II GM [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and
JNLPBA [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] datasets. The BNER task is identical to the GNER task
described in Section 5.2.
      </p>
      <p>
        We employ clustering as a means to ensure diversity in
annotated data in downstream tasks. As a result of the large number
of document types from many source systems, it is not feasible
to sample from every document type. The Document type dataset
as we will see in Section 3.1 has 168 document types. We aim to
show that sampling from clusters is a feasible strategy to represent
diverse clinical text documents in annotated data. We specifically
choose GNER as the downstream task because biomedical concepts
are correlated with the relevant subject matter domain [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>DATA</title>
      <p>We describe here the corpus of the large healthcare provider to
illustrate the scale of the problem. The data corpus is collected
with ethical approval. The data processing pipeline anonymizes
patient data. The processed data is stored in a HIPAA compliant
environment with restricted access. We mentioned earlier that
documents are assigned document types by source systems.</p>
      <p>Naming document types depends upon 5 axes:</p>
      <sec id="sec-4-1">
        <title>Subject matter domain: e.g. cardiology</title>
        <p>Type of service: e.g. consultation
Kind of document: e.g. note, consent
1http://bioasq.org/
2https://www.ncbi.nlm.nih.gov/pubmed
3https://meshb.nlm.nih.gov/search</p>
        <sec id="sec-4-1-1">
          <title>Setting: e.g. hospital, clinic</title>
          <p>Role: e.g. attending, consultant</p>
          <p>However, the naming conventions are not uniformly applied
even within the same system and source systems might not use all
the axes when deciding to assign document types. Because of these
incompatible naming conventions, our repository has a very large
variety of document types.</p>
          <p>The repository has 41,521 document types containing 718,337
documents. The data is diverse and the distribution across subject
matter categories is not uniform. There are 32,468 types in
IMAGECAST, for example, a source system primarily for radiology. This
is because radiology documents have a diferent document type
based on the relevant body part and the type of the image (X-ray,
MRI).</p>
          <p>4,458 of the document types had at least 100 documents
(common document types) and the remaining types had less than 100
documents per type (sparse document types).</p>
          <p>Among the 4,458 common document types, radiology notes were
grouped based on certain conventions (such as first 2 letters of their
name) into 18 types. The resulting consolidated dataset has 1296
common document types. We used this dataset to examine high
level characteristics.</p>
          <p>The most frequent tokens in the documents within a document
type can give a certain idea about the document type. This is
illustrated in Table 1.</p>
          <p>We were able to identify certain patterns among document types
and on closer inspection, we found that certain document types
were duplicates of each other. The key properties were small
euclidean distance between them in the feature vector space (described
in Section 4, an high overlap of top terms within the document
types’ vocabulary and similar names for the document types. Some
examples are shown in Table 2. Each duplicate pair here belong to
the same source system. However, we cannot rule out a scenario
where there are duplicate types across source systems. Clustering
the documents can be a way to ensure that these duplicate note
types are always grouped together which can significantly reduce
annotation eforts.</p>
          <p>The “GNER dataset" and the “Document type dataset" are subsets
of the documents from the 1296 common document types.
3.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Document type dataset</title>
      <p>The Document type dataset is used to find the methodology for
clustering and understand the best features. The resultant parameters
are then used to cluster documents for the GNER dataset.</p>
      <p>
        The dataset contains 13,440 documents spanning across 30 unique
subject matter domains. The subject categories are listed in
Appendix A. The Document type dataset’s diversity is a very notable
aspect. With 13,440 documents, it is smaller than the MIMIC [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
dataset( 53,423 documents), but certainly more diverse in terms of
subject matter content. To our understanding, we do not know of
any other work that utilises clustering to tackle diversity in clinical
text documents with broad coverage of subject matter content.
      </p>
      <p>
        Subject matter content for documents in this dataset is informed
by the document’s document type. We can map many of the
document types to standard representation from the subject matter
domain axis of LOINC Ontology [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. This mapping is first created
id
1907
4906
106
      </p>
      <sec id="sec-5-1">
        <title>Term 1</title>
        <p>foot
liver
artery</p>
      </sec>
      <sec id="sec-5-2">
        <title>Term 2</title>
        <p>ankle
hepatitis
femoral</p>
      </sec>
      <sec id="sec-5-3">
        <title>Term 3</title>
        <p>incision
cirrhosis
catheter</p>
      </sec>
      <sec id="sec-5-4">
        <title>Category</title>
        <p>Orthopaedic Surgery
Gastroentrology
Cardiology
by a data analyst and subsequently verified by a subject matter
expert. The 13,440 documents belong to 168 document types across
the 30 subject categories.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>GNER dataset</title>
      <p>Our experiments for the GNER task uses a dataset of 2059
documents. We refer to this as the “GNER dataset". The GNER dataset
consists of documents with each token annotated as one of G-B,
G-I or O. All generic mentions of medical concepts (e.g. conditions,
procedures, labs-observation, medications) are annotated.</p>
      <p>We do not have information about the subject categories for
these documents. We do however, know the source systems for
the documents. The distribution of source systems in the dataset is
shown in table 3. Documents from diferent source systems have
notable diferences in content. COPATH notes are clinical
observations and procedures while IMAGECAST notes are radiology
observations and procedures. CERNER and EPIC are typically
systems used at hospital level and contains a mixture of note types such
as progress notes, consults and discharge summaries. PROVATION
notes are gastroenterology procedures while VASCUPRO notes are
vascular lab reports. MUSE notes consists of cardiology procedures,
CVIS notes are related to cardiac imaging and APOLLO notes are
for documenting physical therapy.
4</p>
    </sec>
    <sec id="sec-7">
      <title>CLUSTER GENERATION PIPELINE</title>
      <p>The motivation behind clustering is that it can create coherent
groups of data. We show how closely the clusters are aligned
according to their subject matter domain. For this analysis, we use
the “Document types dataset".
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>Purity</title>
      <p>Clusters are evaluated based on how coherent they are to a
particular subject matter. This is measured using the purity metric. To
compute purity, each cluster is assigned to the class which is most
frequent in the cluster, and then the accuracy of this assignment is
measured by counting the number of correctly assigned documents
and dividing by N , the total number of documents.</p>
      <p>1 Õ
purity(W , C) = N mjax |wk ∩ cj |</p>
      <p>k
where W = (w1, w2, w3, ...wk ) is the set of clusters and C =
(c1, c2, c3, ..cj ) is the set of classes.</p>
      <p>Let us look at the specific steps involved in generating document
clusters.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Preprocessing and tokenizing documents</title>
      <p>All documents are preprocessed for stopword removal (stopwords
from the nltk4 corpus), filtering out highest frequency words (top
0.1%) in the corpus and words with only numbers in them. The
documents are then tokenized with nltk word tokenizer. The number
of unique tokens in the dataset after preprocessing and tokenizing
is approximately 2.7 million.
4.3</p>
      <p>
        Feature generation
4.3.1 N-gram word tokens. The word tokens generated in the
previous process were used to generate unigram, bigram and trigram
word tokens. Looking at the top terms for documents, they indicate
the subject matter domain (e.g. cardiology, neurology) or the kind
of document (e.g. note, letter) in most of the cases. Bigrams and
trigrams are included as features because many medical concepts
are 2 or 3 words or more. Some examples are “intravenous solution"
and “atrial sinus solitus". We only considered the top 200,000
unigrams based on document frequency. As for bigrams and trigrams,
the total size was 8.8 million and 23.5 million respectively. To limit
the size, we put additional restrictions based on corpus frequency
(greater than 30).
4.3.2 N-gram word tokens filtered on medical vocabulary. We also
restricted the word tokens to only those found in a medical
dictionary5. The medical terms are unigram tokens from two corpuses
(OpenMedSpel6 and MTH-Med-Spel-Chek7). Bigrams and trigrams
are selected only when all the constituent terms are part of the
dictionary. This was shown to group documents of the same subject
categories even closer. The clustering purity based on unigrams
with no filtering on the “Document type dataset" was 0.56 while
the one based on filtered unigrams was 0.63. It also had the
considerable advantage of reducing the dimensionality of the features.
Unigram frequency reduced from 200,000 to 17,462. Bigram and
trigram frequencies were 100,000 and 70,000 respectively (Table 4).
4.3.3 Section headers. Medical records typically have sections that
distinguish one type from another. For e.g. A cytology report
contains section headers such as “CLINICAL HISTORY" and “SOURCE
OF SPECIMEN", whereas a test report of lumbar puncture has
“PROCEDURE" and “TECHNIQUE". These section headers act as a
signature for documents from the same source and can be used to
identify format level features for the documents. We extract section
headers from the documents with NobleCoder [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and tokenize
them with nltk word tokenizer. Tokenizing the section headers
helps create unigram, bigram and trigram features from the section
headers that are normalized across the documents.
4.3.4 Concepts. NobleCoder tool [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] is able to map words
pertaining to medical concepts into their UMLS [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] representation.
In addition, we can restrict the concepts to be only from certain
semantic types. With inputs from knowledge engineers, we only
extracted concepts from a particular list of semantic types. For
details, refer Appendix B. Similar to the use of medical vocabulary,
we can reduce our focus to medical concepts only when identifying
features.
5github.com/glutanimate/wordlist-medicalterms-en
6https://e-medtools.com/
7https://rajn.co/
      </p>
      <sec id="sec-9-1">
        <title>Feature</title>
        <p>Unigrams
Bigrams
Trigrams
Section headers
Concept tokens</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Tf-idf matrix for generated features</title>
      <p>
        N-gram tokens, section headers and concepts are combined and
they are weighted by their tf-idf (Term frequency - Inverse
document frequency) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] values. 3 separate feature matrices are created
corresponding to n-gram features, section headers and concepts
respectively.
      </p>
      <p>The feature matrices are then horizontally stacked8 in a variety
of combinations to examine their efects (5.1) Horizontal stacking is
simply concatenating the individual feature matrices row-wise. E.g.
a 13,440 x 8,432 feature matrix for section headers is combined with
a 13,440 x 27,453 concept feature matrix to get a 13,440 x 35,975
combined feature matrix.
4.5</p>
    </sec>
    <sec id="sec-11">
      <title>K-means clustering</title>
      <p>
        In K-means clustering, (Hartigan and Wong [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) datapoints are
divided into clusters of equal variance. K-means clustering is a fast
algorithm and the speed allows us to iterate quickly based on
different combinations of features. Document similarity is determined
based on the euclidean distance between the corresponding feature
vectors.
      </p>
      <p>K-means clustering initializes ‘k’ centroids when generating ‘k’
clusters. Each centroid will be a particular document in the dataset.
In our setting, the centroid is chosen at random for each
execution of the algorithm. Predetermined centroids did not change the
purity values for the generated clusters. Subsequent members for
each cluster is chosen when that document is closer to a particular
centroid than other centroids. The centroids are then recalculated
after adding each member. The algorithm is run multiple times and
we check whether the results are consistent across each run. We
tried a wide range of k-values from 10 to 140, but the clusters were
found to be fairly stable and consistent across all values. We also
tried other clustering techniques and the resultant purity values
were comparable to values obtained from the k-means method.</p>
      <p>Table 5 illustrates the content for some of these clusters. Cluster
0 is about sleep medicine and cluster 2 is about x-rays. But clusters
3 and 4 are about consults and ofice visits.
5
5.1</p>
    </sec>
    <sec id="sec-12">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-13">
      <title>Examining the best features for clustering for subject matter domain</title>
      <p>Each document in the Document types dataset corresponds to a
sample datapoint for the clustering algorithm. The Document type
dataset has 13,440 samples. Each document has a subject matter
category assignment and there are 30 unique subject categories. The
8https://docs.scipy.org/doc/scipy/reference/generated /scipy.sparse.hstack.html
number of clusters are chosen as 30 based on the mean silhouette
coeficient of all samples. Silhouette coeficient of a sample i in a
cluster C is defined as
s(i) =</p>
      <p>b(i) − a(i) , |Ci | &gt; 1
max a(i), b(i)
s(i) = 0, |Ci | = 1
where ai is the mean distance of i to other points in C and bi is
the smallest mean distance of i to all points in any other cluster, of
which i is not a member.</p>
      <p>
        The “closeness" of samples in a cluster can be measured by a
cluster’s mean silhouette coeficient. High silhouette coeficient of
a sample in a cluster [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] indicates that it is a part of a cluster with
similar samples and separated from dissimilar samples.
      </p>
      <p>Thus, we use the following parameters for the k-means
algorithm, number of runs:10; number of clusters: 30. The “default"
information we have about the document types is their source
systems. So our baseline purity based on clustering by the 24 source
systems in the dataset is 0.44 (Table 6). We use all the features we
mentioned in previous sections, namely, unigrams, bigrams,
trigrams, section headers, and concept tokens. Each of these features
independently outperform the baseline. On the basis of the
experiments, we see that simply using unigrams and bigrams (117,462
features) resulted in the highest values. Using unigrams (17,462) on
its own or the combination of section headers and concepts (35,975)
have comparable purity. Usage of the latter set of features have
an added advantage of lower dimensionality as the corresponding
feature vector is much smaller. For the experiments on the GNER
dataset, we only used unigram features.</p>
      <sec id="sec-13-1">
        <title>Taken out</title>
        <p>APOLLO
CERNERH1
COPATH
CVIS
EPIC
IMAGECAST
MUSE
PROVATION
VASCUPRO
The GNER task is modeled as a sequence labeling problem. Given a
word from the document, the task is to predict one of the labels. The
labels used are G-B = beginning of an entity, G-I = inside an entity,
and O = outside of an entity. For e.g., consider the input sentence,
"The patient is sufering from a cardio-vascular disease." The
correct predictions would be (The,O), (patient,O), (is, O), (sufering,O),
(from,O), (a,O), (cardio,G-B), (vascular,G-I), (disease,G-I),(.O).</p>
        <p>
          GNER is an important step in the NLP pipeline to extract
biomedical entities and concepts. The GNER model we use is a
Bi-LSTMCRF based model (Huang et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]). The Bi-LSTM layer has access
to past and future tokens at any particular time-step in the
token stream. The CRF layer captures sentence level information.
We make no changes to the configuration and features for the
Bi-LSTM-CRF model from the version presented in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>We use the GNER dataset for this task. Recall that we do not
have access to subject matter annotations for documents in the
GNER dataset. But we find that, in this dataset, the document’s
source system is a good indicator of subject matter content (Section
3.2). The lack of coverage in training data is first illustrated using
document’s source system as a proxy for subject matter content.
We will then show that if we use a document’s cluster in place of
the source system, GNER system performance sufers in a similar
manner. We also use the clustering methodology as explained in
Section 4. Performance of the GNER model is measured as the F1
score on the test data.</p>
        <p>There are 3 parts to the experiments in the GNER task.
(1) Leave one out cross validation based on document’s
source system. This identifies the drop in performance for
documents in test data that are from unseen source systems
in training.
(2) Cluster the documents. Leave one out cross validation
based on document’s cluster. We show that performance
of GNER on documents in test data that are from unseen
clusters in training similarly drops
(3) Train and test the documents on a per cluster basis.</p>
        <p>Finally, we examine the GNER performance on documents
trained and tested on a per cluster basis.
5.2.1 Efect of source systems on the GNER task. Documents in
diferent source systems belong to diferent sub-domains and their
format is diferent as well. We investigate the efect of source
systems on GNER performance with leave one out cross validation
on the dataset, based on the source system of the documents.
Documents originating from one source system is left out and kept
aside as a test set. Remaining documents are shufled to make a
train/dev split in the ratio 9:1. The trained models are tested on the
corresponding dev set and the corresponding left out test set.</p>
        <p>Table 7 shows the efect of source systems on GNER performance.
Performance of all 8 trained models on the dev set is much better
than their performance on the left out test set. Documents from
source systems unseen in training causes a drop in performance. We
will refer to this as “the unseen type problem". This indicates that
when creating training data, we want to have as broad a coverage
as possible. It is not feasible to annotate documents across the
overwhelming amount of document types that are available. We
will see how clustering the documents ensures that the document
types being selected are suficiently diverse.
5.2.2 Efect of clustering on GNER task. For the next part of our
experiment, we created 7 clusters from documents in the GNER
dataset. Table 3 shows the distribution of the document’s source
systems in these 7 clusters. Each of the seven clusters are dominated
by documents from a particular system. With the exception of
cluster 3 and the source system MUSE, we can almost map a source
system with a particular cluster. The features and methodology for
clustering is described in detail in Section 4. We perform leave one
out cross validation on the 7 clusters. One cluster is selected, and
kept aside as the test set. The remaining documents in rest of the
documents is split into train/dev sets in the ratio 9:1. Then we train
the GNER model on the resulting training set and test the model
on the dev set and test sets. This is repeated for every cluster.</p>
        <p>We see that there is a drop of in test set performance with the
exception of cluster 2 (Table 9). The test performance is low for
clusters that are strongly dominated by one source with no or very
little documents of that source in other clusters. These are clusters
1,4, and 6. For cluster 0, the source systems of documents in this
cluster, CERNER, COPATH and, IMAGECAST are also represented
in other clusters. In the case of clusters 3 and 5, the performance
is lower (but still higher than 4 other clusters) because it is
dominated by documents from CERNER. Even though there are 278
documents from CERNER in the training set, there is a diverse
range of documents from that system.</p>
        <p>For cluster 2, which is dominated by COPATH documents, there
are still 42 documents in training data. Further more, the documents
in COPATH are mostly clinical pathology or observation notes with
a limited vocabulary. They also tend to be very dense with many
generic named entity examples to learn from. For instance, when
training on a per cluster basis, cluster 2 has 3857 named entities in
the training set from 200 documents. There are only 761 entities
from 460 documents in the training set for cluster 0.</p>
        <p>Going back to the 7 clusters in the leave one out cross validation,
we generated train/test split for each of the clusters. The GNER
model was trained and tested on a per cluster basis. Table 10 shows
the results for the GNER model when it is trained and tested on
documents from each cluster separately. The F1 scores in the test
set are better in 3 of the 7 cases, maintained for 2 cases and drops
for clusters 1 and 6.</p>
        <p>Cluster 2 has a high performance because of the nature of the
clinical pathology documents that constitute the cluster. Cluster 4
is dominated by documents from PROVATION which also shares
some of the properties of cluster 2. Both clusters have a limited
vocabulary of biomedical concepts with a dense distribution within
the documents. Cluster 6 has a training and test F1 of 0 because
there are only 14 named entities in the entire training set. This
brings to a potential pitfall when training on a per cluster basis.
We also need to make sure that each cluster has enough training
data. Cluster 1 on the other hand, has enough training data, but it
is dominated by documents from the EPIC source system, which
similar to CERNER is very diverse in terms of subject matter domain.
This is the second possible pitfall. We need to choose the features
so that clusters are aligned on the subject matter domain as much
as possible. CERNER and EPIC documents are grouped together
despite the diverse nature of constituent documents because general
hospital notes are closer to each other than say, radiology notes.</p>
        <p>We looked at 5 clusters generated from the GNER dataset, divided
each cluster into train and test sets. We trained the Bi-LSTM-CRF
model on the training sets of each cluster separately and tested it
on the corresponding test sets from that cluster. We posit that the
silhouette coeficient of a cluster is correlated with the GNER
performance. In Table 8, the average silhouette coeficients of samples
in a cluster are compared against the GNER F1 score when trained
and tested on a train/test split from the same cluster.</p>
        <p>The high pearson correlation coeficient of 0.63 indicates that
well formed clusters have a high linear correlation with a higher
test GNER performance. We saw in Section 4 that the “closeness"
of the samples in a cluster is influenced by their respective subject
matter domain.</p>
        <p>Despite the caveats mentioned above, when we choose
documents for annotation, we can improve test performance by choosing
them from as many clusters as possible. This avoids having to
annotate subject matter domain for each document type.
6</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>CONCLUSION</title>
      <p>The diversity of unstructured clinical text documents has been an
under-studied problem in clinical NLP. This paper presented initial
explorations into a large real-life data repository of 157 million
documents across 42 source systems and found that the source
systems reported more than 40,000 document types.</p>
      <p>Initial explorations of the document types showed that they vary
widely in content and format, with significant ramifications on
supervised NLP tasks. When a supervised generic named entity
detection model was tested on document types that had not been
present in the training data, the performance is much lower
compared to a model trained on a more diverse training set (“the unseen
type problem"). This indicates a need for careful selection of data
when annotating to create a training set for a new NLP task in a real
world setting; poorly chosen training data will hinder the creation
of generalizable models. Ideally, an annotated training set would
have coverage over all subject matter content. However, due to the
large number of document types available, this is prohibitively
expensive. Our study showed that many of the types reported by the
systems are actually quite similar, leading us to explore clustering
as a method to mitigate diversity of note types.</p>
      <p>We experimented with various features for clustering and found
that to generate clusters along subject matter domains, a
combination of unigram and bigram features worked well, providing
high purity scores on the “Document types dataset". Since we do
not have subject matter annotations for the larger data repository
of 40,000 document types, we posit that clusters are a reasonable
stand-in to ensure representation.</p>
      <p>We showed that clustering captures information that helps
translate training performance for the GNER task. By clustering the
document types to a smaller set of clusters, it is possible to select
training data for NLP tasks with good coverage without wasting
annotation efort on similar types.</p>
      <p>Future work may include utilizing semantic embedding
representations of documents and training feature weights for better
clustering.</p>
      <p>A</p>
    </sec>
    <sec id="sec-15">
      <title>GOLD ANNOTATED SUBJECT MATTER</title>
    </sec>
    <sec id="sec-16">
      <title>CATEGORIES</title>
      <sec id="sec-16-1">
        <title>Gastroentrology</title>
      </sec>
      <sec id="sec-16-2">
        <title>Surgery</title>
      </sec>
      <sec id="sec-16-3">
        <title>Podiatry</title>
      </sec>
      <sec id="sec-16-4">
        <title>Pulmonary Medicine</title>
      </sec>
      <sec id="sec-16-5">
        <title>Orthopaedic surgery</title>
      </sec>
      <sec id="sec-16-6">
        <title>Surgical Oncology</title>
      </sec>
      <sec id="sec-16-7">
        <title>Cardiology</title>
      </sec>
      <sec id="sec-16-8">
        <title>Family Medicine</title>
      </sec>
      <sec id="sec-16-9">
        <title>Allergy</title>
      </sec>
      <sec id="sec-16-10">
        <title>Physical Medicine and Rehabilitation</title>
      </sec>
      <sec id="sec-16-11">
        <title>Molecular Genetic Pathology</title>
      </sec>
      <sec id="sec-16-12">
        <title>Anesthesiology</title>
      </sec>
      <sec id="sec-16-13">
        <title>Diagnostic Radiology</title>
      </sec>
      <sec id="sec-16-14">
        <title>Otolaryngology</title>
      </sec>
      <sec id="sec-16-15">
        <title>Neonatal perinatal summary</title>
      </sec>
      <sec id="sec-16-16">
        <title>Interventional Radiology</title>
      </sec>
      <sec id="sec-16-17">
        <title>Geriatric Medicine</title>
      </sec>
      <sec id="sec-16-18">
        <title>Nuclear Medicine</title>
      </sec>
      <sec id="sec-16-19">
        <title>Emergency Medicine</title>
      </sec>
      <sec id="sec-16-20">
        <title>Neurology</title>
      </sec>
      <sec id="sec-16-21">
        <title>Endocrinology</title>
      </sec>
      <sec id="sec-16-22">
        <title>Obstetrics and Gynecology</title>
      </sec>
      <sec id="sec-16-23">
        <title>Clinical Pathology</title>
      </sec>
      <sec id="sec-16-24">
        <title>Sleep Medicine</title>
      </sec>
      <sec id="sec-16-25">
        <title>Radiation Oncology</title>
      </sec>
      <sec id="sec-16-26">
        <title>Hematology</title>
      </sec>
      <sec id="sec-16-27">
        <title>Mental Health</title>
      </sec>
      <sec id="sec-16-28">
        <title>Urology Rheumatology</title>
        <p>B</p>
      </sec>
    </sec>
    <sec id="sec-17">
      <title>SEMANTIC TYPES FOR CONCEPT</title>
    </sec>
    <sec id="sec-18">
      <title>FILTERING</title>
      <sec id="sec-18-1">
        <title>Injury or Poisoning</title>
      </sec>
      <sec id="sec-18-2">
        <title>Body Space or Junction Finding Cell Virus</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Alan</surname>
            <given-names>R</given-names>
          </string-name>
          <string-name>
            <surname>Aronson and François-Michel Lang</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>An overview of MetaMap: historical perspective and recent advances</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          ,
          <volume>17</volume>
          (
          <issue>3</issue>
          ):
          <fpage>229</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Bodenreider</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>The Unified Medical Language System (UMLS): integrating biomedical terminology</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>32</volume>
          :
          <fpage>D267</fpage>
          -
          <lpage>D270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Andreea</given-names>
            <surname>Bodnari</surname>
          </string-name>
          , Louise Deléger, Thomas Lavergne, Aurélie Névéol, and
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Zweigenbaum</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>A supervised named-entity extraction system for medical text</article-title>
          .
          <source>In CLEF.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Chodey</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Hu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Clinical text analysis using machine learning methods</article-title>
          .
          <source>In 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Dina</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          , Wendy W. Chapman, and
          <string-name>
            <surname>Clement J. McDonald</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>What can natural language processing do for clinical decision support?</article-title>
          <source>Journal of Biomedical Informatics</source>
          ,
          <volume>42</volume>
          (
          <issue>5</issue>
          ):
          <fpage>760</fpage>
          -
          <lpage>772</lpage>
          . Biomedical Natural Language Processing.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Dina</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          , Wendy W Chapman, and
          <string-name>
            <surname>Clement J McDonald</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>What can natural language processing do for clinical decision support? Journal of biomedical informatics</article-title>
          ,
          <volume>42</volume>
          (
          <issue>5</issue>
          ):
          <fpage>760</fpage>
          -
          <lpage>772</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Carsten</given-names>
            <surname>Eickhof</surname>
          </string-name>
          , Yubin Kim,
          <string-name>
            <given-names>and Ryen</given-names>
            <surname>White</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Overview of the health search and data mining (hsdm 2020) workshop</article-title>
          .
          <source>In Proceedings of the Thirteenth ACM International Conference on Web Search and Data Mining</source>
          , WSDM '
          <fpage>20</fpage>
          , New York, NY, USA. ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ganesan</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Subotin</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A general supervised approach to segmentation of clinical texts</article-title>
          .
          <source>In 2014 IEEE International Conference on Big Data (Big Data)</source>
          , pages
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Pathima</given-names>
            <surname>Nusrath</surname>
          </string-name>
          <string-name>
            <surname>Hameed</surname>
          </string-name>
          , Karin Verspoor, Snezana Kusljic, and
          <string-name>
            <given-names>Saman</given-names>
            <surname>Halgamuge</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A two-tiered unsupervised clustering approach for drug repositioning through heterogeneous data integration</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>19</volume>
          (
          <issue>1</issue>
          ):
          <fpage>129</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Hartigan</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Wong</surname>
          </string-name>
          .
          <year>1979</year>
          .
          <article-title>Algorithm as 136: A k-means clustering algorithm</article-title>
          .
          <source>Journal of the Royal Statistical Society</source>
          . Series C (Applied Statistics),
          <volume>28</volume>
          (
          <issue>1</issue>
          ):
          <fpage>100</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ming-Siang</surname>
            <given-names>Huang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Po-Ting</surname>
            <given-names>Lai</given-names>
          </string-name>
          , Richard Tzong-Han
          <string-name>
            <surname>Tsai</surname>
          </string-name>
          , and
          <string-name>
            <surname>Wen-Lian Hsu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Revised JNLPBA corpus: A revised version of biomedical NER corpus for relation extraction task</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1901</year>
          .10219.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Zhiheng</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Wei Xu,
          <string-name>
            <given-names>and Kai</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Bidirectional LSTM-CRF models for sequence tagging</article-title>
          .
          <source>CoRR, abs/1508</source>
          .
          <year>01991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Stanley</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Huf</surname>
          </string-name>
          , Roberto A.
          <string-name>
            <surname>Rocha</surname>
          </string-name>
          ,
          <string-name>
            <surname>Clement J. McDonald</surname>
          </string-name>
          ,
          <string-name>
            <surname>Georges J. E. De</surname>
            <given-names>Moor</given-names>
          </string-name>
          , Tom Fiers, Jr. Bidgood,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Arden W.</given-names>
            <surname>Forrey</surname>
          </string-name>
          , William G. Francis,
          <string-name>
            <surname>Wayne R. Tracy</surname>
            , Dennis Leavelle,
            <given-names>Frank</given-names>
          </string-name>
          <string-name>
            <surname>Stalling</surname>
            , Brian Grifin, Pat Maloney, Diane Leland, Linda Charles, Kathy Hutchins,
            <given-names>and John</given-names>
          </string-name>
          <string-name>
            <surname>Baenziger</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Development of the Logical Observation Identifier Names and Codes (LOINC) Vocabulary</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <fpage>276</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Alistair</surname>
            <given-names>E. W.</given-names>
          </string-name>
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , Tom J.
          <string-name>
            <surname>Pollard</surname>
            ,
            <given-names>Lu</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Li-wei</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Lehman</surname>
          </string-name>
          , Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark.
          <year>2016</year>
          .
          <article-title>Mimic-iii, a freely accessible critical care database</article-title>
          .
          <source>Scientific Data</source>
          ,
          <volume>3</volume>
          :160035 EP -.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Li*, and</article-title>
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Clinical documents clustering based on medication/symptom names using multi-view nonnegative matrix factorization</article-title>
          .
          <source>IEEE Transactions on NanoBioscience</source>
          ,
          <volume>14</volume>
          (
          <issue>5</issue>
          ):
          <fpage>500</fpage>
          -
          <lpage>504</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Georgia</surname>
            <given-names>McGaughey</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>W Patrick</given-names>
            <surname>Walters</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Brian</given-names>
            <surname>Goldman</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Understanding covariate shift in model performance</article-title>
          .
          <source>F1000Research</source>
          , 5:
          <string-name>
            <given-names>Chem</given-names>
            <surname>Inf</surname>
          </string-name>
          Sci-
          <volume>597</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Shawn</surname>
            <given-names>N</given-names>
          </string-name>
          <string-name>
            <surname>Murphy and Henry C Chueh</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>A security architecture for query tools used to access large biomedical databases</article-title>
          .
          <source>Proceedings. AMIA Symposium</source>
          , pages
          <fpage>552</fpage>
          -
          <lpage>556</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Lucila</given-names>
            <surname>Ohno-Machado</surname>
          </string-name>
          , Vineet Bafna, Aziz A Boxwala, Brian E Chapman, Wendy W Chapman, Kamalika Chaudhuri, Michele E Day, Claudiu Farcas,
          <string-name>
            <surname>Nathaniel D Heintzman</surname>
          </string-name>
          , Xiaoqian Jiang, Hyeoneui Kim, Jihoon Kim, Michael E Matheny,
          <article-title>Frederic S Resnic, Staal A Vinterbo, , and the iDASH team</article-title>
          .
          <year>2011</year>
          .
          <article-title>iDASH: integrating data for analysis, anonymization, and sharing</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          ,
          <volume>19</volume>
          (
          <issue>2</issue>
          ):
          <fpage>196</fpage>
          -
          <lpage>201</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Peter</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rousseeuw</surname>
          </string-name>
          .
          <year>1987</year>
          .
          <article-title>Silhouettes: A graphical aid to the interpretation and validation of cluster analysis</article-title>
          .
          <source>Journal of Computational and Applied Mathematics</source>
          ,
          <volume>20</volume>
          :
          <fpage>53</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Guergana</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Savova</surname>
            ,
            <given-names>James J.</given-names>
          </string-name>
          <string-name>
            <surname>Masanz</surname>
          </string-name>
          ,
          <string-name>
            <surname>Philip</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Ogren</surname>
          </string-name>
          , Jiaping Zheng, Sunghwan Sohn, Karin Kipper Schuler, and
          <string-name>
            <surname>Christopher</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Chute</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications</article-title>
          .
          <source>Journal of the American Medical Informatics Association : JAMIA, 17</source>
          <volume>5</volume>
          :
          <fpage>507</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shekhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Chikka</surname>
          </string-name>
          , L. Thomas,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mandhan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Karlapalem</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Identifying medical terms related to specific diseases</article-title>
          .
          <source>In 2015 IEEE International Conference on Data Mining Workshop (ICDMW)</source>
          , pages
          <fpage>170</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Larry</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Lorraine K Tanabe</given-names>
            , Rie Johnson nee Ando, Cheng-Ju
            <surname>Kuo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I-Fang</surname>
            <given-names>Chung</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chun-Nan</surname>
            <given-names>Hsu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu-Shi</surname>
            <given-names>Lin</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Roman</given-names>
            <surname>Klinger</surname>
          </string-name>
          , Christoph M Friedrich,
          <string-name>
            <given-names>Kuzman</given-names>
            <surname>Ganchev</surname>
          </string-name>
          , Manabu Torii, Hongfang Liu, Barry Haddow, Craig A Struble, Richard J Povinelli, Andreas Vlachos,
          <string-name>
            <surname>Jr Baumgartner</surname>
          </string-name>
          ,
          <string-name>
            <surname>William</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lawrence</surname>
            <given-names>Hunter</given-names>
          </string-name>
          , Bob Carpenter, Richard Tzong-Han
          <string-name>
            <surname>Tsai</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hong-Jie</surname>
            <given-names>Dai</given-names>
          </string-name>
          , Feng Liu, Yifei Chen, Chengjie Sun, Sophia Katrenko, Pieter Adriaans, Christian Blaschke, Rafael Torres, Mariana Neves, Preslav Nakov, Anna Divoli, Manuel Maña-López,
          <string-name>
            <given-names>Jacinto</given-names>
            <surname>Mata</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W John</given-names>
            <surname>Wilbur</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Overview of biocreative ii gene mention recognition</article-title>
          .
          <source>Genome biology</source>
          ,
          <issue>9 Suppl 2</issue>
          (
          <issue>Suppl 2</issue>
          ):
          <fpage>S2</fpage>
          -
          <lpage>S2</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Karen</surname>
            <given-names>Sparck</given-names>
          </string-name>
          <string-name>
            <surname>Jones</surname>
          </string-name>
          .
          <year>1988</year>
          .
          <article-title>Document retrieval systems</article-title>
          . In Peter Willett, editor,
          <source>Document Retrieval Systems</source>
          ,
          <article-title>chapter A Statistical Interpretation of Term Speciifcity and Its Application in Retrieval</article-title>
          , pages
          <fpage>132</fpage>
          -
          <lpage>142</lpage>
          . Taylor Graham Publishing, London, UK, UK.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Amber</surname>
            <given-names>Stubbs</given-names>
          </string-name>
          , Christopher Kotfila, Hua Xu,
          <string-name>
            <given-names>and Özlem</given-names>
            <surname>Uzuner</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Identifying risk factors for heart disease over time: Overview of 2014 i2b2/uthealth shared task track 2</article-title>
          .
          <source>Journal of biomedical informatics, 58 Suppl(Suppl</source>
          <volume>)</volume>
          :
          <fpage>S67</fpage>
          -
          <lpage>S77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Buzhou</surname>
            <given-names>Tang</given-names>
          </string-name>
          , Hongxin Cao, Xiaolong Wang,
          <string-name>
            <surname>Qingcai Chen</surname>
            , and
            <given-names>Hua</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Evaluating word representation features in biomedical named entity recognition tasks</article-title>
          .
          <source>BioMed research international</source>
          ,
          <year>2014</year>
          :
          <fpage>240403</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Eugene</surname>
            <given-names>Tseytlin</given-names>
          </string-name>
          , Kevin Mitchell, Elizabeth Legowski, Julia Corrigan, Girish Chavan, and
          <string-name>
            <surname>Rebecca</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Jacobson</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>NOBLE - flexible concept recognition for large-scale biomedical natural language processing</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>17</volume>
          (
          <issue>1</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Grace</surname>
          </string-name>
          . Wahba.
          <year>1990</year>
          .
          <article-title>Spline Models for Observational Data</article-title>
          .
          <source>Society for Industrial and Applied Mathematics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Wei-Hung</surname>
            <given-names>Weng</given-names>
          </string-name>
          , Kavishwar B.
          <string-name>
            <surname>Wagholikar</surname>
          </string-name>
          , Alexa T.
          <string-name>
            <surname>McCray</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter Szolovits</surname>
          </string-name>
          , and
          <string-name>
            <surname>Henry</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Chueh</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach</article-title>
          .
          <source>BMC Medical Informatics and Decision Making</source>
          ,
          <volume>17</volume>
          (
          <issue>1</issue>
          ):
          <fpage>155</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>