<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recognizing and Encoding Disorder Concepts in Clinical Text using Machine Learning and Vector Space Model*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Buzhou Tang</string-name>
          <email>buzhou.tang@uth.tmc.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yonghui Wu</string-name>
          <email>yonghui.wu@uth.tmc.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Min Jiang</string-name>
          <email>min.jiang@uth.tmc.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joshua C. Denny</string-name>
          <email>josh.denny@vanderbilt.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hua Xu</string-name>
          <email>hua.xu@uth.tmc.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Biomedical Informatics, Vanderbilt University</institution>
          ,
          <addr-line>Nashville, TN</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School</institution>
          ,
          <addr-line>Shenzhen, Guangdong</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Biomedical Informatics, The University of Texas Health Science Center at Houston</institution>
          ,
          <addr-line>Houston, Texas</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The ShARe/CLEF eHealth Evaluation Lab (SHEL) organized a challenge on natural language processing (NLP) and information retrieval (IR) in the medical domain in 2013. The first task of the 2013 ShARe/CLEF challenge was to extract disorder mention spans and their associated UMLS (Unified Medical Language System) concept unique identifiers (CUIs). We participated in Task 1 and developed a clinical disorder recognition and encoding system. The proposed system consists of two components: a machine learning-based approach to recognize disorder entities and a vector space model-based method to encode disorders to UMLS CUIs. The challenge organizers manually annotated disorder entities and corresponding UMLS CUIs in 298 clinical notes, of which 199 notes were used for training and 99 were for testing. Evaluation on the test data set showed that our system achieved the best F-measure of 0.750 for entity recognition (ranked first) and the highest F-measure of 0.514 for UMLS CUI encoding (ranked third), indicating the promise of the proposed approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>medical language processing</kwd>
        <kwd>natural language processing</kwd>
        <kwd>named entity recognition</kwd>
        <kwd>UMLS encoding</kwd>
        <kwd>clinical concept extraction</kwd>
        <kwd>conditional random fields</kwd>
        <kwd>structured support vector machines</kwd>
        <kwd>vector space model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Clinical natural language processing (NLP) has received great attention in recent
years because it is critical to unlock information embedded in clinical documents in
the secondary use of electronic health records (EHRs) data for clinical and
translational research. Clinical concept extraction, which recognizes clinically relevant
entities (e.g., diseases, drugs, labs etc.) in text and maps them to identifiers in standard
vocabularies (e.g., Concept Unique Identifier (CUI) defined in Unified Medical
Lan* corresponding author
guage System (UMLS) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), is one of the fundamental tasks in clinical NLP research.
Many systems have been developed to extract clinical concepts from various types of
clinical notes in last two decades. Earlier studies mainly focused on building symbolic
NLP systems that are heavily based on domain knowledge (e.g., medical
vocabularies). The representative systems include MedLEE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], SymText/MPlus [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
MetaMap [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], KnowledgeMap [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], cTAKES[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and HiTEX [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In the past few years,
with the increasingly available annotated clinical corpora, researchers started to
investigate the use of machine learning algorithms in clinical entity recognition. The Center
for Informatics for Integrating Biology &amp; the Beside (i2b2) has organized a few
clinical NLP challenges to promote research in this field. In 2009, the i2b2 NLP challenge
was to recognize medication-related concepts. Both rule-based and machine learning
based methods as well as hybrid methods were developed by over twenty
participating teams [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In the 2010 i2b2 clinical NLP challenge, organizers expanded clinical
concepts from medication to problems, tests, and treatments. Most of systems were
primarily based on machine learning algorithms in this challenge, likely due to the
availability of large annotated datasets [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>In 2013, the ShARe/CLEF eHealth Evaluation Lab (SHEL) organized three shared
tasks on natural language processing (NLP) and information retrieval (IR): 1) clinical
disorder extraction and encoding to Systematized Nomenclature Of Medicine Clinical
Terms (SNOMED-CT), 2) acronym/abbreviation identification, and 3) retrieval of
web pages based on queries generated when reading the clinical reports. The Task 1
on clinical disorder extraction is similar to the 2010 i2b2 challenge on clinical
problem extraction. However, there are two major differences between these two tasks: 1)
ShARe/CLEF task allowed disjoint entities, while 2010 i2b2 clinical problem
extraction only dealt with entities of consecutive words; and 2) ShARe/CLEF task required
mapping disorder entities to SNOMED-CT (using UMLS CUIs), which was not
required in the 2010 i2b2 challenge.</p>
      <p>In this paper, we describe our system for Task 1 of the 2013 ShARe/CLEF
challenge. Our system consists of a machine learning based approach for disorder entity
recognition and a Vector Space Model (VSM) based method for mapping extracted
entities to SNOMED-CT codes. Evaluation by the organizers showed our system was
top-ranked among all participating teams.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>Fig. 1 shows the overview architecture of our systems for the first task of the
ShARe/CLEF eHealth 2013 shared task. It is an end-to-end system of two
components: disorder entity recognition and encoding. The first component consists of five
modules. As the clinical narrative supplied by the organizer was not well formatted,
we developed rule-based modules to detect the boundary of sentences and tokenize
them for each note at first, and aligned the preprocessed note back to the original one
at last. The other components were presented in the following sections in detailed.</p>
      <p>Fig. 1. The overview architecture of our disorder concept extraction systems for the
first task of the ShARe/CLEF eHealth 2013 shared task.
2.1</p>
      <sec id="sec-2-1">
        <title>Dataset</title>
        <p>The organizers collected 298 notes from different clinical encounters including
radiology reports, discharge summaries, and ECG/ECHO reports. For each note, disorder
entities were annotated based on a pre-defined guideline and then mapped to
SNOMED-CT concepts represented by UMLS CUIs. If a disorder entity cannot be
found in SNOMED-CT, it will be marked as “CUI-less”. The data set was divided
into two parts: a training set of 199 notes that were used for system development, and
a test set of 99 notes for evaluating systems. In the training set, 5811 disorder entities
were annotated and mapped to 1007 unique CUIs or CUI-less. The test set contained
5340 disorder entities with 795 CUIs or CUI-less. Table 1 shows the counts of entities
and CUIs in the training and test datasets.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Disorder entity recognition</title>
        <p>In machine learning-based named entity recognition (NER) systems, annotated data
are typically converted into a BIO format, where each word is assigned into one of
three labels: B means beginning of an entity, I means inside an entity, and O means
outside of an entity. Thus the NER problem is converted into a classification problem
to assign one of the three labels to each word. As mentioned previously, one
challenge of this task is that some disorder mentions (&gt;10%) were disjoint, which could
not be directly solved using the traditional BIO approach, which only works on
entities with consecutive words. Therefore we developed different strategies for
consecutive entities and disjoint entities. For consecutive disorder entities, we labeled words
Dataset
Training
Test
using traditional BIO tags. For disjoint entities, we created two additional sets of tags:
1) D{B, I} was used to label disjoint entity words that are not shared by multiple
concepts (called non-head entity); and 2) H{B, I} was used to label head words that
belonged to more than two disjoint concepts (called head entity). Figure 2 shows some
examples of labeling consecutive and disjoint disorder entities using our new tagging
sets. In this approach, we need to assign one of the seven labels {B, I, O, DB, DI, HB,
HI} to each word. When converting labeled words to entities, we defined a few
simple rules. For example, one rule for head words is “for each disjoint head entity,
combine it with all other non-head entities to form final disorder entities”.</p>
        <p>Sentence 1: “The left atrium is dilated .”
Encoding: “The/O left/DB atrium/DI is/O dilated/DB ./O”
Sentence 2: “The aortic root and ascending aorta are moderately dilated .”
Encoding: “The/O aortic/DB root/DI and/O ascending/DB aorta/DI are/O 
moderately/O dilated/HB ./O”</p>
        <p>We investigated two machine learning algorithms for disorder entity recognition.
One is Conditional Random Fields (CRFs), which is a representative sequence
labeling algorithm and is suitable for the NER problem. Another one is Structural Support
Vector Machines (SSVMs), which was proposed by Tsochantaridis et al. [23] in 2005
for structural data, such as trees and sequences. It is an SVMs-based discriminative
algorithm for structural prediction. Therefore, SSVMs combines the advantages of
both CRFs and SVMs and is suitable for sequence labeling problems as well.
CRFsuite (http://www.chokkan.org/software/crfsuite/) and SVMhmm
(http://www.cs.cornell.edu/people/tj/svm_light/svm_hmm.html) were used as
implements of CRF and SSVM respectively.</p>
        <p>
          For features, we used bag-of-word, part-of-speech (POS) from Stanford tagger
(http://www-nlp.stanford.edu/software/tagger.shtml), type of notes, section
information, word representation from Brown clustering [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and random indexing [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ],
semantic categories of words based on UMLS [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] lookup, MetaMap [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], or cTAKEs
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] outputs. Most of features were the same as those used in our previous system for
medical concept recognition [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ][
          <xref ref-type="bibr" rid="ref14">14</xref>
          ][
          <xref ref-type="bibr" rid="ref15">15</xref>
          ][
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Disorder entity encoding</title>
        <p>We treated disorder entity encoding as a ranking problem, where each recognized
disorder entity was considered as a query and candidates terms in UMLS as
documents. The Vector Space Model (VSM) was used in this task. The process consists of
two steps: 1) generate candidate CUIs from UMLS; and 2) rank candidate CUIs and
then take the top ranked CUI as the system’s output. We applied following criteria to
select candidate CUIs from UMLS for a given disorder entity: the corresponding
terms of a candidate CUI should contain all words in the disorder entity (except stop
words). For each candidate CUI, a vector containing its words, weighted by term
frequency–inverse document frequency (tf-idf) derived from entire
UMLS/SNOMED-CT terms, was created. The cosine similarity between a disorder
entity vector and a candidate CUI vector was calculated and used to rank candidate
CUIs. The top ranked CUI was then selected as the correct CUI of the entity. In order
to leverage the training data, we further built a limited VSM-based encoding system
by using CUIs/terms and entities occurred in the training set only, instead of the entire
UMLS. When processing the test set, we first determined whether an entity occurred
in the training set or not. If it did, we used the limited VSM-based encoding system to
predict the corresponding CUI. Otherwise, we used the general VSM-based encoding
system that was built on entire UMLS.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Experiments and Evaluation</title>
        <p>
          Our system was developed and trained using the training set (199 notes) and was
evaluated using the test set (99 notes). All parameters of CRF and SSVM were
optimized by 10-fold cross-validation on the training dataset. The performance of
disorder entity recognition were evaluated by precision, recall and F-measure in both
“strict” and “relaxed” modes, where “strict” refers that a concept is correctly
recognized if and only if the starting and ending offsets of it is exactly same as a disorder
mention in the gold standard, and “relaxed” refers that a disorder mention is correctly
recognized as long as it overlaps with any disorder mention in the gold standard. For
encoding of SNOMED-CT, all participating systems were evaluated using accuracy
only, in “strict” and “relaxed” modes, as defined in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ][
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>
        F-measures of 0.750 under “strict” criterion and 0.873 under “relaxed” criterion,
ranked first in the challenge. For SNOMED encoding, our system achieved the best
accuracy of 0.514, ranked third in the challenge.
Although a number of existing clinical NLP systems such as MedLEE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], MetaMap
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], KnowledgeMap [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and cTAKES [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] can extract clinical concepts and map them
to UMLS CUIs, it is difficult to compare the performance of these systems because
there is a lack of publically available corpora with annotations of UMLS CUIs. The
2013 ShARe/CLEF eHealth shared task 1 provides such a benchmark dataset for
clinical concept recognition and encoding, which is a significant contribution to the
clinical NLP research. Furthermore, the best system in the challenge achieved an accuracy
of 0.589 on encoding SNOME concepts, indicating it is still very challenging to
develop general clinical NLP systems that can accurately recognize and encode clinical
disorders to standard terminologies.
      </p>
      <p>
        In this study, we developed a clinical disorder recognition and encoding system
that combines a machine learning based approach for entity recognition and a
VSMbased approach for UMLS concept mapping. Our system was top-ranked among all
participating teams, indicating the promise of proposed approaches. However, there is
still much room for further improvement. First, our proposed method for disjoint
entity recognition has limitations. For example, if a sentence has multiple disjoint entities,
our current simple rule-based strategies would not be able to resolve the ambiguity
and will produce wrong combinations of disorder entities as shown in Fig 3, where
there are two disorder entities in the given sentence: “blood … on his tongue” and
“pupils … pinpoint”, which are represented by “blood/DB … on/DB his/DI
tongue/DI” and “pupils/DB … pinpoint/DB” respectively, but parsed into one
disorder entity “blood … on his tongue … pupils … pinpoint” by our strategies. Thus,
more sophisticated methods for disjoint concept recognition should be investigated in
future. In addition, our VSM-based method to map entities to UMLS CUIs is not
optimal. When compared with the top ranked team on UMLS CUI mapping, our
system achieved better performance on entity recognition, but lower accuracy on CUI
mapping, indicating the weakness of our encoding step. A few possible aspects for
further improvement are: 1) use other types of information as features for building
vectors, such as context, type of notes, section information and so on; 2) explore other
ranking algorithms such as Support Vector Machines [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], and 3) implement word
sense disambiguation algorithms for ambiguous entities.
We developed a clinical disorder recognition and encoding system that consists of a
machine learning-based approach to recognize disorder entities and a vector space
model-based method to encode disorders to UMLS CUIs. Our entry based on this
system was top-ranked in the 2013 ShARe/CLEF eHealth shared task 1, indicating the
promise of our approaches. However, more investigations are needed in order to
achieve satisfactory performance on extracting and encoding medical concepts in
clinical text.
      </p>
      <sec id="sec-3-1">
        <title>Acknowledge</title>
        <p>This study is supported in part by grants from NLM R01LM010681, the Office of the
National Coordinator for Health Information Technology 10510592, NCI
1R01CA141307, and NIGMS 1R01GM102282. We also thank the ShARe/CLEF
eHealth shared task 2013 organizers, who were funded by the United States National
Institutes of Health with grant (R01GM090187).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>“</given-names>
            <surname>Unified Medical Language System (UMLS) - Home</surname>
          </string-name>
          .” [Online]. Available: http://www.nlm.nih.gov/research/umls/. [Accessed:
          <fpage>22</fpage>
          -May-2013].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. O.</given-names>
            <surname>Alderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Austin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Cimino</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , “
          <article-title>A general natural-language text processor for clinical radiology</article-title>
          .,
          <source>” J Am Med Inform Assoc</source>
          , vol.
          <volume>1</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>174</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Koehler</surname>
          </string-name>
          , “
          <article-title>SymText : a natural language understanding system for encoding free text medical data;</article-title>
          ,” University of Utah;,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Christensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Haug</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Fiszman</surname>
          </string-name>
          , “
          <article-title>MPLUS: a probabilistic medical language understanding system,”</article-title>
          <source>in Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume</source>
          <volume>3</volume>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2002</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Aronson and F.-M. Lang</surname>
          </string-name>
          , “
          <article-title>An overview of MetaMap: historical perspective and recent advances</article-title>
          ,
          <source>” J Am Med Inform Assoc</source>
          , vol.
          <volume>17</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>229</fpage>
          -
          <lpage>236</lpage>
          , May
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Denny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Irani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. H.</given-names>
            <surname>Wehbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Smithers</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Spickard</surname>
          </string-name>
          , “
          <article-title>The KnowledgeMap Project: Development of a Concept-Based Medical School Curriculum Database,” AMIA Annu Symp Proc</article-title>
          , vol.
          <year>2003</year>
          , pp.
          <fpage>195</fpage>
          -
          <lpage>199</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Savova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Masanz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. V.</given-names>
            <surname>Ogren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sohn</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. C. KipperSchuler</surname>
          </string-name>
          , and C. G. Chute, “
          <article-title>Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications</article-title>
          ,”
          <source>J Am Med Inform Assoc</source>
          , vol.
          <volume>17</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>507</fpage>
          -
          <lpage>513</lpage>
          , Sep.
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Q. T.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goryachev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sordo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Murphy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Lazarus</surname>
          </string-name>
          , “
          <article-title>Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system</article-title>
          ,
          <source>” BMC Med Inform Decis Mak</source>
          , vol.
          <volume>6</volume>
          , p.
          <fpage>30</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>O.</given-names>
            <surname>Uzuner</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Solti</surname>
          </string-name>
          , and E. Cadag, “
          <article-title>Extracting medication information from clinical text</article-title>
          ,
          <source>” J Am Med Inform Assoc</source>
          , vol.
          <volume>17</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>514</fpage>
          -
          <lpage>518</lpage>
          , Oct.
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ö. Uzuner</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          <string-name>
            <surname>South</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
            , and
            <given-names>S. L. DuVall</given-names>
          </string-name>
          , “
          <year>2010</year>
          i2b2/
          <article-title>VA challenge on concepts, assertions, and relations in clinical text</article-title>
          ,
          <source>” J Am Med Inform Assoc</source>
          , vol.
          <volume>18</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>552</fpage>
          -
          <lpage>556</lpage>
          , Oct.
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Brown</surname>
          </string-name>
          , P. V. deSouza,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Mercer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. J. D.</given-names>
            <surname>Pietra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Lai</surname>
          </string-name>
          , “
          <article-title>Class-Based n-gram Models of Natural Language,” Computational Linguistics</article-title>
          , vol.
          <volume>18</volume>
          , pp.
          <fpage>467</fpage>
          -
          <lpage>479</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lund</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Burgess</surname>
          </string-name>
          , “
          <article-title>Producing high-dimensional semantic spaces from lexical co-occurrence,” Behavior Research Methods</article-title>
          , Instruments, &amp;
          <string-name>
            <surname>Computers</surname>
          </string-name>
          , vol.
          <volume>28</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>203</fpage>
          -
          <lpage>208</lpage>
          , Jun.
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , M. Liu,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Rosenbloom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Denny</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          , “
          <article-title>A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries</article-title>
          ,
          <source>” J Am Med Inform Assoc</source>
          , vol.
          <volume>18</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>601</fpage>
          -
          <lpage>606</lpage>
          , Oct.
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          , “
          <article-title>Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features</article-title>
          ,
          <source>” BMC Med Inform Decis Mak</source>
          , vol.
          <volume>13</volume>
          <issue>Suppl 1</issue>
          , p.
          <fpage>S1</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          , “
          <article-title>Clinical entity recognition using structural support vector machines with rich features,” in Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics</article-title>
          , New York, NY, USA,
          <year>2012</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Denny</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          , “
          <article-title>A hybrid system for temporal information extraction from clinical text</article-title>
          ,
          <source>” J Am Med Inform Assoc, Apr</source>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Suominen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Salantera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sanna</surname>
          </string-name>
          , and et al, “
          <source>Overview of the ShARe/CLEF eHealth Evaluation Lab</source>
          <year>2013</year>
          ,” presented at
          <source>the Proceedings of CLEF</source>
          <year>2013</year>
          ,
          <year>2013</year>
          , p. To appear.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>W.</given-names>
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Savova</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Elhadad</surname>
          </string-name>
          , “
          <article-title>ShARe/CLEF Shared Task 1 for boundary detection and normalization of SNOMED disorders,” presented at the</article-title>
          <source>Proceedings of CLEF</source>
          <year>2013</year>
          ,
          <year>2013</year>
          , p. To appear.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          , “
          <article-title>Optimizing search engines using clickthrough data,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining</article-title>
          , New York, NY, USA,
          <year>2002</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>