<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HITSZ_CNER: A hybrid system for entity recognition from Chinese clinical text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jianglu Hu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xue Shi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zengjian Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaolong Wang</string-name>
          <email>wangxl@insun.hit.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qingcai Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Buzhou Tang</string-name>
          <email>tangbuzhou@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School</institution>
          ,
          <addr-line>Shenzhen, China, 518055</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>With rapid development of electronic medical records, more and more attention has been attracted to reuse these data for research and commercial. As the entity recognition is one of the most primary task for medical information extraction, the 2017 China conference on knowledge graph and semantic computing (CCKS) challenge sets up a track for clinical named entity recognition (CNER). The organizers provide 400 annotated Chinese medical records for this track, 300 out of them are used as a training set and 100 as a test set. Other 2,605 raw medical records are released as an unlabeled set. In this study, we develop a hybrid system based on rule, CRF (conditional random fields) and RNN (recurrent neural network) methods for the CNER task. Experiments on the official test set show that our system achieves the F1-scores of 91.08% and 94.26% under the “strict” and “relaxed” criteria respectively, ranking first in the 2017 CCKS CNER challenge. By applying a self-training method with unlabeled data, the F1-scores of all machine learning-based methods are improved by about 1.0% under “strict” criterion. The future work of us will focus on the more effective extraction of body, disease and treatment entities.</p>
      </abstract>
      <kwd-group>
        <kwd>Entity Recognition</kwd>
        <kwd>Chinese Clinical Text</kwd>
        <kwd>Recurrent Neural Network</kwd>
        <kwd>Conditional Random Fields</kwd>
        <kwd>Hybrid Method</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In recent years, the medical information processing has become a popular researching
focus as the generation of larger amount of electronic medical records and the
potential requirements for medical information services and medical decision supports.
Clinical entity recognition, one of the most primary clinical text processing task, has
been organized as a shared-task in many challenges, such as the i2b2 (the center for
informatics for integrating biology &amp; the beside) 2009[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], i2b2 2010[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], i2b2 2012[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
SHEL (ShARe/CLEF eHealth Evaluation Lab) 2013[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], SemEval (Semantic
Evaluation) 2014[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], SemEval 2015[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], SemEval 2016[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], etc. These challenges not only
accelerate the research on entity recognition, but also annotate several valuable
corpora for clinical entity recognition. However, no one of them was organized on the
Chinese clinical text. For this purpose, the 2017 CCKS (the China conference on
knowledge graph and semantic computing) challenge sets up a clinical named entity
recognition track (CNER) to identify entities from Chinese clinical text, in which five
categories of entity are defined: Body, Disease, Symptom, Test and Treatment.
      </p>
      <p>
        The early clinical entity recognition systems are mainly based on
dictionarymatching and rules, such as MedLEE[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], etc. The machine learning-based clinical
entity recognition systems have been developed since the last few years, especially
after above several clinical entity recognition challenges have been organized. The
main machine learning algorithms used for entity recognition include: hidden markov
model (HMM), conditional random field (CRF) and structured support vector
machine (SSVM), etc. In recent years, the recurrent neural network (RNN) has been
widely used for clinical entity recognition, and achieves the state-of-the-art
performances on i2b2 2010, i2b2 2012 and i2b2 2014 corpora[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>In this study, we participate in the 2017 CCKS CNER challenge and develop a
hybrid system for the Chinese clinical entity recognition, which is based on four
individual methods (rule, CRF, RNN and RNN with features) and a vote-based approach.
Besides, we also apply a self-training method with a large unlabeled dataset to
improve the performance of our system.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>Figure 1 shows the overview architecture of our system for the entity recognition
from Chinese clinical text, which contains four individual methods: rule-based,
CRFbased, RNN-based and RNN with features methods. Firstly, we deploy these methods
on Chinese clinical text respectively, the results of rule-based method are used as the
features in other three methods. Then a vote-based method is used to combine all
predicted entities by them. The detailed description of our system is presented below.</p>
      <sec id="sec-2-1">
        <title>Rule-based Method</title>
        <p>Since the prior-knowledge plays an important role in the entity recognition, especially
for the clinical text, we construct several dictionaries for each type of entity referring
to the training set and some open websites (e.g. “Baidu baike”, “Xunyiwenyao”, etc.),
such as: body location, disease, symptom, examine, surgery, medicine, etc.</p>
        <p>Assisting by these dictionaries, we build lots of rules to recognize the common
patterns of entities, for example, in the phrase of “右侧小脑” (“right epencephalon”),
“小脑” (“epencephalon”) can be identified as “Body” by dictionary-matching, the “右
侧” (“right”) will be extended by our rules. Besides, in “…有心脏病病史…” (“...has
history of heart attack…”), we can extract the “心脏病” (“heart attack”) as “Disease”
according to the pattern “…有…病史…” (“…has history of …”).
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>CRF-based Method</title>
        <p>As the CRF algorithm has been widely used for sequence labeling tasks, we also
develop a CRF-based method for the clinical entity recognition using CRF++ as the
implementation of CRF. The features used in this method include: n-gram, radical
feature, spelling feature, word segmentation, part-of-speech, section head, dictionary
feature, relation feature, distributed representation of word, rule feature, etc.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>RNN-based Method</title>
        <p>In this paper, we also employed a bidirectional LSTM (BI-LSTM), long short-term
memory - a variant of RNN, method for the entity recognition from Chinese clinical
text, which consists of three main layers: 1) input layer, generates the representation
of each word in a sentence; 2) LSTM layer, takes the word representation sequence as
input and generates a new one that captures the context information of the words. It
contains both forward and backward LSTM networks; 3) output layer, learns the
dependencies between successive labels by a transition matrix, and predicts a best label
sequence according to the output of LSTM layer.</p>
        <p>To utilized the hand-crafted features constructed in above rule and CRF based
models, we extend above neural network (BI-LSTM) by adding a hidden layer (fully
connected layer, FC) after the LSTM layer to concatenate the feature representations,
which is represented as BI-LSTM-FEA, as shown in Figure 2.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Voting and Self-training</title>
        <p>As introduced before, three machine learning-based methods are deployed for the
clinical entity recognition independently. To take the advantages of different methods,
we use a vote-based approach to combine all predicted entities by them: a candidate
entity is selected only when it has been exactly predicted by at least two methods.</p>
        <p>Except the annotated data, the organizers of 2017 CCKS challenge also provide a
set of unlabeled records. To explore the contribution of unlabeled data for the clinical
entity recognition, we use a self-training approach, as follow: 1) train all individual
methods on the official training set; 2) tag unlabeled records by above methods
respectively, and combine all results by voting; 3) merge the tagged unlabeled data with
the official training data as a new larger training set; 4) finally, retrain all above
individual methods on this new training set, and tag on the official test data.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>
          In the 2017 CCKS CNER challenge, organizers provided 400 medical records
annotated with five categories of entity (as in Table 1), 300 out of them are used as a
training set and 100 as a test set. Besides, other 2,605 unlabeled records are released as an
unlabeled set. The statistics of entity on different categories are listed in Table 1.
All our evaluations are performed on the official test set using the evaluation tool of
2017 CCKS CNER challenge, which outputs micro-average precisions (Prec.), recalls
(Rec.) and F1-scores (F1) under two criteria: “strict” - checks whether the boundary
and category of an entity is exactly matched with a gold one; while “relaxed” - only
considers the boundary of an entity is overlapped with a gold one of same category,
“strict” is the primary one.
In this study, we directly divide the sentences into Chinese characters, which can
avoid the boundary error of entity caused by the word segmentation tools. The
“BIOES” (B-begin, I-inside, E-end, S-single, O-outside) tags are used to represent the
entity. For neural network models, we use the stochastic gradient descent (SGD)
algorithm to estimate parameters, and the pre-trained Chinese character embedding was
learned from training and unlabeled datasets by word2vec tool. The feature
representations are randomly initialized from a uniform distribution ranging in [
          <xref ref-type="bibr" rid="ref1">-1, 1</xref>
          ].
        </p>
        <p>Table 2 shows the performance of various methods on test set. We can see that all
the machine learning-based methods outperformed the rule-based method, BI-LSTM
model achieves much better F1-scores than CRF-based method, the voted results
(90.17% under “strict” criterion) outperformed all the individual methods. After
applying the self-training approach, the F1-scores of all individual methods are
improved by about 1.0% under “strict” criterion, and the F1-score of voted result is
improved by 0.06%. However, the BI-LSTM-FEA model performs much poorly than
other methods, which performs best on our validation set (divided from training set).
We think that the bad results of rule-based method on test set may cause the poor
performance of BI-LSTM-FEA model.
In this study, we proposed a hybrid system based on rule, CRF and RNN methods for
the entity recognition from Chinese clinical text. Experiments on 2017 CCKS corpus
show that our system achieves the F1-scores of 91.08% and 94.26% under “strict” and
“relaxed” criteria respectively, ranking first in this challenge. Among all individual
methods, BI-LSTM outperforms rule-based and CRF methods. By applying a
selftraining approach with unlabeled data, the F1-scores of all machine learning-based
methods are improved by about 1.0% under “strict” criterion. The future works of us
will focus on the more effective extraction of body, disease and treatment entities.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This paper is supported in part by grants: National 863 Program of China
(2015AA015405), NSFCs (National Natural Science Foundations of China)
(61573118, 61402128, 61473101, and 61472428), Special Foundation for Technology
Research Program of Guangdong Province (2015B010131010), Strategic Emerging
Industry Development Special Funds of Shenzhen (JCYJ20140627163809422,
20151013161937, JSGG20151015161015297 and JCYJ20160531192358466),
Innovation Fund of Harbin Institute of Technology (HIT.NSRIF.2017052), Program from
the Key Laboratory of Symbolic Computation and Knowledge Engineering of
Ministry of Education (93K172016K12) and CCF-Tencent Open Research Fund
(RAGR20160102).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Uzuner</surname>
          </string-name>
          , Ö.,
          <string-name>
            <surname>Solti</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cadag</surname>
          </string-name>
          , E.:
          <article-title>Extracting medication information from clinical text</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>17</volume>
          ,
          <fpage>514</fpage>
          -
          <lpage>518</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Uzuner</surname>
          </string-name>
          , Ö.,
          <string-name>
            <surname>South</surname>
            ,
            <given-names>B.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , DuVall, S.L.:
          <year>2010</year>
          i2b2/
          <article-title>VA challenge on concepts, assertions, and relations in clinical text</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>18</volume>
          ,
          <fpage>552</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uzuner</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Evaluating temporal relations in clinical text: 2012 i2b2 Challenge</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>20</volume>
          ,
          <fpage>806</fpage>
          -
          <lpage>813</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Suominen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salanterä</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velupillai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chapman</surname>
            ,
            <given-names>W.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savova</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elhadad</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pradhan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>South</surname>
            ,
            <given-names>B.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mowery</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>G.J.:</given-names>
          </string-name>
          <article-title>Overview of the ShARe/CLEF eHealth evaluation lab 2013</article-title>
          . In:
          <article-title>International Conference of the Cross-Language Evaluation Forum for European Languages</article-title>
          , pp.
          <fpage>212</fpage>
          -
          <lpage>231</lpage>
          . Springer, (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Pradhan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elhadad</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chapman</surname>
            ,
            <given-names>W.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manandhar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savova</surname>
          </string-name>
          , G.:
          <article-title>SemEval-2014 Task 7: Analysis of Clinical Text</article-title>
          . In: SemEval@ COLING, pp.
          <fpage>54</fpage>
          -
          <lpage>62</lpage>
          . (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Derczynski</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savova</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pustejovsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhagen</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>SemEval-2015 Task 6: Clinical TempEval</article-title>
          . In: SemEval@ NAACL-HLT, pp.
          <fpage>806</fpage>
          -
          <lpage>814</lpage>
          . (
          <year>Year</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savova</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , W.-T.,
          <string-name>
            <surname>Derczynski</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pustejovsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Semeval-2016 task 12:
          <article-title>Clinical tempeval</article-title>
          .
          <source>Proceedings of SemEval 1052-1062</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alderson</surname>
            ,
            <given-names>P.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Austin</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimino</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          , Johnson, S.B.:
          <article-title>A general naturallanguage text processor for clinical radiology</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>1</volume>
          ,
          <fpage>161</fpage>
          -
          <lpage>174</lpage>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Entity recognition from clinical texts via recurrent neural network</article-title>
          .
          <source>BMC Medical Informatics and Decision Making</source>
          <volume>17</volume>
          ,
          <issue>67</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>