<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Conditional Random Fields Approach to Clinical Name Entity Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xiaoran Yang</string-name>
          <email>xiaoyang.yxr@alibaba-inc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenkang Huang</string-name>
          <email>wenkang.hwk@alibaba-inc.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alibaba Health Information Technology Limited</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Clinic named entity recognition (CNER) is an initial step in understanding and using electronic medical record clinical free-text. The CCKS committee sets up a task for CNER for recognizing five types of entities including body part, independent symptom, symptom description, operation and drug. For this task, we develop a conditional random fields (CRF) model with char embedding, POS, radical, PinYin, dictionary and rule features. Our best model on the test dataset achieves the strict F1-Measure of 0.8926 which ranked the first place.</p>
      </abstract>
      <kwd-group>
        <kwd>Name Entity Recognition</kwd>
        <kwd>Electronic Medical Records</kwd>
        <kwd>NER</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>With the growth of the scale of electronic medical records, clinical named entity
recognition (CNER) has gradually become an important research topic. The research
progress of CNER in China is quite slow due to the lack of uniform standards and public
datasets. For this purpose, the CCKS 2018 conference in July 2018, sets up a CNER
task to identify entities from Chinese clinical text with a label specification and a
training datasets.</p>
      <p>Currently, the most effective way to identify named entities is based on machine
learning algorithm, such as support vector machines (SVM) [1], conditional random
fields (CRF) [2], structured support vector machines (SSVM) [3], recurrent neural
network (RNN) with its variant model [4], and convolutional neural network (CNN) with
its variant model [5]. In this paper, we participated in the CCKS 2018 CNER task and
developed a method based on conditional random fields. By evaluating and choosing a
great number of different features in the method including the characteristic feature and
feature based on external data, we achieved a F1-Measure of 0.8926 based on the CCKS
2018 CNER task dataset.</p>
    </sec>
    <sec id="sec-2">
      <title>Task Formalism</title>
      <p>Clinical named entity recognition task is often considered as a sequence label task.
Given a sentence X= &lt;x1,…, xn&gt;, the goal is to label each character xi according the
context of X with BMESO (B-Begin M-Middle E-End S-Single O-Outside) notation
scheme. The CCKS2018 evaluation task 1 gives the annotation datasets and the
unlabeled datasets with 5 pre-defined categories (body part, independent symptom,
symptom description, operation and drug). An example of the tag sequence for “患者2个月
前因上腹部不适于我院就诊 (the patient went to see a doctor two month ago in our
hospital because of epigastrium)” is shown in Figure 1.</p>
      <p>患 \O 者 \O 2 \O 个 \O 月 \O 前 \O 因 \O 上 \B-BOD 腹 \M-BOD 部 \E-BOD
不 \B-DES 适 \E-DES 于 \O 我 \O 院 \O 就 \O 诊 \O
rule.
3.1</p>
      <sec id="sec-2-1">
        <title>Conditional Random Fields(CRF)</title>
        <p>A conditional random field (CRF) is a type of discriminative, undirected
probabilistic graphical model, which has been widely used for sequence labeling problems. For a
given character sequence  = { 1, … ,  n} where  n is the input vector composed of the
char and features of  th character, and a given label sequence 
= { 1, … ,  n} for  .
γ( ) represent the all of possible labels for  . The CRF model define the formula of the
probability of character sequence  with given label sequence  is:
 ( | ;  ) =</p>
        <p>∑ =1 exp⁡( ( ( ),  ( ),  ))

∑ =1 ∑ ∈γ( ) exp⁡( (  ,  ( ),  ))</p>
        <p>Where  ( ( ),  ( ),  )⁡are potential function, and  is the parameters of CRF. In our
work, we use the character as a unit for sequence labeling model rather than use the
word. Log likelihood function was used to get the loss of the CRF layer. Finally, the
viterbi algorithm was used to decode.
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Features</title>
      </sec>
      <sec id="sec-2-3">
        <title>3.2.1 Char Embedding</title>
        <p>Given a sequence X =⁡&lt; x1, … , xn &gt;, distributed embedding vector is used to
represent the information for each character. Formally, we look up in a character embedding
matrix for embedding vector for each character xi.</p>
        <p>A single English character does not have semantics, while Chinese characters often
have strong semantic information. To utilize these semantic information, we use
cw2vec [7] instead of word2vec to construct the char embedding matrix. Different from
the work of word2vec, it puts forward the concept of "n-gram strokes", which is the
semantic structure of the continuous n strokes of Chinese words (or Chinese
characters). We have trained a cw2vec model using CCKS2018 training corpus and testing
corpus with 128 embedding dims.</p>
      </sec>
      <sec id="sec-2-4">
        <title>3.2.2 Part-of-Speech (POS)</title>
        <p>Part-of-speech (POS) features can help identify clinical named entity. For example,
body parts are always consist of many nouns, such as “右上腹(the right upper
quadrant)” , and a verb often comes before the name of the operation or that of the drug,
such as taking a drug or performing a surgery. In this paper, a python library named
Jieba was used to implement a POS tagger.</p>
      </sec>
      <sec id="sec-2-5">
        <title>3.2.3 Chinese PinYin</title>
        <p>Due to the use of the Pinyin input method, a large number of homophone typos
entities have appeared, and these homophone typos entities have not been identified. For
example, the "右附件(the right adnexa)" appearing in the text can be identified, but "
右附件(the right adnexa)" may be mistakenly written as "有附件(have adnexa)" due to
the use of Pinyin input method. These homonym characters cannot be identified. In
addition, some similar Chinese characters with the same pronunciation would have the
same meanings .Therefore, we use character spell features to help improve the result of
clinical named entity recognition.</p>
      </sec>
      <sec id="sec-2-6">
        <title>3.2.4 Radical</title>
        <p>Chinese characters are composed of smaller units - radicals, like English words are
composed of letters. These radicals often have semantic information about the original
character. For example, the characters “肠(intestines)”, “肺(lung)”, “肝(liver)” with
same radical “月” are all related to human body parts. We retrieved the radical
composition of each character from online Xinhua dictionary (http://tool.httpcn.com/Zi).</p>
      </sec>
      <sec id="sec-2-7">
        <title>3.2.5 Dictionary</title>
        <p>An additional dictionary was constructed from the training set and open websites or
databases such as DrugBank, “xunyiwenyao”, etc. Bi-directional maximum matching
(BDMM) algorithm [8] was used to find the word in dictionary appearing in sequences.
In order to improve the accuracy of entity boundary recognition, BMESO notation
schema was used for tagging which can give more information about character’s
position.
3.2.6 Rule</p>
        <p>With these dictionaries above, by mining frequent pattern [9], we can also find many
medical terminology do not appear in the dictionary. For instance, according the
sequence “行子宫切除术(do hysterectomy)，” and “子宫切除术(hysterectomy)” in
operation dictionary, we can extract the pattern “行(do)&lt;Operation&gt;，”. Using the
pattern we can also extract operation entity “直肠癌切除术(rectal cancer resection)” from
“行直肠癌切除术(do rectal cancer resection)”，while “直肠癌切除术(ectal cancer
resection)” are not in operation dictionary. We also use body part prefixes to extend the
body part entities such as “左侧卵巢(the left ovary)” while only “卵巢(ovary)” in body
part dictionary. In this paper, words that extracted by patterns were also tagged using
BMESO notation schema.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <sec id="sec-3-1">
        <title>Datasets</title>
        <p>The CCKS 2018 CNER task provided 600 annotated corpus as training dataset with
five types of entities (body part, independent symptom, symptom description, operation
and drug). 400 unlabeled corpus were also provided as testing dataset to evaluate the
model. The statistics of different types of entities in training corpus are listed in Table
1. To choose features and best hyper-parameters, we split 600 training corpus into 480
training corpus and 120 validation corpus.</p>
        <p>Entity
Count</p>
        <p>In this section, we compare different combinations between six type features in CRF
model. The comparative results are listed in Table 2.
Char
Char + Char embedding
Char + Word segmentation
Char + Radical
Char + Radical + POS
Char + Radical + POS + PinYin
Char + Radical + POS + PinYin + Dictionary
Char + Radical + POS + PinYin + Dictionary + Rule</p>
        <p>The result of CRF model has improved a little with radical features, POS features,
and PinYin features, but with dictionary features and rule features, it has improved
notably. It seems that radical features, POS features, PinYin features may have potential
influence in clinical named entity recognition, but dictionary features and rule features
could have explicit improvement.
4.4</p>
      </sec>
      <sec id="sec-3-2">
        <title>Compared with the state-of-art model</title>
        <p>In this section, we compare the best CRF model with a state-of-art model
bi-LSTMCRF by testing sets. The comparative results are summarized in Table 3.</p>
        <p>Model
BiLSTM+CRF
Our CRF</p>
        <p>Compared strict and relaxed results, we find that the body parts and the operations
don’t have a high strict F-measure but have a high relaxed F-measure. It means that the
right position of entities has been found without the right boundary. Through searching
the full testing corpus, it seems that the body part and operation entities are lack of a
uniform labeling specification.</p>
        <p>Comparing results between two models, the reason that why the best result in CRF
model is better than it in Bi-LSTM-CRF model may be the scale of the data sets smaller
than the scale of the entities. Therefore, the Bi-LSTM-CRF model is easy to fall into
overfitting. And by looking through the result, we can find that Bi-LSTM-CRF model
can identify more entities while some of them are wrong. We believe that if the scale
of data sets become larger, the result of Bi-LSTM-CRF will be better.</p>
        <p>By building a number of features including characteristic of character and external
data, a clinical named entity recognition model using CRF algorithm was developed.
Compared with the state-of-art algorithm Bi-LSTM+CRF, our CRF model achieved a
better performance. The reason might be that the scale of corpus is not large enough
and the label specification is not uniform. In the CCKS 2018 CNER task, we achieved
a strict F-measure of 0.8926 which ranked the first. We will focus on the more effective
extraction of body and operation entities’ boundary in the future.
10.
11.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Asahara</given-names>
            <surname>Masayuki</surname>
          </string-name>
          , and Yuji Matsumoto.:
          <article-title>Japanese named entity extraction with redundant morphological analysis</article-title>
          .
          <source>Proceedings of the</source>
          <year>2003</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics on Human Language TechnologyVolume 1</article-title>
          . Association for Computational Linguistics,
          <fpage>8</fpage>
          -
          <lpage>15</lpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>McCallum</given-names>
            <surname>Andrew</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Wei</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons</article-title>
          .
          <source>Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics</source>
          ,
          <fpage>188</fpage>
          -
          <lpage>191</lpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Lee</given-names>
            <surname>Yuh-Jye</surname>
          </string-name>
          , and Olvi L. Mangasarian.:
          <article-title>SSVM: A smooth support vector machine for classification</article-title>
          .
          <source>Computational optimization and Applications 20.1</source>
          ,
          <fpage>5</fpage>
          -
          <lpage>22</lpage>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Huang</given-names>
            <surname>Zhiheng</surname>
          </string-name>
          , Wei Xu,
          <string-name>
            <given-names>and Kai</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>: Bidirectional LSTM-CRF models for sequence tagging</article-title>
          .
          <source>arXiv preprint arXiv</source>
          ,
          <volume>1508</volume>
          .
          <year>01991</year>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Strubell</given-names>
            <surname>Emma</surname>
          </string-name>
          , et al.:
          <article-title>Fast and accurate entity recognition with iterated dilated convolutions</article-title>
          .
          <source>arXiv preprint arXiv</source>
          ,
          <volume>1702</volume>
          .
          <year>02098</year>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Xu</given-names>
            <surname>Yan</surname>
          </string-name>
          , et al.
          <article-title>"Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries</article-title>
          .
          <source>" Journal of the American Medical Informatics Association</source>
          <volume>21</volume>
          .
          <year>e1</year>
          ,
          <fpage>e84</fpage>
          -
          <lpage>e92</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Cao</given-names>
            <surname>Shaosheng</surname>
          </string-name>
          , et al.:
          <article-title>cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information</article-title>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Gai</given-names>
            <surname>Rong Li</surname>
          </string-name>
          , et al.:
          <article-title>Bidirectional maximal matching word segmentation algorithm with rules</article-title>
          .
          <source>Advanced Materials Research</source>
          . Vol.
          <volume>926</volume>
          .
          <source>Trans Tech Publications</source>
          ,
          <fpage>3368</fpage>
          -
          <lpage>3372</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Xu</given-names>
            <surname>Dong</surname>
          </string-name>
          , et al.:
          <article-title>Data-driven information extraction from Chinese electronic medical records</article-title>
          .
          <source>PloS one 10.8</source>
          ,
          <issue>e0136270</issue>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Gross</surname>
          </string-name>
          , Samuel S., et al.:
          <article-title>Training conditional random fields for maximum labelwise accuracy</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Kingma Diederik P.</surname>
          </string-name>
          , and Jimmy Ba.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv</source>
          ,
          <volume>1412</volume>
          .6980 (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>