<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semi-supervised learning for disabilities detection on English and Spanish biomedical text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salvador Medina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jordi Turmo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Henry Loharja</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Llu s Padro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Talp Research Center</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Politcnica de Catalunya</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>66</fpage>
      <lpage>73</lpage>
      <abstract>
        <p>This paper describes the disability detection model approaches presented by UPC's TALP 3 team for the DIANN 2018 shared task. The best of those approaches was ranked in 3rd place for exact-matching of disability detection. The models combine a semi-supervised learning model using CRFs and LSTM with word embedding features with a supervised CRF model for the detection of disabilities and negations respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>disabilities detection biomedical abstracts semi-supervised learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        This paper describes the approaches built by one of the UPC teams (TALP 3) to
participate in the DIANN 2018 task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The task consisted in automatically
recognizing disabilities occurring biomedical domain. Two di erent subtasks were
proposed: dealing with Spanish text and dealing with English text, both being
abstracts from biomedical journals.
      </p>
      <p>The paper is organized as follows. Sections 2 and 3 describe the approaches
used to learn the disabilities detection model and the negation detection model
respectively. The results achieved by our methods are presented and brie y
analyzed in Section 4. Finally, Section 5 concludes.</p>
    </sec>
    <sec id="sec-2">
      <title>Learning of the disabilities recognition model</title>
      <p>Our system tackles the disability recognition task as a sequence tagging problem,
mapping each word in them to their corresponding BIO-Tag. We apply two
alternative sequence tagging models: either learning Conditional Random Field
(CRF) probabilistic graphical models or recurrent arti cial neural networks using
Bidirectional Long Short-Time Memory Network (BiLSTM) memory units and
a nal CRF layer.</p>
      <p>Due to the relatively small size of the provided training corpus, the two
proposed models are prone to severe over- tting issues in completely supervised</p>
      <sec id="sec-2-1">
        <title>Medina S. et Al.</title>
        <p>learning scenario. In order to prevent this issue and add new patterns not
included in the original training set, we applied self-learning to unlabeled abstracts.
2.1</p>
        <sec id="sec-2-1-1">
          <title>Semi-supervised learning method</title>
          <p>Our system iteratively uses self-learning to add new examples to the training set
from an unlabeled corpus. The unlabeled corpus is built by scrapping articles'
abstracts from ScienceDirect 3 and Tesis Doctorals en Xarxa4, two websites that
contain PhD theses and articles from science journals. We use the disabilities'
phrases found in the training set as search terms and limit to 2000 results,
removing duplicates. With this, we could retrieve 41049 abstracts for English and
38632 for Spanish. We then divide it into 7 batches of 5000 abstracts each, which
are applied to the respective iteration of the self-learning algorithm. Our
particular implementation takes the training set, a minimum con dence threshold
and the batches as input and proceeds as described in Algorithm 1.
Algorithm 1 Pseudo-code of the implemented self-learning algorithm. Cmin is
the con dence threshold, Itmax is the maximum iteration and Bi represents the
i-th unlabeled batch.</p>
          <p>i 0
m f it model(Xtrain; Ytrain; 0)
evaluations ;
while i &lt; Itmax ^ :converges(evaluations) do</p>
          <p>Ybi run(m; Bi)
Xselected; Yselected select examples(Bi; Ybi ; k; Cmin)
Xtrain Xtrain [ Xselected
Ytrain Ytrain [ Yselected
m f it model(Xtrain; Ytrain; i)
evaluations evaluations [ evaluate(m; Xvalidation; Yvalidation)
i i + 1
end while
return m
2.2</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Word-Embedding models</title>
          <p>
            Word-Embedding models are used in both the CRF and the BiLSTM-CRF
models. For the rst model, we group all words into 1024 clusters using k-means and
apply them as binary input features. For the second, the full feature vector is
fed to the input layer. We considered the word embedding models listed below.
{ G-en: General-purpose English word-embedding model of 300 dimensions,
trained using GoogleNews's articles5.
3 https://www.sciencedirect.com/
4 https://www.tesisenred.net/
5 Downloaded from https://code.google.com/archive/p/word2vec/
Semi-supervised learning for disabilities detection
3
{ G-es: General-purpose Spanish word-embedding model of 300 dimensions,
trained from multiple sources6.
{ S-en and S-es: English and Spanish context-speci c word-embedding models
of 30 dimensions, trained from the unsupervised corpus fetched from
ScienceDirect using Pennington's GloVe algorithm [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ].
2.3
          </p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Conditional Random Field tagger</title>
          <p>Our rst sequence tagger consists on a linear-chain CRF tagger with binary
input features, using the implementation provided by Python CRFSuite7. In
order to compute the con dence of tagged sequences, required for self-learning,
we compute the probability P (yjx; w) of the assigned labels y respect to the
input x and feature functions w as de ned in Equation 1.</p>
          <p>
            P (yjx; w) = Py0e2xYp(ePxpi(PPji Pwjjfwj(jyfij (1y;i0yi1;;xy;i0i;)x); i))
(1)
Input Features For each token of the input sentences, we use a combination
of the features listed below, in a window of up to 7 tokens (3 before and 3 after).
If the window is within the beginning or the end of the document, the
special features Begin of Sentence (BOS) and End of Sentence (EOS) are applied.
Sentences are tokenized and analyzed using FreeLing [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], a multi-lingual natural
language processing tool.
          </p>
          <p>{ Word capitalization, either all lowercase, all uppercase, rst uppercase or
combined.
{ Whether or not the token contains numerical characters.
{ Pre xes and su xes of length 3 and 4, padded when necessary.
{ Part of Speech, determined by FreeLing 's PoS tagger.
{ Lemma, determined by FreeLing 's lemmatizer.
{ Word embedding cluster.</p>
          <p>{ Token, just used in the rst iteration of the self-learning algorithm.
2.4</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>Bilinear Long Short-Time Memory model</title>
          <p>
            The BiLSTM-CRF model is implemented using Python's Keras library with
TensorFlow backend8. LSTM layers for both directions use the standard LSTM
layer provided by Keras, whereas for the output CRF layer we use the
implementation in the Keras-Contrib extension library9. A dropout factor of 0.5 is
applied to the output layer for regularization.
6 Downloaded from http://crscardellino.me/SBWCE/[
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]
7 Python CRFSuite - Python bindings to CRFSuite [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]
8 Keras: The Python Deep Learning library[
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]
9 keras-contrib : Keras community contributions - GitHub
4
          </p>
          <p>We tune the network to estimate the probability of each tag assignment
by using the Adam optimizer with categorical cross-entropy loss function. The
probability of the output sequence P (Y T ) is usually computed as the product of
conditional probabilities at each time step P (ytjY t 1). However, this is not
practical in our case, as the probability vanishes fast for long sentences, which would
potentially prioritize shorter ones. To prevent this, we opted for de ning the
condence as the minimum output probability for all time-steps minfP (ytjY t 1)g.
Input Features In this second model we only consider word-embeddings as the
input features. For out-of-dictionary words, the average vector of all words in
the training corpus is applied. When both the general-purpose word embedding
model and the context-speci c word embedding model are used, both feature
vectors are concatenated.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Learning of the negation detection model</title>
      <p>
        The approach we use for the negation detection is based on the work presented
by Agarwal and Yu [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]: a CRF-based negation detection. That work uses a tool
named ABNER [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] which is a software tool for molecular biology text analysis
especially in named entity recognition. At ABNER's core is a statistical machine
learning system using linear-chain CRFs with a variety of orthographic and
contextual features. The tool includes a Java API allowing users to incorporate
ABNER into their systems, as well as training and using models for other data.
This Java API are what we mainly used in our approach for negation detection
by adapting a point of view of named entity recognition.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Data Preprocessing</title>
        <p>
          The main task in DIANN is detecting disabilities whilst detecting negation is
only focused on those that are related to the negated disabilities. This
characteristic of the task results in a very few number of negation occurrences being
annotated inside the training data and so it is insu cient to use only this dataset
for training a negation detection model using the approach we use. As a way
to tackle this issue we decided to include another datasets, Bioscope [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] for
English and IULA corpus [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] for Spanish, to enrich the training dataset we have
for DIANN task especially for negation. By using this method, we obtain two
datasets for training:
1. English training dataset: sentences with negation annotated in both English
training data from DIANN and abstracts from Bioscope.
2. Spanish training dataset: sentences with negation annotated in both Spanish
training data from DIANN and from IULA corpus.
        </p>
        <p>We pre-processed both corpora by enriching the raw data with BIO tags in
order to be used as input for training. Each token results tagged with "|B-S" if</p>
        <sec id="sec-3-1-1">
          <title>Semi-supervised learning for disabilities detection 5</title>
          <p>it is in the beginning of negation scope, "|I-S" if it is inside the negation scope,
or "|O" if it is outside of the scope. In order to capture the information of the
negation cue, we append su x "C" to the tag if the token is part of a negation
cue. An example of sentence in this format is:</p>
          <p>Five|O hundred|O twenty-five|O infants|O without|B-SC risk|I-S
factors|I-S
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Training The Model for Negation Detection</title>
        <p>Our goal with negation detection is to classify whether each word inside a
sentence is part of negation (scope or cue) or not. Using this understanding, we
can use ABNER, a CRF-based NER tool, as a platform for negation detection
by adapting it to reach that goal. We give three kind of classi cation for each
word which we observe: Scope, Cue, or Out. A word classi ed as Out is not part
of either a negation scope or a cue. Figure 3.2 shows the ow of our negation
detection approach.</p>
        <p>
          We used the CRF-based system in ABNER to train a negation detection
model by using the training data we prepared before, as described in Section
3.1. The training process uses an orthographic feature set which by default is
the one used in ABNER. The simplest and most clear feature set is the
vocabulary from the training data. Generalizations over how the words are written
(capitalization, a xes, etc.) are also relevant. The current approach includes
training vocabulary, 17 orthographic features based on regular expressions (e.g.,
Alphanumeric, HasDash, HasDigit) as well as pre xes and su xes in the
character length ranged from three to four. As an example, the word 'without' has two
pre x features: Pre x3='wit' and Pre x4='with' as well as two su x features:
su x3='out' and su x4='hout'. To model localization context in a simple way,
neighboring words in the window [
          <xref ref-type="bibr" rid="ref1">-1,1</xref>
          ] are also added as features. For example,
the middle token in the sequence with no symptoms has features Word='no',
Neighbor='with', and Neighbor= 'symptoms'. Words are also assigned with a
generalized word class in which capital letters are replaced by 'A', lowercase
        </p>
        <sec id="sec-3-2-1">
          <title>Spanish disability</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Spanish neg disability</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Spanish all</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>Approach CRF1 CRF2</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>LSTM</title>
          <p>CRF1
CRF2</p>
        </sec>
        <sec id="sec-3-2-6">
          <title>LSTM</title>
          <p>CRF1
CRF2</p>
        </sec>
        <sec id="sec-3-2-7">
          <title>LSTM IXA</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>Exact Partial</title>
          <p>P R F1 P R F1
0,814 0,594 0,687 0,898 0,655 0,758
0,807 0,603 0,69 0,889 0,664 0,76
0,67 0,603 0,634 0,743 0,668 0,703
0,647 0,5 0,564 0,941 0,727 0,821
0,647 0,5 0,564 0,941 0,727 0,821
0,688 0,5 0,579 1 0,727 0,842
0,779 0,555 0,648 0,89 0,633 0,74
0,772 0,563 0,652 0,88 0,642 0,742
0,64 0,559 0,597 0,735 0,642 0,685
0,746 0,795 0,77 0,82 0,873 0,846
letters by 'a', digits by '0', and all other characters by ". There is a similar
"brief word class" feature which collapses consecutive identical character types
into one. For example, the words "EX3" and "SHA1" are given the features
WC=AA0 and BWC=AAA0, respectively, while "N-folds" and "T-cells" both
are assigned WC=A aaaaa and BWC=A a.</p>
          <p>After applying the resulting negation detection model to the test, we do some
post-processing to change the BIO format of the result into the required format.</p>
        </sec>
        <sec id="sec-3-2-9">
          <title>Semi-supervised learning for disabilities detection 7</title>
          <p>4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>We performed three executions for each language (Spanish and English). Each
execution combines the semi-supervised method described in 2.1, one tagger
model from those presented in Sections 2.3 and 2.4 (CRF and BiLSTM) and one
or two of the word embeddings described in Section 2.2 (G-es, S-es, G-en, S-en).
Concretely, the approaches were:
{ CRF1: combination of CRF and speci c word embeddings (S-es or S-en).
{ CRF2: CRF combined with both speci c and general word embeddings
(Ses+G-es or S-en+G-en).
{ BiLSTM: BiLSTM combined with the speci c and general word embeddings.</p>
      <p>As far as [LANG] all and [LANG] disability results are concerned, our best
approach was CRF2 for Spanish and CRF1 for English, although for the later,
the di erences with CRF2 does not seem statistically signi cant. Hence, the
use of both the speci c and the general word embeddings seems like a better
choice for detecting disabilities and consequently for the full task of detecting
disabilities together with their scopes and cues when negated.</p>
      <p>On the one hand, however, our results for the detection of disabilities are
around 10 points lower than those achieved by the best approach presented
in DIANN for Spanish, and from 13 (Exact) to 14 (Partial) for English. This
leads to a similar behaviour of our approaches for the whole DIANN task. The
hypothesis for these di erences is that the resulting taggers are biased to get
better precision than recall, hence penalizing the overall F1 score.</p>
      <p>On the other hand, F1 scores in the whole DIANN task shows interesting
results. For Spanish, CRF2 was ranked the 3th for both exact and partial matching,
although the di erences with the 2nd place are not statistically signi cant. For
English, CRF1 was also ranked the 3th for exact matching, but it fell to the 6th
place for partial matching.</p>
      <p>Regarding negated disabilities, the results are more di cult to analyze. Mainly
because the number of negated disabilities is around 30-40, so for the test
corpus as for the train corpus, which is not signi cant enough to conclude with a
statistically representative comparison.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper we have described our participation in DIANN 2018 task of
disabilities detection for Spanish and English biomedical text. Our best approach</p>
      <sec id="sec-5-1">
        <title>Medina S. et Al.</title>
        <p>to nd exact matches consisted of a semi-supervised approach combining CRF,
medical domain-speci c word embeddings and context-independent word
embeddings for both Spanish and English. This approach was ranked in 3th position
in the o cial results, although far from the 1st ranked: from 11 to 13 points less
in F1 score.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments References</title>
      <p>This contribution has been partially funded by the Spanish Ministry of Economy
(MINECO) and the European Union (TIN2016-77820-C3-3-R and AEI/FEDER,UE).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
          </string-name>
          , H.:
          <article-title>Biomedical negation scope detection with conditional random elds 17,</article-title>
          <volume>696</volume>
          {
          <volume>701</volume>
          (11
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cardellino</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <source>Spanish Billion Words Corpus and Embeddings (March</source>
          <year>2016</year>
          ), http://crscardellino.me/SBWCE/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , et al.: Keras. https://keras.io (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fabregat</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mart</surname>
            nez-Romo,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Overview of the diann task: Disability annotation task at ibereval 2018</article-title>
          .
          <source>In: Proceedings of the Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval'18)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Marimon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vivaldi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bel</surname>
          </string-name>
          , N.:
          <article-title>Annotation of negation in the iula spanish clinical record corpus</article-title>
          .
          <source>In: Proceedings of the Workshop Computational Semantics Beyond Events and Roles</source>
          . pp.
          <volume>43</volume>
          {
          <issue>52</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Okazaki</surname>
          </string-name>
          , N.:
          <article-title>Crfsuite: a fast implementation of conditional random elds (crfs) (</article-title>
          <year>2007</year>
          ), http://www.chokkan.org/software/crfsuite/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Padr</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanilovsky</surname>
          </string-name>
          , E.:
          <article-title>Freeling 3.0: Towards wider multilinguality</article-title>
          .
          <source>In: Proceedings of the Language Resources and Evaluation Conference (LREC</source>
          <year>2012</year>
          ). ELRA, Istanbul, Turkey (May
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.: Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <volume>1532</volume>
          {
          <issue>1543</issue>
          (
          <year>2014</year>
          ), http://www.aclweb.org/anthology/D14-1162
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Settles</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Abner: an open source tool for automatically tagging genes, proteins and other entity names in text</article-title>
          .
          <source>Bioinformatics</source>
          <volume>21</volume>
          (
          <issue>14</issue>
          ),
          <volume>3191</volume>
          {
          <fpage>3192</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Vincze</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szarvas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farkas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mora</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Csirik</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The bioscope corpus: biomedical texts annotated for uncertainty, negation and their scopes</article-title>
          .
          <source>BMC bioinformatics 9</source>
          (
          <issue>11</issue>
          ),
          <source>S9</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>