TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 65-70


       A Hybrid Bi-LSTM-CRF model for Knowledge
           Recognition from eHealth documents
Un Modelo Hı́brido Bi-LSTM-CRF para el Reconocimiento de
Conocimiento a partir de documentos electrónicos de eSalud
      Renzo M. Rivera Zavala1 , Paloma Martı́nez1 , Isabel Segura-Bedmar1
            1
              Computer Science Department, University Carlos III of Madrid
          100371920@alumnos.uc3m.es, pmf@inf.uc3m.es, isegura@inf.uc3m.es

       Abstract: In this work, we describe a Deep Learning architecture for Named Entity
       Recognition (NER) in biomedical texts. The architecture has two bidirectional Long
       Short-Term Memory (LSTM) layers and a last layer based on Conditional Random
       Field (CRF). Our system obtained the first place in the subtask A (identification)
       of TASS-2018-Task 3 eHealth Knowledge Discovery, with an F1 of 87.2%.
       Keywords: NER, Bi-LSTM, CRF, Information Extraction
       Resumen: En este trabajo, describimos una arquitectura Deep Learning para el
       reconocimiento de entidades nombradas (NER) en textos biomédicos. La arquitec-
       tura se compone de dos capas bidireccionales LSTM (Long Short-Term Memory)
       y una última capa basada en Conditional Random Field (CRF). Nuestro sistema
       obtuvó el primer puesto en las subtareas A (identificación) y B (clasificación) de la
       competición TASS-2018-Task 3 eHealth Knowledge Discovery, con una F1 de 87.2%.
       Palabras clave: NER, Bi-LSTM, CRF, Extracción de Información


1    Introduction                                               2001). Recently, Deep learning-based meth-
                                                                ods have also demonstrated state-of-the-art
Currently, the number of biomedical litera-                     performance by automatically learning of rel-
ture is growing at an exponential rate. The                     evant patterns from corpora, which allows
substantial number of research works makes                      the independence of a specific language or
it extremely difficult for researchers to keep                  domain. However, until now, Deep Learning
up with the new development in their re-                        methods have not been able to provide bet-
search areas. Therefore, the effective man-                     ter results than those obtained by classical
agement of a large amount of information                        traditional machine learning methods (Lim-
and the accuracy of knowledge is a vital task.                  sopatham and Collier, 2016).
Named Entities Recognition (NER) is one                            In this paper, we propose a hybrid
of the fundamental tasks of biomedical text                     model combining two bidirectional Long
mining, with the aim of identifying pieces of                   Short Memory (Bi-LSTM) layers with a CRF
text that refer to specific entities of interest.               layer. To do this, we adapt the NeuroNER
    There are different scopes to address the                   model proposed in (Dernoncourt, Lee, and
NER problem. Among them, we can find                            Szolovits, 2017) for the subtask A (identifi-
methods based on dictionaries, which are lim-                   cation) of TASS-2018-Task 3 eHealth Knowl-
ited by the size of the dictionary, spelling er-                edge Discovery (Martı́nez-Cámara et al.,
rors, the use of synonyms and the constant                      2018). Specifically, we have extended Neu-
growth of vocabulary. Rule-based methods                        roNER by adding context information, Part-
and Machine Learning methods usually re-                        of-Speech (PoS) tags and information about
quire both syntactic and semantic features                      overlapping or nested entities. Moreover, in
as well as characteristics of the language of                   this work, we use two pre-trained word em-
the specific domain. One of the most effec-                     bedding models: i) a word2vec model (Span-
tive method is Conditional Random Fields                        ish Billion Word Embeddings (Cardellino,
(CRF) (Lafferty, McCallum, and Pereira,                         2016)), which was trained on the 2014 dump
ISSN 1613-0073                     Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes.
                              Renzo M. Rivera Zavala, Paloma Martínez e Isabel Segura-Bedmar


of Wikipedia and ii) a sense-disambiguation                      networks. In last years, neural networks for
embedding model (Trask, Michalak, and Liu,                       training word embedding models have gained
2015).                                                           a lot of popularity among NLP community
   The rest of the paper is organized as fol-                    because they are able to capture syntactic
lows. In Section 2, we describe the architec-                    and semantic information among words. The
ture of our system. Section 3 presents the                       most popular methods are word2vec (Le and
results. In Section 4, we provide the conclu-                    Mikolov, 2014), the global aggregate model
sions.                                                           of word-word co-occurrence statistics (Pen-
                                                                 nington, Socher, and Manning, 2014) and
2 System Description                                             the morphological representation of fastText
2.1 Pre-processing                                               (Bojanowski et al., 2017).
                                                                    In this work, we used the Spanish Bil-
All texts were preprocessed in four steps.
                                                                 lion Words(Cardellino, 2016), which is a pre-
First, sentences were split by using Spacy
                                                                 trained model of word embeddings trained
(Space.io, 2018), an open source library for
                                                                 on different text corpora written in Spanish
advanced natural language processing with
                                                                 (such Ancora Corpus (Taulé, Martı́, and Re-
support for 26 languages. Second, sentences
                                                                 casens, 2008) and Wikipedia). The details of
and their annotated entities were trans-
                                                                 the pre-trained model are the following:
formed to the BRAT format1 , a standoff for-
mat similar to BioNLP Shared Task stand-                            • Corpus size: approximately 1.5 billions
off format. Then, sentences were tokenized.                           words
Finally, each token in a sentence was anno-
tated using the BMEWO-V extended tag en-                            • Vocab size: 1000653
coding, to capture information about the se-
quence of tokens in a given sentence. The                           • Array size: 300
BIOES label scheme introduced in the work                           • Algorithm: Skip-gram Bag of Words
of (Borthwick et al., 1998) arises in order to
overcome the limitation of the BIO scheme                        2.2.2    Sense-Disambiguation
for the representation of discontinuous enti-                             Embedding
ties. BIOES coding distinguishes the end of                      We also used the sense2vec (Trask, Michalak,
an entity through the E (End) tag and adds                       and Liu, 2015) model, which provides multi-
the S (Single) tag to denote entities com-                       ple embeddings for each word based on the
posed of a single token. The BIOES-V or                          sense of the word. This is able to analyze the
BMEWO-V encoding distinguishes the B tag                         context of a word and then assign its more
to indicate the start of an entity, the M tag                    adequate vector. In this work, we used a pre-
to indicate the continuity of an entity, the E                   trained model generated with the sense2vec
tag to indicate the end of an entity, the W                      tool with 22 million words represented in 128
tag for indicate a single entity, and the O tag                  features vectors trained on the 2015 Reddit.
to represent other tokens that do not belong                        Reddit Vector is a pre-trained model of
to any entity. The V tag allows to represent                     sense-disambiguation representation vectors
overlapping entities. This encoding scheme                       presented by (Trask, Michalak, and Liu,
allows the representation of discontinuous en-                   2015). This model was trained on a collec-
tities and overlapping or nested entities.                       tion of comments published on Reddit (cor-
2.2      Learning Transfer                                       responding to the year 2015). The pre-
                                                                 trained Reddit vectors support the following
In our work, we propose as input of our net-                     ”senses”, whether partial or full PoS tags or
work two different embeddings: word embed-                       entity tags. The details of the pre-trained
dings and sense-disambiguation embeddings.                       model are the following:
Below we describe them in more detail.
2.2.1 Words Embeddings                                              • Corpus size: approximately 2 billions
Word embedding is an approach to repre-                               words
sent words as vectors of real numbers. There                        • Vocab size: 1 million
are different methods to obtain these vec-
tors such as probabilistic models and neural                        • Array size: 128
  1
      http://brat.nlplab.org/standoff.html                          • Algorithm: Sense2Vec
                                                           66
                   A Hybrid Bi-LSTM-CRF Model for Knowledge Recognition from eHealth Documents


               Figure 1: Overview architecture of our hybrid LSTM-CRF model.


2.3     The network                                         ond Bi-LSTM layer. The goal of this layer
2.3.1    Character Embedding                                is to obtain a sequence of probabilities cor-
         Bi-LSTM layer                                      responding to each label of the BMEWO-V
                                                            encoding format. In this way, for each in-
Although the word embeddings are able to
                                                            put token, this layer returns six probabilities
capture syntactic and semantic information,
                                                            (one for each tag in BMEWO-V). The final
other linguistic information such as morpho-
                                                            tag should be that with highest probability.
logical information, orthographic transcrip-
tion or PoS tags are not exploited. Accord-                     The parameters of the sets and the hyper
ing to (Ling et al., 2015), the use of character            parameters of the models are the following:
embeddings improves learning for specific do-
                                                               • Words Embedding Dimension: 300
mains and is useful for morphologically rich
languages. For this reason, we decided to                      • Characters Embedding Dimension: 25
consider the character embedding represen-
tation in our system. We used a vector of 25                   • Hidden Layers Dimension: 100 (for each
dimensions to represent each character. The                      LSTM: for the forward and backward
character alphabet includes all 121 unique                       layers)
characters in the TASS-2018-Task 3 eHealth
Knowledge Discovery training, development                      • Learning method: SGD, learning ratio:
and test datasets and the token PADDING.                         0.005
In this way, tokens in sentences are repre-
sented by their corresponding character em-                    • Dropout: 0.5
beddings, which are the input for the first                    • Epochs: 100
Bi-LSTM network.
2.3.2   Word and Sense embedding                            2.3.3   Conditional Random Fields
        Bi-LSTM layer                                               (CRF) layer
The output of the first layer is concate-                   To improve the accuracy of predictions, we
nated with the word embeddings and with                     also used a CRF model trained, which takes
the sense-disambiguation embeddings of the                  as input the output of the previous layer and
tokens in a given input sentence. This con-                 obtains the most probable sequence of pre-
catenation of features is the input for the sec-            dicted labels.
                                                       67
                         Renzo M. Rivera Zavala, Paloma Martínez e Isabel Segura-Bedmar


2.4    Post-processing                                      tass/2018/task-3/evaluation.html).
Once tokens have been annotated with their                  Moreover, we used evaluation script
corresponding labels in the BMEWO-V en-                     (https://github.com/TASS18-Task3/
coding format, the entity mentions must be                  data/blob/master/score_training.py)
transformed to the BRAT format. V tags,                     provided by the shared task organizers to
which identify nested or overlapping entities,              evaluate our system.
are generated as new annotations within the
scope of other mentions.                                    3.2      Results
                                                            As it was described above, our system is
3     Evaluation                                            based on network with two Bi-LSTM layers
3.1    Datasets                                             and a last layer for CRF. In the first Bi-
The evaluation of the proposed model                        LSTM layer, we consider the character em-
was carried out using the annotated                         beddings. In the second layer, we concate-
corpus proposed in the TASS-2018-                           nate the output of the first layer with word
Task 3 eHealth Knowledge Discovery                          embeddings and sense-disambiguate embed-
(https://github.com/tass18-task3/data).                     dings. Finally, the last layer uses a CRF to
    The training set is made up of 5 docu-                  obtain the most suitable labels for each to-
ments with 3276 entities annotations. The                   ken.
development set consists of 1 text document                    Table 3 compares the results obtained us-
with 1958 entities annotations. The test set                ing the NeuroNER system with our extended
consists of 1 text document (see Table 2).                  version using pre-trained embeddings models
There are two types of of entities: concepts                and the BMEWO-V encoding format. Our
and actions. For this reason, tokens can be                 extended version of NeuroNER achieves a
annotated with different labels (see Table 1)               significant improvement of the results (more
following the BMEWO-V encoding format.                      than 7.2% in F1).

      Entity          Tags                                     System                     P       R       F1
                                                               NeuroNER                   0.824   0.785   0.804
      Concept   B/M/E/W/V-Concept
                                                               ext.   Neu-                0.862   0.882   0.872
      Action    B/M/E/W/V-Action
                                                               roNER
      Others           O

                                                            Table 3: Comparison of NeuroNER and our
       Table 1: Tokens Tag in Sentence
                                                            extended version.

 Datasets        Files   Concept         Action                In the substask A (identification of key
 Train             5      2427            849               phrases), our system obtained the top micro
 Development       1      1525            434               F1 (0.872) (see Table4). It significantly out-
 Test              1        0              0                perform the rest of participating systems. We
                                                            will wait to review the proposed systems in
                                                            greater depth in order to establish compar-
         Table 2: Dataset Statistics                        isons and possible improvements to our im-
                                                            plementation.
   In our experiments, we used precision,
recall and F1 score to evaluate the perfor-                   System                 P            R       F
mance of our system. The TASS-2018-Task                       Extended               0.862        0.882   0.872
3 considers two different criteria:    the                    NeuroNER
partial matching (a tagged entity name                        plubeda                0.77         0.81    0.79
is correct only if there is some overlap                      upf-upc                0.86         0.75    0.80
between it and a gold entity name) and                        VSP                    0.31         0.32    0.32
exact matching (a tagged entity name                          Marcelo                0.11         0.32    0.17
is correct only if its boundary exactly
match with a gold entity name). A de-                       Table 4: Results of the participating systems
tailed description of evaluation is in the                  in the subtask A.
web (http://www.sepln.org/workshops/
                                                      68
                  A Hybrid Bi-LSTM-CRF Model for Knowledge Recognition from eHealth Documents


4   Conclusions                                                tem Demonstrations, pages 97–102. Asso-
Named Entity Recognition (NER) is a crucial                    ciation for Computational Linguistics.
tool in text mining tasks. In this work, we                Lafferty, J., A. McCallum, and F. C. Pereira.
propose a hybrid Bi-LSTM and CRF model                        2001. Conditional random fields: Proba-
adding sense-disambiguation embedding and                     bilistic models for segmenting and labeling
an extended tag encoding format to detect                     sequence data. In Proceedings of the Eigh-
discontinuous entities, as well as overlap-                   teenth International Conference on Ma-
ping or nested entities. Our system is able                   chine Learning (ICML ’01), pages 282–
to achieve satisfactory performance without                   289.
requiring specifically domain knowledge or
                                                           Le, Q. and T. Mikolov. 2014. Distributed
hand-crafted features. It is also important to
                                                              representations of sentences and docu-
highlight the language independence, which
                                                              ments. In International Conference on
is key to multi-language tasks. Our results
                                                              Machine Learning, pages 1188–1196.
demonstrated that the extended BMEWO-V
encoding improves the result of the predic-                Limsopatham, N. and N. Collier. 2016.
tions. Moreover, the pre-trained models help                 Learning orthographic features in bi-
to reduce training time and increase the ac-                 directional lstm for biomedical named en-
curacy of labeling, achieving the highest F1                 tity recognition. In Proceedings of the
for the subtask A.                                           Fifth Workshop on Building and Evaluat-
   We plan to try with other embeddings                      ing Resources for Biomedical Text Mining
models such as the FastText model, which                     (BioTxtM2016), pages 10–19.
contains morphological information. More-                  Ling, W., T. Luı́s, L. Marujo, R. F. As-
over, we will extend the encoding format                      tudillo, S. Amir, C. Dyer, A. W. Black,
to capture distinct types of overlapping or                   and I. Trancoso. 2015. Finding function
nested entities.                                              in form: Compositional character models
                                                              for open vocabulary word representation.
Acknowledgement                                               In Proceedings of the 2015 Conference on
This work was supported by the Research                       Empirical Methods in Natural Language
Program of the Ministry of Economy and                        Processing, page 1520–1530.
Competitiveness - Government of Spain                      Martı́nez-Cámara, E., Y. Almeida-Cruz,
(project DeepEMR: Clinical information ex-                   M. C. Dı́az-Galiano, S. Estévez-Velarde,
traction using deep learning and big data                    M. A. Garcı́a-Cumbreras, M. Garcı́a-
techniques-TIN2017-87548-C2-1-R)                             Vega, Y. Gutiérrez, A. Montejo-Ráez,
                                                             A. Montoyo, R. Muñoz, A. Piad-
References
                                                             Morffis, and J. Villena-Román. 2018.
Bojanowski, P., E. Grave, A. Joulin, and                     Overview of TASS 2018:          Opinions,
  T. Mikolov. 2017. Enriching word vectors                   health and emotions. In E. Martı́nez-
  with subword information. Transactions                     Cámara, Y. Almeida Cruz, M. C.
  of the Association for Computational Lin-                  Dı́az-Galiano, S. Estévez Velarde, M. A.
  guistics, 5:135–146.                                       Garcı́a-Cumbreras,      M. Garcı́a-Vega,
Borthwick, A., J. Sterling, E. Agichtein,                    Y. Gutiérrez Vázquez, A. Montejo Ráez,
  and R. Grishman. 1998. Exploiting Di-                      A. Montoyo Guijarro, R. Muñoz Guillena,
  verse Knowledge Sources via Maximum                        A. Piad Morffis, and J. Villena-Román,
  Entropy in Named Entity Recognition.                       editors, Proceedings of TASS 2018: Work-
  Technical report.                                          shop on Semantic Analysis at SEPLN
                                                             (TASS 2018), volume 2172 of CEUR
Cardellino, C. 2016. Spanish Billion Words                   Workshop Proceedings, Sevilla, Spain,
  Corpus and Embeddings.                                     September. CEUR-WS.
Dernoncourt, F., J. Y. Lee, and P. Szolovits.              Pennington, J., R. Socher, and C. Manning.
  2017. Neuroner: an easy-to-use pro-                        2014. Glove: Global vectors for word rep-
  gram for named-entity recognition based                    resentation. In Proceedings of the 2014
  on neural networks. In Proceedings of                      conference on empirical methods in nat-
  the 2017 Conference on Empirical Meth-                     ural language processing (EMNLP), pages
  ods in Natural Language Processing: Sys-                   1532–1543.
                                                      69
                        Renzo M. Rivera Zavala, Paloma Martínez e Isabel Segura-Bedmar


Space.io. 2018. spaCy · Industrial-strength
  Natural Language Processing in Python.
Taulé, M., M. A. Martı́, and M. Recasens.
  2008. Ancora: Multilevel annotated cor-
  pora for catalan and spanish. In LREC
  2008, pages 96–101.
Trask, A., P. Michalak, and J. Liu. 2015.
   sense2vec - a fast and accurate method for
   word sense disambiguation in neural word
   embeddings. CoRR, abs/1511.06388.


                                                     70