=Paper= {{Paper |id=Vol-2664/eHealth-KD_paper8 |storemode=property |title=IXA-NER-RE at eHealth-KD Challenge 2020 |pdfUrl=https://ceur-ws.org/Vol-2664/eHealth-KD_paper8.pdf |volume=Vol-2664 |authors=Edgar Andrés,Óscar Sainz,Aitziber Atutxa,Oier Lopez de Lacalle |dblpUrl=https://dblp.org/rec/conf/sepln/AndresSAL20 }} ==IXA-NER-RE at eHealth-KD Challenge 2020== https://ceur-ws.org/Vol-2664/eHealth-KD_paper8.pdf
IXA-NER-RE at eHealth-KD Challenge 2020:
Cross-Lingual Transfer Learning for Medical
Relation Extraction
Edgar Andrésa , Oscar Sainza , Aitziber Atutxaa and Oier Lopez de Lacallea
a
    IXA NLP Group, University of the Basque Country (UPV/EHU)


                                         Abstract
                                         The eHealth-KD 2020 set out this year an automatic extraction challenge on a coarse range of knowledge
                                         from health documents written in the Spanish Language. Our group has participated in all the proposed
                                         scenarios; the main one, the Named Entity Recognition (NER) subtask, the Relation Extraction (RE) sub-
                                         task, and the alternative domain obtaining very different results in each of them. The main task has
                                         been conceived as a pipeline of the NER and RE subtask, each of them independently developed from
                                         the other. The Name Entity Recognition task has been envisaged as a basic seq2seq system applying
                                         a general-purpose Language Model and static embeddings. Unlike the NER subtask, in the RE subtask
                                         several approaches were successfully explored; first, transfer learning methods as a way to measure the
                                         adaptation ability of pre-trained language models to both medical domain and Spanish language. Sec-
                                         ond, Matching the Blanks to tackle the problem of the reduced size of the training corpus by producing
                                         relation representations directly from non tagged text. As mentioned, the results in the different task
                                         were heterogeneous; while the result in NER is on the average (F1 0.66), with ample room for improve-
                                         ment, the result in RE has been outstanding, obtaining the first place in this task (F1 0.633) with more
                                         than 3 points over the next classified, demonstrating the soundness of the proposed techniques.

                                         Keywords
                                         Language Models, Matching the Blanks, Named Entity Recognition, Relation Extraction




1. Introduction
In this paper we describe our participation at eHealth-KD 2020 shared task [1], consisting on
extracting semantic structured information for Spanish medical texts. The challenge is divided
in two main tasks proposed as a pipeline. The first task is devoted to the identification and
classification of medical entities. In the second task, participants need to detect the semantic
relations between the entities, presumably, discovered in the first task.
   Organizers proposed different evaluation schemes in which 1) systems are evaluated on
the whole tasks at once (main evaluation), and 2) entity recognition and relation extraction
are evaluated separately (task A and task B, respectively). Our system is built on top of two
independent components and, thus, training and development of the each component is carried
out in their specific sub-tasks separately.

Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: eandres011@ehu.eus (E. Andrés); osainz006@ehu.eus (O. Sainz); aitziber.atutxa@ehu.eus (A. Atutxa);
oier.lopezdelacalle@ehu.eus (O.L.d. Lacalle)
orcid:
                                       © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   We approached the Named Entity Recognition (NER) task1 with a character based BiLSTM
sequence labeler [2] trained on the training set provided by the organizers employing both pre-
trained general domain Language Model and static word embeddings. Regarding the relation
extraction task2 , in order to solve it, we decided to use a transfer learning strategy and fine-tune
existing multilingual pre-trained language models [3] in the annotated data of the task.
   Note that we propose a system with heterogeneous components, with different goals for each
of the part in the system. This way, the goals of our participation in the task are twofold.
    • Entity recognition: Our goal was to check the suitability of a character based pre-
      trained Language Model based system on a heterogeneous NER setting. Pre-trained
      Language Model based systems have been successfully used in Medical Entity Recognition
      (MER) tasks. But unlike other similar challenges that involved MER (CLEF eHealth 2020,
      PharmacoNER 2019), the present task is especially challenging because the entities are
      not purely medical but very heterogeneous not only semantically regarding the domain,
      but also syntactically.
    • Relation extraction: Our main goal for using large multilingual pre-trained language
      models is to measure ability to adapt to medical domain and Spanish language when
      using transfer learning methods. In addition make experiments adapting Matching the
      Blanks [4] (MTB) to eHealth-KD 2020 setting.
  The system obtains very uncompensated results, while in relation extraction we outperform
the rest of the participants by wide margin (3.4 points better than the second ranked), system
has large room of improvement in entity recognition (we are still more than 10 points lower
than best systems). Overall, our system shows very competitive results with 55.7 of F1 in the
main task.


2. Related Work
Entity recognition MER, as opposed to NER, shows certain specificities [5], like their de-
scriptive nature, their productivity and the massive use of acronyms. These specificities and the
fact that static embeddings were systematically employed by NER systems, yield researchers to
use in-domain corpus, as opposed to general-domain corpus, to both train the MER systems
as well as the static pre-trained embeddings, in order to obtain better results since controlling
domain leads to better control on polysemy ([6],[7]). Recently, performance of both NER and
MER tasks has shown a significant breakthrough with the introduction of contextualized word
embeddings (ELMo [8], ULMFiT [9], BERT [10] and FLAIR [2]).
   Although contextualized embeddings seem to reduce the gap between general and domain
specific corpus, several works on MER task argue that domain-specific contextualized embed-
dings still yields superior performance over the standard and general-domain word embeddings
([11],[12],[13],[14]). As mentioned in the introduction, the present MER task due to its hetero-
geneity (concepts are more specific to the medical domain while actions or references
   1
    Entities are classified in 4 types: concept, action, predicate, and reference.
   2
    The 13 relation types are organized in 4 main categories: general relations, contextual relations,
action roles, and predicate roles.




                                                 152
are less specific) represents a perfect playground to check the performance of contextualized
Language Models based embeddings calculated over general domain corpus on the different
entities.

Transfer Learning Recently transfer learning has been shown as a successful alternative
when (almost) no annotated data is available in the target domain and language [10, 4]. Recent
Transformer sequence models [15] surpass the state-of-the-art in many information extraction
tasks such as relation extraction [4, 16, 17]. Some works try to integrate the information available
in knowledge bases into Transformers sequence models [16]. Nevertheless, simpler approaches
based on entity markers (further details in Section 4) show same competitive performance with a
quicker setup [4]. In a similar manner, multilingual language models [3] have shown impressive
capacity to perform zero-shot learning in cross-lingual tasks. This kind of models seems very
promising for relation extraction tasks where target language contains small annotated training
set.

Data-augmentation A variety of data-augmentation have been proposed for information
extraction tasks. One of the most significant paradigm is distant-supervision [18, 19], in which
existing relations in knowledge bases are aligned to unlabeled text relying on some heuristics
and automatically labeling training data [20]. More recently, Soares et al. [4] introduce an
augmentation method that does not require relation label and adapts the model learning by
Matching the Blanks (MTB). In this work we explore the idea of MTB to approach the relation
extraction task in eHealth-KD 2020.


3. Entity Recognition system
We adopted a sequence to sequence Deep Learning approach [2] to pursue the Named Entity
Recognition (NER) task.

3.1. NER Architecture
The FLAIR system [2] employed for NER is shown in the Figure 1 is composed by three main
components; first a character based Language model (LM) which generates very powerful
character based contextual word representations, that are afterwards concatenated with static
embeddings. On the top of this LM layer, a BiLSTM layer captures the sequential dependencies
among the words of the input sequence. And finally, a conditional random fields (CRF) layer to
handle the tagging inference.

   The LM in this case concatenates: Static Embeddings and Contextual Flair Embeddings.
Contextual FLAIR Embeddings are formed from the character based partial calculations, the
BiLSTM strategy is used to take context into account. Those calculations are performed as seen
in the bottom of the Figure 1, the results are concatenated with Static Embeddings.




                                                153
Figure 1: General Architecture of the NER system.


   The input provided by the organizer as BRAT standoff format, was tokenized using NLTK
word_tokenize general purpose function and afterwards converted to the Inside Outside Be-
ginning format (IOB). This format does not capture overlapped and disjoint entities. The
development set was divided so, one part was used for development and other for test. The
output of the system was converted to the required BRAT standoff format into the corresponding
(.ann) files, this consists on an entity, offsets and the matched text.


3.2. Learning Setup
The proposed architecture was submitted in the current approach, the language model was
composed by Contextual and Static Word Embeddings, in between the LM and the Dependency
layer was proposed a dropout layer, finally the Prediction layer was connected after it. The
architecture used for training can be seen on the Table 1. The training hyperparemeters can
summarized as follows: learning rate 0.1, batch size 16 and patience 3 for early stop, it takes
into account over-fitting in development file. Maximum of hundred epochs of training were
performed, and stopped at 81 epoch. The process was computed on CPU AMD Ryzen 7 1700
Eight-Core Processor, and took 45 minutes to end.

   In the current experiment we used pre-trained FastText Static Embeddings (es-crawl) [21]
trained over Web crawls (general-domain) and Contextual Flair Embeddings (es-forward +
es-backward) [2] trained with Wikipedia (general-domain). All embedding layer were calculated
keeping the default parameters. We did not use the additional Medline sentences to train the
LM. Therefore no in-domain fine-tunning was pursued.



                                               154
     Layer                           Dropout                                                   Specification
     LM-Forward                          0.5                                       Embedding (275, 100)
                                          -                                             LSTM (100, 2048)
                                          -                      Linear (in = 2048, out = 275, bias=True)
     LM-Backward                         0.5                                       Embedding (275, 100)
                                          -                                             LSTM (100, 2048)
                                          -                      Linear (in = 2048, out = 275, bias=True)
                                          -                                        Embedding (275, 100)
     Dependency Tracker           0.5+0.05(word)                Linear (in = 4396, out = 4396, bias=True)
                                         -                                  BiLSTM (in= 4396, out = 600)
     Decision layer                       -            Linear (in = 600, out = 11, bias=True, crf = true)

Table 1
NER hyperparameter setting.




4. Relation Extraction system
On this section we describe our relation extraction (RE) component. In total we have built three
RE systems: XLMem, XLMem* and XLMem*+MTB. All the models are based on the same XLM
with entity-markers (XLMem) architecture, but they differ on training strategies and data. We
first describe the base architecture of XMLem models. In the following sections, we discuss
different training strategies and the hyperparameter values used in training.

4.1. XLMem Architecture
The basic architecture of our system is the relation encoder. The encoder consist on a transformer
[15] based pre-trained language model with a relation extraction head on top. A particularity
of this relation encoder is the need of Entity Markers [4] as additional tokens on the input
sentence. This special tokens delimits the boundaries of each entity on the input sentence as
shown in the Figure 2. The entity aware input is then fed to the pre-trained language model.
The relation extraction head concatenates the representations of the markers that indicate the
starting position of the entities and combine them with a linear layer encoding the final relation
representation.
   Formally, given a relation statement 𝑟 = (𝑥, 𝑒1 , 𝑒2 ) formed by a sequence of tokens 𝑥 =
[𝑥0 , 𝑥1 , ..., 𝑥𝑛 ] and two entities 𝑒1 and 𝑒2 , we first corrupt our input sentence by adding the entity
markers ([E1S] and [E1E] defines where the first entity starts and ends)

                        𝑥̃ = [𝑥0 , ..., [𝐸1𝑆], 𝑒1 , [𝐸1𝐸], ..., [𝐸2𝑆], 𝑒2 , [𝐸2𝐸], ..., 𝑥𝑛 ]

then we obtain the hidden representations ℎ = 𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑟(𝑥̃ ) and finally we define our
relation encoder 𝑓𝜃 as follows:



                                                       155
Figure 2: The XLMem relation encoder architecture based on Entity Marker strategy for relation rep-
resentation.




                                    𝑓𝜃 = 𝑊𝑟𝑒 [ℎ𝐸1𝑆 ; ℎ𝐸2𝑆 ] + 𝑏𝑟𝑒                              (1)
where 𝑊𝑟𝑒 ∈ ℝ2𝐻 ×𝐻 and 𝑏𝑟𝑒 ∈ ℝ𝐻 being 𝐻 the hidden representation size. Finally, classification
is performed by stacking a linear layer on top of the 𝑓𝜃 encoder with a softmax activation
function:

                             𝑜𝑢𝑡𝑝𝑢𝑡(𝑟) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑊𝑐𝑙𝑓 𝑓𝜃 (𝑟) + 𝑏𝑐𝑙𝑓 )                         (2)
where 𝑊𝑐𝑙𝑓 ∈ ℝ𝐾 ×𝐻 and 𝑏𝑐𝑙𝑓 ∈ ℝ𝐾 being 𝐻 the hidden representation size and 𝐾 the number of
relations.
   Using XLM as pre-trained language model gives the opportunity to learn a cross-lingual
relation-encoder, which seems to be a good choice for this setting. Concretely, we use the xlm-
mlm-17-1280 checkpoint provided by the Hugging Face team [22]. This particular checkpoint
has been trained on the Masked Language Model (MLM) strategy with 17 languages including
Spanish, which is our target language for the task.

4.2. Matching the Blanks
Matching the Blanks [4] (MTB) can be seen as novel alternative to the well-known Distant
Supervision [18]. The approach is based on the hypothesis that if two entities are related,
sentences that contains those two entities are more likely to express same relation. The Figure 3
shows three different sentences from our MTB corpus. While the first two sentences encode
the same relation for paciente and síntomas, the third sentence express the relation for paciente
and tiempo.
   Training dataset is generated as follows. We generate positive pairs of sentence (i.e examples
1 and 2 in Figure 3) if they share blanked entities. We generate strong negative pairs if they
share one entity (i.e examples 1 and 3), and weak negatives if no entity is shared. Once we have
generated those examples, we train a model that learns whether a pair of sentences encodes
the same relation or not, and we transfer learned parameters to the actual relation extraction
task.Note that [blank] are introduced to avoid simply relearning a linking entities to Knowledge
Base (KB) used to generate the MTB corpus.



                                                156
      (1) Se observó actividad de CK en [blank] con dengue con presencia de [blank] como
      vómito, hematemesis y dolor abdominal.
      (2) Al parecer, existen mecanismos comunes a ambas patologías que pueden influir en la
      exacerbación de los [blank] del asma en [blank] con obesidad.
      (3) El [blank] promedio para el inicio de ENT fue de 30 (23,5) horas, y el 88,7% de los
      [blank] alcanzaron el objetivo nutricional en 48 horas.

Figure 3: Three different entries on the MTB dataset. The first two share the same entities paciente
and síntomas and the third that contains the entities paciente and tiempo shares only one of them.


                Hyperparameter                  MTB pre-training      fine-tuning
                Learning-rate                         1𝑒 −4              3𝑒 −4
                Optimizer                            SGD                SGD
                Batch size                             8                  16
                Gradient accumulation steps            8                   4
                Floating point precision         FP16 and FP32      FP16 and FP32
                Early stopping (patience 3)            ✔                  ✔
                Blanks masking probability            0.7                  -

Table 2
Hyperparameter settings of the Relation Extraction system training process.


   To build the MTB corpus we have use spanish Medline abstracts. We have processed them
with Freeling [23] to extract medical entities. In total we got 7, 543 medical entities that forms
278, 956 entity-pairs. With those entity-pairs we have generated 691, 392 positive instances
and 833, 332 negative instances. We have split our data into 80% for training and 20% for
development. Finally, for technical reasons we discard those instances that contains contexts
larger than 128 tokens.

4.3. Learning Setup
In this section we report the set of hyperparameters that we have used at time of fine-tuning
the models (Figure 2). For our case the hyperparameters that better fits to the development
set were the same for the three tested approaches. Also we report the hyperparameters used
during MTB pre-training.
   The reported configurations are used on a single NVIDIA Titan V GPU with 12Gb of RAM.
The fine-tune process takes less than 10 hours and in the case of MTB pre-training we have
manually stopped for reasons of time and deadlines.


5. Results
Table 3 shows the results for the main and alternative domain tasks. Our official run combines
the NER system with the XLMem∗ RE system. In this case XLMem∗ make use of the additional




                                                157
                                       Main eval.                           Alternative
            Model             Prec.      Rec.             F1        Prec.      Rec.          F1
            Vicomtech         0.679      0.652          0.665       0.594      0.535       0.563
            Talp-UPC          0.626      0.626          0.626       0.604      0.563       0.583
            UH-MAJA-KD        0.634      0.615          0.625       0.608      0.498       0.547
            IXA-NER-RE        0.536      0.580          0.557       0.563      0.416        0.478

Table 3
Official results of the best systems in the main and alternative domain tasks.


                                                 Dev                            Test
               Model                    Prec.    Rec.          F1      Prec.    Rec.       F1
               SINAI                      -         -           -      0.845   0.807      0.825
               Vicomtech                  -         -           -      0.822   0.820      0.821
               Talp-UPC                   -         -           -      0.807   0.825      0.816
               UH-MAJA-KD                 -         -           -      0.820   0.808      0.814
               UH-MatCom                  -         -           -      0.795   0.825      0.767
               baseline                   -         -           -      0.542   0.504      0.586
               (Ours) BiLSTM + CRF      0.742    0.696     0.718       0.692   0.727      0.660

Table 4
Results for Named Entity Recognition task. The BiLSTM + CRF system is the one that has been send
to the competition.


3,000 automatically annotated sentences from Medline that were provided for further training.
Overall results show that although our system is competitive (4th overall rank) still has large
room of improvement in the main task, as well as in the alternative task. We would like to
note that further in-house evaluation showed that our best system combination would be using
XLMem without using extra automatic annotated data.

5.1. Entity Extraction Task (A)
In the following Table 4 could be seen the Test results of the NER task, the best results for each
metric are highlighted with bold characters, we also provide our results over Dev set, and over
the official Test set.

   Although far from the result obtained by the first classified, the system presented overcomes
the baseline with no fine-tunning and using general-domain static embeddings and pre-trained
Language Model. A preliminary error analysis has led us to conclude that contrary to what we
initially thought the domain might be relevant in this NER task. Although 3 of the four types
of entities (actions, references and predicates) are not specially medical domain entity types,
the fact that predicting references and predicates is strongly conditioned to having previously
correctly predicted their antecedent concept, and the latter is most of the times domain specific




                                                    158
                                   Train                     Dev                       Test
 Model                     Prec.   Rec.      F1     Prec.    Rec.      F1     Prec.    Rec.     F1
 Vicomtech                   -       -        -         -      -       -      0.672   0.515    0.583
 UH-MAJA-KD                  -       -        -         -      -       -      0.629   0.571    0.599
 (Ours) XLMem             0.861    0.849   0.855    0.708   0.642    0.674    0.690   0.625    0.656
 (Ours) XLMem*            0.767    0.795   0.781    0.707   0.672    0.689    0.649   0.619    0.633
 (Ours) XLMem*+MTB        0.788    0.709   0.746    0.755   0.616    0.678    0.707   0.584    0.640

Table 5
Results obtained by the different systems in the Relation Extraction task. Best results on each metric
are marked in bold and * indicates the use of extra automatic annotated data for training. The XLMem*
system is the one that has been send to the competition.


might have an impact.

5.2. Relation Extraction Task (B)
On this part we discuss the the results obtained by our RE systems during the development and
testing. We compare our three different systems with the rest top competitors and we evaluate
more exhaustively our own systems by comparing the precision-recall curves and analyzing
the confusion in the prediction.
   Table 5 shows a comparative results (precision, recall, and F1 score) between our systems
and the rest top competitors. As only the Test set was reported by the organizers, we just
compare our system with the rest on that specific partition. Results on the development show
the following: 1) the additional automatic annotated data have positive effect on the model
regularization, and 2) MTB pre-training gives a boost on the precision by loosing on the recall.
The best model according to the development set is the XLMem*, which was part of the official
run. On the contrary, Test results show unexpected behaviour. We hypothesize this is due to
the differences on the relation-type distribution of development and test partitions (Figure 4).
Nevertheless, each of the proposed relation extractors outperforms the rest of the systems by a
large margin.
   Figure 5 reports precision-recall curves on relation categories of the three RE models. The
curves show that the XLMem and XLMem* systems performs similar as their micro-averaged
curves are very close, but not for the XLMem*+MTB, which the curve performs under the rest.
On the other hand, curves on relation categories show that the XLMem* system performs better
in the action-roles relations, and the XLMem perform better in the general domain relations.
The differences on development and test can also be explained by the distribution shown in
Figure 4. Finally, analysis of the output reveals that the confusion is located actually between
the negative class (no-relation) and the rest of positive relations (i.e. false negatives), and not
between the positive relation types.




                                                  159
Figure 4: Relation distribution of development and test datasets.




                                            Figure 5: Precision/Recall curves of the different systems.


6. Conclusions
The purpose of this work was to evaluate the feasibility of different approaches to medical
entity recognition and relation extraction for Spanish. Entity recognition was approached with
a character based sequence labeler, and for the relation extraction we used a fine-tuned large
multilingual pre-trained language model. Proposed system shows promising results. We ranked
4th overall, and obtain the best results for the relation extraction task. In the future, we plan
to improve the entity recognition part by means of using a domain specific LM, and further
investigate the use of Matching the Blanks method as a data-augmentation technique.


References
 [1] A. Piad-Morffis, Y. Gutiérrez, H. Cañizares-Diaz, S. Estevez-Velarde, Y. Almeida-Cruz,
     R. Muñoz, A. Montoyo, Overview of the eHealth Knowledge Discovery Challenge at
     IberLEF 2020, in: Proceedings of the Iberian Languages Evaluation Forum co-located with
     36th Conference of the Spanish Society for Natural Language Processing, IberLEF@SEPLN
     2020, Spain, September, 2020., 2020.
 [2] A. Akbik, D. Blythe, R. Vollgraf, Contextual string embeddings for sequence labeling, in:
     Proceedings of the 27th International Conference on Computational Linguistics, Associa-
     tion for Computational Linguistics, Santa Fe, New Mexico, USA, 2018, pp. 1638–1649. URL:
     https://www.aclweb.org/anthology/C18-1139.
 [3] G. Lample, A. Conneau, Cross-lingual language model pretraining, Advances in Neural
     Information Processing Systems (NeurIPS) (2019).



                                                160
 [4] L. Baldini Soares, N. FitzGerald, J. Ling, T. Kwiatkowski, Matching the blanks: Distributional
     similarity for relation learning, in: Proceedings of the 57th Annual Meeting of the
     Association for Computational Linguistics, Association for Computational Linguistics,
     Florence, Italy, 2019, pp. 2895–2905. URL: https://www.aclweb.org/anthology/P19-1279.
     doi:10.18653/v1/P19-1279.
 [5] G. Zhou, J. Zhang, J. Su, D. Shen, C. Tan, Recognizing names in biomedical texts: a machine
     learning approach, Bioinformatics 20 (2004) 1178–1190.
 [6] F. Soares, M. Villegas, A. Gonzalez-Agirre, M. Krallinger, J. Armengol-Estapé, Medical
     word embeddings for Spanish: Development and evaluation, in: Proceedings of the
     2nd Clinical Natural Language Processing Workshop, Association for Computational
     Linguistics, Minneapolis, Minnesota, USA, 2019.
 [7] P. Stenetorp, H. Soyer, S. Pyysalo, S. Ananiadou, T. Chikayama, Size (and domain) matters:
     Evaluating semantic word space representations for biomedical text, in: Proceedings of
     the 5th International Symposium on Semantic Mining in Biomedicine, Zürich, Switzerland,
     2012.
 [8] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
     contextualized word representations, in: Proc. of NAACL, 2018.
 [9] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, 2018.
     arXiv:1801.06146.
[10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[11] L. Akhtyamova, P. Martinez, K. Verspoor, J. Cardiff, Testing contextualized word embed-
     dings to improve ner in spanish clinical case narratives, BMC Medical Informatics and
     Decision Making (2020) preprint. doi:10.21203/rs.2.22697/v1.
[12] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: a pre-trained biomedical
     language representation model for biomedical text mining, Bioinformatics (2019). URL:
     https://doi.org/10.1093/bioinformatics/btz682. doi:10.1093/bioinformatics/btz682.
[13] Y. Si, J. Wang, H. Xu, K. Roberts, Enhancing clinical concept extraction with contextual
     embeddings, Journal of the American Medical Informatics Association 26 (2019) 1297–1304.
     URL: http://dx.doi.org/10.1093/jamia/ocz096. doi:10.1093/jamia/ocz096.
[14] G. Sheikhshabbafghi, I. Birol, A. Sarkar, In-domain context-aware token embeddings
     improve biomedical named entity recognition, in: Proceedings of the Ninth International
     Workshop on Health Text Mining and Information Analysis, Association for Computational
     Linguistics, Brussels, Belgium, 2018, pp. 160–164. URL: https://www.aclweb.org/anthology/
     W18-5618. doi:10.18653/v1/W18-5618.
[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: Advances in neural information processing systems,
     2017, pp. 5998–6008.
[16] M. E. Peters, M. Neumann, R. L. Logan, R. Schwartz, V. Joshi, S. Singh, N. A. Smith,
     Knowledge enhanced contextual word representations, in: EMNLP, 2019.
[17] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, O. Levy, SpanBERT: Improving
     pre-training by representing and predicting spans, arXiv preprint arXiv:1907.10529 (2019).
[18] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant supervision for relation extraction without
     labeled data, in: Proceedings of the Joint Conference of the 47th Annual Meeting of the



                                               161
     ACL and the 4th International Joint Conference on Natural Language Processing of the
     AFNLP, Association for Computational Linguistics, Suntec, Singapore, 2009, pp. 1003–1011.
     URL: https://www.aclweb.org/anthology/P09-1113.
[19] O. Sainz, O. Lopez de Lacalle, I. Aldabe, M. Maritxalar, Domain adapted distant supervision
     for pedagogically motivated relation extraction, in: Proceedings of The 12th Language Re-
     sources and Evaluation Conference, European Language Resources Association, Marseille,
     France, 2020, pp. 2213–2222. URL: https://www.aclweb.org/anthology/2020.lrec-1.270.
[20] R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, D. S. Weld, Knowledge-based weak
     supervision for information extraction of overlapping relations, in: Proceedings of the
     49th Annual Meeting of the Association for Computational Linguistics: Human Language
     Technologies, Association for Computational Linguistics, Portland, Oregon, USA, 2011, pp.
     541–550. URL: https://www.aclweb.org/anthology/P11-1055.
[21] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning word vectors for 157
     languages, in: Proceedings of the International Conference on Language Resources and
     Evaluation (LREC 2018), 2018.
[22] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Brew, Huggingface’s transformers: State-of-the-art natural language
     processing, ArXiv abs/1910.03771 (2019).
[23] L. Padró, E. Stanilovsky, Freeling 3.0: Towards wider multilinguality, in: Proceedings of
     the Language Resources and Evaluation Conference (LREC 2012), ELRA, Istanbul, Turkey,
     2012.




                                              162