<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MLT-DFKI at CLEF eHealth 2019: Multi-label Classi cation of ICD-10 Codes with BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Saadullah Amin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gunter Neumann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katherine Dun eld</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Vechkaeva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kathryn Annette Chapman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Morgan Kelly Wixted ?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DFKI GmbH</institution>
          ,
          <addr-line>Campus D3 2, 66123 Saarbrucken</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the adoption of electronic health record (EHR) systems, hospitals and clinical institutes have access to large amounts of heterogeneous patient data. Such data consists of structured (insurance details, billing data, lab results etc.) and unstructured (doctor notes, admission and discharge details, medication steps etc.) documents, of which, latter is of great signi cance to apply natural language processing (NLP) techniques. In parallel, recent advancements in transfer learning for NLP has pushed the state-of-the-art to new limits on many language understanding tasks. Therefore, in this paper, we present team DFKIMLT's participation at CLEF eHealth 2019 Task 1 of automatically assigning ICD-10 codes to non-technical summaries (NTSs) of animal experiments where we use various architectures in multi-label classi cation setting and demonstrate the e ectiveness of transfer learning with pre-trained language representation model BERT (Bidirectional Encoder Representations from Transformers) and its recent variant BioBERT. We rst translate task documents from German to English using automatic translation system and then use BioBERT which achieves an F1-micro of 73.02% on submitted run as evaluated by the challenge.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Indexing</kwd>
        <kwd>Transfer Learning</kwd>
        <kwd>Multi-label Classi cation</kwd>
        <kwd>ICD-10 Codes</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>EHR systems o er rich source of data that can be utilized to improve health care systems by applying information extraction, representation learning and</title>
      <p>
        ?On behalf of the PRECISE4Q consortium
Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12
September 2019, Lugano, Switzerland.
predictive modeling [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] techniques. Among many other applications, one such
task is the automatic assignment of International Statistical Classi cation of
      </p>
    </sec>
    <sec id="sec-2">
      <title>Diseases (ICD) codes [27] to clinical notes, otherwise called semantic indexing of</title>
      <p>
        clinical documents [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The problem is to learn a mapping from natural language
free-texts to medical concepts such that, given a new document, the system can
assign one or more codes to it. Approximating the mapping in this setting can
be seen as multi-label classi cation and is one way to solve the problem, besides
hierarchical classi cation [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], learning to rank and unsupervised methods.
      </p>
    </sec>
    <sec id="sec-3">
      <title>In this study, we describe our work on CLEF eHealth 2019 [19] Task 1 [26],</title>
      <p>
        which is about multilingual information extraction from German non-technical
summaries (NTSs) of animal experiments collected from AnimalTestInfo database
to classify according to ICD-10 codes, German modi cation version 2016 1. The
AnimalTestInfo database was developed in Germany to make the non-technical
summaries (NTSs) of animal research studies available in a searchable and easily
accessible web-based format. Each NTS was manually assigned an ICD-10 code
with the goal of advancing the integrity and reporting of responsible animal
research [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This task requires an automated approach to classify the NTSs,
whereby the data exhibits challenging attributes of multilingualism, domain
speci city and codes skewness with hierarchical structure.
      </p>
      <p>
        We explore various models, starting with traditional bag-of-words support
vector machines (SVM) to standard deep learning architectures of convolutional
neural networks (CNN) and recurrent neural networks (RNN) with three types
of attention mechanisms; namely, hierarchical attention Gated Recurrent Unit
(GRU) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], self-attention Long-Short Term Memory (LSTM) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and codes
attentive LSTM. Finally, we show the e ectiveness of ne-tuning state-of-the-art
pre-trained BERT models [
        <xref ref-type="bibr" rid="ref10 ref22">10, 22</xref>
        ], which requires minimal task speci c changes
and works well for small datasets. However, the signi cant performance boost
comes from translating the German NTSs to English and then applying the same
models, yielding an absolute gain of 6.22% f-score on dev set, from best German
model to English model. This can be attributed to the fact that each language has
its own linguistic and cultural characteristics that may contain di erent signals
to e ectively classify a speci c class [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Given translated texts, we also nd
that domain speci c embeddings have more e ect when considering static word
embeddings [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], giving an avg. gain of 2.77% over contextual embeddings [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ],
where the gain is 0.86%.
2
      </p>
      <sec id="sec-3-1">
        <title>Related Work</title>
        <p>
          Automatic assignment of ICD codes [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] to health related documents has been
well studied, both in previous CLEF shared tasks and in general. Traditional
approaches range from rule based and dictionary look ups [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to machine
learning models [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. However, more recently the focus has been on applying deep
learning.
1 https://www.dimdi.de/static/de/klassifikationen/icd/icd-10-gm/
kode-suche/htmlgm2016/
        </p>
        <p>
          Many techniques have been proposed using CNNs, RNNs and hybrid
systems. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] uses shallow CNN and improves its predictions for rare labels by
dictionary-based lexical matching. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] addresses the challenges of long documents
and high cardinality of label space in MIMIC-III [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] by modifying
Hierarchical Attention Network [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ] with labels attention. More recent focus has been
on using sequence-to-sequence (seq2seq) [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] based encoder-decoder based
architectures. [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] rst builds a multilingual death cause extraction model using
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>LSTMs encoder-decoder, with concatenated French, Hungarian and Italian fast</title>
      <p>
        Text emebddings as inputs and causes extracted from ICD-10 dictionaries as
outputs. The output representations are then passed to an attention based biLSTM
classi er which predicts the codes. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] uses character level CNN [41] encoders
for French and Italian, which are genealogically related languages and similar
on a character level, with a biRNN decoder. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] enriches word embeddings with
language-speci c Wikipedia and creates an ensemble model from a CNN
classier and GRU encoder-decoder. Few other techniques have also been proposed
to use sequence-to-sequence framework and obtained good results [
        <xref ref-type="bibr" rid="ref2 ref24">2, 24</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>While successful, these approaches make an auto-regressive assumption on</title>
      <p>
        output codes, which may hold true only when there is one distinct path from
parent to child code for a given document. However, in the ICD codes assignment,
a document can have multiple disjoint paths in a directed acyclic graph (DAG),
formed by concepts hierarchy [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. Also, for a smaller dataset, the decoder may
su er from low variance vocabulary and data sparsity issues. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a novel
      </p>
    </sec>
    <sec id="sec-6">
      <title>Hierarchical Multi-label Classi cation Network (HMCN) with feed-forward and</title>
      <p>recurrent variations is proposed that jointly optimizes local and global loss
functions for discovering local hierarchical class-relationships in addition to global
information from the entire class hierarchy while penalizing hierarchical
violations (a child node getting a higher score than parent). However, they only
consider tree based hierarchies where a node strictly has one parent.</p>
      <p>
        Contextualized word embeddings, such as ELMo [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] and BERT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
derived from pre-trained bidirectional language models (biLMs) and trained on
large texts have shown to substantially improve performance on many NLP
tasks; question answering, entailment and sentiment classi cation, constituency
parsing, named entity recognition, and text classi cation. Such transfer learning
involves ne-tuning of these pre-trained models on a down-stream supervised
task to get good results with minimal e ort. In this sense, they are simple, e
cient and performant. Motivated by this, and recent work of [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], we use BERT
models for this task and achieve better results than CNN and RNN based
methods. We also show great improvements with translated English texts.
3
      </p>
      <sec id="sec-6-1">
        <title>Data</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>The dataset contains 8,385 training documents (including dev set) and 407 test documents, all in German. Each document has six text elds: document title, uses (goals) of the experiment, possible harms caused to animals and comments about replacement, reduction and re nement (in the scope of 3R principles).</title>
      <p>ICD-10 Code</p>
      <p>II
C00-C97</p>
      <p>IX</p>
      <p>VI
C00-C75
No. of documents
(train + dev)
1515
1479
930
799
732</p>
    </sec>
    <sec id="sec-8">
      <title>The documents are assigned one or more codes from ICD-10-GM (German</title>
    </sec>
    <sec id="sec-9">
      <title>Modi cation version 2016) which exhibits a hierarchy forming a DAG [33], where</title>
      <p>the highest-level nodes are called chapters and their direct child nodes are called
groups. The depth of most chapters is one but in some cases it goes to
secondlevel (e.g. M00-M25, T20-T32) and, in one case, up to third-level (C00-C97).
Documents are assigned mixed codes such that a parent and child node can
coexists and a child node can have multiple parents. Moreover, 91 documents are
missing one or more of six text elds and only 6,472 have labels (5,820 in train
set and 652 in dev set), while 52 of them have only chapter level codes. Table</p>
    </sec>
    <sec id="sec-10">
      <title>1 shows top-5 most frequent codes. These classes account for more than 90% of the dataset leading to a high imbalance. Due to a shallow hierarchy, we consider the problem as multi-label classi cation instead of hierarchical classi cation.</title>
      <p>4</p>
      <sec id="sec-10-1">
        <title>Methods</title>
        <p>
          Since the documents are domain speci c and in German, we argue that it might
be di cult for open-domain and multilingual pre-trained models to do e
ective transfer learning. Furthermore, [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] suggests that each language has its own
linguistic and cultural characteristics that may contain di erent signals to e
ectively classify a speci c class. Based on this, and the fact that translations are
always available as domain-free parallel corpora, we use them in our system and
show improvements across all models. Since English has readily more accessible
biomedical literature available as free texts, we use English translations for our
documents. To perform a thorough case study, we tested several models and
pre-trained embeddings. Below we describe each of them.
        </p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Baseline For baseline we use a TF-IDF weighted bag-of-words based linear SVM model.</title>
    </sec>
    <sec id="sec-12">
      <title>CNN Convolutional Neural Network (CNN) learns local features of input repre</title>
      <p>
        sentation through varying number and sizes of lters performing convolution
operation. They have been very successful in many text classi cation tasks [
        <xref ref-type="bibr" rid="ref18">18, 41</xref>
        ].
      </p>
    </sec>
    <sec id="sec-13">
      <title>While many advanced CNN architectures exist, we use a shallow model of [20].</title>
    </sec>
    <sec id="sec-14">
      <title>Attention Models Attention is a mechanism that was initially proposed in sequence-to-sequence based Neural Machine Translation (NMT) [3] to allow decoder to attend to encoder states while making predictions. More generally, attention generates a probability distribution over features, allowing models to</title>
      <p>put more weight on relevant features. In our study, we used three attention based
models.</p>
    </sec>
    <sec id="sec-15">
      <title>HAN Hierarchical Attention Network (HAN) deals with the problem of long</title>
      <p>
        documents classi cation by modeling attention at each hierarchical level of
document i.e. words and sentences [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ]. This allows the model to rst attend word
encoder outputs, in a sentence, followed by attending the sentence encoder
outputs to classify a document. Like [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ], we also use bidirectional Gated Recurrent
      </p>
    </sec>
    <sec id="sec-16">
      <title>Units (GRUs) as word and sentence encoder.</title>
    </sec>
    <sec id="sec-17">
      <title>SLSTM Self-Attention Long-Short Term Memory (SLSTM) network is a simple single layer network based on bidirectional LSTMs encoder. An input sequence is rst passed through the encoder and encoded representations are self-attended to produce outputs.</title>
      <p>CLSTM All ICD codes have a textual description, e.g. code A80-A89 is about
viral infections of the central nervous system that can help a model while
classifying. Fig. 1 shows a document containing words related to those found in the
descriptions of their labeled codes. Such words may or may not be present but
the intuition is to use this additional meta-data to enrich the encoder
representation by attention. To the best of our knowledge, this is the rst time that
the codes' descriptions are directly used to align with input text via attention.</p>
    </sec>
    <sec id="sec-18">
      <title>The closest work is from [4], where author uses codes attention but they directly</title>
      <p>consider code as a unit of representation creating an embedding lookup. We also
create an embedding layer for codes but using their texts where a code
representation is obtained via average of word embeddings of each token. We call this
network as Codes Attentive LSTM (CSLSTM) and describe it more formally.</p>
      <p>Let X = fx1; x2; :::; xng 2 Rn d be an n-length input document sequence,
where xi is a d-dimensional embedding vector for input word wi belonging to
documents vocabulary VD. Let T = ft1; t2; :::; tmg 2 Rm l be m-codes by
llength titles representation matrix, where each ti = fti1 ; ti2 ; :::; til g 2 Rl d and
tij is d-dimensional embedding vector for code i's title word j, belonging to
titles vocabulary VT . The embedding matrices are di erent for documents and
codes titles, this is because the title words can be missing in documents vocab.</p>
    </sec>
    <sec id="sec-19">
      <title>Similarly, we used di erent LSTM encoders for document and code words (shared</title>
      <p>encoder under performed on dev set; not reported). The network then transforms
input as Xout = CLSTM(X; T ), with following operations:</p>
      <p>Xenc = [x1enc ; x2enc ; :::; xnenc ]
xienc = LSTMW (xi)
Tenc = [t1enc ; t2enc ; :::; tmenc ]</p>
      <p>l
tienc = 1l X LSTMC (tij )</p>
      <p>j=1
Xout = [Xenc; Tenc] 2 R(n+m) h
Xout = Xout + AT Xout</p>
      <p>n
1 X
Xout = n Xoutj</p>
      <p>j=1</p>
      <p>A = softmax(XoutXoTut) 2 R(n+m) (n+m)
where, Xenc is a sequence of word encoder LSTMW outputs and Tenc is a
sequence of averaged title words encoding by code encoder LSTMC . We
concatenate document words sequence with titles sequence and perform self-attention</p>
    </sec>
    <sec id="sec-20">
      <title>A, followed by residual connection and average over resulting sequence to get</title>
      <p>nal representation.</p>
      <p>
        BERT Pre-training large models on unsupervised corpus with language
modeling objective and then ne-tuning the same model for a downstream
supervised task eliminates the need of heavily engineered task-speci c architectures
[
        <xref ref-type="bibr" rid="ref10 ref29 ref30">10, 29, 30</xref>
        ]. Bidirectional Encoder Representations from Transformers (BERT)
is a recently proposed such model, following ELMo and OpenAI GPT. BERT is a
multi-layer bidirectional Transformer (feed-forward multi-headed self-attention)
[
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] encoder that is trained with two objectives, masked language modeling
(predicting a missing word in a sentence from the context) and next sentence
prediction (predicting whether two sentences are consecutive sentences). BERT has
improved the state-of-the-art in many language understanding tasks and recent
works show that it sequentially model NLP pipeline, POS tagging, parsing, NER,
sematic roles and coreference [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]. Similar works [
        <xref ref-type="bibr" rid="ref13">13, 40</xref>
        ] have been performed to
understand and interpret BERT's learning capacity. We therefore use BERT in
our task and show that it achieves best results compared to other models and is
nearly agnostic to domain speci c pre-training (BioBERT; [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]).
5
5.1
      </p>
      <sec id="sec-20-1">
        <title>Experiments</title>
        <sec id="sec-20-1-1">
          <title>Pre-processing</title>
          <p>We consider each document as one text eld i.e. all six elds are joined together
to form one input text. As mentioned in section 3, only 6,472 documents are
labeled, out of which 654 are in dev set form total of 840. Since there is no gold
standard for these documents we cannot evaluate them, so we ignored them
during training. We also abstained from adding an extra "no" class (i.e. proxy
for predicting nothing) for such documents because we assume that all NTSs
should be indexed (e.g. like MEDLINE auto-indexing of new PubMed articles)
and therefore inherently has one or more true ICD-10 codes assigned to them.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-21">
      <title>However, the o cial evaluation script penalizes model predictions for such doc</title>
      <p>uments by considering them all false positives. We will cover this in detail in
results section.</p>
      <p>
        To translate German documents to English we used automatic translation
from Google Translate API v22. For both, German and English, we use language
speci c sentence and word tokenizer o ered by NLTK [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and spaCy3,
respectively. Tokens with document frequencies outside 5 and 60% of training corpus
were removed and only top-10000 tokens were kept to limit the vocabulary. This
applies to all models other than BERT, which uses WordPiece tokenizer [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ] and
builds its own vocabulary. Lastly, we remove all the classes with frequency less
than 15. All the experiments were performed without any cross-validation on
dev set to nd best parameters.
5.2
      </p>
      <sec id="sec-21-1">
        <title>Pre-trained Embeddings</title>
      </sec>
    </sec>
    <sec id="sec-22">
      <title>We use following pre-trained models for German:</title>
      <p>FTde: fastText DE Common Crawl (300d)4
BERTde: BERT-Base, Multilingual Cased (768d)5
and following for English:</p>
    </sec>
    <sec id="sec-23">
      <title>FTen: fastText EN Common Crawl (300d)</title>
      <p>PubMeden: PubMed word2vec (400d)6
BERTen: BERT-Base, Cased (768d)7</p>
      <p>BioBERTen: BioBERT (768d)8
2 https://cloud.google.com/translate/docs/translating-text
3 https://spacy.io/usage/models
4 https://fasttext.cc/docs/en/crawl-vectors.html
5 https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_</p>
      <p>H-768_A-12.zip
6 https://archive.org/details/pubmed2018_w2v_400D.tar
7 https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_</p>
      <p>A-12.zip
8 https://github.com/naver/biobert-pretrained/releases/tag/v1.</p>
      <p>0-pubmed-pmc</p>
    </sec>
    <sec id="sec-24">
      <title>TF-IDF + Linear SVM For baseline, we use scikit-learn implementation of</title>
    </sec>
    <sec id="sec-25">
      <title>LinearSVC with one-vs-all training [28].</title>
    </sec>
    <sec id="sec-26">
      <title>For all the models, except BERT, we used a batch size of 64, max sequence</title>
      <p>
        length of 256, learning rate of 0.001 with Adam [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and 50 epochs with early
stopping. We used binary cross-entropy for each class as our objective function
and F1-micro score as performance metrics. All the experiments were performed
on single 12 GB Nvidia TitanXp GPU. We implemented these models and our
code is publicly available.9
      </p>
    </sec>
    <sec id="sec-27">
      <title>CNN We con gured CNN with 64 channels and lter sizes of 3, 4 and 5.</title>
    </sec>
    <sec id="sec-28">
      <title>HAN Following [39], we also used biGRU encoders but with hidden size of 300.</title>
    </sec>
    <sec id="sec-29">
      <title>We set the maximum number of sentences in a documents and maximum number of words in a sentence as 40 and 10 respectively.</title>
    </sec>
    <sec id="sec-30">
      <title>SLSTM A biLSTM encoder with hidden size of 300.</title>
    </sec>
    <sec id="sec-31">
      <title>CLSTM Similar to SLSTM, but with additional T matrix of size total number of titles (230, collected from ICD-10-GM) max title sequence length of 10.</title>
      <p>BERT We used PyTorch's implementation of BERT10 with default parameters.</p>
    </sec>
    <sec id="sec-32">
      <title>To avoid memory issues, we used maximum sequence length of 256 with batch size 6.</title>
    </sec>
    <sec id="sec-33">
      <title>Ensemble Based on dev set results, we also created an ensemble of top-2 models as weighted combination of their raw scores, where then the prediction for each example is given by:</title>
      <p>y^ = 1f (
S1 + (1
)</p>
      <p>S2) &gt; 0:5g 2 f0; 1gjCj</p>
    </sec>
    <sec id="sec-34">
      <title>S1, S2 are raw probability scores from rst and second best model respectively,</title>
      <p>while is sigmoid function and jCj is number of classes. We select best value of
on dev set such that the f1 score of ensemble is higher than individual models.
Fig. 2 shows variation with performance metrics.
5.4</p>
      <sec id="sec-34-1">
        <title>Results</title>
        <p>
          9 https://github.com/suamin/multilabel-classification-bert-icd10
10 https://github.com/huggingface/pytorch-pretrained-BERT
(English) improved the score by an avg. of 4.07%. This can be attributed to the
fact that there is much more English texts than other languages, but it can also
be argued that English may have stronger linguistic signals to classify the classes
where German models make mistakes [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          The baseline proved to be a strong one, with the highest precision of all
and outperforming HAN and CNN models, for both German and English, with
common crawl embeddings. HAN performs better when documents are relatively
long e.g. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] reports strong results with HAN based models on MIMIC dataset
[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], where the average document size exceeds 1900 tokens. After pre-processing,
the averaged document length in our case was approximately 340. For CNN, we
believe advanced variants may perform better.
        </p>
      </sec>
    </sec>
    <sec id="sec-35">
      <title>SLSTM and CLSTM, both being just one layer, performed comparably and</title>
      <p>better than baseline. SLSTM is much simpler and relies purely on self-attention,
which also compliments higher scores by BERT models, which are stacked
multiheaded self-attention networks. For CLSTM, since many documents are missing
the title words (in fact many title words never appeared in corpus), the model
had weak alignment signals between documents and this additional meta-data.</p>
    </sec>
    <sec id="sec-36">
      <title>However, it still performed really well, getting second best score with PubMed</title>
      <p>embeddings.</p>
      <p>
        BERT performed better than other models, both in German and English
with an avg. score of 6% points higher. BioBERTen performed slightly (+0.86%)
better than BERTen, this was also noticeable in Relation Extraction task in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ],
where domain speci c and general BERT performed comparably. This partly
shows BERT's ability to generalize and being robust to domain shifts (learning
from only 5k training docs), however, this contradicts the recent ndings of
[40], where authors re ect on such issues, and catastrophic forgetting in
BERTlike models. On other hand, the e ect of using in-domain pre-trained models
was more signi cant for static-embeddings; using pre-trained PubMeden vectors
Baseline
      </p>
      <p>CNN
HAN
SLSTM
CLSTM
BERT</p>
      <p>TF-IDFde
TF-IDFen</p>
      <p>FTde</p>
      <p>FTen
PubMeden</p>
      <p>FTde</p>
      <p>FTen
PubMeden</p>
      <p>FTde</p>
      <p>FTen
PubMeden
out-performed open-domain FTen by an avg. of 2.77%. Such analysis was not
performed for German due to lack of medical domain German vectors. BERT
models had highest recall but relatively poor precision. This is preferable in
real-world medical applications, where the recall is of much more importance.</p>
    </sec>
    <sec id="sec-37">
      <title>We also combined our top-2 models, BioBERTen and CLSTSM-PubMeden,</title>
      <p>to get an ensemble which performed better than both and got highest score of
84.67% on dev set. The intuition was to improve on BERT's precision without
losing too much of recall. At = 0:63 we got the highest score. This increased</p>
    </sec>
    <sec id="sec-38">
      <title>BioBERTen precision by 7.24% at loss of 2.5% recall. Since our focus was mainly</title>
      <p>on single model systems therefore we used best single model for submission.
5.5</p>
      <sec id="sec-38-1">
        <title>Submission and test scores</title>
        <p>The test set contains 407 documents, which we rst translate to English and then
run predictions with BioBERTen as our submitted model. We obtained a test
f1micro of 73% with 86% recall and 64% precision as posted by o cial results. Our
system ranked second but the there was signi cant di erence between test and
dev set performances, especially, low precision. After the gold set was released, we
probed it and realized that the o cial script provided by the challenge considers
all predictions on test examples for which there is no gold label (93 of them) as
false positives. We think that it is intrinsically impossible to compare examples
Baseline</p>
        <p>CNN
HAN
SLSTM
CLSTM
BERT</p>
        <p>TF-IDFde
TF-IDFen</p>
        <p>FTde</p>
        <p>FTen
PubMeden</p>
        <p>FTde</p>
        <p>FTen
PubMeden</p>
        <p>FTde</p>
        <p>FTen
PubMeden
with predictions where gold standard is not available. To emphasize, we give an
example, if we take test document with id=20486 where the gold labels are
fC00C97, C76-C80, IIg and our best model predicted fC00-C97, C76-C80, IIg i.e. a
perfect match with maximum score. Given o cial evaluation, if this example did
not had gold standard available then our model predictions would all had been
considered as false positives, which severely degrades precision of a model which
may have generalized well to predict on future examples. Table 3 shows this
comparison on test set, where in "Original" column we use the same evaluation as
provided by the task and in "Modi ed" we remove all documents from evaluation
for which gold labels are not available. As can be seen, recall column is just the
same as original with only precision column changes which changes f1-score as
well. With the modi cation, all the models have similar performance as it was
on dev set, as we also evaluated trained and evaluated on dev set by removing
unlabeled examples. With modi cation, the submitted system achieves a test
score of 80.82% now compared to that of 82.90% on dev set. Finally, ensemble
model gets highest scores of 77.98% and 82.49% with original and modi ed
evaluation respectively.</p>
        <sec id="sec-38-1-1">
          <title>Discussion</title>
          <p>Biomedical text mining is generally a challenging eld but recent progresses of
transfer learning in NLP can signi cantly reduce the engineering required to
come up with domain sensitive models. Unsupervised data is cheap, and can
be obtained in abundance to learn general language patterns [40], however, such
data may not be readily available when dealing with in-domain and low-resource
languages (e.g. Estonian medical documents). Such de ciencies encourage
research for better cross-lingual and cross-domain embedding alignment methods
that can transfered e ectively.</p>
        </sec>
        <sec id="sec-38-1-2">
          <title>Acknowledgements</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-39">
      <title>We thank Stalin Varanasi for helpful discussions. This work was partially funded by European Union's Horizon 2020 research and innovation programme under grant agreement No. 777107.</title>
      <p>2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, pages 1480{1489, 2016.
40. Dani Yogatama, Cyprien de Masson d'Autume, Jerome Connor, Tomas Kocisky,
Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris
Dyer, et al. Learning and evaluating general linguistic intelligence. arXiv preprint
arXiv:1901.11373, 2019.
41. Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional
networks for text classi cation. In Advances in neural information processing systems,
pages 649{657, 2015.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Reinald</given-names>
            <surname>Kim</surname>
          </string-name>
          <string-name>
            <surname>Amplayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Kyungjae</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jinyeong</given-names>
            <surname>Yeo</surname>
          </string-name>
          , and
          <article-title>Seung-won Hwang. Translations as additional contexts for sentence classi cation</article-title>
          .
          <source>arXiv preprint arXiv:1806.05516</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A</given-names>
            <surname>Atutxa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            <surname>Casillas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N</given-names>
            <surname>Ezeiza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I</given-names>
            <surname>Goenaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V</given-names>
            <surname>Fresno</surname>
          </string-name>
          ,
          <string-name>
            <surname>K Gojenola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R</given-names>
            <surname>Martinez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Oronoz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O</given-names>
            <surname>Perez-de Vinaspre</surname>
          </string-name>
          .
          <article-title>Ixamed at clef ehealth 2018 task 1: Icd10 coding with a sequence-to-sequence approach</article-title>
          .
          <source>CLEF</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <surname>Yoshua Bengio.</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>arXiv preprint arXiv:1409.0473</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Tal</given-names>
            <surname>Baumel</surname>
          </string-name>
          , Jumana Nassour-Kassis, Raphael Cohen,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Elhadad</surname>
          </string-name>
          , and
          <article-title>Noemie Elhadad. Multi-label classi cation of patient notes: case study on icd code assignment</article-title>
          .
          <source>In Workshops at the Thirty-Second AAAI Conference on Arti cial Intelligence</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Bettina</given-names>
            <surname>Bert</surname>
          </string-name>
          , Antje Dorendahl, Nora Leich, Julia Vietze, Matthias Steinfath, Justyna Chmielewska, Andreas Hensel, Barbara Grune, and
          <article-title>Gilbert Schonfelder. Rethinking 3r strategies: Digging deeper into animaltestinfo promotes transparency in in vivo biomedical research</article-title>
          .
          <source>PLoS biology</source>
          ,
          <volume>15</volume>
          (
          <issue>12</issue>
          ):
          <fpage>e2003217</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Rabia</given-names>
            <surname>Bounaama and M El Amine</surname>
          </string-name>
          <article-title>Abderrahim</article-title>
          . Tlemcen university at celf ehealth
          <year>2018</year>
          <article-title>team techno: Multilingual information extraction-icd10 coding</article-title>
          .
          <source>CLEF</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Cerri</surname>
          </string-name>
          , Rodrigo C Barros, and
          <string-name>
            <surname>Andre CPLF De Carvalho.</surname>
          </string-name>
          <article-title>Hierarchical multi-label classi cation using local neural networks</article-title>
          .
          <source>Journal of Computer and System Sciences</source>
          ,
          <volume>80</volume>
          (
          <issue>1</issue>
          ):
          <volume>39</volume>
          {
          <fpage>56</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Learning phrase representations using rnn encoder-decoder for statistical machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1406.1078</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Koby</given-names>
            <surname>Crammer</surname>
          </string-name>
          , Mark Dredze, Kuzman Ganchev, Partha Pratim Talukdar, and
          <string-name>
            <given-names>Steven</given-names>
            <surname>Carroll</surname>
          </string-name>
          .
          <article-title>Automatic code assignment to medical text</article-title>
          .
          <source>In Proceedings of the workshop on bionlp</source>
          <year>2007</year>
          :
          <article-title>Biological, translational, and clinical language processing</article-title>
          , pages
          <volume>129</volume>
          {
          <fpage>136</fpage>
          . Association for Computational Linguistics,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jacob</surname>
            <given-names>Devlin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <article-title>Bert: Pretraining of deep bidirectional transformers for language understanding</article-title>
          .
          <source>arXiv preprint arXiv:1810.04805</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>Remi</given-names>
            <surname>Flicoteaux</surname>
          </string-name>
          .
          <article-title>Ecstra-aphp@ clef ehealth2018-task 1: Icd10 code extraction from death certi cates</article-title>
          .
          <source>CLEF</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Julien</given-names>
            <surname>Gobeill</surname>
          </string-name>
          and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Ruch</surname>
          </string-name>
          .
          <article-title>Instance-based learning for icd10 categorization</article-title>
          .
          <source>CLEF</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Yoav</given-names>
            <surname>Goldberg</surname>
          </string-name>
          .
          <article-title>Assessing bert's syntactic abilities</article-title>
          .
          <source>arXiv preprint arXiv:1901.05287</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <article-title>Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Julia</surname>
            <given-names>Ive</given-names>
          </string-name>
          , Natalia Viani, David Chandran,
          <string-name>
            <given-names>Andre</given-names>
            <surname>Bittar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Sumithra</given-names>
            <surname>Velupillai</surname>
          </string-name>
          .
          <article-title>Kcl-health-nlp@ clef ehealth 2018 task 1: Icd-10 coding of french and italian death certi cates with character-level convolutional neural networks</article-title>
          .
          <source>In 19th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF</source>
          <year>2018</year>
          , Avignon,
          <source>France, 10 September 2018 through 14 September</source>
          <year>2018</year>
          , volume
          <volume>2125</volume>
          .
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Serena</surname>
            <given-names>Jeblee</given-names>
          </string-name>
          , Akshay Budhkar, Sasa Milic, Je Pinto, Chloe Pou-Prom, Krishnapriya Vishnubhotla, Graeme Hirst, and
          <string-name>
            <given-names>Frank</given-names>
            <surname>Rudzicz</surname>
          </string-name>
          .
          <article-title>Toronto cl at clef 2018 ehealth task 1: Multi-lingual icd-10 coding using an ensemble of recurrent and convolutional neural networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Alistair</surname>
            EW Johnson, Tom J Pollard, Lu Shen,
            <given-names>H Lehman</given-names>
          </string-name>
          <string-name>
            <surname>Li-wei</surname>
          </string-name>
          , Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark.
          <article-title>Mimic-iii, a freely accessible critical care database</article-title>
          .
          <source>Scienti c data</source>
          ,
          <volume>3</volume>
          :
          <fpage>160035</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. Rie Johnson and Tong Zhang.
          <article-title>Deep pyramid convolutional neural networks for text categorization</article-title>
          .
          <source>In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>562</fpage>
          {
          <fpage>570</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Liadh</surname>
            <given-names>Kelly</given-names>
          </string-name>
          , Hanna Suominen, Lorraine Goeuriot, Mariana Neves, Evangelos Kanoulas,
          <string-name>
            <given-names>Dan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Leif</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          , Rene Spijker, Guido Zuccon, Harrisen Scells, and
          <article-title>Joa~o Palotti. Overview of the CLEF eHealth evaluation lab 2019</article-title>
          . In Fabio Crestani, Martin Braschler, Jacques Savoy,
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Rauber</surname>
          </string-name>
          , et al., editors,
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ). Lecture Notes in Computer Science, Berlin Heidelberg, Germany,
          <year>2019</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>Yoon</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <article-title>Convolutional neural networks for sentence classi cation</article-title>
          .
          <source>arXiv preprint arXiv:1408.5882</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Diederik P Kingma and Jimmy Ba</surname>
          </string-name>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Jinhyuk</surname>
            <given-names>Lee</given-names>
          </string-name>
          , Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and
          <string-name>
            <given-names>Jaewoo</given-names>
            <surname>Kang</surname>
          </string-name>
          .
          <article-title>Biobert: pre-trained biomedical language representation model for biomedical text mining</article-title>
          .
          <source>arXiv preprint arXiv:1901.08746</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          and
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          .
          <article-title>Nltk: the natural language toolkit</article-title>
          .
          <source>arXiv preprint cs/0205028</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <given-names>Zulfat</given-names>
            <surname>Miftahutdinov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Elena</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          .
          <article-title>Kfu at clef ehealth 2017 task 1: Icd10 coding of english death certi cates with recurrent neural networks</article-title>
          .
          <source>In CLEF (Working Notes)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <article-title>Je rey Dean. E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Mariana</surname>
            <given-names>Neves</given-names>
          </string-name>
          , Daniel Butzke, Antje Dorendahl, Nora Leich, Benedikt Hummel,
          <article-title>Gilbert Schonfelder, and Barbara Grune. Overview of the CLEF eHealth 2019 Multilingual Information Extraction</article-title>
          . In Fabio Crestani, Martin Braschler, Jacques Savoy,
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Rauber</surname>
          </string-name>
          , et al., editors,
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ). Lecture Notes in Computer Science, Berlin Heidelberg, Germany,
          <year>2019</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27. World Health Organization.
          <article-title>International statistical classi cation of diseases and related health problems</article-title>
          , volume
          <volume>1</volume>
          . World Health Organization,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Fabian</surname>
            <given-names>Pedregosa</given-names>
          </string-name>
          , Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          , Ron Weiss,
          <string-name>
            <surname>Vincent Dubourg</surname>
          </string-name>
          , et al.
          <article-title>Scikit-learn: Machine learning in python</article-title>
          .
          <source>Journal of machine learning research</source>
          ,
          <volume>12</volume>
          (Oct):
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Matthew E Peters</surname>
            , Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
            <given-names>Kenton</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>and Luke</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>arXiv preprint arXiv:1802.05365</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Alec</surname>
            <given-names>Radford</given-names>
          </string-name>
          , Karthik Narasimhan, Time Salimans, and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <article-title>Improving language understanding with unsupervised learning</article-title>
          .
          <source>Technical report, Technical report, OpenAI</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Jurica</surname>
            <given-names>Seva</given-names>
          </string-name>
          , Mario Sanger, and Ulf Leser.
          <article-title>Wbi at clef ehealth 2018 task 1: Language-independent icd-10 coding using multi-lingual embeddings and recurrent neural networks</article-title>
          .
          <source>CLEF</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Benjamin</surname>
            <given-names>Shickel</given-names>
          </string-name>
          , Patrick James Tighe, Azra Bihorac, and
          <string-name>
            <given-names>Parisa</given-names>
            <surname>Rashidi</surname>
          </string-name>
          .
          <article-title>Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis</article-title>
          .
          <source>IEEE journal of biomedical and health informatics</source>
          ,
          <volume>22</volume>
          (
          <issue>5</issue>
          ):
          <volume>1589</volume>
          {
          <fpage>1604</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Carlos N Silla and Alex A Freitas</surname>
          </string-name>
          .
          <article-title>A survey of hierarchical classi cation across di erent application domains</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>22</volume>
          (
          <issue>1-2</issue>
          ):
          <volume>31</volume>
          {
          <fpage>72</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Noah</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Contextual word representations: A contextual introduction</article-title>
          .
          <source>arXiv preprint arXiv:1902.06006</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Ilya</surname>
            <given-names>Sutskever</given-names>
          </string-name>
          , Oriol Vinyals, and Quoc V Le.
          <article-title>Sequence to sequence learning with neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <volume>3104</volume>
          {
          <fpage>3112</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Ian</surname>
            <given-names>Tenney</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dipanjan Das</surname>
            , and
            <given-names>Ellie</given-names>
          </string-name>
          <string-name>
            <surname>Pavlick</surname>
          </string-name>
          .
          <article-title>Bert rediscovers the classical nlp pipeline</article-title>
          .
          <source>arXiv preprint arXiv:1905.05950</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Lukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <volume>5998</volume>
          {
          <fpage>6008</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38.
          <string-name>
            <surname>Yonghui</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao,
          <string-name>
            <given-names>Qin</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Klaus</given-names>
            <surname>Macherey</surname>
          </string-name>
          , et al.
          <article-title>Google's neural machine translation system: Bridging the gap between human and machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1609.08144</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          39.
          <string-name>
            <surname>Zichao</surname>
            <given-names>Yang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Diyi</given-names>
            <surname>Yang</surname>
          </string-name>
          , Chris Dyer, Xiaodong He,
          <string-name>
            <surname>Alex Smola</surname>
            , and
            <given-names>Eduard</given-names>
          </string-name>
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          .
          <article-title>Hierarchical attention networks for document classi cation</article-title>
          .
          <source>In Proceedings of the</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>