<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilingual ICD-10 Code Assignment with Transformer Architectures using MIMIC-III Discharge Summaries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Department of Computer Science</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>University of Applied Sciences</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arts Dortmund (FHDO)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emil-Figge Str.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dortmund</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany hesch</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Medical Informatics</institution>
          ,
          <addr-line>Biometry and Epidemiology</addr-line>
          ,
          <institution>University Hospital Essen</institution>
          ,
          <addr-line>Essen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this work, we present the participation of FHDO Biomedical Computer Science Group (BCSG) to the CLEF eHealth challenge 2020 Task 1 on automatic assignment of ICD-10 codes (CIE-10 in the Spanish translation) to clinical case studies. Training data has been augmented with documents from the Medical Information Mart for Intensive Care (MIMIC-III), a critical care database. ICD-10 CM General Equivalence Mappings (GEMs) were subsequently used to convert the codi cation from ICD-9 to ICD-10. Recent state-of-the-art Transformer-based models, such as BioBERT and ClinicalBERT are compared to the Generalized Autoregressive Pretraining for Language Understanding (XLNet) model. Finally, the apriori algorithm has been applied to build association rules by nding frequent item sets. An ensemble of BioBERT and XLNet achieved a mean Average Precision (MAP) score of 0:259 (0:306 for the subset of codes only present in the training and validation sets).</p>
      </abstract>
      <kwd-group>
        <kwd>BioBERT</kwd>
        <kwd>MIMIC-III</kwd>
        <kwd>Conversion</kwd>
        <kwd>Apriori</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>billing mechanism in the Electronic Health Record (EHR) and can be used for
automatic semantic indexing of clinical documents, but also to facilitate decision
support systems that aim to help clinical coders by suggesting a relevant subset
of potential codes for selection. The problem can be described as a mapping
from natural language free-texts to medical concepts such that, given a new
document, the system can assign multiple codes to it.</p>
      <p>
        In terms of application in the biomedical eld, Bidirectional Encoder
Representations from Transformers (BERT) has only recently been used for ICD
code assignment tasks, such as classifying German animal experiments in CLEF
eHealth 2019 [
        <xref ref-type="bibr" rid="ref14 ref3">3,27,25</xref>
        ]. While it has proven to work well on assigning a smaller
subset of ICD codes, it is uncertain how Transformer architecture models can
perform on arbitrary long clinical text and in solving extreme multi-label
classication problems with a high average amount of assigned codes per document.
      </p>
      <p>
        CLEF eHealth tracks feature the classi cation of multilingual clinical
documents using ICD codes since 2016 [22,23,24,25]. This work enriches training data
with the Medical Information Mart for Intensive Care (MIMIC-III) database and
compares BERT based models with XLNet [
        <xref ref-type="bibr" rid="ref19">32</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        A hierarchy-based approach with Support Vector Machines (SVM) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], using the
'is-a' relationship between ICD-9 codes to model label dependencies has been an
early approach to ICD coding [
        <xref ref-type="bibr" rid="ref13">26</xref>
        ]. The hierarchy-based classi er surpassed the
at SVM, which did not consider code dependencies. Other approaches identi ed
label density and label noise as useful features [
        <xref ref-type="bibr" rid="ref16">29</xref>
        ], while others empirically
evaluated the simultaneous occurrence of labels [16].
      </p>
      <p>
        ML-NET [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] followed the hierarchy-based approach and extended the coding
of documents. Its deep neural network consists of an additional network for
estimating the number of labels. Instead of separating relevant vs. irrelevant
labels by a threshold value, a network for predicting the number of labels was
built by using the document vector as input.
      </p>
      <p>
        Baumel et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] evaluated 4 di erent models for ICD code assignment using
data from MIMIC-II and MIMIC-III data sets. They presented a continuous
bag-of-word model [19] (CBOW), a convolutional neural network, an SVM
oneversus-all model and a bidirectional gated-recurrent unit model with hierarchical
attention (HA-GRU).
      </p>
      <p>Another proposed model is a code-wise attention network [21], where
attention mechanisms are used to extract n-grams from the text that are in uential
in predicting each code.</p>
      <p>
        Uni ed Medical Language System (UMLS) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] mapping and word embeddings
have shown to be e ective within text classi cation in the biomedical domain and
improved results in automatic ICD coding [
        <xref ref-type="bibr" rid="ref15">28</xref>
        ]. The embeddings were selected
by sequentially mapping discharge summaries to UMLS biomedical concepts in
an approach to enrich word representations and to eliminate variations caused
by tense, abbreviations and/or spelling mistakes.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Dataset</title>
      <p>For training data, two di erent sources were used: The o cal CodiEsp dataset3
with manually generated ICD-10 codi cations, and the MIMIC-III database with
the older ICD-9 classi cation system in use, which are mappable to discharge
summaries [15]. When exploring other additional resources, such as the abstracts
collected from Lilacs and Ibecs4, the MIMIC-III database was selected as the
main data source for augmentation, because it seems to be the most similar
database compared to the CodiEsp corpus. Among the free text narrative
structured documents describing hospital courses, it has a high average amount of
manually assigned codes per document coming from real-world EHRs. With the
decision to use the MIMIC-III dataset for augmentation it was also decided to
focus on the English translated documents of CodiEsp corpus. A key di erence
between the two data sources is that the codi cation for CodiEsp is a
semantic mapping of concepts, where the assigned codes do not have to be based on
medical outcome. For example, a negative serum test (as seen in Listing 1.1) for
CodiEsp still results in appropriately assigned ICD-10 codes, whereas it would
not appear on MIMIC-III.</p>
      <sec id="sec-3-1">
        <title>Listing 1.1. Excerpt of CodiESP Document with id S0211-69952009000500014-1,</title>
        <p>showing results of a blood serum test and codi cation (Assigned Codes List: r80.9,
r20.2, b19.20, b19.10, r23.8, r60.0, r10.9, r19.7, m25.50, l98.9, b20).
[ . . . ]
On p h y s i c a l e x a m i n a t i o n : b l o o d p r e s s u r e 104/76 mmHg,
BMI 2 7 , minimal edema i n l o w e r l i m b s and p a p u l e s i n
e l b o w s and arms . Blood count and c o a g u l a t i o n were
normal , c r e a t i n i n e 0 . 9 mg/ dl , t o t a l c h o l e s t e r o l
238 mg/ dl , t r i g l y c e r i d e s 104 mg/ dl , t o t a l p r o t e i n
6 . 5 g / d l and albumin 3 . 6 g / d l . A n t i c a r d i o l i p i n
a n t i b o d i e s a n t i c a r d i o l i p i n : S e r o l o g y a g a i n s t HBV,
HCV and HIV was n e g a t i v e .
[ . . . ]
3.1</p>
        <sec id="sec-3-1-1">
          <title>CodiEsp Corpus</title>
          <p>The CodiEsp corpus consists of 1,000 clinical case studies manually selected
by a practicing physician and a clinical documentarian [20]. The training and
development dataset comprises 750 documents with an average of 11.09 codes
assigned per document. The test set contains 250 documents and was provided
with an additional collection of more than 2,000 documents (background set)
to prevent manual corrections. Within the CodiEsp training and development
dataset there are 26,696 unique tokens with an average of 301 tokens and 19
sentences per document. It contains 2,557 distinct codes in total of which 363
3 https://doi.org/10.5281/zenodo.3625746, last accessed 2020-07-17
4 https://doi.org/10.5281/zenodo.3606625, last accessed 2020-07-17
unseen codes appear in the test set as seen in Figure 1 (a). 68.24 % of the codes
are explainable with the CodiEsp training and development dataset.
3.2</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>MIMIC-III Corpus</title>
          <p>The MIMIC-III database comprises de-identi ed records from Beth Israel
Deaconess Medical Center intensive care unit (ICU) stays, collected between 2001
and 2012. It contains 59,652 discharge summaries with an average of 11.48 codes
assigned per document. It has 119,171 unique tokens with 1,947 tokens and 112
sentences on average. The dataset is in principle very well suited but has some
characteristics that need to be adapted. The coding system is ICD-9, which has
to be converted to ICD-10 accordingly to match the CodiEsp codi cation. In
addition, the dataset only contains summaries of intensive care unit stays, which
on average exceed the maximum length of tokens available for Transformer
architectures. After conversion, the dataset contains 5,447 distinct codes as seen
in Figure 1 (b).</p>
          <p>
            Segmentation For BERT [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] models, the maximum length of a sequence after
tokenizing is 512, resulting in an e ective limit of 510 tokens for the input layer
after subtracting the [CLS] and [SEP] tokens. Because MIMIC-III discharge
summaries have an average length of 1,947 tokens (see Table 1) with only 11:67 %
of all documents not exceeding 510 tokens, the data has to be truncated in order
to t into the Transformer model.
          </p>
          <p>
            A simple approach as supposed by Sun et. al. [
            <xref ref-type="bibr" rid="ref17">30</xref>
            ] would be to only use
the rst 510 tokens (head-only) or to use the last 510 tokens (tail-only) of a
document, but none of them seem to be appropriate for truncating clinical text
without losing relevant information.
          </p>
          <p>When inspecting the summaries, even though they are free text narratives,
a xed structure has been identi ed in most of the documents: They usually
start with a Chief Complaint followed by a historical Background section, which
may include History of Present Illness, Past Medical History, Social History and
Family History. During Diagnostics and Pertinent Results, the structure is no
longer as consistent and di erent sections appear, which are more dependent on
the individual case. From the middle towards the end of the documents there is
a section called Brief Hospital Course, which summarizes the ICU stay followed
by discharge condition instructions and/or followup instructions.</p>
          <p>In early experiments, the e ect of using di erent segments was evaluated.
Here, it was found that using the rst 510 tokens (head-only) of discharge
summaries decreased the performance in comparison to using the last 510 tokens
(tail-only). It can be assumed that this is because the background history, which
comes at the top of the documents, is not as relevant to the clinical coding as the
narrative over the actual present hospital course. It was decided to remove
content up to the Brief Hospital Course section and sequentially use the remaining
document up to whatever ts into 510 tokens. 7; 822 documents were omitted
where this section was not present, resulting in 13 % loss of data. Descriptive
statistics of the segmented corpus can be seen in Table 1.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>MIMIC-III MIMIC-III* CodiESP Train Dev</title>
      </sec>
      <sec id="sec-3-3">
        <title>Number of records with ICD code 59,652 51,830 750</title>
        <p>Number of unique tokens 1,091,025 276,500 26,696
Number of bigrams 10,609,279 2,846,377 114,846
Number of trigrams 27,814,651 7,873,155 180,650</p>
      </sec>
      <sec id="sec-3-4">
        <title>Avg. number of tokens / record 1,947 427 301</title>
      </sec>
      <sec id="sec-3-5">
        <title>Avg. number of sentences / record 112 39 19 Avg. number of labels / record 11.48 11.45 11.09</title>
        <sec id="sec-3-5-1">
          <title>ICD-9 code Conversion with General Equivalence Mappings ICD-9</title>
          <p>
            codes of the MIMIC-III database have been converted to ICD-10 using the
publicly available ICD-10 CM General Equivalence Mappings (GEMs) [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. Turer et
al. assessed the reliability of conversion between ICD-9 and ICD-10 and found
that manual coding from the forward GEMs and backward GEMs were
reproducible by 85.2 % and 90.4 % respectively [
            <xref ref-type="bibr" rid="ref18">31</xref>
            ].
          </p>
          <p>Data Selection Because of the di erent data sources and MIMIC-III being
limited to ICU cases, both datasets have been compared in terms of their
distinct code subsets. As seen in Figure 1 (b), the MIMIC-III data contains 4,156
unique ICD-10 codes that are not present in the CodiEsp train, development,
and testset. These codes are less generic, apply to the ICU cases and are not
covered by the smaller CodiEsp corpus. To make the data augmentation more
practical, only documents where 50 % or more of the assigned codes are present
in the Top 100, Top 250 or Top 500 frequent codes of the CodiEsp training and
development set were used (impact on training size can be seen in Table 3). Only
discharge summaries containing the Brief Hospital Course section were selected
by using a regular expression match, resulting in 51,830 out of 59,652 available
documents.</p>
          <p>Available data augmentation increases when changing the Top frequent code
amount, because the criteria/matching rule, if a document has 50 % or more
codes is less strict, resulting in more MIMIC-III documents ending up in training
data. Increasing the augmentation data in that way increases recall, but reduces
precision (see Table 4). A good compromise was to create a model that is able
to predict the Top 100 frequent codes in CodiEsp.
4
4.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Methods</title>
      <sec id="sec-4-1">
        <title>Transformer architecture and BERT</title>
        <p>
          BERT and Transformer [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] have proven to be extremely e ective in many
downstream natural language processing (NLP) tasks. While it works well on assigning
a smaller subset of ICD codes [
          <xref ref-type="bibr" rid="ref14 ref3">3,27</xref>
          ], it is uncertain how BERT models can work
MIMIC-III Data
        </p>
        <p>4156
MIMIC-III Data
684</p>
        <p>Train_Dev Set
613</p>
        <p>522
156</p>
        <p>207
Test Set
258
801</p>
        <p>Train_Dev Set
1201
213
296
56</p>
        <p>484
307
Test Set
1414
780</p>
        <p>363
Train_Dev Set</p>
        <p>Test Set
MIMIC-III Data
897</p>
        <p>Top 100 Codes
278
74 26
Test Set</p>
        <p>765
(a) CodiEsp Train Dev and Test</p>
        <sec id="sec-4-1-1">
          <title>Distribution.</title>
          <p>(b) MIMIC-III and CodiEsp Train Dev
and Test Set.
(c) MIMIC-III with 50 % in Top 100</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>CodiEsp and Test Set.</title>
          <p>(d) MIMIC-III with 50 % in Top 100</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>CodiEsp and Train Dev and Test Set. Fig. 1. Venn diagrams showing the distribution of the number of distinct ICD-10 codes for di erent datasets and subsets.</title>
          <p>with clinical texts of any length and in solving extreme multi-label classi cation
problems with a high average number of assigned codes per document.</p>
          <p>Though the MIMIC-III augmentation does not t into the token limitation
without clipping documents, the Transformer architecture o ers good
innovations that can be practical in the classi cation of clinical text. The word
tokenizer allows words that are outside the vocabulary to be represented by word
pieces instead of simply assigning them to an unknown token, which is why it
was selected for the rst tests. This feature is particularly useful for discharge
summaries, as spelling mistakes and non-standard abbreviations are common.</p>
          <p>
            Bidirectional Encoder Representations from Transformers for Biomedical Text
Mining (BioBERT) [18] and ClinicalBERT [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] have the same architecture but
are pre-trained on large-scale biomedical corpora. BioBERT has been pre-trained
on PubMed abstracts5 and PMC6 full-text articles. Bio ClinicalBERT7 is an
extended model that was also pre-trained on all notes from MIMIC-III (880M
words). The Bio ClinicalBERT model was selected because of the larger
pretraining.
4.2
          </p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>XLNet</title>
        <p>
          The recently proposed Generalized Autoregressive Pretraining for Language
Understanding (XLNet) model [
          <xref ref-type="bibr" rid="ref19">32</xref>
          ] is an autoregressive language model (LM). It
is important to note that although BERT and XLNet have many similarities,
there are some di erences that need to be explained. Here, autoregressive means
that XLNet makes use of the TransformerXL [14] to capture information from
previous sequences in order to process the current sequence, and achieving the
regressive e ect at the sequence level. XLNet uses relative position coding and
a permutation LM, by factorizing the output with all possible permutations.
        </p>
        <p>The permutation e ect is limited to words which are \attended" to. This
is done by changing the attention mask prior to the attention softmax while
keeping track of the positional information in a sequence. For example, during
pre-training, to predict a token t, the attention mask is set to minimum numbers
for tokens that appear after position i &gt; t. Only the tokens before and including
t on the current factorization are used to compute the attention. The advantage
is that the tokens that come before t change with each permutation, but their
positions within the sequence are kept constant, allowing XLNet to capture
bidirectional context.</p>
        <p>
          XLNet implements the Multi-head attention, which is slightly di erent from
the one in BERT, where it is known that it generates a query Q, a key K, and
a value V projection of each word in the input sentence. For each query Q, the
Multi-head attention Layer uses K to compute an attention score for each value
vector V and then sums the value vectors into a single representation using the
attention weights [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>For XLNet, linear layers are used to map the input to the Multi-head
attention layer directly. This results in mapping the input into smaller space with the
same number of dimensions that add up to the original dimension as known for
BERT. This allows each word to attend more to other words and not only to
itself, which results in a nal richer representation of each word.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Preprocessing</title>
        <p>Because the discharge summaries were de-identi ed free text narratives,
additional pre-processing steps were taken to convert them into a sequence of
sentences, removing all numbers, and name placeholders. Leading and trailing
5 https://pubmed.ncbi.nlm.nih.gov/, last accessed 2020-07-17
6 https://www.ncbi.nlm.nih.gov/pmc/, last accessed 2020-07-17
7 https://huggingface.co/emilyalsentzer/Bio_Discharge_Summary_BERT, last
accessed 2020-07-17
spaces, quotations and semicolons have also been removed. For the CodiESP
corpus, no pre-processing was applied.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Training the models</title>
        <p>
          The experiments were done with the PyTorch-transformers implementations of
BERT and XLNet8. The overall end-to-end training process can be seen in Figure
2. The models were ne-tuned on all layers without freezing. As proposed by the
original papers [
          <xref ref-type="bibr" rid="ref19 ref9">9,32</xref>
          ], Adam [17] was used in early experiments as the optimizier,
but was then replaced by the Layerwise Adaptive Large Batch (LAMB) optimizer
[
          <xref ref-type="bibr" rid="ref20">33</xref>
          ] because it resulted in a slightly reduced training time. The hyperparameters
have been selected and optimized based on the development set performance.
Using a learning rate of 7e-4 or 6e-4 resulted in the best scores, though the
Transformer model seems to react very sensitively to the use of di erent learning
rates, because selecting di erent settings often led to poor results.
        </p>
        <p>Di erent warmup schedules were tried, but had no impact on the results.
Among the two versions of BERT cased and uncased, it was found that overall
the uncased version works slightly better. However, the di erence is still very
small. For XLNet, the only available version is cased. The base version of XLNet
was preferred over the large version due to computational expense. The training
batch size was 8 for XLNet and 16 for BERT models. To produce the ranking
of the codes, Binary cross-entropy with logits was used to obtain con dence for
each ICD-10 code during inference. They were then ordered by con dence and
cut o with a threshold of t = 0:4. The prediction pipeline of the BERT model
including the association rules is shown in gure 3.
4.5</p>
      </sec>
      <sec id="sec-4-5">
        <title>Apriori Association Rules</title>
        <p>
          The apriori algorithm [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] has been used to nd frequent itemsets in a list of
transactions but recently has also been in use to nd association rules and label
co-occurrences in clinical text, such as in autopsy reports [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Association rules
can be obtained with the support and con dence parameter, where the support
of a set of items is the probability that this set of items occurs in a transaction.
Con dence refers to the likelihood that an item B will also be purchased when
item A is purchased. It can be calculated by dividing the number of transactions
where A and B are bought together by the total number of transactions where
A is bought. To identify and explore co-occurrences, a low min support (0.02)
value has been used on the CodiEsp train and development set. The resulting
apriori association rules as seen in Figure 6 have been plotted with the arulesviz
[13] R package. The graph shows 59 rules.
        </p>
        <p>One example for a relation is Hepatitis B and C as shown by the rule that
connects b19.10 and b19.20. When exploring the data, it was found that this rule
refers to serology tests, that often include test results for di erent viruses, such
as hepatitis B and C. An example can be seen in Listing 1.1. Another con dent
8 https://github.com/huggingface/transformers, last accessed 2020-07-17
Visual Paradigm Online Diagrams Express Edition
BioBERT Pubmed abstracts
BioClinical Pubmed abstracts +
BERT MIMSICu-mIImDaisricehsarge
XLNet</p>
        <p>BooksCorpus +</p>
        <p>Englisch Wikipedia
CodiEsp</p>
        <p>Of icial Data from</p>
        <p>Organizer
MIMIC-I I</p>
        <p>Matching MIMIC-I I</p>
        <p>Documents
Probabilities (t=0.4)
[Annotated]
label1
...</p>
        <p>labelN</p>
        <p>Classifier with
Sigmoid + BCELoss</p>
        <p>C
E[CLS]</p>
        <p>T1</p>
        <p>E1
Data for Pre-training</p>
        <p>Model (BERT/XLNet)</p>
        <p>Pre-training</p>
        <p>Loss
[Unannotated]</p>
        <p>Copy Weights
Data for Fine-tuning</p>
        <p>Model (BERT/XLNet)</p>
        <p>Additional
Layer</p>
        <p>Fine-tuning</p>
        <p>Loss
rule is that localized enlarged lymph nodes (r59.0 and r59.9) links to unspeci ed
fever (r50.9), which then links to unspeci ed pain (r52). As such rules should be
covered by the trained model, not that many di erent rulesets have been tested
and added during inference.</p>
        <p>However, the 11-ruleset as seen in Figure 5 improved the mean Average
Precision (MAP) results on the development set between 0 % to 1:2 % depending
on the model and was therefore added to the nal submission if missed out.
The submission guideline requires that the prediction is ordered by con dence.
Because the predicted con dence cannot be compared with apriori support or
con dence values and because the con dence of the primary model was not high
enough, the association rule codes were added at the end. They were ranked by
highest level of support.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <p>Figure 4 shows experimental runs on the development set for the tested models
with di erent pre-trained embeddings and di erent frequent Top code subsets.
This results in di erent enriched training data and also in a di erent amount
of labels a model is able to predict. A comparison of how many documents end
up in the training data can be seen in Table 3. The nal best results on the
development set for each model can be seen in Table 2.</p>
      <p>While the F1-Score is superior on models which are only able to predict the
Top 50 frequent codes, the MAP score penalises this behaviour on the full set,
because not only the classi cation but also the positional ranking is taken into
consideration. When matching the Top 50 most frequent codes with MIMIC-III
there is not enough data available for augmentation (363 additional documents).
Starting with the Top 100 most frequent codes, improvements coming from the
additional data can be seen. The augmentation improves the reported MAP
score by 0:097 (0:128 F1) for the XLNet model. Increasing the training data
further increases recall, but decreases precision.</p>
      <p>The nal test set results for evaluation were reported by the task organisers
and can be seen in Table 4. On the test set, the Bio ClinicalBERT model achieved
the overall best performance for a single model with a MAP score of 0:259. XLNet
on Top 100 frequent codes achieved the best performance in precision.</p>
      <p>When the goldstandard for the test set was released, it was evaluated how
many of the unseen codes would have been explainable by keeping the remaining
annotated codes of each MIMIC-III document within the training data
(Knowledge Discovery). Figure 1 (d) shows that for the Top 100 most frequent codes
training set, 56 distinct unseen codes would have been explainable. Here, a small
performance improvement can be expected, but it is noteworthy that only a few
of the codes were seen more than once in the test data (76 appearances in total).
Because they were unseen before, it can be assumed that these are codes with
rare appearances. It can be concluded that more resources are needed to be able
to explain the full code set.</p>
      <p>0:6
0:5
0:4
e
r
o
-S 0:3
c
1
F
0:2
0:1
0</p>
      <sec id="sec-5-1">
        <title>BERT base mimic Top 50</title>
      </sec>
      <sec id="sec-5-2">
        <title>BERT base Top 50</title>
      </sec>
      <sec id="sec-5-3">
        <title>XLNet mimic Top 50</title>
      </sec>
      <sec id="sec-5-4">
        <title>XLNet mimic Top 100</title>
      </sec>
      <sec id="sec-5-5">
        <title>XLNet Top 50</title>
      </sec>
      <sec id="sec-5-6">
        <title>XLNet Top 100</title>
      </sec>
      <sec id="sec-5-7">
        <title>Bio ClinicalBERT Top 100</title>
        <p>0
500
1;000 1;500 2;000 2;500 3;000 3;500 4;000 4;500</p>
      </sec>
      <sec id="sec-5-8">
        <title>Steps</title>
      </sec>
      <sec id="sec-5-9">
        <title>Model MAP F1</title>
      </sec>
      <sec id="sec-5-10">
        <title>XLNet base cased + MIMIC-III - Top 50 0.232 0.608</title>
        <p>XLNet base cased - Top 50 0.216 0.602</p>
      </sec>
      <sec id="sec-5-11">
        <title>BERT base uncased + MIMIC-III - Top 50 0.143 0.47</title>
        <p>BERT base uncased - Top 50 0.165 0.372
XLNet base cased + MIMIC-III - Top 100 0.247 0.432
XLNet base - Top 100 0.15 0.304</p>
      </sec>
      <sec id="sec-5-12">
        <title>Bio Clinical BERT + MIMIC-III Top 100 0.244 0.361</title>
      </sec>
      <sec id="sec-5-13">
        <title>Model Training Data Size Model Size</title>
        <p>XLNet mimic 500 19,484 documents 459.78M
XLNet mimic 250 10,754 documents 459.03M
XLNet mimc 100 3,286 documents 458.58M</p>
      </sec>
      <sec id="sec-5-14">
        <title>Bio Clinical BERT 100 3,286 documents 423.43M</title>
        <p>Fig. 5. Graph for 11 ICD-10 apriori association rules. Size: min support(0.03)
min con dence(0.3), Color: lift(1.393-20.294).</p>
        <p>i51.9
i25.9</p>
        <p>b19.10
n18.9</p>
        <p>i10
i82.90
r5r95.90.9</p>
        <p>e11.9
l98.9
d64.9
n28.9
This work compared BERT based models with XLNet. The e ect of enriching
training data with documents from MIMIC-III was evaluated. Here, it was found
that the MIMIC-III augmentation with code conversion was able to improve the
results compared to using only the stock data set. The apriori algorithm has
been applied to build and explore association rules by nding frequent item sets.
The 11-ruleset was able to improve the mean Average Precision (MAP) results
on the development set between 0 % and 1:2 %.</p>
        <p>Among the submitted models, the ensemble of BioBERT and XLNet achieved
the highest mean Average Precision (MAP) score of 0.259 (0.306 for the subset
of codes only present in the train and validation sets). In terms of single model
performance, the Bio ClinicalBERT model achieved overall best performance.
The XLNet, even though pre-trained on generic text has the highest precision
value on the test set and overall best performance on the development set.</p>
        <p>Though the models are still far from achieving good results on the full label
set, the task has been very challenging with many possible labels, given only a
relatively small dataset. It was found that the large MIMIC-III database is not
able to cover all unseen codes, so it can be concluded that more resources are
needed to be able to explain the full code set.</p>
        <p>In future work, XLNets attention should be further evaluated because the
sequence dependency on the hidden states of previous sequences can be adjusted
by a memory length hyper-parameter. It will be interesting to tune and see the
impact of this parameter, but also to test and see how a domain-speci c XLNet
model performs when pre-trained on large biomedical data.
13. Hahsler, M., Chelluboina, S.: arulesviz: Visualizing association rules and frequent
itemsets. R package version 0.1-5 (2012)
14. Howard, J., Ruder, S.: Universal language model ne-tuning for text
classication. In: Proceedings of the 56th Annual Meeting of the Association for</p>
      </sec>
      <sec id="sec-5-15">
        <title>Computational Linguistics (Volume 1: Long Papers). pp. 328{339 (01 2018).</title>
        <p>https://doi.org/10.18653/v1/P18-1031
15. Johnson, A.E., Pollard, T.J., Shen, L., Li-Wei, H.L., Feng, M., Ghassemi, M.,</p>
      </sec>
      <sec id="sec-5-16">
        <title>Moody, B., Szolovits, P., Celi, L.A., Mark, R.G.: MIMIC-III, a freely accessible</title>
        <p>critical care database. Scienti c data 3(1), 1{9 (2016)
16. Kavuluru, R., Rios, A., Lu, Y.: An empirical evaluation of supervised learning
approaches in assigning diagnosis codes to electronic medical records. Arti cial
Intelligence in Medicine 65 (05 2015). https://doi.org/10.1016/j.artmed.2015.04.007
17. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio,</p>
      </sec>
      <sec id="sec-5-17">
        <title>Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations,</title>
      </sec>
      <sec id="sec-5-18">
        <title>ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings</title>
        <p>(2015)
18. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a
pre-trained biomedical language representation model for biomedical text mining.</p>
      </sec>
      <sec id="sec-5-19">
        <title>Bioinformatics (2019). https://doi.org/10.1093/bioinformatics/btz682</title>
        <p>19. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: E cient Estimation of Word
Representations in Vector Space. In: 1st International Conference on Learning
Representations (ICLR). vol. abs/1301.3781. Scottsdale, Arizona, USA (2013)
20. Miranda-Escalada, A., Gonzalez-Agirre, A., Armengol-Estape, J., Krallinger, M.:</p>
      </sec>
      <sec id="sec-5-20">
        <title>Overview of automatic clinical coding: annotations, guidelines, and solutions for</title>
        <p>non-english clinical cases at codiesp track of CLEF eHealth 2020. In: Working</p>
      </sec>
      <sec id="sec-5-21">
        <title>Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop</title>
      </sec>
      <sec id="sec-5-22">
        <title>Proceedings (2020)</title>
        <p>21. Mullenbach, J., Wiegre e, S., Duke, J., Sun, J., Eisenstein, J.: Explainable</p>
      </sec>
      <sec id="sec-5-23">
        <title>Prediction of Medical Codes from Clinical Text. In: Proceedings of the 2018</title>
      </sec>
      <sec id="sec-5-24">
        <title>Conference of the North American Chapter of the Association for Com</title>
        <p>putational Linguistics: Human Language Technologies. pp. 1101{1111 (2018).
https://doi.org/10.18653/v1/N18-1100
22. Neveol, A., Cohen, K.B., Grouin, C., Hamon, T., Lavergne, T., Kelly, L., Goeuriot,</p>
      </sec>
      <sec id="sec-5-25">
        <title>L., Rey, G., Robert, A., Tannier, X., Zweigenbaum, P.: Clinical Information Ex</title>
        <p>traction at the CLEF eHealth Evaluation lab 2016. CEUR Workshop Proceedings
1609, 28{42 (2016)
23. Neveol, A., Robert, A., Anderson, R., Cohen, K.B., Grouin, C., Lavergne, T., Rey,</p>
      </sec>
      <sec id="sec-5-26">
        <title>G., Rondet, C., Zweigenbaum, P.: CLEF eHealth 2017 Multilingual Information</title>
      </sec>
      <sec id="sec-5-27">
        <title>Extraction task Overview: ICD10 Coding of Death Certi cates in English and</title>
      </sec>
      <sec id="sec-5-28">
        <title>French. In: Working Notes of Conference and Labs of the Evaluation (CLEF)</title>
      </sec>
      <sec id="sec-5-29">
        <title>Forum. CEUR Workshop Proceedings (2017)</title>
        <p>24. Neveol, A., Robert, A., Grippo, F., Morgand, C., Orsi, C., Pelikan, L., Ramadier,</p>
      </sec>
      <sec id="sec-5-30">
        <title>L., Rey, G., Zweigenbaum, P.: CLEF eHealth 2018 Multilingual Information Ex</title>
        <p>traction Task Overview: ICD10 Coding of Death Certi cates in French, Hungarian
and Italian. In: Working Notes of Conference and Labs of the Evaluation (CLEF)</p>
      </sec>
      <sec id="sec-5-31">
        <title>Forum. CEUR Workshop Proceedings (2018)</title>
        <p>25. Neves, M.L., Butzke, D., Dorendahl, A., Leich, N., Hummel, B., Schonfelder, G.,</p>
      </sec>
      <sec id="sec-5-32">
        <title>Grune, B.: Overview of the CLEF eHealth 2019 multilingual information extrac</title>
        <p>tion. In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum.</p>
      </sec>
      <sec id="sec-5-33">
        <title>CEUR Workshop Proceedings (2019)</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srikant</surname>
          </string-name>
          , R.:
          <article-title>Fast algorithms for mining association rules</article-title>
          .
          <source>In: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB)</source>
          . vol.
          <volume>1215</volume>
          , pp.
          <volume>487</volume>
          {
          <issue>499</issue>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alsentzer</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boag</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            ,
            <given-names>W.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDermott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Publicly available clinical BERT embeddings</article-title>
          .
          <source>In: Proceedings of the 2nd Clinical Natural Language Processing Workshop</source>
          . pp.
          <volume>72</volume>
          {
          <fpage>78</fpage>
          . Association for Computational Linguistics, Minneapolis, Minnesota, USA (Jun
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>W19</fpage>
          -1909
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Amin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <article-title>Dun eld</article-title>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Vechkaeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.A.</given-names>
            ,
            <surname>Wixted</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.K.</surname>
          </string-name>
          :
          <article-title>MLT-DFKI at CLEF eHealth 2019: Multi-label Classi cation of ICD-10 Codes with BERT</article-title>
          . In: Working Notes of Conference and
          <article-title>Labs of the Evaluation (CLEF) Forum (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Baumel</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nassour-Kassis</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Cohen,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Elhadad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Elhadad</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>: Multi-label classi cation of patient notes: case study on ICD code assignment</article-title>
          .
          <source>In: Workshops at the thirty-second AAAI conference on arti cial intelligence</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bodenreider</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>The uni ed medical language system (UMLS): integrating biomedical terminology</article-title>
          .
          <source>Nucleic acids research 32(suppl 1)</source>
          ,
          <source>D267{D270</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Butler</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          : ICD-10
          <article-title>general equivalence mappings: Bridging the translation gap from ICD-9</article-title>
          .
          <source>Journal of AHIMA</source>
          <volume>78</volume>
          (
          <issue>9</issue>
          ),
          <volume>84</volume>
          {
          <fpage>86</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khandelwal</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>What Does BERT Look at? An Analysis of BERT's Attention</article-title>
          .
          <source>In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source>
          . pp.
          <volume>276</volume>
          {
          <issue>286</issue>
          (01
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>W19</fpage>
          -4828
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Support-vector networks</article-title>
          .
          <source>Machine learning 20(3)</source>
          ,
          <volume>273</volume>
          {
          <fpage>297</fpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          . CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>ML-Net: multilabel classi cation of biomedical texts with deep neural networks</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>26</volume>
          (
          <issue>11</issue>
          ),
          <volume>1279</volume>
          {
          <fpage>1285</fpage>
          (
          <year>2019</year>
          ). https://doi.org/10.1093/jamia/ocz085
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Duarte</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martins</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>C.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          :
          <article-title>Deep neural models for ICD-10 coding of death certi cates and autopsy reports in free-text</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>80</volume>
          ,
          <issue>64</issue>
          {
          <fpage>77</fpage>
          (
          <year>2018</year>
          ). https://doi.org/10.1016/j.jbi.
          <year>2018</year>
          .
          <volume>02</volume>
          .011
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suominen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelly</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miranda-Escalada</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Saez Gonzales,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Viviani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Overview of the CLEF eHealth Evaluation Lab 2020</article-title>
          . In: Arampatzis,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Kanoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Tsikrika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Vrochidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Joho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Lioma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Eickho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Neveol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <string-name>
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and
          <source>Interaction: Proceedings of the Eleventh International Conference of the CLEF Association (CLEF</source>
          <year>2020</year>
          ) . LNCS Volume number:
          <volume>12260</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          26.
          <string-name>
            <surname>Perotte</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pivovarov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Natarajan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiskopf</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wood</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elhadad</surname>
          </string-name>
          , N.:
          <article-title>Diagnosis code assignment: models and evaluation metrics</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>21</volume>
          (
          <issue>2</issue>
          ),
          <volume>231</volume>
          {
          <fpage>237</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          27. Sanger,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Kittner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Leser</surname>
          </string-name>
          ,
          <string-name>
            <surname>U.</surname>
          </string-name>
          :
          <article-title>Classifying German Animal Experiment Summaries with Multi-lingual BERT at CLEF eHealth 2019 Task 1</article-title>
          . In: Working Notes of Conference and
          <article-title>Labs of the Evaluation (CLEF) Forum (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          28. Schafer, H.,
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C.M.:</given-names>
          </string-name>
          <article-title>UMLS mapping and Word embeddings for ICD code assignment using the MIMIC-III intensive care database</article-title>
          .
          <source>In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)</source>
          . pp.
          <volume>6089</volume>
          {
          <fpage>6092</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          29. Spolao^r, N.,
          <string-name>
            <surname>Cherman</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monard</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.D.:</given-names>
          </string-name>
          <article-title>A comparison of multi-label feature selection methods using the problem transformation approach</article-title>
          .
          <source>Electronic Notes in Theoretical Computer Science</source>
          <volume>292</volume>
          ,
          <issue>135</issue>
          {
          <fpage>151</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          30.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qiu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>How to ne-tune BERT for text classi cation?</article-title>
          <source>In: China National Conference on Chinese Computational Linguistics</source>
          . pp.
          <volume>194</volume>
          {
          <fpage>206</fpage>
          . Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          31.
          <string-name>
            <surname>Turer</surname>
            ,
            <given-names>R.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuckowsky</surname>
          </string-name>
          , T.D.,
          <string-name>
            <surname>Causey</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosenbloom</surname>
          </string-name>
          , S.T.:
          <article-title>ICD-10-CM Crosswalks in the primary care setting: assessing reliability of the GEMs and reimbursement mappings</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>22</volume>
          (
          <issue>2</issue>
          ),
          <volume>417</volume>
          {
          <fpage>425</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          32.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , Carbonell, J.G.,
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          :
          <article-title>XLNet: Generalized Autoregressive Pretraining for Language Understanding</article-title>
          . In: Wallach,
          <string-name>
            <given-names>H.M.</given-names>
            ,
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Beygelzimer</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>d'</surname>
            Alche-Buc,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>E.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garnett</surname>
            ,
            <given-names>R</given-names>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems 32: NeurIPS</source>
          <year>2019</year>
          . pp.
          <volume>5754</volume>
          {
          <fpage>5764</fpage>
          . Vancouver, BC, Canada (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          33.
          <string-name>
            <surname>You</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reddi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hseu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhojanapalli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demmel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keutzer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          :
          <article-title>Large Batch Optimization for Deep Learning: Training BERT in 76 minutes</article-title>
          .
          <source>In: International Conference on Learning Representations (ICLR)</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>