<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Interaction Matching for Long-Tail Multi-Label Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sean MacAvaney</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franck Dernoncourt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Walter Chang</string-name>
          <email>wachangg@adobe.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nazli Goharian</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ophir Frieder</string-name>
          <email>ophirg@ir.cs.georgetown.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Adobe Research</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IR Lab, Georgetown University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present an elegant and effective approach for addressing limitations in existing multi-label classification models by incorporating interaction matching, a concept shown to be useful for ad-hoc search result ranking. By performing soft n-gram interaction matching, we match labels with natural language descriptions (which are common to have in most multi-labeling tasks). Our approach can be used to enhance existing multi-label classification approaches, which are biased toward frequently-occurring labels. We evaluate our approach on two challenging tasks: automatic medical coding of clinical notes and automatic labeling of entities from software tutorial text. Our results show that our method can yield up to an 11% relative improvement in macro performance, with most of the gains stemming from labels that appear infrequently in the training set (i.e., the long tail of labels).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Multi-label text classification (i.e., the task of assigning a
variable number labels to a piece of text) is a classic task
with a variety of practical applications. For instance, a
clinical report could be tagged with medical codes, describing
a patient’s diagnoses (e.g., Lyme disease) and procedures
(e.g., clipping of aneurysm). Since both the note and the
medical codes are required in the clinical process, a system
that can automatically derive these medical codes from notes
would be a valuable time-saving measure. As another
example, software tutorial text can be semantically labeled with
the tools that are used to accomplish the task. These labels
could provide additional support to users trying to replicate
a tutorial, or be used to improve search engines by indexing
on these labels.</p>
      <p>
        Text labeling approaches rely on machine learning
techniques to rank a given set of labels for a piece of text.
For example, supervised FastText
        <xref ref-type="bibr" rid="ref8">(Joulin et al. 2017)</xref>
        learns
dense label and document term representations. At inference
time, it compares the label representations to the
representation of an unseen document to produce label scores.
Others have employed conceptually similar yet more
sophisticated approaches, such as using convolutional neural
networks and attention mechanisms in a similar fashion
        <xref ref-type="bibr" rid="ref11 ref13">(Mullenbach et al. 2018; Liu et al. 2017)</xref>
        . One limitation of
these approaches is that the models fail to effectively rank
infrequently-occurring labels due to inadequate variability
in the training data. These approaches are also unable to
handle labels that do not occur in training data (e.g., extremely
rare or new labels).
      </p>
      <p>
        We present a text labeling approach that uses soft
ngram interaction matching, an approach inspired by recent
work in ad-hoc ranking
        <xref ref-type="bibr" rid="ref4">(Pang et al. 2016; Hui et al. 2018)</xref>
        .
This allows handling of labels that have meaningful
natural language names, while not necessarily occurring
frequently in training data. It is common to have label names in
multi-labeling tasks, as these are used by humans to
manually perform the labeling (e.g., medical codes have
descriptions). Our approach, which handles infrequent labels, can
be combined with existing labeling techniques that handle
frequently-occurring labels. We show that our approach is
effective at two tasks, each with a large number of labels:
automatic medical coding of clinical reports (1,159 labels),
and automatic labeling of tools in software tutorials (831
labels).
      </p>
      <p>In summary, our contributions are: (1) an approach for
extending multi-label classification models, based on soft
n-gram interaction matching; (2) an evaluation on two
datasets, showing that our approach can be effectively
combined with other leading classification approaches; and (3)
an analysis demonstrating our capacity to identify long tail
labels, even those without training samples.</p>
    </sec>
    <sec id="sec-2">
      <title>Background &amp; Related Work</title>
      <p>
        Multi-label text classification. This is a well-studied task
with a multitude of prior work. Among the most notable
recent efforts are supervised FastText
        <xref ref-type="bibr" rid="ref8">(Joulin et al. 2017)</xref>
        ,
which learns embeddings for labels that can be compared
to document representations. Earlier work by
        <xref ref-type="bibr" rid="ref9">Kim (2014)</xref>
        showed that a simple convolutional neural network (CNN)
with dynamic pooling can be effective for text
classification.
        <xref ref-type="bibr" rid="ref1">Berger (2015)</xref>
        shows that recurrent neural networks
(RNNs) can also be used for classification.
        <xref ref-type="bibr" rid="ref7">Johnson and
Zhang (2015)</xref>
        uses n-gram indicator variables fed into a
deep neural network to make classification decisions.
        <xref ref-type="bibr" rid="ref17">Yen
et al. (2016)</xref>
        addresses training data sparsity by enforcing
heavy regularization penalties, but falls short of handling
extremely infrequent labels. Liu et al. (2017) attempts to
address label sparsity by using shared intermediate
representations from a CNN.
        <xref ref-type="bibr" rid="ref3">Gehrmann et al. (2018)</xref>
        interpreted
CNNs’ classification by defining a salience score for each
token of the sentence input. Others have shown that using
attention can further improve performance, and improve
explainability of decisions in the medical domain
        <xref ref-type="bibr" rid="ref13 ref16">(Mullenbach
et al. 2018; Xie et al. 2018)</xref>
        .
        <xref ref-type="bibr" rid="ref5">Jain et al. (2019)</xref>
        shows that
millions of labels can be practically handled using a pruning
strategy. These approaches are limited by the variability of
labels in the training data, or exact term matching. None of
these approaches allow for soft label text matching, allowing
for soft matching of unseen labels.
      </p>
      <p>
        Soft n-gram interaction matching. Interaction-focused
ranking models formulate document ranking as a
learningto-rank problem over a term similarity matrix between a
query and a document. One successful approach to learn
these patterns is by applying square convolutional kernels
over the term similarity matrix (e.g. Pang et al. (2016);
        <xref ref-type="bibr" rid="ref4">Hui
et al. (2018)</xref>
        ), a process called soft-ngram interaction
matching. We propose an approach inspired by these methods for
multi-label text classification; we use the approach to rank
labels for a given segment of text, rather than documents for
a query. Due to the large number of labels, we use a fixed
sequence, rather than allowing the model to learn out-of-order
n-grams. We also normalize the scores based on the label
length. In our preliminary work, we found this a necessary
optimization for scalability, without impacting performance;
this is not necessary in document ranking models because
the same query is used for each document, and thus has the
same length.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>Notation. Let T be a sequence of tokens in a document, and
let L be a collection of labels. For multi-label text
classification, a score V Li;T is generated for each label Li 2 L. The
labels are ranked by this score, optionally employing a score
threshold and/or a maximum count to select the labels.</p>
      <p>For our task, we assume each label consists of a
sequence of tokens representing its name in natural language:
Li = fLi;1; Li;2; :::; Li;jLijg. This is a reasonable
assumption, given that the labels are usually produced for humans,
who will often need a name to reason about the label. For
our method to be most effective, the names should have
terms that may have approximate matches in the text. For
instance, given the procedure label of clipping of aneurysm,
an approximate match found in the text might be clip
the aneurysm. Our method uses the term similarity matrix
SLi;T 2 RjLij;jT j between label Li and text T as input:</p>
      <p>SjL;ki;T = cos(embed(Li;j ); embed(Tk))
(1)
where cos( ) is the cosine similarity score and embed( )
returns the token’s word embedding. Note that each similarity
score here is a unigram match; our model operates over this
matrix to perform n-gram matching. An example similarity
matrix is shown in Figure 1.</p>
      <p>Interaction matching model. Inspired by recent
interaction-focused ranking models in information retrieval,
we apply square convolutions over the similarity matrix
to produce soft n-gram matching scores. Unlike document
ranking models, however, we impose a single fixed
sequential convolution kernel over the labels. In other words,
we use the identity matrix IjLij as a convolutional kernel.
We then take the maximum scores from each kernel and
normalize them by the length of the label jLij. This step is
not taken in the document ranking models but necessary
in this context because multiple labels of different lengths
are being matched over the same document. Note that since
the convolutional kernel matches similarity scores, exact
matches are not required; this is what makes the n-grams
‘soft’. More formally, our method generates a label score
for each document position P Li;T 2 RjT j for document T
and label Li:
(2)
P Li;T =</p>
      <p>IjLij ? SLi;T</p>
      <p>jLij
where ( ) is the sigmoid activation function and ? performs
2-dimensional convolution. For simplicity, we assume ?
applies padding where necessary.</p>
      <p>To generate a label score for the entire text, we perform
max pooling: V Li;T = maxjjT=j1 PjLi;T . The use of max
pooling allows for the soft n-gram to match anywhere in the
document and for the score not to be influenced by
document length (as opposed to average pooling, for instance).
Furthermore, the arg max yields an interpretable grounding
of the model’s decision within the text and can be used to
aid in the explanation of the model’s decision. At inference
time, all labels in L are ranked using this method. This
approach is trivially parallelizable and easily handles datasets
with thousands of labels. An example of interaction
matching scores are shown in Figure 1. In this example, the exact
match is given a perfect matching score of 1.</p>
      <p>The model’s structure allows the interaction model to
easily incorporate new labels to be introduced after
training simply by adding to L. In our experiments, we train
the model by back-propagating errors to the word
embeddings. We recognize that our model is ineffective for labels
Report: ...status post tracheostomy for paradoxical vocal cord motion with asthma
discharge medications fenofexadine mg po q day calcium carbonate grams po t i d
percocet one po q to hours prn pain...</p>
      <p>ICD-9 labels: Other diseases of upper respiratory tract ; Asthma ; Diabetes mellitus
Tutorial: Create another new document (I chose 600
x 400 px for width and height), select the brush tool,
and open the brush preset panel.</p>
      <p>Tool labels: File &gt; New ; Brush Tool
that do not match the text. Thus, we suggest incorporating
our method into existing multi-label text classification
approaches, which can learn to effectively match labels that
frequently occur in the training data. We train the two
models jointly, combining them by taking the maximum score
for each label.</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>We test our approach on two tasks: medical coding and
software tool extraction from online tutorials. While both are
multi-label classification tasks, the data characteristics are
different for each task, demonstrating that our approach is
generally applicable. Examples are given in Figure 2.</p>
      <sec id="sec-4-1">
        <title>Medical coding</title>
        <p>
          Dataset. We first evaluate on the MIMIC-III dataset
          <xref ref-type="bibr" rid="ref6">(Johnson et al. 2016)</xref>
          , a large, de-identified, and publicly-available
collection of medical records; Each record in the dataset
includes ICD-9 codes, which identify diagnoses and
procedures performed.1 Each code is partitioned into sub-codes,
which often include specific circumstantial details. We treat
the parent (top-level) codes as labels to be identified based
on the patient’s discharge note. The dataset consists of
112k clinical reports records and 1,159 top-level ICD-9
codes (labels). See Table 1 for further dataset
characteristics and Figure 2 for an excerpt of a report with labeled
codes. We use the same train/dev/test split used by
          <xref ref-type="bibr" rid="ref13">(Mullenbach et al. 2018)</xref>
          , with 1,632 development and 3,372
testing reports. We train word embeddings on MIMIC-III
using word2vec
          <xref ref-type="bibr" rid="ref12">(Mikolov et al. 2013)</xref>
          , matching the setting
of
          <xref ref-type="bibr" rid="ref13">(Mullenbach et al. 2018)</xref>
          . Note that this does not
preclude the matching of terms unseen in training data;
trivially, a larger unlabeled corpus could be employed for
training embeddings or binary matching could be used for
outof-vocabulary terms.
        </p>
        <p>Baselines &amp; training. We compare our approach to the
state-of-the-art attention-based CAML (Mullenbach et al.
1https://mimic.physionet.org/; https://www.cdc.gov/nchs/icd/icd9.htm
Model
Bi-RNN
+ interaction (ours)
CNN
+ interaction (ours)
CAML
+ interaction (ours)</p>
        <p>MacroP</p>
        <p>MacroR</p>
        <p>MacroF1</p>
        <p>
          MicroF1
2018) network for medical coding, along with a
convolutional neural network text classifier network
          <xref ref-type="bibr" rid="ref9">(Kim 2014)</xref>
          and
a bi-directional GRU network as baselines, using the
implementations provided by
          <xref ref-type="bibr" rid="ref13">(Mullenbach et al. 2018)</xref>
          . These
represent strong baselines for this task. We combine our
approach with each baseline, training them jointly as described
in Section . We train all neural models optimizing cross
entropy loss with the Adam optimizer
          <xref ref-type="bibr" rid="ref10 ref7">(Kingma and Ba 2015)</xref>
          (learning rate of 10 4). We select a threshold for all labels
using MacroF1 performance on the dev set.
        </p>
        <p>Results. Our results on MIMIC-III are presented in
Table 2. In line with prior work on the dataset, we measure
the performance in terms of micro- and macro-averaged
F1 score. Since our focus is improving the long tail
of infrequently-occurring labels, we also include
macroaveraged precision and recall. Our approach outperforms the
state-of-the-art CAML method in terms of macro precision,
recall and F1 by 3–5% (relative improvement, significant at
p &lt; 0:05). The performance improvement on the weaker
CNN baseline is even more pronounced, achieving an 8–
11% improvement on the macro metrics (also significant at
p &lt; 0:05). Interestingly, our approach improves the
precision for the bi-directional RNN, at the expense of recall. We
attribute this to the interaction matching technique that is
inherently high-precision. The improvements in the micro
metrics are less pronounced, showing that our approach
primarily benefits the long tail.</p>
        <p>Error analysis. We often found that our method was able
to match infrequent labels where CAML had failed. For
instance, in one report, our method labeled all three codes
correctly (including one that occurs in only 0.5% of
training data), while the unmodified CAML method found two
of the three correctly, but also mistakenly included a third,
completely unrelated label (occurs in about 0.1% of
training data). We observed cases where general codes were not
+Inter.</p>
        <p>Sentence
1
1
7</p>
        <p>Go to Filter &gt; Texture &gt; Craquelure. Change the Crack Spacing to 13, the
Crack Depth to 3, and the Crack Brightness to 8.</p>
        <p>Now in your new layer, using the Radial Gradient Tool, drag a red gradient
over the whole document.</p>
        <p>Set the duration of frame 2 to .05 seconds
Now it is time to create a path P. If we have our path the right click of the
mouse and stroke path with brush set 10px hardness of 100%.
matched effectively by either model. For instance, Other
diseases of lung is difficult to match by both models because it
involves more advanced reasoning (i.e., the condition affects
the lung, and there isn’t another label).</p>
      </sec>
      <sec id="sec-4-2">
        <title>Software tutorial labeling</title>
        <p>Dataset. We also evaluate our approach on a collection of
software tutorials, labeled with the tools used to complete
each step (by sentence). This dataset is collected from
online tutorials and manually labeled by sentence with a large
collection of software tools (831 in total). The dataset
consists of 40k sentences, with an average length of 21.7
tokens and an average number of 1.2 tools labeled per record.
An example labeled tutorial sentence is given in Figure 2.
Note that the tool mentions can be either explicit (brush
tool ! Brush Tool) or implicit (Create a new
document ! File &gt; New). The dataset will be made
available for validation of our results. We use a random 90/5/5%
train/dev/test set split.</p>
        <p>
          Baselines. Since no specialized systems exist for this
dataset, we use supervised FastText
          <xref ref-type="bibr" rid="ref8">(Joulin et al. 2017)</xref>
          ,
XML-CNN
          <xref ref-type="bibr" rid="ref11">(Liu et al. 2017)</xref>
          , and BERT
          <xref ref-type="bibr" rid="ref2">(Devlin et al. 2020)</xref>
          as baseline approaches for the tutorial dataset. FastText is
trained for 100 epochs with 1–3 word n-grams, and
XMLCNN and BERT are trained using default settings. We
initialize the embeddings for interaction matching using
100dim GloVe embeddings
          <xref ref-type="bibr" rid="ref15">(Pennington, Socher, and Manning
2014)</xref>
          , and fine-tune them during the training process.
        </p>
        <p>Results. We present our results on the tutorial dataset
in Table 4. We use Average Precision (AP, macro-averaged
by label). This evaluation emphasizes correctness along the
long tail (as opposed to a micro average). When applied to
FastText, our approach improves the test set performance by
6% (relative improvement, significant at p &lt; 0:01). The
FastText model achieved a perfect AP score of 1.0 for 55
of the labels found in the test set (meaning it was ranked
highest among all labels whenever it appeared), whereas the
interaction variant had a perfect score for 69 labels. This is
a 25% improvement, most of which came from the less
frequent half of the labels. The least frequent quarter saw an
even bigger change, from 11 perfect scores to 26 (136%),
four of which had no training samples. Our approach also
slightly improves the performance on XML-CNN, though
the results are not significant at p &lt; 0:01. Interestingly, the
XML-CNN model appears to hamper performance in the
development set. Finally, the BERT model underperforms
FastText and XML-CNN. This is likely in part due to the
small amount of training data available for many labels.</p>
        <p>Qualitatively, we find that the interaction approach
improves in situations in which there are similar terms/phrases
in the long tail. Examples are given in Table 3. Specifically,
in cases in which there is similar text to a label in the
sentence, the interaction approach is beneficial (a) and (b). We
acknowledge that the interaction mechanism can
occasionally have false matches (c), and it does not improve
performance when there is no similar text (d).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We presented an approach to enhance existing multi-label
classification techniques that employs soft n-gram
interaction matching. We demonstrated that the approach is
effective at identifying labels in the long tail, which are
underrepresented with current state-of-the-art classification
approaches. We also showed that the approach can effectively
label items that do not appear at all in the training data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Berger</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Large Scale Multi-label Text Classification with Semantic Word Vectors</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chang, M.-W.;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Gehrmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Carlson</surname>
            ,
            <given-names>E. T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J. T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Welt</surname>
            ,
            <given-names>J.; Foote</given-names>
          </string-name>
          <string-name>
            <surname>Jr</surname>
          </string-name>
          , J.;
          <string-name>
            <surname>Moseley</surname>
            ,
            <given-names>E. T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Grant</surname>
            ,
            <given-names>D. W.</given-names>
          </string-name>
          ; Tyler, P. D.; et al.
          <year>2018</year>
          .
          <article-title>Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives</article-title>
          .
          <source>PloS one 13</source>
          <volume>(2)</volume>
          :
          <fpage>e0192360</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Hui</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yates</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Berberich</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and de Melo,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Balasubramanian</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chunduri</surname>
            ,
            <given-names>B. R.</given-names>
          </string-name>
          ; and Varma,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Slice: Scalable Linear Extreme Classifiers Trained on 100 Million Labels for Related Searches</article-title>
          .
          <source>In WSDM.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>A. E. W.</given-names>
          </string-name>
          ; Pollard,
          <string-name>
            <given-names>T. J.</given-names>
            ;
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ;
            <surname>wei</surname>
          </string-name>
          <string-name>
            <given-names>H.</given-names>
            <surname>Lehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ;
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Ghassemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            ; Moody, B.;
            <surname>Szolovits</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Celi</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. A.</surname>
          </string-name>
          ; and Mark,
          <string-name>
            <surname>R. G.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>MIMIC-III, a freely accessible critical care database</article-title>
          .
          <source>In Scientific data.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , R.; and Zhang, T.
          <year>2015</year>
          .
          <article-title>Effective Use of Word Order for Text Categorization with Convolutional Neural Networks</article-title>
          .
          <source>In HLT-NAACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and Mikolov,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Convolutional Neural Networks for Sentence Classification</article-title>
          .
          <source>In EMNLP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Adam: A Method for Liu</surname>
          </string-name>
          ,
          <source>J.;</source>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , W.-C.;
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Deep Learning for Extreme Multi-label Text Classification</article-title>
          .
          <source>In SIGIR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Chen,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            ; and
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Distributed Representations of Words and Phrases and their Compositionality</article-title>
          .
          <source>In NIPS.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Mullenbach</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Wiegreffe,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Duke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Sun</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Explainable Prediction of Medical Codes from Clinical Text</article-title>
          . In NAACL-HLT.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          2016.
          <article-title>Text Matching as Image Recognition</article-title>
          .
          <source>In AAAI.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Socher, R.; and
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>GloVe: Global Vectors for Word Representation</article-title>
          .
          <source>In EMNLP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ; Zhang, M.; and
          <string-name>
            <surname>Xing</surname>
            ,
            <given-names>E. P.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>A Neural Architecture for Automated ICD Coding</article-title>
          . In ACL.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Yen</surname>
            ,
            <given-names>I. E.-H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ravikumar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Dhillon</surname>
            ,
            <given-names>I. S.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification</article-title>
          . In ICML.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>