<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Neural Multi-Field Document Representations for Patent Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Subhash Chandra Pujari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fryderyk Mantiuk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Giereth</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jannik Strötgen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Annemarie Friedrich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bosch Center for Artificial Intelligence</institution>
          ,
          <addr-line>Renningen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Duale Hochschule Baden-Württemberg</institution>
          ,
          <addr-line>Stuttgart</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Computer Science, Heidelberg University</institution>
          ,
          <addr-line>Heidelberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Robert Bosch GmbH</institution>
          ,
          <addr-line>Stuttgart</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>13</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>Patent classification constitutes a long-tailed hierarchical learning problem. Prior work has demonstrated the eficacy of neural representations based on pre-trained transformers, however, due to the limited input size of these models, using only title and abstract of patents as input. Patent documents consist of several textual fields, some of which are quite long. We show that a baseline using simple tf.idf-based methods can easily leverage this additional information. We propose a new architecture combining the neural transformer-based representations of the various fields into a meta-embedding, which we demonstrate to outperform the tf.idf-based counterparts especially on less frequent classes. Using a relatively simple architecture, we outperform the previous state of the art on CPC classification by a margin of 1.2 macro-avg. F1 and 2.6 micro-avg. F1. We identify the textual field giving a “brief-summary” of the patent as most informative with regard to CPC classification, which points to interesting future directions of research on less computation-intensive models, e.g., by summarizing long documents before neural classification.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;patent classification</kwd>
        <kwd>long-tailed classification</kwd>
        <kwd>neural document representations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>C
D
C01</p>
      <p>C03</p>
      <p>D02</p>
      <p>D03
C01B C01C</p>
      <p>C03B C03C D02G D02H D03C D03D</p>
      <sec id="sec-1-1">
        <title>Level</title>
        <sec id="sec-1-1-1">
          <title>Section</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>Class</title>
          <p>ID
C
C01</p>
        </sec>
        <sec id="sec-1-1-3">
          <title>Subclass C01C</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Description</title>
        <sec id="sec-1-2-1">
          <title>Chemistry;</title>
        </sec>
        <sec id="sec-1-2-2">
          <title>Metallurgy</title>
        </sec>
        <sec id="sec-1-2-3">
          <title>Inorganic</title>
        </sec>
        <sec id="sec-1-2-4">
          <title>Chemistry</title>
        </sec>
        <sec id="sec-1-2-5">
          <title>Ammonia;</title>
        </sec>
        <sec id="sec-1-2-6">
          <title>Cyanogen;</title>
          <p>
            the first step, CPC classification is a good testbed for patent classification in general due to
the free availability of labeled data, and has been the target of research for decades already
[
            <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">1, 2, 3, 4, 5, 6, 7, 8, 9</xref>
            ]. The CPC taxonomy arranges labels into up to nine hierarchical levels.
At the fifth level, the taxonomy has about 240k labels with a very skewed distribution with a
very long tail. Following previous works [
            <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
            ], we restrict our evaluation to the first three
levels of hierarchy – as exemplarily shown in Figure 1 – which still results in a very challenging
long-tailed hierarchical classification task with 767 labels.
          </p>
          <p>
            Patents are structured text documents with multiple fields, e.g., title, abstract, description,
and claims. Among these, only the claims, which have to be interpreted in context of the
description, are legally binding. The remaining fields are often drafted less carefully, or even
intentionally conceal a patent’s content. In earlier research on non-neural text classification
[
            <xref ref-type="bibr" rid="ref10 ref2 ref3 ref4 ref5">2, 3, 4, 5, 10</xref>
            ], document representations for long documents were obtained for instance using
tf.idf-based methods. Despite being robust, they ignore sequence information and do not, in
contrast to more recent neural language models, profit from self-supervised pre-training. Due
to the limited input size of transformers such as BERT [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], which can process only input up to
512 word-piece tokens corresponding to only few sentences, prior research on neural patent
classification [
            <xref ref-type="bibr" rid="ref12 ref8 ref9">8, 9, 12</xref>
            ] has mainly relied on title and abstract only. The latter is obviously
problematic as these fields are often rather broad, and hence often do not precisely describe the
patent’s content. We are also not aware of a systematic study comparing the various possible
input fields for patent classification to exploit potentially complementary information across
ifelds. 2 In sum, prior research either did not leverage the strength of transformers, or uses only
title and abstract to compute a document representation. In this work, we hence propose a
neural system architecture with a pre-trained transformer-based neural language model [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] as
backbone, but incorporating embeddings derived from various textual fields.
          </p>
          <p>
            As a first step, we enrich the USPTO-70k dataset, which originally only contained title and
abstract, with the four additional patent fields claims, detailed description (detail-desc),
brief-summary, and figure description ( fig-desc). The latter three fields are sub-fields
of description within USPTO patents. Based on this enriched dataset, we perform various
2Ingwersen [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] introduced the concept of polyrepresentation in the field of cognitive IR theory, which states
that the uncertainty of an information retrieval system decreases by incorporating multiple representations of
full-text semantic entities, e.g., sentence, paragraph, sections, etc., into a document representation.
experiments systematically comparing the performance of a non-neural hierarchical classifier
[
            <xref ref-type="bibr" rid="ref15">15</xref>
            ], whose representations are based on tf.idf and that uses SVMs internally, with an extension
of a state-of-the-art neural model [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. We evaluate the various text field embeddings in terms
of their informativeness with respect to the task of classifying patents according to the CPC
taxonomy, finding that models using the brief-summary are very efective. We also find that
information from the fields is complementary, i.e., the models using all input fields outperform
their counterparts using subsets thereof. This finding holds both in the case of non-neural
and neural systems. Finally, when combining several embeddings into a meta-embedding, we
found vector summation to outperform concatenation. Our further analysis shows that using
additional textual information works best especially for the dificult infrequent labels, i.e., in
few-shot scenarios.
          </p>
          <p>
            In summary, we propose a novel neural system for patent classification, demonstrating how
and that the various textual sub-fields of patents can be an efective source of information. In
particular, our contributions are as follows:
• We enrich the USPTO-70k dataset of [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ], which contains only titles and abstracts, with
four additional patent fields, making it as well as code available to foster future research. 3
• We propose an approach to eficiently generate an efective document representation
using a transformer-based model incorporating complementary information from
multiple textual patent fields. We demonstrate that the additional information increases
classification performance by a considerable margin.
• Unlike previous works, e.g., [
            <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
            ], we also evaluate the model performance in a few-shot
setting, where our proposed approach fairs particularly well in the least frequent label
group (showing 44% better F1-macro).
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Approaches to patent classification difer across several dimensions: there are neural and
non-neural methods, approaches exploiting the full document text vs. ones relying on title
and abstract only, and techniques performing hierarchical classification vs. those tackling the
coarsest (CPC) class granularity only.</p>
      <p>
        Full-text-based approaches. With the underlying bag-of-word representation, a tf.idf
feature vector incorporates the complete document text into a document representation. Fall
et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] experiment with title, abstract, claims, description, and the meta-data fields and find
that using the first 300 words of the title, inventors, applicant, abstracts, and description works
best. Several works [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ] show that the description is more informative than other sections
of a patent, in particular, the initial part of the description. Benites et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] rank first in
the ALTA 2018 Shared Task on patent classification [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and use the full text of the patent
documents. However, their method predicts only nine labels at the Section level, while we
address hierarchical patent classification with 767 labels across three granularities.
      </p>
      <p>
        Neural Models. Analyzing the submissions of the TREC deep learning 2019 track,
Carswell et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] conclude that BERT-based methods substantially outperform earlier used text
3https://github.com/boschresearch/multifield_patent_classification_bir2022
representation techniques. Based on this finding, Lin et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] divide the timeline of deep
learning models into pre-BERT and BERT eras. The two primary methods for feature vector
generation in the pre-BERT era are Convolutional Neural Networks (CNNs) and Recurrent
Neural Networks (RNNs), including its successors, i.e., GNNs and LSTMs. DeepPatent by Li et
al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] applies a CNN on top of word embeddings for the first 100 words from title and abstract.
They compare the results with a non-neural baseline, which uses a tf.idf vector generated with
the complete document text. However, the insights of this comparison are limited as only the
micro-precision value outperformed the non-neural model. Grawe et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] apply an LSTM
on top of word embeddings and compare using the first 150 words of the description against
using the first 400 words. Risch and Krestel [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] pre-train fastText word embeddings on a large
corpus of patent documents and combine the word embeddings using Gated Recurrent Units
(GRUs). They also use only title and abstract, but they include the first 300 words. Lee and
Hsiang [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is the first one to apply BERT for the patent classification task. They report that
using only the first claim gives results at par with title and abstract. Unlike [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Pujari et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
consider patent classification as a hierarchical multi-label classification problem and propose a
hierarchical transformer-based multi-task model that trains an intermediate SciBERT layer with
title and abstract as input text. Comparing BERT and SciBERT on a patent classification task,
Althammer et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] found that the SciBERT model performs better. The transformer-based
language model is computationally expensive, therefore, subject to a maximum sequence length
constraint, accommodating only 512 word-piece tokens.
      </p>
      <p>
        Transformers with longer inputs. Long-text transformer models reduce the computational
overhead with the sparse attention mechanism, increasing the sequence length to 4096 tokens.
Zaheer et al. report state-of-the-art patent classification results using Big Bird [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] using title,
abstract and claims. Although efective, the transformer-based model is still incapable of
accommodating the complete document text into a document representation. Therefore, in this
work, we look into text pruning techniques for a more informative document representation.
      </p>
      <p>
        Few-shot/long-tailed text classification. Besides an efective document representation for
a long multi-field document, a major challenge with CPC classification is the prediction of less
frequent labels. Mullenbach et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] proposed a method for few-shot learning with
attentionbased weights for combining the label-based embedding and incorporating it into a document
representation. Further, Rios et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] extended this method [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] by incorporating information
from a hierarchical taxonomy using a GCNN [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Surveying the literature proposing novel
datasets [
        <xref ref-type="bibr" rid="ref25 ref26 ref27">25, 26, 27</xref>
        ] for a few-shot learning setting, we do not find a clear definition of the
criterion used for categorizing a label into a few-shot category. Therefore, we create
frequencybased label groups for evaluating a few-shot setting.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Model</title>
      <p>
        Our classification model architecture is similar to that proposed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, instead of
using only the concatenated text of title and abstract as input to the transformer, we compute
embeddings of various patent fields, aggregating them into a meta-embedding using vector
summation or concatenation.
      </p>
      <p>
        Field Embeddings. We use PatSciBERT, i.e., the SciBERT model [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] finetuned on CPC
text concatenation
      </p>
      <sec id="sec-3-1">
        <title>A method which tokenizes an input text and calculates</title>
        <p>the sequence embedding using a transformer-based LM.</p>
        <p>
          Same as (· ), however, the LM is finetuned during
the task.
vector concatenation
vector summation
classification by [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], to generate embeddings for the first 510 word-piece tokens of each textual
ifeld. This is motivated by the findings of previous works that the initial part of a field’s
text is often more informative than the remainder [
          <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
          ]. In most experiments, we do not
ifne-tune PatSciBERT. For word-piece tokenization, we use a BertTokenizer 4 initialized with
Sci-BERT’s vocabulary. The 768-dimensional last hidden state of the [CLS] token is used as the
text field’s embedding. We denote this embedding as (· ) when not fine-tuning;  (· ) is used
when fine-tuning the underlying language model.
        </p>
        <p>Aggregation. For aggregating several field embeddings into a document representation
 (for the th instance), we experiment with two simple aggregation methods, i.e., vector
summation (⊕ ) and concatenation (;).</p>
        <p>
          Classification Model. We use a Transformer-based Hierarchical Multi-task Model (THMM)
architecture similar to [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], with  as input, and one classification head per label. The heads
consist of three-layer perceptrons with a binary softmax head, predicting whether the label
applies to the document or not. The hierarchical taxonomy links define the input to the
classification heads as the concatenation of the document representation  and the output of
the second hidden state of the respective parent heads.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>
        In this section, we describe our experimental setup. First, in Section 4.1, we describe the
steps taken for enriching the USPTO-70k dataset by incorporating additional fields taken from
USPTO bulk download.5 Section 4.2 introduces the evaluation metrics for hierarchical multi-label
classification, which are used for the analysis of results in Section 5. Also, here we define the
label groups based on taxonomy level, label frequency, and section information. We compare
4https://huggingface.co/transformers/v2.4.0/model_doc/bert.html#berttokenizer
5https://patentsview.org/download/data-download-tables
our results with the neural and non-neural baselines as described in Section 4.3, whereas our
experimental setup is defined in Section 4.4.
4.1. Dataset
We use the USPTO-70k dataset released by Pujari et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], containing 50k train, 10k test and
10k dev instances. The instances are labeled with 9, 128, 630 unique labels from each of the
ifrst three CPC 6 levels, i.e., section, class, subclass. The original USPTO-70k dataset provides
only title and abstract. We enrich the dataset with additional patent fields and make
the enriched dataset publicly available to ease future work. A patent document contains the
four main text fields title, abstract, claims, and description, with the latter being the
longest and most detailed. The USPTO dump aggregates the subfields within the description
into three groups: brief- summary, fig-desc, and detail-desc. With approximately 1.8k
word-piece tokens, brief-summary is very concise in contrast to the elaborate detail-desc
ifeld with approximately 9.5k tokens (see Figure 2). As shown in Table 2, brief-summary often
contains several subfields, however, as there is no strict structure or completeness required, we
simply use the concatenation of these texts.
4.2. Evaluation
Metrics. In line with [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we evaluate the models using hierarchical precision, recall and
F1-score as proposed by [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], defining precision as ℎ = ∑︀∑︀∩ and recall as ℎ = ∑︀∑︀∩ .
For each test instance , the predicted label set  consists of all predicted labels with their
ancestors. Similarly, the true label set  contains true labels, including ancestors. Since within
a CPC taxonomy, each child label has a single ancestor, we add missing ancestors for each
predicted label. Most previous works [
        <xref ref-type="bibr" rid="ref12 ref7 ref8">7, 8, 12</xref>
        ] evaluate the model performance using the
micro-average scores, which do not adequately reveal the performance of a model for the less
frequent labels. Therefore, we compute macro-scores for each of the three measures as an
average of scores obtained across labels due to the skewed label distribution. For example, the
macro-F1 score is computed as an average F1-score for each label. Also, we segregate the labels
into groups according to various criteria for more fine-grained analysis. The macro-F1 for a
group is computed as the average of the per-label F1 scores of the labels in the respective group.
The grouping strategy is defined with three criteria: frequency, level, and section information.
6CPC taken at 2020.06.
12k
s
ce8k
n
a
t
isn4k
16k
s12k
e
c
an8k
t
isn4k
8k
s6k
e
c
n
a4k
t
s
in2k
16k
50 100 150 200 250
number of tokens
(a) title
      </p>
      <p>Label Grouping. For our analyses, we perform several groupings of labels in order to compute
(macro-average) results by label.</p>
      <p>Grouping by Label Frequency. The less frequent labels capture the fine-grained information
and thus are often more informative than the more frequent ones. Previous works evaluate the
performance of a model for these less frequent labels under a few-shot setting. However, there
is no standard threshold for what constitutes a minority class in few-shot text classification. 7
Therefore, instead of sticking to a single value as a measure of the few-shot category, we
define four frequency-based label groups with a label frequency threshold of 10, 50, and 100,
respectively. Table 3 shows the number of labels within each label group.</p>
      <p>Grouping by Level. We use the hierarchical taxonomy information for dividing the labels
into groups based on the hierarchical level. Since the USPTO dataset consists of labels from the
ifrst three levels of the CPC taxonomy, we create three label groups with 9, 128, 630 labels each.</p>
      <p>Grouping by Section (Domain). In this setting, we align a section label and all its child
nodes as a single group, creating nine groups in total, essentially performing a topical grouping
as shown in Table 3. Group B contains the largest number of labels followed by F. The group
Y consists of only 15 labels used as a general tagging labels for new technologies or a general
tagging label for cross-sectional technologies spanning over several sections.</p>
      <p>
        7For example, MIMIC-III [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], EURLEX57K [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], AMAZON13K [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] have few-shot categorization thresholds of
5, 50 and 100 respectively.
      </p>
      <sec id="sec-4-1">
        <title>Label</title>
      </sec>
      <sec id="sec-4-2">
        <title>Frequency</title>
      </sec>
      <sec id="sec-4-3">
        <title>Level Section</title>
        <p>
          4.3. Baselines
TwistBytes (TB). As a non-neural baseline, we use TwistBytes [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], a Local Classifier per Node
(LCN) approach which trains a Support Vector Machine (SVM) model [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] as a base classifier for
each label in the class hierarchy. TwistBytes uses the sibling strategy when training a classifier,
i.e., using the subset of the training data consisting of the instances with the label addressed
by the classifier or those of the respective siblings. The tree is traversed from the root node to
leaf nodes during prediction, predicting a label at each hop using the label-specific classifier. A
child label classifier is traversed if the probability of the parent label is more than a user-defined
threshold. We use the same parameter values as defined by [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] using a decision function
threshold value of -0.25 and the TF-IDF vector of dimension 70k.
        </p>
        <p>
          THMM with  (+). As a neural baseline, we chose the THMM setting of [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] (see Section 3).
They generate a document representation, concatenating the title and abstract and passing it
through the SciBERT model. The SciBERT model weights are finetuned during training.
4.4. Experimental Setup and Hyperparameters
To make the results comparable, all models use the same hyperparameters, which are based on
the best values found by [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The hidden layer size of all dense layers in the classification heads
is 256, dropout is set to 0.25 across layers and we use a batch size of 64. Contrary to Pujari et
al., we use a learning rate of 10− 3, because we are not fine-tuning the language model. We
train all models for a maximum of 50 epochs with early stopping if the F1-macro for the dev
dataset does not increase for 7 epochs. As indicated in Figure 2, the majority of the sections
exceed the 512-token limit imposed by SciBERT and many of the other Transformer-based
models. For all of the sections where this is the case, we set the maximum input length to 512
tokens. In sections that can instead fit in smaller sequences, we reduce the maximum input
length of the tokenizer for eficiency reasons. For title we use a maximum length of 64, and
for abstract, a maximum length of 256. For all implementations, we use Python with the
libraries TensorFlow8 and Keras9 [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. For integrating SciBERT, we make use of the transformers
[
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] library from Huggingface10.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <p>Our experiments aim to identify the most informative text field embeddings and best
aggregation method for combining them into a document representation. Section 5.1 compares our
best performing model to the neural and non-neural baselines, including an analysis by label
frequency. In Section 5.2, we compare the performance when using various combinations of
ifelds as input broken down by label frequency, domain, and hierarchy level.
5.1. Performance of Models Using Various Document Representations
Informativeness of Individual Field Embeddings. In order to assess the contribution of
diferent fields towards the informativeness w.r.t. the document classification task, we use one
ifeld at a time, generate a document representation, and evaluate it on the CPC classification
task. As we can see in Table 4, the brief-summary is the most informative field in terms of
overall performance, showing a high score across metrics and clearly outperforming models
leveraging abstract and claims. Unlike detail-desc, a brief-summary often includes
the invention summary and precise details on the technical field of an invention, which might be
a possible reason for its higher informativeness compared to other patent fields. As the title
ifeld contains a few terms describing an invention in absolute brevity, a document representation
based on title can identify some labels with high precision, but only has a low recall. Similar
conclusions might be drawn for the fig-desc, as it contains particular domain-related terms,
and for legal implications claims, which are very specific to an invention. In contrast, no such
legal boundation holds for abstract, thus it is often drafted in a broad and imprecise manner,
partially explaining its limitations for use in classification tasks such as ours. Thus, it shows a
high recall score but lower precision.</p>
      <p>8https://www.tensorflow.org/
9https://keras.io/
10https://huggingface.co/transformers/
Aggregating Information from Several Fields. Next, we combine information from several
ifeld embeddings (see Table 4). First, comparing vector summation and concatenation, we see a
clear advantage when using summation (⊕ ) over concatenations (indicated by ;) as aggregation
method. Second, we observe that in both cases, adding more information helps. The F1-macro
score being the primary metrics of our analysis, we see a significant gain in the score when adding
the brief-summary to the document representation. Here, we observe that the performance
across the neural and non-neural approaches is consistent. It means the informativeness of a
ifeld is independent of the text representation method.</p>
      <p>
        Comparison of Best Configuration to Baseline Systems. First, we note that TwistBytes
using tf.idf-based representations of the complete document text has a higher micro-F1 score
than the THMM [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] with  ( + )) as proposed by Pujari et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This clearly demonstrates
that there is relevant information in the additional text, and motivates us to combine
multiifeld embedding into a single neural document representation using an efective aggregation
technique. Table 4 shows the evaluation results of our proposed approach (THMM with
sumbased aggregation of six content fields, last row) compared to the baselines (TwistBytes and
THMM with the concatenated title and abstract text as input). Our proposed approach
outperforms both baselines in terms of macro-F1 and micro-F1 scores. When comparing our
approach to the version of THMM proposed by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we see much improvement in precision with
a slight dip in recall. Our proposed approach performs better across all three micro-average
scores, i.e., precision, recall, F1.
      </p>
      <p>Performance by Label Frequency. Table 5 shows the results of our best performing
approach (THMM with () ⊕ () ⊕ () ⊕ () ⊕ () ⊕ ( )) and baselines for diferent
frequency-based groups. We find that our approach works better overall, in particular, for the
most infrequent label set, i.e., consisting of labels having a count less than 10, the macro-F1
is 44% better than the THMM with  ( + ). The performance of the non-neural TwistBytes
model is strong for more frequent labels, but poor for less frequent labels.</p>
      <p>Summary. In our experiments, we identify brief-summary as the most informative patent
ifeld and vector summation as the most efective aggregation technique. Neural models
outperform their non-neural counterparts especially in the case of infrequent labels.
5.2. Analysis of Field Embeddings by CPC Level, Label Frequency, and</p>
      <p>Domain
We here report a fine-grained analysis of model performance for three label groupings, focusing
on macro-F1. Overall, models using the brief-summary field perform better than using all
other fields across all label-grouping settings (see Figure 3), and excel in few-shot scenarios.</p>
      <p>Level Hierarchy. On analyzing the level group results in Figure 3, we observe that at level 1,
the performance is quite similar across fields, including title and fig-desc. However, for
more fine-grained labels, especially at level 3, the brief-summary is more informative than
another field embedding.
0.5
0.0</p>
      <p>E
section
1</p>
      <p>Frequency-based Groups. For the high-frequency label group, we see a similar performance
for abstract, claims, and brief-summary. However, for labels with fewer instances, the
performance for brief-summary is noticeably better.</p>
      <p>Section/Domain. Using brief-summary consistently leads to better results across diferent
sections, followed by the abstract (see Figure 3). However, in the case of D, H, Y as label groups,
the relative gain in performance with brief-summary is marginal compared to abstract.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Outlook</title>
      <p>In this paper, we have addressed the challenge of classifying patents, which are long multi-field
text documents. We have shown that performance of both non-neural and neural models
benefits from leveraging a larger document context by combining text snippets from the various
ifelds. As a first step, we have enriched the USPTO-70k patent dataset with four additional
textual patent fields. Among these, we identify brief-summary as the most informative patent
ifeld in terms of overall performance and as being highly efective for classifying infrequent
cases. Our experiments identify vector summation to perform better than concatenation.</p>
      <p>
        While our model is conceptually simple and clearly outperforms previous work, it also points
towards interesting directions for future work. A first step is clearly to try more advanced
methods for creating meta-embeddings, such as incorporating adversarial techniques as in
[
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. Second, another promising direction for patent classification is to employ variants of
transformer-based neural language models (such as LongFormer [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] or Big Bird [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]) that
incorporate larger text documents. In particular, as we have found the brief-summary field
to be a very efective source of information, a potential future system could first apply a
summarization method on the entire patent and then compute an embedding using such an
extended language model. Finally, patents do not just consist of textual fields, but also include
images or diagrams as well as further meta-data that will likely contain relevant information.
The integration of these various types of information, e.g., in multi-modal approaches, certainly
is another fruitful direction for research on patent classification.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We thank the anonymous reviewers for their insightful comments. We also thank Tim Tarsi for
fruitful discussions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Smith</surname>
          </string-name>
          , Automation of Patent Classification,
          <source>World Patent Information</source>
          <volume>24</volume>
          (
          <year>2002</year>
          )
          <fpage>269</fpage>
          -
          <lpage>271</lpage>
          . doi:
          <volume>10</volume>
          .1016/S0172-
          <volume>2190</volume>
          (
          <issue>02</issue>
          )
          <fpage>00067</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Fall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Törcsvári</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Benzineb</surname>
          </string-name>
          , G. Karetka,
          <source>Automated Categorization in the International Patent Classification, SIGIR Forum 37</source>
          (
          <year>2003</year>
          )
          <fpage>10</fpage>
          -
          <lpage>25</lpage>
          . doi:
          <volume>10</volume>
          .1145/945546.945547.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guyot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Benzineb</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Falquet, myClass: A Mature Tool for Patent Classification, in: Proceedings of the International Conference of the Cross-Language Evaluation Forum (CLEF'10), CEUR-WS</article-title>
          .org, Padua, Italy,
          <year>2010</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1176</volume>
          /
          <article-title>CLEF2010wn-CLEF-IP-GuyotEt2010</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>C.-H. Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ken</surname>
          </string-name>
          , T. Huang,
          <source>Patent Classification System Using a New Hybrid Genetic Algorithm Support Vector Machine, Applied Soft Computing</source>
          <volume>10</volume>
          (
          <year>2010</year>
          )
          <fpage>1164</fpage>
          -
          <lpage>1177</lpage>
          . doi:
          <volume>10</volume>
          . 1016/j.asoc.
          <year>2009</year>
          .
          <volume>11</volume>
          .033.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Verberne</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>D'hondt, Patent Classification Experiments with the Linguistic Classification System LCS in CLEF-IP 2011, in: Proceedings of the International Conference of the CrossLanguage Evaluation Forum (CLEF'11), CEUR-WS</article-title>
          .org, Amsterdam, The Netherlands,
          <year>2011</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1177</volume>
          /
          <article-title>CLEF2011wn-CLEF-IP-VerberneEt2011</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Benzineb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guyot</surname>
          </string-name>
          ,
          <source>Automated Patent Classification, Current Challenges in Patent Information Retrieval</source>
          (
          <year>2011</year>
          )
          <fpage>239</fpage>
          -
          <lpage>261</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -19231-9_
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          , J. Hu,
          <article-title>DeepPatent: Patent Classification with Convolutional Neural Networks</article-title>
          and
          <string-name>
            <given-names>Word</given-names>
            <surname>Embedding</surname>
          </string-name>
          ,
          <source>Scientometrics</source>
          <volume>117</volume>
          (
          <year>2018</year>
          )
          <fpage>721</fpage>
          -
          <lpage>744</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s11192-018-2905-5.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.-S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Hsiang,</surname>
          </string-name>
          <article-title>PatentBERT: Patent Classification with Fine-Tuning a pre-trained BERT</article-title>
          <string-name>
            <surname>Model</surname>
          </string-name>
          ,
          <year>2019</year>
          . arXiv:
          <year>1906</year>
          .02124.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Pujari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Strötgen</surname>
          </string-name>
          ,
          <article-title>A Multi-Task Approach to Neural Multi-Label Hierarchical Patent Classification using Transformers</article-title>
          ,
          <source>in: Proceedings of the 43rd European Conference on Information Retrieval (ECIR'21)</source>
          , Online,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>030</fpage>
          -72113-8_
          <fpage>34</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Benites</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <article-title>Classifying Patent Applications with Ensemble Methods</article-title>
          ,
          <source>in: Proceedings of the 16th Annual Workshop of The Australasian Language Technology Association (ALTA'18)</source>
          , Dunedin, New Zealand,
          <year>2018</year>
          . URL: https: //aclanthology.org/U18-1012.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL'19), Association for Computational Linguistics</article-title>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaheer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Guruganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ainslie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Alberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ontanon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ravula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          , Big Bird:
          <article-title>Transformers for Longer Sequences</article-title>
          ,
          <source>in: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS'20)</source>
          , online,
          <year>2020</year>
          . URL: https://proceedings.neurips.cc/paper/2020/file/ c8512d142a2d849725f31a9a7a361ab9-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ingwersen</surname>
          </string-name>
          ,
          <article-title>Polyrepresentation of information needs and semantic entities: Elements of a cognitive theory for information retrieval interaction</article-title>
          ,
          <source>in: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94)</source>
          , Springer-Verlag, Berlin, Heidelberg,
          <year>1994</year>
          , p.
          <fpage>101</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Cohan,
          <article-title>SciBERT: A Pretrained Language Model for Scientific Text</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP'19)</source>
          , Association for Computational Linguistics, Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3615</fpage>
          -
          <lpage>3620</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1371.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Benites</surname>
          </string-name>
          , TwistBytes - Hierarchical Classification at GermEval 2019:
          <article-title>Walking the Fine Line (of Recall and Precision)</article-title>
          ,
          <source>in: Proceedings of KONVENS'19</source>
          ,
          <string-name>
            <surname>German</surname>
            <given-names>Society</given-names>
          </string-name>
          <source>for Computational Linguistics &amp; Language Technology</source>
          , Erlangen-Nürnberg, Germany,
          <year>2019</year>
          , pp.
          <fpage>326</fpage>
          -
          <lpage>335</lpage>
          . URL: https://konvens.org/proceedings/2019/papers/germeval/Germeval_Task1_ paper_6.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mollá</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Seneviratne</surname>
          </string-name>
          ,
          <article-title>Overview of the 2018 ALTA Shared Task: Classifying Patent Applications</article-title>
          ,
          <source>in: Proceedings of the 16th Annual Workshop of The Australasian Language Technology Association (ALTA'18)</source>
          , Dunedin, New Zealand,
          <year>2018</year>
          , pp.
          <fpage>84</fpage>
          -
          <lpage>88</lpage>
          . URL: https: //aclanthology.org/U18-1011.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <source>Overview of the TREC 2019 Deep Learning Track</source>
          ,
          <year>2020</year>
          . arXiv:
          <year>2003</year>
          .07820.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <article-title>Pretrained Transformers for Text Ranking: BERT and Beyond</article-title>
          , Morgan &amp; Claypool Publishers,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .2200/S01123ED1V01Y202108HLT053.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Grawe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Bonfante</surname>
          </string-name>
          ,
          <source>Automated Patent Classification Using Word Embedding, in: Proceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA'17)</source>
          , IEEE, Cancun, Mexico,
          <year>2017</year>
          , pp.
          <fpage>408</fpage>
          -
          <lpage>411</lpage>
          . doi:
          <volume>10</volume>
          . 1109/ICMLA.
          <year>2017</year>
          .0-
          <fpage>127</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Risch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krestel</surname>
          </string-name>
          ,
          <article-title>Domain-specific Word Embeddings for Patent Classification</article-title>
          ,
          <source>Data Technologies and Applications</source>
          <volume>53</volume>
          (
          <year>2019</year>
          )
          <fpage>108</fpage>
          -
          <lpage>122</lpage>
          . doi:
          <volume>10</volume>
          .1108/DTA-01-2019-0002.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Althammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Buckley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hofstätter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <article-title>Linguistically informed masking for representation learning in the patent domain</article-title>
          ,
          <source>in: Proceedings of the 2nd Workshop on Patent Text Mining</source>
          and
          <article-title>Semantic Technologies (PatentSemTech) 2021 co-located with the 44th</article-title>
          <source>International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'21)</source>
          , Online,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mullenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wiegrefe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Duke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisenstein</surname>
          </string-name>
          ,
          <article-title>Explainable Prediction of Medical Codes from Clinical Text, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL'18), Association for Computational Linguistics</article-title>
          , New Orleans, Louisiana,
          <year>2018</year>
          , pp.
          <fpage>1101</fpage>
          -
          <lpage>1111</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N18</fpage>
          -1100.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kavuluru</surname>
          </string-name>
          , Few-Shot and
          <article-title>Zero-Shot Multi-Label Learning for Structured Label Spaces</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP'18)</source>
          , Association for Computational Linguistics, Brussels, Belgium,
          <year>2018</year>
          , pp.
          <fpage>3132</fpage>
          -
          <lpage>3142</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D18</fpage>
          -1352.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Kipf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Semi-Supervised Classification with Graph Convolutional Networks</article-title>
          ,
          <source>in: Proceedings of the 5th International Conference on Learning Representations (ICLR'17)</source>
          , Toulon, France,
          <year>2017</year>
          . URL: https://openreview.net/forum?id=
          <fpage>SJU4ayYgl</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          , E. Fergadiotis,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <article-title>Large-Scale Multi-Label Text Classification on EU Legislation, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL'19), Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>6314</fpage>
          -
          <lpage>6322</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1636.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , T. Pollard,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          , L.-w. Lehman,
          <string-name>
            <given-names>M.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghassemi</surname>
          </string-name>
          , B. Moody, P. Szolovits,
          <string-name>
            <given-names>L.</given-names>
            <surname>Celi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mark</surname>
          </string-name>
          ,
          <article-title>MIMIC-III, a Freely Accessible Critical Care Database, Scientific Data 3 (</article-title>
          <year>2016</year>
          )
          <article-title>160035</article-title>
          . doi:
          <volume>10</volume>
          .1038/sdata.
          <year>2016</year>
          .
          <volume>35</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>D. D. Lewis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>T. G.</given-names>
          </string-name>
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>RCV1: A New Benchmark Collection for Text Categorization Research</article-title>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mach</surname>
          </string-name>
          .
          <source>Learn. Res</source>
          .
          <volume>5</volume>
          (
          <year>2004</year>
          )
          <fpage>361</fpage>
          -
          <lpage>397</lpage>
          . URL: http://jmlr.org/ papers/volume5/lewis04a/lewis04a.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kiritchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Matwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Famili</surname>
          </string-name>
          ,
          <article-title>Functional Annotation of Genes Using Hierarchical Text Categorization</article-title>
          ,
          <source>in: Proceedings of the BioLINK SIG: Linking Literature, Information and Knowledge for Biology</source>
          ,
          <year>2005</year>
          . URL: https://www.site.uottawa.ca/~stan/papers/2004/ p15.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          ,
          <string-name>
            <surname>Support-Vector</surname>
            <given-names>Networks</given-names>
          </string-name>
          ,
          <source>Machine Learning</source>
          <volume>20</volume>
          (
          <year>1995</year>
          )
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          . doi:
          <volume>10</volume>
          .1007/BF00994018.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Keras</surname>
          </string-name>
          , https://github.com/fchollet/keras,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <article-title>HuggingFace's Transformers: State-of-the-art</article-title>
          <source>Natural Language Processing</source>
          ,
          <year>2019</year>
          . arXiv:
          <year>1910</year>
          .03771.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Strötgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klakow</surname>
          </string-name>
          , FAME:
          <article-title>Feature-Based Adversarial MetaEmbeddings for Robust Input Representations</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP'21)</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>8382</fpage>
          -
          <lpage>8395</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>660</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          , Longformer: The
          <string-name>
            <surname>Long-Document Transformer</surname>
          </string-name>
          ,
          <year>2020</year>
          . arXiv:
          <year>2004</year>
          .05150.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>