Evaluating Neural Multi-Field Document Representations for Patent Classification Subhash Chandra Pujari1,2 , Fryderyk Mantiuk1,3 , Mark Giereth4 , Jannik Strötgen1 and Annemarie Friedrich1 1 Bosch Center for Artificial Intelligence, Renningen, Germany 2 Institute of Computer Science, Heidelberg University, Heidelberg, Germany 3 Duale Hochschule Baden-Württemberg, Stuttgart, Germany 4 Robert Bosch GmbH, Stuttgart, Germany Abstract Patent classification constitutes a long-tailed hierarchical learning problem. Prior work has demonstrated the efficacy of neural representations based on pre-trained transformers, however, due to the limited input size of these models, using only title and abstract of patents as input. Patent documents consist of several textual fields, some of which are quite long. We show that a baseline using simple tf.idf-based methods can easily leverage this additional information. We propose a new architecture combining the neural transformer-based representations of the various fields into a meta-embedding, which we demonstrate to outperform the tf.idf-based counterparts especially on less frequent classes. Using a relatively simple architecture, we outperform the previous state of the art on CPC classification by a margin of 1.2 macro-avg. F1 and 2.6 micro-avg. F1. We identify the textual field giving a “brief-summary” of the patent as most informative with regard to CPC classification, which points to interesting future directions of research on less computation-intensive models, e.g., by summarizing long documents before neural classification. Keywords patent classification, long-tailed classification, neural document representations 1. Introduction Organizations must be aware of competitors’ Intellectual Property (IP) artifacts as IP litigation might involve a considerable loss in revenue or delay product launch. Carving out a competitor’s IP portfolio or the relevant IP landscape often relies on classifying patents by their content. For example, the Cooperative Patent Classification1 (CPC) is a large taxonomy used by major patent offices for their internal organization. Once submitted to a patent office, each patent application is manually labeled with a set of labels taken this taxonomy. Within companies, a second step in managing or understanding patent portfolios usually consists in organizing patents into internal or use-case specific classification schemes. Besides the obvious use case of automating BIR 2022: 12th International Workshop on Bibliometric-enhanced Information Retrieval at ECIR 2022, April 10, 2022, hybrid. $ subhashchandra.pujari@de.bosch.com (S. C. Pujari); fryderyk.mantiuk@de.bosch.com (F. Mantiuk); mark.giereth@de.bosch.com (M. Giereth); jannik.stroetgen@de.bosch.com (J. Strötgen); annemarie.friedrich@de.bosch.com (A. Friedrich) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://www.cooperativepatentclassification.org 13 ROOT C D C01 C03 D02 D03 Level ID Description Section C Chemistry; C01B C01C C03B C03C D02G D02H D03C D03D Metallurgy Class C01 Inorganic Chemistry Subclass C01C Ammonia; Cyanogen; Figure 1: An excerpt from the Cooperative Patent Classification (CPC) taxonomy. the first step, CPC classification is a good testbed for patent classification in general due to the free availability of labeled data, and has been the target of research for decades already [1, 2, 3, 4, 5, 6, 7, 8, 9]. The CPC taxonomy arranges labels into up to nine hierarchical levels. At the fifth level, the taxonomy has about 240k labels with a very skewed distribution with a very long tail. Following previous works [7, 8, 9], we restrict our evaluation to the first three levels of hierarchy – as exemplarily shown in Figure 1 – which still results in a very challenging long-tailed hierarchical classification task with 767 labels. Patents are structured text documents with multiple fields, e.g., title, abstract, description, and claims. Among these, only the claims, which have to be interpreted in context of the description, are legally binding. The remaining fields are often drafted less carefully, or even intentionally conceal a patent’s content. In earlier research on non-neural text classification [2, 3, 4, 5, 10], document representations for long documents were obtained for instance using tf.idf-based methods. Despite being robust, they ignore sequence information and do not, in contrast to more recent neural language models, profit from self-supervised pre-training. Due to the limited input size of transformers such as BERT [11], which can process only input up to 512 word-piece tokens corresponding to only few sentences, prior research on neural patent classification [8, 9, 12] has mainly relied on title and abstract only. The latter is obviously problematic as these fields are often rather broad, and hence often do not precisely describe the patent’s content. We are also not aware of a systematic study comparing the various possible input fields for patent classification to exploit potentially complementary information across fields.2 In sum, prior research either did not leverage the strength of transformers, or uses only title and abstract to compute a document representation. In this work, we hence propose a neural system architecture with a pre-trained transformer-based neural language model [14] as backbone, but incorporating embeddings derived from various textual fields. As a first step, we enrich the USPTO-70k dataset, which originally only contained title and abstract, with the four additional patent fields claims, detailed description (detail-desc), brief-summary, and figure description (fig-desc). The latter three fields are sub-fields of description within USPTO patents. Based on this enriched dataset, we perform various 2 Ingwersen [13] introduced the concept of polyrepresentation in the field of cognitive IR theory, which states that the uncertainty of an information retrieval system decreases by incorporating multiple representations of full-text semantic entities, e.g., sentence, paragraph, sections, etc., into a document representation. 14 experiments systematically comparing the performance of a non-neural hierarchical classifier [15], whose representations are based on tf.idf and that uses SVMs internally, with an extension of a state-of-the-art neural model [9]. We evaluate the various text field embeddings in terms of their informativeness with respect to the task of classifying patents according to the CPC taxonomy, finding that models using the brief-summary are very effective. We also find that information from the fields is complementary, i.e., the models using all input fields outperform their counterparts using subsets thereof. This finding holds both in the case of non-neural and neural systems. Finally, when combining several embeddings into a meta-embedding, we found vector summation to outperform concatenation. Our further analysis shows that using additional textual information works best especially for the difficult infrequent labels, i.e., in few-shot scenarios. In summary, we propose a novel neural system for patent classification, demonstrating how and that the various textual sub-fields of patents can be an effective source of information. In particular, our contributions are as follows: • We enrich the USPTO-70k dataset of [9], which contains only titles and abstracts, with four additional patent fields, making it as well as code available to foster future research.3 • We propose an approach to efficiently generate an effective document representation using a transformer-based model incorporating complementary information from mul- tiple textual patent fields. We demonstrate that the additional information increases classification performance by a considerable margin. • Unlike previous works, e.g., [8, 9], we also evaluate the model performance in a few-shot setting, where our proposed approach fairs particularly well in the least frequent label group (showing 44% better F1-macro). 2. Related Work Approaches to patent classification differ across several dimensions: there are neural and non-neural methods, approaches exploiting the full document text vs. ones relying on title and abstract only, and techniques performing hierarchical classification vs. those tackling the coarsest (CPC) class granularity only. Full-text-based approaches. With the underlying bag-of-word representation, a tf.idf feature vector incorporates the complete document text into a document representation. Fall et al. [2] experiment with title, abstract, claims, description, and the meta-data fields and find that using the first 300 words of the title, inventors, applicant, abstracts, and description works best. Several works [3, 4, 5] show that the description is more informative than other sections of a patent, in particular, the initial part of the description. Benites et al. [10] rank first in the ALTA 2018 Shared Task on patent classification [16] and use the full text of the patent documents. However, their method predicts only nine labels at the Section level, while we address hierarchical patent classification with 767 labels across three granularities. Neural Models. Analyzing the submissions of the TREC deep learning 2019 track, Car- swell et al. [17] conclude that BERT-based methods substantially outperform earlier used text 3 https://github.com/boschresearch/multifield_patent_classification_bir2022 15 representation techniques. Based on this finding, Lin et al. [18] divide the timeline of deep learning models into pre-BERT and BERT eras. The two primary methods for feature vector generation in the pre-BERT era are Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), including its successors, i.e., GNNs and LSTMs. DeepPatent by Li et al. [7] applies a CNN on top of word embeddings for the first 100 words from title and abstract. They compare the results with a non-neural baseline, which uses a tf.idf vector generated with the complete document text. However, the insights of this comparison are limited as only the micro-precision value outperformed the non-neural model. Grawe et al. [19] apply an LSTM on top of word embeddings and compare using the first 150 words of the description against using the first 400 words. Risch and Krestel [20] pre-train fastText word embeddings on a large corpus of patent documents and combine the word embeddings using Gated Recurrent Units (GRUs). They also use only title and abstract, but they include the first 300 words. Lee and Hsiang [8] is the first one to apply BERT for the patent classification task. They report that using only the first claim gives results at par with title and abstract. Unlike [8], Pujari et al. [9] consider patent classification as a hierarchical multi-label classification problem and propose a hierarchical transformer-based multi-task model that trains an intermediate SciBERT layer with title and abstract as input text. Comparing BERT and SciBERT on a patent classification task, Althammer et al. [21] found that the SciBERT model performs better. The transformer-based language model is computationally expensive, therefore, subject to a maximum sequence length constraint, accommodating only 512 word-piece tokens. Transformers with longer inputs. Long-text transformer models reduce the computational overhead with the sparse attention mechanism, increasing the sequence length to 4096 tokens. Zaheer et al. report state-of-the-art patent classification results using Big Bird [12] using title, abstract and claims. Although effective, the transformer-based model is still incapable of accommodating the complete document text into a document representation. Therefore, in this work, we look into text pruning techniques for a more informative document representation. Few-shot/long-tailed text classification. Besides an effective document representation for a long multi-field document, a major challenge with CPC classification is the prediction of less frequent labels. Mullenbach et al. [22] proposed a method for few-shot learning with attention- based weights for combining the label-based embedding and incorporating it into a document representation. Further, Rios et al. [23] extended this method [22] by incorporating information from a hierarchical taxonomy using a GCNN [24]. Surveying the literature proposing novel datasets [25, 26, 27] for a few-shot learning setting, we do not find a clear definition of the criterion used for categorizing a label into a few-shot category. Therefore, we create frequency- based label groups for evaluating a few-shot setting. 3. Model Our classification model architecture is similar to that proposed in [9]. However, instead of using only the concatenated text of title and abstract as input to the transformer, we compute embeddings of various patent fields, aggregating them into a meta-embedding using vector summation or concatenation. Field Embeddings. We use PatSciBERT, i.e., the SciBERT model [14] finetuned on CPC 16 Table 1 Abbreviations. Symbols Description 𝑡 title 𝑎 abstract 𝑐 claims 𝑑 detail-desc 𝑏𝑠 brief-summary 𝑓𝑑 fig-desc + text concatenation 𝑒(·) A method which tokenizes an input text and calculates the sequence embedding using a transformer-based LM. 𝑒𝑓 (·) Same as 𝑒(·), however, the LM is finetuned during the task. ; vector concatenation ⊕ vector summation classification by [9], to generate embeddings for the first 510 word-piece tokens of each textual field. This is motivated by the findings of previous works that the initial part of a field’s text is often more informative than the remainder [3, 4, 5]. In most experiments, we do not fine-tune PatSciBERT. For word-piece tokenization, we use a BertTokenizer4 initialized with Sci-BERT’s vocabulary. The 768-dimensional last hidden state of the [CLS] token is used as the text field’s embedding. We denote this embedding as 𝑒(·) when not fine-tuning; 𝑒𝑓 (·) is used when fine-tuning the underlying language model. Aggregation. For aggregating several field embeddings into a document representation 𝑥𝑖 (for the 𝑖th instance), we experiment with two simple aggregation methods, i.e., vector summation (⊕) and concatenation (;). Classification Model. We use a Transformer-based Hierarchical Multi-task Model (THMM) architecture similar to [9], with 𝑥𝑖 as input, and one classification head per label. The heads consist of three-layer perceptrons with a binary softmax head, predicting whether the label applies to the document or not. The hierarchical taxonomy links define the input to the classification heads as the concatenation of the document representation 𝑥𝑖 and the output of the second hidden state of the respective parent heads. 4. Experimental Setup In this section, we describe our experimental setup. First, in Section 4.1, we describe the steps taken for enriching the USPTO-70k dataset by incorporating additional fields taken from USPTO bulk download.5 Section 4.2 introduces the evaluation metrics for hierarchical multi-label classification, which are used for the analysis of results in Section 5. Also, here we define the label groups based on taxonomy level, label frequency, and section information. We compare 4 https://huggingface.co/transformers/v2.4.0/model_doc/bert.html#berttokenizer 5 https://patentsview.org/download/data-download-tables 17 Table 2 In practice, not all patents contain all textual fields. We here show, for the case of the USPTO-70k dataset, the number of instances which do not contain particular subfields of brief-summary. field-name train-set dev-set test-set total number of instances 50,625 10,000 10,000 invention summary missing 5,836 1,428 1,413 background of invention missing 3,888 845 855 technical-field missing 28,304 5,111 5,008 our results with the neural and non-neural baselines as described in Section 4.3, whereas our experimental setup is defined in Section 4.4. 4.1. Dataset We use the USPTO-70k dataset released by Pujari et al. [9], containing 50k train, 10k test and 10k dev instances. The instances are labeled with 9, 128, 630 unique labels from each of the first three CPC6 levels, i.e., section, class, subclass. The original USPTO-70k dataset provides only title and abstract. We enrich the dataset with additional patent fields and make the enriched dataset publicly available to ease future work. A patent document contains the four main text fields title, abstract, claims, and description, with the latter being the longest and most detailed. The USPTO dump aggregates the subfields within the description into three groups: brief- summary, fig-desc, and detail-desc. With approximately 1.8k word-piece tokens, brief-summary is very concise in contrast to the elaborate detail-desc field with approximately 9.5k tokens (see Figure 2). As shown in Table 2, brief-summary often contains several subfields, however, as there is no strict structure or completeness required, we simply use the concatenation of these texts. 4.2. Evaluation Metrics. In line with [9], we evaluate the models using ∑︀ hierarchical precision, recall ∑︀ and 𝑃𝑖 ∩𝑇𝑖 𝑃𝑖 ∩𝑇𝑖 F1-score as proposed by [28], defining precision as ℎ𝑃 = ∑︀ 𝑃𝑖 and recall as ℎ𝑅 = ∑︀ 𝑇𝑖 . For each test instance 𝑖, the predicted label set 𝑃𝑖 consists of all predicted labels with their ancestors. Similarly, the true label set 𝑇𝑖 contains true labels, including ancestors. Since within a CPC taxonomy, each child label has a single ancestor, we add missing ancestors for each predicted label. Most previous works [7, 8, 12] evaluate the model performance using the micro-average scores, which do not adequately reveal the performance of a model for the less frequent labels. Therefore, we compute macro-scores for each of the three measures as an average of scores obtained across labels due to the skewed label distribution. For example, the macro-F1 score is computed as an average F1-score for each label. Also, we segregate the labels into groups according to various criteria for more fine-grained analysis. The macro-F1 for a group is computed as the average of the per-label F1 scores of the labels in the respective group. The grouping strategy is defined with three criteria: frequency, level, and section information. 6 CPC taken at 2020.06. 18 12k mean: 11.72 8k mean: 132.63 8k mean: 1295.09 6k instances instances instances 8k 6k 4k 4k 4k 2k 2k 5 10 15 20 25 0 50 100 150 200 250 1k 2k 3k number of tokens number of tokens number of tokens (a) title (b) abstract (c) claims 16k mean: 9421.0 16k mean: 1882.99 16k mean: 387.19 12k 12k 12k instances instances instances 8k 8k 8k 4k 4k 4k 10k 20k 30k 40k 2k 4k 6k 8k 500 1k 1.5k 2k number of tokens number of tokens number of tokens (d) detail-desc (e) brief-summary (f) fig-desc Figure 2: Word-piece token distribution for different patent fields. Table 3 shows details of groups within each of the three settings, which are further explained below. Label Grouping. For our analyses, we perform several groupings of labels in order to compute (macro-average) results by label. Grouping by Label Frequency. The less frequent labels capture the fine-grained information and thus are often more informative than the more frequent ones. Previous works evaluate the performance of a model for these less frequent labels under a few-shot setting. However, there is no standard threshold for what constitutes a minority class in few-shot text classification.7 Therefore, instead of sticking to a single value as a measure of the few-shot category, we define four frequency-based label groups with a label frequency threshold of 10, 50, and 100, respectively. Table 3 shows the number of labels within each label group. Grouping by Level. We use the hierarchical taxonomy information for dividing the labels into groups based on the hierarchical level. Since the USPTO dataset consists of labels from the first three levels of the CPC taxonomy, we create three label groups with 9, 128, 630 labels each. Grouping by Section (Domain). In this setting, we align a section label and all its child nodes as a single group, creating nine groups in total, essentially performing a topical grouping as shown in Table 3. Group B contains the largest number of labels followed by F. The group Y consists of only 15 labels used as a general tagging labels for new technologies or a general tagging label for cross-sectional technologies spanning over several sections. 7 For example, MIMIC-III [26], EURLEX57K [25], AMAZON13K [27] have few-shot categorization thresholds of 5, 50 and 100 respectively. 19 Table 3 Shows the distribution of labels within three label groups. Grouping Grouping Definition Count (subclass) # Instances Criterion Identifier total=630 1-10 𝑓𝑖 ≤ 10 114 1,085 Label 11-50 10 < 𝑓𝑖 ≤ 50 238 5,079 Frequency 51-100 50 < 𝑓𝑖 ≤ 100 91 5,897 100+ 100 < 𝑓𝑖 187 47,486 Count (all labels) total=767 1 Level 1 CPC taxonomy 9 50,625 Level 2 Level 2 CPC taxonomy 128 50,625 3 Level 3 CPC taxonomy 630 50,625 Count (all labels) total=767 A Human Necessities 100 8,157 B Performing Operations, Transporting 199 18,167 C Chemistry, Metallurgy 103 10,474 D Textiles; Paper 44 6,791 Section E Fixed Constructions 38 2,390 F Mechanical Engineering; Lightning; 119 15,591 Heating; Weapons; Blasting G Physics 93 20,458 H Electricity 56 20,206 Y General Tagging 15 9,360 4.3. Baselines TwistBytes (TB). As a non-neural baseline, we use TwistBytes [15], a Local Classifier per Node (LCN) approach which trains a Support Vector Machine (SVM) model [29] as a base classifier for each label in the class hierarchy. TwistBytes uses the sibling strategy when training a classifier, i.e., using the subset of the training data consisting of the instances with the label addressed by the classifier or those of the respective siblings. The tree is traversed from the root node to leaf nodes during prediction, predicting a label at each hop using the label-specific classifier. A child label classifier is traversed if the probability of the parent label is more than a user-defined threshold. We use the same parameter values as defined by [15] using a decision function threshold value of -0.25 and the TF-IDF vector of dimension 70k. THMM with 𝑒𝑓 (𝑡+𝑎). As a neural baseline, we chose the THMM setting of [9] (see Section 3). They generate a document representation, concatenating the title and abstract and passing it through the SciBERT model. The SciBERT model weights are finetuned during training. 20 4.4. Experimental Setup and Hyperparameters To make the results comparable, all models use the same hyperparameters, which are based on the best values found by [9]. The hidden layer size of all dense layers in the classification heads is 256, dropout is set to 0.25 across layers and we use a batch size of 64. Contrary to Pujari et al., we use a learning rate of 10−3 , because we are not fine-tuning the language model. We train all models for a maximum of 50 epochs with early stopping if the F1-macro for the dev dataset does not increase for 7 epochs. As indicated in Figure 2, the majority of the sections exceed the 512-token limit imposed by SciBERT and many of the other Transformer-based models. For all of the sections where this is the case, we set the maximum input length to 512 tokens. In sections that can instead fit in smaller sequences, we reduce the maximum input length of the tokenizer for efficiency reasons. For title we use a maximum length of 64, and for abstract, a maximum length of 256. For all implementations, we use Python with the libraries TensorFlow 8 and Keras9 [30]. For integrating SciBERT, we make use of the transformers [31] library from Huggingface10 . 5. Experimental Results Our experiments aim to identify the most informative text field embeddings and best aggre- gation method for combining them into a document representation. Section 5.1 compares our best performing model to the neural and non-neural baselines, including an analysis by label frequency. In Section 5.2, we compare the performance when using various combinations of fields as input broken down by label frequency, domain, and hierarchy level. 5.1. Performance of Models Using Various Document Representations Informativeness of Individual Field Embeddings. In order to assess the contribution of different fields towards the informativeness w.r.t. the document classification task, we use one field at a time, generate a document representation, and evaluate it on the CPC classification task. As we can see in Table 4, the brief-summary is the most informative field in terms of overall performance, showing a high score across metrics and clearly outperforming models leveraging abstract and claims. Unlike detail-desc, a brief-summary often includes the invention summary and precise details on the technical field of an invention, which might be a possible reason for its higher informativeness compared to other patent fields. As the title field contains a few terms describing an invention in absolute brevity, a document representation based on title can identify some labels with high precision, but only has a low recall. Similar conclusions might be drawn for the fig-desc, as it contains particular domain-related terms, and for legal implications claims, which are very specific to an invention. In contrast, no such legal boundation holds for abstract, thus it is often drafted in a broad and imprecise manner, partially explaining its limitations for use in classification tasks such as ours. Thus, it shows a high recall score but lower precision. 8 https://www.tensorflow.org/ 9 https://keras.io/ 10 https://huggingface.co/transformers/ 21 Table 4 Model performance for different document representation inputs. macro-avg. micro-avg. Model doc-rep P R F1 P R F1 TB tf -idf (𝑡 + 𝑎) 39.8 19.1 24.2 65.1 53.4 58.7 TB tf -idf (𝑡 + 𝑎 + 𝑐) 43.4 20.6 25.9 66.9 56.1 61.0 TB tf -idf (𝑡 + 𝑎 + 𝑐 + 𝑑) 43.4 21.3 26.5 68.8 59.3 63.7 TB tf -idf (𝑡 + 𝑎 + 𝑐 + 𝑑 + 𝑏𝑠) 45.4 23.4 28.8 69.9 60.8 65.0 TB tf -idf (𝑡 + 𝑎 + 𝑐 + 𝑑 + 𝑏𝑠 + 𝑓 𝑑) 45.5 23.6 28.9 69.9 60.9 65.1 THMM 𝑒𝑓 (𝑡 + 𝑎) 42.6 36.7 37.7 66.6 63.3 64.9 THMM 𝑒(𝑡) 35.6 19.5 23.6 62.8 49.0 55.0 THMM 𝑒(𝑎) 39.6 30.0 32.6 66.1 59.8 62.8 THMM 𝑒(𝑐) 41.1 27.5 31.0 67.5 58.0 62.4 THMM 𝑒(𝑑) 42.5 25.1 29.9 66.9 54.1 59.8 THMM 𝑒(𝑏𝑠) 47.8 32.3 36.6 70.4 59.9 64.7 THMM 𝑒(𝑓 𝑑) 38.9 20.1 25.0 65.7 49.3 56.3 THMM 𝑒(𝑡); 𝑒(𝑎) 40.9 30.7 33.4 67.6 60.4 63.8 THMM 𝑒(𝑡); 𝑒(𝑎); 𝑒(𝑐) 42.4 30.6 33.8 69.4 60.3 64.5 THMM 𝑒(𝑡); 𝑒(𝑎); 𝑒(𝑐); 𝑒(𝑑) 43.9 31.0 34.6 68.7 61.4 64.8 THMM 𝑒(𝑡); 𝑒(𝑎); 𝑒(𝑐); 𝑒(𝑑); 𝑒(𝑏𝑠) 45.8 31.4 35.5 70.0 62.1 65.8 THMM 𝑒(𝑡); 𝑒(𝑎); 𝑒(𝑐); 𝑒(𝑑); 𝑒(𝑏𝑠); 𝑒(𝑓 𝑑) 46.6 32.0 36.1 69.9 62.3 65.9 THMM 𝑒(𝑡) ⊕ 𝑒(𝑎) 40.9 30.2 32.6 66.6 60.1 63.2 THMM 𝑒(𝑡) ⊕ 𝑒(𝑎) ⊕ 𝑒(𝑐) 42.0 31.9 34.6 67.4 62.3 64.8 THMM 𝑒(𝑡) ⊕ 𝑒(𝑎) ⊕ 𝑒(𝑐) ⊕ 𝑒(𝑑) 43.9 33.3 35.8 68.4 63.5 65.9 THMM 𝑒(𝑡) ⊕ 𝑒(𝑎) ⊕ 𝑒(𝑐) ⊕ 𝑒(𝑑) ⊕ 𝑒(𝑏𝑠) 47.7 34.6 38.3 70.4 64.0 67.1 THMM 𝑒(𝑡) ⊕ 𝑒(𝑎) ⊕ 𝑒(𝑐) ⊕ 𝑒(𝑑) ⊕ 𝑒(𝑏𝑠) ⊕ 𝑒(𝑓 𝑑) 48.6 35.0 38.9 70.2 65.0 67.5 Aggregating Information from Several Fields. Next, we combine information from several field embeddings (see Table 4). First, comparing vector summation and concatenation, we see a clear advantage when using summation (⊕) over concatenations (indicated by ;) as aggregation method. Second, we observe that in both cases, adding more information helps. The F1-macro score being the primary metrics of our analysis, we see a significant gain in the score when adding the brief-summary to the document representation. Here, we observe that the performance across the neural and non-neural approaches is consistent. It means the informativeness of a field is independent of the text representation method. Comparison of Best Configuration to Baseline Systems. First, we note that TwistBytes using tf.idf-based representations of the complete document text has a higher micro-F1 score than the THMM [9] with 𝑒𝑓 (𝑡 + 𝑎)) as proposed by Pujari et al. [9]. This clearly demonstrates that there is relevant information in the additional text, and motivates us to combine multi- field embedding into a single neural document representation using an effective aggregation technique. Table 4 shows the evaluation results of our proposed approach (THMM with sum- based aggregation of six content fields, last row) compared to the baselines (TwistBytes and THMM with the concatenated title and abstract text as input). Our proposed approach 22 Table 5 Macro scores for the best performing model and baselines over frequency-based groups. macro-avg. model doc-rep group P R F1 TB tf -idf (𝑡 + 𝑎 + 𝑐 + 𝑑 + 𝑏𝑠 + 𝑓 𝑑) 1-10 1.80 00.7 0.90 TB tf -idf (𝑡 + 𝑎 + 𝑐 + 𝑑 + 𝑏𝑠 + 𝑓 𝑑) 11-50 31.7 11.7 16.0 TB tf -idf (𝑡 + 𝑎 + 𝑐 + 𝑑 + 𝑏𝑠 + 𝑓 𝑑) 51-100 63.6 23.4 32.0 TB tf -idf (𝑡 + 𝑎 + 𝑐 + 𝑑 + 𝑏𝑠 + 𝑓 𝑑) 100+ 64.6 40.8 47.8 THMM 𝑒𝑓 (𝑡 + 𝑎) 1-10 11.1 7.1 7.5 THMM 𝑒𝑓 (𝑡 + 𝑎) 11-50 31.4 25.9 26.2 THMM 𝑒𝑓 (𝑡 + 𝑎) 51-100 45.9 39.7 40.9 THMM 𝑒𝑓 (𝑡 + 𝑎) 100+ 56.8 48.5 51.4 THMM 𝑒(𝑡) ⊕ 𝑒(𝑎) ⊕ 𝑒(𝑐) ⊕ 𝑒(𝑑) ⊕ 𝑒(𝑏) ⊕ 𝑒(𝑓 ) 1-10 13.1 10.7 10.8 THMM 𝑒(𝑡) ⊕ 𝑒(𝑎) ⊕ 𝑒(𝑐) ⊕ 𝑒(𝑑) ⊕ 𝑒(𝑏) ⊕ 𝑒(𝑓 ) 11-50 43.5 27.1 31.0 THMM 𝑒(𝑡) ⊕ 𝑒(𝑎) ⊕ 𝑒(𝑐) ⊕ 𝑒(𝑑) ⊕ 𝑒(𝑏) ⊕ 𝑒(𝑓 ) 51-100 57.4 37.0 43.2 THMM 𝑒(𝑡) ⊕ 𝑒(𝑎) ⊕ 𝑒(𝑐) ⊕ 𝑒(𝑑) ⊕ 𝑒(𝑏) ⊕ 𝑒(𝑓 ) 100+ 64.2 46.3 52.5 outperforms both baselines in terms of macro-F1 and micro-F1 scores. When comparing our approach to the version of THMM proposed by [9], we see much improvement in precision with a slight dip in recall. Our proposed approach performs better across all three micro-average scores, i.e., precision, recall, F1. Performance by Label Frequency. Table 5 shows the results of our best performing ap- proach (THMM with 𝑒(𝑡) ⊕ 𝑒(𝑎) ⊕ 𝑒(𝑐) ⊕ 𝑒(𝑑) ⊕ 𝑒(𝑏) ⊕ 𝑒(𝑓 )) and baselines for different frequency-based groups. We find that our approach works better overall, in particular, for the most infrequent label set, i.e., consisting of labels having a count less than 10, the macro-F1 is 44% better than the THMM with 𝑒𝑓 (𝑡 + 𝑎). The performance of the non-neural TwistBytes model is strong for more frequent labels, but poor for less frequent labels. Summary. In our experiments, we identify brief-summary as the most informative patent field and vector summation as the most effective aggregation technique. Neural models outper- form their non-neural counterparts especially in the case of infrequent labels. 5.2. Analysis of Field Embeddings by CPC Level, Label Frequency, and Domain We here report a fine-grained analysis of model performance for three label groupings, focusing on macro-F1. Overall, models using the brief-summary field perform better than using all other fields across all label-grouping settings (see Figure 3), and excel in few-shot scenarios. Level Hierarchy. On analyzing the level group results in Figure 3, we observe that at level 1, the performance is quite similar across fields, including title and fig-desc. However, for more fine-grained labels, especially at level 3, the brief-summary is more informative than another field embedding. 23 f1-macro 0.5 0.5 0.0 1 2 3 0.0 1-10 11-50 51-100 100+ level label-group title abstract f1-macro 0.5 claims detail-desc brief-desc fig-desc 0.0 A B C D E F G H Y section Figure 3: F1 macro distribution across three label group setting. The details for each of the label group are shown in Table 3. Frequency-based Groups. For the high-frequency label group, we see a similar performance for abstract, claims, and brief-summary. However, for labels with fewer instances, the performance for brief-summary is noticeably better. Section/Domain. Using brief-summary consistently leads to better results across different sections, followed by the abstract (see Figure 3). However, in the case of D, H, Y as label groups, the relative gain in performance with brief-summary is marginal compared to abstract. 6. Conclusion and Outlook In this paper, we have addressed the challenge of classifying patents, which are long multi-field text documents. We have shown that performance of both non-neural and neural models benefits from leveraging a larger document context by combining text snippets from the various fields. As a first step, we have enriched the USPTO-70k patent dataset with four additional textual patent fields. Among these, we identify brief-summary as the most informative patent field in terms of overall performance and as being highly effective for classifying infrequent cases. Our experiments identify vector summation to perform better than concatenation. While our model is conceptually simple and clearly outperforms previous work, it also points towards interesting directions for future work. A first step is clearly to try more advanced methods for creating meta-embeddings, such as incorporating adversarial techniques as in [32]. Second, another promising direction for patent classification is to employ variants of transformer-based neural language models (such as LongFormer [33] or Big Bird [12]) that incorporate larger text documents. In particular, as we have found the brief-summary field to be a very effective source of information, a potential future system could first apply a summarization method on the entire patent and then compute an embedding using such an extended language model. Finally, patents do not just consist of textual fields, but also include images or diagrams as well as further meta-data that will likely contain relevant information. The integration of these various types of information, e.g., in multi-modal approaches, certainly is another fruitful direction for research on patent classification. 24 Acknowledgments We thank the anonymous reviewers for their insightful comments. We also thank Tim Tarsi for fruitful discussions. References [1] H. Smith, Automation of Patent Classification, World Patent Information 24 (2002) 269–271. doi:10.1016/S0172-2190(02)00067-4. [2] C. J. Fall, A. Törcsvári, K. Benzineb, G. Karetka, Automated Categorization in the Interna- tional Patent Classification, SIGIR Forum 37 (2003) 10–25. doi:10.1145/945546.945547. [3] J. Guyot, K. Benzineb, G. Falquet, myClass: A Mature Tool for Patent Classification, in: Proceedings of the International Conference of the Cross-Language Evaluation Fo- rum (CLEF’10), CEUR-WS.org, Padua, Italy, 2010. URL: http://ceur-ws.org/Vol-1176/ CLEF2010wn-CLEF-IP-GuyotEt2010.pdf. [4] C.-H. Wu, Y. Ken, T. Huang, Patent Classification System Using a New Hybrid Genetic Algorithm Support Vector Machine, Applied Soft Computing 10 (2010) 1164–1177. doi:10. 1016/j.asoc.2009.11.033. [5] S. Verberne, E. D’hondt, Patent Classification Experiments with the Linguistic Classification System LCS in CLEF-IP 2011, in: Proceedings of the International Conference of the Cross- Language Evaluation Forum (CLEF’11), CEUR-WS.org, Amsterdam, The Netherlands, 2011. URL: http://ceur-ws.org/Vol-1177/CLEF2011wn-CLEF-IP-VerberneEt2011.pdf. [6] K. Benzineb, J. Guyot, Automated Patent Classification, Current Challenges in Patent Information Retrieval (2011) 239–261. doi:10.1007/978-3-642-19231-9_12. [7] S. Li, J. Hu, Y. Cui, J. Hu, DeepPatent: Patent Classification with Convolutional Neu- ral Networks and Word Embedding, Scientometrics 117 (2018) 721–744. doi:10.1007/ s11192-018-2905-5. [8] J.-S. Lee, J. Hsiang, PatentBERT: Patent Classification with Fine-Tuning a pre-trained BERT Model, 2019. arXiv:1906.02124. [9] S. C. Pujari, A. Friedrich, J. Strötgen, A Multi-Task Approach to Neural Multi-Label Hierarchical Patent Classification using Transformers, in: Proceedings of the 43rd European Conference on Information Retrieval (ECIR’21), Online, 2021. doi:10.1007/ 978-3-030-72113-8_34. [10] F. Benites, S. Malmasi, M. Zampieri, Classifying Patent Applications with Ensemble Methods, in: Proceedings of the 16th Annual Workshop of The Australasian Lan- guage Technology Association (ALTA’18), Dunedin, New Zealand, 2018. URL: https: //aclanthology.org/U18-1012. [11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies (NAACL’19), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423. [12] M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, 25 A. Ravula, Q. Wang, L. Yang, A. Ahmed, Big Bird: Transformers for Longer Se- quences, in: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS’20), online, 2020. URL: https://proceedings.neurips.cc/paper/2020/file/ c8512d142a2d849725f31a9a7a361ab9-Paper.pdf. [13] P. Ingwersen, Polyrepresentation of information needs and semantic entities: Elements of a cognitive theory for information retrieval interaction, in: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94), Springer-Verlag, Berlin, Heidelberg, 1994, p. 101–110. [14] I. Beltagy, K. Lo, A. Cohan, SciBERT: A Pretrained Language Model for Scientific Text, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP’19), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3615–3620. doi:10.18653/v1/D19-1371. [15] F. Benites, TwistBytes – Hierarchical Classification at GermEval 2019: Walking the Fine Line (of Recall and Precision), in: Proceedings of KONVENS’19, German Society for Com- putational Linguistics & Language Technology, Erlangen-Nürnberg, Germany, 2019, pp. 326–335. URL: https://konvens.org/proceedings/2019/papers/germeval/Germeval_Task1_ paper_6.pdf. [16] D. Mollá, D. Seneviratne, Overview of the 2018 ALTA Shared Task: Classifying Patent Applications, in: Proceedings of the 16th Annual Workshop of The Australasian Language Technology Association (ALTA’18), Dunedin, New Zealand, 2018, pp. 84–88. URL: https: //aclanthology.org/U18-1011. [17] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, E. M. Voorhees, Overview of the TREC 2019 Deep Learning Track, 2020. arXiv:2003.07820. [18] J. Lin, R. Nogueira, A. Yates, Pretrained Transformers for Text Ranking: BERT and Beyond, Morgan & Claypool Publishers, 2021. doi:10.2200/S01123ED1V01Y202108HLT053. [19] M. F. Grawe, C. A. Martins, A. G. Bonfante, Automated Patent Classification Using Word Embedding, in: Proceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA’17), IEEE, Cancun, Mexico, 2017, pp. 408–411. doi:10. 1109/ICMLA.2017.0-127. [20] J. Risch, R. Krestel, Domain-specific Word Embeddings for Patent Classification, Data Technologies and Applications 53 (2019) 108–122. doi:10.1108/DTA-01-2019-0002. [21] S. Althammer, M. Buckley, S. Hofstätter, A. Hanbury, Linguistically informed masking for representation learning in the patent domain, in: Proceedings of the 2nd Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech) 2021 co-located with the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Online, 2021. [22] J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, J. Eisenstein, Explainable Prediction of Medical Codes from Clinical Text, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL’18), Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 1101–1111. doi:10.18653/v1/N18-1100. [23] A. Rios, R. Kavuluru, Few-Shot and Zero-Shot Multi-Label Learning for Structured Label Spaces, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language 26 Processing (EMNLP’18), Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 3132–3142. doi:10.18653/v1/D18-1352. [24] T. N. Kipf, M. Welling, Semi-Supervised Classification with Graph Convolutional Networks, in: Proceedings of the 5th International Conference on Learning Representations (ICLR’17), Toulon, France, 2017. URL: https://openreview.net/forum?id=SJU4ayYgl. [25] I. Chalkidis, E. Fergadiotis, P. Malakasiotis, I. Androutsopoulos, Large-Scale Multi-Label Text Classification on EU Legislation, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19), Association for Computational Linguistics, Florence, Italy, 2019, pp. 6314–6322. doi:10.18653/v1/P19-1636. [26] A. Johnson, T. Pollard, L. Shen, L.-w. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Celi, R. Mark, MIMIC-III, a Freely Accessible Critical Care Database, Scientific Data 3 (2016) 160035. doi:10.1038/sdata.2016.35. [27] D. D. Lewis, Y. Yang, T. G. Rose, F. Li, RCV1: A New Benchmark Collection for Text Categorization Research, J. Mach. Learn. Res. 5 (2004) 361–397. URL: http://jmlr.org/ papers/volume5/lewis04a/lewis04a.pdf. [28] S. Kiritchenko, S. Matwin, A. F. Famili, Functional Annotation of Genes Using Hierarchical Text Categorization, in: Proceedings of the BioLINK SIG: Linking Literature, Information and Knowledge for Biology, 2005. URL: https://www.site.uottawa.ca/~stan/papers/2004/ p15.pdf. [29] C. Cortes, V. Vapnik, Support-Vector Networks, Machine Learning 20 (1995) 273–297. doi:10.1007/BF00994018. [30] F. Chollet, et al., Keras, https://github.com/fchollet/keras, 2015. [31] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, HuggingFace’s Transformers: State-of-the-art Natural Language Processing, 2019. arXiv:1910.03771. [32] L. Lange, H. Adel, J. Strötgen, D. Klakow, FAME: Feature-Based Adversarial Meta- Embeddings for Robust Input Representations, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP’21), Association for Compu- tational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 8382–8395. doi:10.18653/v1/2021.emnlp-main.660. [33] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The Long-Document Transformer, 2020. arXiv:2004.05150. 27