=Paper=
{{Paper
|id=Vol-3604/paper6
|storemode=property
|title=Benchmarking Natural Language Processing Algorithms for Patent Summarization
|pdfUrl=https://ceur-ws.org/Vol-3604/paper6.pdf
|volume=Vol-3604
|authors=Silvia Casola,Alberto Lavelli
|dblpUrl=https://dblp.org/rec/conf/patentsemtech/CasolaL23
}}
==Benchmarking Natural Language Processing Algorithms for Patent Summarization==
Benchmarking Natural Language Processing Algorithms for Patent Summarization Silvia Casola1,* , Alberto Lavelli1 1 University of Padua, Fondazione Bruno Kessler Abstract The number of patent applications is enormous, and patent documents are long and complex. Methods for automatically obtaining the most salient information in a short text would thus be useful for patent professionals and other practitioners. However, patent summarization is currently under-researched; moreover, the proposed methods are difficult to compare directly as they are generally tested on different datasets. In this paper, we benchmark several extractive, abstractive, and hybrid summarization methods on the BigPatent dataset, compare automatic metrics and show qualitative insights. Keywords Summarization, Patents, Natural language processing,, Natural language generation 1. Introduction proaches in the patent domain, specifically on the Big- Patent [4] dataset. The dataset is popular in the NLP com- Patents protect inventions that their holders consider munity, as patents present several challenges in terms of important enough to take legal action to obtain the abstractivity, length, and language, among others; more- monopoly in using, making, and selling them — and thus, over, while not exempt from design issues, it is also one profit from their wit. Thus, they help in valuing intellec- of the few patent benchmarks that allow for a direct tual work. At the same time, inventors must disclose the comparison between approaches. We evaluate extrac- invention and its characteristics in detail to file a patent tive, abstractive, and hybrid methods; we also explore application: thus, patents are intended to benefit society transferring summarization methods from the scientific and help new knowledge spread — correcting the ten- paper domain [5] with limited success. For each method, dency to keep valuable technical details secret. Patents, we discuss strengths and limitations, provide standard however, are difficult to process: the number of patent summarization metrics and qualitative insights. applications is enormous and patent documents are long and hard to read, rich in technical and legal language. To this end, tools that automatically extract or gener- 2. Previous work ate summaries from patent documents can be particularly valuable in helping patent agents, R&D groups, and other 2.1. Automatic text summarization professionals; using summaries instead of the whole doc- Methods for text summarization are generally classified ument can also improve the performance of automatic into extractive, abstractive, and hybrid ones. processes, as shown in other domains [1, 2]. In extractive text summarization, a subset of sentences In the general domain, summarization tools and from the source document is chosen as the most represen- methodologies have shown promising results; applica- tative, and the final summary is a simple concatenation of tions to the patent domain are, however, still relatively such sentences. Methods can be graph-based [6, 7, 8], rely limited. Moreover, while previous work has explored on token frequency [9], or on learned intrinsic features methods for automatically generating patent summaries, [10, 11, 12]. these methods are hard to compare, as no generally ac- In contrast, abstractive text summarization aims at cepted benchmarks exist; thus, conclusions on the pros generating a new piece of text based on the source, sim- and cons of each approach are hard to make. Even the ilar to what a person would do, and can contain novel most recent abstractive dataset presents important limi- vocabulary or expressions. Sequence-to-sequence mod- tations and issues that might make direct comparisons els [13, 14, 15, 16, 17] are popular for this task, with meaningless [3]. transformer-based ones being particularly performative To partially fill this gap, we benchmark existing ap- [18, 19, 20, 21]. Finally, hybrid methods try to fuse both PatentSemTech’23: 4th Workshop on Patent Text Mining and Semantic approaches, for example, by extracting and rewriting Technologies, July 27th, 2023, Taipei, Taiwan sentences [22]. * Extractive models are generally simpler than abstrac- Corresponding author. $ scasola@fbk.eu (S. Casola); lavelli@fbk.eu (A. Lavelli) tive ones and require fewer computational resources and © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop http://ceur-ws.org ISSN 1613-0073 data; however, summaries have to contain complete sen- Proceedings CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1 tences from the source, which often contain both central # docs 258,935 # tokens (avg) 121.0 and peripheral information. Moreover, the final sum- Summary # sents (avg) 3.6 mary is a simple concatenation of sentences, with pend- sent len (avg) 43.4 ing references and no discourse structure. Abstractive # tokens (avg) 4,893.6 summaries are more similar to those written by humans. Source # sents (avg) 161.2 Information can be easily condensed and the generated sent length (avg) 31.3 text is much more natural and easier to read. However, compression ratio 45.8 abstractive models might produce non-factual informa- tion, i.e. include statements that are not in the source or Table 1 Length statistics on the BigPatent/G dataset. The number of that directly contradict them. See, e.g., [23] for a compre- tokens, sentences, tokens per sentence, and the compression hensive survey of summarization techniques. ratio are computed per document and then averaged. The compression ratio is the ratio between the number of tokens 2.2. Patent summarization in the source and the number of tokens in the Abstract. Many traditional approaches for patent summarization have been extractive. The document is often segmented tains the full Description with all its subsections, in the into sentences or fragments [24] and preprocessed (e.g., original casing. We will use this version in this paper. to keep specific parts of speech only [25, 26]); features However, we notice two main limitations. First, patents can then be extracted. General-domain ones include key- lack section headers due to the performed preprocessing; words [27], title words, cue words, and position. An thus, any structural information is lost. Second, the input anthology for technical terms might also be used [26, 28]. often contains the author’s Summary of the Invention, Domain-specific approaches [24] are often linguistically- which significantly simplifies the task. motivated. Once extracted, features are used to score the To solve both issues, we download the raw data and (i) sentence relevance in the summary either heuristically apply all original preprocessing steps, excluding remov- [27] or in a data-driven way [24, 29, 30, 31]. Alternative ing the subsection headers and newlines, and (ii) remove approaches use the patent discourse structure, which the Summary of the Invention section by heuristically they prune [32]. matching headers. Table 1 contains some metrics on our Recently, [4] introduced the BigPatent dataset, whose version of the dataset. associated task is that of summarizing the patent’s De- tailed Description into its Abstract. As authors show, patents’ Abstracts are highly abstractive — with relevant 4. Evaluation Protocol content spread throughout the input — and have many novel n-grams. The dataset has been used as a testbed Evaluating patent summarization results is challenging. for general-purpose systems [19, 33, 34, 21, 35], given On the one hand, automatic text simplification outputs the high abstractivity of its targets and the length of its (and Natural Language Generation outputs in general) inputs. are difficult to evaluate automatically, and the problem For an overview of patent summarization approaches, is considered open [37, 38]. see [36]. While automatic metrics such as ROUGE [39] exist, they have known limitations. In the patent domain in particular, some previous work [40, 31] has anecdotally 3. Dataset questioned the metric validity (and its correlation to ex- pert’s opinion and practical utility), even if no quantita- We use the G (Physics) subsection of the BigPatent dataset tive studies in the patent domain have been performed, [4]. to the best of our knowledge. More complex metrics, e.g., The dataset is associated with the task of generating model-based methods [41, 42], should be fine-tuned with the patent Abstract from its Description. We are aware domain-specific data. of the practical imitations of this setting, as the Abstract On the other hans, human evaluation is not easier. contains superficial and general information, but still In fact, it is particularly hard in the patent domain for consider experimenting on the dataset useful given its two main reasons: a) the best way to evaluate a summa- popularity in the Natural Language Processing commu- rization output is to read the whole source document. nity. However, patents are extremely long and hard to read; b) The dataset exists in two versions [3]. The original ver- patent documents and Abstracts are extremely complex sion text is uncased and tokenized, and its input typically and should be evaluated by legal and technical experts, contains the Detailed Description only (i.e., a subsection but hiring such experts is very expensive and unpractical of the Description section). The alternative version con- in most scenarios. 2 Aware of these limitations, we will use two main eval- Set #T. ROUGE-1 ROUGE-2 ROUGE-L Val 50 28.20 8.52 18.08 uation methods: Val 100 37.06 11.40 21.99 • Automatic evaluation: we will select hyper- Val 150 38.60 12.33 22.33 parameters and automatically evaluate outputs Val 250 35.39 12.27 20.69 Val 500 25.74 10.37 16.11 using ROUGE [39]. We also experimented with Val 1000 16.22 7.65 11.00 factuality-related metrics, e.g., QAEval [41]; how- Test 150 38.59 12.30 22.33 ever, they do not seem to adapt well to the patent domain and should be fine-tuned. Table 2 • Qualitative evaluation: we report a preliminary Results using TextRank. We selected the number of extracted qualitative evaluation of a subset of candidate tokens on the validation test and run the most promising model on the test set. summaries. We will consider the patent fluency, consistency, and similarity to the Abstract. Set #S ROUGE-1 ROUGE-2 ROUGE-L Val 1 26.03 8.12 17.40 5. Extractive methods Val 2 34.72 10.93 21.14 Val 3 37.48 12.02 21.89 Val 4 37.76 12.40 21.71 5.1. Graph-based systems Val 5 36.92 12.46 21.16 The core idea of graph-based methods is to represent the Val 6 35.62 12.36 20.48 original document as a graph having sentences as nodes Test 4 37.76 12.46 21.76 and their similarity as edges, and then extract the most Table 3 central sentences only. Results using LexRank. We selected the number of extracted sentences on the validation test and run the most promising 5.1.1. TextRank [7] model on the test set. TextRank uses the number of shared words among two sentences, normalized by the length of the sentences as Automatic evaluation ROUGE scores are shown in its similarity metrics. Edges in the complete graph are Table 2 and 3. As expected, performance is similar for the then pruned using a threshold, and the most central sen- two systems, with TextRank being marginally superior. tences according to PageRank [43] are extracted. We used Unsurprisingly, the best-performing systems are those the summa1 implementation. In this implementation, the that select a number of tokens or sentences similar to user chooses the target summary length in terms of to- that of the gold standard. kens, and the number of sentences that best approximate that number is extracted. We cross-validated the number Qualitative assessment The outputs obtained using of tokens and left any other parameters at their default the two algorithms are relatively similar. We notice that values. Some sample outputs are in Table ??. the sentence tokenization is not always perfect: for exam- ple, the extracted summary of patent US-2005152022-A1 5.1.2. LexRank [6] contains the sentence "The mixed color display [...] by the LexRank is similar in nature to TextRank, but it uses type of processes described in the aforementioned U.S. Pat. the cosine similarity of their Term Frequency–Inverse No.", where the patent number has been incorrectly con- Document Frequency (TF-IDF) representation as its simi- sidered as a stand-alone sentence. This is in accordance larity metrics. We used the sumy implementation2 . We with previous work [44, 45], which showed that general- validated the number of extracted sentences per patent domain Natural Language Processing resources tend to and left any other parameters at their default value. The have suboptimal performance in the patent domain and algorithms are unsupervised and can easily be used even should be adapted. for very long documents with no modifications. We also Moreover, sentences naturally contain references to 3 tried to perform experiments with PacSum [8] but found other parts of the original text e.g., "as described be- the algorithm extremely computationally demanding in low" in US.2005152011-A1 or "according to claim 1" in our use case. US-9478115-B2. We also notice that all the extracted sentences tend to be extremely long and naturally contain core and periph- eral information (e.g., included in parenthesis). These 1 3 https://summanlp.github.io/textrank/ Sentences also tend to contain numerical references to the figures, 2 https://github.com/miso-belica/sumy which are lost. 3 Set #S ROUGE-1 ROUGE-2 ROUGE-L Set ROUGE-1 ROUGE-2 ROUGE-L Val 1 20.09 4.38 13.54 Validation 41.70 17.52 28.38 Val 2 28.51 6.48 17.15 Test 41.53 17.25 28.18 Val 3 32.37 7.70 18.43 Val 4 33.93 8.38 18.80 Table 5 Val 5 34.28 8.78 18.70 RBART results on the validation and test sets. Val 6 34.00 9.02 18.43 Val 7 33.30 9.14 18.05 Val 8 32.44 9.20 17.63 Automatic evaluation Table 4 shows the ROUGE Test 5 34.26 8.72 18.66 scores. LSA tends to perform worse than the graph-based Table 4 algorithms. In contrast to the graph-based methods, it Result using LSA. We selected the number of extracted sen- tends to work best when extracting several short sen- tences (#S) on the validation test and run the most promising tences. model on the test set. Qualitative assessment Even with the known limi- tations of extractive systems (references, structure, sen- are known limitations of naive extractive models and are tences needing compression, etc.), some reasonable con- very common problems of our extracted summaries. Ex- tent selection is performed. For example, they often ex- tracted sentences do not seem too similar to each other, tract the sentence that describes the invention’s nature, which is sometimes described as a limitation of graph- as in “The present invention is based on the object to pro- based systems. vide an operator system for a machine, which is ergonomic Even with their limitations, the algorithms seem to per- with regard to the handling thereof and offers sufficient form reasonable content selection (with TextRank being work protection." for US-9478115-B2 or “The present in- superior to LexRank also from a qualitative perspective); vention relates to computer security and, more particularly, when compared to their references, the extracted sum- to an efficient method of screening untrusted digital files." maries often contain most of their core elements and, in for US-9208317-B2. Sentences are generally shorter than many cases, are very similar to the reference in terms those extracted by graph-based systems. of content. This is evident in some specific cases (e.g., [31] noticed LSA showed a better quality when com- patent US-9478115-B2 and US-2003016244-A1) and is in- pared to TextRank in the generation of patent titles. Our teresting, considering the algorithm is unsupervised. results do not confirm this finding for Abstract gener- If we assume the final target of the extracted sum- ation from the Description as measured automatically; maries is human readers, the lack of discourse structure qualitatively, the results are relatively different and might and the length of the extracted sentences might make be used for different purposes. the outputs too hard to understand. It might, however, be possible to use the outputs in an ad hoc interface, e.g., where core sentences are highlighted. 6. Abstractive methods We use BART [18], a sequence-to-sequence system, as 5.2. Latent Semantic Analysis a baseline for abstractive summarization. We fine-tune a BART-base model (∼ 140 million parameters) on the Latent Semantic Analysis [46] aims at exploiting the la- BigPatent/G datasets. We train using the Hugging Face tent semantic structure of the document and extracts library with early stopping on the evaluation loss (pa- sentences that best represent the most important latent tience: 5) and the following hyperparameters: max target topics. The algorithm decomposes the term-sentence ma- length: 250; number of beams: 5; evaluation steps: 10k; trix constructed from the source document using SVD max steps: 500M. We leave all other parameters at their [47]. The 𝑡 × 𝑠 terms-by-sentence matrix 𝐴 is thus de- 𝑇 default values. Some sample outputs are in Table ??. composed as 𝐴 = 𝑈 Σ𝑉 . Thus, the original matrix is decomposed into a matrix of term distributions over latent topics, a diagonal matrix of topic importance (the Automatic evaluation Table 5 shows the results in singular values), and a matrix of topic distributions across terms of ROUGE. As expected, the results improve over sentences. For each of the 𝐾 most salient latent topics all extractive systems, with an increase of almost 5 (i.e., those corresponding to the largest singular values), ROUGE-2 points over the best extractive system. the sentence with the largest index value is included in the summary [48, 10]. We use the sumy implementation, Qualitative assessment Qualitatively, we notice that validate the number of sentences, and leave all other pa- summaries are generally grammatical, with very rare lo- rameters at their default values. Some sample outputs cal problems. Text is coherent and much easier to read are in Table ??. and understand than those composed through extracted 4 BART Hybrid Gold standard Set #T ROUGE-1 ROUGE-2 ROUGE-L Coverage (avg) 95.75 96.12 90.68 Val 1000 42.79 17.92 28.79 Density (avg) 11.84 8.83 3.82 Val 500 41.54 16.74 27.88 Val 250 40.33 15.60 27.01 Table 6 Test 1000 42.47 17.74 28.59 Extractivity metrics on the summaries generated by the fine- tuned BART and the select and rephrase models. We also Table 7 report the corresponding metrics on the gold-standard sum- Result using the previously described hybrid approach. We maries for comparison. The metrics are computed per docu- selected the number of extracted tokens (#T) on the validation ment and then averaged. test and run the most promising model on the test set. sentences. In all cases, summaries seem adequate and Thus, in this section, we explore a hybrid approach. We convey the main points of their gold standard counter- first select important sentences using an unsupervised parts. graph-based algorithm and then rewrite the content us- However, we noticed that the generated summaries ing an abstractive system. Specifically, we use TextRank are largely extractive, with no or few modifications to as it performed best among the considered extractive sentences in the source. In the following example, the models. We considered three extracted lengths: 1000, extractive fragments in the summary generated for patent 500, and 250 tokens. Then, we train a BART system to US-2005152022-A1 and its source (Background of the rephrase the selected sentences to generate the target Invention subsection) are underlined. summary: we use the selected sentences as the input and More specifically, in one aspect this invention relates the original gold standard as the target and fine-tuned to electro-optic displays with simplified backplanes, and the model. Some sample outputs are in Table ??. methods for driving such displays. In another aspect, this invention relates to electro-optic displays in which mul- Automatic evaluation Table 7 reports the ROUGE tiple types of electro-optic units are used to improve the scores. Extracting 1000 tokens through TextRank and colors available from the displays. The present invention then rephrasing the summary using BART results in the is especially, though not exclusively, intended for use in highest ROUGE, surpassing the vanilla BART approach electrophoretic displays. on all metrics. The obtained metrics are the highest While some deletion is performed, most text is directly among all the extractive and abstractive models we con- extracted from the source. To quantify how extractive sidered. the generated summaries are with respect to the source, Note that, even for the approaches where a smaller we compute the coverage and the density of the gen- number of tokens is extracted, relatively good perfor- erated summaries, following [49], which we report in mances are obtained. Extracting 500 tokens results in Table 6. The extractive fragment coverage measures the scores only marginally worse than those obtained by a proportion of tokens in the summary that is part of an BART model fed with the first 1024 subtokens. While re- extractive fragment; it roughly measures how much a sults obtained by extracting 250 tokens only score worse summary vocabulary is derivative of a text. The den- in terms of ROUGE, the rewriting component is crucial. sity also takes into account the length of the extractive In fact, an improvement of 5 ROUGE-1, 3.3 ROUGE-2, and fragments: the higher the density, the better a summary 5.3 ROUGE-L points is observed over the results obtained can be described as a series of extractions. We notice using TextRank only. that the generated summaries tend to have much longer abstractive fragments with respect to the gold standard. Qualitative assessment The outputs obtained with this approach are fluent, and relatively similar to those 7. Hybrid methods obtained through the vanilla BART. The coverage and density (Table 6) also show a marginally lower extractiv- 7.1. Extractive to abstractive: select and ity of the generated summaries. rephrase 7.2. DANCER Results in the previous sections show graph-based extrac- tive methods tend to be able to select central content but An alternative approach to deal with high document lack any discourse structure. Using BART solved some length is to exploit the document structure. To sum- of these issues, but the model can only summarize the marize scientific documents, for example, [5], proposed first part of the patent document, as its input length is to deal with different sections independently; however, limited to 1024 subtokens. no experiments were performed in the patent domain. 5 Here, we explore if adapting this method to the patent #Tokens % patents Field 73.73 38.27% domain can be useful. Background 710.04 94.85% Specifically, we perform the steps described in the Drawings 243.43 97.60% following. Embodiments 3168.25 53.07% References 92.10 28.18% • Dividing and normalizing subsections: To divide Related Art 644.27 4.12% the Description text into subsections, we use sim- Objective 256.95 2.09% ple regular expressions, exploiting the fact that Detailed Descr. 3404.91 55.23% section headers lines include fully cased tokens only. Patent headers can follow different conven- Table 8 tions4 . Thus, we normalize the headers through Average length of each subsection type and percentage of patents that contain the subsection. a simple keyword-matching algorithm into nine classes. The classes are shown in Table 8. Subsec- tions that did not match with any of the keywords were left in a default category and ignored. all subsections appear in all patents. We thus con- • Alignment between abstract sentences and sub- sider several strategies for subsection selection: sections: Following [5], we use ROUGE-L [39] (i) Pre-selection: We heuristically pre-select to align sentences in the abstract to patent sub- subsections based on their role6 and fed sections. Specifically, for each sentence in the them to the trained model in their original Abstract, we compute its ROUGE-L recall with all order. We then concatenated the results. individual paragraphs in all subsections; we then (ii) Generate from M subsections: We retrieve align the sentence with the subsection containing all subsections in the patent and sort them the paragraph with the maximum score5 . Figure according to how likely they are to be 3 shows the percentage of subsections that, when aligned in the whole dataset (Figure 3). We present, align with at least one sentence in the generate from the first M most commonly patent’s Abstract. aligned subsection, where M goes from 1 • Using paired elements as training data: Fol- to the total number of subsections in the lowing the previous steps, each Abstract sen- patent. The final summary is a concatena- tence is aligned with a Description subsection. tion of the generated sentences. Thus, for each (Description, Abstract)𝑖 pair, we (iii) Generate from all subsections in the patent: created 𝑁 (Subsection, Abstract sentence(s))𝑖𝑛 we use all subsections in their original or- pairs, where 𝑁 is the number of unique subsec- der and concatenate the results. tions that are aligned with at least one sentence in the Abstract. If multiple sentences align with the • Second abstractive step: The final abstract ob- same patent subsection, the target contains all the tained as a concatenation of sentences lacks any aligned sentences in their original order. We then discourse structure and might not be coherent; trained a BART-base model [18] using the subsec- in particular, we notice that it often contains tion as input and the aligned sentence(s) as target; repeated information. Thus, we explore if per- we set the maximum generated length to 250, the forming a second abstractive step can improve number of beams to 5, and left all other hyper- performance. To this end, we train a second parameters at their default values. We trained BART model that, given the output of the pre- with early stopping on the validation set. Table 9 vious step (i.e., the summary as a concatenation reports the metrics obtained by the model on the of sentences), is trained to paraphrase it to be sentence generation step. We also experimented more similar to the target Abstract. with prepending the subsection type (as a special Some sample outputs (before performing the second ab- token) to its text but with no improvement. stractive step) are in Table ??. • Inference: At inference, we obtain the final sum- mary by concatenating the sentences generated Automatic evaluation Table 10 reports the results on from the individual subsections. Patent structure the validation set. We report results obtained by gener- is less coherent than that of papers; in fact, not ating from pre-selection, using the best-aligned section 4 For example, subsections with similar content can be named Fields, only (as a baseline), the best result with a varying number Field, Field Of The Invention, etc. of sections (and Figure 2 shows ROUGE-L as a function of 5 We retrieve the subsection containing most of the sentence content, 6 regardless of any possible additional text (that the summarization We selected the subsections of type FIELD, BACKGROUND, EM- model will learn to filter out). BODIMENTS, OBJECTIVE, DESCRIPTION 6 [ name=plotLeft, scale=0.45, tick label style=font=, ylabel near [ height=5cm, width=7cm, ybar interval, ymin=0, ticks, xlabel style=yshift=2.2ex, xticklabel style=rotate=90, ymax=150000, xmin=0.5,xmax=7.5, minor y tick num = 1, title=Train aligned sections, title style=yshift=-1.5ex,, xlabel=distinct subsections, ylabel=Abs. # of Abstracts, ] symbolic x coords=BACKGROUND, DESCRIPTION, +[ybar interval, mark=no, draw=black, fill = white] plot DRAWINGS, EMBODIMENTS, FIELD, OBJECTIVE, coordinates (1, 130664) (2, 113373) (3, 13537) (4, 1196) (5, 138) REFERENCES, RELATED ART, xtick=data, ] [ybar] (6, 24) (7, 3) ; coordinates (BACKGROUND, 48.66) (DESCRIPTION, 87.02) Figure 3: Number of unique subsections types to (DRAWINGS, 1.76) (EMBODIMENTS, 83.49) (FIELD, 22.51) which the Abstract aligns. (OBJECTIVE, 2.14) (REFERENCES, 2.02) (RELATED ART, 33.65) ; [ at=(plotLeft.right of south east), anchor=left of south west, scale=0.45, tick label style=font=, ylabel near ticks, xlabel style=yshift=2.2ex, xticklabel style=rotate=90, subsections are very similar and describe what the in- title=Val aligned sections, title style=yshift=-1.5ex,, symbolic vention is and its goal. While the second abstractive step x coords=BACKGROUND, DESCRIPTION, DRAWINGS, helps limit repetition, the resulting output is often short EMBODIMENTS, FIELD, OBJECTIVE, REFERENCES, and contains too little information compared to the gold RELATED ART, xtick=data, ] [ybar] coordinates (BACKGROUND, 49.31) (DESCRIPTION, 86.67) (DRAWINGS, standard. We noticed a number of issues that could make 1.86) (EMBODIMENTS, 83.59) (FIELD, 23.04) (OBJECTIVE, the transfer from the scientific publications to the patent 29.21) (REFERENCES, 1.79) (RELATED ART, 34.74) ; domain unsuccessful: Figure 1: Percentage of subsections that, when present, are • Less predictable structure and session headers: aligned to at least one sentence in the Abstract in the train Scientific papers have a very coherent structure (left) and validation (right) sets. as they tend to roughly follow a fixed schema (e.g., Introduction, Previous Work, Method, Con- Model R1 R2 RL clusions), with each section having a clear fixed BART 35.00 15.74 26.63 role. While, on a superficial level, patent docu- BART(+ subs. type) 33.28 14.81 25.66 ments have a similar structure with sections and Table 9 subsections, they are less coherent. As Table 8 Model trained on generating the Abstract sentence(s) given shows, the subsections of the Description tend to the subsection. We also experimented with prepending the vary. Moreover, the role of each subsection is less subsection text with its type. determined. • Less compositional Abstracts: An analysis of the Model R1 R2 RL Abstracts’ compositionality shows that many of DANCER (preselection) 38.73 16.03 25.63 the sentences in the Abstract align with the same DANCER (best aligned, M=1) 27.39 10.64 19.83 patent subsections. Figure 3 represents the num- DANCER (best M, M=3) 40.70 16.45 25.08 ber of unique sentences to which each Abstract DANCER (all) 40.68 16.38 25.90 aligns. Note most patent Abstracts only align to DANCER + abstractive 38.88 15.89 26.99 one or two different subsections. Moreover, a Table 10 qualitative analysis of the Abstract shows that Results on the validation set. while paper abstracts tend to follow a fixed struc- ture (first describing the background, then the [ xlabel= Number of sections, ylabel= ROUGE-L, goal and methods, then the results and conclu- width=7.7cm,height=7cm] [color=black,mark=x] coordinates sions), patent Abstracts seem to lack the compo- (1, 17.95) (2, 22.46) (3, 24.42) (4, 24.40) (5, 24.39) (6, 24.39) (7, sitional nature of scientific papers. The lack of a 24.39) ; fixed flow in the Abstract might also explain the Figure 2: ROUGE-L results as a function of the number of relatively low results obtained by the abstractive subsections used for the generation. model when generating the Abstract sentence(s) from the original subsections. As the alignment is more random, finding a pattern and correctly generating the aligned sentences is more chal- the number of summarized subsections), and the result lenging. obtained by summarizing all sections. We also report the results after the second abstractive step. Note that none of the configurations surpasses the simple BART 8. Conclusions baseline. In this paper, we have benchmarked several extractive, ab- Qualitative analysis Inspecting the outputs, we no- stractive, and hybrid methods on the BigPatent/G dataset. ticed that many of the sentences generated from various 7 Among extractive systems, we found that graph-based Acknowledgments ones seem appropriate for content selection and perform relatively well in metrics and outputs. However, the We acknowledge the support of the PNRR project FAIR — extracted outputs are subject to all the limitations of Future AI Research (PE00000013), under the NRRP MUR extractive summarization, with dangling references being program funded by the NextGenerationEU particularly common. The length of the sentences, the dangling references, and the lack of discourse structure make the outputs challenging to process for humans and References possibly machines. [1] I. Mani, D. House, G. Klein, L. Hirschman, T. Firmin, Among the abstractive approaches, we have analyzed B. Sundheim, The TIPSTER SUMMAC text summa- BART and have found that it performs best in automatic rization evaluation, in: Ninth Conference of the metrics compared to extractive algorithms. We have also European Chapter of the Association for Computa- found that the produced outputs are, in fact, not very tional Linguistics, Association for Computational abstractive with respect to the input, with long chunks Linguistics, Bergen, Norway, 1999, pp. 77–85. URL: of texts identical to input passages; the model seems, https://aclanthology.org/E99-1011. however, very good in removing non-central content [2] T. Sakai, K. Sparck-Jones, Generic summaries for from the single sentences, which extractive systems are indexing in information retrieval, in: Proceed- natively unable to do. In future work, we plan to explore ings of the 24th Annual International ACM sim- more powerful abstractive models, including those in the plificationR Conference on Research and Develop- GPT family [50, 51, 52]. ment in Information Retrieval, SIGIR ’01, Associ- We have considered a simple select-and-rewrite ap- ation for Computing Machinery, New York, NY, proach, which obtained the best automatic metrics. We USA, 2001, p. 190–198. URL: https://doi.org/10.1145/ have also tried to adapt DANCER, initially designed for 383952.383987. doi:10.1145/383952.383987. scientific articles, to the patent domain. However, we [3] S. Casola, A. Lavelli, H. Saggion, What’s in a have found that patents are more variable in the sections (dataset’s) name? The case of BigPatent, in: Pro- they contain and in the sections’ content itself, and their ceedings of the 2nd Workshop on Natural Language Abstracts tend to be less compositional than those of pa- Generation, Evaluation, and Metrics (GEM), Asso- pers. Thus, the approach was not particularly successful ciation for Computational Linguistics, Abu Dhabi, when transferring to the patent domain. United Arab Emirates (Hybrid), 2022, pp. 399–404. Our setting, however, has several limitations. First, the URL: https://aclanthology.org/2022.gem-1.34. BigPatent dataset has known issues, and the Abstract is [4] E. Sharma, C. Li, L. Wang, BIGPATENT: A Large- not regarded as the best target for summarization in the Scale Dataset for Abstractive and Coherent Sum- patent community, as it contains superficial information marization, in: Proceedings of the 57th An- rather than the core invention nature. Second, we did nual Meeting of the Association for Computa- not have the opportunity to collaborate with legal and tional Linguistics, Association for Computational technical experts to evaluate our outputs. Linguistics, Florence, Italy, 2019, pp. 2204–2213. We believe that future work on patent summariza- URL: https://www.aclweb.org/anthology/P19-1212. tion should tackle a number of open problems. First, we doi:10.18653/v1/P19-1212. hope that this work will motivate the creation of better [5] A. Gidiotis, G. Tsoumakas, A divide-and-conquer benchmarks, which can be shared among researchers and approach to the summarization of long documents, practitioners interested in patent summarization. Sec- IEEE/ACM Transactions on Audio, Speech, and ond, we hope that the design of such a benchmark can be Language Processing 28 (2020) 3029–3040. doi:10. made in conjunction with patent experts and industrial 1109/TASLP.2020.3037401. practitioners to ensure that it can be practically useful; [6] G. Erkan, D. R. Radev, LexRank: Graph-Based Lexi- while it is likely not practical to ask experts to write gold- cal Centrality as Salience in Text Summarization, J. standard summaries, there is space for improvement in Artif. Int. Res. 22 (2004) 457–479. the current setting. Third, the validity of the standard [7] R. Mihalcea, P. Tarau, TextRank: Bringing Order evaluation metrics in the patent domains should be mea- into Text, in: Proceedings of the 2004 Conference sured based on experts’ evaluation of the outputs. Finally, on Empirical Methods in Natural Language Pro- the factual accuracy of abstractive methods — which is cessing, Association for Computational Linguis- particularly important in a legal and technical domain —- tics, Barcelona, Spain, 2004, pp. 404–411. URL: should be better investigated. https://www.aclweb.org/anthology/W04-3252. [8] H. Zheng, M. Lapata, Sentence centrality revis- ited for unsupervised summarization, in: Proceed- 8 ings of the 57th Annual Meeting of the Associa- Association for Computational Linguistics (Volume tion for Computational Linguistics, Association for 1: Long Papers), Association for Computational Lin- Computational Linguistics, Florence, Italy, 2019, pp. guistics, Vancouver, Canada, 2017, pp. 1073–1083. 6236–6247. URL: https://aclanthology.org/P19-1628. URL: https://www.aclweb.org/anthology/P17-1099. doi:10.18653/v1/P19-1628. doi:10.18653/v1/P17-1099. [9] A. Nenkova, L. Vanderwende, The impact of fre- [18] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, quency on summarization, Microsoft Research, Red- A. Mohamed, O. Levy, V. Stoyanov, L. Zettle- mond, Washington, Tech. Rep. MSR-TR-2005 101 moyer, BART: Denoising sequence-to-sequence (2005). pre-training for natural language generation, trans- [10] J. Steinberger, K. Jezek, et al., Using latent seman- lation, and comprehension, in: Proceedings of the tic analysis in text summarization and summary 58th Annual Meeting of the Association for Com- evaluation, Proc. ISIM 4 (2004) 8. putational Linguistics, Association for Computa- [11] Y. Liu, M. Lapata, Text summarization with pre- tional Linguistics, Online, 2020, pp. 7871–7880. URL: trained encoders, ArXiv abs/1908.08345 (2019). https://aclanthology.org/2020.acl-main.703. doi:10. [12] M. Zhong, P. Liu, Y. Chen, D. Wang, X. Qiu, 18653/v1/2020.acl-main.703. X. Huang, Extractive summarization as text [19] J. Zhang, Y. Zhao, M. Saleh, P. Liu, PEGASUS: matching, in: Proceedings of the 58th An- Pre-training with Extracted Gap-sentences for Ab- nual Meeting of the Association for Computa- stractive Summarization, in: H. D. III, A. Singh tional Linguistics, Association for Computational (Eds.), Proceedings of the 37th International Confer- Linguistics, Online, 2020, pp. 6197–6208. URL: ence on Machine Learning, volume 119 of Proceed- https://aclanthology.org/2020.acl-main.552. doi:10. ings of Machine Learning Research, PMLR, 2020, pp. 18653/v1/2020.acl-main.552. 11328–11339. URL: http://proceedings.mlr.press/ [13] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to Se- v119/zhang20ae.html. quence Learning with Neural Networks, in: Pro- [20] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The ceedings of the 27th International Conference on long-document transformer, ArXiv abs/2004.05150 Neural Information Processing Systems - Volume (2020). 2, NIPS’14, MIT Press, Cambridge, MA, USA, 2014, [21] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, p. 3104–3112. C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, [14] A. M. Rush, S. Chopra, J. Weston, A neural attention L. Yang, A. Ahmed, Big Bird: Transformers for model for abstractive sentence summarization, in: Longer Sequences, in: H. Larochelle, M. Ranzato, Proceedings of the 2015 Conference on Empirical R. Hadsell, M. F. Balcan, H. Lin (Eds.), Advances in Methods in Natural Language Processing, Associa- Neural Information Processing Systems, volume 33, tion for Computational Linguistics, Lisbon, Portu- Curran Associates, Inc., 2020, pp. 17283–17297. URL: gal, 2015, pp. 379–389. URL: https://aclanthology. https://proceedings.neurips.cc/paper/2020/file/ org/D15-1044. doi:10.18653/v1/D15-1044. c8512d142a2d849725f31a9a7a361ab9-Paper.pdf. [15] S. Chopra, M. Auli, A. M. Rush, Abstractive sen- [22] S. Huang, R. Wang, Q. Xie, L. Li, Y. Liu, An tence summarization with attentive recurrent neu- extraction-abstraction hybrid approach for long ral networks, in: Proceedings of the 2016 Con- document summarization, in: 2019 6th Interna- ference of the North American Chapter of the As- tional Conference on Behavioral, Economic and sociation for Computational Linguistics: Human Socio-Cultural Computing (BESC), 2019, pp. 1–6. Language Technologies, Association for Compu- doi:10.1109/BESC48373.2019.8962979. tational Linguistics, San Diego, California, 2016, [23] W. S. El-Kassas, C. R. Salama, A. A. Rafea, pp. 93–98. URL: https://aclanthology.org/N16-1012. H. K. Mohamed, Automatic text summa- doi:10.18653/v1/N16-1012. rization: A comprehensive survey, Expert [16] R. Nallapati, B. Zhou, C. dos Santos, Ç. Gulçehre, Systems with Applications 165 (2021) 113679. B. Xiang, Abstractive text summarization using URL: https://www.sciencedirect.com/science/ sequence-to-sequence RNNs and beyond, in: Pro- article/pii/S0957417420305030. doi:https: ceedings of The 20th SIGNLL Conference on Com- //doi.org/10.1016/j.eswa.2020.113679. putational Natural Language Learning, Association [24] J. Codina-Filbà, N. Bouayad-Agha, A. Burga, for Computational Linguistics, Berlin, Germany, G. Casamayor, S. Mille, A. Müller, H. Sag- 2016, pp. 280–290. URL: https://aclanthology.org/ gion, L. Wanner, Using genre-specific fea- K16-1028. doi:10.18653/v1/K16-1028. tures for patent summaries, Information Pro- [17] A. See, P. J. Liu, C. D. Manning, Get to the point: cessing & Management 53 (2017) 151 – 174. Summarization with pointer-generator networks, URL: http://www.sciencedirect.com/science/article/ in: Proceedings of the 55th Annual Meeting of the pii/S0306457316302825. doi:https://doi.org/ 9 10.1016/j.ipm.2016.07.002. Methods in Natural Language Processing (EMNLP), [25] A. Trappey, C. Trappey, B. H. Kao, Automated Association for Computational Linguistics, Online, Patent Document Summarization for R&D Intellec- 2020, pp. 9308–9319. URL: https://www.aclweb.org/ tual Property Management, 2006 10th International anthology/2020.emnlp-main.748. doi:10.18653/ Conference on Computer Supported Cooperative v1/2020.emnlp-main.748. Work in Design (2006) 1–6. [34] J. He, W. Kryściński, B. McCann, N. Rajani, C. Xiong, [26] A. J. C. Trappey, C. V. Trappey, C.-Y. Wu, A CTRLsum: Towards Generic Controllable Text Sum- Semantic Based Approach for Automatic Patent marization, arXiv preprint arXiv:2012.04281 (2020). Document Summarization, in: R. Curran, S.-Y. [35] M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y.- Chou, A. Trappey (Eds.), Collaborative Product and H. Sung, Y. Yang, LongT5: Efficient text-to-text Service Life Cycle Management for a Sustainable transformer for long sequences, in: Findings of the World, Springer London, London, 2008, pp. 485– Association for Computational Linguistics: NAACL 494. 2022, Association for Computational Linguistics, [27] Y.-H. Tseng, C.-J. Lin, Y.-I. Lin, Text mining Seattle, United States, 2022, pp. 724–736. URL: https: techniques for patent analysis, Information Pro- //aclanthology.org/2022.findings-naacl.55. cessing & Management 43 (2007) 1216 – 1247. [36] S. Casola, A. Lavelli, Summarization, simpli- URL: http://www.sciencedirect.com/science/article/ fication, and generation: The case of patents, pii/S0306457306002020. doi:https://doi.org/ Expert Systems with Applications 205 (2022) 10.1016/j.ipm.2006.11.011, patent Process- 117627. URL: https://www.sciencedirect.com/ ing. science/article/pii/S0957417422009356. doi:https: [28] A. Trappey, C. Trappey, C.-Y. Wu, Automatic patent //doi.org/10.1016/j.eswa.2022.117627. document summarization for collaborative knowl- [37] A. Celikyilmaz, E. Clark, J. Gao, Evaluation of Text edge systems and services, Journal of Systems Generation: A Survey, 2020. arXiv:2006.14799. Science and Systems Engineering 18 (2009) 71–94. [38] E. Lloret, L. Plaza, A. Aker, The Challenging doi:10.1007/s11518-009-5100-7. Task of Summary Evaluation: An Overview, Lang. [29] K. Girthana, S. Swamynathan, Query Oriented Resour. Eval. 52 (2018) 101–148. URL: https:// Extractive-Abstractive Summarization System (QE- doi.org/10.1007/s10579-017-9399-2. doi:10.1007/ ASS), in: Proceedings of the ACM India Joint Inter- s10579-017-9399-2. national Conference on Data Science and Manage- [39] C.-Y. Lin, ROUGE: a Package for Automatic Evalu- ment of Data, CoDS-COMAD ’19, Association for ation of Summaries, in: Workshop on Text Summa- Computing Machinery, New York, NY, USA, 2019, rization Branches Out, Post-Conference Workshop p. 301–305. URL: https://doi.org/10.1145/3297001. of ACL 2004, Barcelona, Spain, 2004, pp. 74–81. 3297046. doi:10.1145/3297001.3297046. [40] J.-S. Lee, Controlling Patent Text Generation by [30] K. Girthana, S. Swamynathan, Query-Oriented Structural Metadata, Association for Computing Patent Document Summarization System (QPSS), Machinery, New York, NY, USA, 2020, p. 3241–3244. in: M. Pant, T. K. Sharma, O. P. Verma, R. Singla, URL: https://doi.org/10.1145/3340531.3418503. A. Sikander (Eds.), Soft Computing: Theories and [41] A. Wang, K. Cho, M. Lewis, Asking and Answer- Applications, Springer Singapore, Singapore, 2020, ing Questions to Evaluate the Factual Consistency pp. 237–246. of Summaries, in: Proceedings of the 58th An- [31] C. M. de Souza, M. E. Santos, M. R. G. Meireles, nual Meeting of the Association for Computational P. E. M. Almeida, Using Summarization Techniques Linguistics, Association for Computational Linguis- on Patent Database Through Computational Intelli- tics, Online, 2020, pp. 5008–5020. URL: https://www. gence, in: P. Moura Oliveira, P. Novais, L. P. Reis aclweb.org/anthology/2020.acl-main.450. doi:10. (Eds.), Progress in Artificial Intelligence, Springer 18653/v1/2020.acl-main.450. International Publishing, 2019, pp. 508–519. [42] A. Pu, H. W. Chung, A. P. Parikh, S. Gehrmann, [32] N. Bouayad-Agha, G. Casamayor, G. Ferraro, T. Sellam, Learning compact metrics for MT, in: S. Mille, V. Vidal, L. Wanner, Improving the compre- Proceedings of EMNLP, 2021. hension of legal documentation: The case of patent [43] L. Page, S. Brin, R. Motwani, T. Winograd, The claims, in: Proceedings of the International Con- PageRank citation ranking : Bringing order to the ference on Artificial Intelligence and Law, 2009, pp. web, in: WWW 1999, 1999. 78–87. doi:10.1145/1568234.1568244. [44] A. Burga, J. Codina, G. Ferraro, H. Saggion, L. Wan- [33] J. Pilault, R. Li, S. Subramanian, C. Pal, On Ex- ner, The challenge of syntactic dependency parsing tractive and Abstractive Neural Document Summa- adaptation for the patent domain, in: ESSLLI-13 rization with Transformer Language Models, in: workshop on extrinsic parse improvement., 2013. Proceedings of the 2020 Conference on Empirical [45] L. Andersson, M. Lupu, A. Hanbury, Domain 10 Adaptation of General Natural Language Process- ing Tools for a Patent Claim Visualization System, in: M. Lupu, E. Kanoulas, F. Loizides (Eds.), Multi- disciplinary Information Retrieval, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 70–82. [46] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. A. Harshman, Indexing by latent se- mantic analysis, Journal of the Association for In- formation Science and Technology 41 (1990) 391– 407. doi:10.1002/(SICI)1097-4571(199009) 41:6<391::AID-ASI1>3.0.CO;2-9. [47] V. Klema, A. Laub, The singular value decomposi- tion: Its computation and some applications, IEEE Transactions on Automatic Control 25 (1980) 164– 176. doi:10.1109/TAC.1980.1102314. [48] Y. Gong, X. Liu, Generic text summarization using relevance measure and latent semantic analysis, in: Annual International ACM SIGIR Conference on Re- search and Development in Information Retrieval, 2001. [49] M. Grusky, M. Naaman, Y. Artzi, Newsroom: A dataset of 1.3 million summaries with diverse ex- tractive strategies, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long Pa- pers), Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 708–719. URL: https://aclanthology.org/N18-1065. doi:10.18653/ v1/N18-1065. [50] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving language understanding by generative pre-training, Technical Report, 2018. [51] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language Models are Unsupervised Multitask Learners, Technical Report, 2019. [52] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language Models are Few-Shot Learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL: https://proceedings.neurips.cc/paper/2020/hash/ 1457c0d6bfcb4967418bfb8ac142f64a-Abstract. html. 11