<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">PatentMatch: A Dataset for Matching Patent Claims &amp; Prior Art</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Julian</forename><surname>Risch</surname></persName>
							<email>julian.risch@hpi.de</email>
							<affiliation key="aff0">
								<orgName type="institution">Hasso Plattner Institute University of Potsdam</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nicolas</forename><surname>Alder</surname></persName>
							<email>nicolas.alder@student.hpi.de</email>
							<affiliation key="aff1">
								<orgName type="institution">Hasso Plattner Institute University of Potsdam</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Christoph</forename><surname>Hewel</surname></persName>
							<email>c.hewel@bettenpat.com</email>
							<affiliation key="aff2">
								<orgName type="department">BETTEN &amp; RESCH Patent</orgName>
								<orgName type="institution">Rechtsanwälte PartGmbB Munich</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ralf</forename><surname>Krestel</surname></persName>
							<email>ralf.krestel@hpi.de</email>
							<affiliation key="aff3">
								<orgName type="institution">Hasso Plattner Institute University of Potsdam</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">PatentMatch: A Dataset for Matching Patent Claims &amp; Prior Art</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">CC1376495E2C4B95C6C4A2B572BD00D0</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T08:52+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>patent documents</term>
					<term>document classification</term>
					<term>dataset</term>
					<term>prior art search</term>
					<term>dense passage retrieval</term>
					<term>deep learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Patent examiners need to solve a complex information retrieval task when they assess the novelty and inventive step of claims made in a patent application. Given a claim, they search for prior art, which comprises all relevant publicly available information. This time-consuming task requires a deep understanding of the respective technical domain and the patent-domain-specific language. For these reasons, we address the computer-assisted search for prior art by creating a training dataset for supervised machine learning called PatentMatch. It contains pairs of claims from patent applications and semantically corresponding text passages of different degrees from cited patent documents. Each pair has been labeled by technically-skilled patent examiners from the European Patent Office. Accordingly, the label indicates the degree of semantic correspondence (matching), i.e., whether the text passage is prejudicial to the novelty of the claimed invention or not. Preliminary experiments using a baseline system show that PatentMatch can indeed be used for training a binary text pair classifier and a dense passage retriever on this challenging information retrieval task. The dataset is available online: https://hpi.de/naumann/s/patentmatch.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>experts, but also illustrates how experts solve this very complex IR-problem.</p><p>In general, a patent entitles the patent owner to exclude others from making, using, or selling an invention. For this purpose, the patent comprises so-called patent claims (usually at the end of a technical description of the invention). These claims legally specify the scope of protection of the invention.To be even more precise, the legally relevant definition can be found in the independent claims, i.e., usually in claim No. 1. Said claim 1 may be only a few lines long and may comprise only rather generalized terms, in order to keep the scope of protection as broad as possible. There may be more than one independent claim, e.g., an independent system claim 1 and an independent method claim 15. The further claims are so-called dependent claims, i.e., they depend on an independent claim. This dependency is explicitly defined in the preamble of the dependent claim, e.g. by starting with: "2. The system according to claim 1, wherein. . . ". The function of dependent claims is to define optional features of the invention, which are preferable but not mandatory for the invention (e.g., ". . . wherein the light source is an OLED").</p><p>In order to obtain a patent, it is required that the invention as defined in the claims is new and inventive over prior art <ref type="bibr" target="#b18">[19]</ref>. A patent application therefore has to be filed at a patent office where it is examined on novelty and inventive step by a technically skilled examiner. In case a patent is granted, said patent is published again as a separate patent document. For this reason, there exists a huge corpus of publicly available patent documents, i.e., published patent applications and patents.</p><p>As a further consequence of this huge patent literature corpus, the examiners usually focus their prior art search on relevant patent documents. Accordingly, they try to retrieve at least one older patent document that discloses the complete invention as defined in the claims, in particular in independent claim 1. In other words, such a novelty-destroying document must comprise passages that semantically match with the definition of claim 1 of the examined patent application. Said novelty-destroying document is manually marked by an expert as "X" document in the search report issued by the patent office <ref type="bibr" target="#b16">[17]</ref>. Any retrieved document that does not disclose the complete invention defined in claim 1 but at least renders it obvious, is marked as "Y" document in the search report. Further found documents that form technological background but are not relevant to the novelty or inventive step of claim 1, are marked as "A" documents. As a consequence, only one retrieved "X" document or "Y" document is enough to refuse claim 1 and hence the patent application. Due to this circumstance, the search task is rather focused on precision than on recall. Usually, a search report issued for an examined patent application only comprises a few (e.g., 5) cited patent documents, wherein (as far as possible) at least one document is novelty destroying (marked as "X" document).</p><p>Advantageously, a search report issued by the European Patent Office (EPO) not only cites patent documents deemed relevant by an expert but also indicates for each cited document which paragraphs within the document are found to be relevant for the examined claims. Figure <ref type="figure" target="#fig_0">1</ref> exemplifies such a search report. The EPO search report annotates each claim of the examined patent application with specific text passages (i.e., paragraphs) of a cited document. The EPO calls this rich-format citation. Given the application with the filing number EP18214053, a patent officer cited prior art with the publication number EP1351172A1. For example, paragraphs 27-28, 60 and 70-74 are relevant passages for assessing the novelty of claims 1 and 3 to 9 (marked by an "X" ). Furthermore, said paragraphs are also relevant for the inventive step of claim 2 (marked by an "Y" ). The search report also lists which search terms were used. In this case, it is the IPC subclass G06K.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>Finding relevant prior art is even for well-trained experts a hard and cumbersome task <ref type="bibr" target="#b9">[10]</ref>. Due to the large volume of literature to be considered as well as the required domain knowledge, patent officers rely on modern information systems to support them with their task <ref type="bibr" target="#b17">[18]</ref>. Nevertheless, the outcome of a prior art search, either to check for patentability or validity of a patent, remains imperfect and biased based on the patent examiner and her search strategy <ref type="bibr" target="#b14">[15]</ref>. In addition, different patent offices can reach different conclusions for the same search <ref type="bibr" target="#b18">[19]</ref>. With this paper we hope to open the door to qualitatively and systematically analyse the search practice particularly at European Patent Office.</p><p>Traditionally, related work at the intersection of information retrieval and patent analysis aims to support the experts by automatically identifying technical terms in patent documents <ref type="bibr" target="#b10">[11]</ref> or keywords that relate to the novelty of claims in applications <ref type="bibr" target="#b23">[24]</ref>. A challenge that all natural language processing applications in the patent domain have is to cope with the legal jargon and specialized terminology, which led to the use of patent-domain-specific word embeddings in deep learning approaches <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b21">22]</ref>. Further, patent classification is the most prominent task for the application of natural language processing in this domain, with supervised deep learning approaches outperforming all other methods <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b21">22]</ref>. Large amounts of labeled training data are available for this task because every published patent document and application is classified according to standardized, hierarchical classification schemes.</p><p>Prior art search is a document retrieval task where the goal is to find related work for a given patent document or application. Formulating the corresponding search query is a research challenge typically addressed with keyword extraction <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b24">25,</ref><ref type="bibr" target="#b26">27]</ref>. Further, there is research on tools to support expert users in defining search queries <ref type="bibr" target="#b22">[23]</ref> or non-expert users in exploring the search space step by step <ref type="bibr" target="#b13">[14]</ref>. The task that we focus on in this paper is patent passage retrieval. Given a query passage, e.g., a claim, the task is to find relevant passages in a corpus of text documents to, e.g., decide on the novelty of the claim. In the CLEF-IP series of shared tasks, there was a claims to passage task in 2012 <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b20">21]</ref>. The shared task dataset contains 2.3 million documents and 2700 relevance judgements of passages for training, which were manually extracted from search reports. The passages are contained in "X" documents and "Y" documents referenced by patent examiners in the search reports. Similar passage retrieval tasks can be found in other domains as well, e.g., passage retrieval for question answering within Wikipedia <ref type="bibr" target="#b2">[3]</ref>. To the best of our knowledge, the dense passage retrieval (DPR) model for open-domain question answering by Karpukhin et al. <ref type="bibr" target="#b11">[12]</ref> has not been used in the patent domain so far and we are the first to train a DPR model on patent data, which we describe in one of our preliminary experiments. Research in the patent domain is limited for three reasons: patent-domain-specific knowledge is necessary to understand (1) different types of documents (patent applications, granted patents, search reports), (2) different classification schemes (IPC, CPC, USPC) and (3) the steps of the patenting process (filing, examination, publication, granting, opposition).</p><p>In this paper, we present PatentMatch, a dataset of claims from patent applications matched with paragraphs from prior art, e.g., published patent documents. Professional patent examiners labeled the claims with references to paragraphs that are prejudicial to the novelty of the claim ("X" documents, positive samples) or that are not prejudicial but represent merely technical background ("A" documents, negative samples). We collected these labels from search reports created by patent examiners, resolved the claims and paragraphs referenced therein, and extracted the corresponding text passages from the patent documents. This procedure resulted in a dataset of six million examined claims and semantically corresponding (matching) text passages that are prejudicial or not prejudicial to the novelty of the claims. The remainder of this paper is structured as follows: Section 3 describes the data collection and processing steps in detail and provides dataset examples and statistics. Section 4 outlines research tasks that could benefit from the dataset and presents two preliminary experiments for two of these tasks. Finally, Section 5 concludes with a discussion of the potential impact of the presented dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">PATENTMATCH DATASET</head><p>The basis of our dataset is the EP full-text data for text analytics by the EPO. <ref type="foot" target="#foot_1">1</ref> It contains the XML-formatted full-texts and publication meta-data of all filed patent applications and published patent documents processed by the EPO since 1978. From 2012 onwards, the search reports for all patent applications are also included. In these reports, patent examiners cite paragraphs from prior art documents if these paragraphs are relevant for judging the novelty and inventive step of an application claim. Although there are no search reports available for applications filed before 2012, we do not discard these older applications because their corresponding published patent documents are frequently referenced as prior art. We use all available search reports to create a dataset of claims of patent applications matched with prior art, more precisely, paragraphs of cited "X" documents and "A" documents. Accordingly, "X" citations represent positive samples and "A" citations represent negative samples. These two categories "X" and "A" differ significantly regarding the level of semantic relevance of a given citation for a given claim. "Y" citations are not used in this work, as they seem too close to "X" citations with regard to their level of semantic relevance to generate a good training signal.</p><p>Our data processing pipeline uses Elasticsearch for storing and searching through this large corpus of about 210GB of text data. As a first data preparation step, an XML parser extracts the full text and meta-data from the raw, multi-nested XML files. Further, for each citation within a search report, it extracts claim number, patent application ID, date, paragraph number, and the type of the references, i.e., "X" document or "A" document.</p><p>Since the search reports were written in a rather systematic, but still unstructured and non-consistent way, a second parsing step standardizes the data format of paragraph references. References like "[paragraph 23]-[paragraph 28]" or "0023 -28" are converted to complete enumerations of paragraph numbers " <ref type="bibr" target="#b22">[23,</ref><ref type="bibr" target="#b23">24,</ref><ref type="bibr" target="#b24">25,</ref><ref type="bibr" target="#b25">26,</ref><ref type="bibr" target="#b26">27,</ref><ref type="bibr" target="#b27">28]</ref>". Furthermore, references by patent examiners comprise not only text paragraphs but also figures, figure captions, or the whole document. In our standardization process, all references that do not resolve to text paragraphs are discarded.</p><p>In the final step, we use the index of our Elasticsearch document database to resolve the referenced paragraph numbers (together with the corresponding document identifiers) to the paragraph texts. Similarly, we resolve the claim texts corresponding to the claim numbers. Thereby, we obtain a dataset that consists of a total of 6,259,703 samples, where each sample contains a claim text, a referenced paragraph text, and a label indicating one of the two types of reference: "X" document (positive sample) or "A" document (negative sample). Table <ref type="table" target="#tab_0">1</ref> lists statistics of the full dataset and Figure <ref type="figure">2</ref> exemplifies a claim text and cited paragraph texts of positive and negative samples.</p><p>We also provide two variations of the data for simplified usage in machine learning scenarios. The first variation balances the label distributions by downsampling the majority class. For each sample with a claim text and a referenced paragraph labeled "X", there is The second variation balances not only the label distribution but also the distribution of claim texts. Further downsampling ensures that there is exactly one sample with label "X" and one sample with label "A" for each claim text. As a result, every claim in the dataset occurs in exactly two samples. This restriction reduces the dataset to 25,340 samples.</p><p>The PatentMatch dataset is published online with example code that shows how to use it for supervised machine learning, and a description of the data collection and preparation process. <ref type="foot" target="#foot_2">2</ref> As the underlying raw data has been released by the EPO under Creative Commons Attribution 4.0 International Public License, we also release our dataset under the same license. <ref type="foot" target="#foot_3">3</ref> To foster comparable evaluation settings in future work, we separated it into a training Claim 1 of application EP17862550: An engine for a ship, comprising: …an air supply apparatus supplying the air to the cylinder wherein the air supply apparatus includes an auxiliary air supply member … Paragraphs 35-37 of "X" document US5271358A: …the engine system 10 includes a second gaseous injector 57 in fluid communication with the cylinder bore 16 through fuel injection port 27 in addition to the gaseous fuel injector 56… Paragraphs 31-32 of "A" document US2016298554A1: …gaseous fuel may be injected from gaseous fuel injector 38 while the air intake ports 32 are open… Figure <ref type="figure">2</ref>: An excerpt from a search report showing a claim and cited paragraphs. The "X" document (positive sample) is novelty-destroying for the claim while the "A" document (negative sample) is not novelty-destroying and merely constitutes technical background. set (80%) and a test set (20%) with a time-wise split based on the application filing date: All applications contained in the training set have an earlier filing date than all applications contained in the test set (March 29th, 2017).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">PRELIMINARY EXPERIMENTS</head><p>Modern information retrieval systems do not solely rely on matching keywords from queries with documents. Especially for complex information needs, semantic knowledge needs to be incorporated <ref type="bibr" target="#b4">[5]</ref>. With the rise of deep learning models, as well as word and document embeddings, improvements in grasping the semantic meaning of queries and documents have been made <ref type="bibr" target="#b1">[2]</ref>. A number of related tasks aim at finding semantically related information, making use of advanced semantic representations <ref type="bibr" target="#b5">[6]</ref> and intelligent retrieval models <ref type="bibr" target="#b19">[20]</ref>. Passage retrieval <ref type="bibr" target="#b12">[13]</ref>, document clustering <ref type="bibr" target="#b8">[9]</ref>, and question answering <ref type="bibr" target="#b27">[28]</ref> all rely on identifying semantically related information.</p><p>Addressing a first exemplary task, we conducted preliminary experiments on text pair classification with Bidirectional Encoder Representations from Transformers (BERT) <ref type="bibr" target="#b3">[4]</ref> as a baseline system. The text pair classification uses the same neural network architecture as the next sentence prediction task: Given a pair of sentences, the next sentence prediction task is to predict if the second sentence is a likely continuation of the first sentence. In our text pair classification scenario, given a claim text and a cited paragraph text, the task is to decide whether the paragraph corresponds to an "X" document (positive sample) or an "A" document (negative sample). To make this decision, the model needs to assess the novelty of the claim in comparison to the paragraph. To this end, it transforms the input text to sub-word tokens and transforms them to their embedding representations. These representation pass through 12 layers of bidirectional Transformers <ref type="bibr" target="#b25">[26]</ref> and the final hidden state of the special token [CLS] encodes the output class label. Our implementation uses the FARM framework and the pre-trained bert-base-uncased model. <ref type="foot" target="#foot_4">4</ref>The test set accuracy on the balanced variation of the data is 54%. On the second variation of the data, which contains exactly one "X" document citation and one "A" document citation per claim, the accuracy on the test set is 52%. For both variations, the accuracy improvements per training epoch are small and the validation loss stops to decrease after training for 6 epochs. It is not to our surprise that the task poses a difficult challenge and that a fine-tuned BERT model is only slightly better than random guessing. The complex linguistic patterns, the legal jargon, and the patent-domain-specific language make it sheer impossible for laymen to manually solve this task and therefore an interesting research challenge for future work.</p><p>A second exemplary task is dense passage retrieval (DPR). Inspired by the work by Karpukhin et al. <ref type="bibr" target="#b11">[12]</ref>, we transform the PatentMatch dataset into the DPR format used for open-domain question answering. Dense passage retrieval is the first step of opendomain question answering and the DPR format contains lists of questions, where each question is accompanied with the correct answer, a passage that contains the answer (positive context), and a passage that does not contain the answer but is still semantically similar to the question (hard negative context). We apply this format to our scenario of matching patent claims with passages from prior art, such that the claim represents the question and the paragraph text from the referenced "X" document is the positive context and the paragraph text from the referenced "A" document is the hard negative context. This version of the PatentMatch dataset contains exactly one sample with label "X" and one sample with label "A" for each claim text, which results in about 12500 triples (claim, positive, hard negative) in DPR format.</p><p>Using the dataset in DPR format, we train a DPR model, which comprises two BERT models (bert-base-uncased) <ref type="bibr" target="#b3">[4]</ref>. One model encodes patent claims while the other encodes paragraph texts from "X" and "A" documents. As in the original DPR paper <ref type="bibr" target="#b11">[12]</ref>, we leverage in-batch negatives for training, which means that given a batch with claims and paragraph texts from corresponding "X" and "A" documents as positive and hard negative contexts, we use the positive context of each claim as an additional negative context for all other claims in the same batch. Using a batch size of 8, there are 8 claims in each batch, 8 positive contexts, 8 hard negative contexts, and implicitly also 7 in-batch (non-hard) negative contexts for each claim. The learning rate is set to 10 −5 using Adam, linear scheduling with warm-up, and a dropout rate of 0.1. Due to memory constraints on the GPU, we limit the claim texts to 200 tokens and the paragraph texts to 256 tokens. In our preliminary experiment, the model achieves an average in-batch rank of 1.42 after training for 5 epochs, which means that the positive context is ranked between second and third position out of eight on average (rank 0 corresponds to first position). Although the method does not return perfect results, it is very useful as a tool for experts who now need to only look at a handful of candidates instead of thousands to find the right paragraph.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">IMPACT &amp; CONCLUSIONS</head><p>With this paper, we not only introduce an extensive dataset that can be used to train and test systems for the aforementioned tasks, but also provide training data for patent passage retrieval <ref type="bibr" target="#b20">[21]</ref>: a very challenging search task mostly conducted by highly-trained patentdomain experts. The need to at least partially automate this task arises from the growing number of patent applications worldwide.</p><p>And with deep learning methods requiring large training sets, we hope to foster research in the patent analysis domain by providing such a dataset. We presented a novel dataset that comprises pairs of semantically similar texts in the patent domain. More precisely, the dataset contains claims from patent applications and paragraphs from prior art. It was created based on search reports by patent officers at the EPO. The simple structure of the dataset reduces the amount of patent-domain knowledge required for analyzing the data or using it for supervised machine learning. With the release of the dataset, we thus hope to foster research on the (semi-)automation of passage retrieval tasks and on user interfaces that support experts in searching through prior art and creating search reports.</p><p>Further, we hope to spark research in analysing how patent experts search for relevant patents and, maybe more interesting, which relevant patents they miss and for what reason. By providing the matched claims and paragraphs, the search process of patent officers can be analyzed and search results compared. For future work, our learned model could be used to adapt the experts' keyword queries for higher recall and to understand the relationship between results from manually curated queries and (relevant) results from deep learning models.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: In this excerpt from a search report, a patent examiner cites paragraph numbers of the published patent document EP1351172A1 for assessing the novelty of claim 1 and 3-9 of application EP18214053.</figDesc><graphic coords="3,67.53,84.68,479.19,169.46" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Dataset statistics: Each sample is a pair of an application's claim and paragraph cited from either an "X" document (positive sample) or "A" document (negative sample).</figDesc><table><row><cell>Samples</cell><cell>6,259,703</cell></row><row><cell>"X" document citations</cell><cell>3,492,987</cell></row><row><cell>"A" document citations</cell><cell>2,766,716</cell></row><row><cell>Distinct patent applications</cell><cell>31,238</cell></row><row><cell>Distinct cited documents</cell><cell>33,195</cell></row><row><cell>Distinct claim texts</cell><cell>297,147</cell></row><row><cell>Distinct cited paragraphs</cell><cell>520,376</cell></row><row><cell>Median claim length (chars)</cell><cell>274</cell></row><row><cell>Median paragraph length (chars)</cell><cell>476</cell></row></table><note>also a sample with the same claim text with a different referenced paragraph labeled "A" and vice versa. This balanced training set consists of 347,880 samples. In this version of the dataset, different claim texts can have different numbers of references. The number of "X" and "A" labels is only balanced for each claim text itself.</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">PatentMatch: A Dataset for Matching Patent Claims &amp; Prior Art PatentSemTech, July 15th, 2021, online</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1">https://www.epo.org/searching-for-patents/data/bulk-data-sets/text-analytics</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2">https://hpi.de/naumann/s/patentmatch</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_3">https://creativecommons.org/licenses/by/4.0/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_4">https://github.com/deepset-ai/FARM, https://huggingface.co/bert-base-uncased</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>We would like to thank Sonia Kaufmann and Martin Kracker from the European Patent Office (EPO) for their support and advise.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Optimizing neural networks for patent classification</title>
		<author>
			<persName><forename type="first">L</forename><surname>Abdelgawad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kluegl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Genc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Falkner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Hutter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD)</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="688" to="703" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A hybrid embedding approach to noisy answer passage retrieval</title>
		<author>
			<persName><forename type="first">D</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Information Retrieval</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="127" to="140" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">WikiPassageQA: A benchmark collection for research on non-factoid answer passage retrieval</title>
		<author>
			<persName><forename type="first">D</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR)</title>
				<meeting>the International Conference on Research and Development in Information Retrieval (SIGIR)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1165" to="1168" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Bert: Pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.04805</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1" to="16" />
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Semantically enhanced information retrieval: An ontology-based approach</title>
		<author>
			<persName><forename type="first">M</forename><surname>Fernández</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Cantador</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>López</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Vallet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Castells</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Motta</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Web Semantics</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="434" to="452" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Word embedding based generalized language model for information retrieval</title>
		<author>
			<persName><forename type="first">D</forename><surname>Ganguly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Roy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mitra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">J</forename><surname>Jones</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR)</title>
				<meeting>the International Conference on Research and Development in Information Retrieval (SIGIR)</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="795" to="798" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Bitem site report for the claims to passage task in CLEF-IP 2012</title>
		<author>
			<persName><forename type="first">J</forename><surname>Gobeill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ruch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the CLEF-IP Workshop</title>
				<meeting>the CLEF-IP Workshop</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">On term selection techniques for patent prior art search</title>
		<author>
			<persName><forename type="first">M</forename><surname>Golestan Far</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sanner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Bouadjenek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Ferraro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hawking</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR)</title>
				<meeting>the International Conference on Research and Development in Information Retrieval (SIGIR)</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="803" to="806" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Similarity measures for text document clustering</title>
		<author>
			<persName><forename type="first">A</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the New Zealand Computer Science Research Student Conference (NZCSRSC)</title>
				<meeting>the New Zealand Computer Science Research Student Conference (NZCSRSC)</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="9" to="56" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Preserving the presumption of patent validity: An alternative to outsourcing the us patent examiner&apos;s prior art search</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Jeffery</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Cath. UL Rev</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="page">761</biblScope>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Unsupervised training set generation for automatic acquisition of technical terminology in patents</title>
		<author>
			<persName><forename type="first">A</forename><surname>Judea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Brügmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Computational Linguistics (COLING)</title>
				<meeting>the International Conference on Computational Linguistics (COLING)</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="290" to="300" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Dense passage retrieval for open-domain question answering</title>
		<author>
			<persName><forename type="first">V</forename><surname>Karpukhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Oguz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Min</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Edunov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-T</forename><surname>Yih</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)</title>
				<meeting>the Conference on Empirical Methods in Natural Language Processing (EMNLP)</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="6769" to="6781" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Passage retrieval revisited</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kaszkiel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zobel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR)</title>
				<meeting>the International Conference on Research and Development in Information Retrieval (SIGIR)</meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="178" to="185" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Incorporating task analysis in the design of a tool for a complex and exploratory search task</title>
		<author>
			<persName><forename type="first">T</forename><surname>Kulahcioglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Fradkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Palanivelu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Conference on Conference Human Information Interaction and Retrieval (CHIIR)</title>
				<meeting>the Conference on Conference Human Information Interaction and Retrieval (CHIIR)</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="373" to="376" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Why weak patents? testing the examiner ignorance hypothesis</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Lei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">D</forename><surname>Wright</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.jpubeco.2017.02.004</idno>
		<ptr target="http://www.sciencedirect.com/science/article/pii/S0047272717300178" />
	</analytic>
	<monogr>
		<title level="j">Journal of Public Economics</title>
		<idno type="ISSN">0047-2727</idno>
		<imprint>
			<biblScope unit="volume">148</biblScope>
			<biblScope unit="page" from="43" to="56" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Deeppatent: patent classification with convolutional neural networks and word embedding</title>
		<author>
			<persName><forename type="first">S</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scientometrics</title>
		<imprint>
			<biblScope unit="volume">117</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="721" to="744" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">How to interpret epo search reports</title>
		<author>
			<persName><forename type="first">K</forename><surname>Loveniers</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">World Patent Information</title>
		<imprint>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="page" from="23" to="28" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">How to apply examiner search strategies in espacenet. a case study</title>
		<author>
			<persName><forename type="first">E</forename><surname>Marttin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A.-C</forename><surname>Derrien</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.wpi.2017.06.001</idno>
		<idno>doi:</idno>
		<ptr target="http://www.sciencedirect.com/science/article/pii/S0172219016301089.BestofSearchMatters" />
	</analytic>
	<monogr>
		<title level="j">World Patent Information</title>
		<idno type="ISSN">0172-2190</idno>
		<imprint>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="page" from="S33" to="S43" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Patent citation analysis. a closer look at the basic input data from patent search reports</title>
		<author>
			<persName><forename type="first">J</forename><surname>Michel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bettels</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scientometrics</title>
		<imprint>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="185" to="201" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval</title>
		<author>
			<persName><forename type="first">H</forename><surname>Palangi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ward</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE/ACM Transactions on Audio, Speech, and Language Processing</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="694" to="707" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">CLEF-IP 2012: Retrieval experiments in the intellectual property domain</title>
		<author>
			<persName><forename type="first">F</forename><surname>Piroi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lupu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hanbury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Sexton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Magdy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">V</forename><surname>Filippov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the CLEF-IP Workshop</title>
				<meeting>the CLEF-IP Workshop</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="1" to="16" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Domain-specific word embeddings for patent classification</title>
		<author>
			<persName><forename type="first">J</forename><surname>Risch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Krestel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data Technologies and Applications</title>
		<imprint>
			<biblScope unit="volume">53</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="108" to="122" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">A visual approach to query formulation for systematic search</title>
		<author>
			<persName><forename type="first">T</forename><surname>Russell-Rose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chamberlain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Shokraneh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Conference on Human Information Interaction and Retrieval (CHIIR)</title>
				<meeting>the Conference on Human Information Interaction and Retrieval (CHIIR)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="379" to="383" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Extraction of keywords of novelties from patent claims</title>
		<author>
			<persName><forename type="first">S</forename><surname>Suzuki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Takatsuka</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Computational Linguistics (COLING)</title>
				<meeting>the International Conference on Computational Linguistics (COLING)</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="1192" to="1200" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">A study of search tactics for patentability search: A case study on patent engineers</title>
		<author>
			<persName><forename type="first">Y.-H</forename><surname>Tseng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-J</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop on Patent Information Retrieval (PaIR@CIKM)</title>
				<meeting>the Workshop on Patent Information Retrieval (PaIR@CIKM)</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="33" to="36" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ł</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems (NeurIPS)</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="5998" to="6008" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Transforming patents into prior-art queries</title>
		<author>
			<persName><forename type="first">X</forename><surname>Xue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR)</title>
				<meeting>the International Conference on Research and Development in Information Retrieval (SIGIR)</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="808" to="809" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Efficiently answering technical questions-a knowledge graph approach</title>
		<author>
			<persName><forename type="first">S</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-R</forename><surname>Wen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Conference on Artificial Intelligence (AAAI)</title>
				<meeting>the Conference on Artificial Intelligence (AAAI)</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="3111" to="3118" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
