<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">CLEF-IP 2010: Prior Art Retrieval using the different sections in patent documents</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Eva</forename><surname>D'hondt</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Radboud University Nijmegen</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Suzan</forename><surname>Verberne</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Radboud University Nijmegen</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">CLEF-IP 2010: Prior Art Retrieval using the different sections in patent documents</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">0A69F993A2F031EA888F9B3875BE8509</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T00:14+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval Retrieval</term>
					<term>Reranking Prior Art Search</term>
					<term>Patent retrieval</term>
					<term>CLEF-IP track</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper we describe our participation in the 2010 CLEF-IP Prior Art Retrieval task where we examined the impact of information in different sections of patent documents, namely the title, abstract, claims, description and IPC-R sections, on the retrieval and re-ranking of patent documents. Using a standard bag-of-words approach in Lemur we found that the IPC-R sections are the most informative for patent retrieval. We then performed a re-ranking of the retrieved documents using a Logistic Regression Model, trained on the retrieved documents in the training set. We found indications that the information contained in the text sections of the patent document can contribute to a better ranking of the retrieved documents. The official results have shown that among the nine groups that participated in the Prior Art Retrieval task we achieved the eigth rank in terms of both Mean Average Precision (MAP) and Recall.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In the literature on patent retrieval there is some disagreement on which part of the patent document would be the most informative for (text-based) document retrieval. Graf and Azzopardi conclude that the claims section is the most useful <ref type="bibr" target="#b1">[2]</ref>, while patent searchers themselves hold that the description is more useful <ref type="bibr" target="#b0">[1]</ref>. Interestingly, the results of last year's CLEF-IP track <ref type="bibr">(2009)</ref> showed that the use of the metadata such as IPC-R codes or name of inventor leads to substantial improvements in patent retrieval over approaches that focussed only on the text sections. <ref type="bibr" target="#b2">[3]</ref>.</p><p>For our participation to the CLEF-IP 2010 track<ref type="foot" target="#foot_0">1</ref> , our goal was to compare the impact of the different patent sections on retrieval performance in a fixed benchmark data set. In this paper we describe our contribution to the track in which we examine the influence of both the IPC-R metadata and the information contained in the different text sections of the patent document on retrieval performance and re-ranking.</p><p>The CLEF-IP 2010 test collection provided by the organisation committee contains a corpus of 2.6 million patent documents pertaining to 1.3 million patents<ref type="foot" target="#foot_1">2</ref> , a set of 300 patent documents that serve as training topics together with their relevance assessments and a set of 500 test topic patent documents for testing. The patents can contain text in three different languages: English, French and German. They are labelled with XML tags to help identify the different sections as well as the different metadata such as IPC-R code, the name of the inventor or the date of the application. The different patent documents correspond to the different stages in the evolution of a patent and will therefore contain different amounts of information, for example, a patent application (A1 document) will not contain as much information as a fully granted patent (B1 document). The information in the older version of the patent is often subsumed by the newer document, but older versions may contain unique information as well. This year we have decided to retrieve patent documents rather than whole patents<ref type="foot" target="#foot_2">3</ref> .</p><p>3 Experimental Set-up</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Patent Section Extraction</head><p>Using a perl script we extracted the English title, abstract, claims and description sections and the IPC-R codes<ref type="foot" target="#foot_3">4</ref> from the original XML files and saved them as plain text in respective text files. If a document did not contain a section or if the section was not in English, no corresponding text file was created. The most important characteristics of the five subcorpora that were created in this manner are shown in 1. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Retrieval Step</head><p>In the retrieval step, we wanted to determine which section of the patent document is the most informative for patent retrieval. To this end, we performed six retrievals on the corpus using the training queries. The retrieved documents of the best-scoring system were used to train the re-ranking models as will be described in section 2.4. For the retrieval step, all the text files in the subcorpora were saved in the Lemur format: Using a bash script, the text in the text files was lowercased, punctuation was removed and the appropriate XML tags for indexing by Lemur were added . Then the texts were indexed using the BuildIndex function of Lemur with the indri IndexType and a stop list for general English.</p><p>In total we built 6 indices: Titles only, Abstracts only, Claims only, Description only, IPC-R codes, and full-text. By full-text we mean that we concatenated title, abstract, claims and description; sections that were not available in the patent document were added as an empty string. If none of these sections were available in English, the patent document was not indexed.</p><p>The topics in the training set were preprocessed in the same manner in order to be used as queries in Lemur. If the original query XML document did not contain a section, it was not added to the lemur query file.</p><p>For each query in the query file we retrieved 100 documents and ranked them according to the TF-IDF ranking model as implemented in Lemur. Table <ref type="table" target="#tab_1">2</ref> shows the results of the retrievals on the 6 indices with their respective training queries. The results are given for Precision (P) and Recall (R) respectively at position 5, 10, 50 and 100 in the result list as well as the Mean Average Precision (MAP) score. The index with the IPC-R codes proved to be the most informative for patent retrieval in terms of Recall and Precision, although results are quite low for all six retrievals. Based on these results, we decided to proceed with only the retrieval results from the IPC-R subcorpus to the second step of our approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Re-ranking Step</head><p>It seems that in a retrieval task, conceptual information (as encoded in the IPC codes) works better than 'surface' textual information. However, we wanted to examine the influence of the different text sections on the positions of the retrieved results in the set.</p><p>We aimed to improve the ranking of the retrieved documents on the basis of the textual information present in the different sections of the patent document. As a predictor of relevance for the sections, we used the cosine similarity between corresponding sections of the topic and each of the retrieved documents.</p><p>We extracted this information as follows: For each topic-document pair from the training result set, we extracted the title, abstract, claims and description sections (if present) from both the topic and the retrieved document. We then calculated the cosine similarity between the sections of the respective documents using a python script which was based on the script by Dennis Muhlestein<ref type="foot" target="#foot_5">5</ref> .</p><p>For each query-document pair we obtained a vector with 4 features: cosine similarity titles, cosine similarity abstracts, cosine similarity claims, and cosine similarity descriptions. In order to determine the importance of each of these features (and thereby each of the sections), we trained a Logistic Regression Model (LRM). The criterium variable was the relevance score of the retrieved document in the training relevance assessments. <ref type="foot" target="#foot_6">6</ref>We used the lrm function from the Design package in R to train this model. We then used the LRM (trained on the training data) to predict an alternative ranking for the retrieved documents. We created two variants of the model: one with only these four features, and one in which the TF-IDF score for the retrieval with IPC-R codes was added as a fifth feature. We did not perform any step-wise model selection but rather combined all predictors at once.</p><p>In this section we present the results of our models in terms of MAP, Precision and Recall for both the training data (table <ref type="table" target="#tab_2">3</ref>) and the test data (table 4). The evaluation of the two re-ranking models on the training data was performed using 5-fold cross-validation. The P, R and MAP results are the averages over the five folds. Between the brackets is the standard deviation. The re-ranking model that incorporates the TF-IDF score of the retrieval set performs slightly better than the other model in both the training and the test results. In terms of Recall and Precision we performed slightly better than during our participation in the CLEF-IP 2009 track but compared to the other teams in this year's track we achieved low scores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Discussion</head><p>In this section we will discuss (a) the retrieval results on the training set and (b) analyse the re-ranking models used.</p><p>One of our goals was to determine which section of the patent document is the most informative for patent retrieval in terms of recall and precision. The results in table <ref type="table" target="#tab_1">2</ref> showed that for a bagof-words approach the IPC-R codes in the patents were the most informative of all the patent sections. During our post-evaluation analysis we discovered that the low scores for the individual text sections are more likely an artefact of our data selection process rather than an adequate reflection of their performance in a retrieval task. Table <ref type="table" target="#tab_0">1</ref> showed that there are considerable differences in size between the different text section corpora and thus in the number of patent documents that could be retrieved for a specific query. Moreover, we found evidence that some relevant patent documents were impossible to retrieve for certain queries. For example, if a relevant document for a query consisting of a claims section did not have a claims section itself, it did not feature in the claims subcorpus and could therefore not be retrieved. Consequently, we cannot draw a definite conclusion about the relative importance of the separate text sections for patent retrieval. The full-text corpus and the IPC-R corpus, however, did not suffer from these drawbacks. We found it interesting that the IPC-R outperformed the full-text retrieval, though the difference between the results is small. The major advantage of the IPC-R section is -predictably-the fact that it is language-independent, conceptual and has a limited 'vocabulary' of terms that can be used. For future work it would be interesting to examine the differences in retrieval results by using more general and more specific IPC codes as retrieval terms.</p><p>Our second goal was to examine the impact of the text sections on the re-ranking of retrieved documents: When we look at the results in table 3 and 4, it seems that the use of the information in the respective text sections of the query and retrieved document can lead to an improvement in the ranking of the relevant results. However, the high standard deviation values for the five folds show that our training set of 300 queries is too small to make any definite conclusions about the improvements made by the models. This may be a consequence of the fact that the models were not trained on optimal data but on rather poor retrieval results. Though they seem to boost the ranking of the retrieved documents, they contain enough noise to diminish the accuracy.</p><p>In order to evaluate the importance of the different text sections in the re-ranking of the retrieval results, we rank them in table 5 according to the coefficient that was assigned to them in the Logistic Regression Model. We find that all texts sections except for the description have a significant influence on the re-ranking of the retrieval results. The correlation analysis reported in table 6 shows a high correlation between the cosine similarity of the claims and description sections. Consequently, the coefficient for the claims section should be interpreted as being caused by the combination of the cosine similarities for the claims and description sections. Of all the text sections the abstracts have the most impact in the re-ranking process. This was to be expected as the abstracts are most likely to contain the keywords that are specific to the field of the invention. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In our contribution to the CLEF-IP 2010 Prior Art Retrieval task we examined the impact of different sections of patent documents on the retrieval and re-ranking of patent documents. Using a standard bag-of-words approach in Lemur we found that the IPC-R sections are more informative for patent retrieval than a full-text representation of the patent document. We then performed a re-ranking of the retrieved documents using a Logistic Regression Model, trained on the retrieved documents in the training set. Looking at the improved MAP scores, we found indications that the information contained in the separate text sections of the patent document can contribute to a better ranking of the retrieved documents.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc></figDesc><table><row><cell></cell><cell cols="6">number of English files generated from the corpus and both query sets</cell></row><row><cell></cell><cell>title</cell><cell>abstract</cell><cell>claims</cell><cell>description</cell><cell>IPC-R</cell><cell># of original patent</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>documents</cell></row><row><cell>corpus</cell><cell cols="3">2,669,501 1,004,436 1,210,507</cell><cell>993,910</cell><cell>2,679,232</cell><cell>2,680,604</cell></row><row><cell>training</cell><cell>196</cell><cell>292</cell><cell>196</cell><cell>196</cell><cell>300</cell><cell>300</cell></row><row><cell>topic</cell><cell>339</cell><cell>490</cell><cell>339</cell><cell>339</cell><cell>490</cell><cell>500</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>BOW retrieval on training set.</figDesc><table><row><cell></cell><cell>MAP</cell><cell>P</cell><cell>P5</cell><cell>P10</cell><cell>P50</cell><cell>R</cell><cell>R5</cell><cell>R10</cell><cell>R50</cell></row><row><cell>title</cell><cell>0.0995</cell><cell>0.0102</cell><cell cols="3">0.0602 0.0378 0.0143</cell><cell>0.0931</cell><cell cols="3">0.0319 0.0408 0.0674</cell></row><row><cell>abstract</cell><cell>0.1139</cell><cell>0.0129</cell><cell cols="3">0.0555 0.0432 0.0204</cell><cell>0.1262</cell><cell cols="3">0.0277 0.0419 0.0995</cell></row><row><cell>claims</cell><cell>0.0866</cell><cell>0.0088</cell><cell cols="3">0.0515 0.0375 0.0156</cell><cell>0.0938</cell><cell cols="3">0.0273 0.0373 0.0754</cell></row><row><cell cols="2">description 0.0646</cell><cell>0.0140</cell><cell cols="3">0.0428 0.0314 0.0296</cell><cell>0.1479</cell><cell cols="3">0.0777 0.1074 0.1479</cell></row><row><cell>full-text</cell><cell>0.0615</cell><cell>0.0138</cell><cell cols="7">0.0232 0.0149 0.0138 0.1577 0.0148 0.0149 0.0667</cell></row><row><cell>IPC-R</cell><cell>0.0677</cell><cell cols="4">0.0156 0.0373 0.0260 0.0187</cell><cell>0.1546</cell><cell cols="3">0.0196 0.0269 0.0887</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc></figDesc><table><row><cell></cell><cell></cell><cell></cell><cell cols="3">Results on training set</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>MAP</cell><cell>P</cell><cell>P5</cell><cell>P10</cell><cell>P50</cell><cell>R</cell><cell></cell><cell>R5</cell><cell>R10</cell><cell>R50</cell></row><row><cell>Baseline</cell><cell>0.0677</cell><cell>0.0156</cell><cell>0.0373</cell><cell>0.0260</cell><cell>0.0187</cell><cell cols="2">0.1546</cell><cell>0.0196</cell><cell>0.0269</cell><cell>0.0887</cell></row><row><cell>using IPC-R</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Re-ranking</cell><cell>0.0858</cell><cell>0.0156</cell><cell>0.0589</cell><cell>0.0431</cell><cell>0.0215</cell><cell cols="2">0.1546</cell><cell>0.0334</cell><cell>0.0471</cell><cell>0.111</cell></row><row><cell>no TF-IDF</cell><cell cols="10">(0.973) (0.0216) (0.1379) (0.0920) (0.0328) (0.2204) (0.0859) (0.1051) (0.1782)</cell></row><row><cell>Re-ranking</cell><cell>0.0870</cell><cell>0.0156</cell><cell>0.0595</cell><cell>0.0455</cell><cell>0.0219</cell><cell cols="2">0.1546</cell><cell>0.0326</cell><cell>0.0494</cell><cell>0.114</cell></row><row><cell cols="11">with (0.519) (0.0216) (0.1361) (0.0943) (0.0331) (0.2204) (0.0788) (0.1058) (0.1814)</cell></row><row><cell></cell><cell></cell><cell cols="4">Table 4: Results on topic set</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>MAP</cell><cell>P</cell><cell>P5</cell><cell>P10</cell><cell>P50</cell><cell>R</cell><cell>R5</cell><cell>R10</cell><cell>R50</cell></row><row><cell>run-1-small</cell><cell cols="9">0.0274 0.0255 0.0786 0.0593 0.0314 0.1253 0.0228 0.0344 0.0857</cell></row><row><cell>(no TF-IDF)</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>run-2-small</cell><cell cols="9">0.0291 0.0255 0.0826 0.0643 0.0331 0.1253 0.0240 0.0376 0.0902</cell></row><row><cell>(with TF-IDF)</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 5 :</head><label>5</label><figDesc>Impact of different text sections on re-ranking</figDesc><table><row><cell>Feature</cell><cell></cell><cell></cell><cell cols="3">Coefficient p-value</cell></row><row><cell cols="3">Cosine similarity between abstracts</cell><cell cols="2">3.43541</cell><cell>0.00</cell></row><row><cell cols="2">Cosine similarity between claims</cell><cell></cell><cell cols="2">2.09992</cell><cell>0.001</cell></row><row><cell cols="2">Cosine similarity between titles</cell><cell></cell><cell cols="2">1.22437</cell><cell>0.00</cell></row><row><cell cols="2">TF-IDF value from retrieval data</cell><cell></cell><cell cols="2">0.01143</cell><cell>0.00</cell></row><row><cell cols="5">Cosine similarity between descriptions -1.18692</cell><cell>0.022</cell></row><row><cell></cell><cell cols="5">Table 6: Correlation analysis of predictors</cell></row><row><cell>Pearon's r</cell><cell cols="5">Cos.sim. Cos.sim. Cos.sim. Cos.sim.</cell><cell>TF-IDF value</cell></row><row><cell></cell><cell>titles</cell><cell cols="3">abstracts claims</cell><cell>descriptions</cell></row><row><cell>Cos.sim. titles</cell><cell>1</cell><cell>0.151</cell><cell></cell><cell>0.368</cell><cell>0.336</cell><cell>0.066</cell></row><row><cell>Cos.sim. abstracts</cell><cell>0.151</cell><cell>1</cell><cell></cell><cell>0.202</cell><cell>0.250</cell><cell>-0.001</cell></row><row><cell>Cos.sim. claims</cell><cell>0.368</cell><cell>0.202</cell><cell></cell><cell>1</cell><cell>0.905</cell><cell>0.046</cell></row><row><cell cols="2">Cos.sim. descriptions 0.336</cell><cell>0.25</cell><cell></cell><cell>0.905</cell><cell>1</cell><cell>0.029</cell></row><row><cell>TF-IDF value</cell><cell>0.066</cell><cell cols="2">-0.002</cell><cell>0.046</cell><cell>0.029</cell><cell>1</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.ir-facility.org/research/evaluation/clef-ip-10/overview</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">Please note the difference between a patent and a patent document: a patent is not a physical document itself but a name for a group of patent documents that have the same patent ID number.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">A whole patent can be constructed by concatenating different patent versions into one document or by constructing a document from the most recent version of every section in the patent documents</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">  4  We used the full IPC-R code up to the level of the subgroups, e.g.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">A01J 5/01.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_5">http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_6">We only considered documents to be either 'relevant' or 'non-relevant' and did not adhere to the subdivision ('relevant' or 'highly relevant') made by the CLEF-IP organisers.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Lexical issues of a syntactic approach to interactive patent retrieval</title>
		<author>
			<persName><forename type="first">D'</forename><surname>Eva</surname></persName>
		</author>
		<author>
			<persName><surname>Hondt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd BCSIRSG Symposium on Future Directions in Information Access</title>
				<meeting>the 3rd BCSIRSG Symposium on Future Directions in Information Access</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A methodology for building a patent test collection for prior art search</title>
		<author>
			<persName><forename type="first">Erik</forename><surname>Graf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Leif</forename><surname>Azzopardi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of EVIA2008</title>
				<meeting>EVIA2008</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Multiple retrieval models and regression models for prior art search</title>
		<author>
			<persName><forename type="first">Patrice</forename><surname>Lopez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Laurent</forename><surname>Romary</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of CLEF 2009</title>
				<meeting>CLEF 2009</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
