Introduction

CLEF-IP 2010: Prior Art Retrieval using the di erent sections in patent documents

0 Eva D'hondt and Suzan Verberne Radboud University Nijmegen

In this paper we describe our participation in the 2010 CLEF-IP Prior Art Retrieval task where we examined the impact of information in di erent sections of patent documents, namely the title, abstract, claims, description and IPC-R sections, on the retrieval and re-ranking of patent documents. Using a standard bag-of-words approach in Lemur we found that the IPC-R sections are the most informative for patent retrieval. We then performed a re-ranking of the retrieved documents using a Logistic Regression Model, trained on the retrieved documents in the training set. We found indications that the information contained in the text sections of the patent document can contribute to a better ranking of the retrieved documents. The o cial results have shown that among the nine groups that participated in the Prior Art Retrieval task we achieved the eigth rank in terms of both Mean Average Precision (MAP) and Recall.

Prior Art Search Patent retrieval CLEF-IP track

Introduction

In the literature on patent retrieval there is some disagreement on which part of the patent document would be the most informative for (text-based) document retrieval. Graf and Azzopardi conclude that the claims section is the most useful [ 2 ], while patent searchers themselves hold that the description is more useful [ 1 ]. Interestingly, the results of last year's CLEF-IP track (2009) showed that the use of the metadata such as IPC-R codes or name of inventor leads to substantial improvements in patent retrieval over approaches that focussed only on the text sections. [ 3 ].

For our participation to the CLEF-IP 2010 track1, our goal was to compare the impact of the di erent patent sections on retrieval performance in a xed benchmark data set. In this paper we describe our contribution to the track in which we examine the in uence of both the IPC-R metadata and the information contained in the di erent text sections of the patent document on retrieval performance and re-ranking.

Data Description

The CLEF-IP 2010 test collection provided by the organisation committee contains a corpus of 2.6 million patent documents pertaining to 1.3 million patents2, a set of 300 patent documents that serve as training topics together with their relevance assessments and a set of 500 test topic patent documents for testing. The patents can contain text in three di erent languages: English, French and German. They are labelled with XML tags to help identify the di erent sections as well as the di erent metadata such as IPC-R code, the name of the inventor or the date of the application. The di erent patent documents correspond to the di erent stages in the evolution of a patent and will therefore contain di erent amounts of information, for example, a patent application (A1 document) will not contain as much information as a fully granted patent (B1 document). The information in the older version of the patent is often subsumed by the newer document, but older versions may contain unique information as well. This year we have decided to retrieve patent documents rather than whole patents3. 3 3.1

Experimental Set-up Patent Section Extraction

Using a perl script we extracted the English title, abstract, claims and description sections and the IPC-R codes4 from the original XML les and saved them as plain text in respective text les. If a document did not contain a section or if the section was not in English, no corresponding text le was created. The most important characteristics of the ve subcorpora that were created in this manner are shown in 1.

corpus training topic In the retrieval step, we wanted to determine which section of the patent document is the most informative for patent retrieval. To this end, we performed six retrievals on the corpus using the training queries. The retrieved documents of the best-scoring system were used to train the re-ranking models as will be described in section 2.4.

For the retrieval step, all the text les in the subcorpora were saved in the Lemur format: Using a bash script, the text in the text les was lowercased, punctuation was removed and the appropriate XML tags for indexing by Lemur were added . Then the texts were indexed using the BuildIndex function of Lemur with the indri IndexType and a stop list for general English.

In total we built 6 indices: Titles only, Abstracts only, Claims only, Description only, IPCR codes, and full-text. By full-text we mean that we concatenated title, abstract, claims and description; sections that were not available in the patent document were added as an empty string. If none of these sections were available in English, the patent document was not indexed.

2Please note the di erence between a patent and a patent document: a patent is not a physical document itself but a name for a group of patent documents that have the same patent ID number.

3A whole patent can be constructed by concatenating di erent patent versions into one document or by constructing a document from the most recent version of every section in the patent documents 4We used the full IPC-R code up to the level of the subgroups, e.g. A01J 5/01.

The topics in the training set were preprocessed in the same manner in order to be used as queries in Lemur. If the original query XML document did not contain a section, it was not added to the lemur query le.

For each query in the query le we retrieved 100 documents and ranked them according to the TF-IDF ranking model as implemented in Lemur. Table 2 shows the results of the retrievals on the 6 indices with their respective training queries. The results are given for Precision (P) and Recall (R) respectively at position 5, 10, 50 and 100 in the result list as well as the Mean Average Precision (MAP) score.

title abstract claims description full-text IPC-R

The index with the IPC-R codes proved to be the most informative for patent retrieval in terms of Recall and Precision, although results are quite low for all six retrievals. Based on these results, we decided to proceed with only the retrieval results from the IPC-R subcorpus to the second step of our approach. 3.3

Re-ranking Step

It seems that in a retrieval task, conceptual information (as encoded in the IPC codes) works better than `surface' textual information. However, we wanted to examine the in uence of the di erent text sections on the positions of the retrieved results in the set.

We aimed to improve the ranking of the retrieved documents on the basis of the textual information present in the di erent sections of the patent document. As a predictor of relevance for the sections, we used the cosine similarity between corresponding sections of the topic and each of the retrieved documents.

We extracted this information as follows: For each topic{document pair from the training result set, we extracted the title, abstract, claims and description sections (if present) from both the topic and the retrieved document. We then calculated the cosine similarity between the sections of the respective documents using a python script which was based on the script by Dennis Muhlestein5.

For each query{document pair we obtained a vector with 4 features: cosine similarity titles, cosine similarity abstracts, cosine similarity claims, and cosine similarity descriptions. In order to determine the importance of each of these features (and thereby each of the sections), we trained a Logistic Regression Model (LRM). The criterium variable was the relevance score of the retrieved document in the training relevance assessments.6

We used the lrm function from the Design package in R to train this model. We then used the LRM (trained on the training data) to predict an alternative ranking for the retrieved documents. We created two variants of the model: one with only these four features, and one in which the TF-IDF score for the retrieval with IPC-R codes was added as a fth feature. We did not perform any step-wise model selection but rather combined all predictors at once.

5http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python 6We only considered documents to be either `relevant' or `non-relevant' and did not adhere to the subdivision (`relevant' or `highly relevant') made by the CLEF-IP organisers.

Results

In this section we present the results of our models in terms of MAP, Precision and Recall for both the training data (table 3) and the test data (table 4). The evaluation of the two re-ranking models on the training data was performed using 5-fold cross-validation. The P, R and MAP results are the averages over the ve folds. Between the brackets is the standard deviation.

Baseline

using IPC-R Re-ranking no TF-IDF Re-ranking with TF-IDF run-1-small (no TF-IDF) run-2-small (with TF-IDF)

MAP 0.0677 0.0858 (0.973) 0.0870 (0.519)

The re-ranking model that incorporates the TF-IDF score of the retrieval set performs slightly better than the other model in both the training and the test results. In terms of Recall and Precision we performed slightly better than during our participation in the CLEF-IP 2009 track but compared to the other teams in this year's track we achieved low scores. 5

Discussion

In this section we will discuss (a) the retrieval results on the training set and (b) analyse the re-ranking models used.

One of our goals was to determine which section of the patent document is the most informative for patent retrieval in terms of recall and precision. The results in table 2 showed that for a bagof-words approach the IPC-R codes in the patents were the most informative of all the patent sections. During our post-evaluation analysis we discovered that the low scores for the individual text sections are more likely an artefact of our data selection process rather than an adequate re ection of their performance in a retrieval task. Table 1 showed that there are considerable di erences in size between the di erent text section corpora and thus in the number of patent documents that could be retrieved for a speci c query. Moreover, we found evidence that some relevant patent documents were impossible to retrieve for certain queries. For example, if a relevant document for a query consisting of a claims section did not have a claims section itself, it did not feature in the claims subcorpus and could therefore not be retrieved. Consequently, we cannot draw a de nite conclusion about the relative importance of the separate text sections for patent retrieval. The full-text corpus and the IPC-R corpus, however, did not su er from these drawbacks. We found it interesting that the IPC-R outperformed the full-text retrieval, though the di erence between the results is small. The major advantage of the IPC-R section is -predictably- the fact that it is language-independent, conceptual and has a limited `vocabulary' of terms that can be used. For future work it would be interesting to examine the di erences in retrieval results by using more general and more speci c IPC codes as retrieval terms.

Our second goal was to examine the impact of the text sections on the re-ranking of retrieved documents: When we look at the results in table 3 and 4, it seems that the use of the information in the respective text sections of the query and retrieved document can lead to an improvement in the ranking of the relevant results. However, the high standard deviation values for the ve folds show that our training set of 300 queries is too small to make any de nite conclusions about the improvements made by the models. This may be a consequence of the fact that the models were not trained on optimal data but on rather poor retrieval results. Though they seem to boost the ranking of the retrieved documents, they contain enough noise to diminish the accuracy.

In order to evaluate the importance of the di erent text sections in the re-ranking of the retrieval results, we rank them in table 5 according to the coe cient that was assigned to them in the Logistic Regression Model. We nd that all texts sections except for the description have a signi cant in uence on the re-ranking of the retrieval results. The correlation analysis reported in table 6 shows a high correlation between the cosine similarity of the claims and description sections. Consequently, the coe cient for the claims section should be interpreted as being caused by the combination of the cosine similarities for the claims and description sections. Of all the text sections the abstracts have the most impact in the re-ranking process. This was to be expected as the abstracts are most likely to contain the keywords that are speci c to the eld of the invention.

Feature

Cosine similarity between abstracts Cosine similarity between claims Cosine similarity between titles TF-IDF value from retrieval data Cosine similarity between descriptions In our contribution to the CLEF-IP 2010 Prior Art Retrieval task we examined the impact of di erent sections of patent documents on the retrieval and re-ranking of patent documents. Using a standard bag-of-words approach in Lemur we found that the IPC-R sections are more informative for patent retrieval than a full-text representation of the patent document. We then performed a re-ranking of the retrieved documents using a Logistic Regression Model, trained on the retrieved documents in the training set. Looking at the improved MAP scores, we found indications that the information contained in the separate text sections of the patent document can contribute to a better ranking of the retrieved documents.

[1] Eva

'hondt. Lexical issues of a syntactic approach to interactive patent retrieval . In Proceedings of the 3rd BCSIRSG Symposium on Future Directions in Information Access , 2009 .

[2]

Erik

Graf and

Leif

Azzopardi . A methodology for building a patent test collection for prior art search . In Proceedings of EVIA2008 , 2008 .

[3]

Patrice

Lopez and

Laurent

Romary . Multiple retrieval models and regression models for prior art search . In Proceedings of CLEF 2009 , 2009 .