Prior art retrieval using the claims section as a
                      bag of words
                                    Suzan Verberne and Eva D’hondt
                         Information Foraging Lab, Radboud University Nijmegen
                                   (s.verberne|e.dhondt)@let.ru.nl


                                                       Abstract
       We describe our participation in the 2009 CLEF-IP task, which was targeted at prior-
       art search for topic patent documents. Our system retrieved patent documents based
       on a standard bag-of-words approach for both the Main Task and the English Task. In
       both runs, we extracted the claim sections from all English patents in the corpus and
       saved them in the Lemur index format with the patent IDs as DOCIDs. These claims
       were then indexed using Lemur’s BuildIndex function. In the topic documents we
       also focussed exclusively on the claims sections. These were extracted and converted
       to queries by removing stopwords and punctuation. We did not perform any term
       selection. We retrieved 100 patents per topic using Lemur’s RetEval function, retrieval
       model TF-IDF. Compared to the other runs submitted for the track, we obtained good
       results in terms of nDCG (0.46) and moderate results in terms of MAP (0.054).

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval

General Terms
Indexing, Bag-of-Words, Queries

Keywords
Prior Art Retrieval, CLEF-IP


1      Introduction
The CLEF-IP track was launched by the Information Retrieval Facility (IRF) in 2009 to investigate
IR techniques for patent retrieval1 . It is part of the CLEF 2009 evaluation campaign2 .
    The task of the track is: “to find patent documents that constitute prior art to a given patent”.
The given patent in this task description serves as a topic for the retrieval task. The Main Task
is to retrieve prior art for topic patents in any of the three following languages: English, French
and German. Three facultative subtasks use parallel monolingual topics in one of these three
languages.
    1 See http://www.ir-facility.org/the irf/clef-ip09-track
    2 See http://www.clef-campaign.org/
2      Our methodology
2.1      Data selection
The CLEF-IP corpus consists of EPO documents with publication date between 1985 and 2000,
covering English, French, and German patents (1,958,955 patent-documents pertaining to 1,022,388
patents, 75GB) [3]. The XML documents in the corpus do not correspond to one complete patent
each but one patent can consist of multiple XML files (representing documents that were produced
at different stages of a patent realization).
    We decided to focus on the claims sections of the patents, because we found that many of the
English patent documents did not contain abstracts. Moreover, we expected the claims section to
be the most informative part of a patent.
    In the CLEF-IP 2009 track the participating teams were provided with 4 different sets of topics
(S,M,L,XL). We opted to do runs on the smallest set (the S data set) for both the Main and the
English task. This set contained 500 topics. Because the information in these topics was different
for both tasks (the topics for the Main Task contained the abstract content as well as the full
information of the granted patent except for citation information, while the topic patents for the
English Task only contained the title and claims elements of the granted patent [3]), we focussed
only on the (English) claims sections from all topic patents.

2.2      Query formulation
There has been much research on the topic of query term extraction/query formulation [1]. How-
ever, we chose not to distil any query terms from the extracted claims section but took all words
in the claims section as one long query (weighted in retrieval with TF-IDF). The reason for this
was twofold. First, adding a term selection step in the retrieval process makes the retrieval process
more prone to errors because it requires the development of a smart selection process. Second, by
weighting the query and document terms using TF-IDF, a form of term selection is carried out in
the retrieval and ranking process.

2.3      Indexing using Lemur
We extracted the claims sections from all English patents in the corpus. after we had removed
all XML markup from the texts in a preprocessing script. Since a patent may consist of multiple
XML documents, which correspond to the different stages of the patent realization process, one
patent can contain more than one claims section. In the index file, we concatenated the claims
sections pertaining to one patent ID into one document. We saved all patent claims in the Lemur
index format with the patent IDs as DOCIDs. They were then indexed using the BuildIndex
function of Lemur with the indri IndexType and a stop word list for general English3 .


3      Results
We performed runs for the Main and English Task with the methodology described above. Since
we used the same set-up for both runs, we obtained the same results. These results are in Table
1. The first row shows the results that are obtained if all relevant assignments are taken into
consideration; the second row contains the results for the highly-relevant citations only [2].


4      Discussion
Although the results that we obtained with our ClaimsBOW approach may seem poor on first
sight, they are not bad compared to the results that were obtained in runs by other participants.
In terms of nDCG, our run performs well (ranked 6th of 70 runs); in terms of MAP our results
    3 This stop word list can be provided by the authors upon request.
Table 1: Results for the clefip-run ‘ClaimsBOW’ on the small topic set using English claims
sections for both the Main Task and the English Task.
                P     P5      P10     P100      R       R5     R10     R100    MAP nDCG
 All       0.0129 0.0668 0.0494 0.0129 0.2201 0.0566 0.0815 0.2201 0.0540 0.4567
 Highly- 0.0080 0.0428 0.0314 0.0080 0.2479 0.0777 0.1074 0.2479 0.0646 0.4567
 relevant


are moderate (ranked around 35th of 70 runs). The low performance achieved by almost all runs
(except for the one submitted by Humboldt University) shows that the task at hand is a difficult
one.


References
[1] Kazuya Konishi. Query Terms Extraction from Patent Document for Invalidity Search. In
    Proceedings of NTCIR-5 Workshop Meeting, pages 312–317, 2005.
[2] Florina Piroi, Giovanna Roda, and Veronika Zenz. CLEF-IP 2009 Evaluation Summary. Tech-
    nical report, Information Retrieval Facility, 2009.
[3] Florina Piroi, Giovanna Roda, and Veronika Zenz. CLEF-IP 2009 Track Guidelines. Technical
    report, Information Retrieval Facility, 2009.