Passage Retrieval Starting from Patent Claims A Clef-Ip 2013 Task Overview Florina Piroi, Mihai Lupu, Allan Hanbury Vienna University of Technology, Institute of Software Technology and Interactive Systems, Favoritenstrasse 9-11, 1040 Vienna, Austria Abstract. Most of the searches a patent expert at a patent oce does are using boolean methods to query large databases of patent data. The Clef-Ip evaluation track is designed to experiment with information re- trieval techniques on the patent domain. The data corpus in the Clef-Ip Lab consists of patent documents published by the European Patent Of- ce. One of the main tasks in the Lab has been related to the Prior Art type of search performed by the patent experts at patent oces. The task has went through various changes along the years, from using vir- tual patent documents as topics (in 2009) to actual patent application documents, and sets of claims from patent application documents (2012 and 2013). Relevance assessments for this task were based on Search Reports published by the European Patent Oce. In this overview we give report on the work we have done in organizing this retrieval task in 2013. 1 The Clef-Ip Passage Retrieval Task The technological developments in our time are closely coupled with the patent system which encourages inventors to make their ideas public in exchange for a monopoly on the invention for a limited period of time, up to 20 years. A patent can be seen as a contract between a government and the patent owner by which the latter can exclude other parties from manufacturing and exploiting the invention without a permission. To obtain a patent, one of the main requirements is that the invention is new. To verify this, extensive searches, not only in the patent repositories, but also specialized literature, conference publications, etc., must pe thoroughly done. The amount of data to be searched, as well as the fact that many publications are now digitized, makes it that search operations cannot be done without the help of computers. With the tasks organized in Clef-Ip along the years we investigate how current IR solutions may serve to the needs of patent experts doing novelty searches. This task, in particular, is meant to explore the approaches that IR systems may oer when faced with nding specic pieces of text that are relevant to any given patent claim. We present here shortly the process of obtaining a patent with focus on the European Patent Oce (Epo [2]). To obtain a patent, a patent application must be registered with a patent oce. A patent application contains an abstract, a title, a detailed description of the invention, drawings (if necesary) and a set of claims that dene the extent of the protection aimed for. An applicant will also cite previously published patents that are considered relevant to the described invention. At the Epo applications can be made in any language. Given that the ocial languages at the Epo are English, French, and German, whenever another language is used in an application, a translation to one of these three languages must be made. Once the application is registered at the patent oce, it will be examined that it is novel, that it has an inventive step, and that it is realizable. During these examinations, at the Epo, a European search report is prepared which lists all the relevant documents found (called patent citations ). The Epo publishes patent applications together with their search reports in a time limit of 18 months from the ling date. If the patent applicant, based on the search report, decides to pursue for a patent, a sequence of communications between him and the patent oce takes place. Usually, during this process, the claims are adjusted such as not to conict with existing patents. The European search report is mainly based on the application claims, and, more often than not, species not only the documents relevant to the (various) claims, but also the passages particularly of importance to them. Knowing this, the Passage Retrieval Task Starting from Claims was designed to investigate the eectiveness of Information Retrieva (IR) methods in nding relevant documents and marking passages particularly pertinent to a set of claims. 2 The Clef-Ip Corpus The Clef-Ip corpus was distributed as a collection of over 3 million Xml docu- ments pertaining to over 1.5 million patents published by the Epo and the World Intellectual Property Organization (Wipo) prior to 2002 [8]. The Clef-Ip corpus is an extract of the larger Marec collection1 which uses a common normalized Xml data format to represent patent documents published by the Epo, Wipo, US Patent and Trademark Oce, and Japan Patent Oce. We do not describe the collection content here, but we direct the reader to the previous publications that detail it ([7,9]). 3 Task Topics The Passage Retrieval from Claims Task models closely the novelty search done by patent examiners at the Epo. Topics in this task are sets of claims extracted from actual patent application documents published by the Epo after 2002. Participants had to return passages that are relevant to the topic claims. The 1 The MAtrixware REsearch Collection. http://ifs.tuwien.ac.at/imp/marec passages must occur in the documents in the Clef-Ip collection. No other data was allowed to be used in preparing for this task. To select the topics for this task we rst had to select the patent application documents out of which we could select, then, various sets of claims. We rst selected a pool of candidate application documents from the Marec collection with a few restrictions:  the document must be published after 2002 (that is, is not part of the Clef-Ip corpus);  the document must be published by the Epo (recall that Marec contains also patents published by the US oce, by the Wipo, and the Japanese oce);  the application should contain at least 3 citations and at most 10. This is because the number of patent documents with more than 10 citations in the search report is very small when compared the the number of patents with less than 10 citations. An additional reason for choosing the upper limit is a pragmatic one: patents with more than 10 citations proved to be more dicult and time consuming to process when extracting the relevance judgements;  the application document does not miss content, that is, it indeed has a de- scription, an abstract and a claims section. We mention here that, according to the Patent Cooperation Treaty [1], for patent applications that are led rst at the Wipo and then at the Epo, Epo does not publish an additional application document, but only a bibliographic entry that points to the orig- inal Wipo application. In terms of Xml representation, this translates into an Xml document that doesn't have a description, an abstract, nor a claims section;  the document does not count more than 300,000 words. Setting this limit al- lowed us to avoid selecting patent application documents that are more than 100 pages long. The rationale behind this decision is that, from past expe- rience, task participants sometimes used full patent documents as queries2 , and it has been shown that some retrieval algorithms do not cope well with large queries [5];  the application document has at least one family member (a patent document published at another patent oce) which was led prior to the document in the pool. This last restriction is an addition to the task organized in 2012. It is, however, an addition that models a widely used practice of the patent examiners, which consists in pulling out everything what was already done at other patent oces with regard to a patent application they have in front of them, before they start their own search. After applying all these restrictions, we ended up with a pool of over 300,000 patent application documents. The next step, was now, to sample documents from this pool and extract sets of claims to be topics. The sampling was ran- 2 Although it may have benets in an IR sense, no patent expert would use such a solution, actually. domly done, with one restriction, however. Some technological areas are overrep- resented in the patent corpus. For example, patents in the farmaceutical domain are more numerous than in other technical domains. Because we intended to have a relatively uniform distribution of the citation numbers the topic doc- uments have, we rst grouped the documents in the pool by the number of citations in the search report and in the Clef-Ip collection. We, then, randomly selected 20 patent application documents from each group with the restriction that each document belongs to a dierent Ipc class 3 . We did this three times: once extracting English application documents, once German, and once French application documents. We have now a pool of over 460 patent application doc- uments. Out of this smaller pool we, randomly, inspected over 200 documents, over 60 in each Epo language, to extract claim sets for our topics. As mentioned in the previous section, a patent application document con- tains a claims section which dene the extent of the legal protection for the described invention. The claims section is a list of sentences (claims) which, for ease of reference, are numbered. Below is an example of the rst 8 claims in the application document of patent WO-02058006. What Is Claimed Is: 1. In a paint roller having an inner resilient cylindrical core and an outer annular surface contact material, the outer annular surface contact ma- terial forming a paint roll medium that is xedly attached to the resilient core, the resilient core and paint roll medium rotating about an axis of said cylindrical core; the improvement wherein the paint roll medium is a hydroentangled threedimensional imaged nonwoven fabric. 2. An imaged nonwoven fabric of claim 1, wherein the fabric is formed from a precursor web comprised of staple length bers. 3. An imaged nonwoven fabric of claim 2, wherein the staple length bers include surface modication agents. 4. An imaged nonwoven fabric of claim 3, wherein the surface modication agents are selected from the group consisting of hydrophobic modiers and hydrophilic modiers. 5. An imaged nonwoven fabric of claim 2, wherein the staple length bers include the incorporation of melt additives. 6. An imaged nonwoven fabric of claim 5, wherein the melt additives are selected from the group consisting of hydrophobic modiers and hy- drophilic modiers. 7. An imaged nonwoven fabric of claim 2, wherein the staple length bers are selected from the group consisting of thermoplastic polymers, ther- moset polymers, natural bers, and blends thereof. 3 Ipc (International Patent Classication System) is a classication system that groups patents by their technological area. Ipc is hierarchially organized in sections, classes, subclasses, groups and subgroups. There are 8 sections, 121 classes, and over 630 subclasses in this classication system. A patent may belong to several technological subareas. Because the relevance judgements for this task are based on European search reports, when selecting the topics we had to inspect, for each application docu- ment in the pool, its search report (an example of a search report is shown in Figure 1). A European search report usually has 4 columns. The second column lists the relevant documents (patent citations) together with relevant passages, images, etc. The rst column marks the relevance category of the citation, with X and Y being citations that destroy the novelty in the patent application, A being citations that oer background information on the invention but do not destroy its novelty or inventive step. The third column in a European search report writes down the claim numbers to which the patent citations pertain. Fig. 1. Extract from a search report. For a patent application document we inspected each patent citation that occured in our corpus4 . We noted the claim numbers it refered to and the relevant passage information. When the relevant passage information was acceptable, that is, it refered to lines of text and not to gures or whole documents, we retained the set of claims to be a topic in our task. We also took care that the search reports were complete, in the sense that the patent examiner did his search for all the claims in the patent application. When this was not the case, the search reports contain a notice on this fact and we could eliminate these cases from our pool. Using this procedure, we could extract more topics from one patent applica- tion documents. It was often the case that each topic extracted from one patent application document had its own set of relevant documents and passages, and that the sets of relevant documents didn't allways overlap. From the over 200 patent application documents inspected we were able to extract 149 topics from 69 patent documents. From the 149 topics distributed to the participants, we later removed topics 78 and 101 for being erroneous. 4 Not all patent citations in a European search report occur in the Clef-Ip corpus. The structure of a Clef-Ip topic is as follows: topic_id patent_ucid.xml patent_ucid.xml xpaths_to_claims where  tid is the topic identier;  tfile is the Xml le which stores the patent application out of which the topic claims were extracted;  tclaims is the list of XPaths to the claims selected as topic from the source patent document;  tfam-docs contains the Xml les that are part of the source patent's family and published prior to the source patent document. Below is an example of a topic in the Clef-Ip 2013 Passage Retrieval Task: PSG-22 EP-1267498-A1.xml FI-111300-B1.xml,FI-20011095-D0.xml,FI-20011095-A.xml /patent-document/claims/claim[1] /patent-document/claims/claim[2] /patent-document/claims/claim[3] /patent-document/claims/claim[4] /patent-document/claims/claim[5] /patent-document/claims/claim[6] /patent-document/claims/claim[7] /patent-document/claims/claim[8] /patent-document/claims/claim[9] /patent-document/claims/claim[10] /patent-document/claims/claim[11] In the topic set distributed to the participants the patent application docu- ments from which the claims were extracted, and the previously published family member documents were also available, such that participants could use them to extend the original queries extracted from the claims. 4 Relevance Judgements Using patent data in evaluation campaigns has one disadvantage when compared to other campaigns: to obtain relevance assessments as in the real life patent search examples experts in the various technological domains are needed. The budget of a research project cannot aord employing them to provide judge- ments, voluntary participation in creating assessments being for most of the patent experts not an option. Despite this disadvantage, we are in the very happy situation that relevance judgements of a kind already exist in the form of patent search reports 5 . All Clef-Ip campaigns used, in one form or another, the search reports to extract 5 Experiments using citation information to design retrieval experiments have been done also in other areas than the patent domain. See for example [11]. relevance assessments. We did the same this year. The diculty in getting the qrels for our topics in 2013 (and in 2012), is that, although patent citations can be easily obtained in some machine-processable form, relevant passages cannot. Therefore, the relevant passage information extraction was done by manual in- spection of the search reports, of the cited documents and by matching them with the textual content of the relevant documents in the Clef-Ip collection. This proved to be a tedious process, so we developed a system to assist us with selecting the relevant pieces of text from the Xml documents in our collection. The system has been used also in 2012 and is described in [9] and [8]. We very shortly present here the main features of the system. We see in Figure 2 that the qrel generating system has three main areas:  a topic description area where, after typing in the patent application docu- ment identier, we can assign the topic an identier (unique in the system), we dene the set of claims in the topic, save it, navigate among its relevant documents with the `Prev' and `Next' buttons.  a qrel display area where we see the currently selected relevant passages and can save them. Also in this area we give a direct link to the application document on the Epo Patent Register server, which, in turns, gives us a quick link to the document's search report.  a qrel denition area where individual passages (corresponding to XPaths in the Xml documents) are displayed. Clicking on them will select them to be part of the topic's qrels. For convenience, we provide three buttons by which we can select with one click all of the abstract's, description's or claims' passages. When clicking on the 'Save QREL' button the selected passages are saved in the database as relevant passages for the topic in work. The relevance judgements created contained both relevant documents and relevant passages in them. Though the documents could be dierentiated by degrees of relevance, due to their categories in the search reports (X, Y, A), the passages were considered all equally relevant. Below is an excerpt from the qrel les obtained with the help of our system: PSG-5 EP-1078736-A1 /patent-document/description/p[20] PSG-5 EP-1078736-A1 /patent-document/description/p[21] PSG-5 EP-1078736-A1 /patent-document/description/p[18] PSG-5 EP-1078736-A1 /patent-document/description/p[15] PSG-5 EP-1078736-A1 /patent-document/claims/claim[1] PSG-5 EP-1078736-A1 /patent-document/abstract/p PSG-5 EP-1078736-A1 /patent-document/claims/claim[2] ... 5 Submissions and Evaluations 5.1 Submissions to the Task The submission format for the passage retrieval task required participants to submit text les with retrieval results similar to the qrel format shown above. The number Fig. 2. A system for nding and saving relevant passages. of documents considered relevant per topic had to be limited at 100, the number of relevant passages in a document was not limited. In addition, to the qrel format, the participant submissions had two more columns, one to specify the order of the results, and another one to specify the retrieval score of a passage/document. Three participants submitted experiments to the Passage Retrieval task, two of them also included relevant passages in their task. In their experiments a two step approach was used. In the rst one, relevant documents were retrieved using various retrieval solutions including Okapi BM25, Language Models and TF-IDF, and Vector Space Models. The participant from the Georgetown University (USA) experienced with various sources for query terms by extracting words from claims and titles, using hyphenating-phrases, Part of Speech tagging and weighted ltering [4]. The team from Innovandio S.A. (Chile) also experienced with a CL-ESA Wikipedia-based multilingual retrieval model ([10], [8], section 3). The third participant to the task, a team of researchers from Vienna University of Technology and the University of Macedonia, Thessaloniki, used a distributed IR system that queried a split Clef-Ip collection. The split is done by exploiting the hyerarchical structure of the International Patent Classication System (Ipc). By di- viding the collection into several sub-collections (by Ipc class, subclass, and subgroup) the patents are organized according to their technological topic. Then the Lemur in- dexer was used to index the title, abstract, description, claims, inventor, applicant and Ipc class information [3]. The CORI and a multilayer method were used for selecting the sources (sub-collections) on which the retrieval should be performed as well as for joining the results. In the gures below, the submission les prexed by `In' belong to the partici- pant from Chile, the submission les prexed by `GU' belong to the participant from Georgetown Unviersity, and the ones prexed by `TM' were sent in by the team from Vienna and Thessaloniki. 5.2 Evaluating the Retrieval Results Three participants submitted a total of 19 runs. Out of these, 8 runs did not provide retrieved passages. We did evaluations at two levels. One at the passage level and one at the patent document level. The evaluation at patent document level was done, as in the previous years, by computing the Recall, Map, and PRES ([6]) at cuto 100. At the passage level we computed, rst, for each relevant document retrieved the precision and aver- age precision w.r.t. the passage retrieved, then averaged it over the number of relevant documents per topic. Finally, averaging these scores over all topics we obtain the pre- cision and mean average precision scores at the passage level. The evaluation script is available for download on the Clef-Ip project website6 . Several simple le clean-up operations had to be done in order to ensure that the document encodings matched the expected format by the evaluation script. These op- erations included duplicate removal, re-grouping the retrieval results such that results belonging to one topic were in a contiguous portion of the les, removing the XPaths refering to headings in the patent document Xml les. This last operation was done because headings are not consistently marked as such in the Clef-Ip collection's doc- uments, being left out of the relevance judgements as well. 6 http://www.ifs.tuwien.ac.at/~clef-ip Fig. 3. Evaluation results, ordered by Recall. Fig. 4. Evaluation results, document level Recall per language. Fig. 5. Second evaluation round, results ordered by Recall. We ran, then, several evaluations depending on the degree of relevance assigned to the citation documents in the search reports. In each round we computed all of the measures mentioned above, we will not, however, present all of them. The rst evaluation round considered all documents in the relevance judgements as equally relevant and did evaluations on four sets of topics: the set of all 147 topics, on the subset of 50 English topics (1-50), on the subset of 49 German topics (51-100), and the subset of 48 French topics (102-149). The results of these evaluations are shown in Figures 3 and 4. The zero values on the gures belong to the runs that did not contain relevant passages. Next we were interested in the metric scores when only the highly relevant citation documents were considered, ignoring the applicant citations. From the 147 topics only 116 have highly relevant citations in the Clef-Ip corpus, so the new evaluation round is done for this smaller set. Figures 5 and 6 show plots for the metrics for this smaller topic set, for the 38 English topics, for the 42 German topics, and for the 22 French topics in it. To compare how the dierent retrieval strategies perform with respect to the dif- ferent relevant documents required (highly relevant only, or both highly relevant and relevant) we computed a third round of evaluations, where we restricted the set of qrels used in the rst round of evaluation to the 116 topics evaluated in the second round. Although we computed all the mentioned metrics for all three languages, we present only the results for the whole Recall and Map(D) for the 116 topics, in Figure 7. 6 Final Words This paper presented the activities we have done to organize the Passage Retrieval Starting from Patent Claims Task in Clef-Ip 2013. We started with selecting patent Fig. 6. Second evaluation round, document level Recall per language. Fig. 7. Third evaluation round, document level Recall and Map(D). application documents and sets of claims in these documents that were our nal top- ics. The most time consuming part of these activities has been extracting the XPaths to relevant passages identied by patent experts in their search reports. Participants were not given any specic queries, but were allowed to build them out of the informa- tion provided in the topics: claims, patent application document, previously published family member documents. Over 20 teams registered to submit retrieval experiments to this task, a number similar with the number of registrations in the previous years. We received submissions from three groups, two of them with relevant passage information as well. Acknowledgements This work was partly supported by the EU Network of Excel- lence PROMISE(FP7-258191) and the Austrian Research Promotion Agency (FFG) FIT-IT project IMPEx(No. 825846). References 1. ***. Patent Cooperation Treaty . 1970. last retrieved: March, 2013. 2. ***. Guidelines for Examination in the European Patent Oce , 2012. www.epo. org/law-practice/legal-texts/guidelines.html, latest retrieved in June 2013. 3. Anastasia Giachanou, Michail Salampasis, Maya Satratzemi, and Nikolaos Sama- ras. Report on the CLEF-IP 2013 Experiments: Multilayer Collection Selection on Topically Organized Patents. In CLEF (Notebook Papers/LABs/Workshops) , 2013. 4. Jiyun Luo and Hui Yang. Query formulation for prior art search - georgetown university at CLEF-IP 2013. In CLEF (Notebook Papers/LABs/Workshops) , 2013. 5. Yuanhua Lv and ChengXiang Zhai. When documents are very long, BM25 fails! In Wei-Ying Ma, Jian-Yun Nie, Ricardo A. Baeza-Yates, Tat-Seng Chua, and W. Bruce Croft, editors, Proceedings of SIGIR , pages 11031104. ACM, 2011. 6. W. Magdy and G. J. F. Jones. PRES: A score metric for evaluating recall-oriented information retrieval applications. InSIGIR 2010 , 2010. 7. F. Piroi, M Lupu, A. Hanbury, and V. Zenz. CLEF-IP 2011: Retrieval in the intellectual property domain, September 2011. 8. Florina Piroi, Mihai Lupu, and Allan Hanbury. Overview of CLEF-IP 2013 Lab: Information Retrieval in the Patent Domain. In Proceedings of CLEF 2013 , Lecture Notes for Computer Science, 2013. to appear. 9. Florina Piroi, Mihai Lupu, Allan Hanbury, Alan P. Sexton, Walid Magdy, and Igor V. Filippov. Clef-ip 2012: Retrieval experiments in the intellectual property domain. In CLEF (Online Working Notes/Labs/Workshop) , 2012. 10. Martin Potthast, Benno Stein, and Maik Anderka. A wikipedia-based multilingual retrieval model. In Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White, editors, Proceedings of ECIR , volume 4956 of Lecture Notes in Computer Science , pages 522530. Springer, 2008. 11. A. Ritchie, S. Teufel, and S. Robertson. Creating a Test Collection for Citation based IR Experiments. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Com- putational Linguistics, 2006.