Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University, Dublin 9, Ireland {wmagdy, gjones}@computing.dcu.ie Abstract. We present our experiments and results for the DCU CNGL participation in the CLEF-IP 2010 Candidate Patent Search Task. Our work applied standard information retrieval (IR) techniques to patent search. In addition, a very simple citation extraction method was applied to improve the results. This was our second consecutive participation in the CLEF-IP tasks. Our experiments in 2009 showed that many sophisticated approach to IR do not improve the retrieval effectiveness for this task. For this reason of we decided to apply only simple methods in 2010. These were demonstrated to be highly competitive with other participants. DCU submitted three runs for the Prior Art Candidate Search Task, two of these runs achieved the second and third ranks among the 25 runs submitted by nine different participants. Our best run achieved MAP of 0.203, recall of 0.618, and PRES of 0.523. Keywords: Patent Retrieval; Query Formulation; CLEF-IP track 1 Introduction The Centre for Next Generation Localisation (CNGL) at Dublin City University (DCU) participated in the CLEF-IP track 2010 Candidate Patent Search Task using the KISS principle. KISS stands for “keep it simple and straightforward” which describes the methods adopted in our submissions for CLEF-IP 2010. Our participation used standard IR approaches with a very simple information extraction technique. The aim of the task is to automatically retrieve all of citations for a given patent (which is considered as the topic) [3], [7]. We submitted three runs for the task: one is based in citation extraction from patent application descriptions, another one uses simple information retrieval techniques to search the patent documents collection using the patent topic, and the final one is a combination of the first two methods. Nine participants submitted 25 runs in total for this task. These were evaluated using three metrics: mean average precision (MAP), recall, and the patent retrieval evaluation scores (PRES). Our best run (which is the third one) achieved the second rank using all the scores among all runs submitted by participants. The paper is organized as follows: Section 2 overviews the patent data collection provided by the track organizers, Section 3 gives full details of the experimental setup for our participation, Section 4 reports the results with some analysis to these result, and finally Section 5 concludes the paper and provides possible future directions. 2 Data Collection 110,000 100,000 90,000 80,000 Number of Patents 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Publishing Year Fig.1. Distribution of patents collection by publishing year For this task the organizers provided more than 2.68M XML documents representing different versions of 1.35M patents filed between 1978 and 2009 (see Figure 1). For our experiments, all different document versions for a single patent were merged into a single document with fields updated from its latest versions. The patent structure is very rich comprising the ‘title’ and ‘claims’, some fields are present in three languages (English “EN”, German “DE”, and French “FR”). In addition, the patent abstract of non-English patents has an English translation of the abstract included. Only the patent ‘title’, ‘abstract’, ‘description’, ‘claims’, and ‘classifications’ fields are extracted from the patents. Some patents lack some of these fields. The only fields that are present in all patents are the ‘title’ and the ‘classifications’. The ‘description’ field is related to the ‘claims’ field, thus if the ‘claims’ field is missing, then the ‘description’ is missing too. However, the opposite in not true, some documents contain the ‘claims’ field while the ‘description’ field is missing. The ‘abstract’ field is an optional part that is only present in some patents. 68% of the patents in the collection are English, 24% are German, and 8% are French. In keeping with our KISS approach to avoid complications with language processing, only the English fields were used for indexing. Hence, English patents were fully indexed, but for German and French patents the title, abstract and claims were only indexed where a translation into English was found. This simplification meant that 32% of the collection (the non-English portion) the description field was not indexed. Additionally, as mentioned above, since some sections were already missing, some patents had no English field content to be indexed except the title or the title and abstract. Figure 2 shows the content that has been indexed for the patent collections based on the existing fields. It can be seen that 22% of the patents have only the titles indexed; leading them to effectively be very short indexed documents since they consist only of the single item of the title. Additionally 10% of the patents have only the titles and abstract indexed and 16% of the patents have only the title and the claims indexed, with some of them having the abstract section as well. Only 52% of the patents have nearly the full document indexed, where the title, description, and claims sections are present. This 52% of the collection are the English patents where all the main sections found. Title only Title+Abstract Title+Claims+[Abstract] Title+Description+Claims +[Abstract] 22% 52% 10% 16% Fig.2. Proportions of patent texts in the collection indexed by field combinations 3 Experimental Setup In this section we describe details of our experimental setup for our three submitted runs. The first run is a very standard information retrieval method where the patent topics were used to search the indexed collection after translating the non-English topics into English. The second run involves extracting the patent citation disclosed within the description of the patent topic without the use of any kind of IR techniques. Finally, the third run is generated through merging the first two runs. 3.1 IR Experiment Text pre-processing. Patent text contains many formulas, numeric references, chemical symbols, and patent-specific words (such as method, system, or device) that can cause a negative effect on the information retrieval process. In order to minimise these problems the text was filtered to remove predefined stop words1, digits, and field-specific stop words. To obtain the stop words for each field, the field frequency for each term in each field was calculated separately. The field frequency for a term “T” in field “X” is the 1 http://members.unine.ch/jacques.savoy/clef/index.html number of fields of type “X” across all documents containing the term “T”. For each field, all terms with field frequency higher than 5% of the highest term field frequency for this field were considered to be stop words [4]. For example, for the ‘title’ field, the following words were identified as stop words: method, device, apparatus, process, etc; for another field such as ‘claims’, the following words were identified as stop words: claim, according, wherein, said, etc. In addition to stop word removal, Porter stemming was applied to the text in order to normalize different surface forms of a given word [6]. Indexing. The Indri search toolkit [8] was used to index the extracted English parts of the patent collection. In the indexing process, the text of the following sections in the patents, if they existed in English, was included in the index: 1. Title 2. Abstract 3. Description 4. Claims In addition, the patent IPC classification [9] was included in the index to be used later in filtering the retrieved results in the search process. Only the top three classification levels were retained for the filtering process with the deeper levels being discarded (example: B01J, C01G, C22B). Further fields in the patent were not used; these included the fields which carry logistic information such as the patent filing date, institute, inventor’s name and address, etc. Nevertheless, the ‘inventors’ field was tested using the training data assess its effectiveness in retrieving relevant documents. Results of these investigations showed it to have a weak effect on the quality of search. Hence, it was discarded from the index. Translating non-English topics. 2,005 patent topics were provided including 1,351 English patent topics, 520 German topics, and 134 French topics. Out of these 5 topics were later excluded by the track organizers. Since the index was built only in English, German and French topics were translated into English using Google translate2. The ‘title’, ‘description’, and ‘claims’ sections were translated into English, while the ‘abstract’ field already had its English translation. Query formulation. One major challenges in patent retrieval is query formulation [2], [7]. As a full patent is taken to be the topic, extracting the best representative text with the proper weights is a key to achieving good retrieval results. Earlier experiments from our participation in CLEF-IP 2009 showed that using the full patent text to search the collection achieves the best results, especially after filtering all results which do not carry any overlap in the ‘classification’ section (only the first three levels of classification are used) [4]. The same setup was used this year by forming the query as follows: - Unigram tokens were extracted from the ‘description’ field after stemming and stop word removal. 2 http://translate.google.com/ - Bigram tokens of frequency higher than 3 were extracted from all the patent fields combined and added to the query after stemming and stop word removal. Extracted queries can be seen a bit long, however, this formulation proved to be the best in our experiments using the training set. This first run is called the “IR” run in our submitted runs. 3.2 Citation Extraction One of the features of patents is the presence of some of the cited patent numbers within the text of the description of the patents. These patent numbers were not filtered out of the text of the patent topics, which can be considered as the presence of part of the answer within the question. Despite this fact, we have not focused on building extra experiments based on this information, since in real life patent search situation this information is not always presented in the patent application, and hence, creating results on it can be considered as a misleading conclusion in the area of patent retrieval. However, in the experimental results, adding this information to the tested methods is reported to demonstrate the impact of using this kind of information. Results show that a misleadingly high MAP can be achieved, but with a very low recall. This observation is significant since of course recall is usually the main objective for patent retrieval tasks. For the large topic collection containing 2,005 patent topics, 2,307 citations were extracted from 771 patent topics and found to be IDs of patents in the indexed collection. Other extracted citations that do not exist in the collection were discarded. The extracted citations were put into the TREC format to be the second run submitted to the CLEF-IP track with the ID “Cit”, standing for “citation”. The first run results list was appended to the second run list after removing the duplicates to act as our third run submitted to the track. This run ID is “IR+Cit”. 4 Results Several evaluation scores have been used for evaluating the submitted runs. Here, we focus on five scores: mean average precision (MAP), recall, recall@100, patent retrieval evaluation score (PRES) [5], and PRES@100. MAP and recall have so far been the two main metrics used for evaluating this task. However, our recently introduced PRES metric is designed to be a dedicated score for recall-oriented IR applications, such as patent search. PRES reflects the quality of the system in retrieving a large portion of the relevant documents in a relatively high ranks based on a user specific cut-off (max) [5]. This is the reason behind using the cut-off of 100, as it has been in shown in [1] that the average number of documents to be checked by a patent examiner is 100. In addition, PRES and recall are also calculated at the cut-off specified by the track organizers (1000). Table 1 shows results for our three submitted runs for the large topic collection (2000 topics). The table shows two extreme runs, namely the “IR” run and the “Cit” run. The “IR” run achieved high recall and moderate precision. On the other hand, the “Cit” run achieved a very high precision while a very low recall. Although the MAP of the “IR” run is higher than “Cit” run, the “Cit” run has a very high precision, as was mentioned in section 3.2, that only 771 topics out of the 2000 had citations extracted, which means that the MAP for these topics alone is 0.3. The last run “IR+Cit” achieves the highest recall and precision since it comes from a simple merging of the two previous runs. The PRES and PRES@100 scores reflect both the recall and the quality of ranking of the system. The “IR+Cit” run and the “IR” runs achieved the second and the third best results among the 25 runs submitted to the track according to PRES and PRES@100. Table 1. MAP, recall, recall@100, PRES, and PRES@100 for the three submitted runs in CLEF-IP 2010. Run # MAP R R@100 PRES PRES@100 IR 0.1216 0.57 0.3036 0.4614 0.228 Cit 0.112 0.1187 0.1187 0.1186 0.1176 IR+Cit 0.2029 0.618 0.3846 0.5229 0.3162 5 Conclusion and Future Work In this paper, we have described our participation in the CLEF-IP 2010 Prior Art Patent Search Task. Three runs were submitted to the track, with one of them achieving the second best run among 25 runs submitted by 9 participants in this task. The three runs represent very simple and straightforward approaches for achieving high effectiveness in this task. Our run using standard IR techniques achieved the third highest performance among all submitted runs by participants according to recall and PRES. Our second run using straightforward citation extraction from patent topics achieved a very high precision performance. The third run with is a very simple merging of the first two runs achieved both high recall and precision (both reflected in PRES) to act as the best second run among the 25 runs submitted by the participants. For future work, utilizing the information of the automatically extracted citation can be an interesting avenue for investigation. Semi-pseudo relevance feedback can be applied through extracting additional terms from these citations to help in improving the results. In addition, different approaches for translating the non-English patents can be tested, since further investigation showed that retrieval performance for the non-English topics was relatively lower than that of the English ones. 6 Acknowledgment This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation (CNGL) at Dublin City University. 7 References [1] Azzopardi L., H. Joho and W. Vanderbauwhede. A Survey on Patent Users Search Behavior, Search Functionality and System Requirements. IRF Report 2010-00001, (2010) [2] Fujii A., M. Iwayama, and N. Kando. Overview of patent retrieval task at NTCIR-4. In Proceedings of the fourth TCIR workshop on evaluation of information retrieval, automatic text summarization and question answering, June 2–4, Tokyo, Japan, (2004) [3] Graf E. and L. Azzopardi. A methodology for building a patent test collection for prior art search. In Proceedings of the Second International Workshop on Evaluating Information Access (EVIA), (2008) [4] Magdy W., J. Leveling, and G. J. F. Jones. Exploring Structured Documents and Query Formulation Techniques for Patent Retrieval. In CLEF working notes 2009, Corfu, Greece, (2009) [5] Magdy W. and G. J. F. Jones. PRES: A Score Metric for Evaluating Recall-Oriented Information Retrieval Applications. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, Geneva, Switzerland, pp. 611-618, (2010) [6] Porter M.F. An Algorithm for Suffix Stripping, Program 14 (3) (1980), pp. 130–137 [7] Roda G., J. Tait, F. Piroi, and V. Zenz. CLEF-IP 2009: retrieval experiments in the Intellectual Property domain. In CLEF working notes 2009, Corfu, Greece, (2009) [8] Strohman T., D. Metzler, H. Turtle, and W. B. Croft. Indri: A language model-based search engine for complex queries. In Proceedings of the International Conference on Intelligence Analysis, (2004) [9] IPC (International Patent Classification): http://www.epo.org/patents/patent- information/ipc-reform.html