-

Combining document representations for prior-art retrieval

0 Eva D'hondt, Suzan Verberne Information Foraging Lab, Radboud University Nijmegen 1 Wouter Alink , Roberto Cornacchia Spinque

In this paper we report on our participation in the CLEF-IP 2011 prior art retrieval task. We investigated whether adding syntactic information in the form of dependency triples to a bag-of-words representation could lead to improvements in patent retrieval. In our experiments, we investigated this e ect on the title, abstract and rst 400 words of the description section. The experiments were conducted in the Spinque framework with which we tried to optimize for the combinations of text representation and document sections. We found that adding triples did not improve overall MAP scores, compared to the baseline bag-of-words approach but does result in slightly higher set recall scores. In future work we will extend our experiments to use all the text sections of the patent documents and ne-tune the mixture weights.

of words in prior-art retrieval; (b) Optimizing the combination of the di erent documents sections and text representations. 2

Data Description

The CLEF-IP 2011 corpus, a part of the MAREC collection, was provided by the IRF [ 1 ] and contains approximately 3 million documents, pertaining to more than 1 million patents1. Most documents (2.6 million) came from the European Patent O ce (EPO) and a smaller subset (around 400,000) consisted of patent documents from the World Intellectual Property Organization (WIPO). The patent documents were stored in the IRF XML format [ 5 ]. A patent document contains metadata such as name of inventor, IPC-R code, date of application, ... as well as (a mixture of) English, German or French text sections for the title, abstract, claims and/or description sections of the patent. In our experiments we only used the English text sections and IPC-R codes. The organizers distributed a training set of 300 patents and {unlike the previous years{ only one topic set containing 3973 documents. 3 3.1

Experimental Set-up Patent Section Extraction

Using a perl script we extracted the English title, abstract, claims and description sections from the original XML les. We also saved the rst 400 words of the description sections and the IPC-R codes2 separately. All the sections were saved as plain text in temporary text les. If a document did not contain a section or if {according to the XML tags{ the section was not in English, no corresponding text le was created. The XML documents contain many text-internal XML tags that indicate gures, references, formulae, etc. in the original patent document. All such tags and the texts that they enclose were ltered from the text using a perl script. 3.2

Patent Parsing

In a preprocessing step the image references and claims headers in the text were removed using the regular expressions described by [ 9 ] in order to facilitate syntactic parsing of the claims and description sentences. We then sentenced the remaining text using a Perl script and knowledge of most common abbreviations in patent texts. The sentences in the resulting text les were parsed using the AEGIR dependency parser [ 8, 10 ], version 1.8.2. One of the AEGIRs output formats is a dependency representation which is comparable to the Stanford typed dependencies [ 4 ], in the sense that it generates a set of binary relations between words for an input sentence, thereby converting some function words (such as prepositions) to relations. In addition to that, AEGIR performs a number of normalizing syntactic transformations, such as passive-to- active transformation.

Because of the large amount of data we used a time constraint of maximum 1 second per sentence. This resulted in a loss of parsing output that di ered somewhat between the separate sections.

Due to the sheer size of the corpus we were not able to completely parse the description and claims sections of the entire corpus within the given time. We therefore had to limit our experiments on the impact of triples to the title, abstract and rst 400 words of the description. The keywords used for the bag-of-words component in the experiments were extracted from the title, abstract and the full description.

1Please note the di erence between a patent and a patent document: a patent is not a physical document itself but a name for a group of patent documents that have the same patent ID number.

2We used the full IPC-R code up to the level of the subgroups, e.g. A01J 5/01.

3Parsing output from these sections was incomplete for the whole corpus and not used for the subsequent experiments. We modeled and executed our runs as search strategies within the Spinque framework [ 2 ]. This is a prototype interactive retrieval environment where search processes are divided into two phases: the search strategy de nition and the actual search.

The framework has a GUI-based drag-and-drop strategy editor which allows the user to construct the search strategies as graph structures, where edges represent data- ows consisting of terms, documents (e.g. patent-documents), document-sections (e.g. invention-title, abstract, description) and named entities (e.g. patents, IPC-R codes, companies). The nodes connected by such edges are pre-de ned, general-purpose operational blocks, that either provide source data (the patent corpus and the topics corpus) or modify their input data- ow by applying operations such as selection based on IPC-R classes, extraction of speci c sections from documents, or ranking of sections and documents, to name a few.

Search strategies de ned in this framework are automatically translated into a probabilistic relational query language and executed on top of an SQL database engine. The ranking scores that are used as the basis for the probabilities were calculated with the Okapi BM25 ranking algorithm. 3.4 3.4.1

Experiments Query term selection

This year, we performed query term selection on the triples, based on their relevance for a speci c IPC-R class. The LCS software [ 7 ] that we used for the classi cation track builds class pro les which contain the term distribution (word and dependency triples) per IPC-R class. We extracted the subset of dependency triples that were most informative for correct classi cation, namely the top 25% of the triples ranked on their Winnow scores, from last year's class pro les and used them to lter the topic triples. Some class pro les for smaller IPC-R classes did not contain many triples (< 1000). In these cases all triples that contributed to classi cation were extracted. The aim of this ltering step is to remove the noisy, less informative topic triples from the query thus improving precision. Since a patent document is usually labeled with not just one single IPC-R code but rather belongs to multiple categories (on average a patent document contains 3 di erent IPC-R codes (on subclass level), the ltering is not so severe that it weeds out the individual di erences between topic patent documents. In other words, the individual ltered topic documents are still very di erent from one another due to the relatively large subsets of terms from the class pro les that were used as lters and the di erent combinations of IPC-R classes per document. The ltering step reduced the average number of triples per topic document (over all sections) from 180 to 60. 3.4.2

Strategy building

The search strategies were constructed and evaluated in Spinque's strategy builder interface. Our strategies consisted of two steps: (1) As in last year's approach we rst ltered the corpus on the IPC-R codes of the topic document to create a subcorpus per topic document that contains documents with at least one IPC-R class in common with the topic document; (2) Terms (words and/or triples) from the sections in the topic documents were then used to query the respective sections of documents in the subcorpus. We did not perform any term selection for the bag-ofwords approach. The resulting document lists were then merged into a larger results list. The ranking in that list depended on the documents scores (BM25 scores from their separate runs) multiplied by the weights given to each results list in the con guration. An example of a search strategy used in this track is shown in gure 1. 3.4.3

Determining the weighting con guration

The mixture weights in the Spinque framework allow for a reranking step while merging the result lists of the runs with individual sections. Finding the optimal mixture weights is a very timeconsuming process, because of the large parameter space. Due to time constraints we were not able to train on many coe cient combinations for the mixtures. We used two di erent approaches to determine the weight con gurations used in the submitted runs: (a) Normalisation over retrieval scores of individual sections; and (b) trial-and-error weighing. 3.4.3.1

Determining the relative importance of di erent sections

The mixture coe cients for the combinations of di erent text sections were found by running a subset of the training set topics on the respective text sections, that is, evaluating the title, abstract and descriptions sections independently from one another. We then took the Mean Average Precision (MAP) scores of these runs, normalised them to sum up to 1 and used the resulting ratios as coe cients for the mixtures.

Determining the relative importance of triples and words in the combined runs

The coe cients for mixing the words only and triples only runs were found using the 'trial and error' method on the training set. Starting from a 50/50 combination we used binary search to arrive at the optimal con guration: a words only (0.8) and triples only (0.2) combination. 3.4.4

Submitted runs

We chose to submit four separate runs: 1. triples only: A baseline run to gauge the impact such precise index terms as Dependency triples can have on retrieval. 2. Words only: A standard bag-of-words baseline run. Keywords were stemmed using the

Porter stemmer (version 1). 3. Combination 1: Combining the results list of the words only (stemmed) and triples only (unstemmed) runs in a 80/20 con guration. 4. Combination 2: Even though triples are lemmatized by the parser, the patent domain consists of many highly specialized subdomains which deploy their own jargon. Consequently the patent documents usually contain a lot of words which may not feature in the parser lexicon [ 10 ]. The AEGIR parser recognises these words using robust rules which lead to good estimates of POS tags (important for correct syntactic analysis later on) but applies no lemmatisation beyond the basic singular-plural di erences. We therefore submitted an extra run to examine the impact of stemming of the triples. 4

Results

5 5.1

Discussion

In this section we present the results of our submitted runs in terms of MAP, Precision and Recall for the general (Table 3) and English language-speci c test data (Table 4).

Impact of dependency triples on retrieval As expected, triples by themselves are too speci c to be used for retrieval: the triples only run achieved a very high set precision but fairly low set recall. On average, only 250 documents were retrieved per topic document in this run. The MAP scores for the di erent sections on a subset of the training data in table 2 show decided di erences between the sections.

However, in the combination runs, merging the triple only and the bag-of-words result lists presented some interesting results: While dependency triples are usually seen as a way of improving ranking, we achieved the highest set recall scores (measured with the language-speci c English relevance assessments) compared to the other participants. An analysis of the result list of the combination 1 run shows that around 5% of relevant documents retrieved in this run (2.5% of all the relevant patents) were found using triples, but were not found in the words only approach. This may show that using dependency triples, i.e. information which abstracts away from the surface form of the sentence, can contribute to retrieval where a bag-of-words approach falls short. However, at this point, the contribution is very small. An alternative explanation is that the dependency triples have improved the ranking of documents in the results list that fell underneath the cut-o point of retrieving 1000 patents per query in the words only run. In which case, there is a complete overlap between the results from the triples only run and the documents found by the words only approach and the improvement in set recall score for the combined is an artefact of our choice of threshold.

Furthermore, another 36% of the relevant documents in combined 1 run were found by both the words and triples approaches. We would expect these documents to feature high in the combined results list thus improving the MAP score (compared to the words only run). However, we did not nd much di erence in the rankings and a slight decrease in MAP score. We expect that netuning the 80/20 words-triples mixture coe cients on a held-out set of the test corpus may improve the rankings.

In the combination 2 run we experimented to try and raise recall by using stemming in the triples as well in the keywords, but we found that precision su ers much in that trade-o : While we did nd more relevant documents, they were all pooled at the bottom of the results list. Moreover, the MAP score was signi cantly lower than for the combination 1 run. It is clear that the mixture weights should be tuned separately for combinations with stemmed triples. 5.2

Impact of the di erent sections

We did not have the opportunity to examine the impact of the di erent sections in much detail. Rather we focussed on optimising the impact of those sections were dependency triples were the most successful in their own right (see section 3.4.3). However, this independency assumption is problematic: While it was a good starting point, namely in the mixtures the most weight was given to those sections that were most likely to have relevant documents high in the list, this strategy cannot properly account for interaction between sections and su ers from the uneven distribution of (English) text data in the corpus. In future work we will use further tuning via trial and error method to try and nd a local if not global optimum.

Conclusion

In our participation to the CLEF-IP11 prior art retrieval track we examined the impact of adding dependency triples obtained with the AEGIR parser to a bag-of-words approach. Triples by themselves are very speci c terms, as re ected by the high precision score achieved in the triple only run. Interestingly, we found that adding triples lead to a slight improvement in recall, rather than in precision as we had expected. It is not quite clear if this is due to the normalisation features of triples or an indirect e ect of their higher precision. We also experimented with stemming of the triples, but this led to a severe loss of precision. In future work we will extend our experiments by adding data of all the description sections and the claims section, both for the words en triples approach. We will also keep working on tuning the mixture coe cients by a 'trial-and-error' method, rather than basing the coe cients on individual retrieval performance of the sections. [11] Suzan Verberne, Merijn Vogel, and Eva D'hondt. Patent classi cation experiments with the Linguistic Classi cation System LCS. In Proceedings of the Conference on Multilingual and Multimodal Information Access Evaluation (CLEF 2010), CLEF-IP workshop, number Section 2, page 49. Sl: sn, 2010. a e S : 1 n 0 .

0 5 2 .5 .5 5 R .0 0 0 0

. 00 506 63 95 15

9 8 7 1 1

.2 .2 .2 R .0 0 0 0

7 7 7 2 0 8 6 1 7

1 1 8 5 1 R .1 .2 .

2 .1 0 0 0 0 9 0 4 6 0 7 1 4 0 2 7 4 3 1 0 0 0 0 3 9 8 8 0 4 5 0 0

9 9 7 1 5 R .0 .0 .

0 .0 0 0 0 0 8 6 1 1 5 35 2 9 5

6 5 4 0 0 0 0

. s 0 0 0 0

.0 .0 .0 t P .0 0 0 0 l e en 00 103 69 93 79

1 1 1 le 9 7 8 5 b 0 5 5 3 5

4 4 3 3 0 5 0 0 5 2 8 5

6 5 4 1 3 P .0 .0 .

0 .0 se .0 .

0 3 1 1 1 P 4 7 7 7 t 1 0 0 0 0 .0 .0 0 .0 .0 .0 M .0 0 0 0 2 . n 0 .

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 e s R 01 11 46 03 6 0 0 3 e 0 36 76 76 80 p 0 0 0 0 0

0 0 0 .0 .

P .0 0 0 0

. 0 0 0 0 0 0 0 0 P .0 .0 . 0 0 0 0 0 0 0 0

P 3 6 6 6 t 1 0 0 0

0 .0 .0

[1] Home - IRF. http://www.ir-facility.org/.

[2]

Alink , Roberto Cornacchia, and Arjen de Vries. Searching clef-ip by strategy . In Carol Peters , Giorgio Di Nunzio, Mikko Kurimo, Thomas Mandl, Djamel Mostefa, Anselmo Peas, and Giovanna Roda, editors, Multilingual Information Access Evaluation I. Text Retrieval Experiments , volume 6241 of Lecture Notes in Computer Science, pages 468 { 475 . Springer Berlin / Heidelberg, 2010 . 10 .1007/978-3- 642 -15754-7 56 .

[3]

Daniela

Becks , Thomas Mandl, and

Womser-Hacker . Phrases or Terms? The Impact of Di erent Query Types . In Proceedings of the Conference on Multilingual and Multimodal Information Access Evaluation (CLEF 2010 ), CLEF-IP workshop, page 99, 2010 .

[4] Marie-Catherine de Marne e and Christopher D. Manning . The Stanford typed dependencies representation . In Coling 2008: Proceedings of the workshop on Cross-Framework and CrossDomain Parser Evaluation - CrossParser '08, number

, pages 1 { 8 , Morristown , NJ, USA, 2008 . Association for Computational Linguistics .

[5] IRF. Clef Ip 2011 Track Guidelines . Technical report, IRF , 2011 .

[6] Cornelis

H.A.

Koster , Jean G. Beney, Suzan Verberne, and Merijn

Vogel . Phrase-Based Document Categorization . In W. Bruce Croft, Mihai Lupu, Katja Mayer, John Tait, and Anthony J. Trippe, editors, Current Challenges in Patent Information Retrieval , volume 29 of The Kluwer International Series on Information Retrieval, pages 263 { 286 . Springer Berlin Heidelberg, 2011 .

[7] Cornelis

H.A.

Koster , Marc

Seutter , and Jean

Beney

. Multi-classi cation of patent applications with Winnow . In Perspectives of Systems Informatics, 5th International Andrei Ershov Memorial Conference , pages 546 { 555 , 2003 .

[8]

Nelleke

Oostdijk , Suzan Verberne, and

Cornelis

Koster . Constructing a broad-coverage lexicon for text mining in the patent domain . In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 2010 ). European Language Resources Association (ELRA) , 2010 .

[9]

Peter

Parapatics and

Michael

Dittenbach . Patent claim decomposition for improved information extraction . In W. Bruce Croft, Mihai Lupu, Katja Mayer, John Tait, and Anthony J. Trippe, editors, Current Challenges in Patent Information Retrieval , The Kluwer International Series on Information Retrieval. Springer Berlin Heidelberg.

[10] Suzan

Verberne

, Eva D'hondt, Nelleke Oostdijk, and Cornelis Koster. Quantifying the challenges in parsing patent claims . Proceedings of the 1st International Workshop on Advances in Patent Information Retrieval (AsPIRe 2010 ), pages 14 { 21 , 2010 .