Introduction

Patent classi cation experiments with the Linguistic Classi cation System LCS in CLEF-IP 2011

Suzan Verberne

s.verberne@cs.ru.nl 0

Eva D'hondt

0 0 Information Foraging Lab Radboud University Nijmegen

We report the results of a series of classi cation experiments with the Linguistic Classi cation System LCS in the context of CLEF-IP 2011. We participated in the main classi cation task: classifying documents on the subclass level. We investigated (1) the use of di erent sections (abstract, description, metadata) from the patent documents; (2) adding dependency triples to the bag-of-words representation; (3) adding the WIPO corpus to the EPO training data; (4) the use of patent citations in the test data for reranking the classes; and (5) the threshold on the class scores for class selection. We found that adding full descriptions to abstracts gives a clear improvement; the rst 400 words of the description also improves classi cation but to a lesser degree. Adding metadata (applicants, inventors en address) did not improve classi cation. Adding dependency triples to words gives a much higher recall at the cost of a lower precision but this e ect is largely due to the class selection threshold. We did not nd an e ect from adding the WIPO corpus, nor from reranking with patent citations. In future work, we plan to investigate whether there are other methods for reranking with patent citations that does give an improvement, because we feel that the citations may still give valuable information. Our most important nding however is the importance of the threshold on the class selection. For the current work, we only compared two values for the threshold and the results are much better for 1.0 than for 0.5. The 0.5 threshold gives higher recall in all runs, which was the original motivation for submitting runs with a lower threshold. However, because the much lower precision, the F-scores are lower. We think that there is still some improvement to be gained from proper tuning of the class selection threshold, and the use of a exible threshold (also taking into account the di erent text representations). This is part of our future work.

Introduction

In this paper, we describe the classi cation experiments that we conducted in the context of the Intellectual Property (IP) track at CLEF 2011 (CLEF-IP1). In 2009, the track was organized for the rst time with a prior art retrieval task. In 2010, a classi cation task was added to the track. In 2011, this task was continued and extented with a new optional sub-task, which is to classify a given patent document up to the subgroup level, when the subclass is given. We only participated in the main classi cation task: classifying documents on the subclass level.

The goal of the classi cation task at CLEF-IP is to classify a given patent document, according to the International Patent Classi cation system (IPC). For the purpose of the track, the organization released a collection of 2.6 million patent documents from the European Patent Ofce (EPO), extended with 400,000 documents from the World Intellectual Property Organization (WIPO). These 3 Million documents with content in English, German and French pertain to over 1 Million patents.2 From the collection, 1,000 documents (the `topics') per language were held out 1 http://www.ir-facility.org/clef-ip 2 A patent is the name for a group of patent documents that relate to the same invention; they have the same patent ID number. as test set. The remainder of the corpus constitutes the target data, on which participants could develop their methods.

In this notebook paper, we describe our classi cation experiments with the Linguistic Classi cation System LCS. We only performed mono-lingual classi cation, training and evaluating our models on English texts only. We evaluate a number of classi cation variables: (1) the use of di erent patent sections, (2) adding dependency triples to the bag-of-words representation, (3) expanding the EPO training corpus with WIPO documents, (4) using patent citations to rerank the selected classes, and (5) tuning the threshold on class selection.

In Section 2, we describe the data selection, data preparation and the classi cation settings used. The results from the classi cation experiments are presented in Section 3, followed by our conclusions in Section 4. 2

Classi cation experiments with LCS

For our classi cation experiments, we used the Linguistic Classi cation System (LCS)3 [ 2, 3 ]. The LCS can perform both mono-classi cation (each document is assigned exactly one class label) and multi-classi cation. In the training phase, the LCS takes as input a le which list the paths to the classi cation les followed by their classes. After this training phase the LCS can be used for testing the trained classi er on a test collection of documents with known classes (usually held-out training data), or for producing a classi cation of new documents without known classes.

Three classi ers have been implemented in the LCS: Naive Bayes, Winnow and SV M light. Last year [ 6 ], we experimented with both Winnow and SV M light and we found that their classi cation accuracy scores are comparable but that SV M light is much slower. Therefore, we decided to use Winnow for this year's CLEF-IP experiments. Winnow has a number of parameters that can be tuned: , and maxiters (the number of training iterations). Based on the tuning we did last year, we decided to use = 1:02, = 0:98 and maxiters = 10.

In our classi cation experiments, we compared the following experimental settings: 1. The use of di erent sections (abstract, description, metadata) from the patent documents; 2. The use of di erent document representations for classi cation, adding dependency triples to the bag-of-words representation. 3. The training corpus selection: EPO only, or EPO and WIPO together; 4. The use of patent citations in the test patents for reranking the assigned classes; 5. The threshold on the class scores for class selection.

We will explain how we prepared the experiments for each of these comparisons in the following subsections. 2.1

Corpus preparation: extracting IPC classes and sections From all patents in the target data, we extracted the information needed for classi cation: the IPCR classes, the textual content from the English abstract and description; and applicants, inventors en address as additional metadata. For each patent, we selected the most recent version which contains all the information needed.4 Table 1 shows the size of the training corpus when particular patent sections are included. We allowed the abstract to be empty if either the description or the metadata sections contains content. As a result, the subcorpus `abstract and metadata' is the largest: 1,3M documents, some of which only contain metadata.

We separately extracted the rst 400 words of the description because the experiences from other participants in last year's workshop [ 5 ] taught us that the head of the description is a good alternative to the complete description, which may be too heavy to classify due to its length. We conducted experiments to validate this assumption. 3 A demo of the application can be found at http://ir-facility.net/news/linguistic-classi cation-systemprototype/ for registered IRF members. 4 E.g. in the corpus directory EP/000000/00/59/01/, EP-0005901-A3.xml is newer than EP-0005901A2.xml and both are newer than EP-0005901-B1.xml.

Di erent document representations: adding triples to words In CLEF-IP 2010, we experimented with the addition of dependency triples to the bag-of-words representation, which is generally used in text classi cation. The results on the 2010 test set were mixed [ 6 ] but in follow-up experiments [ 3 ], we consistently found a signi cant improvement in F-score when we added dependency triples to the word-based representation of patent abstracts.

This year, we again investigated the improvement that can be gained from adding dependency triples to the bag of words, but we did not limit ourselves to classi cation of abstracts. We parsed the abstracts and the rst 400 words of the descriptions with the AEGIR dependency parser [ 4, 7 ] version 1.8.2. AEGIR's output representation is comparable to the Stanford typed dependencies representation [ 1 ], in the sense that it generates a set of binary relations between words for an input sentence, thereby converting some function words (such as prepositions) to relations. In addition to that, AEGIR performs a number of normalizing transformations, such as passive-toactive transformation. For example, the clause \an in ammatory reaction, caused by the bowel tissue" leads to the same analysis as \the bowel tissue causes an in ammatory reaction". An example of the triple representation can be found in Figure 1 below [ 6 ].

Original text words triples Heat is stored heat is stored [IT,SUBJ,store] [store,OBJ,heat] at a steady at a steady [store,PREPat,temperature] temperature using temperature using [temperature,ATTR,steady] calcium chloride calcium chloride [temperature,DET,a] [chloride,ATTR,calcium] hexahydrate and hexahydrate and [hexahydrate,ATTR,chloride] up to 20 percent up to percent [hexahydrate,ATTR,using] [up,PREPto,20 percent] strontium chloride strontium chloride [assist,OBJ,crystallization] hexahydrate hexahydrate [chloride,ATTR,strontium] to assist to assist [hexahydrate,ATTR,chloride] crystallisation. crystallisation [hexahydrate,SUBJ,assist] In text classi cation, system performance usually goes up when the size of the training set increases. While the CLEF-IP test set only consisted of documents from the EPO corpus, we investigated if adding documents from another corpus, namely the WIPO, to the EPO training set led to improvements in classi cation accuracy. We added the WIPO corpus to two of our section subcorpora: abstracts and description, and abstracts, description and metadata. Table 2 shows the resulting document counts for the training corpora. From the table it is clear that in the WIPO corpus, there are fewer documents with the metadata elds applicants, inventors en address than in the EPO corpus.

The use of patent citations for reranking the classes Some of the patent les (topics) in the test set contain citations to other EPO patents. We used these citations to rerank the LCS output using the following procedure: 1. For each topic, we extracted the patents that are cited by the topic (labelled as patcit in the

XML le); 2. We looked up each of the citations in the training corpus and extracted their IPC-R classes.

We found that 562 of the 1,000 topics contains at least one cited patent with one or more IPC-R classes. 3. These `citation classes' get a vote each time they occur in a cited patent. A vote is worth 1.0 in addition to the LCS score.

For example, in one of the experiments, LCS selected and assigned them the following scores: ve classes for the topic EP-1223323-A2, Of these, F01N (1x), B60K (2x) and B60W (2x) occur in the citations of EP-1223323-A2. Their classi cation score is increased by the number of times they occur in the citations, and the list of classes is re-ranked:

EP-1223323-A2 EP-1223323-A2 EP-1223323-A2 EP-1223323-A2 EP-1223323-A2

F01N F02D B60W B60K

F02N EP-1223323-A2 EP-1223323-A2 EP-1223323-A2 EP-1223323-A2 EP-1223323-A2

F01N B60W B60K F02D F02N

The threshold on the class scores for class selection In the case of multi-classi cation, LCS is exible with respect to the number of classes that are returned per document. Internally, it produces a full ranking of classes for each document in the test set. The user can regulate the selection of classes with three parameters: (1) a threshold that puts a lower bound on the classi cation score for a class to be selected, (2) the maximum number of classes selected per document (`maxranks') and (3) the minimum number of classes selected per document (`minranks'). In the experiments on the target data, we kept the selection threshold to 1.0 (which is the default). Based on the average number of classes per document in the target data (2.7 according to [ 6 ]), we decided to set maxranks = 4. Setting minranks = 1 assures that each document is assigned at least one class, even if all classes have a score below the threshold.

In the submitted runs on the test data, we decided to lower the class selection threshold to 0.5 because the value of 1.0 gives an average of 1.8 classes per test document; setting it at 0.5 gives an average of 3.2 classes. The latter seemed wiser for a recall-oriented task. Also, we increased maxranks to 5. In additional experiments, we evaluated the results for a threshold of 1.0 against the results for the threshold of 0.5. ad400WT abs, desc400 words+triples ad400WTcit abs, desc400 words+triples ad400WT1 abs, desc400 words+triples aWT abs words+triples amWTcit abs, meta words+triples amWT abs, meta words+triples aW abs words aWcit abs words amW abs, meta words ad400W abs, desc400 words admWOW abs, desc, meta words admWOWcit abs, desc, meta words admWcit abs, desc, meta words admW abs, desc, meta words adW abs, desc words aW1 abs words adWcit abs, desc words adWOW abs, desc words adWOWcit abs, desc words amW1 abs, meta words aWT1 abs words+triples amWT1 abs, meta words+triples admWOW1 abs, desc, meta words admWOWcit1 abs, desc, meta words ad400W1 abs, desc400 words admW1 abs, desc, meta words adW1 abs, desc words adWcit1 abs, desc words adWOW1 abs, desc words adWOWcit1 abs, desc words 3

Results

For training the classi cation models, we used the target data with the exception of the 2000 most recent documents in the training corpus, which we used as test set in the development stage. A complete overview of the results on the real test data (the 1,000 topics provided by the track organization) is shown in Table 3. As opposed to last year, when we measured standard deviations over multiple runs of the same experiment, we only performed each experiment once this year. Our results on the 2010 data showed that standard deviations are small and even small di erences in the results tend to be signi cant because of the large data set [ 3 ].

Figures 2{6 at the end of the paper show the e ects of di erent sections, text representation, corpus selection, patent citations and class selection threshold respectively (the ve experimental variables that we compare).

Figure 2 shows that adding the description to the abstract gives a clear improvement in classi cation accuracy: from 0.54 to 0.62 in F-score. The e ect of adding the rst 400 words of the description instead of the complete description, is smaller, giving an F-score of 0.60. Surprisingly, adding metadata (applicants, inventors en address) to the abstracts and descriptions does not give any improvement. This is in contrast with last year's results, when some participants reported signi cant improvement from adding applicants, inventors en address as metadata [ 5 ].

Figure 3 shows that adding dependency triples to the bag-of-words representation has an e ect but whether this is a positive e ect highly depends on the evaluation measure used. Recall is higher for the words+triples representation but this comes at the cost of a much lower precision. The experimental setting with the lowest F-score of all, ad400WT, has the highest recall of all runs (0.87). We had a look at the full ranking of the classes and found that for the runs with triples, the class scores are generally higher. This means that more classes get a score above the xed threshold of 0.5 (in fact, the average number of classes selected per patent for ad400WT is 5.0, which is the maximum number of selected classes). As a result, recall is higher and precision is lower.

Figure 4 shows that there is no e ect of adding the WIPO documents to the EPO training corpus. More data generally gives better classi cation results, but in this task and using this data, increasing the number of documents from 650K to 905K did not generate any e ect.

Figure 5 shows that the use of patent citations in the test data for reranking the classes has no visible e ect either. We plan to investigate whether there are other methods for reranking with patent citations that does give an improvement, because we feel that the citations may still give valuable information.

Figure 6 shows that the threshold on the class scores for class selection is highly important for the evaluation scores. For the current work, we only compared two values for the threshold, 0.5 and 1.0, and it is clearly visible that the results are much better for 1.0 than for 0.5. The 0.5 threshold gives higher recall in all runs, which was the original motivation for submitting runs with a lower threshold. However, because the much lower precision, the F-scores are lower. The default LCS threshold of 1.0 clearly is the better choice here. We think that there is still some improvement to be gained from proper tuning of the class selection threshold, and the use of a exible threshold (also taking into account the di erent text representations). This is part of our future work. 4

Conclusion

We reported the results of a series of classi cation experiments in the context of CLEF-IP 2011. We investigated (1) the use of di erent sections (abstract, description, metadata) from the patent documents; (2) adding dependency triples to the bag-of-words representation; (3) adding the WIPO corpus to the EPO training data; (4) the use of patent citations in the test data for reranking the classes; and (5) the threshold on the class scores for class selection.

We found that adding full descriptions to abstracts gives a clear improvement; the rst 400 words of the description also improves classi cation but to a lesser degree. Adding metadata (applicants, inventors en address) did not improve classi cation. Adding dependency triples to words gives a much higher recall at the cost of a lower precision but this e ect is largely due to the class selection threshold. We did not nd an e ect from adding the WIPO corpus, nor from reranking with patent citations. Our most important nding is the importance of the threshold on the class selection. Our future work will be directed at tuning this threshold.

1 abs abs words+triples abs

words

P R F1 P R F1 yes no yes no yes abs abs, desc abs, desc abs, desc, meta abs, desc, meta

Sections and reranking with citations (yes/no) P R F1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.5 abs 1.0 abs

0.5 1.0 0.5 1.0 0.5 1.0 abs, meta abs, meta abs, desc abs, desc abs, desc, abs, desc,

meta meta Sections and threshold for class selection (1.0 or 0.5)

1. M.C. De Marne e and C.D. Manning . The Stanford typed dependencies representation . In Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation , pages 1 { 8. Association for Computational Linguistics, 2008 .

C.H.A.

Koster ,

Seutter , and

Beney . Multi-classi cation of patent applications with Winnow . Lecture Notes in Computer Science , pages 545 { 554 , 2003 .

3. Cornelis

H. A.

Koster , Jean G. Beney, Suzan Verberne, and Merijn

Vogel . Phrase-Based Document Categorization . In W. Bruce Croft, Mihai Lupu, Katja Mayer, John Tait, and Anthony J. Trippe, editors, Current Challenges in Patent Information Retrieval , volume 29 of The Kluwer International Series on Information Retrieval, pages 263 { 286 . Springer Berlin Heidelberg, 2011 .

Nelleke

Oostdijk , Suzan Verberne, and Cornelis

H.A.

Koster . Constructing a broad coverage lexicon for text mining in the patent domain . In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 2010 ). European Language Resources Association (ELRA) , 2010 .

Florina

Piroi and John Tait. CLEF-IP 2010 : Retrieval Experiments in the Intellectual Property Domain . In CLEF 2010 LABs and Workshops Notebook Papers , 2010 .

Verberne ,

Vogel , and

Dhondt . Patent classi cation experiments with the Linguistic Classication System LCS . In Proceedings of the Conference on Multilingual and Multimodal Information Access Evaluation (CLEF 2010 ), CLEF-IP workshop, 2010 .

Suzan

Verberne , Eva D'hondt, Nelleke Oostdijk, and Cornelis

H.A.

Koster . Quantifying the Challenges in Parsing Patent Claims . In Proceedings of the 1st International Workshop on Advances in Patent Information Retrieval (AsPIRe 2010 ), pages 14 { 21 , 2010 .