Introduction

CLEF-IP 2011 Working Notes: Utilizing Prior Art Candidate Search Results for Refined IPC Classification

Hyung-Kook Seo

hkseo@wisenut.co.kr 0

Kyouyeol Han

Jaean Lee

jalee@wisenut.co.kr 0 0 2F Ottogi Center , 1009-1 Daechi-Dong, Kangnam-Gu, Seoul , Korea

For the refined IPC classification in the CLEF-IP 2011 task, we constructed classification system with KNN classification which uses PAC (Prior Art Candidate) search results as neighbors. We also slightly modified the neighborhood evaluation. We also furnished a simple PAC search system. We produced some running results both in PAC search and classification, and evaluated our system. Our test showed an improved result in the refined IPC classification.

Prior Art Candidate Search IPC Categorization KNN Classification

Introduction

1.1

Test Collection and Topics Like other participants in CLEF-IP 2011, we used only test the data collection which comprises extracts of the MAREC dataset by IRF.

We only used this collection in the entire process for producing running result of the tasks we participated. 1.2

CLEF-IP Tasks Participated We participated in following three tasks in CLEF-IP 2011 [ 1 ]:

1. Prior Art Candidate (PAC) Search

2. IPC Classification: up to subclass level 3. Refined IPC Classification: up to subgroup level, with given subclass value As we stated before, our ultimate interests lie in refined IPC classification, and there were small efforts to improve PAC search results. This is described in the next chapter. 2

Approaches to Refined Classification Classification of a patent up to sub-class degree is quite difficult task for model-based classification, because of sparseness of training samples in that level. So, we implemented indirect (and simple) method, that implements KNN-like classification using PAC search results. 2.1

Existing Classification Approach In patent classification, Kostar et al. [ 2 ] proposed a method using the winnow algorithm. Winnow is a mistake driven learning algorithm that computes for each category a vector of weights for separating between relevant and irrelevant patent [ 3 ]. In this study, they obtained F-measure of around 68% (multi-categorization). This result was measured with a customized success criterion and relatively few documents.

Fall et al. [ 4 ] applied various machine learning algorithms to patent classification. The machine learning algorithms were Naive Bayes(NB), SNoW, support vector machines(SVM) and k-nearest neighbor(KNN) algorithms. Here SNoW is a variation of the winnow algorithm. They investigated useful patent document fields to index, and defined three measures of categorization success. As a result, they presented the best precision of 41%(SVM), 39%(NB), 33%(NB) and 36%(SNoW) when the first 300 words are indexed at subclass level. In first three guesses KNN achieved the best precision of 62% and all categories SVM achieved the best precision of 48%. They [ 5 ] also presented a customized language-independent text classification system for categorization.

When the amount of training data increses, a model-based system has increased feature scale and time complexity. In order to reduce the feature scale, some of researchers limited the number of documents, term selection, and length of the documents. To reduce time complexity, other have attempted instance-based learning such as KNN. It first selects K samples when the similarity values are sorted in descending order, and then determines the categories of test sample with class mapping method. It makes a trade-off between effectiveness and time complexity.

W. Wang et al.[ 6 ] reported their experience in the NTCIR-7 Patent Mining Task(MT) to classify patent documents according to the IPC taxonomy. Their approach is based on the KNN algorithm using cosine and Euclid distance similarity. And T. Xiao et al.[ 7 ] described their methodology that are used KNN and re-ranking models. They achieved a mean average precision (MAP) of 48.86% when classifying according to the subgroup level. Also [ 8 ] reported result of their experiments on the automatic assignment of patent classification to research paper abstracts. The results showed the best precision of 50.62% (MAP) when using formal run data and particular query group.

Lately, Y. Cai et al. [ 9 ] presented a KNN text categorization method based on shared nearest neighbor, effectively combining the BM25 similarity calculation method and the Neighborhood Information of samples in the NTCIR-8 workshop. BM25 is a bag-of-words retrieval function, combines the word frequency and document frequency, balances the length of the document, and is a highly efficient similarity calculation method. They conducted a comparison experiment on Japanese corpus and English corpus provided by the National Institute of Informatics from 1993 to 2002, using the basic KNN and KNN based on shared nearest neighbors. Compared to KNN method, KNN+SNN method showed 72.12% precision (about 0.03) higher at subclass levels and 36.93% precision at subgroup levels on English corpus. 2.2

Our IPC Classification Approach As we know about KNN Classification, K nearest neighbors are K documents most similar with the given query document to be classified [ 10 ].

So it can be easily connected with search results. That is, top K search results with query document can be directly adopted in KNN classification, and one system used this simple method, though it was not so competitive with model-based classification algorithms [ 11 ].

In fact, at subgroup level we need about 70,000 categories to be trained, and most classification models suffer from sparseness of training documents and problems in system memory (for loading models) and processing time (training or classification itself). But according to our experience in Korean patent domain, KNN classification with PAC search showed quite good quality in classification of subgroup level.

So we tried to construct refined IPC classification system utilizing PAC results. (And IPC classification up to subclass level as well. In fact, we paid not so much attention in the optimization or improvement in subclass level, because of limited time) 2.3

PAC Search Approach We implemented a PAC search system using only selected weighted keywords which are extracted from major content fields (title of invention, abstract, description, claims).

We added two additional efforts in PAC search to improve our results. They are the following:

Removing Non-Content Words

A Document will be represented by a set of words which consist of content-word and functional word. After POS tagging, we try to remove functional words and stop word. While stop words are controlled by human input and cannot be automated, we can algorithmically find words which don’t describe a particular document. (non-content words).

The standard probabilistic model for the distribution of a certain type of event over units of a fixed size is the Poisson distribution.

The most common model of the Poisson distribution in IR, the parameter > 0 is the average number of occurrences of per document: that is, where is the collection frequency and N is the total number of documents in the collection. And we can get an approximation of DF by . As this model assumes independence between term occurrences, its estimations are good for non-content words. [12]

Algorithm:

1. Calculate collection ) 2. Calculate expected document frequency by Poisson distribution: : collection frequency of i / N ( N is total number of document in the 3. Get the overestimation value: expected document frequency(i)/df(i)

Parameter:

1. Document Frequency Rate: Percentage of the number of document in the collection

2. Lower overestimation criteria and upper overestimation criteria

Result:

According to parameter tests, our system (in Korean) showed best result under observed document frequency of 0.05%, lower overestimation criteria of 0.9 and upper overestimation criteria of 1.0.

In the CLEF-IP 2011, we’ll fix lower overestimation criteria as 0.9, and change upper overestimation criteria from 1.0 to 1.3. (1)

Extracting Co-Occurrence Terms

A null hypothesis is often stated by saying the parameter Θ is in a specified subset 0 of the parameter space .

Likelihood ratios are an approach to hypothesis testing. The likelihood function is is a function of the parameter θ with x held fixed at the value that was actually observed, i.e., the data. The likelihood ratio test statistic is [14]

In applying the likelihood ratio test to collocation discovery, we examine the following two alternative explanations for the occurrence frequency of a bigram [13]

We used the usual maximum likelihood estimates for p, p1 and p2 and write c1, c2, and c12 for the number of occurrences of w1, w2 and w1w2 in the corpus:

Now we can get likelihood ratio by assuming a binomial distribution and then following is asymptotically distributed (3) (4) (5) (6)

In the CLEF-IP 2011, we used confidence level of α=99.9. We tried both with cooccurrence terms and without them in the runs. 3

System Setup

We implemented our system according to procedures to be explained in this chapter.

We first extracted weighted keywords from each patent xml file provided in CLEFIP 2011 corpora, combined them in several bulk files, and indexed them. For indexer and searcher, we used Lucene. We implemented a simple searcher program implemented in Java, and a final classifier program applying search results. 3.1

Overall Architecture Our system is illustrated in the following architecture diagram. It’s very similar with traditional document search system.

Patent XML Files Online Translation

XML

Parser Translator Queries (Translated) Bulk Files

POS

Tagger (English only) Classifier Application Searcher Application

Feature Extractor Co-occurrence

Terms Extractor Searcher (Lucene)

Bulk Files (Features)

Indexer Application

Indexer (Lucene)

Index CLEF-IP System

Indexer/Searcher

We used our in-house English POS tagger for base English analysis.

For translating other languages like French or German into English, we used open online translation service, MyMemory [15].

We used Lucene 3.1.0 for the base search engine, and for accessing this engine, we wrote simple java applications for indexing and searching. We also wrote an application for classification which calls the searcher application. 3.2

Preprocessing and Indexing Basically, we used a nearly identical preprocessing system except this time we used English POS tagger instead of Korean one.

We selected only one XML document among various versions of a same patent to guarantee uniqueness of the patent, so that there’s no patent document with same application number in the entire index system. After this process, we got 1,331,182 unique patents in the EP(European Patents), and 437,987 in the WO(WIPO Patents)

During this procedure we also translated content fields with the online translation service. After some sample runs, we discovered that translating full text would consume a lot of time (and may lead to missing the CLEF-IP 2011 deadline), so we only translated the abstract and select sentences(about 2048 characters) in other content fields.

For feature extraction, we used these content fields: Title of Invention, Abstract, Description, and Claims. We also extracted some co-occurrence terms and select up to 5 terms with extracted features.

We finally produced bulk files with features with and without co-occurrence terms (we call them co-terms for simplification) for indexing. And we produced two separated indices, to analyze the effect of co-terms used in the search.

We also preprocessed patents used as topics (queries). In this case, translation with full content was conducted. 3.3

Prior Art Candidate Search We produced a total of 8 runs for search results. First 4 runs target index without coterms, and other 4 runs target index with co-terms. In each group, we changed upper overestimation threshold for non-content words removal from 1.0 to 1.3 (1.0, 1.1, 1.2, 1.3) resulting in 4 runs for each group.

We produced 1,000 results for every patent query. The results are produced without lower threshold in the weight of search results, so in most cases, our search results per one document were almost 1,000 documents. 3.4

Classification Because we used search results of PAC search, we also produced 8 runs for each classification task.

For KNN classification, we set K as 1,000, because we produced 1,000 search results per a query.

In fact, we have observed that combining reciprocal of search results than just counting the number of patents per category shows much better results. It’s similar to adoption of weighting in the average precision [16]. We basically adopted this improved weighting scheme in the KNN classification results we got.

To verify this intuition, we also ran the base condition and compared this result with improved weighting scheme applied in the CLEF-IP 2011 runs. 4

Results and Analysis Following is a simple report on our results.

4.1

PAC Search Results We simply show the result of our runs along with best runs of CLEF-IP 2011. CHEMNITZ.CUT_UHI_CLEFIP_BOW HYDERABAD.

TEXTRANK_IDFCITATIONALL WISENUT_R1_BASE_PAC WISENUT_R2_BASE_10_PAC WISENUT_R3_BASE_30_PAC WISENUT_R4_BASE_30_PAC WISENUT_R5_CO_PAC WISENUT_R6_CO_10_PAC WISENUT_R7_CO_20_PAC WISENUT_R8_CO_30_PAC

Entire Language

MAP SET_P SET_recal recal_5 recal_10 recal_20 0.0914 0.0037 0.4318 0.0896 0.1251 0.1635

English Only

MAP SET_P SET_recal recal_5 recal_10 recal_20 0.1009 0.0049 0.5233 0.0956 0.1401 0.1921

We got slightly improved result when co-terms are applied. And the differences in upper threshold in the non-content words extraction made no special differences.

And due to multilingual issues, our result showed quite low quality. (It’s partially displayed in English results that show quite narrower gaps with the top runners) We’ll try to find alternatives to overcome this issue. 4.2

IPC Classification Results We also got the results of our classification runs and compared them with ones from the other participant. RUN_NAME NIJMEGEN.RUN_ADMWCIT_CLS1 NIJMEGEN.RUN_ADMW_CLS1 WISENUT.WISENUT_R1_BASE_CLS1 WISENUT.WISENUT_R2_BASE_10_CLS1 WISENUT.WISENUT_R3_BASE_20_CLS1 WISENUT.WISENUT_R4_BASE_30_CLS1 WISENUT.WISENUT_R5_CO_CLS1 WISENUT.WISENUT_R6_CO_10_CLS1 WISENUT.WISENUT_R7_CO_20_CLS1 WISENUT.WISENUT_R8_CO_30_CLS1 set_P 0.5379 0.5436 0.2867 0.2871 0.2869 0.2871 0.2882 0.2883 0.2885 0.2884 set_recall

set_F_1.0 0.8563 0.8506 0.838 0.8389 0.8384 0.8387 0.8366 0.8371 0.8376 0.8376 0.6168 0.6186 0.4021 0.4027 0.4024 0.4027 0.4027 0.4029 0.4032 0.4031

Because we do not use model-based methods, our result showed lower result in the precision. We also didn’t limit the score of classification result; if we tune the score thresholds, it’s expected that we may produce much better results. 4.3

Refined IPC Classification Results Finally, our refined IPC classification results are displayed in the table below. If more labs participated in this track, we may get a better perspective on our quality. RUN_NAME NIJMEGEN.RUN_WINNOW_WORDS_CLS2 WISENUT.WISENUT_R1_BASE_CLS2 WISENUT.WISENUT_R2_BASE_10_CLS2 WISENUT.WISENUT_R3_BASE_20_CLS2 WISENUT.WISENUT_R4_BASE_30_CLS2 WISENUT.WISENUT_R5_CO_CLS2 WISENUT.WISENUT_R6_CO_10_CLS2 WISENUT.WISENUT_R7_CO_20_CLS2 WISENUT.WISENUT_R8_CO_30_CLS2 set_P

Whie simple comparison is quite dangerous, our system showed quite improved results in this track.

And as we stated before, we compared the new, refined weighting scheme (which is applied in CLEF-IP 2011) with the base one. Following table shows that result.

Refined scheme showed far better results than the base scheme, especially in precision. MAPs were also dramatically improved. (Note that precision and recall were micro-averaged, so they’re quite different from our reported values)

Considering the result of [ 9 ], our result is very promising, because precision in suggesting one IPC classification result showed almost the same or improved quality.

Conclusion and Future Work

We implemented a simple refined IPC classification system utilizing search results provided from PACS system. Though our PACS system showed rather lower performance than those of other labs, our refined classification results based on the search results of our system showed quite good performance, especially when the criteria for category selection is changed.

We left some challenges as future work.

First, we can improve PACS search results. For example, we didn’t set threshold in the score, only the maximum number of results. And we had a major problem in search results due to multilingual defects of our system. We may improve these problems at the next workshop.

Second, we can adapt model-based classification up to subclass level. In fact, it’s true that model-based classification method works well up to subclass level, so our IPC classification system should use classification model like SVM does.

Finally, we may optimize weighting factor of ranked documents in the refined IPC classification. As just using reciprocal of ranks in the search results improved the refined classification, it’s expected that adopting more sophisticated weighting factor in the KNN can produce improved classification results. 12. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing, MIT

Press, Cambridge (1999) 13. Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. In:

Computational Linguistics 19. pp. 61--74. (1993) 14. Casella, G., Berger, R.L.: Statistical Inference, 2nd edition, p. 375. Duxbury Press (2001) 15. MyMemory, http://mymemory.translated.net 16. Average Precision, http://en.wikipedia.org/wiki/Information_retrieval#Average_precision

1. Piroi , F. : CLEF- IP 2011 : Track Guidelines . IRF, Vienna ( 2011 )

2. Koster , C.H.A. , Seutter , M. , Beney , J.: Classifying Patent Applications with Winnow . In: Proceedings Benelearn Conference , Antwerpen ( 2001 )

3. Littlestone , N.: Learning Quickly when Irrelevant Attributes Abound: A new LinearThreshold Algorithm . In: Machine Learning , Vol. 2 , pp. 285 -- 318 . Springer, Nethelands ( 1988 )

4. Fall , C. J. , Torcsvari , A. , Benzineb , K. , & Karetka , G. : Automated Categorization in the International Patent Classification . In: ACM SIGIR Forum , Vol. 37 , Issue 1 . ACM, New York ( 2003 )

5. Fall

C. J.

, Benzineb

, Guyot

, Torcsvari

, Fievet

: Computer-Assisted Categorization of Patent Documents in the International Patent Classification . In: Proceedings of the International Chemical Information Conference (ICIC'03) , Nimes, France ( 2003 )

6. Wang , W. , Li , S. , Wang . C. : ICL at NTCIR-7: A Improved KNN Algorithm for Text Categorization . In: Proceedings of NTCIR-7 Workshop Meeting ( 2008 )

7. Xiao , T. , Cao , F. , Li , T. , Song , G. , Zhou , K. , Zhu , J. , Wang , H.: KNN and Re-ranking Models for English Patent Mining at NTCIR-7 . In: Proceedings of NTCIR-7 Workshop Meeting ( 2008 )

8. Mase , H. , Iwayama , M. : Hitachi Ltd.: NTCIR-7

Patent

Mining Experiments at Hitachi . In: Proceedings of NTCIR-7 Workshop Meeting ( 2008 )

9. Cai , Y. , Ji , D. , Cai , D. :

A KNN

Research Paper Classification Method Based on Shared Nearest Neighbor . In: Proceedings of the 8th NTCIR Workshop Meeting . pp. 336 -- 340 . ( 2008 )

10. k-nearest neighbor algorithm , http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm

11. Lopez , P. , Romary , L. : Experiments with Citation Mining and Key-Term Extraction for Prior Art Search . In: CLEF-IP 2010 , Padua ( 2010 )