Patent classification experiments with the Linguistic
                      Classification System LCS

                            Suzan Verberne, Merijn Vogel and Eva D’hondt

                                        Information Foraging Lab
                                     Department of Computer Science
                                      Radboud University Nijmegen
                                           s.verberne@let.ru.nl


        Abstract. In the context of the CLEF-IP 2010 classification task, we conducted a
        series of experiments with the Linguistic Classification System (LCS). We compared
        two document representations for patent abstracts: a bag-of-words representation and
        a syntactic/semantic representation containing both words and dependency triples.
        We evaluated two types of output: using a fixed cut-off on the ranking of the classes
        and using a flexible cut-off based on a threshold on the classification scores.
        Using the Winnow classifier, we obtained an improvement in classification scores
        when triples are added to the bag of words. However, our results are remarkably
        better on a held-out subset of the target data than on the 2 000-topic test set.
        The main findings of this paper are: (1) adding dependency triples to words has a
        positive effect on classification accuracy and (2) selecting classes by using a threshold
        on the classification scores instead of returning a fixed number of classes per document
        improves classification scores while at the same time it lowers the number of classes
        needs to be judged manually by the professionals at the patent office.

1     Introduction
In this paper, we describe the classification experiments that we conducted in the context of the
Intellectual Property (IP) track at CLEF (CLEF-IP1 ). In 2009, the track was organized for the
first time with a prior art retrieval task. In 2010, a classification task was added to the track.
    The goal of the classification task at CLEF-IP is to “classify a given patent document accord-
ing to the International Patent Classification system (IPC)”. For the purpose of the track, the
organization released a collection of 2.6 million patent documents pertaining to 1.3 million patents
from the European Patent Office (EPO) with content in English, German and French. From the
collection, 2 000 documents (the ‘topics’) were held out as test set. The remainder of the corpus
constitutes the target data, on which participants could develop their methods.
    The target data comprise EPO documents with application dates older than 2002. Multiple
documents pertaining to the same patent were not merged — the task was to classify them as
individual documents. In the IPC system, documents are ordered hierarchically into sections,
classes, subclasses, main groups and subgroups. In CLEF-IP, the classification task was to classify
documents on the subclass level.
    In our experiments, we focused on the use of different document representations in text classi-
fication. We compared the widely used bag-of-words model to a document representation based on
syntactic/semantic terms, namely dependency triples (DTs). Dependency triples are normalized
syntactic units consisting of two terms (a head and a modifier) and the syntactic relation between
them (e.g. subject, object or attribute). For all experiments, we used the Linguistic Classification
System (LCS).
    In this notebook paper, we first explain which parts of the corpus we used and the how we
prepared the data (Section 2). In Section 3, we describe our experiments and the classification
settings used. We conclude with a discussion of the results (Section 4) and a plan for follow-up
experiments (Section 5).
1
    http://www.ir-facility.org/research/evaluation/clef-ip-10
2     Data preparation

The data selection in our experiments was motivated by practical concerns: Since we wanted a
comparison between classification experiments using bag-of-words and syntactic/semantic terms,
the choice of data was limited to abstracts as these are the easiest and consequently the fastest to
parse. We parsed all (over 500,000) English abstracts of the corpus in a couple of days. To parse
all the claims and/or description sections would have taken considerably longer because of the
extremely long and complex sentences used in these sections. [4].
    We extracted from the corpus all files that contain both an abstract in English and at least one
IPC class in the field <classification-ipcr>.2 We extracted the IPC classes on the document
level, not the invention level. This means that we did not include the inventions where the IPC
class is in another file than the English abstract. We saved the abstract texts in plain text, and
administrated the IPC classes in a separate file.
    For the bag-of-words representation, we ran a simple normalization script that removed punctu-
ation, capitalization and numbers from all abstract files. For the syntactic/semantic representation,
we parsed the abstract texts with the AEGIR dependency parser [3]. AEGIR allows us to set a
maximum parse time per sentence, which is useful since for longer (and hence more ambiguous)
sentences the parsing speed goes down. The output of the parser is a list of dependency triples
for each abstract that have undergone a number of normalizing transformations on the morpho-
logic and syntactic level, such as the transformation from passive to active voice (hence the term
‘syntactic/semantic’).
    Figure 1 is an example of a small original text, the normalized text in the bag-of-words repre-
sentation and the triples in the syntactic/semantic representation. For the experiments with the
syntactic/semantic representation, the triples are concatenated to the words. Table 1 gives general
statistics on the target data and the test data.


Original text      words              triples
Heat is stored     heat is stored     [IT,SUBJ,store] [store,OBJ,heat]
at a steady        at a steady        [store,PREPat,temperature]
temperature using temperature using [temperature,ATTR,steady]
calcium chloride calcium chloride [temperature,DET,a] [chloride,ATTR,calcium]
hexahydrate and    hexahydrate and    [hexahydrate,ATTR,chloride]
up to 20 percent up to percent        [hexahydrate,ATTR,using] [up,PREPto,20 percent]
strontium chloride strontium chloride [assist,OBJ,crystallization]
hexahydrate        hexahydrate        [chloride,ATTR,strontium]
to assist          to assist          [hexahydrate,ATTR,chloride]
crystallisation. crystallisation      [hexahydrate,SUBJ,assist]

Fig. 1. Part of the original text from the abstract of document EP-0011358-A1.txt (left) and the two
document representations that we created: normalized text (words) and syntactic/semantic terms (triples)


                     Table 1. Statistics on the target data and test data (topic set)

                                                                         target data test data
# of files                                                                 2 680 604     2 000
# of files with an English abstract and IPC-R class                          532 274     2 000
% of abstract files with empty parser output (max. parse time 10 secs)           3.5       4.6
# of different IPC-R subclasses                                                  629       476
Average number of classes per file                                               2.7       2.3

2
    IPC-R    is   the     IPC   Reform     classification,  sometimes       also   called    IPC8.   See
    http://www.intellogist.com/wiki/IPC Classification System.
3     Classification experiments with LCS
For our classification experiments, we use the Linguistic Classification System (LCS)3 [2, 1]. The
LCS can perform both mono-classification (each document belongs to precisely one class) and
multi-classification. In the training phase, the LCS takes as input a file which list the paths to
the classification files followed by their classes. After this training phase the LCS can be used
for testing the classifier obtained on a test collection of documents with known classes (usually
held-out training data), or for producing a classification of new documents without known classes.

3.1     Experimental set-up
Three classifiers have been implemented in the LCS: Naive Bayes, Winnow or SV M light . We
experimented with both Winnow and SV M light and we found that their classification accuracy
scores are comparable but that SV M light is much slower. For example, in order to train a model
based on 425 819 abstracts that belong to 629 different subclasses, Winnow needed around two
hours (independent of the document representation used) while SV M light spent six and a half
hours on the same task. Therefore, we decided to use Winnow for the CLEF-IP experiments.
    Winnow has a number of parameters that can be tuned: α, β and maxiters (the number of
training iterations). After some tuning around the default values, we decided to use α = 1.02
and β = 0.98. For maxiters, we experimented with three and ten iterations, and found that the
classification accuracy still improved somewhat after the third iteration. Therefore we decided to
use ten iterations
    In the case of multi-classification, LCS is flexible with respect to the number of classes that is
returned per document. Internally, it produces a full ranking of classes for each document in the
test set. The user can regulate the selection of classes with three parameters: (1) a threshold that
puts a lower bound on the classification score for a class to be selected, (2) the maximum number
of classes selected per document (‘maxranks’) and (3) the minimum number of classes selected per
document (‘minranks’). We kept the selection threshold to 1.0 (which is the default). Based on
the average number of classes per document in the target data, we decided to set maxranks = 4.
Setting minranks = 1 assures that each document is assigned at least one class, even if all classes
have a score below the threshold.
    We present the results on four experiments with LCS:
 1. Classifying abstracts from the target data in the bag-of-words (words-only) representation into
    IPC-R subclasses
 2. Classifying abstracts from the target data in the syntactic/semantic (words+triples) represen-
    tation into IPC-R subclasses
 3. Classifying abstracts from the test data in the bag-of-words (words-only) representation into
    IPC-R subclasses
 4. Classifying abstracts from the test data in the syntactic/semantic (words+triples) representa-
    tion into IPC-R subclasses
For experiments 1 and 2, we randomly split the target data: we used 80% of the data for training
the classifier and 20% for testing. We repeated this four times with different random splits and
calculated the mean and standard deviation over the four outcomes in order to get a measure for
the reliability of the results. For experiments 3 and 4, we applied classification models which were
previously trained on a random 80% of the target data to the 2 000 abstracts from the test data,
after the relevance assessments for the topics had been released by the organization.

3.2     Results
We present the results in terms of precision (P ), recall (R) and their harmonic mean (F1 ) for two
types of output: (a) the classes that were selected using the threshold on classification scores in
3
    A demo of the application can be found at http://ir-facility.net/news/linguistic-classification-system-
    prototype/ for registered IRF members.
LCS and (b) the classes that were returned using a fixed cut-off point in the class ranking. For
the threshold-based cut-off, precision and recall are calculated using:
                                     |relevant classes ∩ selected classes|
                               P =                                                                    (1)
                                              |selected classes|

                                     |relevant classes ∩ selected classes|
                               R=                                                                     (2)
                                              |relevant classes|
For the fixed cut-off, precision and recall are calculated using:
                                     |relevant classes ∩ classes returned@n|
                           P @n =                                                                     (3)
                                              |classes returned@n|

                                     |relevant classes ∩ classes returned@n|
                           R@n =                                                                      (4)
                                                 |relevant classes|
We chose n = 4 as a cut-off point for evaluating the ranking because it best compares to our
parameters for the threshold-based cut-off in LCS (maxranks = 4). In addition to that, we give
the results in terms of P @1 and R@50 because precision is especially relevant in the high ranks and
recall in the longer tail. We also give Mean Average Precision (MAP) for each of the experiments.
The results for the target data and the test data are in Table 2 and 3 respectively.


Table 2. Classification results using Winnow on abstracts from a held-out subset of the target data. P, R
and F are averages over four random 80–20 splits of the data. Between brackets is the standard deviation.
All numbers are percentages. Boldface marks the results that are discussed in the next section.

                        Threshold-based cut-off                      Fixed cut-off
                 P            R            F1           P@1    P@4     R@4     R@50 F1 @4  MAP
1. words-only    67.63 (0.17) 61.28 (0.15) 64.30 (0.08) 80.91% 47.90% 70.41% 90.06% 57.01% 0.717
2. words+triples 73.64 (0.08) 61.74 (0.13) 67.16 (0.07) 83.11% 50.21% 73.70% 93.73% 59.73% 0.755


Table 3. Classification results using Winnow on abstracts from the test data (2 000 topics). All numbers
are percentages. Boldface marks the results that are discussed in the next section.

                          Threshold-based cut-off                     Fixed cut-off
                 P             R           F1            P@1    P@4     R@4     R@50 F1 @4  MAP
3. words-only    60.06         52.06       55.77         69.95% 37.46% 64.60% 87.61% 47.42% 0.665
4. words+triples 61.52         52.08       56.41         71.85% 38.36% 66.16% 89.59% 48.56% 0.685


4     Discussion
We compare the classification results from three different points of view: (1) the two document
representations (words-only vs. words and triples), (2) the target data vs. the test data and (3)
the threshold-based cut-off vs. the fixed cut-off for the class ranking.
    With respect to the first point, we observe a significant improvement in classification perfor-
mance on the target data when we add triples to the bag of words: F1 increases from 64.30 (with
a standard deviation of 0.08) to 67.16 (with a standard deviation of 0.07). However, on the test
data, this difference is much smaller and probably not significant.4
4
    We cannot measure standard deviations for the test data because the topic set is too small to split up
    and compare the results on random subsets of it.
    That brings us to the second point: the results for the target data and test data are very
different from each other. Overall classification scores are lower for the topic test set than they are
on a held-out set from the target data (F1 for words-only is 55.77 compared to 64.30). Inspection
of the files in both sets shows that all files included in the test data are newer than the ones
in the target data. This was done by the CLEF-IP organization to reflect the realistic task of
classifying incoming patent applications using a model trained on existing patents. The fact that
models trained on older abstracts are a better fit on contemporary abstracts than on more recent
abstracts suggests that the content of the patents belonging to a specific subclass has changed
over time.
    It is more difficult to explain why the improvement gained from adding triples to words is
smaller for the test data than it is for the target data. Table 1 shows that in the test data more
abstracts had empty parser output than in the target data but this difference is small (4.6% and
3.5% respectively). We checked the output of the parser for the topic abstracts but we have no
reason to believe that the topic abstracts were that much more difficult to parse as to result in less
reliable triplets. This leaves us with the option that the smaller improvement is (at least partly)
due to coincidence. There are only 2 000 topic abstracts that are classified in 476 different IPC-R
classes. A different selection of 2 000 abstracts could easily lead to a few percent change in the
classification accuracy.
    Finally, we compared the results on the ranking with fixed cut-off to the results for the
threshold-based cut-off. We see that class selection using a threshold on the classification score has
a positive effect on both the precision and the recall, and hence on the F1 score (64.30% compared
to 57.01% at rank 4 for words-only on the target data). Selecting classes by using a threshold on
the classification scores for the classes instead of returning a fixed number of classes per document
leads to better classification while a lower number of classes needs to be judged manually.

5   Follow-up experiments
For the proceedings of CLEF-IP 2010, we plan to conduct follow-up experiments in two directions.
    First, we will investigate why the improvement gained from adding triples to words is smaller
for the test data than it is for the target data. We plan to look into (1) the distribution of IPC
classes in the test data compared to the target data, (2) the subset of IPC classes that are covered
by the target data but not by the test data and (3) the impact of triples compared to words in
the class profiles of these classes.
    In order to find out whether the differences between the results for the test data and the target
data are due to coincidence, we plan to create at least five test sets of 2 000 abstracts extracted
from the same time slice of the MAREC corpus as the supplied topic test set. Then we will classify
these sets using the same models trained on the target data in order to obtain the variation of the
classification accuracy on test sets of 2 000 abstracts.
    Fine, we plan to set up a series of tuning experiments for the threshold parameter in LCS on a
held-out development set, to see if we can gain additional improvement from optimizing the class
selection.

References
1. C.H.A. Koster and J.G. Beney. Phrase-based document categorization revisited. In Proceedings of the
   2nd international workshop on Patent information retrieval, pages 49–56. ACM, 2009.
2. C.H.A. Koster, M. Seutter, and J. Beney. Multi-classification of patent applications with Winnow.
   Lecture Notes in Computer Science, pages 545–554, 2003.
3. Nelleke Oostdijk, Suzan Verberne, and Cornelis H.A. Koster. Constructing a broad coverage lexicon for
   text mining in the patent domain. In Proceedings of the Seventh conference on International Language
   Resources and Evaluation (LREC 2010). European Language Resources Association (ELRA), 2010.
4. Suzan Verberne, Eva D’hondt, Nelleke Oostdijk, and Cornelis H.A. Koster. Quantifying the Challenges
   in Parsing Patent Claims. In Proceedings of the 1st International Workshop on Advances in Patent
   Information Retrieval (AsPIRe 2010), pages 14–21, 2010.