=Paper= {{Paper |id=Vol-1176/CLEF2010wn-CLEF-IP-PerezEt2010 |storemode=property |title=Using BM25F and KLD for Patent Retrieval |pdfUrl=https://ceur-ws.org/Vol-1176/CLEF2010wn-CLEF-IP-PerezEt2010.pdf |volume=Vol-1176 }} ==Using BM25F and KLD for Patent Retrieval== https://ceur-ws.org/Vol-1176/CLEF2010wn-CLEF-IP-PerezEt2010.pdf
         Using BM25F and KLD for Patent Retrieval

                Joaquı́n Pérez-Iglesias , Álvaro Rodrigo, Vı́ctor Fresno

                         NLP & IR Group, UNED, Madrid
              {joaquin.perez,alvarory,vfresno}@lsi.uned.es


       Abstract. We describe in this paper our system for the Prior-art task of CLEF-
       IP 2010 (a task focused on the retrieval of relevant patents to a given one) and
       its results. We have developed a system where patents are indexed by fields in
       order to allow a selection of the most discriminative terms of each field, apply-
       ing Kullback-Leibler divergence as feature selection method, and using differ-
       ent boost factors for each field applying BM25F as ranking function. Although
       CLEF-IP has been proposed in a multilingual scenario, we have approached it
       from a monolingual perspective. The results are on the average of last year’s re-
       sults, what encourages us to continue the development of this system by including
       some kind of multi-lingual processing.


1   Introduction
The automatic retrieval of patents supposes an important application in the process of
publishing new patents. Before publishing a patent, it must be granted the novelty of
that invention. However, the manual search of similar patents has associated a high cost.
Furthermore, the patents can have been published in any country and in any language;
therefore, this is a multilingual task, where the patents to be retrieved may be written in
a language different to the one of the candidate patent. These are some of the reasons
why there is a growing interest in developing automatic systems able to provide patents
relevant to a given one.
    The CLEF-IP task at the Cross Language Evaluation Forum aims at evaluating
patent retrieval systems in a multilingual scenario. CLEF-IP proposes two different
tasks: the Prior art Task and the Classification Task. We have participated at the Prior
art Task, where a set of patents (we call them topic patents) are given to participants,
and each participant has to return, for each patent, a ranking of the patents that are
considered relevant.
    This search is performed over a collection with 2.6 million of patent documents and
the set of topic patents contains 2000 patents (it is possible to send runs over a subset
that consists of 500 patents). The documents contained in the collection represents dif-
ferent versions of a patent, where there can be additional information or some changes
among the different versions of the same patent.
    In short, our approach to the Prior art Task is based on an Information Retrieval
(IR) system based on BM25F as ranking function, and the application of Kullback-
Leibler divergence for selecting the most discriminative terms of each field for a given
topic patent. Then, we build a query with these most discriminative terms and we as-
sign different boost factors to the terms of different fields for improving the ranking of
candidate relevant patents.
    The structure of this paper is as follows: the main components of our system are
described in Section 2. The description of the submitted runs is given in Section 3,
while the results of these runs is shown in Section 4. Finally, some conclusions and
future work are given in Section 5.

2   System Overview
This section describes the different modules of our system. The architecture of the sys-
tem is given in Figures 1 and 2.




                          Fig. 1. Architecture of indexing phase.




                          Fig. 2. Architecture of retrieval phase.



    The system has three major parts: the first one for the preprocessing of patent doc-
uments, where an index per language is built; a second one for building queries; and
a third one for retrieving relevant patents given a query patent. Besides, we added the
possibility of using predictors at the output of the IR engine. The components of our
system are described in detail in the following subsections.

2.1 Indexing
Indexing the whole Patent The different versions of a patent represent the different
stages of it before its final version. Each document has a kind code for indicating the
stage of the document. The different versions of a patent does not always contain the
whole data of a patent. This is why it is opened to participants the option of joining the
different versions of a patent into a single one for processing purposes.
     Since the objective of the task is to retrieve relevant patents (no matter the version),
we considered more appropriated to work with patents given as results of combining
all its different versions. The main reason of this decision was the fact that the different
stages of a patent represent different subsets of the whole patent and, then, the combina-
tion of them would allow to dispose of the terminology of the whole patent for selecting
the most discriminative terms of it.

Indexing by Fields Once we have a patent, only the terms of the fields considered
relevant for us are extracted. We selected as relevant fields of a patent the “description”,
“title”, “claims” and “abstract” based on checking the observations already made in
CLEF-IP 2009 [1]. Afterwards, the terms of each field are indexed in a different field
of the index, resulting in having four fields per index.
     Our decision of considering four different fields in the index arose from the fact that
each patent has different sections and then, the terms of a certain section can be more
representative for that section than for other sections. Otherwise, high discriminative
terms of a section could not be so relevant when they are used for searching over whole
patents. Furthermore, by using fields, we can assign different boost factors to the terms
of each fields. Thus, if we think that a field is more relevant than another one, it is easier
for us to give more importance to that field.

Indexing by Language The collection of patents used in CLEF-IP contains documents
in English, German and French. We have two options at the indexing period regarding
how to deal with languages: to create a unique index which contains all the patents no
matter their language; or to create an index per language.
    It must be taken into account that we used in this edition a monolingual approach.
Then, if we used a unique index, the terms of a language might affect the frequency of
terms in another language. For example, a high relevance term in a language could be a
stopword in a different language, and then, it would not appear in any document of the
second language. Therefore, the statistics of term frequency that are used for retrieving
documents could be not appropriate if we merge documents in different languages.
This is why we decided to create three different indexes, one per language. Each index
contains only patents in this language. In the case of fields in a language different to the
one of the document (for example, the title can be in different languages), the field is
ignored.
    This fact has as a main drawback that in case of a patent should be retrieved with
a topic expressed in a different language, this patent document would not be retrieved.
Finally, the preprocessing step for feeding the indexes included stemming and stop
words removal specifically by language.

2.2 Building the Query
Once we have all the patents in indexes, we have to select which query is going to be
used for retrieving the most relevant patents to a given one. The first possibility would be
to use the whole topic patent as a query. However, this is different to classical IR, where
the size of the query is quite smaller. A big size of the query has associated a higher
delay for obtaining the output from the IR engine. This observation lead us to think that
the best approach is to select for each topic patent, the most discriminative terms for
building the query. Given that we considered important to separate the different fields
of a patent in the index, we obtained the most relevant terms by field.
    We made use of language models for selecting the terms of a query. In fact, we
built two different language models given a topic document: one language model of the
whole indexed collection, and another one for the given topic document. Each of these
language models is built taking into account each one of the considered fields.
    We computed the Kullback-Leibler divergence (KLD) [2] between the language
model of the topic and the one of the collection (at the level of each field). A term
is considered highly discriminant if it appears frequently on the topic query q, and,
at the same time, its frequency in the collection is not significant. KLD measures the
divergence between two probability distributions: the probability of a term within this
query and within the whole collection C. It can be expressed as follows:

                                                               pQ (t)
                            KLDpQ ,pC = pQ (t) · log(                 )                (1)
                                                               pC (t)

     where pQ is the probability of each term t within the patent topic q, that is the
frequency of the term t within the document q, divided by the length (number of terms)
of q. Finally, pC , is the probability of the same term t within the whole collection, and
it is equivalent to the frequency of term t in the collection divided by the total number
of terms in C.
     Applying the previous equation we will be able to rank all the terms from the patent
topic according to their importance within the query. After ranking the terms by their
divergence, a threshold is established and only those terms with divergence above the
threshold are selected.
     Thus, by using KLD we are able to build queries that contain the most discriminative
terms of each field, what we hope that will allow us to retrieve the most relevant patents
to a given one.


2.3   Patent Retrieval

With the query already build, we have to fed the IR engine with that query in order to
obtain a ranking of patents considered relevant to the input query. The decision of the
IR model must take into account that we are working with an index with different fields.
The combination of fields can be achieved by means of using a ranking function which
is able to exploit this type of document structures. This is why we chose BM25F [4].
    In order to use BM25F, we first obtain the accumulated weight of a term over all
fields as next:

                                         X          occursdt,c · boostc
                         weight(t,d) =
                                         c in d
                                                  ((1 − bc ) + bc · avl
                                                                     lc
                                                                        c
                                                                          )
    where lc is the field length; avlc is the average length for the field c; bc is a constant
related to the field length, similar to b in BM25 and boostc is the boost factor applied to
field c.
    Next, a non-linear saturation k1weight
                                        +weight , in order to reduce the effect of term fre-
quency to the final score is applied.
                                    X        weight(t, d)
                         R(q,d) =                           · idf (t)                     (2)
                                          k
                                    t in q 1
                                             + weight(t, d)

      idf (t) is computed as in the BM25 case

                                               N − df (t) + 0.5
                               idf (t) = log                                              (3)
                                                 df (t) + 0.5
where N is the number of documents in the collection and df is the number of docu-
ments where appears the term t. We decided to include different boost factors because
they allow us to change the importance given to the terms of a certain field.

2.4    Predictors
A novelty included in our system is the use of predictors for selecting the most promis-
ing ranking to a given patent topic when several queries to this topic are used. That is,
the predictor allows us to create and launch different queries for a given topic, and to
select the ranking that is considered to perform better.
    We have applied predictors based on the dispersion where we suppose that if a rank-
ing has a high value of standard deviation in their document scores, it could indicate that
the ranking function has been able to discriminate between relevant and non-relevant
documents [3]. On the other hand, if a low level of dispersion appears, because the
ranking function has assigned similar weights, it can be interpreted as if it was not able
to distinguish between relevant documents from non-relevant ones.
    Therefore, two different prediction methods are tested. Given ranking list scores
(RL) and their means µ(RL), we define two different predictors:

    – Max SD
                              σmax = max[σ(RL[1,d] ) : d ∈ RL]
    – SD at Best cut point, where we use a ranking list size equivalent to 60 as test as it
      was inferred from the training collection
                                    v
                                    u
                                    u1 X   N
                         σ(RL) = t             (score(di ) − µ(RL))2
                                       N i=1


3     Submitted Runs
The different runs submitted to CLEF-IP were obtained by applying different decisions
in some of the modules of the system and according to the results obtained at the devel-
opment period.
    The first decision for creating a run was the amount of terms to be used in a query.
We took a look at the average length of each field in the collection and we performed
several experiments that helped us to decide the amount of terms to be used. These
terms are selected taking the most discriminative ones of each field according to KLD.
    The first intuition is to think that the more terms in the query, the better the results.
However, it is no clear that more terms in a query allow to improve results. Besides,
an increase in the number of queries has associated an increase in the time for running
a query. This is why we performed experiments considering less than 1000 terms per
query. We experimented using terms from all the fields, as well as terms from only a
field, discovering that the better results were obtained with terms from only the descrip-
tion field.
    Secondly, we have also given different boost values to each field in order to increase
the importance of the terms of a certain field in a query. In this case, we used the last
year’s observations about the importance of the title in this task. The other field which
received a higher boost factor was description, which showed to be important in our
experiments and improved the performance of the system when its boost factor was
increased.
    The next step is to decide what index is going to be used. Given a topic, the lan-
guage of the topic (information that is available in the given topic) is used for deciding
which index (and only this index) is going to be used. Thus, English topics are searched
only in the index with English patents, etc. This decision leads to a decrease in perfor-
mance when there are relevant patents in a language different to the one of the topic.
In this case, we would need to include some kind of cross-lingual retrieval, what is not
available in the current version of our system.
    Finally, it must be considered that our system returns a ranking of patents, but par-
ticipant runs at CLEF-IP must return specific versions of a patent. Since all the versions
of a patent constituted the whole patent, given a patent in the ranking return by the IR
engine, we changed the id of this patent with the ids of all the versions of this patent.
    Taking into consideration all the aspects mentioned above, we submitted the fol-
lowing eight runs:

 1. Run 1 (Predictor maxsd): given the rankings returned by the IR engine in the
    other runs (excluding the other predictor), this run selected for each topic the most
    promising ranking according to the Max SD predictor (described in Section 2.4).
 2. Run 2: the queries build in this run contained the whole title, 30 terms from the
    abstract, 100 terms from the description and 100 terms from the claims section of
    the topic patent. The terms are selected taking the ones with are more discriminative
    according to KLD. The boost factors for all the fields are the same.
 3. Run 3: the queries used in this run have the same terms that the ones in run 2. The
    only difference was to give the double boost factor to the terms of the description
    field over the other fields. With this run, we wanted to test our intuition of giving
    more importance to the description field.
 4. Run 4: the queries used in this run have the same terms that the ones in run 2. The
    only difference was to give five times more boost factor to the title over the other
    fields. We created this run based on the observations made by last year’s participants
    about the importance of the title.
 5. Run 5: the queries used in this run have the same terms that the ones in run 2. The
    only difference was to give five times more boost factor to the title over the other
    fields, except the description field that receives the double boost factor that claims
    and abstract. This run was created for testing the effect of giving more importance
    to both the title and the description. The reason of giving much more importance to
    the title was that this field has less terms than the other fields, and it needs a greater
    boost factor.
 6. Run 6: the queries build in this run contained only the 120 most discriminative
    terms from the description field of the topic patent. The terms were selected taking
    the ones which are more discriminative according to KLD. With this run, we wanted
    to test the importance of the description field by using only terms from it. The
    number of terms was taken based on the results in the development period, where
    good results were obtained considering only this field.
 7. Run 7: the queries used in this run used only terms from the description field, as
    in run 6. However, we included 240 terms instead of 120 in order to check whether
    more terms from the description were helpful.
 8. Run 8 (Predictor sd 60): given the rankings returned by the IR engine in the
    other runs (excluding the other predictor), this run selected for each topic the most
    promised ranking according to the predictor that uses standard deviation at cut 60
    (described in Section 2.4).


4     Analysis of Results
The results obtained by the runs described in Section 3 are shown in Table 1. Besides,
the results of the best participant system at CLEF-IP 2010 are also given for comparison
purposes. The results are given according to different evaluation measures, where the
main one is the Mean Average Precision (MAP).


                           Table 1. Results of the submitted runs.


     run map set P P 5 P 10 P 50 P 100 set R R 5 R 10 R 50 R 100
     best 0.2645 0.0273 0.4209 0.3482 0.1603 0.1027 0.6945 0.1279 0.197 0.3917 0.4809
    run 1 0.0927 0.0095 0.179 0.1446 0.0752 0.0505 0.4375 0.0489 0.078 0.1874 0.2443
    run 2 0.1051 0.0099 0.2027 0.1631 0.0812 0.0543 0.4565 0.0575 0.0907 0.2057 0.2663
    run 3 0.0963 0.0095 0.1883 0.1535 0.0756 0.0507 0.4367 0.0526 0.0843 0.1903 0.2475
    run 4 0.1054 0.0099 0.2026 0.1633 0.0813 0.0543 0.4569 0.0575 0.0911 0.2054 0.2664
    run 5 0.0966 0.0095 0.1891 0.1531 0.0755 0.0507 0.4374 0.0529 0.084 0.1897 0.2476
    run 6 0.0901 0.0095 0.1751 0.1408 0.0743 0.0512 0.4353 0.047 0.0743 0.1838 0.2477
    run 7 0.0848 0.0089 0.1718 0.1336 0.0687 0.0465 0.4048 0.0476 0.0724 0.1677 0.221
    run 8 0.0935 0.0094 0.181 0.1446 0.0741 0.0503 0.4341 0.0498 0.0792 0.186 0.2447




  A first look at our results shows that our performance is far from the best system.
However, our system performed in average a 0.1 of MAP, which was the average result
in last year’s edition. Therefore, we think that it is an adequate result in this our first
participation at CLEF-IP. Our best results were obtained according to recall, showing
that our system was able to retrieve a 40% of relevant patents. However, our ranking
needs to be improved if we want to obtain better results. We want to do a deeper study
of our results in order to detect errors. Nevertheless, we think that the inclusion of some
multilingual processing could lead us to outperform results.
    Regarding the comparison among our runs, the best results where obtain where all
the fields (title, abstract, claims and description) were taken into account. In these runs,
the increase of the boost factor in the title allowed to outperform results, showing that
different boost factors can be important in this task. However, boost factors must be
selected carefully, since a higher boost factor in the description gave us worse results.

5   Conclusions and Future Work
We have described in this paper an approach for retrieving relevant patents to a input
one. This has been our first participation in this task, where we have obtained results
similar to the last year’s average, what encourages us to continue our work. Our ap-
proach has been based on the consideration of different fields of a patent for creating
the index and using the retrieval function, in combination with the use of the Kullback-
Leibler divergence for detecting the most discriminative terms that will take part of a
query.
    Future work is focused on performing a deeper analysis of results in order to de-
tect errors in our approach with the purpose of improving our system. Moreover, since
CLEF-IP is multilingual task, we are thinking about how to include a cross-lingual
search in our system.

Acknowledgments
This work has been partially supported by the Spanish Ministry of Science and Inno-
vation within the project QEAVis-Catiex (TIN2007-67581-C02-01), the Regional Gov-
ernment of Madrid under the Research Network MA2VICMR (S-2009/TIC-1542), the
Education Council of the Regional Government of Madrid and the European Social
Fund. We are very grateful to Juan Martı́nez-Romo for providing us its implementation
of Language Models.

References
1. G. Roda F. Piroi and V. Zenz. CLEF-IP 2009 Evaluation Summary. In Proceedings of CLEF
   2009. LNCS, 2009.
2. S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical
   Statistics, 22(1):79—-86, 1951.
3. Joaquı́n Pérez-Iglesias and Lourdes Araujo. Ranking list dispersion as a query performance
   predictor. pages 371–374. 2009.
4. Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple bm25 extension to multiple
   weighted fields. In CIKM ’04: Proceedings of the thirteenth ACM international conference
   on Information and knowledge management, pages 42–49, New York, NY, USA, 2004. ACM.