<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>University of Santiago de Compostela at CLEF-IP09</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jos´e Carlos Toucedo</string-name>
          <email>josecarlos.toucedo@usc.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Patent Retrieval, Prior Art Search, Cross Language, Query Formulation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Performance</institution>
          ,
          <addr-line>Experimentation, Measurement</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad de Santiago de Compostela</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe our participation in CLEF-IP 2009 (prior-art search task). This was the first year of the task and we focused on how to build effectively a prior art query from a patent. Basically, we implemented simple strategies to extract terms from some textual fields of the patent documents and gave preference to title terms. We ran experiments with standard BM25 configurations and we paid little attention to language-dependent issues.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 [Information storage and retrieval]</kwd>
        <kwd>Information Search and Retrieval</kwd>
        <kwd>query formulation</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>7 [Information storage and retrieval]</kwd>
        <kwd>Digital Libraries</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The main task of the CLEF-IP09 track is to investigate Information Retrieval (IR) techniques
for patent retrieval, specifically for prior art search. Prior art consists of any kind of information
(publication, product, process, etc.) within the patent and non-patent literature that has been
made available to the public, maybe orally, before the filing date of a patent application. Therefore,
a prior art search tries to retrieve any prior record with identical or similar contents to a given
patent application.</p>
      <p>This track provides the participants with a huge collection of more than one million patents
from the European Patent Office (EPO). This is composed of all the patents published between
1986 and 2000. Every patent, identified by an unique number, consists of several XML documents
generated at different stages of the patent’s life-cycle. Therefore, each patent document is identified
by the patent number plus the stage (represented as a kind code plus a version number). For
instance, the patent document with the id 0981201-A3 denotes the third version of the application
of the patent 0981201. The information within a patent document is structured. Some fields
appear in all the stages, such as the title, the bibliographic data, the description and the claims.
However, there are some fields like the abstract that only occur in a particular stage. At the EPO,
the patents can be written in English, French or German. The title and the claims of every patent
are translated into these three languages. The rest of the patent is written in a single language.</p>
      <p>In an information retrieval setting the patent to be evaluated can be regarded as the information
need and all the granted patents to date as the document collection. Since a patent is made up of
several documents all these documents have to be taken into account in order to produce a query
patent. In the prior art search task, the query patent provided is built from the non-common
fields of the patent documents and from the common fields of the document with the highest
stage. A query patent constructed in this way in the EPO is about 3500 terms long on average.
The query patent is therefore a long and verbose document and many of its terms are redundant
or unrepresentative.</p>
      <p>Throghout these notes we will explain the approach we have taken to address the prior art
search task. This year our main objective has been to formulate a concise query that effectively
represents the underlying information need.</p>
      <p>Since this is our first participation in CLEF, we have just focused on query formulation.
We recognize that there are many issues such as link analysis, entity extraction, cross-language
retrieval, field boosting, etc. that might play a key role in prior art search but these will be
considered for next editions of this track.</p>
      <p>The rest of the paper is organized as follows. Section 2 describes the approach we have taken,
specifically how the query is built and what experiments we designed; the runs we submitted are
explained in section 3 and the conclusions we extracted are exposed in section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Approach taken</title>
      <p>Although every patent is composed of several documents, this track requires that retrieval is
performed at patent level. This problem can be addressed following two different approaches: a)
building an index of patents or b) building an index of patent documents. The former requires to
define an effective strategy to combine several patent documents into a single patent representation.
The second approach is simpler because it only requires to post-process the retrieved documents
in order to obtain one result per patent. Our choice was to assign a patent the score of its highest
ranked document. This follows the intuition that the patent document that is the most similar to
the patent query reflects well the connection between the query and the underlying patent.</p>
      <p>Although the documents contain terms from three different languages, no language-oriented
distinction was made during the index construction 1. The index contains all terms in any language
for each patent document. Note that some fields are translated into the three languages in the
patent (e.g. the title) and these translations appear in the index associated to the same document.
Furthermore, stemming was not applied and an English stopword list (with 733 stopwords) was
used in order to remove common words. This makes sense because almost 70% of the data was
written in English.
2.1</p>
      <sec id="sec-2-1">
        <title>Query formulation</title>
        <p>The query patent is too long to be processed in a reasonable time and contains noisy terms that
might harm performance. We think that a good query preprocessing is a key factor in order to
achieve good effectiveness.</p>
        <p>Our experiments focused on extracting the most significant terms from the query patent, i.e.
those terms that are discriminative. To this aim, we used inverse document frequency (idf). In
our evaluation with the training set, we concentrated on deciding the number of terms that should
be included into the query. We ran this process in both a language-independent and
languagedependent way (i.e. a single ranking of terms vs. three rankings of terms, one for each language).</p>
        <p>The number of query terms is difficult to set because few query terms make that the query
processing is fast but the information need might be misrepresented; on the other hand, if many
1We deeply thank the support of Erik Graf and Leif Azzopardi, from University of Glasgow, who granted us
access to their indexes.
terms are taken the query will contain many noisy terms and, furthermore, the query processing
time might be prohibitive. We have studied two methods to choose a suitable number of terms:
(i) establishing a fixed number of terms for all queries and (ii) establishing a fixed percentage of
the query patent length (i.e. the query size varies from one query patent to another).</p>
        <p>Once the number of query terms has been selected, we must determine how they are extracted.
We explored two strategies: language-independent and language-dependent. Suppose that we
select n terms from the original query patent regardless of the language. This means that all
query patent terms (english, french and german terms) are ranked together and we simply select
the n terms with the highest idf from this list. Because of the nature of the languages, it is likely
that the three languages present different idf patterns. Besides, there are fewer German/French
documents than English documents and, therefore, this introduces a bias in terms of idf. We
therefore felt that we needed to test other alternatives for selecting terms. We tried out an
extraction of terms where each language contributes with the same number of terms. In this
second strategy we first grouped the terms of a query patent depending on their language (no
classification was needed since every field in the XML is tagged with language information). Next,
the highest n′ = ⌊n/3⌋ terms from each group are extracted. The query is finally obtained by
compiling the terms from the three groups.</p>
        <p>So far, we simply explained which options we have considered for query formulation. In section
2.3 we will explain how some combinations of these strategies perform.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Retrieval model</title>
        <p>
          We used the well-known BM25 retrieval model [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] with the usual parameters (b = 0.75, k1 = 1.2,
k3 = 1000), but we also tried several variations for b and k1 in the submitted runs.
        </p>
        <p>
          The platform under which we executed our experiments was the Lemur Toolkit [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>All experiments were executed in the LDC, a system provided by the Information Retrieval
Facility (IRF) 2.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Training</title>
        <p>With the training data provided by the track, we studied two dimensions: query length and
language. Query length refers to the way in which query size is set. As argued above, this can
be done in a query-dependent (i.e. a given percentage of the patent query terms are selected) or
query-independent way (i.e. a fixed number of terms are selected for all queries). The language
dimension reflects the way in which terms are ranked (language-independent, i.e. a single rank
for all terms; language-dependent, one rank of terms for every language). Hence, our training
consisted of studying how the four combinations of these dimensions perform in terms of two
well-known measures, MAP and Bpref.</p>
        <p>The following results were obtained with the large training set (500 queries) of the main task,
which contains queries in the three languages. In this case, we only tried the usual parameters for
the BM25 retrieval model.</p>
        <p>Figures 1(a) and 2(a) consider the case where the number of terms is fixed for all queries. We
clearly get better performance when the language is not taken into account during the training.
However, figures 1(b) and 2(b), where terms are selected using a percentage of the query length,
show a different trend. Figure 1(b) shows that, for values less than 50% 3, no significant difference
can be stablished in terms of MAP. In contrast, figure 2(b) shows that the language-dependent
choice is slightly more consistent than the language-independent one in terms of Bpref.</p>
        <p>We have to choose between two opposite models, the model that consists of combining the
query-dependent and language-dependent strategies and, on the other hand, the model that
considers the query-independent and language-independent strategies together. If we observe carefully
the plots we will note that these two models do not differ in MAP values but, in terms of Bpref,
the model that is language and query dependent presents the best performance.
2We are grateful to the IRF for the support given to us.
3We are not taking into account greater values because the resulting queries are too long.
0.14
0.12
0.1
0.08
P
A
M0.06
0.04
0.02
00
0.6
0.5
0.4
F
ER0.3
P
B
0.2
0.1
00
10
20
30 40
% of query length
(a) Query-independent experiments
(b) Query-dependent experiments
Language−dependent
Language−independent
500 600 700
Language−dependent
Language−independent
50 60
70</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Submitted runs</title>
      <p>We participated in the Main task of this track with eight runs for the Small set of topics, which
contains 500 queries in different languages.</p>
      <p>First, we submitted four runs considering the scenario that best worked for our training
experiments. These four runs differ on the retrieval model parameters:
• uscom BM25a: b = 0.2, k1 = 0.1, k3 = 1000
• uscom BM25b: b = 0.75, k1 = 1.2, k3 = 1000
• uscom BM25c: b = 0.75, k1 = 1.6, k3 = 1000
• uscom BM25d : b = 1, k1 = 1.2, k3 = 1000</p>
      <p>Furthermore, we submitted four additional runs where the final queries were expanded with the
title terms of the query patent. In this way, the query term frequency of these terms is augmented
and the presence of the title terms in the final queries is guaranteed. These new runs are labeled
as the previous ones plus an extra “t”.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results and conclusions</title>
      <p>The official evaluation results of our submitted runs are summarized in Table 1.
uscom BM25a
uscom BM25b
uscom BM25c
uscom BM25d
uscom BM25at
uscom BM25bt
uscom BM25ct
uscom BM25dt</p>
      <p>P</p>
      <p>The first conclusion we can extract from the evaluation is that our decision to force the presence
of title terms worked well. Regardless of the configuration of the BM25 parameters, the run with
the title terms always obtains better performance than its counterpart. Therefore, we can state
that the title terms represent an important factor in prior art search.</p>
      <p>Furthermore, among the configurations with the title terms the best run is the one labeled as
uscom BM25bt. This run corresponds to the usual parameters of the BM25 retrieval model, i.e.
b = 0.75, k1 = 1.2, k3 = 1000.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] The Lemur Toolkit</article-title>
          . http://www.lemurproject.org.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          .
          <source>Okapi at trec-3</source>
          . pages
          <fpage>109</fpage>
          -
          <lpage>126</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>