<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BiTeM site report for the Claims to Passage task in CLEF-IP 2012</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Julien Gobeill</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Ruch</string-name>
          <email>patrick.ruch@hesge.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>BiTeM group, University of Applied Sciences, Information Studies</institution>
          ,
          <addr-line>Geneva</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In CLEF-IP 2012, we participated in the Claims to Passage task where the goal was to return relevant passages according to sets of claims, for patentability or novelty search purposes. The collection contained 2.3M of documents, corresponding to an estimated volume of 250M of passages. To cope with the problems induced by this large dataset, we designed a two-step retrieval system. In the first step, the 2.3M of patent application documents were indexed ; for each topic, we then retrieved the k most similar documents with a classical Prior Art Search. Document representations and tuning of the IR engine were set relying on training data and on the expertise we acquired in past similar tasks. In particular, we used not only claims for topics, but also the full description of the application document, and the applicants/inventors details ; moreover, we discarded retrieved documents that didn't share at least one IPC code with the topic. The k parameter ranged from 5 to 1000 according to the computed run. In the second step, for each topic (i.e. “on the fly”), we indexed the passages contained in these k most similar documents and queried with the topic claims in order to obtained the final runs. Thus, we dealt with approximately 11M of passages instead of 250M. The best k parameter with the training data was 10. Hence, we decided to submit four runs with k set to 10, 20, 50, and 100. Finally, we analyzed the training data and observed that the position of a passage in the document played a role, as passages at the end of the description were more likely to be relevant. Thus, we re-ranked each run according to passages' positions in the document in order to submit four supplementary runs.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>Information Retrieval</kwd>
        <kwd>Intellectual Property</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        BiTeM (Bibliomics and Text Mining) is a research group located in Geneva,
having a strong expertise on text mining in large corpora, especially in biomedicine. We
already took part in several evaluation campaigns on Information Retrieval (IR) in the
Intellectual Property domain, such as previous CLEF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], TREC[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], or NTCIR[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In
CLEF-IP 2012, we participated in the Claims to Passage task where the goal was to
return relevant passages from patents contained in the collection according to sets of
claims, for patentability or novelty search purposes.
      </p>
      <p>This task is known in computer science as Passage Retrieval, a subtask of
Information Retrieval. We early identified two different strategies in order to retrieve the
relevant passages: either a one-step retrieval, or a two-steps retrieval. The one-step
retrieval consists in building a unique search engine by indexing all passages. The
two-steps retrieval consists in building a first unique search engine by indexing all
documents, then, for each topic (i.e. „on the fly“), building a second search engine by
only indexing passages belonging to the retrieved documents. The CLEF-IP 12
collection contained 2.3M of documents, corresponding to an estimated volume of 250M
of passages. To cope with the problems induced by this large dataset, we chose the
two-steps retrieval strategy, as described in Fig.1.</p>
      <p>In the first step, the 2.3M of patent application documents were indexed ; for each
topic, we then retrieved the most similar documents with a classical Prior Art Search.
Document representation and tuning of the IR engine were set relying on training data
and on the expertise we acquired in past similar tasks. For document representation,
we used titles, abstracts, claims, applicants and inventors details, and IPC codes
(complete format, e.g. G09G 3/28). For topic representation, we exploited the
provided patents and also used titles, abstracts, claims, applicants and inventors details, and
IPC codes, along with the full description sections. We thus obtained, for each query,
a set of retrieved patents. Then, we applied a supplementary post-processing strategy
investigated in previous CLEF-IP, by discarding retrieved patents that did not share at
least one IPC code with the topic patent.</p>
      <p>In the second step, for each topic, we extracted passages contained in the k first
retrieved patents. Then, for each topic, we indexed these passages and queried only with
the claims provided in the topic in order to obtain our runs. Relying on training data,
we evaluated different values of k, ranging from 5 to 1000.</p>
      <p>
        Indexing and Retrieval were computed with the Terrier platform [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which is
designed for large collections such as TREC or CLEF collections. We chose settings
which proved to be efficient in the past competitions: PL2 as weighting scheme, and
Bo1 as Query Expansion model, both with default parameters and with Porter
stemming [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For multilingual purposes, we simply chose to only index English sections,
and to use Google in order to translate the topics from French or German into English.
      </p>
      <p>Finally, we analyzed the qrel provided with the training data, focusing on the
position (within the description section) of the relevant passages. We thus divided, for
each patent, the description section into ten equal parts. Then, we analyzed from
which part came the passages contained in the qrel, and the passages contained in our
run (for k=10). Pqrel(i) is the percentage of relevant passages in the qrel that belong to
the i-th part of the description, i ranging from 1 (the beginning) to 10 (the end). Prun(i)
is the percentage of relevant passages in our run that belong to the i-th part of the
description Fig.2 illustrates these distributions.</p>
      <p>Both distributions are opposite. In the qrel, the passages belonging to the end of the
description are more likely to be relevant. On the contrary, our search system tends to
favor passages belonging to the beginning of the description. Hence, we computed
weights W(i) for each part according to the qrel distribution, by dividing Pqrel(i) by
Prun(i), then we re-ranked our runs by boosting scores according to these weights. This
re-ranking strategy is obviously applied only to passages belonging to the description
section.</p>
      <p>
        For evaluations, we computed Mean Reciprocal Rank (MRR), which is the
multiplicative inverse of the rank of the first correct returned answer [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Results on training data</title>
      <p>First, we evaluated different values of k with the training data. Results are
presented in Tab. 1. It appears that the best value for k is 10. k=10 means that, for each query,
we retrieved passages only within the 10 first retrieved patents.</p>
      <p>MRR
0.014
0.017
0.01
0.013
0.013
0.007
0.004
0.003
Finally, we evaluated the impact of our re-ranking strategy with training data, and
observed a slight improvement in terms of MRR, ranging from +2% to +6%
according to the value of k.
4
5</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion References</title>
      <p>Hence, we decided to submit four official runs with different values of k: 10, 20, 50
and 100. As participants were allowed to submit up to 8 runs, we applied our
reranking strategy to all the mentioned runs in order to obtain four supplementary
official runs.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Gobeill</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasche</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teodoro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruch</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Simple pre and post processing strategies for partent searching in CLEF Intellectual Property Track 2009</article-title>
          . Proceedings of CLEF (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gobeill</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaudinat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasche</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teodoro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vishnyakova</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruch</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <source>BiTeM site report for TREC Chemistry</source>
          <year>2010</year>
          :
          <article-title>Impact of Citations Feedback for Patent Prior Art Search and Chemical Compounds Expansion for Ad Hoc Retrieval</article-title>
          .
          <source>Proceedings of TREC</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Teodoro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gobeill</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pasche</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruch</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>Report on the NTCIR 2010 Experiments: automatic IPC encoding and novelty detection for effective patent mining</article-title>
          .
          <source>Proceedings of NTCIR</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ounis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lioma</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plachouras</surname>
          </string-name>
          ,V.:
          <article-title>Research Directions in Terrier: a Search Engine for Advanced Retrieval on the Web. Novatica/UPGRADE Special Issue on Next Generation Web Search</article-title>
          , vol.
          <volume>8</volume>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>56</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>An algorithm for suffix stripping</article-title>
          .
          <source>Program</source>
          , vol.
          <volume>14</volume>
          , pp.
          <fpage>130</fpage>
          -
          <lpage>137</lpage>
          (
          <year>1980</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Voorhees</surname>
          </string-name>
          ,E.:
          <article-title>Overview of the Question Answering Track</article-title>
          .
          <source>Proceedings TREC</source>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>