<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CLEF-IP 2010: Prior Art Retrieval using the di erent sections in patent documents</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Eva D'hondt and Suzan Verberne Radboud University Nijmegen</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe our participation in the 2010 CLEF-IP Prior Art Retrieval task where we examined the impact of information in di erent sections of patent documents, namely the title, abstract, claims, description and IPC-R sections, on the retrieval and re-ranking of patent documents. Using a standard bag-of-words approach in Lemur we found that the IPC-R sections are the most informative for patent retrieval. We then performed a re-ranking of the retrieved documents using a Logistic Regression Model, trained on the retrieved documents in the training set. We found indications that the information contained in the text sections of the patent document can contribute to a better ranking of the retrieved documents. The o cial results have shown that among the nine groups that participated in the Prior Art Retrieval task we achieved the eigth rank in terms of both Mean Average Precision (MAP) and Recall.</p>
      </abstract>
      <kwd-group>
        <kwd>Prior Art Search</kwd>
        <kwd>Patent retrieval</kwd>
        <kwd>CLEF-IP track</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In the literature on patent retrieval there is some disagreement on which part of the patent
document would be the most informative for (text-based) document retrieval. Graf and Azzopardi
conclude that the claims section is the most useful [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], while patent searchers themselves hold that
the description is more useful [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Interestingly, the results of last year's CLEF-IP track (2009)
showed that the use of the metadata such as IPC-R codes or name of inventor leads to substantial
improvements in patent retrieval over approaches that focussed only on the text sections. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>For our participation to the CLEF-IP 2010 track1, our goal was to compare the impact of the
di erent patent sections on retrieval performance in a xed benchmark data set. In this paper
we describe our contribution to the track in which we examine the in uence of both the IPC-R
metadata and the information contained in the di erent text sections of the patent document on
retrieval performance and re-ranking.</p>
    </sec>
    <sec id="sec-2">
      <title>Data Description</title>
      <p>The CLEF-IP 2010 test collection provided by the organisation committee contains a corpus of
2.6 million patent documents pertaining to 1.3 million patents2, a set of 300 patent documents
that serve as training topics together with their relevance assessments and a set of 500 test topic
patent documents for testing. The patents can contain text in three di erent languages: English,
French and German. They are labelled with XML tags to help identify the di erent sections as
well as the di erent metadata such as IPC-R code, the name of the inventor or the date of the
application. The di erent patent documents correspond to the di erent stages in the evolution
of a patent and will therefore contain di erent amounts of information, for example, a patent
application (A1 document) will not contain as much information as a fully granted patent (B1
document). The information in the older version of the patent is often subsumed by the newer
document, but older versions may contain unique information as well. This year we have decided
to retrieve patent documents rather than whole patents3.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Set-up</title>
      <sec id="sec-3-1">
        <title>Patent Section Extraction</title>
        <p>Using a perl script we extracted the English title, abstract, claims and description sections and
the IPC-R codes4 from the original XML les and saved them as plain text in respective text les.
If a document did not contain a section or if the section was not in English, no corresponding text
le was created. The most important characteristics of the ve subcorpora that were created in
this manner are shown in 1.</p>
        <p>corpus
training
topic
In the retrieval step, we wanted to determine which section of the patent document is the most
informative for patent retrieval. To this end, we performed six retrievals on the corpus using
the training queries. The retrieved documents of the best-scoring system were used to train the
re-ranking models as will be described in section 2.4.</p>
        <p>For the retrieval step, all the text les in the subcorpora were saved in the Lemur format:
Using a bash script, the text in the text les was lowercased, punctuation was removed and the
appropriate XML tags for indexing by Lemur were added . Then the texts were indexed using the
BuildIndex function of Lemur with the indri IndexType and a stop list for general English.</p>
        <p>In total we built 6 indices: Titles only, Abstracts only, Claims only, Description only,
IPCR codes, and full-text. By full-text we mean that we concatenated title, abstract, claims and
description; sections that were not available in the patent document were added as an empty
string. If none of these sections were available in English, the patent document was not indexed.</p>
        <p>2Please note the di erence between a patent and a patent document: a patent is not a physical document itself
but a name for a group of patent documents that have the same patent ID number.</p>
        <p>3A whole patent can be constructed by concatenating di erent patent versions into one document or by
constructing a document from the most recent version of every section in the patent documents
4We used the full IPC-R code up to the level of the subgroups, e.g. A01J 5/01.</p>
        <p>The topics in the training set were preprocessed in the same manner in order to be used as
queries in Lemur. If the original query XML document did not contain a section, it was not added
to the lemur query le.</p>
        <p>For each query in the query le we retrieved 100 documents and ranked them according to the
TF-IDF ranking model as implemented in Lemur. Table 2 shows the results of the retrievals on
the 6 indices with their respective training queries. The results are given for Precision (P) and
Recall (R) respectively at position 5, 10, 50 and 100 in the result list as well as the Mean Average
Precision (MAP) score.</p>
        <p>title
abstract
claims
description
full-text
IPC-R</p>
        <p>The index with the IPC-R codes proved to be the most informative for patent retrieval in
terms of Recall and Precision, although results are quite low for all six retrievals. Based on these
results, we decided to proceed with only the retrieval results from the IPC-R subcorpus to the
second step of our approach.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Re-ranking Step</title>
        <p>It seems that in a retrieval task, conceptual information (as encoded in the IPC codes) works
better than `surface' textual information. However, we wanted to examine the in uence of the
di erent text sections on the positions of the retrieved results in the set.</p>
        <p>We aimed to improve the ranking of the retrieved documents on the basis of the textual
information present in the di erent sections of the patent document. As a predictor of relevance
for the sections, we used the cosine similarity between corresponding sections of the topic and
each of the retrieved documents.</p>
        <p>We extracted this information as follows: For each topic{document pair from the training result
set, we extracted the title, abstract, claims and description sections (if present) from both the topic
and the retrieved document. We then calculated the cosine similarity between the sections of the
respective documents using a python script which was based on the script by Dennis Muhlestein5.</p>
        <p>For each query{document pair we obtained a vector with 4 features: cosine similarity titles,
cosine similarity abstracts, cosine similarity claims, and cosine similarity descriptions. In order to
determine the importance of each of these features (and thereby each of the sections), we trained a
Logistic Regression Model (LRM). The criterium variable was the relevance score of the retrieved
document in the training relevance assessments.6</p>
        <p>We used the lrm function from the Design package in R to train this model. We then used the
LRM (trained on the training data) to predict an alternative ranking for the retrieved documents.
We created two variants of the model: one with only these four features, and one in which the
TF-IDF score for the retrieval with IPC-R codes was added as a fth feature. We did not perform
any step-wise model selection but rather combined all predictors at once.</p>
        <p>5http://allmybrain.com/2007/10/19/similarity-of-texts-the-vector-space-model-with-python
6We only considered documents to be either `relevant' or `non-relevant' and did not adhere to the subdivision
(`relevant' or `highly relevant') made by the CLEF-IP organisers.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>In this section we present the results of our models in terms of MAP, Precision and Recall for
both the training data (table 3) and the test data (table 4). The evaluation of the two re-ranking
models on the training data was performed using 5-fold cross-validation. The P, R and MAP
results are the averages over the ve folds. Between the brackets is the standard deviation.</p>
      <sec id="sec-4-1">
        <title>Baseline</title>
        <p>using IPC-R
Re-ranking
no TF-IDF
Re-ranking
with TF-IDF
run-1-small
(no TF-IDF)
run-2-small
(with TF-IDF)</p>
        <p>MAP
0.0677
0.0858
(0.973)
0.0870
(0.519)</p>
        <p>The re-ranking model that incorporates the TF-IDF score of the retrieval set performs slightly
better than the other model in both the training and the test results. In terms of Recall and
Precision we performed slightly better than during our participation in the CLEF-IP 2009 track
but compared to the other teams in this year's track we achieved low scores.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>In this section we will discuss (a) the retrieval results on the training set and (b) analyse the
re-ranking models used.</p>
      <p>One of our goals was to determine which section of the patent document is the most informative
for patent retrieval in terms of recall and precision. The results in table 2 showed that for a
bagof-words approach the IPC-R codes in the patents were the most informative of all the patent
sections. During our post-evaluation analysis we discovered that the low scores for the individual
text sections are more likely an artefact of our data selection process rather than an adequate
re ection of their performance in a retrieval task. Table 1 showed that there are considerable
di erences in size between the di erent text section corpora and thus in the number of patent
documents that could be retrieved for a speci c query. Moreover, we found evidence that some
relevant patent documents were impossible to retrieve for certain queries. For example, if a relevant
document for a query consisting of a claims section did not have a claims section itself, it did not
feature in the claims subcorpus and could therefore not be retrieved. Consequently, we cannot
draw a de nite conclusion about the relative importance of the separate text sections for patent
retrieval. The full-text corpus and the IPC-R corpus, however, did not su er from these drawbacks.
We found it interesting that the IPC-R outperformed the full-text retrieval, though the di erence
between the results is small. The major advantage of the IPC-R section is -predictably- the fact
that it is language-independent, conceptual and has a limited `vocabulary' of terms that can be
used. For future work it would be interesting to examine the di erences in retrieval results by
using more general and more speci c IPC codes as retrieval terms.</p>
      <p>Our second goal was to examine the impact of the text sections on the re-ranking of retrieved
documents: When we look at the results in table 3 and 4, it seems that the use of the information
in the respective text sections of the query and retrieved document can lead to an improvement in
the ranking of the relevant results. However, the high standard deviation values for the ve folds
show that our training set of 300 queries is too small to make any de nite conclusions about the
improvements made by the models. This may be a consequence of the fact that the models were
not trained on optimal data but on rather poor retrieval results. Though they seem to boost the
ranking of the retrieved documents, they contain enough noise to diminish the accuracy.</p>
      <p>In order to evaluate the importance of the di erent text sections in the re-ranking of the
retrieval results, we rank them in table 5 according to the coe cient that was assigned to them
in the Logistic Regression Model. We nd that all texts sections except for the description have
a signi cant in uence on the re-ranking of the retrieval results. The correlation analysis reported
in table 6 shows a high correlation between the cosine similarity of the claims and description
sections. Consequently, the coe cient for the claims section should be interpreted as being caused
by the combination of the cosine similarities for the claims and description sections. Of all the text
sections the abstracts have the most impact in the re-ranking process. This was to be expected as
the abstracts are most likely to contain the keywords that are speci c to the eld of the invention.</p>
      <sec id="sec-5-1">
        <title>Feature</title>
        <p>Cosine similarity between abstracts
Cosine similarity between claims
Cosine similarity between titles
TF-IDF value from retrieval data
Cosine similarity between descriptions
In our contribution to the CLEF-IP 2010 Prior Art Retrieval task we examined the impact of
di erent sections of patent documents on the retrieval and re-ranking of patent documents. Using
a standard bag-of-words approach in Lemur we found that the IPC-R sections are more informative
for patent retrieval than a full-text representation of the patent document. We then performed a
re-ranking of the retrieved documents using a Logistic Regression Model, trained on the retrieved
documents in the training set. Looking at the improved MAP scores, we found indications that
the information contained in the separate text sections of the patent document can contribute to
a better ranking of the retrieved documents.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Eva</surname>
            <given-names>D</given-names>
          </string-name>
          <article-title>'hondt. Lexical issues of a syntactic approach to interactive patent retrieval</article-title>
          .
          <source>In Proceedings of the 3rd BCSIRSG Symposium on Future Directions in Information Access</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Erik</given-names>
            <surname>Graf</surname>
          </string-name>
          and
          <string-name>
            <given-names>Leif</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          .
          <article-title>A methodology for building a patent test collection for prior art search</article-title>
          .
          <source>In Proceedings of EVIA2008</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Patrice</given-names>
            <surname>Lopez</surname>
          </string-name>
          and
          <string-name>
            <given-names>Laurent</given-names>
            <surname>Romary</surname>
          </string-name>
          .
          <article-title>Multiple retrieval models and regression models for prior art search</article-title>
          .
          <source>In Proceedings of CLEF</source>
          <year>2009</year>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>