<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining document representations for prior-art retrieval</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Eva D'hondt, Suzan Verberne Information Foraging Lab, Radboud University Nijmegen</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Wouter Alink</institution>
          ,
          <addr-line>Roberto Cornacchia Spinque</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we report on our participation in the CLEF-IP 2011 prior art retrieval task. We investigated whether adding syntactic information in the form of dependency triples to a bag-of-words representation could lead to improvements in patent retrieval. In our experiments, we investigated this e ect on the title, abstract and rst 400 words of the description section. The experiments were conducted in the Spinque framework with which we tried to optimize for the combinations of text representation and document sections. We found that adding triples did not improve overall MAP scores, compared to the baseline bag-of-words approach but does result in slightly higher set recall scores. In future work we will extend our experiments to use all the text sections of the patent documents and ne-tune the mixture weights.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>of words in prior-art retrieval; (b) Optimizing the combination of the di erent documents sections
and text representations.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data Description</title>
      <p>
        The CLEF-IP 2011 corpus, a part of the MAREC collection, was provided by the IRF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and
contains approximately 3 million documents, pertaining to more than 1 million patents1. Most
documents (2.6 million) came from the European Patent O ce (EPO) and a smaller subset
(around 400,000) consisted of patent documents from the World Intellectual Property
Organization (WIPO). The patent documents were stored in the IRF XML format [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A patent document
contains metadata such as name of inventor, IPC-R code, date of application, ... as well as (a
mixture of) English, German or French text sections for the title, abstract, claims and/or
description sections of the patent. In our experiments we only used the English text sections and IPC-R
codes. The organizers distributed a training set of 300 patents and {unlike the previous years{
only one topic set containing 3973 documents.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Set-up</title>
      <sec id="sec-3-1">
        <title>Patent Section Extraction</title>
        <p>Using a perl script we extracted the English title, abstract, claims and description sections from
the original XML les. We also saved the rst 400 words of the description sections and the IPC-R
codes2 separately. All the sections were saved as plain text in temporary text les. If a document
did not contain a section or if {according to the XML tags{ the section was not in English, no
corresponding text le was created. The XML documents contain many text-internal XML tags
that indicate gures, references, formulae, etc. in the original patent document. All such tags and
the texts that they enclose were ltered from the text using a perl script.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Patent Parsing</title>
        <p>
          In a preprocessing step the image references and claims headers in the text were removed using
the regular expressions described by [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] in order to facilitate syntactic parsing of the claims and
description sentences. We then sentenced the remaining text using a Perl script and knowledge
of most common abbreviations in patent texts. The sentences in the resulting text les were
parsed using the AEGIR dependency parser [
          <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
          ], version 1.8.2. One of the AEGIRs output
formats is a dependency representation which is comparable to the Stanford typed dependencies
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], in the sense that it generates a set of binary relations between words for an input sentence,
thereby converting some function words (such as prepositions) to relations. In addition to that,
AEGIR performs a number of normalizing syntactic transformations, such as passive-to- active
transformation.
        </p>
        <p>Because of the large amount of data we used a time constraint of maximum 1 second per
sentence. This resulted in a loss of parsing output that di ered somewhat between the separate
sections.</p>
        <p>Due to the sheer size of the corpus we were not able to completely parse the description
and claims sections of the entire corpus within the given time. We therefore had to limit our
experiments on the impact of triples to the title, abstract and rst 400 words of the description.
The keywords used for the bag-of-words component in the experiments were extracted from the
title, abstract and the full description.</p>
        <p>1Please note the di erence between a patent and a patent document: a patent is not a physical document itself
but a name for a group of patent documents that have the same patent ID number.</p>
        <p>2We used the full IPC-R code up to the level of the subgroups, e.g. A01J 5/01.</p>
        <p>
          3Parsing output from these sections was incomplete for the whole corpus and not used for the subsequent
experiments.
We modeled and executed our runs as search strategies within the Spinque framework [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. This is
a prototype interactive retrieval environment where search processes are divided into two phases:
the search strategy de nition and the actual search.
        </p>
        <p>The framework has a GUI-based drag-and-drop strategy editor which allows the user to
construct the search strategies as graph structures, where edges represent data- ows consisting of
terms, documents (e.g. patent-documents), document-sections (e.g. invention-title, abstract,
description) and named entities (e.g. patents, IPC-R codes, companies). The nodes connected by
such edges are pre-de ned, general-purpose operational blocks, that either provide source data (the
patent corpus and the topics corpus) or modify their input data- ow by applying operations such
as selection based on IPC-R classes, extraction of speci c sections from documents, or ranking of
sections and documents, to name a few.</p>
        <p>Search strategies de ned in this framework are automatically translated into a probabilistic
relational query language and executed on top of an SQL database engine. The ranking scores
that are used as the basis for the probabilities were calculated with the Okapi BM25 ranking
algorithm.
3.4
3.4.1</p>
      </sec>
      <sec id="sec-3-3">
        <title>Experiments</title>
        <sec id="sec-3-3-1">
          <title>Query term selection</title>
          <p>
            This year, we performed query term selection on the triples, based on their relevance for a speci c
IPC-R class. The LCS software [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] that we used for the classi cation track builds class pro les
which contain the term distribution (word and dependency triples) per IPC-R class. We extracted
the subset of dependency triples that were most informative for correct classi cation, namely the
top 25% of the triples ranked on their Winnow scores, from last year's class pro les and used them
to lter the topic triples. Some class pro les for smaller IPC-R classes did not contain many triples
(&lt; 1000). In these cases all triples that contributed to classi cation were extracted. The aim of this
ltering step is to remove the noisy, less informative topic triples from the query thus improving
precision. Since a patent document is usually labeled with not just one single IPC-R code but
rather belongs to multiple categories (on average a patent document contains 3 di erent IPC-R
codes (on subclass level), the ltering is not so severe that it weeds out the individual di erences
between topic patent documents. In other words, the individual ltered topic documents are still
very di erent from one another due to the relatively large subsets of terms from the class pro les
that were used as lters and the di erent combinations of IPC-R classes per document. The
ltering step reduced the average number of triples per topic document (over all sections) from
180 to 60.
3.4.2
          </p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Strategy building</title>
          <p>The search strategies were constructed and evaluated in Spinque's strategy builder interface. Our
strategies consisted of two steps: (1) As in last year's approach we rst ltered the corpus on
the IPC-R codes of the topic document to create a subcorpus per topic document that contains
documents with at least one IPC-R class in common with the topic document; (2) Terms (words
and/or triples) from the sections in the topic documents were then used to query the respective
sections of documents in the subcorpus. We did not perform any term selection for the
bag-ofwords approach. The resulting document lists were then merged into a larger results list. The
ranking in that list depended on the documents scores (BM25 scores from their separate runs)
multiplied by the weights given to each results list in the con guration. An example of a search
strategy used in this track is shown in gure 1.
3.4.3</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>Determining the weighting con guration</title>
          <p>The mixture weights in the Spinque framework allow for a reranking step while merging the result
lists of the runs with individual sections. Finding the optimal mixture weights is a very
timeconsuming process, because of the large parameter space. Due to time constraints we were not
able to train on many coe cient combinations for the mixtures. We used two di erent approaches
to determine the weight con gurations used in the submitted runs: (a) Normalisation over retrieval
scores of individual sections; and (b) trial-and-error weighing.
3.4.3.1</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>Determining the relative importance of di erent sections</title>
          <p>The mixture coe cients for the combinations of di erent text sections were found by running
a subset of the training set topics on the respective text sections, that is, evaluating the title,
abstract and descriptions sections independently from one another. We then took the Mean
Average Precision (MAP) scores of these runs, normalised them to sum up to 1 and used the
resulting ratios as coe cients for the mixtures.</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>Determining the relative importance of triples and words in the combined runs</title>
          <p>The coe cients for mixing the words only and triples only runs were found using the 'trial and
error' method on the training set. Starting from a 50/50 combination we used binary search to
arrive at the optimal con guration: a words only (0.8) and triples only (0.2) combination.
3.4.4</p>
        </sec>
        <sec id="sec-3-3-6">
          <title>Submitted runs</title>
          <p>We chose to submit four separate runs:
1. triples only: A baseline run to gauge the impact such precise index terms as Dependency
triples can have on retrieval.
2. Words only: A standard bag-of-words baseline run. Keywords were stemmed using the</p>
          <p>
            Porter stemmer (version 1).
3. Combination 1: Combining the results list of the words only (stemmed) and triples only
(unstemmed) runs in a 80/20 con guration.
4. Combination 2: Even though triples are lemmatized by the parser, the patent domain
consists of many highly specialized subdomains which deploy their own jargon. Consequently
the patent documents usually contain a lot of words which may not feature in the parser
lexicon [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. The AEGIR parser recognises these words using robust rules which lead to
good estimates of POS tags (important for correct syntactic analysis later on) but applies
no lemmatisation beyond the basic singular-plural di erences. We therefore submitted an
extra run to examine the impact of stemming of the triples.
4
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>5
5.1</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>In this section we present the results of our submitted runs in terms of MAP, Precision and Recall
for the general (Table 3) and English language-speci c test data (Table 4).</p>
      <p>Impact of dependency triples on retrieval
As expected, triples by themselves are too speci c to be used for retrieval: the triples only run
achieved a very high set precision but fairly low set recall. On average, only 250 documents were
retrieved per topic document in this run. The MAP scores for the di erent sections on a subset
of the training data in table 2 show decided di erences between the sections.</p>
      <p>However, in the combination runs, merging the triple only and the bag-of-words result lists
presented some interesting results: While dependency triples are usually seen as a way of improving
ranking, we achieved the highest set recall scores (measured with the language-speci c English
relevance assessments) compared to the other participants. An analysis of the result list of the
combination 1 run shows that around 5% of relevant documents retrieved in this run (2.5% of all
the relevant patents) were found using triples, but were not found in the words only approach.
This may show that using dependency triples, i.e. information which abstracts away from the
surface form of the sentence, can contribute to retrieval where a bag-of-words approach falls short.
However, at this point, the contribution is very small. An alternative explanation is that the
dependency triples have improved the ranking of documents in the results list that fell underneath
the cut-o point of retrieving 1000 patents per query in the words only run. In which case, there
is a complete overlap between the results from the triples only run and the documents found by
the words only approach and the improvement in set recall score for the combined is an artefact
of our choice of threshold.</p>
      <p>Furthermore, another 36% of the relevant documents in combined 1 run were found by both the
words and triples approaches. We would expect these documents to feature high in the combined
results list thus improving the MAP score (compared to the words only run). However, we did
not nd much di erence in the rankings and a slight decrease in MAP score. We expect that
netuning the 80/20 words-triples mixture coe cients on a held-out set of the test corpus may
improve the rankings.</p>
      <p>In the combination 2 run we experimented to try and raise recall by using stemming in the
triples as well in the keywords, but we found that precision su ers much in that trade-o : While we
did nd more relevant documents, they were all pooled at the bottom of the results list. Moreover,
the MAP score was signi cantly lower than for the combination 1 run. It is clear that the mixture
weights should be tuned separately for combinations with stemmed triples.
5.2</p>
      <sec id="sec-5-1">
        <title>Impact of the di erent sections</title>
        <p>We did not have the opportunity to examine the impact of the di erent sections in much detail.
Rather we focussed on optimising the impact of those sections were dependency triples were the
most successful in their own right (see section 3.4.3). However, this independency assumption is
problematic: While it was a good starting point, namely in the mixtures the most weight was given
to those sections that were most likely to have relevant documents high in the list, this strategy
cannot properly account for interaction between sections and su ers from the uneven distribution
of (English) text data in the corpus. In future work we will use further tuning via trial and error
method to try and nd a local if not global optimum.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In our participation to the CLEF-IP11 prior art retrieval track we examined the impact of adding
dependency triples obtained with the AEGIR parser to a bag-of-words approach. Triples by
themselves are very speci c terms, as re ected by the high precision score achieved in the triple only
run. Interestingly, we found that adding triples lead to a slight improvement in recall, rather than
in precision as we had expected. It is not quite clear if this is due to the normalisation features of
triples or an indirect e ect of their higher precision. We also experimented with stemming of the
triples, but this led to a severe loss of precision. In future work we will extend our experiments
by adding data of all the description sections and the claims section, both for the words en triples
approach. We will also keep working on tuning the mixture coe cients by a 'trial-and-error'
method, rather than basing the coe cients on individual retrieval performance of the sections.
[11] Suzan Verberne, Merijn Vogel, and Eva D'hondt. Patent classi cation experiments with
the Linguistic Classi cation System LCS. In Proceedings of the Conference on Multilingual
and Multimodal Information Access Evaluation (CLEF 2010), CLEF-IP workshop, number
Section 2, page 49. Sl: sn, 2010.
a
e
S
:
1
n 0 .</p>
      <p>0
5 2 .5 .5 5
R .0 0 0 0</p>
      <p>.
00 506 63 95 15</p>
      <p>9 8 7
1 1</p>
      <p>.2 .2 .2
R .0 0 0 0</p>
      <p>7 7 7 2
0 8 6 1 7</p>
      <p>1 1 8
5 1
R .1 .2 .</p>
      <p>2 .1
0 0 0 0
9 0 4 6
0 7 1 4 0
2 7 4 3 1
0 0 0 0
3 9 8 8
0 4 5 0 0</p>
      <p>9 9 7
1 5
R .0 .0 .</p>
      <p>0 .0
0 0 0 0
8 6 1 1
5 35 2 9 5</p>
      <p>6 5 4
0 0 0 0</p>
      <p>.
s 0 0 0 0</p>
      <p>.0 .0 .0
t P .0 0 0 0
l
e
en 00 103 69 93 79</p>
      <p>1 1 1
le 9 7 8 5
b 0 5 5 3 5</p>
      <p>4 4 3
3 0 5 0
0 5 2 8 5</p>
      <p>6 5 4
1 3
P .0 .0 .</p>
      <p>0 .0
se .0 .</p>
      <p>0
3 1 1 1
P 4 7 7 7
t 1 0 0 0
0 .0 .0
0 .0 .0 .0
M .0 0 0 0
2 .
n 0 .</p>
      <p>0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
e
s R 01 11 46 03
6 0 0 3
e 0 36 76 76 80
p 0 0 0 0 0</p>
      <p>0 0
0 .0 .</p>
      <p>P .0 0 0 0</p>
      <p>.
0 0 0 0
0 0 0 0
P .0 .0 .
0 0 0 0
0 0 0 0</p>
      <p>P 3 6 6 6
t 1 0 0 0</p>
      <p>0 .0 .0</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>[1] Home - IRF. http://www.ir-facility.org/.</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Alink</surname>
          </string-name>
          , Roberto Cornacchia, and Arjen de Vries.
          <article-title>Searching clef-ip by strategy</article-title>
          .
          <source>In Carol Peters</source>
          , Giorgio Di Nunzio, Mikko Kurimo, Thomas Mandl, Djamel Mostefa, Anselmo Peas, and Giovanna Roda, editors,
          <source>Multilingual Information Access Evaluation I. Text Retrieval Experiments</source>
          , volume
          <volume>6241</volume>
          of Lecture Notes in Computer Science, pages
          <volume>468</volume>
          {
          <fpage>475</fpage>
          . Springer Berlin / Heidelberg,
          <year>2010</year>
          .
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -15754-7
          <fpage>56</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Daniela</given-names>
            <surname>Becks</surname>
          </string-name>
          , Thomas Mandl, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Womser-Hacker</surname>
          </string-name>
          .
          <article-title>Phrases or Terms? The Impact of Di erent Query Types</article-title>
          .
          <source>In Proceedings of the Conference on Multilingual and Multimodal Information Access Evaluation (CLEF</source>
          <year>2010</year>
          ), CLEF-IP workshop, page 99,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Marie-Catherine de Marne</surname>
            e and
            <given-names>Christopher D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>The Stanford typed dependencies representation</article-title>
          .
          <source>In Coling 2008: Proceedings of the workshop on Cross-Framework and CrossDomain Parser</source>
          Evaluation - CrossParser '08,
          <string-name>
            <surname>number</surname>
            <given-names>ii</given-names>
          </string-name>
          , pages
          <volume>1</volume>
          {
          <fpage>8</fpage>
          ,
          <string-name>
            <surname>Morristown</surname>
          </string-name>
          , NJ, USA,
          <year>2008</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>IRF. Clef</surname>
          </string-name>
          <article-title>Ip 2011 Track Guidelines</article-title>
          .
          <source>Technical report, IRF</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Cornelis</surname>
            <given-names>H.A.</given-names>
          </string-name>
          <string-name>
            <surname>Koster</surname>
            , Jean G. Beney, Suzan Verberne, and
            <given-names>Merijn</given-names>
          </string-name>
          <string-name>
            <surname>Vogel</surname>
          </string-name>
          .
          <article-title>Phrase-Based Document Categorization</article-title>
          . In W. Bruce Croft, Mihai Lupu, Katja Mayer, John Tait, and Anthony J. Trippe, editors,
          <source>Current Challenges in Patent Information Retrieval</source>
          , volume
          <volume>29</volume>
          of The Kluwer International Series on Information Retrieval, pages
          <volume>263</volume>
          {
          <fpage>286</fpage>
          . Springer Berlin Heidelberg,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Cornelis</surname>
            <given-names>H.A.</given-names>
          </string-name>
          <string-name>
            <surname>Koster</surname>
            ,
            <given-names>Marc</given-names>
          </string-name>
          <string-name>
            <surname>Seutter</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jean</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Beney</surname>
          </string-name>
          <article-title>. Multi-classi cation of patent applications with Winnow</article-title>
          .
          <source>In Perspectives of Systems Informatics, 5th International Andrei Ershov Memorial Conference</source>
          , pages
          <volume>546</volume>
          {
          <fpage>555</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Nelleke</given-names>
            <surname>Oostdijk</surname>
          </string-name>
          , Suzan Verberne, and
          <string-name>
            <given-names>Cornelis</given-names>
            <surname>Koster</surname>
          </string-name>
          .
          <article-title>Constructing a broad-coverage lexicon for text mining in the patent domain</article-title>
          .
          <source>In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC</source>
          <year>2010</year>
          ).
          <source>European Language Resources Association (ELRA)</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Parapatics</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Dittenbach</surname>
          </string-name>
          .
          <article-title>Patent claim decomposition for improved information extraction</article-title>
          . In W. Bruce Croft, Mihai Lupu, Katja Mayer, John Tait, and Anthony J. Trippe, editors,
          <source>Current Challenges in Patent Information Retrieval</source>
          , The Kluwer International Series on Information Retrieval. Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Suzan</surname>
            <given-names>Verberne</given-names>
          </string-name>
          ,
          <article-title>Eva D'hondt, Nelleke Oostdijk, and Cornelis Koster. Quantifying the challenges in parsing patent claims</article-title>
          .
          <source>Proceedings of the 1st International Workshop on Advances in Patent Information Retrieval (AsPIRe</source>
          <year>2010</year>
          ), pages
          <fpage>14</fpage>
          {
          <fpage>21</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>