<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Does Patent IR profit from Linguistics or Maximum Query Length?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniela Becks</string-name>
          <email>daniela.becks@uni-hildesheim.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maximilian Eibl</string-name>
          <email>eibl@informatik.tu-chemnitz.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Jürgens</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Kürsten</string-name>
          <email>jens.kuersten@informatik.tu-chemnitz.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Wilhelm</string-name>
          <email>thomas.wilhelm@informatik.tu-chemnitz.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christa Womser-Hacker</string-name>
          <email>womser@uni-hildesheim.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Performance, Experimentation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chemnitz University of Technology</institution>
          ,
          <addr-line>Straße der Nationen 62 09111 Chemnitz</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Intellectual Property, Evaluation, Patent Retrieval System</institution>
          ,
          <addr-line>Natural Language Processing</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Hildesheim, Information Science</institution>
          ,
          <addr-line>Marienburger Platz 22, 31141 Hildesheim</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In 2011, the University of Hildesheim and Chemnitz University of Technology participated together in the CLEF Intellectual Property Track. We focused on the prior art candidate search, which was already provided for the third time. Our group submitted seven runs ranging from simple bag of words to linguistic phrases. The aim of our experiments was to examine the effectiveness of different query strategies. Especially, we wanted to evaluate the advantage of linguistic phrases in contrast to very long bag of words queries. Phrases were extracted using a special extraction component, which has been developed by the University of Hildesheim.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Prior art search is a well known type of patent search that is performed to find out
whether there exists prior art to a given patent application [1, 2]. In contrast, the goal
of the classification tasks is to classify patents according to the International Patent
Classification (IPC). Furthermore, two image-based tasks were introduced this year
[1].</p>
      <p>At CLEF-IP 2011, the University of Hildesheim and Chemnitz University of
Technology did joint work. Each of our experiments concentrated on the prior art
search, which has been organized for three years now.</p>
      <p>The test collection, which was provided by the IRF1, consisted of approximately 2.5
million documents stored as XML files. Each of the documents from the European
Patent Office (EPO) was published before 2002 [1, 3]. The collection differed with
respect to the last two years, because it contained about 400.000 extra documents
from the World Intellectual Property Office (WIPO) [1, 3]. Beside the document
collection, the organizers provided one topic set (about 4.000 patents assigned the
code “A1” or “A2”), which equally consisted of German, English and French topic
files [1].</p>
    </sec>
    <sec id="sec-2">
      <title>2 System Setup</title>
      <p>
        Our experiments were performed on the basis of the Xtrieval framework developed at
Chemnitz University of Technology. This framework consists of four different
components, three of which form the system core (
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1-3</xref>
        ) [4]:
      </p>
      <sec id="sec-2-1">
        <title>1. Indexing 2. Retrieval 3. Evaluation 4. User interface.</title>
        <p>The Xtrieval framework was designed to make use of common retrieval API's, such
as Lemur2, Terrier3 and Apache Lucene4, for evaluation purposes. For the present
experiments we used Apache Lucene in version 3.1 as retrieval core in Xtrieval [4].
Thus, the underlying retrieval model is the traditional Vector Space Model. More
details on our approach are given in the following section.
1 Information Retrieval Facility
2 http://www.lemurproject.org/
3 http://terrier.org/
4 http://lucene.apache.org/</p>
        <sec id="sec-2-1-1">
          <title>2.1 Preprocessing and Indexing</title>
          <p>To index the document collection, the standard retrieval approach was followed. As a
consequence, the text extracted from the XML documents was first preprocessed.
This preprocessing included the following steps:
•
•
•
stopword elimination
tokenization
stemming
In [5] it was mentioned that patent specific terms, which appear frequently are likely
to result in comprehensive document lists. Following this, our group decided to use a
customized stopword list, i.e. a standard stopword list5 which was amended with a
number of patent specific terms. This approach was already used in 2009 and 2010 [5,
6]. In a next step, the preprocessed text was stored into the index file. Because it
proofed to be more effective during the experiments on the trainings set, we stored the
text as a bag of words. In this context, each language was treated separately. Thus, the
resulting index consisted of three fields, one single field per language.
Furthermore, the following patent parts were considered during the indexing process:
•
•
•
•</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Title Abstract Claim Description</title>
        <p>Besides this textual information, the language-independent IPC codes were included
into the index, because the experiments at CLEF 2009 showed that these are
particularly advantageous to increase the recall of an information retrieval system [5].</p>
        <sec id="sec-2-2-1">
          <title>2.2 Phrase Extraction</title>
          <p>A lot of research has concentrated on the effectiveness of different kinds of phrases.
At CLEF 2010, the University of Hildesheim investigated the effectiveness of terms
and phrases in the context of patent information retrieval. The experiments using
phrases significantly outperformed the term baseline, although a simple statistical
approach was used [6]. Furthermore, in [7] the author focused on the advantage of
statistical and syntactical noun phrases for interactive query expansion. In this case,
linguistic phrases proofed to be effective for information retrieval [7]. Following this,
we decided to investigate the influence of linguistic (dependency) phrases on the
effectiveness of a patent retrieval system. In this context, different types of phrases
were considered ranging from simple adjective noun to complex noun object
relations.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>5 http://members.unine.ch/jacques.savoy/clef/index.html</title>
        <p>To run our experiments we extracted phrases using a special extraction component
that has been developed at the University of Hildesheim. The underlying approach is
called rule based dependency parsing and combines the rule based method described
in [8] with dependency parsing. In case of rule based dependency parsing,
dependency phrases are identified with the help of defined term pairs. The following
example illustrates this approach:
A METHOD FOR IMPLEMENTING ONLINE MAINTENANCE
COMMUNICATION NETWORK (Topic EP-1881641-A1)
IN</p>
        <p>THE
In the example title the tool would extract the phrase “method for implementing
online maintenance”. The determiner “a” and the preposition “in” serve as patterns to
identify the phrase. This approach was implemented using UIMA6 and openNLP7 as
the basis of the extraction tool. A detailed description of the system developed by the
University of Hildesheim can be found in [9]. By now, the extraction tool has been
tested only on English documents. As a consequence, the phrase experiments
concentrated on English patents only.</p>
        <sec id="sec-2-3-1">
          <title>2.3 Search Process</title>
          <p>Our group performed various experiments for the prior candidate search task. Given
the topic file, the goal was to identify those patents that describe prior art. As already
mentioned, a topic file is a patent which is assigned code “A1” or “A2” [1]. On the
basis of these documents, the query was constructed automatically from the content of
different patent parts. These included the following:
•
•
•
•
•</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Title</title>
        <p>Abstract
Claim
Description</p>
        <p>
          IPC
At CLEF-IP 2011, we experimented with the following query modifications:
1. Boolean queries consisting of terms (
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1-5</xref>
          )
2. Queries consisting of linguistic phrases (
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
3. Combined queries consisting of terms and phrases (
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>6 http://uima.apache.org/ 7 http://incubator.apache.org/opennlp/</title>
        <p>The University of Hildesheim and Chemnitz University of Technology submitted
seven different runs. Each experiment made use of the same index file, but differed
with respect to the query options. A detailed description of these runs as well as an
overview of the results can be found in the next two sections.</p>
        <sec id="sec-2-5-1">
          <title>3.1 Submitted Runs</title>
          <p>
            Our group performed four multilingual (
            <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1-4</xref>
            ) as well as three monolingual English
runs (
            <xref ref-type="bibr" rid="ref5 ref6 ref7">5-7</xref>
            ) within the prior art candidate search task. An overview of the experimental
settings is given below.
          </p>
          <p>1.
2.
3.
4.
5.
6.
7.</p>
          <p>CUT_UHI_CLEFIP_BOW: query terms extracted from abstract, claims
and title
CUT_UHI_CLEFIP_BOW_DESC: query terms extracted from abstract,
claims, title and description</p>
        </sec>
        <sec id="sec-2-5-2">
          <title>CUT_UHI_CLEFIP_BOW_DESC_IPCR: query terms extracted from</title>
          <p>abstract, claims, title and description, IPC
CUT_UHI_CLEFIP_BOW_IPCR: query terms extracted from abstract,
title and claims, IPC</p>
        </sec>
        <sec id="sec-2-5-3">
          <title>CUT_UHI_CLEFIP_BOW_EN_ABSTRACT: query terms extracted</title>
          <p>from abstract and title
CUT_UHI_CLEFIP_BOW_EN_P: linguistic phrases extracted from
abstract and terms of title</p>
        </sec>
        <sec id="sec-2-5-4">
          <title>CUT_UHI_CLEFIP_BOW_EN_P_ABSTRACT: linguistic</title>
          <p>extracted from abstract and terms extracted from abstract and title
phrases
As can be seen, only the third experiment made use of all patent sections and the other
runs were restricted to special fields. For example, the first run was restricted to
abstract, claims and title. We did not use the whole IPC code, but included the first
four digits only. Although the complete classification information proofed to be more
accurate, we decided to use only the first four digits, because this significantly
accelerated the search process.</p>
          <p>
            Our runs were divided into two major categories. The first group of experiments (
            <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1-4</xref>
            )
concentrated on English, French and German and was performed to investigate the
effect of very long queries. In contrast, the second group of runs (
            <xref ref-type="bibr" rid="ref5 ref6 ref7">5-7</xref>
            ) made use of
English terms and phrases only and aimed at investigating the effect of short, but
precise queries. Phrases as well as terms were combined into Boolean queries using
the operator OR. Independent of its type, the query was run against the language
specific index field. Thus, an English query was searched within each index field that
contained English content. These two approaches reflect two very distinctive
perspectives on prior art search.
          </p>
          <p>The results revealed that the first query strategy, which was based on using as many
terms as possible, proofed to be more effective, because our best run achieved a map
of 0.0914 (run 3). In this case, the query was formulated using terms extracted from
all patent fields. Furthermore, the experiment that concentrated on claims, title and
abstract (run 1) proofed to be effective (0.0824). Surprisingly, the recall of this run
was similar to that of the fourth experiment (0.4318) using the IPC additionally. This
indicates that the classification codes do not have any advantage over the title,
abstract and claims. In contrast, the run that additionally concentrated on the
description (run 2) achieved a lower recall (0.3993). Following this, we could
summarize that the detailed description seems to be more advantageous with respect
to the precision of a patent retrieval system while the remaining patent sections do
have a positive effect on the completeness of the search results. Some statistics
according to the obtained results are provided in Table 1.
The results in Table 1 further indicate that using linguistic phrases did not increase the
effectiveness of the retrieval system. Instead, the results of this experiment (0.1899;
0.0208) did significantly fall below the recall and map values of the remaining runs.
This aspect is quite surprising, because phrases, in general, are considered to be more
effective than terms [7]. One reason for the negative results of this experiment might
be the small number of phrases that were extracted from the abstract. For example,
only three linguistic phrases were extracted from topic EP-1226990 with the help of
the extraction tool. The enrichment of phrases with additional terms (run 6) led to
higher recall and map values. This could be a hint that our phrase queries were too
short.</p>
          <p>Furthermore, the abstract might not be the adequate patent section to construct phrase
queries because of the existence of noisy terms [10]. As can be seen in the above
table, integrating the terms from the detailed description (run 2 and 3) achieved the
best results. Therefore, this patent section might be a better basis for generating
phrase queries.</p>
        </sec>
        <sec id="sec-2-5-5">
          <title>3.2 Influence of query length</title>
          <p>Besides the retrieval effectiveness, the duration of the experiment and the query
length were measured. Some statistics of these aspects are displayed in the following
table.</p>
          <p>Although it seems to be quite effective, the use of very long queries has one
disadvantage, because the search process is slowed down. As can be seen, the
experiment which concentrated on the description (run 2) ran about eleven hours. In
case of a realistic prior art search this might be problematic.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 Outlook</title>
      <p>At CLEF 2011, the University of Hildesheim and Chemnitz University of Technology
did joint work. Our experiments concentrated on the following two different aspects:
1. The effect of linguistic phrases extracted from patent documents
2. The effect of very long queries with maximum number of terms
The results reveal that very long queries seem to be more effective in the context of
patent information retrieval, but they significantly slow down the search process. In
contrast, short queries which were constructed of phrases extracted from the abstract
did not show any positive effect on neither recall nor mean average precision, but they
are advantageous with respect to the run time. This raises the question of how long a
patent searcher is willing to wait for accurate search results.</p>
      <p>In the future, we will have to think about a combined search strategy that takes into
account terms as well as phrases, because both query types seem to have some
advantages. Furthermore, we will have to improve the generation of phrase queries,
because phrases extracted from the abstract did not improve the retrieval
effectiveness. Using a different part of the patent, e.g. the detailed description, might
show some improvements with respect to recall or map.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          : CLEF- IP
          <year>2011</year>
          . Track Guidelines,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Graf</surname>
            ,
            <given-names>E</given-names>
          </string-name>
          ; Azzopardi,
          <string-name>
            <surname>L.</surname>
          </string-name>
          (
          <year>2008</year>
          )
          <article-title>: A methodology for building a patent test collection for prior art search</article-title>
          .
          <source>In: Proceedings of the 2nd International Workshop on Evaluating Information Access (EVIA)</source>
          , S.
          <fpage>60</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Information</given-names>
            <surname>Retrieval Facility</surname>
          </string-name>
          (
          <year>2011</year>
          )
          <article-title>: Documents in the CLEF-IP Corpus</article-title>
          . http://www.ir-facility.
          <source>org/collection (verified: 05.08</source>
          .
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kürsten</surname>
          </string-name>
          , Jens; Wilhelm, Thomas; Eibl,
          <string-name>
            <surname>Maximilian</surname>
          </string-name>
          (
          <year>2008</year>
          )
          <article-title>: Extensible Retrieval and Evaluation Framework: Xtrieval</article-title>
          . In: Baumeister,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Atzemüller,
          <string-name>
            <surname>M.</surname>
          </string-name>
          (
          <year>2008</year>
          ): Proceedings of the LWA -
          <string-name>
            <surname>Workshop</surname>
            <given-names>FGIR</given-names>
          </string-name>
          , Würzburg, S.
          <fpage>107</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Becks</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Womser-Hacker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mandl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kölle</surname>
            ,
            <given-names>R</given-names>
          </string-name>
          (
          <year>2010</year>
          )
          <article-title>: Patent Retrieval Experiments in the Context of the CLEF IP Track 2009</article-title>
          . In: Peters,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ; Di Nunzio,
          <string-name>
            <given-names>G.M.</given-names>
            ;
            <surname>Kurimo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Mostefa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Penas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Roda</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . (Eds.):
          <article-title>Multilingual Information Access Evaluation I - Text Retrieval Experiments</article-title>
          ,
          <source>Proceedings of CLEF</source>
          <year>2009</year>
          , Corfu, Greece, Berlin et. al: Springer, S.
          <fpage>491</fpage>
          -
          <lpage>496</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Becks</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Mandl, Th.;
          <string-name>
            <surname>Womser-Hacker</surname>
            ,
            <given-names>Ch.</given-names>
          </string-name>
          (
          <year>2010</year>
          ):
          <article-title>Phrases or Terms? The Impact of different Query Types</article-title>
          .
          <source>In: Working Notes of 11th Workshop of the Cross-Language Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2010</year>
          , Padua, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Vechtomova</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          (
          <year>2006</year>
          )
          <article-title>: Noun Phrases in Interactive Query Expansion and Document Ranking</article-title>
          .
          <source>In: Information Retrieval</source>
          ,
          <volume>9</volume>
          (
          <issue>4</issue>
          ), S.
          <fpage>399</fpage>
          -
          <lpage>420</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Jaene</surname>
          </string-name>
          , H.;
          <string-name>
            <surname>Seelbach</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>1975</year>
          )
          <article-title>: Maschinelle Extraktion von zusammengesetzten Ausdrücken aus englischen Fachtexten</article-title>
          . Berlin, Köln, Frankfurt(Main): Beuth.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Becks</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schulz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2011</year>
          )
          <article-title>: Domänenübergreifende Phrasenextraktion mithilfe einer lexikonunabhängigen Analysekomponente</article-title>
          . In: Griesbaum,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Mandl, Th,; Womser-Hacker,
          <string-name>
            <surname>Ch.</surname>
          </string-name>
          (Hrsg.), Information und Wissen: global, sozial und frei?,
          <source>Schriften zur Informationswissenschaft, Band</source>
          <volume>58</volume>
          ,
          <string-name>
            <surname>Boizenburg</surname>
          </string-name>
          : Werner Hülsbusch, S.
          <fpage>388</fpage>
          -
          <lpage>392</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jochim</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lioma</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schütze</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2010</year>
          ):
          <article-title>Preliminary Study into Query Translation for Patent Retrieval</article-title>
          .
          <source>In: Proceedings of the PaIR'10 Workshop</source>
          , Toronto, Ontario, Canada, S.
          <fpage>57</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>