<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Prior Art Search using International Patent Classi cation Codes and All-Claims-Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gyorgy Szarvas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Herbert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna Gurevych</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Measurement, Performance, Experimentation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>UKP Lab, Technische Universitat Darmstadt</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this study, we describe our system at the Intellectual Property track of the 2009 CrossLanguage Evaluation Forum campaign (CLEF-IP). The CLEF-IP track addressed prior art search for patent applications. We used the Apache Lucene IR library to conduct experiments with the traditional TF-IDF-based ranking approach, indexing both the textual content of each patent and the IPC codes assigned to each document. We formulated our queries by using all claims and the title of a patent application in order to measure the (weighted) lexical overlap between topics and prior art candidates. We also formulated a language-independent query using the IPC codes of a document to improve the coverage and to obtain a more accurate ranking of candidates. Additionally, we used the IPC taxonomy (the categories and their short descriptive texts) to create a Concept Based Query Expansion [14] model for measuring the semantic overlap between topics and prior art candidates and tried to incorporate this information to our system's ranking process. Probably due to an insu cient length of de nition texts in the IPC taxonomy (used to de ne the concept mapping of our model), incorporating the concept based similarity measure did not improve our performance and was thus excluded from the nal submission. Using the extended boolean vector space model of Lucene, our system remained e cient and still yielded fair performance: it achieved the 6th best Mean Average Precision score out of 14 participating systems on 500 topics, and the 4th best score out of 9 participants on 10.000 topics.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>http://www.ukp.tu-darmstadt.de
The CLEF-IP 2009 track was organized by Matrixware and the Information Retrieval Facility.
The goal of this track was to investigate the application of IR methods for patent retrieval. The
task was to perform prior art search, which is a special type of search with the goal of verifying the
originality of a patent. If another patent or document is found that already covers a very similar
invention and no su cient originality is given in a patent, it would no longer be valid. In the
case of a patent application, this would prohibit the acceptance. If a patent is already accepted,
opposition can render a patent invalid with citations of prior art they found. Therefore, nding
even a single prior art document can be crucial in the process, as it can have direct e ect on the
decision about patentability, or withdrawal of the patent application.</p>
      <p>Prior art search is usually performed manually at patent o ces by experts over millions of
documents. The process often takes several days and requires strict documentation and experienced
professionals. It would be bene cial if IR methods could ease this task or improve the e cacy of
search.</p>
      <p>Major challenges associated with nding prior art are the following:</p>
      <p>The usage of vocabulary and grammar is not enforced and depends on the authors.
In order to cover a wide eld of applications, many times very general formulations and
vague language are used.</p>
      <p>Some authors might try to disguise the information contained in a patent for taking actions
against people that infringe a patent later.</p>
      <p>The description of inventions frequently uses new vocabulary, as probably no such thing
existed before.</p>
      <p>Since patents can be submitted in three di erent languages even in the European Union,
information constituting prior art might be described in a di erent language than the patent
under investigation.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset &amp; Task</title>
      <p>For the challenge, a collection of 1.9 million patent documents from the European Patent O ce
(EPO) was used. The documents in this collection correspond to approximately 1 million
individual patents led between 1985 and 2000 (thus one patent can have several les, with di erent
versions/types of information). The patents are in the English, German, or French language. The
language distribution is not equal: 69% of the patents are English, 23% are German, and 7% are
French. The patents are given in an XML format and supply detailed information, e.g. title,
description, abstract, claims, inventors, classi cation, abstract, etc. For more information about
the dataset see the track web page1.</p>
      <p>The main challenge of the track was to nd prior art for the given topic documents. Several
tasks were de ned: the Main task, where topics corresponded to full patent documents, and the
multilingual tasks, where only the title and claim elds were given in a single language (English,
German, or French) and prior art documents were expected to be retrieved in any of these three
languages.</p>
      <p>Relevance assessments were compiled automatically using the citations to prior art documents
found in the EPO les of the topic patent applications. The training data for the challenge
consisted of 500 topics and relevant prior art. The evaluation was carried out on an unseen document
set of size 500 (Small), 1.000 (Medium) and 10.000 (XLarge evaluation) topics, respectively.</p>
      <p>The main goals of Patent Retrieval systems can be characterized as to achieve:
high recall, as single result can invalidate a patent application, or
high precision at top ranks to provide results that require less manual analysis by a patent
expert.</p>
      <p>
        For a more detailed description of the task, participating groups, the dataset and overall results,
please see the challenge description paper: [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>1http://www.ir-facility.org/the_irf/clef-ip09-track/data-characteristics
1.2</p>
    </sec>
    <sec id="sec-3">
      <title>Related work</title>
      <p>
        Patent retrieval has been studied extensively by the IR research community, within the scope
of scienti c workshops and patent retrieval shared tasks. In the early 2000s several workshops
were organized at major Computational Linguistics conferences [
        <xref ref-type="bibr" rid="ref11 ref8">11, 8</xref>
        ] devoted to analyzing and
discussing the applicability of IR techniques to patent document collections, and to assess the
special characteristics of patent data compared to other genres.
      </p>
      <p>
        Since 2003, Patent Information Retrieval has been studied within the scope of the NTCIR
evaluation campaigns, mainly focusing to retrieval of patents in Japanese, and patent abstracts in
English. At NTCIR-3, a patent retrieval task was given. The goal was to build technical surveys
from two years of patent data [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Topics were given as newspaper articles and a memorandum
from a person who is interested in a technical survey about a topic mentioned in the articles.
      </p>
      <p>
        Invalidity Search was addressed at NTCIR-4 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] using a document collection of Japanese patents
published between 1993 and 2002. Using the claims in a topic patent, the task was to nd a list of
patents that invalidate the claims in the topic. Additionally, the passages that are in con ict with
the topic had to be found. As a cross-lingual challenge, patents were also partially translated to
English.
      </p>
      <p>
        The same setup was given for the patent retrieval task at NTCIR-5, but with a larger number of
topics. For some topics, passages that invalidate claims for a patent document had to be returned.
More information about the task can be found in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        At NTCIR-6, the rst English patent retrieval task was introduced [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The objective of the
English task at NTCIR-6 was Invalidity Search using a collection of English patents from the
US Patent &amp; Trademark O ce (USPTO). Topic documents were also patents from the USPTO.
Relevance assessments were compiled using the citations in the topic documents, similarly to the
CLEF-IP 2009 challenge.
      </p>
      <p>
        The 2009 Intellectual Property challenge [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] at the Cross-Language Evaluation Forum
campaign extended the scope of Invalidity Search to all the o cial languages of European Patent
O ce (i.e. English, German and French) using a collection of over 1 million patents in multiple
languages. Another major di erence between the NTCIR-6 and the CLEF-IP task was that here
multiple manually assigned patent classi cation codes were available to each document, while at
NTCIR-6 all patents were assigned a single label.
      </p>
      <p>
        For the NTCIR-3 Workshop Iwayama et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] indexed nouns, verbs, adjectives and
out-ofdictionary words of patents and performed retrieval using newspaper articles as topics. They
stated that patent retrieval was not signi cantly di erent from retrieval on newspaper items for
the technical survey task. This made it promising to rely on the classical retrieval models also for
invalidity search.
      </p>
      <p>
        Fujii (2007) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] employed word-based indexing for invalidity search and employed a citation
based score to improve the ranking of retrieval results. The use of citation information was
not allowed at the CLEF-IP challenge as the same information was used to compile relevance
assessments for the challenge.
      </p>
      <p>
        Subtopic analysis was performed by Takagi et al. (2004) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] for invalidity search tasks. By
analyzing and nding subtopics contained in target documents, multiple queries were built. For
each query a search was carried out on the document database, resulting in a list of relevant patent
documents. Unlike Takagi et al. we decided to use single queries for whole topic documents due
to time constraints.
      </p>
      <p>
        Several previous studies aimed at going beyond the traditional keyword-matching and apply a
semantic retrieval approach for patents. For example, Patent Cafe2 uses Latent Semantic Analysis
to implement a semantic search engine for patent texts. The recent EU project called PATExpert
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] uses manually crafted ontologies for concept based retrieval of patents. On the other hand
to our best knowledge Concept Based Query Expansion [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] has not yet been explored in Patent
Retrieval.
      </p>
      <p>
        The main ndings of recent evaluation campaigns were that traditional IR methods work
reasonably for patent collections, but the special language used in patent texts and the use of
di erent terminology might pose problems to keyword-based retrieval. Many studies point out the
importance of exploiting the manually assigned topic labels (i.e. the patent classi cation codes
assigned to applications by patent experts) for more e cient retrieval. The task overview papers
of the above mentioned evaluation campaigns, the state of the art survey [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] of PATExpert and a
recent survey on Patent informatics [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] provide a good overview of work related to this study.
2
      </p>
      <sec id="sec-3-1">
        <title>Our Approach</title>
        <sec id="sec-3-1-1">
          <title>In this section, we discuss our system submitted to the challenge.</title>
          <p>2.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Data Characterization</title>
      <p>For most patents, several les were available, corresponding to di erent versions of the patent (an
application text is subject to change during the evaluation process).</p>
      <p>We decided not to use all the di erent versions available for the patent, but used the most
upto-date version. We considered the latest version to contain the most authoritative information.
If a speci c eld used by our system was missing from that version, we extracted the respective
information from the latest source that included the particular eld. In our system, we used the
information provided under the claims, abstract, description, title and IPC codes elds only.</p>
      <p>Exploiting other, potentially useful sections of patent applications such as authors or date was
omitted so far.
2.2</p>
    </sec>
    <sec id="sec-5">
      <title>Preprocessing</title>
      <p>To create the indices, we employed Lucene and performed the following preprocessing steps:</p>
      <sec id="sec-5-1">
        <title>Sentence splitting based on the Java BreakIterator implementation.</title>
        <p>Tokenization based on the Java BreakIterator (for the French documents we also used
apostrophes as token boundaries: e.g. d'un was split to d and un).</p>
        <p>
          Stopword removal using manually crafted stopword lists. We started with general purpose
stopword lists containing determiners, pronouns, etc. for each language, and appended
them with highly frequent terms manually. We considered each frequent word (appearing in
several hundreds of thousand of documents) a potential stopword and included it in the list,
if we judged it as a generic term or a domain speci c stopword, that is not representative
of the patent content. For example, a large number of patent documents contain words like
gure (used in gure captions and also to refer to the pictures in the text), or invention
(it usually occurred in the 1st sentence of the documents). Since we lacked the necessary
domain expertise to evaluate each term properly, stopword lists compiled by experts could
easily improve our system to some extent.
for the German language, we applied dictionary-based compound splitting [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]3.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Stemming using the Porter algorithm4.</title>
        <p>
          The preprocessing pipeline was set up using the Unstructured Information Management
Architecture (UIMA), a framework for the development of component based Natural Language
Processing (NLP) applications. We used the DKPro Information Retrieval framework [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], which provides
e cient and con gurable UIMA components for common NLP and Information Retrieval tasks.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Retrieval</title>
      <p>2.3.1</p>
      <p>Indices
The basis of our system is the extended boolean vector space model as implemented by Lucene.
We queried the indices described below and combined the results in a post-processing step in order
to incorporate information from both the text and the IPC codes.</p>
      <p>In order to employ Lucene for patent retrieval, we created a separate index for each language using
only elds for the corresponding language. That is, for example, for the German index, only elds
with a language attribute set to DE were used.</p>
      <p>For each patent, we extracted the text of a selection of elds (e.g. title only, title &amp; claim,
claim &amp; abstract &amp; description - limited to n words). The concatenated elds were preprocessed
as described above. For each patent, a single document was added to the Lucene index, and the
patentNumber eld to identify the patent.</p>
      <p>Topic documents were indexed similarly in a separate topic index, in order to have the topic
texts preprocessed in the same manner as the document collection. We created topic indices using
the title and claim elds in each language. All the text in these elds was used to formulate the
queries, without any particular further ltering. This way our system ranked documents according
to their lexical overlap with the topic patent.</p>
      <p>To exploit the IPC codes assigned to the patents, a separate index was created containing only
the IPC categories of the documents. This information could provide a language independent
ranking measure of the domain overlap between the query and documents.
2.3.2</p>
      <p>Queries</p>
      <sec id="sec-6-1">
        <title>In this section, we describe how the query is formulated for each topic.</title>
        <p>For the main task, such topic documents were selected that had their title and claim elds
available in all three languages. Moreover, since these documents were full patent applications
they contained further elds, optionally in one or more languages, but we did not use any of these
additional elds.</p>
        <p>We created a separate query for each language and ran it against the document collection
index of the corresponding language. Each query contained the whole content of the two above
mentioned elds, with each query term separated by the OR query operator.</p>
        <p>For the language speci c tasks, only the title and claim elds of the corresponding language
were made available. We performed the same retrieval step as for the main task, but restricted
the search to the respective language index. E.g., for the French subtask, we used only the French
title and claims elds to formulate our query and performed retrieval only on the French document
index.</p>
        <p>To measure the weighted overlap of the IPC codes, a separate query was formulated that
included all IPC codes assigned to the topic document (again, each query term OR-ed together).
2.3.3</p>
        <p>Result Fusion
As a result, our system retrieved three ranked lists of patent documents, one result list for each
of the three language indices. Since the majority of the true positive documents for the training
topics shared at least one full IPC code5 with the topic patent, we decided to lter our results to
contain only such documents that shared an IPC code with the topic. Additionally, we acquired
one result list from the IPC code index. We normalized each single list to have a maximum
relevance value of 1:0 for the top ranked document in order to make the scores comparable to each
other.</p>
        <p>To prepare our system output for the language speci c subtasks, we added the relevance scores
returned by the IPC and the textual query and ranked the results according to the resulting
5For example A61K-6/027, corresponding to Preparations for dentistry { Use of non-metallic elements or
compounds thereof, e.g. carbon.
relevance score. For the combination of results, we normalized the lists and then used the following
formula: Score(d) = ScoreIP C(d)+Scoretext(d)</p>
        <p>2</p>
        <p>For the Main task submission, the three language-speci c lists had to be combined in
order to end up with a single ranked list of results. To do this, we took the highest language
speci c result from the three individual lists for each document. That is, each document was
ranked according to its highest relevance score in the Main task submission: Scoremain(d) =
M AX(ScoreEN (d); ScoreDE (d); ScoreF R(d)).</p>
        <p>Whenever our system retrieved less then 1000 individual documents using the above described
procedure, we appended the result list with documents retrieved by the same steps, but applying a
less restrictive IPC code lter. This means that at the end of the list, we included such documents
that shared only a higher level IPC category6, but not an exact code with the topic.
3</p>
        <sec id="sec-6-1-1">
          <title>Experiments and Results</title>
          <p>In this section we present the performance statistics of the system submitted to the CLEF-IP
challenge and report on some additional experiments performed after the submission deadline.
We provide Mean Average Precision (MAP) as the main evaluation metric, in accordance with
the o cial CLEF-IP evaluation. Since precision at top rank positions is extremely important for
systems that are supposed to assist manual work, like prior art search, we always indicate Precision
at 1 and 10 retrieved documents (P@1 and P@10) for comparison7.
3.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Challenge submission</title>
      <p>We used the processing pipeline discussed above to extract text from di erent elds of patent
applications. We experimented with indexing single elds, and some combinations thereof. In
particular, we used only titles, only claims, only description or a combination of title and claims
for indexing.</p>
      <p>As the claims eld is the legally important eld, we decided to include the whole claims eld
in the indices for the submitted system. We used an arbitrarily chosen threshold of 800 words
for the indexed document size. That is, for patents with a short claims eld, we added some text
from their abstract or description respectively, to have at least 800 words in the index for each
patent. When the claims eld alone was longer than 800 words, we used the whole eld. This
way, we tried to provide a more or less uniform-length representation of each document to make
the retrieval results less sensitive to document length. We did not have time during the challenge
timeline to nd the text size threshold that gave optimal performance for our system, so this 800
words limit was chosen arbitrarily { motivated by the average size of claims sections.</p>
      <p>Table 1 shows the MAP, P@1 and P@10 values of the system con gurations we tested during
the CLEF-IP challenge development period, for the Main task, on the 500 training topics. These
were: 1) system using IPC-code index only; 2) system using text-based index only; 3) system
using text-based index only, result list ltered for matching IPC code; 4) combination of result
lists of 1) and 2); 5) combination of result lists of 1) and 3).</p>
      <p>The bold line in Table 1 represents our submitted system. The same con guration gave the
best scores on the training topic set for each individual language. Table 2 shows the scores of this
system con guration for each language and the Main task on the 500 training and on the 10000
evaluation topics.</p>
      <p>6Here we took into account only the 3 top levels of the IPC hierarchy. For example A61K, corresponding to
Preparations for dentistry.</p>
      <p>
        7During system development we always treated all citations as equally relevant documents, so we only present
such evaluation here. For more details and analysis of performance on highly relevant items (e.g. those provided
by the opposition) please see the task description paper [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>Nr.
(1)
(2)
(3)
(4)
(5)</p>
      <p>Method
IPC only
Text only
Text only - ltered
IPC and text
IPC and text - ltered</p>
      <p>We also examined retrieval performance using di erent document length thresholds. That
is, we extracted the rst 400, 800, 1600 or 3200 words of the concatenated claim, abstract and
description elds to see whether more text improves the performance. Table 4 shows the MAP
scores for these experiments. The results show that only a slight improvement could be reached
by using more text for indexing documents. Using 1600 words as the document size threshold, as
suggested by Table 4, would have given 0.1170 MAP score for English, on the 10k evaluation set
{ which is only a marginal improvement over the submitted con guration.</p>
      <p>The best performing con guration we obtained during the submission period included a ltering
step to discard any resulting document that did not have any IPC code shared with the topic. This
way, retrieval was actually constrained to the cluster of documents that had some overlapping IPC
labeling. A natural idea was to evaluate whether creating a separate index for these clusters (and
thus having in-cluster term weighting schemes and ranking) is bene cial to performance. Results
of this cluster-based retrieval approach are reported in Table 5.</p>
      <p>The best paramater settings of our system (i.e. one using 1600 words as document length
treshold, 0.6/0.4 weights for text/IPC indices, respectively and separate indices for the cluster of
documents with a matching IPC code for each topic { the bold line in Table 5.) showed a MAP
score of 0.1243, P@1 of 0.2223 and p@10 of 0.0937 on the 10.000 documents size evaluation set,
for English. This is a 0.8% point improvement in MAP compared to our submitted system.</p>
      <p>Method
800 words 0.5/0.5 weights (text/IPC)
800 words 0.6/0.4 weights (text/IPC)
1600 words 0.5/0.5 weights (text/IPC)
1600 words 0.6/0.4 weights (text/IPC)</p>
      <p>MAP
0.1203
0.1223
0.1202
0.1252
In the previous section, we introduced the results we obtained during the challenge timeline,
together with some follow-up experiments. We think our relatively simple approach gave fair
results, our submission ended 6th out of 14 participating systems on the Small evaluation set
of 500 topics8 and 4th out of 9 systems on the larger evaluation set of 10000 topics. Taking
into account that only one participating system achieved remarkably higher MAP scores, and the
simplicity of our system, we nd these results promising.</p>
      <p>
        We attribute these promising results to the e cient use of IPC information to enhance
keywordsearch. Our experiments demonstrated that the several ways we employed IPC codes to
restrict/focus text search (i.e. ltering according to IPC, retrieval based on IPC codes, local search
in the cluster of patents with matching IPC) all improved retrieval performance. We also tried to
incorporate the whole IPC taxonomy to extend the traditional vector space model based retrieval
similarly to Qui and Frey's [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] Concept Based Query Expansion technique. Unfortunately, this
approach did not improve the performance of our system, most probably due to the very short
descriptive information given in the IPC taxonomy for each category. We think that this
approach would be particularly promising if a version of the taxonomy with a reasonable amount of
descriptive text for its categories were available.
      </p>
      <p>
        We also discussed in detail that during the challenge development period, we made several
arbitrary choices regarding system parameter settings and that (even though we chose reasonably
8Since the larger evaluation set included the small one, we consistently reported results on the largest set possible.
For more details about performance statistics on the smaller sets, please see [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
well performing parameter values), tuning these parameters could have improved the accuracy of
the system to some extent. The limitations of our approach are obvious though:
First, as our approach mainly measures lexical overlap between the topic patent and prior
art candidates, such prior art items that use signi cantly di erent vocabulary to describe
their innovations are most probably missed by the system.
      </p>
      <p>Second, without any sophisticated keyword / terminology extraction from the topic claims,
our queries are long and probably contain irrelevant terms that puts a burden on the system's
accuracy.</p>
      <p>Third, the patent documents provided by the organizers were quite comprehensive,
containing detailed information on inventors, assignees, priority dates etc. Out of these information
types we only used the IPC codes and some of the textual description of patents.
Last, since we made the compromise to search among documents with a matching IPC code
(and only extend to documents with a matching main category when we had insu cient
number of retrieved documents in the rst step), we obviously lost the chance of retrieving
such prior art items that have di erent IPC classi cation from the patent being investigated.
We think these patents are possibly the most challenging and important items to nd { since
they are more di cult to discover for humans as well.
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>Error analysis</title>
      <p>We examined a few topics manually to assess the typical sources of errors produced by our system.
Our ndings nicely agree with the claims above. Restricting the search for the cluster with a
matching IPC code reduces the search space to a few thousand of documents on average. On
the other hand, our system con guration results in losing 11% of the prior art items entirely (i.e.
those that do not have even a main IPC category shared with the topic patent) and in very low
chances of retrieving another 10% of the prior art (i.e. those that share only a main IPC category
but not an exact IPC code). Besides these issues, the system is able to retrieve the majority of
the remaining relevant documents within the top ranked 1000 results.</p>
      <p>Poor ranking (that is, having relevant items ranked low in the list) comes from the lack of
selection of meaningful search terms from the topic patents. Since many patents discuss very
similar topics, usually there are a number of documents with substantial overlap, but relevance
is de ned more by the presence or absence of a very few very speci c terms or phrases (we only
consider unigrams for retrieval). This is the most obvious place for improvement regarding our
system.</p>
      <p>A typical example is the topic patent with id EP-1474501 9. This patent had 16 true positive
(TP) documents in the collection. Out of these, we retrieved 12 and missed 4 { 3 having not even
a main IPC category shared with the topic10 and 1 sharing only main category11. We also got a
TP document top ranked (EP-0767824 ), which is indeed very similar to the topic: IPC codes and
whole phrases and sentence parts match between them. The rest of the TPs on the other hand
came ranked low (under the 100th). We saw the main reason for this in many documents having a
large vocabulary overlap with the topic, but having di erent technical terms like materials, names
of chemicals, etc. - aspects that really make the di erence in a prior art search setting. We think
that an intelligent selection of query terms would have resulted in ranking the relevant documents
higher here.</p>
      <p>9Lubricating compositions / IPC:C10M as main topic
10Having detergent compositions / IPC:C11D ; macromolecular compounds obtained by reactions only involving
carbon-to-carbon unsaturated bonds / IPC:C08F and shaping or joining of plastics / B29C and containers for
storage or transport of articles or materials / IPC:B65D as their main topics</p>
      <p>11Both on lubricating compositions, but the topic categorized as a mixture, while the prior art document
categorized according to its main component (di erent 4th and 5th level classi cation)</p>
      <sec id="sec-8-1">
        <title>Conclusion</title>
        <p>In this study, we demonstrated that even a simple Information Retrieval system measuring the
IPC-based and lexical overlap between a topic and prior art candidates works reasonably well:
our system gives a True Positive (prior art) top ranked for little more than 20% of the topics. We
believe that a simple visualization approach, e.g. displaying content in a parallel view highlighting
textual/IPC overlaps could be an e cient assistant tool for manual prior art search (performed
at Patent O ces).</p>
        <p>On the other hand, our experiments conducted within the scope of the CLEF 2009 Intellectual
Property challenge might not provide a good insight to the precision of such systems: to our
knowledge only such topics were selected for evaluation that were actually opposed by third parties
(in other words only such patents were used for evaluation purposes that actually were questionable
regarding their novelty). This also emphasizes that our system probably would be usable only for
assisting manual search.
5.1</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Future work</title>
      <p>As follow up research, we plan to extend our system in several di erent ways. We showed that
local and global term weightings behave di erently in retrieving prior art documents. A
straightforward extension would be therefore to incorporate both to improve our results further.
Similarly, experimenting with other weighting schemes than the one implemented in Lucene is another
straightforward way to extend our system.</p>
      <p>
        More important, we plan to further investigate the possibilities of incorporating semantic
similarity measures to the retrieval process, complementary to lexical overlap. For this { since we
don't have access to an IPC taxonomy with su cient textual descriptions { we plan to experiment
with the concept based query expansion model when Wikipedia is used as a source of background
knowledge [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for constructing the concept-based text representation.
6
      </p>
      <sec id="sec-9-1">
        <title>Acknowledgements</title>
        <p>This work was supported by the German Ministry of Education and Research (BMBF) under grant
'Semantics- and Emotion-Based Conversation Management in Customer Support (SIGMUND)',
No. 01ISO8042D, and by the Volkswagen Foundation as part of the Lichtenberg-Professorship
Program under the grant No. I/82806.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bonino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ciaramella</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Corno</surname>
          </string-name>
          .
          <article-title>Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics (in press)</article-title>
          .
          <source>World Patent Information</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bru</surname>
          </string-name>
          <article-title>gmann. Patexpert - state of the art in patent processing</article-title>
          .
          <source>Technical report, ISJB</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fujii</surname>
          </string-name>
          .
          <article-title>Enhancing patent retrieval by citation analysis</article-title>
          .
          <source>In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>793</volume>
          {
          <fpage>794</fpage>
          . ACM New York, NY, USA,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fujii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iwayama</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          .
          <article-title>The patent retrieval task in the fourth NTCIR workshop</article-title>
          .
          <source>In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>560</volume>
          {
          <fpage>561</fpage>
          . ACM New York, NY, USA,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fujii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iwayama</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          .
          <article-title>Overview of patent retrieval task at NTCIR-5</article-title>
          .
          <source>In Proceedings of the Fifth NTCIR Workshop Meeting</source>
          , pages
          <volume>269</volume>
          {
          <fpage>277</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fujii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iwayama</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          .
          <article-title>Overview of the patent retrieval task at the ntcir-6 workshop</article-title>
          .
          <source>In Proceedings of NTCIR-6 Workshop Meeting</source>
          , pages
          <volume>359</volume>
          {
          <fpage>365</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Markovitch</surname>
          </string-name>
          .
          <article-title>Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis</article-title>
          .
          <source>In Proceedings of the 20th International Joint Conference on Arti cial Intelligence</source>
          , pages
          <fpage>1606</fpage>
          {
          <fpage>1611</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Iwayama</surname>
          </string-name>
          and
          <string-name>
            <surname>A</surname>
          </string-name>
          . Fujii, editors.
          <source>Proceedings of the ACL-2003 Workshop on Patent Corpus Processing</source>
          , Morristown, NJ, USA,
          <year>2003</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Iwayama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fujii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Marukawa</surname>
          </string-name>
          .
          <article-title>An empirical study on retrieval models for di erent document genres: patents and newspaper articles</article-title>
          .
          <source>In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval</source>
          , pages
          <volume>251</volume>
          {
          <fpage>258</fpage>
          , New York, NY, USA,
          <year>2003</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Iwayama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fujii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Takano</surname>
          </string-name>
          .
          <article-title>Overview of patent retrieval task at NTCIR3</article-title>
          .
          <source>In Proceedings of the NTCIR-3 Workshop Meeting</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          and MK.
          <source>Leong. Workshop on Patent Retrieval SIGIR 2000 - Workshop Report</source>
          . volume
          <volume>34</volume>
          , pages
          <fpage>28</fpage>
          {
          <fpage>30</fpage>
          , New York, NY, USA,
          <year>2000</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Langer</surname>
          </string-name>
          .
          <article-title>Zur Morphologie und Semantik von Nominalkomposita</article-title>
          .
          <source>In Tagungsband der 4. Konferenz zur Verarbeitung naturlicher Sprache</source>
          , pages
          <volume>83</volume>
          {
          <fpage>97</fpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Mu</surname>
          </string-name>
          ller, T. Zesch,
          <string-name>
            <surname>MC</surname>
            . Muller,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Bernhard</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Ignatova</surname>
            ,
            <given-names>I. Gurevych</given-names>
          </string-name>
          , and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Muhlhauser. Flexible UIMA Components for Information Retrieval Research</article-title>
          .
          <source>In Proceedings of the LREC 2008 Workshop 'Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP'</source>
          , pages
          <fpage>24</fpage>
          {
          <fpage>27</fpage>
          ,
          <string-name>
            <surname>Marrakech</surname>
          </string-name>
          , Morocco, May
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiu</surname>
          </string-name>
          and HP. Frei.
          <article-title>Concept based query expansion</article-title>
          .
          <source>In SIGIR '93: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>160</volume>
          {
          <fpage>169</fpage>
          , New York, NY, USA,
          <year>1993</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Roda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Zenz.</surname>
          </string-name>
          CLEF-IP
          <year>2009</year>
          :
          <article-title>retrieval experiments in the Intellectual Property domain</article-title>
          .
          <source>In Working Notes of the 10th Workshop of the Cross Language Evaluation Forum (CLEF)</source>
          , Corfu, Greece,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Takaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fujii</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ishikawa</surname>
          </string-name>
          .
          <article-title>Associative Document Retrieval by Query Subtopic Analysis and its Application to Invalidity Patent Search</article-title>
          .
          <source>In Proceedings of the thirteenth ACM international conference on Information and knowledge management</source>
          , pages
          <volume>399</volume>
          {
          <fpage>405</fpage>
          . ACM New York, NY, USA,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wanner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          , S. Brugmann, J.
          <string-name>
            <surname>Codina</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Diallo</surname>
            , E. Escorsa,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Giereth</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kompatsiaris</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Papadopoulos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Pianta</surname>
          </string-name>
          , et al.
          <source>Towards Content-oriented Patent Document Processing. World Patent Information</source>
          ,
          <volume>30</volume>
          (
          <issue>1</issue>
          ):
          <volume>21</volume>
          {
          <fpage>33</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>