<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multiple Retrieval Models and Regression Models for Prior Art Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Patrice Lopez</string-name>
          <email>lopez@hotmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laurent Romary</string-name>
          <email>laurent.romary@loria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Humboldt Universit ̈at zu Berlin - Institut fu ̈r Deutsche Sprache und Linguistik patrice</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1994</year>
      </pub-date>
      <abstract>
        <p>This paper presents the system called PATATRAS (PATent and Article Tracking, Retrieval and AnalysiS) realized at the Humboldt University for the IP track of CLEF 2009. Our approach presents three main characteristics: 1. The usage of multiple retrieval models (KL, Okapi) and term index definitions (lemma, phrase, concept) for the three languages considered in the present track (English, French, German) producing ten different sets of ranked results.</p>
      </abstract>
      <kwd-group>
        <kwd>Patent</kwd>
        <kwd>Prior Art Search</kwd>
        <kwd>Multilinguality</kwd>
        <kwd>Regression models</kwd>
        <kwd>Re-ranking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Our participation to the CLEF IP track was first motivated by our interest in infrastructures
for technical and scientific literature in general. A large collection of patent publications offers an
excellent opportunity of experimentation. With several millions of documents, such a collection
first corresponds to a realistic volume of documents comparable with the largest existing article
repositories. Patents cover multiple technical and scientific domains while providing rich
crossdisciplinary relations. This level of complexity regarding multiple thematics is similar, for instance,
to a large scale repository such as HAL (Hyper Article en Ligne) [
        <xref ref-type="bibr" rid="ref10">11</xref>
        ]. In addition, the European
Patent (EP) publications present a quite unique multilingual dimension, often combining three
languages (English, French and German) in the same publication (title and claims). Finally,
patents can be qualified as extreme exemples of noisy, deliberatly vague and misleading wordings
for the title, abstract and claims parts while maintaining relatively standard technical terminologies
in the description bodies.
      </p>
      <p>Our second motivation was to experiment a few fundamental approaches that we consider
central and constant for any technical and scientific collections, namely first the exploitation of rich
terminological information and natural language processing techniques, second the exploitation
of the relations among citations and, third, the exploitation of machine learning for improving
retrieval and classification results. Obviously none of these points is original, but we believe that
their appropriate combination can provide a framework that would provide much more than the
sum of these individual parts.</p>
      <p>Last, we consider that the efficient exploitation and dissemination of patent information are
currently not satisfactory. While such services as Google Patent and SumoBrain have certainly
improved this aspect, patent information is still very difficult to access and exploit as technical
documentation. The applicants have usually no interest to disclose their invention in a manner that
could facilitate its dissimination. The patent document itself is often poorly structured and follow
only a minimal review during its examination focused on the claims and legal aspects. However,
a clear disclosure of an invention is the counterpart toward the public of patent monopoly rights.
A patent that is impossible to retrieve is in practice a failure of the patent ”contract”. This is not
desirable because it is directed against the public and against the goal of a patent system, which
is first to motivate and encourage research and innovation. Economic and commercial incentives
are not the only factor for boosting invention and innovation, openly exchanged knowledge is the
source and the lifeblood of research and new ideas. For this purpose, better tools for searching and
discovering technical and scientific information are also desirable for patent information, beyond
pure economic aspects.
2</p>
    </sec>
    <sec id="sec-2">
      <title>CLEF IP 2009 and the Prior Art Task</title>
      <p>Before starting to describe our system, we introduce some basic definitions and examine in more
detail the task and how it defers from a standard prior art search for patent examination. We also
discuss the main differences between a patent document and traditional documents considered in
usual text retrieval tasks.</p>
      <p>In the following description, the ”collection” refers to the data collection of approx. 1,9 millions
documents corresponding to 1 million European Patents. This collection represents the prior art.
The ”training set” refers to the 500 documents of ”training topics” provided with judgements (the
relevant patents to be retrieved). The ”validation set” corresponds to a subset of approx. 4000
patents selected by us from the collection and a ”patent topic” refers to the patent for which the
prior art search is done.
2.1</p>
      <sec id="sec-2-1">
        <title>Prior Art Searches</title>
        <p>The goal of the CLEF IP 2009 track was to identify in the collection the closest prior art to a
given patent. The evaluation was produced automatically using patent citations introduced during
the official prosecution of this patent application at the European Patent Office (EPO). The list
of patent citations, therefore, gathers the patent citations provided by the applicant himself, the
result of the prior art search performed by the patent examiner, and patent citations introduced
at a later stage of proceedings (examination and possibly opposition).</p>
        <p>The prior art search as implemented in the CLEF IP 2009 track can be considered globally as
easier than a real one. The usual starting point of a patent examiner performing a prior art task
is a set of application documents in only one language, with one or more IPC1 classes and with
a very broad set of claims. In the present task, the topic patents were entirely made of examined
granted patents providing reliable information resulting normally from the search phase:
• The ECLA2 classes. They corresponds to a fine classification used for the search, more
precise than the IPC.
• The claims of the granted patents. These final claims are drafted taking into account the
prior art identified in the search report and are often revised for removing clarity issues. In
addition, this final version of the claims is translated in the three official languages of the
EPO by a skilled human translator, making crosslingual IR techniques more reliable.
• A revised description. The first part of the description often acknowledges the most
important document of the prior art which has been identified during the search phase.</p>
      </sec>
      <sec id="sec-2-2">
        <title>The limits of the textual content</title>
        <p>
          The textual content of patent documents is known to be difficult to process with traditional text
processing and text retrieval techniques. As pointed out by [
          <xref ref-type="bibr" rid="ref5">6</xref>
          ], patents often make use of non
standard terminology, vague terms and legalistic language. The claims are usually written in
a very different style than the description. The description also frequently contains digressions
and general presentations of the technical fields which do not provide any useful information
1International Patent Classification: a hierarchical classification of approx. 60.000 subdivisions used by all
patent offices and maintained by the WIPO (World Intellectual Property Organization).
        </p>
        <p>2European CLAssification: an extension of the IPC corresponding to a hierarchical classification of approx. 135
600 subdivisions, about 66 000 more than the IPC.
about the contribution of the patent. A patent also contains non-linguistic material that could
be important: tables, mathematical and chemical formulas, citations, technical drawing, etc. For
so called drawing-oriented fields (such as mecanics), examiners focus their first attention only on
drawings and we can suppose that any automatic retrieval based only on text will fail.</p>
        <p>So one could challenge the relevance of any standard technical vocabulary for searching patent
documents. However, the description, after a general presentation of the state of the art, illustrates
the claimed ”invention” with preferred embodiments which very often use a well accepted technical
terminology, and exhibit a language much more similar to usual scientific and technical literature.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>The citation structures</title>
        <p>The patent collection is a very dense network of citations creating a set of interrelations particularly
interesting to exploit during a prior art search. Table 1 gives a quantitative overview of the citation
network we observed for the patent collection3. The large majority of patents are continuations
of previous works and previous patents. The citation relations make this development process
visible. Similarly, fundamental patents which open new technologies subfields are exceptional but
tends to be cited very frequently in the whole subfield during years.</p>
        <p>
          [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], for instance, exploits the citation graph of a patent collection for identifying patent thickets,
i.e. the patent portfolios of several companies overlapping on a similar technical aspect. While the
author’s goal was to identify pro-competitive technical domains with respect to antitrust regimes,
this work shows that multiply related patents can be infered from the overall citation network of
a patent collection. High density of inter-citations between a group of applicants is the evidence
of cumulative innovation and multiple blocking patents. If a new patent applicant belonging to
this patent ticket appears, it is very likely that the most relevant prior art documents are already
present in this patent thicket.
        </p>
        <p>In addition, as drawings are excluded from the present task, the exploitation of citation
relations appears the best source of evidence for retrieval in drawing-oriented fields.</p>
        <sec id="sec-2-3-1">
          <title>Citations total EP doc. EP doc. with citation text</title>
          <p>Categorya
all
X
Y
A
D
other
all
all</p>
          <p>#
4 854 280
581 853
413 981
1 849 251
2 019 733</p>
          <p>198 749
1 082 647
363 494
aX means that the cited document taken alone anticipates the claimed invention. Y indicates that the cited
document in combination with other prior art documents covers the claimed invention. A means that the document
is relevant but discloses only partially the claimed invention. Finally, D, which can appear together with the previous
categories, indicates that the document has been cited by the applicant in the original description.</p>
          <p>The patents cited in the description of a patent document are potentially highly relevant
documents. First, the examiner often acknowledges the applicant’s proposed prior art by adding
this document in the search report (usually as an A document since it is extremely rare that an
applicant discloses himself, by mistake, a ”killing” X document in his application). Second, in case
the patent document corresponds to a granted patent, Rule 42(1)(b) EPC requires the applicant
to acknowledge the closest prior art. As a consequence, the closest prior art document, sometimes
an EP document, is frequently present in the description body of a B patent publication.</p>
          <p>3Just as a comparison, following Thomson Reuters’ Journal Citation Reports, approximately 60 millions citations
has been made in more than 7000 journals during the time period 1997-2005.</p>
          <p>Actually, we observed that, in the final XL evaluation set, the European Patents cited in the
descriptions represent 8,52% of the expected prior art documents of the final topic set. For the
10.000 topic patents, we found that 4407 were citing in their descriptions at least one EP document
from the collection, for a total of 7960 cited EP documents for which 5305 are relevant. This is
21.66% of the relevant documents of these 4 407 topics. With a ”run” made only with these cited
patents, a MAP (Mean Average Precision) of 0.2230 is obtained, with a precision at 5 of 0.2315.
This shows the potential contribution of this simple source of relevant patents, while it must be
noted that it is only a partial source since it leaves more than half of the patent topics without
any citations.
2.4</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Importance of metadata</title>
        <p>In addition to the application content (text and figures) and the citation information, all patent
publications contain a relatively rich set of well defined metadata. Traditionally at the EPO, the
basic approach to cope with the volume of the data is, first, to exploit the European patent
classification (ECLA classes) to create restricted search sets and, second, to limit the broad searches
to the titles and abstracts. When following this strategy, an examiner has retrieved a set of
approx. 50 documents, he visualizes each of these documents and performs a careful non automated
analysis of the whole content including the description text and the figures. Exploiting the ECLA
classes appears, therefore, as a solid basis for efficiently pruning the search space.</p>
        <p>In addition to the classification metadata, during 30 years of prior art search, the EPO patent
examiners have developed heuristics for finding interesting documents given the application in
hand. Since the means for searching text content are still relatively rudimentary (in 2009, only
a command line boolean search engine is used at the EPO), these heuristics are often based on
metadata such as the applicant name, the inventor names, other patent office classes or priority
documents.
2.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Multilinguality</title>
        <p>The European Patent documents are by definition highly multilingual. First, each patent is
associated to one of the three official languages of application. In the collection, the language
distribution was the following: 69% for English, 23% for German and 7% for French. It indicates
that all the textual content of a patent will be available at least in this language. Second, granted
patents also contains the title and the claims in the three official languages. These translations are
usually of very high quality because they are made by professional human translators. Crosslingual
retrieval techniques are therefore crucial for patents, not only because the target documents are in
different languages, but also because a patent document often provides itself reliable multilingual
information which makes possible the creation of valid queries in each language.</p>
        <p>As it is important not to limit a retrieval only in the main language of the patent application,
our system needed to deal with different languages for each patent. We have thus decided to build
different index and different specialized retrieval models for each languages and, in a second time,
to merge the different results in oder to exploit the benefit of each language.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Overall Description of the System</title>
      <sec id="sec-3-1">
        <title>System architecture</title>
        <p>As explained in the previous section, there is clear evidence that pure text retrieval techniques
are insufficient for coping correctly with patent documents. Our proposal is to combine useful
information from the citation structure and the patent metadata, in particular patent classification
information, as pre and post processing steps of text-based retrieval techniques. In order to
exploit multilinguality and different retrieval approaches, we merged the ranked results of multiple
retrieval models based on machine learning techniques. As illustrated by Figure 1, our system,
called PATATRAS (PATent and Article Tracking, Retrieval and AnalysiS), relies on four main
processing steps:
1. the creation of an initial working set for each patent topic in order to limit the search space,
2. the application of multiple retrieval models (KL divergence, Okapi) using different index
models (English lemma, French lemma, German lemma, English phrases and concepts) for
producing several sets of ranked results,
3. the merging of the different ranked results based on multiple SVM regression models and a
linear combination of the normalized ranking scores,</p>
        <sec id="sec-3-1-1">
          <title>4. a post-ranking based on a SVM regression model.</title>
          <p>Steps 2 and 3 have been designed as generic processing steps that could be reused for any
technical and scientific content, patent or non patent. The step 2 uses standard text retrieval
techniques. The step 3 uses domains information and standard metadata such as the language
that we can find or obtain automatically in any type of document collections. The patent-specific
information are exploited in steps 1 and 4. Before presenting in more in details these four different
steps, we now briefly describe how all the patent documents have been parsed and pre-processed.</p>
          <p>Patent
Collection</p>
          <p>Init
Patent
Topic</p>
          <p>Index
Lemma</p>
          <p>EN</p>
          <p>Index
Lemma</p>
          <p>FR</p>
          <p>Index
Lemma</p>
          <p>DE</p>
          <p>Index
Phrase</p>
          <p>EN</p>
          <p>Index
Concepts
Ranked
results
(10)
Initial
Working</p>
          <p>Sets
Tokenization
POS tagging</p>
          <p>Phrase
extraction
Concept
tagging</p>
          <p>LEMUR 4.9
- KL divergence
- Okapi BM25</p>
          <p>Query
Lemmas EN</p>
          <p>Query
Lemmas FR</p>
          <p>Query
Lemmas DE</p>
          <p>Query
Phrases EN</p>
          <p>Query
Concepts</p>
          <p>Merging</p>
          <p>Ranked
shared
results</p>
          <p>Postranking</p>
          <p>Final
Ranked
results
2. Representation of all the metadata in a MySQL database, following a comprehensive relation
model - given the required heavy processing based on the metadata, and although XML
databases can present some advantages, we needed a very fast and easy to optimize database.
4. For all the textual data associated to the patent: Rule-based tokenization depending on the
language.
5. For all the textual data associated to the patent: Part of speech tagging and lemmatization.</p>
          <p>We used our own HMM-based implementation for English and the Tree Tagger [14] for
French and German.
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Metadata Database</title>
        <p>We performed a basic normalization and cleaning of all inventor and applicant names: Particules
and titles were removed from the inventor fields (for instance Professor Dr. Dr. h.c. mult.
Wolfgang Wahlster becomes simply -with all due respect- Wolfgang Wahlster), business entity
marks (Inc., GmbH, Kabushi Kaisha) and locations (country names) were removed from applicant
fields. We also stored the citation texts of the patents cited in a description. The database storing
all metadata of the collection and all corresponding indexes, but not the textual content, had a
final size of 2,48 GB.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Indexing models</title>
      <sec id="sec-4-1">
        <title>Overview</title>
        <p>
          The five following indexes were build using the Lemur toolkit [
          <xref ref-type="bibr" rid="ref6">7</xref>
          ] (version 4.9):
• For each of the three language (English, French, German), we built a full index at the lemma
level.
• For English, we created an additional phrase index based on phrase as term definition.
• A crosslingual concept index was finally built using the list of concepts identified in the
textual material for all three languages.
        </p>
        <p>We did not, therefore, index the collection document by document, but rather considered a
”meta-document” corresponding to all the publications related to a patent application. In each
case, the following textual data corresponding to a patent is indexed:
• the last version of the title,
• the first version of the description (Following Article 123(2) EPC, we are sure that any
further publication will not go beyond the scope of the initial version of the description),
• the last version of the abstract,
• the last version of the claims.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Lemma indexes</title>
        <p>Based on the result of the lemmatization, we considered the lemma present in the textual data
of the publications corresponding to the patents of the collection: title, abstract, claims of the
last available publication, and description of the earliest available publication - so A1, A2 or B1
in this order. The selection of the lemma as term unit could be view as a stemming removing all
inflexional suffixes. Only the lemmas corresponding to open grammatical categories (i.e. noun,
verb, adverb, adjective and number) has been indexed, which could be viewed as applying a
stop-word list in traditional IR.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Phrase indexes</title>
        <p>For the English content, a phrase extraction was realized: based on the part of speech tagging
results, all the noun phrases were identified and the Dice Coefficient was applied to select the
phrases [16].
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Multilingual Terminological database</title>
        <p>
          A multilingual terminological database covering multiple technical and scientific domains and
based on a conceptual model, have been created from the following existing ”free” resources:
MeSH, UMLS, the Gene Ontology, a subset of WordNet corresponding to the technical domains
as identified in WordNet Domains [
          <xref ref-type="bibr" rid="ref8">9</xref>
          ] with the corresponding entries of WOLF (a free French
WordNet), and a subset of the English, French and German Wikipedia corresponding to technical
and scientific categories [17].
        </p>
        <p>
          The Wikipedia ”dump” XML files were processed with a slightly modified version of Wikiprep
[18] able to extract multilingual relations in addition to usual structure and text information.
Similarly as in [
          <xref ref-type="bibr" rid="ref3">4</xref>
          ], we interpreted an article as a concept, the title of an article being the preferred
term and the disambiguation redirections to this article being alternative or variant terms realizing
this concept.
        </p>
        <p>A set of 76 general technical domains have been first established derived from the Dewey
Decimal Classification [3]. For each above-mentioned sources, a mapping has been written between
the main catgories/domains of the source and the general domains. Two concepts coming from
two different sources were merged if they share at least one term and one domain.</p>
        <p>The resulting database contains:
• Approx. 3 millions terms (2,6 millions for English, 190.000 for German, 140.000 for French),
• 1,4 millions of concepts (71.000 realized in German and 65.000 realized in French),
• 1 million of semantic relations,
• approx. 20.000 fully specified acronyms,
• 123.000 additional ”source-specific” categories.</p>
        <p>
          The terminological database relies on a conceptual model [
          <xref ref-type="bibr" rid="ref11">12</xref>
          ] and is currently implemented in
a MySQL database. A web interface has been developed for browsing the terminological database
and for performing basic searches, see Figure 2. For the purpose of the CLEF IP tasks, we use
only the terms and the acronyms information of this terminological resource.
        </p>
        <p>Wikipedia offers a massive sources of terminological data (more than 1,5 million of terms
for the technical and scientific categories) and was the only multilingual source. However, the
quality of the data is clearly not comparable with a well designed domain specific terminology
as for instance MeSH. In particular, due to overall noisy data, the merging of concepts involving
Wikipedia entries was not satisfactory. In practice, it was not possible to consider as term variants
the disambiguation terms given by Wikipedia and we merged concepts involving Wikipedia entries
only if the article-level term entry was in common with the other concepts for a common domain.
This work is still ongoing and we expect to improve it in the future.
4.5</p>
      </sec>
      <sec id="sec-4-5">
        <title>Concept tagging</title>
        <p>The terms of the terminological database have been used for annotating the textual data of the
whole collection, training and topic sets. A term annotator able to deal with such a large volume
of data has been developed specifically for this track. This annotator can match term variants
following morphological variations. The concept disambiguation was realized on the basis of the
ECLA classes (or by default the IPC classes) of the processed patent. For this purpose, the
upper level of the IPC (corresponding to the IPC ”classes” as defined in the IPC, i.e. the three
first characters of the classes/subclasses/groups/subgroups) has been mapped to the common
abovementioned 76 domains. Given the design of the terminological database, a term used in a
given domain corresponds to a single concept. When an IPC class corresponds to several domains
(for instance the class G10 is used for music instruments which can correspond to the domains
acoutics and electronics), or when a term corresponds to several concepts in the same domain (for
instance engine in the computing field which could be a layout engine or a search engine), a term
cannot always be disambiguated on the basis of the IPC class. In this case, we have decided in
the present implementation not to further attempt to select a concept and to skip this term.
4.6</p>
      </sec>
      <sec id="sec-4-6">
        <title>Retrieval models</title>
        <sec id="sec-4-6-1">
          <title>We used the two following well known retrieval models:</title>
          <p>• Unigram language model with KL-Divergence and Jelinek-Mercer smoothing (λ = 0.4),
• Okapi weighting function BM25 (K1 = 1.5, b = 1.5, K3 = 3).</p>
          <p>The two models have been used with each of the previous five indexes, resulting in the production
of 10 lists of retrieval results for each topic patent.</p>
          <p>In each case the query was build based on all the available textual data of a topic patent and
processed similarly as the whole collection in order to create one query per language based on
lemma, one query based on English phrases and one query based on concepts (independent by
definition from the language). The query for a given model is, therefore, a representation of the
whole textual content of the patent. The query is exactly the same representation as the one of
a document indexed in the collection. The scoring model used for retrieval corresponds, in this
context, to a distance between two ”documents”. In the framework of language model based IR,
KL divergence is typically used for evaluating the distance between two documents, i.e. between
two unigram probabilistic distributions. While Okapi BM25 is usually used for ad hoc retrieval, it
is also known as a reliable scoring function for evaluating the distance between documents. The
drawback of using a whole document representation as query is the processing time which is always
related to the size of the query. However, for the present work, we did not consider processing time
as an issue, as long as the whole set of patent topics could be processed in the track timeframe. For
both retrieval methods, we did not use query expansion, nor pseudo-relevance feedback, because
these two techniques did not appear effective during our first experiments.</p>
          <p>
            The retrieval processes were based on the Lemur toolkit [
            <xref ref-type="bibr" rid="ref6">7</xref>
            ], version 4.9. The baseline results
of the different indexes and retrieval models are presented in Table 2, column (1). These results
correspond to the application of the retrieval model with on whole collection.
          </p>
        </sec>
        <sec id="sec-4-6-2">
          <title>Model KL KL KL</title>
          <p>KL
KL
Okapi
Okapi
Okapi
Okapi
Okapi
Index
lemma
lemma
lemma
phrase
concept
lemma
lemma
lemma
phrase
concept</p>
        </sec>
        <sec id="sec-4-6-3">
          <title>Language EN FR DE</title>
          <p>EN
all
EN
FR
DE
EN
all</p>
          <p>We can observe that for the same query and the same index, KL divergence always provided
better results than Okapi. The best result is obtained with KL divergence with English lemma
representation. The fact that conceptual and phrase based retrievals perform worst than the
monolingual English lemma representation can appear disappointing given the effort needed to
implement them. However, it is consistent with previous works which have noted the information
loss implied by pure conceptual representations as compared to simple stem-based retrieval.
4.7</p>
        </sec>
      </sec>
      <sec id="sec-4-7">
        <title>Citation texts</title>
        <p>
          The citation texts of a target patent are all the paragraphs in other patent documents that refers
to it. Extending the content of a patent with its citation text aims at providing more textual
descriptions corresponding to the important contributions of the patent according to the other
patent applicants. Following a similar approach, [
          <xref ref-type="bibr" rid="ref9">10</xref>
          ], for instance, tried to exploit the citation
texts in order to improve the semantic interpretation and the retrieval of text for biomedical
articles. Moreover, for the case of patent documents, since part of the citation text can be written
in the other languages than the taget patent, it possibly increases the multilingual description
of the cited patents. While for technical and scientific articles, the citation text is usually just
a sentence, citation texts for patents appear to be in a constant manner a whole paragraph.
Therefore, for each patent document present and cited in the collection, the entire paragraph of
citation was appended to the textual material of the cited patent.
        </p>
        <p>The table 2 presents the impact of adding the citation text, see column (2). The improvement
is low and statistically not significant. We think, however, that this result is encouraging because
it was obtained with a very limited number of citation texts. Only citations of an EP document
were considered here. By having a complete collection or patent family information, it would be
possible to extract much more citation texts (most likely by a factor five to ten) and we could
expect that the improvement would be more significant.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Creation of initial working set</title>
      <p>For each topic patent, we created a prior working set for reducing the search space and the effect
of term polysemy. The goal here is, for a given topic patent, to select the smallest set of patents
which has the best chance to contain all the relevant documents. A set Sp of patents for a given
patent p is created by applying successively the following steps:
1. Put in Sp all the patents cited in the description of the topic patent (as identified in step 3
of the document parsing) and present in the collection.
2. Up the citation tree: All patents citing at least one patent of Sp are added to Sp.
3. Down the citation tree: All patents cited by at least one patent of Sp are added to Sp (steps
2 and 3 were performed iteratively a second times after step the fifth step).
4. Priority dependencies: All patents having a priority document in common with p or with
a patent in Sp are added in Sp. All patents citing a priority document of a patent in Sp
are added to Sp. This step permits to gather non unitary and divisional patent applications
(i.e. a single patent application containing possibly different independent inventions which
results in several parallel applications) and to exploit the partial information present in the
collection about patent families.
5. All patents having the same applicant as the topic patent and at least one common inventor
are added in Sp.
6. All patents belonging to one of the ECLA class of p (if at least one ECLA class is available)
are added in Sp.
7. All patents belonging to the ECLA classes most frequently co-occurring in the collection
with the ECLA classes of p are added in S.
8. If the working set is below a given limit: All patents having the same applicant and belonging
to one of the IPC class of p are added to to Sp.
9. If the working set is still below a given limit: All patents belonging to one of the IPC class
of p are added to to Sp.</p>
      <p>Step
1. cited patents
2. up the citation tree
3. down the citation tree
4. common prority doc.:
5. same applicant/inventor
2+3. second iteration
6. same ECLAs
7. most freq. co-occuring ECLA
8. same applicant and IPC
9. same ipc
Total relevant</p>
      <p>At the end of this process, if the size of the working set is too low or too large, no working set
is used and the retrieval is performed on the whole collection. In the submitted runs, the lower
limit was 10 documents and the upper limit 10.000. The table 3 presents the performance of each
step in term of increase of the micro recall, i.e. coverage of all the relevant documents of the whole
set of topics. Note that the recall reported in the official CLEF IP 2009 evaluation summary is
the macro recall, i.e. the average of recall obtained for each topic patent. The last column gives
the sum of the number of documents for all working sets.</p>
      <p>The micro recall of the final run was 0.6985, thus lower than the one of the final working lists.
This difference comes from the working list having a number of patents higher than 1000 and
where the final processing has not been able to place all the relevant documents in the first 1000
patent results. These ”missed” patents are the most difficult documents to process: the system
failed to rank them both on the basis of the textual content and the metadata even with an initial
working set. Since many working lists contained less than 1000 documents and sometimes less than
100 documents, the list of results of the final run was frequently less than 1000. Similarly some
working sets were particularly large, sometimes more than 10 000 documents, but all final runs
were cut at 1000. The final number of results was, therefore, on average approx. 415 documents
per patent topic, as compared to approx. 2616 documents in average per initial working set.</p>
      <p>It is also clear that we are missing 26.97% of the expected relevant documents which is a
relatively high number. Identifying how to capture these documents without expanding too much
the working sets will require further investigations.</p>
      <p>These different steps correspond to typical search strategies used by the patent examiners
themselves for building sources of interesting patents. They capture techniques that have emerged
in patent examination and which are considered to be effective. As the goal of the track is, to a
large extend, to recreate the patent examiner’s search reports, recreating such restricted working
sets appears to be a valuable approach. Table 2, column (3) and (4) show the improvement of
using the initial working sets instead of the whole collection. The retrieval with working sets was
realized using the ”working list” functionality of Lemur in batch mode.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Merging of results</title>
      <p>In the previous section, we have observed that conceptual and phrase-based retrievals alone were
less efficient than English lemma model. However, several works have shown that the conceptual
and phrase representation can improve a word-based model, for instance the semantic smoothing
techniques in the language model framework. In our preliminary results, we observed that the
different retrieval models present a strong potential of complementarity, in particular between
the different languages and between lemma/concept. The table 4 presents the repartition of the
best results over the different models. We can observe that concept-based models provide a high
number of results with higher MAP than the English lemma model. We can also note that each
language-specific model provides a constant set of best results over all other models. Intuitively,
for a topic patent essentially described in the main language of application (in particular the whole
description) and minimally in the two other languages (just the title and the claims), the index
corresponding to main language should provide better results and should be prioritized.
# better than
baseline
# equal
# best
overall</p>
      <p>Merging multiple results has been used in the context of distributed information retrieval, in
particular with partially overlapping collections. Merging ranked results from different models for
the same collection appears well adapted to a patent collection because the different models exploit
different views, for instance different languages, for retrieving documents in the same collection.
In addition, we are presently in an exceptional situation where we can exploit a very large amount
of training data given the number of citations present in the collection. For training purposes, 500
complete examples were provided with judgements. Moreover, the collection itself could provide
a huge number of examples of prior art results. This uncommon aspect makes possible a fully
supervised learning method. Machine learning algorithms are, therefore, well appropriated to
weight the different ranked lists so that, given a query, the most reliable models are prioritized.
The merging of ranked lists of results is here expressed as a regression problem.</p>
      <p>
        The usage of regression models for merging results was described for instance in [
        <xref ref-type="bibr" rid="ref12">13</xref>
        ] and [15].
Merging based on regression usually surpasses other combination methods which do not involve
machine learning, such as the well known CORI merging algorithm. In [15], the precision
following a merging of results from different search engines was improved up to 98.9 % as compared
to a merging based on the CORI algorithm. In [
        <xref ref-type="bibr" rid="ref12">13</xref>
        ], even with a very limited number of
features used for learning, a merging of retrieval results from different languages based on a logistic
regression significantly surpasses all other score combination approaches. Regression models
appears particularly appropriate in the present case, because they permit to adapt the merging on
a query-by-query basis.
      </p>
      <p>For each model m, we trained a linear regression model using as input a set −o→ffeatures inferred
−→
from the query. As a general framework, given a set of examples, (x1, y1), ..., (xN , yN ), where the
−→
xi are vectors of features and t−→he yi are values corresponding to the dependent variable, the goal
of a linear regression is to find w such that and
Φ( w ) = min X(yi − −→w.−→xi)2 (1)
−→</p>
      <p>i
In the present regression training, the observed MAP was used as the dependent variable for
representing the pertinence of a set of results for a given retrieval model.</p>
      <p>For the realizing the merging, scores were first normalized so that they all lie in the same range.
We used the standard normalization, basically all the min are shift to 0 and the max is scaled to
1:</p>
      <p>vmax − vmin
The regression model gives a score cqm for the retrieval model m and for the query q which is
interpreted as an estimation of the relevance of the results retrieved by m for the query q.</p>
      <p>The merged relevance score sdq for a patent d following a query (here a patent topic) q is
obtained as a linear combination of the normalized scores wmd obtained from each retrieval model
m:
sdq = X cmqwmd (3)</p>
      <p>m∈M
This score is computed for each patent appearing in at least one of the result sets produced by
the different models. The merged result set is build by ranking these patents according to this
new relevance score.</p>
      <p>The key of the supervised learning approach is to exploit as much training data as possible in
order to use a rich set of features while avoiding overfitting. To exploit more training data, we
created from the collection track a supplementary training set of 4 131 patents. We selected patents
citing at least 4 EP documents, with a language distribution and an IPC class distribution similar
to the ones of the whole collection. We assembled the initial working sets similarly as explained in
section 5, but we filtered out patents whose publication dates were after the priority date of the
considered patent. During our tests for selecting the regression algorithm, we built the merging
models based on these 4 131 patents, that we called the validation set, and tested them on the
”normal” training set of 500 topics. For producing the final official runs, we trained the merging
models on the whole 4.631 training patents.
(2)
wi =</p>
      <p>vi − vmin</p>
      <p>The following set of features was used: (f1) the language of proceeding of the patent topic, (f2)
the size of the query, (f3) the size of the initial working set, (f4-5) the non-normalized minimum
and the maximum retrieval scores of the set of results, (f6) the range of the non-normalized score
of the result set, (f7-8) the main IPC trunk (first character of an IPC class) and the IPC class
(three first characters of an IPC class/group). The average number of words of the phrases was
also used for the results based on the phrase index (f9).</p>
      <p>
        Expressing the merging of ranked results as a regression problem permits to use existing
machine learning software packages which provide excellent evaluation utilities. We have
experimented several regression models: least median squared linear regression, SVM regression (SMO
and ν-SVM methods) and multilayer perceptron using LibSVM [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for the ν-SVM regression
method and the WEKA toolkit [19] for the other methods. For the ν-SVM regression method,
we used the methodology described in [
        <xref ref-type="bibr" rid="ref4">5</xref>
        ] for setting the parameters which includes scaling and
cross-validation. SVM in general is known to be sensitive to hyperparameter selection (λ). This
methodology led to a strong improvement as compared to our first random testing and the SMO
regression.
      </p>
      <sec id="sec-6-1">
        <title>Features LeastMedSq f1 f1-6</title>
        <p>all
Post-ranking, or re-ranking in general, is a particularly attractive technique in Information
Retrieval because, first, the usual result of a retrieval model is a weighted n-best list of outputs and,
second, it is much easier to experiment and optimize a post-ranker for a particular set of features
than to integrate them in a single model. In the present case, as we used retrieval models designed
for texts, it would be extremely difficult or impossible to modify these methods for including the
features we want, namely to prioritize certain patents based on citation relations and metadata.
The drawback of re-ranking is, of course, that it is limited by the initial model.</p>
        <p>While the previous section was focusing on learning to merge ranked results, this step aims at
learning to rank. Regression here is used to weight a patent result in a ranked list of patents given
a query. The creation of the inital sets was dealing with recall and we address now precision. The
goal of the re-ranking of the final merged result is first of all to boost the score of certain patents:
• Patents initially cited in the description of the topic patent: a boolean feature was used to
indicate if the retrieved patent was present or not in the citations.
• Patents having several ECLA and IPC classes in common with the topic patent: two features
were introduced to indicate the number of common ECLA classes and the number of IPC
classes.
• Patents with higher probability of citation as observed within the same IPC class and within
the set of results: we introduced two features corresponding to a number between 0 and 1
representing these two probabilities.
• Patents having the same applicant and at least one identical inventor: the following features
were added: a boolean indicating if the applicant was common between the retrieve patent
and the patent topic, and a number between 0 and 1 indicating the proportion of common
inventors.</p>
        <p>Similarly as for the creation of the initial working sets, these features correspond to criteria often
considered by patent examiners when defining their search strategies. Re-ranking based on these
features permits to obtain a final result closer to a standard EPO search report and thus increases
the average precision.</p>
        <p>The dependent variable represents the weight adjustment to be applied to the particular patent
result. In the training data, the dependent variable sp for a patent result p was set as follow:
sp =
(wmax if p is relevant,
0</p>
      </sec>
      <sec id="sec-6-2">
        <title>Otherwise.</title>
        <p>(4)
where wmax is the score of the top result in the current result set. For each retrieve patent p
in the result set, the regression model produce a score sp which is used to reevaluate the score of
p to wp′ such that wp′ = wp(sp + 1). The final runs is obtained by re-sorting the list of results and
by applying a cutoff at rank 1000.</p>
        <p>The regression model was trained using the normal training set and the additional validation
set presented in the previous section. In order to limit the size of the training data and to avoid
too many useless negative examples, we considered only 20 negative results per patent. The final
run is also based on SVM regression, more precisely the WEKA implementation SMOreg using
Pearson VII Universal Kernel (PUK) function. We did not evaluate other regression algorithms.
8
8.1</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Final Results</title>
      <sec id="sec-7-1">
        <title>Main task</title>
        <p>Table 6 summarizes the automatic evaluation obtained for our final runs. We processed the entire
list of queries (the XL set, corresponding to 10 000 patent topics) and, therefore, cover the smaller
bundles (S and M, respectively 500 and 1000). The list of relevant documents distinguishes two
types of documents: relevant documents (patents cited by the applicant and and with a A category)
and very relevant patents (all the other cited patents).</p>
        <p>The exploitation of patent metadata and citation information clearly provides a significant
improvement over a retrieval only based on textual data. By exploiting the same metadata as a
patent examiner and combining them to robust text retrieval models via prior working sets and
reranking, we created result sets closer to actual search reports. Overall, the exploitation of ECLA
classes and of the patents cited in the descriptions provided the best improvements. In addition,
the different regression models proved to be a very effective way of combining complementary
indexing models. While many models exhibit relatively low individual results, they appear to be
strongly specialized following discriminant criteria as the application language or the technical
domain. Similarly a regression model appears an efficient technique for re-ranking a list of ranked
results following heterogeneous features.
While the main task allows the participants to exploit all the available textual information, the
language tasks limits the usable material to a single language. We applied for the language specific
tasks a similar approach as for the main tasks, but with the following restrictions:
• The patents cited in the description (which are used during the creation of initial working
sets and in the final re-ranking) have been only considered for the main language of a patent
topic ;
• The lemma index corresponding to a different language was not used ;
• The phrase index was used only for English ;
• The queries for concepts were limited to the concepts extracted in the text of a single
language.</p>
        <p>We think that we have ensured, therefore, that for building the query and for the exploitation of
patent metadata, we did not use any elements different from the task’s language.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Hardware and Processing Time</title>
      <p>We use 3 machines with 2.0GHz Core 2 Duo processor, 2 GB SDRAM (3 headless Mac Mini) and
a Laptop with a similar configuration and 4 GB RAM. In addition 2 TB of storage have been used
for the collection files, indexes and all the intermediary processed documents. The four machines
ran under 64bits Mac OS 10.5.6. Here are some indications about the processing time:
• The compilation of the terminological database took 28 hours on one machine after a
preprocessing of 135 hours for the three language Wikipedia XML files.
• Tokenization and POS tagging took 55 hours-machine. Phrase extraction was the heaviest
task and took a total of approx. 28 days-machine.
• Controlled concept indexing took approx. 22 hours-machine.
• Training a regression model took between 10 minutes (least median squared linear regression)
to 150 minutes (Multilayer Perceptron) per model. Aggregation of results and post-ranking
took approx. 90 minutes.
• Producing the final runs required 5 complete days of processing using entirely the 4 machines,
i.e. approx. 20 days-machine.</p>
      <p>The total processing time for a topic patent was, therefore, approx. 43 secondes. This is, of
course, quite long for an online processing. However, given the challenge of processing the whole
collection of patent documents, we did not address at all the issue of processing time and many
possibilities for runtime improvement and optimization exist.
10</p>
    </sec>
    <sec id="sec-9">
      <title>Future Work</title>
      <p>[3] http://www.oclc.org/dewey/
[16] Franck Smadja, Kathleen R. McKeown and Vasileios Hatzivassiloglou. Translating
collocations for bilingual lexicons: A statistical approach. Computational Linguistics 22(1):1-38
(1996).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Chih-Chung Chang</surname>
          </string-name>
          and
          <string-name>
            <surname>Chih-Jen Lin</surname>
          </string-name>
          .
          <article-title>LIBSVM : a library for support vector machines</article-title>
          ,
          <year>2001</year>
          . http://www.csie.ntu.edu.tw/˜cjlin/libsvm
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Gavin</given-names>
            <surname>Clarkson</surname>
          </string-name>
          .
          <article-title>Objectve Identification of Patent Thickets: A Network Analytic approach</article-title>
          .
          <source>Harvard Business School Doctoral Thesis</source>
          .
          <year>2004</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Markovitch</surname>
          </string-name>
          .
          <article-title>Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis</article-title>
          .
          <source>In Proceedings of IJCAI</source>
          , pages
          <fpage>16061611</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Chih-Wei</surname>
            <given-names>Hsu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chih-Chung Chang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Chih-Jen Lin</surname>
          </string-name>
          .
          <article-title>A Practical Guide to Support Vector Classication</article-title>
          .
          <source>May 19</source>
          ,
          <year>2009</year>
          . http://www.csie.ntu.edu.tw/ cjlin/papers/guide/guide.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Marc</given-names>
            <surname>Krier</surname>
          </string-name>
          and
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Zacc</surname>
          </string-name>
          .
          <source>Automatic Categorisation Applications at the European Patent Office. World patent Information</source>
          <volume>24</volume>
          :
          <fpage>187</fpage>
          -
          <lpage>196</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>The</given-names>
            <surname>Lemur Project</surname>
          </string-name>
          .
          <year>2001</year>
          -
          <fpage>2008</fpage>
          . University of Massachusetts and Carnegie Mellon University. http://www.lemurproject.org/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Patrice</given-names>
            <surname>Lopez</surname>
          </string-name>
          . GROBID:
          <article-title>Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications</article-title>
          .
          <source>Proceedings of ECDL</source>
          <year>2009</year>
          , 13th European Conference on Digital Library, Corfu, Greece, Sept.
          <fpage>27</fpage>
          - Oct.
          <volume>2</volume>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Bernardo</given-names>
            <surname>Magnini</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gabriela</given-names>
            <surname>Cavagli</surname>
          </string-name>
          .
          <article-title>Integrating Subject Field Codes into WordNet</article-title>
          . In Gavrilidou M.,
          <string-name>
            <surname>Crayannis</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markantonatu</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piperidis</surname>
            <given-names>S.</given-names>
          </string-name>
          and Stainhaouer G. (Eds.)
          <source>Proceedings of LREC-2000</source>
          , Second International Conference on Language Resources and Evaluation, Athens, Greece, May 31- June 2,
          <year>2000</year>
          , pp.
          <fpage>1413</fpage>
          -
          <lpage>1418</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Preslav</surname>
            <given-names>Nakov</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Ariel</given-names>
            <surname>Schwartz</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marti</given-names>
            <surname>Hearst</surname>
          </string-name>
          .
          <article-title>Citances: Citation Sentences for Semantic Analysis of Bioscience Text</article-title>
          . SIGIR'04 workshop on Search and Discovery in Bioinformatics, Sheffield, UK.
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Laurent</given-names>
            <surname>Romary</surname>
          </string-name>
          and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Armbruster</surname>
          </string-name>
          .
          <source>Beyond Institutional Repositories. June 25</source>
          ,
          <year>2009</year>
          . Available at SSRN: http://ssrn.com/abstract=1425692
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Laurent</given-names>
            <surname>Romary</surname>
          </string-name>
          .
          <article-title>An abstract model for the representation of multilingual terminological data: TMF - Terminological Markup Framework</article-title>
          .
          <source>TAMA (Terminology in Advanced Microcomputer Applications)</source>
          ,
          <source>Antwerp (Belgium)</source>
          ,
          <source>February 1-2</source>
          ,
          <year>2001</year>
          . Available at http://hal.inria.fr/inria00100405
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Jacques</given-names>
            <surname>Savoy</surname>
          </string-name>
          .
          <article-title>Combining Multiple Strategies for Effective Monolingual</article-title>
          and
          <string-name>
            <surname>Cross-Language Retrieval</surname>
          </string-name>
          .
          <source>Information Retrieval</source>
          <volume>7</volume>
          (
          <issue>1</issue>
          -2) :
          <fpage>121</fpage>
          -
          <lpage>148</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>