<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CLEF-IP 2010: Retrieval Experiments in the Intellectual Property Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Florina Piroi</string-name>
          <email>f.piroi@ir-facility.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The Information Retrieval Facility (IRF) Vienna</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the recent decade that research in IR methods for Intellectual Property domain has increased. The rst eorts in observing how information retrieval is done in patent domain were done with the series of Nist workshops (see for example [2]). Lately, more workshops and conferences are dedicated to bringing together IR and IP specialists [ 3,7]. In 2008, the Irf obtained the agreement to coordinate two evaluation campaigns with emphasis on patent documents and prior art retrieval: ClefIp and TrecChem . The ClefIp track was launched in 2009 to investigate IR techniques for patent retrieval and it was part of the CLEF 2009 evaluation campaign. In 2010, the track continued as a benchmarking activity of the Clef 2010 conference. The track utilizes a collection of more than 1.3 million patent documents derived from Epo (European Patent Oce) sources. The collection covers English, French and German with at least 150,000 documents in each language. There were two tasks in the 2010's track. The rst one is to nd patent documents that are candidates to constitute prior art for a given document. The second task is to classify a given document according to the International Patent Classication system ( Ipc). Relevance judgements will be produced using the patent citations for the Prior Art Candidates search task and using the recorded classication codes for the Classication task. This notebook gives a report on the ClefIp activity in 2010. The paper is structured as follows: Section 2 describes the test collection used this year, section 3 presents the participating teams and gives an overview of the methods the teams involved. In the same section we also present the main measurements done in this track.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>2.1</p>
    </sec>
    <sec id="sec-2">
      <title>The 2010 ClefIp</title>
    </sec>
    <sec id="sec-3">
      <title>Data Collection</title>
      <sec id="sec-3-1">
        <title>The Objects in the Collection</title>
        <p>The ClefIp collection contains patents, physically stored as a collection of
patent documents. A patent document may be an application document, a search
report, or a granted patent document. We describe in the following some of the
key terms and steps in a patent’s lifecycle.</p>
        <p>A patent is a set of exclusive legal rights for the use and exploitation of an
invention in exchange for its public disclosure. The exclusive rights are given
by a governing authority and are limited in time. The requirements for
granting patents vary widely among patent oces, but a common rst step is to le
a patent application request with a patent oce. For this, the applicant must
supply a written specication of the inventionalso called an application
document where the background of the invention, a description of the invention,
and a set of claims which dene the scope of protection, should the patent be
granted, are given. The application date, or ling date of a patent refers to the
date when the patent application was led.</p>
        <p>In order to be granted, a patent application is examined by professionals who
will analyze wether it meets certain patentability criteria and wether the
application complies with the relevant patent law. The most important patentability
criteria are novelty, inventiveness, and practicality. Of relevance to the ClefIp
benchmarking activity is the novelty criteria. A patent application satises the
novelty requirement if no earlier patent or other kind of publication describing
(parts of) the invention can be found in a reasonable amount of time. Such a
search for noveltyrelevant documents is called a prior art search. Results of a
prior art search are recorded in a search report, and are a basis for further
communication with the applicant which may result in modications of the patent
specications before the patent is granted. The the relevant documents listed in
a search report of a patent are referred to as patent citations . Usually, the search
report and the application document are published within 18 months from the
application date.</p>
        <p>
          When a patent application is found to meet all the necessary legal and
patentability requirements, a decision to grant the patent is made and, after
further fees and procedural steps, the granted patent is published. An important
procedural step at the Epo is that a translation of the claims in all three ocial
Epo languages (English, German, French) is provided [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>Patent documents generated at the dierent stages of the patent’s life-cycle
are identied by a country code (denoting the patent oce analyzing/granting
the patent), a unique numeric identier, and by a kind code together with a
version number1. In the case of Epo the A in the kind code denote a patent
document published in the application phase, the B kind code marks a granted
patent document.</p>
        <p>It is possible to le a patent application at more than one patent oce. When
the same invention is granted a patent by dierent patent oces, the two patents
are said to belong to the same patent family.
1 For the EP patents, documents at dierent stages have the same numeric identier.</p>
        <p>For other patent oces this is not always the case. For example, the patent
document US-6689545-B2 represents a US granted patent with its application document
publication number US-2003011722-A1</p>
        <p>An important tool in organizing the large amount of patent data which patent
oces regulate is the classication system . A patent classication system ‘sorts’
the patents according to the technical area they belong to, and it is a basis for a
quick investigation of the state of the art in a eld 2. There are several patent
classication systems, the most used being the International Patent Classication
system (Ipc), the European Classication System ( Ecla), the US Classication
System.
2.2</p>
        <p>Technical Elements
Compared to the ClefIp 2009 data collection, this year there has been an
increase in the number of patent documents to be included in the ClefIp data
collection. The total number of patent documents is over 3.5 million, almost one
million more documents than in 2009.</p>
        <p>The documents in the ClefIp 2010 collection are extracted from the Marec3
data corpus, and are patent documents published by the Epo.</p>
        <p>Following the same procedure as last year, we split the available data into
two parts
1. the collection corpus (or target data set) contains documents with
application date prior to 2002. This set contains over 2.6 million documents,
representing over 1.9 million patents.
2. the topic pool contains documents with application date between 2002 and
2009. This set contains over 0.8 million patent documents, representing over
0.6 million patents.</p>
        <p>The same as in 2009, the Test Collection Corpus was delivered to the
participants as is, without merging the documents related to the same patent into
one document. Each patent is identied by a unique patent number-a string
starting with EP and followed by 7 digits. Corresponding to each patent is a
directory containing the patent documents related to that patent. The layout is
nnnnnn/nn/nn/nn/*.xml.</p>
        <p>For example, to patent EP 0981201 corresponds the directory containing les
EP-0981201-A2.xml, EP-0981201-A3.xml, and EP-0981201-B1.xml:
&gt; pwd
/000000/98/12/01
&gt; ls
EP-0981201-A2.xml EP-0981201-A3.xml EP-0981201-B1.xml</p>
        <p>All documents in the ClefIp collection contain the following main Xml
elds: bibliographic data, abstract, description, and claims. Not all documents
actually have content in these elds. This happens because certain Epo patent
applications are internationally led under the Patent Cooperation Treaty ( Pct4)
2 See http://www.wipo.int/classifications/ipc/en/
3 The Marec data corpus is a collection of over 19 million patent documents, in Xml
format, made available by the Irf for research purposes.
4 http://www.wipo.int/pct/en/
in which case, the Epo does not republish the whole patent application, but only
a bibliographic entry which refers to the original application.
2.3</p>
        <p>Tasks and Topics
There were two tasks in ClefIp 2010. A Prior Art Candidates Search task
(Pac) and a Classication task ( Cls).</p>
        <p>The rst task in this track ( Pac) consisted in nding patent documents
in the target collection that may invalidate a given patent application. The
participants were provided with two sets of patents from the topic pool (a small
set of 500 topics and a large set of 2000 topics). The task didn’t restrict the
language used for retrieving the documents, but participants were encouraged
to use the multilingual characteristic of the collection (namely, that claims in
granted patent documents are provided in three languages).</p>
        <p>The second task in the ClefIp track (Cls) is a newly introduced one, and
required to classify a given patent document according to the Ipc system. The
classication was to be given at the subclass level. The set of classication topics
contained 2000 patent documents, a dierent set than the one used in the Pac
task.</p>
        <p>Dierently from the last year’s topics, where a virtual patent document was
composed with a description and claims in German, English and French, this year
we have used patent application documents as topics. This means that the topic
documents contain claims in only one of the three languages, with about 67%
of the documents having English content, 26% German content, and 7% French
content. We have placed no constraints on the choice of topics, other than one:
the application documents must have content in the abstract, description and
claims sections of the Xml document. The patent documents released as topics
had the citation records (for the Pac task) and classication records (for the
Cls task) removed from the documents.
2.4</p>
        <p>
          Relevance Assessments
The relevance assessments used to evaluate the Pac submissions were obtained
automatically from the patent citations stored in the collection documents. Since
the average number of citations per patent in the ClefIp collection is low
(below 4), we have looked for methods to extend the set of relevant documents
per topic. For this we used an extended list of citations, where to the patents
listed in the patent’s search report (the direct citations), we added also the
patent citations listed in the family members of the topic patent, as well as the
family members of the cited patents. For a detailed explanation of the citation
extraction procedure, we point the reader to the last year’s track overview article
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>The relevance assessments used to evaluate the Cls submissions were also
obtained automatically from the documents that originated the Cls topics. We
have extracted the Ipc codes, restricted to the subclass level, from the patent
documents.</p>
        <p>ID
bitem
dcu
hild
humb
insa
jve
run
spq
ssft
uaic
ui
uned</p>
      </sec>
      <sec id="sec-3-2">
        <title>Institution</title>
        <p>BiTeM, Service of Medical Informatics, Geneva CH
University Hospitals
Dublin City Univ. - School of Computing
Hildesheim Univ. - Information Science
Humboldt Univ. - Dept. of German Language and
Linguistics
Lci Institut National des Sciences AppliquØes de FR
Lyon
Industrial Property Documentation Department, JSI FR
Jouve
Information Foraging Lab, Radboud University NL
Nijmegen
Spinque NL
Simple Shift CH
Al. I. Cuza University of Ia‡i - Natural Language RO
Processing
Information Retrieval Group, Universitas Indonesia ID
UNED - E.T.S.I. Informatica, Dpto. Lenguajes y ES
Sistemas Informaticos
IE
DE
DE 1
5
3
2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Submissions and Results</title>
      <p>This section is based on the descriptions provided by the participants. We present
here which Xml elds were used in document processing, what kind of pre and
postprocessing was done, the retrieval and ranking system that was used to
obtain the results, crosslanguage techniques involved.
⋆ The bitem participant has submitted runs to both Pac and Cls tasks. For
both tasks, the Porter stemmer was applied, and stopwords were eliminated in
a document preprocessing step.</p>
      <p>In the Pac task, the participant has used the following elds both for index
creation and query generation: title, abstract, claims, Ipc codes, applicants and
inventors information. Using the Terrier platform, only one English index was
created, and retrieval results were ranked using Terrier’s PL scheme. Fields in
Cls Pac
7 2
other language than English were translated into English before adding them to
the index using the Google translator. Topic documents in a dierent language
than English were also translated into English with the Google translator. For
the run that simulated the examiner search a postprocessing step was applied
where the citations provided by the applicant in the text of the document were
used. The participant also experimented with using the geographical location of
the applicant in the postprocessing phase.</p>
      <p>The document elds used for indexing within the Cls task are title, abstract,
claims, description, applicant and citations. First a retrieval step is done, where
the Google translator is used as in the Pac task. The retrieved document are
given as input to the kNN algorithm which maps them to Ipc codes, which are
then reranked.
⋆ The dcu group submitted runs to the Pac task. The English index used in the
retrieval was created from the following elds: title, abstract, description, claims,
and classication tags. The document preprocessing phase included stopwords
removal, stemming and number removal. The nonEnglish topics were
translated into English using the Google translator, the IR system used for retrieving
results was Indri which ranked the results using a language model and inferred
networks. The postprocessing step for one of the runs added the citations
extracted from the topic document descriptions (i.e. applicant citations) to the list
of results.
⋆ The University of Hildesheim group ( hild) participated in the Pac task and
experimented with various types of queries in the frame of an Apache Lucene
based system. One English language index was created based on the patent
number, title, abstract and Ipc Xml elds. Stopwords were removed and a Porter
stemmer was also applied on both corpus and topic documents. Phrase queries
are extracted from various elds in the topic document, like phrases from the
title only or from title, main claim, rst part of the description. The Ipc codes
are also used in ltering the results, by looking for results that share at least one
Ipc code with the topic document.
⋆ The humb group took part in both track’s tasks with the same custommade
system (PATATRAS). The preprocessing step included citation identication
in the patent’s text, cleaning the inventor and applicant names, languagebased
tokenizations, POStagging, concepttagging, keyterm extraction and
lematization. All patent document elds were used in the index creation. One lemma
index per language and a concept index based on a selfdeveloped
terminological database GRISP were used in the retrieval experiments. The PATATRAS
system combines the Lemur, Okapi BM25, and Indri retrieval engines, each
acting on certain index les. Result ranking is done by BM25, Indri and SVM.
The classication results were obtained with the same system, but involving the
KNN classier, and the conceptual tagging is eliminated from the preprocessing
phase.
⋆ All classication results sent in by the insa group were obtain using the LCS2 5
classication system, using a balanced Winnow method. The text fed into the
classier was considered as a bag of words or as a bag of linguistic triples
obtained by preprocessing selected elds with AGFL 6 builtin linguistic parsers
(EP4IR for English, and FR4IR for French). The various training experiments
were done by choosing dierent document elds to be considered in the training
process: a)abstract and titles, b) abstracts, titles, names and addresses, c)
description.
⋆ jve participated in the Cls task with three runs. Both in training and in
the test phase the title, claims, description, and abstract (when available) were
used. The rst run was obtained with a SVM classier, where documents were
preprocessed by tokenization, POStagging, lemmatization, and a keyphrase
tagging step, which in a patentoriented context detects the terms that best
explain the subject of the patent application. WordNet (a lexical resource for the
English language) was also used in this step. The second run submitted by JSI
Jouve made use of the Lemur system to index the data corpus, generating one
index per language. Lemur was also used to retrieve relevant documents to the
given topics. From the returned patent documents, only the classication codes
were kept. The third run combined the two methods used for the rst two runs.
⋆ The Radboud University group ( run) participated in both Pac and Cls tasks.
The retrieval system used for the Pac task was Lemur/Indri based. The index
was created out of the title, abstract, claims, description and Ipc code elds, in
English only. Per topic, one hundred documents were retrieved, which were then
reranked using regression models. The system used for the Cls task is LCS
Winnow with a Lucene analyzer. Only the English abstracts were fed into the
classier. The abstracts were rst processed to remove punctuation, numbers
and to put all letters into lowercase, then a simple tokenizer was applied.
Experiments with dependency triples in the abstract were done using the AEGIR
hybrid parser.
⋆ The retrieval system used by Spinque ( spq) is an inhouse retrieval system
that contains a graphical search strategy builder, and an own indexer based on
MonetDB. The same system, with the same generic index was used in both Pac
and Cls tasks. In both tasks, the topic documents were passed through a
Snowball stemmer, with English stopwords removed. The rst 26 terms given by the
tf-idf algorithm were considered to be part of the query. Also, the Ipc codes were
used in the query creation. The retrieval step returned a list of ranked patent
documents, for the Pac task, and, for the Cls task, the Ipc codes attached to
the ranked patent documents.
⋆ The classier used by the Simple Shift participant ( ssft), myClass, is a winnow
like inhouse implementation. The elds used in the classication process are:
inventor, applicant, title, abstract, claims, and description reduced to a size of
5 http://www.phasar.cs.ru.nl/LCS/
6 http://www.agfl.cs.ru.nl
maximum 4k. During the training phase ssft made use of additional patent
corpora to increase the classication precision. In other training experiments, over
sampling (copying the documents in the respective category a certain number
of times) was used. A collocation 7 extraction step was done during the indexing
phase.
⋆ The run submitted by the uaic participant to the Pac task was obtained by
a Lucene based system. The English only index was created using the invention
title, claims, and abstract or description (when the abstract was missing)
document elds. The data corpus was split into 20 and the indexing was done in
parallel on 20 machines, one split per machine. Lucene was used to extract a
query from the topic document using the same document elds as for the index
creation. Various boost factors were applied to the document elds used in the
query.
⋆ In the Pac task, the University of Indonesia ui participant used a simple Lemur
Indri setting. A large number of Xml elds were used in the English index
creation, among which we list abstract, applicant, applicant’s address, abstract,
claims, classication codes, descriptions. The three submitted runs extracted
the query from dierent document elds in the topic documents: invention
title and description; claims; invention title, description, and claims. The query
extraction is done by the tf-idf term weighting algorithm, keeping the rst 10
terms.
⋆ The uned group participated in the Pac task with a retrieval system based
on BM25. Before indexing or retrieving, the patent documents in the
collection have been joined (at Xml eld level) into one patent document. Four Xml
elds, to which stemming and stopword removal was applied, have been
considered for the indexing process: title, abstract, description and claims. A separate
index per language was created. The query terms are extracted from the topic
documents by computing the Kullback-Leibler divergence (KLD) between the
language model of the topic document and the language model of the patent
collection. Experiments with eld boost values were also made.
3.2</p>
      <p>Evaluation Results
We have evaluated the submitted experiments using the most common metrics
in IR. Before we ran the evaluation software, some simple cleanup of the data
was done. A further important data correction was done on the experiments
submitted to the Cls task. Here, we noticed that several participants have made
use of Ipc versions that were not used in the ClefIp data corpus. (The data
feeds that originated the ClefIp documents did not carry classication symbols
that were eliminated when the Ipc system got revised over the time.) For this
reason, we removed all entries in the result les where classication codes not
occurring in the ClefIp 2010 corpus were listed.</p>
      <p>For each submitted Pac experiment we computed the following measures:
7 Collocations are concept expressed by more than one word.
For each submitted Cls experiment we computed the following measures:</p>
      <p>
        All computations were done using the trec_eval 9.0 software provided by
Nist, with the exception of the Pres measure, which we computed using a script
provided to us by the measure’s authors [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Figures 1 through 5 show some of
the calculated measures. Detailed values for each of the mentioned measures are
given in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>Final Observations</title>
      <p>We have presented here an account of the benchmarking activities done within
the ClefIp lab, organized in the frame of the Clef 2010 conference. The time
was, unfortunately, too short to be able to do an indepth analysis of the found
results, we leave this as future work. Lack of resources was also an impediment
in following some of the lines drawn at the end of the 2009 track. One of such
lines to follow is intensifying contacts with IP professionals. However, we were
not able to pursue this goal. We didn’t forget about the conclusions drawn at
the end of ClefIp 2009, we only postponed their realization.
Fig. 1. MAP measures for the Pac runs</p>
      <p>P at 100
R at 100
0.3
0.25
0.2
0.15
0.1
0.05
0dcu-1h-Sumbr-uSn-2-rsumn-a1l-usamica-lSbitem-b2it-eSm-s1p-qS-Sdcu-3h-Sild-1-hsilmd-a2l-dscmu-a2lh-Sild-4-hsilmd-a3l-usnmeadl-u4n-Sed-u2n-Sed-u3n-Sed-u5n-Sed-u6n-Sed-u1n-Sed-u8n-Sed-u7i--S3-sumi-a1l-sumi-a2l-smal
Fig. 2. Precision and Recall at 100 measures for the Pac runs
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0.45
0.4
0.35
0.3
nDcg
Pres
Fig. 4.</p>
      <p>Map and F_1
measures for the</p>
      <p>Cls runs
0.2 s</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. lEeugraolp-eatnextPsa.
          <article-title>tent Convention (EPC)</article-title>
          . http://www.epo.org/patents/law/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Atsushi</given-names>
            <surname>Fujii</surname>
          </string-name>
          , Makoto Iwayama, and
          <string-name>
            <given-names>Noriko</given-names>
            <surname>Kando</surname>
          </string-name>
          .
          <article-title>Overview of the Patent Retrieval Task at the NTCIR-6 Workshop</article-title>
          . In Noriko Kando and David Kirk Evans, editors,
          <source>Proceedings of the Sixth NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval</source>
          , Question Answering, and CrossLingual Information Access , pages
          <fpage>359365</fpage>
          ,
          <fpage>2</fpage>
          -1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan, May
          <year>2007</year>
          . National Institute of Informatics.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zenz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Berger</surname>
          </string-name>
          . 1st
          <source>international workshop on advances in patent information retrieval (AsPIRe'10). March</source>
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Florina</given-names>
            <surname>Piroi</surname>
          </string-name>
          . CLEF-IP
          <year>2010</year>
          :
          <article-title>Classication task evaluation summary</article-title>
          .
          <source>August</source>
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Florina</given-names>
            <surname>Piroi</surname>
          </string-name>
          . CLEF-IP
          <year>2010</year>
          :
          <article-title>Prior art candidates search evaluation summary</article-title>
          .
          <source>July</source>
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>G.</given-names>
            <surname>Roda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Zenz.</surname>
          </string-name>
          CLEF-IP
          <year>2009</year>
          :
          <article-title>Retrieval Experiments in the Intellectual Property Domain</article-title>
          . To appear.
          <source>In Proc. of CLEF, Revised Selected Papers</source>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>J.</given-names>
            <surname>Tait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Harris</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          .
          <source>The 3rd international workshop on patent information retrieval (PaIR</source>
          <year>2010</year>
          ).
          <source>October</source>
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Magdy W. and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          . PRES:
          <article-title>A score metric for evaluating recall-oriented information retrieval applications</article-title>
          .
          <source>In SIGIR</source>
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>