=Paper= {{Paper |id=Vol-1179/CLEF2013wn-CLEFIP-PiroiEt2013 |storemode=property |title=Passage Retrieval Starting from Patent Claims A Clef-Ip 2013 Task Overview |pdfUrl=https://ceur-ws.org/Vol-1179/CLEF2013wn-CLEFIP-PiroiEt2013.pdf |volume=Vol-1179 |dblpUrl=https://dblp.org/rec/conf/clef/PiroiLH13a }} ==Passage Retrieval Starting from Patent Claims A Clef-Ip 2013 Task Overview== https://ceur-ws.org/Vol-1179/CLEF2013wn-CLEFIP-PiroiEt2013.pdf
    Passage Retrieval Starting from Patent Claims
                   A Clef-Ip 2013 Task Overview

                   Florina Piroi, Mihai Lupu, Allan Hanbury
                          Vienna University of Technology,
             Institute of Software Technology and Interactive Systems,
                     Favoritenstrasse 9-11, 1040 Vienna, Austria



      Abstract.    Most of the searches a patent expert at a patent oce does
      are using boolean methods to query large databases of patent data. The
      Clef-Ip evaluation track is designed to experiment with information re-
      trieval techniques on the patent domain. The data corpus in the Clef-Ip
      Lab consists of patent documents published by the European Patent Of-
      ce. One of the main tasks in the Lab has been related to the Prior Art
      type of search performed by the patent experts at patent oces. The
      task has went through various changes along the years, from using vir-
      tual patent documents as topics (in 2009) to actual patent application
      documents, and sets of claims from patent application documents (2012
      and 2013). Relevance assessments for this task were based on Search
      Reports published by the European Patent Oce.
      In this overview we give report on the work we have done in organizing
      this retrieval task in 2013.


1    The Clef-Ip Passage Retrieval Task
The technological developments in our time are closely coupled with the patent
system which encourages inventors to make their ideas public in exchange for
a monopoly on the invention for a limited period of time, up to 20 years. A
patent can be seen as a contract between a government and the patent owner
by which the latter can exclude other parties from manufacturing and exploiting
the invention without a permission.
    To obtain a patent, one of the main requirements is that the invention is new.
To verify this, extensive searches, not only in the patent repositories, but also
specialized literature, conference publications, etc., must pe thoroughly done.
The amount of data to be searched, as well as the fact that many publications are
now digitized, makes it that search operations cannot be done without the help of
computers. With the tasks organized in Clef-Ip along the years we investigate
how current IR solutions may serve to the needs of patent experts doing novelty
searches. This task, in particular, is meant to explore the approaches that IR
systems may oer when faced with nding specic pieces of text that are relevant
to any given patent claim.
    We present here shortly the process of obtaining a patent with focus on the
European Patent Oce (Epo [2]).
    To obtain a patent, a patent application must be registered with a patent
oce. A patent application contains an abstract, a title, a detailed description
of the invention, drawings (if necesary) and a set of claims that dene the extent
of the protection aimed for. An applicant will also cite previously published
patents that are considered relevant to the described invention. At the Epo
applications can be made in any language. Given that the ocial languages at
the Epo are English, French, and German, whenever another language is used
in an application, a translation to one of these three languages must be made.
Once the application is registered at the patent oce, it will be examined that
it is novel, that it has an inventive step, and that it is realizable. During these
examinations, at the Epo, a European search report is prepared which lists all
the relevant documents found (called patent citations ).
    The Epo publishes patent applications together with their search reports in
a time limit of 18 months from the ling date. If the patent applicant, based on
the search report, decides to pursue for a patent, a sequence of communications
between him and the patent oce takes place. Usually, during this process, the
claims are adjusted such as not to conict with existing patents.

The European search report is mainly based on the application claims, and,
more often than not, species not only the documents relevant to the (various)
claims, but also the passages particularly of importance to them. Knowing this,
the Passage Retrieval Task Starting from Claims was designed to investigate the
eectiveness of Information Retrieva (IR) methods in nding relevant documents
and marking passages particularly pertinent to a set of claims.

2     The Clef-Ip Corpus
The Clef-Ip corpus was distributed as a collection of over 3 million Xml docu-
ments pertaining to over 1.5 million patents published by the Epo and the World
Intellectual Property Organization (Wipo) prior to 2002 [8]. The Clef-Ip corpus
is an extract of the larger Marec collection1 which uses a common normalized
Xml data format to represent patent documents published by the Epo, Wipo,
US Patent and Trademark Oce, and Japan Patent Oce. We do not describe
the collection content here, but we direct the reader to the previous publications
that detail it ([7,9]).

3     Task Topics
The Passage Retrieval from Claims Task models closely the novelty search done
by patent examiners at the Epo. Topics in this task are sets of claims extracted
from actual patent application documents published by the Epo after 2002.
Participants had to return passages that are relevant to the topic claims. The
1
    The MAtrixware REsearch Collection. http://ifs.tuwien.ac.at/imp/marec
passages must occur in the documents in the Clef-Ip collection. No other data
was allowed to be used in preparing for this task.
    To select the topics for this task we rst had to select the patent application
documents out of which we could select, then, various sets of claims. We rst
selected a pool of candidate application documents from the Marec collection
with a few restrictions:
   the document must be published after 2002 (that is, is not part of the
    Clef-Ip corpus);
   the document must be published by the Epo (recall that Marec contains
    also patents published by the US oce, by the Wipo, and the Japanese
    oce);
   the application should contain at least 3 citations and at most 10. This is
    because the number of patent documents with more than 10 citations in
    the search report is very small when compared the the number of patents
    with less than 10 citations. An additional reason for choosing the upper
    limit is a pragmatic one: patents with more than 10 citations proved to be
    more dicult and time consuming to process when extracting the relevance
    judgements;
   the application document does not miss content, that is, it indeed has a de-
    scription, an abstract and a claims section. We mention here that, according
    to the Patent Cooperation Treaty [1], for patent applications that are led
    rst at the Wipo and then at the Epo, Epo does not publish an additional
    application document, but only a bibliographic entry that points to the orig-
    inal Wipo application. In terms of Xml representation, this translates into
    an Xml document that doesn't have a description, an abstract, nor a claims
    section;
   the document does not count more than 300,000 words. Setting this limit al-
    lowed us to avoid selecting patent application documents that are more than
    100 pages long. The rationale behind this decision is that, from past expe-
    rience, task participants sometimes used full patent documents as queries2 ,
    and it has been shown that some retrieval algorithms do not cope well with
    large queries [5];
   the application document has at least one family member (a patent document
    published at another patent oce) which was led prior to the document in
    the pool. This last restriction is an addition to the task organized in 2012.
    It is, however, an addition that models a widely used practice of the patent
    examiners, which consists in pulling out everything what was already done
    at other patent oces with regard to a patent application they have in front
    of them, before they start their own search.

After applying all these restrictions, we ended up with a pool of over 300,000
patent application documents. The next step, was now, to sample documents
from this pool and extract sets of claims to be topics. The sampling was ran-
2
    Although it may have benets in an IR sense, no patent expert would use such a
    solution, actually.
domly done, with one restriction, however. Some technological areas are overrep-
resented in the patent corpus. For example, patents in the farmaceutical domain
are more numerous than in other technical domains. Because we intended to
have a relatively uniform distribution of the citation numbers the topic doc-
uments have, we rst grouped the documents in the pool by the number of
citations in the search report and in the Clef-Ip collection. We, then, randomly
selected 20 patent application documents from each group with the restriction
that each document belongs to a dierent Ipc class 3 . We did this three times:
once extracting English application documents, once German, and once French
application documents. We have now a pool of over 460 patent application doc-
uments. Out of this smaller pool we, randomly, inspected over 200 documents,
over 60 in each Epo language, to extract claim sets for our topics.
    As mentioned in the previous section, a patent application document con-
tains a claims section which dene the extent of the legal protection for the
described invention. The claims section is a list of sentences (claims) which, for
ease of reference, are numbered. Below is an example of the rst 8 claims in the
application document of patent WO-02058006.
    What Is Claimed Is:
    1. In a paint roller having an inner resilient cylindrical core and an outer
       annular surface contact material, the outer annular surface contact ma-
       terial forming a paint roll medium that is xedly attached to the resilient
       core, the resilient core and paint roll medium rotating about an axis of
       said cylindrical core; the improvement wherein the paint roll medium is
       a hydroentangled threedimensional imaged nonwoven fabric.
    2. An imaged nonwoven fabric of claim 1, wherein the fabric is formed from
       a precursor web comprised of staple length bers.
    3. An imaged nonwoven fabric of claim 2, wherein the staple length bers
       include surface modication agents.
    4. An imaged nonwoven fabric of claim 3, wherein the surface modication
       agents are selected from the group consisting of hydrophobic modiers
       and hydrophilic modiers.
    5. An imaged nonwoven fabric of claim 2, wherein the staple length bers
       include the incorporation of melt additives.
    6. An imaged nonwoven fabric of claim 5, wherein the melt additives are
       selected from the group consisting of hydrophobic modiers and hy-
       drophilic modiers.
    7. An imaged nonwoven fabric of claim 2, wherein the staple length bers
       are selected from the group consisting of thermoplastic polymers, ther-
       moset polymers, natural bers, and blends thereof.
3
    Ipc (International Patent Classication System) is a classication system that groups
    patents by their technological area. Ipc is hierarchially organized in sections, classes,
    subclasses, groups and subgroups. There are 8 sections, 121 classes, and over 630
    subclasses in this classication system. A patent may belong to several technological
    subareas.
    Because the relevance judgements for this task are based on European search
reports, when selecting the topics we had to inspect, for each application docu-
ment in the pool, its search report (an example of a search report is shown in
Figure 1). A European search report usually has 4 columns. The second column
lists the relevant documents (patent citations) together with relevant passages,
images, etc. The rst column marks the relevance category of the citation, with
X and Y being citations that destroy the novelty in the patent application, A
being citations that oer background information on the invention but do not
destroy its novelty or inventive step. The third column in a European search
report writes down the claim numbers to which the patent citations pertain.




                        Fig. 1. Extract from a search report.

     For a patent application document we inspected each patent citation that
occured in our corpus4 . We noted the claim numbers it refered to and the relevant
passage information. When the relevant passage information was acceptable, that
is, it refered to lines of text and not to gures or whole documents, we retained
the set of claims to be a topic in our task. We also took care that the search
reports were complete, in the sense that the patent examiner did his search for
all the claims in the patent application. When this was not the case, the search
reports contain a notice on this fact and we could eliminate these cases from our
pool.
     Using this procedure, we could extract more topics from one patent applica-
tion documents. It was often the case that each topic extracted from one patent
application document had its own set of relevant documents and passages, and
that the sets of relevant documents didn't allways overlap. From the over 200
patent application documents inspected we were able to extract 149 topics from
69 patent documents. From the 149 topics distributed to the participants, we
later removed topics 78 and 101 for being erroneous.
4
    Not all patent citations in a European search report occur in the Clef-Ip corpus.
     The structure of a Clef-Ip topic is as follows:
 topic_id 
 patent_ucid.xml 
 patent_ucid.xml 
 xpaths_to_claims 
where
  tid is the topic identier;
  tfile is the Xml le which stores the patent application out of which the
   topic claims were extracted;
  tclaims is the list of XPaths to the claims selected as topic from the source
   patent document;
  tfam-docs contains the Xml les that are part of the source patent's family
   and published prior to the source patent document.

Below is an example of a topic in the Clef-Ip 2013 Passage Retrieval Task:

PSG-22
EP-1267498-A1.xml
FI-111300-B1.xml,FI-20011095-D0.xml,FI-20011095-A.xml
/patent-document/claims/claim[1] /patent-document/claims/claim[2]
/patent-document/claims/claim[3] /patent-document/claims/claim[4]
/patent-document/claims/claim[5] /patent-document/claims/claim[6]
/patent-document/claims/claim[7] /patent-document/claims/claim[8]
/patent-document/claims/claim[9] /patent-document/claims/claim[10]
/patent-document/claims/claim[11]

In the topic set distributed to the participants the patent application docu-
ments from which the claims were extracted, and the previously published family
member documents were also available, such that participants could use them
to extend the original queries extracted from the claims.

4     Relevance Judgements
Using patent data in evaluation campaigns has one disadvantage when compared
to other campaigns: to obtain relevance assessments as in the real life patent
search examples experts in the various technological domains are needed. The
budget of a research project cannot aord employing them to provide judge-
ments, voluntary participation in creating assessments being for most of the
patent experts not an option.
   Despite this disadvantage, we are in the very happy situation that relevance
judgements of a kind already exist in the form of patent search reports 5 . All
Clef-Ip campaigns used, in one form or another, the search reports to extract
5
    Experiments using citation information to design retrieval experiments have been
    done also in other areas than the patent domain. See for example [11].
relevance assessments. We did the same this year. The diculty in getting the
qrels for our topics in 2013 (and in 2012), is that, although patent citations can
be easily obtained in some machine-processable form, relevant passages cannot.
Therefore, the relevant passage information extraction was done by manual in-
spection of the search reports, of the cited documents and by matching them
with the textual content of the relevant documents in the Clef-Ip collection.
    This proved to be a tedious process, so we developed a system to assist us with
selecting the relevant pieces of text from the Xml documents in our collection.
The system has been used also in 2012 and is described in [9] and [8]. We very
shortly present here the main features of the system. We see in Figure 2 that
the qrel generating system has three main areas:
   a topic description area where, after typing in the patent application docu-
    ment identier, we can assign the topic an identier (unique in the system),
    we dene the set of claims in the topic, save it, navigate among its relevant
    documents with the `Prev' and `Next' buttons.
   a qrel display area where we see the currently selected relevant passages and
    can save them. Also in this area we give a direct link to the application
    document on the Epo Patent Register server, which, in turns, gives us a
    quick link to the document's search report.
   a qrel denition area where individual passages (corresponding to XPaths in
    the Xml documents) are displayed. Clicking on them will select them to be
    part of the topic's qrels. For convenience, we provide three buttons by which
    we can select with one click all of the abstract's, description's or claims'
    passages. When clicking on the 'Save QREL' button the selected passages
    are saved in the database as relevant passages for the topic in work.
    The relevance judgements created contained both relevant documents and
relevant passages in them. Though the documents could be dierentiated by
degrees of relevance, due to their categories in the search reports (X, Y, A), the
passages were considered all equally relevant.
    Below is an excerpt from the qrel les obtained with the help of our system:
PSG-5 EP-1078736-A1 /patent-document/description/p[20]
PSG-5 EP-1078736-A1 /patent-document/description/p[21]
PSG-5 EP-1078736-A1 /patent-document/description/p[18]
PSG-5 EP-1078736-A1 /patent-document/description/p[15]
PSG-5 EP-1078736-A1 /patent-document/claims/claim[1]
PSG-5 EP-1078736-A1 /patent-document/abstract/p
PSG-5 EP-1078736-A1 /patent-document/claims/claim[2]
...



5     Submissions and Evaluations
5.1   Submissions to the Task
The submission format for the passage retrieval task required participants to submit
text les with retrieval results similar to the qrel format shown above. The number
Fig. 2. A system for nding and saving relevant passages.
of documents considered relevant per topic had to be limited at 100, the number of
relevant passages in a document was not limited. In addition, to the qrel format, the
participant submissions had two more columns, one to specify the order of the results,
and another one to specify the retrieval score of a passage/document.
    Three participants submitted experiments to the Passage Retrieval task, two of
them also included relevant passages in their task. In their experiments a two step
approach was used. In the rst one, relevant documents were retrieved using various
retrieval solutions including Okapi BM25, Language Models and TF-IDF, and Vector
Space Models. The participant from the Georgetown University (USA) experienced
with various sources for query terms by extracting words from claims and titles, using
hyphenating-phrases, Part of Speech tagging and weighted ltering [4]. The team from
Innovandio S.A. (Chile) also experienced with a CL-ESA Wikipedia-based multilingual
retrieval model ([10], [8], section 3).
    The third participant to the task, a team of researchers from Vienna University
of Technology and the University of Macedonia, Thessaloniki, used a distributed IR
system that queried a split Clef-Ip collection. The split is done by exploiting the
hyerarchical structure of the International Patent Classication System (Ipc). By di-
viding the collection into several sub-collections (by Ipc class, subclass, and subgroup)
the patents are organized according to their technological topic. Then the Lemur in-
dexer was used to index the title, abstract, description, claims, inventor, applicant and
Ipc class information [3]. The CORI and a multilayer method were used for selecting
the sources (sub-collections) on which the retrieval should be performed as well as for
joining the results.
    In the gures below, the submission les prexed by `In' belong to the partici-
pant from Chile, the submission les prexed by `GU' belong to the participant from
Georgetown Unviersity, and the ones prexed by `TM' were sent in by the team from
Vienna and Thessaloniki.


5.2    Evaluating the Retrieval Results
Three participants submitted a total of 19 runs. Out of these, 8 runs did not provide
retrieved passages.
    We did evaluations at two levels. One at the passage level and one at the patent
document level. The evaluation at patent document level was done, as in the previous
years, by computing the Recall, Map, and PRES ([6]) at cuto 100. At the passage
level we computed, rst, for each relevant document retrieved the precision and aver-
age precision w.r.t. the passage retrieved, then averaged it over the number of relevant
documents per topic. Finally, averaging these scores over all topics we obtain the pre-
cision and mean average precision scores at the passage level. The evaluation script is
available for download on the Clef-Ip project website6 .
    Several simple le clean-up operations had to be done in order to ensure that the
document encodings matched the expected format by the evaluation script. These op-
erations included duplicate removal, re-grouping the retrieval results such that results
belonging to one topic were in a contiguous portion of the les, removing the XPaths
refering to headings in the patent document Xml les. This last operation was done
because headings are not consistently marked as such in the Clef-Ip collection's doc-
uments, being left out of the relevance judgements as well.
6
    http://www.ifs.tuwien.ac.at/~clef-ip
        Fig. 3. Evaluation results, ordered by Recall.




Fig. 4. Evaluation results, document level Recall per language.
           Fig. 5. Second evaluation round, results ordered by Recall.

    We ran, then, several evaluations depending on the degree of relevance assigned to
the citation documents in the search reports. In each round we computed all of the
measures mentioned above, we will not, however, present all of them.
    The rst evaluation round considered all documents in the relevance judgements as
equally relevant and did evaluations on four sets of topics: the set of all 147 topics, on
the subset of 50 English topics (1-50), on the subset of 49 German topics (51-100), and
the subset of 48 French topics (102-149). The results of these evaluations are shown in
Figures 3 and 4. The zero values on the gures belong to the runs that did not contain
relevant passages.
    Next we were interested in the metric scores when only the highly relevant citation
documents were considered, ignoring the applicant citations. From the 147 topics only
116 have highly relevant citations in the Clef-Ip corpus, so the new evaluation round
is done for this smaller set. Figures 5 and 6 show plots for the metrics for this smaller
topic set, for the 38 English topics, for the 42 German topics, and for the 22 French
topics in it.
    To compare how the dierent retrieval strategies perform with respect to the dif-
ferent relevant documents required (highly relevant only, or both highly relevant and
relevant) we computed a third round of evaluations, where we restricted the set of qrels
used in the rst round of evaluation to the 116 topics evaluated in the second round.
Although we computed all the mentioned metrics for all three languages, we present
only the results for the whole Recall and Map(D) for the 116 topics, in Figure 7.


6    Final Words
This paper presented the activities we have done to organize the Passage Retrieval
Starting from Patent Claims Task in Clef-Ip 2013. We started with selecting patent
Fig. 6. Second evaluation round, document level Recall per language.




Fig. 7. Third evaluation round, document level Recall and Map(D).
application documents and sets of claims in these documents that were our nal top-
ics. The most time consuming part of these activities has been extracting the XPaths
to relevant passages identied by patent experts in their search reports. Participants
were not given any specic queries, but were allowed to build them out of the informa-
tion provided in the topics: claims, patent application document, previously published
family member documents.
     Over 20 teams registered to submit retrieval experiments to this task, a number
similar with the number of registrations in the previous years. We received submissions
from three groups, two of them with relevant passage information as well.


Acknowledgements This work was partly supported by the EU Network of Excel-
lence PROMISE(FP7-258191) and the Austrian Research Promotion Agency (FFG)
FIT-IT project IMPEx(No. 825846).


References
 1. ***. Patent Cooperation Treaty   . 1970. last retrieved: March, 2013.
 2. ***. Guidelines for Examination in the European Patent Oce        , 2012. www.epo.
    org/law-practice/legal-texts/guidelines.html, latest retrieved in June 2013.
 3. Anastasia Giachanou, Michail Salampasis, Maya Satratzemi, and Nikolaos Sama-
    ras. Report on the CLEF-IP 2013 Experiments: Multilayer Collection Selection
    on Topically Organized Patents. In     CLEF (Notebook Papers/LABs/Workshops)         ,
    2013.
 4. Jiyun Luo and Hui Yang. Query formulation for prior art search - georgetown
    university at CLEF-IP 2013. In   CLEF (Notebook Papers/LABs/Workshops)         , 2013.
 5. Yuanhua Lv and ChengXiang Zhai. When documents are very long, BM25 fails!
    In Wei-Ying Ma, Jian-Yun Nie, Ricardo A. Baeza-Yates, Tat-Seng Chua, and
    W. Bruce Croft, editors,  Proceedings of SIGIR   , pages 11031104. ACM, 2011.
 6. W. Magdy and G. J. F. Jones. PRES: A score metric for evaluating recall-oriented
    information retrieval applications. InSIGIR 2010     , 2010.
 7. F. Piroi, M Lupu, A. Hanbury, and V. Zenz. CLEF-IP 2011: Retrieval in the
    intellectual property domain, September 2011.
 8. Florina Piroi, Mihai Lupu, and Allan Hanbury. Overview of CLEF-IP 2013 Lab:
    Information Retrieval in the Patent Domain. In   Proceedings of CLEF 2013   , Lecture
    Notes for Computer Science, 2013. to appear.
 9. Florina Piroi, Mihai Lupu, Allan Hanbury, Alan P. Sexton, Walid Magdy, and
    Igor V. Filippov. Clef-ip 2012: Retrieval experiments in the intellectual property
    domain. In  CLEF (Online Working Notes/Labs/Workshop)         , 2012.
10. Martin Potthast, Benno Stein, and Maik Anderka. A wikipedia-based multilingual
    retrieval model. In Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven,
    and Ryen W. White, editors,    Proceedings of ECIR    , volume 4956 of Lecture Notes
    in Computer Science   , pages 522530. Springer, 2008.
11. A. Ritchie, S. Teufel, and S. Robertson. Creating a Test Collection for Citation
    based IR Experiments. In   Proceedings of the main conference on Human Language
    Technology Conference of the North American Chapter of the Association of Com-
    putational Linguistics, 2006.