CLEF-IP 2012: Retrieval Experiments in the
                 Intellectual Property Domain


                Florina Piroi1 , Mihai Lupu1 , Allan Hanbury1 ,
                Walid Magdy2 , Alan P. Sexton3 , Igor Filippov4
                         1
                            Vienna University of Technology,
              Institute of Software Technology and Interactive Systems,
                       Favoritenstrasse 9-11, 1040 Vienna, Austria
       2
         Qatar Computing Research Institute, Qatar Foundation, Doha, Qatar
               3
                 University of Birmingham, School of Computer Science,
                  Edgbaston Birmingham, B15 2TT, United Kingdom
                  4
                    Chemical Biology Laboratory, SAIC-Frederick, Inc.,
              Frederick National Lab, Frederick, Maryland, 21702, USA


      Abstract.    The Clef-Ip test collection was rst made available in 2009
      to support research in IR methods in the intellectual property domain.
      Since then several kinds of tasks, reecting various specic parts of patent
      expert's work ows, have been organized. We give here an overview of
      the tasks, topics, assessments and evaluations of the Clef-Ip 2012 lab.


1   Introduction
The patent system encourages innovation by giving an advantage to people
and/or companies that disclose their inventions to the open society. The advan-
tage consists of exclusive rights on the published invention for a limited period
of time, usually 20 years. A requirement for obtaining a patent, i.e. exclusive
implementation and utilization rights, for an invention is that no similar inven-
tion was previously disclosed. Because of the high economic impact of a granted
patent it is important that the specic searches during the examination of patent
applications are thorough.
    Current rates of technological development has resulted in a large increase in
the number of patent applications led with the patent oces around the world.
To keep up with the increase in the amount of data, appropriate information
retrieval (IR) methods have to be developed and tested. The Clef-Ip test col-
lection gives the IR tool developers a test bed to evaluate the performance of
their tools in the intellectual property area (specically patents).
    Since 2009 the Clef-Ip evaluation campaign (2009) and benchmarking labs
(2010 and 2011) have posed their participants tasks that reect specic parts of
a patent examiner's work-ow: nding prior art of a patent application, classi-
fying the patent application according to the International Patent Classication
System, using images in patent searches, classifying images occurring in patents.
    This year there three tasks were organized, each concerning a dierent aspect
of the data that can be found in a patent collection:
  Passage retrieval starting from Claims. The topics in this task were claims in
   patent application documents. Given a claim or a set of claims, the partici-
   pants were asked to retrieve relevant documents in the collection and mark
   out the relevant passages in these documents.
  Flowchart Recognition Task. The topics in this task are patent images rep-
   resenting ow-charts. Participants in this task were asked to extract the
   information in these images and return it in a predened textual format.
  Chemical Structure Recognition Task. The topics in this task were patent
   pages in TIFF format. Participants had to identify the location of the chem-
   ical structures depicted on these pages and, for a specied subset of those
   diagrams, return the corresponding structure in a MOL le (a chemical struc-
   ture le format).

The ideas behind the Clef-Ip task denition and their organization are instances
of the use cases dened in the frame of the PROMISE1 project.
    The rest of the paper is organized as follows: Section 2 describes the Clef-Ip
collection and each of the organized tasks. We detail, for each of the tasks, the
idea behind proposing such a task, the sets of topics and their relevance judge-
ments, and the measures used in assessing the retrieval eectiveness. Section 3
presents the results of the evaluation activites for each of the three tasks and a
list of participants. We nish with Section 4.


2     The 2012 Clef-Ip Collection
This year's collection corpus is the same as the one used in the Clef-Ip 2011 lab,
i.e. it contains patent documents derived from European Patent Oce (Epo) and
World Intellectual Property Organization (Wipo) sources, stored as Xml les,
corresponding to over 1.5 million patents published until 2002. This year's collec-
tion does not include the image data used in the Image classication and Image
Retrieval task in the 2011 lab [6] . For a detailed description of the Clef-Ip
collection we refer the reader to the previous Clef-Ip overview notes [6,7,8].
For a description of key terms and steps in a patent's lifecycle see [5]. To make
this paper self contained, we re-state some facts about the Clef-Ip collection.
     In the process of patenting, several, and potentially many, documents are
published by a patent oce. The most common ones are the        patent application ,
the  search report , and thegranted patent  . In most of the cases when a patent
application is published, the search report is contained in it, otherwise it is pub-
lished at a later date. Each patent oce has its own identication system to
uniquely distinguish the dierent patents. At the Epo and Wipo all patent doc-
uments belonging to the same patent are assigned the same numerical identier.
The dierent types of documents (application, granted patent, additional search
reports, etc.) are distinguished by  kind codes  appended to the numerical iden-
tier. For example, EP-0402531-A1 is the identier of the patent application
1
    http://www.promise-noe.eu
document (kind code A1) of the European patent number 0402531, while EP-
0402531-B1 identies the European patent specication (i.e. text of the granted
patent)2 .
    The EP and WO documents in the Clef-Ip collection are stored in an Xml
format and are extracted from the Marec data collection. Marec contains
over 19 million patent documents published by the Epo, Wipo, US Patents and
Trade Organization and the Japan Patent Oce, storing them in an unied Xml
format under one data denition document.
    All Xml documents in the Clef-Ip collection, according to the common
Dtd, contain the following main Xml elds: bibliographic data, abstract, de-
scription, and claims. Not all Xml documents actually have content in these
elds. For example, an A4 kind document is a supplementary search report and
will therefore not usually contain the abstract, claims and description elds.
The documents have a document language assigned to them (English, German
or French). The main Xml elds can also have one of the three languages as-
signed to them, which can be dierent from the document language. Some Xml
elds can occur more than once with dierent language attributes attached to
them. For example, EP patent specication documents (EP-nnnnnnn-B1.xml)
must contain claims in three languages (English, German and French).
    We continue this section with a more detailed description of the three tasks
organized in Clef-Ip 2012.


2.1    Passage Retrieval Starting From Claims
The importance of the claims in a patent is given by their role in dening the
extent of the rights protection an applicant has secured. The decisions taken by
patent examiners at patent oceswhen examining a patent applicationrefer
to claims in the application document and provides a list of previously published
(patent) documents relevant for the application at hand. Furthermore, they often
underline passages in these documents to sustain their decisions.
    The idea behind organizing this task is to investigate the support an IR
system could give an IP expert in retrieving documents and passages relevant to
a set of claims. The topics in this task are sets of claims extracted from actual
patent application documents. Participants were asked to return documents in
the Clef-Ip 2012 corpus and mark the passages relevant to the topic claims.
    The participants were provided with a set of 51 training topics. Splitting
by the document language of the application document where the topic claims
appear, 18 topics were in German, 21 in English and 12 in French. For the test
set we have created 105 topics, with 35 in each language. Both training and
test topics were created manually. We describe below the steps in obtaining the
topics and their relevance judgments.
2
    A list of kind Epo kind codes is listed at https://register.epo.org/espacenet/
    help?topic=kindcodes.
    Kind codes used by the Wipo are listed at http://www.wipo.int/patentscope/en/
    wo_publication_information/kind_codes.html
Topic Selection. Before actually creating the topics we have rst created a pool
of candidate documents out of which we extracted the sets of topic claims. The
documents in the pool were patent applications not occurring in the Clef-Ip
data corpus (i.e published after 2001), with content in all main Xml elds, and
with two to twelve citations in the collection corpus listed in their search report.
We have counted only the highly relevant citations (marked with `X' or `Y' on
the search reports) leaving out the citations about technological background of
the invention (marked with `A' on the search reports). (In the Clef-Ip ter-
minology, we have used thus only highly relevant direct citations occurring in
the Clef-Ip corpus [8].) In the Clef-Ip collection, there are few patents with
a higher number of highly relevant citations, and these are usually very large
documents, with a large patent family, and, by manual inspection, looking for
candidate topics, dicult to handle. The next step in creating the topics and


                          Fig. 1. Extract from a search report.

their qrels consisted of examining the search reports to extract sets of claims
and their relevant documents together with the passages in the documents. For
each highly relevant citation document in the search report and in our collection,
we extracted the claim numbers3 the citation referred to. These formed the sets
of claims for a candidate topic. We then looked at the mentioning of relevant
passages in the cited document and decided if the candidate topic could be kept
in the set of topics or not. Rejecting a topic candidate was done when:

  relevant documents were referring to gures only;
  no relevant passage specications were given for the listed citations, or it
     was mentioned that the whole document is relevant;
  the search report had the mention `Incomplete search' which usually means
     that the search done by the patent expert was not done for all claims.

From one patent application document it was possible to extract several sets of
claims as topics, often with completely dierent sets of relevance judgments.
    We illustrate how this step was done by an example: Figure 1 shows a part of
the search report for the EP-13884446 patent application. The numbers on the
right hand side represent claim numbers. Two sets of claims can be identied:
{1,2,3,7} and {8}. Leaving the references to gures out, there is enough infor-
3
    Claims in patent documents are numbered for ease of reference
mation to identify relevant passages in the two relevant citations, WO-0126537
and EP-1101450, so we kept the two sets of claims as topics4 .
   Concretely, a topic in this task is formulated as:
<tid>tPSG-5</tid>
<tfile>EP-1480263-A1.xml</tfile>
<tclaims>/patent-document/claims/claim[1] /patent-document/claims/claim[2]
/patent-document/claims/claim[3] /patent-document/claims/claim[16]
/patent-document/claims/claim[17] /patent-document/claims/claim[18]
</tclaims>
The text les with the retrieval results that the participants submitted to this
task had the following format:
 topic_id       Q0     doc_id     rel_psg_xpath        psg_rank      psg_score
     where:
  topic_id is the identier of a topic
  Q0 is a value maintained for historical reasons
  doc_id is the identier of the patent document in which the relevant passages
     occur
  rel_psg_xpath is the XPpath identifying the relevant passage in the doc_id
     document
  psg_rank is the rank of the passage in the overall list of relevant passages
  psg_score is the score of the passage in the (complete) list of relevant pas-
     sages
   Only one XPath per line was allowed in the result les. If more passages are
considered relevant for a topic, these have to be placed on separate lines. The
maximum number of lines in the result les is limited to containing 100 doc_ids
when ignoring the XPaths.

Creating the Relevance Judgments. Both in the topics and in the relevance
judgements, reference to the claims and relevant passages is encoded by means
of XPaths. Where the search report referred to specic lines rather than para-
graphs, we took as relevant the set of paragraphs fully covering those lines. Once
the topics chosen, the last step was to actually create the les with the relevance
judgements. We did this manually, by matching the passage indications in the
search reports with the content of the patent documents and with the content
and XPaths stored in the Xml les. To ease this process we have used a system
developed in-house, a screenshot of which can be seen in Figure 2. We see in
Figure 2 that the qrel generating system has three main areas:
  a topic description area where, after typing in the patent application docu-
     ment identier (here, EP-1384446-A1), we can assign the topic an identier
     (unique in the system), we dene the set of claims in the topic, save it,
     navigate among its relevant documents with the `Prev' and `Next' buttons.
4
    In the end, from the EP-13884446 application document we have extracted ve topics: PSG30 to
    PSG34.
Fig. 2. Creating the qrels.
  a claim structure area where we display the claims and the claim tree. Also
   in this area we give a direct link to the application document on the Epo
   Patent Register server.
  a qrel denition area where individual passages (corresponding to XPaths in
   the Xml documents) are displayed. Clicking on them will select them to be
   part of the topic's qrels. For convenience, we provide three buttons by which
   we can select with one click all of the abstract's, description's or claims'
   passages. When clicking on the 'Save QREL' button the selected passages
   are saved in the database as relevant passages for the topic in work.

Evaluation Measures. The evaluation of the passage retrieval task was carried
out on two levels: those of the document and of the passage. The objective of
measuring systems retrieval quality at the document level is to evaluate the
system performance in retrieving whole relevant patent document. The objective
of the passage level evaluation is to measure the system ranking quality of the
relevant passages in the relevant patent documents.
    Regarding the document-level evaluation, we focused on recall-oriented re-
trieval measures. Patent retrieval evaluation score [3] (Pres), Recall, and Map
were used for evaluating the system performance in retrieving the relevant doc-
uments to a given topic. The cut-o used for computations was 100 patent doc-
uments (not passages!) in the results list.
    At the passage-level we measured system performance for ranking the pas-
sages in a given relevant document according to their relevance to the topic.
The measures of choice are mean average precision Map and precision. In more
detail, the scores are calculated as follows:
    For each document relevant to a given topic, we compute the average preci-
sion (AP) and precision. To dierentiate them from the usual average precision
and precision measures calculated at topic level, we call them AP at document
level, AP(D), and precision at document level, Precision(D). Then, for a certain
topic T and a relevant document Di , the average precision at the Di document
level is computed by the following equation:
                                     1        ∑
                AP (Di , T ) =                  P recision(r) · rel(r)          (1)
                                 np (Di , T )

where np (Di , T ) is the number of relevant passages in the document Di for the
topic T , P recision(r) is the precision at rank r in the ranked passage list, and
rel(r) is the relevance of the passage, a value in the set {0,1}. The precision at
the Di document level for the topic T , P recision(Di , T ), is the percentage of
retrieved relevant passages in the list of all retrieved passages for the document
Di .
     We, thus, compute AP(D) and Precision(D) for all the relevant documents
of a given topic, and the average of these scores is calculated to get the score of
the given topic:
                        ∑                                   ∑
                          AP (Di ,T )                         P recision(Di ,T )
        AP (D, T ) =       n(T )      , P recision(D, T ) =       n(T )
                                                                                 (2)
where AP (D, T ) is the average precision per documents for topic T , P recision(D, T )
is the precision per documents for topic T , n(T ) is the number of relevant doc-
uments for topic T .
    Finally, Map(D) and Precision(D) are computed as the mean of AP (D, T )
and P recision(D, T ) across all topics.
    The Map(D) and Precision(D) measures carry similarities with the measures
used in INEX evaluation track, the `Relevant in Context' tasks [2] where instead
of sequences of characters we look at XPaths.


2.2   Flowchart Recognition

Patent images play an important role in the every-day work of IP specialists
by helping them make quick decisions on the relevancy of particular documents
to a patent document. The design of the Flow Chart Recognition task in the
Clef-Ip lab aims at making the content of the patent images searchable and
comparable, a topic of high interest for the IP specialists.
   The topics in this task are patent images representing ow-charts. Partic-
ipants in this task were asked to extract the information in these images and
return it in a predened textual format. The set of training topics contained 50
ow-charts together with their textual representation (the qrels). The set of test
topics contains 100 ow-charts. All images are black and white tif les. We were
not interested, yet, in the connection between the patent images and the patent
documents they occur in.


Topic Selection Our job in selecting the topics for this task was much eas-
ier than in the `Claims to Passage' task. This is due to the fact that in 2011
the Clef-Ip lab had organized an image classication task, where one of the
classication classes was ow-charts. Having already a set of images containing
ow-charts, we had to browse through them and select images for the topic sets.
Since we chose to model the ow-charts as graphs, we left out from the topic
selection images with 'wrong' graph data, like, for example, edges ending in the
white.


Creating Relevance Judgments. Once the textual representation of the ow-
charts was xed, we have manually created the textual representations for each
of the topics in the training and test sets. In Figure 3 we can see an example of a
ow-chart image and its textual representation. In the textual representation of
a ow-chart, MT stands for meta information about the ow-chart (like number of
nodes, edges, title), lines starting with NO describe the nodes of the graph, lines
starting with DE describe directed edges, while lines starting with UE describe
uni-directed edges in the graph. Lines beginning with CO denote comments that
are not to be automatically processed.
          Fig. 3. A ow-chart image and its textual representation.

Evaluation Measure Since ow-charts can be modeled as graphs, to assess
how good the image recognition is done in this specic task, the main evaluation
measure is the graph distance metric based on the mcs, most common sub-
graph (see [1,9]). The distance between the topic owchart Ft and the submitted
owchart Fs is computed as:

                                              |mcs(Ft , Fs )|
                  d(Ft , Fs ) = 1 −                                          (3)
                                      |Ft | + |Fs | − |mcs(Ft , Fs )|

where |.| denotes the size of the owchart/graph. In our case, the size of the
owchart is the number of edges plus the number of nodes.
   The distance between the topic and the submitted owcharts is to be com-
puted at three levels:
  basic: only the owchart structure is taken into consideration, i.e. nodes and
   edges without type information and without text labels.
  intermediate: use the owchart structure together with the node types.
  complete: owchart structure, node types, text labels
To actually compute the distance we use an in-house implementation of the
McGregor algorithm for computing most common sub-graphs [4].
    In the process of developing these evaluations, we have found that the com-
plete evaluation is better served by a combination of node type matching and
edit-distance measuring of the text labels. This is because we cannot make a
hard-match between the OCRed text provided by the participants and the one
in the gold standard. Therefore, we allow the McGregor algorithm to generate
all possible largest common sub-graphs, and compute the best, average and dis-
tribution of edit-distances between the nodes of each of these sub-graphs and
the gold standard. This is unfortunately extremely time consuming.

2.3   Chemical Structure Recognition
Given the importance of chemical molecular diagrams in patents, the ability
to extract such diagrams from documents and to recognize them to the extent
necessary to automatically compare them for similarity or identity to other di-
agrams is a potentially very powerful approach to identifying relevant claims.
This task was divided into two parts, segmentation and recognition;

Segmentation For this sub-task, 30 patents were selected, rendered to 300dpi
monochrome multipage TIFF images and all chemical molecular diagrams were
manually clipped from the images using a bespoke tool. This clipping tool
recorded the minimal bounding box size and coordinates of each diagram clipped
and the results recorded in a ground-truth comma separated value (CSV) le.
The participants were asked to produce their own results CSV le containing
this bounding box clip information for each diagram that their systems could
identify.
    Another bespoke tool was written to automatically compare the participants'
results le with the ground truth le. This identied matches at various toler-
ance levels, where a match is awarded if every side of a participant's bounding
bounding box is within the tolerance number of pixels of the corresponding side
of a ground truth bounding box. Evaluation results were calculated for each of a
range of tolerances starting at 0 pixels and increasing to the maximum number
of pixels that still disallowed any single participant bounding boxes from match-
ing more than one ground truth bounding box. This maximum limit in practice
was 55 pixels, or just under 0.5 cm.
    The number of true positive, false positive and false negative matches were
counted for each tolerance setting, and from that the precision, recall and F1 -
measure was calculated.

Recognition A diagram recognition task requires the participants to take a set
of diagrams, analyse them to some recognised format and submit their recog-
nised format les for evaluation. In order to evaluate the results, these submitted
format les must be compared to a ground-truth set of format les. Therein lies a
dicult problem with respect to the chemical diagram recognition task for patent
documents. The currently most complete standard format for chemical diagrams
is the MOL le format. This format captures quite well fully specied chemical
diagram molecules. However, it has become standard in patent documents to de-
scribe whole families or classes of molcules using diagrams that extend standard
molecule diagrams with graphical representations of varying structures, called
Markush structures  . Markush structures cannot be represented in MOL les.
    Given the standard nature of MOL les, there have been a signicant number
of research and commercial projects to recognise diagrams with all the features
that can be represented by MOL les. However, without standard extensions
to MOL les to cope with Markush stuctures, there has been relatively little
eort expended in recognising such extended diagrams. With the intention of
fostering such eorts, the chemical structure recognition task for Clef-Ip 2012
was designed to expose participants to a relatively small number of the simpler
of these extended structures, while also providing a large number of cases fully
covered by the current MOL le standard.
    A total of 865 diagram images, called the   automatic set , were selected. The
diagrams in this set were fully representable in standard MOL les. Evaluation
of this set was carried out by automatically comparing the participants submit-
ted MOL les with the ground truth MOL les using the open source chemistry
toolbox, OpenBabel. The key tool in this set is the InChi representation (Inter-
national Chemical Identier). OpenBabel was chosen among other tools oering
similar functionality because it is free and available to everyone. The number
of correctly matched diagrams (and the percentage correctly matched) were re-
ported for each participant.
    A manual set   of 95 images were chosen which contain some amount of vari-
ability in their structure and which can only be represented in MOL les by
some abuse of the MOL le standard. These can not be automatically evaluated
as the OpenBabel system cannot deal with the resulting structures. However,
such MOL les can still be rendered to an image format using the MarvinView
tool from ChemAxon. Thus it was possibile to carry out the evaluation of this
set by manual visual comparison of the original image, the MarvinView gener-
ated image of the ground-truth MOL le for the image, and the MarvinView
generated image of the participant's submitted MOL le. To this end a bespoke
web application was written to enable the organisers and participants to verify
the manual visual evaluation.
    It was less than satisfactory to have to carry out the evaluation of this latter
set manually, and even more so that we had to exclude from the set structures
that appear in patent les but which cannot be rendered from (abused) MOL les
using MarvinView. This points strongly to a need in the community to develop
either an extension or an alternative to MOL les that can fully support common
Markush structures together with the necessary ancillary tools for manipulating,
comparing and rendering such structures to images.


3   Submissions and Results

Table 1 gives a list of institutions that submitted experiments to the Clef-Ip
lab in 2012. Research collaborations between institutions can be identied by the
RunID. Note that two dierent groups from the Vienna University of Technology
have each contributed to dierently identied submissions (tuw and lut).
               Table 1. List of participants and runs submitted
  RunID      Institution                                               Clm Fc Cs
  uob      University of Birmingham, School of Computer           UK            x
           Science
  bit      Univ. of Applied Sciences, Information Studies,        CH    x
           Geneva
  chm      Chemnitz University of Technology, Department of       DE    x
           Computer Science
  cvc      Computer Vision Center, Universitat Autónoma de        ES        x
           Barcelona
  hild     Univ. Hildesheim, Information Science                  DE    x
  humb-inr Humbold Univ., Dept, of German Language and            DE        x
           Linguistics
  humb-inr INRIA                                                  FR        x
  joann    Joanneum Research, Institute for Information and       AT        x
           Communication Technologies
  lut      University of Lugano                                   CH    x
  tuw      Univ. of Macedonia, Department of Applied              GR    x
           Informatics, Thessaloniki
  saic     Chemical Biology Laboratory, SAIC-Frederick Inc.       US            x
  lut      Vienna University of Technology, Inst. for Software    AT    x
           Technology and Interactive Systems
  tuw      Vienna University of Technology, Inst. for Software    AT    x
           Technology and Interactive Systems
  tuw      Univ. of Wolverhampton, School of Technology           UK   x
             Total:                                                    31 13 7


3.1     Evaluation Results
Claims to Passage As stated in section 2.1, we computed two sets of measures,
one at the document level, very similar to the measurements done in the last
years, and one at the passage level. To compute measures at the document
level we have ignored the passage information in the participants' submissions
and kept only the <topic, relevant document, rank> tuples. On these we have
computed Pres, Recall and Map, both for the complete set of topics, as well as
split on languages. At the passage level, we have computed Map- and Precision-
like measures, by computing passage AP, respectively passage Precision, for each
relevant document, then averaging over the topic's relevant documents. The nal
scores are obtained by averaging over all queries.
    The solutions chosen by the submitting participants range from two-step re-
trieval approaches, namely a document level retrieval in the rst step and a
passage level retrieval in the second step (  bit
                                                and  chm ) to using Natural Lan-
                              lut
guage Processing techniques ( ). The    tuw team used a distributed IR system by
splitting the Clef-Ip collection by Ipc codes, while the   hild
                                                              team experimented
with search types trigram-based searches. All participants have used translation
tools on the generated queries.
   Figure 4 presents the Pres, Map and Recall at the document level, Figure 5
shows the Precision and Map at the passage levels for the complete set of topics.
The tuw  participant was left out in the passage level evaluations because they
have submitted experiments referring to documents only, and not passages.


                 Fig. 4. Measures at relevant document level.


Precision(D) and Map(D) gives an indication about the system performance in
ranking the passages in the relevant documents regardless to the quality of the
document-level retrieval quality.

Flow Chart Recognition Unfortunately, at the time of writing these workshop
notes, the execution of the evaluation programm was not closed. We will post
the results on the project's website, http://www.ifs.tuwien.ac.at/~clef-ip

Chemical Structure Recognition Only one participant, saic, submitted an
attempt at the chemical molecular diagram segmentation subtask. They submit-
ted two attempts, both using as input the multi-page ti les provided by the
organisers. The dierence was that in one run they used tiffsplit to separate
the pages into individual les, while in the other one they used OSRA native le
reading/rendering capability. They achieved signicantly better performance on
the latter with results presented in Table 2 (note that a tolerance of 55 is just
under 0.5cm): Both  saic and uob  submitted result sets (1 and 4 respectively) for
the diagram recognition sub-task (Table 3).
                Fig. 5. Measures at relevant passage level.


        Table 2. Chemical molecular diagram segmentation results.
                     Tolerance Precision Recall    F1
                         0      0.70803 0.68622 0.69696
                        10      0.79311 0.76868 0.78070
                        20      0.82071 0.79543 0.80787
                        40      0.86696 0.84025 0.85340
                        55      0.88694 0.85962 0.87307


              Table 3. Chemical diagram recognition results.

           Automatic Set            Manual Set               Total
      #Structures Recalled % #Structures Recalled % #Structures Recalled %
saic     865        761 88%      95        38    40%   960        799 83%
uob-1    865        832 96%      95        44    46%   960        876 91%
uob-2    865        821 95%      95        56    59%   960        877 91%
uob-3    865        821 95%      95        44    46%   960        865 90%
uob-4    865        832 96%      95        54    57%   960        886 92%
  Clearly, both groups, unsurprisingly, found the diagrams with varying ele-
ments signicantly more challenging than the more standard xed diagrams.


4     Final Observations
We have described the benchmarking activities done in the frame of the Clef-Ip
2012. One of the main challenges faced by the organizers were obtaining relevance
judgments and choosing topics for the `Passage Retrieval Starting from Claims'
task. The eort spent on this challenge prompted us to waive an additional pilot
task originally proposed for this year, in which we were interested in nding
description passages relevant to a claim in a patent application document.
    Another challenge was nding proper measures to assess the eciency of the
passage retrieval task as formulated in the Clef-Ip lab and for the `Flow-chart
Recognition' task. The proposed measures are to be a starting point for further
discussions on what is the best way to assess the eectiveness of these types of
information retrieval.

Acknowledgments This work was partly supported by the EU Network of Excel-
lence PROMISE (FP7-258191) and the Austrian Research Promotion Agency (FFG)
FIT-IT project IMPEx5 (No. 825846).


References
1. Horst Bunke and Kim Shearer. A graph distance metric based on the maximal
   common subgraph. Pattern Recognition Letters, 19(3-4):255259, 1998.
2. Jaap Kamps, Jovan Pehcevski, Gabriella Kazai, Mounia Lalmas, and Stephen
   Robertson. Inex 2007 evaluation measures. In Focused Access to XML Documents,
    6th International Workshop of the Initiative for the Evaluation of XML Retrieval,
    INEX 2007, Dagstuhl Castle, Germany, December 17-19, 2007. Selected Papers    ,
   volume 4862 of Lecture Notes in Computer Science, pages 2433. Springer, 2008.
3. W. Magdy and G. J. F. Jones. PRES: A score metric for evaluating recall-oriented
   information retrieval applications. In SIGIR 2010, 2010.
4. James J. McGregor. Backtrack search algorithms and the maximal common sub-
   graph problem. Softw., Pract. Exper., 12(1):2334, 1982.
5. F. Piroi, M. Lupu, and A. Hanbury. Eects of Language and Topic Size in Patent
   IR: An Empirical Study. In Information Access Evaluation. Multilinguality, Mul-
    timodality, and Visual Analytics, Third International Conference of the CLEF Ini-
   tiative, CLEF 2012, Rome, Italy, September 17-20, 2012, Proceedings, volume 7488
   of Lecture Notes in Computer Science. Springer, September 2012.
6. F. Piroi, M Lupu, A. Hanbury, and V. Zenz. CLEF-IP 2011: Retrieval in the
   intellectual property domain, September 2011.
7. F. Piroi and J. Tait. CLEF-IP 2010: Retrieval experiments in the intellectual prop-
   erty domain. Technical Report IRFTR201000005, Information Retrieval Facility,
   Vienna, September 2010. Also available as a Notebook Paper of the CLEF 2010
   Informal Proceedings.
5
    http://www.joanneum.at/?id=3922
8. G. Roda, J. Tait, F. Piroi, and V. Zenz. CLEF-IP 2009: Retrieval Experiments
   in the Intellectual Property Domain. In C. Peters, G.M. Di Nunzio, M. Kurimo,
   D. Mostefa, A. Penas, and G. Roda, editors, Multilingual Information Access Evalua-
   tion I. Text Retrieval Experiments 10th Workshop of the Cross-Language Evaluation
   Forum, CLEF 2009, volume 6241, pages 385409. Springer, 2010.
9. Walter D. Wallis, Peter Shoubridge, Miro Kraetzl, and D. Ray. Graph distances
   using graph union. Pattern Recognition Letters, 22(6/7):701704, 2001.