<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilayer Collection Selection and Search of Topically Organized Patents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michail Salampasis</string-name>
          <email>salampasis@ifs.tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasia Giachanou</string-name>
          <email>agiahanou@uom.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgios Paltoglou</string-name>
          <email>g.paltoglou@wlv.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Applied Informatics, University of Macedonia</institution>
          ,
          <addr-line>Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Inst. of Software Technology, and Interactive Systems, Vienna Univ. of Technology</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Technology, Univ. of Wolverhampton</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2002</year>
      </pub-date>
      <fpage>391</fpage>
      <lpage>397</lpage>
      <abstract>
        <p>We present a patent search system that explores three issues: (a) topical organization of patents based on their IPC, (b) collection selection of topically organised patent collections and (c) integration of collection selection tools to patent search systems. Patent documents produced worldwide have manually-assigned classification codes which in our work are used to cluster, distribute and index patents through hundreds or thousands of sub-collections. We propose a new collection selection method suitable for search systems having documents organized using hierarchical classification schemes such as IPC/CPC. The new method uses multiple evidence utilising, for each collection, the ranking of ancestors collections in higher level of the classification hierarchy. We tested our method on CLEF-IP 2011 and compared its performance to state-of-the-art collection selection algorithms. We also integrated this method as a component suggesting patent collections in the iPerFedPat patent search system.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Distributed Information Retrieval (DIR), also known as federated search [SC03a], offers users the capability
of simultaneously searching multiple online remote information sources through a single point of search. The
DIR process can be perceived as three separate but interleaved sub-processes: Source representation, in which
surrogates of the available remote collections are created [CC01]. Source selection, in which a subset of the
available information collections is chosen to process the query [CLC95, SC03b] and results merging, in which
the separate results are combined into a single merged result list which is returned to the user [SC03a, PSS08].</p>
      <p>Although DIR has been explored for more than 15 years now, it hasn’t been widely used in many search
applications, mainly because the centralized approach has proved to be scalable and more effective in web search.
However, in professional search, DIR could be better applicable and more suitable because quite often high value
documents exist that can be naturally organized into sub-collections. Patent search is a very good example
because patents have manually-assigned classification codes providing an environment where DIR techniques can
be effectively applied. In our study, the International Patent Classification (IPC) codes are used to topically
cluster, distribute and index patents through hundreds or thousands of sub-collections. Our system automatically
selects the best collections/IPCs for each query submitted to the system, something which very precisely and
naturally resembles the way patents professionals do various types of patents searches, especially invalidity search.</p>
      <p>The work which is presented in this paper is not a typical federated search study, since we focused on logically
clustering the patents rather than distributing them at a physical level. We created clusters of patents based
on their manually assigned IPC codes and we tested different collection selection methods for the IPC/cluster
retrieval task. In that sense our work can be equally seen as a method for cluster-based document retrieval
using DIR selection methods. In that context there is no attention given to the claim that DIR can improve the
efficiency of patent search. In fact, in case of patent search, where complete patent collections can be acquired
easily and the patent collections can be indexed centrally, probably this is not the case at all. On the other
hand it should be equally said that because our method is based on clustering the patent documents and not
physically distributed them, the DIR methods which we applied (for source selection and results merging) also
operate in the scale of milliseconds.</p>
      <p>We present a new collection selection method that follows a multilayer, multi-evidence process to suggest
collections taking advantage of the special hierarchical classification of patent documents. The proposed method
is compared to state-of the-art collection selection algorithms (CORI, BordaFuse, Reciprocal Rank).</p>
      <p>Additionally, we explore the issues of integrating an IPC collection selection component to a prototype patent
search system called iPerFedPat (www.perfedpat.eu). This system is based on the ezDL [Fuh11] which is a
framework for interactive search applications integrating multiple information sources. The iPerFedPat system,
based on the ezDL, has a pluggable architecture, providing core services and operations being able to integrate
multiple patent data sources and patent related data streams, thus providing multiple patent search tools and
UIs. The main utility of integrating the selection component in iPerFedPat is for assisting the retrieval of relevant
IPCs during a long-session patent search. In many types of patents search (e.g. prior art search) the core relevant
IPCs have already been manually identified during pre-classification before starting the search. However, there
are several situations (e.g. when the person conducting a patent search would like to validate the IPCs that have
been already assigned or a re-classification is needed), where tools for assisting this task will be very useful.</p>
      <p>The rest of this paper is organized as follows. In Section 2 we present in detail how patents are topically
organized in our work using their IPC code. In Section 3 we describe the DIR techniques that were tested on
patent documents for our study and the new methodology for collection selection proposed in this paper. In
Section 4 we describe the details of our experimental setup and the results. We follow with a discussion of the
rationale of our approach in Section 5 and future work and conclusions in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Topically Organised Patents for DIR</title>
      <p>In this paper, we extend our previous work of applying DIR methods to topically organized patents [SPG12]. We
propose a new collection selection method that surpasses previous source/IPC selection methods for topically
organised patents. Another collection selection study involving topically organized patents is reported in the
literature [LCC00], however this study was conducted many years ago with a different (USPTO) patent dataset.
Also, our approach of dividing patents is different and closer to the actual way of patent examiners conducting
patent searches, as we divide patents into a much larger number of sub-collections. Additionally, our approach
to apply CORI in multiple layers is new and much more effective.</p>
      <p>All patents have manually assigned IPC codes [CC11]. IPC is an internationally accepted standard taxonomy
for classifying, sorting, organizing, disseminating, and searching patents. It is officially administered by World
Intellectual Property Organization (WIPO). The IPC provides a hierarchical system of language independent
symbols for the classification of patents according to the different areas of technology to which they pertain. IPC
has currently about 71,000 nodes which are organized into a five-level hierarchical system which is also extended
in greater levels of granularity. IPC codes are assigned to patent documents manually by technical specialists.</p>
      <p>Patents are classified by a number of different classification schemes. European Classification (ECLA) and
U.S. Patent Classification System (USPTO) are the most known classification schemes used by EPO and USPTO
respectively. Recently, EPO and USPTO signed a joint agreement to develop a common classification scheme
known as Cooperative Patent Classification (CPC). The CPC that has been developed as an extension of the
IPC contains over 260,000 individual codes. For this study, patents were organized based on IPC codes because
from the available classification schemes in CLEF-IP, IPC is the most widely used and also used by EPO.</p>
      <p>Although IPC codes are used to topically cluster patents into sub-collections, something which is a prominent
prerequisite for DIR, there are some important differences which motivated us to re-examine and adapt existing
DIR techniques in patent search. Firstly, IPC are assigned by humans in a very detailed and purposeful
assignment process, something which is very different by the creation of sub-collections using automated clustering
algorithms or the naive division method by chronological or source order, a division method which has been
extensively used in past DIR research. Also, patents are published electronically using a strict technical form
and structure [Ada10]. This characteristic is another reason to reassess existing DIR techniques because these
have been mainly developed for structureless and short documents such as newspapers or poorly structured
web documents. Another important difference is that patent search is recall oriented because very high recall
is required in most searchers [LMTT11], i.e. a single missed patent in a patentability search can invalidate a
newly granted patent. This contrasts with web search where high precision of initially returned results is the
requirement and about which DIR algorithms were mostly concentrated and evaluated [PSS08].</p>
      <p>Before we describe our study further we should explain IPC which determines how we created the
subcollections in our experiments. Top-level IPC nodes consist of eight sections such as human necessities, performing
operations, chemistry, textiles, fixed constructions, mechanical engineering, physics, and electricity. A section is
divided into classes which are subdivided into subclasses. Subclass is divided into main groups which are further
subdivided into subgroups. In total, the current IPC has 8 sections, 129 classes, 632 subclasses, 7.530 main
groups and approximately 63,800 subgroups.</p>
      <p>Table 1 shows a part of IPC. Section symbols use uppercase letters A through H. A class symbol consists of a
section symbol followed by two-digit numbers like F01, F02 etc. A subclass symbol is a class symbol followed by
an uppercase letter like F01B. A main group symbol consists of a subclass symbol followed by one to three-digit
numbers followed by a slash followed by 00 such as F01B7/00. A subgroup symbol replaces the last 00 in a main
group symbol with two-digit numbers except for 00 such as F01B7/02. Each IPC node is attached with a noun
phrase description which specifies some technical fields relevant to that IPC code. Note that a subgroup may
have more refined subgroups (i.e. defining 6th, 7th level etc). Hierarchies among subgroups are indicated not by
subgroup symbols but by the number of dot symbols preceding the node descriptions as shown in Table 1.</p>
    </sec>
    <sec id="sec-3">
      <title>Collection Selection</title>
      <sec id="sec-3-1">
        <title>Prior Work on Collection Selection</title>
        <p>There are a number of source selection approaches including CORI [CLC95], gGlOSS [FPC+99], and others, that
characterize different collections using collection statistics like term frequencies [SJCO02]. These statistics, which
are used to select or rank the available collections relevance to a query, are usually assumed to be available from
cooperative search providers. Alternatively, statistics can be approximated by sampling uncooperative providers
with a set of queries [CC01]. The main characteristic of CORI which is probably the most widely used and
tested source selection method is that it creates a hyper-document representing all the documents-members of a
sub-collection.</p>
        <p>The Decision-Theoretic framework (DTF) presented by Fuhr [Fuh99] is one of the first attempts to approach
the problem of source selection from a theoretical point of view. The Decision-Theoretic framework (DTF)
produces a ranking of collections with the goal of minimizing the occurring costs, under the assumption that
retrieving irrelevant documents is more expensive than retrieving relevant ones.</p>
        <p>In more recent years, there has been a shift of focus in research on source selection, from estimating the
relevancy of each remote collection to explicitly estimating the number of relevant documents in each. ReDDE
[SC03b] focuses at exactly that task. It is based on utilizing a centralized sample index, comprised of all the
documents that are sampled in the query-sampling phase and ranks the collections based on the number of
documents that appear in the top ranks of the centralized sample index. Its performance is similar to CORI
at testbeds with collections of similar size and better when the sizes vary significantly. Two similar approaches
named CRCS(l) and CRCS(e) were presented by Shokouhi [Sho07], assigning different weights to the returned
documents depending on their rank, in a linear or exponential fashion. Other methods see source selection as
a voting method where the available collections are candidates and the documents retrieved from the set of
sampled documents are voters [PSS09]. Different voting mechanism can be used (e.g. BordaFuse, Reciprocal
Rank, Compsum) mainly inspired by data fusion techniques. The methods described in this paragraph in past
DIR experiments attained improvements in precision over previous approaches, but their recall was usually lower.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Multilayer Collection Selection</title>
        <p>We exploit the IPC hierarchical classification scheme and topically organized patents to propose a new
multipleevidence multilayer collection selection method. The new method ranks collections/IPCs not only based on the
subdivision of patents in a specific IPC layer, but additionally utilizing the ranking of their ancestors, if the
same selection process (query) would had been applied at a higher level. This method can effectively suggest
relevant collections at any professional search system where high value documents exist that can be organized
hierarchically according to an appropriate classification scheme.</p>
        <p>The motivation behind the multilayer method is to select as many as possible relevant collections at low levels
(level 4, level 5 etc). IPC code selection when applied at low levels can effectively help patent examiners to
identify quickly the subgroups they should focus and this can become a real time saver. In a recent field survey,
patent examiners expressed the problem of spending time exploring IPC codes (sub-groups) that discover later
they are not relevant. That happens more often in smaller patent offices where patent examiners are usually asked
to examine patents in areas which they are knowledgeable but not top experts. In such conditions collection/IPC
selection methods and tools could be very useful for patents examiners while searching relevant patents.</p>
        <p>The proposed method is based on collections selected by CORI. Previous studies showed that CORI performs
better than other collection selection methods (BordaFuse, Reciprocal Rank) when applied at the patent
domain [SPG12]. The reason is that CORI is based on a content-based representation of sub-collections using a
hyperdocument approach, while the other methods use individual retrieved documents from a sub-collection to
estimate the relevance of a sub-collection. However, CORI tends to produce poorer results at low levels (level 4).
One reason is that the technological area of patents belonging to a sub-collection is more accurately represented
in higher IPC levels as it consists of less sub-collections. At higher IPC levels, documents in one sub-collection are
relatively homogeneous and better distinguished from patents in other IPCs, something that is more difficult to
capture in lower levels. For example, sub-collections of level 4 that contains about ten times more sub-collections
than level 3, are less easier differentiated between each other using a hyperdocument approach, resulting in a
decreased CORI performance. To depict this differentiation more clearly, patents that represent methods for
dental hygiene can be more easily differentiated from radiation therapy patents (level 3) while patents represent
dental machines for boring may not be so easily differentiated from those represent dental tools (level 4).</p>
        <p>In order for the algorithm to function using multiple evidence, the documents should be organized in at least
two different levels. In this paper, we focus on level 3 (subclass) and level 4 (main group). When a query is
submitted to the system, two lists of collections with their relevance scores are returned, one list from level 3 and
one from level 4. We used CORI collection selection algorithm to retrieve the relevant collections as it has been
proven more effective than other collection selection algorithms (e.g. BordaFuse, Reciprocal Rank) [SPG12].</p>
        <p>The lists returned from leveli and leveli+1 can be represented by two plots using the collection and the score:
{(CollA, scoreA), (CollB, scoreB), (CollC , scoreC ), ..., (CollN , scoreN )}
{(CollA.1, scoreA.1), (CollA.2, scoreA.2), (CollA.3, scoreA.3), ..., (CollA.M , scoreA.M ), (CollB.1, scoreB.1), ...,
(CollN.1, scoreN.1), ..., (CollN.M , scoreN.M )}</p>
        <p>where N is the number of collections suggested at leveli, M is the number of collections at leveli+1 that are
children of collectionA and M is the number of collections at leveli+1 that are children of collectionN .</p>
        <p>The new collection selection algorithm combines the information gathered from the two levels to produce a
new list of relevant collection. The new algorithm evaluates the new scores for collections at leveli+1 according
to the following equation:
scorey.z = a ∗ scorey + (1 − a) ∗ scorey.z
(1)
where y is a collection at leveli and z is a collection at leveli+1 which is child of the CollY .</p>
        <p>The value of parameter a represents the weight of the collections selected at level 3. For our experiments, the
value of the weight was decided after a training process. During the training process that preceded the actual
runs, we tested various parameters to examine which value optimizes the performance of the method.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiment</title>
      <sec id="sec-4-1">
        <title>Experimental Set up</title>
        <p>The data collection which was used in the study is CLEF-IP 2011 where patents are extracts of the MAREC
dataset, containing over 2.6 million patent documents pertaining to 1.3 million patents from the EPO with
content in English, German and French, and extended by documents from the WIPO. We indexed the collection
with the Lemur toolkit. The fields which have been indexed are: title, abstract, description (first 500 words),
claims, inventor, applicant and IPC class information. Patent documents have been pre-processed to produce
a single (virtual) document representing a patent. Our pre-processing involves also stop-word removal and
stemming using the Porter stemmer. In our study, we use the Inquery algorithm implementation of Lemur.</p>
        <p>We have divided the CLEF-IP collection using the subclass (split3), the main group (split4) and the
subgroup level (split5). This decision is driven by the way that patent examiners work when doing patent searches
who basically try to incrementally focus into a narrower sub-collection of documents. In the present system, we
allocate a patent to each sub-collection specified by at least one of its IPC codes, i.e. a sub-collection might
overlap with others in terms of the patents it contains. This is the reason why the column #patents presents a
number larger than the 1.3 million patents that constitute the CLEF-IP 2011 collection.</p>
        <p>To test our system, we used a subset of the official queries provided in CLEF-IP 2011 dataset. We run
50 random queries generated using the title, the abstract, the description and the claims. We tested different
combinations of source selection (CORI, BordaFuse, and Reciprocal Rank) at split3 and split4. For results
merging, we applied CORI results merging algorithm [CLC95] that is based on a heuristic weighted scores
merging algorithm. We also performed runs with the centralized index and the optimal approach for each split.
For the optimal run, the system retrieved documents only from the collections containing the relevant documents.</p>
        <p>The multilayer method was tested at split4. To test the multilayer method, we used the collections selected
by CORI at split3 and split4. Additionally, we run 100 training queries generated randomly to decide which
value of the parameter a (equation 1) optimizes the performance of the method. For the experiments in this
study, parameter a was assigned the value of 0.8 which means that 80% of the evidence stems from split3.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Results</title>
        <p>Source selection algorithms for source recommendation applications (high-recall goal) are compared using a recall
measure Rn [CLC95, NF03, Lar03] where the collection ranking provided by the source selection algorithm under
investigation is compared with the optimal ranking, under which collections are ranked by the number of relevant
documents they possess.</p>
        <p>Figure 1 shows the results produced from the source selection algorithms as they gradually select more
subcollections (X axis) at split3 (Figure 1a) and split4 (Figure 1b). The results produced from the multilayer method
at split4 are showed on Figure 1b as this method was only tested on split4.</p>
        <p>The best performing algorithm at split3 (Figure 1a) is CORI which identifies about 90% of relevant documents
in the first 20 suggested collections while the other methods identify less than 40%. At split4 (Figure 1b) the
best performing algorithm is the multilayer method where the first 20 suggested collections contain about 67% of
all relevant documents while CORI managed to identify 46% of relevant documents. This is a very encouraging
result that strongly suggests that source selection algorithms can be effectively used to suggest sub-collections
as starting points for information seekers to search. The precision oriented methods Reciprocal Rank and
BordaFuse produce poor results since they manage to identify only 15% of relevant documents in the first 20
selected sub-collections.</p>
        <p>Table 3 shows the results from the runs performed on the centralized index and the top run for each DIR
method that was applied on the split3 and split4.The multilayer was performed and evaluated at split4. The
(a) Level of subclass (b) Level of main group</p>
        <p>Figure 1: Results of the source suggestion algorithm(s)
performance of the methods can be also compared to the performance of the optimal run.
The results show that the best performing source selection algorithm at the level of subclass (split3) is CORI. The
superiority of CORI as source selection method compared to BordaFuse and Reciprocal Rank is unquestionable
and consistent with our previous study [SPG12]. However, we observe that as the number of sub-collections
increases (i.e. IPC level 4, 5 etc), the performance of the collection selection algorithms deteriorates in contrast
to the optimal method that performs better at lower levels (level4).</p>
        <p>The most interesting and important finding for this study is that the multilayer method performs better than
the other methods at lower levels. The multilayer method managed to select more relevant collections than CORI
at split4 by utilising information from previous levels. The performance of actual runs using our source selection
method at split4 is better than using CORI as source selection but also when compared to the centralized index
approach. Additionally, it is very interesting that some DIR approaches managed to perform better than the
centralized approach that is also an assumption from a previous study [SPG12]. This finding shows that DIR
approaches not only can be more efficient and probably more appropriate due to the dynamic nature of creating
documents in the patent domain, but also more effective.</p>
        <p>It seems that DIR methods, at least in patent search, can be applied in a way resembling more the
clusterbased approaches to information retrieval [Wil88, FLSG12] and could improve efficiency and effectiveness. As for
efficiency, searching and browsing on sub-collections rather than the complete collection could significantly reduce
the retrieval time and more significantly the information seeking time of users. In relation to effectiveness, the
potential of DIR retrieval stems from the cluster hypothesis [Rij79] which states that related documents residing
in the same cluster (sub-collection) tend to satisfy same information needs. The clustering hypothesis was proved
by Fuhr [FLSG12] who developed the optimum clustering framework. The expectation in the context of source
selection, which is of primarily importance for this study, is that if the correct sub-collections are selected then
it will be easier for relevant documents to be retrieved from the smaller set of available documents and more
effective searches can be performed.</p>
        <p>The field of DIR has been explored in the last decade mostly as a response to technical challenges such as
the prohibitive size and exploding rate of growth of the web which make it impossible to be indexed completely
[RGM01]. Also there is a large number of online sources (web sites), known as invisible web which are not
reachable by search engines. As the main focus of this paper is patent search, we should mention this is
especially true in the patent domain as nearly all authoritative online patent sources (e.g. EPOs espacenet) are
not indexable and therefore not accessible by general purpose search engines.</p>
        <p>From our one-to-one interviews in a small patent office with patent examiners, the majority of them said that
an IPC suggestion tool may be useful in their searches. Therefore we integrated the IPC selection method as a
tool in the iPerFedPat system (www.perfedpat.eu) for suggesting relevant IPCs. Such tool can be a time saver
for patent examiners as they can focus their search to a narrow set of relevant collections. Additionally, of high
importance is that patent examiners will have the opportunity to use the tool in a combination with other tools
of the system such as searching specific datasets resulting in a faster and more efficient patent search process.</p>
        <p>The integration of the IPC suggestion tool was implemented sending http requests to an external server
providing the IPC selection services. The server receives the requests and sends a response back about the IPCs
suggested. In iPerFedPat there are more search tools integrated in similar way (tools for faceted search, entity
extraction, clustering search). From a information seeking process perspective, the integration of different search
tools with the main retrieval engine (producing ranked lists of patent documents in response to a query), allows
different search interfaces to coexist in an information seekers patent search system. This client-server integration
provides the core services to the patents search system but synchronization between the tools is required so one
event or action in one tool (for example selecting an IPC) can update the views produced from the main retrieval
engine or the other tools. We plan to extend the integration scheme of iPerFedPat with a communication and
coordination language to address this need.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this paper, a new collection selection algorithm was presented for sub-collections divided using a hierarchical
classification scheme. To test the new collection selection method, we divided the CLEF-IP collection into clusters
using the subclass (split3) and the main group (split4) level to experiment with different levels and depth of topical
organization. The new method was compared with state-of-the-art algorithms and the centralized approach.</p>
      <p>The results showed that the best performing source selection algorithm was the multilayer method at the level
of main group. On the other side the (precision oriented) methods Reciprocal Rank and BordaFuse consistently
produced worse results.</p>
      <p>We plan to continue this work. One issue which we wish to explore further is how the collection selection
methods would perform if an automatic or a semi-automatic clustering method would be applied. We would
also like to explore how features such as sub-collections sizes could influence the performance of the multilayer
collection selection method. Also, we would like to experiment with larger distribution levels based on IPC
(subgroup level). We produced divisions of higher granularity at level 5 of IPC but we didnt have the time and
the resources to report results for this division (split5). We plan to report the runs using split5 in a future paper.</p>
      <p>We also explored issues related to the integration of an IPC suggestion tool to a patent search system. However,
we know that user-centered studies are needed and are more appropriate to decide the usefulness of such tools
and we plan to conduct them in the near future.</p>
      <p>In conclusion, we feel that the discussion and the experiment presented in this paper are useful to the designers
of patent search systems which are based on DIR methods that were more effective and efficient than others which
are based on centralized approaches. Of course, more and larger experiments are required before we can reach a
more general conclusion. However, our experiment has produced some indications advocating the development
of patent search systems which would be based on similar principles with the ideas that inspired the adaptation
and use of DIR methods and their integration in patent search systems.</p>
      <sec id="sec-5-1">
        <title>Acknowledgements</title>
        <p>The second author is supported by a Marie Curie fellowship from the IEF project PerFedPat (www.perfedpat.eu).
[Ada10]</p>
        <p>Stephen Adams. The text, the full text and nothing but the text: Part 1 standards for creating
textual information in patent documents and general search implications. World Patent Information,
32(1):22 – 29, 2010.</p>
        <p>Jamie Callan and Margaret Connell. Query-based sampling of text databases. ACM Transactions
on Information Systems, 19(2):97–130, April 2001.</p>
        <p>Yen-Liang Chen and Yu-Ting Chiu. An ipc-based vector space model for patent retrieval. Information
Processing I&amp; Management, 47(3):309 – 322, 2011.</p>
        <p>James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference
networks. In Proceedings of the 18th annual international ACM SIGIR conference on Research and
development in information retrieval, SIGIR ’95, pages 21–28, New York, NY, USA, 1995. ACM.
[FPC+99] James C. French, Allison L. Powell, Jamie Callan, Charles L. Viles, Travis Emmitt, Kevin J. Prey,
and Yun Mou. Comparing the performance of database selection algorithms. In Proceedings of
the 22nd annual international ACM SIGIR conference on Research and development in information
retrieval, SIGIR ’99, pages 238–245, New York, NY, USA, 1999. ACM.</p>
        <p>Norbert Fuhr. A decision-theoretic approach to database selection in networked ir. ACM Trans. Inf.
Syst., 17(3):229–249, 1999.</p>
        <p>Norbert Fuhr. An infrastructure for supporting the evaluation of interactive information retrieval.
In Proceedings of the 2011 workshop on Data infrastructurEs for supporting information retrieval
evaluation, DESIRE ’11, pages 1–2, New York, NY, USA, 2011. ACM.</p>
        <p>Ray R. Larson. Distributed ir for digital libraries. In In Research and Advanced Technology for
Digital Libraries (ECDL 2003, pages 487–498. Springer (LNCS, 2003.</p>
        <p>Leah S. Larkey, Margaret E. Connell, and Jamie Callan. Collection selection and results merging with
topically organized u.s. patents and trec data. In Proceedings of the ninth international conference
on Information and knowledge management, CIKM ’00, pages 282–289, New York, NY, USA, 2000.</p>
        <p>ACM.
[LMTT11] Mihai Lupu, Katja Mayer, John Tait, and Anthony J. Trippe. Current Challenges in Patent
Information Retrieval. The Information Retrieval Series. Springer, 2011.</p>
        <p>Henrik Nottelmann and Norbert Fuhr. Evaluating different methods of estimating retrieval quality
for resource selection. In Proceedings of the 26th annual international ACM SIGIR conference on
Research and development in informaion retrieval, SIGIR ’03, pages 290–297, New York, NY, USA,
2003. ACM.</p>
        <p>Georgios Paltoglou, Michail Salampasis, and Maria Satratzemi. A results merging algorithm for
distributed information retrieval environments that combines regression methodologies with a selective
download phase. Information Processing &amp; Management, 44(4):1580–1599, July 2008.</p>
        <p>Georgios Paltoglou, Michail Salampasis, and Maria Satratzemi. Simple adaptations of data fusion
algorithms for source selection. In Mohand Boughanem, Catherine Berrut, Josiane Mothe, and
Chantal Soule-Dupuy, editors, Advances in Information Retrieval, volume 5478 of Lecture Notes in
Computer Science, pages 497–508. Springer Berlin Heidelberg, 2009.
[RGM01] Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden web. In Proceedings of the 27th
International Conference on Very Large Data Bases, VLDB ’01, pages 129–138, San Francisco, CA,
USA, 2001. Morgan Kaufmann Publishers Inc.
[SC03a]
[Sho07]
[SPG12]
[Wil88]</p>
        <p>C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd
edition, 1979.</p>
        <p>Luo Si and Jamie Callan. A semisupervised learning method to merge search engine results. ACM
Transactions on Information Systems, 21(4):457–491, 2003.</p>
        <p>Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection.
Proceedings of the 26th annual international ACM SIGIR conference on Research and development
in informaion retrieval - SIGIR ’03, page 298, 2003.</p>
        <p>Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information
retrieval. In Proceedings of the 29th European conference on IR research, ECIR’07, pages 160–172,
Berlin, Heidelberg, 2007. Springer-Verlag.
Michail Salampasis, Georgios Paltoglou, and Anastasia Giahanou. Report on the clef-ip 2012
experiments: Search of topically organized patents. In Pamela Forner, Jussi Karlgren, and Christa
Womser-Hacker, editors, CLEF (Online Working Notes/Labs/Workshop), 2012.</p>
        <p>Peter Willett. Recent trends in hierarchic document clustering: a critical review. Inf. Process.
Manage., 24(5):577–597, August 1988.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>[SC03b]</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>