<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparing Distributed Indexing: To MapReduce or Not?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Richard M. C. McCreadie</string-name>
          <email>richardm@dcs.gla.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Craig Macdonald</string-name>
          <email>craigm@dcs.gla.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iadh Ounis</string-name>
          <email>ounis@dcs.gla.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing</institution>
          ,
          <addr-line>Science</addr-line>
          ,
          <institution>University of Glasgow</institution>
          ,
          <addr-line>Glasgow, G12 8QQ</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. In particular, we describe two indexing approaches based on the original MapReduce paper, and compare these with a standard distributed IR system, the MapReduce indexing strategy used by the Nutch IR platform, and a more advanced MapReduce indexing implementation that we propose. Experiments using the Hadoop MapReduce implementation and a large standard TREC corpus show our proposed MapReduce indexing implementation to be more efficient than those proposed in the original paper.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>The Web is the largest known document repository, and
poses a major challenge for Information Retrieval (IR)
systems, such as those used by Web search engines or Web
IR researchers. Indeed, while the index sizes of major Web
search engines are a closely guarded secret, these are
commonly accepted to be in the range of billions of documents.
For researchers, the recently released TREC ClueWeb09
corpus1 of 1.2 billion Web documents poses both indexing and
retrieval challenges. In both scenarios, the ability to
efficiently create appropriate index structures to allow effective
and efficient search is of much value. Moreover, at such
scale, the use of distributed architectures to achieve high
throughput is essential.</p>
      <p>
        In this work, we investigate the MapReduce
programming paradigm, that has been gaining popularity in
commercial settings, with implementations by Google [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
Yahoo! [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Microsoft also has a similar framework for
distributed operations [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In particular, MapReduce allows
the horizontal scaling of large-scale workloads using clusters
of machines. It applies the intuition that many common
large-scale tasks can be expressed as map and reduce
operations [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], thereby providing an easily accessible framework
for parallelism over multiple machines.
      </p>
      <p>However, while MapReduce has been widely adopted
within Google, and is reportedly used for their main indexing
process, the MapReduce framework implementation and other
1See http://boston.lti.cs.cmu.edu/Data/clueweb09/.
Copyright c 2009 for the individual papers by the papers’ authors.
Copying permitted for private and academic purposes. Re-publication of material
from this volume requires permission by the copyright owners. This volume
is published by its editors.</p>
      <p>
        LSDS-IR Workshop. July 2009. Boston, USA.
programs using it remain (understandably) internal only.
Moreover, there have been few empirical studies undertaken
into the scalability of MapReduce beyond that contained
within the original MapReduce paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which in
particular demonstrates the scalability of the simple operations
grep and sort. More recently, a MapReduce
implementation has been used to sort 1 terabyte of data in approx. 1
minute [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. However, while Dean &amp; Ghemawat [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] suggest
a simple formulation in MapReduce for document indexing,
no studies have empirically shown the benefits of applying
MapReduce on the important IR indexing problem.
      </p>
      <p>
        This paper contributes a first step towards understanding
the benefits of indexing large corpora using MapReduce, in
comparison to other indexing implementations. In
particular, we describe four different methods of performing
document indexing in MapReduce, from initial suggestions by
Dean &amp; Ghemawat, to more advanced strategies. We
deploy MapReduce indexing strategies in the Terrier IR
platform [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], using the freely available Hadoop
implementation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] of MapReduce, and then perform experiments using
standard TREC data.
      </p>
      <p>The remainder of this paper is structured as follows:
Section 2 describes a state-of-the art single-pass indexing
strategy; Section 3 introduces the MapReduce paradigm;
Section 4 describes strategies for document indexing in
MapReduce; Section 5 describes our experimental setup, research
questions, experiments, and analysis of results; Concluding
remarks are provided in Section 6.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>INDEXING</title>
      <p>
        In the following, we briefly describe the structures
involved in the indexing process (Section 2.1) and how the
modern single-pass indexing strategy is deployed in the open
source Terrier IR platform [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] on which this work is based
(Section 2.2). We then provide details of how an indexing
process can be distributed to make use of additional
machines (Section 2.3).
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Index Structures</title>
      <p>
        To allow efficient retrieval of documents from a corpus,
suitable data structures must be created, collectively known
as an index. Usually, a corpus covers many documents, and
hence the index will be held on a large storage device -
commonly one or more hard disks. Typically, at the centre of
any IR system is the inverted index [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. For each term, the
inverted index contains a posting list, which lists the
documents - represented as integer document-IDs (doc-IDs)
containing the term. Each posting in the posting list also
stores sufficient statistical information to score each
document, such as the frequency of the term occurrences and,
possibly, positional information (the position of the term
within each document, which facilitates phrase or proximity
search) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] or field information (the occurrence of the term
in various semi-structured area of the document, such as
title, enabling these to be higher-weighted during retrieved).
The inverted index does not store the textual terms
themselves, but instead uses an additional structure known as a
lexicon to store these along with pointers to the
corresponding posting lists within the inverted index. A document
index may also be created which stores meta-information
about each document within the inverted index, such as an
external name for the document (e.g. URL), and the length
of the document [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The process of generating these
structures is known as indexing.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Single-pass Indexing</title>
      <p>
        When indexing a corpus of documents, documents are
read from their storage location on disk, and then tokenised.
Tokens may then be removed (stop-words) or transformed
(e.g. stemming), before being collated into the final
index structures [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Current state-of-the-art indexing uses
a single-pass indexing method [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], where the (compressed)
posting lists for each term are built in memory as the
corpus is scanned. However, it is unlikely that the posting lists
for very many documents would fit wholly in the memory
of a single machine. Instead, when memory is exhausted,
the partial indices are ‘flushed’ to disk. Once all documents
have been scanned, the final index is built by merging the
flushed partial indices.
      </p>
      <p>
        In particular, the temporary posting lists held in memory
are of the form list(term, list(doc-ID, Term Frequency)).
Additional information such as positions or fields can also
be held within each posting. As per modern compression
schemes, only the first doc-ID in each posting list is absolute
- for the rest, the difference between doc-IDs are instead
stored to save space, using Elias-Gamma compression [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Distributing Indexing</title>
      <p>
        The single-pass indexing strategy described above is
designed to run on a single machine architecture with finite
available memory. However, should we want to take
advantage of multiple machines, this can be achieved in an
intuitive manner by deploying an instance of this indexing
strategy on each machine [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. For machines with more than
one processor, one instance per processing core is possible,
assuming the local disk and memory are not saturated. As
described by Ribeiro-Neto &amp; Barbosa [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], each instance
would index a subset of the input corpus to create an index
for only those documents. It should be noted that if the
documents to be indexed are local to the machines doing
the work (shared-nothing), such as when each machine has
crawled the documents it is indexing, then this strategy will
always be optimal (will scale linearly with processing power).
However, in practical terms, fully machine-local data is
difficult to achieve when a large number of machines is involved.
This stems from the need to split and distribute the corpus
without overloading the network or risking un-recoverable
data loss from a single point of failure.
      </p>
      <p>
        Distributed indexing has seen some coverage in the
literature. Ribeiro-Neto &amp; Barbosa [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] compared three
distributed indexing algorithms for indexing 18 million
documents. Efficiency was measured with respect to local
throughput of each processor, not in terms of overall indexing time.
Unfortunately, they do not state the underlying hardware
that they employ, and as such their results are difficult to
compare to. Melnik et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] described a distributed
indexing regime designed for the Web, with considerations for
updatable indices. However, their experiments did not
consider efficiency as the number of nodes is increased.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Dean &amp; Ghemawat proposed the MapReduce
paradigm for distributing data-intensive processing across
multiple machines. Section 3 gives an overview of MapReduce.
Section 4 reviews prior work on MapReduce indexing, namely
that of Dean &amp; Ghemawat, who suggest how document
indexing can be implemented in MapReduce, and from the
Nutch IR system. Moreover, we propose a more advanced
method of MapReduce indexing, which, by the experiments
in Section 5, is shown to be more efficient.
3.
      </p>
    </sec>
    <sec id="sec-6">
      <title>MAPREDUCE</title>
      <p>
        MapReduce is a programming paradigm for the
processing of large amounts of data by distributing work tasks over
multiple processing machines [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It was designed at Google
as a way to distribute computational tasks which are run
over large datasets. It is built on the idea that many tasks
which are computationally intensive involve doing a ‘map’
operation with a simple function over each ‘record’ in a large
dataset, emitting key/value pairs to comprise the results.
The map operation itself can be easily distributed by
running it on different machines processing different subsets of
the input data. The output from each of these is then
collected and merged into the desired results by ‘reduce’
operations.
      </p>
      <p>
        By using the MapReduce abstraction, the complex details
of parallel processing, such as fault tolerance and node
availability, are hidden, in a conceptually simple framework [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
allowing highly distributed tools to easily be built on top
of MapReduce. Indeed, various companies have developed
tools to perform data mining operations on large-scale datasets
on top of MapReduce implementations. Google’s Sawzall [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
and Yahoo’s Pig [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] are two such examples of data mining
languages. Microsoft uses a distributed framework similar
to MapReduce called Dryad, which the Nebula scripting
language uses to provide similar data mining capabilities [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
However, it is of note that MapReduce trades the ability to
perform code optimisation (by abstracting from the internal
workings) for easy implementation through its framework,
meaning that an implementation in MapReduce is likely not
the optimal solution, but will be cheaper to produce and
maintain [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>MapReduce is designed from a functional programming
perspective, where functions provide definitions of
operations over input data. A single MapReduce job is defined
by the user as two functions. The map function takes in a
key/value pair (of type &lt;key1, value1&gt;) and produces a set
of intermediate key/value pairs (&lt;key2, value2&gt;). The
outputs from the map function are then automatically grouped
by their key, and then passed to the reduce function. The
reduce task merges the values with the same key to form a
smaller final result. A typical job will have many map tasks
which each operate on a subset of the input data, and fewer
reduce tasks, which operate on the merged output of the
map tasks. Map or reduce tasks may run on different
machines, allowing parallelism to be achieved. In common with
functional programming design, each task is independent of
other tasks of the same type, and there is no global state,
or communication between maps or between reduces.</p>
      <p>
        Counting term occurrences in a large data-set is an
oftenrepeated example of how to use MapReduce paradigm2 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
For this, the map function takes the document file-name
(key1) and the contents of the document (value1) as input,
then for each term in the document emits the term (key2)
and the integer value ‘1’ (value2). The reduce then sums up
all of the values (many 1s) for each key2 (a term) to give
the total occurrences of that term.
      </p>
      <p>
        As mentioned above, MapReduce jobs are executed over
multiple machines. In a typical setup, data is not stored in
a central file store, but instead replicated in blocks (usually
of 64MB) across many machines [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This has a central
advantage that the map functions can operate on data that
may be ‘rack-local’ or ‘machine-local’ - i.e. does not have
to transit intra- and inter-data centre backbone links, and
does not overload a central file storage service. Therefore
high bandwidth can be achieved because data is always as
local as possible to the processing CPUs. Intermediate
results of map tasks are stored on the processing machines
themselves. To reduce the size of this output (and
therefore IO), it may be merged using a combiner, which acts as
a reducer local to each machine. A central master machine
provides job and task scheduling, which attempts to perform
tasks as local as possible to the input data.
      </p>
      <p>
        While MapReduce is seeing increasing popularity, there
are only a few notable studies investigating the paradigm
beyond the original paper. In particular, for machine
learning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Chu et al. studied how various machine learning
algorithms could be parallelised using the MapReduce
paradigm, however experiments were only carried out on single
systems, rather than a cluster of machines. In such a
situation, MapReduce provides an easy framework to distribute
non-cooperating tasks of work, but misses the central data
locality advantage facilitated by a MapReduce framework.
A similar study for natural language processing [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] used
several machines, but with experimental datasets of only
88MB and 770MB, would again fail to see benefit in the
data-local scheduling of tasks.
      </p>
      <p>
        In contrast, indexing is an IO-intensive operation, where
large amounts of raw data have to be read and transformed
into suitable index structures. In this work, we show how
indexing can be implemented in a MapReduce framework.
However, the MapReduce implementation described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
is not available outside of Google. Instead, we use the
Hadoop [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] framework, which is an open-source Java
implementation of MapReduce from the Apache Software
Foundation, with developers contributed by Yahoo! and Facebook,
among others. In the next section, we describe several
indexing strategies in MapReduce, starting from that proposed
in the original MapReduce paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], before developing a
more refined strategy inspired by the single-pass indexing
described in Section 2.2.
      </p>
    </sec>
    <sec id="sec-7">
      <title>INDEXING IN MAPREDUCE</title>
      <p>
        In this section, we show how indexing can be performed
in MapReduce. Firstly, we describe two possible
interpretations of indexing as envisaged by Dean &amp; Ghemawat in
their original seminal MapReduce paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (Section 4.1).
Then, we describe an alternative MapReduce indexing
strategy used by the Nutch IR platform, before finally showing
2A worked example and associated source code is
available at http://hadoop.apache.org/core/docs/r0.19.0/mapred_
tutorial.html
how a more refined single-pass indexing strategy can be
implemented in MapReduce (Section 4.3).
      </p>
      <p>It should be noted that in MapReduce each map task is
not aware of its context in the overall job. For indexing, this
means that the doc-IDs emitted from the map phases
cannot be globally correct. Instead, these doc-IDs start from
0 in each map. To allow the reduce tasks to calculate the
correct doc-IDs, each map task produces a “side-effect” file,
detailing the number of documents emitted per map. This
is true for all the indexing implementations described in this
section. We also note that for all our indexing
implementations the number of reducers specified depicts the number
of final indices generated.
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>Dean &amp; Ghemawat’s MapReduce</title>
    </sec>
    <sec id="sec-9">
      <title>Indexing Strategy</title>
      <p>
        The original MapReduce paper by Dean &amp; Ghemawat [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
presents a short description for performing indexing in
MapReduce, which is directly quoted below:
      </p>
      <p>“The map function parses each document, and emits a
sequence of &lt;word, document ID&gt; pairs. The reduce function
accepts all pairs for a given word, sorts the corresponding
document IDs and emits a &lt;word, list(document ID)&gt; pair.
The set of all output pairs forms a simple inverted index. It
is easy to augment this computation to keep track of word
positions.”</p>
      <p>
        The implicit claim being made in the original MapReduce
paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is that efficient indexing could be trivially
implemented in MapReduce. However, we argue that this
oversimplifies the details, and provides room for a useful study to
allow document indexing in MapReduce to be better
understood. For example, for an inverted index to be useful, the
term frequencies within each document need to be stored.
Though this is not accounted for in Dean &amp; Ghemawat’s
paper, there are two possible interpretations on how this
could be achieved within the bounds laid out in the
quotation above. We detail these interpretations below in
Sections 4.1.1 and 4.1.2, respectively.
4.1.1
      </p>
      <p>Emitting Term,Doc-ID Tuples</p>
      <p>The literal interpretation of the description above would
be to output a set of &lt;term, doc-ID&gt; pairs for each token
in a document. This means that if a single term appears n
times in a document then the &lt;term, doc-ID&gt; pair will be
emitted n times. This has the advantage of making the map
phase incredibly simple, as it emits on a per token basis.
However, this means that we will emit a &lt;term, doc-ID&gt;
pair for every token in the collection. In general, when a
map task emits lots of intermediate data, this will be saved
to the machine’s local disk, and then later transferred to
the appropriate reducer. However, with this indexing
interpretation, the intermediate map data would be extremely
large - indeed, similar to the size of the corpus, as each
token in the corpus is emitted along with a doc-ID. Having
large amounts of intermediate map data will increase map
to reducer network traffic, as well as lengthening the sort
phase. These are likely to have an effect on the job’s overall
execution time. The reducer will - for each unique term
sort the doc-IDs, then add up the instances on a per doc-ID
basis to retrieve the term frequencies. Finally, the reducer
will write the completed posting list for that term to disk.
Figure 1 provides a pseudo-code implementation of map and
reduce functions for this strategy.</p>
      <sec id="sec-9-1">
        <title>Dean &amp; Ghemawat MapReduce Indexing</title>
      </sec>
      <sec id="sec-9-2">
        <title>Map function pseudo-code</title>
      </sec>
      <sec id="sec-9-3">
        <title>1: Input</title>
        <p>Key: Document Identifier, Name</p>
        <p>Value: Contents of the Document, DocContents</p>
      </sec>
      <sec id="sec-9-4">
        <title>2: Output</title>
        <p>A list of (term,doc-ID) pairs, one for each token
in the document
3: for each Term in the DocContents loop
4 : Stem(Term)
5 : deleteIfStopword(Term)
6 : if (Term is not empty) then emit(Term, doc-ID)
7: end loop
8: Add document to the Document Index
9: if (lastMap()) write out information about the
10: documents this map processed (“side-effect” files)</p>
      </sec>
      <sec id="sec-9-5">
        <title>Dean &amp; Ghemawat MapReduce Indexing</title>
      </sec>
      <sec id="sec-9-6">
        <title>Reduce function pseudo-code</title>
      </sec>
      <sec id="sec-9-7">
        <title>1: Input</title>
        <p>Key: A Term</p>
        <p>Value: List of (doc-ID), doc-IDs</p>
      </sec>
      <sec id="sec-9-8">
        <title>2: Output</title>
        <p>Key: Term</p>
        <p>Value: Posting List
3 : List Posting-List = new PostingList()
4 : Sort doc-IDs
5 : for each doc-ID in doc-IDs loop
6 : increment tf for doc-ID
7 : correct doc-ID
8 : add doc-ID and tf to Posting-List
9 : end loop
10: emit(Posting-List)</p>
        <p>We claim that emitting once for every token extracted is
wasteful of resources, causing excessive disk IO on the map
by writing intermediate map output to disk, and excessive
disk IO in moving map output to the reduce tasks. To
reduce IO, we could instead emit &lt;term,(doc-ID, tf )&gt; tuples,
where tf is the term frequency for the current document.
In this way, the number of emit operations which have to
be done is significantly reduced, as we now only emit once
per unique term per document. The reduce method for this
interpretation is also much simpler than the earlier
interpretation, as it only has to sort instances by document to get
the final posting list to write out. It should also be noted
that the &lt;term, doc-ID&gt; strategy described earlier, can be
adapted to generate tf s instead through the use of a
MapReduce combiner, which performs a localised merge on each
map task’s output.</p>
        <p>While the &lt;term,(doc-ID, tf )&gt; indexing strategy emits
significantly less than that described in Section 4.1.1, we
argue that an implementation in this manner would still be
inefficient, because a large amount of IO is still required to
store, move and sort the temporary map output data.
4.2</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Nutch’s MapReduce Indexing Strategy</title>
      <p>
        The Apache Software Foundation’s open source Nutch
platform [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] also deploys a MapReduce indexing strategy,
using the Hadoop MapReduce implementation. By
inspection of the source of Nutch v0.9, we have determined that
the MapReduce indexing strategy differs from the general
outline described in Section 4.1 above. Instead of emitting
terms, Nutch only tokenises the document during the map
phase, hence emitting &lt;doc-ID, analysed-Document&gt;
tuples from the map function. Each analysed-Document
contains the textual forms of each term and their corresponding
frequencies. The reduce phase is then responsible for writing
all index structures. Compared to emitting &lt;term,(doc-ID,
tf )&gt;, the Nutch indexing method will emit less, but the
value of each emit will contain substantially more data (i.e.
the textual form and frequency of each unique term in the
document). We believe this is a step-forward towards
reducing intermediate map output. However, there may still be
scope for further reducing map task output to the benefit of
overall indexing efficiency. In the next section, we develop
our single-pass indexing strategy (described in Section 2.2)
for the MapReduce framework, to address this issue.
4.3
      </p>
    </sec>
    <sec id="sec-11">
      <title>Single-pass MapReduce Indexing Strategy</title>
      <p>We now adapt the single-pass indexing strategy described
in Section 2.2, for use in a MapReduce framework. The
indexing process is split into m map tasks. Each map task
operates on its own subset of the data, and is similar to the
single-pass indexing corpus scanning phase. However, when
memory runs low or all documents for that map have been
processed, the partial index is flushed from the map task,
by emitting a set of &lt;term, posting list&gt; pairs. The
partial indices (flushes) are then sorted by term, map and flush
numbers before being passed to a reduce task. As the flushes
are collected at an appropriate reduce task, the posting lists
for each term are merged by map number and flush number,
to ensure that the posting lists for each term are in a
globally correct ordering. The reduce function takes each term
in turn and merges the posting lists for that term into the
full posting list, as a standard index. Elias-Gamma
compression is used as in non-distributed indexing to store only
the distance between doc-IDs. Figure 2 provides a
pseudocode implementation of map and reduce functions for our
proposed MapReduce indexing strategy.</p>
      <p>The fundamental difference between this strategy and that
of Dean &amp; Ghemawat described in Section 4.1, is what the
map tasks emit. Instead of emitting a batch of
&lt;term,docID&gt; pairs immediately upon parsing each document, we
instead build up a posting list for each term in memory. Over
many documents, memory will eventually be exhausted, at
which time all currently stored posting lists will be flushed
as &lt;term,posting list&gt; tuples. This has the positive effect of
minimising both the size of the map task output, as well as
the number of emits. Compared to the Dean &amp; Ghemawat
indexing strategies, far less emits will be called, but emits
will be much larger. Compared to the Nutch MapReduce
indexing strategy, there may more emits, however, the reduce
task is operating on term-sorted data, and does not require
a further sort and invert operation to generate an inverted
index. Moreover, the emit values will only contain doc-IDs
instead of textual terms, making them considerably smaller.</p>
      <p>Figure 3 presents an example for a distributed setting
MapReduce indexing paradigm of 200 documents. The
documents are indexed by m = 2 map tasks, before the posting
lists for each term are grouped and sorted, and then reduced
to a single index. The posting lists output from each map
contains only local doc-IDs. In the reduce tasks, these are
merged into a list of absolute doc-IDs, by adding to each</p>
      <sec id="sec-11-1">
        <title>Single-Pass MapReduce Indexing</title>
      </sec>
      <sec id="sec-11-2">
        <title>Map function pseudo-code</title>
      </sec>
      <sec id="sec-11-3">
        <title>1: Input</title>
        <p>Key: Document Identifier, Name</p>
        <p>Value: Contents of the Document, DocContents</p>
      </sec>
      <sec id="sec-11-4">
        <title>2: Output</title>
        <p>Key: Term</p>
        <p>Value: Posting list
3: for each Term in the DocContents loop
4 : Stem(Term)
5 : deleteIfStopword(Term)
6 : if (Term is not empty) then add the current
document for that term to the in-memory</p>
        <p>Posting List
7: end loop
8: Add document to the Document Index
9: if (lastMap() or outOfMemory()) then</p>
        <p>emit(in-Memory Posting List)
10: if (lastMap()) write out information about the
11: documents this map processed (“side-effect” files)</p>
      </sec>
      <sec id="sec-11-5">
        <title>Single-Pass MapReduce Indexing</title>
      </sec>
      <sec id="sec-11-6">
        <title>Reduce function pseudo-code</title>
      </sec>
      <sec id="sec-11-7">
        <title>1: Input</title>
        <p>Key: A Term</p>
        <p>Value: List of (Posting List), PartialPostingLists</p>
      </sec>
      <sec id="sec-11-8">
        <title>2: Output</title>
        <p>Key: Term</p>
        <p>Value: Posting List
3 : List Posting-List = new PostingList()
4 : Sort PartialPostingLists by the map and flush they
were emitted from
5 : for each PostList in PartialPostingLists loop
6 : for each doc-ID in PostList loop
7 : correct doc-ID
8 : Merge PostList into Posting-List
9 : end loop
10: end loop
11: emit(Posting-List)
entry the number of documents processed by previous map
tasks. However, note that in our indexing implementation,
the doc-IDs are flush-local as well as map-local. While this
is not strictly necessary, it allows smaller doc-IDs to be
emitted from each map, which can be better compressed.</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>EXPERIMENTS &amp; RESULTS</title>
      <p>In the following experiments, we aim to determine the
efficiency of multiple indexing implementations. Specifically,
we investigate whether distributed indexing as laid out in
the original MapReduce paper (Section 4.1) is fit for
purpose. We compare this to our single-pass indexing strategy
developed both for a single machine architecture (Section 2)
and for MapReduce (Section 4.3). Note that in this paper
we do not investigate Nutch’s MapReduce indexing
strategy, however we expect it to be more efficient than Dean
&amp; Ghemawat’s indexing strategy, while being less efficient
than our single-pass indexing strategy. We leave this for
future work. Furthermore, we investigate these approaches in
terms of scalability as the number of machines designated for
work is increased, and experiment with various parameters
in MapReduce to determine how to most efficiently apply it
for indexing.
5.1</p>
    </sec>
    <sec id="sec-13">
      <title>Research Questions</title>
      <p>To measure the efficiency of our indexing implementations
and therefore the suitability (or otherwise) of MapReduce
for indexing, we investigate 3 important research questions,
which we address by experimentation in the remainder of
this section:
1. Can a practical application of the distributed indexing
strategy described in Section 2 be sufficient for large-scale
collections when using many machines? (Section 5.4)
2. When indexing with MapReduce, what is the most
efficient number of maps and reduces to use? (Section 5.5)
3. Is MapReduce Performance Close to Optimal Distributed
Indexing? (Section 5.6)
5.2</p>
    </sec>
    <sec id="sec-14">
      <title>Evaluation Metrics</title>
      <p>Research questions 1-3 require a metric for indexing
performance. For this, we measure the throughput of the
system, in terms of MB/s (megabytes per second). We
calculate throughput as collectionsize/timetaken where
collection size is the compressed size on disk for a single copy of
the collection in MB (megabytes). The time taken is the full
time taken by the job (including setup) measured in seconds.</p>
      <p>Research question 3 mandates suitability for indexing at
a large scale. We measure suitability in terms of throughput
(as above) and in terms of speedup. Speedup Sm, defined</p>
      <p>
        T1 , where m is the number of machines, T1 is the
as Sm = Tm
execution of the algorithm on a single machine, and Tm is
the execution time in parallel, using m machines [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This
encompasses the idea that not only should speed improve
as more resources are added, but that such a speed increase
should reflect the quantity of those resources. For instance,
if we increase the available resources by a factor of 2, then it
would be desirable to get (close to) twice the speed. This is
known as linear speedup (where Sm = m), and is the ideal
scenario for parallel processing. However, linear speedup
can be hard to achieve in a parallel environment, because
of the growing influence of small sequential sections of code
as the number of processors increases (known as Amdahl’s
law [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), or due to overheads.
5.3
      </p>
    </sec>
    <sec id="sec-15">
      <title>Experimental Setup</title>
      <p>
        Following [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], which prescribes guidelines for presenting
indexing techniques, we now give details of our
experimental cluster setup, consisting of 19 identical machines. Each
machine has a single Intel Xeon 2.4GHz processor with 4
cores, 4GB of RAM, and contains three hard drives: One
160GB hard disk, spinning at 7200rpm with an 8MB buffer,
is used for the operating system and temporary job scratch
space; Two 400GB hard disks, each spinning at 7200rpm
with a 16MB buffer, are dedicated for distributed file
system storage. Each machine is running a copy of the open
source Linux operating system Centos 5 and are connected
together by a gigabit Ethernet connection on a single rack.
The Hadoop (version 0.18.2) distributed file system (DFS)
is running on this cluster, replicating files to the distributed
file storage on each machine. Each file on the DFS is split
into 64MB blocks, which are each replicated to 2 machines3.
While each machine has four processors available at any one
time, only three of these are valid targets for job execution,
the last processor is left free for the distributed file system
software running on each machine. As our cluster is shared
by several users, job allocation is done by Hadoop on
Demand (HOD) running with the Torque resource manager
(version 2.1.9) rather than using a dedicated Hadoop
cluster. Machines not allocated to a MapReduce job are
available to be scheduled by Torque for other jobs not associated
with MapReduce. However on such nodes, the fourth
processor core is still free for distributed file system work4. We
also have in the same rack a RAID5 centralised file server
powered by 8 Intel Xeon 3GHz processor cores for use with
non-MapReduce jobs, providing network file system (NFS)
storage. For consistency, in the following experiments, we
employ the standard TREC web collection .GOV2. This is
an 80GB (425GB uncompressed) crawl of .gov Web domain
comprising over 25 million documents. Before the advent of
ClueWeb09, .GOV2 was the largest available TREC corpus.
5.4
      </p>
    </sec>
    <sec id="sec-16">
      <title>Is Distributed Indexing Good Enough?</title>
      <p>First we determine if MapReduce is necessary for
largescale indexing. If a simple distribution of the non-parallel
indexing strategy described in Section 2 is sufficient to index
large collections then there is no need for MapReduce. To
evaluate this, we distribute the single-pass indexing
strategy across n machines in our cluster, where we vary n =
{1, 2, 4, 6, 8}. To provide a comparative baseline, the
nonparallel single-pass indexing implementation in Terrier can
index the .GOV2 corpus on a single processor core (not
machine) in just over 1 day using the same algorithm. This
translates into a throughput of approximately 1MB/sec. For
distributed indexing to be sufficient for indexing large
collections, throughput should increase in a (close-to) linear
fashion with the number of processing cores added. As
3This is lower than the Hadoop default of 3, to conserve
distributed file system space.
4Hence, as each machine always has one processing core free
to handle distributed file system traffic, and the network
traffic of other cluster jobs is assumed to be low, then there
should be no impact on the validity of the experiments.
10
15</p>
      <p>20
Number of Map Tasks
25
30
35
mentioned in Section 2.3, when distributed indexing uses
machine-local data, indexing will achieve exactly linear
scaling. However, unless the document data is already present
on the machines (e.g. indexing takes place on the machines
which crawled the documents), there would be the need to
copy the required data to the indexing machines. In many
other scenarios, crawling or documents corpora storage may
not be on indexing machines. Moreover, local-only indexing
is not resilient to machine failure. Instead, we experiment
with the shared-corpus distributed indexing, where the
corpus is indexed over NFS from a central fileserver. Local data
(shared-nothing) indexing would require the corpus subset
to be copied prior to indexing.</p>
      <p>Table 1, row 1, shows how throughput increases as we
allocate more machines (recall that each machine adds three
processor cores for indexing work). Here we can see that
throughput indeed increases in a reasonable fashion,
However, once we allocate more than 4 machines we observe
no further speed improvements. This is caused by our
central file store becoming a bottleneck as it is unable to serve
all the allocated machines simultaneously. We can
therefore conclude that this distribution method is unsuitable for
large-scale indexing using our hardware setup. Moreover,
we argue that even with better hardware this issue cannot
be overcome as the file server(s) will always be slower than
the combination of all worker machines.
5.5</p>
    </sec>
    <sec id="sec-17">
      <title>Investigating MapReduce Parameters</title>
      <p>In Section 5.4, we showed that the distributed indexing
strategy described in Section 2 is unsuitable for the
scalable distributed shared-corpus indexing of large collections.
However, before we can evaluate MapReduce as an alternate
solution we need to investigate how to maximise its efficiency
in terms of its input parameters. The fundamental
parameters of a MapReduce job are m - the number of map tasks
that the input data is divided across - and r, the number of
reduce tasks. A higher number of map tasks means that the
input collection of documents is split into smaller chunks,
but also that there will be more overheads, as more tasks
have to be initialised and latterly cleared. To determine
what effect this has on performance, we vary m while
indexing the .GOV2 corpus, using a set 4 machines. The results
- in terms of indexing time - are shown in Figure 4. We see
that when the number of maps is small (i.e. less than the
12 processors available from the 4 machines), parallelism is
hindered, as not all processors have work to do. When the
number of map tasks is ≤ 14, we also note that indexing
time is still high. On examination of these jobs, we found
9500
)sd 9000
on 8500
c
se 8000
(
x 7500
de 7000
n
Io 6500
tn 6000
e
ka 5500
eT 5000
im4500
T 4000 0
10
20</p>
      <p>30 40 50
Number of Reduce Tasks
60
70
80
that the balance of work between map tasks was not even,
with one map task taking markedly longer than the others5.
When the number of map tasks is increased to 16, balance
is restored.</p>
      <p>
        In previous work [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], we have shown that the time taken
by the reduce step is an important factor in determining
indexing performance. Therefore, it is important to know how
many reduce tasks it is is optimal to create - subject to
external constraints on the number of reducers (e.g. having 8
query servers suggests 8 reducers are used so that 8 final
indices are created). To test the effect of the number of reduce
tasks on efficiency, we index .GOV2 while varying the
number of reduce tasks. Here we used 6 machines and 72 map
tasks. The indexing time results are shown in Figure 5. As
we would expect, while the number of reduces is below the
available processors (for the 6 machines allocated, 18
processors) the speed increases as we add more reducers, since
we are effectively providing more parallel processing power.
Once we are beyond the number of processors however,
indexing time increases. This is intuitive, as there is more work
to be done than available processors. Therefore, we can
conclude that the number of reduce tasks should be a multiple
of the number of processors. Unlike map tasks, however,
there is an incentive to have less reduce tasks, resulting in
fewer indices, but this needs to be traded off against the
possibility of failures and the associated time wasted through
re-running.
5.6
      </p>
    </sec>
    <sec id="sec-18">
      <title>Is MapReduce Performance Close to</title>
    </sec>
    <sec id="sec-19">
      <title>Optimal Distributed Indexing?</title>
      <p>We now investigate whether MapReduce is an efficient
alternative to distributed indexing. Moreover, we evaluate
MapReduce against optimal distributed indexing in terms
of performance, i.e. the extent to which it scales close to
linearly with processing power. The core advantage of
MapReduce is the ability to apply the distributed file system
(DFS) to avoid centralised storage of data (creating a
single point of failure), and to take advantage of data locality
to avoid excess network IO. This meanwhile, is at the cost
of additional overheads in job setup, monitoring and
control, as well as the additional IO required to replicate the
data on a DFS. As the centralised file-system was
identified as the bottleneck for distributed indexing, we would
5Hadoop actually supports speculative execution, where two
copies of the last task, or the slowest tasks, will be started.
Only output from the first successful task to complete will be
used. This uses otherwise idle processing power to decrease
average job duration.
expect MapReduce to perform better since it uses a DFS.
For evaluation, we perform a direct comparison on
throughput between indexing strategies. Note that while distributed
indexing creates n index shards, where n is the number of
processors allocated, MapReduce instead produces r index
shards where r is the number of reduce tasks created. For
these experiments we always allocate 72 map tasks and 24
reduce tasks. This means that for distributed indexing a
smaller number of index shards were created when indexing
on {1, 2, 4, 6} machines. However, we believe that this has
no significant impact on the overall throughput.</p>
      <p>First, we investigate whether the MapReduce indexing
strategy proposed by Dean &amp; Ghemawat is more efficient
than distributed indexing. Table 1 shows how the
throughput increases as we allocate more machines - in particular,
row 2 shows results for Dean &amp; Ghemawat’s strategy,
interpreted as emitting term &lt;doc-ID,tf&gt; tuples (Section 4.1.2).
We also implemented the other interpretation which emits
term,doc-ID tuples, however, it consumed excessive
temporary storage space during operation due to its large number
of emit operations. This made it impossible to determine
throughput, as the worker machines ran out of disk space
causing the job to fail. Our implementation of Dean &amp;
Ghemawat’s indexing method also creates the additional data
structures described in Section 2.1 - i.e. the lexicon and
document index - and uses the compressed Terrier inverted
index format. From Table 1, row 2, we can see that this
implementation performs very poorly in comparison to
distributed indexing. Indeed, with 8 machines it indexes only at
half the speed of distributed indexing with the same number
of machines. Upon further investigation, as expected, this
speed degradation can be attributed to the large volume
of map output which is generated by this approach.
However, it should be noted that unlike distributed indexing,
performance improvements do not stall after 4 machines.
This would indicate that while the indexing strategy is poor,
MapReduce in general will continue to garner performance
improvements as more machines are added. Therefore, we
believe this makes it more suitable for processing larger
corpora, where larger clusters of 100s-1000s of machines are
needed to index them in reasonable amounts of time.</p>
      <p>We now experiment with our proposed implementation of
single-pass indexing in MapReduce, as described in Section
4.3. Our expectation is that this strategy should prove to
be more efficient as it lowers disk and network IO by
building up posting lists in memory, thereby minimising map
output size. Table 1, row 3 shows the throughput of the
single-pass MapReduce indexing strategy. In comparison to
Dean &amp; Ghemawat’s indexing strategy, we find our approach
to be markedly faster. Indeed, when using 8 machines our
method is over 2.7 times faster. Moreover, Figure 6 shows
the speedup achieved by both approaches as the number of
machines is increased. We observe that our single-pass based
strategy scales close to linearly in terms of indexing time as
the number of machines allocated for work is increased. In
contrast, the scalability of Dean &amp; Ghemawat’s approach is
noticeably worse (5.5 times for 8 processors, versus 6.8 times
for single-pass based indexing). We believe that this makes
our proposed strategy suitable for scaling to large clusters of
machines, which is essential when indexing new large-scale
collections like ClueWeb09.
1 1
2</p>
      <p>3 4 5 6
Number of Allocated Machines
7
8</p>
    </sec>
    <sec id="sec-20">
      <title>CONCLUSION</title>
      <p>
        In this paper, we detailed four different strategies for
applying document indexing within the MapReduce paradigm,
with varying efficiency. In particular, we firstly showed that
indexing speed using a distributed indexing strategy was
limited by accessing a centralised file-store, and hence the
advantage of using MapReduce to allocate indexing tasks
close to input data is clear. Secondly, we showed that the
MapReduce indexing strategy suggested by Dean &amp;
Ghemawat in the original MapReduce paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] generates too
much intermediate map data, causing an overall slowness
of indexing. In contrast, our proposed single-pass indexing
strategy is almost 3 times faster, and scales well as the
number of machines allocated is increased.
      </p>
      <p>Overall, we conclude that the single-pass based MapReduce
indexing algorithm should be suitable for efficiently
indexing larger corpora, including the recently released TREC
ClueWeb09 corpus. Moreover, as a framework for distributed
indexing, MapReduce conveniently provides both data
locality and resilience. Finally, it is of note that an
implementation of the MapReduce single-pass indexing strategy
described in this paper is freely available for use by the
community as part of the Terrier IR Platform6.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Software Foundation</surname>
          </string-name>
          .
          <article-title>The apache hadoop project</article-title>
          . http://hadoop.apache.org/, as of 15/06/
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amdahl</surname>
          </string-name>
          .
          <article-title>Validity of the single processor approach to achieving large-scale computing capabilities</article-title>
          .
          <source>In Proc. of AFIPS</source>
          , pp.
          <fpage>483</fpage>
          -
          <lpage>485</lpage>
          ,
          <year>1967</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cafarella</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Cutting</surname>
          </string-name>
          .
          <article-title>Building nutch: Open source search</article-title>
          .
          <source>ACM Queue</source>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ):
          <fpage>54</fpage>
          -
          <lpage>61</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.-T.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-A.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. R.</given-names>
            <surname>Bradski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Olukotun</surname>
          </string-name>
          .
          <article-title>Map-reduce for machine learning on multicore</article-title>
          .
          <source>In Proc. of NIPS</source>
          <year>2006</year>
          , pp.
          <fpage>281</fpage>
          -
          <lpage>288</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          . Mapreduce:
          <article-title>Simplified data processing on large clusters</article-title>
          .
          <source>In Proc. of OSDI</source>
          <year>2004</year>
          , pp.
          <fpage>137</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Elias</surname>
          </string-name>
          .
          <article-title>Universal codeword sets and representations of the integers</article-title>
          .
          <source>Information Theory</source>
          , IEEE Transactions on,
          <volume>21</volume>
          (
          <issue>2</issue>
          ):
          <fpage>194</fpage>
          -
          <lpage>203</lpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gobioff</surname>
          </string-name>
          , and S.-T. Leung.
          <article-title>The google file system</article-title>
          .
          <source>SIGOPS Oper. Syst. Rev.</source>
          ,
          <volume>37</volume>
          (
          <issue>5</issue>
          ):
          <fpage>29</fpage>
          -
          <lpage>43</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Heinz</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <article-title>Efficient single-pass index construction for text databases</article-title>
          .
          <source>JASIST</source>
          ,
          <volume>54</volume>
          (
          <issue>8</issue>
          ):
          <fpage>713</fpage>
          -
          <lpage>729</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Hill</surname>
          </string-name>
          . What is scalability?
          <source>SIGARCH Comput. Archit. News</source>
          ,
          <volume>18</volume>
          (
          <issue>4</issue>
          ):
          <fpage>18</fpage>
          -
          <lpage>21</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Isard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Budiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birrell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Fetterly</surname>
          </string-name>
          .
          <article-title>Dryad: distributed data-parallel programs from sequential building blocks</article-title>
          .
          <source>In Proc. of EuroSys</source>
          <year>2007</year>
          , pp.
          <fpage>59</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Johnson</surname>
          </string-name>
          . Frameworks =
          <article-title>(components + patterns)</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>40</volume>
          (
          <issue>10</issue>
          ):
          <fpage>39</fpage>
          -
          <lpage>42</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Laclavik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seleng</surname>
          </string-name>
          , and L. Hluchy´.
          <article-title>Towards large scale semantic annotation built on mapreduce architecture</article-title>
          .
          <source>In Proc. of ICCS (3)</source>
          , pp.
          <fpage>331</fpage>
          -
          <lpage>338</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
          </string-name>
          , and H. Schu¨tze. Introduction to Information Retrieval. Cambridge University Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>McCreadie</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>MacDonald, and</article-title>
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          .
          <article-title>On single-pass indexing with mapreduce</article-title>
          .
          <source>In Proc. of SIGIR</source>
          <year>2009</year>
          , in press.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Melnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>Building a distributed full-text index for the web</article-title>
          .
          <source>In Proc. of WWW</source>
          <year>2001</year>
          , pp.
          <fpage>396</fpage>
          -
          <lpage>406</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Olston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Reed</surname>
          </string-name>
          , U. Srivastava,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomkins</surname>
          </string-name>
          .
          <article-title>Pig latin: A not-so-foreign language for data processing</article-title>
          .
          <source>In Proc. of SIGMOD</source>
          <year>2008</year>
          , pp.
          <fpage>1099</fpage>
          -
          <lpage>1110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>O. O'Malley</surname>
            and
            <given-names>A. C.</given-names>
          </string-name>
          <string-name>
            <surname>Murthy</surname>
          </string-name>
          .
          <article-title>Winning a 60 second dash with a yellow elephant</article-title>
          . TR, Yahoo! Inc.,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Lioma</surname>
          </string-name>
          .
          <article-title>Terrier: A high performance and scalable information retrieval platform</article-title>
          .
          <source>In Proc. of OSIR workshop, SIGIR-2006</source>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dorward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Griesemer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Quinlan.</surname>
          </string-name>
          <article-title>Interpreting the data: Parallel analysis with sawzall</article-title>
          .
          <source>Scientific Programming</source>
          ,
          <volume>13</volume>
          (
          <issue>4</issue>
          ):
          <fpage>277</fpage>
          -
          <lpage>298</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Ribeiro-Neto</surname>
          </string-name>
          , E. S. de Moura,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Neubert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Ziviani</surname>
          </string-name>
          .
          <article-title>Efficient distributed algorithms to build inverted files</article-title>
          .
          <source>In Proc. of SIGIR</source>
          <year>1999</year>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>E.</given-names>
            <surname>Schonfeld</surname>
          </string-name>
          .
          <article-title>Yahoo! search wants to be more like google</article-title>
          ,
          <source>embraces hadoop</source>
          ,
          <year>2008</year>
          . http://www.techcrunch.com/
          <year>2008</year>
          /02/20/ yahoo-search
          <article-title>-wants-to-be-more-</article-title>
          <string-name>
            <surname>like -</surname>
          </string-name>
          google
          <string-name>
            <surname>-</surname>
          </string-name>
          embraces-hadoop/, as of 15/06/
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomasic</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <article-title>Performance of inverted indices in shared-nothing distributed text document information retrieval systems</article-title>
          .
          <source>In Proc. of PDIS</source>
          <year>1993</year>
          , pp.
          <fpage>8</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moffat</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Bell</surname>
          </string-name>
          . Managing Gigabytes:
          <article-title>Compressing and Indexing Documents and Images</article-title>
          . Morgan Kaufmann,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moffat</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Ramamohanarao</surname>
          </string-name>
          .
          <article-title>Guidelines for presentation and comparison of indexing techniques</article-title>
          .
          <source>SIGMOD Record</source>
          ,
          <volume>25</volume>
          (
          <issue>3</issue>
          ):
          <fpage>10</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>