=Paper=
{{Paper
|id=Vol-480/paper-5
|storemode=property
|title=Comparing Distributed Indexing: To MapReduce or Not?
|pdfUrl=https://ceur-ws.org/Vol-480/paper5.pdf
|volume=Vol-480
|dblpUrl=https://dblp.org/rec/conf/sigir/McCreadieMO09a
}}
==Comparing Distributed Indexing: To MapReduce or Not?==
<pdf width="1500px">https://ceur-ws.org/Vol-480/paper5.pdf</pdf>
<pre>
     Comparing Distributed Indexing: To MapReduce or Not?

           Richard M. C. McCreadie                              Craig Macdonald                          Iadh Ounis
              Department of Computing                       Department of Computing               Department of Computing
                     Science                                       Science                               Science
               University of Glasgow                         University of Glasgow                 University of Glasgow
                Glasgow, G12 8QQ                              Glasgow, G12 8QQ                      Glasgow, G12 8QQ
            richardm@dcs.gla.ac.uk                          craigm@dcs.gla.ac.uk                  ounis@dcs.gla.ac.uk


ABSTRACT                                                                      programs using it remain (understandably) internal only.
Information Retrieval (IR) systems require input corpora to                   Moreover, there have been few empirical studies undertaken
be indexed. The advent of terabyte-scale Web corpora has                      into the scalability of MapReduce beyond that contained
reinvigorated the need for efficient indexing. In this work,                  within the original MapReduce paper [5], which in partic-
we investigate distributed indexing paradigms, in particular                  ular demonstrates the scalability of the simple operations
within the auspices of the MapReduce programming frame-                       grep and sort. More recently, a MapReduce implementa-
work. In particular, we describe two indexing approaches                      tion has been used to sort 1 terabyte of data in approx. 1
based on the original MapReduce paper, and compare these                      minute [17]. However, while Dean & Ghemawat [5] suggest
with a standard distributed IR system, the MapReduce in-                      a simple formulation in MapReduce for document indexing,
dexing strategy used by the Nutch IR platform, and a more                     no studies have empirically shown the benefits of applying
advanced MapReduce indexing implementation that we pro-                       MapReduce on the important IR indexing problem.
pose. Experiments using the Hadoop MapReduce imple-                              This paper contributes a first step towards understanding
mentation and a large standard TREC corpus show our                           the benefits of indexing large corpora using MapReduce, in
proposed MapReduce indexing implementation to be more                         comparison to other indexing implementations. In particu-
efficient than those proposed in the original paper.                          lar, we describe four different methods of performing doc-
                                                                              ument indexing in MapReduce, from initial suggestions by
                                                                              Dean & Ghemawat, to more advanced strategies. We de-
1.     INTRODUCTION                                                           ploy MapReduce indexing strategies in the Terrier IR plat-
   The Web is the largest known document repository, and                      form [18], using the freely available Hadoop implementa-
poses a major challenge for Information Retrieval (IR) sys-                   tion [1] of MapReduce, and then perform experiments using
tems, such as those used by Web search engines or Web                         standard TREC data.
IR researchers. Indeed, while the index sizes of major Web                       The remainder of this paper is structured as follows: Sec-
search engines are a closely guarded secret, these are com-                   tion 2 describes a state-of-the art single-pass indexing strat-
monly accepted to be in the range of billions of documents.                   egy; Section 3 introduces the MapReduce paradigm; Sec-
For researchers, the recently released TREC ClueWeb09 cor-                    tion 4 describes strategies for document indexing in Map-
pus1 of 1.2 billion Web documents poses both indexing and                     Reduce; Section 5 describes our experimental setup, research
retrieval challenges. In both scenarios, the ability to effi-                 questions, experiments, and analysis of results; Concluding
ciently create appropriate index structures to allow effective                remarks are provided in Section 6.
and efficient search is of much value. Moreover, at such
scale, the use of distributed architectures to achieve high
throughput is essential.
                                                                              2. INDEXING
   In this work, we investigate the MapReduce program-                          In the following, we briefly describe the structures in-
ming paradigm, that has been gaining popularity in com-                       volved in the indexing process (Section 2.1) and how the
mercial settings, with implementations by Google [5] and                      modern single-pass indexing strategy is deployed in the open
Yahoo! [21]. Microsoft also has a similar framework for dis-                  source Terrier IR platform [18] on which this work is based
tributed operations [10]. In particular, MapReduce allows                     (Section 2.2). We then provide details of how an indexing
the horizontal scaling of large-scale workloads using clusters                process can be distributed to make use of additional ma-
of machines. It applies the intuition that many common                        chines (Section 2.3).
large-scale tasks can be expressed as map and reduce oper-
ations [5], thereby providing an easily accessible framework                  2.1 Index Structures
for parallelism over multiple machines.                                         To allow efficient retrieval of documents from a corpus,
   However, while MapReduce has been widely adopted with-                     suitable data structures must be created, collectively known
in Google, and is reportedly used for their main indexing                     as an index. Usually, a corpus covers many documents, and
process, the MapReduce framework implementation and other                     hence the index will be held on a large storage device - com-
                                                                              monly one or more hard disks. Typically, at the centre of
1
    See http://boston.lti.cs.cmu.edu/Data/clueweb09/.                         any IR system is the inverted index [23]. For each term, the
Copyright c 2009 for the individual papers by the papers’ authors. Copy-      inverted index contains a posting list, which lists the doc-
ing permitted for private and academic purposes. Re-publication of material   uments - represented as integer document-IDs (doc-IDs) -
from this volume requires permission by the copyright owners. This volume
is published by its editors.                                                  containing the term. Each posting in the posting list also
LSDS-IR Workshop. July 2009. Boston, USA.                                     stores sufficient statistical information to score each docu-
ment, such as the frequency of the term occurrences and,          Unfortunately, they do not state the underlying hardware
possibly, positional information (the position of the term        that they employ, and as such their results are difficult to
within each document, which facilitates phrase or proximity       compare to. Melnik et al. [15] described a distributed in-
search) [23] or field information (the occurrence of the term     dexing regime designed for the Web, with considerations for
in various semi-structured area of the document, such as ti-      updatable indices. However, their experiments did not con-
tle, enabling these to be higher-weighted during retrieved).      sider efficiency as the number of nodes is increased.
The inverted index does not store the textual terms them-            In [5], Dean & Ghemawat proposed the MapReduce para-
selves, but instead uses an additional structure known as a       digm for distributing data-intensive processing across mul-
lexicon to store these along with pointers to the correspond-     tiple machines. Section 3 gives an overview of MapReduce.
ing posting lists within the inverted index. A document           Section 4 reviews prior work on MapReduce indexing, namely
index may also be created which stores meta-information           that of Dean & Ghemawat, who suggest how document in-
about each document within the inverted index, such as an         dexing can be implemented in MapReduce, and from the
external name for the document (e.g. URL), and the length         Nutch IR system. Moreover, we propose a more advanced
of the document [18]. The process of generating these struc-      method of MapReduce indexing, which, by the experiments
tures is known as indexing.                                       in Section 5, is shown to be more efficient.

2.2 Single-pass Indexing                                          3. MAPREDUCE
   When indexing a corpus of documents, documents are                MapReduce is a programming paradigm for the process-
read from their storage location on disk, and then tokenised.     ing of large amounts of data by distributing work tasks over
Tokens may then be removed (stop-words) or transformed            multiple processing machines [5]. It was designed at Google
(e.g. stemming), before being collated into the final in-         as a way to distribute computational tasks which are run
dex structures [23]. Current state-of-the-art indexing uses       over large datasets. It is built on the idea that many tasks
a single-pass indexing method [8], where the (compressed)         which are computationally intensive involve doing a ‘map’
posting lists for each term are built in memory as the cor-       operation with a simple function over each ‘record’ in a large
pus is scanned. However, it is unlikely that the posting lists    dataset, emitting key/value pairs to comprise the results.
for very many documents would fit wholly in the memory            The map operation itself can be easily distributed by run-
of a single machine. Instead, when memory is exhausted,           ning it on different machines processing different subsets of
the partial indices are ‘flushed’ to disk. Once all documents     the input data. The output from each of these is then col-
have been scanned, the final index is built by merging the        lected and merged into the desired results by ‘reduce’ oper-
flushed partial indices.                                          ations.
   In particular, the temporary posting lists held in memory         By using the MapReduce abstraction, the complex details
are of the form list(term, list(doc-ID, Term Frequency)).         of parallel processing, such as fault tolerance and node avail-
Additional information such as positions or fields can also       ability, are hidden, in a conceptually simple framework [13],
be held within each posting. As per modern compression            allowing highly distributed tools to easily be built on top
schemes, only the first doc-ID in each posting list is absolute   of MapReduce. Indeed, various companies have developed
- for the rest, the difference between doc-IDs are instead        tools to perform data mining operations on large-scale datasets
stored to save space, using Elias-Gamma compression [6].          on top of MapReduce implementations. Google’s Sawzall [19]
                                                                  and Yahoo’s Pig [16] are two such examples of data mining
2.3 Distributing Indexing                                         languages. Microsoft uses a distributed framework similar
   The single-pass indexing strategy described above is de-       to MapReduce called Dryad, which the Nebula scripting lan-
signed to run on a single machine architecture with finite        guage uses to provide similar data mining capabilities [10].
available memory. However, should we want to take ad-             However, it is of note that MapReduce trades the ability to
vantage of multiple machines, this can be achieved in an          perform code optimisation (by abstracting from the internal
intuitive manner by deploying an instance of this indexing        workings) for easy implementation through its framework,
strategy on each machine [22]. For machines with more than        meaning that an implementation in MapReduce is likely not
one processor, one instance per processing core is possible,      the optimal solution, but will be cheaper to produce and
assuming the local disk and memory are not saturated. As          maintain [11].
described by Ribeiro-Neto & Barbosa [20], each instance              MapReduce is designed from a functional programming
would index a subset of the input corpus to create an index       perspective, where functions provide definitions of opera-
for only those documents. It should be noted that if the          tions over input data. A single MapReduce job is defined
documents to be indexed are local to the machines doing           by the user as two functions. The map function takes in a
the work (shared-nothing), such as when each machine has          key/value pair (of type <key1, value1>) and produces a set
crawled the documents it is indexing, then this strategy will     of intermediate key/value pairs (<key2, value2>). The out-
always be optimal (will scale linearly with processing power).    puts from the map function are then automatically grouped
However, in practical terms, fully machine-local data is diffi-   by their key, and then passed to the reduce function. The
cult to achieve when a large number of machines is involved.      reduce task merges the values with the same key to form a
This stems from the need to split and distribute the corpus       smaller final result. A typical job will have many map tasks
without overloading the network or risking un-recoverable         which each operate on a subset of the input data, and fewer
data loss from a single point of failure.                         reduce tasks, which operate on the merged output of the
   Distributed indexing has seen some coverage in the lit-        map tasks. Map or reduce tasks may run on different ma-
erature. Ribeiro-Neto & Barbosa [20] compared three dis-          chines, allowing parallelism to be achieved. In common with
tributed indexing algorithms for indexing 18 million docu-        functional programming design, each task is independent of
ments. Efficiency was measured with respect to local through-     other tasks of the same type, and there is no global state,
put of each processor, not in terms of overall indexing time.     or communication between maps or between reduces.
   Counting term occurrences in a large data-set is an often-     how a more refined single-pass indexing strategy can be im-
repeated example of how to use MapReduce paradigm2 [5].           plemented in MapReduce (Section 4.3).
For this, the map function takes the document file-name              It should be noted that in MapReduce each map task is
(key1) and the contents of the document (value1) as input,        not aware of its context in the overall job. For indexing, this
then for each term in the document emits the term (key2)          means that the doc-IDs emitted from the map phases can-
and the integer value ‘1’ (value2). The reduce then sums up       not be globally correct. Instead, these doc-IDs start from
all of the values (many 1s) for each key2 (a term) to give        0 in each map. To allow the reduce tasks to calculate the
the total occurrences of that term.                               correct doc-IDs, each map task produces a “side-effect” file,
   As mentioned above, MapReduce jobs are executed over           detailing the number of documents emitted per map. This
multiple machines. In a typical setup, data is not stored in      is true for all the indexing implementations described in this
a central file store, but instead replicated in blocks (usually   section. We also note that for all our indexing implementa-
of 64MB) across many machines [7]. This has a central ad-         tions the number of reducers specified depicts the number
vantage that the map functions can operate on data that           of final indices generated.
may be ‘rack-local’ or ‘machine-local’ - i.e. does not have
to transit intra- and inter-data centre backbone links, and
does not overload a central file storage service. Therefore
                                                                  4.1 Dean & Ghemawat’s MapReduce
high bandwidth can be achieved because data is always as
                                                                      Indexing Strategy
local as possible to the processing CPUs. Intermediate re-           The original MapReduce paper by Dean & Ghemawat [5]
sults of map tasks are stored on the processing machines          presents a short description for performing indexing in Map-
themselves. To reduce the size of this output (and there-         Reduce, which is directly quoted below:
fore IO), it may be merged using a combiner, which acts as          “The map function parses each document, and emits a se-
a reducer local to each machine. A central master machine         quence of <word, document ID> pairs. The reduce function
provides job and task scheduling, which attempts to perform       accepts all pairs for a given word, sorts the corresponding
tasks as local as possible to the input data.                     document IDs and emits a <word, list(document ID)> pair.
   While MapReduce is seeing increasing popularity, there         The set of all output pairs forms a simple inverted index. It
are only a few notable studies investigating the paradigm         is easy to augment this computation to keep track of word
beyond the original paper. In particular, for machine learn-      positions.”
ing [4], Chu et al. studied how various machine learning             The implicit claim being made in the original MapReduce
algorithms could be parallelised using the MapReduce para-        paper [5] is that efficient indexing could be trivially imple-
digm, however experiments were only carried out on single         mented in MapReduce. However, we argue that this over-
systems, rather than a cluster of machines. In such a situa-      simplifies the details, and provides room for a useful study to
tion, MapReduce provides an easy framework to distribute          allow document indexing in MapReduce to be better under-
non-cooperating tasks of work, but misses the central data        stood. For example, for an inverted index to be useful, the
locality advantage facilitated by a MapReduce framework.          term frequencies within each document need to be stored.
A similar study for natural language processing [12] used         Though this is not accounted for in Dean & Ghemawat’s
several machines, but with experimental datasets of only          paper, there are two possible interpretations on how this
88MB and 770MB, would again fail to see benefit in the            could be achieved within the bounds laid out in the quo-
data-local scheduling of tasks.                                   tation above. We detail these interpretations below in Sec-
   In contrast, indexing is an IO-intensive operation, where      tions 4.1.1 and 4.1.2, respectively.
large amounts of raw data have to be read and transformed
into suitable index structures. In this work, we show how         4.1.1 Emitting Term,Doc-ID Tuples
indexing can be implemented in a MapReduce framework.
                                                                     The literal interpretation of the description above would
However, the MapReduce implementation described in [5]
                                                                  be to output a set of <term, doc-ID> pairs for each token
is not available outside of Google. Instead, we use the
                                                                  in a document. This means that if a single term appears n
Hadoop [1] framework, which is an open-source Java imple-
                                                                  times in a document then the <term, doc-ID> pair will be
mentation of MapReduce from the Apache Software Founda-
                                                                  emitted n times. This has the advantage of making the map
tion, with developers contributed by Yahoo! and Facebook,
                                                                  phase incredibly simple, as it emits on a per token basis.
among others. In the next section, we describe several index-
                                                                  However, this means that we will emit a <term, doc-ID>
ing strategies in MapReduce, starting from that proposed
                                                                  pair for every token in the collection. In general, when a
in the original MapReduce paper [5], before developing a
                                                                  map task emits lots of intermediate data, this will be saved
more refined strategy inspired by the single-pass indexing
                                                                  to the machine’s local disk, and then later transferred to
described in Section 2.2.
                                                                  the appropriate reducer. However, with this indexing inter-
                                                                  pretation, the intermediate map data would be extremely
4.   INDEXING IN MAPREDUCE                                        large - indeed, similar to the size of the corpus, as each to-
  In this section, we show how indexing can be performed          ken in the corpus is emitted along with a doc-ID. Having
in MapReduce. Firstly, we describe two possible interpre-         large amounts of intermediate map data will increase map
tations of indexing as envisaged by Dean & Ghemawat in            to reducer network traffic, as well as lengthening the sort
their original seminal MapReduce paper [5] (Section 4.1).         phase. These are likely to have an effect on the job’s overall
Then, we describe an alternative MapReduce indexing strat-        execution time. The reducer will - for each unique term -
egy used by the Nutch IR platform, before finally showing         sort the doc-IDs, then add up the instances on a per doc-ID
                                                                  basis to retrieve the term frequencies. Finally, the reducer
2
  A worked example and associated source code is avail-           will write the completed posting list for that term to disk.
able at http://hadoop.apache.org/core/docs/r0.19.0/mapred_        Figure 1 provides a pseudo-code implementation of map and
tutorial.html                                                     reduce functions for this strategy.
                                                                 using the Hadoop MapReduce implementation. By inspec-
     Dean & Ghemawat MapReduce Indexing -                        tion of the source of Nutch v0.9, we have determined that
                Map function pseudo-code                         the MapReduce indexing strategy differs from the general
  1: Input                                                       outline described in Section 4.1 above. Instead of emitting
         Key: Document Identifier, Name                          terms, Nutch only tokenises the document during the map
         Value: Contents of the Document, DocContents            phase, hence emitting <doc-ID, analysed-Document> tu-
  2: Output                                                      ples from the map function. Each analysed-Document con-
         A list of (term,doc-ID) pairs, one for each token       tains the textual forms of each term and their corresponding
         in the document                                         frequencies. The reduce phase is then responsible for writing
  3: for each Term in the DocContents loop                       all index structures. Compared to emitting <term,(doc-ID,
  4 : Stem(Term)                                                 tf )>, the Nutch indexing method will emit less, but the
  5 : deleteIfStopword(Term)                                     value of each emit will contain substantially more data (i.e.
  6 : if (Term is not empty) then emit(Term, doc-ID)             the textual form and frequency of each unique term in the
  7: end loop                                                    document). We believe this is a step-forward towards reduc-
  8: Add document to the Document Index                          ing intermediate map output. However, there may still be
  9: if (lastMap()) write out information about the              scope for further reducing map task output to the benefit of
  10: documents this map processed (“side-effect” files)         overall indexing efficiency. In the next section, we develop
                                                                 our single-pass indexing strategy (described in Section 2.2)
     Dean & Ghemawat MapReduce Indexing -                        for the MapReduce framework, to address this issue.
             Reduce function pseudo-code
  1: Input                                                       4.3 Single-pass MapReduce Indexing Strategy
        Key: A Term
                                                                    We now adapt the single-pass indexing strategy described
        Value: List of (doc-ID), doc-IDs
                                                                 in Section 2.2, for use in a MapReduce framework. The in-
  2: Output
                                                                 dexing process is split into m map tasks. Each map task
        Key: Term
                                                                 operates on its own subset of the data, and is similar to the
        Value: Posting List
                                                                 single-pass indexing corpus scanning phase. However, when
  3 : List Posting-List = new PostingList()
                                                                 memory runs low or all documents for that map have been
  4 : Sort doc-IDs
                                                                 processed, the partial index is flushed from the map task,
  5 : for each doc-ID in doc-IDs loop
                                                                 by emitting a set of <term, posting list> pairs. The par-
  6 : increment tf for doc-ID
                                                                 tial indices (flushes) are then sorted by term, map and flush
  7 : correct doc-ID
                                                                 numbers before being passed to a reduce task. As the flushes
  8 : add doc-ID and tf to Posting-List
                                                                 are collected at an appropriate reduce task, the posting lists
  9 : end loop
                                                                 for each term are merged by map number and flush number,
  10: emit(Posting-List)
                                                                 to ensure that the posting lists for each term are in a glob-
Figure 1: Pseudo-code interpretation of Dean &                   ally correct ordering. The reduce function takes each term
Ghemawat’s MapReduce indexing strategy (map                      in turn and merges the posting lists for that term into the
emitting <term,doc-ID>, Section 4.1.1).                          full posting list, as a standard index. Elias-Gamma com-
                                                                 pression is used as in non-distributed indexing to store only
                                                                 the distance between doc-IDs. Figure 2 provides a pseudo-
4.1.2 Emitting Term,Doc-ID,TF Tuples                             code implementation of map and reduce functions for our
   We claim that emitting once for every token extracted is      proposed MapReduce indexing strategy.
wasteful of resources, causing excessive disk IO on the map         The fundamental difference between this strategy and that
by writing intermediate map output to disk, and excessive        of Dean & Ghemawat described in Section 4.1, is what the
disk IO in moving map output to the reduce tasks. To re-         map tasks emit. Instead of emitting a batch of <term,doc-
duce IO, we could instead emit <term,(doc-ID, tf )> tuples,      ID> pairs immediately upon parsing each document, we in-
where tf is the term frequency for the current document.         stead build up a posting list for each term in memory. Over
In this way, the number of emit operations which have to         many documents, memory will eventually be exhausted, at
be done is significantly reduced, as we now only emit once       which time all currently stored posting lists will be flushed
per unique term per document. The reduce method for this         as <term,posting list> tuples. This has the positive effect of
interpretation is also much simpler than the earlier interpre-   minimising both the size of the map task output, as well as
tation, as it only has to sort instances by document to get      the number of emits. Compared to the Dean & Ghemawat
the final posting list to write out. It should also be noted     indexing strategies, far less emits will be called, but emits
that the <term, doc-ID> strategy described earlier, can be       will be much larger. Compared to the Nutch MapReduce in-
adapted to generate tf s instead through the use of a Map-       dexing strategy, there may more emits, however, the reduce
Reduce combiner, which performs a localised merge on each        task is operating on term-sorted data, and does not require
map task’s output.                                               a further sort and invert operation to generate an inverted
   While the <term,(doc-ID, tf )> indexing strategy emits        index. Moreover, the emit values will only contain doc-IDs
significantly less than that described in Section 4.1.1, we      instead of textual terms, making them considerably smaller.
argue that an implementation in this manner would still be          Figure 3 presents an example for a distributed setting
inefficient, because a large amount of IO is still required to   MapReduce indexing paradigm of 200 documents. The doc-
store, move and sort the temporary map output data.              uments are indexed by m = 2 map tasks, before the posting
                                                                 lists for each term are grouped and sorted, and then reduced
4.2 Nutch’s MapReduce Indexing Strategy                          to a single index. The posting lists output from each map
  The Apache Software Foundation’s open source Nutch             contains only local doc-IDs. In the reduce tasks, these are
platform [3] also deploys a MapReduce indexing strategy,         merged into a list of absolute doc-IDs, by adding to each
              Single-Pass MapReduce Indexing -
                  Map function pseudo-code
     1: Input
            Key: Document Identifier, Name
            Value: Contents of the Document, DocContents
     2: Output
            Key: Term
            Value: Posting list
     3: for each Term in the DocContents loop
     4 : Stem(Term)
     5 : deleteIfStopword(Term)
     6 : if (Term is not empty) then add the current
            document for that term to the in-memory
            Posting List
     7: end loop
     8: Add document to the Document Index
     9: if (lastMap() or outOfMemory()) then                       Figure 3: Correcting document IDs while merging.
            emit(in-Memory Posting List)
     10: if (lastMap()) write out information about the            in MapReduce to determine how to most efficiently apply it
     11: documents this map processed (“side-effect” files)        for indexing.

             Single-Pass MapReduce Indexing -                      5.1 Research Questions
                Reduce function pseudo-code                           To measure the efficiency of our indexing implementations
     1: Input                                                      and therefore the suitability (or otherwise) of MapReduce
           Key: A Term                                             for indexing, we investigate 3 important research questions,
           Value: List of (Posting List), PartialPostingLists      which we address by experimentation in the remainder of
     2: Output                                                     this section:
           Key: Term                                               1. Can a practical application of the distributed indexing
           Value: Posting List                                     strategy described in Section 2 be sufficient for large-scale
     3 : List Posting-List = new PostingList()                     collections when using many machines? (Section 5.4)
     4 : Sort PartialPostingLists by the map and flush they        2. When indexing with MapReduce, what is the most effi-
           were emitted from                                       cient number of maps and reduces to use? (Section 5.5)
     5 : for each PostList in PartialPostingLists loop             3. Is MapReduce Performance Close to Optimal Distributed
     6 : for each doc-ID in PostList loop                          Indexing? (Section 5.6)
     7:       correct doc-ID
     8:       Merge PostList into Posting-List                     5.2 Evaluation Metrics
     9 : end loop                                                     Research questions 1-3 require a metric for indexing per-
     10: end loop                                                  formance. For this, we measure the throughput of the sys-
     11: emit(Posting-List)                                        tem, in terms of MB/s (megabytes per second). We calcu-
                                                                   late throughput as collectionsize/timetaken where collec-
Figure 2: Pseudo-code for our proposed single-pass                 tion size is the compressed size on disk for a single copy of
MapReduce indexing strategy (Section 4.3).                         the collection in MB (megabytes). The time taken is the full
                                                                   time taken by the job (including setup) measured in seconds.
entry the number of documents processed by previous map               Research question 3 mandates suitability for indexing at
tasks. However, note that in our indexing implementation,          a large scale. We measure suitability in terms of throughput
the doc-IDs are flush-local as well as map-local. While this       (as above) and in terms of speedup. Speedup Sm , defined
is not strictly necessary, it allows smaller doc-IDs to be emit-   as Sm = TTm 1
                                                                                 , where m is the number of machines, T1 is the
ted from each map, which can be better compressed.                 execution of the algorithm on a single machine, and Tm is
                                                                   the execution time in parallel, using m machines [9]. This
5.     EXPERIMENTS & RESULTS                                       encompasses the idea that not only should speed improve
   In the following experiments, we aim to determine the ef-       as more resources are added, but that such a speed increase
ficiency of multiple indexing implementations. Specifically,       should reflect the quantity of those resources. For instance,
we investigate whether distributed indexing as laid out in         if we increase the available resources by a factor of 2, then it
the original MapReduce paper (Section 4.1) is fit for pur-         would be desirable to get (close to) twice the speed. This is
pose. We compare this to our single-pass indexing strategy         known as linear speedup (where Sm = m), and is the ideal
developed both for a single machine architecture (Section 2)       scenario for parallel processing. However, linear speedup
and for MapReduce (Section 4.3). Note that in this paper           can be hard to achieve in a parallel environment, because
we do not investigate Nutch’s MapReduce indexing strat-            of the growing influence of small sequential sections of code
egy, however we expect it to be more efficient than Dean           as the number of processors increases (known as Amdahl’s
& Ghemawat’s indexing strategy, while being less efficient         law [2]), or due to overheads.
than our single-pass indexing strategy. We leave this for fu-
ture work. Furthermore, we investigate these approaches in         5.3 Experimental Setup
terms of scalability as the number of machines designated for        Following [24], which prescribes guidelines for presenting
work is increased, and experiment with various parameters          indexing techniques, we now give details of our experimen-
     Number of Machines (Cores)   1(3)   2(6)   4(12)   6(18)   8(24)                                   18000


                                                                        Time Taken to Index (seconds)
       Distributed Single-Pass    2.44   4.6    12.8     12.4   12.8
    Dean & Ghemawat MapReduce     1.15   1.59   4.01     4.71   6.38                                    16000

       MapReduce Single-Pass      2.59   5.19   9.45    13.16   17.31
                                                                                                        14000

Table 1: Throughput as the number of machines al-
                                                                                                        12000
located is increased using using a variety of indexing
strategies, measured in MB/sec.                                                                         10000


tal cluster setup, consisting of 19 identical machines. Each                                            8000
                                                                                                                5   10   15           20            25   30   35
machine has a single Intel Xeon 2.4GHz processor with 4                                                                       Number of Map Tasks

cores, 4GB of RAM, and contains three hard drives: One
                                                                        Figure 4: The effect of varying the number of map
160GB hard disk, spinning at 7200rpm with an 8MB buffer,
                                                                        tasks on indexing time (seconds) of .GOV2 collec-
is used for the operating system and temporary job scratch
                                                                        tion: 4 machines, 1 reduce task.
space; Two 400GB hard disks, each spinning at 7200rpm
with a 16MB buffer, are dedicated for distributed file sys-
tem storage. Each machine is running a copy of the open                 mentioned in Section 2.3, when distributed indexing uses
source Linux operating system Centos 5 and are connected                machine-local data, indexing will achieve exactly linear scal-
together by a gigabit Ethernet connection on a single rack.             ing. However, unless the document data is already present
The Hadoop (version 0.18.2) distributed file system (DFS)               on the machines (e.g. indexing takes place on the machines
is running on this cluster, replicating files to the distributed        which crawled the documents), there would be the need to
file storage on each machine. Each file on the DFS is split             copy the required data to the indexing machines. In many
into 64MB blocks, which are each replicated to 2 machines3 .            other scenarios, crawling or documents corpora storage may
While each machine has four processors available at any one             not be on indexing machines. Moreover, local-only indexing
time, only three of these are valid targets for job execution,          is not resilient to machine failure. Instead, we experiment
the last processor is left free for the distributed file system         with the shared-corpus distributed indexing, where the cor-
software running on each machine. As our cluster is shared              pus is indexed over NFS from a central fileserver. Local data
by several users, job allocation is done by Hadoop on De-               (shared-nothing) indexing would require the corpus subset
mand (HOD) running with the Torque resource manager                     to be copied prior to indexing.
(version 2.1.9) rather than using a dedicated Hadoop clus-                 Table 1, row 1, shows how throughput increases as we al-
ter. Machines not allocated to a MapReduce job are avail-               locate more machines (recall that each machine adds three
able to be scheduled by Torque for other jobs not associated            processor cores for indexing work). Here we can see that
with MapReduce. However on such nodes, the fourth pro-                  throughput indeed increases in a reasonable fashion, How-
cessor core is still free for distributed file system work4 . We        ever, once we allocate more than 4 machines we observe
also have in the same rack a RAID5 centralised file server              no further speed improvements. This is caused by our cen-
powered by 8 Intel Xeon 3GHz processor cores for use with               tral file store becoming a bottleneck as it is unable to serve
non-MapReduce jobs, providing network file system (NFS)                 all the allocated machines simultaneously. We can there-
storage. For consistency, in the following experiments, we              fore conclude that this distribution method is unsuitable for
employ the standard TREC web collection .GOV2. This is                  large-scale indexing using our hardware setup. Moreover,
an 80GB (425GB uncompressed) crawl of .gov Web domain                   we argue that even with better hardware this issue cannot
comprising over 25 million documents. Before the advent of              be overcome as the file server(s) will always be slower than
ClueWeb09, .GOV2 was the largest available TREC corpus.                 the combination of all worker machines.

5.4 Is Distributed Indexing Good Enough?                                5.5 Investigating MapReduce Parameters
   First we determine if MapReduce is necessary for large-                 In Section 5.4, we showed that the distributed indexing
scale indexing. If a simple distribution of the non-parallel            strategy described in Section 2 is unsuitable for the scal-
indexing strategy described in Section 2 is sufficient to index         able distributed shared-corpus indexing of large collections.
large collections then there is no need for MapReduce. To               However, before we can evaluate MapReduce as an alternate
evaluate this, we distribute the single-pass indexing strat-            solution we need to investigate how to maximise its efficiency
egy across n machines in our cluster, where we vary n =                 in terms of its input parameters. The fundamental parame-
{1, 2, 4, 6, 8}. To provide a comparative baseline, the non-            ters of a MapReduce job are m - the number of map tasks
parallel single-pass indexing implementation in Terrier can             that the input data is divided across - and r, the number of
index the .GOV2 corpus on a single processor core (not ma-              reduce tasks. A higher number of map tasks means that the
chine) in just over 1 day using the same algorithm. This                input collection of documents is split into smaller chunks,
translates into a throughput of approximately 1MB/sec. For              but also that there will be more overheads, as more tasks
distributed indexing to be sufficient for indexing large col-           have to be initialised and latterly cleared. To determine
lections, throughput should increase in a (close-to) linear             what effect this has on performance, we vary m while index-
fashion with the number of processing cores added. As                   ing the .GOV2 corpus, using a set 4 machines. The results
3                                                                       - in terms of indexing time - are shown in Figure 4. We see
  This is lower than the Hadoop default of 3, to conserve               that when the number of maps is small (i.e. less than the
distributed file system space.
4
  Hence, as each machine always has one processing core free            12 processors available from the 4 machines), parallelism is
to handle distributed file system traffic, and the network              hindered, as not all processors have work to do. When the
traffic of other cluster jobs is assumed to be low, then there          number of map tasks is ≤ 14, we also note that indexing
should be no impact on the validity of the experiments.                 time is still high. On examination of these jobs, we found
Time Taken to Index (seconds)   9500
                                                                                             expect MapReduce to perform better since it uses a DFS.
                                9000
                                8500                                                         For evaluation, we perform a direct comparison on through-
                                8000                                                         put between indexing strategies. Note that while distributed
                                7500
                                                                                             indexing creates n index shards, where n is the number of
                                7000
                                6500
                                                                                             processors allocated, MapReduce instead produces r index
                                6000                                                         shards where r is the number of reduce tasks created. For
                                5500                                                         these experiments we always allocate 72 map tasks and 24
                                5000
                                4500
                                                                                             reduce tasks. This means that for distributed indexing a
                                4000                                                         smaller number of index shards were created when indexing
                                       0   10   20    30       40       50    60   70   80
                                                     Number of Reduce Tasks                  on {1, 2, 4, 6} machines. However, we believe that this has
                                                                                             no significant impact on the overall throughput.
Figure 5: The effect of varying the number of reduce                                            First, we investigate whether the MapReduce indexing
tasks on indexing time (seconds) of .GOV2 collec-                                            strategy proposed by Dean & Ghemawat is more efficient
tion: 6 machines, 72 map tasks.                                                              than distributed indexing. Table 1 shows how the through-
                                                                                             put increases as we allocate more machines - in particular,
that the balance of work between map tasks was not even,                                     row 2 shows results for Dean & Ghemawat’s strategy, inter-
with one map task taking markedly longer than the others5 .                                  preted as emitting term <doc-ID,tf> tuples (Section 4.1.2).
When the number of map tasks is increased to 16, balance                                     We also implemented the other interpretation which emits
is restored.                                                                                 term,doc-ID tuples, however, it consumed excessive tempo-
   In previous work [14], we have shown that the time taken                                  rary storage space during operation due to its large number
by the reduce step is an important factor in determining in-                                 of emit operations. This made it impossible to determine
dexing performance. Therefore, it is important to know how                                   throughput, as the worker machines ran out of disk space
many reduce tasks it is is optimal to create - subject to ex-                                causing the job to fail. Our implementation of Dean & Ghe-
ternal constraints on the number of reducers (e.g. having 8                                  mawat’s indexing method also creates the additional data
query servers suggests 8 reducers are used so that 8 final in-                               structures described in Section 2.1 - i.e. the lexicon and
dices are created). To test the effect of the number of reduce                               document index - and uses the compressed Terrier inverted
tasks on efficiency, we index .GOV2 while varying the num-                                   index format. From Table 1, row 2, we can see that this
ber of reduce tasks. Here we used 6 machines and 72 map                                      implementation performs very poorly in comparison to dis-
tasks. The indexing time results are shown in Figure 5. As                                   tributed indexing. Indeed, with 8 machines it indexes only at
we would expect, while the number of reduces is below the                                    half the speed of distributed indexing with the same number
available processors (for the 6 machines allocated, 18 pro-                                  of machines. Upon further investigation, as expected, this
cessors) the speed increases as we add more reducers, since                                  speed degradation can be attributed to the large volume
we are effectively providing more parallel processing power.                                 of map output which is generated by this approach. How-
Once we are beyond the number of processors however, in-                                     ever, it should be noted that unlike distributed indexing,
dexing time increases. This is intuitive, as there is more work                              performance improvements do not stall after 4 machines.
to be done than available processors. Therefore, we can con-                                 This would indicate that while the indexing strategy is poor,
clude that the number of reduce tasks should be a multiple                                   MapReduce in general will continue to garner performance
of the number of processors. Unlike map tasks, however,                                      improvements as more machines are added. Therefore, we
there is an incentive to have less reduce tasks, resulting in                                believe this makes it more suitable for processing larger cor-
fewer indices, but this needs to be traded off against the pos-                              pora, where larger clusters of 100s-1000s of machines are
sibility of failures and the associated time wasted through                                  needed to index them in reasonable amounts of time.
re-running.                                                                                     We now experiment with our proposed implementation of
                                                                                             single-pass indexing in MapReduce, as described in Section
5.6 Is MapReduce Performance Close to                                                        4.3. Our expectation is that this strategy should prove to
    Optimal Distributed Indexing?                                                            be more efficient as it lowers disk and network IO by build-
   We now investigate whether MapReduce is an efficient al-                                  ing up posting lists in memory, thereby minimising map
ternative to distributed indexing. Moreover, we evaluate                                     output size. Table 1, row 3 shows the throughput of the
MapReduce against optimal distributed indexing in terms                                      single-pass MapReduce indexing strategy. In comparison to
of performance, i.e. the extent to which it scales close to                                  Dean & Ghemawat’s indexing strategy, we find our approach
linearly with processing power. The core advantage of Map-                                   to be markedly faster. Indeed, when using 8 machines our
Reduce is the ability to apply the distributed file system                                   method is over 2.7 times faster. Moreover, Figure 6 shows
(DFS) to avoid centralised storage of data (creating a sin-                                  the speedup achieved by both approaches as the number of
gle point of failure), and to take advantage of data locality                                machines is increased. We observe that our single-pass based
to avoid excess network IO. This meanwhile, is at the cost                                   strategy scales close to linearly in terms of indexing time as
of additional overheads in job setup, monitoring and con-                                    the number of machines allocated for work is increased. In
trol, as well as the additional IO required to replicate the                                 contrast, the scalability of Dean & Ghemawat’s approach is
data on a DFS. As the centralised file-system was identi-                                    noticeably worse (5.5 times for 8 processors, versus 6.8 times
fied as the bottleneck for distributed indexing, we would                                    for single-pass based indexing). We believe that this makes
                                                                                             our proposed strategy suitable for scaling to large clusters of
5
  Hadoop actually supports speculative execution, where two                                  machines, which is essential when indexing new large-scale
copies of the last task, or the slowest tasks, will be started.                              collections like ClueWeb09.
Only output from the first successful task to complete will be
used. This uses otherwise idle processing power to decrease
average job duration.
                               8
                                                                                           [7] S. Ghemawat, H. Gobioff, and S.-T. Leung. The
                                                             Linear speedup
                               7
                                       .GOV2 Single-Pass MapReduce Indexing                    google file system. SIGOPS Oper. Syst. Rev.,
                                                  .GOV2 Dean and Ghemawat
                                                                                               37(5):29–43, 2003.
      Speedup (Times Faster)


                               6                                                           [8] S. Heinz and J. Zobel. Efficient single-pass index
                                                                                               construction for text databases. JASIST,
                               5                                                               54(8):713–729, 2003.
                                                                                           [9] M. D. Hill. What is scalability? SIGARCH Comput.
                               4
                                                                                               Archit. News, 18(4):18–21, 1990.
                               3                                                          [10] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.
                                                                                               Dryad: distributed data-parallel programs from
                               2                                                               sequential building blocks. In Proc. of EuroSys 2007,
                                                                                               pp. 59–72.
                               1
                                   1       2       3       4       5          6   7   8   [11] R. E. Johnson. Frameworks = (components +
                                                 Number of Allocated Machines
                                                                                               patterns). Commun. ACM, 40(10):39–42, 1997.
Figure 6: Speedup of .GOV2 indexing as more Map-                                          [12] M. Laclavik, M. Seleng, and L. Hluchý. Towards large
Reduce machines are allocated.                                                                 scale semantic annotation built on mapreduce
                                                                                               architecture. In Proc. of ICCS (3), pp. 331–338, 2008.
6.                  CONCLUSION                                                            [13] C. D. Manning, P. Raghavan, and H. Schütze.
   In this paper, we detailed four different strategies for ap-                                Introduction to Information Retrieval. Cambridge
plying document indexing within the MapReduce paradigm,                                        University Press, 2008.
with varying efficiency. In particular, we firstly showed that                            [14] R. McCreadie, C. MacDonald, and I. Ounis. On
indexing speed using a distributed indexing strategy was                                       single-pass indexing with mapreduce. In Proc. of
limited by accessing a centralised file-store, and hence the                                   SIGIR 2009, in press.
advantage of using MapReduce to allocate indexing tasks                                   [15] S. Melnik, S. Raghavan, B. Yang, and
close to input data is clear. Secondly, we showed that the                                     H. Garcia-Molina. Building a distributed full-text
MapReduce indexing strategy suggested by Dean & Ghe-                                           index for the web. In Proc. of WWW 2001, pp.
mawat in the original MapReduce paper [5] generates too                                        396–406.
much intermediate map data, causing an overall slowness                                   [16] C. Olston, B. Reed, U. Srivastava, R. Kumar, and
of indexing. In contrast, our proposed single-pass indexing                                    A. Tomkins. Pig latin: A not-so-foreign language for
strategy is almost 3 times faster, and scales well as the num-                                 data processing. In Proc. of SIGMOD 2008, pp.
ber of machines allocated is increased.                                                        1099–1110.
   Overall, we conclude that the single-pass based MapReduce                              [17] O. O’Malley and A. C. Murthy. Winning a 60 second
indexing algorithm should be suitable for efficiently index-                                   dash with a yellow elephant. TR, Yahoo! Inc., 2009.
ing larger corpora, including the recently released TREC                                  [18] I. Ounis, G. Amati, V. Plachouras, B. He,
ClueWeb09 corpus. Moreover, as a framework for distributed                                     C. Macdonald, and C. Lioma. Terrier: A high
indexing, MapReduce conveniently provides both data lo-                                        performance and scalable information retrieval
cality and resilience. Finally, it is of note that an imple-                                   platform. In Proc. of OSIR workshop, SIGIR-2006,
mentation of the MapReduce single-pass indexing strategy                                       pp. 18–25.
described in this paper is freely available for use by the com-                           [19] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan.
munity as part of the Terrier IR Platform6 .                                                   Interpreting the data: Parallel analysis with sawzall.
                                                                                               Scientific Programming, 13(4):277–298, 2005.
7.                  REFERENCES                                                            [20] B. A. Ribeiro-Neto, E. S. de Moura, M. S. Neubert,
    [1] Apache Software Foundation. The apache hadoop                                          and N. Ziviani. Efficient distributed algorithms to
        project. http://hadoop.apache.org/, as of                                              build inverted files. In Proc. of SIGIR 1999, pp.
        15/06/2009.                                                                            105–112.
    [2] G. Amdahl. Validity of the single processor approach                              [21] E. Schonfeld. Yahoo! search wants to be more like
        to achieving large-scale computing capabilities. In                                    google, embraces hadoop, 2008.
        Proc. of AFIPS, pp. 483–485, 1967.                                                     http://www.techcrunch.com/2008/02/20/
    [3] M. Cafarella and D. Cutting. Building nutch: Open                                      yahoo-search-wants-to-be-more-like
        source search. ACM Queue, 2(2):54–61, 2004.                                            -google-embraces-hadoop/, as of 15/06/2009.
    [4] C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R.                                     [22] A. Tomasic and H. Garcia-Molina. Performance of
        Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for                                     inverted indices in shared-nothing distributed text
        machine learning on multicore. In Proc. of NIPS 2006,                                  document information retrieval systems. In Proc. of
        pp. 281–288.                                                                           PDIS 1993, pp. 8–17.
    [5] J. Dean and S. Ghemawat. Mapreduce: Simplified                                    [23] I. H. Witten, A. Moffat, and T. C. Bell. Managing
        data processing on large clusters. In Proc. of OSDI                                    Gigabytes: Compressing and Indexing Documents and
        2004, pp. 137–150.                                                                     Images. Morgan Kaufmann, 1999.
    [6] P. Elias. Universal codeword sets and representations                             [24] J. Zobel, A. Moffat, and K. Ramamohanarao.
        of the integers. Information Theory, IEEE                                              Guidelines for presentation and comparison of
        Transactions on, 21(2):194–203, 1975.                                                  indexing techniques. SIGMOD Record, 25(3):10–15,
                                                                                               1996.
6
    http://terrier.org

</pre>