=Paper=
{{Paper
|id=Vol-480/paper-7
|storemode=property
|title=The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce
|pdfUrl=https://ceur-ws.org/Vol-480/paper3.pdf
|volume=Vol-480
|dblpUrl=https://dblp.org/rec/conf/sigir/Lin09a
}}
==The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce==
<pdf width="1500px">https://ceur-ws.org/Vol-480/paper3.pdf</pdf>
<pre>
               The Curse of Zipf and Limits to Parallelization:
              A Look at the Stragglers Problem in MapReduce

                                                                      Jimmy Lin
                             The iSchool, College of Information Studies, University of Maryland
                       National Center for Biotechnology Information, U.S. National Library of Medicine
                                                              jimmylin@umd.edu


ABSTRACT                                                                      programmer-defined “mappers” and “reducers”. Key-value
This paper explores the problem of “stragglers” in Map-                       pairs form the processing primitives in MapReduce. The
Reduce: a common phenomenon where a small number of                           mapper is applied to every input key-value pair to generate
mappers or reducers takes significantly longer than the oth-                  an arbitrary number of intermediate key-value pairs. The
ers to complete. The effects of these stragglers include un-                  reducer is applied to all values associated with the same
necessarily long wall-clock running times and sub-optimal                     intermediate key to generate output key-value pairs. This
cluster utilization. In many cases, this problem cannot sim-                  two-stage processing structure is illustrated in Figure 1.
ply be attributed to hardware idiosyncrasies, but is rather                      Under the MapReduce framework, a programmer needs
caused by the Zipfian distribution of input or intermedi-                     only to provide implementations of the mapper and reducer.
ate data. I present a simple theoretical model that shows                     On top of a distributed file system [3], the runtime trans-
how such distributions impose a fundamental limit on the                      parently handles all other aspects of execution on clusters
amount of parallelism that can be extracted from a large                      ranging from a few to a few thousand processors. The run-
class of algorithms where all occurrences of the same ele-                    time is responsible for scheduling, coordination, handling
ment must be processed together. A case study in parallel                     faults, and the potentially very large sorting problem be-
ad hoc query evaluation highlights the severity of the strag-                 tween the map and reduce phases whereby intermediate key-
glers problem. Fortunately, a simple modification of the in-                  value pairs must be grouped by key. The availability of
put data cuts end-to-end running time in half. This example                   Hadoop, an open-source implementation of the MapReduce
illustrates some of the issues associated with designing effi-                programming model, coupled with the dropping cost of com-
cient MapReduce algorithms for real-world datasets.                           modity hardware and the growing popularity of alternatives
                                                                              such as utility computing, has brought data-intensive dis-
Categories and Subject Descriptors: H.3.3 [Informa-                           tributed computing within the reach of many academic re-
tion Storage and Retrieval]: Information Search and Re-                       searchers [5].
trieval                                                                          This paper explores a performance issue frequently en-
General Terms: Algorithms, Performance                                        countered with MapReduce algorithms on natural language
                                                                              text and other large datasets: the “stragglers problem”, where
1.    INTRODUCTION                                                            the distribution of running times for mappers and reducers
   The only practical solution to large data problems to-                     is highly skewed. Often, a small number of mappers takes
day is to distribute computations across multiple machines                    significantly longer to complete than the rest, thus blocking
in a cluster. With traditional parallel programming mod-                      the progress of the entire job. Since in MapReduce the re-
els (e.g., MPI, pthreads), the developer shoulders the bur-                   ducers cannot start until all the mappers have finished, a
den of explicitly managing concurrency. As a result, a sig-                   few stragglers can have a large impact on overall end-to-end
nificant amount of the developer’s attention must be de-                      running time. The same observation similarly applies to the
voted to managing system-level details (e.g., synchroniza-                    reduce stage, where a small number of long-running reduc-
tion primitives, inter-process communication, data trans-                     ers can significantly delay the completion of a MapReduce
fer, etc.). MapReduce [2] presents an attractive alterna-                     job. In addition to long running times, the stragglers phe-
tive: its functional abstraction provides an easy-to-under-                   nomenon has implications for cluster utilization—while a
stand model for designing scalable, distributed algorithms.                   submitted job waits for the last mapper or reducer to com-
   Taking inspiration from higher-order functions in func-                    plete, most of the cluster is idle. Of course, interleaving the
tional programming, MapReduce provides an abstraction for                     execution of multiple MapReduce jobs alleviates this prob-
                                                                              lem, but presently, the scheduler in the open-source Hadoop
                                                                              implementation remains relatively rudimentary.
                                                                                 There are two main reasons for the stragglers problem.
                                                                              The first is idiosyncrasies of machines in a large cluster—
                                                                              this problem is nicely handled by “speculative execution” [2],
                                                                              which is implemented in Hadoop. To handle machine-level
Copyright c 2009 for the individual papers by the papers’ authors. Copy-      variations, multiple instances of the same mapper or re-
ing permitted for private and academic purposes. Re-publication of material
from this volume requires permission by the copyright owners. This volume
                                                                              ducer are redundantly executed in parallel (depending on
is published by its editors.                                                  the availability of cluster resources), and results from the
LSDS-IR Workshop. July 2009. Boston, USA.
                                                                                                                                          6
                                                                                                                                HN,s (N=10 )
                input     input     input     input                                               16


                                                                                                  14


                                                                    maximum theoretical speedup
          map            map            map            map                                        12


                                                                                                  10

                   Barrier: group values by keys                                                   8


                                                                                                   6

                reduce         reduce         reduce
                                                                                                   4


                                                                                                   2
                                                                                                           y=1
                    output     output   output                                                     0
                                                                                                       0         1          2                   3     4   5
                                                                                                                       s (characteristic exponent)

Figure 1: Illustration of MapReduce: “mappers” are
applied to input records, which generate intermedi-                Figure 2: Plot of HN,s , the N th generalized harmonic
ate results that are aggregated by “reducers”.                     number with N = 106 , based on different values of
                                                                   s, the characteristic exponent of the Zipfian distri-
                                                                   bution. This plot shows the maximum theoretical
first instance to finish are used (the remaining results are       speedup of a parallel algorithm where all instances
discarded). This, however, does not address skewed run-            of an element type must be processed together.
ning times caused by the distribution of input or intermedi-
ate data. To illustrate this, consider the simple example of
word count using MapReduce (for example, as the first step         O(n) with respect to the number of element instances. In
to computing collection frequency in a language-modeling           other words, the amount of work necessary to process an
framework for information retrieval): the mappers emit key-        element is proportional to its frequency. If we normalize the
values pairs with the term as the key and the local count as       total amount of work to one, then the fraction of the total
the value. Reducers sum up the local counts to arrive at the       work that must be devoted to processing the most common
global counts. Due to the distribution of terms in natural         element is:
language, some reducers will have more work than others
(in the limit, imagine counting stopwords)—this yields po-                                                                1                     1
                                                                                                                     PN                =                  (2)
tentially large differences in running times, independent of                                                          n=1     1/ns             HN,s
hardware idiosyncrasies.
   This paper explores the stragglers problem in MapReduce,           Even if we assume unlimited resources and the ability to
starting first with a theoretical analysis in Section 2. A spe-    process all element types in parallel, the running time of
cific case study in parallel ad hoc query evaluation is dis-       an algorithm would still be bound by the time required to
cussed in Section 3; for that specific task, a simple manipu-      process the most frequent element. In the same spirit as Am-
lation of the input data yields a significant decrease in wall-    dahl’s Law [1], Zipf’s Law places a theoretical upper bound
clock running time. I conclude with some thoughts on the           on the amount of parallelism that can be extracted for cer-
design of MapReduce algorithms in light of these findings.         tain classes of algorithms. In short, an algorithm is only as
                                                                   fast as the slowest of the sub-tasks that can run in parallel.
2.   A THEORETICAL ANALYSIS                                           As a concrete example, suppose N = 106 and s = 1. The
                                                                   most frequently-occurring element would be observed about
   It is a well-known observation that word occurrences in
                                                                   6.95% percent of the time—which means that the maximum
natural language, to a first approximation, follow a Zipfian
                                                                   theoretical speedup of any parallel algorithm is about a fac-
distribution. Many other naturally-occurring phenomena,
                                                                   tor of 14 (assuming that all instances of the same element
ranging from website popularity to sizes of craters on the
                                                                   must be processed together). For this class of algorithms,
moon, can be similarly characterized [7]. Loosely formu-
                                                                   the maximum theoretical speedup is simply HN,s . Figure 2
lated, Zipfian distributions are those where a few elements
                                                                   plots HN,s with varying values of s for N = 106 . This simple
are exceedingly common, but contain a “long tail” of rare
                                                                   theoretical model shows that Zipfian distributions present a
events. More precisely, for a population of N elements, or-
                                                                   serious impediment to the development of efficient parallel
dered by frequency of occurrence, the frequency of the ele-
                                                                   algorithms for a large class of real-world problems.
ment with rank k can be characterized by:
                                                                      This analysis can be directly applied to MapReduce al-
                                                                   gorithms. In a common scenario, each key corresponds to
                                  k−s                              a single type of element, and the fraction of total values
                  p(k; s, N ) = PN                           (1)
                                        s
                                 n=1 1/n                           associated with each key can be characterized by a Zipfian
                                                                   distribution. In other words, a few keys have a large number
where s is the characteristic exponent. The denominator is         of associated values, while a very large number of keys have
known as the N th generalized harmonic number, often writ-         only a few values. Take the two following examples:
ten as HN,s . For many types of algorithms, all occurrences
of the same type of element must be processed together.               Inverted indexing. The well-known MapReduce algo-
That is, the element type represents the finest grain of paral-    rithm for inverted indexing begins by mapping over input
lelism that can be achieved. Let us assume that the amount         documents and emitting individual postings as intermediate
of “work” associated with processing a type of element is          key-value pairs. All postings associated with each term are
gathered in the reducer, where the final postings lists are        1: procedure Map(Term t, Postings P )
created and written to disk. The most straightforward im-          2:    [Q1 , Q2 , . . . Qn ] ← LoadQueries()
plementation of the algorithm divides the term space into          3:    for all Qi ∈ [Q1 , Q2 , . . . Qn ] do
roughly equal-sized partitions and assigns each to a reducer       4:        if t ∈ Qi then
(typically, through hashing). This, however, does not take         5:            Initialize.AssociativeArray(H)
into account the inherent distribution of document frequen-        6:            for all ha, f i ∈ P do
cies (which characterizes the length of each postings list, and    7:                   H{a} ← wt,q · wt,d
hence the amount of work that needs to be done for each            8:            Emit(i, H)
term). Due to the Zipfian distribution of term occurrences,        1: procedure Reduce(Qid i, [H1 , H2 , H3 , . . .])
the reducers assigned to very frequent terms will have sig-        2:    Initialize.AssociativeArray(Hf )
nificantly more work than the other reducers, and therefore        3:    for all H ∈ [H1 , H2 , H3 , . . .] do
take much longer to complete. These are the stragglers ob-         4:       Merge(Hf , H)
served in the execution of the algorithm.
                                                                   5:    Emit(i, Hf )
   Computing PageRank over large graphs. Although
Google recently revealed that it has developed infrastruc-        Figure 3: Pseudo-code of the parallel queries algo-
ture specifically designed for large-scale graph algorithms       rithm in MapReduce.
(called Pregel), the implementation of PageRank [8] in Map-
Reduce is nevertheless instructive. The standard iterative
MapReduce algorithm for computing PageRank maps over              3.1    Background
adjacency lists associated with each vertex (containing infor-       Computing pairwise similarity on document collections is
mation about outlinks); PageRank contributions are passed         a task common to a variety of problems such as clustering,
onto the target of those outlinks, keyed by the target ver-       unsupervised learning, and text retrieval. One algorithm
tex id. In the reduce stage, PageRank contributions for each      presented in [6] treats this task as a very large ad hoc re-
vertex are totaled.1 In the simplest implementation, vertices     trieval problem (where query-document scores are computed
are distributed across the reducers such that each is respon-     via inner products of weighted feature vectors). This pro-
sible for roughly the same number of vertices (by hashing).       ceeds as follows: First, the entire collection is divided up into
However, the highly skewed distribution of incoming links to      individual blocks of documents; these are treated as blocks
a page presents a problem: the vast majority of pages have        of “queries”. For each document block, the parallel retrieval
few inbound links, while a few (e.g., the Google homepage)        algorithm in Figure 3 is applied. The input to each map-
have many orders of magnitude more. In computing Page-            per is a term t (the key) and its postings list P (the value).
Rank, the amount of work required to process a vertex is          The mapper loads all the queries at once and processes each
roughly proportional to the number of incoming links. The         query in turn. If the query does not contain t, no action is
stragglers are exactly those assigned to process vertices in      performed. If the query contains t, then the corresponding
the graph with large in-degrees.                                  postings must be traversed to compute the partial contri-
                                                                  butions to the query-document score. For each posting el-
   In practice, faster speed-ups than predicted are possible      ement, the partial contribution to the score (wt,q · wt,d ) is
because in most cases some amount of local aggregation can        computed and stored in an associative array H, indexed by
be performed, thus making the skewed key-value distribu-          the document id a—this structure holds the accumulators.
tions less pronounced. In MapReduce, this is accomplished         The mapper emits an intermediate key-value pair with the
with “combiners”, which can dramatically increase efficiency.     query number i as the key and H as the value. The result of
Nevertheless, the stragglers problem remains both common          each mapper is all partial query-document scores associated
and severe.                                                       with term t for all queries that contain the term.
   In both of the examples presented above, the stragglers           In the reduce phase, all associative arrays belonging to
problem is most severe in the reduce stage of processing,         the same query are brought together. The reducer performs
since the distribution of the intermediate data can be char-      an element-wise sum of all the associative arrays (denoted
acterized as Zipfian. However, depending on the distribu-         by Merge in the pseudo-code): this adds up the contri-
tion of input key-value pairs, it is also possible to have the    butions for each query term across all documents. The fi-
stragglers problem in the map phase—in fact, the case study       nal result is an associative array holding complete query-
presented in the next section examines such a case.               document scores. In effect, this algorithm replaces random
                                                                  access to the postings with a parallel scan of all postings. In
3.   PARALLEL QUERIES: A CASE STUDY                               processing a set of queries, each postings list is accessed only
                                                                  once—each mapper computes partial score contributions for
   This paper presents a case study of the stragglers prob-       all queries that contain the term. Pairwise similarity for the
lem that builds on the parallel queries algorithm described in    entire collection can be computed by running this algorithm
SIGIR 2009 [6]. This piece is meant to serve as a companion       over all blocks in the collection. For more details, please re-
to the paper in the main conference proceedings. The prob-        fer to additional discussions of this algorithm in the SIGIR
lem explored here nicely illustrates how the running time of      2009 main proceedings paper.
a MapReduce algorithm is dominated by the slowest parallel
sub-task. Fortunately, for this problem, a simple manipula-       3.2    Experimental Results
tion of the input data cuts running time in half.
                                                                    Experiments using the algorithm presented in Figure 3
1                                                                 were conducted on a collection of 4.59m MEDLINE ab-
 This algorithm sketch ignores details such as handling of dan-
gling links and the jump factor.                                  stracts (from journals in the life and health sciences domain),
                      1e+06                                                                                 500
                                                                                                                                                                  original postings lists
                                                                                                                                                                split 100k postings lists

                      100000
                                                                                                            400


                       10000


                                                                                        number of workers
    number of terms


                                                                                                            300

                        1000


                                                                                                            200
                         100


                          10                                                                                100


                           1
                               1   10   100        1000        10000   100000   1e+06                         0
                                                                                                                  0   200   400   600    800      1000   1200       1400          1600      1800
                                              document frequency
                                                                                                                                        time (seconds)


Figure 4: Distribution of document frequencies of
                                                                                        Figure 5: Progress of the MapReduce parallel
terms in the MEDLINE collection.
                                                                                        queries algorithm (472 full-abstract “queries” on the
                                                                                        MEDLINE collection), comparing original postings
used in the TREC 2005 genomics track [4]. From the list of                              lists (thick black line) and postings lists split into
relevant documents for the evaluation, 472 (approximately                               100k segments (thin red line).
one tenth) were selected to serve as the “queries”. The en-
tire text of each abstract was taken verbatim as the query.
On average, each query contained approximately 120 terms                                                    • At the 532s mark, all except for 5 mappers (1% of the
after stopword removal. All runs were performed on a large                                                    total) have completed.
cluster running the Hadoop implementation of MapReduce                                                      • At the 1131s mark, all except for one mapper have
(version 0.17.3) with Java 1.5; jobs were configured with 500                                                 finished—the last mapper would run for about another
mappers and 200 reducers. The parallel queries algorithm                                                      4.5 minutes before finishing.
was implemented in pure Java. Although the cluster con-
tains a large number of machines, each individual machine                                  Since all mappers must complete before the reducers can
is relatively slow, based on informal benchmarks.2                                      begin processing intermediate key-value pairs, the running
   An inverted index was built from the entire collection.3                             time of the map phase is dominated by the slowest mapper,
Figure 4 shows the document frequencies of all 1.3m unique                              which as shown here takes significantly longer than the rest
terms in the collection: on the x-axis, the document fre-                               of the mappers. Repeated trials of the same experiment
quency of a term, and on the y-axis, the number of terms                                produced essentially the same behavior (not shown).
that have that df. Approximately 753k terms appear only                                    The behavior of the parallel queries algorithm is explained
once, and approximately 168k terms appear only twice. The                               by the empirical distribution of the length of the postings
term with the largest df appeared in 1.62m documents (or                                lists (see Figure 4). There are a few very long postings
about one in three documents). This distribution is fairly                              lists (corresponding to common terms such as “gene” or “pa-
typical of information retrieval text collections.                                      tient”), while the vast majority of the postings lists are very
   The thick line in Figure 5 plots the progress of the Map-                            short. Since for each query term the postings must be tra-
Reduce job on the unmodified index of the document collec-                              versed to compute partial document score contributions, the
tion: wall-clock time (in seconds) on the x-axis and number                             amount of work involved in processing a query term is on
of active workers (either mappers or reducers) on the y-axis.                           the order of the length of the postings list. Therefore, the
At the beginning of the job, 500 mappers are started by the                             mappers assigned to process common query terms will run
MapReduce runtime; the number of running mappers de-                                    for a disproportionately long time—these are exactly the
creases as the job progresses. When all the mappers have                                stragglers observed in Figure 5. These empirical results are
finished, 200 reducers start up after a short lull (during this                         consistent with the theoretical model presented in Section 2.
time the framework is copying intermediate key-value pairs                              However, the situation is not quite as grim in some cases—in
from mappers to reducers). Finally, the job ends when all                               a retrieval application, user queries are less likely to contain
reducers complete and the results have been written to disk.                            common terms since they are typically less helpful in speci-
   The map phase of execution completes in 1399s; the re-                               fying an information need.
duce phase begins at the 1418s mark, and the entire job                                    A natural solution to the stragglers problem, in this case,
finishes in 1675s. The long tail of the plot in the map phase                           is to break long postings lists into shorter ones. Exactly
graphically illustrates the stragglers problem. Consider the                            such a solution was implemented: postings lists were split
following statistics, which can be interpreted as checkpoints                           into 100k segments, so that terms contained in more than
corresponding to 98.0%, 99.0%, and 99.8% mapper progress:                               100k documents were associated with multiple postings lists.
                                                                                        Furthermore, the ordering of all the postings segments were
                      • At the 467s mark, all except for 10 mappers (2% of              randomized (as opposed to alphabetically sorted by term,
                        the total) have completed.                                      as in the original inverted index).4 Note that this modifi-
2                                                                                       cation to the inverted index did not require any changes to
 The hardware configuration is the same as the setup in [6].
3                                                                                       4
 The tokenizer from the open-source Lucene search engine was                             This algorithm was straightforwardly implemented in Map-
adapted for document processing. A list of stopwords from the                           Reduce, by mapping over the original inverted index and writing
Terrier search engine was used.                                                         a new copy.
     Postings      98.0%     99.0%     99.8%      100.0%           the algorithm itself, which may be slow precisely because of
     original        467s      532s     1131s       1399s          the stragglers problem. This chicken-and-egg dependency
     segmented       425s      498s      576s        636s          can be broken by sampling strategies, but at the cost of
                                                                   greater complexity in algorithm design.
                                                                      For solving reduce-phase stragglers, once the distribution
Table 1: Progress of mappers in the parallel queries               of intermediate data has been characterized, a more intel-
algorithm, comparing original postings lists and                   ligent partitioner can better distribute load across the re-
postings lists split into 100k segments.                           ducers. Unfortunately, this may still be insufficient in some
                                                                   cases, for example, PageRank computations. Recall that in
                                                                   computing PageRank all contributions from inbound links
the parallel queries algorithm. Although partial score con-        must be summed, thereby creating the requirement that all
tributions for a single query term might now be computed           values associated with the same key (i.e., vertex in graph)
in different mappers, all partial scores will still be collected   be processed together. The only recourse here is a modifi-
together in the same reducer during the element-wise sum           cation of the original algorithm (e.g., a multi-stage process
across associative arrays keyed by the same query id.              for totaling the PageRank contributions of pages with large
   The thin (red) line in Figure 5 plots the progress of the       in-degrees).
MapReduce job on the modified index. The map phase of                 The upshot of this discussion is that to overcome the effi-
execution completes in 636s; the reduce phase begins at the        ciency bottleneck imposed by Zipfian distributions, develop-
665s mark, and the entire job finishes in 831s. With this sim-     ers must apply application-specific knowledge to parallelize
ple modification to the inverted index, the same algorithm         the processing of common elements. This, in turn, depends
runs in about half the wall-clock time. Table 1 compares the       on the ingenuity of the individual developer and requires
98.0%, 99.0%, 99.8%, and 100.0% progress of the mappers            insight into the problem being tackled.
for both the original and segmented postings list. Multiple
trials of the same experiment gave rise to similar results.
   As a follow up, additional experiments were conducted
                                                                   5.   CONCLUSION
with even smaller postings segments, but finer splits did            This paper explores the stragglers problem in MapReduce
further decrease running time. This is likely due to the dis-      caused by Zipfian distributions common in many real-world
tribution of query terms—some postings lists will never be         datasets. What are the broader implications of these find-
traversed no matter how finely segmented they are, since the       ings? I believe that two lessons are apparent:
query documents contain only a subset of all terms in the
collection (and by implication, some mappers will do little        • First, efficient MapReduce algorithms are not quite as
work regardless).                                                    easy to design as one might think. The allure of Map-
   Why does this simple manipulation of the input data work          Reduce is that it presents a simple programming model
so well? It is important to note that the technique does             for designing scalable algorithms. While this remains an
not actually reduce the number of computations required to           accurate statement, the deeper truth is that there is still
produce query-document scores. The total amount of work              quite a bit of “art” in designing efficient MapReduce algo-
(i.e., area under the curve) is the same—however, with the           rithms for non-toy problems—witness the case study pre-
segmented index, work is more evenly distributed across the          sented in this paper, where a simple tweak to the input
mappers. In essence, splitting long postings lists is a simple,      data cut running time in half.5
yet effective, approach to “defeating” Zipfian distributions       • Second, MapReduce algorithms cannot be studied in iso-
for this particular application.                                     lation, divorced from real-world applications—the strag-
                                                                     glers problem is caused by properties of datasets (criti-
4.   DISCUSSION                                                      cally, not from the processing architecture or inherent bot-
                                                                     tlenecks in the algorithm). Furthermore, since there does
   The limits on parallelization imposed by Zipfian distri-
                                                                     not appear to be a general-purpose, universally-applicable
butions is based on one important assumption—that all in-
                                                                     solution to the problem, it is of limited value to discuss
stances of the same type of element must be processed to-
                                                                     algorithmic performance without reference to specific ap-
gether. The simple approach to overcoming the stragglers
                                                                     plications.
problem is to devise algorithms that do not depend on this
assumption. In the parallel queries case, this required only          Hopefully, these two lessons will be useful to future de-
manipulating input data and no modifications to the algo-          signers of MapReduce algorithms.
rithm: query-document scores could be computed indepen-
dently, since the associative arrays in which they are held
would eventually be brought together and merged in the             6.   ACKNOWLEDGMENTS
reduce stage.                                                        This work was supported by the following sources: the In-
   Although the parallel queries case study dealt with strag-      tramural Research Program of the NIH, National Library of
glers in the map phase due to distributional characteristics       Medicine; NSF under awards IIS-0705832 and IIS-0836560
of the input data, the same principles can be applied to           (under the CLuE program); Google and IBM, via the Aca-
stragglers in the reduce phase as well. However, there are         demic Cloud Computing Initiative. I am grateful to Esther
additional complexities that need to be considered. Often,         and Kiri for their love.
it is known in advance that intermediate data can be char-         5
                                                                     Although one might argue how obvious this solution is, I think it
acterized as Zipfian—but devising algorithms to address the        is safe to say that the solution requires a degree of understanding
issue may require actually knowing which elements occupy           of both information retrieval algorithms and MapReduce that a
the head of the distribution. This in turn requires executing      novice would be unlikely to have.
7.   REFERENCES
[1] G. Amdahl. Validity of the single processor approach to
    achieving large-scale computing capabilities. In
    Proceedings of the AFIPS Spring Joint Computer
    Conference, pages 483–485, 1967.
[2] J. Dean and S. Ghemawat. MapReduce: Simplified
    data processing on large clusters. In Proceedings of the
    6th Symposium on Operating System Design and
    Implementation (OSDI 2004), pages 137–150, San
    Francisco, California, 2004.
[3] S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google
    File System. In Proceedings of the 19th ACM
    Symposium on Operating Systems Principles (SOSP
    2003), pages 29–43, Bolton Landing, New York, 2003.
[4] W. R. Hersh, A. Cohen, J. Yang, R. Bhupatiraju,
    P. Roberts, and M. Hearst. TREC 2005 Genomics
    Track overview. In Proceedings of the Fourteenth Text
    REtrieval Conference (TREC 2005), Gaithersburg,
    Maryland, 2005.
[5] J. Lin. Scalable language processing algorithms for the
    masses: A case study in computing word co-occurrence
    matrices with mapreduce. In Proceedings of the 2008
    Conference on Empirical Methods in Natural Language
    Processing (EMNLP 2008), pages 419–428, Honolulu,
    Hawaii, 2008.
[6] J. Lin. Brute force and indexed approaches to pairwise
    document similarity comparisons with mapreduce. In
    Proceedings of the 32nd Annual International ACM
    SIGIR Conference on Research and Development in
    Information Retrieval (SIGIR 2009), Boston,
    Massachusetts, 2009.
[7] M. Newman. Power laws, Pareto distributions and
    Zipf’s law. Contemporary Physics, 46(5):323–351, 2005.
[8] L. Page, S. Brin, R. Motwani, and T. Winograd. The
    PageRank citation ranking: Bringing order to the Web.
    Stanford Digital Library Working Paper
    SIDL-WP-1999-0120, Stanford University, 1999.

</pre>