A Scalability Metric for Parallel Computations
     on Large, Growing Datasets (like the Web)

                                     Jesse Weaver

    Tetherless World Constellation, Rensselaer Polytechnic Institute, Troy, NY, USA


        Abstract. One of the greatest challenges facing computations on data
        crawled from the Web is the (in)ability to scale to such large quantities
        of data. While some computations are less challenged by this than oth-
        ers, inference on the Semantic Web is certainly limited in this regard.
        Parallelism has been employed to scale inference to larger datasets, but
        evaluations of recent works have fallen back on common parallel com-
        puting metrics that do not apply to this specific scalability challenge.
        In this position paper, the name data scaling is given to this scalability
        challenge, and the metric growth efficiency is defined.


1     Introduction
One of the greatest challenges facing computations on data crawled from the
Web is the (in)ability to scale to such large quantities of data. Parallelism is
often employed to scale computation to larger datasets while keeping execution
time reasonable. However, traditional parallel computing metrics focus on one of
two forms of scaling: strong scaling or weak scaling. The goal of strong scaling is
to reduce execution time for a fixed problem by adding processors. The goal of
weak scaling is to keep execution time constant by adding processors to accom-
modate additional workload (not necessarily proportional to amount of data).
In contrast to these notions of scaling, a new notion of scalability – data scaling
– is introduced herein, and it is concerned with keeping execution time constant
by adding processors to accommodate more data.
    The ideas presented in this position paper are preliminary in nature, still
requiring formal development. The specific intent of this paper, though, is to
instigate a paradigm shift in the way scalability evaluations of parallel inference
(and possibly other problems) on large, RDF datasets are performed.
    In section 2, the need for data scaling is motivated. In section 3, common
scalability metrics are shown to be unfit for measuring data scalability, and a
new data scaling metric – growth efficiency – is defined. Discussion about, and
a retroactive example of, evaluating a system with growth efficiency is given in
section 4, and conclusion is given in section 5.

2     Motivation: data scaling
This work subscribes to the following statement by Hitzler and van Harmelen:
“Concerning scalability, reasoning systems have made major leaps in the recent


                                           91
past . . . . However, it remains an open question when (and if1 ) these approaches
will scale to the size of the web, . . . ” [5]. From this statement, two assumptions
are inferred which are used to motivate the work presented herein.
 1. It is important for reasoning (including inference) systems to scale toward
    the size of the Web.
 2. The Web is continuously growing.
Perhaps these assumptions are debatable, but for the intents and purposes of this
paper, they are considered axiomatic. Additionally, although inference motivates
this work, the definitions and their application are not specific to any particular
class of problems.
    To support definitions throughout this paper, a simplistic, intentionally non-
specific computational model is assumed. A (terminating) parallel computation
is performed on some dataset, in some amount of time, utilizing some number
of processors. This is sufficient for the following discussion.
    Indeed, progress has been made for specific forms of reasoning on large
datasets. At the very least, varying degrees of RDFS and OWL inference have
been achieved on real-world datasets containing around a billion RDF triples [6,
8, 11, 14]. However, in 2010, linked RDF data on the Web consisted of over 24.7
billion RDF triples, and the amount of such data appears to be rapidly growing
[1]. Therefore, even if effective inference could be demonstrated on the entire
body of RDF data on the (current) Web, it would likely need to scale to even
larger datasets in the future.
    Evaluations of recently studied, parallel inference approaches [2, 6–13] give
no direct indication of how well the approaches will scale to dataset sizes beyond
those used in the evaluations. This seems to be due to reliance on scalability met-
rics traditionally used in parallel computing that do not apply to the challenge
of large and growing datasets. Therefore, there is a need to explicitly name and
define this scalability issue, and to provide relevant metrics for it.
      Data scaling is concerned with the change in execution time of a paral-
lel computation as processors are added to accommodate larger datasets. Data
scaling is arguably the central scalability issue for parallel computations on the
Web, and it is distinct from strong scaling and weak scaling as illustrated through
metrics in the following section.


3     Common scalability metrics and growth efficiency
This section contains a brief review of fundamental metrics often used for mea-
suring scalability (in some sense) of parallel systems. These are defined herein in
an atypical way in order to relate the metrics to dataset size, thus highlighting
their insufficiency for measuring data scalability. Then a new metric for data
scaling is introduced. To aide this discussion, the following definition is needed.
1
    “Since the web keeps growing, they may never scale, even if they become much more
    efficient.” Footnote in quote from [5]. Footnote number appearing in the quote herein
    differs from the number used in [5].


                                           92
Definition 1. A growing dataset is effectively a function D that maps positive
integers to datasets such that for any positive integer n, |D(n)| = n and D(n) ⊂
D(n + 1).
   Relative speedup and metrics based on it are the scalability metrics reported
by nearly every recent work [2, 8–12]. Others report only (variants of) execution
time [6, 7, 13].
Definition 2. Let D be a growing dataset, and fix k to some positive integer.
Let T1 be the execution time for one processor with dataset D(k), and let TP
be the execution time for P processors with dataset D(k). Then the relative
speedup is defined as SP = T1 /TP .
Clearly, relative speedup (the most common strong scaling metric) gives no direct
indication of data scaling since the dataset is fixed. As an alternative, one may
resort to (empirical) scaled speedup [4] (the most common weak scaling metric).
Definition 3. Let D be a growing dataset, and fix k to some positive integer.
Let t1 be the execution time for one processor with dataset D(k). Let i > k be a
positive integer such that the execution time for P processors with dataset D(i)
is t1 . Then let tP be the execution time for one processor with dataset D(i).
Then the scaled speedup is defined as SP = tP /t1 .
Unlike with relative speedup, in scaled speedup, the dataset size changes with
the number of processors. However, processors are added not to accommodate
more data but rather to keep execution time constant.
    Therefore, these metrics (and those metrics derived from them) are unsuited
for measuring data scalability, and a new metric is needed.
Definition 4. Let D be a growing dataset, and fix k to some positive integer. Let
T1 be the execution time for one processor with dataset D(k), and let TP be the
execution time for P processors with dataset D(P · k). Then growth efficiency
is defined as GP = T1 /TP .
Growth efficiency directly aligns with the notion of data scaling. The idea is that
the size of the input dataset should grow linearly with the number of processors,
as captured in the definition. Thus, processors are added to accommodate more
data. Growth efficiency is more comparable to efficiency EP = SP /P or scaled
efficiency EP = SP /P in that it is a value between zero and one2 (inclusively).


4     Evaluations using growth efficiency
Performance evaluations using growth efficiency are fairly intuitive, but there are
some important details, particularly when the evaluation is meant to compare
systems.
2
    That is, in theory. Although undetermined at present, there may exist some condi-
    tions in practice that allow for growth efficiency to be greater than one. This would
    be akin to superlinear speedup in which efficiency can be greater than one.


                                           93
    Points x, y in scatter plots should be such that x is the dataset size and y is
the growth efficiency. Using notation from definition 4, the points are P · k, GP 
for some k and various P . This brings up the issue of what k should be. k is
the amount of data for a single processor, and as such, it is referred to herein
as the processor capacity. Processor capacity can be defined in numerous ways.
One possibility is availability of space (e.g., RAM, disk); another possibility is
the maximum amount of data a single processor can handle without exceeding a
specific upper bound on execution time. Regardless, an evaluation should include
discussion and justification of how processor capacity is determined.
   It is often the case that evaluations of different systems are performed inde-
pendently of each other, and some meaningful comparison is retroactively sought.
Thus, consideration must be given to potential differences in choice of growing
dataset and notion of processor capacity. That is, evaluations are comparable
only in as much as the parameters of the evaluations are comparable.
    Growing datasets should be similar, if not the same. Meeting this require-
ment is straightforward with synthetic datasets, but more difficult with real-
world datasets. Unless it is obvious, an evaluation should make clear the method
by which the dataset was linearly “grown” with number of processors. It is con-
ceivable that the order of adding data can vary the change of execution time,
for example, in the context of inference with negation as failure.
    Notions of processor capacity should be similar, although it is not necessary
that they be exactly the same. For example, two evaluations using different
notions of space-bounded, processor capacity are likely to still be comparable
strictly in the context of data scaling.
    Although more thorough discussion is warranted, an example, retroactive
evaluation illustrating the differences of strong scaling and data scaling would
likely be a better use of the remaining space, given the limitation on paper length.
Some of the results from parallel, RDFS inference in [13] are reorganized in this
section to address data scaling. The growing dataset is LUBM [3] generated using
a seed of zero. A notion of space-bounded capacity is used, specifically RAM-
bounded capacity. In this case, the processor capacity is 2,699,360 triples. This
is not necessarily the maximum processor capacity, but since this evaluation is
retroactive, it is sufficient for the purposes of demonstration.
     This example is intended to illustrate how growth efficiency differs from ef-
ficiency EP = SP /P in the strong scaling sense. Therefore, the two are plotted
below over number of processors. (Recall that it was stated that the x-axis
should be dataset size for growth efficiency, but such an x-axis is nonsensical for
efficiency.)
   Table 1a shows metrics for the overall computation, which includes I/O
from/to a parallel file system. Efficiency and growth efficiency are plotted for
number of processors in figure 1a. Growth efficiency for 256 processors is 0.66.
That is, 256 times as much data was processed in about 1/0.66 ≈ 1.5 times as
much time as with a single processor. Efficiency, which is 0.36 for 256 processors,
does not make this evident at all.


                                        94
    Table 1b and figure 1b show the same metrics for only the inference portion
of the computation. The inference portion of the computation is embarrassingly
parallel, so there is no interprocess communication or contention. Clearly, the
inference portion of the computation is very scalable – at least for LUBM data
– in both the strong and data scaling senses. This indicates that the inference
portion of the computation will likely scale to very large datasets without sig-
nificantly impacting execution time, although the same cannot be said for the
overall computation.


                   P     TP            EP       TP          GP                          P     TP            EP       TP          GP
                   1     360           1.00     360         1.00                        1     283           1.00     283         1.00
                   2     178           1.01     415         0.87                        2     137           1.03     285         0.99
                   4     98.0          0.92     415         0.87                        4     67.7          1.04     286         0.99
                   8     49.1          0.92     408         0.88                        8     33.2          1.06     289         0.98
                   16    26.3          0.85     409         0.88                        16    16.4          1.08     288         0.98
                   32    14.4          0.78     442         0.81                        32    8.24          1.07     289         0.98
                   64    8.39          0.67     466         0.77                        64    4.23          1.04     292         0.97
                   128   3.91          0.52     506         0.71                        128   2.11          1.04     291         0.97
                   256   3.91          0.36     546         0.66                        256   1.10          1.00     290         0.97

                         (a) Overall                                                    (b) Inference Only

Table 1: Execution times, efficiency, and growth efficiency up to 256 processors


     1                                                                         1


    0.1                                                                       0.1
          1                      10                                100              1                 10                                100

                          Efficiency          Growth Efficiency                                Efficiency          Growth Efficiency


                         (a) Overall                                                    (b) Inference Only

              Fig. 1: Efficiency and growth efficiency (log/log) up to 256 processors


5             Conclusion
Traditional scalability metrics from parallel computing fail to address the specific
scalability challenge faced by parallel computations on data crawled from the
Web, that is, the ability to handle large, growing datasets. A notion of data


                                                                         95
scaling has been defined that is concerned with how execution time is affected
as data grows, increasing the number of processors linearly with dataset size. A
new metric, growth efficiency, has been introduced for evaluating data scalability
of parallel computations. Focus has been on inference over RDF data crawled
from the Semantic Web.
    Acknowledgements. Much thanks to Jacopo Urbani and David Mizell for
their constructive feedback in revising this paper.

References
 1. Bizer, C.: Pay-as-you-go Data Integration on the public Web of Linked Data.
    Keynote Presentation at the 3rd Future Internet Symposium (September 2010)
 2. Goodman, E.L., Mizell, D.: Scalable In-memory RDFS Closure on Billions of
    Triples. In: Proceedings of the 6th International Workshop on Scalable Seman-
    tic Web Knowledge Base Systems (2010)
 3. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base sys-
    tems. Web Semantics: Science, Services and Agents on the World Wide Web 3(2-3)
    (2005)
 4. Gustafson, J.L.: Reevaluating Amdahl’s Law. Communications of the ACM 31(5),
    532–533 (May 1988)
 5. Hitzler, P., van Harmelen, F.: A Reasonable Semantic Web. Semantic Web Journal
    1(1), 39–44 (2010)
 6. Hogan, A., Pan, J.Z., Polleres, A., Decker, S.: SAOR: Template Rule Optimisations
    for Distributed Reasoning over 1 Billion Linked Data Triples. In: Proceedings of
    the 9th International Semantic Web Conference. (2010)
 7. Kaoudi, Z., Miliaraki, I., Koubarakis, M.: RDFS Reasoning and Query Answering
    on Top of DHTs. In: Proceedings of the 8th International Semantic Web Confer-
    ence. pp. 499–516 (2008)
 8. Kotoulas, S., Oren, E., van Harmelen, F.: Mind the Data Skew: Distributed Infer-
    encing by Speeddating in Elastic Regions. In: Proceedings of the 19th International
    World Wide Web Conference (2010)
 9. Oren, E., Kotoulas, S., Anadiotis, G., Siebes, R., ten Teije, A., van Harmelen, F.:
    Marvin: Distributed reasoning over large-scale Semantic Web data. Web Semantics:
    Science, Services and Agents on the World Wide Web 7(4), 305–316 (2009)
10. Soma, R., Prasanna, V.K.: Parallel Inferencing for OWL Knowledge Bases. In:
    Proceedings of the 37th International Conference on Parallel Processing. pp. 75–
    82 (2008)
11. Urbani, J., Kotoulas, S., Maassen, J., van Harmelen, F., Bal, H.: OWL reasoning
    with WebPIE: calculating the closure of 100 billion triples. In: Proceedings of the
    7th Extended Semantic Web Conference (2010)
12. Urbani, J., Kotoulas, S., Oren, E., van Harmelen, F.: Scalable Distributed Rea-
    soning using MapReduce. In: Proceedings of the 8th International Semantic Web
    Conference (2009)
13. Weaver, J., Hendler, J.A.: Parallel Materialization of the Finite RDFS Closure for
    Hundreds of Millions of Triples. In: Proceedings of the 8th International Semantic
    Web Conference. pp. 682–697 (2009)
14. Williams, G.T., Weaver, J., Atre, M., Hendler, J.A.: Scalable Reduction of Large
    Datasets to Interesting Subsets. Web Semantics: Science, Services and Agents on
    the World Wide Web 8 (2010)


                                         96