<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Scalability Metric for Parallel Computations on Large, Growing Datasets (like the Web)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jesse Weaver</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tetherless World Constellation, Rensselaer Polytechnic Institute</institution>
          ,
          <addr-line>Troy, NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>91</fpage>
      <lpage>96</lpage>
      <abstract>
        <p>One of the greatest challenges facing computations on data crawled from the Web is the (in)ability to scale to such large quantities of data. While some computations are less challenged by this than others, inference on the Semantic Web is certainly limited in this regard. Parallelism has been employed to scale inference to larger datasets, but evaluations of recent works have fallen back on common parallel computing metrics that do not apply to this specific scalability challenge. In this position paper, the name data scaling is given to this scalability challenge, and the metric growth efficiency is defined.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>One of the greatest challenges facing computations on data crawled from the
Web is the (in)ability to scale to such large quantities of data. Parallelism is
often employed to scale computation to larger datasets while keeping execution
time reasonable. However, traditional parallel computing metrics focus on one of
two forms of scaling: strong scaling or weak scaling. The goal of strong scaling is
to reduce execution time for a fixed problem by adding processors. The goal of
weak scaling is to keep execution time constant by adding processors to
accommodate additional workload (not necessarily proportional to amount of data).
In contrast to these notions of scaling, a new notion of scalability – data scaling
– is introduced herein, and it is concerned with keeping execution time constant
by adding processors to accommodate more data.</p>
      <p>The ideas presented in this position paper are preliminary in nature, still
requiring formal development. The specific intent of this paper, though, is to
instigate a paradigm shift in the way scalability evaluations of parallel inference
(and possibly other problems) on large, RDF datasets are performed.</p>
      <p>
        In section 2, the need for data scaling is motivated. In section 3, common
scalability metrics are shown to be unfit for measuring data scalability, and a
new data scaling metric – growth efficiency – is defined. Discussion about, and
a retroactive example of, evaluating a system with growth efficiency is given in
section 4, and conclusion is given in section 5.
This work subscribes to the following statement by Hitzler and van Harmelen:
“Concerning scalability, reasoning systems have made major leaps in the recent
past . . . . However, it remains an open question when (and if1) these approaches
will scale to the size of the web, . . . ” [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. From this statement, two assumptions
are inferred which are used to motivate the work presented herein.
1. It is important for reasoning (including inference) systems to scale toward
the size of the Web.
2. The Web is continuously growing.
      </p>
      <p>Perhaps these assumptions are debatable, but for the intents and purposes of this
paper, they are considered axiomatic. Additionally, although inference motivates
this work, the definitions and their application are not specific to any particular
class of problems.</p>
      <p>To support definitions throughout this paper, a simplistic, intentionally
nonspecific computational model is assumed. A (terminating) parallel computation
is performed on some dataset, in some amount of time, utilizing some number
of processors. This is sufficient for the following discussion.</p>
      <p>
        Indeed, progress has been made for specific forms of reasoning on large
datasets. At the very least, varying degrees of RDFS and OWL inference have
been achieved on real-world datasets containing around a billion RDF triples [
        <xref ref-type="bibr" rid="ref11 ref14 ref6 ref8">6,
8, 11, 14</xref>
        ]. However, in 2010, linked RDF data on the Web consisted of over 24.7
billion RDF triples, and the amount of such data appears to be rapidly growing
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Therefore, even if effective inference could be demonstrated on the entire
body of RDF data on the (current) Web, it would likely need to scale to even
larger datasets in the future.
      </p>
      <p>
        Evaluations of recently studied, parallel inference approaches [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref2 ref6 ref7 ref8 ref9">2, 6–13</xref>
        ] give
no direct indication of how well the approaches will scale to dataset sizes beyond
those used in the evaluations. This seems to be due to reliance on scalability
metrics traditionally used in parallel computing that do not apply to the challenge
of large and growing datasets. Therefore, there is a need to explicitly name and
define this scalability issue, and to provide relevant metrics for it.
      </p>
      <p>Data scaling is concerned with the change in execution time of a
parallel computation as processors are added to accommodate larger datasets. Data
scaling is arguably the central scalability issue for parallel computations on the
Web, and it is distinct from strong scaling and weak scaling as illustrated through
metrics in the following section.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Common scalability metrics and growth efficiency</title>
      <p>
        This section contains a brief review of fundamental metrics often used for
measuring scalability (in some sense) of parallel systems. These are defined herein in
an atypical way in order to relate the metrics to dataset size, thus highlighting
their insufficiency for measuring data scalability. Then a new metric for data
scaling is introduced. To aide this discussion, the following definition is needed.
1 “Since the web keeps growing, they may never scale, even if they become much more
efficient.” Footnote in quote from [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Footnote number appearing in the quote herein
differs from the number used in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Definition 1. A growing dataset is effectively a function D that maps positive
integers to datasets such that for any positive integer n, |D(n)| = n and D(n) ⊂
D(n + 1).
      </p>
      <p>
        Relative speedup and metrics based on it are the scalability metrics reported
by nearly every recent work [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref2 ref8 ref9">2, 8–12</xref>
        ]. Others report only (variants of) execution
time [
        <xref ref-type="bibr" rid="ref13 ref6 ref7">6, 7, 13</xref>
        ].
      </p>
      <p>Definition 2. Let D be a growing dataset, and fix k to some positive integer.
Let T1 be the execution time for one processor with dataset D(k), and let TP
be the execution time for P processors with dataset D(k). Then the relative
speedup is defined as SP = T1/TP .</p>
      <p>
        Clearly, relative speedup (the most common strong scaling metric) gives no direct
indication of data scaling since the dataset is fixed. As an alternative, one may
resort to (empirical) scaled speedup [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (the most common weak scaling metric).
Definition 3. Let D be a growing dataset, and fix k to some positive integer.
Let t1 be the execution time for one processor with dataset D(k). Let i &gt; k be a
positive integer such that the execution time for P processors with dataset D(i)
is t1. Then let tP be the execution time for one processor with dataset D(i).
Then the scaled speedup is defined as SP = tP /t1.
      </p>
      <p>Unlike with relative speedup, in scaled speedup, the dataset size changes with
the number of processors. However, processors are added not to accommodate
more data but rather to keep execution time constant.</p>
      <p>Therefore, these metrics (and those metrics derived from them) are unsuited
for measuring data scalability, and a new metric is needed.</p>
      <p>Definition 4. Let D be a growing dataset, and fix k to some positive integer. Let
T1 be the execution time for one processor with dataset D(k), and let TP be the
execution time for P processors with dataset D(P · k). Then growth efficiency
is defined as GP = T1/TP .</p>
      <p>Growth efficiency directly aligns with the notion of data scaling. The idea is that
the size of the input dataset should grow linearly with the number of processors,
as captured in the definition. Thus, processors are added to accommodate more
data. Growth efficiency is more comparable to efficiency EP = SP /P or scaled
efficiency EP = SP /P in that it is a value between zero and one2 (inclusively).
4</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluations using growth efficiency</title>
      <p>Performance evaluations using growth efficiency are fairly intuitive, but there are
some important details, particularly when the evaluation is meant to compare
systems.
2 That is, in theory. Although undetermined at present, there may exist some
conditions in practice that allow for growth efficiency to be greater than one. This would
be akin to superlinear speedup in which efficiency can be greater than one.</p>
      <p>Points x, y in scatter plots should be such that x is the dataset size and y is
the growth efficiency. Using notation from definition 4, the points are P · k, GP
for some k and various P . This brings up the issue of what k should be. k is
the amount of data for a single processor, and as such, it is referred to herein
as the processor capacity. Processor capacity can be defined in numerous ways.
One possibility is availability of space (e.g., RAM, disk); another possibility is
the maximum amount of data a single processor can handle without exceeding a
specific upper bound on execution time. Regardless, an evaluation should include
discussion and justification of how processor capacity is determined.</p>
      <p>It is often the case that evaluations of different systems are performed
independently of each other, and some meaningful comparison is retroactively sought.
Thus, consideration must be given to potential differences in choice of growing
dataset and notion of processor capacity. That is, evaluations are comparable
only in as much as the parameters of the evaluations are comparable.</p>
      <p>Growing datasets should be similar, if not the same. Meeting this
requirement is straightforward with synthetic datasets, but more difficult with
realworld datasets. Unless it is obvious, an evaluation should make clear the method
by which the dataset was linearly “grown” with number of processors. It is
conceivable that the order of adding data can vary the change of execution time,
for example, in the context of inference with negation as failure.</p>
      <p>Notions of processor capacity should be similar, although it is not necessary
that they be exactly the same. For example, two evaluations using different
notions of space-bounded, processor capacity are likely to still be comparable
strictly in the context of data scaling.</p>
      <p>
        Although more thorough discussion is warranted, an example, retroactive
evaluation illustrating the differences of strong scaling and data scaling would
likely be a better use of the remaining space, given the limitation on paper length.
Some of the results from parallel, RDFS inference in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] are reorganized in this
section to address data scaling. The growing dataset is LUBM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] generated using
a seed of zero. A notion of space-bounded capacity is used, specifically
RAMbounded capacity. In this case, the processor capacity is 2,699,360 triples. This
is not necessarily the maximum processor capacity, but since this evaluation is
retroactive, it is sufficient for the purposes of demonstration.
      </p>
      <p>This example is intended to illustrate how growth efficiency differs from
efficiency EP = SP /P in the strong scaling sense. Therefore, the two are plotted
below over number of processors. (Recall that it was stated that the x-axis
should be dataset size for growth efficiency, but such an x-axis is nonsensical for
efficiency.)</p>
      <p>Table 1b and figure 1b show the same metrics for only the inference portion
of the computation. The inference portion of the computation is embarrassingly
parallel, so there is no interprocess communication or contention. Clearly, the
inference portion of the computation is very scalable – at least for LUBM data
– in both the strong and data scaling senses. This indicates that the inference
portion of the computation will likely scale to very large datasets without
significantly impacting execution time, although the same cannot be said for the
overall computation.
0.1 1
1
Traditional scalability metrics from parallel computing fail to address the specific
scalability challenge faced by parallel computations on data crawled from the
Web, that is, the ability to handle large, growing datasets. A notion of data
scaling has been defined that is concerned with how execution time is affected
as data grows, increasing the number of processors linearly with dataset size. A
new metric, growth efficiency, has been introduced for evaluating data scalability
of parallel computations. Focus has been on inference over RDF data crawled
from the Semantic Web.</p>
      <p>Acknowledgements. Much thanks to Jacopo Urbani and David Mizell for
their constructive feedback in revising this paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Pay-as-you-go Data Integration on the public Web of Linked Data</article-title>
          .
          <source>Keynote Presentation at the 3rd Future Internet Symposium (September</source>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Goodman</surname>
            ,
            <given-names>E.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mizell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Scalable In-memory RDFS Closure on Billions of Triples</article-title>
          .
          <source>In: Proceedings of the 6th International Workshop on Scalable Semantic Web Knowledge Base Systems</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heflin</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>LUBM: A benchmark for OWL knowledge base systems</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>3</volume>
          (
          <issue>2</issue>
          -3) (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gustafson</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Reevaluating Amdahl's Law</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>31</volume>
          (
          <issue>5</issue>
          ),
          <fpage>532</fpage>
          -
          <lpage>533</lpage>
          (May
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hitzler</surname>
          </string-name>
          , P.,
          <string-name>
            <surname>van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <string-name>
            <given-names>A Reasonable</given-names>
            <surname>Semantic</surname>
          </string-name>
          <article-title>Web</article-title>
          .
          <source>Semantic Web Journal</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <fpage>39</fpage>
          -
          <lpage>44</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>J.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>SAOR: Template Rule Optimisations for Distributed Reasoning over 1 Billion Linked Data Triples</article-title>
          .
          <source>In: Proceedings of the 9th International Semantic Web Conference</source>
          . (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kaoudi</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miliaraki</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koubarakis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>RDFS Reasoning and Query Answering on Top of DHTs</article-title>
          .
          <source>In: Proceedings of the 8th International Semantic Web Conference</source>
          . pp.
          <fpage>499</fpage>
          -
          <lpage>516</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oren</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Mind the Data Skew: Distributed Inferencing by Speeddating in Elastic Regions</article-title>
          .
          <source>In: Proceedings of the 19th International World Wide Web Conference</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Oren</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anadiotis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siebes</surname>
          </string-name>
          , R., ten
          <string-name>
            <surname>Teije</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Marvin: Distributed reasoning over large-scale Semantic Web data</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>7</volume>
          (
          <issue>4</issue>
          ),
          <fpage>305</fpage>
          -
          <lpage>316</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Soma</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prasanna</surname>
            ,
            <given-names>V.K.</given-names>
          </string-name>
          :
          <article-title>Parallel Inferencing for OWL Knowledge Bases</article-title>
          .
          <source>In: Proceedings of the 37th International Conference on Parallel Processing</source>
          . pp.
          <fpage>75</fpage>
          -
          <lpage>82</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Urbani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maassen</surname>
          </string-name>
          , J., van
          <string-name>
            <surname>Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bal</surname>
          </string-name>
          , H.:
          <article-title>OWL reasoning with WebPIE: calculating the closure of 100 billion triples</article-title>
          .
          <source>In: Proceedings of the 7th Extended Semantic Web Conference</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Urbani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kotoulas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oren</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Scalable Distributed Reasoning using MapReduce</article-title>
          .
          <source>In: Proceedings of the 8th International Semantic Web Conference</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Weaver</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>Parallel Materialization of the Finite RDFS Closure for Hundreds of Millions of Triples</article-title>
          .
          <source>In: Proceedings of the 8th International Semantic Web Conference</source>
          . pp.
          <fpage>682</fpage>
          -
          <lpage>697</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>G.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weaver</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atre</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>Scalable Reduction of Large Datasets to Interesting Subsets</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>8</volume>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>