<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Performance Modeling in the Age of Big Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Robert Heinrich</string-name>
          <email>robert.heinrich@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Holger Eichelberger</institution>
          ,
          <addr-line>Klaus Schmid</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Software Design and Quality Karlsruhe Institute of Technology Karlsruhe</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Software Systems Engineering University of Hildesheim Hildesheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>-Big Data aims at the efficient processing of massive amounts of data. Performance modeling is often used to optimize performance of systems under development. Based on experiences from modeling Big Data solutions, we describe some problems in applying performance modeling and discuss potential solution approaches. Index Terms-Performance, modeling, Big Data, Palladio I. MOTIVATION</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Big Data solutions store, transfer, and process huge data
sets. Modeling the architecture of Big Data systems in a
component-based fashion and using architectural models for
conducting performance analysis is essential to compare design
alternatives. As Big Data solutions are typically realized as
distributed systems, performance analysis can help to
determine the required amount of resources and the distribution
of analysis tasks over a cluster prior to implementation.</p>
      <p>
        An established research area at the intersection of
modeldriven and component-based software engineering is using
component models to conduct performance simulation and
prediction. There are numerous approaches with varying focus
in this area [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The key research that led to this paper was
driven from experience with Palladio [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], however, we believe
that most of our points hold for the broader field of
modelbased performance prediction of component-based systems.
      </p>
      <p>
        In our research, we aim to apply model-based performance
prediction to Big Data systems. Such systems typically use
established infrastructures like Storm, Spark, etc. Our current
focus is on performance models for near real-time computation
for financial data streams [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. As a consequence, our discussion
of the approaches and their limitations is driven from this
background. However, we believe that several of our points
actually apply to a larger range of situations. As part of our
discussion, we will make explicit which problems are particular
to Big Data and which ones probably apply to a wider range.
      </p>
      <p>The structure of the paper is: In Section II we describe the
identified problems. This is the core of this contribution, as this
is to our knowledge the first time these issues are made
explicit. In Section III we discuss some initial ideas on how to
address these issues. Finally, in Section IV we conclude.</p>
    </sec>
    <sec id="sec-2">
      <title>II. IDENTIFIED PROBLEMS Our analysis was mainly driven from attempts to model the performance of Big Data systems for achieving adaptive behavior. Our modeling studies were performed with Palladio.</title>
      <p>
        Hence, in terms of experiences the points below must be
regarded as Palladio-specific. However, as we will discuss, the
points also apply to a broader range of modeling approaches
and (partially) even beyond Big Data systems. First limitations
of Palladio for modeling Big Data systems have been described
in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Our analysis goes beyond these limitations and is
structured into 3+1 points: the first three issues refer to
potential problems in modeling characteristics of Big Data
systems, while the last point is of a more fundamental nature.
      </p>
      <p>Flexible Component Distribution: In Big Data systems, it is
often necessary to modify the distribution of worker
implementations across different resources, e.g., switching
from two servers responsible for a certain kind of processing to
three servers hosting the same logical component. This can also
be combined with adding servers (e.g., automatic scale-out).
Such behavior is not specific to Big Data systems, as even web
shops like Amazon do this. These distribution changes are
triggered by changes in the workload of the system.</p>
      <p>
        Today capabilities for describing arbitrarily complex
dynamic component-resource binding do not exist in
performance modeling and analysis approaches like Palladio.
However, approaches like SimuLizar [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and iObserve [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] are
able to support some specific adaptations. SimuLizar considers
deployment changes during analysis at design-time. iObserve
focuses on reflecting observed changes in the running system
in deployment like migration and (de-)replication.
      </p>
      <p>Data-oriented Load Distribution: Big Data is centered on
data processing and the data itself is used to control the
applications, i.e., the processing component is selected based
on the type of data (data-flow processing). In contrast, existing
performance modeling approaches focus on the call-relation
among components and do not consider data as first class
entities. Thus, they do not provide capabilities for modeling
data streams. Based on our experience, this makes it very hard
to model Big Data applications. However, beyond ease and
adequacy of modeling, it also leads to situations were specific,
performance-relevant aspects cannot be modeled, e.g., if the
amount of data stays the same over time, but the composition
of data types change, this may lead to adaptations. It seems
these issues are particularly relevant to Big Data applications.</p>
      <p>Explicit Queuing Components: Big Data applications utilize
queuing components for various purposes, but in particular, to
organize the distribution of data across the application and to
smooth peak loads. Thus, queuing has a very significant
performance impact. The precise impact depends on specific
aspects of the queuing. Any performance models that omit such
aspects are significantly insufficient, but current
componentbased modeling approaches like Palladio do not have modeling
capabilities for queuing components. (Queuing exists, but is
restricted to modeling resource usage). Thus, it is impossible to
create adequate models of this aspect of system behavior. As
internal queuing exists in other systems as well, we assume that
this will be a problem for modeling these types of systems, too.</p>
      <p>The previous three points describe three dimensions of
modeling capabilities that are not – or not sufficiently –
supported by existing performance modeling approaches like
Palladio. However, there exists a broader and more
fundamental problem to which we will turn now:</p>
      <p>
        Blackbox Infrastructure: In Big Data applications,
technologies like Storm, Spark, Hadoop, etc. play a central
role, however, these are very large and complex infrastructures.
There do not exist any models of their behavior and because of
their size alone it would be a daunting task to construct one.
The situation that large, unknown infrastructures are part of the
models is not new. But, experience shows that for traditional
systems, despite abstracting from classical infrastructures it
was possible to get very adequate results [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This is different
for Big Data, as the infrastructure operations may have a strong
impact on system performance. This leads to the fundamental
problem of how to derive models (at least of critical aspects) of
such large infrastructures? Manual model construction seems
out of scope, as the modeling alone would be often many times
more complex than modeling the core application. Moreover, it
would entail a significant reengineering project.
      </p>
    </sec>
    <sec id="sec-3">
      <title>III. SOLUTION IDEAS</title>
    </sec>
    <sec id="sec-4">
      <title>Based on the problems identified in Section II we will now discuss some potential solution approaches.</title>
      <p>
        Flexible Component Distribution: Performance analysis of
systems with flexible component distribution requires the
inclusion of adequate modeling primitives to support this. An
approach that already heads in this direction is SimuLizar [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
It supports the modeling of self-adaptation rules, e.g. for load
balancing. However, the expressiveness of its rules is not
sufficient to support all relevant distribution adaptations.
      </p>
      <p>In order to improve the capabilities of performance
modeling approaches, we assume it is necessary to significantly
enhance the capabilities for describing runtime adaptation rules
and further enhance analysis approaches so the adaptation
effects are taken into account in the analysis.</p>
      <p>Data-oriented Load Distribution: As discussed earlier, the
key issue is that the modeling of component-oriented
approaches relies on call-relations, while in Big Data systems,
the key relation is the data flow. Hence, we assume a natural,
but necessary extension, will be to extend the modeling
approaches with explicit data-flow modeling primitives.</p>
      <p>
        However, we are not the first to propose this. Seifermann
proposed such an extension for the Palladio approach [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. His
motivation was completely different, i.e., it focused on
analyzing systems for privacy or SLAs violations. We imagine
that an appropriate extension for dataflow modeling could
actually simultaneously serve both purposes.
      </p>
      <p>Explicit Queuing Components: To the best of our
knowledge queuing components are not yet considered as
predefined model elements on architecture level in existing
approaches to architecture analysis. They may be constructed
manually using formalisms like Layered Queuing Networks
(LQNs) and Queuing Petri Nets (QPNs). However, to support
an integrated performance analysis in a component-based
paradigm, it would be necessary to integrate these capabilities
into the underlying component models. We regard this as a
difficult, but mandatory challenge for the performance analysis
of Big Data systems.</p>
      <p>Blackbox Infrastructure: Even if the above extensions
would be sufficient to model the characteristics of complex Big
Data systems, the problem would remain that the underlying
infrastructures would need to be modeled as well. Given
existing performance-oriented reengineering approaches, this
would require significant reengineering efforts that seem rather
unrealistic. Hence, the challenge here is: how can we (semi-)
automatically derive sufficient model information from such
infrastructures? If these infrastructures would be once
comprehensively modeled, these models could be reused as a
single component or as a parameterizable pattern.</p>
    </sec>
    <sec id="sec-5">
      <title>IV. CONCLUSION</title>
      <p>In this paper, we provided an initial discussion of current
shortcomings in model-driven performance engineering. While
it was based on modeling Big Data systems with Palladio, we
believe the experiences hold at least for the broader range of
performance modeling of big data systems and some of them
may hold for a much wider range of cases were complex
offthe-shelf infrastructures are used in system development.</p>
      <p>It is the goal of our ongoing research to address the
identified shortcomings by augmenting modeling approaches
and providing novel methods for model construction.</p>
    </sec>
    <sec id="sec-6">
      <title>V. REFERENCES</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Luckey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Becker</surname>
          </string-name>
          .
          <article-title>Performance analysis of self-adaptive systems for requirements validation at design-time</article-title>
          .
          <source>Quality of Software Architectures</source>
          , pp.
          <fpage>43</fpage>
          -
          <lpage>52</lpage>
          , ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          .
          <article-title>Architectural run-time models for performance and privacy analysis in dynamic cloud applications</article-title>
          .
          <source>Performance Evaluation Review</source>
          ,
          <volume>43</volume>
          (
          <issue>4</issue>
          ):
          <fpage>13</fpage>
          -
          <lpage>22</lpage>
          , ACM,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Koziolek</surname>
          </string-name>
          .
          <article-title>Performance evaluation of component-based software systems: A survey</article-title>
          .
          <source>Performance Evaluation</source>
          ,
          <volume>67</volume>
          (
          <issue>8</issue>
          ):
          <fpage>634</fpage>
          -
          <lpage>658</lpage>
          , Elsevier,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kroß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Brunnert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Krcmar</surname>
          </string-name>
          .
          <article-title>Modeling Big Data Systems by Extending the Palladio Component Model</article-title>
          .
          <source>SoftwaretechnikTrends</source>
          <volume>35</volume>
          (
          <issue>3</issue>
          ). GI.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Reussner</surname>
          </string-name>
          et al., (Ed.).
          <article-title>Modeling and Simulating Software Architectures - The Palladio Approach</article-title>
          . MIT Press,
          <year>2016</year>
          . ISBN:
          <fpage>978</fpage>
          -0-
          <fpage>262</fpage>
          -03476-0.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Seifermann</surname>
          </string-name>
          .
          <article-title>Architectural data flow analysis</article-title>
          .
          <source>IEEE/IFIP Working Conference on Software Architecture</source>
          , pp.
          <fpage>270</fpage>
          -
          <lpage>271</lpage>
          , IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Qin</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Eichelberger</surname>
          </string-name>
          .
          <article-title>Impact-minimizing Runtime Switching of Distributed Stream Processing Algorithms</article-title>
          .
          <source>Big Data Processing - Reloaded</source>
          , Joint Conference, CEUR-WS.org,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>