=Paper= {{Paper |id=Vol-1723/6 |storemode=property |title=Performance Modeling in the Age of Big Data: Some reflections on current limitations |pdfUrl=https://ceur-ws.org/Vol-1723/6.pdf |volume=Vol-1723 |authors=Robert Heinrich,Holger Eichelberger,Klaus Schmid |dblpUrl=https://dblp.org/rec/conf/models/HeinrichES16 }} ==Performance Modeling in the Age of Big Data: Some reflections on current limitations== https://ceur-ws.org/Vol-1723/6.pdf
          Performance Modeling in the Age of Big Data
                                      Some Reflections on Current Limitations

                      Robert Heinrich                                             Holger Eichelberger, Klaus Schmid
                Software Design and Quality                                           Software Systems Engineering
              Karlsruhe Institute of Technology                                         University of Hildesheim
                     Karlsruhe, Germany                                                   Hildesheim, Germany
                   robert.heinrich@kit.edu                                    {eichelberger,schmid}@sse.uni-hildesheim.de

  Abstract—Big Data aims at the efficient processing of massive       Hence, in terms of experiences the points below must be
amounts of data. Performance modeling is often used to optimize       regarded as Palladio-specific. However, as we will discuss, the
performance of systems under development. Based on                    points also apply to a broader range of modeling approaches
experiences from modeling Big Data solutions, we describe some        and (partially) even beyond Big Data systems. First limitations
problems in applying performance modeling and discuss
                                                                      of Palladio for modeling Big Data systems have been described
potential solution approaches.
  Index Terms—Performance, modeling, Big Data, Palladio               in [7]. Our analysis goes beyond these limitations and is
                                                                      structured into 3+1 points: the first three issues refer to
                                                                      potential problems in modeling characteristics of Big Data
                          I. MOTIVATION
                                                                      systems, while the last point is of a more fundamental nature.
    Big Data solutions store, transfer, and process huge data             Flexible Component Distribution: In Big Data systems, it is
sets. Modeling the architecture of Big Data systems in a              often necessary to modify the distribution of worker
component-based fashion and using architectural models for            implementations across different resources, e.g., switching
conducting performance analysis is essential to compare design        from two servers responsible for a certain kind of processing to
alternatives. As Big Data solutions are typically realized as         three servers hosting the same logical component. This can also
distributed systems, performance analysis can help to                 be combined with adding servers (e.g., automatic scale-out).
determine the required amount of resources and the distribution       Such behavior is not specific to Big Data systems, as even web
of analysis tasks over a cluster prior to implementation.             shops like Amazon do this. These distribution changes are
    An established research area at the intersection of model-        triggered by changes in the workload of the system.
driven and component-based software engineering is using                  Today capabilities for describing arbitrarily complex
component models to conduct performance simulation and                dynamic component-resource binding do not exist in
prediction. There are numerous approaches with varying focus          performance modeling and analysis approaches like Palladio.
in this area [3]. The key research that led to this paper was         However, approaches like SimuLizar [1] and iObserve [2] are
driven from experience with Palladio [4], however, we believe         able to support some specific adaptations. SimuLizar considers
that most of our points hold for the broader field of model-          deployment changes during analysis at design-time. iObserve
based performance prediction of component-based systems.              focuses on reflecting observed changes in the running system
    In our research, we aim to apply model-based performance          in deployment like migration and (de-)replication.
prediction to Big Data systems. Such systems typically use                Data-oriented Load Distribution: Big Data is centered on
established infrastructures like Storm, Spark, etc. Our current       data processing and the data itself is used to control the
focus is on performance models for near real-time computation         applications, i.e., the processing component is selected based
for financial data streams [7]. As a consequence, our discussion      on the type of data (data-flow processing). In contrast, existing
of the approaches and their limitations is driven from this           performance modeling approaches focus on the call-relation
background. However, we believe that several of our points            among components and do not consider data as first class
actually apply to a larger range of situations. As part of our        entities. Thus, they do not provide capabilities for modeling
discussion, we will make explicit which problems are particular       data streams. Based on our experience, this makes it very hard
to Big Data and which ones probably apply to a wider range.           to model Big Data applications. However, beyond ease and
    The structure of the paper is: In Section II we describe the      adequacy of modeling, it also leads to situations were specific,
identified problems. This is the core of this contribution, as this   performance-relevant aspects cannot be modeled, e.g., if the
is to our knowledge the first time these issues are made              amount of data stays the same over time, but the composition
explicit. In Section III we discuss some initial ideas on how to      of data types change, this may lead to adaptations. It seems
address these issues. Finally, in Section IV we conclude.             these issues are particularly relevant to Big Data applications.
                                                                          Explicit Queuing Components: Big Data applications utilize
                    II. IDENTIFIED PROBLEMS
                                                                      queuing components for various purposes, but in particular, to
    Our analysis was mainly driven from attempts to model the         organize the distribution of data across the application and to
performance of Big Data systems for achieving adaptive                smooth peak loads. Thus, queuing has a very significant
behavior. Our modeling studies were performed with Palladio.          performance impact. The precise impact depends on specific
aspects of the queuing. Any performance models that omit such           Explicit Queuing Components: To the best of our
aspects are significantly insufficient, but current component-      knowledge queuing components are not yet considered as
based modeling approaches like Palladio do not have modeling        predefined model elements on architecture level in existing
capabilities for queuing components. (Queuing exists, but is        approaches to architecture analysis. They may be constructed
restricted to modeling resource usage). Thus, it is impossible to   manually using formalisms like Layered Queuing Networks
create adequate models of this aspect of system behavior. As        (LQNs) and Queuing Petri Nets (QPNs). However, to support
internal queuing exists in other systems as well, we assume that    an integrated performance analysis in a component-based
this will be a problem for modeling these types of systems, too.    paradigm, it would be necessary to integrate these capabilities
    The previous three points describe three dimensions of          into the underlying component models. We regard this as a
modeling capabilities that are not – or not sufficiently –          difficult, but mandatory challenge for the performance analysis
supported by existing performance modeling approaches like          of Big Data systems.
Palladio. However, there exists a broader and more                      Blackbox Infrastructure: Even if the above extensions
fundamental problem to which we will turn now:                      would be sufficient to model the characteristics of complex Big
    Blackbox Infrastructure: In Big Data applications,              Data systems, the problem would remain that the underlying
technologies like Storm, Spark, Hadoop, etc. play a central         infrastructures would need to be modeled as well. Given
role, however, these are very large and complex infrastructures.    existing performance-oriented reengineering approaches, this
There do not exist any models of their behavior and because of      would require significant reengineering efforts that seem rather
their size alone it would be a daunting task to construct one.      unrealistic. Hence, the challenge here is: how can we (semi-)
The situation that large, unknown infrastructures are part of the   automatically derive sufficient model information from such
models is not new. But, experience shows that for traditional       infrastructures? If these infrastructures would be once
systems, despite abstracting from classical infrastructures it      comprehensively modeled, these models could be reused as a
was possible to get very adequate results [4]. This is different    single component or as a parameterizable pattern.
for Big Data, as the infrastructure operations may have a strong
impact on system performance. This leads to the fundamental                                  IV. CONCLUSION
problem of how to derive models (at least of critical aspects) of       In this paper, we provided an initial discussion of current
such large infrastructures? Manual model construction seems         shortcomings in model-driven performance engineering. While
out of scope, as the modeling alone would be often many times       it was based on modeling Big Data systems with Palladio, we
more complex than modeling the core application. Moreover, it       believe the experiences hold at least for the broader range of
would entail a significant reengineering project.                   performance modeling of big data systems and some of them
                                                                    may hold for a much wider range of cases were complex off-
                      III. SOLUTION IDEAS                           the-shelf infrastructures are used in system development.
    Based on the problems identified in Section II we will now          It is the goal of our ongoing research to address the
discuss some potential solution approaches.                         identified shortcomings by augmenting modeling approaches
    Flexible Component Distribution: Performance analysis of        and providing novel methods for model construction.
systems with flexible component distribution requires the
inclusion of adequate modeling primitives to support this. An                                 V. REFERENCES
approach that already heads in this direction is SimuLizar [1].     [1] M. Becker, M. Luckey, and S. Becker. Performance analysis of
It supports the modeling of self-adaptation rules, e.g. for load        self-adaptive systems for requirements validation at design-time.
balancing. However, the expressiveness of its rules is not              Quality of Software Architectures, pp. 43-52, ACM, 2013.
sufficient to support all relevant distribution adaptations.        [2] R. Heinrich. Architectural run-time models for performance and
    In order to improve the capabilities of performance                 privacy analysis in dynamic cloud applications. Performance
modeling approaches, we assume it is necessary to significantly         Evaluation Review, 43(4):13-22, ACM, 2016.
enhance the capabilities for describing runtime adaptation rules    [3] H. Koziolek. Performance evaluation of component-based
and further enhance analysis approaches so the adaptation               software systems: A survey. Performance Evaluation, 67(8):634–
                                                                        658, Elsevier, 2010.
effects are taken into account in the analysis.
    Data-oriented Load Distribution: As discussed earlier, the      [4] J. Kroß, A. Brunnert, and H. Krcmar. Modeling Big Data Systems
                                                                        by Extending the Palladio Component Model. Softwaretechnik-
key issue is that the modeling of component-oriented
                                                                        Trends 35(3). GI. 2015.
approaches relies on call-relations, while in Big Data systems,
                                                                    [5] R. Reussner et al., (Ed.). Modeling and Simulating Software
the key relation is the data flow. Hence, we assume a natural,
                                                                        Architectures – The Palladio Approach. MIT Press, 2016. ISBN:
but necessary extension, will be to extend the modeling                 978-0-262-03476-0.
approaches with explicit data-flow modeling primitives.
                                                                    [6] S. Seifermann. Architectural data flow analysis. IEEE/IFIP
    However, we are not the first to propose this. Seifermann           Working Conference on Software Architecture, pp. 270-271, IEEE,
proposed such an extension for the Palladio approach [6]. His           2016.
motivation was completely different, i.e., it focused on            [7] C. Qin and H. Eichelberger. Impact-minimizing Runtime Switching
analyzing systems for privacy or SLAs violations. We imagine            of Distributed Stream Processing Algorithms. Big Data Processing
that an appropriate extension for dataflow modeling could               - Reloaded, Joint Conference, CEUR-WS.org, 2016.
actually simultaneously serve both purposes.