Performance Modeling in the Age of Big Data Some Reflections on Current Limitations Robert Heinrich Holger Eichelberger, Klaus Schmid Software Design and Quality Software Systems Engineering Karlsruhe Institute of Technology University of Hildesheim Karlsruhe, Germany Hildesheim, Germany robert.heinrich@kit.edu {eichelberger,schmid}@sse.uni-hildesheim.de Abstract—Big Data aims at the efficient processing of massive Hence, in terms of experiences the points below must be amounts of data. Performance modeling is often used to optimize regarded as Palladio-specific. However, as we will discuss, the performance of systems under development. Based on points also apply to a broader range of modeling approaches experiences from modeling Big Data solutions, we describe some and (partially) even beyond Big Data systems. First limitations problems in applying performance modeling and discuss of Palladio for modeling Big Data systems have been described potential solution approaches. Index Terms—Performance, modeling, Big Data, Palladio in [7]. Our analysis goes beyond these limitations and is structured into 3+1 points: the first three issues refer to potential problems in modeling characteristics of Big Data I. MOTIVATION systems, while the last point is of a more fundamental nature. Big Data solutions store, transfer, and process huge data Flexible Component Distribution: In Big Data systems, it is sets. Modeling the architecture of Big Data systems in a often necessary to modify the distribution of worker component-based fashion and using architectural models for implementations across different resources, e.g., switching conducting performance analysis is essential to compare design from two servers responsible for a certain kind of processing to alternatives. As Big Data solutions are typically realized as three servers hosting the same logical component. This can also distributed systems, performance analysis can help to be combined with adding servers (e.g., automatic scale-out). determine the required amount of resources and the distribution Such behavior is not specific to Big Data systems, as even web of analysis tasks over a cluster prior to implementation. shops like Amazon do this. These distribution changes are An established research area at the intersection of model- triggered by changes in the workload of the system. driven and component-based software engineering is using Today capabilities for describing arbitrarily complex component models to conduct performance simulation and dynamic component-resource binding do not exist in prediction. There are numerous approaches with varying focus performance modeling and analysis approaches like Palladio. in this area [3]. The key research that led to this paper was However, approaches like SimuLizar [1] and iObserve [2] are driven from experience with Palladio [4], however, we believe able to support some specific adaptations. SimuLizar considers that most of our points hold for the broader field of model- deployment changes during analysis at design-time. iObserve based performance prediction of component-based systems. focuses on reflecting observed changes in the running system In our research, we aim to apply model-based performance in deployment like migration and (de-)replication. prediction to Big Data systems. Such systems typically use Data-oriented Load Distribution: Big Data is centered on established infrastructures like Storm, Spark, etc. Our current data processing and the data itself is used to control the focus is on performance models for near real-time computation applications, i.e., the processing component is selected based for financial data streams [7]. As a consequence, our discussion on the type of data (data-flow processing). In contrast, existing of the approaches and their limitations is driven from this performance modeling approaches focus on the call-relation background. However, we believe that several of our points among components and do not consider data as first class actually apply to a larger range of situations. As part of our entities. Thus, they do not provide capabilities for modeling discussion, we will make explicit which problems are particular data streams. Based on our experience, this makes it very hard to Big Data and which ones probably apply to a wider range. to model Big Data applications. However, beyond ease and The structure of the paper is: In Section II we describe the adequacy of modeling, it also leads to situations were specific, identified problems. This is the core of this contribution, as this performance-relevant aspects cannot be modeled, e.g., if the is to our knowledge the first time these issues are made amount of data stays the same over time, but the composition explicit. In Section III we discuss some initial ideas on how to of data types change, this may lead to adaptations. It seems address these issues. Finally, in Section IV we conclude. these issues are particularly relevant to Big Data applications. Explicit Queuing Components: Big Data applications utilize II. IDENTIFIED PROBLEMS queuing components for various purposes, but in particular, to Our analysis was mainly driven from attempts to model the organize the distribution of data across the application and to performance of Big Data systems for achieving adaptive smooth peak loads. Thus, queuing has a very significant behavior. Our modeling studies were performed with Palladio. performance impact. The precise impact depends on specific aspects of the queuing. Any performance models that omit such Explicit Queuing Components: To the best of our aspects are significantly insufficient, but current component- knowledge queuing components are not yet considered as based modeling approaches like Palladio do not have modeling predefined model elements on architecture level in existing capabilities for queuing components. (Queuing exists, but is approaches to architecture analysis. They may be constructed restricted to modeling resource usage). Thus, it is impossible to manually using formalisms like Layered Queuing Networks create adequate models of this aspect of system behavior. As (LQNs) and Queuing Petri Nets (QPNs). However, to support internal queuing exists in other systems as well, we assume that an integrated performance analysis in a component-based this will be a problem for modeling these types of systems, too. paradigm, it would be necessary to integrate these capabilities The previous three points describe three dimensions of into the underlying component models. We regard this as a modeling capabilities that are not – or not sufficiently – difficult, but mandatory challenge for the performance analysis supported by existing performance modeling approaches like of Big Data systems. Palladio. However, there exists a broader and more Blackbox Infrastructure: Even if the above extensions fundamental problem to which we will turn now: would be sufficient to model the characteristics of complex Big Blackbox Infrastructure: In Big Data applications, Data systems, the problem would remain that the underlying technologies like Storm, Spark, Hadoop, etc. play a central infrastructures would need to be modeled as well. Given role, however, these are very large and complex infrastructures. existing performance-oriented reengineering approaches, this There do not exist any models of their behavior and because of would require significant reengineering efforts that seem rather their size alone it would be a daunting task to construct one. unrealistic. Hence, the challenge here is: how can we (semi-) The situation that large, unknown infrastructures are part of the automatically derive sufficient model information from such models is not new. But, experience shows that for traditional infrastructures? If these infrastructures would be once systems, despite abstracting from classical infrastructures it comprehensively modeled, these models could be reused as a was possible to get very adequate results [4]. This is different single component or as a parameterizable pattern. for Big Data, as the infrastructure operations may have a strong impact on system performance. This leads to the fundamental IV. CONCLUSION problem of how to derive models (at least of critical aspects) of In this paper, we provided an initial discussion of current such large infrastructures? Manual model construction seems shortcomings in model-driven performance engineering. While out of scope, as the modeling alone would be often many times it was based on modeling Big Data systems with Palladio, we more complex than modeling the core application. Moreover, it believe the experiences hold at least for the broader range of would entail a significant reengineering project. performance modeling of big data systems and some of them may hold for a much wider range of cases were complex off- III. SOLUTION IDEAS the-shelf infrastructures are used in system development. Based on the problems identified in Section II we will now It is the goal of our ongoing research to address the discuss some potential solution approaches. identified shortcomings by augmenting modeling approaches Flexible Component Distribution: Performance analysis of and providing novel methods for model construction. systems with flexible component distribution requires the inclusion of adequate modeling primitives to support this. An V. REFERENCES approach that already heads in this direction is SimuLizar [1]. [1] M. Becker, M. Luckey, and S. Becker. Performance analysis of It supports the modeling of self-adaptation rules, e.g. for load self-adaptive systems for requirements validation at design-time. balancing. However, the expressiveness of its rules is not Quality of Software Architectures, pp. 43-52, ACM, 2013. sufficient to support all relevant distribution adaptations. [2] R. Heinrich. Architectural run-time models for performance and In order to improve the capabilities of performance privacy analysis in dynamic cloud applications. Performance modeling approaches, we assume it is necessary to significantly Evaluation Review, 43(4):13-22, ACM, 2016. enhance the capabilities for describing runtime adaptation rules [3] H. Koziolek. Performance evaluation of component-based and further enhance analysis approaches so the adaptation software systems: A survey. Performance Evaluation, 67(8):634– 658, Elsevier, 2010. effects are taken into account in the analysis. Data-oriented Load Distribution: As discussed earlier, the [4] J. Kroß, A. Brunnert, and H. Krcmar. Modeling Big Data Systems by Extending the Palladio Component Model. Softwaretechnik- key issue is that the modeling of component-oriented Trends 35(3). GI. 2015. approaches relies on call-relations, while in Big Data systems, [5] R. Reussner et al., (Ed.). Modeling and Simulating Software the key relation is the data flow. Hence, we assume a natural, Architectures – The Palladio Approach. MIT Press, 2016. ISBN: but necessary extension, will be to extend the modeling 978-0-262-03476-0. approaches with explicit data-flow modeling primitives. [6] S. Seifermann. Architectural data flow analysis. IEEE/IFIP However, we are not the first to propose this. Seifermann Working Conference on Software Architecture, pp. 270-271, IEEE, proposed such an extension for the Palladio approach [6]. His 2016. motivation was completely different, i.e., it focused on [7] C. Qin and H. Eichelberger. Impact-minimizing Runtime Switching analyzing systems for privacy or SLAs violations. We imagine of Distributed Stream Processing Algorithms. Big Data Processing that an appropriate extension for dataflow modeling could - Reloaded, Joint Conference, CEUR-WS.org, 2016. actually simultaneously serve both purposes.