Experimenting with Complex Event Processing for Large Scale Internet Services Monitoring Research Paper Stephan Grell, Olivier Nano Microsoft, Ritter Strasse 23, Aachen, 52072, Germany Tel: +49 241 99784 533, Fax: +49 241 99784 77 {stgrell, onano}@microsoft.com Abstract: In this paper we discuss an experiment to monitor large scale internet services. The monitoring system is implemented as a Complex Event Processing system to enable better scaling of the infrastructure. We describe the application of the monitoring in two scenarios and give a brief overview of the benefits of the approach and the remaining challenges. The discussion is focused on language expressiveness, debuggability, root cause analysis, and maintaining a stable system. Keywords: CEP, debugging, load shedding 1 Introduction Large scale internet services are a class of applications which impose strong requirements on their management infrastructure and on their execution environment. Large scale internet services are services deployed in Data Center environments and running on a large number of machines (from hundreds of machines to thousands of machines). Due to the scale of these services, the management and monitoring system needs to scale accordingly. Monitoring of large scale internet services can be performed in many different ways depending on the expected monitoring results. This experiment is based on a Complex Event Processing (CEP) system which process the events in near real-time as long as the resource consumption limits are kept. We explore two monitoring scenarios: syntactic transactions generated by a watchdog and business events generated by the services themselves. In today’s monitoring setups there are a lot of tradeoffs between scalability and the speed at which results are computed. Our goal is to evaluate the benefits of implementing, deploying and managing a monitoring system through a CEP system in very large scale setup. The motivation to use a CEP system is its ability to do fast in- memory processing of events (filtering, grouping and aggregating) which enables to do real time analysis. The following sections discuss early results and share some of the challenges which we believe need to be addressed to ease the adoption of CEP as a scalable monitoring system. 2 Stephan Grell, Olivier Nano In the next section we detail the two scenarios for monitoring large scale internet services. In the third section we present the CEP system we used for our experiments. In the fourth section we describe challenges around the expressiveness of query languages. In the fifth section we discuss the challenges around debugabillity and root cause analysis. In the sixth section we describe the challenges around environment stability and load shedding. At the end we conclude and present potential future work. 2 Monitoring large scale internet services In this paper we explore two different scenarios to monitor Service Level Agreements (SLA) of large scale internet services. In the first scenario we monitor the services’ availability through syntactic transactions. In the second we monitor business events generated by user interactions upon the services. 2.1 Syntactic transactions monitoring In the first scenario services are monitored via syntactic transactions generated by a watch dog invoking the services. The results of the syntactic transactions are made available via a Pub/Sub system to different listeners. One of them is a SLA monitor which aggregates all the results according to different rules depending on the state of the services. The SLA monitor publishes the results of the aggregations which are emailed to the IT operators. The motivation for syntactic transactions is to have a high degree of control on the load of the monitoring system. Syntactic transactions are scheduled at regular interval and the frequency can be adjusted. For this scenario, a single machine is sufficient to execute the SLA monitoring because the volume of syntactic transactions results is low enough to be transferred to the central SLA monitor. We used a CEP system for this scenario because of its flexibility in updating the standing queries based on new demands as well as its ease to add new queries to monitor new services. Experimenting with Complex Event Processing for Large Scale Internet Services Monitoring Research Paper 3 Fig. 1: Outlines the communication between the services and the services involved The SLA monitor uses the following set of rules to aggregate the results of the syntactic transactions. Every service has its own set of tests (test cases) which are combined in a test module. Every test module has one SLA associated. The state of the test module is determined by the worst state of its test cases. A test case can have the following states: up, low, or down. The state is computed based on the success percentage of the test runs: above 95% the state is up, between 5% and 95% the state is low, and below 5% the state is down. The SLA computes the states of a Test Module every 10 min. If the Test Module undergoes a state change, a state change event is published. Otherwise the Test Module stays in its current state and state stay events are published every 30 min for down, 60 min for low, and 6 hours for up. Every published event contains the entire list of the test case results for the test module at that time with the aggregated values over the reporting period. In addition to monitoring the Test Module states, the SLA monitor is also monitoring its own infrastructure. The SLA also raises an alarm if it did not receive any test events for a period of 5 min. No events for 5 minutes are an indication that either the Pub/Sub is down or the watch dog. To support the debugging of the services, the exceptions that the watch dog creates in case of failures will be reported with the state of the test case. Every test case report contains the latest exception. We decided against a list of all exception because that list might get long over the course of up to 6 hours of monitoring. This scenario brings the requirement of maintaining a complex standing query in an environment with limited human access. The writing was complicated enough that multiple iterations were needed to get the state model right and to understand the 4 Stephan Grell, Olivier Nano issues. To get this scenario right, we need a way to understand what the CEP system is doing from log files and being able to analyze them efficiently from remote. 2.2 Business events monitoring In this second scenario the requirements are to monitor business events generated by the services. Business events are events which represent transactions happening in the service logic as opposed to lower level events (like heartbeat, disk operations, etc). The business events generated are for example login/logout of users, search operation triggers, etc. As these events are user driven, they are generated as users hit the services. Because of the events’ frequency and the number of machines running the services in the cluster, it is not feasible, nor performing, to move all the generated events to a central machine for processing. For this scenario we have setup a distributed version of our SLA monitoring system. Every machine that is running an instance of a service is also running an instance of our SLA monitoring system which receives and aggregates the business events generated by the local service over short period of times (around 10 minutes). A central instance of the monitoring system then receives, through the Pub/Sub, all the aggregated events and performs a per user aggregation over the events from all services. The final aggregation results are made available through the Pub/Sub for other services, such as billing or user satisfaction monitoring. This scenario puts requirements on the resource usage by the SLA monitoring infrastructure. As the SLA monitors run on the same machines as the services themselves we need to guarantee that they will execute within certain resources boundaries (such as CPU, memory, and network bandwidth). This scenario also brings requirements on the quality of the final aggregated data. Depending on the technique used to maintain the service utilization, certain business events can be discarded to ensure the overall performance of the SLA monitoring system. However, because these business events and their aggregate are used by other systems such as a billing system, it is important to ensure that effect of the discarded data on the overall results is controlled. Experimenting with Complex Event Processing for Large Scale Internet Services Monitoring Research Paper 5 SLA SLA 3. 2. 4. 1. SLA Pub/Sub SLA aggregation server SLA Users Service Provider Servers with SLA monitoring Fig. 2: Showing the distributed cloud service scenario with the user interactions 3 The SLA monitoring system The SLA monitoring system (1) we implemented for the experiments is generic and comparable in term of operators and queries to other CEP systems (2) (3). It consists of the following parts (also shown in Fig. 3): - SLA language for writing SLA documents (similar to standing queries) - SLA editor to create the SLA document as box diagrams - SLA runtime to monitor an SLA document, single node or distributed - SLA dashboard for monitoring the service performance against the SLA and to analyze SLA runtime log files. The language contains four main building blocks: Input Adapters, Probes, Computations and Audits/ Violations. Probes take the data from the Input Adapters and extract that data into the internal data format. They also filter the incoming data, so that only the information needed is extracted. Computations contain a set of operators. The operators work on streams or the output of other operators. Computations get their inputs from the Probes or other Computations. Audits and Violations define data validations and trigger an output event when data is out of range. Data computed by Audits and Violations is publish on the Pub/Sub system and made available for other systems to consume. In addition the language contains supporting concepts for time based triggering, logging, and assigning the elements to specific runtimes. The assignment of elements 6 Stephan Grell, Olivier Nano to the runtimes can be updated for some elements during the runtime. The elements that reside in different runtimes are connected by streams which span machines. SLA Tooling SLA Runtime SLA Runtime OutputAdapter Output Adapter SLA Designer SLA Runtime Audit SLA Audit Violation/ Audit SLA Audit Audit Compiler Computation SLA Store Audit Audit Probe Input InputAdapter Adapter SLA Analyser Fig. 3: Overview of the SLA monitoring system with the different tools and the main concepts inside the language 4 Language expressiveness SLA monitoring brings interesting challenges when implemented as a CEP system. Certain SLA requirements are difficult to express as queries and would benefit from automata support description for example. Next, we list some examples of issues expressing SLA with standard query languages. SLA monitoring usually works within fixed windows as the state of a system is usually defined over a fixed period of time. This means that hopping windows need to be simulated on top of the sliding windows semantic of the underlying CEP system. A SLA monitoring system needs to determine the state of services and send out information about it. In our experiments state changes are of high importance and the most expressive way to implement them was to build a state machine with standing queries. In our first scenario described in section 2, we built 3 sub-queries for the analysis of the current state and nine additional sub-queries for state transitions. The nine sub-queries include a state stability transition. Timing the publication of violations and audit reports is also challenging. For regular audit reports they need to be published at precise time of the day and violations reports depend on the state of the monitored service and their frequency depends on past violation reports. For the CEP usage in an SLA monitoring system we believe that adding domain specific constructs that provide SLA patterns would ease the description of SLA. Experimenting with Complex Event Processing for Large Scale Internet Services Monitoring Research Paper 7 These patterns should simplify the pattern recognition inside streams, the dealings with time in the processing as well as the state computation. 5 Offline / root cause analysis The development of the different standing queries showed that it is easy to introduce errors in standing queries. The generation of the traces and past events analysis helps not only in debugging the standing query during the development phase but also later in production when an unexpected issue occurs. The analysis of past events has been research topic for some time now. It is labeled as “Time travel” or “Replay” (1). The requirements of our environment exceed the ones of the Borealis project. Our monitoring system runs on servers which cannot be directly accessed. All we can do is accessing log files to understand what the monitoring service has been doing in the past. Another challenge is the limit on the file size. The monitoring system cannot log every incoming event package that it has seen since the beginning of time. And lastly, the obvious challenge of limiting the performance impact on the processing is of high importance. In addition to our internal motivation as the developer of the standing queries for the monitoring system, we have an additional request by our “users” to be able to drill down into interesting violation events from the standing query. The tools and requirements for this functionality are very similar to the debugging one. The drill down case is a bit more limited, as the systems needs only to hold the events and the internal processing steps which resulted into a violation. All the other events are considered noise which distract from the violation analysis. To anticipate these requirements the monitoring runtime includes a logging framework that allows logging every result of a “block” in the standing query. These results contain besides the time, a unique identifier, and the data fields also a list of events / results which were consumed by this result. The log messages are made available via a set of adapters, including a file writer as well as a network connector. The network connector allows a tool to directly connect the log messages and display the processing progress. The file writer stores all messages in a rotation log file on disc. The file has a maximum size associated. A UI tool allows us to display the log messages with the standing query. The display enables the user to replay all log messages for the processing or only the ones for certain sub-path inside the standing query. In addition one can also look at one processing block and see all the log messages for that one with its definition. The work so far allows us to debug a running standing query and we can understand what happened inside the monitoring runtime in case of a crash. One could also use this data for recovery in case of a failure. However, this does not satisfy all the demands we started out with. This tool does not fully support the drill down requirements. The log framework does not allow selecting which data can be purged from the log file so that only the data of the violation stays. If the log file size is big enough and the drill down request is down quickly after the violation is issued, one has a good chance that the data that lead to the violation is still available. The 8 Stephan Grell, Olivier Nano current approach has also a big impact on the overall performance when the logging is enabled. A second mode that supports the drill down but not the debugging is thought of but not experimented with. One could store the history, similar to the Borealis approach, in memory and only write the log file when a violation happened. This would result in a less frequent interruption and the writing could be scheduled such that it is does not interfere with the processing. After a couple sessions of debugging and analyzing the log files, the desire for a CEP debugging environment similar to that of programming languages arose. This environment should be capable of working with recorded traces from a test run as well as simulating the execution of the standing queries with a range of inputs. The simulation environment would be helpful as a test environment for the written standing query and allows testing the corner cases and the different paths. 6 Reliable Infrastructure Large scale internet services require by definition a very stable execution environment and as they scale the management infrastructure needs to scale as well. In the second scenario that we presented in section 2, the requirement is to monitor business events generated by the services. The number of events generated by the services is high and there are a lot of service instances. Therefore a first aggregation of the business events needs to be done on the machines running the services instances. Due to the fact that the number of business events generated depends on user actions on the services, a higher service load results in a high event count which in turn impacts the computation of the first level of aggregation on the services’ machines. This creates a potential problem for the stability of the services’ execution environments. Based on the very dynamic nature of the services and aggregations execution under heavy load it is very difficult to understand under which exact conditions the system is at risk. We are experimenting with a combination of two techniques to limit the risk of system overload and system crash during execution. The first technique applies static capacity management to the standing queries to understand what is potentially executable without too much risk. The second technique is events shedding to enable aggregations to empty its pipeline under overload situations. Our goal with these techniques is to ensure during execution that the SLA monitoring elements deployed on the service machines will never exceed 30% of CPU and memory consumption. For the capacity management our main interest is to understand statically the memory and CPU consumption of queries based on an event input rates. In the light of work from (1) (2), we have gathered statistics over time of our operators depending on their semantic (selection, aggregation, projection) and build a model of their composition. This enables the SLA runtime to evaluate the burst potential of a given query depending on input rates. This technique gives a rough estimate of potential burst and does not take into account external factors from the environment. Because capacity management only gives a rough estimate of potential burst and does not prevent overload situation, the SLA monitoring system is complemented by an events shedding gate (7)(9) to stay in the 30% CPU and memory consumption Experimenting with Complex Event Processing for Large Scale Internet Services Monitoring Research Paper 9 range. Comparable to work from (3), (4) and (5) we introduce a drop operator as an entry gate as a first operator (and only first operator). This operator uses random events shedding which brings maximum efficiency. The challenges when using static analysis in capacity management of the queries is to evaluate the roughness of the estimates. Because many external factors influence the queries execution at runtime, it would be interesting to refine the queries analysis during execution. One of the main challenges of applying events shedding is the understanding of the quality impact (8) of the event loss on the final aggregated results as well as determining if a delayed execution (6) could save quality without destabilizing the system. From a quality perspective, it is difficult to predict the impact of the removal of low level events. It is also a fair assumption that the quality impact very much depends on the usage of the output events from the system. Therefore, it would be interesting to investigate special purpose shedding techniques, which are dependent on the SLA. Another area that we would need to invest more is the understanding on how delayed execution of operators can help in overload scenarios. We believe that a schedule that takes the current load on the machine into account and the knowledge on how much CPU time an operator will consume can make a smarter decision between delayed execution and load shedding. 7 Conclusions In this paper we discussed the use of Complex Event Processing for the management of large scale internet services. We presented two scenarios focused on SLA monitoring emphasizing the importance of language expressiveness, root cause analysis and environment stability. These issues are from our perspective the most critical and potential prevent a safe deployment of such monitoring system in large scale internet services. We also described our SLA monitoring solution which allows distributing and scaling the processing of events inside the management infrastructure. Our experiments showed that the usage of Complex Event Processing nicely separates the implementation details from the semantics of the SLA. CEP also includes good support for scalability inside the infrastructure to anticipate future grows. It aids in the removal of the network bottlenecks created by the Pub/Sub layer, by moving computation closer to events’ sources. However as we have underlined there are still many challenges while deploying such system for large scale internet services. Better expressiveness of languages for writing queries would help the developers / IT operators. Debugging queries and enabling root-cause analysis is an important requirement for fine tuning service monitoring and understand service failures. Capacity management of the queries is also an important requirement when deploying the queries to understand potential bottlenecks and availability problems. In addition it plays a safe guard to ensure that the monitoring services do not draw to many resources away from the services. 10 Stephan Grell, Olivier Nano We have detailed a few exploration paths which are promising. For the debugging and root cause analysis we have experimented with a linkage mechanism inside our logging framework supported by analysis tools. For the capacity management and scalability we are exploring flood gates mechanisms as well as mathematically formulating the query operators’ memory and CPU requirements. Our work at simplifying the expressiveness of the standing queries was just identified as a subject of promising future work. More work is still needed to fully answer these requirements. As future work we are investigating a less intrusive debugging approach and an improvement of the automatic scalability. 8 Acknowledgement The work reported in this paper has been done under the umbrella of the project SeCSE funded by the European Commission under contract FP6-IST-511680. 9 Works Cited 1. SLA Monitoring: Shifting the Trust. O.Nano, M.Gilbert. Ljubljana : IOS Press Amsterdam, 2005. Echallenges 2005. 2. Borealis. http://www.cs.brown.edu/research/borealis/public/. [Online] Brown University. 3. Stream. http://infolab.stanford.edu/stream/. [Online] Stanford University. 4. The Design of the Borealis Stream Processing Engine. Daniel J. Abadi, and others. 2005, Proceedings of the 2005 CIDR Conference. 5. Query Processing, Resource Management, and Approximation in a Data Stream Management System. Rajeev Motwani, Jennifer Widom, Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Gurmeet Manku, Chris Olston, Justin Rosenstein, Rohit Varma. 2003. CIDR. 6. Chain: Operator Scheduling for Memory Minimization in Data Stream Systems. Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani. 2003. SIGMOD. 7. Load Shedding for Aggregation Queries over Data Streams. B. Babcock, M. Datar, and R. Motwani. s.l. : IEEE, 2004. ICDE. 8. Approximate Join Processing Over Data Streams. A. Das, J. Gehrke, and M. Riedewald. s.l. : ACM , 2003. SIGMOD. 9. Load Shedding in a Data Stream Manager. N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, M. Stonebraker. Berlin : s.n., 2003. VLDB .