Using Component Interaction Model and Network Traces for Root-cause Analysis* Atul Kumar Anil Nair IBM Research Toshiba Software India Pvt Ltd Manyata Embassy Business Park, Nagwara, Fortune Summit, 6th Sector, HSR Layout, Outer Ring Road, Bangalore, India Hosur Main Road, Bangalore, India kumar.atul@in.ibm.com anil.nair@toshiba-tsip.com ABSTRACT Keywords Root-cause analysis after a system failure/error is an impor- Component Interactions Model, Dynamic Analysis, Root- tant activity to determine exact reasons for failure/error. cause analysis, Network Packet Filtering and Analysis Most of the time, these error conditions cannot be repro- duced or it is not feasible to run the system again using the exact same scenario. Therefore, execution trace log of 1. INTRODUCTION various functions/components recorded during the event is Execution trace log data1 is useful for debugging and root essential for root cause analysis and debugging in a complex cause analysis by identifying sequence of operations that led system. Source code level instrumentation for dynamic anal- to a system failure or error. Component interaction and ysis provides accurate execution trace log. But it is difficult information flow is also used to identify performance bot- to use an instrumented system in production environments tlenecks in a complex system. These logs are also useful in because of performance and system stability issues. In a dis- finding better deployment plan for distributed components tributed system, intercepted network messages can be ana- on different hosts by looking at inter-host component inter- lyzed to identify interactions between various components of action patterns. Additionally, it can also help in optimizing the system. However, messages captured on network alone and prioritizing test cases. do not provide complete information because messages be- Such logs can be obtained using dynamic analysis [7]. tween components on same host would not appear on net- Original source code is instrumented by inserting instrumen- work. We present a new idea to construct interaction infor- tation code to log some information at desired points (nor- mation among components of a distributed application using mally at the beginning and the end of a function). Then messages captured on network and an interaction model that application is executed and log generated by the instru- is a set of rules and heuristics about component interaction. mented code is logged. However, it may not be possible An interaction model is pre-built offline using profile infor- to use instrumented applications in a production environ- mation and static control flow graph of the system. Profiling ment. Instrumentation code causes overhead that may not is done with test data in a non production environment such be acceptable in a production environment because of per- as a test environment using ‘close-to-real’ test scenario. Mes- formance reasons. Controller hosts in automation systems sages corresponding to components interaction are captured have limited logging and tracing capabilities because em- on network to create a partial execution trace log. Then the bedded devices are normally resource constrained (memory trace log is completed using the pre-built interaction model. and/or CPU time) and may not bear this overhead. Us- ing instrumented code as created by coverage tools or debug tools slows execution of programs to a degree that these Categories and Subject Descriptors tools are not usable in production environments or even in H.4 [Information Systems Applications]: Miscellaneous; complex system test environments. Moreover, it may not be D.2.2.c [Software Engineering]: Distributed/Internet based reasonably safe to execute instrumented code in live produc- software engineering tools and techniques tion environments because logging of system activities may cause exceptions that may destabilize an otherwise stable system. General Terms In a large system, components of application are often dis- tributed on several hosts in a system. Interactions between Component Interactions, model based analysis components of such applications take place via network com- munication. We propose an idea for a system where mes- sages on network are eavesdropped. Since all interactions among components of an application do not appear on net- work (e.g., communication between components on the same 1 *This work was done when authors were with ABB Corporate Research Temporal information about the start of execution and the end of execution for functions/components in a pro- Copyright c 2016 for the individual papers by the papers’ authors. Copying permitted gram/system and also what function/component execution for private and academic purposes. This volume is published and copyrighted by its followed/precedes what other function/component execu- editors. tion. 2nd Modelling Symposium (ModSym 2016) - colocated with ISEC 2016, Goa, India, Feb 18, 2016 23 Component1 System with Component2 Instrumented Code Dynamic Component3 Analysis + Component4 Test Scenario Run test Component5 Data scenarios Component6 Test Environment Interaction Component7 Analysis Profile data time Model Int main (int argc, char *argv[]) { if (argc == 0) { return 0; } Static Engine …... analysis } Source code Control flow graph Component1 Component1 Release System Component2 Component2 Component3 Component3 + Component4 Component4 Network trace Real Data analysis Component5 Component5 Production Component6 Component6 Environment Component7 Component7 time time Partial component interaction log Constructed complete component interaction log Figure 1: Process of Constructing Execution Trace Log host), component interaction information captured on net- 2.2 Creating Execution log work therefore will be incomplete. To construct a full ex- Following are major steps in the process of constructing ecution trace log from partial component interaction infor- execution log from network messages. mation built using network messages, an interaction model is used. The interaction model is built beforehand using 1. Instrument source code using traditional methods, use control flow graph of the application obtained from static comprehensive and near real test scenario to gather analysis and profile information collected in a test environ- profile data. Use dynamic analysis to construct trace ment using instrumented application and test scenario. logs for all test scenarios. 2. Perform static analysis on source code to generate con- 2. CONSTRUCTING EXECUTION LOG FROM trol flow graphs of the system. NETWORK MESSAGES 3. Use the above two to create an interaction model. This model is used to identify patterns and create heuristics 2.1 Capturing Network Messages about various work-flows in the system. If some infor- Several tools exist to capture and analyze messages on mation is missing in a execution trace log, then the data networks. Normally, these tools require a host running model should be able to tell most likely candidates for them to be present on the same network on which messages missing places. When used with temporal information, need to be listened to. Intercepting network messages be- this model should be enough to construct a complete tween different hosts in a distributed system using a sepa- trace-log form the partial log. rate computer has no impact on system performance. Tools 4. In a production environment, capture network mes- such as Netmon [1], Microsoft Message Analyzer [2], Tcp- sages to identify component interactions. Create a dump/libpcap [4], WinDump/Winpcap [5], Wireshark [6], partial execution trace log using this information. etc. make it easy to analyze network packets at various network layers including application layers. Packets can be 5. Use the interaction model created in step 3 to generate filtered for specific patterns. Applications developed using complete execution trace log. popular frameworks such as Dot Net, J2EE, CORBA etc., have well defined message formats for sending messages be- Figure 1 shows the process to construct execution log tween components and are easy to identify automatically using network traces and a previously built patterns and using packet filters. heuristics. The top left part of diagram shows step 1 of the process mentioned above. Output of dynamic analysis for each test 2nd Modelling Symposium (ModSym 2016) - colocated with ISEC 2016, Goa, India, Feb 18, 2016 24 scenario is used later to build an interaction mode. The mid- Some tips and tricks to help troubleshooters extract root dle left part of diagram shows static analysis process that cause information from network traces is provided in [3]. generates control flow diagram from source code. The top Objective is to reduce what might be hundreds of gigabytes right part of diagram shows the process of creating inter- of data to essential events that show root cause of a problem. action model. This model essentially holds a set of rules Focus is to use network traces and then narrow the fault and heuristics for various possible interactions among com- down to a box. ponents of the system. The bottom left part of diagram An execution environment for Java programs is presented shows the process of creating partial component execution in [8] that improves execution performance by using both trace log from network messages. This log is used by an en- online and off-line profile information to guide dynamic opti- gine shown in the bottom right part of diagram to construct mization. A dynamic compilation system based on JikesRVM complete execution trace log using rules and heuristics from was developed that makes use of both. the interaction model. A model-based diagnosis approach is discussed in [9] that discovers faults based on generic fault models and abstract 2.3 Interaction Model event traces. These events may be associated to multiple Building a good interaction model is key to success of this system components. Availability of fault for each component idea. Control flow graphs for various modules/components is not assumed and generic fault models of classes of faults of an application provide all possible interactions in appli- are used instead. cations. Interaction between modules/components can be Our proposed idea is different from above works because, obtained from call graph, input data, and system integra- in our approach, we construct a trace log very close to the tion model. This is still not sufficient to capture dynamic one obtained from dynamic analysis of instrumented appli- behavior of application. For example, if there are n possi- cation without actually instrumented it. We rely on ma- ble paths an execution sequence can take from a particular turity of a pre-built model but it needs to be validated by point, then temporal information can reduce that possibility actually building a prototype tool based on our idea and to k (where k << n). System may follow a different exe- then comparing the trace-log generated by our tool with cution sequence at start-up, at user input time and at time complete data captured from instrumented application. of I/O. With a good set of test data, a dynamic interaction model can be built which can cover most common usage 4. CONCLUSIONS AND FUTURE WORK scenario. Combining static and dynamic interaction models can reduce total number of possible execution paths. Some We presented an idea to construct components interac- heuristics built around execution behavior and input values tion trace log for components of a distributed application in are used to select the most likely path. live production environments. An interaction model is first built offline by generating profile data in a test environment. 2.4 Assumptions Then, in a live production system, a partial components in- teraction trace log is created from network messages eaves- This approach makes the following assumptions. dropped from a separate host on the same network. Finally, • Network messages are not encrypted. a complete execution trace log is constructed by an engine using partial logs and interaction model. • Systems under considerations are distributed systems We plan to start a short project on this idea. Purpose is where significant interaction among components passes to validate our hypothesis presented in this paper that the over network. execution trace log can be constructed by only capturing network messages in a live production system (other infor- • Sufficient test scenario data is available that is close to mation required is collected offline). In particular, we would real usage scenario. focus to find answers of the following questions. • Interaction model is rebuilt after there is any change • Does enough component interactions take place over in the system. network (between hosts) in a real distributed applica- tion? If yes, then how much is ‘enough’ ? Can we use less messages than what can be captured on network 3. RELATED WORK to reduce size of network log? Performing root-cause analysis in distributed system is a well studied subject. A recent work on run-time root cause • Can we develop heuristics that help in recreating com- analysis in distributed systems is presented in [10]. This plete profile from network messages and models that work addresses problem of deriving relationships for fault were built offline? correlation in adaptive distributed systems where compo- nents are dynamically installs/updates/removes and presents • Does log data provide enough information to isolate a state chart-based solution which tried to identify the se- heisenbugs in software that are otherwise non-repeatable? quence of method execution. An approach to combine model-driven techniques with 5. REFERENCES runtime models to perform root cause analysis of execut- ing systems is presented in [11]. The approach is to com- [1] How to use network monitor to capture network bine advantages of model-driven development with reusable traffic. http://support.microsoft.com/kb/812953. software artifacts. Interactive visualizations enable efficient [2] Microsoft message analyzer operating guide. tracing of log file entries and corresponding model artifacts http://technet.microsoft.com/en- during runtime. us/library/jj649776.aspx. 2nd Modelling Symposium (ModSym 2016) - colocated with ISEC 2016, Goa, India, Feb 18, 2016 25 [3] Network trace analysis strategies. http://www.advance7.com/wp- content/uploads/2012/11/Network-Trace-Analysis- Strategies-Whitepaper.pdf. [4] Tcpdump/libpcap. http://www.tcpdump.org/. [5] Winpcap. http://www.winpcap.org/. [6] Wireshark. http://www.wireshark.org/. [7] T. Bell. The concept of dynamic analysis. In ACM SIGSOFT international symposium on Foundations of software engineering, pages 216–234. ACM SIGSOFT Software Engineering Notes, November 1999. [8] C. Krintz. Coupling on-line and off-line profile information to improve program performance. In international symposium on Code generation and optimization, pages 69–78. IEEE Computer Society, March 2003. [9] W. Mayer, X. Pucel, and M. Stumptner. Diagnosing component interaction errors from abstract event traces. In 23rd Australasian Joint Conference on Advances in Artificial Intelligence, pages 496–505. Lecture Notes in Computer Science Volume 6464, December 2010. [10] A. Raj, S. Barrett, and S. Clarke. Run-time root cause analysis in adaptive distributed systems. In On the Move to Meaningful Internet Systems: OTM 2013 Workshops, pages 292–301. Lecture Notes in Computer Science Volume 8186, September 2013. [11] M. Szvetits and U. Zdun. Enhancing root cause analysis with runtime models and interactive visualizations. In 8th International Workshop on Models at run.time, September 2013. 2nd Modelling Symposium (ModSym 2016) - colocated with ISEC 2016, Goa, India, Feb 18, 2016 26