Using Component Interaction Model and Network Traces
                 for Root-cause Analysis*

                                      Atul Kumar                                                           Anil Nair
                          IBM Research                                                           Toshiba Software India Pvt Ltd
              Manyata Embassy Business Park, Nagwara,                                       Fortune Summit, 6th Sector, HSR Layout,
                 Outer Ring Road, Bangalore, India                                             Hosur Main Road, Bangalore, India
                          kumar.atul@in.ibm.com                                                 anil.nair@toshiba-tsip.com

ABSTRACT                                                                               Keywords
Root-cause analysis after a system failure/error is an impor-                          Component Interactions Model, Dynamic Analysis, Root-
tant activity to determine exact reasons for failure/error.                            cause analysis, Network Packet Filtering and Analysis
Most of the time, these error conditions cannot be repro-
duced or it is not feasible to run the system again using
the exact same scenario. Therefore, execution trace log of                             1.     INTRODUCTION
various functions/components recorded during the event is                                 Execution trace log data1 is useful for debugging and root
essential for root cause analysis and debugging in a complex                           cause analysis by identifying sequence of operations that led
system. Source code level instrumentation for dynamic anal-                            to a system failure or error. Component interaction and
ysis provides accurate execution trace log. But it is difficult                        information flow is also used to identify performance bot-
to use an instrumented system in production environments                               tlenecks in a complex system. These logs are also useful in
because of performance and system stability issues. In a dis-                          finding better deployment plan for distributed components
tributed system, intercepted network messages can be ana-                              on different hosts by looking at inter-host component inter-
lyzed to identify interactions between various components of                           action patterns. Additionally, it can also help in optimizing
the system. However, messages captured on network alone                                and prioritizing test cases.
do not provide complete information because messages be-                                  Such logs can be obtained using dynamic analysis [7].
tween components on same host would not appear on net-                                 Original source code is instrumented by inserting instrumen-
work. We present a new idea to construct interaction infor-                            tation code to log some information at desired points (nor-
mation among components of a distributed application using                             mally at the beginning and the end of a function). Then
messages captured on network and an interaction model that                             application is executed and log generated by the instru-
is a set of rules and heuristics about component interaction.                          mented code is logged. However, it may not be possible
An interaction model is pre-built offline using profile infor-                         to use instrumented applications in a production environ-
mation and static control flow graph of the system. Profiling                          ment. Instrumentation code causes overhead that may not
is done with test data in a non production environment such                            be acceptable in a production environment because of per-
as a test environment using ‘close-to-real’ test scenario. Mes-                        formance reasons. Controller hosts in automation systems
sages corresponding to components interaction are captured                             have limited logging and tracing capabilities because em-
on network to create a partial execution trace log. Then the                           bedded devices are normally resource constrained (memory
trace log is completed using the pre-built interaction model.                          and/or CPU time) and may not bear this overhead. Us-
                                                                                       ing instrumented code as created by coverage tools or debug
                                                                                       tools slows execution of programs to a degree that these
Categories and Subject Descriptors                                                     tools are not usable in production environments or even in
H.4 [Information Systems Applications]: Miscellaneous;                                 complex system test environments. Moreover, it may not be
D.2.2.c [Software Engineering]: Distributed/Internet based                             reasonably safe to execute instrumented code in live produc-
software engineering tools and techniques                                              tion environments because logging of system activities may
                                                                                       cause exceptions that may destabilize an otherwise stable
                                                                                       system.
General Terms                                                                             In a large system, components of application are often dis-
                                                                                       tributed on several hosts in a system. Interactions between
Component Interactions, model based analysis
                                                                                       components of such applications take place via network com-
                                                                                       munication. We propose an idea for a system where mes-
                                                                                       sages on network are eavesdropped. Since all interactions
                                                                                       among components of an application do not appear on net-
                                                                                       work (e.g., communication between components on the same
                                                                                       1
*This work was done when authors were with ABB Corporate Research                        Temporal information about the start of execution and
                                                                                       the end of execution for functions/components in a pro-
Copyright c 2016 for the individual papers by the papers’ authors. Copying permitted   gram/system and also what function/component execution
for private and academic purposes. This volume is published and copyrighted by its     followed/precedes what other function/component execu-
editors.                                                                               tion.


                            2nd Modelling Symposium (ModSym 2016) - colocated with ISEC 2016, Goa, India, Feb 18, 2016
                                                                                                                                              23
                                                 Component1
            System with
                                                 Component2
           Instrumented
               Code                Dynamic       Component3
                                   Analysis
                  +                              Component4
           Test Scenario            Run test
                                                 Component5
               Data                scenarios

                                                 Component6
          Test Environment                                                                                                          Interaction
                                                 Component7                                                    Analysis
                                                                      Profile data                 time                               Model


      Int main (int argc, char
      *argv[]) {
         if (argc == 0) {
          return 0;
        }                           Static                                                                                           Engine
        …...                       analysis
      }

            Source code                                           Control flow graph


                                                 Component1                                                 Component1

          Release System                         Component2                                                 Component2

                                                 Component3                                                 Component3
                  +                              Component4                                                 Component4
                                 Network trace
             Real Data              analysis     Component5                                                 Component5

             Production                          Component6                                                 Component6
            Environment                          Component7                                                 Component7
                                                                                                   time                                                               time
                                                              Partial component interaction log                      Constructed complete component interaction log


                                                  Figure 1: Process of Constructing Execution Trace Log


host), component interaction information captured on net-                                    2.2      Creating Execution log
work therefore will be incomplete. To construct a full ex-                                     Following are major steps in the process of constructing
ecution trace log from partial component interaction infor-                                  execution log from network messages.
mation built using network messages, an interaction model
is used. The interaction model is built beforehand using                                          1. Instrument source code using traditional methods, use
control flow graph of the application obtained from static                                           comprehensive and near real test scenario to gather
analysis and profile information collected in a test environ-                                        profile data. Use dynamic analysis to construct trace
ment using instrumented application and test scenario.                                               logs for all test scenarios.

                                                                                                  2. Perform static analysis on source code to generate con-
2.        CONSTRUCTING EXECUTION LOG FROM                                                            trol flow graphs of the system.
          NETWORK MESSAGES                                                                        3. Use the above two to create an interaction model. This
                                                                                                     model is used to identify patterns and create heuristics
2.1         Capturing Network Messages                                                               about various work-flows in the system. If some infor-
   Several tools exist to capture and analyze messages on                                            mation is missing in a execution trace log, then the
data networks. Normally, these tools require a host running                                          model should be able to tell most likely candidates for
them to be present on the same network on which messages                                             missing places. When used with temporal information,
need to be listened to. Intercepting network messages be-                                            this model should be enough to construct a complete
tween different hosts in a distributed system using a sepa-                                          trace-log form the partial log.
rate computer has no impact on system performance. Tools
                                                                                                  4. In a production environment, capture network mes-
such as Netmon [1], Microsoft Message Analyzer [2], Tcp-
                                                                                                     sages to identify component interactions. Create a
dump/libpcap [4], WinDump/Winpcap [5], Wireshark [6],
                                                                                                     partial execution trace log using this information.
etc. make it easy to analyze network packets at various
network layers including application layers. Packets can be                                       5. Use the interaction model created in step 3 to generate
filtered for specific patterns. Applications developed using                                         complete execution trace log.
popular frameworks such as Dot Net, J2EE, CORBA etc.,
have well defined message formats for sending messages be-                                     Figure 1 shows the process to construct execution log
tween components and are easy to identify automatically                                      using network traces and a previously built patterns and
using packet filters.                                                                        heuristics.
                                                                                               The top left part of diagram shows step 1 of the process
                                                                                             mentioned above. Output of dynamic analysis for each test


                                 2nd Modelling Symposium (ModSym 2016) - colocated with ISEC 2016, Goa, India, Feb 18, 2016
                                                                                                                                                                      24
scenario is used later to build an interaction mode. The mid-            Some tips and tricks to help troubleshooters extract root
dle left part of diagram shows static analysis process that           cause information from network traces is provided in [3].
generates control flow diagram from source code. The top              Objective is to reduce what might be hundreds of gigabytes
right part of diagram shows the process of creating inter-            of data to essential events that show root cause of a problem.
action model. This model essentially holds a set of rules             Focus is to use network traces and then narrow the fault
and heuristics for various possible interactions among com-           down to a box.
ponents of the system. The bottom left part of diagram                   An execution environment for Java programs is presented
shows the process of creating partial component execution             in [8] that improves execution performance by using both
trace log from network messages. This log is used by an en-           online and off-line profile information to guide dynamic opti-
gine shown in the bottom right part of diagram to construct           mization. A dynamic compilation system based on JikesRVM
complete execution trace log using rules and heuristics from          was developed that makes use of both.
the interaction model.                                                   A model-based diagnosis approach is discussed in [9] that
                                                                      discovers faults based on generic fault models and abstract
2.3      Interaction Model                                            event traces. These events may be associated to multiple
   Building a good interaction model is key to success of this        system components. Availability of fault for each component
idea. Control flow graphs for various modules/components              is not assumed and generic fault models of classes of faults
of an application provide all possible interactions in appli-         are used instead.
cations. Interaction between modules/components can be                   Our proposed idea is different from above works because,
obtained from call graph, input data, and system integra-             in our approach, we construct a trace log very close to the
tion model. This is still not sufficient to capture dynamic           one obtained from dynamic analysis of instrumented appli-
behavior of application. For example, if there are n possi-           cation without actually instrumented it. We rely on ma-
ble paths an execution sequence can take from a particular            turity of a pre-built model but it needs to be validated by
point, then temporal information can reduce that possibility          actually building a prototype tool based on our idea and
to k (where k << n). System may follow a different exe-               then comparing the trace-log generated by our tool with
cution sequence at start-up, at user input time and at time           complete data captured from instrumented application.
of I/O. With a good set of test data, a dynamic interaction
model can be built which can cover most common usage                  4.    CONCLUSIONS AND FUTURE WORK
scenario. Combining static and dynamic interaction models
can reduce total number of possible execution paths. Some                We presented an idea to construct components interac-
heuristics built around execution behavior and input values           tion trace log for components of a distributed application in
are used to select the most likely path.                              live production environments. An interaction model is first
                                                                      built offline by generating profile data in a test environment.
2.4      Assumptions                                                  Then, in a live production system, a partial components in-
                                                                      teraction trace log is created from network messages eaves-
     This approach makes the following assumptions.
                                                                      dropped from a separate host on the same network. Finally,
      • Network messages are not encrypted.                           a complete execution trace log is constructed by an engine
                                                                      using partial logs and interaction model.
      • Systems under considerations are distributed systems             We plan to start a short project on this idea. Purpose is
        where significant interaction among components passes         to validate our hypothesis presented in this paper that the
        over network.                                                 execution trace log can be constructed by only capturing
                                                                      network messages in a live production system (other infor-
      • Sufficient test scenario data is available that is close to   mation required is collected offline). In particular, we would
        real usage scenario.                                          focus to find answers of the following questions.

      • Interaction model is rebuilt after there is any change             • Does enough component interactions take place over
        in the system.                                                       network (between hosts) in a real distributed applica-
                                                                             tion? If yes, then how much is ‘enough’ ? Can we use
                                                                             less messages than what can be captured on network
3.      RELATED WORK                                                         to reduce size of network log?
  Performing root-cause analysis in distributed system is a
well studied subject. A recent work on run-time root cause                 • Can we develop heuristics that help in recreating com-
analysis in distributed systems is presented in [10]. This                   plete profile from network messages and models that
work addresses problem of deriving relationships for fault                   were built offline?
correlation in adaptive distributed systems where compo-
nents are dynamically installs/updates/removes and presents                • Does log data provide enough information to isolate
a state chart-based solution which tried to identify the se-                 heisenbugs in software that are otherwise non-repeatable?
quence of method execution.
  An approach to combine model-driven techniques with                 5.    REFERENCES
runtime models to perform root cause analysis of execut-
ing systems is presented in [11]. The approach is to com-               [1] How to use network monitor to capture network
bine advantages of model-driven development with reusable                   traffic. http://support.microsoft.com/kb/812953.
software artifacts. Interactive visualizations enable efficient         [2] Microsoft message analyzer operating guide.
tracing of log file entries and corresponding model artifacts               http://technet.microsoft.com/en-
during runtime.                                                             us/library/jj649776.aspx.


                       2nd Modelling Symposium (ModSym 2016) - colocated with ISEC 2016, Goa, India, Feb 18, 2016
                                                                                                                              25
 [3] Network trace analysis strategies.
     http://www.advance7.com/wp-
     content/uploads/2012/11/Network-Trace-Analysis-
     Strategies-Whitepaper.pdf.
 [4] Tcpdump/libpcap. http://www.tcpdump.org/.
 [5] Winpcap. http://www.winpcap.org/.
 [6] Wireshark. http://www.wireshark.org/.
 [7] T. Bell. The concept of dynamic analysis. In ACM
     SIGSOFT international symposium on Foundations of
     software engineering, pages 216–234. ACM SIGSOFT
     Software Engineering Notes, November 1999.
 [8] C. Krintz. Coupling on-line and off-line profile
     information to improve program performance. In
     international symposium on Code generation and
     optimization, pages 69–78. IEEE Computer Society,
     March 2003.
 [9] W. Mayer, X. Pucel, and M. Stumptner. Diagnosing
     component interaction errors from abstract event
     traces. In 23rd Australasian Joint Conference on
     Advances in Artificial Intelligence, pages 496–505.
     Lecture Notes in Computer Science Volume 6464,
     December 2010.
[10] A. Raj, S. Barrett, and S. Clarke. Run-time root cause
     analysis in adaptive distributed systems. In On the
     Move to Meaningful Internet Systems: OTM 2013
     Workshops, pages 292–301. Lecture Notes in
     Computer Science Volume 8186, September 2013.
[11] M. Szvetits and U. Zdun. Enhancing root cause
     analysis with runtime models and interactive
     visualizations. In 8th International Workshop on
     Models at run.time, September 2013.


                   2nd Modelling Symposium (ModSym 2016) - colocated with ISEC 2016, Goa, India, Feb 18, 2016
                                                                                                                26