Extracting Realistic User Behavior Models Reiner Jung Marc Adolf rju@informatik.uni-kiel.de mad@informatik.uni-kiel.de Kiel University, Germany Kiel University, Germany Abstract may lose behavioral information, as two visits of a page might differ in purpose and a looping behavior Workloads play a central role in assessing software might actually be dissimilar than another loop along qualities, like performance and privacy. They are the same pages. (III) X-means only yields acceptable characterized by intensity and user behavior patterns. results for small parameter vectors of at least ordinal Combining multiple intensities and behaviors are used values, but current behavior models are mapped to to create workload profiles which, among others, se- vectors. More model transitions imply more parame- valuate software design, predict of system utilization. ters, which harm clustering [3]. The central challenge for workload profiles is their To mitigate these issues, we want (a) to advance be- fit to real workloads and in particular the match to havior models to better capture the specific properties specific behaviors. This is especially relevant for un- of different kinds of users based on domain knowledge derstanding and identifying specific user groups and of the observed software system and other parameters, support workload composition by operators. e.g., time, and (b) to improve clustering and classifica- In this paper, we address the identification of such tion of observed user behaviors. Therefore, we extend realistic user behaviors by utilizing domain specific classic behavior models with domain knowledge and attributes, report on our evaluation of the fitness of evaluate different aggregation approaches capable of behavior clustering approaches, and discuss our setup handling large parameter sets or find ways to reduce to evaluate further clustering approaches. the parameter sets to be able to use aggregation ap- 1 Introduction proaches well suited for limited number of parameters. In this paper, we report on our preliminary find- Service quality of software systems is influenced by ings regarding two clustering approaches X-means and workload intensity and the user behavior. Both fac- Expectation-Maximization (EM) [2] in context of re- tors play a vital role characterizing the system work- alistic user behaviors, present additional approaches load [7], which is relevant to understand past work- which we are currently investigating, and formulate loads and construct workload profiles to estimate fu- key questions regarding the identification behavior ture system utilization and performance. For exam- models and their quality. ple, the resource consumption of browsing a catalog, The paper is structured as follows: Section 2 dis- searching the inventory, and purchasing items can be cusses user behavior models. Section 3 introduces quite different. Therefore, it is necessary to be able clustering and aggregation approaches. Section 4 to distinguish specific kinds of user behavior to char- presents the concept of realistic user behaviors. Sec- acterize the workload sufficiently. tion 5 describes the evaluation and Section 6 discusses State of the art workload characterization ap- preliminary results for X-means and EM clustering. proaches, such as WESSBAS [8], use a behavior mix, Finally, Section 7 summarizes our findings, discusses where different workload intensities are combined with further research, and describes our key questions. specific user behavior models to construct a work- load model. These approaches collect user sessions 2 Behavior Models and aggregate them to behavior models. WESSBAS Behavior models describe kinds of users and are ag- estimates behavior models utilizing X-means cluster- gregations of single user behaviors with similar be- ing [3]. Such behavior models have three key short- havior patterns. A single user behavior comprises all comings: (I) They reflect the observed behavior of the system invocations (entry level events) of a user dur- past, but might not represent specific user groups cor- ing a session. It can be modeled as a path over vis- rectly harming predictability. For example, a deter- ited pages or transformed into a behavior graph or a gent shopper might reappear frequently while a sun- Markov-chain, which may contain loops for repetitive screen shopper has a different seasonal profile. Unfor- behavior. These paths or graphs are then grouped for tunately, current approaches cannot distinguish be- similarity and merged into a behavior model. tween them. (II) The behavior models use cyclic Figure 1 depicts an annotated behavior graph of graphs with edge counts or probabilities to create a user interacting with the JPetStore [12], an exam- compact representations. However, in this process we EMLS 2018: 5th Collaborative Workshop on Evolution and Maintenance of Long-Living Software Systems @ SE18, Ulm, Germany 47 new parameters for the next iteration are computed. These two steps are repeated until the convergence criterion is reached [2]. Hierarchical Clustering Hierarchical clustering is a approach which builds a hierarchy of clusters. Where the root cluster contains all individuals which are then further divided into smaller clusters. The hi- erarchy can either be build up from the root cluster Figure 1: Behavior graph representing a shopper to the leaves (divisive) or vise versa (agglomerative). from our JPetStore [12] use case. With the agglomerate approach, we start with clusters containing each only one individual. Then we deter- ple application resembling a shop system for pets. In mine the distance between all pairs of clusters. The this graph, nodes represent page visits and edges ex- pair which has the shortest distance is then merged press the transitions between pages. The numbers at into one cluster. This process is repeated until all in- the edges indicate the amount of transitions between dividuals are merged into one single cluster [4, p. 95]. pages. In addition, we added domain specific infor- The shortest distance between two elements is deter- mation, like the viewed category and product, which mined by a distance function, like Euclidian distance can be used to support the behavior clustering. or Manhattan distance. 3 Clustering Approaches Similarity Matching In contrast to the other ap- proaches, similarity matching uses two metrics to Clustering can be used to identify groups of data compare graphs based on structural similarity and on points which share similarities. In our context, we the distance of parameter values based on their se- use clustering to identify user behavior models, like mantic similarity. The algorithm computes initially WESSBAS, which uses X-means. Clustering is af- the distance between each graph of a set of behavior fected by density, distances, and distribution of data graphs, creating a vector for each graph containing the points. Depending on the clustering approach, the distances to all others graphs. The distance between dimension of the data points can have a significant Pn two vectors is the sum of differences i=1 |da,i − db,i | impact on the quality of the clusters. We employ clus- where n is the number of graphs, and dj,i refers to tering methods provided by Weka [4]: the values in a vector j. Graphs where the distance X-Means X-means builds on the K-means cluster- is lower than a defined threshold, are then considered ing algorithm [3], which consists of three steps [1]: (1) similar and grouped together. For every expected cluster (K), a center point, called These groups are further divided based on their centroid, is randomly chosen from the data points. (2) semantic difference. For example, in JPetStore cate- According to a chosen distance metric, each point x gories, products, and items form a tree. The distance of the data set is assigned to the closest centroid. (3) between two values in the tree determines their se- The centroids are recomputed according to the center mantic difference, e.g., two cats Fritz and Felix be- of mass of the points belonging to it. (2) and (3) are long both to the product male cartoon cat, having a repeated until a convergence criterion is met. distance of one, while the cat Amber belong to the In contrast to K-means, X-means searches over a product female cat, and therefore the distance is two. range (e.g., 2 to 10) for a set of clusters, which provide the best fit. Therefore, X-means starts with comput- 4 Realistic User Behaviors ing K-means for the lower bound (e.g., K=2). Subse- We define realistic user behavior models as behavior quently, each cluster is split into two using 2-means to models which reflect real groups of users in contrast try to improve the fit. Both steps are iterated while to approximated groups, i.e., groups solely defined by incrementing K until the upper bound is reached or their transitions, neglecting domain-specific data. For an iteration is worse than the one before [3]. example, our detergent shoppers should form a sepa- rate group from those buying sunscreen. This is help- Expectation-Maximization EM is an iterative ful to better understand seasonal behavioral changes method consisting of two phases (E- and M-step) that and allow to create and modify workloads more real- are repeated until the convergence criteria is met and istically. This is relevant in scenarios, where workload a final set of clusters is identified. Initially, a ran- characterizations can be modified to provide the sys- dom set of cluster identifying data points are defined tem with knowledge of upcoming events, like a sun- which are the initial parameters for EM. The E-step screen shopper just before the holiday season. uses these parameters to compute the expected val- A key ingredient for realistic user behaviors is ues of each data point. The M-step uses the E-step domain-specific data, like the products or categories. results to compute a new maximum likelihood for the With this additional data, user behavior can be classi- data points regarding the parameters. This way, the fied in different groups. To be able to use such values EMLS 2018: 5th Collaborative Workshop on Evolution and Maintenance of Long-Living Software Systems @ SE18, Ulm, Germany 48 in a clustering approach, a suitable metric must be Account manager (AM) Changes contact informa- defined, e.g., products of similar type should be closer tion after login. Inspects one of the prior orders. together than products which are in another category. Browsing user (BU) Searches products and only browses categories, products, and items. 5 Evaluation Product lover (*L) Visits the CATS (CL) or Our evaluation uses the iObserve analysis service with FISH (FL) category and selects one product. Repeats different aggregation filters to examine clustering and 8 times and concludes shopping. aggregation approaches. Figure 2 depicts an excerpt Single product buyer (S*) Goes to a category of the pipe and filter setup for our analysis. The Ses- (REPTILES (SR) or CATS (SC)) and buys one item. sion Collector collects EntryEvents caused by a users New customer (NC) Registers as a new customer, and creates sessions (collection of EntryEvents). The logs in, and buys a reptile. filter sends out a session event either when it receives Experiment Execution At the end of each anal- a SessionEndEvent or a timeout is reached and the ysis run, we compare the detected clustered behavior filter is triggered by the time trigger filter. The An- models with the prepared IBMs. First, we identified notatedGraphBuilder creates from the SessionEvent which detected behavior model matches best to an an annotated graph and sends it to an Aggregation- IBM. Second, we identified the distance of the match- Filter. Depending on the aggregation and clustering ing model. In case a match can be identified, this approach, a specific filter is inserted there. Finally, the counts as a hit (score=0). The match between two aggregated graphs are serialized with the GraphOut- models is computed in three steps: (1) We remove putFilter or alternatively transformed into a behavior nodes which are not connected to the behavior graph, model suitable for the Palladio Component Model. as they are created by mapping graphs to matrices and back. (2) We identify missing and additional nodes SessionEvent AnnotatedGraph Graph Output SessionEndEvent EntryEvent Filter and edges, and compute ratios between these differ- Session Collector Annotated Graph Builder Aggrega�on Filter AggregatedGraphs ences and the IBM. The lower the ratios, the better PCM Model the fit of the detected behavior model. (3) We com- Transla�on Timer Timer pare the request parameters in the behavior models. For example, in the IBM the parameter CATS ap- Figure 2: Analysis Pipe-and-Filter setup for user be- pears once, but the detected behavior model includes havior aggregation REPTILES, then the behaviors do not match. 5.2 JIRA Experiment The input for this analysis is provided by two soft- ware systems. The first observed systems is an JPet- The JIRA experiment utilizes real word monitor- Store [12] instance instrumented with Kieker [5]. The ing data which we collect every semester during a second one is an instrumented instance of our research four week practical course where multiple groups of group’s JIRA [11] which is used by students during a students plan and develop a small software system. four week practical course. We use the JPetStore to Therefore, we do not have predefined IBMs for this evaluate whether a specific previously defined setup of experiment. However, we want to detect and isolate realistic behaviors can be detected. While the JIRA specific behaviors. To examine which computed be- experiment is used to apply the approaches to another havior models match the reality, we will discuss the domain where we want to explore whether they can aggregated behavior models with the students. Beside produce reasonable results in a realistic scenario. this qualitative evaluation, we also will gain insight in how students use JIRA and whether we have to intro- 5.1 JPetStore Experiment duce the functionality differently. For the JPetStore experiments, we modeled seven re- alistic (ideal) user behaviors [9], which utilize all func- 6 Preliminary Results tions of the JPetStore. We created workloads with Se- We already evaluated X-means and EM clustering. lenium [13] that represent these behavior models. We The X-means setup is based on preliminary work, execute JPetStore together with our workload and col- where we tested different configuration parameters for lected monitoring data. This data is then processed by the algorithm [9]. We choose a configuration for X- the iObserve analysis [6] using different clustering al- means which provided the best fit to the JPetStore gorithms provided by their respective filters. Then, we scenario. We set the range for the number of expected compare the detected behaviors with the mentioned clusters to [6..12] and use the Manhattan Metric. set of seven ideal behavior models (IBM). For this clustering, we decided to go with the stan- dard setting in Weka and are not setting any param- Workload The behaviors are tailored to share com- eters, including the pre-estimated number of clusters. mon behavior, but also include significant differences Since both algorithms start with randomly chosen val- regarding pages, transitions, and request parameters, ues, the results may differ between each execution. e.g., whether the person shops cats or fishes: EMLS 2018: 5th Collaborative Workshop on Evolution and Maintenance of Long-Living Software Systems @ SE18, Ulm, Germany 49 Table 1: Comparison between the generated clusters vectors resulting in less precise clustering [3], and for and the IBMs (scores and content). EM the behavior model merge might be amendable. In future, we will evaluate further aggregation al- Scores AM BU CL FL SC SR NC gorithms, including hierarchical clustering and simi- EM 0.25 0 0 0 0.8 0.8 0.9 larity matching. We want to include other factors, X-means 0.25 0 0 0 0.1 0.1 0.4 like, seasonal factors, into the clustering to improve Content AM BU CL FL SC SR NC detection. In context of the EMLS’18, our key ques- EM + + + (a) (c) (e) tions are: (a) Are these specific approaches useful to X-means + + + (b) (d) (f) improve the quality of aggregated behaviors? (b) Are cyclic graphs the best possible way to describe realistic user behaviors? (c) How can we improve our evalua- Therefore, we execute each clustering five times to tion, especially in scenarios where we cannot predefine avoid results solely based on arbitrary starting val- ideal behavior models? ues. In X-means, the resulting user behaviors are the computed centroids of each cluster. They do not nec- References essarily correspond to a real behavior. In contrast, the EM clustering only groups measured behaviors. [1] J. B. MacQueen. “Some Methods for Classifi- In this evaluation, we simply took one representative cation and Analysis of MultiVariate Observa- behavior of each group. This can be improved, e.g., tions”. In: 5th Berkeley Symp. on Math. Statis- by creating a mean vector of every cluster. tics and Probability. Vol. 1. UC Press, 1967. Both approaches could not detect all 7 IBMs [2] T. K. Moon. “The expectation-maximization al- (EM=4 and X-means=5 clusters), but some of the de- gorithm”. In: IEEE Signal Process. Mag. 13.6 tected models match an IBM. Table 1 depicts scores (Nov. 1996). and parameter matches of behaviors to IBMs. [3] D. Pelleg et al. “X-means: Extending K-means EM and X-means both detected the account man- with Efficient Estimation of the Number of ager behavior, but there where minor discrepancies Clusters.” In: ICML. Vol. 1. 2000. between the aggregations and IBMs. As we did not [4] M. Hall et al. “The WEKA Data Mining Soft- record parameters for these pages, we could not com- ware: An Update”. In: SIGKDD Explor. Newsl. pare the behaviors content wise. The browsing user, 11.1 (Nov. 2009). cat lover, and fish lover where detected correctly by both algorithms. The single cat buyer, however, could [5] A. v. Hoorn et al. “Kieker: A Framework for not be detected by EM, the closest match was the Application Performance Monitoring and Dy- single reptile buyer behavior (a). X-means created a namic Software Analysis”. In: Proceedings of merged cluster of cat and reptile buyer, and the new ICPE 2012. ACM, Apr. 2012. customer, identifiably by 1/3 possibility for a cat and [6] R. Heinrich et al. “Architectural Run-Time 2/3 for reptiles (b). Similarly, the single reptile buyer Models for Operator-in-the-Loop Adaptation was identified by EM (c) and X-means closest match of Cloud Applications”. In: Proceedings of was the same as for the single cat buyer (d). Finally, MESOCA. IEEE Computer Society, Sept. 2015. the new customer detection failed, as the returned [7] M. C. Calzarossa et al. “Workload Characteriza- cluster deviated significantly from the IBM. Also the tion: A Survey Revisited”. In: CSUR 48.3 (Feb. found graph better matches a single buyer or product 2016). lover than the new customer. [8] C. Vögele et al. “WESSBAS: extraction of 7 Conclusion probabilistic workload specifications for load testing and performance prediction—a model- We presented our efforts towards realistic user behav- driven approach for session-based application ior clustering to improve the understanding of work- systems”. In: SoSyM (Oct. 2016). loads and their composition, which can improve soft- [9] C. Dornieden. “Knowledge-Driven User Behav- ware quality assessment and support more precise ior Model Extraction for iObserve”. Master the- workload alterations. All experiment data, notes, and sis. Kiel University, June 2017. artifacts can be found in our replication package [10]. We reported on the detection quality of the two [10] Replication package. doi . org / 10 . 5281 / clustering approaches EM and X-means for behavior zenodo.883095. 2017. models. Our current findings are that both clustering [11] JIRA – Issue Tracking System. https://de. approaches are able to differentiate some behaviors atlassian.com/software/jira. 2017. based on parameter information, which is an improve- [12] MyBatis JPetStore application. http : / / www . ment in comparison to the clustering without this in- mybatis.org/spring/sample.html. 2017. formation. However, they are unable to detect all be- haviors correctly. Key issues are for X-means larger [13] Selenium browser automation. http : / / www . seleniumhq.org. 2017. EMLS 2018: 5th Collaborative Workshop on Evolution and Maintenance of Long-Living Software Systems @ SE18, Ulm, Germany 50