Predicting Test Case Verdicts Using Textual Analysis of Committed Code Churns Khaled Walid Al-Sabbagh1[0000−0003−2571−5099] , Miroslaw 1[0000−0002−9052−0864] Staron , Regina Hebig1[0000−0002−1459−2081] , and Wilhelm Meding2 1 Chalmers | University of Gothenburg, Computer Science and Engineering Department, Gothenburg, Sweden 2 Ericsson AB {khaled.al-sabbagh, miroslaw.staron, regina.hebig}@gu.se wilhelm.meding@ericsson.com Abstract. Background: Continuous Integration (CI) is an agile soft- ware development practice that involves producing several clean builds of the software per day. The creation of these builds involve running exces- sive executions of automated tests, which is hampered by high hardware cost and reduced development velocity. Goal: The goal of our research is to develop a method that reduces the number of executed test cases at each CI cycle. Method: We adopt a design research approach with an infrastructure provider company to develop a method that exploits Ma- chine Learning (ML) to predict test case verdicts for committed source code. We train five different ML models on two data sets and evaluate their performance using two simple retrieval measures: precision and re- call Results: While the results from training the ML models on the first data-set of test executions revealed low performance, the curated data- set for training showed an improvement on performance with respect to precision and recall. Conclusion: Our results indicate that the method is applicable when training the ML model on churns of small sizes. Keywords: Machine Learning, Verdicts, Code Churn, Test Case Selec- tion 1 Introduction CI is a modern software development practice, which is based on frequent inte- gration of codes from developers and teams into a product’s main branch [23]. One of the cornerstones of its popularity is the promise of higher quality deliv- ered by frequent testing and the ability to quickly pinpoint the code that does not meet quality requirements. To achieve this, CI systems execute tests as part of the integration [5]. However, excessive execution of automated software tests is penalized with high hardware cost and reduced development velocity that may consequently hinder agility and time to market. In order to address this challenge, a CI system should be able to pinpoint exactly which test cases should be executed in order to maximize the probability Copyright © 2019 for this paper by its authors. 138 Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). of finding defects (i.e. to reduce the “empty” test executions). To achieve this, the CI system needs to be able to predict whether a given test case has a chance of finding a defect or not, or at least whether it will fail or pass – predict the verdict of a test case execution. We set off to address the problem of predicting test case verdicts by training five ML models on a large data set of historical test cases that were executed against changes made to a software developed at company A. The term ”code churn” is defined as a measure that quantifies these changes. Throughout the remaining sections of this paper, we use this term to refer to committed code made during different CI cycles. Our research is inspired from a previous study conducted by Knauss et el. [13], where the authors explored the relationship between historical code churns and test case executions using a statistical model. Their method used precision and recall metrics in predicting an optimal suite of functional regression tests that would trigger failure. In this paper we expand on that approach by go- ing one step further – conducting a textual analysis of what is the code that is actually being integrated. For example, instead of using code location as the parameter, we use such measures as the number of ’if’ statements or whether the code contains data definitions. Similarly, our choice of using code churns is inspired from the work of Nagappan and Ball [14], in which the authors pre- sented a technique for early faults prediction using code churn measures. In their publication, the authors identified a positive correlation between the size of code churns and system defects density. Our method builds on this and uses the Bag of Words (BOW) approach to extract features from code churns. This enables the identification of statistical dependency between keywords and test case verdicts. For example, a churn containing a frequent occurrence of keywords like new or delete might trigger specific tests to fail. More precisely, we aim at investigating the following research question: How to reduce the number of executed test cases by selecting the most effective minimal test suite when integrating new code churns into the product’s main branch? Our study was conducted in collaboration with a large Swedish-based infras- tructure providing company. We study a software product that has evolved over a span of a decade by different cross functional teams. As a result of our study, we present a method that uses ML to predict test case verdicts (MeBoTS). To address this research question, we conduct a design research study, where we develop a new method and evaluate it on the company’s data set. Our method is based on the research by Ochodek et al. [17], which uses textual analyses to characterize source code. Our MeBoTS method builds on that by using historical test verdicts as predicted variables and uses Random Forest algorithms to make the predictions. The remainder of this paper is organized as follows: Section II provides back- ground information about two categories of ML. The sections that follow pro- vide: an overview of the most related studies in this area, a description of the 139 method that we developed in our study as well as the results, validity analysis, recommendations, and finally, conclusion. 2 Background 2.1 Categories of Machine Learning Machine learning is a class of Artificial Intelligence that provides systems the ability to automatically make inferences, given examples relevant to a task [8]. The main advantage of using Machine learning over classical statistical analysis, is its ability to deal with large and complex data-sets [12]. These systems can be classified into four categories depending on the type of supervision involved in training: a) Supervised, b) Unsupervised, c) Semi-supervised, d) Reinforcement Learning [8]. Since we view the problem of predicting test case verdicts as a classification problem, we briefly mention the supervised learning category. In supervised learning the training data-set fed into the ML model contains the desired solution, called labels. The model tries to find a statistical structure between these examples and their desired solutions [12]. A typical task for this kind of learning is classification. 2.2 Tree-based and Deep Learning Models In Machine learning, a decision tree is an algorithm that belongs to the family of supervised learning algorithms. The algorithm has an inherent tree-like struc- ture and is commonly used for solving classification and regression problems [20]. Starting from the root node, the algorithm uses a binary recursive scheme to repeatedly split each node into two child nodes, where the root node has the complete training sample [1]. The resulting child nodes correspond to features in the training data, whereas the leaf nodes correspond to class labels (binary or multivariate). Other algorithms, such as Random Forest and Adaptive Boosting, use Decision trees as a primary component in their structure. These algorithms build a collection of decision trees, called an ensemble, to increase the overall learning of the classification or regression task at hand [12]. Deep Learning is a branch of ML that was founded on the premise of using successive learning ”layers” to achieve more useful representation of the data [8]. The learning of these successive layers are achieved via models called Neu- ral Networks (NN) [8]. A multilayer NN is one that consists of at least three layers: 1) one input layer, 2) at least one hidden layer, 3) and an output layer of artificial neurons [10]. Similarly, a Convolutional network (CNN) consists of a set of learning layers [8]. The main difference between the two networks is in the way they search for patterns in the input space [12]. More precisely, a CNN works by sweeping a matrix-like window, called filter, over every location in ev- ery patch to extract patterns from the input data [12]. As opposed to Decision Trees, ANN have a black box nature, which means that no insight about how their predictions were made can be easily accessible [12]. Nevertheless, the main advantage of using deep learning comes from their ability to handle large and complex data-sets of features. 140 2.3 Code Churns The amount of changes made to software over time is referred to as code churn [14]. As new churns are added, new risks of introducing defects into the system emerge [21]. According to Y. Shin et al. [21], each check-in made into a version control system includes newly added or deleted code that increase the chances of triggering failures. At some point in time, an evolving system may be vul- nerable, on average, to one extra fault for every new additional change [11]. For example, in C programming, the declaration of ’static’ local and global variables are among the most confused keywords by developers, as each static local and global declaration has a different effect on how the data will be retained in the program’s memory [3]. 3 Related Work In the following we discuss related work on the specific use of machine learning for test case selection or prioritization. 3.1 ML-based Test-Case Selection Around 2015/16, we find the first machine learning based approaches for test- case selection. With only 4 studies included in the systematic mapping study by Durelli et al. [9], the use of machine learning for test case prioritization seems to be new. Busjaeger and Xie [4] present an industrial case study in which a linear model is trained with the SVMmap algorithm using the features Java code coverage, text path similarity, text content similarity, failure history, and test age. The evaluation on the industrial case study, considering 2000 changes and over 45 000 test executions shows an Average Percentage of Faults Detected (APFD) of around 85%. Chen et al. [6] prioritize test programs for compilers ”identifies a set of fea- tures of test programs, trains a capability model to predict the probability of a new test program for triggering compiler bugs and a time model to predict the execution time of a test program.” Spieker et al. [22] introduced Retecs, a reinforcement learning-based approach to test case selection and prioritization. Retecs considers duration of a test case’s execution, previous last execution and failure history. Online learning is used to improve test case selection between continuous integration cycles. The approach was evaluated on 3 industrial data sets, including together more than 1.2 mil- lion verdicts, and achieved a normalized Average Percentage of Faults Detected (APFD) of around 0.4 to 0.8 depending on the data set. Most recently, Azizi and Do [2] perform test case prioritization by calculating a ranked list of components considering the access frequency of a component as well as a fault risk. The fault risk for each component is thereby predicted using a linear model of change and bug histories. Test cases associated with highly 141 ranked components are prioritized. The approach was evaluated on three web- based systems and where it could reduce the number of test cases by 20% while still finding over 80% of the errors. Palma et al. [18] replicate and extend a work of Noor and Hemmati [16] and [15], to predict test case failure based on a machine learned model basing on test quality metrics as well as similarity-based metrics. However, to the best of our knowledge, no other learning-based method works for code-churns. The only exception is one of our previous collaboration with Knauss et al. [13]. The introduced code-churn based test selection method (CCTS) analyzes correlations between test-case failure and source code change. The approach was evaluated in several configurations, leading to results rang- ing from 26% precision up to a 54% with a 97% recall. We deem these results promising and one of the main motivations for this study. 4 Method using Bag of Words for Test Selection (MeBoTS) The following section is a description of the MeBoTS method used in this re- search, which comprises of three sequential steps, as shown in Figure 1. The method utilizes two Python programs and an open source textual analyzer pro- gram, called CCFLEX [17]. Fig. 1. The MeBoTS method. 4.1 Code Churns Extraction (Step 1) A Python-based code churn extraction program was created to collect and com- pile code churns committed in the source code repository. The program takes one input parameter: a time ordered list of historical test case execution results extracted from a database, where each element in the list represents an instance 142 of a previously run test case and holds information about: the name of the exe- cuted test case, the baseline code in Git against which the test case was executed, and the verdict value - as shown in Table 1. Table 1. An Excerpt of the Historical Test Case Executions List Baseline Test Case Name Verdict ca82a6dff817ec66f ST-case 22 FAILED ca82a6dff817ec66f FT-case 42 PASSED 34bb5e22134200896 FT-case-333 FAILED 34bb5e22134200333 FT-case-3 PASSED The program first loops through the extracted list of tests and looks at the change history log maintained by Git and performs a file comparison utility (diff) on a pair of consecutive baselines in the tests list. Note that each baseline value is a hash representation of a revision (build), pointing to a specific location in Git’s history log. The result is a fine-grained string that comprises the committed code churns, where each LOC in the churn is compiled with its: 1) filename, 2) physical file path, 3) test case verdict, 4) baseline hash code. The resulting string is then arranged in a table-like format and written in a csv file, named as ’Lines of Code’ in Figure 1. 4.2 Textual Analysis and Features Extraction (Step 2) The result of the extract from Git is saved as an array (code churn, filename, physical file path, test case verdict and baseline). This file is the input to our tex- tual analysis and feature extraction. The textual analysis and feature extraction use each line from the code churn and: – creates a vocabulary for all lines (using the bag of words technique, with a specific cut-off parameter), – for the words that are used seldom (i.e. fall outside of the frequency defined by the cut-off parameter of the bag of words), a token is created, – finds a set of predefined keywords in each line, – check each word in the line whether it is part of the vocabulary, it should be tokenized or if it is a predefined feature. An example input is presented in Table 2. The input contains an example code in C. For the textual analyses, we can pre-define (arbitrarily for this example) two features: “if” and “int”. The bag of words analysis also found the word “printf” as frequent. It has also defined the following tokens: – “a” – to denote the words (of any length) that contain only lowercase letters (e.g. “condition”), 143 Table 2. Input to the textual analysis and feature extraction Filename Path Content Hash firstFile.c c:/folder if (condition == true) printf(’Hello World’); aa00111 firstFile.c c:/folder printf(’\n’); aa00111 secondFile.c c:/folder int i = 10; aa00111 – “Aa” – to denote the words that start with capital letters and continue with lowercase letters (e.g. “Hello”), – “0” – to denote the numbers, i.e. sequence of numbers of any length (e.g. “10”) The manual features and the bag of words results are then used as features in the feature extraction. Table 3, which corresponds to the input from Table 2. Table 3. Output from the feature extraction algorithm Filename Path if int a Aa Content firstFile.c c:/folder 1 0 3 2 if(condition == true) printf("Hello World"); firstFile.c c:/folder 0 0 2 0 printf("\n"); secondFile.cc:/folder 0 1 1 0 int i = 10; Table 3 is a large array with the numbers, each representing the number of times a specific feature presents in the line. This way of extracting information about the source code is new in our approach, compared to the most common ap- proaches of analyzing code churns. Compared to the other approaches, MeBoTS recognizes what is written in the code, without understanding of the syntax or semantics of the code. This means that we can analyze each line of code sep- arately, without the need to compile the code or without the need to parse it. This means, that we can take code churns from different files in the same base- line and analyze them together. MeBoTS also goes beyond such approaches like Nagappan et al. [14], which characterizes churns in terms of metrics like number of churned lines or churn size. 4.3 Training and Applying the Classifier Algorithm (Step 3) We exploit the set of extracted features provided by the textual analyzer in step 2 as the independent variables and the verdict of the executed test cases as the dependant variable, which is a binary representation of the execution result (passed or failed). The MeBoTS method uses a second Python program that utilizes and trains an ML model to classify test case verdicts. The program reads the BOW vector space file in a sequence of chunks, merging the extracted feature vectors and the verdicts vector into a single data frame that gets split 144 into a training and testing set before it is fed into the models for training. Table 4 shows an excerpt of the generated data frame. Table 4. Input to the Classifier Model File Name Line Number F1 F2 F3 F4 F5 .. F500 Verdict firstFile 1 0 0 6 1 0 .. 0 PASSED firstFile 2 0 0 5 3 2 .. 0 PASSED secondFile 1 0 0 6 1 0 .. 0 FAILED 5 Research Design 5.1 Collaborating Company The study has been conducted at an organization, belonging to a large infrastruc- ture provider company. The organization develops a mature software-intensive telecommunication network product. The organization consists of several hun- dred software developers, organized in several empowered agile teams, spread over a number of continents. Given that they have been early adopters of lean and agile software development methodologies, they have become mature in these areas of work. They have also implemented CI and continuous deliveries. The organization is also mature with regard to measuring. For instance every agile team, as well as leading functions/roles, uses one or more monitors to display status and progress in various development and devops areas. A well- established and efficient measurement infrastructure, automatically collects and processes data, and then distributes the information needed by the organization. 5.2 Dataset The data-set provided by company A contained historical test case execution results for a mature software product that has evolved for almost a decade. The analyzed product consisted of over 10k test cases and several million lines of code written in the C language. We decided to test the MeBoTS on a set of randomly selected tests that, presumably, reacted to changes in the source code during different CI cycles. Our selection of test cases was based on the granularity of test executions whose verdicts changed from one state to another(see Table 1). The extracted data-set belonged to twelve test cases that were executed 82 times during different CI cycles. The size of the extracted churns was 1.4 million lines of code, among which 618k lines were labeled as passed and 776k as failed. To better understand the shape of the data-set, we visually inspected the size of code churns covered by each test execution. The scatter plot in Figure 2 shows the distribution of the 82 extracted test executions, belonging to the twelve test cases. Each mark on the plot represents one test case execution. The x-axis represents the number of lines in the code churn the test case was executed on. 145 The y-axis represents the overall number of test executions for the executed test case. Test executions of the same test case are marked with the same symbol. The visual inspection of the scatter plot suggests that our data-set comprised of churns of varying sizes (large and small). We interpreted this distribution Fig. 2. Churn Size per Test Execution Plot. with uncertainty of whether large churns contain additional noise that would adversely affect the training of the ML models. As a result, we decided to curate the original data-set by filtering out tests that were executed on churns whose total size exceeded 110k lines. The visual inspection of the curated data-set is represented in Figure 3, comprised of 290k lines of code, containing a fairly balanced representation of the binary classes (passed and failed), with 110k lines belonging to the passed class and 180k lines belonging to the failed class. The two data-sets described above were used for training the ML models. The first data-set was used in the first phase, whereas the second curated data- set was used in the second phase of ML training, and ultimately became our focus due to data size homogeneity. 5.3 Evaluating and Selecting a Classification Model To select the most suitable model for classifying test verdicts, we selected five different ML models and trained them sequentially. The five models are: 1) Decision Tree, 2) Random Forest, 3) AdaBoost, 4) Multilayer NN, 5) CNN The choice of selecting the three tree-based models was due to their low computational cost and white box nature, whereas the selection of ANN models were based on their ability to abstract large and complex number of features. 146 Fig. 3. Churn Size per Test Execution After Curation Plot. Each of the five classification models for test verdicts uses i) the historical test case verdicts ii) the feature vectors in the bag of words table as the baseline of prediction. The evaluation was done in two iterations. In the first iteration, the five models were trained on the original data-set, which contained a mix of large and small churns, comprising a total of 1.4m lines of code for 500 feature vectors. The second iteration involved training the models on the curated data-set, which almost contained 290k lines for the same 500 features, as in the first iteration. Both data sets, curated and original, were split as follow: 70% for the training set and 30% for the test set. To save the long run time required by automated hyper-parameter tuning tools such as grid and random search, the configuration of the models was done manually. Table 5 summarizes the hyper-parameters used for training the mod- els. We used the implementation of Decision Tree, Random Forest, and Ad- aBoost algorithms available in the Python scikit-learn library [19] and then used the Keras library [7] for the implementation of Multilayer NN and CNN. The hyper-parameters of the three tree-based models were kept in their de- fault state as found in the scikit-learn library. The only alterations made were in the ’random state’ value in Decision Trees and the n estimator (number of trees) in both Adaboost and Random Forest. With respect to the ANN models, the architecture of the multilayer ANN was represented with three sequential dense layers that consisted of: one input layer, one hidden layer, and one output layer. For the CNN, the stack of layers comprised of: a Reshape layer, a Convolution layer, a Maxpooling layer, and four Dense layers. The learning in both models was induced over 100 epochs (iterations), as can be seen in table 5 147 Table 5. The Evaluated Models and Their Hyper-Parameters Classifier Random State Number of Trees Number of Layers Epochs Decision Tree 123 - - - Random Forest - 50 - - AdaBoost - 100 - - Multilayer NN - - 3 100 CNN - - 8 100 The performance of the classifiers were evaluated using simple retrieval mea- sure: recall and precision. These measures are based on the following four categories of errors: – True positives: correct prediction of test executions that pass – True negatives: correct prediction of test executions that fail – False positives: incorrect prediction of test executions that pass – False negatives: incorrect prediction of test executions that fail Precision is the number of correctly predicted tests divided by the total number of predicted tests, calculated as follows: |T rueP ositive| precision = |T rueP ositive| + |F alseP ositive| Recall is the number of correctly predicted tests divided by the total number of tests that should have been positive. |T rueP ositive| recall = |T rueP ositive| + |F alseN egative| While recall and precision measures relate to one another, precision is a mea- sure of exactness, whereas recall is a measure of completeness indicating the percentage of all predicted failed tests in our case. Since the goal of this study is to reduce the amount of test executions without altering the effectiveness of testing, we needed to minimize the risk of missing tests that will actually fail and, therefore, accept some false alarms in the prediction of failed tests. Thus, we believe that the measure of the mode’s precision is more important than its recall. 6 Results To answer the research question, we present the results of using MeBoTS for predicting test case verdicts. We interpret the results of the analysis in light of the reported rates of precision and recall and the values of the four categories of errors: true negatives, false positives, false negatives, and true negatives, as shown in Tables 6 and 7 list the results of training the five models using the original and curated data sets. 148 6.1 Training the Models on Churns of Varying Sizes The evaluation of the five models in the first iteration reports a mean preci- sion of 47% and a mean recall of 17%, suggesting that all models achieved low performance. The best result was obtained by the Multilayer NN model with a precision rate of 50.3% and a recall rate of 31%. The precision and recall rates for the five models can be seen in Figure 4. The interpretation of these values suggest that out of the 406,948 lines of code that actually triggered a test case failure, the model correctly predicted test failure for 226,994 lines, whereas it could correctly predict a passing test verdict for 5,758 out of 11,428 lines. Simi- larly, the results of the four categories of errors for the other models can be seen in Table 6 and interpreted in the same fashion. Table 6. Models Evaluation before Data Curation Model Precision Recall True Neg False Pos False Neg True Pos Decision Trees 43.9% 17.2% 191,883 40,781 153,709 32,003 Random Forest 44% 17.7% 190,864 41,800 152,794 32,918 AdaBoost 51.9% 6.5% 221,472 11,192 173,590 12,122 Multilayer NN 50.3% 31% 226,994 5,670 179,954 5 758 CNN 50.7 16.5 202,745 29,919 154,890 30,822 Fig. 4. Precision and Recall of The Models Before Data Curation 6.2 Training the Models on Churns of Small Sizes The second iteration of analysis involved training the same set of models on the curated data-set, which excluded tests covering churns with quantities above 110k lines of code. The results, shown in Figure 5, indicate an improvement in precision and recall when compared to the results in the first iteration for the same types of models. Table 7 reports the results from the second round of train- ing, showing a mean precision of 70% and a mean recall of 44.5%. The Multilayer NN model performed best in prediction, such that, it correctly predicted 48,755 149 lines that actually triggered a test case failure, out of 67,363 lines in the test set. Table 7. Models Evaluation after Data Curation Model Precision Recall True Neg False Pos False Neg True Pos Decision Trees 68% 48.4% 46,445 7,548 17,011 15,981 Random Forest 67.9% 49.5% 46,252 7,741 16,637 16,355 AdaBoost 69% 36% 48,656 5,337 21,075 11,917 Multilayer NN 73.3% 43.6% 48,755 5,238 18,608 14 384 CNN 71.75% 44.9% 48,162 5,831 18,179 14,813 6.3 Implication The results show that we can predict the verdict of a test case with a precision of 73.3%. This means that we can use the results to reduce the test suite by excluding tests that are predicted to pass, but we need to know that there is 26.7% probability that we miss a test case failure. This means that the reduction of the test suite comes with a cost. This cost can be reduced, for example, if we collect the test cases which were not executed and execute them with lower frequency (e.g. during a nightly test suite instead of hourly builds). Fig. 5. Precision and Recall of The Models After Data Curation 7 Validity analysis When analyzing the validity of our study, we used the framework recommended by Wohlin et al. [24]. We discuss the threats to validity in two categories: in- ternal and external. Typically, a number of internal threats to validity emerge in studies that involve designing an ANN architecture, namely that the num- ber of hyper-parameters to tune is large that we cannot cover all combinations to decide on the best configuration for the network. To minimize this threat, we used two different multilayer neural networks and trained them during the 150 two iterations of analysis. This provided us with a sanity check on whether the networks produce similar results. Another threat to internal validity is in the random selection of test cases. There is a chance that the extracted test exe- cutions contain one or more test that failed due to factors that do not pertain to functional deficiencies, but due to, for instance, an environment upgrade or machinery failure at execution time. Similarly, there is a chance that the ex- tracted test executions may have failed as a result of defects in the test script code and not the base code. In order to minimize this threat, we collected data for multiple test cases, thus minimizing the probability of identifying test cases which are not representative. The major threat to external validity for this study comes from the number of extracted tests that were used for training the classifiers. We only studied one company and one product and a limited number of test cases. This was a design choice as we wanted to understand the dynamics of test execution and be able to use statistical methods alongside the machine learning algorithms. However, we are aware that the generalization of the results for different types of systems require further investigations using tests and churns from different systems. 8 Recommendations In this section, we provide our recommendations for practitioners who would like to utilize MeBoTS for early prediction of test case verdicts. – The choice of using deep learning or tree-based models for solving this su- pervised ML problem does not lead to better prediction performance. For this reason, we suggest the use of Decision Trees, since they require less com- putational time and provide knowledge as to how the results of classification were derived. – We suggest to only utilize code churns that are homogeneous and small in size prior to applying features extraction with BOW. Small code churns introduce less noise and therefore the quality of the predictions is higher. This can also save practitioners ample time for data curation. – We recommend that practitioners only extract historical test executions that have failed due to reasons related to functional defects for training the ML model. This knowledge can be obtained from testers/developers who are familiar with the recurrent issues in the source code. 9 Conclusion and Future Work This paper has presented a method (MeBoTS) for achieving early prediction of test case verdicts by training a machine learning model on historical test executions and code churns. We have evaluated the method using two data sets, one containing a variation of large and small churns, and a second containing only small churns. The results from training the models on small churns revealed a precision rate of 73% and a recall of 43.6%, suggesting that the application of the 151 method is promising, yet more investigation is required to validate the findings. Moreover, contrary to other existing methods that use statistical correlations for predicting test verdicts, the main advantage of MeBoTS is the ability to predict verdicts of new code changes as they emerge during development and before they get integrated into the main branch. We believe that the results of this study open new directions for studies to investigate the effectiveness of MeBoTS on different types of systems using larger set of small churns with more test case executions. Finally, studies that inves- tigate the impact of using different feature extraction techniques, such as word embedding are encouraged to identify any changes in the overall performance of MeBoTS. Acknowledgment This research has been partially carried out in the Software Centre, University of Gothenburg, and Ericsson AB. References 1. Awad, M., Khanna, R.: Efficient learning machines: theories, concepts, and appli- cations for engineers and system designers. Apress (2017) 2. Azizi, M., Do, H.: A collaborative filtering recommender system for test case prioritization in web applications. In: Proceedings of the 33rd An- nual ACM Symposium on Applied Computing. pp. 1560–1567. SAC ’18, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3167132.3167299, http://doi.acm.org/10.1145/3167132.3167299 3. Beningo, J.: Using the static keyword in c. https://community.arm.com/developer/ip-products/system/b/embedded- blog/posts/using-the-static-keyword-in-c (2014) 4. Busjaeger, B., Xie, T.: Learning for test prioritization: an industrial case study. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. pp. 975–980. ACM (2016) 5. Çalikli, G., Staron, M., Meding, W.: Measure early and decide fast: transforming quality management and measurement to continuous deployment. In: Proceedings of the 2018 International Conference on Software and System Process. pp. 51–60. ACM (2018) 6. Chen, J., Bai, Y., Hao, D., Xiong, Y., Zhang, H., Xie, B.: Learning to prioritize test programs for compiler testing. In: Proceedings of the 39th International Conference on Software Engineering. pp. 700–711. IEEE Press (2017) 7. Chollet, F., et al.: Keras. https://keras.io (2015) 8. Chollet, F.: Deep Learning with Python. Manning (2017) 9. Durelli, V.H., Durelli, R.S., Borges, S.S., Endo, A.T., Eler, M.M., Dias, D.R., Guimarães, M.P.: Machine learning applied to software testing: A systematic map- ping study. IEEE Transactions on Reliability (2019) 10. Gondra, I.: Applying machine learning to software fault-proneness prediction. Jour- nal of Systems and Software 81(2), 186–195 (2008) 152 11. Graves, T.L., Karr, A.F., Marron, J.S., Siy, H.: Predicting fault incidence using software change history. IEEE Transactions on Software Engineering 26(7), 653– 661 (July 2000). https://doi.org/10.1109/32.859533 12. Gron, A.: Hands-On Machine Learning with Scikit-Learn and TensorFlow. Oreilly (2015) 13. Knauss, E., Staron, M., Meding, W., Söder, O., Nilsson, A., Castell, M.: Supporting continuous integration by code-churn based test selection. In: Proceedings of the Second International Workshop on Rapid Continuous Software Engineering. pp. 19–25. IEEE Press (2015) 14. Nagappan, N., Ball, T.: Use of relative code churn measures to predict system defect density. In: Proceedings of the 27th international conference on Software engineering. pp. 284–292. ACM (2005) 15. Noor, T.B., Hemmati, H.: A similarity-based approach for test case prioritization using historical failure data. In: 2015 IEEE 26th International Symposium on Soft- ware Reliability Engineering (ISSRE). pp. 58–68. IEEE (2015) 16. Noor, T.B., Hemmati, H.: Studying test case failure prediction for test case priori- tization. In: Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering. pp. 2–11. ACM (2017) 17. Ochodek, M., Staron, M., Bargowski, D., Meding, W., Hebig, R.: Using machine learning to design a flexible loc counter. In: Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), IEEE Workshop on. pp. 14–20. IEEE (2017) 18. Palma, F., Abdou, T., Bener, A., Maidens, J., Liu, S.: An improvement to test case failure prediction in the context of test case prioritization. In: Proceedings of the 14th International Conference on Predictive Mod- els and Data Analytics in Software Engineering. pp. 80–89. PROMISE’18, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3273934.3273944, http://doi.acm.org/10.1145/3273934.3273944 19. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011) 20. Saxena, R.: Introduction to decision tree algorithm (2017), https://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/ 21. Shin, Y., Meneely, A., Williams, L., Osborne, J.A.: Evaluating complexity, code churn, and developer activity metrics as indicators of software vulnerabil- ities. IEEE Transactions on Software Engineering 37(6), 772–787 (Nov 2011). https://doi.org/10.1109/TSE.2010.81 22. Spieker, H., Gotlieb, A., Marijan, D., Mossige, M.: Reinforcement learning for automatic test case prioritization and selection in continuous integration. In: Pro- ceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. pp. 12–22. ACM (2017) 23. Ståhl, D., Bosch, J.: Experienced benefits of continuous integration in industry software product development: A case study. In: The 12th IASTED International Conference on Software Engineering,(Innsbruck, Austria, 2013). pp. 736–743 (2013) 24. Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Ex- perimentation in software engineering. Springer Science & Business Media (2012) 153