=Paper=
{{Paper
|id=Vol-2767/paper06
|storemode=property
|title=An Evaluation of Machine Learning Methods for Predicting Flaky Tests
|pdfUrl=https://ceur-ws.org/Vol-2767/05-QuASoQ-2020.pdf
|volume=Vol-2767
|authors=Azeem Ahmad,Ola Leifler, Kristian Sandahl
|dblpUrl=https://dblp.org/rec/conf/apsec/AhmadLS20
}}
==An Evaluation of Machine Learning Methods for Predicting Flaky Tests==
8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) An Evaluation of Machine Learning Methods for Predicting Flaky Tests Azeem Ahmada , Ola Lei�era and Kristian Sandahla a Linköping University, 581 83 Linköping, Sweden Abstract In this paper we have investigated as a means of prevention the feasibility of using machine learning (ML) classi�ers for �aky test prediction in project written with Python. This study compares the predictive accuracy of the three machine learning classi�ers (Naive Bayes, Support Vector Machines, and Random Forests) with each other. We compared our �ndings with the earlier investigation of similar ML classi�ers for projects written in Java. Authors in this study investigated if test smells are good predictors of test �akiness. As developers need to trust the predictions of ML classi�ers, they wish to know which types of input data or test smells cause more false negatives and false positives. We concluded that RF performed better when it comes to precision (> 90%) but provided very low recall (< 10%) as compared to NB (i.e., precision < 70% and recall >30%) and SVM (i.e., precision < 70% and recall >60%). Keywords Improve Software Quality, Flaky Test Detection, Machine Learning Classi�ers, Experimentation, Test Smells 1. Introduction dence that the main reasons for test �akiness are spe- ci�c test smells. Luo et al. suggested that "developers Developers need to ensure that their changes to the should avoid speci�c test smells that lead to test �ak- code base do not break existing functionality. If test iness". Authors in [2] investigated the question: "To cases fail, developers expect test failures to be con- what extent can �aky tests be explained by the presence nected to the changes. Unfortunately, some test fail- of test smells?" They concluded that the "cause of 54% ures have nothing to do with the code changes. Devel- of the �aky tests can be attributed to the characteristics opers spend time analyzing changes trying to identify of the co-occurring test smell". the source of the test failure, only to �nd out that the Mapping test smells to �aky test resemble the prob- cause of the failure is test �akiness (TF). Many stud- lem of mapping words to spam/ham email. Certain ies [1, 2, 3, 4] have been conducted to determine the words (i.e., sale, discount etc.) are more frequent in root causes of test �akiness. These studies concluded spam emails. Many studies [6, 7, 8, 9, 10, 11, 12, 13, that the main root cause of TF is the test smells. Test 14, 15] have been conducted to predict email class (i.e., smells are poorly written test cases and their presence spam or ham) based on email contents. We adopted a negatively a�ect the test suites and production code similar approach in this study to determine the �ak- or even the software functionality [5]. Another de�- iness of test cases based on the test case code. Ma- nition is "poor design or implementation choices applied chine Learning approaches have been widely studied by programmers or testers during the development of test and there are lots of algorithms that can be used in cases" [2] . Asynchronous wait, input/output calls, and e-mail classi�cation including Naive Bayes [16][17], test order dependency are some of the test smells that Support Vector Machines [18][19][15, 14], Neural Net- have been found to be the most common causes of TF works [20][21], K-nearest neighbor [22]. [1]. The results presented by Luo et al. [1] were par- Recently, Pinto et al. evaluated �ve machine learn- tially replicated by Palomba and Zaidman [2], leading ing classi�ers (Random Forest, Decision Tree, Naive to the conclusion that the most prominent causes of Bayes, Support Vector Machine, and Nearest Neigh- TF are test smells such as asynchronous wait, concur- bour) to generate �aky test vocabulary [23]. They con- rency, and input output issues. There is strong evi- cluded that Random Forest and Support Vector Ma- chine provided best prediction of �aky tests. The in- 8th International Workshop on Quantitative Approaches to Software vestigated test cases were written in Java and the au- Quality in conjunction with the 27th Asia-Paci�c Software thors concluded that: "future work will have to investi- Engineering Conference (APSEC 2020) Singapore, 1st December 2020 gate to what extent their �ndings generalize to software � azeem.ahmad@liu.se (A. Ahmad); ola.lei�er@liu.se (O. Lei�er); kristian.sandahl@liu.se (K. Sandahl) written in other programming languages [23]. � 0000-0003-3049-1261 (A. Ahmad) In this study, we implemented supervised ML classi- © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). �ers to detect if the test case is �aky or not based on the CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) 37 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) contents of a test cases written in Python. We com- categories of classi�cation and can �nd the best hyper- pared our �ndings with what was presented by Pinto plane to partition a sample space [15]. et al. [23]. We looked for evidence if machine learning RF is an ensemble classi�cation method (a technique classi�ers are applicable in predicting �aky tests and that combines several base models to produce an opti- the results can be generalized to test cases written in mal predictive model) suitable for handling problems other languages. In addition to this, our unique contri- that involve grouping data into di�erent classes. RF bution is to investigate if test smells are good predic- predicts by using decision trees. Trees are constructed tors of test �akiness. Through manual investigation of during training which can later be used for class pre- false positives and false negatives, we concluded a list diction. There is a vote associated with each tree and of test smells that are strong and weak predictors of once the class vote has been produced for all individ- test �akiness. We investigated the following research ual trees, the class with the highest vote is considered questions in this study. to be the output. RQ1: What are the predictive accuracy of Naive Bayes, 2.2. Performance Metrics and Support Vector Machine and Random Forest concerning Parameters Tuning �aky test detection and prediction? RQ2: To what extent the predicting power of machine To evaluate the predictive accuracy of classi�ers, accu- learning classi�ers vary when applied on software writ- racy as the only performance indices is not su�cient ten in other programming language? [16]. We must consider precision, recall, F1-score, ROC RQ3: What can we learn about the predictive power of curve, false positives and false negatives [16]. There is test smells using machine learning classi�ers mentioned always some cost associated with false positives and in RQ1? false negatives. When a non �aky test wrongly clas- si�ed as �aky, it gives rise to a some what insigni�- cant problem, because an experienced user can bypass 2. Data Set Description and the warning by looking at test case code. In contrast, Prepossessing when a �aky test is wrongly classi�ed as non �aky test, this is obnoxious, because it indicates the test suite still We wrote a script to extract the contents of all test have test cases whose outcome cannot be trusted. cases from open-source projects, mentioned in Table The experiment started with the implementation of 1. After the test case content’s extraction, we checked simple NB without Laplace smoothing. The results did which of the test cases, in our database, has been men- not provide good accuracy or precision, because with- tioned in [24] as �aky. After this mapping, we �nalized out Laplace smoothing, the probability of appearing a a database with the project name, test case name, test rare test smell (i.e., test smell that was not in the train- case content and a label. There are many keywords in ing set) in the test set is set to 0, given the formula = / the test case code that are irrelevant for the identi�- cation of test �akiness. We performed extensive data cleaning such as removing punctuation marks, digits where the is the probability that an individual test and speci�c keywords (i.e., int, string, array, assert*) smell is present in a �aky test, represents the num- as well as converting text to lower case. ber of times that particular test smell appeared in a test case and represents the number of times that 2.1. Classifiers: test smell appeared in any test case. Laplace smooth- ing refers to the modi�cation in the equation: = ( + )/( + ) An NBC, �rst proposed in 1998, is a probabilistic model which can determine the outcome (i.e., �aky or not tents of its features (i.e., test case code). In our case, where we set the = 1 so that classi�er adds 1 to the �aky) of an instance (i.e., test case) based on the con- the outcome of NBC is binary. NBC is widely applied probability of rare test smells that were not present in in classi�cation and known to obtain excellent results. the training set. Another step is to identify the thresh- [25]. old (i.e., 0.0 - 1.0) which will increase the predictive ac- The attractive feature of SVM is that it eliminates curacy of the outcome. As far as SVM was concerned, the need for feature selections, which makes spam clas- although the feature data set space was linear, we de- si�cation easy and faster [14]. SVM deals with the dual cided to use both kernels (i.e., linear and poly) for the sake of experiment. For random forest, we used ntree 38 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) Table 1 peared 1150 times in �aky tests and 52 times in non- Open-source project names provided by [24] with number �aky tests. of total test cases and flaky tests Figure 1 (A) represents the ROC curve [26] concern- ing NBC with Laplace smoothing denoted as NBL with Project Name Total Number of Flaky TCs Tests di�erent threshold (i.e., from 0.0 to 1.0). We conducted di�erent experiments with di�erent training and test apache-qpid-0.18 2357 284 hibernate 4 3231 273 data sets such as 50/50, 60/40, 70/30, 80/20 and 90/10. apache-wicket-1.4.20 1250 216 We found similar values for k-fold cross validation. apache-karaf-2.3 163 102 ROC curve provides a comparison between sensitivity apache-struts 2.5 2346 60 and speci�city helping in organizing classi�ers and vi- apache-derby-10.9 3832 40 apache-lucene-solr-3.6 764 7 sualizing their performance [26]. Sensitivity also known apache-cassandra-1.1 523 4 as the true positive rate represents a bene�t of predict- apache-nutch-1.4 7 4 ing �aky tests correctly and speci�city also known as apache-hbase-0.94 29 2 false positive rate represents the cost of predicting non apache-hive-10.9 23 2 jfreechart-1.0.18 2292 0 �aky tests as �aky tests. In the case of false positive, developers need to spend e�ort and time, just to �nd between 300 - 700 as well as restricting number of vari- out that this is a classi�er mistake and the test case is ables available for splitting at each tree node known as not �aky. The optimal target, in the ROC curve, is to mtry between 25 and 100. rise vertically from origin to the top left corner (higher true positive rate) as soon as possible because then the classi�er can achieve all true positives with the cost 3. Results of committing a few false positive. The diagonal line, in Figure 1 (A), represents the strategy of randomly This section discusses the performance of NBL, SVM guessing the outcome. Any classi�er that appears in and RF with di�erent parameters. We compared our the lower right triangle performs worse than a ran- results with the �ndings of Pinto et al. to discuss how dom guessing and we can see that NBL lies in the up- results vary between Java and Python projects. We per left triangle. Looking at 1 (A), NBL with 70/30 data also discussed why some classi�ers do not perform as partition is suitable to proceed further with 0.4 prob- expected and what can we learn about the predictive ability score. NBL, as shown in 1 (A), has stopped is- power of test smells for test �akiness detection and suing positive classi�cation (i.e., �aky test prediction) prediction. around 0.76 - 0.87 threshold. After 0.87, it commits more false positive rate. 3.1. RQ1: Performance of Naive Bayes We tuned di�erent parameters in NBL, SVM and RF Classifier, Support Vector Machine before conducting further experiments. We do not in- and Random Forest tend to provide the results of all experiments because those experiments were only conducted to �nd the op- Table 2 shows the 20 features with the highest infor- timal parameters. The rest (i.e., simple NB, SVM with mation gain together with their frequency with respect radial and sigmiod kernels) were not included in fur- to �aky and non-�aky tests. We assigned the features ther experiments and discarded. Figure 1 (A-E) pro- to the categories presented by Luo et al. in [1]. We vides comparisons of NBL, SVM-Linear and SVM-Poly manually traversed the code of �aky and non-�aky (i.e., di�erent kernels) for accuracy, precision, recall tests to understand the context and how features were and F1-score. All classi�ers have achieved good ac- used in the tests to assign categories. The top fea- curacies ranging from 93% - 96%. NBL outperformed ture "conn" appeared in 1361 �aky tests and only 15 SVM although the di�erence between them is not dra- non-�aky tests. This feature is associated with exter- matic. Looking only at the accuracy results of classi- nal connection to input/output devices and lies under �ers can be deceiving. The important factor for classi- the category of "IO", presented by Luo et. al in [1]. �er selection is to ask the right question and motivate The second top feature is "double" which appeared in the choice of using speci�c classi�er such as are we 1190 �aky tests and 12 non-�aky tests assigned to the interested in detecting �aky tests correctly (i.e., category of "IO" followed by "�oating points opera- precision) or marking a non �aky test as �aky tions". The top 3rd feature "tabl" was related to table is not cost e�ective (i.e, recall). It is important to creation during runtime for databases queries and ap- look at precision, recall and accuracy all together for classi�er selection. We can assume that practitioners 39 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) A 1.00 B C ● ● ● ● 60 name ● Precision Values (%) Accuracy Values (%) 0.75 95 NBL−50/50L Classifier 50 Classifier sensitivity NBL−60/40L ● NBLaplace=1 ● NBLaplace=1 0.50 40 NBL−70/30L SVM−Linear SVM−Linear 94 ● NBL−80/20L ● ● SVM−Poly 30 SVM−Poly 0.25 ● ● NBL−90/10L 20 0.00 93 0.00 0.25 0.50 0.75 1.00 50/5060/4070/3080/2090/10 50/5060/4070/3080/2090/10 1 − specificity Data Partitions Data Partitions D 60 E 60 50 50 Recall Values (%) Classifier Classifier F1 Score ● NBLaplace=1 ● NBLaplace=1 40 40 SVM−Linear ● SVM−Linear ● ● ● SVM−Poly ● SVM−Poly 30 ● 30 ● ● ● ● 20 20 50/5060/4070/3080/2090/10 50/5060/4070/3080/2090/10 Data Partitions Data Partitions Figure 1: Performance comparison among classifiers. (A) represents the ROC curve of NBL classifier with di�erent data partition and probability score. (B-E) represents the accuracy, precision, recall and F1-score of di�erent classifiers with a di�erent data partition, respectively. are more interested in precision than recall because Table 2 the test suite size, in many organizations, is very large Top 20 frequented features and assigned category and they cannot inspect all test cases. In this partic- Features Frequency Assigned category from ular case, any classi�er that correctly �ag �aky tests Luo et. al [1] will be encouraged. Precision can answer the question; "If the �lter says this test case is �aky, what’s the new 28083 IO probability that it’s �aky?”. Figure 1 (C,D) provides assertequ 7967 - null 4721 - precision and recall values for NBL and SVM. It can be from 4719 - noticed that NBL precision is increasing (in C) with the string 4126 - sclose 3315 IO gradual decrease in recall (in D). NBL precision of 65% true 3154 - dictates that 35% of what was marked as �aky was not select 2842 - �aky. Recall is also lower in NBL as compared to SVM- for 2809 Unordered Collection Linear. SVM-Poly performs worst in terms of precision fals 2604 - not 2294 - and recall as expected due the fact that the input data int 2111 - set is not polynomial and is well suited for image pro- asserttru 1910 - cessing whereas linear kernel performs better for text tabl* 1677 IO should 1596 - classi�cation. doubl 1588 Floating point operations valu 1429 - F1-score, as presented in Figure 1 (E), is the har- expr 1322 - monic mean of precision and recall. F1-score is use- tcommit 1313 IO expcolnam 1211 IO ful and informative because of prevalent phenomenon of class imbalance in text classi�cation [27]. NBL is a suitable candidate although it has a lower F1-score as it requires high computation and are very sensitive to compared to SVM-Linear because NBL performs bet- noisy data [29]. ter with short documents as in our case, the training test case consists of 6-15 lines of code [28]. NBL pro- RF provides lesser classi�cation error and better F1- vides higher precision and lower recall as compared scores as compared to decision trees, NBL and SVM. to SVM-linear. Another disadvantage of SVM is that 40 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) Accuracy F1−Score Precision Recall ntree 75 700 Values 600 50 500 400 25 300 25 50 75 100 25 50 75 100 25 50 75 100 25 50 75 100 mtry Figure 2: Performance of RF with di�erent parameters (i.e., number of trees and mtry). The precision, in which we are most interested, is usu- feature deletion 2) it calculates approximation of im- ally better than that of SVM and NBL. Authors in [16] portant features for classi�cation and 3) it is very ro- also concluded that RF performs better than NBL and bust to noise and outliers [30]. Caruana in [17] com- SVM. The class outcomes are based on "votes" which pared 10 di�erent ML classi�ers and concluded that are calculated by each tree in the forest. The outcome decision trees and random forest outperform all other (i.e., �aky or not �aky) is selected based on the higher classi�ers for spam classi�cation. votes. Figure 2 presents the performance of RF with re- spect to selected metrics. mtry represents the number of variables randomly sampled as candidates at each 3.2. RQ2: Predicting Power of ML split while ntree is the number of trees to grow. There Classifiers with Respect to Other is no way to �nd an optimal mtry and ntree, so we ex- perimented with di�erent settings, as shown in Figure Languages 2. The mtry has a direct e�ect on precision and recall In comparison of our �ndings with what was presented as shown in Figure 2. With an increase in mtry, the by Pinto et al. [23], we observed two di�erences. First, precision is decreasing and recall in increasing; an un- the top 20 frequented features are very di�erent in wanted situation. The optimal value of mtry is 5 where both studies. Only one feature such as "tabl" marked precision is higher and recall is lower regardless of the as star (*) in Table 2 were similar in both the �ndings. number of trees. The change in mtry did not a�ect the However, we observed more features were related to accuracy but as we discussed earlier, we are not only "IO" output category, as presented in Table 2, which interested in accuracy but precision too. complemented the �ndings of Pinto et al. stating "that We performed several experiments to �nd optimal all projects manifesting �akiness are IO-intensive" [23]. parameters within a classi�er before comparing it to Second, we have a very lower precision, recall and f1- other classi�ers. After these experiments, we identi- score as compared to Pinto et al. except at a instance �ed three unique classi�ers with unique and optimal where random forest provided 0.92 precision. Table parameters. Since, we are most interested in higher 3 provides detail statistics of precision, recall, and f1- precision, we can see that RF with mtry = 5 and ntree=250 score of three algorithms for comparison. The algo- outperforms all other classi�ers only for precision. RF rithms on Python language continuously performed has achieved more than 90% precision with less than worst contrary to what pinto et al. claimed: "Although 10% recall. We did not achieve high precision (i.e., the studied projects are mostly written in Java, we do not >90%) in all classi�ers. NBL provides unexpected re- expect major di�erences in the results if another object- sults although it holds a good reputation in terms of oriented programming language is used instead, since detecting spam emails [29]. As compared to NBL and some keywords maybe shared among them" [23]. SVM, RF have distinct qualities such as 1) it can work We speculate that there could be several reasons as- with thousands of di�erent input features without any sociated with these performance reduction such as (1) 41 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) Table 3 such as what operating system they are running on Comparison of Precision, Recall and F1-Score between our and whether or not speci�c con�gurations should be findings (A) and Pinto et al. (B) deployed. Test smells that are classi�ed as strong pre- dictors are very useful with machine learning classi- Precision Recall F1-Score Di� Algo. A B A B A B �ers because they only exist in test case function as Random Forest 0.92 0.99 0.4 0.91 0.09 0.95 one unit and do not require additional information. Naive Bayes 0.62 0.93 0.15 0.8 0.24 0.86 Support Vector 0.51 0.93 0.61 0.92 0.57 0.93 4. Lesson Learned We implemented the code ourselves using R libraries ML and AI algorithms in recent years have established for aforementioned classi�ers whereas pinto et al. used a good reputation for predicting diseases based on symp- Weka [31] which is an open source machine learn- toms, spam emails based on email contents and many ing software that can be accessed through a graphi- more. We believe that given a proper input data set cal user interface, standard terminal applications [32], which clearly distinguishes between �aky and non �aky (2) Number of features were very high in the training tests, ML and AI can provide high prediction capabil- samples and in these cases other models should be con- ities saving e�ort, time and resources. We strongly sidered (i.e., regularized linear regression) that might believe that practitioners, during training of data set, performed better, (3) the versatility o�ered by param- should not consider complete test cases as an input but eter tunning can become problematic and require spe- only the test codes (i.e., only few lines) that reveal test cial considerations that can impact the classi�ers, etc. �akiness. It is inconclusive that predicting power of machine 3.3. RQ3: Test Smells Analysis and their learning vary with respect to software written in an- Predictive Power for Test Flakiness other languages. Investigation on Java test cases [23] revealed good results while �ndings for Python test Detection and Prediction cases performed unexpected, thus requiring more in- We investigated manually di�erent cases of true pos- vestigations whether lexical information can be traced itives (i.e., correct �aky test prediction), false positive to �akiness. (i.e., �aky test cases marked as non �aky) and false Async wait, precision, randomness and IO test smells negative (i.e., non �aky test cases marked as �aky) and are string predictors can be predicted by machine learn- true negatives (correct non �aky test prediction) to an- ing classi�ers with 100% precision because they only swer RQ3. We observed that it is not only the fre- exist in test case code and do not require additional in- quency of test smell that makes a test case �aky but formation from test class or operating system. Whereas its co-existence with the class code or external factors all other test smells mentioned in Table 4 are weak pre- such as operating systems or speci�c product. For ex- dictors of test �akiness and require additional sources ample, The test smell ’Conditional Test Logic’ as men- of information. We are only aware of test smells that tioned in [3] refers to nested and complex ’if-else’ struc- are investigated in open-source repositories and liter- ture in the test case. Depending on which branch of ature on test smells in closed-source software is scarce. ’if-else’ is executed, the system under test may require speci�c environment settings. Failing to set the envi- ronment, during di�erent executions, will �ip the test 5. Discussion and Implication case outcome, thus making it �aky. Valuable Indicators for Testers These classi�ers can After manual investigation of all true/false positives increase the awareness about �aky test vocabulary among and true/false negatives, we come up with a list of testers. When a new test is added to a test suite, it test smells that are strong or weak predictors of test will be easy to identify whether this test case contains �akiness, as shown in Table 4. Strong predictors refer speci�c test smells that were known to increase test to those test smells that existed in true positives and �akiness during previous executions. Testers can take true negatives cases whereas weak predictors only ex- advantage of these types of information to reduce test isted in false negatives and false positives. Test smells �akiness. Testers can easily identify test smells that that are classi�ed as weak predictors in this study are are independent of their environment with the help of still useful and can help in identi�cation of test �ak- Table 4. iness, but they are not useful with machine learning Precision Depends on Data Set: In the literature classi�ers because they require additional information 42 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) Table 4 Test Smells as Strong and Weak Predictors Together with Source of their Existence Test Smell Category Prediction Cate- Test Test Operating External Hardware/Product gory Case Class System Libraries Async wait Strong [ ] - - - - Precision (float operations) Strong [ ] - - - - Randomness Strong [ ] - - - - IO Strong [ ] - - - - Unordered Collection Weak [ ] [ ] [ ] - - Time Weak [ ] [ ] [ ] - [ ] Platform Weak [ ] [ ] [ ] [ ] - Concurrency Weak [ ] [ ] [ ] - - Test order dependency Weak [ ] [ ] [ ] [ ] [ ] Resource Leak Weak [ ] [ ] [ ] - - of ML, particularly with spam detection, it is acknowl- crease precision at the expense of recall. When en- edged that precision is a function of the combination countering ’false negative’, an experienced developer, of the classi�er and the data set under investigation. having su�cient knowledge of the test smells, will by- Classi�er’s precision, in isolation of data set, does not pass the outcome, however, with ’false positive’, de- make sense. The right question is "how precise a clas- velopers are unaware of the fact that test suite still si�er is for a given data set". Unfortunately, there contains �aky tests. The motivation of employing ML is no data available that provides test case contents classi�ers (i.e., higher precision - low recall vs balances and an associated label thus, limiting the use of ad- precision and recall) should be made clear before pro- vanced ML and AI algorithms. In addition to lack of ceeding with implementation. �aky test data, all research has been conducted with Multi-Factor Input Criteria for Flaky Test Detec- open-source software and we know a little about what tion: We observed that the ML algorithm should in- test smells are present in closed-source software. Ah- clude di�erent sources of information to increase pre- mad et. al. concluded that there are speci�c test smells dictive accuracy. These sources may include 1) assign- that are associated with the nature of the product [33] ing speci�c weight (i.e., in numbers) to speci�c test known as ’company-speci�c’ test smells. The classi�er smells or test code, 2) developer’s experience (i.e., new which are trained on a speci�c data set or a domain developer, unaware of the test design guidelines are cannot be generalized to be used with another data set more likely to write �aky tests), 3) company-speci�c or domain. There is a long road ahead to explore the test smells. best classi�er given di�erent data sets. Beyond Static Analysis of Test Smells and their Frequency: ML is capable of incorporating di�erent 6. Related Work sources of information to increase predictive accuracy Luo et al., in [1], investigated 52 open-source projects as compared to the limited experiment in this study and 201 commits and categorized the causes of test where we only utilized the frequency of test smells in case. Asynchronous wait (45%), concurrency (20%), the test case. During the investigation of the cases of and test order dependency (12%) were found to be the ’false negative’ and ’false positive’, it has been observed most common causes of TF. Palomba and Zaidman in that the frequency of test smells in the test case will not [2] partially replicated the results presented by Luo et be su�cient for prediction. Some test case code (i.e., al. concluding that the most prominent causes of TF seeds()) will cancel the e�ect of test smell (i.e., ran- are asynchronous wait, concurrency, and input output dom()), no matter how frequent the random() function and network issues. Authors investigated, in [3], the appears in the test case. Some test smell, even with relationship between smells and TF. Another empiri- single appearance, will weight more than a test smell cal study of the root causes of TF in Android Apps was for higher frequency. conducted by Thorve et al. [4] by analyzing the com- Precision Vs Recall: When a test suite grows in size, mits of 51 Apache open-source projects. Thorve et al. developers would like any indications of tests that are [4] complement the results of Luo et al. and Palomba more likely to be �aky rather than adopting an ap- and Zaidman, but they also report two additional test proach of re-run which of-course is not cost e�ective smells (user interface and program logic) that are re- in terms of time and resources. Developers like to in- 43 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) lated to TF in Android Apps. Bell et al. in [34] and pro- ings of this study for other data set. posed a new technique called DeFlaker, which moni- tors the latest code coverage and marks the test case as �aky if the test case does not execute any of the 8. Conclusion changes. Another technique called PRADET [35] does At the moment of writing this paper, literature is scarce not detect �aky tests directly, rather it uses a system- on test �akiness (i.e., root causes, challenges, mitiga- atic process to detect problematic test order dependen- tion strategies, etc.) which requires signi�cant atten- cies. These test order dependencies can lead to �ak- tion from researchers and practitioners. We extracted iness. King et al. in [36] present an approach that �aky and non �aky test case contents from open source leverages Bayesian networks for �aky test classi�ca- repositories. We implemented three ML classi�ers such tion and prediction. This approach considers �akiness as Naive Bayes, Support Vector Machine and Random as a decease mitigated by analyzing the symptoms and Forest to see if the predictive accuracy can be increased. possible causes. Teams using this technique improved The authors concluded that only RF performs better CI pipeline stability by as much as 60%. To best of our when it comes to precision (i.e., > 90%) but the recall knowledge, no study has been conducted to evaluate is very low (< 10%) as compared to NBL (i.e., preci- the predictive accuracy of machine learning classi�ers sion < 70% and recall >30%) and SVM (i.e., precision that can help developers in �aky test case prediction < 70% and recall >60%). The authors concluded that and detection. predicting accuracy of ML classi�ers are strongly as- Dutta et al. [37] and Sjobom [38] investigated projects sociated with the lexical information of test cases (i.e., written in Python language to classify test smells that test cases written in Java or Python). The authors in- increase test �akiness. Their study is limited to list vestigated why other classi�ers failed to produce ex- the test smells and their e�ect on test �akiness. Our pected results and concluded that; 1) it is a combina- study worked with the test smells identi�ed in [38]. tion of the test smell and an external environment that Pinto et al. evaluated �ve machine learning classi�ers makes a test case �aky, and in this study, the exter- (Random Forest, Decision Tree, Naive Bayes, Support nal environment was not taken into consideration, 2) Vector Machine, and Nearest Neighbour) to generate ML classi�ers should not only consider the frequency �aky test vocabulary written in Java [23]. The con- of test smells in the test case but other important test cluded that Random Forest and SVM performed very codes that have an ability to cancel the e�ect of test well with high precision and recall. They concluded smells. that features such as "job", "action", and "services" were commonly associated with �aky tests. We replicated the similar experiment with di�erent programming lan- 9. Acknowledgment guage and extended the current knowledge by answer- ing RQ2 and RQ3. We appreciate Linköping University students to pro- vide their expertise to collect �aky test data from on- line repositories. 7. Validity Threats The authors in this study selected only those ML clas- References si�ers which have established a good reputation of high [1] Q. Luo, F. Hariri, L. Eloussi, D. Marinov, An Empir- accuracy in spam detection thus reducing the selection ical Analysis of Flaky Tests, in: Proceedings of the bias. 22Nd ACM SIGSOFT International Symposium on Founda- The authors in this study reduced the experimenter tions of Software Engineering, FSE 2014, ACM, New York, bias by performing several experiments with di�erent NY, USA, 2014, pp. 643–653. URL: http://doi.acm.org/10.1145/ 2635868.2635920. doi:10.1145/2635868.2635920, event- thresholds (i.e., probability scores, kernels, number of place: Hong Kong, China. trees, etc.) before selecting a champion. [2] F. Palomba, A. Zaidman, Does Refactoring of Test Smells In- External validity refers to the possibility of gener- duce Fixing Flaky Tests?, in: 2017 IEEE International Confer- ence on Software Maintenance and Evolution (ICSME), 2017, alizing the �ndings, as well as the extent to which the pp. 1–12. doi:10.1109/ICSME.2017.12. �ndings are of interest to other researchers and practi- [3] F. Palomba, A. Zaidman, The smell of fear: on the relation tioners beyond those associated with the speci�c case between test smells and �aky tests, Empirical Software En- being investigated. Since the precision strongly de- gineering 24 (2019) 2907–2946. URL: https://doi.org/10.1007/ s10664-019-09683-z. doi:10.1007/s10664-019-09683-z. pends on the data set under investigation, we have an external validity threat. We cannot generalize the �nd- 44 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) [4] S. Thorve, C. Sreshtha, N. Meng, An Empirical Study of Flaky USA, 2006, pp. 161–168. URL: https://doi.org/10.1145/1143844. Tests in Android Apps, in: 2018 IEEE International Confer- 1143865. doi:10.1145/1143844.1143865. ence on Software Maintenance and Evolution (ICSME), 2018, [18] C.-Y. Chiu, Y.-T. Huang, Integration of Support Vector Ma- pp. 534–538. doi:10.1109/ICSME.2018.00062. chine with Naïve Bayesian Classi�er for Spam Classi�cation, [5] V. Garousi, B. Küçük, Smells in software test code: A survey of in: Fourth International Conference on Fuzzy Systems and knowledge in industry and academia, Journal of Systems and Knowledge Discovery (FSKD 2007), volume 1, 2007, pp. 618– Software 138 (2018) 52–81. URL: http://www.sciencedirect. 622. doi:10.1109/FSKD.2007.366, iSSN: null. com/science/article/pii/S0164121217303060. doi:10.1016/j. [19] Z. Jia, W. Li, W. Gao, Y. Xia, Research on Web Spam Detec- jss.2017.12.013. tion Based on Support Vector Machine, in: 2012 International [6] R. Shams, R. E. Mercer, Classifying Spam Emails Using Text Conference on Communication Systems and Network Tech- and Readability Features, in: 2013 IEEE 13th International nologies, 2012, pp. 517–520. doi:10.1109/CSNT.2012.117, Conference on Data Mining, 2013, pp. 657–666. doi:10.1109/ iSSN: null. ICDM.2013.131, iSSN: 2374-8486. [20] A. S. Katasev, L. Y. Emaletdinova, D. V. Kataseva, Neural Net- [7] S. K. Tuteja, N. Bogiri, Email Spam �ltering using BPNN classi- work Spam Filtering Technology, in: 2018 International Con- �cation algorithm, in: 2016 International Conference on Auto- ference on Industrial Engineering, Applications and Manufac- matic Control and Dynamic Optimization Techniques (ICAC- turing (ICIEAM), 2018, pp. 1–5. doi:10.1109/ICIEAM.2018. DOT), 2016, pp. 915–919. doi:10.1109/ICACDOT.2016. 8728862, iSSN: null. 7877720, iSSN: null. [21] M. K., R. Kumar, Spam Mail Classi�cation Using Combined [8] E. Sahın, M. Aydos, F. Orhan, Spam/ham e-mail classi�cation Approach of Bayesian and Neural Network, in: 2010 Interna- using machine learning methods based on bag of words tech- tional Conference on Computational Intelligence and Commu- nique, in: 2018 26th Signal Processing and Communications nication Networks, 2010, pp. 145–149. doi:10.1109/CICN. Applications Conference (SIU), 2018, pp. 1–4. doi:10.1109/ 2010.39, iSSN: null. SIU.2018.8404347, iSSN: null. [22] L. Firte, C. Lemnaru, R. Potolea, Spam detection �lter us- [9] K. Mathew, B. Issac, Intelligent spam classi�cation for mo- ing KNN algorithm and resampling, in: Proceedings of bile text message, in: Proceedings of 2011 International the 2010 IEEE 6th International Conference on Intelligent Conference on Computer Science and Network Technology, Computer Communication and Processing, 2010, pp. 27–33. volume 1, 2011, pp. 101–105. doi:10.1109/ICCSNT.2011. doi:10.1109/ICCP.2010.5606466, iSSN: null. 6181918, iSSN: null. [23] G. Pinto, B. Miranda, S. Dissanayake, What is the Vocabulary [10] A. B. M. S. Ali, Y. Xiang, Spam Classi�cation Using Adaptive of Flaky Tests? (2020) 11. Boosting Algorithm, in: 6th IEEE/ACIS International Confer- [24] W. Lam, R. Oei, A. Shi, D. Marinov, T. Xie, iDFlakies: A ence on Computer and Information Science (ICIS 2007), 2007, Framework for Detecting and Partially Classifying Flaky Tests, pp. 972–976. doi:10.1109/ICIS.2007.170, iSSN: null. in: 2019 12th IEEE Conference on Software Testing, Valida- [11] R. K. Yin, Case study research design and methods, 4th ed ed., tion and Veri�cation (ICST), 2019, pp. 312–322. doi:10.1109/ Thousand Oaks, Calif Sage Publications, 2009. URL: https:// ICST.2019.00038, iSSN: 2159-4848. trove.nla.gov.au/work/11329910. [25] M. Sasaki, H. Shinnou, Spam detection using text clustering, [12] A. A. Alurkar, S. B. Ranade, S. V. Joshi, S. S. Ranade, P. A. in: 2005 International Conference on Cyberworlds (CW’05), Sonewar, P. N. Mahalle, A. V. Deshpande, A proposed data 2005, pp. 4 pp.–319. doi:10.1109/CW.2005.83, iSSN: null. science approach for email spam classi�cation using machine [26] T. Fawcett, An introduction to ROC analysis, Pattern Recogni- learning techniques, in: 2017 Internet of Things Business Mod- tion Letters 27 (2006) 861–874. URL: http://www.sciencedirect. els, Users, and Networks, 2017, pp. 1–5. doi:10.1109/CTTE. com/science/article/pii/S016786550500303X. doi:10.1016/j. 2017.8260935, iSSN: null. patrec.2005.10.010. [13] S. Vahora, M. Hasan, R. Lakhani, Novel approach: Naïve Bayes [27] D. Zhang, J. Wang, X. Zhao, Estimating the Uncertainty of with Vector space model for spam classi�cation, in: 2011 Average F1 Scores, in: Proceedings of the 2015 International Nirma University International Conference on Engineering, Conference on The Theory of Information Retrieval, ICTIR ’15, 2011, pp. 1–5. doi:10.1109/NUiConE.2011.6153245, iSSN: Association for Computing Machinery, Northampton, Mas- 2375-1282. sachusetts, USA, 2015, pp. 317–320. URL: https://doi.org/10. [14] M. R. Islam, W. Zhou, M. U. Choudhury, Dynamic Fea- 1145/2808194.2809488. doi:10.1145/2808194.2809488. ture Selection for Spam Filtering Using Support Vector Ma- [28] Wang, Baselines and bigrams | Proceedings of the 50th An- chine, in: 6th IEEE/ACIS International Conference on Com- nual Meeting of the Association for Computational Linguis- puter and Information Science (ICIS 2007), 2007, pp. 757–762. tics: Short Papers - Volume 2, ???? URL: https://dl-acm-org.e. doi:10.1109/ICIS.2007.92, iSSN: null. bibl.liu.se/doi/10.5555/2390665.2390688. [15] T.-Y. Yu, W.-C. Hsu, E-mail Spam Filtering Using Support Vec- [29] S. Abu-Nimeh, D. Nappa, X. Wang, S. Nair, A compari- tor Machines with Selection of Kernel Function Parameters, son of machine learning techniques for phishing detection, in: 2009 Fourth International Conference on Innovative Com- in: Proceedings of the anti-phishing working groups 2nd puting, Information and Control (ICICIC), 2009, pp. 764–767. annual eCrime researchers summit on - eCrime ’07, ACM doi:10.1109/ICICIC.2009.184, iSSN: null. Press, Pittsburgh, Pennsylvania, 2007, pp. 60–69. URL: http: [16] E. G. Dada, J. S. Bassi, H. Chiroma, S. M. Abdulhamid, A. O. //portal.acm.org/citation.cfm?doid=1299015.1299021. doi:10. Adetunmbi, O. E. Ajibuwa, Machine learning for email 1145/1299015.1299021. spam �ltering: review, approaches and open research prob- [30] L. Breiman, Random Forests, Machine Learning 45 (2001) 5–32. lems, Heliyon 5 (2019) e01802. URL: http://www.sciencedirect. URL: https://doi.org/10.1023/A:1010933404324. doi:10.1023/ com/science/article/pii/S2405844018353404. doi:10.1016/j. A:1010933404324. heliyon.2019.e01802. [31] I. H. Witten, E. Frank, Data mining: practical machine learn- [17] R. Caruana, A. Niculescu-Mizil, An empirical comparison of ing tools and techniques with Java implementations, ACM supervised learning algorithms, in: Proceedings of the 23rd in- SIGMOD Record 31 (2002) 76–77. URL: https://doi.org/10.1145/ ternational conference on Machine learning, ICML ’06, Asso- 507338.507355. doi:10.1145/507338.507355. ciation for Computing Machinery, Pittsburgh, Pennsylvania, [32] Weka 3 - Data Mining with Open Source Machine Learning 45 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020) Software in Java, ???? URL: https://www.cs.waikato.ac.nz/ml/ weka/index.html. [33] A. Ahmad, O. Lei�er, K. Sandahl, Empirical Analysis of Fac- tors and their E�ect on Test Flakiness - Practitioners’ Percep- tions, arXiv:1906.00673 [cs] (2019). URL: http://arxiv.org/abs/ 1906.00673, arXiv: 1906.00673. [34] J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, D. Mari- nov, DeFlaker: Automatically Detecting Flaky Tests, in: 2018 IEEE/ACM 40th International Conference on Software Engi- neering (ICSE), 2018, pp. 433–444. doi:10.1145/3180155. 3180164. [35] A. Gambi, J. Bell, A. Zeller, Practical Test Dependency De- tection, in: 2018 IEEE 11th International Conference on Soft- ware Testing, Veri�cation and Validation (ICST), 2018, pp. 1– 11. doi:10.1109/ICST.2018.00011. [36] T. M. King, D. Santiago, J. Phillips, P. J. Clarke, Towards a Bayesian Network Model for Predicting Flaky Automated Tests, in: 2018 IEEE International Conference on Soft- ware Quality, Reliability and Security Companion (QRS-C), IEEE Comput. Soc, Lisbon, 2018, pp. 100–107. doi:10.1109/ QRS-C.2018.00031. [37] S. Dutta, A. Shi, R. Choudhary, Z. Zhang, A. Jain, S. Misailovic, Detecting �aky tests in probabilistic and machine learning ap- plications, in: Proceedings of the 29th ACM SIGSOFT Inter- national Symposium on Software Testing and Analysis, ISSTA 2020, Association for Computing Machinery, New York, NY, USA, 2020, pp. 211–224. URL: https://doi.org/10.1145/3395363. 3397366. doi:10.1145/3395363.3397366. [38] A. Sjöbom, Studying Test Flakiness in Python Projects : Orig- inal Findings for Machine Learning, 2019. URL: http://urn.kb. se/resolve?urn=urn:nbn:se:kth:diva-264459. 46