=Paper=
{{Paper
|id=Vol-2767/paper06
|storemode=property
|title=An Evaluation of Machine Learning Methods for Predicting Flaky Tests
|pdfUrl=https://ceur-ws.org/Vol-2767/05-QuASoQ-2020.pdf
|volume=Vol-2767
|authors=Azeem Ahmad,Ola Leifler, Kristian Sandahl
|dblpUrl=https://dblp.org/rec/conf/apsec/AhmadLS20
}}
==An Evaluation of Machine Learning Methods for Predicting Flaky Tests==
<pdf width="1500px">https://ceur-ws.org/Vol-2767/05-QuASoQ-2020.pdf</pdf>
<pre>
                                          8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


An Evaluation of Machine Learning Methods for
Predicting Flaky Tests
Azeem Ahmada , Ola Lei�era and Kristian Sandahla
a Linköping University, 581 83 Linköping, Sweden


                                  Abstract
                                  In this paper we have investigated as a means of prevention the feasibility of using machine learning (ML) classi�ers for �aky
                                  test prediction in project written with Python. This study compares the predictive accuracy of the three machine learning
                                  classi�ers (Naive Bayes, Support Vector Machines, and Random Forests) with each other. We compared our �ndings with the
                                  earlier investigation of similar ML classi�ers for projects written in Java. Authors in this study investigated if test smells
                                  are good predictors of test �akiness. As developers need to trust the predictions of ML classi�ers, they wish to know which
                                  types of input data or test smells cause more false negatives and false positives. We concluded that RF performed better when
                                  it comes to precision (> 90%) but provided very low recall (< 10%) as compared to NB (i.e., precision < 70% and recall >30%)
                                  and SVM (i.e., precision < 70% and recall >60%).

                                  Keywords
                                  Improve Software Quality, Flaky Test Detection, Machine Learning Classi�ers, Experimentation, Test Smells


1. Introduction                                                                                                 dence that the main reasons for test �akiness are spe-
                                                                                                                ci�c test smells. Luo et al. suggested that "developers
Developers need to ensure that their changes to the                                                             should avoid speci�c test smells that lead to test �ak-
code base do not break existing functionality. If test                                                          iness". Authors in [2] investigated the question: "To
cases fail, developers expect test failures to be con-                                                          what extent can �aky tests be explained by the presence
nected to the changes. Unfortunately, some test fail-                                                           of test smells?" They concluded that the "cause of 54%
ures have nothing to do with the code changes. Devel-                                                           of the �aky tests can be attributed to the characteristics
opers spend time analyzing changes trying to identify                                                           of the co-occurring test smell".
the source of the test failure, only to �nd out that the                                                           Mapping test smells to �aky test resemble the prob-
cause of the failure is test �akiness (TF). Many stud-                                                          lem of mapping words to spam/ham email. Certain
ies [1, 2, 3, 4] have been conducted to determine the                                                           words (i.e., sale, discount etc.) are more frequent in
root causes of test �akiness. These studies concluded                                                           spam emails. Many studies [6, 7, 8, 9, 10, 11, 12, 13,
that the main root cause of TF is the test smells. Test                                                         14, 15] have been conducted to predict email class (i.e.,
smells are poorly written test cases and their presence                                                         spam or ham) based on email contents. We adopted a
negatively a�ect the test suites and production code                                                            similar approach in this study to determine the �ak-
or even the software functionality [5]. Another de�-                                                            iness of test cases based on the test case code. Ma-
nition is "poor design or implementation choices applied                                                        chine Learning approaches have been widely studied
by programmers or testers during the development of test                                                        and there are lots of algorithms that can be used in
cases" [2] . Asynchronous wait, input/output calls, and                                                         e-mail classi�cation including Naive Bayes [16][17],
test order dependency are some of the test smells that                                                          Support Vector Machines [18][19][15, 14], Neural Net-
have been found to be the most common causes of TF                                                              works [20][21], K-nearest neighbor [22].
[1]. The results presented by Luo et al. [1] were par-                                                             Recently, Pinto et al. evaluated �ve machine learn-
tially replicated by Palomba and Zaidman [2], leading                                                           ing classi�ers (Random Forest, Decision Tree, Naive
to the conclusion that the most prominent causes of                                                             Bayes, Support Vector Machine, and Nearest Neigh-
TF are test smells such as asynchronous wait, concur-                                                           bour) to generate �aky test vocabulary [23]. They con-
rency, and input output issues. There is strong evi-                                                            cluded that Random Forest and Support Vector Ma-
                                                                                                                chine provided best prediction of �aky tests. The in-
8th International Workshop on Quantitative Approaches to Software                                               vestigated test cases were written in Java and the au-
Quality in conjunction with the 27th Asia-Paci�c Software                                                       thors concluded that: "future work will have to investi-
Engineering Conference (APSEC 2020) Singapore, 1st December 2020                                                gate to what extent their �ndings generalize to software
� azeem.ahmad@liu.se (A. Ahmad); ola.lei�er@liu.se (O. Lei�er);
kristian.sandahl@liu.se (K. Sandahl)                                                                            written in other programming languages [23].
� 0000-0003-3049-1261 (A. Ahmad)                                                                                   In this study, we implemented supervised ML classi-
                            © 2020 Copyright for this paper by its authors. Use permitted under Creative
                            Commons License Attribution 4.0 International (CC BY 4.0).                          �ers to detect if the test case is �aky or not based on the
 CEUR
 Wor
 Pr
    ks
     hop
  oceedi
       ngs
             ht
             I
              tp:
                //
                 ceur
                    -
             SSN1613-
                     ws
                      .or
                    0073
                        g

                            CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                           37
                8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


contents of a test cases written in Python. We com-             categories of classi�cation and can �nd the best hyper-
pared our �ndings with what was presented by Pinto              plane to partition a sample space [15].
et al. [23]. We looked for evidence if machine learning            RF is an ensemble classi�cation method (a technique
classi�ers are applicable in predicting �aky tests and          that combines several base models to produce an opti-
the results can be generalized to test cases written in         mal predictive model) suitable for handling problems
other languages. In addition to this, our unique contri-        that involve grouping data into di�erent classes. RF
bution is to investigate if test smells are good predic-        predicts by using decision trees. Trees are constructed
tors of test �akiness. Through manual investigation of          during training which can later be used for class pre-
false positives and false negatives, we concluded a list        diction. There is a vote associated with each tree and
of test smells that are strong and weak predictors of           once the class vote has been produced for all individ-
test �akiness. We investigated the following research           ual trees, the class with the highest vote is considered
questions in this study.                                        to be the output.

   RQ1: What are the predictive accuracy of Naive Bayes,        2.2. Performance Metrics and
Support Vector Machine and Random Forest concerning
                                                                     Parameters Tuning
�aky test detection and prediction?
   RQ2: To what extent the predicting power of machine          To evaluate the predictive accuracy of classi�ers, accu-
learning classi�ers vary when applied on software writ-         racy as the only performance indices is not su�cient
ten in other programming language?                              [16]. We must consider precision, recall, F1-score, ROC
   RQ3: What can we learn about the predictive power of         curve, false positives and false negatives [16]. There is
test smells using machine learning classi�ers mentioned         always some cost associated with false positives and
in RQ1?                                                         false negatives. When a non �aky test wrongly clas-
                                                                si�ed as �aky, it gives rise to a some what insigni�-
                                                                cant problem, because an experienced user can bypass
2. Data Set Description and                                     the warning by looking at test case code. In contrast,
   Prepossessing                                                when a �aky test is wrongly classi�ed as non �aky test,
                                                                this is obnoxious, because it indicates the test suite still
We wrote a script to extract the contents of all test           have test cases whose outcome cannot be trusted.
cases from open-source projects, mentioned in Table                The experiment started with the implementation of
1. After the test case content’s extraction, we checked         simple NB without Laplace smoothing. The results did
which of the test cases, in our database, has been men-         not provide good accuracy or precision, because with-
tioned in [24] as �aky. After this mapping, we �nalized         out Laplace smoothing, the probability of appearing a
a database with the project name, test case name, test          rare test smell (i.e., test smell that was not in the train-
case content and a label. There are many keywords in            ing set) in the test set is set to 0, given the formula

                                                                                        = /
the test case code that are irrelevant for the identi�-
cation of test �akiness. We performed extensive data
cleaning such as removing punctuation marks, digits
                                                           where the is the probability that an individual test
and speci�c keywords (i.e., int, string, array, assert*)
                                                           smell is present in a �aky test,       represents the num-
as well as converting text to lower case.
                                                           ber of times that particular test smell appeared in a
                                                           test case and         represents the number of times that
2.1. Classifiers:                                          test smell appeared in any test case. Laplace smooth-
                                                           ing refers to the modi�cation in the equation:

                                                                                   = ( + )/( + )
An NBC, �rst proposed in 1998, is a probabilistic model
which can determine the outcome (i.e., �aky or not

tents of its features (i.e., test case code). In our case, where we set the = 1 so that classi�er adds 1 to the
�aky) of an instance (i.e., test case) based on the con-

the outcome of NBC is binary. NBC is widely applied probability of rare test smells that were not present in
in classi�cation and known to obtain excellent results. the training set. Another step is to identify the thresh-
[25].                                                      old (i.e., 0.0 - 1.0) which will increase the predictive ac-
   The attractive feature of SVM is that it eliminates curacy of the outcome. As far as SVM was concerned,
the need for feature selections, which makes spam clas- although the feature data set space was linear, we de-
si�cation easy and faster [14]. SVM deals with the dual cided to use both kernels (i.e., linear and poly) for the
                                                           sake of experiment. For random forest, we used ntree


                                                           38
                 8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


Table 1                                                         peared 1150 times in �aky tests and 52 times in non-
Open-source project names provided by [24] with number          �aky tests.
of total test cases and flaky tests                                Figure 1 (A) represents the ROC curve [26] concern-
                                                                ing NBC with Laplace smoothing denoted as NBL with
 Project Name                   Total Number of   Flaky
                                TCs               Tests
                                                                di�erent threshold (i.e., from 0.0 to 1.0). We conducted
                                                                di�erent experiments with di�erent training and test
 apache-qpid-0.18               2357              284
 hibernate 4                    3231              273
                                                                data sets such as 50/50, 60/40, 70/30, 80/20 and 90/10.
 apache-wicket-1.4.20           1250              216           We found similar values for k-fold cross validation.
 apache-karaf-2.3               163               102           ROC curve provides a comparison between sensitivity
 apache-struts 2.5              2346              60            and speci�city helping in organizing classi�ers and vi-
 apache-derby-10.9              3832              40
 apache-lucene-solr-3.6         764               7             sualizing their performance [26]. Sensitivity also known
 apache-cassandra-1.1           523               4             as the true positive rate represents a bene�t of predict-
 apache-nutch-1.4               7                 4             ing �aky tests correctly and speci�city also known as
 apache-hbase-0.94              29                2             false positive rate represents the cost of predicting non
 apache-hive-10.9               23                2
 jfreechart-1.0.18              2292              0             �aky tests as �aky tests. In the case of false positive,
                                                                developers need to spend e�ort and time, just to �nd
between 300 - 700 as well as restricting number of vari-        out that this is a classi�er mistake and the test case is
ables available for splitting at each tree node known as        not �aky. The optimal target, in the ROC curve, is to
mtry between 25 and 100.                                        rise vertically from origin to the top left corner (higher
                                                                true positive rate) as soon as possible because then the
                                                                classi�er can achieve all true positives with the cost
3. Results                                                      of committing a few false positive. The diagonal line,
                                                                in Figure 1 (A), represents the strategy of randomly
This section discusses the performance of NBL, SVM              guessing the outcome. Any classi�er that appears in
and RF with di�erent parameters. We compared our                the lower right triangle performs worse than a ran-
results with the �ndings of Pinto et al. to discuss how         dom guessing and we can see that NBL lies in the up-
results vary between Java and Python projects. We               per left triangle. Looking at 1 (A), NBL with 70/30 data
also discussed why some classi�ers do not perform as            partition is suitable to proceed further with 0.4 prob-
expected and what can we learn about the predictive             ability score. NBL, as shown in 1 (A), has stopped is-
power of test smells for test �akiness detection and            suing positive classi�cation (i.e., �aky test prediction)
prediction.                                                     around 0.76 - 0.87 threshold. After 0.87, it commits
                                                                more false positive rate.
3.1. RQ1: Performance of Naive Bayes                               We tuned di�erent parameters in NBL, SVM and RF
     Classifier, Support Vector Machine                         before conducting further experiments. We do not in-
     and Random Forest                                          tend to provide the results of all experiments because
                                                                those experiments were only conducted to �nd the op-
Table 2 shows the 20 features with the highest infor-           timal parameters. The rest (i.e., simple NB, SVM with
mation gain together with their frequency with respect          radial and sigmiod kernels) were not included in fur-
to �aky and non-�aky tests. We assigned the features            ther experiments and discarded. Figure 1 (A-E) pro-
to the categories presented by Luo et al. in [1]. We            vides comparisons of NBL, SVM-Linear and SVM-Poly
manually traversed the code of �aky and non-�aky                (i.e., di�erent kernels) for accuracy, precision, recall
tests to understand the context and how features were           and F1-score. All classi�ers have achieved good ac-
used in the tests to assign categories. The top fea-            curacies ranging from 93% - 96%. NBL outperformed
ture "conn" appeared in 1361 �aky tests and only 15             SVM although the di�erence between them is not dra-
non-�aky tests. This feature is associated with exter-          matic. Looking only at the accuracy results of classi-
nal connection to input/output devices and lies under           �ers can be deceiving. The important factor for classi-
the category of "IO", presented by Luo et. al in [1].           �er selection is to ask the right question and motivate
The second top feature is "double" which appeared in            the choice of using speci�c classi�er such as are we
1190 �aky tests and 12 non-�aky tests assigned to the           interested in detecting �aky tests correctly (i.e.,
category of "IO" followed by "�oating points opera-             precision) or marking a non �aky test as �aky
tions". The top 3rd feature "tabl" was related to table         is not cost e�ective (i.e, recall). It is important to
creation during runtime for databases queries and ap-           look at precision, recall and accuracy all together for
                                                                classi�er selection. We can assume that practitioners


                                                           39
                                               8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


A 1.00                                                                                  B                                                                             C                                  ●
                                                                                                                                                                                                               ●          ●
                                                                                                                                                                                                                     ●
                                                                                                                                                                                             60
                                                                     name                                                                                                                           ●


                                                                                                                                                                      Precision Values (%)
                                                                                        Accuracy Values (%)
                    0.75                                                                                      95
                                                                           NBL−50/50L                                                              Classifier                                50                                Classifier
sensitivity


                                                                           NBL−60/40L                                                               ●   NBLaplace=1                                                             ●   NBLaplace=1
                    0.50                                                                                                                                                                     40
                                                                           NBL−70/30L                                                                   SVM−Linear                                                                  SVM−Linear
                                                                                                              94                ●
                                                                           NBL−80/20L                                                ●
                                                                                                                          ●                             SVM−Poly                             30                                     SVM−Poly
                    0.25                                                                                                                   ●
                                                                                                                     ●
                                                                           NBL−90/10L
                                                                                                                                                                                             20
                    0.00                                                                                      93
                           0.00    0.25     0.50       0.75   1.00                                                 50/5060/4070/3080/2090/10                                                      50/5060/4070/3080/2090/10
                                       1 − specificity                                                                   Data Partitions                                                                Data Partitions

D 60                                                                                    E 60

                    50                                                                                        50
Recall Values (%)


                                                                 Classifier                                                                        Classifier

                                                                                        F1 Score
                                                                     ●   NBLaplace=1                                                                ●   NBLaplace=1
                    40                                                                                        40
                                                                         SVM−Linear                                                        ●            SVM−Linear
                                                                                                                                ●
                                                                                                                                     ●
                            ●                                            SVM−Poly                                         ●                             SVM−Poly
                    30             ●                                                                          30
                                           ●                                                                         ●
                                                   ●
                                                         ●

                    20                                                                                        20
                         50/5060/4070/3080/2090/10                                                                 50/5060/4070/3080/2090/10
                                  Data Partitions                                                                        Data Partitions


Figure 1: Performance comparison among classifiers. (A) represents the ROC curve of NBL classifier with di�erent data
partition and probability score. (B-E) represents the accuracy, precision, recall and F1-score of di�erent classifiers with a
di�erent data partition, respectively.


are more interested in precision than recall because Table 2
the test suite size, in many organizations, is very large Top 20 frequented features and assigned category
and they cannot inspect all test cases. In this partic-
                                                           Features           Frequency     Assigned category from
ular case, any classi�er that correctly �ag �aky tests                                      Luo et. al [1]
will be encouraged. Precision can answer the question;
"If the �lter says this test case is �aky, what’s the
                                                           new                28083         IO

probability that it’s �aky?”. Figure 1 (C,D) provides
                                                           assertequ          7967          -
                                                           null               4721          -
precision and recall values for NBL and SVM. It can be     from               4719          -
noticed that NBL precision is increasing (in C) with the   string             4126          -
                                                           sclose             3315          IO
gradual decrease in recall (in D). NBL precision of 65%    true               3154          -
dictates that 35% of what was marked as �aky was not       select             2842          -
�aky. Recall is also lower in NBL as compared to SVM-      for                2809          Unordered Collection
Linear. SVM-Poly performs worst in terms of precision      fals               2604          -
                                                           not                2294          -
and recall as expected due the fact that the input data    int                2111          -
set is not polynomial and is well suited for image pro-    asserttru          1910          -
cessing whereas linear kernel performs better for text     tabl*              1677          IO
                                                           should             1596          -
classi�cation.
                                                                                                                                               doubl                        1588                                   Floating point operations
                                                                                                                                               valu                         1429                                   -
   F1-score, as presented in Figure 1 (E), is the har-                                                                                         expr                         1322                                   -
monic mean of precision and recall. F1-score is use-                                                                                           tcommit                      1313                                   IO
                                                                                                                                               expcolnam                    1211                                   IO
ful and informative because of prevalent phenomenon
of class imbalance in text classi�cation [27]. NBL is a
suitable candidate although it has a lower F1-score as
                                                        it requires high computation and are very sensitive to
compared to SVM-Linear because NBL performs bet-
                                                        noisy data [29].
ter with short documents as in our case, the training
test case consists of 6-15 lines of code [28]. NBL pro-
                                                           RF provides lesser classi�cation error and better F1-
vides higher precision and lower recall as compared
                                                        scores as compared to decision trees, NBL and SVM.
to SVM-linear. Another disadvantage of SVM is that


                                                                                                                                     40
                    8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


                    Accuracy                 F1−Score                Precision               Recall

                                                                                                                 ntree
          75
                                                                                                                     700
 Values


                                                                                                                     600
          50
                                                                                                                     500
                                                                                                                     400
          25
                                                                                                                     300


               25     50   75   100     25     50   75   100    25     50    75   100   25    50      75   100
                                                         mtry


Figure 2: Performance of RF with di�erent parameters (i.e., number of trees and mtry).


The precision, in which we are most interested, is usu- feature deletion 2) it calculates approximation of im-
ally better than that of SVM and NBL. Authors in [16] portant features for classi�cation and 3) it is very ro-
also concluded that RF performs better than NBL and bust to noise and outliers [30]. Caruana in [17] com-
SVM. The class outcomes are based on "votes" which pared 10 di�erent ML classi�ers and concluded that
are calculated by each tree in the forest. The outcome decision trees and random forest outperform all other
(i.e., �aky or not �aky) is selected based on the higher classi�ers for spam classi�cation.
votes. Figure 2 presents the performance of RF with re-
spect to selected metrics. mtry represents the number
of variables randomly sampled as candidates at each 3.2. RQ2: Predicting Power of ML
split while ntree is the number of trees to grow. There
                                                                Classifiers with Respect to Other
is no way to �nd an optimal mtry and ntree, so we ex-
perimented with di�erent settings, as shown in Figure           Languages
2. The mtry has a direct e�ect on precision and recall In comparison of our �ndings with what was presented
as shown in Figure 2. With an increase in mtry, the by Pinto et al. [23], we observed two di�erences. First,
precision is decreasing and recall in increasing; an un- the top 20 frequented features are very di�erent in
wanted situation. The optimal value of mtry is 5 where both studies. Only one feature such as "tabl" marked
precision is higher and recall is lower regardless of the as star (*) in Table 2 were similar in both the �ndings.
number of trees. The change in mtry did not a�ect the However, we observed more features were related to
accuracy but as we discussed earlier, we are not only "IO" output category, as presented in Table 2, which
interested in accuracy but precision too.                 complemented the �ndings of Pinto et al. stating "that
   We performed several experiments to �nd optimal all projects manifesting �akiness are IO-intensive" [23].
parameters within a classi�er before comparing it to Second, we have a very lower precision, recall and f1-
other classi�ers. After these experiments, we identi- score as compared to Pinto et al. except at a instance
�ed three unique classi�ers with unique and optimal where random forest provided 0.92 precision. Table
parameters. Since, we are most interested in higher 3 provides detail statistics of precision, recall, and f1-
precision, we can see that RF with mtry = 5 and ntree=250 score of three algorithms for comparison. The algo-
outperforms all other classi�ers only for precision. RF rithms on Python language continuously performed
has achieved more than 90% precision with less than worst contrary to what pinto et al. claimed: "Although
10% recall. We did not achieve high precision (i.e., the studied projects are mostly written in Java, we do not
>90%) in all classi�ers. NBL provides unexpected re- expect major di�erences in the results if another object-
sults although it holds a good reputation in terms of oriented programming language is used instead, since
detecting spam emails [29]. As compared to NBL and some keywords maybe shared among them" [23].
SVM, RF have distinct qualities such as 1) it can work       We speculate that there could be several reasons as-
with thousands of di�erent input features without any sociated with these performance reduction such as (1)


                                                               41
                  8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


Table 3                                                            such as what operating system they are running on
Comparison of Precision, Recall and F1-Score between our           and whether or not speci�c con�gurations should be
findings (A) and Pinto et al. (B)                                  deployed. Test smells that are classi�ed as strong pre-
                                                                   dictors are very useful with machine learning classi-
                    Precision     Recall     F1-Score   Di�
 Algo.              A      B    A     B     A     B                �ers because they only exist in test case function as
 Random Forest      0.92 0.99   0.4 0.91    0.09 0.95              one unit and do not require additional information.
 Naive Bayes        0.62 0.93   0.15 0.8    0.24 0.86
 Support Vector     0.51 0.93   0.61 0.92   0.57 0.93
                                                                   4. Lesson Learned
We implemented the code ourselves using R libraries                ML and AI algorithms in recent years have established
for aforementioned classi�ers whereas pinto et al. used            a good reputation for predicting diseases based on symp-
Weka [31] which is an open source machine learn-                   toms, spam emails based on email contents and many
ing software that can be accessed through a graphi-                more. We believe that given a proper input data set
cal user interface, standard terminal applications [32],           which clearly distinguishes between �aky and non �aky
(2) Number of features were very high in the training              tests, ML and AI can provide high prediction capabil-
samples and in these cases other models should be con-             ities saving e�ort, time and resources. We strongly
sidered (i.e., regularized linear regression) that might           believe that practitioners, during training of data set,
performed better, (3) the versatility o�ered by param-             should not consider complete test cases as an input but
eter tunning can become problematic and require spe-               only the test codes (i.e., only few lines) that reveal test
cial considerations that can impact the classi�ers, etc.           �akiness.
                                                                      It is inconclusive that predicting power of machine
3.3. RQ3: Test Smells Analysis and their                           learning vary with respect to software written in an-
     Predictive Power for Test Flakiness                           other languages. Investigation on Java test cases [23]
                                                                   revealed good results while �ndings for Python test
     Detection and Prediction                                      cases performed unexpected, thus requiring more in-
We investigated manually di�erent cases of true pos-               vestigations whether lexical information can be traced
itives (i.e., correct �aky test prediction), false positive        to �akiness.
(i.e., �aky test cases marked as non �aky) and false                  Async wait, precision, randomness and IO test smells
negative (i.e., non �aky test cases marked as �aky) and            are string predictors can be predicted by machine learn-
true negatives (correct non �aky test prediction) to an-           ing classi�ers with 100% precision because they only
swer RQ3. We observed that it is not only the fre-                 exist in test case code and do not require additional in-
quency of test smell that makes a test case �aky but               formation from test class or operating system. Whereas
its co-existence with the class code or external factors           all other test smells mentioned in Table 4 are weak pre-
such as operating systems or speci�c product. For ex-              dictors of test �akiness and require additional sources
ample, The test smell ’Conditional Test Logic’ as men-             of information. We are only aware of test smells that
tioned in [3] refers to nested and complex ’if-else’ struc-        are investigated in open-source repositories and liter-
ture in the test case. Depending on which branch of                ature on test smells in closed-source software is scarce.
’if-else’ is executed, the system under test may require
speci�c environment settings. Failing to set the envi-
ronment, during di�erent executions, will �ip the test             5. Discussion and Implication
case outcome, thus making it �aky.
                                                                   Valuable Indicators for Testers These classi�ers can
   After manual investigation of all true/false positives
                                                                   increase the awareness about �aky test vocabulary among
and true/false negatives, we come up with a list of
                                                                   testers. When a new test is added to a test suite, it
test smells that are strong or weak predictors of test
                                                                   will be easy to identify whether this test case contains
�akiness, as shown in Table 4. Strong predictors refer
                                                                   speci�c test smells that were known to increase test
to those test smells that existed in true positives and
                                                                   �akiness during previous executions. Testers can take
true negatives cases whereas weak predictors only ex-
                                                                   advantage of these types of information to reduce test
isted in false negatives and false positives. Test smells
                                                                   �akiness. Testers can easily identify test smells that
that are classi�ed as weak predictors in this study are
                                                                   are independent of their environment with the help of
still useful and can help in identi�cation of test �ak-
                                                                   Table 4.
iness, but they are not useful with machine learning
                                                                      Precision Depends on Data Set: In the literature
classi�ers because they require additional information


                                                              42
                8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


Table 4
Test Smells as Strong and Weak Predictors Together with Source of their Existence
         Test Smell Category            Prediction Cate-   Test           Test       Operating External     Hardware/Product
                                        gory               Case           Class      System    Libraries
         Async wait                     Strong             [ ]            -          -          -           -
         Precision (float operations)   Strong             [ ]            -          -          -           -
         Randomness                     Strong             [ ]            -          -          -           -
         IO                             Strong             [ ]            -          -          -           -
         Unordered Collection           Weak               [ ]            [ ]        [ ]        -           -
         Time                           Weak               [ ]            [ ]        [ ]        -           [ ]
         Platform                       Weak               [ ]            [ ]        [ ]        [ ]         -
         Concurrency                    Weak               [ ]            [ ]        [ ]        -           -
         Test order dependency          Weak               [ ]            [ ]        [ ]        [ ]         [ ]
         Resource Leak                  Weak               [ ]            [ ]        [ ]        -           -


of ML, particularly with spam detection, it is acknowl-            crease precision at the expense of recall. When en-
edged that precision is a function of the combination              countering ’false negative’, an experienced developer,
of the classi�er and the data set under investigation.             having su�cient knowledge of the test smells, will by-
Classi�er’s precision, in isolation of data set, does not          pass the outcome, however, with ’false positive’, de-
make sense. The right question is "how precise a clas-             velopers are unaware of the fact that test suite still
si�er is for a given data set". Unfortunately, there               contains �aky tests. The motivation of employing ML
is no data available that provides test case contents              classi�ers (i.e., higher precision - low recall vs balances
and an associated label thus, limiting the use of ad-              precision and recall) should be made clear before pro-
vanced ML and AI algorithms. In addition to lack of                ceeding with implementation.
�aky test data, all research has been conducted with               Multi-Factor Input Criteria for Flaky Test Detec-
open-source software and we know a little about what               tion: We observed that the ML algorithm should in-
test smells are present in closed-source software. Ah-             clude di�erent sources of information to increase pre-
mad et. al. concluded that there are speci�c test smells           dictive accuracy. These sources may include 1) assign-
that are associated with the nature of the product [33]            ing speci�c weight (i.e., in numbers) to speci�c test
known as ’company-speci�c’ test smells. The classi�er              smells or test code, 2) developer’s experience (i.e., new
which are trained on a speci�c data set or a domain                developer, unaware of the test design guidelines are
cannot be generalized to be used with another data set             more likely to write �aky tests), 3) company-speci�c
or domain. There is a long road ahead to explore the               test smells.
best classi�er given di�erent data sets.
Beyond Static Analysis of Test Smells and their
Frequency: ML is capable of incorporating di�erent                 6. Related Work
sources of information to increase predictive accuracy
                                                                   Luo et al., in [1], investigated 52 open-source projects
as compared to the limited experiment in this study
                                                                   and 201 commits and categorized the causes of test
where we only utilized the frequency of test smells in
                                                                   case. Asynchronous wait (45%), concurrency (20%),
the test case. During the investigation of the cases of
                                                                   and test order dependency (12%) were found to be the
’false negative’ and ’false positive’, it has been observed
                                                                   most common causes of TF. Palomba and Zaidman in
that the frequency of test smells in the test case will not
                                                                   [2] partially replicated the results presented by Luo et
be su�cient for prediction. Some test case code (i.e.,
                                                                   al. concluding that the most prominent causes of TF
seeds()) will cancel the e�ect of test smell (i.e., ran-
                                                                   are asynchronous wait, concurrency, and input output
dom()), no matter how frequent the random() function
                                                                   and network issues. Authors investigated, in [3], the
appears in the test case. Some test smell, even with
                                                                   relationship between smells and TF. Another empiri-
single appearance, will weight more than a test smell
                                                                   cal study of the root causes of TF in Android Apps was
for higher frequency.
                                                                   conducted by Thorve et al. [4] by analyzing the com-
Precision Vs Recall: When a test suite grows in size,
                                                                   mits of 51 Apache open-source projects. Thorve et al.
developers would like any indications of tests that are
                                                                   [4] complement the results of Luo et al. and Palomba
more likely to be �aky rather than adopting an ap-
                                                                   and Zaidman, but they also report two additional test
proach of re-run which of-course is not cost e�ective
                                                                   smells (user interface and program logic) that are re-
in terms of time and resources. Developers like to in-


                                                              43
               8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


lated to TF in Android Apps. Bell et al. in [34] and pro- ings of this study for other data set.
posed a new technique called DeFlaker, which moni-
tors the latest code coverage and marks the test case
as �aky if the test case does not execute any of the 8. Conclusion
changes. Another technique called PRADET [35] does
                                                           At the moment of writing this paper, literature is scarce
not detect �aky tests directly, rather it uses a system-
                                                           on test �akiness (i.e., root causes, challenges, mitiga-
atic process to detect problematic test order dependen-
                                                           tion strategies, etc.) which requires signi�cant atten-
cies. These test order dependencies can lead to �ak-
                                                           tion from researchers and practitioners. We extracted
iness. King et al. in [36] present an approach that
                                                           �aky and non �aky test case contents from open source
leverages Bayesian networks for �aky test classi�ca-
                                                           repositories. We implemented three ML classi�ers such
tion and prediction. This approach considers �akiness
                                                           as Naive Bayes, Support Vector Machine and Random
as a decease mitigated by analyzing the symptoms and
                                                           Forest to see if the predictive accuracy can be increased.
possible causes. Teams using this technique improved
                                                           The authors concluded that only RF performs better
CI pipeline stability by as much as 60%. To best of our
                                                           when it comes to precision (i.e., > 90%) but the recall
knowledge, no study has been conducted to evaluate
                                                           is very low (< 10%) as compared to NBL (i.e., preci-
the predictive accuracy of machine learning classi�ers
                                                           sion < 70% and recall >30%) and SVM (i.e., precision
that can help developers in �aky test case prediction
                                                           < 70% and recall >60%). The authors concluded that
and detection.
                                                           predicting accuracy of ML classi�ers are strongly as-
   Dutta et al. [37] and Sjobom [38] investigated projects
                                                           sociated with the lexical information of test cases (i.e.,
written in Python language to classify test smells that
                                                           test cases written in Java or Python). The authors in-
increase test �akiness. Their study is limited to list
                                                           vestigated why other classi�ers failed to produce ex-
the test smells and their e�ect on test �akiness. Our
                                                           pected results and concluded that; 1) it is a combina-
study worked with the test smells identi�ed in [38].
                                                           tion of the test smell and an external environment that
Pinto et al. evaluated �ve machine learning classi�ers
                                                           makes a test case �aky, and in this study, the exter-
(Random Forest, Decision Tree, Naive Bayes, Support
                                                           nal environment was not taken into consideration, 2)
Vector Machine, and Nearest Neighbour) to generate
                                                           ML classi�ers should not only consider the frequency
�aky test vocabulary written in Java [23]. The con-
                                                           of test smells in the test case but other important test
cluded that Random Forest and SVM performed very
                                                           codes that have an ability to cancel the e�ect of test
well with high precision and recall. They concluded
                                                           smells.
that features such as "job", "action", and "services" were
commonly associated with �aky tests. We replicated
the similar experiment with di�erent programming lan- 9. Acknowledgment
guage and extended the current knowledge by answer-
ing RQ2 and RQ3.                                           We appreciate Linköping University students to pro-
                                                           vide their expertise to collect �aky test data from on-
                                                           line repositories.
7. Validity Threats
The authors in this study selected only those ML clas-        References
si�ers which have established a good reputation of high
                                                              [1] Q. Luo, F. Hariri, L. Eloussi, D. Marinov,        An Empir-
accuracy in spam detection thus reducing the selection            ical Analysis of Flaky Tests,       in: Proceedings of the
bias.                                                             22Nd ACM SIGSOFT International Symposium on Founda-
   The authors in this study reduced the experimenter             tions of Software Engineering, FSE 2014, ACM, New York,
bias by performing several experiments with di�erent              NY, USA, 2014, pp. 643–653. URL: http://doi.acm.org/10.1145/
                                                                  2635868.2635920. doi:10.1145/2635868.2635920, event-
thresholds (i.e., probability scores, kernels, number of          place: Hong Kong, China.
trees, etc.) before selecting a champion.                     [2] F. Palomba, A. Zaidman, Does Refactoring of Test Smells In-
   External validity refers to the possibility of gener-          duce Fixing Flaky Tests?, in: 2017 IEEE International Confer-
                                                                  ence on Software Maintenance and Evolution (ICSME), 2017,
alizing the �ndings, as well as the extent to which the
                                                                  pp. 1–12. doi:10.1109/ICSME.2017.12.
�ndings are of interest to other researchers and practi-      [3] F. Palomba, A. Zaidman, The smell of fear: on the relation
tioners beyond those associated with the speci�c case             between test smells and �aky tests, Empirical Software En-
being investigated. Since the precision strongly de-              gineering 24 (2019) 2907–2946. URL: https://doi.org/10.1007/
                                                                  s10664-019-09683-z. doi:10.1007/s10664-019-09683-z.
pends on the data set under investigation, we have an
external validity threat. We cannot generalize the �nd-


                                                         44
                  8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


 [4] S. Thorve, C. Sreshtha, N. Meng, An Empirical Study of Flaky                USA, 2006, pp. 161–168. URL: https://doi.org/10.1145/1143844.
     Tests in Android Apps, in: 2018 IEEE International Confer-                  1143865. doi:10.1145/1143844.1143865.
     ence on Software Maintenance and Evolution (ICSME), 2018,              [18] C.-Y. Chiu, Y.-T. Huang, Integration of Support Vector Ma-
     pp. 534–538. doi:10.1109/ICSME.2018.00062.                                  chine with Naïve Bayesian Classi�er for Spam Classi�cation,
 [5] V. Garousi, B. Küçük, Smells in software test code: A survey of             in: Fourth International Conference on Fuzzy Systems and
     knowledge in industry and academia, Journal of Systems and                  Knowledge Discovery (FSKD 2007), volume 1, 2007, pp. 618–
     Software 138 (2018) 52–81. URL: http://www.sciencedirect.                   622. doi:10.1109/FSKD.2007.366, iSSN: null.
     com/science/article/pii/S0164121217303060. doi:10.1016/j.              [19] Z. Jia, W. Li, W. Gao, Y. Xia, Research on Web Spam Detec-
     jss.2017.12.013.                                                            tion Based on Support Vector Machine, in: 2012 International
 [6] R. Shams, R. E. Mercer, Classifying Spam Emails Using Text                  Conference on Communication Systems and Network Tech-
     and Readability Features, in: 2013 IEEE 13th International                  nologies, 2012, pp. 517–520. doi:10.1109/CSNT.2012.117,
     Conference on Data Mining, 2013, pp. 657–666. doi:10.1109/                  iSSN: null.
     ICDM.2013.131, iSSN: 2374-8486.                                        [20] A. S. Katasev, L. Y. Emaletdinova, D. V. Kataseva, Neural Net-
 [7] S. K. Tuteja, N. Bogiri, Email Spam �ltering using BPNN classi-             work Spam Filtering Technology, in: 2018 International Con-
     �cation algorithm, in: 2016 International Conference on Auto-               ference on Industrial Engineering, Applications and Manufac-
     matic Control and Dynamic Optimization Techniques (ICAC-                    turing (ICIEAM), 2018, pp. 1–5. doi:10.1109/ICIEAM.2018.
     DOT), 2016, pp. 915–919. doi:10.1109/ICACDOT.2016.                          8728862, iSSN: null.
     7877720, iSSN: null.                                                   [21] M. K., R. Kumar, Spam Mail Classi�cation Using Combined
 [8] E. Sahın, M. Aydos, F. Orhan, Spam/ham e-mail classi�cation                 Approach of Bayesian and Neural Network, in: 2010 Interna-
     using machine learning methods based on bag of words tech-                  tional Conference on Computational Intelligence and Commu-
     nique, in: 2018 26th Signal Processing and Communications                   nication Networks, 2010, pp. 145–149. doi:10.1109/CICN.
     Applications Conference (SIU), 2018, pp. 1–4. doi:10.1109/                  2010.39, iSSN: null.
     SIU.2018.8404347, iSSN: null.                                          [22] L. Firte, C. Lemnaru, R. Potolea, Spam detection �lter us-
 [9] K. Mathew, B. Issac, Intelligent spam classi�cation for mo-                 ing KNN algorithm and resampling, in: Proceedings of
     bile text message, in: Proceedings of 2011 International                    the 2010 IEEE 6th International Conference on Intelligent
     Conference on Computer Science and Network Technology,                      Computer Communication and Processing, 2010, pp. 27–33.
     volume 1, 2011, pp. 101–105. doi:10.1109/ICCSNT.2011.                       doi:10.1109/ICCP.2010.5606466, iSSN: null.
     6181918, iSSN: null.                                                   [23] G. Pinto, B. Miranda, S. Dissanayake, What is the Vocabulary
[10] A. B. M. S. Ali, Y. Xiang, Spam Classi�cation Using Adaptive                of Flaky Tests? (2020) 11.
     Boosting Algorithm, in: 6th IEEE/ACIS International Confer-            [24] W. Lam, R. Oei, A. Shi, D. Marinov, T. Xie, iDFlakies: A
     ence on Computer and Information Science (ICIS 2007), 2007,                 Framework for Detecting and Partially Classifying Flaky Tests,
     pp. 972–976. doi:10.1109/ICIS.2007.170, iSSN: null.                         in: 2019 12th IEEE Conference on Software Testing, Valida-
[11] R. K. Yin, Case study research design and methods, 4th ed ed.,              tion and Veri�cation (ICST), 2019, pp. 312–322. doi:10.1109/
     Thousand Oaks, Calif Sage Publications, 2009. URL: https://                 ICST.2019.00038, iSSN: 2159-4848.
     trove.nla.gov.au/work/11329910.                                        [25] M. Sasaki, H. Shinnou, Spam detection using text clustering,
[12] A. A. Alurkar, S. B. Ranade, S. V. Joshi, S. S. Ranade, P. A.               in: 2005 International Conference on Cyberworlds (CW’05),
     Sonewar, P. N. Mahalle, A. V. Deshpande, A proposed data                    2005, pp. 4 pp.–319. doi:10.1109/CW.2005.83, iSSN: null.
     science approach for email spam classi�cation using machine            [26] T. Fawcett, An introduction to ROC analysis, Pattern Recogni-
     learning techniques, in: 2017 Internet of Things Business Mod-              tion Letters 27 (2006) 861–874. URL: http://www.sciencedirect.
     els, Users, and Networks, 2017, pp. 1–5. doi:10.1109/CTTE.                  com/science/article/pii/S016786550500303X. doi:10.1016/j.
     2017.8260935, iSSN: null.                                                   patrec.2005.10.010.
[13] S. Vahora, M. Hasan, R. Lakhani, Novel approach: Naïve Bayes           [27] D. Zhang, J. Wang, X. Zhao, Estimating the Uncertainty of
     with Vector space model for spam classi�cation, in: 2011                    Average F1 Scores, in: Proceedings of the 2015 International
     Nirma University International Conference on Engineering,                   Conference on The Theory of Information Retrieval, ICTIR ’15,
     2011, pp. 1–5. doi:10.1109/NUiConE.2011.6153245, iSSN:                      Association for Computing Machinery, Northampton, Mas-
     2375-1282.                                                                  sachusetts, USA, 2015, pp. 317–320. URL: https://doi.org/10.
[14] M. R. Islam, W. Zhou, M. U. Choudhury, Dynamic Fea-                         1145/2808194.2809488. doi:10.1145/2808194.2809488.
     ture Selection for Spam Filtering Using Support Vector Ma-             [28] Wang, Baselines and bigrams | Proceedings of the 50th An-
     chine, in: 6th IEEE/ACIS International Conference on Com-                   nual Meeting of the Association for Computational Linguis-
     puter and Information Science (ICIS 2007), 2007, pp. 757–762.               tics: Short Papers - Volume 2, ???? URL: https://dl-acm-org.e.
     doi:10.1109/ICIS.2007.92, iSSN: null.                                       bibl.liu.se/doi/10.5555/2390665.2390688.
[15] T.-Y. Yu, W.-C. Hsu, E-mail Spam Filtering Using Support Vec-          [29] S. Abu-Nimeh, D. Nappa, X. Wang, S. Nair, A compari-
     tor Machines with Selection of Kernel Function Parameters,                  son of machine learning techniques for phishing detection,
     in: 2009 Fourth International Conference on Innovative Com-                 in: Proceedings of the anti-phishing working groups 2nd
     puting, Information and Control (ICICIC), 2009, pp. 764–767.                annual eCrime researchers summit on - eCrime ’07, ACM
     doi:10.1109/ICICIC.2009.184, iSSN: null.                                    Press, Pittsburgh, Pennsylvania, 2007, pp. 60–69. URL: http:
[16] E. G. Dada, J. S. Bassi, H. Chiroma, S. M. Abdulhamid, A. O.                //portal.acm.org/citation.cfm?doid=1299015.1299021. doi:10.
     Adetunmbi, O. E. Ajibuwa, Machine learning for email                        1145/1299015.1299021.
     spam �ltering: review, approaches and open research prob-              [30] L. Breiman, Random Forests, Machine Learning 45 (2001) 5–32.
     lems, Heliyon 5 (2019) e01802. URL: http://www.sciencedirect.               URL: https://doi.org/10.1023/A:1010933404324. doi:10.1023/
     com/science/article/pii/S2405844018353404. doi:10.1016/j.                   A:1010933404324.
     heliyon.2019.e01802.                                                   [31] I. H. Witten, E. Frank, Data mining: practical machine learn-
[17] R. Caruana, A. Niculescu-Mizil, An empirical comparison of                  ing tools and techniques with Java implementations, ACM
     supervised learning algorithms, in: Proceedings of the 23rd in-             SIGMOD Record 31 (2002) 76–77. URL: https://doi.org/10.1145/
     ternational conference on Machine learning, ICML ’06, Asso-                 507338.507355. doi:10.1145/507338.507355.
     ciation for Computing Machinery, Pittsburgh, Pennsylvania,             [32] Weka 3 - Data Mining with Open Source Machine Learning


                                                                       45
                   8th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2020)


     Software in Java, ???? URL: https://www.cs.waikato.ac.nz/ml/
     weka/index.html.
[33] A. Ahmad, O. Lei�er, K. Sandahl, Empirical Analysis of Fac-
     tors and their E�ect on Test Flakiness - Practitioners’ Percep-
     tions, arXiv:1906.00673 [cs] (2019). URL: http://arxiv.org/abs/
     1906.00673, arXiv: 1906.00673.
[34] J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, D. Mari-
     nov, DeFlaker: Automatically Detecting Flaky Tests, in: 2018
     IEEE/ACM 40th International Conference on Software Engi-
     neering (ICSE), 2018, pp. 433–444. doi:10.1145/3180155.
     3180164.
[35] A. Gambi, J. Bell, A. Zeller, Practical Test Dependency De-
     tection, in: 2018 IEEE 11th International Conference on Soft-
     ware Testing, Veri�cation and Validation (ICST), 2018, pp. 1–
     11. doi:10.1109/ICST.2018.00011.
[36] T. M. King, D. Santiago, J. Phillips, P. J. Clarke, Towards
     a Bayesian Network Model for Predicting Flaky Automated
     Tests, in: 2018 IEEE International Conference on Soft-
     ware Quality, Reliability and Security Companion (QRS-C),
     IEEE Comput. Soc, Lisbon, 2018, pp. 100–107. doi:10.1109/
     QRS-C.2018.00031.
[37] S. Dutta, A. Shi, R. Choudhary, Z. Zhang, A. Jain, S. Misailovic,
     Detecting �aky tests in probabilistic and machine learning ap-
     plications, in: Proceedings of the 29th ACM SIGSOFT Inter-
     national Symposium on Software Testing and Analysis, ISSTA
     2020, Association for Computing Machinery, New York, NY,
     USA, 2020, pp. 211–224. URL: https://doi.org/10.1145/3395363.
     3397366. doi:10.1145/3395363.3397366.
[38] A. Sjöbom, Studying Test Flakiness in Python Projects : Orig-
     inal Findings for Machine Learning, 2019. URL: http://urn.kb.
     se/resolve?urn=urn:nbn:se:kth:diva-264459.


                                                                         46

</pre>