Joint Analysis of Families of SE Experiments
                                                                Adrián Santos Parrilla
                                                             University of Oulu, Finland
                                                        Department of Information Processing
                                                                      Science
                                                          adrian.santos.parrilla@oulu.fi

     ABSTRACT
     Context: Replication is of paramount importance for building
                                                                                1. INTRODUCTION
                                                                                SE experiments can be analyzed separately to acquire knowledge
     solid theories in experimental disciplines and is a cornerstone of
                                                                                about the performance of different treatments under certain
     the evolution of science. Over the last few years, the role of
                                                                                circumstances (working environment or specific population
     replication in software engineering (SE), families of experiments
                                                                                characteristics). The shortcomings of this approach include: (1)
     and the need to aggregate the results of groups of experiments
                                                                                the number of subjects is a limiting factor across most SE
     have attracted special attention. Frameworks, taxonomies,
                                                                                experiments; (2) the results might be artifactual, that is, due to the
     processes, recommendations and guidelines for reporting
                                                                                impact of the experimental protocol and not to the treatments
     replications have been proposed to support the replication of SE
                                                                                applied by the subjects; (3) the findings from one study cannot be
     experiments. There has been much less debate about the issue of
                                                                                interpreted outside the confines of the setting of that experiment.
     the joint analysis of replications whose raw data are available to
     experimenters.                                                             The role and importance of replications in tackling the issue of the
                                                                                generalization of SE experimental findings has been recognized
     Objectives: The aim of our research is to explore current trends in
                                                                                by many authors within the SE community [11, 19, 22, 24]. As
     the joint analysis of SE experiments whose raw data are available
                                                                                stated in [24], “replications play a key role in Empirical Software
     to experimenters. Notice that the fact that experimenters have
                                                                                Engineering by allowing the community to build knowledge about
     access to the raw data is what differentiates joint analysis from
                                                                                which results or observations hold under which conditions”. The
     other methods for aggregating experimental results (e.g.
                                                                                aim of replication is twofold [23]. First, “replication is needed not
     systematic literature review (SLR), where the applicability of
                                                                                merely to validate one’s findings, but, more importantly, to
     meta-analysis techniques is widely accepted). The objective of
                                                                                establish the increasing range of radically different conditions
     this three-year investigation is to shed light on the best joint
                                                                                under which the findings hold, and the predictable exceptions”.
     analysis approach when the experimenters have access to raw data
                                                                                Second, as noted in [11], “if an experiment is not replicated, there
     from several replications.
                                                                                is no way to distinguish whether results were produced by chance
     Method: Narrative comparison, standard frequentist methods,                (the observed event occurred accidentally), results are artifactual
     meta-analysis and Bayesian methods have been used in SE                    (the event occurred because of the experimental configuration but
     literature. We will apply and evaluate each approach to the                does not exist in reality) or results conform to patterns existing in
     experiments on Test-Driven Development (TDD) carried out                   reality”. Thus, replication provides experimenters and the
     within the Experimental Software Engineering Industrial                    community with a continuous knowledge building process by: (1)
     Laboratory (ESEIL) project. We will propose and rate a tentative           confirming previous experimental results; and (2) identifying the
     framework for aggregating results within the ESEIL project. The            reasons why previous results do not hold under the new
     proposed framework, as well as the different existing methods,             experimental conditions. By aggregating the results of
     will be evaluated on another set of replications of testing                experiments, we get to see the whole picture for different
     technique experiments.                                                     population characteristics, settings and conformance to the
     Current status: The thesis proposal was elicited on the 15                 treatments used within the SE community.
     January 2015 and rounded out over the following six months. As             The shortage of replication studies within the SE community was
     a three-year thesis, its discussion and findings will be projected         highlighted in [28]. Out of a total of 5453 articles published in
     across the years 2015, 2016 and 2017. The first results are now            different SE-related journals and conference proceedings between
     being aggregated with the data from four different experiments on          1993 and 2002, 20 out of 113 controlled experiments were
     TDD (two in academia and two in industry), and preliminary                 described as replications [28]. In a mapping study on SE
     results are expected to be available in October 2015.                      replications completed from 2010 to 2011 [9] based on
                                                                                bibliographic searches covering the period from 1994 to 2010,
                                                                                only 96 out of 16,000 papers included replications. These 96
                                                                                papers reported a total of 133 replications. Furthermore, the
     Keywords                                                                   results showed that nearly 70% of the replications were published
     SE replication, joint analysis, family of experiments, raw data.           after 2004 and that up to a 70% of the studies were internal
                                                                                replications (i.e., carried out by the same experimenters) [9]. An
                                                                                update of the same study identified and analyzed replications
                                                                                published in 2011 and 2012, and noted that the trend in the
                                                                                number of replications in SE continued to be upward (56 papers
                                                                                in two years) [9]. However, the growth rate was slow, possibly
                                                                                indicating the need for patterns to improve the way in which
                                                                                replications are run in the field [9].
    Copyright © 2015 for this paper by its authors. Copying permitted for
private and academic purposes.


                                                                            4
The organization of an International Workshop on Replication in             Notice that this definition of family of experiments also covers
Empirical Software Engineering Research (RESER) is illustrative             related experiments found by a SLR, even though the
of the growing interest in replication. At this venue, empirical            experimenters are completely unconnected. We think that the
software engineering researchers have the opportunity to present            concept of family of experiments should be defined more
and discuss the theoretical foundations and methods of                      precisely in order to make a distinction between the two situations
replication, as well as the results of replicated studies.                  below:
Traditionally groups of experiments have been formed within the                 i.    Set of related experiments, typically found in a SLR,
SE community by means of SLR, and their results analyzed                              that can be aggregated to generate evidence. The
jointly in order to build new pieces of knowledge. But, nowadays,                     available information in these cases provides only a
researchers are replicating their own studies in order to increase                    short description of protocol and conditions as regards
the relevance and validity of their findings.                                         the setting and no more than sample descriptive
As some authors state [29], there is a need to further investigate                    statistics as regards the data.
the problem of generalizing conclusions from individual studies.               ii.    Set of experiments conducted by related researchers that
This could be done by extending research tools commonly used in                       make the raw data available for further joint analysis. In
engineering and computer science with those applied in sciences                       this case, the available information covers everything
that study people such as medicine or psychology. Brooks [5]                          that the experimenters know about their own studies.
suggested that research methods like statistical meta-analysis              We suggest that the term family of experiments should be used to
could benefit software engineering in generalizing the findings             refer to situation (ii) above. Thus, this research narrows down the
from individual studies. However SE experiments have in general             meaning of family of experiments to a definition similar to the
several constraints which make difficult the application of meta-           explanation given in [1]: “a set of similar experiments that pursue
analysis [10]: (1) small sample sizes (generally less than 10               the same goal to build the knowledge needed to extract significant
subjects per treatment); (2) the number of experiments per meta-            conclusions”, where experimenters have the raw data of the
analysis is also small in many cases; (3) some studies do not               experiments and first-hand knowledge of the setting.
provide the statistical parameters required for meta-analysis when
reporting their results.                                                    In this article we report a PhD thesis that is being carried out to
                                                                            investigate how to conduct a joint analysis of a family of
Even though meta-analysis is a widely accepted method for                   experiments. We first report the current methods that have been
aggregating results from studies identified by means of SLR                 used in SE for the joint analysis of experiments. We also outline a
(generally reporting statistics such as the mean, standard deviation        tentative path for building a framework for aggregating the results
or number of subjects), there appears to be no such agreement on            of families of experiments.
the right way to analyze the replications of experiments whose
raw data are available to experimenters. Researchers who are in             This paper is organized as follows. Section 2 briefly discusses
possession of the raw data of the experiments are better able to            relevant prior work on the topic of results aggregation in SE.
compute, understand and assess the different variables considered           Section 3 outlines the main objectives of the proposal. Section 4
in the experiments than if they only have access to findings                describes the proposed research approach. Finally, Section 5
reported in different publications. Furthermore, the issue                  summarizes the current status of the outlined proposal.
nowadays seems to be object of debate in other fields such as
medicine or social sciences where the communities are still
discussing the advantages and disadvantages of conducting meta-             2. RELATED WORK
analysis with individual participant data (IPD) gathered from the           In the following sections, we briefly discuss the different
constituent studies and aggregated data (AD), or the group-level            approaches proposed and adopted within the SE community to
statistics (effect sizes) that appear in reports of a study’s results       conduct joint analyses of families of experiments whose raw data
[7].                                                                        are available to experimenters, discuss their applicability and state
It is unclear yet within the SE community which is the most                 the conclusions concerning their use reported in the different
straightforward and valid procedure for aggregating results when            publications.
the raw data of the experiments are available to the experimenters          The different techniques are discussed in chronological order by
and they have first-hand knowledge of the protocol and                      date of publication of the respective paper applying or proposing
conditions. Besides, different joint analysis techniques may be             the technique.
applicable depending on the different characteristics of the
replications.
The concept of family of experiments was first reported in SE by            2.1 Narrative Comparison
Basili et al. in 1991 [3]. This concept is explained in [1] as              The difference between close and differentiated replications was
follows: “a family is composed of multiple similar experiments              discussed in depth by Juristo and Vegas in 2011 [19]. They
that pursue the same goal to build the knowledge needed to                  proposed an approach for analyzing groups of experiments: the
extract significant conclusions”. From this point of view, the              results of each experiment are analyzed separately and are then
concept of family of experiments is a “framework for organizing             grouped according to concordances and discordances between the
sets of related studies” [3], where “experiments can be viewed as           results of the replications identified through narrative comparison.
part of common families of studies rather than being isolated               A differentiated replication (i.e., a replication that produces a
events” [3]. As each of these experiments is viewed as belonging            different outcome than the main experiment) is considered as an
to a group of studies, their results could be analyzed as a whole           opportunity to explore the different variables that might have had
instead of separately, and the findings integrated into one                 an impact on the outcome rather than being seen as a threat to the
comprehensive result.


                                                                        5
validity of the replication. There are several noteworthy points            Meta-analysis is the current standard for aggregating quantitative
with regard to the study reported in [19]:                                  results across studies [22]. It can be used to combine data even if
                                                                            studies report contradictory results provided that the overall
         There is a big imbalance between the total number of
                                                                            variation is not too extreme [25].
          subjects participating in each replication (176
          participants at the UPM; 31 subjects in the UPV                   Meta-analysis has been used within the SE community with
          replication and 76 in the ORT replication). This                  multiple objectives such as studying the effects of TDD on
          imbalance in the number of subjects could have biased             external quality and productivity [26], checking for correlations of
          the results due to natural random variability.                    metrics across software projects corpora [30] or the effect on
                                                                            defect detection rates of different inspection techniques [14]
         The report states that “the results are considered equal if
                                                                            amongst others [6, 13]. Furthermore, guidelines for applying
          the estimated mean value for the replication results is
                                                                            different meta-analytic techniques have been proposed to the SE
          within the confidence interval of the baseline
                                                                            community [10].
          experiment results” [19]. Some sources [8] state that
          roughly 83% of replications will fall within the 95%              Traditional meta-analytic techniques rely on the assumption that
          interval of confidence for the means of the original              effect size estimates from different experiments are independent
          experiment. In other words, even if two samples (one              and have sampling distributions with known conditional variances
          per experiment) are drawn from the same population,               [16]. An experiment that examines multiple dependent variables
          there is an 83% chance that the mean of the second                or a cluster of studies carried out by the same investigator or
          experiment will fall within the confidence interval of the        laboratory [16] poses a threat to the supposed independence of
          first. Thus, due to the random variability of the sample,         experiments. The hierarchical dependence model is applicable
          Juristo and Vegas might be considering the result of the          when the dependence structure between the experiments is due to
          replication as a different outcome, merely because the            the inherent condition of belonging to a cluster of experiments
          mean did not fall within the confidence interval of the           [12]. The circumstances, implications and impact of dependence
          main experiment (although, in actual fact, it represents          across studies have been studied at length by different researchers
          the same result, i.e., population, in a different random          [15, 21], and multiple techniques for dealing with this issue have
          sample). This underestimation of sampling variability is          been proposed [16]. However, their usage requires an in-depth
          a limitation of the applicability of the narrative                knowledge of the different techniques available, and their
          comparison approach proposed in [19]: roughly 17% of              applicability is by no means clear [16].
          replication results will, due to random variability, be           How meta-analysis should be applied to a family of experiments
          considered different when they are in fact equal (i.e.,           in SE is a matter of debate, and there are many opinions on
          represent the same population).                                   procedure. As stated by Miller [24], “because the dependent
This method relies on narrative comparison to analyze a family of           replications rely on the same underlying protocols as the original
experiments using observations such as the mean of the different            study, their results cannot be considered as truly independent of
outcome variables obtained by the subjects in the experiments to            the original study. Moreover, they may propagate any accidental
discuss the findings. It is up to the expert analyst to observe and         biases from the original study into the results of the replication”.
interpret the outcomes, and the method depends on their ability to          Recall, which is one of the main assumptions underlying
identify extraneous factors that might have influenced the                  traditional meta-analysis, relies on independence and could be
outcome in the different replications. One clear drawback is that           violated in some cases where replications are run by related
the technique might underestimate the random sampling                       researchers.
variability within a certain population and thus overestimate the           Again, if experimental material is reused (thus increasing the
effect of third variables. This technique can be applied to the raw         dependence between two experiments), “although from a simple
data of the experiments or to the known descriptive statistics of           replication point of view, this seems attractive; from a meta-
the different experimental outcomes (although the variables                 analysis point of view this is undesirable, as it creates strong
should be interpreted with due caution if the raw data are not              correlations between the two studies” [24]. Kitchenham shares
available).                                                                 this view [22], stating “in particular, dependent replications
In any case, the narrative comparison technique requires the                violate the main assumption underlying meta-analysis which is the
experimenters of the replications to interact in order to identify          standard method of aggregating results from quantitative
extraneous variables that might have had an impact on the                   experiments. Recently, my colleagues and I were forced to omit
consistency of results in order to gather knowledge for further             three studies from a systematic literature review because the
investigation.                                                              ‘replications’ were so close that they offered no additional
                                                                            information to the aggregation process”.
                                                                            Pickard et al. [25] state, in reference to the outcome of the
2.2 Meta-Analysis                                                           primary studies, that “the greater the degree of similarity between
Meta-analysis is a set of statistical techniques that has been used         the studies the more confidence you can have in the results of a
to combine the different effect sizes of a family of experiments [9,        meta-analysis”. The best thing then would be very similar settings
10]. Effect sizes can be estimated to evaluate the average impact           without either communication or information sharing among
of an independent variable on a dependent variable across studies.          experimenters: a rare occurrence in SE.
Since measures may be taken from different settings and may be              Furthermore, it is up to researchers to settle several issues
non-uniform, a standardized measure must be taken for each                  regarding meta-analysis, such as:
experiment. These measures must be combined to estimate the                          The selection of the effect size metric used to perform
global effect size of a factor [1].                                                   the joint analysis, i.e., computation of the raw mean


                                                                        6
          difference, a standardized mean difference, odds ratio,          The separate analyses performed for each experiment in [27]
          risk ratio or risk differences [4].                              appear to be clearly explained from the data analysis viewpoint:
         The standardizer used to compute the effect size, i.e.,          “the experiment has two factors, paired measurements, a sample
          pool standard deviations, weight each group’s standard           size of less than 30 and data which is not normally distributed”.
          deviation by sample size or use the control group                When aggregating the data from the three experiments into one
          standard deviation [8].                                          data set and carrying out the joint analysis, however, Runeson et
         The computation of some effect sizes from others or the          al. report [27] “the overall two-factor ANOVA results for the
          use of unbiased versions of effect size metrics [8].             three experiments”. Notice that the authors no longer mention that
                                                                           the data are “repeated measures”, and the “two-factor ANOVA”
Besides, the impact of experimental designs on the resulting effect        analysis carried out is interpreted without any reference to a
sizes (such as multiple-treatment studies and multiple-endpoint            within-subjects factor. Furthermore, the joint analysis of the three
studies [16]) makes meta-analysis applicability a controversial            experiments is performed using a Kruskall-Wallis test, the
topic in the SE community.                                                 equivalent of the one-way ANOVA test for non-normal
Meta-analysis has the potential of aggregating the results of              distributions.
different experiments if the raw data are not available, even              Because the data are dependent, a repeated measures general
though its stability in SE experiments has been questioned [24].           linear model could have been fitted to analyze the data. Also, the
However, meta-analysis has been applied as well when the                   within-subjects and between-subjects factors considered should
experimenters are in possession of the raw data [1]. This issue            have been clearly specified in order to pave the way for data
raises the question of whether the best procedure for the joint            analysis, understandability and reproduction.
analysis of experiments whose raw data are available to
researchers is to apply meta-analysis techniques or whether it             A framework applying this approach to aggregate results from a
would be better to use other approaches. As noted in [4]: “losing          family of experiments should comply with three objectives: (1)
sight of the fact that meta-analysis is a tool with multiple               provide a specific set of steps to be carried out to pre-process the
applications causes confusion and leads to pointless discussions           data and report the data pre-processing of the experiments; (2)
about what is the right way to perform a research synthesis, when          provide a template with all the relevant information that should be
there is no single right way”.                                             stated about each of the experiments to carry out the individual
                                                                           analysis; (3) provide guidance for defining a joint analysis from
All the above raises doubts within the community, which does not           separate experiments, accounting for any of the possible
appear to be clear about the applicability of meta-analysis, its           limitations of each experiment.
boundaries and misuses, adding to the confusion surrounding the
aggregation of results in families of SE experiments.
                                                                           2.4 Bayesian Methods
                                                                           Bayesian methods for data analysis have also been applied to the
2.3 Standard Frequentist Methods                                           aggregation of the results of a family of experiments [17, 18].
Another option for conducting the joint analysis of families of            They resolve the inconsistencies found between replications and
experiments is to analyze all data together via standard frequentist       the original experiment by investigating moderators, i.e., variables
methods such as analysis of variance (ANOVA) [27]. Basically,              that cause an effect to differ across contexts [2]. An iterative
each experiment within a group of replicated controlled                    approach is applied to try to identify moderators that might have
experiments is analyzed separately. After gathering knowledge              an influence on the outcome of the experiment. The different
about the results of the different replications and briefly                variables and their interaction studied in the proposed models are
discussing whether or not the results hold, the experimenters              then measured based on the most relevant changes made to the
hypothesize about which variables might have had an impact on              different replications. As explained in [17], “By moderator, we
the results. In a next step, the raw data from all the different           mean any explanatory variable that interacts with another
studies belonging to the family of experiments are aggregated and          explanatory variable in predicting a response variable. For one
analyzed as a whole considering the hypothesized variables as              variable to “moderate” another does not mean that it dampens the
factors.                                                                   other’s effect —rather, it means that an interaction exists, such
The study presented by Runeson et al. [27] in 2014 is an example           that the latter’s effect varies in response to the former”.
of such an approach. They report three experiments comparing               Bayesian methods provide an alternative to traditional meta-
code inspections with unit testing: the original experiment, an            analysis. First, using Bayesian methods, data can be accumulated
internal replication (a replication performed by the same                  over time (prior knowledge) into the analysis of future
researchers minimizing changes in the replication) and an external         replications. Second, Bayesian methods can be used to combine
replication (a replication performed by a different group of               results such that all data are treated as current observations [18].
researchers, varying several aspects of the experiment) [27]. The
three experiments were cross-over designs, where the subjects              However, the application of this method to joint analysis requires
applied one defect detection method (code inspection or structural         thorough knowledge of Bayesian statistics, which have seldom
unit testing) to one program and then the other method to the              been used in the SE community that is dominated by frequentist
other program [27]. The dependent variables of the experiments             methods such as meta-analysis: the current standard for
are time spent on the tasks, number of defects detected and                aggregating quantitative results [22, 25].
localized and rate, i.e., number of defects detected and localized
per time unit [27].                                                        3. RESEARCH OBJECTIVES
                                                                           We have carried out several experiments to assess the TDD agile
                                                                           development technique [31]. A lot of data are being collected


                                                                       7
from multiple replications, and their analysis and processing could          medicine. Such bibliographic searches of online databases could
provide insights into proper ways of handling and synthesizing               turn up a variety of methods or prescriptions that might lend
the results of joint analysis.                                               themselves to extrapolation to SE. The conditions under which
Our research is driven by several methodological questions:                  these techniques can be used and their limitations will be studied
                                                                             within specific SE setups, and they will be assessed by means of
                                                                             direct application to the ESEIL project experiments on TDD.
     1.   What is the best way of analyzing families of
          experiments with raw data in SE? Do the existing                   The application of these different analysis techniques to the same
          approaches produce contradictory results? Under what               family of experiments can lead to multiple, possibly even
          circumstances are these different analysis approaches              contradictory results. In our studies we will try to explore the
          applicable?                                                        scope of application of the different aggregation approaches and
     2.   Where are the limits to the feasibility of grouping                define the limits of their applicability for conducting joint
          different experiments, i.e., how similar does the design           analysis.
          of the experiment need to be?                                      After exploring these approaches and discussing their implications
     3.   Is there any kind of knowledge on the different                    within the ESEIL project, we will propose a framework for the
          experiments that is of paramount importance for joint              joint analysis of families of experiments. A second version of this
          analysis? Is it always correctly reported?                         framework will be refined and further expanded with the aim of
                                                                             applying it to a family of testing experiments [20].
Also, several TDD-specific questions will drive our research:                We will then adopt the most promising methods in order to
                                                                             extrapolate their applicability to a different set of experiments
     1.   Does subject type (students or professionals) have any             within the SE community such as software requirements.
          implications regarding the performance of different
          development techniques (TDD, ITL)?                                 Finally, the proposed updated framework could be assessed and
     2.   Does any moderator variable or interaction amongst                 reviewed by colleagues within the SE community from different
          moderator variables across different organizational                viewpoints in order to lend the proposal higher external validity
          and/or academic setups have an impact on the                       and consistency.
          performance of the development methodologies?
     3.   Does the context (academia, industry or even different             5. SUMMARY OF CURRENT STATUS
          industries) have any impact on the performance of the              The thesis proposal was elicited on 15 January 2015 and rounded
          subjects applying different development techniques?                out over the following six months. As a three-year thesis, its
                                                                             findings and proposals will be projected across the years 2015,
Other research questions might arise in the course of the research,          2016 and 2017. The publishing strategy targets publication at the
and their implications will be discussed thoroughly as part of the           ICSE and ESE conferences and in the TSE, TOSEM, EMSE and
PhD thesis.                                                                  IST journals over the three-year research period.
Our research will provide different contributions to the academia            At the time of writing, a preliminary aggregation of results is
and practice:                                                                being carried out using the data from four replications as part of
                                                                             the Experimental Software Engineering Industrial Laboratory
     1.   Different methods for aggregation of results will be               (ESEIL) project. Two experiments were run in a professional
          used jointly for analyzing families of experiments whose           setting, whereas another two were run in academia. The results of
          raw data is available to experimenters, and the edges of           the experiments will be aggregated using different analysis
          applicability of the different techniques will be                  approaches, and their implications, constraints and findings will
          discussed along the research process.                              be discussed and further explored in subsequent studies.
     2.   Multiple industry experiments on TDD will be
          aggregated and their results for different software
          metrics analyzed. Specifically, different treatments such          6. REFERENCES
          as traditional test last coding or ITL will be compared            [1] Abrahao, S., Gravino, C., Insfran, E., Scanniello, G., &
          against TDD in industrial settings, which may lead to                  Tortora, G. (2013). Assessing the effectiveness of sequence
          interesting findings about the effects of TDD in real                  diagrams in the comprehension of functional requirements:
          software development contexts.                                         Results from a family of five experiments. Software
                                                                                 Engineering, IEEE Transactions on, 39(3), 327-342.
                                                                             [2] Baron, R. M., & Kenny, D. A. (1986). The moderator–
4. RESEARCH APPROACH                                                             mediator variable distinction in social psychological
Within the ESEIL project, several replications (i.e., reporting both
                                                                                 research: Conceptual, strategic, and statistical
different and consistent results) are being run on the topic of TDD
                                                                                 considerations. Journal of personality and social
performance. These replications consistently alter different
aspects of the primary study (design of the experiment, subject                  psychology, 51(6), 1173.
type, instrumentation, treatments, artifacts, location, training,            [3] Basili, V. R., Shull, F., & Lanubile, F. (1999). Building
researchers, session length, etc.). All these replications are being             knowledge through families of experiments. Software
analyzed separately and their findings discussed.                                Engineering, IEEE Transactions on, 25(4), 456-473.
In order to carry out the joint analysis of experiments, we will first       [4] Borenstein, M., Hedges, L. V., Higgins, J. P., & Rothstein,
run a search of current trends in the analysis of families of                    H. R. (2011).Introduction to Meta-Analysis. John Wiley &
experiments in other areas such as agriculture, psychology or                    Sons.


                                                                         8
[5] Brooks, A. (1997). Meta analysis—a silver bullet—for meta-           [19] Juristo, N., & Vegas, S. (2011). The role of non-exact
    analysts. Empirical Software Engineering, 2(4), 333-338.                  replications in software engineering experiments. Empirical
[6] Ciolkowski, M. (2009, October). What do we know about                     Software Engineering, 16(3), 295-324.
    perspective-based reading? An approach for quantitative              [20] Juristo, N., Vegas, S., Solari, M., Abrahao, S., & Ramos, I.
    aggregation in software engineering. InProceedings of the                 (2012, April). Comparing the effectiveness of equivalence
    2009 3rd International Symposium on Empirical Software                    partitioning, branch testing and code reading by stepwise
    Engineering and Measurement (pp. 133-144). IEEE                           abstraction applied by subjects. In Software Testing,
    Computer Society.                                                         Verification and Validation (ICST), 2012 IEEE Fifth
[7] Cooper, H., & Patall, E. A. (2009). The relative benefits of              International Conference on(pp. 330-339). IEEE.
    meta-analysis conducted with individual participant data             [21] Kalaian, H. A., & Raudenbush, S. W. (1996). A multivariate
    versus aggregated data.Psychological methods, 14(2), 165.                 mixed linear model for meta-analysis. Psychological
[8] Cumming, G. (2012). Understanding the new statistics:                     methods, 1(3), 227.
    Effect sizes, confidence intervals, and meta-analysis.               [22] Kitchenham, B. (2008). The role of replications in empirical
    Routledge.                                                                software engineering—a word of warning. Empirical
[9] Da Silva, F. Q., Suassuna, M., França, A. C. C., Grubb, A.                Software Engineering, 13(2), 219-221.
    M., Gouveia, T. B., Monteiro, C. V., & dos Santos, I. E.             [23] Lindsay, R. M., & Ehrenberg, A. S. (1993). The design of
    (2014). Replication of empirical studies in software                      replicated studies. The American Statistician, 47(3), 217-
    engineering research: a systematic mapping study. Empirical               228.
    Software Engineering, 19(3), 501-557.
                                                                         [24] Miller, J. (2000). Applying meta-analytical procedures to
[10] Dieste, O., Fernández, E., Garcia Martinez, R., & Juristo, N.            software engineering experiments. Journal of Systems and
     (2011, April). Comparative analysis of meta-analysis                     Software, 54(1), 29-39.
     methods: when to use which?. InEvaluation & Assessment in
                                                                         [25] Pickard, L. M., Kitchenham, B. A., & Jones, P. W. (1998).
     Software Engineering (EASE 2011), 15th Annual Conference
                                                                              Combining empirical results in software
     on (pp. 36-45). IET.
                                                                              engineering. Information and software technology, 40(14),
[11] Gómez, O. S., Juristo, N., & Vegas, S. (2014).                           811-821.
     Understanding replication of experiments in software
                                                                         [26] Rafique, Y., & Misic, V. (2013). The effects of test-driven
     engineering: A classification. Information and Software
                                                                              development on external quality and productivity: A meta-
     Technology, 56(8), 1033-1048.
                                                                              analysis. Software Engineering, IEEE Transactions
[12] Gurevitch, J., & Hedges, L. V. (1999). Statistical issues in             on, 39(6), 835-856.
     ecological meta-analyses. Ecology, 80(4), 1142-1149.
                                                                         [27] Runeson, P., Stefik, A., & Andrews, A. (2014). Variation
[13] Hannay, J. E., Dybå, T., Arisholm, E., & Sjøberg, D. I.                  factors in the design and analysis of replicated controlled
     (2009). The effectiveness of pair programming: A meta-                   experiments. Empirical Software Engineering, 19(6), 1781-
     analysis. Information and Software Technology, 51(7),                    1808.
     1110-1122.
                                                                         [28] Sjøberg, D. I., Hannay, J. E., Hansen, O., Kampenes, V. B.,
[14] Hayes, W. (1999). Research synthesis in software                         Karahasanovic, A., Liborg, N. K., & Rekdal, A. C. (2005). A
     engineering: a case for meta-analysis. In Software Metrics               survey of controlled experiments in software
     Symposium, 1999. Proceedings. Sixth International (pp.                   engineering. Software Engineering, IEEE Transactions
     143-151). IEEE.                                                          on, 31(9), 733-753.
[15] Hedges, L. V., & Olkin, I. (2014). Statistical method for           [29] Succi, G., Spasojevic, R., Hayes, J. J., Smith, M. R., &
     meta-analysis. Academic press.                                           Pedrycz, W. (2000). Application of statistical meta-analysis
                                                                              to software engineering metrics data. InProceedings of the
[16] Hedges, L. V., Tipton, E., & Johnson, M. C. (2010). Robust
                                                                              World Multiconference on Systemics, Cybernetics and
     variance estimation in meta‐regression with dependent effect
                                                                              Informatics (Vol. 1, pp. 709-714).
     size estimates. Research Synthesis Methods, 1(1), 39-65.
                                                                         [30] Succi, G., Spasojevic, R., Hayes, J. J., Smith, M. R., &
[17] Jonathan L. Krein, Lutz Prechelt, Natalia Juristo, Aziz
                                                                              Pedrycz, W. (2000). Application of statistical meta-analysis
     Nanthaamornphong, Jeffrey C. Carver, Sira Vegas, Charles
                                                                              to software engineering metrics data. InProceedings of the
     D. Knutson, Kevin D. Seppi and Dennis L. Egget, A Multi-
                                                                              World Multiconference on Systemics, Cybernetics and
     site Joint Replication of a Design Patterns Experiment using
     Moderator Variables to Generalize across Contexts. Software              Informatics (Vol. 1, pp. 709-714).
     Engineering, IEEE Transactions. Under review.                       [31] Vegas, Sira; Dieste, Oscar; Juristo, Natalia, "Difficulties in
[18] Jonathan L. Krein, Lutz Prechelt, Natalia Juristo, Kevin D.              Running Experiments in the Software Industry: Experiences
     Seppi, Aziz Nanthaamornphong, Jeffrey C. Carver, Sira                    from the Trenches," Conducting Empirical Studies in
     Vegas and Charles D. Knutson, A Method for Generalizing                  Industry (CESI), 2015 IEEE/ACM 3rd International
     across Contexts in Software Engineering Experiments.                     Workshop on , vol., no., pp.3,9, 18-18 May 2015 doi:
     Software Engineering, IEEE Transactions. Submitted.                      10.1109/CESI.2015.8


                                                                     9