Joint Analysis of Families of SE Experiments Adrián Santos Parrilla University of Oulu, Finland Department of Information Processing Science adrian.santos.parrilla@oulu.fi ABSTRACT Context: Replication is of paramount importance for building 1. INTRODUCTION SE experiments can be analyzed separately to acquire knowledge solid theories in experimental disciplines and is a cornerstone of about the performance of different treatments under certain the evolution of science. Over the last few years, the role of circumstances (working environment or specific population replication in software engineering (SE), families of experiments characteristics). The shortcomings of this approach include: (1) and the need to aggregate the results of groups of experiments the number of subjects is a limiting factor across most SE have attracted special attention. Frameworks, taxonomies, experiments; (2) the results might be artifactual, that is, due to the processes, recommendations and guidelines for reporting impact of the experimental protocol and not to the treatments replications have been proposed to support the replication of SE applied by the subjects; (3) the findings from one study cannot be experiments. There has been much less debate about the issue of interpreted outside the confines of the setting of that experiment. the joint analysis of replications whose raw data are available to experimenters. The role and importance of replications in tackling the issue of the generalization of SE experimental findings has been recognized Objectives: The aim of our research is to explore current trends in by many authors within the SE community [11, 19, 22, 24]. As the joint analysis of SE experiments whose raw data are available stated in [24], “replications play a key role in Empirical Software to experimenters. Notice that the fact that experimenters have Engineering by allowing the community to build knowledge about access to the raw data is what differentiates joint analysis from which results or observations hold under which conditions”. The other methods for aggregating experimental results (e.g. aim of replication is twofold [23]. First, “replication is needed not systematic literature review (SLR), where the applicability of merely to validate one’s findings, but, more importantly, to meta-analysis techniques is widely accepted). The objective of establish the increasing range of radically different conditions this three-year investigation is to shed light on the best joint under which the findings hold, and the predictable exceptions”. analysis approach when the experimenters have access to raw data Second, as noted in [11], “if an experiment is not replicated, there from several replications. is no way to distinguish whether results were produced by chance Method: Narrative comparison, standard frequentist methods, (the observed event occurred accidentally), results are artifactual meta-analysis and Bayesian methods have been used in SE (the event occurred because of the experimental configuration but literature. We will apply and evaluate each approach to the does not exist in reality) or results conform to patterns existing in experiments on Test-Driven Development (TDD) carried out reality”. Thus, replication provides experimenters and the within the Experimental Software Engineering Industrial community with a continuous knowledge building process by: (1) Laboratory (ESEIL) project. We will propose and rate a tentative confirming previous experimental results; and (2) identifying the framework for aggregating results within the ESEIL project. The reasons why previous results do not hold under the new proposed framework, as well as the different existing methods, experimental conditions. By aggregating the results of will be evaluated on another set of replications of testing experiments, we get to see the whole picture for different technique experiments. population characteristics, settings and conformance to the Current status: The thesis proposal was elicited on the 15 treatments used within the SE community. January 2015 and rounded out over the following six months. As The shortage of replication studies within the SE community was a three-year thesis, its discussion and findings will be projected highlighted in [28]. Out of a total of 5453 articles published in across the years 2015, 2016 and 2017. The first results are now different SE-related journals and conference proceedings between being aggregated with the data from four different experiments on 1993 and 2002, 20 out of 113 controlled experiments were TDD (two in academia and two in industry), and preliminary described as replications [28]. In a mapping study on SE results are expected to be available in October 2015. replications completed from 2010 to 2011 [9] based on bibliographic searches covering the period from 1994 to 2010, only 96 out of 16,000 papers included replications. These 96 papers reported a total of 133 replications. Furthermore, the Keywords results showed that nearly 70% of the replications were published SE replication, joint analysis, family of experiments, raw data. after 2004 and that up to a 70% of the studies were internal replications (i.e., carried out by the same experimenters) [9]. An update of the same study identified and analyzed replications published in 2011 and 2012, and noted that the trend in the number of replications in SE continued to be upward (56 papers in two years) [9]. However, the growth rate was slow, possibly indicating the need for patterns to improve the way in which replications are run in the field [9]. Copyright © 2015 for this paper by its authors. Copying permitted for private and academic purposes. 4 The organization of an International Workshop on Replication in Notice that this definition of family of experiments also covers Empirical Software Engineering Research (RESER) is illustrative related experiments found by a SLR, even though the of the growing interest in replication. At this venue, empirical experimenters are completely unconnected. We think that the software engineering researchers have the opportunity to present concept of family of experiments should be defined more and discuss the theoretical foundations and methods of precisely in order to make a distinction between the two situations replication, as well as the results of replicated studies. below: Traditionally groups of experiments have been formed within the i. Set of related experiments, typically found in a SLR, SE community by means of SLR, and their results analyzed that can be aggregated to generate evidence. The jointly in order to build new pieces of knowledge. But, nowadays, available information in these cases provides only a researchers are replicating their own studies in order to increase short description of protocol and conditions as regards the relevance and validity of their findings. the setting and no more than sample descriptive As some authors state [29], there is a need to further investigate statistics as regards the data. the problem of generalizing conclusions from individual studies. ii. Set of experiments conducted by related researchers that This could be done by extending research tools commonly used in make the raw data available for further joint analysis. In engineering and computer science with those applied in sciences this case, the available information covers everything that study people such as medicine or psychology. Brooks [5] that the experimenters know about their own studies. suggested that research methods like statistical meta-analysis We suggest that the term family of experiments should be used to could benefit software engineering in generalizing the findings refer to situation (ii) above. Thus, this research narrows down the from individual studies. However SE experiments have in general meaning of family of experiments to a definition similar to the several constraints which make difficult the application of meta- explanation given in [1]: “a set of similar experiments that pursue analysis [10]: (1) small sample sizes (generally less than 10 the same goal to build the knowledge needed to extract significant subjects per treatment); (2) the number of experiments per meta- conclusions”, where experimenters have the raw data of the analysis is also small in many cases; (3) some studies do not experiments and first-hand knowledge of the setting. provide the statistical parameters required for meta-analysis when reporting their results. In this article we report a PhD thesis that is being carried out to investigate how to conduct a joint analysis of a family of Even though meta-analysis is a widely accepted method for experiments. We first report the current methods that have been aggregating results from studies identified by means of SLR used in SE for the joint analysis of experiments. We also outline a (generally reporting statistics such as the mean, standard deviation tentative path for building a framework for aggregating the results or number of subjects), there appears to be no such agreement on of families of experiments. the right way to analyze the replications of experiments whose raw data are available to experimenters. Researchers who are in This paper is organized as follows. Section 2 briefly discusses possession of the raw data of the experiments are better able to relevant prior work on the topic of results aggregation in SE. compute, understand and assess the different variables considered Section 3 outlines the main objectives of the proposal. Section 4 in the experiments than if they only have access to findings describes the proposed research approach. Finally, Section 5 reported in different publications. Furthermore, the issue summarizes the current status of the outlined proposal. nowadays seems to be object of debate in other fields such as medicine or social sciences where the communities are still discussing the advantages and disadvantages of conducting meta- 2. RELATED WORK analysis with individual participant data (IPD) gathered from the In the following sections, we briefly discuss the different constituent studies and aggregated data (AD), or the group-level approaches proposed and adopted within the SE community to statistics (effect sizes) that appear in reports of a study’s results conduct joint analyses of families of experiments whose raw data [7]. are available to experimenters, discuss their applicability and state It is unclear yet within the SE community which is the most the conclusions concerning their use reported in the different straightforward and valid procedure for aggregating results when publications. the raw data of the experiments are available to the experimenters The different techniques are discussed in chronological order by and they have first-hand knowledge of the protocol and date of publication of the respective paper applying or proposing conditions. Besides, different joint analysis techniques may be the technique. applicable depending on the different characteristics of the replications. The concept of family of experiments was first reported in SE by 2.1 Narrative Comparison Basili et al. in 1991 [3]. This concept is explained in [1] as The difference between close and differentiated replications was follows: “a family is composed of multiple similar experiments discussed in depth by Juristo and Vegas in 2011 [19]. They that pursue the same goal to build the knowledge needed to proposed an approach for analyzing groups of experiments: the extract significant conclusions”. From this point of view, the results of each experiment are analyzed separately and are then concept of family of experiments is a “framework for organizing grouped according to concordances and discordances between the sets of related studies” [3], where “experiments can be viewed as results of the replications identified through narrative comparison. part of common families of studies rather than being isolated A differentiated replication (i.e., a replication that produces a events” [3]. As each of these experiments is viewed as belonging different outcome than the main experiment) is considered as an to a group of studies, their results could be analyzed as a whole opportunity to explore the different variables that might have had instead of separately, and the findings integrated into one an impact on the outcome rather than being seen as a threat to the comprehensive result. 5 validity of the replication. There are several noteworthy points Meta-analysis is the current standard for aggregating quantitative with regard to the study reported in [19]: results across studies [22]. It can be used to combine data even if studies report contradictory results provided that the overall  There is a big imbalance between the total number of variation is not too extreme [25]. subjects participating in each replication (176 participants at the UPM; 31 subjects in the UPV Meta-analysis has been used within the SE community with replication and 76 in the ORT replication). This multiple objectives such as studying the effects of TDD on imbalance in the number of subjects could have biased external quality and productivity [26], checking for correlations of the results due to natural random variability. metrics across software projects corpora [30] or the effect on defect detection rates of different inspection techniques [14]  The report states that “the results are considered equal if amongst others [6, 13]. Furthermore, guidelines for applying the estimated mean value for the replication results is different meta-analytic techniques have been proposed to the SE within the confidence interval of the baseline community [10]. experiment results” [19]. Some sources [8] state that roughly 83% of replications will fall within the 95% Traditional meta-analytic techniques rely on the assumption that interval of confidence for the means of the original effect size estimates from different experiments are independent experiment. In other words, even if two samples (one and have sampling distributions with known conditional variances per experiment) are drawn from the same population, [16]. An experiment that examines multiple dependent variables there is an 83% chance that the mean of the second or a cluster of studies carried out by the same investigator or experiment will fall within the confidence interval of the laboratory [16] poses a threat to the supposed independence of first. Thus, due to the random variability of the sample, experiments. The hierarchical dependence model is applicable Juristo and Vegas might be considering the result of the when the dependence structure between the experiments is due to replication as a different outcome, merely because the the inherent condition of belonging to a cluster of experiments mean did not fall within the confidence interval of the [12]. The circumstances, implications and impact of dependence main experiment (although, in actual fact, it represents across studies have been studied at length by different researchers the same result, i.e., population, in a different random [15, 21], and multiple techniques for dealing with this issue have sample). This underestimation of sampling variability is been proposed [16]. However, their usage requires an in-depth a limitation of the applicability of the narrative knowledge of the different techniques available, and their comparison approach proposed in [19]: roughly 17% of applicability is by no means clear [16]. replication results will, due to random variability, be How meta-analysis should be applied to a family of experiments considered different when they are in fact equal (i.e., in SE is a matter of debate, and there are many opinions on represent the same population). procedure. As stated by Miller [24], “because the dependent This method relies on narrative comparison to analyze a family of replications rely on the same underlying protocols as the original experiments using observations such as the mean of the different study, their results cannot be considered as truly independent of outcome variables obtained by the subjects in the experiments to the original study. Moreover, they may propagate any accidental discuss the findings. It is up to the expert analyst to observe and biases from the original study into the results of the replication”. interpret the outcomes, and the method depends on their ability to Recall, which is one of the main assumptions underlying identify extraneous factors that might have influenced the traditional meta-analysis, relies on independence and could be outcome in the different replications. One clear drawback is that violated in some cases where replications are run by related the technique might underestimate the random sampling researchers. variability within a certain population and thus overestimate the Again, if experimental material is reused (thus increasing the effect of third variables. This technique can be applied to the raw dependence between two experiments), “although from a simple data of the experiments or to the known descriptive statistics of replication point of view, this seems attractive; from a meta- the different experimental outcomes (although the variables analysis point of view this is undesirable, as it creates strong should be interpreted with due caution if the raw data are not correlations between the two studies” [24]. Kitchenham shares available). this view [22], stating “in particular, dependent replications In any case, the narrative comparison technique requires the violate the main assumption underlying meta-analysis which is the experimenters of the replications to interact in order to identify standard method of aggregating results from quantitative extraneous variables that might have had an impact on the experiments. Recently, my colleagues and I were forced to omit consistency of results in order to gather knowledge for further three studies from a systematic literature review because the investigation. ‘replications’ were so close that they offered no additional information to the aggregation process”. Pickard et al. [25] state, in reference to the outcome of the 2.2 Meta-Analysis primary studies, that “the greater the degree of similarity between Meta-analysis is a set of statistical techniques that has been used the studies the more confidence you can have in the results of a to combine the different effect sizes of a family of experiments [9, meta-analysis”. The best thing then would be very similar settings 10]. Effect sizes can be estimated to evaluate the average impact without either communication or information sharing among of an independent variable on a dependent variable across studies. experimenters: a rare occurrence in SE. Since measures may be taken from different settings and may be Furthermore, it is up to researchers to settle several issues non-uniform, a standardized measure must be taken for each regarding meta-analysis, such as: experiment. These measures must be combined to estimate the  The selection of the effect size metric used to perform global effect size of a factor [1]. the joint analysis, i.e., computation of the raw mean 6 difference, a standardized mean difference, odds ratio, The separate analyses performed for each experiment in [27] risk ratio or risk differences [4]. appear to be clearly explained from the data analysis viewpoint:  The standardizer used to compute the effect size, i.e., “the experiment has two factors, paired measurements, a sample pool standard deviations, weight each group’s standard size of less than 30 and data which is not normally distributed”. deviation by sample size or use the control group When aggregating the data from the three experiments into one standard deviation [8]. data set and carrying out the joint analysis, however, Runeson et  The computation of some effect sizes from others or the al. report [27] “the overall two-factor ANOVA results for the use of unbiased versions of effect size metrics [8]. three experiments”. Notice that the authors no longer mention that the data are “repeated measures”, and the “two-factor ANOVA” Besides, the impact of experimental designs on the resulting effect analysis carried out is interpreted without any reference to a sizes (such as multiple-treatment studies and multiple-endpoint within-subjects factor. Furthermore, the joint analysis of the three studies [16]) makes meta-analysis applicability a controversial experiments is performed using a Kruskall-Wallis test, the topic in the SE community. equivalent of the one-way ANOVA test for non-normal Meta-analysis has the potential of aggregating the results of distributions. different experiments if the raw data are not available, even Because the data are dependent, a repeated measures general though its stability in SE experiments has been questioned [24]. linear model could have been fitted to analyze the data. Also, the However, meta-analysis has been applied as well when the within-subjects and between-subjects factors considered should experimenters are in possession of the raw data [1]. This issue have been clearly specified in order to pave the way for data raises the question of whether the best procedure for the joint analysis, understandability and reproduction. analysis of experiments whose raw data are available to researchers is to apply meta-analysis techniques or whether it A framework applying this approach to aggregate results from a would be better to use other approaches. As noted in [4]: “losing family of experiments should comply with three objectives: (1) sight of the fact that meta-analysis is a tool with multiple provide a specific set of steps to be carried out to pre-process the applications causes confusion and leads to pointless discussions data and report the data pre-processing of the experiments; (2) about what is the right way to perform a research synthesis, when provide a template with all the relevant information that should be there is no single right way”. stated about each of the experiments to carry out the individual analysis; (3) provide guidance for defining a joint analysis from All the above raises doubts within the community, which does not separate experiments, accounting for any of the possible appear to be clear about the applicability of meta-analysis, its limitations of each experiment. boundaries and misuses, adding to the confusion surrounding the aggregation of results in families of SE experiments. 2.4 Bayesian Methods Bayesian methods for data analysis have also been applied to the 2.3 Standard Frequentist Methods aggregation of the results of a family of experiments [17, 18]. Another option for conducting the joint analysis of families of They resolve the inconsistencies found between replications and experiments is to analyze all data together via standard frequentist the original experiment by investigating moderators, i.e., variables methods such as analysis of variance (ANOVA) [27]. Basically, that cause an effect to differ across contexts [2]. An iterative each experiment within a group of replicated controlled approach is applied to try to identify moderators that might have experiments is analyzed separately. After gathering knowledge an influence on the outcome of the experiment. The different about the results of the different replications and briefly variables and their interaction studied in the proposed models are discussing whether or not the results hold, the experimenters then measured based on the most relevant changes made to the hypothesize about which variables might have had an impact on different replications. As explained in [17], “By moderator, we the results. In a next step, the raw data from all the different mean any explanatory variable that interacts with another studies belonging to the family of experiments are aggregated and explanatory variable in predicting a response variable. For one analyzed as a whole considering the hypothesized variables as variable to “moderate” another does not mean that it dampens the factors. other’s effect —rather, it means that an interaction exists, such The study presented by Runeson et al. [27] in 2014 is an example that the latter’s effect varies in response to the former”. of such an approach. They report three experiments comparing Bayesian methods provide an alternative to traditional meta- code inspections with unit testing: the original experiment, an analysis. First, using Bayesian methods, data can be accumulated internal replication (a replication performed by the same over time (prior knowledge) into the analysis of future researchers minimizing changes in the replication) and an external replications. Second, Bayesian methods can be used to combine replication (a replication performed by a different group of results such that all data are treated as current observations [18]. researchers, varying several aspects of the experiment) [27]. The three experiments were cross-over designs, where the subjects However, the application of this method to joint analysis requires applied one defect detection method (code inspection or structural thorough knowledge of Bayesian statistics, which have seldom unit testing) to one program and then the other method to the been used in the SE community that is dominated by frequentist other program [27]. The dependent variables of the experiments methods such as meta-analysis: the current standard for are time spent on the tasks, number of defects detected and aggregating quantitative results [22, 25]. localized and rate, i.e., number of defects detected and localized per time unit [27]. 3. RESEARCH OBJECTIVES We have carried out several experiments to assess the TDD agile development technique [31]. A lot of data are being collected 7 from multiple replications, and their analysis and processing could medicine. Such bibliographic searches of online databases could provide insights into proper ways of handling and synthesizing turn up a variety of methods or prescriptions that might lend the results of joint analysis. themselves to extrapolation to SE. The conditions under which Our research is driven by several methodological questions: these techniques can be used and their limitations will be studied within specific SE setups, and they will be assessed by means of direct application to the ESEIL project experiments on TDD. 1. What is the best way of analyzing families of experiments with raw data in SE? Do the existing The application of these different analysis techniques to the same approaches produce contradictory results? Under what family of experiments can lead to multiple, possibly even circumstances are these different analysis approaches contradictory results. In our studies we will try to explore the applicable? scope of application of the different aggregation approaches and 2. Where are the limits to the feasibility of grouping define the limits of their applicability for conducting joint different experiments, i.e., how similar does the design analysis. of the experiment need to be? After exploring these approaches and discussing their implications 3. Is there any kind of knowledge on the different within the ESEIL project, we will propose a framework for the experiments that is of paramount importance for joint joint analysis of families of experiments. A second version of this analysis? Is it always correctly reported? framework will be refined and further expanded with the aim of applying it to a family of testing experiments [20]. Also, several TDD-specific questions will drive our research: We will then adopt the most promising methods in order to extrapolate their applicability to a different set of experiments 1. Does subject type (students or professionals) have any within the SE community such as software requirements. implications regarding the performance of different development techniques (TDD, ITL)? Finally, the proposed updated framework could be assessed and 2. Does any moderator variable or interaction amongst reviewed by colleagues within the SE community from different moderator variables across different organizational viewpoints in order to lend the proposal higher external validity and/or academic setups have an impact on the and consistency. performance of the development methodologies? 3. Does the context (academia, industry or even different 5. SUMMARY OF CURRENT STATUS industries) have any impact on the performance of the The thesis proposal was elicited on 15 January 2015 and rounded subjects applying different development techniques? out over the following six months. As a three-year thesis, its findings and proposals will be projected across the years 2015, Other research questions might arise in the course of the research, 2016 and 2017. The publishing strategy targets publication at the and their implications will be discussed thoroughly as part of the ICSE and ESE conferences and in the TSE, TOSEM, EMSE and PhD thesis. IST journals over the three-year research period. Our research will provide different contributions to the academia At the time of writing, a preliminary aggregation of results is and practice: being carried out using the data from four replications as part of the Experimental Software Engineering Industrial Laboratory 1. Different methods for aggregation of results will be (ESEIL) project. Two experiments were run in a professional used jointly for analyzing families of experiments whose setting, whereas another two were run in academia. The results of raw data is available to experimenters, and the edges of the experiments will be aggregated using different analysis applicability of the different techniques will be approaches, and their implications, constraints and findings will discussed along the research process. be discussed and further explored in subsequent studies. 2. Multiple industry experiments on TDD will be aggregated and their results for different software metrics analyzed. Specifically, different treatments such 6. REFERENCES as traditional test last coding or ITL will be compared [1] Abrahao, S., Gravino, C., Insfran, E., Scanniello, G., & against TDD in industrial settings, which may lead to Tortora, G. (2013). Assessing the effectiveness of sequence interesting findings about the effects of TDD in real diagrams in the comprehension of functional requirements: software development contexts. Results from a family of five experiments. Software Engineering, IEEE Transactions on, 39(3), 327-342. [2] Baron, R. M., & Kenny, D. A. (1986). The moderator– 4. RESEARCH APPROACH mediator variable distinction in social psychological Within the ESEIL project, several replications (i.e., reporting both research: Conceptual, strategic, and statistical different and consistent results) are being run on the topic of TDD considerations. Journal of personality and social performance. These replications consistently alter different aspects of the primary study (design of the experiment, subject psychology, 51(6), 1173. type, instrumentation, treatments, artifacts, location, training, [3] Basili, V. R., Shull, F., & Lanubile, F. (1999). Building researchers, session length, etc.). All these replications are being knowledge through families of experiments. Software analyzed separately and their findings discussed. Engineering, IEEE Transactions on, 25(4), 456-473. In order to carry out the joint analysis of experiments, we will first [4] Borenstein, M., Hedges, L. V., Higgins, J. P., & Rothstein, run a search of current trends in the analysis of families of H. R. (2011).Introduction to Meta-Analysis. John Wiley & experiments in other areas such as agriculture, psychology or Sons. 8 [5] Brooks, A. (1997). Meta analysis—a silver bullet—for meta- [19] Juristo, N., & Vegas, S. (2011). The role of non-exact analysts. Empirical Software Engineering, 2(4), 333-338. replications in software engineering experiments. Empirical [6] Ciolkowski, M. (2009, October). What do we know about Software Engineering, 16(3), 295-324. perspective-based reading? An approach for quantitative [20] Juristo, N., Vegas, S., Solari, M., Abrahao, S., & Ramos, I. aggregation in software engineering. InProceedings of the (2012, April). Comparing the effectiveness of equivalence 2009 3rd International Symposium on Empirical Software partitioning, branch testing and code reading by stepwise Engineering and Measurement (pp. 133-144). IEEE abstraction applied by subjects. In Software Testing, Computer Society. Verification and Validation (ICST), 2012 IEEE Fifth [7] Cooper, H., & Patall, E. A. (2009). The relative benefits of International Conference on(pp. 330-339). IEEE. meta-analysis conducted with individual participant data [21] Kalaian, H. A., & Raudenbush, S. W. (1996). A multivariate versus aggregated data.Psychological methods, 14(2), 165. mixed linear model for meta-analysis. Psychological [8] Cumming, G. (2012). Understanding the new statistics: methods, 1(3), 227. Effect sizes, confidence intervals, and meta-analysis. [22] Kitchenham, B. (2008). The role of replications in empirical Routledge. software engineering—a word of warning. Empirical [9] Da Silva, F. Q., Suassuna, M., França, A. C. C., Grubb, A. Software Engineering, 13(2), 219-221. M., Gouveia, T. B., Monteiro, C. V., & dos Santos, I. E. [23] Lindsay, R. M., & Ehrenberg, A. S. (1993). The design of (2014). Replication of empirical studies in software replicated studies. The American Statistician, 47(3), 217- engineering research: a systematic mapping study. Empirical 228. Software Engineering, 19(3), 501-557. [24] Miller, J. (2000). Applying meta-analytical procedures to [10] Dieste, O., Fernández, E., Garcia Martinez, R., & Juristo, N. software engineering experiments. Journal of Systems and (2011, April). Comparative analysis of meta-analysis Software, 54(1), 29-39. methods: when to use which?. InEvaluation & Assessment in [25] Pickard, L. M., Kitchenham, B. A., & Jones, P. W. (1998). Software Engineering (EASE 2011), 15th Annual Conference Combining empirical results in software on (pp. 36-45). IET. engineering. Information and software technology, 40(14), [11] Gómez, O. S., Juristo, N., & Vegas, S. (2014). 811-821. Understanding replication of experiments in software [26] Rafique, Y., & Misic, V. (2013). The effects of test-driven engineering: A classification. Information and Software development on external quality and productivity: A meta- Technology, 56(8), 1033-1048. analysis. Software Engineering, IEEE Transactions [12] Gurevitch, J., & Hedges, L. V. (1999). Statistical issues in on, 39(6), 835-856. ecological meta-analyses. Ecology, 80(4), 1142-1149. [27] Runeson, P., Stefik, A., & Andrews, A. (2014). Variation [13] Hannay, J. E., Dybå, T., Arisholm, E., & Sjøberg, D. I. factors in the design and analysis of replicated controlled (2009). The effectiveness of pair programming: A meta- experiments. Empirical Software Engineering, 19(6), 1781- analysis. Information and Software Technology, 51(7), 1808. 1110-1122. [28] Sjøberg, D. I., Hannay, J. E., Hansen, O., Kampenes, V. B., [14] Hayes, W. (1999). Research synthesis in software Karahasanovic, A., Liborg, N. K., & Rekdal, A. C. (2005). A engineering: a case for meta-analysis. In Software Metrics survey of controlled experiments in software Symposium, 1999. Proceedings. Sixth International (pp. engineering. Software Engineering, IEEE Transactions 143-151). IEEE. on, 31(9), 733-753. [15] Hedges, L. V., & Olkin, I. (2014). Statistical method for [29] Succi, G., Spasojevic, R., Hayes, J. J., Smith, M. R., & meta-analysis. Academic press. Pedrycz, W. (2000). Application of statistical meta-analysis to software engineering metrics data. InProceedings of the [16] Hedges, L. V., Tipton, E., & Johnson, M. C. (2010). Robust World Multiconference on Systemics, Cybernetics and variance estimation in meta‐regression with dependent effect Informatics (Vol. 1, pp. 709-714). size estimates. Research Synthesis Methods, 1(1), 39-65. [30] Succi, G., Spasojevic, R., Hayes, J. J., Smith, M. R., & [17] Jonathan L. Krein, Lutz Prechelt, Natalia Juristo, Aziz Pedrycz, W. (2000). Application of statistical meta-analysis Nanthaamornphong, Jeffrey C. Carver, Sira Vegas, Charles to software engineering metrics data. InProceedings of the D. Knutson, Kevin D. Seppi and Dennis L. Egget, A Multi- World Multiconference on Systemics, Cybernetics and site Joint Replication of a Design Patterns Experiment using Moderator Variables to Generalize across Contexts. Software Informatics (Vol. 1, pp. 709-714). Engineering, IEEE Transactions. Under review. [31] Vegas, Sira; Dieste, Oscar; Juristo, Natalia, "Difficulties in [18] Jonathan L. Krein, Lutz Prechelt, Natalia Juristo, Kevin D. Running Experiments in the Software Industry: Experiences Seppi, Aziz Nanthaamornphong, Jeffrey C. Carver, Sira from the Trenches," Conducting Empirical Studies in Vegas and Charles D. Knutson, A Method for Generalizing Industry (CESI), 2015 IEEE/ACM 3rd International across Contexts in Software Engineering Experiments. Workshop on , vol., no., pp.3,9, 18-18 May 2015 doi: Software Engineering, IEEE Transactions. Submitted. 10.1109/CESI.2015.8 9