<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>As
a three-year thesis, its discussion and findings will be projected
across the years</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Joint Analysis of Families of SE Experiments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adrián Santos Parrilla</string-name>
          <email>adrian.santos.parrilla@oulu.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Oulu, Finland Department of Information Processing Science</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>2016</volume>
      <fpage>4</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>Context: Replication is of paramount importance for building solid theories in experimental disciplines and is a cornerstone of the evolution of science. Over the last few years, the role of replication in software engineering (SE), families of experiments and the need to aggregate the results of groups of experiments have attracted special attention. Frameworks, taxonomies, processes, recommendations and guidelines for reporting replications have been proposed to support the replication of SE experiments. There has been much less debate about the issue of the joint analysis of replications whose raw data are available to experimenters. Objectives: The aim of our research is to explore current trends in the joint analysis of SE experiments whose raw data are available to experimenters. Notice that the fact that experimenters have access to the raw data is what differentiates joint analysis from other methods for aggregating experimental results (e.g. systematic literature review (SLR), where the applicability of meta-analysis techniques is widely accepted). The objective of this three-year investigation is to shed light on the best joint analysis approach when the experimenters have access to raw data from several replications. Method: Narrative comparison, standard frequentist methods, meta-analysis and Bayesian methods have been used in SE literature. We will apply and evaluate each approach to the experiments on Test-Driven Development (TDD) carried out within the Experimental Software Engineering Industrial Laboratory (ESEIL) project. We will propose and rate a tentative framework for aggregating results within the ESEIL project. The proposed framework, as well as the different existing methods, will be evaluated on another set of replications of testing technique experiments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;SE replication</kwd>
        <kwd>joint analysis</kwd>
        <kwd>family of experiments</kwd>
        <kwd>raw data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        SE experiments can be analyzed separately to acquire knowledge
about the performance of different treatments under certain
circumstances (working environment or specific population
characteristics). The shortcomings of this approach include: (1)
the number of subjects is a limiting factor across most SE
experiments; (2) the results might be artifactual, that is, due to the
impact of the experimental protocol and not to the treatments
applied by the subjects; (3) the findings from one study cannot be
interpreted outside the confines of the setting of that experiment.
The role and importance of replications in tackling the issue of the
generalization of SE experimental findings has been recognized
by many authors within the SE community [
        <xref ref-type="bibr" rid="ref11 ref19 ref22 ref24">11, 19, 22, 24</xref>
        ]. As
stated in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], “replications play a key role in Empirical Software
Engineering by allowing the community to build knowledge about
which results or observations hold under which conditions”. The
aim of replication is twofold [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. First, “replication is needed not
merely to validate one’s findings, but, more importantly, to
establish the increasing range of radically different conditions
under which the findings hold, and the predictable exceptions”.
Second, as noted in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], “if an experiment is not replicated, there
is no way to distinguish whether results were produced by chance
(the observed event occurred accidentally), results are artifactual
(the event occurred because of the experimental configuration but
does not exist in reality) or results conform to patterns existing in
reality”. Thus, replication provides experimenters and the
community with a continuous knowledge building process by: (1)
confirming previous experimental results; and (2) identifying the
reasons why previous results do not hold under the new
experimental conditions. By aggregating the results of
experiments, we get to see the whole picture for different
population characteristics, settings and conformance to the
treatments used within the SE community.
      </p>
      <p>
        The shortage of replication studies within the SE community was
highlighted in [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. Out of a total of 5453 articles published in
different SE-related journals and conference proceedings between
1993 and 2002, 20 out of 113 controlled experiments were
described as replications [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. In a mapping study on SE
replications completed from 2010 to 2011 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] based on
bibliographic searches covering the period from 1994 to 2010,
only 96 out of 16,000 papers included replications. These 96
papers reported a total of 133 replications. Furthermore, the
results showed that nearly 70% of the replications were published
after 2004 and that up to a 70% of the studies were internal
replications (i.e., carried out by the same experimenters) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. An
update of the same study identified and analyzed replications
published in 2011 and 2012, and noted that the trend in the
number of replications in SE continued to be upward (56 papers
in two years) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, the growth rate was slow, possibly
indicating the need for patterns to improve the way in which
replications are run in the field [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The organization of an International Workshop on Replication in
Empirical Software Engineering Research (RESER) is illustrative
of the growing interest in replication. At this venue, empirical
software engineering researchers have the opportunity to present
and discuss the theoretical foundations and methods of
replication, as well as the results of replicated studies.
Traditionally groups of experiments have been formed within the
SE community by means of SLR, and their results analyzed
jointly in order to build new pieces of knowledge. But, nowadays,
researchers are replicating their own studies in order to increase
the relevance and validity of their findings.</p>
      <p>
        As some authors state [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], there is a need to further investigate
the problem of generalizing conclusions from individual studies.
This could be done by extending research tools commonly used in
engineering and computer science with those applied in sciences
that study people such as medicine or psychology. Brooks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
suggested that research methods like statistical meta-analysis
could benefit software engineering in generalizing the findings
from individual studies. However SE experiments have in general
several constraints which make difficult the application of
metaanalysis [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]: (1) small sample sizes (generally less than 10
subjects per treatment); (2) the number of experiments per
metaanalysis is also small in many cases; (3) some studies do not
provide the statistical parameters required for meta-analysis when
reporting their results.
      </p>
      <p>
        Even though meta-analysis is a widely accepted method for
aggregating results from studies identified by means of SLR
(generally reporting statistics such as the mean, standard deviation
or number of subjects), there appears to be no such agreement on
the right way to analyze the replications of experiments whose
raw data are available to experimenters. Researchers who are in
possession of the raw data of the experiments are better able to
compute, understand and assess the different variables considered
in the experiments than if they only have access to findings
reported in different publications. Furthermore, the issue
nowadays seems to be object of debate in other fields such as
medicine or social sciences where the communities are still
discussing the advantages and disadvantages of conducting
metaanalysis with individual participant data (IPD) gathered from the
constituent studies and aggregated data (AD), or the group-level
statistics (effect sizes) that appear in reports of a study’s results
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>It is unclear yet within the SE community which is the most
straightforward and valid procedure for aggregating results when
the raw data of the experiments are available to the experimenters
and they have first-hand knowledge of the protocol and
conditions. Besides, different joint analysis techniques may be
applicable depending on the different characteristics of the
replications.</p>
      <p>
        The concept of family of experiments was first reported in SE by
Basili et al. in 1991 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This concept is explained in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as
follows: “a family is composed of multiple similar experiments
that pursue the same goal to build the knowledge needed to
extract significant conclusions”. From this point of view, the
concept of family of experiments is a “framework for organizing
sets of related studies” [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], where “experiments can be viewed as
part of common families of studies rather than being isolated
events” [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As each of these experiments is viewed as belonging
to a group of studies, their results could be analyzed as a whole
instead of separately, and the findings integrated into one
comprehensive result.
      </p>
      <p>Notice that this definition of family of experiments also covers
related experiments found by a SLR, even though the
experimenters are completely unconnected. We think that the
concept of family of experiments should be defined more
precisely in order to make a distinction between the two situations
below:
ii.</p>
      <sec id="sec-1-1">
        <title>Set of related experiments, typically found in a SLR,</title>
        <p>that can be aggregated to generate evidence. The
available information in these cases provides only a
short description of protocol and conditions as regards
the setting and no more than sample descriptive
statistics as regards the data.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Set of experiments conducted by related researchers that</title>
        <p>make the raw data available for further joint analysis. In
this case, the available information covers everything
that the experimenters know about their own studies.</p>
        <p>
          We suggest that the term family of experiments should be used to
refer to situation (ii) above. Thus, this research narrows down the
meaning of family of experiments to a definition similar to the
explanation given in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]: “a set of similar experiments that pursue
the same goal to build the knowledge needed to extract significant
conclusions”, where experimenters have the raw data of the
experiments and first-hand knowledge of the setting.
        </p>
        <p>In this article we report a PhD thesis that is being carried out to
investigate how to conduct a joint analysis of a family of
experiments. We first report the current methods that have been
used in SE for the joint analysis of experiments. We also outline a
tentative path for building a framework for aggregating the results
of families of experiments.</p>
        <p>This paper is organized as follows. Section 2 briefly discusses
relevant prior work on the topic of results aggregation in SE.
Section 3 outlines the main objectives of the proposal. Section 4
describes the proposed research approach. Finally, Section 5
summarizes the current status of the outlined proposal.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORK</title>
      <p>In the following sections, we briefly discuss the different
approaches proposed and adopted within the SE community to
conduct joint analyses of families of experiments whose raw data
are available to experimenters, discuss their applicability and state
the conclusions concerning their use reported in the different
publications.</p>
      <p>The different techniques are discussed in chronological order by
date of publication of the respective paper applying or proposing
the technique.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Narrative Comparison</title>
      <p>
        The difference between close and differentiated replications was
discussed in depth by Juristo and Vegas in 2011 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. They
proposed an approach for analyzing groups of experiments: the
results of each experiment are analyzed separately and are then
grouped according to concordances and discordances between the
results of the replications identified through narrative comparison.
A differentiated replication (i.e., a replication that produces a
different outcome than the main experiment) is considered as an
opportunity to explore the different variables that might have had
an impact on the outcome rather than being seen as a threat to the
validity of the replication. There are several noteworthy points
with regard to the study reported in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]:


      </p>
      <sec id="sec-3-1">
        <title>There is a big imbalance between the total number of</title>
        <p>subjects participating in each replication (176
participants at the UPM; 31 subjects in the UPV
replication and 76 in the ORT replication). This
imbalance in the number of subjects could have biased
the results due to natural random variability.</p>
        <p>
          The report states that “the results are considered equal if
the estimated mean value for the replication results is
within the confidence interval of the baseline
experiment results” [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. Some sources [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] state that
roughly 83% of replications will fall within the 95%
interval of confidence for the means of the original
experiment. In other words, even if two samples (one
per experiment) are drawn from the same population,
there is an 83% chance that the mean of the second
experiment will fall within the confidence interval of the
first. Thus, due to the random variability of the sample,
Juristo and Vegas might be considering the result of the
replication as a different outcome, merely because the
mean did not fall within the confidence interval of the
main experiment (although, in actual fact, it represents
the same result, i.e., population, in a different random
sample). This underestimation of sampling variability is
a limitation of the applicability of the narrative
comparison approach proposed in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]: roughly 17% of
replication results will, due to random variability, be
considered different when they are in fact equal (i.e.,
represent the same population).
        </p>
        <p>This method relies on narrative comparison to analyze a family of
experiments using observations such as the mean of the different
outcome variables obtained by the subjects in the experiments to
discuss the findings. It is up to the expert analyst to observe and
interpret the outcomes, and the method depends on their ability to
identify extraneous factors that might have influenced the
outcome in the different replications. One clear drawback is that
the technique might underestimate the random sampling
variability within a certain population and thus overestimate the
effect of third variables. This technique can be applied to the raw
data of the experiments or to the known descriptive statistics of
the different experimental outcomes (although the variables
should be interpreted with due caution if the raw data are not
available).</p>
        <p>In any case, the narrative comparison technique requires the
experimenters of the replications to interact in order to identify
extraneous variables that might have had an impact on the
consistency of results in order to gather knowledge for further
investigation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2.2 Meta-Analysis</title>
      <p>
        Meta-analysis is a set of statistical techniques that has been used
to combine the different effect sizes of a family of experiments [
        <xref ref-type="bibr" rid="ref10 ref9">9,
10</xref>
        ]. Effect sizes can be estimated to evaluate the average impact
of an independent variable on a dependent variable across studies.
Since measures may be taken from different settings and may be
non-uniform, a standardized measure must be taken for each
experiment. These measures must be combined to estimate the
global effect size of a factor [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Meta-analysis is the current standard for aggregating quantitative
results across studies [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. It can be used to combine data even if
studies report contradictory results provided that the overall
variation is not too extreme [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>
        Meta-analysis has been used within the SE community with
multiple objectives such as studying the effects of TDD on
external quality and productivity [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], checking for correlations of
metrics across software projects corpora [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] or the effect on
defect detection rates of different inspection techniques [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
amongst others [
        <xref ref-type="bibr" rid="ref13 ref6">6, 13</xref>
        ]. Furthermore, guidelines for applying
different meta-analytic techniques have been proposed to the SE
community [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Traditional meta-analytic techniques rely on the assumption that
effect size estimates from different experiments are independent
and have sampling distributions with known conditional variances
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. An experiment that examines multiple dependent variables
or a cluster of studies carried out by the same investigator or
laboratory [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] poses a threat to the supposed independence of
experiments. The hierarchical dependence model is applicable
when the dependence structure between the experiments is due to
the inherent condition of belonging to a cluster of experiments
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The circumstances, implications and impact of dependence
across studies have been studied at length by different researchers
[
        <xref ref-type="bibr" rid="ref15 ref21">15, 21</xref>
        ], and multiple techniques for dealing with this issue have
been proposed [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. However, their usage requires an in-depth
knowledge of the different techniques available, and their
applicability is by no means clear [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        How meta-analysis should be applied to a family of experiments
in SE is a matter of debate, and there are many opinions on
procedure. As stated by Miller [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], “because the dependent
replications rely on the same underlying protocols as the original
study, their results cannot be considered as truly independent of
the original study. Moreover, they may propagate any accidental
biases from the original study into the results of the replication”.
Recall, which is one of the main assumptions underlying
traditional meta-analysis, relies on independence and could be
violated in some cases where replications are run by related
researchers.
      </p>
      <p>
        Again, if experimental material is reused (thus increasing the
dependence between two experiments), “although from a simple
replication point of view, this seems attractive; from a
metaanalysis point of view this is undesirable, as it creates strong
correlations between the two studies” [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Kitchenham shares
this view [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], stating “in particular, dependent replications
violate the main assumption underlying meta-analysis which is the
standard method of aggregating results from quantitative
experiments. Recently, my colleagues and I were forced to omit
three studies from a systematic literature review because the
‘replications’ were so close that they offered no additional
information to the aggregation process”.
      </p>
      <p>
        Pickard et al. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] state, in reference to the outcome of the
primary studies, that “the greater the degree of similarity between
the studies the more confidence you can have in the results of a
meta-analysis”. The best thing then would be very similar settings
without either communication or information sharing among
experimenters: a rare occurrence in SE.
      </p>
      <p>
        Furthermore, it is up to researchers to settle several issues
regarding meta-analysis, such as:
 The selection of the effect size metric used to perform
the joint analysis, i.e., computation of the raw mean


difference, a standardized mean difference, odds ratio,
risk ratio or risk differences [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The standardizer used to compute the effect size, i.e.,
pool standard deviations, weight each group’s standard
deviation by sample size or use the control group
standard deviation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        The computation of some effect sizes from others or the
use of unbiased versions of effect size metrics [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Besides, the impact of experimental designs on the resulting effect
sizes (such as multiple-treatment studies and multiple-endpoint
studies [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) makes meta-analysis applicability a controversial
topic in the SE community.
      </p>
      <p>
        Meta-analysis has the potential of aggregating the results of
different experiments if the raw data are not available, even
though its stability in SE experiments has been questioned [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
However, meta-analysis has been applied as well when the
experimenters are in possession of the raw data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This issue
raises the question of whether the best procedure for the joint
analysis of experiments whose raw data are available to
researchers is to apply meta-analysis techniques or whether it
would be better to use other approaches. As noted in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]: “losing
sight of the fact that meta-analysis is a tool with multiple
applications causes confusion and leads to pointless discussions
about what is the right way to perform a research synthesis, when
there is no single right way”.
      </p>
      <p>All the above raises doubts within the community, which does not
appear to be clear about the applicability of meta-analysis, its
boundaries and misuses, adding to the confusion surrounding the
aggregation of results in families of SE experiments.</p>
    </sec>
    <sec id="sec-5">
      <title>2.3 Standard Frequentist Methods</title>
      <p>
        Another option for conducting the joint analysis of families of
experiments is to analyze all data together via standard frequentist
methods such as analysis of variance (ANOVA) [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Basically,
each experiment within a group of replicated controlled
experiments is analyzed separately. After gathering knowledge
about the results of the different replications and briefly
discussing whether or not the results hold, the experimenters
hypothesize about which variables might have had an impact on
the results. In a next step, the raw data from all the different
studies belonging to the family of experiments are aggregated and
analyzed as a whole considering the hypothesized variables as
factors.
      </p>
      <p>
        The study presented by Runeson et al. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] in 2014 is an example
of such an approach. They report three experiments comparing
code inspections with unit testing: the original experiment, an
internal replication (a replication performed by the same
researchers minimizing changes in the replication) and an external
replication (a replication performed by a different group of
researchers, varying several aspects of the experiment) [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. The
three experiments were cross-over designs, where the subjects
applied one defect detection method (code inspection or structural
unit testing) to one program and then the other method to the
other program [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. The dependent variables of the experiments
are time spent on the tasks, number of defects detected and
localized and rate, i.e., number of defects detected and localized
per time unit [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
      <p>
        The separate analyses performed for each experiment in [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]
appear to be clearly explained from the data analysis viewpoint:
“the experiment has two factors, paired measurements, a sample
size of less than 30 and data which is not normally distributed”.
When aggregating the data from the three experiments into one
data set and carrying out the joint analysis, however, Runeson et
al. report [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] “the overall two-factor ANOVA results for the
three experiments”. Notice that the authors no longer mention that
the data are “repeated measures”, and the “two-factor ANOVA”
analysis carried out is interpreted without any reference to a
within-subjects factor. Furthermore, the joint analysis of the three
experiments is performed using a Kruskall-Wallis test, the
equivalent of the one-way ANOVA test for non-normal
distributions.
      </p>
      <p>Because the data are dependent, a repeated measures general
linear model could have been fitted to analyze the data. Also, the
within-subjects and between-subjects factors considered should
have been clearly specified in order to pave the way for data
analysis, understandability and reproduction.</p>
      <p>A framework applying this approach to aggregate results from a
family of experiments should comply with three objectives: (1)
provide a specific set of steps to be carried out to pre-process the
data and report the data pre-processing of the experiments; (2)
provide a template with all the relevant information that should be
stated about each of the experiments to carry out the individual
analysis; (3) provide guidance for defining a joint analysis from
separate experiments, accounting for any of the possible
limitations of each experiment.</p>
    </sec>
    <sec id="sec-6">
      <title>2.4 Bayesian Methods</title>
      <p>
        Bayesian methods for data analysis have also been applied to the
aggregation of the results of a family of experiments [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ].
They resolve the inconsistencies found between replications and
the original experiment by investigating moderators, i.e., variables
that cause an effect to differ across contexts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. An iterative
approach is applied to try to identify moderators that might have
an influence on the outcome of the experiment. The different
variables and their interaction studied in the proposed models are
then measured based on the most relevant changes made to the
different replications. As explained in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], “By moderator, we
mean any explanatory variable that interacts with another
explanatory variable in predicting a response variable. For one
variable to “moderate” another does not mean that it dampens the
other’s effect —rather, it means that an interaction exists, such
that the latter’s effect varies in response to the former”.
Bayesian methods provide an alternative to traditional
metaanalysis. First, using Bayesian methods, data can be accumulated
over time (prior knowledge) into the analysis of future
replications. Second, Bayesian methods can be used to combine
results such that all data are treated as current observations [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
However, the application of this method to joint analysis requires
thorough knowledge of Bayesian statistics, which have seldom
been used in the SE community that is dominated by frequentist
methods such as meta-analysis: the current standard for
aggregating quantitative results [
        <xref ref-type="bibr" rid="ref22 ref25">22, 25</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>3. RESEARCH OBJECTIVES</title>
      <p>
        We have carried out several experiments to assess the TDD agile
development technique [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. A lot of data are being collected
from multiple replications, and their analysis and processing could
provide insights into proper ways of handling and synthesizing
the results of joint analysis.
      </p>
      <p>Our research is driven by several methodological questions:</p>
      <sec id="sec-7-1">
        <title>What is the best way of analyzing families of</title>
        <p>experiments with raw data in SE? Do the existing
approaches produce contradictory results? Under what
circumstances are these different analysis approaches
applicable?
Where are the limits to the feasibility of grouping
different experiments, i.e., how similar does the design
of the experiment need to be?
Is there any kind of knowledge on the different
experiments that is of paramount importance for joint
analysis? Is it always correctly reported?</p>
      </sec>
      <sec id="sec-7-2">
        <title>Does subject type (students or professionals) have any</title>
        <p>implications regarding the performance of different
development techniques (TDD, ITL)?
Does any moderator variable or interaction amongst
moderator variables across different organizational
and/or academic setups have an impact on the
performance of the development methodologies?
Does the context (academia, industry or even different
industries) have any impact on the performance of the
subjects applying different development techniques?
Also, several TDD-specific questions will drive our research:
Other research questions might arise in the course of the research,
and their implications will be discussed thoroughly as part of the
PhD thesis.</p>
        <p>Our research will provide different contributions to the academia
and practice:</p>
      </sec>
      <sec id="sec-7-3">
        <title>Different methods for aggregation of results will be</title>
        <p>used jointly for analyzing families of experiments whose
raw data is available to experimenters, and the edges of
applicability of the different techniques will be
discussed along the research process.</p>
        <p>Multiple industry experiments on TDD will be
aggregated and their results for different software
metrics analyzed. Specifically, different treatments such
as traditional test last coding or ITL will be compared
against TDD in industrial settings, which may lead to
interesting findings about the effects of TDD in real
software development contexts.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>4. RESEARCH APPROACH</title>
      <p>Within the ESEIL project, several replications (i.e., reporting both
different and consistent results) are being run on the topic of TDD
performance. These replications consistently alter different
aspects of the primary study (design of the experiment, subject
type, instrumentation, treatments, artifacts, location, training,
researchers, session length, etc.). All these replications are being
analyzed separately and their findings discussed.</p>
      <p>In order to carry out the joint analysis of experiments, we will first
run a search of current trends in the analysis of families of
experiments in other areas such as agriculture, psychology or
medicine. Such bibliographic searches of online databases could
turn up a variety of methods or prescriptions that might lend
themselves to extrapolation to SE. The conditions under which
these techniques can be used and their limitations will be studied
within specific SE setups, and they will be assessed by means of
direct application to the ESEIL project experiments on TDD.
The application of these different analysis techniques to the same
family of experiments can lead to multiple, possibly even
contradictory results. In our studies we will try to explore the
scope of application of the different aggregation approaches and
define the limits of their applicability for conducting joint
analysis.</p>
      <p>
        After exploring these approaches and discussing their implications
within the ESEIL project, we will propose a framework for the
joint analysis of families of experiments. A second version of this
framework will be refined and further expanded with the aim of
applying it to a family of testing experiments [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>We will then adopt the most promising methods in order to
extrapolate their applicability to a different set of experiments
within the SE community such as software requirements.
Finally, the proposed updated framework could be assessed and
reviewed by colleagues within the SE community from different
viewpoints in order to lend the proposal higher external validity
and consistency.</p>
    </sec>
    <sec id="sec-9">
      <title>5. SUMMARY OF CURRENT STATUS</title>
      <p>The thesis proposal was elicited on 15 January 2015 and rounded
out over the following six months. As a three-year thesis, its
findings and proposals will be projected across the years 2015,
2016 and 2017. The publishing strategy targets publication at the
ICSE and ESE conferences and in the TSE, TOSEM, EMSE and
IST journals over the three-year research period.</p>
      <p>At the time of writing, a preliminary aggregation of results is
being carried out using the data from four replications as part of
the Experimental Software Engineering Industrial Laboratory
(ESEIL) project. Two experiments were run in a professional
setting, whereas another two were run in academia. The results of
the experiments will be aggregated using different analysis
approaches, and their implications, constraints and findings will
be discussed and further explored in subsequent studies.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Abrahao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gravino</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Insfran</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scanniello</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Tortora</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Assessing the effectiveness of sequence diagrams in the comprehension of functional requirements: Results from a family of five experiments</article-title>
          .
          <source>Software Engineering</source>
          , IEEE Transactions on,
          <volume>39</volume>
          (
          <issue>3</issue>
          ),
          <fpage>327</fpage>
          -
          <lpage>342</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Baron</surname>
            ,
            <given-names>R. M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kenny</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          (
          <year>1986</year>
          ).
          <article-title>The moderatormediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations</article-title>
          .
          <source>Journal of personality and social psychology</source>
          ,
          <volume>51</volume>
          (
          <issue>6</issue>
          ),
          <fpage>1173</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Basili</surname>
            ,
            <given-names>V. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shull</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lanubile</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Building knowledge through families of experiments</article-title>
          .
          <source>Software Engineering</source>
          , IEEE Transactions on,
          <volume>25</volume>
          (
          <issue>4</issue>
          ),
          <fpage>456</fpage>
          -
          <lpage>473</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Borenstein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hedges</surname>
            ,
            <given-names>L. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Higgins</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rothstein</surname>
            ,
            <given-names>H. R.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Introduction to Meta-Analysis</article-title>
          . John Wiley &amp; Sons.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Brooks</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>Meta analysis-a silver bullet-for metaanalysts</article-title>
          .
          <source>Empirical Software Engineering</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ),
          <fpage>333</fpage>
          -
          <lpage>338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Ciolkowski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2009</year>
          ,
          <article-title>October)</article-title>
          .
          <article-title>What do we know about perspective-based reading? An approach for quantitative aggregation in software engineering</article-title>
          .
          <source>InProceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement</source>
          (pp.
          <fpage>133</fpage>
          -
          <lpage>144</lpage>
          ). IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Cooper</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Patall</surname>
            ,
            <given-names>E. A.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>The relative benefits of meta-analysis conducted with individual participant data versus aggregated data</article-title>
          .
          <source>Psychological methods</source>
          ,
          <volume>14</volume>
          (
          <issue>2</issue>
          ),
          <fpage>165</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Cumming</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis</article-title>
          .
          <source>Routledge.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Da</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. Q.</given-names>
            ,
            <surname>Suassuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>França</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C. C.</given-names>
            ,
            <surname>Grubb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            ,
            <surname>Gouveia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. B.</given-names>
            ,
            <surname>Monteiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            , &amp; dos
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. E.</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Replication of empirical studies in software engineering research: a systematic mapping study</article-title>
          .
          <source>Empirical Software Engineering</source>
          ,
          <volume>19</volume>
          (
          <issue>3</issue>
          ),
          <fpage>501</fpage>
          -
          <lpage>557</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Dieste</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernández</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia</surname>
            <given-names>Martinez</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            , &amp;
            <surname>Juristo</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          (
          <year>2011</year>
          , April).
          <article-title>Comparative analysis of meta-analysis methods: when to use which?</article-title>
          .
          <source>InEvaluation &amp; Assessment in Software Engineering (EASE</source>
          <year>2011</year>
          ), 15th Annual Conference on (pp.
          <fpage>36</fpage>
          -
          <lpage>45</lpage>
          ). IET.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Gómez</surname>
            ,
            <given-names>O. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Juristo</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Vegas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Understanding replication of experiments in software engineering: A classification</article-title>
          .
          <source>Information and Software Technology</source>
          ,
          <volume>56</volume>
          (
          <issue>8</issue>
          ),
          <fpage>1033</fpage>
          -
          <lpage>1048</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Gurevitch</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hedges</surname>
            ,
            <given-names>L. V.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Statistical issues in ecological meta-analyses</article-title>
          .
          <source>Ecology</source>
          ,
          <volume>80</volume>
          (
          <issue>4</issue>
          ),
          <fpage>1142</fpage>
          -
          <lpage>1149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Hannay</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dybå</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arisholm</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Sjøberg</surname>
            ,
            <given-names>D. I.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>The effectiveness of pair programming: A metaanalysis</article-title>
          .
          <source>Information and Software Technology</source>
          ,
          <volume>51</volume>
          (
          <issue>7</issue>
          ),
          <fpage>1110</fpage>
          -
          <lpage>1122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Research synthesis in software engineering: a case for meta-analysis</article-title>
          .
          <source>In Software Metrics Symposium</source>
          ,
          <year>1999</year>
          . Proceedings. Sixth International (pp.
          <fpage>143</fpage>
          -
          <lpage>151</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Hedges</surname>
            ,
            <given-names>L. V.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Olkin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Statistical method for meta-analysis</article-title>
          . Academic press.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Hedges</surname>
            ,
            <given-names>L. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tipton</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Robust variance estimation in meta‐regression with dependent effect size estimates</article-title>
          .
          <source>Research Synthesis Methods</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <fpage>39</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Jonathan</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Krein</surname>
            , Lutz Prechelt, Natalia Juristo, Aziz Nanthaamornphong, Jeffrey C. Carver, Sira Vegas,
            <given-names>Charles D.</given-names>
          </string-name>
          <string-name>
            <surname>Knutson</surname>
            ,
            <given-names>Kevin D.</given-names>
          </string-name>
          <string-name>
            <surname>Seppi</surname>
          </string-name>
          and
          <string-name>
            <surname>Dennis L. Egget</surname>
            ,
            <given-names>A Multisite</given-names>
          </string-name>
          <string-name>
            <surname>Joint</surname>
          </string-name>
          <article-title>Replication of a Design Patterns Experiment using Moderator Variables to Generalize across Contexts</article-title>
          .
          <source>Software Engineering</source>
          , IEEE Transactions.
          <article-title>Under review</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Jonathan</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Krein</surname>
            , Lutz Prechelt, Natalia Juristo,
            <given-names>Kevin D.</given-names>
          </string-name>
          <string-name>
            <surname>Seppi</surname>
            , Aziz Nanthaamornphong, Jeffrey C. Carver, Sira Vegas and
            <given-names>Charles D.</given-names>
          </string-name>
          <string-name>
            <surname>Knutson</surname>
          </string-name>
          ,
          <article-title>A Method for Generalizing across Contexts in Software Engineering Experiments</article-title>
          . Software Engineering, IEEE Transactions. Submitted.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Juristo</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Vegas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>The role of non-exact replications in software engineering experiments</article-title>
          .
          <source>Empirical Software Engineering</source>
          ,
          <volume>16</volume>
          (
          <issue>3</issue>
          ),
          <fpage>295</fpage>
          -
          <lpage>324</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Juristo</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vegas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abrahao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ramos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          (
          <year>2012</year>
          , April).
          <article-title>Comparing the effectiveness of equivalence partitioning, branch testing and code reading by stepwise abstraction applied by subjects</article-title>
          .
          <source>In Software Testing, Verification and Validation (ICST)</source>
          ,
          <year>2012</year>
          IEEE Fifth International Conference on(pp.
          <fpage>330</fpage>
          -
          <lpage>339</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Kalaian</surname>
            ,
            <given-names>H. A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Raudenbush</surname>
            ,
            <given-names>S. W.</given-names>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>A multivariate mixed linear model for meta-analysis</article-title>
          .
          <source>Psychological methods</source>
          ,
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <fpage>227</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Kitchenham</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>The role of replications in empirical software engineering-a word of warning</article-title>
          .
          <source>Empirical Software Engineering</source>
          ,
          <volume>13</volume>
          (
          <issue>2</issue>
          ),
          <fpage>219</fpage>
          -
          <lpage>221</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Lindsay</surname>
            ,
            <given-names>R. M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ehrenberg</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          (
          <year>1993</year>
          ).
          <article-title>The design of replicated studies</article-title>
          .
          <source>The American Statistician</source>
          ,
          <volume>47</volume>
          (
          <issue>3</issue>
          ),
          <fpage>217</fpage>
          -
          <lpage>228</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Applying meta-analytical procedures to software engineering experiments</article-title>
          .
          <source>Journal of Systems and Software</source>
          ,
          <volume>54</volume>
          (
          <issue>1</issue>
          ),
          <fpage>29</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Pickard</surname>
            ,
            <given-names>L. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kitchenham</surname>
            ,
            <given-names>B. A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>P. W.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>Combining empirical results in software engineering</article-title>
          .
          <source>Information and software technology</source>
          ,
          <volume>40</volume>
          (
          <issue>14</issue>
          ),
          <fpage>811</fpage>
          -
          <lpage>821</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Rafique</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Misic</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>The effects of test-driven development on external quality and productivity: A metaanalysis</article-title>
          .
          <source>Software Engineering</source>
          , IEEE Transactions on,
          <volume>39</volume>
          (
          <issue>6</issue>
          ),
          <fpage>835</fpage>
          -
          <lpage>856</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Runeson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Andrews</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Variation factors in the design and analysis of replicated controlled experiments</article-title>
          .
          <source>Empirical Software Engineering</source>
          ,
          <volume>19</volume>
          (
          <issue>6</issue>
          ),
          <fpage>1781</fpage>
          -
          <lpage>1808</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Sjøberg</surname>
            ,
            <given-names>D. I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hannay</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hansen</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kampenes</surname>
            ,
            <given-names>V. B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karahasanovic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liborg</surname>
            ,
            <given-names>N. K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rekdal</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>A survey of controlled experiments in software engineering</article-title>
          .
          <source>Software Engineering</source>
          , IEEE Transactions on,
          <volume>31</volume>
          (
          <issue>9</issue>
          ),
          <fpage>733</fpage>
          -
          <lpage>753</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Succi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spasojevic</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pedrycz</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Application of statistical meta-analysis to software engineering metrics data</article-title>
          .
          <source>InProceedings of the World Multiconference on Systemics, Cybernetics and Informatics</source>
          (Vol.
          <volume>1</volume>
          , pp.
          <fpage>709</fpage>
          -
          <lpage>714</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Succi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spasojevic</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pedrycz</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Application of statistical meta-analysis to software engineering metrics data</article-title>
          .
          <source>InProceedings of the World Multiconference on Systemics, Cybernetics and Informatics</source>
          (Vol.
          <volume>1</volume>
          , pp.
          <fpage>709</fpage>
          -
          <lpage>714</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Vegas</surname>
          </string-name>
          , Sira; Dieste, Oscar; Juristo, Natalia,
          <article-title>"Difficulties in Running Experiments in the Software Industry: Experiences from the Trenches," Conducting Empirical Studies in Industry (CESI</article-title>
          ),
          <source>2015 IEEE/ACM 3rd International Workshop on</source>
          , vol., no., pp.
          <volume>3</volume>
          ,
          <issue>9</issue>
          ,
          <fpage>18</fpage>
          -
          <lpage>18</lpage>
          May
          <year>2015</year>
          doi: 10.1109/CESI.
          <year>2015</year>
          .8
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>