<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generation of Mimic Software Project Data Sets for Software Engineering Research</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maohua Gan</string-name>
          <email>pa2i5772@s.okayama-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kentaro Sasaki</string-name>
          <email>ken.default.0828@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Akito Monden</string-name>
          <email>monden@okayama-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zeynep Yucel</string-name>
          <email>zeynep@okayama-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate School of Natural, Science and Technology, Okayama University</institution>
          ,
          <addr-line>Okayama</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Previously at Faculty of, Engineering, Okayama University</institution>
          ,
          <addr-line>Okayama</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>38</fpage>
      <lpage>43</lpage>
      <abstract>
        <p>-To conduct empirical research on industry software development, it is necessary to obtain data of real software projects from industry. However, only few such industry data sets are publicly available; and unfortunately, most of them are very old. In addition, most of today's software companies cannot make their data open, because software development involves many stakeholders, and thus, its data confidentiality must be strongly preserved. This paper proposes a method to artificially generate a “mimic” software project data set whose characteristics (such as average, standard deviation and correlation coefficients) are very similar to a given confidential data set. The proposed method uses the Box-Muller method for generating normally distributed random numbers, then, exponential transformation and number reordering are used for data mimicry. Instead of using the original (confidential) data set, researchers are expected to use the mimic data set to produce similar results as the original data set. To evaluate the usefulness of the proposed method, effort estimation models were built from an industry data set and its mimic data set. We confirmed that two models are very similar to each other, which suggests the usefulness of our proposal.</p>
      </abstract>
      <kwd-group>
        <kwd>empirical software engineering</kwd>
        <kwd>data confidentiality</kwd>
        <kwd>software effort estimation</kwd>
        <kwd>data mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        In the research field of empirical software engineering,
researchers demand for data of real software development
projects from industry. However, only few industry data sets are
publicly available. Also, these data sets are quite old, which
becomes a great problem in ensuring the validity and reliability
of the research. For example, tera-Promise repository [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
provides several industry data sets such as Desharnais [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
COCOMO '81 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Kemerer [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Albrecht [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], but these data
were recorded in the 1980's; thus, the development
environments and processes may greatly differ from modern
software development. In addition, the sample size is often very
small, e.g. Kemerer has only 15 projects and Albrecht has only
24 projects. Surprisingly, these old and small data sets are still
actively used in recent research papers in top journals (e.g.
[2][
        <xref ref-type="bibr" rid="ref11">11</xref>
        ][
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) due to lack of new industry data sets.
      </p>
      <p>
        Meanwhile, although many companies measure and
accumulate data of recent software development projects, it
becomes more and more difficult for university researchers to
use them for the research because the legal compliance to
various data protection regulations has become extremely
important for todays’ companies. Moreover, since software
development involves many stakeholders, their data
confidentiality must be strongly preserved; thus, it became more
difficult to take the data out of the company. In addition,
although there are some studies performed using the latest
software development data, only their analysis results are
disclosed and the data itself is not disclosed. For example, the
white paper on software development data in 2016-2017 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
provides various analysis results of 4046 software development
projects held in 31 Japanese software development companies;
however, the data set itself is not disclosed.
      </p>
      <p>
        In this paper, to make it possible for academic researchers to
use the confidential software project data set of a company, we
propose a method to artificially create a mimic data set that has
very similar characteristics to a given confidential data set.
Instead of using the original (confidential) data set, researchers
are expected to use the mimic data set to produce similar results
as the original data set. For example, researchers can use the
mimic data set for the purpose of evaluation of software effort
estimation methods, because many industry data sets are
required for the evaluation of the stability assessment of the
methods [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Moreover, such a mimic data set is also useful to
practitioners because many companies want to compare their
software development performance (such as productivity and
defect density) with other companies.
      </p>
      <p>
        As a basic idea of our proposal, we measure statistics of each
variable as well as correlation coefficients between all pairs of
variables in a confidential data set. Next, to produce a mimic
variable, we use the Box–Muller method [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for generating
normally distributed random numbers; then, exponential
transformation is applied to the generated values to mimic the
value distribution of the original variable. After generating all
mimic variables, number reordering is applied to the generated
values to mimic the correlation coefficients between all pairs of
original variables.
      </p>
      <p>Interestingly, our method can freely determine the number
of data points to generate. For example, we could produce a data
set of sample size n = 1000, which means 1000 projects, from
an original data set with much smaller samples, e.g. n = 30. This
also means that there is no one-to-one mapping of projects
between the original data set and the mimic data set. Therefore,
data privacy and confidentiality are effectively protected even if
the mimic data set is made open.</p>
      <p>
        In contrast, conventional data anonymizing methods for
software engineering data employ data mutation techniques to
gain data privacy [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ][
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Since data mutation keeps the
oneto-one mapping of data points between the anonymized data set
and the original data set, threats of breaking the anonymity
cannot be perfectly prevented. Moreover, since strong data
mutation yields change of data characteristics, balancing privacy
and utility is a big challenge in this approach [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. On the other
hand, our mimic data set is composed of randomly generated
data points without keeping the one-to-one mapping to the
original data set, data anonymization is much more effectively
achieved. We believe that companies are more confident in
using our method than using the data mutation to comply with
various data protection regulations.
      </p>
      <p>
        To evaluate the utility of the proposed method, this paper
presents a case study of producing a mimic data set from
Deshanais data set [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which is one of the most frequently used
data sets in software effort estimation study [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In the case
study, we built effort estimation models from both the original
data set and the mimic data set to see whether we could obtain
similar results from both data sets.
      </p>
    </sec>
    <sec id="sec-2">
      <title>II. RELATED WORK</title>
      <p>
        Peters et. al [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] proposed a data anonymization method
called MORPH to solve privacy issues in software development
organizations. They target defect prediction research and try to
anonymize the defect data set that consists of various software
metrics measured for each source file of a software product.
They use data mutation techniques, which add small amount of
changes to each value to make it difficult to identify a specific
source file in a data set. They further propose a method called
CLIFF, which allows to eliminate some data points that are not
necessary for the defect prediction. Combining CLIFF with
MORPH, they try to balance privacy and utility of defect data
sets [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>Since their approach is specially proposed for two group
classification problem (i.e., distinguishing defect-prone files and
not defect-prone files in a defect data set), it cannot be applied
to general purpose data sets such as software project data sets
that we target in this paper. In addition, since data mutation
keeps the one-to-one mapping of data points between the
anonymized data set and the original data set except for
eliminated ones, threats of breaking the anonymity cannot be
perfectly prevented. In contrast, we try to produce a completely
artificial data set from given characteristics of a confidential data
set.</p>
    </sec>
    <sec id="sec-3">
      <title>III. THE PROPOSED METHOD</title>
      <sec id="sec-3-1">
        <title>A. Basic Idea and Procedure</title>
        <p>In this paper, a confidential data set that needs to be kept
secret is called a “source data set” or a simply “source data.”
And, the artificially generated data to mimic the source data is
called a “mimic data set” or “mimic data.”</p>
        <p>
          As source data, we target software project data sets. Table I
shows a part of Desharnais data set [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which is one of the
commonly used software project data sets for effort estimation
studies. In Table I, “PM” stands for “project manager” and “FP”
stands for “function point.” Many software companies record
similar data sets that consist of various project features. In this
paper we assume that there is no missing value in a data set.
        </p>
        <p>
          Typically, software project data sets contain software size
metrics such as Function Point (FP) and Source Lines of Code
(SLOC), as well as the project length (often denoted as
“duration”), and the development effort. It has been known that
the probability distribution of these variables roughly follows
log-normal distribution [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Therefore, this paper approximates
the value distribution of quantitative variables by the log-normal
distribution.
        </p>
        <p>After setting the number of cases n to be generated in the
mimic data, the procedure to generate mimic data from source
data is described as follows:


</p>
        <p>Step 1: For each ratio scale or interval scale variable in
the source data, generate a set of artificial values whose
distribution is similar to the source data.</p>
        <p>Step 2: For each ordinal scale or nominal scale variable
in the source data, generate a set of artificial values
whose distribution is similar to the source data.</p>
        <p>Step 3: For all variables in the mimic data, repeat
swapping of values so that the correlation coefficient
matrix of the mimic data becomes similar to that of the
source data.</p>
        <p>In the next section, details of these steps are described.</p>
      </sec>
      <sec id="sec-3-2">
        <title>B. Step 1. Generation of Ratio/Interval Scale Variables</title>
        <p>
          This paper employs the Box–Muller method [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to generate
quantitative variables. The Box–Muller method, also called the
Box-Muller transform, is an algorithm for generating a pairs of
normally distributed random numbers N(μ, σ2) from given
uniformly distributed random numbers. Its mathematical
expression is as follows:
 1 =  √−2
 2 =  √−2
 1
 1
2
        </p>
        <p>2 +  
2  2 + 
where  1 and  2 are independent samples from the
uniformly distributed random numbers on the interval (0,1) .
These  1 and  2 are easily generated in many programming
languages (e.g. by using rand() function in C language). N1 and
N2 are independent random variables with a normal distribution.
In this paper we use N1 only.</p>
        <p>As mentioned above, we assume quantitative variables
follow log-normal distribution. To generate log-normally
distributed random numbers, we apply exponential
transformation, which is inverse transformation of the
logarithmic transformation, to values obtained by the
BoxMuller method.</p>
        <p>As an example, Fig. 1 shows the value distribution of “effort”
in Desharnais data set, which we consider as source data. Fig. 2
shows its log-transformed value distribution. We see in Fig. 2
that log-transformed effort values roughly follow the normal
distribution. We can use the standard deviation σ and the mean
value μ of Fig. 2 to generate the mimic data by the Box-Muller
method. Fig. 3 shows the result of the Box-Muller method,
which is the mimic data of Fig. 2. Finally, Fig. 4 shows the result
of its exponential transformation, which is a mimic data of Fig.
1. Although values in Fig. 4 are all artificially generated ones,
we see that Fig 4 well resembles Fig. 1.</p>
        <p>In addition, by the following equation, we can directly obtain
the standard deviation σ and the mean value μ of log-transformed
source data from the standard deviation ′ and the mean value
′ of original source data.</p>
        <p>2 = ln⁡{1 + (′/′)</p>
        <p>2}
 =</p>
        <p>ln(′ ) −  2/2</p>
        <p>This means that, a company who own a (secret) source data
only needs to provide ′ and ′ directly computed from the
source data.</p>
      </sec>
      <sec id="sec-3-3">
        <title>C. Step 2. Generation of Ordinal/Nominal Scale Variables</title>
        <p>For each ordinal scale or nominal scale variable in the
source data, we generate a set of artificial values so that the
percentage of cases in each bin is same as the source data. For
example, assuming that we have an ordinal scale variable
“requirement clarity,” which has four ranks or bins (“1. very
clear”, “2. clear”, “3. unclear”, “4. very unclear”). As also
assume that the percentage of values belonging to these bins are
20% for “1. very clear”, 25% for “2. clear”, 35% for “3. unclear”
and 10% for “5. very unclear” respectively. Then, to generate a
mimic data, we simply generate an artificial mimic sample
whose percentage of cases in each bin is same as that of the
source data.</p>
      </sec>
      <sec id="sec-3-4">
        <title>D. Step 3. Mimicking the Relationship among Variables</title>
        <p>For all pairs of variables in the source data, there may exists
some sort of relationship. This paper captures such
relationships via the correlation coefficient matrix of the source
data; and, the proposed method tries to make the correlation
coefficient matrix of the mimic data close to that of the source
data. This can be done by swapping values within a variable,
which does not break the value distribution of that variable. In
this study, we assume there is some outliers in the source data;
therefore, we decided to use Spearman’s rank correlation
coefficient instead of the Pearson correlation coefficient to
capture the relationships among variables.
We propose the following procedure to mimic the
relationship among variables in source data.</p>
        <p>1. Compute the correlation coefficient matrix of the
source data.
2. Randomly select one variable in the mimic data. Then,
randomly select two values from this variable, and
swap them.
3. If the correlation coefficient matrix of the mimic data
becomes more similar to that of the source data, we
consider that the value swapping is successful, and go
back to Step 2. Otherwise, we consider that the
swapping is unsuccessful, cancel swapping and go
back to Step 2. To evaluate the similarity of the
correlation coefficient matrices, we use the sum of
squared differences (∑(  −   )2 ) between the set of
rank correlation coefficients of the mimic data and
those of the source data. If the sum of squared
differences becomes smaller after the swapping, then
we consider that the correlation coefficient matrices
get more similar.
4. When the sum of squared differences converges,
swapping is completed (i.e. stop repeating Step 2.)</p>
      </sec>
      <sec id="sec-3-5">
        <title>E. Rounding Off Generated Values</title>
        <p>This is an additional step to make the mimic data visually
more similar to the source data. Since the values of quantitative
variables are generated from random numbers, their significant
figures are different from that of source data. For this reason,
each value should be rounded off to an appropriate precision
according to the significant figure of source data. For example,
Function Point is an integer in source data, so it should be
rounded off to integer.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>IV. CASE STUDY</title>
      <p>
        To evaluate the effectiveness of the proposed method, this
section presents a case study to generate a mimic data from
Desharnais data set [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In the case study, we built effort
estimation models from both the source data and the mimic data
to investigate their similarity.
      </p>
      <sec id="sec-4-1">
        <title>A. Source data set</title>
        <p>
          The Desharnais data set is one of the most frequently used
data sets in software effort estimation research [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. It contains
77 projects without missing values. This case study generated
mimic data of same sample size (n=77.) Quantitative variables
used in this paper are Duration, Transactions, Entities,
PointsAdjust, and Effort. And, qualitative variables used are
TeamExp, ManagerExp, and Lang. TeamExp and ManagerExp
are ordinal scale variables, the TeamExp ranges from 0 to 4, and
the ManagerExp ranges from 0 to 7. The variable Lang is
divided into two binary variables Lang2 and Lang3.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>B. Characteristics of Generated Ratio/Interval Scale</title>
      </sec>
      <sec id="sec-4-3">
        <title>Variables</title>
        <p>The mean value, standard deviation, maximum value and
minimum value of quantitative variables of source data and
mimic data are shown in Table II and Table III respectively.
Their relative differences are shown in Table IV. From these
results, we see that the difference of mean value, standard</p>
        <p>Duration
Transactions</p>
        <p>Entities
PointsAdjust</p>
        <p>Effort</p>
        <p>Duration
Transactions</p>
        <p>Entities
PointsAdjust</p>
        <p>Effort
deviation and minimum value between two data sets are very
small, which indicates effectiveness of the proposed method. On
the other hand, the maximum values are turned out to be not very
similar. This is because source data contain outliers. Mimicking
the outliers are our important future work.</p>
        <p>For more details of the generated variables, the distribution
of source data and mimic data of the four quantitative variables
are shown in Figure 5.1 to Figure 5.8. From these figures we can
also visually see the similarity between two data sets. (For the
variable “Effort”, we have already shown the histograms in Fig.
1 and Fig. 4.)</p>
      </sec>
      <sec id="sec-4-4">
        <title>C. Rank Correlation Coefficient Matrix</title>
        <p>differences of rank correlation coefficients when increasing the
number of updates (i.e. successful swapping) of variables. As
shown in the figure, the sum squared differences becomes very
close to zero (0.000069) as the number of updates increases.</p>
        <p>A part of rank correlation coefficient matrix for each data set
is shown in Table 5 and Table 6. From Table 5 and Table 6, we
can see that the maximum of the difference is 0.008, which is
sufficiently small. So it is considered that the relationship
between any two variables is sufficiently reproduced.</p>
      </sec>
      <sec id="sec-4-5">
        <title>D. The Comparison of Predictive Model about Man-hour</title>
        <p>
          Assuming the effort estimation research using mimic data.
we conduct log-log regression modeling on both source data
and mimic data respectively, and investigate their similarity.
The objective variable is “Effort” and other variables are
predictor variables. The log-log regression model is a linear
regression model with logarithmic transformation applied to
both predictor variables and the objective variable before model
construction. Kitchenham and Mendes [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] pointed out the
necessity of logarithmic transformation to improve the
prediction performance of effort estimation models.
        </p>
        <p>The result of log-log regression for source data and mimic
data are shown in Table VII and Table VIII. From these tables,
we see constant (intercept) and coefficients of predictor
variables are similar. The R2 values of these models are 0.882
for source data and 0.820 for mimic data, which are also similar.
Looking at p-values, for some variable, p-value is not very
similar. One of the possible reason is that outliers might
affected the p-value. We need further investigation in our future
study. Also, in future we will evaluate the prediction
performance of the models.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>V. SUMMARY</title>
      <p>In this paper we proposed a method for artificially
generating a mimic data set from a given (confidential) source
data set. From a case study with a software project data set, our
main findings are as follows.</p>
      <p> The standard deviation and the mean value of
quantitative variables of mimic data are very similar to
that of source data.
 The rank correlation coefficient matrix of mimic data
is very similar to that of source data.
 Effort estimation models using log-log regression
modeling built from source data and mimic data are
similar in their coefficients.</p>
      <p>In future, we will evaluate the prediction performance
of the built models. Also, we will apply various data
analysis techniques such as clustering and association rule
mining for mimic data to evaluate the utility of the proposed
method. In addition, we will try to improve our method by
mimicking more aspects in source data, such as outliers,
skewness and kurtosis of variables.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Albrecht</surname>
          </string-name>
          , J. Gaffney, “
          <article-title>Software function, source lines of code, and development effort prediction</article-title>
          ,
          <source>” IEEE Transactions on Software Engineering</source>
          , vol.
          <volume>9</volume>
          , pp.
          <fpage>639</fpage>
          -
          <lpage>648</lpage>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Azzeh</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ,
          <string-name>
            <surname>“</surname>
          </string-name>
          <article-title>A replicated assessment and comparison of adaptation techniques for analogy-based effort estimation,” Empirical Software Engineering</article-title>
          , vol.
          <volume>17</volume>
          , no.
          <issue>1-2</issue>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>127</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Baskeles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Turhan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Bener</surname>
          </string-name>
          , “
          <article-title>Software effort estimation using machine learning methods</article-title>
          ,
          <source>” Proc. 22nd International Symposium on Computer and Information Sciences (ISCIS2007)</source>
          , pp.
          <fpage>126</fpage>
          -
          <lpage>131</lpage>
          , Dec.
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Boehm</surname>
          </string-name>
          , “Software engineering economics,” Prentice-Hall,
          <string-name>
            <surname>NY</surname>
          </string-name>
          ,
          <year>1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G. E. P.</given-names>
            <surname>Box and M. E. Muller</surname>
          </string-name>
          , “
          <article-title>A note on the generation of random normal deviates,”</article-title>
          <source>The Annals of Mathematical Statistics</source>
          , vol.
          <volume>29</volume>
          , no. 2 pp.
          <fpage>610</fpage>
          -
          <lpage>611</lpage>
          ,
          <year>1958</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Briand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Langley</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Wieczorrek</surname>
          </string-name>
          , “
          <article-title>A replicated assessment and comparison of common software cost modeling techniques</article-title>
          ,
          <source>” Proc. 22nd International Conference on Software Engineering (ICSE2000)</source>
          , pp.
          <fpage>377</fpage>
          -
          <lpage>386</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>J.-M. Desharnais</surname>
          </string-name>
          , “
          <article-title>Analyse statistique de la productivitie des projects informatique a partie de la technique des point des function,”</article-title>
          <source>Master's Thesis</source>
          , University of Montreal,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>M. C.</surname>
          </string-name>
          <article-title>Jones, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Pewsey</surname>
          </string-name>
          , “
          <article-title>Sinh-arcsinh distributions</article-title>
          ,
          <source>” Biometrika</source>
          , vol.
          <volume>96</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>761</fpage>
          -
          <lpage>780</lpage>
          , Dec.
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Kemerer</surname>
          </string-name>
          , “
          <article-title>An empirical validation of software cost estimation models,” Communications of the ACM</article-title>
          , vol.
          <volume>30</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>416</fpage>
          -
          <lpage>429</lpage>
          ,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kitchenham</surname>
          </string-name>
          , and E. Mendes, “
          <article-title>Why comparative effort prediction studies may be invalid</article-title>
          ,
          <source>” Proc. 5th International Conference on Predictor Models in Software Engineering</source>
          , Article no.
          <issue>4</issue>
          , May
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kocaguneli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Menzies</surname>
          </string-name>
          , J. Keung, “
          <article-title>On the value of ensemble effort estimation”</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          , vol.
          <volume>38</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>1403</fpage>
          -
          <lpage>1416</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Maxwell</surname>
          </string-name>
          , “Applied statistics for software managers,” Englewood Cliffs, NJ, Prentice-Hall,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Menzies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Pryor</surname>
          </string-name>
          , “
          <article-title>The promise repository of empirical software engineering data</article-title>
          ,” http://openscience.us/repo, North Carolina State University, Department of Computer Science,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Peters</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Menzies</surname>
          </string-name>
          , “
          <article-title>Privacy and utility for defect prediction: experiments with MORPH</article-title>
          ,
          <source>” Proc. International Conference on Software Engineering</source>
          , pp.
          <fpage>189</fpage>
          -
          <lpage>199</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Menzies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , “
          <article-title>Balancing privacy and utility in cross-company defect prediction</article-title>
          ,
          <source>” IEEE Transactions on Software Engineering</source>
          , vol.
          <volume>39</volume>
          , no.
          <issue>8</issue>
          , pp.
          <fpage>1054</fpage>
          -
          <lpage>1068</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Phannachitta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Keung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monden</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Matsumoto</surname>
          </string-name>
          , “
          <article-title>A stability assessment of solution adaptation techniques for analogy-based software effort estimation,” Empirical Software Engineering</article-title>
          , vol.
          <volume>22</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>474</fpage>
          -
          <lpage>504</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Software</given-names>
            <surname>Reliability Enhancement Center</surname>
          </string-name>
          , Information-technology Promotion Agency, “
          <article-title>White paper on software development data in 2016- 2017</article-title>
          ,” SEC Books,
          <year>2016</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>