<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Weighted Multi-Factor Multi-Layer Identification of Potential Causes for Events of Interest in Software Repositories</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Copyright c 2015 by the paper's authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. In: A.H. Bagge, T. Mens (eds.): Postproceedings of SATToSE 2015 Seminar on Advanced Techniques and Tools for Software Evolution, University of Mons, Belgium</institution>
          ,
          <addr-line>6-8</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Philip Makedonski, Jens Grabowski Institute of Computer Science, University of G ̈ottingen Goldschmidtstr.</institution>
          <addr-line>7, 37077 G ̈ottingen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Change labelling is a fundamental challenge in software evolution. Certain kinds of changes can be labeled based on directly measurable characteristics. Labels for other kinds of changes, such as changes causing subsequent fixes, need to be estimated retrospectively. In this article we present a weight-based approach for identifying potential causes for events of interest based on a cause-fix graph supporting multiple factors, such as causing a fix or a refactoring, and multiple layers reflecting di↵erent levels of granularity, such as project, file, class, method. We outline di↵erent strategies that can be employed to refine the weights distribution across the di↵erent layers in order to obtain more specific labelling at finer levels of granularity.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The field of software mining explores di↵erent
approaches for extracting information from software
repositories both in the form of basic facts and in the
form of derived knowledge. While software repositories
provide a wealth of information related to the
development and evolution of software projects, most of it
is of empirical nature, that is, describing consequences
rather than causes. For example, developers typically
describe their development and maintenance activities
as fixing issues and problems, improving certain
properties, adding features and functionality, and
refactoring code. In contrast, during software assessment, we
are often more interested in the potential causes for
such activities which are typically not explicitly
labelled as such due to the fact that such knowledge is
usually not available at the time when the
corresponding activity was performed.</p>
      <p>In this article, we are concerned with activities
which are associated with contributing to various
technical risks for undesirable phenomena, such as failures,
or dicult to maintain code that needs refactoring.
We explore means for the retrospective identification
and quantification of such activities based on
empirical data and di↵erent factors contributing to labelling
activities as risky. The quantitative information in the
form of weights provides a more refined view on the
extent to which an activity can be considered a technical
risk. The presented approach can be generalised to
labelling activities as potential causes for events of
interest with respect to any particular assessment task,
regardless of whether it is concerned with a technical
risk or not.</p>
      <p>Existing approaches are typically based on some
form of origin analysis [GT02], involving
linetracking and annotation graphs [KZPW06], line
histories [CC06], line mapping [MHC14], as well as
several refinements to these [WS08, CCDP09] in order
to map and track entities across revisions. Di↵erent
applications for such approaches have been discussed
in the literature, ranging from finding fix-inducing
changes [SZZ05] and the role of authorship on
implicated code [RD11] to defect-insertion circumstance
analysis [PP14]. While these are closely related to the
topic of this article, to our knowledge none of the
existing approaches has incorporated weighting of the
extent to which a change contributes to a subsequent fix.
The weighting information can be used to refine and
improve existing applications, such as better targeted
recommendations for artifacts that need additional
review or testing.</p>
      <p>This article is structured as follows: In Section 2
we outline the basic notions related to our approach.
In Section 3 we discuss the weighting approach and
its generalisation for arbitrary factors. In Section 4
we refine the approach to cover multiple levels of
abstraction across distinct layers. Then, in Section 5, we
discuss di↵erent strategies for distributing the weights
across the layers. Section 6 summarises related work.
Finally, we conclude with a short summary and
outlook in Section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Causes and Fixes</title>
      <p>In this article we are concerned with determining the
likely causes for events of interest. Before we proceed,
we need to establish what we consider under “events
of interest” and other related notions:
Artifact: A generalised notion of a software-related
entity a at any level of granularity, such as project,
file, class, method, on which developers perform
development and maintenance activities. An
artifact may contain other artifacts at finer levels of
granularity.</p>
      <p>State: A generalised notion of a revision at of artifact
a at a point in time t. The set of all states of an
artifact a is denoted as A.</p>
      <p>Event of interest: A state at of an artifact a at a
point in time t which can be described by some
quantitative or qualitative characteristic factor,
such as the content of a descriptive message
associated with the state.</p>
      <p>Fix: A modification to an existing part of an artifact
a in a given state at, that was last modified or
created at an earlier point in time t n resulting
in a state at n. The modification may, but does
not strictly need to, relate to fixing a problem.
Cause: A modification of a part of an artifact a at a
given state at that was modified at a later point
in time t + n resulting in a state at+n.</p>
      <sec id="sec-2-1">
        <title>Cause-Fix Relationship: A relationship between</title>
        <p>two states (at n, at) of an artifact a, where a
part of a that was modified in at n was
subsequently modified in a later state at, hence at n
is considered a cause for at. It is denoted as
at !n causes at.</p>
        <p>Cause-Fix Graph: A hierarchical directed graph
G = (N, E), where the set of nodes N includes
representations for each state of each artifact. A
state may contain other states at finer levels of
granularity, i.e. !ct contains at, based on the
containment relationships between the corresponding
artifacts for the states (assuming that artifact c
contains artifact a, i.e. !c contains a). For
example, the state for a class may contain also states for
methods modified at the same time as the class.
The set of directed edges E includes
representations for each cause-fix relationship between two
states of an artifact.</p>
        <p>Based on the cause-fix relationships, for a given
state at identified as a fix, we define the set of states
fixed by at (i.e. the set of causes for at) as:
atFIXES = {at n 2 A : at !n causes at}
(1)</p>
        <p>Conversely, for a given state at n identified as a
cause, the set of known caused fixes for at n is defined
as:
atCAnUSES = {at 2 A : at !n causes at}
(2)</p>
        <p>A cause-fix graph can be constructed by utilising
information extracted from version control systems.
This can be accomplished automatically by applying
any of the approaches for tracking the location of
modified fragments across revisions already described
in the literature [WS08, CCDP09] and transforming
their output. The resulting graph at the project (or
global) level of abstraction represents the cause-fix
relationships between states of the whole project. An
example for such a graph for five states of a project p
(p1 to p5) is shown on Figure 1.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Weights and Factors</title>
      <p>A simplified binary classification of nodes in the graph
as causes for events of interest presents some
limitations. The basic example from Figure 1 already raises
two questions related to the significance of the
classifications:
• Given that both p3 and p4 are identified as causes
for the fix in p5, are they both equally likely causes
and thus to be considered of equal importance?
• Given that p3 is identified as causing both p4 and
p5, is it then considered a less likely cause for p5,
and thus to be considered of less importance?
p1
fixes.rw = 0.0
fixes.tw = 0.0
fixes.aw = 0.0
refactors.rw = 0.0
refactors.tw = 0.0
refactors.aw = 0.0
cw(p1, p3, refactors) = 0
cw(p1, p3, fixes) = 0
causes
p2
fixes.rw = 0.0
fixes.tw = 0.0
fixes.aw = 0.0
refactors.rw = 0.0
refactors.tw = 0.0
refactors.aw = 0.0
cw(p3, p5, refactors) = 0
cw(p3, p5, fixes) = 0.5</p>
      <p>causes
cw(p3, p4, refactors) = 1 cw(p4, p5, refactors) = 0
cw(p3, p4, fixes) = 0 cw(p4, p5, fixes) = 0.5</p>
      <p>causes causes
p3
In order to be able to reason about these questions,
we need means to quantify the relationships between
fixes and causes. We can establish that cause-fix
relationships are many-to-many, that is a revision may
be the cause for many subsequent revisions, and a
revision may fix multiple previous revisions.
Conceptually, we consider a fix as an activity that is “removing
a weight” from a state of an artifact. Consequently
the activities that contributed to the causes for the
fix “added weight” to the corresponding states of the
artifact. Our approach to quantifying the degree to
which a revision can be considered as the cause for
another revision is based on this conceptual premise.
In addition, there may be di↵erent types of “weights”
based on di↵erent characteristics of the fixing revision,
e.g. “fixing an issue”, “refactoring code”, etc.,
reflecting the di↵erent kinds of events of interest. In order
to accommodate this, we extend the notion to
“removing a weight related to a weight factor wf ” where
wf 2 { fixes, refactors, . . .}. Thus, we speak of a fixing
revision at as having removed weight (rw) with respect
to weight factor wf where:
rw(at, wf ) =
(1 if wf property holds for at
0 otherwise
(3)</p>
      <p>Each of the causes at n can be regarded as
contributing to that weight, thus for each cause-fix
relationship at !n causes at and for each weight factor
wf , we define the notion of contributed weight (cw)
of a causing revision at n to a fixing revision at with
regard to a weight factor wf as:
1
cw(at n, at, wf ) = rw(at, wf ) |atFIXES|
(4)</p>
      <p>For each fix at caused by a causing state at n, the
causing state at n is then said to accumulate a total
weight (tw ) with regard to weight factor wf , defined
as:</p>
      <p>X
at2 atCAnUSES
tw(at n, wf ) =
cw(at n, at, wf )
(5)</p>
      <p>For example, the fix in p5 is removing a weight
rw(p5, fixes) = 1 with respect to the weight factor
“fixes”. In the set P denoting all states of p, there are
two states p5FIXES = {p5 n 2 P : p5 !n causes p5} =
{p3, p4} identified as causes for this fix. They are
considered to be contributing equally to that weight. In
this case, each cause-fix relationship is contributing a
weight cw(p5 n, p5, fixes) = 0.5. On the other hand,
p4 can be considered “neutral” with respect to the
“fixes” weight factor (i.e. rw(p4, fixes) = 0), as it is not
identified as an event of interest. Hence, p3 does not
contribute any weight to p4 (i.e. cw(p3, p4, fixes) = 0).
In this case, we speak of p3 and p4 as having a
tw(p3, fixes) = tw(p4, fixes) = 0.5. Thus, at first
glance it may seem that p3 and p4 can be considered
equally important.</p>
      <p>In order to reason about the second question, we
need to contemplate the inverse relationship.
Considering p3 in the example, it causes both p4 and p5, i.e.
p3CAUSES = {p3+n 2 P : p!3 causes p3+n} = {p4, p5},
whereas p4 only causes p5, i.e. pCAUSES = {p5}. To
4
take this into account, we define the notion of average
weight (aw) with regard to weight factor wf as:
tw(at n, wf )
aw(at n, wf ) = |atCAnUSES| (6)</p>
      <p>In the example above, this yields aw(p3, fixes) =
0.25 and aw(p4, fixes) = 0.5, respectively. Thus, we
can state that while both p3 and p4 can be considered
important as causes for the fix in p5 with respect to the
weight factor “fixes”, since p3 is also a cause for p4, it
is less important than p4 as it also caused a “neutral”
change with respect to the weight factor “fixes” in
addition to the fixing change. If we consider the
“refactors” weight factor, we observe that the weights are
distributed di↵erently since it is p 4 where the weight
related to that factor is removed (rw(p4, refactors) =
1) and hence p3 is the only identified cause
contributing all the removed weight (cw(p3, p4, refactors) =
tw(p3, refactors) = aw(p3, refactors) = 1). The
corresponding weighting is also shown in Figure 1. The
weight-related values are calculated for each weight
factor for each node. Note, that while information
about the causing revisions can be considered
definitive, information about the fixing revisions is only
partially known. Future revisions may still include fixes
for existing revisions, thus altering their weights.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Layers and Granularities</title>
      <p>In the examples discussed so far, only the project level
of granularity was considered. In practice, a revision
at the project level of granularity can be decomposed
to revisions at the file and logical levels of
granularity, where multiple related artifacts at these levels are
changed together as part of a development activity. In
this case, the challenge of transferring weights between
the di↵erent levels arises. Furthermore, while a set of
related artifacts may be changed within a causing
revision, only a subset of these artifacts and possibly a set
of additional artifacts may be changed within a
corresponding fixing revision. Thus, the causes and fixes
for a revision of an artifact at a finer level of
granularity may be a subset of the causes and fixes for the
containing artifact. Consequently, the weight
distribution may vary across the di↵erent levels of granularity.
This raises two fundamental challenges:
• Given a revision that is the cause for a fix, where
the cause a↵ects multiple artifacts at a finer level
of granularity, are all of these artifacts
contributing equally to the cause for the fix?
• Given a revision that is considered a fix, which
a↵ects multiple artifacts at a finer level of
granularity, are all these artifacts equally important for
the fix?</p>
      <p>To illustrate the first challenge, consider a di↵erent
scenario, sketched in Figure 2. In this scenario, there
are three files, x, y, and z, two of which are modified
as part of p3, p4, p5 on the project level of granularity.
There are two states at the file level for each state at
the project level of granularity. The naive approach
would be to simply copy the weights from the project
level to the file level. With regard to the first challenge,
the question arises whether the states y3 and y4 at the
file level are contributing at all to the cause for the fix
in p5, given that in p5 only x and z have been modified.
In other words, shall y3 and y4 be assigned any weights
at all? The same is also applicable at the logical level.</p>
      <p>Even from this simplified example, we can observe
that the naive copy approach can potentially result in
a lot of noise since the sets of states of artifacts at a
finer level of granularity may vary between the causing
and the fixing states at the coarser level of granularity.
A more adequate approach is to construct a distinct
cause-fix graph at each layer corresponding to a given
level of granularity based on the cause-fix relationships
among the states at that level. This enables weight
redistribution within the corresponding layers, yielding
more accurate weighting for each layer. Consider the
same scenario from Figure 2, where instead of copying
the weights from the project layer, we calculate the
weights at the file layer based only on the cause-fix
relationships at that layer, as illustrated in Figure 3.
This approach yields more accurate weight
distribution, taking into account that only x and z were
modified as part of the fix in p5. Hence, the corresponding
states x3 and z4 carry the full responsibility for
causing the fix in p5 and thus shall be assigned the
corresponding weights, whereas the states y3 and y4 can be
considered neutral in this case and shall be assigned
no weights at all.</p>
      <p>This brings us to the second challenge, which can
be exemplified in the given scenario as follows: given
that both states x5 and z5 at the file level are
considered as part of the fix in p5 at the project level,
are both x5 and z5 contributing equally to the fix
in p5? So far, states at finer levels of granularity
simply inherited the removed weights from the
containing state at a coarser level of granularity, that is
rw(x5, fixes) = rw(z5, fixes) = rw(p5, fixes).
Inheriting the removed weights from the containing state
does not take into account potential dilution of the
contribution of each individual state at the finer level
of granularity. If there is a single state at the finer level
of granularity, it can be considered solely responsible
for the fix, but if there are a large number of states at
the finer level of granularity, each one of them may be
contributing only a small part to the fix.</p>
      <p>Even in this simple artificial scenario, we need to
account for both the number of states at a finer level
Copy Approach
p4
y4
of granularity involved in a fix and potentially also
other characteristics of each state in order to obtain a
more accurate picture. This raises some concerns that
need to be taken into account, such as the following:
• Does the number of states of artifacts at a finer
level of granularity involved in a fix dilute the
contribution of each individual state to the fix?
• Do states of certain types of artifacts contribute
more to a fix than others (e.g. states of code vs.
image artifacts)?
• Do states of larger artifacts contribute more to a
fix than states of smaller artifacts?
• Do states of artifacts containing larger changes
contribute more to a fix than states of artifacts
containing smaller changes?</p>
      <p>In order to take these concerns into account in the
weighting approach, we define di↵erent weight
distribution strategies, which distribute removed weights
in fixing states across artifact states at finer levels of
granularity depending on their contribution to a fix.
Consequently, the weights calculated for the causing
states are also updated according to the strategy
being used.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Weight Distribution Strategies</title>
      <p>As noted in Section 4, when we consider the
contribution of each state of an artifact at a finer level of
granularity to a fix in a state of a containing artifact
at a coarser level of granularity, we need to take
di↵erent aspects into account, such as the number of states
at the finer level of granularity involved in the fix, the
type and size of the corresponding artifacts, as well as
the amount of change to each corresponding artifact.
To address these concerns, we exemplify four weight
distribution strategies. Additional strategies may be
added to emphasise the importance of other
characteristics of corresponding artifacts, such as their
complexity, documentation availability, etc. The weight
distribution strategies refine the notion of removed weight
(rw) to distributed removed weight (drw). The
distributed removed weight according to a distribution
strategy ds for a state at of artifact a contained in a
state ct of containing artifact c is defined based on the
following expression:
drw(at, wf, ds) = rw(ct, wf ) · df (at, ds)
(7)
where the distribution factor for a distribution
strategy ds (df (ds)) determines the proportion of the
removed weight from the containing state ct allocated
to the contained state at according to the distribution
strategy of choice. As a baseline, the distribution
factor for the inherit strategy discussed in Section 4 and
shown in Figure 3 can be defined as:
df (at, inherit) = 1
(8)</p>
      <p>Substituting the removed weight with the
distributed removed weight in the calculation of the
conLayer Approach
p4
y4
tributed weights enables the support for distributed
removed weights according to a given strategy
throughout the approach.
The shared strategy takes into account number of
states of artifacts at a finer level of granularity
involved in a fix based on the assumption that a large
number of states dilutes the contribution of each
individual state to the fix. This strategy distributes the
removed weight equally, assuming that each state at a
finer granularity contributes equally to the fix. As a
consequence, the more states contributing to a fix the
less impact each individual state has. Given the set
of states at a finer level of granularity contained in a
state ct, defined as:</p>
      <p>ctCONTENTS = {at : !ct contains at}
the distribution factor for the shared strategy is
defined as:
df (at, shared) =</p>
      <p>1
|ctCONTENTS|</p>
      <p>The application of the shared strategy to the
running example from Figures 2–3 and the resulting
weight redistribution is shown in Figure 4. Since
two states at the file level of granularity are
involved in the fix at the project level of granularity,
the df (x5, shared) = df (z5, shared) = 0.5 and hence
(9)
(10)
drw(x5, fixes, shared) = drw(z5, fixes, shared) = 0.5.
Consequently, the total and average weights of the
corresponding causing states at the file level of granularity
are also adjusted. Thus, the dilution of the
contribution of each state at the finer level of granularity to the
fix is also extended to the total and average weights of
the corresponding causing states.</p>
      <p>While we exemplify only the application of the
strategy to the project and file levels of granularity,
this strategy is also applicable at di↵erent logical levels
of granularity. Note, however, that it shall be applied
at each logical level of granularity (e.g. Class, Method,
Function, etc.) separately, which makes its application
at that level more similar to the type strategy.
5.2</p>
      <sec id="sec-5-1">
        <title>Type Strategy</title>
        <p>The type strategy takes into account how much states
of artifacts at a finer level of granularity contribute to a
fix based on the artifact type (at) of the corresponding
artifact. This strategy distributes the removed weight
equally among states of artifacts of a selected type
(indicated as a parameter), while states of artifacts of
other types do not get any removed weight assigned.
It can be used to emphasise the importance of states
of code artifacts and de-emphasise the importance of
image artifacts, for example. The distribution factor
for the type strategy for a given type T is defined as:
where {st 2 ctCONTENTS : at(st) = T } denotes the set
of states of artifacts of type T contained in ct.</p>
        <p>The application of the type strategy for the type
code to the running example from Figures 2–4 and the
resulting weight redistribution is shown in Figure 5.
Of the two states at the file level of granularity
involved in the fix at the project level of granularity,
only x5 is of type code, hence df (x5, type:code) = 1,
whereas df (z5, type:code) = 0 since at(z5) = image.
Consequently, drw(x5, fixes, type:code) = 1, whereas
drw(z5, fixes, type:code) = 0. The total and average
weights of the corresponding causing states at the file
level of granularity are adjusted respectively. Thus,
the emphasis on the contribution of states of code
artifacts to the fix is also extended to the total and average
weights of the corresponding causing states.</p>
        <p>This strategy can be applied multiple times for
different types of artifacts, essentially resulting in a
distribution of removed weights “within type”, i.e. the
removed weight of a fixing state at the project level of
granularity is distributed once among all states of code
artifacts, then again independently among all states of
test artifacts, and so on. In a similar manner, it can
also be applied at the di↵erent logical levels of
granularity (e.g. Class, Method, Function, etc.) individually
in order to obtain the equivalent of the shared
strategy at the file level of granularity applied at the logical
levels of granularity.
5.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Size Strategy</title>
        <p>The size strategy emphasises the impact of the size of
an artifact (as) in a given state that is considered as
a part of a fixing state at a coarser level of
granularity. The underlying assumption is that larger artifacts
require more time and e↵ort to maintain [ABJ10] and
thus more emphasis shall be placed on such artifacts
and their contribution to the occurrence of an event
of interest, such as a fix. Hence, if there is weight
to be removed in a fix, the chunk of that weight to be
removed from a given artifact is assumed to be
proportional to the size of the artifact. The size of an artifact
is generally measured in terms of lines of code, however
other measures may be used as well. The distribution
factor for the size strategy is defined as:
df (at, size) =
as(at)
as(ct)
(12)</p>
        <p>The application of the size strategy to the running
example from Figures 2–5 and the resulting weight
redistribution is shown in Figure 6. Given the
artifact sizes as(x5) = 40 and as(z5) = 60, the
corresponding distribution factors are df (x5, size) = 0.4
and df (z5, size) = 0.6, which are also identical to the
respective distributed removed weights for x5 and z5.
The total and average weights of the corresponding
causing states at the file level of granularity are also
p5
Type Strategy
adjusted respectively, emphasising the impact of the
size of the corresponding artifacts in the fixing state
on their contribution to the fix as indicated by the
removed weight assigned to them, and also on the
total and average weights of the corresponding causing
states.</p>
        <p>Similar to the shared strategy, the size strategy
shall be applied at each logical levels of granularity
(e.g. Class, Method, Function, etc.) separately, which
e↵ectively results in a refinement of the size strategy
that also integrates the type strategy. In that case, the
size strategy takes a parameter T denoting the type of
artifacts it shall be applied to. Given the typed artifact
size (tas) for a state of an artifact ct and an artifact
type T defined as the sum of the sizes of all artifacts
of type T in the states contained in ct:
tas(ct, T ) =</p>
        <p>X
the size strategy is refined by integrating the tas in
the distribution factor resulting in:
df (at, size:T ) =
( as(at)
tas(ct,T )
0
if at(at) = T
otherwise
(14)</p>
        <p>Apart from the application at the logical levels of
granularity, this refinement also combines the
emphasis on the type and the size of the artifact. When
applied at the file level of granularity, only the size of
artifacts of the given type is taken into consideration.
If a fixing state at the project level includes states of
artifacts of di↵erent types, e.g. code and test, and we
are interested primarily in artifacts of type code, the
typed size strategy distributes the removed weight
according to the size of code artifacts only. Thus, even
if the fixing state contains large test artifacts, they
will have no impact on the weight distribution among
the code artifacts. Similar to the type strategy, the
typed size strategy can be applied multiple times for
di↵erent types of artifacts, essentially resulting in a
distribution of removed weights “within type”.
5.4</p>
      </sec>
      <sec id="sec-5-3">
        <title>Churn Strategy</title>
        <p>The churn strategy emphasises the impact of the
amount of change (churn) of an artifact (ac) in a given
state that is considered as a part of a fixing state at
a coarser level of granularity. The underlying
assumption is that larger changes in artifacts require more
time and e↵ort to perform and potentially contribute
more to the occurrence of an event of interest, such as
a fix. Hence, if there is weight to be removed in a fix,
the chunk of that weight to be removed from a given
artifact is assumed to be proportional to the amount
of change that needed to be performed in the artifact.
The distribution factor for the churn strategy is
defined as:
ac(at)
ac(ct)
df (at, churn) =
(15)</p>
        <p>The application of the churn strategy to the
running example from Figures 2–6 and the resulting
Size Strategy
weight redistribution is shown in Figure 7. Given
that ac(x5) = 4 and ac(z5) = 1, the
corresponding distribution factors are df (x5, churn) = 0.8 and
df (z5, churn) = 0.2, which are also identical to the
respective distributed removed weights for x5 and z5.
The total and average weights of the corresponding
causing states at the file level of granularity are also
adjusted respectively. This emphasises the impact of
the amount of change in the states of the
corresponding artifacts in the fixing state on their contribution
to the fix. Their contribution is indicated by the
removed weight assigned to them. By extension, this
also emphasises the impact of the amount of change
on the total and average weights of the corresponding
causing states.</p>
        <p>Contemplating the application of both the size and
the churn strategies, as illustrated in Figure 6 and
Figure 7, respectively, we may observe a contradiction
in the weight distributions. The size strategy indicates
that z5 is contributing more to the fix in p5 due to its
larger size and hence its causing state z4 is the more
likely cause for the fix in p5. On the other hand, the
churn strategy indicates that x5 is contributing more
to the fix in p5 due to the larger amount of change in x5
and hence its causing state x3 is the more likely cause
for the fix in p5. The di↵erent strategies ultimately
enable emphasising di↵erent characteristics of events
of interest. Which one is to be used depends on the
application context and the assessment task. If the
size of artifacts is perceived as resulting in more e↵ort
involved in maintenance and development tasks, then
the size strategy will be more adequate for identifying
and emphasising the states of artifacts that contribute
both to events of interest and to their likely causes
based on the relative importance of these states with
respect to the e↵ort involved in understanding them.
On the other hand, if the amount of change in states
of artifacts is considered more critical with respect to
the e↵ort involved in maintenance and development
tasks, then the churn strategy will be more adequate.
The states of artifacts that contribute both to events of
interest and to their likely causes can be identified and
emphasised based on their relative importance with
respect to the e↵ort involved in modifying them.</p>
        <p>There are di↵erent kinds of churn measures
described in the literature [KAG+96, KS94, MGP13,
NB05]. We consider a rather simple absolute measure
of churn defined as the sum of additions and removals
in terms of lines (churned lines of code in [NB05]),
where a modification is considered both a removal and
an addition of one or more lines that are part of the
modification. Other notions of churn can also be used
in the churn strategy, however if a relative churn
measure is used, such as the ones described in [NB05], the
distribution factor may need to be adjusted as well.</p>
        <p>Similar to the shared and the size strategy, the
churn shall be applied at each logical levels of
granularity (e.g. Class, Method, Function, etc.) separately,
which e↵ectively results in a refinement of the churn
strategy that also integrates the type strategy,
analogous to the size strategy. In that case, the churn
strategy takes a parameter T denoting the type of
artifacts it shall be applied to. Given the typed artifact
churn (tac) for a state of an artifact ct and an artifact
Churn Strategy
the churn strategy is refined by integrating the tac in
the distribution factor resulting in:
df (at, churn:T ) =
( as(at)
tac(ct,T )
0
if ac(at) = T
otherwise
(17)</p>
        <p>Apart from the application at the logical levels of
granularity, this refinement also combines the
emphasis on the type of the artifact and the amount of change
in the artifact. When applied at the file level of
granularity, only the churn of artifacts of the given type is
taken into consideration. If a fixing state at the project
level includes states of artifacts of di↵erent types, e.g.
code and test, and we are interested primarily in
artifacts of type code, the typed churn strategy distributes
the removed weight according to the churn of code
artifacts only. Thus, even if the fixing state contains
large changes to test artifacts, they will have no
impact on the weight distribution among the code
artifacts. Similar to the type and the typed size strategy,
the typed churn strategy can be applied multiple times
for di↵erent types of artifacts, essentially resulting in
a distribution of removed weights “within type”.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Related Work</title>
      <p>Existing approaches are typically based on some form
of origin analysis [GT02], involving line tracking and
annotation graphs [KZPW06], line histories [CC06],
line mapping [MHC14], as well as refinements to
these [WS08, CCDP09] in order to map and track
entities across revisions. Historage [HMK11] is an
approach for tracing fine-grained artifact histories
including renaming changes. The approach presented
in this chapter builds on top of these approaches,
applying origin analysis to events of interest in order to
determine their potential causes and then quantifying
the cause-fix relationships by means of weights. Our
approach also considers di↵erent levels of granularity.
Any of the existing approaches can be used as a
foundation and generally the accuracy of the weighting
depends in part on the quality of the results from the
underlying origin analysis approach.</p>
      <p>Di↵erent applications for the existing approaches
have been discussed in the literature, ranging from
finding fix-inducing changes [SZZ05] and
understanding the role of authorship on implicated code [RD11] to
defect-insertion circumstance analysis [PP14]. While
in a sense such applications do serve a similar purpose
— identifying potential causes for events of interest,
they are more focused on identifying such causes before
the event of interest has occurred. Such applications
generally require sucient information about known
causes for events of interest, which serves as training
data in order to build pattern recognition models that
are then used to identify potential causes for events
of interest. Both, the training and the validation of
such pattern recognition models requires data
annotated with known causes for events of interest. The
approach discussed in this chapter can be applied to
produce such data emphasising di↵erent
characteristics across multiple levels of granularity for di↵erent
kinds of events of interest.</p>
      <p>The challenge of “tangled changes” [HZ13] is
somewhat related to topic of this chapter, where the
authors study the prevalence of such changes that are
unrelated or loosely related to events of interest and
apply a multi-predictor approach to untangle them,
based on di↵erent confidence voters. The approach
discussed in this article relies on weighting and
di↵erent weight distribution strategies to emphasise certain
characteristics of changes related to events of
interest, that are considered to be of importance in a given
context. It can further benefit from a more
sophisticated untangling approach, such as the one described
in [HZ13], which can be incorporated as an additional
weight distribution strategy to refine the distribution
of weights among fixing and causing states of artifacts
across the di↵erent levels of granularity.</p>
      <p>To the best of our knowledge none of the existing
approaches has incorporated quantification of the
extent to which a change in one state contributes to a
subsequent fix in a later state of an artifact. Also, none
of the approaches has explored how to apply cause-fix
analysis across multiple levels of granularity.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this article, we explored a weight-based approach
for finding potential causes for events of interest in
software repositories. An event of interest can be any
occurrence that may be of relevance for an
assessment task, such as fixing issues and problems,
improving properties, adding features and functionality, and
refactoring code. The approach adds quantitative
information on top of existing approaches for origin
analysis, such as ones based on line tracking. The
quantitative information is in the form of weights, where
an event of interest regarded as a fix is considered to
be removing a weight, and the potential causes for the
event of interest are considered to be contributing to
the presence of that weight. Distinct weights can be
calculated across di↵erent dimensions, based on the
kind of event of interest, such as a bug fix, refactoring,
etc., designated by a distinct factor for each kind of
interest. The approach accommodates weight
redistribution across multiple layers corresponding to
di↵erent levels of granularity in order to provide more
accurate information at these levels of granularity. We
outlined di↵erent strategies for weight redistribution
across the di↵erent levels of granularity, which enable
emphasising di↵erent characteristics of the states of
artifacts involved in an event of interest, such as their
type, size, or the amount of change they have
undergone. The emphasis on di↵erent characteristics allows
us to account for the importance of these
characteristics in the e↵ort involved in performing an activity
that leads to an event of interest or its causes. Further
weight distribution strategies may be defined in order
to emphasise other characteristics or combinations of
characteristics of events of interest.</p>
      <p>There are di↵erent related approaches described in
the literature, which seek to establish relationships
between fixes and their likely causes. However, none
of them have incorporated quantification of the
extent to which a likely cause contributes to a
subsequent fix, especially across multiple levels of
granularity. The presented approach builds on top of these
approaches and generally any of them can serve as a
foundation, providing the relationships between fixes
and their likely causes. Based on these relationships,
the proposed approach can be used to calculate the
corresponding weights and quantify the cause-fix
relationships. There are also di↵erent related applications
discussed in the literature which can be used for
similar purposes. However, their scope and focus is mostly
on identifying potential causes for events of interest,
where the event of interest has not yet occurred. The
approach discussed in this article can be applied to
provide necessary information for the configuration,
validation, and refinement of such applications.</p>
      <p>While the set of revisions identified as causes for a
given revision is definitive, meaning that no additional
causes may be added for that revision, the set of
revisions identified as fixes for a given revision reflects the
state of knowledge at a given point in time, meaning
that future revisions may also fix issues introduced in
that revision. This a↵ects the reliability of the
calculated weights. In future work, a suitable cut-o↵ point
in time needs to be defined, after which the calculated
weights for causing states can be considered unreliable.
Such a cut-o↵ point may be based on release tags, or
on the distance between causing and fixing states with
respect to a particular factor, or on the distance
between causing and fixing states in general.</p>
      <p>Establishing the real causes for events of interest
is a hard task. The presented approach provides a
foundation for the quantification of potential causes.
The next step is to investigate the extent to which the
presented approach can be used to determine the real
causes for events of interest, and in particular the role
of di↵erent weight distribution strategies and
combinations of strategies towards that goal.
[ABJ10]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Syst</surname>
          </string-name>
          . Softw.,
          <volume>83</volume>
          (
          <issue>1</issue>
          ):
          <fpage>2</fpage>
          -
          <lpage>17</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Canfora</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Cerulo</surname>
          </string-name>
          .
          <article-title>Fine grained indexing of software repositories to support impact analysis</article-title>
          .
          <source>In Proceedings of the 2006 international workshop on Mining software repositories</source>
          , pages
          <fpage>105</fpage>
          -
          <lpage>111</lpage>
          , Shanghai, China,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [CCDP09]
          <string-name>
            <given-names>G.</given-names>
            <surname>Canfora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cerulo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. Di</given-names>
            <surname>Penta</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Tracking</given-names>
            <surname>Your</surname>
          </string-name>
          <article-title>Changes: A LanguageIndependent Approach</article-title>
          . Software, IEEE,
          <volume>26</volume>
          (
          <issue>1</issue>
          ):
          <fpage>50</fpage>
          -
          <lpage>57</lpage>
          ,
          <year>February 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [GT02] [HMK11] [HZ13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Godfrey</surname>
          </string-name>
          and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tu</surname>
          </string-name>
          .
          <article-title>Tracking structural evolution using origin analysis</article-title>
          .
          <source>In Proceedings of the international workshop on Principles of software evolution - IWPSE</source>
          '02,
          <string-name>
            <surname>page</surname>
            <given-names>117</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Orlando</surname>
          </string-name>
          , Florida,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Hideaki</given-names>
            <surname>Hata</surname>
          </string-name>
          , Osamu Mizuno, and
          <string-name>
            <given-names>Tohru</given-names>
            <surname>Kikuno</surname>
          </string-name>
          . Historage:
          <article-title>Fine-grained Version Control System for Java</article-title>
          .
          <source>In Proceedings of the 12th International Workshop on Principles of Software Evolution and the 7th Annual ERCIM Workshop on Software Evolution, IWPSE-EVOL '11</source>
          , pages
          <fpage>96</fpage>
          -
          <lpage>100</lpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Kim</given-names>
            <surname>Herzig</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Zeller</surname>
          </string-name>
          .
          <article-title>The impact of tangled code changes</article-title>
          .
          <source>MSR '13</source>
          ,
          <string-name>
            <surname>page</surname>
            <given-names>121130</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piscataway</surname>
          </string-name>
          , NJ, USA,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [KAG+96]
          <string-name>
            <surname>T.M. Khoshgoftaar</surname>
            ,
            <given-names>E.B.</given-names>
          </string-name>
          <string-name>
            <surname>Allen</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Nandi</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>McMullan</surname>
          </string-name>
          .
          <article-title>Detection of software modules with high debug code churn in a very large legacy system</article-title>
          .
          <source>In Seventh International Symposium on Software Reliability Engineering</source>
          ,
          <year>1996</year>
          .
          <source>ISSRE</source>
          <year>1996</year>
          , pages
          <fpage>364</fpage>
          -
          <lpage>371</lpage>
          ,
          <year>October 1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [KS94]
          <string-name>
            <given-names>T.M.</given-names>
            <surname>Khoshgoftaar</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.M.</given-names>
            <surname>Szabo</surname>
          </string-name>
          .
          <article-title>Improving code churn predictions during the system test and maintenance phases</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>In International Conference on Software Maintenance</source>
          ,
          <year>1994</year>
          .
          <source>ICSM</source>
          <year>1994</year>
          , pages
          <fpage>58</fpage>
          -
          <lpage>67</lpage>
          ,
          <year>September 1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [KZPW06]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Pan</surname>
          </string-name>
          , and
          <string-name>
            <surname>E.J.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Whitehead</surname>
          </string-name>
          .
          <article-title>Automatic Identification of Bug-Introducing Changes</article-title>
          . In
          <source>Automated [MGP13] [MHC14] [NB05] [PP14] [RD11] [SZZ05] [WS08] Software Engineering</source>
          ,
          <year>2006</year>
          . ASE '
          <volume>06</volume>
          . 21st IEEE/ACM International Conference on, pages
          <fpage>81</fpage>
          -
          <lpage>90</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Philip</given-names>
            <surname>Makedonski</surname>
          </string-name>
          , Jens Grabowski, and
          <string-name>
            <given-names>Florian</given-names>
            <surname>Philipp</surname>
          </string-name>
          .
          <article-title>Quantifying the evolution of TTCN-3 as a language</article-title>
          .
          <source>International Journal on Software Tools for Technology Transfer</source>
          ,
          <volume>16</volume>
          (
          <issue>3</issue>
          ):
          <fpage>227</fpage>
          -
          <lpage>246</lpage>
          ,
          <year>July 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>Covrig: A Framework for the Analysis of Code, Test, and Coverage Evolution in Real Software</article-title>
          .
          <source>In Proceedings of the 2014 International Symposium on Software Testing and Analysis</source>
          ,
          <source>ISSTA 2014</source>
          , pages
          <fpage>93</fpage>
          -
          <lpage>104</lpage>
          , New York, NY, USA,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Nagappan</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ball</surname>
          </string-name>
          .
          <article-title>Use of relative code churn measures to predict system defect density</article-title>
          .
          <source>In 27th International Conference on Software Engineering</source>
          ,
          <year>2005</year>
          .
          <source>ICSE</source>
          <year>2005</year>
          , pages
          <fpage>284</fpage>
          -
          <lpage>292</lpage>
          . IEEE, May
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Prechelt</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Pepper</surname>
          </string-name>
          .
          <source>Why Software Repositories Are Not Used for Defectinsertion Circumstance Analysis More Often: A Case Study. Inf. Softw. Technol.</source>
          ,
          <volume>56</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1377</fpage>
          -
          <lpage>1389</lpage>
          ,
          <year>October 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>F.</given-names>
            <surname>Rahman</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Devanbu</surname>
          </string-name>
          .
          <article-title>Ownership, experience and defects: a fine-grained study of authorship</article-title>
          .
          <source>In Proceedings of the 33rd International Conference on Software Engineering</source>
          , ICSE '
          <volume>11</volume>
          , pages
          <fpage>491</fpage>
          -
          <lpage>500</lpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Sliwerski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zeller</surname>
          </string-name>
          .
          <article-title>When do changes induce fixes</article-title>
          ?
          <source>In Proceedings of the 2005 international workshop on Mining software repositories</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          , St. Louis, Missouri,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Williams</surname>
          </string-name>
          and
          <string-name>
            <surname>J. Spacco.</surname>
          </string-name>
          <article-title>SZZ revisited: verifying when changes induce fixes</article-title>
          .
          <source>In Proceedings of the 2008 workshop on Defects in large software systems, DEFECTS '08</source>
          , pages
          <fpage>32</fpage>
          -
          <lpage>36</lpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM. ACM ID:
          <volume>1390826</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>