<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Application of The Clustering In Software Development Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>T Afanasieva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>I Sibirev</string-name>
          <email>ivan.sibirev@yandex.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Systems Department, Ulyanovsk State Technical University</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>445</fpage>
      <lpage>454</lpage>
      <abstract>
        <p>The paper describes the task of identifying of homogeneous groups of software projects with similar behavior of metrics of software development processes. This is necessary for analysis, understanding and improving the practice of software development, for planning, detection of numerical relationships between key metrics of the development process and the quality and cost of the project; it is also necessary to identify where efforts are made, where defects occur; to identify the impact of technological, process, and organizational aspects on the result. To solve this task the FBC-approach (Fuzzy Behavior Clustering) based on combination of three types of time series clustering is applied to group the software projects taking in account similar behavior of their development processes. Two experiments on the set of metrics from open projects repositories showed the availability of the proposed FBCapproach to identify homogeneous groups of software projects with similar behavior of key software metrics.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The beginning of the 21st century is characterized by the entry of mankind into the "information" era,
information becomes the main resource, tool and output of production. At the same time, the software
engineering reaches industrial scales, software projects are becoming increasingly complex. According
to the statistics of software projects success around the world, more than half of them are controversial
or disastrous [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In the software development the version control systems, project- and
issuetracking systems (GIT, SVN JIRA, TFS, Bugzilla, Trac, Mantis, Redmine) are used to store and
monitoring the processes. They operate with different terminology for describing the software
development process and use repositories of metrics which are not always coincident with each other.
Recently, a lot of works have appeared where it is insistently stated about the need to analyze data
from software development repositories. According to the work of A. Mockus [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the objectives of
such analysis are understanding and improving the software development practice, identifying
numerical relationships between key metrics of the development process and the quality, cost
estimating of the project; identification of where the main efforts are made, where the defects occur;
revealing the influence of technological, process, organizational aspects on the result [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Therefore, it
is necessary to create data mining tools that allow us to perform modeling, analysis, prediction of
phenomena in software projects and create tools that improve the time, quality, and cost of software
development according to the data from the repositories.
      </p>
      <p>Among the tasks to be performed when analyzing data from a repository based on clustering, we
will distinguish the following tasks:


</p>
      <p>
        Search for the causes of anomalies and how to remove them;
Prediction of software defects (it was revealed in the work [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] that 20% of the code contains
on average 80% of errors; identification of the corresponding blocks will facilitate work of
testers);
Preprocessing of data for obtaining more stable results when applying machine learning and
data mining methods.
      </p>
      <p>
        A review of the literature on the repository data analysis is given in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Data from a software
repository is also used to plan and coordinate the work, to discover what makes the changes difficult,
to study which process control tools work and why, to check the release readiness criteria; to
implement a process approach in assessing software quality, to search for people (personnel,
customers, etc.); to detect defects, to predict risks from software changes, to find independently
serviced pieces of code, etc. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        Clustering data allows us to split the objects of the software development process into homogeneous
groups. In the work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the program code from the repository is subject to agglomerate hierarchical
clustering in order to extract the component architecture of object-oriented systems. The coupling
between classes is one of the metrics of object-oriented systems. Inheritance, composition,
aggregation, and method calls are distinguished among the possible dependencies between
objectoriented components. Applying to these dependencies the algorithm of agglomerate hierarchical
clustering (the method of single connections using the measure of similarity as a metric) creates
components in the object-oriented system.
      </p>
      <p>
        The work of Chintakindi Srinivasa, Vangipuram Radhakrishnab and Dr.CVGuru Rao [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
describes the software components clustering from available repositories for efficient searching and
for creating libraries. In this paper, a new similarity function is defined to calculate the similarity
between any two software components, it is used for clustering software components using the
"maximum capture" method. After formation of clusters, each cluster is identified by its word pattern
calculated using fuzzy Gaussian membership functions. Uncontrolled clustering with Markov process
templates or controlled clustering algorithms with templates that can be used to classify components in
decision-making tasks, are possible.
      </p>
      <p>
        Many works are devoted to the prediction of software errors based on data from a repository,
reviews on this topic are given in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Most of the works involves the processing and
analysis of program code using a variety of metrics based on historical data from previous versions of
the software without clustering.
      </p>
      <p>
        In the works of T. Zimmermann et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], N. Nagappan et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] there was investigated the
chance of success for interproject prediction of defects, but the authors concluded that there is no any
single set of metrics suitable for describing any projects, the models of one project history do not
allow transfer to other projects. Upon that, defect prediction models could be accurate when they were
obtained from similar projects (the similarity was not precisely defined). In this connection, the
problem of "similarity" of projects arises. In the work [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], it was attempted to solve it by clustering
vectors, the elements of which are correlations between the metrics of object-oriented programming
(OO-metrics) and the number of defects.
      </p>
      <p>
        In the work of James W. Tunnell [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], data from the repository are used to construct time series of
problems that arise in the development of software, and that are subproblems of a certain parental
problem. The sliding windows method, autoregression of models, their testing, model selection using a
penalty for a residual error and the number of parameters are used. The selected models are used to
predict program code errors.
      </p>
      <p>Based on the results of a review of works in the field of clustering in the software development
processes analysis, it can be concluded that the use of only static metrics does not fully allow analysis
of the development process status. Existing approaches to the software development projects analysis
do not provide for clustering by dynamic process metrics. For a deeper analysis of software
development, it is necessary to investigate metrics not only of the program code (static), but also the
dynamic metrics of the software development processes (grouping of processes with similar dynamics,
revealing the degree of proximity of processes, comparing groups of linguistic estimates reflecting
qualitative metrics and processes are necessary).</p>
      <p>Time series (TS) are often used to represent and study the dynamics of processes in complex
systems, so the clustering of time series is relevant and in demand and is considered an important step
in the analysis of processes that allows homogeneous groups of data to form.</p>
      <p>
        A review of the time series clustering is given in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], where the following are
distinguished:



      </p>
      <p>
        Clustering of raw data (point-based, time-frequency metrics), upon that, the same length,
scope, close regions of time series values are necessary; the problem becomes the significant
effect of noises on the clustering result;
Clustering based on density and based on a quantum grid, the limitation of which is the loss of
individual features of objects, the loss of perception of time series as a chronological
sequence, the nature of time series behavior (increase, decrease, etc.) is not taken into account;
Clustering based on models or machine learning: statistical and artificial neural networks.
They are characterized by dependence on idealization in the construction of a model, by the
loss of individual features of objects when they are replaced by generalized metrics; moreover,
there is no correction for a time series behavior, that are fuzzy in nature [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Basic limitations of the considered above approaches for clustering of software metrics are the
requirements of the same length and scope of the time series [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and focusing on static data only. If the
nature of time series behavior is not taken into account for clustering, then the results of clustering are
likely to inherit this flaw. It is impossible to automatically detect time series similar to the accuracy of
compression, stretching, and shifts of individual sections.
      </p>
      <p>
        Therefore in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] the FBC–approach combining the advantages of mention above approaches was
proposed, moreover it includes clustering of time series by their fuzzy behavior.
      </p>
      <p>
        In this paper, the application of FBC-approach (Fuzzy Behavior Clustering) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to group the
software projects metrics in the form of time series is given. The FBC- approach allows us to identify
new knowledge about the nature of software development processes based on data extracted from the
software repository and, on this basis, to formulate solutions for their improvement. This knowledge is
presented on three levels of the hierarchy of the time series model (general trends, trends, and
fluctuations). The proposed approach makes it possible to cluster time series of different dimensions,
time scale and span; and also, to receive clusters of time series of similar behavior. The use of fuzzy
approach to cluster the behavior improves the noise immunity of the algorithm and increases its
ergonomics.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Statement</title>
      <p>Let us consider the informational model of the software projects in the form</p>
      <p>D  {dijt }
where i is a number of the project (i  1..m) , j is a number of the key metric ( j  1..n) , t is a the time
moment for the key metric (t  1..T ) .</p>
      <p>The task is to analyze software projects for key metrics using model D  {dijt } with a view to
identify groups of similar projects Clk that is the clusters of software projects,</p>
      <p>k  1..Q . When i  const , diconst,i,t is a set of time series characterizing temporal (dynamical)
changes of project metrics. If j  const , then di, jconst,t is the set of time series describing the temporal
changes of the metric for different projects, such as, the number of commits, the number of branches,
the number of developers, the average time between commits, the development time of the project,
(1)
etc. In the case t  const , di, j,tconst are tabular data, with text, numeric and binary values, and static
project metrics at a given time.</p>
      <p>The problem is to group the software projects by temporal metrics (1) with similar structures and
changes as well.</p>
      <p>Then the clustering of software projects metrics is considered as the task of grouping of the time
series
di, jconst,t , obtained from the repository D  {dijt } , and of deriving k clusters CDynamic similar on
k
time series behavior and on their density:</p>
      <p>D jconst CDynamic
k
(1)</p>
      <p>In this paper similarity on time series behavior will be considered using the fuzzy sets in the form
of general tendencies from the term set {"fall", "stability", "growth" and "fluctuation"}. Similarity on
time series density will be determined based on Euclidean measure. Parameter k should be calculated
in dependence of data. The task in the form (2) will be named as dynamic clustering (for short ) in the
paper further.</p>
      <p>The task (2) is a machine learning task and the its results are needed to better understand of
software development processes. The obtained clusters allow to extract information from repositories
about groups of objects, such as tasks, projects and developers in temporal space. This knowledge is
useful in software management to make decisions on improving development process in the future.</p>
    </sec>
    <sec id="sec-4">
      <title>4. The software metrics clustering using time series</title>
      <p>
        To cluster the software projects in the form (2), it is proposed to apply Fuzzy Behavior Clustering
(FBC) approach described in the work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It uses for clustering the time series representation at three
levels of their hierarchy (general trends, trend component, fluctuation component) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the fuzzy trend
concept; the fuzzy linguistic terms apparatus [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the technique for extracting the main trend and the
time series trend; F-transformation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Besides that, it combines three time series clustering
methods: point-based, feature-based and model-based.
      </p>
      <p>The scheme of the proposed dynamic clustering in the form (2) is shown in figure 1. The input data
are extracted from the software repository D. The input could be presented by: commit hash No,
branch name, author's login commit, commit date. Then data are pre-processed to obtain a set of
statistical project metrics from the repository: the number of commits, the number of branches, the
number of developers, the average time between commits, the project development time, etc. With the
help of data pre-processing, time series of these metrics are obtained for developers, years, months,
days, hours. Then FBC-approach of time series of these metrics is then applied. As a result, k clusters
CDynamic are obtained being the outputs of dynamic clustering of software projects by temporal metrics.
k
of i  const or j  const , dijt are time series of different dimensions, diapason, scope, and
different trends of behavior.</p>
      <p>
        Below we briefly recall the algorithm of FBC-approach of time series, proposed in the paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Let’s i  const or j  const for D  {dijt } . Then we obtain the input data in the form of a set of
time series which we denote  {X s} , where X s  dis, j,t (s  1..m) or X s  di, js,t (s  1..n) These
time series could be of different dimension, diapason, scope and different trends of behavior. Output
data of the FBC-approach are the clusters CkDynamic of time series X s corresponding to software
development process metrics.</p>
      <p>
        Since FBC-approach uses the representation of the behavior of time series at three levels of the
hierarchy (general trends, trend component, fluctuation components) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we define the set of linguistic
terms of the main time series trend in the form GT= {"fall", "stability", "growth" and "fluctuation"}.
      </p>
      <p>The adopted technique of FBC-approach to software metrics in respect to expression (2) includes
the following steps.</p>
      <p>
        First step. The transformation that puts time series X s in line with the linguistic term of the general
tendencies from the set of linguistic terms of the general time series trend GT= {"fall", "stability",
"growth" and "fluctuation"}, according to the algorithm given in the work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]:
      </p>
      <p>X s  gts , gts  GT , X s  Y . (2)</p>
      <p>Second step. Clustering of the gts by the general trends using the equivalent in the form of a
linguistic term of the general trend from the set GT= {"fall", "stability", "growth" and "fluctuation "}.
In this case, the set of corresponding time series Y  {X s } is divided into subsets, or clusters of main
trends:</p>
      <sec id="sec-4-1">
        <title>Y  Yfall UYgrowth UYstab UYfluct .</title>
        <p>Third step. The numerical clustering of time series X s from the clusters of Yfall , Ygrowth , Ystab , Yfluct
based on the transformation of each time series X s  Y into the parameter vector Z s .</p>
        <p>A) Obtaining the parameter vector Z s for each X s  Y . Grouping values of each initial time series
X s on N1 clusters. Then barycentres of clusters, which are considered as parameters of time series, are
calculated. Then the time series of parameters Z s is built from the barycentres sorted in chronological
order. Upon that, the set Z  {Zs} of vectors composed of the obtained parameters is divided into
(3)
clusters of main trends</p>
      </sec>
      <sec id="sec-4-2">
        <title>Z  Z fall U Zgrowth U Zstab U Z fluct .</title>
        <p>(5)
corresponding to the original time series from the sets Yfall , Ygrowth , Ystab , Yfluct in accordance with
the partition of the set Y to the subsets from step 2.</p>
        <p>B) Clustering of the vectors Z s from the sets Z fall , Z growth , Zstab , Z fluct to N 2 clusters. Output is the
clusters CkDynamic which elements are time series X s , where N 2  k  4 * N 2 . Note that the values of N1
and N 2 are set by users. This setting is made due to the N 2 parameter. The values of N 2  8
correspond to the super-high intra-cluster similarity of time series requirement, up to full visual
correspondence at the level of local trends. Values N2  0,1, 2,.., 4 – meet the requirement of low
intracluster similarity of time series.</p>
        <p>FBC-clustering algorithm has a high level of modularity, any numeric clustering methods can be
used for 3rd step of clustering, including fuzzy ones. The proposed solution of the clustering problem
for software development processes based on FBC-approach makes it possible to cluster software
development processes for time series of different dimensions, scope and diapason; it also increases
the information content of time series clustering in comparison with numerical methods, as it extracts
knowledge about types of time series behavior.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments on clustering the software development process metrics</title>
      <p>
        The main purpose of the experimental study of the proposed approach is to show its usefulness for
clustering the software development metrics using the FBC-approach [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The numeric clustering of
time series in FBC-approach in experiments will be used in the form of hierarchical clustering [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]:
Ward method, single-linkage method and centroid method. Euclidean distances between the centers of
the numeric clustering were used.
      </p>
      <p>Experiment 1. The goal of the Experiment 1 was to obtain clusters of software development
processes based on FBC-approach of time series by metrics represented in the form of a number of
commits for various projects and their versions from Git repositories to identify homogeneous groups
of projects with similar dynamics, giving to groups of linguistic assessments used to plan and manage
software development processes.</p>
      <p>The experiment uses data from 142 time series of daily commits for various projects and their
versions from Git repositories of 3 projects: MongoDB (10 years of development,
[https://github.com/mongodb/mongo.git]); Libvideo - 3 years of development
[https://github.com/i3arnon/libvideo.git]; ProjectUD - 10 days of development
[https://github.com/Anton7393/ProjectUD.git].</p>
      <p>Commits of the form (Hesh Commit No, the name of the branch, the author of the commit, the date
of the commit, for example, “2d700d9 | 1.8 | Anton 7393 | 2016-07-08 13:36:01 +0400 | ^)” have been
received from Git repositories. Based on that data, a set of static metrics for projects from the Git
repository was obtained by preprocessing: the number of commits, the number of branches, the
number of developers, the average time between commits, the time of project development, etc. With
the help of pre-processing the data, time series were obtained for employees, years, months, days,
hours. These time series extracted from the specified sources, have different lengths and different main
trends.</p>
      <p>Application of the first step of FBC-approach ensured separation of 142 time series of commits to
different projects into 4 groups of time series according to their main trends: "chaos" with 63 time
series, "fluctuation" with 32 time series, "fall" with 33 time series, "growth" with 14 time series. At the
second step of FBC-approach, further decomposition of each group of time series into the number of
clusters equal to one-third of the total number in the corresponding group was carried out. Clustering
was carried out for normalized time series using the centroid method, the single connection method,
and the Ward method. Figure 2 shows examples of clusters of time series obtained with the help of
FBC-approach. The OX axis shows the normalized number of days, and the OY axis shows the
normalized number of commits per day.</p>
      <p>FBC-approach with the use of the Ward method gave within each cluster of trends from 3 to 4
clusters of non-unit size, examples of which are shown in figure 2, and several single clusters.
Namely, with the general trends "fall" were formed 4 single clusters and 4 clusters with 7, 4, 3 and 15
time series (figure 2a). In the cluster of "growth" trends 3 clusters with 3, 4 and 6 time series were
found (figure 2b). The cluster of "chaos" trends consists of 15 single clusters and 3 clusters with 7, 9
and 32 time series (figure 2c). The cluster of "fluctuation" trends involves 4 single clusters and 4
clusters with of 7, 6, 4 and 15 time series (figure 2d).</p>
      <p>So, in terms of the time series of daily commits, the following knowledge about change of software
projects in accordance to main tendency were obtained using proposed clustering with Ward method:




44% of the projects tends to be "chaos",
23% of projects has the trend "fluctuation",
23% of projects are characterized by the trend "fall",
only 10% of projects are developed in direction of "growth".</p>
      <p>When FBC-approach was used with the centroid method or the single-linkage method, one cluster
collects most of the time series, a plurality of clusters of unit capacity is also obtained, and one or two
clusters collect from 2 to 3 time series. As the number of clusters increases single clusters continue to
be outlay from the total mass. That is, these methods are good for detecting abnormal time series far
from the main group. Single clusters require management attention as being abnormal. They can be
used to search for software projects with atypical development process.</p>
      <p>Another findings is that FBC-approach with the use of the centroid method, the single-linkage and
the Ward method formed different clusters of time series with intersections. It is customary to select
the static clustering method for a researcher's task in the practice of applying methods of clustering.
Moreover, the Ward method is used to obtain spherical embedded clusters. The centroid method is
good for finding anomalies. In our experiment 1, the method of single-linkage gave better results if the
task was to identify time series that are similar to accuracy of compression, stretching, shifts, and
symmetry of sections.</p>
      <p>Experiment 2. The goal of the Experiment 2 is to obtain clusters of software development processes
based on FBC-clustering of time series by metrics represented by the number of commits of individual
developer in software projects from mention above Git repositories. We hoped to identify
homogeneous groups of developers with similar dynamics in the work, giving to project teams of
linguistic estimates used for planning and management of software development processes.</p>
      <p>The initial data of the experiment are: 34 developers, 34 time series, 11 clusters, the length of time
series is different and is a term of participation in the projects. One value of time series is the number
of commits per day (axis OY is the normalized number of commits, and axis OX is the normalized
number of days).</p>
      <p>Figure 3 shows some results of FBC-clustering with the single-linkage clustering. The result of the
application of FBC-clustering with dividing time series by behavior and density allows us to extract
the following knowledge:



</p>
      <p>Five software developers decrease their activity. Two clusters C1 and C2 with a falling trend
containing 2 and 3 time series of software project metrics (see figure 3a).</p>
      <p>Five software developers increase their activity. Two clusters with a general trend of "growth"
(with 4 and 1 time series) were obtained (cluster C3 with 4 time series is depicted on the
figure 3b).</p>
      <p>The activity of fifteen software developers characterized by non-regular fluctuations. Four
clusters of the "chaos" trend for 9, 4, 1 and 1 time series (cluster 4 with 9 time series and
cluster C5 with 4 time series are depicted on the figure 3c).</p>
      <p>Nine software developers characterized by regular fluctuations. Three clusters with the trend
"fluctuation", including 7, 1 and 1 time series (figure 3d shows the cluster D6 with 7 time
series).
By assigning to software developers the tasks suitable for this cluster, it is possible to stimulate
professional growth if the developer's parameters are below the average ones for the cluster or
decreasing. We can form a development vector to go to another cluster, if the parameters are higher
than the average ones for the cluster. Clustering methods are used to search for anomalous objects (in
this experiment, they are clusters of unit capacity), which, in particular, is suitable for "searching for
talents".</p>
      <p>Table 1 shows the inter-cluster distances (Euclidean distances between the centers of the clusters
were used).</p>
      <p>C1
C2
707
1148</p>
      <p>0
860
698
736
860
0
Table 1 characterizes the mutual distance and scattering of cluster centers by data sets.</p>
      <p>Inter-cluster distances are not homogeneous, what allows us to suggest, in the case of clusters,
similarity of sets of control actions. The use of the first step of FBC-approach with division by general
trends clusters made it possible to obtain clusters slightly different in their inter-cluster distances, but
differing in the general trend of behavior.</p>
      <p>
        Evaluation of the FBC-clustering results effectiveness was carried out using such indexes for
clustering results evaluation as “The Ball-Hall index” and “Calinski-Harabasz index” [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The
BallHall index and the Calinski-Harabasz index turned out to be in 3 and 2 times better for FBC-approach,
in particular, with the centroid method, than the corresponding point-wise clustering indexes with the
centroid method for time series which are linearly interpolated to equal-dimensional series. This
confirms the quality of clustering by the FBC-approach.
      </p>
      <p>The FBC-approach application for the software projects analysis has shown the availability of the
proposed grouping tools which allow us to identify on the basis of metrics from the software
repository, homogeneous groups of software projects with similar behavior of key metrics. The
FBCapproach can be used by software project managers, lead programmers, system administrators to
improve the quality of project planning based on the extracted information.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper the machine learning task of software project clustering is considered, described and
solved using FBC-approach. The scheme and the adopted technique of FBC-approach are provided for
clustering of temporal software metrics? Which could be of different length, behavior and diapason.
The advantage of the FBC-approach is that it allows us to identify, on the basis of data from the
repository, homogeneous groups with similar dynamics of key software metrics. The proposed
approach showed their performance capabilities and efficiency in the clustering of software
development processes according to the repository metrics presented by time series. The clustering of
software development processes based on the repository metrics presented by time series describing
the state and behavior of a system, will allow solving tasks on improving the practice of software
development, planning, revealing numerical relationships of quality, cost of the project based on key
metrics of the development process; identification of where efforts are made, where defects occur;
revealing the influence of technological, process, and organizational aspects on the result. Grouping of
dynamically changing key software development metrics allows us to identify groups of related
processes, assign labels to classes, use results in order to search for anomalies and defects, and to build
forecasts, what will be used in making managerial decisions.</p>
      <p>The Future work will be focused on the study of using the domain knowledge in the form of
domain ontology to produce the linguistic summarization of proposed machine learning results.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The authors acknowledge that this paper was partially supported by the Russian Foundation of Basic
Research, projects № 16-07-00535 and № 16-47-730715.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Afanasieva</surname>
            <given-names>T</given-names>
          </string-name>
          and
          <article-title>Sapunkov A Selection of Time series Forecasting Model Using a Combination of Linguistic and Numerical Criteria</article-title>
          .
          <source>In Proc. of 2016 IEEE 10th International Conference on Application of Information and Communication Technologies (AICT)</source>
          <year>2016</year>
          pp
          <fpage>341</fpage>
          -
          <lpage>345</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Afanasieva</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yarushkina</surname>
            <given-names>N</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sibirev</surname>
            <given-names>I Time</given-names>
          </string-name>
          <string-name>
            <surname>Series</surname>
          </string-name>
          <article-title>Clustering using Numerical and Fuzzy Representations</article-title>
          .
          <source>In Proc. Of Joint 17th World Congress of lnternational Fuzzy Systems Association and 9th International Conference on Soft Computing and Intelligent Systems (IFSA-SCIS</source>
          <year>2017</year>
          ), Otsu, Shiga, Japan, June 27-30,
          <year>2017</year>
          .
          <fpage>978</fpage>
          -1-
          <fpage>5090</fpage>
          -4917-2/
          <fpage>17</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bagnall</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lines</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bostrom</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Large</surname>
            <given-names>J</given-names>
          </string-name>
          and
          <string-name>
            <surname>Keogh E The Great</surname>
          </string-name>
          <article-title>Time Series Classification Bake Off: A Review and Experimental Evaluation of Recently Proposed Algorithms</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          <year>2016</year>
          pp
          <fpage>1</fpage>
          -
          <lpage>55</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Bernard</surname>
            <given-names>D Clustering</given-names>
          </string-name>
          <string-name>
            <surname>Indices</surname>
          </string-name>
          . University Paris Ouest. Lab
          <string-name>
            <surname>Modal'X. April</surname>
          </string-name>
          ,
          <year>2013</year>
          ,
          <year>34p</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Budhkar</surname>
            ,
            <given-names>Sh.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gopal</surname>
          </string-name>
          , Dr. A.
          <article-title>Component identification from existing object oriented system using Hierarchical clustering</article-title>
          .
          <source>IOSR Journal of Engineering</source>
          .Vol.
          <volume>2</volume>
          (
          <issue>5</issue>
          ), May,
          <year>2012</year>
          , pp.
          <fpage>1064</fpage>
          -
          <lpage>1068</lpage>
          . - URL: http://www.iosrjen.org/Papers/vol2_issue5/X02510641068.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Jureczko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madeyski</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <article-title>Towards identifying software project clusters with respect to defect prediction PROMISE</article-title>
          . Wrocław University of Technology, Poland,
          <year>2010</year>
          . - URL: http://madeyski.e-informatyka.pl/download/JureczkoMadeyski10f.pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>T.W.</given-names>
          </string-name>
          <article-title>Clustering of time series data - a survey-Pattern recognition</article-title>
          .
          <source>Elsevier</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>1857</fpage>
          -
          <lpage>1874</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Makridakis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wheelwright</surname>
            ,
            <given-names>S.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hyndman</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          <article-title>Forecasting methods and applications</article-title>
          . John Wiley &amp; Sons, Inc.,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Mockus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>How to run empirical studies using project repositories</article-title>
          .
          <source>Avaya Labs</source>
          ,
          <year>2006</year>
          . - URL: http://www.research.avayalabs.com /user/audris
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Nagappan</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ball</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeller</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mining</surname>
          </string-name>
          <article-title>Metrics to Predict Component Failures</article-title>
          .
          <source>In Proceedings of the 28th International Conference on Software Engineering</source>
          , Shanghai, China, May
          <volume>20</volume>
          -28,
          <year>2006</year>
          , ICSE'
          <fpage>06</fpage>
          . ACM Press New Your, NY,
          <year>2006</year>
          , рр.
          <fpage>452</fpage>
          -
          <lpage>461</lpage>
          . - URL: http://doi.acm.
          <source>org/10</source>
          .1145/ 1134285.1134349
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Novák</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perfilieva</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dvorak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Insight into Fuzzy Modeling</article-title>
          . Wiley,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Purao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vaishnavi</surname>
            ,
            <given-names>V. K.</given-names>
          </string-name>
          <article-title>Product metrics for object-oriented systems</article-title>
          .
          <source>ACM Computing Surveys</source>
          <volume>35</volume>
          ,
          <issue>2</issue>
          , June 2003, pp.
          <fpage>191</fpage>
          -
          <lpage>221</lpage>
          . - URL: http://doi.acm.
          <source>org/10</source>
          .1145/ 857076.857090
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>B. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seetharamaiah</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>Organizational Strategies and Social Interaction Influence in Software Development Effort Estimation</article-title>
          .
          <source>IOSR Journal of Computer Engineering (IOSR-JCE)</source>
          , Volume
          <volume>16</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>2</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ver</surname>
          </string-name>
          . XII, pp
          <fpage>29</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Srinivasa</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radhakrishnab</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guru</surname>
            <given-names>Rao</given-names>
          </string-name>
          , Dr.C.V.
          <article-title>Clustering and Classification of Software Component for Efficient Component Retrieval and Building Component Reuse Libraries</article-title>
          .
          <source>In Proc. of the 2nd International Conference on Information Technology and Quantitative Management (ITQM)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1044</fpage>
          -
          <lpage>1050</lpage>
          . - URL: https://pdfs.semanticscholar.org/67ac/ 2fc8979e84b4e57bac4ccab3af8e71f821ec.pdf
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Tunnell</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          <article-title>Using Time Series Models for Defect Prediction in Software Release Planning</article-title>
          . Central Washington University, Electronic Theses Student Scholarship and
          <string-name>
            <given-names>Creative</given-names>
            <surname>Works</surname>
          </string-name>
          ,
          <year>2015</year>
          . - URL: http://digitalcommons.cwu.edu/etd
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Wahyudin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biffl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>A framework for Defect Prediction in Specific Software Project Contexts</article-title>
          .
          <source>In Proc. of the 3rd IFIP Central and East European Conference on Software Engineering Techniques (CEE-SET2008)</source>
          , Brno,Czech Republic.
          <fpage>October13</fpage>
          -
          <volume>15</volume>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Weyuker</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ostrand</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Adapting a Fault Prediction Model to Allow Widespread Usage</article-title>
          .
          <source>In Proc. of the the International Workshop on Predictive Models in Software Engineering. PROMISE'08</source>
          , Leipzig, Germany. May 12-13,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Zimmermann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nagappan</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gal</surname>
          </string-name>
          ,l H.,
          <string-name>
            <surname>Giger</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>Cross-project Defect Prediction</article-title>
          .
          <source>In Proc. of the 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE)</source>
          . Amsterdam, The Netherlands,
          <source>August 24-28</source>
          <year>2009</year>
          ,
          <year>2009</year>
          , рр.
          <fpage>91</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Zolhavarieh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aghabozorgi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ying</surname>
            ,
            <given-names>Wah</given-names>
          </string-name>
          <string-name>
            <surname>Teh</surname>
          </string-name>
          .
          <article-title>A Review of Subsequence Time Series Clustering</article-title>
          , In Scientific World Journal. Vol.
          <year>2014</year>
          ,
          <string-name>
            <surname>Article</surname>
            <given-names>ID</given-names>
          </string-name>
          312521,
          <year>2014</year>
          , 19 pp. -
          <fpage>URL</fpage>
          : shttp://dx.doi.org/10.1155/
          <year>2014</year>
          /312521.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>