=Paper=
{{Paper
|id=Vol-1730/p06
|storemode=property
|title=GitHubAnalyser: a Tool Detecting Class Correlations on Git Repositories
|pdfUrl=https://ceur-ws.org/Vol-1730/p06.pdf
|volume=Vol-1730
|authors=Gaetano Cammariere,Massimiliano Portelli,Placido Russo
|dblpUrl=https://dblp.org/rec/conf/system/CammarierePR16
}}
==GitHubAnalyser: a Tool Detecting Class Correlations on Git Repositories==
<pdf width="1500px">https://ceur-ws.org/Vol-1730/p06.pdf</pdf>
<pre>
                GitHubAnalyser: a Tool Detecting Class
                    Correlations on Git Repositories
                                  Gaetano Cammariere, Massimiliano Portelli, Placido Russo
                                        Department of Mathematics and Informatics
                                                   University of Catania
                                                      Catania, Italy
              Email: gaetano.cammariere@outlook.it, massimiliano.portelli@gmail.com, russo.placido@gmail.com


   Abstract—We have realised a tool, dubbed GitHubAnalyser,                                   II. G ATHERED S TATISTICS
performing data mining and analysis of GitHub repositories
in order to gain several statistics on Java classes. The sought
                                                                               Through the analysis of a repository, the proposed tool is
statistics aim at highlighting the correlation between classes,             able to produce as output three statistics which help the devel-
detected from the simultaneous occurrence of changes on a                   opers with their job. The parameters needed for each statistic
repository. The tool has been developed using MetricMiner2,                 can be configured in a specific setting file, “settings.ini”, which
a Java mining library, and MrJob that uses Python with the                  contains all the necessary parameters for the execution of the
MapReduce model to compute data analysis in a distributed and
parallel manner.
                                                                            tool. The full list of parameters is specified in the section II-A.
                                                                            The three statistics are explained below.
                        I. I NTRODUCTION                                       The first statistic produces as output a file, “output1.tsv”,
   The proposed GitHubAnalyser tool aims at helping develop-                having the list of the modified classes during a given time
ers to extract three statistics from a Java code repository: (i) the        range, together with the information of date and time of their
strongly related classes that happen to be modified at the same             commit. The temporal range is set through two parameters
time, (ii) how many times most of the classes (percentage)                  which correspond to the two temporal instants that are the
were modified together, and (iii) for each class how often                  limits of the range. The parameters in the file “settings.ini”
has been modified with any other class. Such metrics provide                are first statistic time inf and first statistic time inf, in the
developers with a representation of the software system and                 format dd/mm/yyyy − hh : mm.
can point them to further analysis aiming at improving the                     The second statistic produces as output a file, “output2.tsv”,
modularity of the system, e.g. by means of refactoring metrics              having the percentage of modified classes for each commit,
and tools [1]–[9].                                                          given as a percentage of minimum threshold. The modified
   GitHub is a hosting service for source code based on Git, a              classes are the classes that have been modified in the same
version control system for software projects. It simplifies the             commit, at the same time, and the percentage is relative to
code sharing and collaboration among projects. The fundamen-                the total of classes present in that temporal instant in the
tal unit of a repository is the commit, a set of related changes in         repository. The threshold is used to show only the commit
a repository from which it is possible to derive representation             whose percentage is above the threshold. The threshold pa-
of code state at a given moment in the time. To mine data from              rameter need to set in the file “settings.ini” as parameter
a repository we use the Java framework MetricMiner 2 [10]                   second statistic perc.
that helps developers with the mining of software repositories.                The third statistic produces as output file “output3.tsv”,
Using this framework we are able to extract some information                having the classes matrix with their frequency of changes.
about commit like date and time of push, author, modified files             I.e. for each class in the repository, it shows the number
and the differences among the states of each file.                          of times that it has been modified together with another
   Since the computation of the statistics becomes expensive                class. This statistic has no input setting parameter, however
with the increasing of the quantity of code to analyse, compu-              two parameters are later used for the visualisation of relative
tation can be executed in a distributed manner using Hadoop,                graphs, because the repository could be very large, with many
an Apache framework inspired to MapReduce for the support                   classes.
to distributed applications with high access to data [11], [12].
                                                                            A. Useful Settings
To use all the advantages of the Python language, we use the
MrJob toolkit that helps to develop Hadoop programs and test                   The tool configuration is given according to the file
them locally.                                                               “settings.ini” that allows the setting of input parameters in
   Finally, with GnuPlot [13], the data obtained from the                   the form “key=value”. The list of parameters is as follow.
computation are visualised in human readable graphs.                           1) repository path: the location, a local folder or a
                                                                                   http/https address, of the repository;
  Copyright c 2016 held by the authors.                                        2) branch name: the branch name of the repository to
                                                                                   analyse. The default value is “master” and is also pos-

                                                                       35
     sible to choose more than one branch name to analyse
     by separating names by a comma;
  3) first statistic time inf : the lower limit used in the first
     statistic. The format is dd/mm/yyyy − hh : mm;
  4) first statistic time sup: the upper limit used in the first
     statistic. The format is dd/mm/yyyy − hh : mm;
  5) second statistic perc: the percentage threshold used in
     the second statistic. The default value is “0” to show all
     the modified classes in the commits;
  6) third statistic n classes: the number n of graphs to
     display for the third statistic. Accordingly only the most
     expressive n classes will be shown in the graph. The
     default n value is “10”;
  7) third statistic threshold: the modifications threshold
     used to display the graphs for the third statistic. It allows
     us to discard the classes whose number of modifications
     is under the threshold. The default value is “5”.

 III. BASIC C ONCEPTS FOR P ROCESSING R EPOSITORIES
   The development of the proposed tool involved several
programming languages combined together in a pipeline with
a bash script. Figure 1 shows the essential flow of execution.
For the mining of repository we use a Java program, based
on the Java MetricMiner library, whereas for computing the
statistics we use Python, with MrJob toolkit and a Gnuplot
script for the visualisation of the results.

A. Mining and pre-processing in Java
   The first phase consists of the download of the Java repos-
itory and the analysis of the metadata provided by GitHub.
Because MetricMiner2 needs a local copy of the repository,
before the preprocessing we need to clone the online reposi-
tory. This is obtained using the ”JGit” library that clones the
source code using the Git API.                                                               Fig. 1. GitHubAnalyser workflow
   Using the GitHub’s metadata we can find the line and the
name of the file that contains a modification, and then all
the modified classes in that file. MetricMiner2 analyses the              the file and the value is the list of classes in the file with a 0 or
repository’s metadata subdividing them by commit. For each                1 if the class was modified in the commit or not. The analysis
commit we obtain the timestamp and a list of “Modification”.              starts from the first commit in the repository in time order
Every “Modification” represents a single file of the project              with the structure that will be filled with the list of modified
with the updated source code and other information like the               classes.
added rows and the removed rows. MetricMiner2 builds a tree                  The final result of the preprocessing will be the entire
where a single node represents a Java class with the relative             history of the repository’s lifecycle subdivided by commit
methods. The Java parser inside MetricMiner2 gives the line               where for every commit we will have its timestamp, all the
number of the beginning of a class and we also calculate the              classes for the timestamp and if some classes are modified a
line number of its end. With these limits for each class and the          0 or 1 for each one. These data will be aggregated in a file
line numbers of modified rows derived from a “Modification”,              called “commitsLog.tsv” and they will be processed in the
we can find the classes that contain at least one modification.           next phase.
   With a single commit, we can obtain only the information                  The pre-processing produces another output files called
about the modifications compared to the previous commit.                  “classesList.txt” that contains the list of all class presents in
On account of this, we need to create two maps key/value,                 the repository from the initial creation to the last commit. This
respectively, the analysed actual commit and the previous                 file is represented in Figure 2.
commit. With the two maps we can get all the information                     During the pre-processing some borderline cases are en-
about the state of the repository during the project lifecycle.           countered and addressed, as follows.
The map contains the information related to the commit with                  • If the commits have more than 200 modified files, Met-
all the Java file in the repository, where the key is the path of               ricMiner2 cannot process it and then throws an exception

                                                                     36
                                                                                in the source code at the time of the commit, along with
                                                                                a “binary value” indicating whether the class changed in
                                                                                the commit;
                                                                                reducer: it returns the pairs “date” of the commit,
                                                                                “percentage” of classes modified in the commit, along
                                                                                with the “list” of changed classes;
                                                                             3) mapper: given a reference class (which will be one of
                                                                                the class of the repository), it gives as key the “name”
                                                                                of a class modified in the same commit as the reference
               Fig. 2. Graphic representation of a map                          class and as value “1”;
                                                                                reducer: it returns the pairs “name” of a class, “number”
                                                                                of times that the class has been modified along with that
      and discards the file from the analysis.                                  of reference.
  •   If a single difference file for a single Java file is longer
                                                                             The three scripts produce three output files, called “out-
      than a threshold (set by MetricMiner2 as 10000 charac-
                                                                          put1.tsv”, “output2.tsv” and “output3.tsv”, calculated for the
      ters) the file content is replaced by a string “TOO BIG”
                                                                          three statistics.
      and then the modification cannot be analysed. For this
                                                                             Since the third statistic produces output in a file for each
      case we choose to consider all the classes in the file
                                                                          class containing the rate of change of classes compared to
      modified.
                                                                          the analysed class, another Python script generates a square
   GitHub also presents some limitation in difference files               matrix having rows and columns for the names of all classes
(diff):                                                                   present in the repository. Each cell (i, j) indicates the number
   • a diff file cannot have more than 1500 lines or more than            of times that the i − th class has been changed simultaneously
      100 KB of raw data                                                  with the j − th class.
   • the maximum number of files in a single diff is 300                     This matrix is further processed by another Python script to
   • the total size of diff file in all files of a view cannot            create two other output files (“filtered classes matrix.tsv” and
      exceed 10000 lines or 1 MB.                                         “most modified list.txt”), which respectively contain the array
   By means of the said Java based pre-processing we can                  of classes sorted in accordance with a minimum frequency
obtain data from the repository that we will use in the Python            threshold and the list of classes modified the most.
distributed processing. The data obtained are aggregated in                  The statistical processing is structured to take place in a
two files:                                                                parallel manner and distributed by using MrJob. In fact the
   • commitsLog.tsv that contains the history of the repository           whole calculation was made to run on multiple machines, after
      subdivided by commit with the relative timestamp, the               proper configuration of a Hadoop cluster.
      relative classes and for each class a zero or a one if it
                                                                          C. Results visualisation
      has been modified.
   • classesList.txt that contains the list of all classes present
                                                                             The result of the calculation of the three statistics is a set of
      in the repository starting from its creation.                       output files in tsv format, tab-separated value, that summarise
                                                                          the results of the tool. Furthermore, some graphs are produced
B. Computing Statistics                                                   to facilitate the user understand the results. This is the list of
   After the data have been extracted from a GitHub repository,           files produced:
they are processed in order to have statistics about the changes             • output1.tsv: it refers to the first statistic and shows the
made to the source code, specifically on changes to Java                        time and date for each commit in the temporal range
classes.                                                                        selected and the list of modified classes during this
   Three Python scripts obtain three statistics. The structure                  commits.
of each script is represented by the implementation of a class               • output2.tsv: it refers to the second statistic and shows the
derived from MrJob, with two methods inside: mapper and                         commits that the percentage of modified classes is over
reducer. These methods are essential in order to define the                     the input threshold. For each of this commits, identified
behaviour according to the paradigm of MapReduce.                               by time and date, it shows the percentage of modified
   Below is the implementation logic of the two methods                         classes and the list of this classes.
described for each statistic:                                                • output3.tsv: it refers to the third statistics and shows the
   1) mapper: it creates a map that has as a key the “times-                    table of all the classes of the repository. Each position
      tamp” of a commit and as value the “name” of a class                      (x, y) represent how many times the classes x and y have
      that has been modified in the commit;                                     been modified together.
      reducer: it returns the pairs “date” of the commit, “list”             • filtered classes matrix.tsv: it is a table of filtered classes
      of classes changed in the commit;                                         that shows only the significant values of the output3.tsv
   2) mapper: it creates a map that has as key the “timestamp”                  file. The significance is given by the values that are over
      of a commit and as value the “name” of a class present                    the input threshold set in the settings file.

                                                                     37
   • most modified list.txt: it is the list of n most modified
     classes in the repository, with n value sets in the settings
     file.
   Finally, the execution of the graphs.sh script produces this
three sets of graphs:
   • A graph that shows, only for the commits that exceeds
     the percentage of modification set in the settings file, the
     percentage of modified classes compared to the total of
     classes present in that commit in that temporal instant.
   • A set of n graphs, n set in the settings file, that show
     the frequencies of modifications associated to the first n
     classes for number of total modification.
   • A heat map that shows the number of concurrent mod-
     ifications for classes, filtered by a specific value in the
     settings file.                                                                       Fig. 4. Graph with the execution times in relation to the total number of
                                                                                          classes of each repository analysed.
   Figure 3 shows the results of the analysis of the wavefron-
tHQ/java repository [14], a repository chosen for testing phase.


                                                                                          Fig. 5. Graph relating to the history of commits of the repository wavefron-
Fig. 3. Graphs produced by the analysis of wavefrontHQ/java repository. The
                                                                                          tHQ/java.
left top graph is the result of the second statistic. The left bottom graph is the
result, the heat map, of the third statistic. The other graphs are the 10 most
modified classes of the repository.
                                                                                             Figure 5 shows the time history of the commits of the
                                                                                          repository that exceeds a threshold percentage of the classes
                               IV. T ESTING                                               modified for each commit, in this case the threshold is set
                                                                                          to 0%. The graph shows that after the first modification of
   For the testing phase we used a machine with the following
                                                                                          100% of the classes, which corresponds to the first upload
hardware configuration:
                                                                                          of classes in the repository, the repository has been changed
   • CPU: Intel Core i7-4510U @ 2.00GHz x 4
                                                                                          to 40% of classes on one occasion during a commit dated
   • RAM: 8GB
                                                                                          November 2015 and it has not been modified for more than
   • OS: Ubuntu 16.04 64-bit
                                                                                          20% from that moment forward.
   The repositories of Java code chosen for testing are listed                               Figure 6 shows that the class “GraphiteDecoder”, edited
in Table I.                                                                               nine times during the commits to the repository, is strongly
   The results obtained from the analysis of the repository,                              correlated to the class “OpenTSDBDecoder” which has been
about the execution time of the tool and the total number                                 changed as often as the first was changed. In addition, there is
of classes to each repository are summarised in the graph of                              a correlation, a little less strong, with classes “AbstractAgent”
Figure 4.                                                                                 and “PushAgent”, and gradually more and more weak corre-
   For a better understanding of the results that the tool                                lations with other classes. This type of chart is shown for the
produces, we have prepared some graphs that show the results                              first classes based on the number of changes of the repository.
produced during the testing phase of the repository wavefron-                                Figure 7 gives a heatmap for modified classes in relation to
tHQ/java.                                                                                 other classes. For a better visualisation, only a subset of classes

                                                                                     38
                      Repository                  Number of commits    Last commit date      Number of classes   Execution time
                      wavefrontHQ/java [14]                     181              13.65                      80          10.430s
                      java-design-patterns [15]                1322              92.50                   1029         555.537s
                      okhttp [16]                              2535              33.33                    583         462.374s
                      RxJava [17]                              4715                8.99                  2059        7189.825s
                                                                      TABLE I
                                                          A NALYSED REPOSITORIES STATS .


                                   Fig. 6. Graph of the class “GraphiteDecoder” in the repository wavefrontHQ/java.


                                                                                corresponding row (or column equivalently) are the number
                                                                                of changes of the other classes in the commits in which the
                                                                                first class has been modified. E.g. the row, or column, for
                                                                                class “AbstractAgent” let us see that it has been changed 35
                                                                                times (color yellow) and along with it the class “PushAgent”
                                                                                was changed about 20 times (color close to red), while all
                                                                                other classes have been changed a number less than 10 times
                                                                                (colours from violet to black).

                                                                                                       V. C ONCLUSION
                                                                                   This paper has described the development of a tool that
                                                                                taps into GitHub repositories and extracts from them some
                                                                                relevant statistics for the commits of a system classes. The
                                                                                statistics, indeed, allow us to gain an insight into the software
                                                                                system, such as the classes strongly related to each other, since
                                                                                they were modified at the same time, most of times; and how
Fig. 7. Heatmap of the number of classes modifications in the repository        many times the repository has been modified, by significant
wavefrontHQ/java.                                                               modifications, for the most part of code.
                                                                                   This work can be considered a preliminary, but essential,
                                                                                step towards developing further useful metrics, describing the
has been shown for which the number of changes exceeds a                        work of the developers, or suggesting developers which parts
certain threshold. The values on the diagonal represent the                     of the system could be improved by means e.g. of refactoring
number of overall changes of the class and the values in the                    techniques.

                                                                           39
                              R EFERENCES
 [1] M. Fowler, K. Beck, J. Brant, W. Opdyke, and D. Roberts, Refactoring:
     Improving the Design of Existing Code. Addison-Wesley, 1999.
 [2] J. Kerievsky, Refactoring to patterns. Addison-Wesley, 2005.
 [3] R. Giunta, G. Pappalardo, and E. Tramontana, “Superimposing roles
     for design patterns into application classes by means of aspects,” in
     Proceedings of ACM Symposium on Applied Computing (SAC). Riva
     del Garda, Italy: ACM, March 2012, pp. 1866–1868.
 [4] C. Napoli, G. Pappalardo, and E. Tramontana, “Using modularity metrics
     to assist move method refactoring of large systems,” in Complex,
     Intelligent, and Software Intensive Systems (CISIS), 2013 Seventh In-
     ternational Conference on. IEEE, 2013, pp. 529–534.
 [5] R. Giunta, G. Pappalardo, and E. Tramontana, “Aspects and annotations
     for controlling the roles application classes play for design patterns,”
     in Proceedings of IEEE Asia Pacific Software Engineering Conference
     (APSEC), Ho Chi Minh, Vietnam, December 2011, pp. 306–314.
 [6] A. Calvagna and E. Tramontana, “Delivering dependable reusable com-
     ponents by expressing and enforcing design decisions,” in Proceedings
     of IEEE Computer Software and Applications Conference (COMPSAC)
     Workshop QUORS, Kyoto, Japan, July 2013, pp. 493–498.
 [7] R. Giunta, G. Pappalardo, and E. Tramontana, “Using Aspects and
     Annotations to Separate Application Code from Design Patterns,” in
     Proceedings of Symposium on Applied Computing (SAC). ACM, 2010,
     pp. 2183–2189.
 [8] S. Cicciarella, C. Napoli, and E. Tramontana, “Searching design patterns
     fast by using tree traversals,” International Journal of Electronics and
     Telecommunications, vol. 61, no. 4, pp. 321–326, 2015.
 [9] E. Tramontana, “A design pattern for improving the performances of
     a distributed access control mechanism,” in Proceedings of AsianPlop,
     Taipei, Taiwan, February 2016.
[10] F. Sokol, M. Zigmund, F. Aniche, and M. Gerosa, “Metricminer: Sup-
     porting researchers in mining software repositories,” in IEEE Interna-
     tional Working Conference on Source Code Analysis and Manipulation
     (SCAM), 2013.
[11] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on
     large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113,
     2008.
[12] C. Napoli, E. Tramontana, and G. Verga, “Extracting location names
     from unstructured italian texts using grammar rules and mapreduce,”
     in Proceedings of the International Conference on Information and
     Software Technologies (ICIST), 2016, pp. 593–601.
[13] T. Williams and L. Hecking, “Gnuplot,” 2003.
[14] “Wavefront,”      2016.     [Online].    Available:   https://github.com/
     wavefrontHQ/java
[15] I. Seppala, 2016. [Online]. Available: https://github.com/iluwatar/
     java-design-patterns
[16] “Square,” 2016. [Online]. Available: https://github.com/square/okhttp
[17] “Reactivex,” 2016. [Online]. Available: https://github.com/ReactiveX/
     RxJava


                                                                                 40

</pre>