=Paper=
{{Paper
|id=Vol-1730/p06
|storemode=property
|title=GitHubAnalyser: a Tool Detecting Class Correlations on Git Repositories
|pdfUrl=https://ceur-ws.org/Vol-1730/p06.pdf
|volume=Vol-1730
|authors=Gaetano Cammariere,Massimiliano Portelli,Placido Russo
|dblpUrl=https://dblp.org/rec/conf/system/CammarierePR16
}}
==GitHubAnalyser: a Tool Detecting Class Correlations on Git Repositories==
GitHubAnalyser: a Tool Detecting Class Correlations on Git Repositories Gaetano Cammariere, Massimiliano Portelli, Placido Russo Department of Mathematics and Informatics University of Catania Catania, Italy Email: gaetano.cammariere@outlook.it, massimiliano.portelli@gmail.com, russo.placido@gmail.com Abstract—We have realised a tool, dubbed GitHubAnalyser, II. G ATHERED S TATISTICS performing data mining and analysis of GitHub repositories in order to gain several statistics on Java classes. The sought Through the analysis of a repository, the proposed tool is statistics aim at highlighting the correlation between classes, able to produce as output three statistics which help the devel- detected from the simultaneous occurrence of changes on a opers with their job. The parameters needed for each statistic repository. The tool has been developed using MetricMiner2, can be configured in a specific setting file, “settings.ini”, which a Java mining library, and MrJob that uses Python with the contains all the necessary parameters for the execution of the MapReduce model to compute data analysis in a distributed and parallel manner. tool. The full list of parameters is specified in the section II-A. The three statistics are explained below. I. I NTRODUCTION The first statistic produces as output a file, “output1.tsv”, The proposed GitHubAnalyser tool aims at helping develop- having the list of the modified classes during a given time ers to extract three statistics from a Java code repository: (i) the range, together with the information of date and time of their strongly related classes that happen to be modified at the same commit. The temporal range is set through two parameters time, (ii) how many times most of the classes (percentage) which correspond to the two temporal instants that are the were modified together, and (iii) for each class how often limits of the range. The parameters in the file “settings.ini” has been modified with any other class. Such metrics provide are first statistic time inf and first statistic time inf, in the developers with a representation of the software system and format dd/mm/yyyy − hh : mm. can point them to further analysis aiming at improving the The second statistic produces as output a file, “output2.tsv”, modularity of the system, e.g. by means of refactoring metrics having the percentage of modified classes for each commit, and tools [1]–[9]. given as a percentage of minimum threshold. The modified GitHub is a hosting service for source code based on Git, a classes are the classes that have been modified in the same version control system for software projects. It simplifies the commit, at the same time, and the percentage is relative to code sharing and collaboration among projects. The fundamen- the total of classes present in that temporal instant in the tal unit of a repository is the commit, a set of related changes in repository. The threshold is used to show only the commit a repository from which it is possible to derive representation whose percentage is above the threshold. The threshold pa- of code state at a given moment in the time. To mine data from rameter need to set in the file “settings.ini” as parameter a repository we use the Java framework MetricMiner 2 [10] second statistic perc. that helps developers with the mining of software repositories. The third statistic produces as output file “output3.tsv”, Using this framework we are able to extract some information having the classes matrix with their frequency of changes. about commit like date and time of push, author, modified files I.e. for each class in the repository, it shows the number and the differences among the states of each file. of times that it has been modified together with another Since the computation of the statistics becomes expensive class. This statistic has no input setting parameter, however with the increasing of the quantity of code to analyse, compu- two parameters are later used for the visualisation of relative tation can be executed in a distributed manner using Hadoop, graphs, because the repository could be very large, with many an Apache framework inspired to MapReduce for the support classes. to distributed applications with high access to data [11], [12]. A. Useful Settings To use all the advantages of the Python language, we use the MrJob toolkit that helps to develop Hadoop programs and test The tool configuration is given according to the file them locally. “settings.ini” that allows the setting of input parameters in Finally, with GnuPlot [13], the data obtained from the the form “key=value”. The list of parameters is as follow. computation are visualised in human readable graphs. 1) repository path: the location, a local folder or a http/https address, of the repository; Copyright c 2016 held by the authors. 2) branch name: the branch name of the repository to analyse. The default value is “master” and is also pos- 35 sible to choose more than one branch name to analyse by separating names by a comma; 3) first statistic time inf : the lower limit used in the first statistic. The format is dd/mm/yyyy − hh : mm; 4) first statistic time sup: the upper limit used in the first statistic. The format is dd/mm/yyyy − hh : mm; 5) second statistic perc: the percentage threshold used in the second statistic. The default value is “0” to show all the modified classes in the commits; 6) third statistic n classes: the number n of graphs to display for the third statistic. Accordingly only the most expressive n classes will be shown in the graph. The default n value is “10”; 7) third statistic threshold: the modifications threshold used to display the graphs for the third statistic. It allows us to discard the classes whose number of modifications is under the threshold. The default value is “5”. III. BASIC C ONCEPTS FOR P ROCESSING R EPOSITORIES The development of the proposed tool involved several programming languages combined together in a pipeline with a bash script. Figure 1 shows the essential flow of execution. For the mining of repository we use a Java program, based on the Java MetricMiner library, whereas for computing the statistics we use Python, with MrJob toolkit and a Gnuplot script for the visualisation of the results. A. Mining and pre-processing in Java The first phase consists of the download of the Java repos- itory and the analysis of the metadata provided by GitHub. Because MetricMiner2 needs a local copy of the repository, before the preprocessing we need to clone the online reposi- tory. This is obtained using the ”JGit” library that clones the source code using the Git API. Fig. 1. GitHubAnalyser workflow Using the GitHub’s metadata we can find the line and the name of the file that contains a modification, and then all the modified classes in that file. MetricMiner2 analyses the the file and the value is the list of classes in the file with a 0 or repository’s metadata subdividing them by commit. For each 1 if the class was modified in the commit or not. The analysis commit we obtain the timestamp and a list of “Modification”. starts from the first commit in the repository in time order Every “Modification” represents a single file of the project with the structure that will be filled with the list of modified with the updated source code and other information like the classes. added rows and the removed rows. MetricMiner2 builds a tree The final result of the preprocessing will be the entire where a single node represents a Java class with the relative history of the repository’s lifecycle subdivided by commit methods. The Java parser inside MetricMiner2 gives the line where for every commit we will have its timestamp, all the number of the beginning of a class and we also calculate the classes for the timestamp and if some classes are modified a line number of its end. With these limits for each class and the 0 or 1 for each one. These data will be aggregated in a file line numbers of modified rows derived from a “Modification”, called “commitsLog.tsv” and they will be processed in the we can find the classes that contain at least one modification. next phase. With a single commit, we can obtain only the information The pre-processing produces another output files called about the modifications compared to the previous commit. “classesList.txt” that contains the list of all class presents in On account of this, we need to create two maps key/value, the repository from the initial creation to the last commit. This respectively, the analysed actual commit and the previous file is represented in Figure 2. commit. With the two maps we can get all the information During the pre-processing some borderline cases are en- about the state of the repository during the project lifecycle. countered and addressed, as follows. The map contains the information related to the commit with • If the commits have more than 200 modified files, Met- all the Java file in the repository, where the key is the path of ricMiner2 cannot process it and then throws an exception 36 in the source code at the time of the commit, along with a “binary value” indicating whether the class changed in the commit; reducer: it returns the pairs “date” of the commit, “percentage” of classes modified in the commit, along with the “list” of changed classes; 3) mapper: given a reference class (which will be one of the class of the repository), it gives as key the “name” of a class modified in the same commit as the reference Fig. 2. Graphic representation of a map class and as value “1”; reducer: it returns the pairs “name” of a class, “number” of times that the class has been modified along with that and discards the file from the analysis. of reference. • If a single difference file for a single Java file is longer The three scripts produce three output files, called “out- than a threshold (set by MetricMiner2 as 10000 charac- put1.tsv”, “output2.tsv” and “output3.tsv”, calculated for the ters) the file content is replaced by a string “TOO BIG” three statistics. and then the modification cannot be analysed. For this Since the third statistic produces output in a file for each case we choose to consider all the classes in the file class containing the rate of change of classes compared to modified. the analysed class, another Python script generates a square GitHub also presents some limitation in difference files matrix having rows and columns for the names of all classes (diff): present in the repository. Each cell (i, j) indicates the number • a diff file cannot have more than 1500 lines or more than of times that the i − th class has been changed simultaneously 100 KB of raw data with the j − th class. • the maximum number of files in a single diff is 300 This matrix is further processed by another Python script to • the total size of diff file in all files of a view cannot create two other output files (“filtered classes matrix.tsv” and exceed 10000 lines or 1 MB. “most modified list.txt”), which respectively contain the array By means of the said Java based pre-processing we can of classes sorted in accordance with a minimum frequency obtain data from the repository that we will use in the Python threshold and the list of classes modified the most. distributed processing. The data obtained are aggregated in The statistical processing is structured to take place in a two files: parallel manner and distributed by using MrJob. In fact the • commitsLog.tsv that contains the history of the repository whole calculation was made to run on multiple machines, after subdivided by commit with the relative timestamp, the proper configuration of a Hadoop cluster. relative classes and for each class a zero or a one if it C. Results visualisation has been modified. • classesList.txt that contains the list of all classes present The result of the calculation of the three statistics is a set of in the repository starting from its creation. output files in tsv format, tab-separated value, that summarise the results of the tool. Furthermore, some graphs are produced B. Computing Statistics to facilitate the user understand the results. This is the list of After the data have been extracted from a GitHub repository, files produced: they are processed in order to have statistics about the changes • output1.tsv: it refers to the first statistic and shows the made to the source code, specifically on changes to Java time and date for each commit in the temporal range classes. selected and the list of modified classes during this Three Python scripts obtain three statistics. The structure commits. of each script is represented by the implementation of a class • output2.tsv: it refers to the second statistic and shows the derived from MrJob, with two methods inside: mapper and commits that the percentage of modified classes is over reducer. These methods are essential in order to define the the input threshold. For each of this commits, identified behaviour according to the paradigm of MapReduce. by time and date, it shows the percentage of modified Below is the implementation logic of the two methods classes and the list of this classes. described for each statistic: • output3.tsv: it refers to the third statistics and shows the 1) mapper: it creates a map that has as a key the “times- table of all the classes of the repository. Each position tamp” of a commit and as value the “name” of a class (x, y) represent how many times the classes x and y have that has been modified in the commit; been modified together. reducer: it returns the pairs “date” of the commit, “list” • filtered classes matrix.tsv: it is a table of filtered classes of classes changed in the commit; that shows only the significant values of the output3.tsv 2) mapper: it creates a map that has as key the “timestamp” file. The significance is given by the values that are over of a commit and as value the “name” of a class present the input threshold set in the settings file. 37 • most modified list.txt: it is the list of n most modified classes in the repository, with n value sets in the settings file. Finally, the execution of the graphs.sh script produces this three sets of graphs: • A graph that shows, only for the commits that exceeds the percentage of modification set in the settings file, the percentage of modified classes compared to the total of classes present in that commit in that temporal instant. • A set of n graphs, n set in the settings file, that show the frequencies of modifications associated to the first n classes for number of total modification. • A heat map that shows the number of concurrent mod- ifications for classes, filtered by a specific value in the settings file. Fig. 4. Graph with the execution times in relation to the total number of classes of each repository analysed. Figure 3 shows the results of the analysis of the wavefron- tHQ/java repository [14], a repository chosen for testing phase. Fig. 5. Graph relating to the history of commits of the repository wavefron- Fig. 3. Graphs produced by the analysis of wavefrontHQ/java repository. The tHQ/java. left top graph is the result of the second statistic. The left bottom graph is the result, the heat map, of the third statistic. The other graphs are the 10 most modified classes of the repository. Figure 5 shows the time history of the commits of the repository that exceeds a threshold percentage of the classes IV. T ESTING modified for each commit, in this case the threshold is set to 0%. The graph shows that after the first modification of For the testing phase we used a machine with the following 100% of the classes, which corresponds to the first upload hardware configuration: of classes in the repository, the repository has been changed • CPU: Intel Core i7-4510U @ 2.00GHz x 4 to 40% of classes on one occasion during a commit dated • RAM: 8GB November 2015 and it has not been modified for more than • OS: Ubuntu 16.04 64-bit 20% from that moment forward. The repositories of Java code chosen for testing are listed Figure 6 shows that the class “GraphiteDecoder”, edited in Table I. nine times during the commits to the repository, is strongly The results obtained from the analysis of the repository, correlated to the class “OpenTSDBDecoder” which has been about the execution time of the tool and the total number changed as often as the first was changed. In addition, there is of classes to each repository are summarised in the graph of a correlation, a little less strong, with classes “AbstractAgent” Figure 4. and “PushAgent”, and gradually more and more weak corre- For a better understanding of the results that the tool lations with other classes. This type of chart is shown for the produces, we have prepared some graphs that show the results first classes based on the number of changes of the repository. produced during the testing phase of the repository wavefron- Figure 7 gives a heatmap for modified classes in relation to tHQ/java. other classes. For a better visualisation, only a subset of classes 38 Repository Number of commits Last commit date Number of classes Execution time wavefrontHQ/java [14] 181 13.65 80 10.430s java-design-patterns [15] 1322 92.50 1029 555.537s okhttp [16] 2535 33.33 583 462.374s RxJava [17] 4715 8.99 2059 7189.825s TABLE I A NALYSED REPOSITORIES STATS . Fig. 6. Graph of the class “GraphiteDecoder” in the repository wavefrontHQ/java. corresponding row (or column equivalently) are the number of changes of the other classes in the commits in which the first class has been modified. E.g. the row, or column, for class “AbstractAgent” let us see that it has been changed 35 times (color yellow) and along with it the class “PushAgent” was changed about 20 times (color close to red), while all other classes have been changed a number less than 10 times (colours from violet to black). V. C ONCLUSION This paper has described the development of a tool that taps into GitHub repositories and extracts from them some relevant statistics for the commits of a system classes. The statistics, indeed, allow us to gain an insight into the software system, such as the classes strongly related to each other, since they were modified at the same time, most of times; and how Fig. 7. Heatmap of the number of classes modifications in the repository many times the repository has been modified, by significant wavefrontHQ/java. modifications, for the most part of code. This work can be considered a preliminary, but essential, step towards developing further useful metrics, describing the has been shown for which the number of changes exceeds a work of the developers, or suggesting developers which parts certain threshold. The values on the diagonal represent the of the system could be improved by means e.g. of refactoring number of overall changes of the class and the values in the techniques. 39 R EFERENCES [1] M. Fowler, K. Beck, J. Brant, W. Opdyke, and D. Roberts, Refactoring: Improving the Design of Existing Code. Addison-Wesley, 1999. [2] J. Kerievsky, Refactoring to patterns. Addison-Wesley, 2005. [3] R. Giunta, G. Pappalardo, and E. Tramontana, “Superimposing roles for design patterns into application classes by means of aspects,” in Proceedings of ACM Symposium on Applied Computing (SAC). Riva del Garda, Italy: ACM, March 2012, pp. 1866–1868. [4] C. Napoli, G. Pappalardo, and E. Tramontana, “Using modularity metrics to assist move method refactoring of large systems,” in Complex, Intelligent, and Software Intensive Systems (CISIS), 2013 Seventh In- ternational Conference on. IEEE, 2013, pp. 529–534. [5] R. Giunta, G. Pappalardo, and E. Tramontana, “Aspects and annotations for controlling the roles application classes play for design patterns,” in Proceedings of IEEE Asia Pacific Software Engineering Conference (APSEC), Ho Chi Minh, Vietnam, December 2011, pp. 306–314. [6] A. Calvagna and E. Tramontana, “Delivering dependable reusable com- ponents by expressing and enforcing design decisions,” in Proceedings of IEEE Computer Software and Applications Conference (COMPSAC) Workshop QUORS, Kyoto, Japan, July 2013, pp. 493–498. [7] R. Giunta, G. Pappalardo, and E. Tramontana, “Using Aspects and Annotations to Separate Application Code from Design Patterns,” in Proceedings of Symposium on Applied Computing (SAC). ACM, 2010, pp. 2183–2189. [8] S. Cicciarella, C. Napoli, and E. Tramontana, “Searching design patterns fast by using tree traversals,” International Journal of Electronics and Telecommunications, vol. 61, no. 4, pp. 321–326, 2015. [9] E. Tramontana, “A design pattern for improving the performances of a distributed access control mechanism,” in Proceedings of AsianPlop, Taipei, Taiwan, February 2016. [10] F. Sokol, M. Zigmund, F. Aniche, and M. Gerosa, “Metricminer: Sup- porting researchers in mining software repositories,” in IEEE Interna- tional Working Conference on Source Code Analysis and Manipulation (SCAM), 2013. [11] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. [12] C. Napoli, E. Tramontana, and G. Verga, “Extracting location names from unstructured italian texts using grammar rules and mapreduce,” in Proceedings of the International Conference on Information and Software Technologies (ICIST), 2016, pp. 593–601. [13] T. Williams and L. Hecking, “Gnuplot,” 2003. [14] “Wavefront,” 2016. [Online]. Available: https://github.com/ wavefrontHQ/java [15] I. Seppala, 2016. [Online]. Available: https://github.com/iluwatar/ java-design-patterns [16] “Square,” 2016. [Online]. Available: https://github.com/square/okhttp [17] “Reactivex,” 2016. [Online]. Available: https://github.com/ReactiveX/ RxJava 40