=Paper=
{{Paper
|id=Vol-1852/p14
|storemode=property
|title=Extracting and comparing variables on
Java source code
|pdfUrl=https://ceur-ws.org/Vol-1852/p14.pdf
|volume=Vol-1852
|authors=Mirko Raimondo Aiello,Andrea Caruso
}}
==Extracting and comparing variables on
Java source code==
Extracting and comparing variables on Java source code Mirko Raimondo Aiello, Andrea Caruso Department of Mathematics and Informatics University of Catania Catania, Italy Email: mremaiello@gmail.com, andrea299024@gmail.com Abstract—We have realised a Java source code analyser that The proposed analysis can be valuable when examining firstly extracts code from a repository and then analyses the code existing code in view of refactoring opportunities [9], [10], to gain some knowledge. In particular, for each method of each [16]. Moreover, it is often necessary for developers to find class the internal variables used are automatically determined. For Java classes, we find method bodies and variable declarations essential characteristics and metrics which give a view on by using regular expressions. The tool has been developed by the internal characteristics, such as design patterns used [7], means of Java and MrJob. Python with the MapReduce model [15], or performance related and dependence issues [6], [17]. have been used to perform data analysis in a distributed manner. This is even valuable for the problem of validating software systems [5]. Index Terms—Parallelization, Java, Python, MapReduce, algo- rithms, data dependence The following of this paper firstly describes the problem at hand, then describes the proposed solution in detail, then I. I NTRODUCTION results are shown, and finally conclusions are drawn. We have developed a system that analyses different versions II. R EQUIREMENTS ANALYSIS of a software system, which is automatically extracted from Requirements analysis in systems engineering and software a git hub repository [1], [11]. Essentially, given two versions engineering encompasses tasks that determine the needs or of a Java class, we gather: the name of the classes, method conditions to meet for the product. As said previously, the names within classes, the number of starting and ending lines entire software product is divided in two tools, for each the of methods and variables used by the methods. following requirements have been identified. The analysis is performed by two different tools. The first • Analysis of Java source code: the first tool will be able tool, developed in Java, takes as input the source code and get classes, methods and variables. The second tool, developed to analyse Java source codes, obtaining the desired data, in Python, analyses the output of the first tool and gets i.e. the names of classes, methods within the classes, the the variables within each method. To mine the data from a number of start and end line of each method, and the set of Java classes, we use regular expressions, a sequence variables used by such methods. The results are stored in of characters that define a search pattern, mainly used in appropriate data structures. • Reading and writing information into files: the second pattern matching with strings, or string matching, i.e. “find and replace”-like operations. tool will be able to read .txt files placed in a specific Since the said computation can become expensive with the directory. These files contain the bodies of methods increasing quantity of code to analyse, the second tool can be previously extracted. The extracted data will be placed executed in a distributed manner using Hadoop, an Apache in .csv and .txt files. In particular, the overall structure of framework providing the MapReduce model, for providing the examined Java files will be placed in a .csv file, while support to distributed systems accessing big data [8], [14], the bodies of the identified methods will be included in [18]. Of course distribution depends on the underlying infras- a .txt file. • Extract variables: the second tool will be able to extract tructure and configuration, for which other support may be needed [3], [4], [12], [13]. To use all the advantages of the the names and types of the variables used by each method, Python language, we adopt the MrJob toolkit that helps to by using the MapReduce paradigm. The variables will be develop Hadoop programs and test them locally [2]. associated with the method that uses them. • Comparison of files: the second tool should analyze the Finally, the data obtained from the computation are saved to a .csv file into the output directory, and can be read using variables that are used within methods, with the purpose tools like CSV Viewer. Both the first and second tool will of tracking changes between two different versions of the be subject to evaluation tests, to calculate execution time and same method. The difference will be performed in both obtain some results from a chosen Java repository. directions, i.e. by identifying variables that can be within the first version of the method and not in the other version Copyright c 2017 held by the authors. and vice versa. 86 III. P ROPOSED A PPROACH The development of the proposed tools involved different programming languages combined together in a pipeline with a bash script. For the Tool 1, used for the pre-processing, we used Java. For the Tool 2, that computes the differences, we used Python with MrJob toolkit. Figure 1 shows a general overview of the developed components. Fig. 3. Iterative Factorial Java Code used as an example. IV. T HE FIRST T OOL The analysis of the class code has been performed by a ParseText class taking as parameters: (i) an input string holding the code of the class under analysis, and (ii) a string that represents the output folder. The source code of the class under analysis is read line by line, and regular expressions (regex) are used on each line to identify the signature of the methods, and the respective row in which they occur. Since the lines of code holding method sig- natures can have different forms (e.g. the constructor does not have a return type), then there will be more regex, contained in a list (classRegexList). By knowing all the starting lines of code for methods, it is possible to extract the ending lines of the methods by simply calculating the differences between the start line of the i-th method and the beginning of the next method. Fig. 1. Java Source Code Analyser workflow. All the algorithms of the said tools described here are valid for an arbitrary number of Java classes. However, to make it easier to understand the algorithms used in this project, all the examples and the related images refer to the analysis of the Java files in Figure 2 and Figure 3. Fig. 4. Fragment of code used for the extraction of information. Once these two pieces of information have been obtained, we can analyse the method body. Obviously, the code will check that the structure of the method is compliant with the Java specifications. The next step uses regular expressions again, and by manipulating strings (by means of replace, replaceAll, trim) extracts the class name, the signature and the method body. The output generated by the first tool consists of a number of files in two categories. Fig. 2. Recursive Factorial Java Code used as an example. • A group of txt files, one for each method found on the analysed code, containing the name of the reference 87 Fig. 5. Regex used to extract, respectively, the constructor and the methods Fig. 10. Fragment of code used for parsing files. of a class. goal to detect variables and insert them into pairs “method- Name, [variablesList]”. The second step consists in a further reduce function that has as the goal to calculate variables Fig. 6. Regex used to extract variables. differences between the examined methods. class, the start and the end line of the method within the class, and the method body. A sample output is shown in Figure 7. • A cmList.csv file containing the list of the identified methods, variables, starting and ending lines as shown in Figure 8. Fig. 11. Fragment of code for the first step. The mapper detects the presence of variables by means of regular expressions. Each regular expression identifies a particular sequence of characters to search for in the text. Fig. 7. Example of output files for the Tool 1. Fig. 8. Example of generated cmList.csv file Fig. 12. Fragment of code of the mapper. V. T HE SECOND T OOL The first reducer receives pairs “methodName, variable- The initial step of the second Tool is to scan the directory Name” detected by the mapper and groups them according in which the first Tool created the .txt files containing the to the “methodName” key, forming “methodName, [vari- methods bodies. When Tool 2 identifies the files, it stores the ablesList]” pairs. filenames in a list called “names”. Furthermore, the number of files taken as input is stored in the variable “n methods”. Fig. 13. Fragment of code of the first reducer. The second reducer is more complex than the previous. It takes as input the identification key of the current method Fig. 9. Fragment of code used for reading file names. and the list of variables associated with it. Variables are read by using the “next” command and stored in the “variables” Afterwards, the previously listed files are opened. The string, then used for printing the “variables.txt” file, which ClassFind job is launched on each of them using the run() will contain pairs “methodName, [variablesList]”. The dict command. type (key-value dictionary) structure “variables list” is used The MapReduce job consists of two main steps. The first to store “methodName, [variablesList]” pairs. The variable step consists of a map function and a reduce function, with the “count” is a counter incremented whenever the MapReduce 88 process has been executed on a method. When it reaches methods variables and associate them to the respective meth- the value “n metodi”, then all methods have been processed, ods. Nevertheless, the two tools achieve this goal in different and the used variables have been identified and stored in ways. Tool 1 uses a classical sequential approach without “variables list”. Now differences can be easily computed. employing the MapReduce paradigm, whereas Tool 2 uses Leveraging the “itertools” library, it is possible to calculate Python and MrJob as described in this section. all combinations of pairs of methods, so as to compare The Tool 2 generates four output files. each method with all the others, which is the task of the • output.txt: created by redirecting the standard output, it is “combinations” function. a file containing, for each pair of analysed methods, the name of the method, the set of variables of each method and the two differences between the sets of variables. The output is shown in Figure 16. • variables.txt: contains the list of the identified methods and their variables, as shown in Figure 17. • differences.txt: shows the differences between the sets of variables only, as shown in Figure 18. • time.txt: contains a string giving the time in seconds spent to search all the methods variables, as shown in Figure 19. Fig. 14. Fragment of code of the second reducer. The calculation of the differences is managed by Counter type structures, which allow to take into account the presence of variables with the same name. Let us suppose, for example, that a method has 2 i named variables, and the second method has only one, the difference between the first and the second Fig. 16. Example of output.txt file of the Tool 2. method will result in one i variable. Fig. 17. Example of variables.txt file of the Tool 2. Fig. 18. Example of differences.txt file of the Tool 2. Fig. 19. Example of time.txt file of the Tool 2. Fig. 15. Fragment of code used for calculation of the differences. VI. R ESULTS AND TIMING ANALYSIS The difference is performed in the same way also in the opposite direction. Then, for each pair of methods, two The configuration used for the test is the following. differences are generated. Different prints for various output • LAPTOP: ASUS A56C files are managed along the entire process. Finally, the various • CPU: Intel (R) Core (TM) i7-3537U @ 2.00GHz strings that contain the results are written on ad-hoc files. • RAM: 4.00 GB The variables-finding function of the Tool 2 has been • OS type: 64-bit implemented as a variant of Tool 1, hence both tools detect • OS: Linux Ubuntu 16.04 LTS 89 The two tools show similar results from a reliability stand- [4] G. Borowik, M. Wozniak, A. Fornaia, R. Giunta, C. Napoli, G. Pap- point. Both seek the variables by means of regular expressions palardo, and E. Tramontana. A software architecture assisting workflow executions on cloud resources. International Journal of Electronics and using the same approach. All internal variables of methods of Telecommunications, 61:17–23, 2015. the code under analysis are detected successfully. [5] A. Calvagna and E. Tramontana. Automated conformance testing of Java Extraction times for finding the internal variables of the virtual machines. In Proceedings of Complex, Intelligent and Software Intensive Systems (CISIS). IEEE, July 2013. methods (and inclusion in key-value lists, in the case of Tool [6] A. Calvagna and E. Tramontana. Delivering dependable reusable com- 2) were compared. Execution times are listed in Table I. ponents by expressing and enforcing design decisions. In Proceedings Of Compsac, pages 493–498, Kyoto, Japan, 2013. IEEE. Number of files Tool 1 (Java) Tool 2 (Python) [7] S. Cicciarella, C. Napoli, and E. Tramontana. Searching design patterns 2 0.004s 0.001s fast by using tree traversals. International Journal of Electronics and 3 0.006s 0.002s Telecommunications, 61(4):321–326, 2015. 4 0.007s 0.002s [8] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on 5 0.009s 0.003s large clusters. Communications of the ACM, 51(1):107–113, 2008. 6 0.011s 0.004s [9] M. Fowler, K. Beck, J. Brant, W. Opdyke, and D. Roberts. Refactoring: 6 0.012s 0.005s Improving the Design of Existing Code. Addison-Wesley, 1999. 8 0.013s 0.006s [10] R. Giunta, G. Pappalardo, and E. Tramontana. Superimposing roles 9 0.013s 0.007s for design patterns into application classes by means of aspects. In 10 0.014s 0.009s Proceedings Of ACM Symposium on Applied Computing (SAC), pages TABLE I 1866–1868, Riva Del Garda (Trento), Italy, March 26-30 2012. A NALYSED EXECUTION TIME STATS . [11] J. Loeliger and M. McCullough. Version Control with Git: Powerful tools and techniques for collaborative software development. O’Reilly Media, Inc., 2012. [12] C. Napoli, F. Bonanno, and G. Capizzi. Exploiting solar wind time series Both tools show linear progression for the measured tim- correlation with magnetospheric response by using an hybrid neuro- ings, increasing with the number of methods for which vari- wavelet approach. Proceedings of the International Astronomical Union, ables are extracted. In general, the Tool 2 is quicker than the 6(S274):156–158, 2010. [13] C. Napoli, F. Bonanno, and G. Capizzi. An hybrid neuro-wavelet Tool 1. The performance results are summarised in the graph approach for long-term prediction of solar wind. Proceedings of the of Figure 20 (by using gnuplot [19]). International Astronomical Union, 6(S274):153–155, 2010. [14] C. Napoli, E. Tramontana, and G. Verga. Extracting location names from unstructured italian texts using grammar rules and mapreduce. In Proceedings Of the International Conference on Information and Soft- ware Technologies (ICIST), volume 639, pages 593–601, Druskininkai, Lithuania, October 13-15 2016. Springer. [15] G. Pappalardo and E. Tramontana. Automatically discovering design patterns and assessing concern separations for applications. In Proceed- ings Of ACM Symposium On Applied Computing (SAC), pages 1591– 1596, Dijon, Francia, 23-27 Aprile 2006. ACM. [16] N. Tsantalis, A. Chatzigeorgiou, G. Stephanides, and S. T. Halkidis. Design pattern detection using similarity scoring. Software Engineering, IEEE Transactions on, 32(11):896–909, 2006. [17] H. Washizaki and Y. Fukazawa. Dynamic hierarchical undo facility in a fine-grained component environment. In Proceedings of International Conference on Tools Pacific: Objects for internet, mobile and embedded applications, pages 191–199. Australian Computer Society, Inc., 2002. [18] T. White. Hadoop: The definitive guide. O’Reilly Media, Inc., 2012. [19] T. Williams and L. Hecking. Gnuplot, 2003. Fig. 20. Graph with the execution times in relation to the total number of methods analysed. VII. C ONCLUSIONS This paper described a pair of tools that analyses Java source code and extract from them some relevant statistics, such as method’s names, signatures, starting and ending lines and variables within methods. The tools have been developed in Java and Python and a comparison of the results in terms of execution time has been shown. R EFERENCES [1] Github. https://github.com. [2] Mrjob v0.5.3. http://pythonhosted.org/mrjob. [3] F. Bannò, D. Marletta, G. Pappalardo, and E. Tramontana. Tackling con- sistency issues for runtime updating distributed systems. In Proceedings of IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pages 1–8, Atlanta, GA, April 2010. 90