=Paper= {{Paper |id=Vol-1852/p14 |storemode=property |title=Extracting and comparing variables on Java source code |pdfUrl=https://ceur-ws.org/Vol-1852/p14.pdf |volume=Vol-1852 |authors=Mirko Raimondo Aiello,Andrea Caruso }} ==Extracting and comparing variables on Java source code== https://ceur-ws.org/Vol-1852/p14.pdf
                  Extracting and comparing variables on
                             Java source code
                                            Mirko Raimondo Aiello, Andrea Caruso
                                             Department of Mathematics and Informatics
                                                       University of Catania
                                                           Catania, Italy
                                      Email: mremaiello@gmail.com, andrea299024@gmail.com


   Abstract—We have realised a Java source code analyser that                The proposed analysis can be valuable when examining
firstly extracts code from a repository and then analyses the code        existing code in view of refactoring opportunities [9], [10],
to gain some knowledge. In particular, for each method of each            [16]. Moreover, it is often necessary for developers to find
class the internal variables used are automatically determined.
For Java classes, we find method bodies and variable declarations         essential characteristics and metrics which give a view on
by using regular expressions. The tool has been developed by              the internal characteristics, such as design patterns used [7],
means of Java and MrJob. Python with the MapReduce model                  [15], or performance related and dependence issues [6], [17].
have been used to perform data analysis in a distributed manner.          This is even valuable for the problem of validating software
                                                                          systems [5].
   Index Terms—Parallelization, Java, Python, MapReduce, algo-
rithms, data dependence                                                      The following of this paper firstly describes the problem
                                                                          at hand, then describes the proposed solution in detail, then
                        I. I NTRODUCTION                                  results are shown, and finally conclusions are drawn.
   We have developed a system that analyses different versions                            II. R EQUIREMENTS ANALYSIS
of a software system, which is automatically extracted from
                                                                             Requirements analysis in systems engineering and software
a git hub repository [1], [11]. Essentially, given two versions
                                                                          engineering encompasses tasks that determine the needs or
of a Java class, we gather: the name of the classes, method
                                                                          conditions to meet for the product. As said previously, the
names within classes, the number of starting and ending lines
                                                                          entire software product is divided in two tools, for each the
of methods and variables used by the methods.
                                                                          following requirements have been identified.
   The analysis is performed by two different tools. The first
                                                                             • Analysis of Java source code: the first tool will be able
tool, developed in Java, takes as input the source code and get
classes, methods and variables. The second tool, developed                     to analyse Java source codes, obtaining the desired data,
in Python, analyses the output of the first tool and gets                      i.e. the names of classes, methods within the classes, the
the variables within each method. To mine the data from a                      number of start and end line of each method, and the
set of Java classes, we use regular expressions, a sequence                    variables used by such methods. The results are stored in
of characters that define a search pattern, mainly used in                     appropriate data structures.
                                                                             • Reading and writing information into files: the second
pattern matching with strings, or string matching, i.e. “find
and replace”-like operations.                                                  tool will be able to read .txt files placed in a specific
   Since the said computation can become expensive with the                    directory. These files contain the bodies of methods
increasing quantity of code to analyse, the second tool can be                 previously extracted. The extracted data will be placed
executed in a distributed manner using Hadoop, an Apache                       in .csv and .txt files. In particular, the overall structure of
framework providing the MapReduce model, for providing                         the examined Java files will be placed in a .csv file, while
support to distributed systems accessing big data [8], [14],                   the bodies of the identified methods will be included in
[18]. Of course distribution depends on the underlying infras-                 a .txt file.
                                                                             • Extract variables: the second tool will be able to extract
tructure and configuration, for which other support may be
needed [3], [4], [12], [13]. To use all the advantages of the                  the names and types of the variables used by each method,
Python language, we adopt the MrJob toolkit that helps to                      by using the MapReduce paradigm. The variables will be
develop Hadoop programs and test them locally [2].                             associated with the method that uses them.
                                                                             • Comparison of files: the second tool should analyze the
   Finally, the data obtained from the computation are saved
to a .csv file into the output directory, and can be read using                variables that are used within methods, with the purpose
tools like CSV Viewer. Both the first and second tool will                     of tracking changes between two different versions of the
be subject to evaluation tests, to calculate execution time and                same method. The difference will be performed in both
obtain some results from a chosen Java repository.                             directions, i.e. by identifying variables that can be within
                                                                               the first version of the method and not in the other version
  Copyright c 2017 held by the authors.                                        and vice versa.




                                                                     86
                 III. P ROPOSED A PPROACH
  The development of the proposed tools involved different
programming languages combined together in a pipeline with
a bash script. For the Tool 1, used for the pre-processing, we
used Java. For the Tool 2, that computes the differences, we
used Python with MrJob toolkit. Figure 1 shows a general
overview of the developed components.




                                                                                 Fig. 3. Iterative Factorial Java Code used as an example.



                                                                                                IV. T HE FIRST T OOL
                                                                            The analysis of the class code has been performed by a
                                                                         ParseText class taking as parameters: (i) an input string holding
                                                                         the code of the class under analysis, and (ii) a string that
                                                                         represents the output folder.
                                                                            The source code of the class under analysis is read line by
                                                                         line, and regular expressions (regex) are used on each line to
                                                                         identify the signature of the methods, and the respective row in
                                                                         which they occur. Since the lines of code holding method sig-
                                                                         natures can have different forms (e.g. the constructor does not
                                                                         have a return type), then there will be more regex, contained
                                                                         in a list (classRegexList). By knowing all the starting lines
                                                                         of code for methods, it is possible to extract the ending lines
                                                                         of the methods by simply calculating the differences between
                                                                         the start line of the i-th method and the beginning of the next
                                                                         method.

             Fig. 1. Java Source Code Analyser workflow.

   All the algorithms of the said tools described here are valid
for an arbitrary number of Java classes. However, to make it
easier to understand the algorithms used in this project, all the
examples and the related images refer to the analysis of the
Java files in Figure 2 and Figure 3.




                                                                             Fig. 4. Fragment of code used for the extraction of information.


                                                                            Once these two pieces of information have been obtained,
                                                                         we can analyse the method body. Obviously, the code will
                                                                         check that the structure of the method is compliant with the
                                                                         Java specifications. The next step uses regular expressions
                                                                         again, and by manipulating strings (by means of replace,
                                                                         replaceAll, trim) extracts the class name, the signature and
                                                                         the method body.
                                                                            The output generated by the first tool consists of a number
                                                                         of files in two categories.
       Fig. 2. Recursive Factorial Java Code used as an example.            • A group of txt files, one for each method found on
                                                                               the analysed code, containing the name of the reference




                                                                    87
Fig. 5. Regex used to extract, respectively, the constructor and the methods                   Fig. 10. Fragment of code used for parsing files.
of a class.

                                                                                    goal to detect variables and insert them into pairs “method-
                                                                                    Name, [variablesList]”. The second step consists in a further
                                                                                    reduce function that has as the goal to calculate variables
                  Fig. 6. Regex used to extract variables.
                                                                                    differences between the examined methods.

     class, the start and the end line of the method within the
     class, and the method body. A sample output is shown in
     Figure 7.
   • A cmList.csv file containing the list of the identified
     methods, variables, starting and ending lines as shown
     in Figure 8.
                                                                                                 Fig. 11. Fragment of code for the first step.

                                                                                      The mapper detects the presence of variables by means
                                                                                    of regular expressions. Each regular expression identifies a
                                                                                    particular sequence of characters to search for in the text.
              Fig. 7. Example of output files for the Tool 1.




               Fig. 8. Example of generated cmList.csv file

                                                                                                  Fig. 12. Fragment of code of the mapper.
                       V. T HE SECOND T OOL
                                                                                      The first reducer receives pairs “methodName, variable-
   The initial step of the second Tool is to scan the directory
                                                                                    Name” detected by the mapper and groups them according
in which the first Tool created the .txt files containing the
                                                                                    to the “methodName” key, forming “methodName, [vari-
methods bodies. When Tool 2 identifies the files, it stores the
                                                                                    ablesList]” pairs.
filenames in a list called “names”. Furthermore, the number
of files taken as input is stored in the variable “n methods”.




                                                                                                Fig. 13. Fragment of code of the first reducer.

                                                                                       The second reducer is more complex than the previous. It
                                                                                    takes as input the identification key of the current method
          Fig. 9. Fragment of code used for reading file names.                     and the list of variables associated with it. Variables are read
                                                                                    by using the “next” command and stored in the “variables”
   Afterwards, the previously listed files are opened. The                          string, then used for printing the “variables.txt” file, which
ClassFind job is launched on each of them using the run()                           will contain pairs “methodName, [variablesList]”. The dict
command.                                                                            type (key-value dictionary) structure “variables list” is used
   The MapReduce job consists of two main steps. The first                          to store “methodName, [variablesList]” pairs. The variable
step consists of a map function and a reduce function, with the                     “count” is a counter incremented whenever the MapReduce




                                                                               88
process has been executed on a method. When it reaches                        methods variables and associate them to the respective meth-
the value “n metodi”, then all methods have been processed,                   ods. Nevertheless, the two tools achieve this goal in different
and the used variables have been identified and stored in                     ways. Tool 1 uses a classical sequential approach without
“variables list”. Now differences can be easily computed.                     employing the MapReduce paradigm, whereas Tool 2 uses
   Leveraging the “itertools” library, it is possible to calculate            Python and MrJob as described in this section.
all combinations of pairs of methods, so as to compare                          The Tool 2 generates four output files.
each method with all the others, which is the task of the                       • output.txt: created by redirecting the standard output, it is
“combinations” function.                                                           a file containing, for each pair of analysed methods, the
                                                                                   name of the method, the set of variables of each method
                                                                                   and the two differences between the sets of variables. The
                                                                                   output is shown in Figure 16.
                                                                                • variables.txt: contains the list of the identified methods
                                                                                   and their variables, as shown in Figure 17.
                                                                                • differences.txt: shows the differences between the sets of
                                                                                   variables only, as shown in Figure 18.
                                                                                • time.txt: contains a string giving the time in seconds spent
                                                                                   to search all the methods variables, as shown in Figure 19.




            Fig. 14. Fragment of code of the second reducer.

   The calculation of the differences is managed by Counter
type structures, which allow to take into account the presence
of variables with the same name. Let us suppose, for example,
that a method has 2 i named variables, and the second method
has only one, the difference between the first and the second                            Fig. 16. Example of output.txt file of the Tool 2.
method will result in one i variable.




                                                                                        Fig. 17. Example of variables.txt file of the Tool 2.




                                                                                       Fig. 18. Example of differences.txt file of the Tool 2.




                                                                                          Fig. 19. Example of time.txt file of the Tool 2.
    Fig. 15. Fragment of code used for calculation of the differences.

                                                                                          VI. R ESULTS AND TIMING ANALYSIS
   The difference is performed in the same way also in
the opposite direction. Then, for each pair of methods, two                     The configuration used for the test is the following.
differences are generated. Different prints for various output                  • LAPTOP: ASUS A56C
files are managed along the entire process. Finally, the various                • CPU: Intel (R) Core (TM) i7-3537U @ 2.00GHz
strings that contain the results are written on ad-hoc files.                   • RAM: 4.00 GB
   The variables-finding function of the Tool 2 has been                        • OS type: 64-bit
implemented as a variant of Tool 1, hence both tools detect                     • OS: Linux Ubuntu 16.04 LTS




                                                                         89
  The two tools show similar results from a reliability stand-                        [4] G. Borowik, M. Wozniak, A. Fornaia, R. Giunta, C. Napoli, G. Pap-
point. Both seek the variables by means of regular expressions                            palardo, and E. Tramontana. A software architecture assisting workflow
                                                                                          executions on cloud resources. International Journal of Electronics and
using the same approach. All internal variables of methods of                             Telecommunications, 61:17–23, 2015.
the code under analysis are detected successfully.                                    [5] A. Calvagna and E. Tramontana. Automated conformance testing of Java
  Extraction times for finding the internal variables of the                              virtual machines. In Proceedings of Complex, Intelligent and Software
                                                                                          Intensive Systems (CISIS). IEEE, July 2013.
methods (and inclusion in key-value lists, in the case of Tool                        [6] A. Calvagna and E. Tramontana. Delivering dependable reusable com-
2) were compared. Execution times are listed in Table I.                                  ponents by expressing and enforcing design decisions. In Proceedings
                                                                                          Of Compsac, pages 493–498, Kyoto, Japan, 2013. IEEE.
            Number of files    Tool 1 (Java)    Tool 2 (Python)                       [7] S. Cicciarella, C. Napoli, and E. Tramontana. Searching design patterns
                          2          0.004s              0.001s                           fast by using tree traversals. International Journal of Electronics and
                          3          0.006s              0.002s                           Telecommunications, 61(4):321–326, 2015.
                          4          0.007s              0.002s                       [8] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on
                          5          0.009s              0.003s                           large clusters. Communications of the ACM, 51(1):107–113, 2008.
                          6          0.011s              0.004s                       [9] M. Fowler, K. Beck, J. Brant, W. Opdyke, and D. Roberts. Refactoring:
                          6          0.012s              0.005s                           Improving the Design of Existing Code. Addison-Wesley, 1999.
                          8          0.013s              0.006s                      [10] R. Giunta, G. Pappalardo, and E. Tramontana. Superimposing roles
                          9          0.013s              0.007s                           for design patterns into application classes by means of aspects. In
                        10           0.014s              0.009s                           Proceedings Of ACM Symposium on Applied Computing (SAC), pages
                                 TABLE I                                                  1866–1868, Riva Del Garda (Trento), Italy, March 26-30 2012.
                   A NALYSED EXECUTION TIME STATS .                                  [11] J. Loeliger and M. McCullough. Version Control with Git: Powerful
                                                                                          tools and techniques for collaborative software development. O’Reilly
                                                                                          Media, Inc., 2012.
                                                                                     [12] C. Napoli, F. Bonanno, and G. Capizzi. Exploiting solar wind time series
  Both tools show linear progression for the measured tim-                                correlation with magnetospheric response by using an hybrid neuro-
ings, increasing with the number of methods for which vari-                               wavelet approach. Proceedings of the International Astronomical Union,
ables are extracted. In general, the Tool 2 is quicker than the                           6(S274):156–158, 2010.
                                                                                     [13] C. Napoli, F. Bonanno, and G. Capizzi. An hybrid neuro-wavelet
Tool 1. The performance results are summarised in the graph                               approach for long-term prediction of solar wind. Proceedings of the
of Figure 20 (by using gnuplot [19]).                                                     International Astronomical Union, 6(S274):153–155, 2010.
                                                                                     [14] C. Napoli, E. Tramontana, and G. Verga. Extracting location names
                                                                                          from unstructured italian texts using grammar rules and mapreduce. In
                                                                                          Proceedings Of the International Conference on Information and Soft-
                                                                                          ware Technologies (ICIST), volume 639, pages 593–601, Druskininkai,
                                                                                          Lithuania, October 13-15 2016. Springer.
                                                                                     [15] G. Pappalardo and E. Tramontana. Automatically discovering design
                                                                                          patterns and assessing concern separations for applications. In Proceed-
                                                                                          ings Of ACM Symposium On Applied Computing (SAC), pages 1591–
                                                                                          1596, Dijon, Francia, 23-27 Aprile 2006. ACM.
                                                                                     [16] N. Tsantalis, A. Chatzigeorgiou, G. Stephanides, and S. T. Halkidis.
                                                                                          Design pattern detection using similarity scoring. Software Engineering,
                                                                                          IEEE Transactions on, 32(11):896–909, 2006.
                                                                                     [17] H. Washizaki and Y. Fukazawa. Dynamic hierarchical undo facility in
                                                                                          a fine-grained component environment. In Proceedings of International
                                                                                          Conference on Tools Pacific: Objects for internet, mobile and embedded
                                                                                          applications, pages 191–199. Australian Computer Society, Inc., 2002.
                                                                                     [18] T. White. Hadoop: The definitive guide. O’Reilly Media, Inc., 2012.
                                                                                     [19] T. Williams and L. Hecking. Gnuplot, 2003.


Fig. 20. Graph with the execution times in relation to the total number of
methods analysed.


                         VII. C ONCLUSIONS
   This paper described a pair of tools that analyses Java
source code and extract from them some relevant statistics,
such as method’s names, signatures, starting and ending lines
and variables within methods. The tools have been developed
in Java and Python and a comparison of the results in terms
of execution time has been shown.
                              R EFERENCES
 [1] Github. https://github.com.
 [2] Mrjob v0.5.3. http://pythonhosted.org/mrjob.
 [3] F. Bannò, D. Marletta, G. Pappalardo, and E. Tramontana. Tackling con-
     sistency issues for runtime updating distributed systems. In Proceedings
     of IEEE International Symposium on Parallel & Distributed Processing,
     Workshops and Phd Forum (IPDPSW), pages 1–8, Atlanta, GA, April
     2010.




                                                                                90