37 Hypermodelling Live OLAP for Code Clone Recommendation Tim FREY and Veit KÖPPEN Otto-von-Guericke-University Magdeburg, Germany Abstract. Code bases contain often millions lines of code. Code recommendation systems ease programming by proposing developers mined and extracted use cases of a code base. Currently, recommender systems are based on hardcoded sets what makes it complicate to adapt them. Another research area is adaptable live detection of code clones. We advance clone detection and code recommender systems by presenting utilization of our Hypermodelling approach to realize an alternative technique. This approach uses Data Warehousing technology that scales for big data and allows for flexible and adaptable queries of source code. We present the generic idea to advance recommendation and clone detection based on queries and to evaluate our application with industry source code. Consequently, recommender systems and clone detection can be customized with flexible queries via Hypermodelling. This enables further research about more complex clone detection and context sensitive code recommendation. Keywords. Data warehousing, software engineering, hypermodelling Introduction Code recommender systems advance integrated development environments. They are based on the idea to extract and mine information from code bases to generate recommendations. Recommendation data contains information, which method calls occur commonly together or which methods of a super class get overwritten [1]. This data is compared to current coding of a developer and proposals are offered. The extraction and mining process limits recommendation to be easily adjusted to specific requirements. For instance, it is desirable to have recommendation information available for diverse APIs and also for an own project. Furthermore, project requirements often differ. In one project, it is required that the recommendation code base just comes out of a specific project and in another setting all available code should be used for recommendation. However, this type of flexible recommendation is currently not available. One main reason for this may be the immense size of modern code bases, resulting in difficulties to adjust the extraction process: It is necessary to scan the code base for different recommendation information every time. Another challenge is the detection of code clones [2, 3]. Thereby, code bases are scanned for duplicates. One main challenge is the different type of code bases and clone detection methods. Sometimes, a certain package should be excluded because replicas are allowed. Furthermore, clones may be exact duplicates of a code fragment or similar pieces of code. Hence, clone detection faces the challenge to provide an easy adjustable infrastructure that allows detecting different kinds of clones and different 38 T. Frey, V. Köppen / Hypermodelling Live OLAP for Code Clone Recommendation code base configurations. 1 Altogether, recommender systems and clone detection face the challenge to provide an adaptable infrastructure for large code bases that allow by flexible means to detect code clones and recommendations. In order to overcome these current limitations, we propose to use our Hypermodelling approach [4, 5, 6] for flexible code recommendation and clone detection. Through this approach both techniques can be covered with Data Warehouse (DW) and Online Analytical processing (OLAP) technology [7] that scales well for large data sets and is easily adjustable for multi- dimensional queries. Therefore, our contribution is to describe, how the Hypermodelling approach can be used to advance recommender systems and clone detection at the same time. We also provide an evaluation, in which we demonstrate the approach on a real world source code excerpt. The remainder of the paper is structured as follows: First, we recap important facts about our Hypermodelling approach. Next, we describe the general approach, how Hypermodelling can be used to detect clones and recommend code fragments. Afterwards, we describe the approach with a concrete example. In Section 4, we refer to related work and point out differences. 1. Hypermodelling Hypermodelling is the idea to combine program analysis and DW. A more detailed description of the Hypermodelling approach is available in [4]. 2 In [5, 6] different reporting possibilities of this approach are presented. DW systems are an integrative component in business computing [7]. 3 They are used to assemble data of different sources together. The integrated data are arranged into multi-dimensional data structures, i.e., data cubes, which serve as base for queries [4]. Queries can be used to aggregate different measures and dimensions (within their hierarchies) that occur in the data. For instance, sales for an employee can be computed for a given time period. Thereby, this query aggregates the region, the products (sales) and the time in relation to financial indicators. Likewise, hierarchies can be abstracted. For instance, the region can be split into continents, countries, counties, as well as cities and the aggregates are associated with the distinct sales for those. This can be done for other hierarchies, e.g., customer group, year, or department. Generally, the idea is that different aggregations enable detailed investigations. With Hypermodelling, we introduce the idea that programming elements, like annotations or classes, are similar to data that are used within a DW. For instance, classes are defined within a package hierarchy. Annotations are associated with classes and their members. They are also defined in their own package. This is like the association of a salesman to a region, time period, and revenues. Hierarchies in code are similar to hierarchies of region or time. All together, we load source code into a DW and realize associations of classes, their inheritance, packages, and annotations as a DW cube and execute queries on it. For this paper, we combine cubes that are used in [4, 5, 6]. We present the results of exemplarily queries on our aggregated cube as reports in this paper, see Section 3. 1 Note, clone detection methods can be configured and live detection is possible [3]. However, it still is a challenge. Thus, we present alternative approach as complementary subsidiary to code recommendation. 2 http://hypermodelling.com 3 Note, there are also open source Data Warehouse solutions available. For instance: http://pentaho.com. T. Frey, V. Köppen / Hypermodelling Live OLAP for Code Clone Recommendation 39 2. Clone Recommendation with Hypermodelling First, we describe the overall idea to use OLAP queries for code analysis. Afterwards, we refer to recommendation and clone detection. 2.1. Overall Idea We propose a query based approach to advance recommendation and clone detection. We use OLAP queries and our data warehouse approach, out of the size of modern code bases. Additionally, DW technology has many best practices and tuning methods at hand. Queries allow flexibility, adaptability, and DW technology scales for big data. Thus, our live query approach should help to avoid clones and give recommendations at the same time. Figure 1. Query refinement throughout the coding process A developer encodes functionality (coding) and in the background, her written code is used to execute an OLAP query against the Code-DW that contains coded structure and elements. For instance, such code can be parameters of methods. Then, the query result gets presented to the developer. This may be in the form of recommendation or in a notification of a code clone. Hence, this query result influences ongoing coding process. From the result, the programmer goes on and encodes more functionality. With this additional functionality, again a query can be executed that contains more information than the first. Every time, as more and more code exists, the query is more and more detailed. If the programmer finishes her current coding, the query is quite detailed with all code fragments that are belonging to a method. If another method shares enough similarities, the corresponding code is presented to the developer and she has to decide if a code clone exists. Currently, recommendation is not done live and neither is based on a query approach. Even more, code recommenders are inflexible and do not support slicing and dicing data for specific needs. This is often likewise the same for clone detection. So, with our Hypermodelling approach to load code into a DW, every query could be customized to meet specific requirements or just be sliced by a specific viewpoint. 2.2. Recommendation Code recommendation systems mine facts that plenty of programmers have done out of a code base. Imagine, a developer overrides a method. The written method is compared with the recommendation data and she gets proposed which methods were called by others that did an override of this method. Exactly the same information can be revealed with an OLAP query. This information can be presented to a developer to recommend her what others did. What we describe is a typical application of a code recommender system. Therefore, we propose a query process for recommender systems in Figure 1. A 40 T. Frey, V. Köppen / Hypermodelling Live OLAP for Code Clone Recommendation developer encodes functionality and queries containing this encoded functionality are done. These queries reveal similar code artifacts and most things that others have done in those are proposed to a developer. The developer continues to encode functionality. This encoded functionality can be added as refinement to the query to get more detailed information on what most others did in a similar situation. With these queries, the recommendation data can be generated live and adapted easily. The Hypermodelling approach ensures that different code bases can be loaded into a DW. This allows us to reuse infrastructure that is designed for big data. 2.3. Clone Detection Similarly to recommendation, we also realize live clone detection. Imagine a developer encodes a method, sharing a high similarity or equality to another method. Today, she continues her work. However, in this moment, she has knowledge at hand what she has programmed and can easily compare it with potential clones. This step fits perfectly into workflow and potential clones can be avoided easily. We propose to use our Hypermodelling approach to detect code clones live though a query based approach as described in Section 2.2. Through queries the detection process can be easily adjusted to project specific requirements. For clone detection, imagine a developer is encoding functionality in a method. She creates the method declaration with all method parameters and then she encodes logic into the method. Methods of objects are called and other constructs are realized. Like for recommenders, regularly queries are executed to determine duplicates. If similarity is too high, code is presented to the developer so that she can decide if she has produced a clone. We show this process in Figure 1, that also describes a clone detection cycle. 3. Evaluation In order to evaluate the application ability of recommendation and clone detection based on queries, we select a method of a program that we describe in [6]. Data in our DW and queries on a real project demonstrate that our approach enables recommendation not only with prepared data. We depict a class (AVMShareMapper) that implements two interfaces that are implemented by other classes of the application. Thereby, we query if others also implement these interfaces to ensure a valid example that shares similarities with other classes. We select the afterpropertiesSet method for investigation and divide the method in four different parts. For every part, we execute queries to simulate how a developer would encode this method and queries would be executed in the background. Our scenario is mainly based on live clone detection. However, the same approach of queries can be used for recommendation. Figure 2 shows the extended coding process that can be supported by queries. We imagine the coding process there as follows: In the first step a developer encodes the class body. Then, she goes on and encodes step by step a method. We split the afterpropertiesSet method and arranged it above, corresponding to the process of a developer. We describe exemplary queries in natural language beneath the process. Those queries can be executed with the DW query languages from the development environment in the background, while the developer is coding. Behind queries is their // Build the list of available AVM share names // Add the shared filesystem name to the list of m_avmShareNames = new StringList(); AVM shares SharedDeviceList shrList = m_avmShareNames. 4 Figure 2. Query based recommendation m_filesysConfig.getShares(); 3 addString(curShare.getName()); Enumeration shrEnum = and clone detection process shrList.enumerateShares(); // Set the AVM filesystem driver to be used while ( shrEnum.hasMoreElements()) when creating dynamic shares { if ( m_driver == null) // Get the current shared device and check if m_driver = (AVMDiskDriver) public void afterPropertiesSet() it is an AVM filesystem device curShare.getInterface(); { SharedDevice curShare = } // Save the server configuration 2 shrEnum.nextElement(); } 1 m_filesysConfig = try{ catch public class AVMShareMapper (FilesystemsConfigSection) if ( curShare.getInterface() instanceof ( InvalidDeviceInterfaceException implements ShareMapper, m_config.getConfigSection( AVMDiskDriver) ex) InitializingBean { FilesystemsConfigSection.SectionName); { … Finish the method and get presented the Implement more logic and recat on the Start encoding a new class Implement the method logic most similar method to decide if a clone has reminding been produced Query for Query for common afterPropertiesSet Query for all method calls Extended query with the with the same method methods of other methods that call new method calls name and superinterface interface getConfigSection() and the following, we go through the process of Figure 2: implementors implement InitializingBean T. Frey, V. Köppen / Hypermodelling Live OLAP for Code Clone Recommendation Show the user which methods others Tell the user that 2 other Remind the user that there is a similar method Present the method with the highest similarity to implement implementors called the same method and he should a void the clone the user 41 First (1), a developer starts encoding the class and implements the ShareMapper and the InitializingBean interface. These two interfaces are used to create a query for result or at least an excerpt. At the bottom, possible ensuing actions are described. In 42 T. Frey, V. Köppen / Hypermodelling Live OLAP for Code Clone Recommendation the most common method names of the classes that implement one of the interfaces. 4 The result shows the amount of children types with the same method name. The result is sorted following the occurrence of method names of ShareMapper implementers. It would be possible to sort the result after the occurrence in InitializingBean or to merge most common method names in both interface implementers. Anyway, the result information can be used to show developers which methods other developers implemented by extending a certain interface. For our scenario, we imagine that developers see plenty of times the afterPropertiesSet method is implemented and start encoding this. Thereby, it can be recognized that a developer uses a method name based on the InitializingBean and the following queries can be specialized on this interface. To give a better impression about the technique, we present the query in Listing 1. The parent class or interface is named ParentType and the CodeStructure is the OLAP cube. The query is based on multi-dimensional expressions standard 5 and shows that the amount of methods for children is computed for extenders of the ShareMapper and InitializingBean. SELECT { [ParentType].[Name].&[ShareMapper], [Parent].[Name].&[InitializingBean] } ON COLUMNS , { [Method].[Name].[All].CHILDREN } ON ROWS FROM [Code-Structure] WHERE ( [Measures].[Method-Count] ) Listing 1. Query for method names of interfaces children In the second step (2), the developer starts encoding logic of the afterPropertiesSet method. She calls a method of what is used to refine the former query. It is enriched with information which methods are called to reveal which types also obtain same methods, implement the interface and have the same method name. Such kind of similarity can be an indicator that the developer produces a clone. Furthermore, maybe the developer is putting effort and thoughts into implementing a method that is already implemented. Therefore, a developer gets presented other methods that share a high similarity. If a user rejects the proposals and wants to encode further functionality, the information about similar methods can also be used to present methods that are called in the similars. We show the corresponding query in Listing 2. The result types of this query that share a similarity (MultiTenantShareMapper, HomeShareMapper) can also be used in another query to generate recommendation information. Exemplarily, such a recommender query is shown in Listing 3. The called methods of two similar types (MultiTenantShareMapper, HomeShareMapper) are computed based on a query. This can be used to propose a user which other methods are called by other developers in a similar situation. SELECT { [CalledMethod].[Name].&[getConfigSection]} ON COLUMNS , {[Type].[Name].[All].CHILDREN} ON ROWS FROM [Code-Structure] WHERE ( [Measures].[Method Calls],[ParentType].[Name].&[InitializingBean], [Method].[Name].&[afterPropertiesSet] ) Listing 2. Determining similar code SELECT NON EMPTY { [Measures].[Method Calls Anzahl] } ON COLUMNS , { [Called Method].[Name].[All].CHILDREN } ON ROWS FROM [Code-Structure] WHERE ( [Type].[Name].&[HomeShareMapper], [Parent - Type].[Name].&[InitializingBean], [Method].[Name].&[afterPropertiesSet] ) Listing 3. Determining what similar code did for recommendation 4 Note, a query can be sliced by a project or a specific package. Plenty of customizations are thinkable. 5 We show a few queries to give an impression of the query language. For further information, see the MDX language reference: http://msdn.microsoft.com/en-us/library/ms145595.aspx / http://xmlforanalysis.com T. Frey, V. Köppen / Hypermodelling Live OLAP for Code Clone Recommendation 43 Succeeding (3), a developer encodes more functionality. In the background the query is extended to compare method calls of the current method with the ones of other extenders. It is like a fail save detector that tries to uncover if there is a clone produced. With every new method call, the query is extended and executed again. After six shared method calls and the super interface, enough indicators are collected to remind the user with a traffic light or a pop up about similarity. 6 Finally, in the forth step (4), the user finishes the method. This is recognized in the background and a complete query for all method calls is executed. Out of the high similarity of the method calls, it is assumed that the method logic is maybe “externalizable” and customizable through parameters. However, the easiest way is to present methods with a high similarity to the user and let him decide if this is a clone. 4. Related Work Our prior works describes a general approach how code structure can be processed with a DW and queries [4, 5, 6]. In this paper, we focused on a concrete application. Others describe different clone detection methods [2]. One clone detection method [3] proposes live clone detection with a client-server architecture. In comparison to our approach, the described detection [3] is not customizable. Another one, CloneDetective [8], offers an advanced framework and tool chain for clone detection, which is especially geared towards flexibility of clone detection research. Hypermodelling, on the contrary, targets to utilize DW technology. In general, we see the use of DWs as an addition to known clone detection mechanisms and not as competitor. DWs often already exist in enterprises where they are used for business applications. Therefore, our approach makes a reuse of DW technology for clone detection possible. Hence, further research should determine in detail if and which of those other clone detection approaches can be realized with DW technology. Other related work can be found in the area of code recommender systems [1]. They provide methods to mine data out of code bases and generate recommendations out of it. The whole recommendation is hard wired and fixed. Further, no scaling technology for live recommendations with a client server infrastructure is proposed. With our approach, through DW query based recommendations a more flexible and adjustable infrastructure is at hand. Therefore, we see the emerging need to investigate DW technology further to advance the recommender systems. 5. Conclusion and Future Work We described the problem of large code bases and the inflexibility of current code recommendation and challenges of clone detection systems. We proposed to overcome these limitations through a query based approach that uses DW technology. Our approach is evaluated and its application ability is shown by queries to a real code base. So, the next generation recommendation and clone detection techniques can be based on DW technology. Related work reveals possible synergies with other research and indicates that DW technology is a promising area to advance software engineering. 6 Note, at this moment such kind of queries can also be used to propose method calls like in step 2. 44 T. Frey, V. Köppen / Hypermodelling Live OLAP for Code Clone Recommendation In general, we see the need to describe our method in more details. Additionally, the use of DW technology enables further research on DW based clone detection. We showed the capability to find “full method clones”. However, clones can also be copied code fragments within different methods. Therefore, further investigations can focus on identification of queries for these clones. Thereby, we see ways to compute similarity indicators based on queries and the clone granularity level (method or fragment based) as important questions. The precision of our technique is fixed to method calls, discarding method parameters. Hence, the same called methods that take different parameters and are considered as clones. For that reason, the possibility to enhance or adjust the precision of our queries through additional facts should be considered. Therefore, we see the need to work together with industry developers to evaluate which level of precision and granularity is desired in practice. Furthermore, we connect code recommendation with clone detection. Therefore, our work makes it possible to regard both areas together and investigate possible synergies in the future. This connection is an additional difference to previous research. Generally, our approach shows a rudimentary and first scenario with primitive queries. More complex queries, scenarios, and areas are of interest for further investigations. For instance, the area of refactoring is also near clone detection and relations to it can be investigated. We also see an advanced trail by integrating the context (e.g., the package, the prior studied code, or the task) wherein a developer encodes functionality to advance recommendation systems. Currently, code recommenders are dull and based on the same rule set. Our dynamic query based approach enables further research, how recommendations need to be altered and adjusted to different contexts of a developer to respect his current programming tasks within the recommendation. References [1] M. Bruch, M. Mezini, and M. Monperrus, Mining subclassing directives to improve framework reuse, In: Proceedings of the 7th IEEE Working Conference on Mining Software Repositories, IEEE, 2010, 141-150. [2] C. K. Roy, J. R. Cordy, and R. Koschke, Comparison and evaluation of code clone detection techniques and tools: a qualitative approach, Science of Computer Programming 74(7) (2009), 470-495. [3] T. Yamashina, H. Uwano, K. Fushida, Y. Kamei, M. Nagura, S. Kawaguchi, and H. Iada, Shinobi: a Real-Time Code Clone Detection Tool for Software Maintenance, Technical Report NAIST-IS- TR2007011. Graduate School of Information Science, Nara Institute of Science and Technology, 2008. [4] T. Frey, V. Köppen, and G. Saake, Hypermodelling – introducing multi-dimensional concern reverse engineering. In: 2nd International ACM/GI Workshop on Digital Engineering (IWDE), Germany, 2011, 58-66. [5] T. Frey, Hypermodelling for drag and drop concern queries. In: Proceedings of Software Engineering 2012 (SE2012), Gesellschaft für Informatik (GI), Berlin, Germany, 2012, 107-118. [6] T. Frey and V. Köppen, Exploring software variance with hypermodelling – an exemplary approach. In: S. Jähnichen, A. Küpper, S. Albayrak, editors, Software Engineering 2012: Fachtagung des GI- Fachbereichs Softwaretechnik, Berlin, Germany, 2012, 121-140. [7] W. H. Inmon, Building the Data Warehouse. 4th ed. J.Wiley & Sons, New York, USA, 2005. [8] E. Juergens, F. Deissenboeck, and B. Hummel, CloneDetective – a workbench for clone detection research. In: Proceedings of the 30th International Conference on Software Engineering, IEEE, 2009, 603-606.