Semantics-aware Software Project Repositories Jonas Tappolet Department of Informatics, University of Zurich, Switzerland tappolet@ifi.uzh.ch Abstract. This proposal explores a general framework to solve software analysis tasks using ontologies. Our aim is to build semantically anno- tated, flexible, and extensible software repositories to overcome data representation, intra- and inter-project integration difficulties as well as to make the tedious and error-prone extraction and preparation of meta-data obsolete. We also outline a number of practical evaluation approaches for our propositions. 1 Research Problem In Software Engineering, many tools have been used for years to support the collaboration of development teams. Among others, these tools are typically a version control system holding the project’s files in a versioned manner and a bug tracking system in which defects and enhancement requests are stored. Re- search has shown that these repositories contain a huge amount of additional information that can be exploited to enhance the quality of software systems (such as the detection of error-prone software patterns or the prediction of the number of defects).An in-depth analysis reveals, however, that there are three main obstacles to seamless software analysis: (1) data representation — the nat- ural structure of the contents of these repositories is a typed graph (e.g., source code syntax trees). However, most of current software analysis tools use rela- tional database management systems requiring a transformation of the data to the relational format. More importantly, they rely on propositional (i.e., one table) representations for analysis, as most data mining algorithms use this for- mat. Moreover, the data is often extracted in a form that is suitable for only a particular task. These transformations are tedious, error-prone, and, usually, lossful. (2) Intra-project repository integration — each of these repositories is designed as a stand-alone system covering only a specific part of the software development process. In order to generate uniform views on software projects, methods are needed to overcome the boundaries of each isolated repository. (3) Inter-project repository integration — fewest software projects use solely their own code. Developers make pervasive use of components and frameworks hosted in remote repositories, weaving a world-wide call graph. Hence, repositories of different projects need to be accessed in a integrated manner. In this paper we present our EvoOnt approach to address these three problem areas. By integrating different repositories (data sources) using Semantic Web technologies, we end-up in a graph-based approach that is capable of handling distributed and heterogeneous software project data. 2 Related Work In this section, we summarize a small selection of the most closely related studies addressing the identified problem areas. Graph based software analysis. Collberg et al. [4] present Gevol, a graph-based visualization tool for CVS and Java. The aim is to support develop- ers understanding the software by providing visual representations of the source code and its history. Sager et al. [13] present Coogle (Code Google) that imple- ments a set of tree similarity measures to detect similarities between Java classes of different releases of software projects. Coogle’s approach is to first transform the abstract syntax tree (AST) representations of Java classes into intermediary FAMIX tree representations [6], and second, to measure their similarity by ap- plying tree similarity algorithms. FAMIX is a software source code meta model designed to serve as an exchange format for object-oriented programming lan- guages using flat text streams. Dietrich [7] proposed an OWL ontology to model the domain of software design patterns to automatically generate documentation about the patterns used in a software system. With the help of this ontology, the presented pattern scanner inspects the ASTs of source code fragments to identify the patterns used in the code. baetle is an open-source project that aims to add semantics to software repos- itories with a strong emphasis on bug tracker data.1 We tightly work together with the baetle developers to combine our ontologies with theirs. Intra-project repository integration. Fischer et al. [8] present their Release History Database (RHDB) that integrates versioning and bug tracking systems. The data is extracted and stored in a traditional relational database. Antoniol et al. [1] combined the relational RHDB with FAMIX to integrate source code with bug tracking and versioning system information. Inter-project repository integration. Chang and Mockus [3] present a method to detect file copies among different versioning systems to build a unified version history. None of the approaches above combines the strengths of an integrated soft- ware ontology to address all three obstacles identified in the introduction. 3 Contributions Data representation. Most of the studies presented in Section 2 use flat repre- sentations of source code. This forces analyses to be on a textual level. Although, valuable information can be extracted using text mining techniques, it is gen- erally hard to detect different types of source code changes (e.g., structural vs. nominal changes). Consider a similarity measure which is able to find changes on the textual source code level (i.e., treating software code as a string of charac- ters). Although the measure can, for instance, say that two software artifacts are different if their copyright notes have changed, it can, however, not say anything about the impact of this change on the software’s functionality. Therefore, we use 1 http://code.google.com/p/baetle/ a graph-based approach to model the repositories using RDF/OWL ontologies, which allows both textual and structural analyses. Intra-project repository integration. There are two different relation types among repositories. Implicit connections are defined by the data itself or given by the nature of a repository. The relation between a file’s meta-data (e.g., create date, file name, version information) and its content implicitly connects different versions of the source code. Explicit connections need to be manually defined. The connection between a bug report (i.e., a bug fix typically reported in a bug reporting tool such as Bugzilla) and a specific file version (typically stored in a version control systems such as CVS) needs to be explicitly defined by a developer. This is usually done by mentioning a bug number in the comment of a file’s new version (commit message), which can be extracted using simple text mining techniques [8] to link the respective bug report with a version of a file. However, the extractability of such explicit connections relies on disciplined and uniform reporting practices of a development team. Another method of linking a bug with a version is to compare the closing date of a bug report with the creation date of a new version. Having matching or near-matching dates is a strong indicator for a connection. With the integration of repositories we can access the history of a file with all changes made during a file’s life cycle. We can differentiate between evolutionary changes (extension of functionality) and maintenance changes (fixes of bugs). Inter-project repository integration. In a next step we can integrate a soft- ware project’s model with the models of used external components. Whenever a program makes a call to an entity that is not located inside the same project, this can be considered an external function call. Our aim is to map these calls to their representing source code model in a remote repository. One convenient method is to relate external calls in the same way as internal ones differentiat- ing them only using their differing namespaces (which need to have a uniform transformation to the source-code-namespace/package-declaration). Having this integration, we can explore the dependencies between a software and its com- ponents analyzing, for example, the impact of a component’s replacement with another (e.g., How does the code need to get adapted?) or the relation between bugs and the usage of external components. 4 Research Plan 4.1 Current State of our Research In a first step, we implemented a set of tools to extract and interconnect data from software repositories (i.e., CVS, Bugzilla, and Java), and store it as in- stances according to EvoOnt’s model. So far, we conducted several experiments using query techniques, reasoning, similarity measures, and machine learning to evaluate EvoOnt’s ability to serve as a general software analysis framework. We briefly summarize our conducted experiments. In our previous work [11, 10], the experiments are described in detail with example data from the Eclipse project. Software metrics. We used plain SPARQL queries to compute object-oriented software metrics [12]. These metrics are, e.g., the number of bugs per file, the relative number of bugs or the fan-in fan-out of a class. Software pattern detection. Using ontological reasoning, we are able to detect software patterns as well as anti-patterns, and code smells [12]. We achieved this by defining a pattern (anti-pattern) using the concepts from our EvoOnt. We build up an own pattern ontology which can, when conducting pattern detection, be combined with the existing ontologies an a reasoner will then match the defined patterns in the data. Similarity measures / Software evolution. Having the data in a graph- based format, we are able to calculate structural similarities between two versions of a source code file using iSPARQL [9] running similar analyses as Coogle by executing a single iSPARQL query. Machine learning. Using SPARQL-ML, a SPARQL extension with machine learning algorithms, we were able to simply reproduce tedious bug prediction analyses [2]. 4.2 Next Steps Intra-project integration. So far, we used Bugzilla-, CVS-, and Java-repo- sitories as data sources to extract software information from. However, there are various other products, we plan to integrate: Jira (bug tracker), Subversion (verisoning system), and C# (programming language) are our next candidates to write RDF/OWL interfaces for. On the other hand, there are other repository types involved in the software development process such as mailing-lists and forums which we plan to integrate as well into our unified framework. These types reflect the social network structure around the development process. Inter-project integration. Inspired from Data-Warehousing, where hetero- geneous data is accessed through a sole interface, we plan to implement such behavior in EvoOnt as well. With the implementation of web-based, RDF/OWL enabled interfaces exposing SPARQL-endpoints (e.g., for versioning systems), it would be able to execute and answer SPARQL queries over the web. This enables a repository to be linked from any other software project using this component’s functionality. Integration with software project repositories. We intend to integrate the semantic capabilities of EvoOnt into commonly used software project repository tools (such as Subversion and Jira) making the tedious extraction and prepara- tion of meta-data obsolete. Evaluation. In our first set of experiments, we evaluated EvoOnt against a wide variety of software analysis tasks. In a next step, we plan to deepen certain experiments. A first selection is: The identification of the location of a bug. Usually, a developer links a version of a source code file with a bug report whenever she fixed a specific bug. Derived from this information, we can use graph algorithms to compute deltas between the fixed and the pre-fixed source code version resulting in a subgraph exactly representing the change made for fixing a bug. Having this portion of change, we can try to identify the point in the history of the software when this changed code fragment was inserted or modified. Our hypothesis would be that this is the point in the software history where the bug was introduced. Analyze the code co-evolution of projects and their components. This important task has been very difficult so far, as the inter-project connections has been largely missing. The inter-project integration allows to uncover the relation be- tween bugs of different projects. A bug may be misleadingly reported in a project due to a bug in the referenced project. Moreover, the impact of updating to a new version of a component that includes several bug fixes, and may have changed its behavior can be made visible. Analyze the connection between code elements and people with respect to their relationship to Conway’s Law [5] and perform other in-depth social network analyses. 5 Conclusions This proposal outlined the advantages of applying Semantic Web technologies to the field of Software Analysis. Specifically, we discussed how Semantic Web technologies allow to overcome the data representation, intra-, and inter-project repository integration problems. We, furthermore, succinctly outlined how we intend to evaluate our approach by showing its usefulness for a variety of software analysis tasks and publish the findings in the software engineering literature. We also indicated our plans to develop semantically annotated software repositories, which will make the extraction and preparation of meta-data obsolete. References 1. G. Antoniol, M. D. Penta, H. C. Gall, and M. Pinzger. Towards the integration of versioning systems, bug reports and source code meta-models. ENTCS, 2005. 2. A. Bernstein, J. Ekanayake, and M. Pinzger. Improving defect prediction using temporal features and non linear models. IWPSE, 2007. 3. H.-F. Chang and A. Mockus. Constructing universal version history. MSR, 2006. 4. C. Collberg, S. Kobourov, J. Nagra, J. Pitts, and K. Wampler. A system for graph-based visualization of the evolution of software. SoftVis, 2003. 5. M. Conway. How do committees invent? Datamation, 1968. 6. S. Demeyer, S. Tichelaar, and P. Steyaert. FAMIX 2.0 - the FAMOOS information exchange model. 1999. 7. J. Dietrich and C. Elgar. A formal description of design patterns using owl. ASWEC, 2005. 8. M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. ICSM, 2003. 9. C. Kiefer, A. Bernstein, and M. Stocker. The fundamentals of iSPARQL - a virtual triple approach for similarity-based semantic web tasks. ISWC, 2007. 10. C. Kiefer, A. Bernstein, and J. Tappolet. Analyzing software with iSPARQL. SWESE, 2007. 11. C. Kiefer, A. Bernstein, and J. Tappolet. Mining software repositories with iS- PARQL and a software evolution ontology. MSR, 2007. 12. M. Lanza and R. Marinescu. Object-Oriented Metrics in Practice. Springer, 2006. 13. T. Sager, A. Bernstein, M. Pinzger, and C. Kiefer. Detecting similar java classes using tree algorithms. MSR, 2006.