Toward Reverse Engineering of VBA Based Excel Spreadsheet Applications Domenico Amalfitano, Nicola Amatucci, Vincenzo De Simone, Anna Rita Fasolino, Porfirio Tramontana Department of Electrical Engineering and Information Technologies University of Naples Federico II Via Claudio 21, Naples, Italy Email: domenico.amafitano@unina.it, nicola.amatucci@unina.it, vincenzo.desimone2@unina.it, anna.fasolino@unina.it, ptramont@unina.it Abstract—Modern spreadsheet systems can be used to imple- will present a technique for abstracting conceptual data models ment complex spreadsheet applications including data sheets, by analyzing spreadsheets data while in section III we will customized user forms and executable procedures written in present a tool supporting the abstraction of views representing a scripting language. These applications are often developed by practitioners that do not follow any software engineering the relationships between VBA procedures, user forms and practice and do not produce any design documentation. Thus, spreadsheet cells. Finally, in section IV some conclusions and spreadsheet applications may be very difficult to be maintained or future works will be presented. restructured. In this position paper we present in a nutshell two reverse engineering techniques and a tool that we are currently II. DATA M ODEL R EVERSE E NGINEERING realizing for the abstraction of conceptual data models and business logic models. The first technique presented in this paper regards the abstraction of conceptual data models from the analysis of the I. I NTRODUCTION structure and the information included in an Excel spreadsheet Spreadsheets are interactive software applications for ma- by means of heuristic rules. This technique is based on nipulation and storage of data. In a spreadsheet data are heuristic rules automatically applicable on a spreadsheet. By organized in worksheets, any of which is represented by a means of these rules, set of candidate classes with attributes, matrix of cells each containing either data or formulas that relationships between them and the corresponding cardinalities are automatically calculated at any variation of data. Modern are abstracted on the basis of the structure and of the properties spreadsheet systems (e.g. Microsoft Excel) are integrated with of spreadsheets and of their components, such as sheets, cells, scripting languages (e.g., Microsoft Visual Basic for Applica- cell headers, etc.. In particular, the heuristics consider cells tions). An end user can extend the presentation layer of the labels and data by looking for repeated data, synonyms and application with Visual Basic for Application (VBA in the group of cells organized as arrays or matrixes. following) by defining new User Forms and can develop new The considered rules have been extensively presented in [1] business functions by means of Procedures. The execution of and [2] where they have been used with success to abstract the these procedures is event-driven: they can be attached to events conceptual data model underlying some complex spreadsheet related to any element of the Excel Object Model. Whenever applications used in the automotive context as informative sys- one of these events occurs, the attached event handler code tems. The abstracted data model represented a useful starting is executed. VBA can also be used to programmatically point in the process of migration of an existing spreadsheet- access and modify the underlying Excel Object Model, i.e. based informative system towards a Web application. Some the hierarchy of objects contained in Excel representing all its of the rules were derived from works found in literature such accessible resources. as in [3] and [4]. These works are focused on spreadsheets Very often developers create complex spreadsheet appli- used as calculation sheets. The proposed techniques are based cations without any documentation at design level making on the analysis of formulas to abstract relationships between each task related to their maintenance very hard. In order to candidate classes. Instead, we focused on spreadsheets used as reconstruct design models of an existing spreadsheet applica- informative systems and proposed further rules based on the tion reverse engineering techniques and tools are needed. The analysis of cell properties and values. research community has devoted great attention to analysis of Figure 1 shows an example of a possible application of formulas in a spreadsheet, while they leave out the analysis of some of the proposed rules. First of all, the sheet of the the spreadsheets embedded code and the static and dynamic spreadsheet shown in Figure 1 may be abstracted as a UML relationships between the code and the spreadsheet’s cells. class. Moreover, the two distinct rectangular areas separated In this paper we will present in a nutshell two reverse by a blank column and composed of labels and data may be engineering techniques and a tool supporting the second abstracted as other two classes. Two composition relationships technique allowing the abstraction of design models of an between these two classes and the class representing the sheet existing spreadsheet application. In details, in section II we may be abstracted, too. The first rows of the two areas have cells with bold texts and different background colors: they is composed of a single workbook with 4 worksheets, a may be abstracted as attributes of the classes representing the VBProject including 6 code modules (that implement a total areas. The texts of these cells may be abstracted as names of of 10 procedures) and a user form with 12 controls. In the the attributes of the corresponding classes. right part of the figure there are some detailed information provided by the EXACT tool about a procedure selected by the user and a graph showing the dependencies between this procedure and the other application components. In particular, the considered procedure writes in 9 different groups of cells, reads the values of a group of cells and calls two procedures. Fig. 1. An example of application of heuristic rules to abstract a conceptual data model of a spreadsheet Fig. 2. Structural View of an Excel Spreadsheet III. B USINESS L OGIC M ODEL R EVERSE E NGINEERING IV. C ONCLUSIONS AND F UTURE W ORKS The second contribution presented in this position paper is In this position paper we have presented in a very concise related to a reverse engineering technique that we have realized way two reverse engineering techniques and a tool abstracting to abstract models of the business logic of a spreadsheet conceptual data models and business logic models of Excel application by statically analyzing sheets, user forms, VBA spreadsheets, taking into account both the data structure, procedures and their inter-relationships. The technique is com- the VBA code procedures, the User Forms and their inter- pletely supported by an interactive Excel add-in called EXACT relationships. In future work, we plan to extend the EXACT (EXcel Application Comprehension Tool) that we have real- tool by introducing the data model reverse engineering tech- ized and that is available at https://github.com/reverse-unina/ nique proposed in section II with the aim to make possible EXACT. It provides features for the extraction of information comprehension processes of Excel spreadsheet applications. and the abstraction of several views of an existing spreadsheet application. R EFERENCES The tool extracts information from a spreadsheet by ex- [1] D. Amalfitano, A. R. Fasolino, P. Tramontana, V. D. Simone, G. D. ploiting the features offered by the Office Primary Interop Mare, and S. Scala, “Information extraction from legacy spreadsheet- Assemblies for the Excel Application that exposed the Mi- based information system - an experience in the automotive context,” in DATA 2014 - Proceedings of 3rd International Conference on crosoft Excel Object Library and by statically analyzing the Data Management Technologies and Applications, Vienna, Austria, source VBA code. The tool is able to reconstruct the set 29-31 August, 2014, 2014, pp. 389–398. [Online]. Available: http: of elements composing the spreadsheet including sheets, user //dx.doi.org/10.5220/0005139603890398 [2] D. Amalfitano, A. Fasolino, V. Maggio, P. Tramontana, G. Di Mare, forms, event handlers, classes and procedures. Furthermore the F. Ferrara, and S. Scala, “Migrating legacy spreadsheets-based systems to tool recovers information related to the relationships between web mvc architecture: An industrial case study,” in Software Maintenance, these elements such as procedure calls, relationships between Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution Week - IEEE Conference on, Feb 2014, pp. 387–390. events and the corresponding handlers, and dependencies [3] R. Abraham and M. Erwig, “Inferring templates from spreadsheets,” between procedures and cells. in Proceedings of the 28th International Conference on Software The EXACT tool provides several features of software Engineering, ser. ICSE ’06. New York, NY, USA: ACM, 2006, pp. 182– 191. [Online]. Available: http://doi.acm.org/10.1145/1134285.1134312 visualization. It offers multiple views at different levels of [4] F. Hermans, M. Pinzger, and A. van Deursen, “Automatically extracting detail and provides cross-referencing functions for switching class diagrams from spreadsheets,” in Proceedings of the 24th European between these views. For example, the left part of Figure Conference on Object-oriented Programming, ser. ECOOP’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 52–75. [Online]. Available: 2 exemplifies the structural view proposed by EXACT for http://dl.acm.org/citation.cfm?id=1883978.1883984 an Excel spreadsheet. The figure shows that the spreadsheet