Business Rule Mining from Spreadsheets Sohon Roy Dept. of Software & Computer Technology Delft University of Technology Delft, Netherlands S.Roy-1@tudelft.nl Abstract—Business rules represent the knowledge that guides oriented software [1, 2]; but we want to apply the technique on the operations of a business organization. They are implemented spreadsheets. The potential benefits of that are as follows. in software applications used by organizations, and the activity of 1) High Level Analysis of Spreadsheets – Extracting extracting them from software is known as business rule mining. business rules enables generation of documentation for It has various purposes amongst which migration and generating documentation are the most common. However, apart from spreadsheets at a higher abstraction level than the spreadsheets conventional software, organizations also use spreadsheets for a themselves. This facilitates the following: large part of their operations and decision-making activities. a) Comprehension – It becomes easier for end-users, who Therefore we believe that spreadsheets are also rich in business are typically not programmers, to understand the structure and rules. We thus propose to develop an automated system for operation of large and complex spreadsheets helping them extracting business rules from spreadsheets in a human efficiently work with or modify such spreadsheets with comprehensible natural language format. This position paper describes our motivation, the problem description, related work, reduced errors and mistakes. and challenges we foresee. b) Comparison – Comparing spreadsheets becomes Index Terms—End-user computing, Business rule mining, possible in order to estimate whether they implement same or Spreadsheets, Knowledge mining. similar functionalities, or even are identical behavior-wise only differing in data values. The latter cannot be done for I. INTRODUCTION & MOTIVATION example by an application that compares spreadsheets in data In her book author B. Halle writes that according to the and formula level. Business Rules Group 1 a business rule is “a statement that c) Validation – Organizations using set of well-formed defines or constrains some aspect of the business. It is and pre-laid business rules can validate whether the intended to assert business structure or to control or influence spreadsheets created by their employees accurately implement the behavior of the business” [1]. Thus business rules are those rules or if there are errors in the logical level. rules that unambiguously determine the actions or results 2) Understanding of Organizational Business Rationale – necessary for desirable operation of a business. Therefore in Some organization may not have their business strategies well the context of software applications, it can be stated that laid out in business rule format; yet vital business knowledge business rules are what that hold the knowledge [1] that is of experts working in the company is hidden in spreadsheets. implemented in the form of programming instructions; Extracting this knowledge would help to form a clear picture whether be it a conditional statement like IF-THEN-ELSE or of how that organization works and its structure. an expression like AREA=3.14*(RADIUS)^2. Thus for most 3) Support for Migration – IT architects need to understand practical purposes, business rule mining from software applications is essentially the mining of knowledge. Apart the business logic when migrating functionalities and from conventional software, all types of organizations also computations implemented in spreadsheets into conventional depend heavily on the use of spreadsheets [3, 4]. Due to their software. Furthermore business analysts need to ensure that wide use in all levels of company operations, the domain the IT architects understood it correctly. This can be achieved knowledge that gets inculcated in spreadsheets is too valuable through knowledge extraction and an automated process a resource to be left untapped [5]. Therefore we want to would largely help in this regard. facilitate the extraction of business knowledge from 4) Safe Re-use and Replication of Spreadsheets – Often spreadsheets through a process of automated business rule spreadsheets are created on ad-hoc basis by experts in an mining. Business rule mining is an activity that is also invoked organization to implement their unique strategies for certain during migration of legacy software systems into systems that scenarios. Over time such spreadsheets grow in size and are considered modern like SOA, modular software, or object complexity and are used by several employees for similar scenarios but with different data sets. Invariably the users are 1 forced to employ the method of copy-paste to replicate the An independent organization formerly part of the users group Guidance of original spreadsheet and customize it according to their needs Users of Integrated Data-Processing Equipment (GUIDE) of IBM corporation, acknowledged as pioneers of the business rule approach by manipulating data and formula. However this process is www.businessrulesgroup.org extremely error-prone [6]. It is probably safer to re-generate IV. RELATED WORK spreadsheets from scratch using the blueprint or structure of Mittermeir et al. proposed an approach for finding high the original spreadsheet instead of copy-pasting. Automated level structures in spreadsheets through logical and semantic business rule extraction can facilitate such blueprint formation classification of cells [7]. Abraham et al. worked on header and and thus make replications of spreadsheets safer. unit inference where units imply values or cell contents and the headers are column headers or the labels [8]. Chatvichienchai II. GOAL AND APPROACH proposed a method for meta-data extraction from spreadsheets Our goal is to devise an algorithm and subsequently an [9] where meta-data are the various labels and also the data that application that will automatically extract business rules from are analogous to primary keys of databases. These works are spreadsheets. Based on the successful implementation of such generally oriented towards the purpose of error reduction in an application our research questions will be as follows. spreadsheets and are not motivated from the business rule RQ1: How accurate the automatically extracted standpoint. Hermans et al. developed a method for extracting business rules will be as compared to those extracted class diagrams from spreadsheets [10]. Our business rule manually by domain experts and spreadsheet users? extraction algorithm will draw its foundation from the class RQ2: How efficient is the automatic extraction process diagram extraction algorithm and improve upon its limitations. compared to manually extracting business rules from spreadsheets? V. CONCLUDING REMARKS Towards answering these research questions, we will To summarize, this paper proposes an application for employ user-studies and controlled experiments, in which we business rule mining from spreadsheets and the research will compare the results of automatic and manual extraction of questions RQ1 and RQ2. Such an application will facilitate business rules from spreadsheets. high level analysis of spreadsheets, understanding of organizational business strategies, support for migration, and III. PROBLEM ILLUSTRATION better re-use of spreadsheets. However, due to their inherent flexibility, spreadsheets do not impose any fixed structural uniformity with regards to layout. This makes the mapping between data and labels difficult and that will be a key challenge to overcome. REFERENCES [1] B. von Halle, Business Rules Applied: Building Better Systems Using the Business Rule Approach, Wiley Computer Publishing, 2002. [2] T. Morgan, Business Rules and Information Systems: Aligning IT with Business Goals, Addison-Wesley, 2002. [3] L. Bradley, K. McDaid, Using bayesian statistical methods to determine the level of error in large spreadsheets, in Proc. of ICSE ’09, Companion Volume, 2009, pp. 351–354. Fig. 1. Spreadsheet for calculation of revenues [4] C. Scaffidi, M. Shaw, B. A. Myers, Estimating the numbers of Typical spreadsheets implement business rules to calculate end users and end user programmers, Proc. of VL/HCC ’05, results. For example in Fig.1 the cell E19 contains the formula 2005, pp. 207–214. SUM(E13:E18). From this formula our algorithm has to infer [5] F. Hermans, Gathering domain knowledge from spreadsheets, the business rule “Total earned revenue = Proc. of ESEC/FSE ’09 Doctoral Symposium, 2009, pp.37-38. Admissions+…+Other earned revenue”. Mapping E13:E18 to [6] F. Hermans, B. Sedee, M. Pinzger, A. van Deursen, Data Clone Admissions…Other earned revenue is straightforward. Detection and Visualization in Spreadsheets, Proc. of ICSE ’13, However there is more to determine as the Total Earned 2013, pp. 292-301. Revenue is divided into columns for Last Year, Current Year, [7] R. Mittermeir, M. Clermont, Finding High-Level Structures in etc. Thus the mapping becomes two dimensional. Furthermore Spreadsheet, Proc. of WCRE ’02, 2002, pp. 221-232. a parser will reach three blank rows and an auxiliary header [8] R.Abraham, M. Erwig, Header and Unit Inference for row (actuals, budget, etc.) before it reaches the “Year” column Spreadsheets Through Spatial Analyses, Proc. of VLHCC ’04, 2004, pp. 165-172. header row. Making things even more challenging, the whole structure is repeated into vertical blocks viz. Earned Revenue, [9] S. Chatvichienchai, Spreadsheet Metadata Extraction: A Layout –Based Approach, Database and Expert Systems Applications Private Sector Revenue. When mapping the rule “Total private Lecture Notes in Computer Science Volume 7446, 2012, pp sector revenue=…” the parser will encounter formulas in the 147-160. 19th row instead of reaching the column headers! Thus, same [10] F. Hermans, M. Pinzger, A. van Deursen, Automatically formula repeated both vertically (in blocks) and horizontally Extracting Class Diagrams from Spreadsheets, Proc. of ECOOP (in year columns), yet being distinct semantically, is a ’10, 2010, pp. 52-75. considerable challenge.