<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Sohon Roy Dept. of Software &amp; Computer Technology Delft University of Technology Delft</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>-Business rules represent the knowledge that guides the operations of a business organization. They are implemented in software applications used by organizations, and the activity of extracting them from software is known as business rule mining. It has various purposes amongst which migration and generating documentation are the most common. However, apart from conventional software, organizations also use spreadsheets for a large part of their operations and decision-making activities. Therefore we believe that spreadsheets are also rich in business rules. We thus propose to develop an automated system for extracting business rules from spreadsheets in a human comprehensible natural language format. This position paper describes our motivation, the problem description, related work, and challenges we foresee. Index Terms-End-user computing, Business rule mining, Spreadsheets, Knowledge mining.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In her book author B. Halle writes that according to the
Business Rules Group1 a business rule is “a statement that
defines or constrains some aspect of the business. It is
intended to assert business structure or to control or influence
the behavior of the business” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Thus business rules are
rules that unambiguously determine the actions or results
necessary for desirable operation of a business. Therefore in
the context of software applications, it can be stated that
business rules are what that hold the knowledge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that is
implemented in the form of programming instructions;
whether be it a conditional statement like IF-THEN-ELSE or
an expression like AREA=3.14*(RADIUS)^2. Thus for most
practical purposes, business rule mining from software
applications is essentially the mining of knowledge. Apart
from conventional software, all types of organizations also
depend heavily on the use of spreadsheets [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. Due to their
wide use in all levels of company operations, the domain
knowledge that gets inculcated in spreadsheets is too valuable
a resource to be left untapped [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Therefore we want to
facilitate the extraction of business knowledge from
spreadsheets through a process of automated business rule
mining. Business rule mining is an activity that is also invoked
during migration of legacy software systems into systems that
are considered modern like SOA, modular software, or object
1 An independent organization formerly part of the users group Guidance of
Users of Integrated Data-Processing Equipment (GUIDE) of IBM corporation,
acknowledged as pioneers of the business rule approach
www.businessrulesgroup.org
oriented software [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]; but we want to apply the technique on
spreadsheets. The potential benefits of that are as follows.
      </p>
      <p>1) High Level Analysis of Spreadsheets – Extracting
business rules enables generation of documentation for
spreadsheets at a higher abstraction level than the spreadsheets
themselves. This facilitates the following:</p>
      <p>a) Comprehension – It becomes easier for end-users, who
are typically not programmers, to understand the structure and
operation of large and complex spreadsheets helping them
efficiently work with or modify such spreadsheets with
reduced errors and mistakes.</p>
      <p>b) Comparison – Comparing spreadsheets becomes
possible in order to estimate whether they implement same or
similar functionalities, or even are identical behavior-wise
only differing in data values. The latter cannot be done for
example by an application that compares spreadsheets in data
and formula level.</p>
      <p>c) Validation – Organizations using set of well-formed
and pre-laid business rules can validate whether the
spreadsheets created by their employees accurately implement
those rules or if there are errors in the logical level.</p>
      <p>2) Understanding of Organizational Business Rationale –
Some organization may not have their business strategies well
laid out in business rule format; yet vital business knowledge
of experts working in the company is hidden in spreadsheets.
Extracting this knowledge would help to form a clear picture
of how that organization works and its structure.</p>
      <p>3) Support for Migration – IT architects need to understand
the business logic when migrating functionalities and
computations implemented in spreadsheets into conventional
software. Furthermore business analysts need to ensure that
the IT architects understood it correctly. This can be achieved
through knowledge extraction and an automated process
would largely help in this regard.</p>
      <p>
        4) Safe Re-use and Replication of Spreadsheets – Often
spreadsheets are created on ad-hoc basis by experts in an
organization to implement their unique strategies for certain
scenarios. Over time such spreadsheets grow in size and
complexity and are used by several employees for similar
scenarios but with different data sets. Invariably the users are
forced to employ the method of copy-paste to replicate the
original spreadsheet and customize it according to their needs
by manipulating data and formula. However this process is
extremely error-prone [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It is probably safer to re-generate
spreadsheets from scratch using the blueprint or structure of
the original spreadsheet instead of copy-pasting. Automated
business rule extraction can facilitate such blueprint formation
and thus make replications of spreadsheets safer.
      </p>
    </sec>
    <sec id="sec-2">
      <title>II. GOAL AND APPROACH</title>
      <p>Our goal is to devise an algorithm and subsequently an
application that will automatically extract business rules from
spreadsheets. Based on the successful implementation of such
an application our research questions will be as follows.</p>
      <p>RQ1: How accurate the automatically extracted
business rules will be as compared to those extracted
manually by domain experts and spreadsheet users?</p>
      <p>RQ2: How efficient is the automatic extraction process
compared to manually extracting business rules from
spreadsheets?</p>
      <p>Towards answering these research questions, we will
employ user-studies and controlled experiments, in which we
will compare the results of automatic and manual extraction of
business rules from spreadsheets.</p>
    </sec>
    <sec id="sec-3">
      <title>III. PROBLEM ILLUSTRATION</title>
      <p>Typical spreadsheets implement business rules to calculate
results. For example in Fig.1 the cell E19 contains the formula
SUM(E13:E18). From this formula our algorithm has to infer
the business rule “Total earned revenue =
Admissions+…+Other earned revenue”. Mapping E13:E18 to
Admissions…Other earned revenue is straightforward.
However there is more to determine as the Total Earned
Revenue is divided into columns for Last Year, Current Year,
etc. Thus the mapping becomes two dimensional. Furthermore
a parser will reach three blank rows and an auxiliary header
row (actuals, budget, etc.) before it reaches the “Year” column
header row. Making things even more challenging, the whole
structure is repeated into vertical blocks viz. Earned Revenue,
Private Sector Revenue. When mapping the rule “Total private
sector revenue=…” the parser will encounter formulas in the
19th row instead of reaching the column headers! Thus, same
formula repeated both vertically (in blocks) and horizontally
(in year columns), yet being distinct semantically, is a
considerable challenge.</p>
    </sec>
    <sec id="sec-4">
      <title>IV. RELATED WORK</title>
      <p>
        Mittermeir et al. proposed an approach for finding high
level structures in spreadsheets through logical and semantic
classification of cells [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Abraham et al. worked on header and
unit inference where units imply values or cell contents and the
headers are column headers or the labels [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Chatvichienchai
proposed a method for meta-data extraction from spreadsheets
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] where meta-data are the various labels and also the data that
are analogous to primary keys of databases. These works are
generally oriented towards the purpose of error reduction in
spreadsheets and are not motivated from the business rule
standpoint. Hermans et al. developed a method for extracting
class diagrams from spreadsheets [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Our business rule
extraction algorithm will draw its foundation from the class
diagram extraction algorithm and improve upon its limitations.
      </p>
    </sec>
    <sec id="sec-5">
      <title>V. CONCLUDING REMARKS</title>
      <p>To summarize, this paper proposes an application for
business rule mining from spreadsheets and the research
questions RQ1 and RQ2. Such an application will facilitate
high level analysis of spreadsheets, understanding of
organizational business strategies, support for migration, and
better re-use of spreadsheets. However, due to their inherent
flexibility, spreadsheets do not impose any fixed structural
uniformity with regards to layout. This makes the mapping
between data and labels difficult and that will be a key
challenge to overcome.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>B. von Halle</surname>
          </string-name>
          ,
          <source>Business Rules Applied: Building Better Systems Using the Business Rule Approach</source>
          , Wiley Computer Publishing,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Morgan</surname>
          </string-name>
          , Business Rules and
          <article-title>Information Systems: Aligning IT with Business Goals, Addison-</article-title>
          <string-name>
            <surname>Wesley</surname>
          </string-name>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bradley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>McDaid</surname>
          </string-name>
          ,
          <article-title>Using bayesian statistical methods to determine the level of error in large spreadsheets</article-title>
          ,
          <source>in Proc. of ICSE '09</source>
          ,
          <string-name>
            <surname>Companion</surname>
            <given-names>Volume</given-names>
          </string-name>
          ,
          <year>2009</year>
          , pp.
          <fpage>351</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Scaffidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Myers</surname>
          </string-name>
          ,
          <article-title>Estimating the numbers of end users and end user programmers</article-title>
          ,
          <source>Proc. of VL/HCC '05</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>207</fpage>
          -
          <lpage>214</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hermans</surname>
          </string-name>
          ,
          <article-title>Gathering domain knowledge from spreadsheets</article-title>
          ,
          <source>Proc. of ESEC/FSE '09 Doctoral Symposium</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hermans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sedee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pinzger</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. van Deursen</surname>
          </string-name>
          ,
          <source>Data Clone Detection and Visualization in Spreadsheets, Proc. of ICSE '13</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>292</fpage>
          -
          <lpage>301</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mittermeir</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Clermont, Finding High-Level Structures in Spreadsheet</article-title>
          ,
          <source>Proc. of WCRE '02</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>221</fpage>
          -
          <lpage>232</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Abraham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Erwig</surname>
          </string-name>
          ,
          <article-title>Header and Unit Inference for Spreadsheets Through Spatial Analyses</article-title>
          ,
          <source>Proc. of VLHCC '04</source>
          ,
          <year>2004</year>
          , pp.
          <fpage>165</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chatvichienchai</surname>
          </string-name>
          ,
          <string-name>
            <surname>Spreadsheet Metadata Extraction: A Layout -Based Approach</surname>
          </string-name>
          ,
          <source>Database and Expert Systems Applications Lecture Notes in Computer Science</source>
          Volume
          <volume>7446</volume>
          ,
          <year>2012</year>
          , pp
          <fpage>147</fpage>
          -
          <lpage>160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hermans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pinzger</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. van Deursen</surname>
          </string-name>
          ,
          <source>Automatically Extracting Class Diagrams from Spreadsheets, Proc. of ECOOP '10</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>