<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards the Conceptual Specification of Statistical Functions with OCL</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jordi Cabot</string-name>
          <email>jcabot@cs.toronto.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose-Norberto Maz´on</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jesu´s Pardillo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan Trujillo</string-name>
          <email>jtrujillo@dlsi.ua.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad de Alicante</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Toronto</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <fpage>7</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Current proposals for designing information systems lack the mechanisms to define statistical functions at the conceptual level. Therefore, queries containing these kind of functions are defined once the rest of the system has already been implemented, which requires much effort and expertise. In this sense, the goal of this paper is to show the benefits of extending the Object Constraint Language (OCL) with a predefined set of statistical functions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Queries containing statistical functions are highly important for users to satisfy
their information needs in a comprehensive manner [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. However, information
systems design gives little importance to the definition of these kind of complex
queries [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Currently, queries are not expressed at the conceptual level, thus
requiring a lot of effort and expertise in the target implementation platform
and preventing designers from validating early on that the conceptual schema
satisfies the users’ requirements.
      </p>
      <p>
        The main restriction for defining queries at the conceptual level is the rather
limited support offered by current conceptual modeling languages. Surprisingly,
essential statistical functions are not predefined in these languages: OCL [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
only provides the sum, size and count operations. ConQuer-II [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and
ConceptBase/TELOS [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] offer just a basic set of predefined statistical functions (as
avg, min and so on). Finally, the ER language [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] does not include a specific
language for expressing queries, and complementary query languages proposed
later on (as [
        <xref ref-type="bibr" rid="ref1 ref2 ref5">1, 2, 5</xref>
        ]) provide, at most, basic operations.
      </p>
      <p>
        It is then clear that a better support for statistical functions is necessary to
easily express complex queries as part of the definition of a conceptual schema,
thus avoiding the error-prone and time-consuming task of defining them once
the system is implemented. To this aim, we propose to extend the standard
OCL library with a new set of statistical functions that designers can use when
defining queries at the conceptual level. These new functions have been tested
on sample data by using one of the well-known case studies from Kimball’s
book [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]: an airline’s marketing department wants to analyze the flight activity
of each member of its frequent flyer program. The department is interested in
seeing what flights the company’s frequent flyers take, which planes they travel
with, what fare basis they pay, how often they upgrade, and how they earn their
frequent flyer miles3.
      </p>
      <p>
        The case study has been implemented in the USE tool [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] in order to ensure
the well-formedness of OCL expressions and facilitate their validation by
providing an evaluation environment. Figure 1 shows the implementation of our case
study. In the background of the USE environment we can see the frequent flyers
class diagram (left-hand side) and the script that loads the data (objects and
links) into the corresponding classes and associations (right-hand side). Given
this class diagram, users can request a set of queries to retrieve useful
information from the system. For instance, they are probably interested in knowing
the miles earned by a frequent flyer in his/her trips from a given airport (e.g.
airports located in Colorado) in a given fare class. Many other queries can be
similarly defined by using other statistical functions in order to analyze data in
a richer manner.
Therefore, we believe that it is highly important to be able to provide all
kinds of statistical functions as predefined constructs offered by the modeling
language so that the definition of complex queries can be carried out at the
con3 Note that, in this case study, the interest is in actual flight activity, but not in
reservation or ticketing activity.
ceptual level in order to define and validate them regardless the final technology
platform chosen to implement the system. This paper is a starting point to
address this research, since we show the feasibility of extending the OCL language
with statistical functions.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Extending OCL with Statistical Functions</title>
      <p>
        Conceptual modeling languages require the use of a general-purpose (textual)
sublanguage to express all kinds of queries, constraints and derivation rules since
most of them cannot be expressed using only the graphical constructs provided
by the conceptual modeling language [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. For UML conceptual schemas, the
Object Constraint Language (OCL [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]) is typically used for this purpose. The
goal of this section is to extend the OCL with a new set of predefined statistical
functions to facilitate the definition of complex queries on UML schemas.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Preliminary OCL Concepts</title>
        <p>OCL is a rich language that offers predefined mechanisms for retrieving the
values of the attributes of an object, for navigating through a set of related
objects, for iterating through collection of objects (e.g., by means of the forAll,
exist and select iterators) and so forth. As part of the language, a standard
library including a predefined set of types and a list of predefined operations
that can be applied on those types is also provided. The types can be primitive
(Integer, Real, Boolean and String ) or collection types (Set, Bag, OrderedSet and
Sequence). Some examples of operations provided for those types are: and, or,
not (Boolean), +, −, ∗, &gt;, &lt; (Real and Integer), union, size, includes, count
and sum (Set).</p>
        <p>All these constructs can be used in the definition of OCL constraints,
derivation rules, queries and pre/post-conditions. In particular, definition of queries
follows the template:
context Class::Q(p1:T1, . . . , pn:Tn): Tresult
body: Query-ocl-expression</p>
        <p>where the query Q returns the result of evaluating the Query−ocl−expression
by using the arguments passed as parameters in its invocation on an object of the
context type Class. Apart from the parameters p1 . . . pn, in query-ocl-expression
designers may use the implicit parameter self (of type Class) representing the
object on which the operation has been invoked.</p>
        <p>As an example, the previous query total miles earned by a frequent flyer in
his/her trips from Colorado in a given fare can be defined as follows:
context Customer::sumMiles(FareClass fc)
body: self.frequentFlyerLegs−&gt;select(f | f.fareClass=fc and</p>
        <p>f.origin.city.name=’Colorado’)−&gt;sum()</p>
        <p>
          Unfortunately, many other interesting queries cannot be similarly defined
since the operators required to define such queries are not part of the standard
library. Next, we present our extension to the OCL standard library to include
new statistical operators. The set of statistical functions included in our study are
those among the most used in data analysis4. These functions can be classified
in three different groups, following [
          <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
          ]: distributive, algebraic and holistic
functions.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Distributive functions</title>
        <p>Distributive functions can be defined by structural recursion, i.e. the input
collection can be partitioned into subcollections that can be individually aggregated
and combined. One example is the max function, which returns the element in
a non-empty collection of objects of type T with the highest value. T must
support the &gt;= operation. If several elements share the highest value, one of them
is randomly selected.
context Collection::max():T
pre: self −&gt;notEmpty()
post: result = self −&gt;any(e | self −&gt;forAll(e2 | e &gt;= e2))
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Algebraic functions</title>
        <p>Algebraic functions are expressed as finite algebraic expressions over distributive
functions, e.g., average is computed by using count and sum functions. The
average function returns the arithmetic average value of the elements in the
non-empty collection. The type of the elements in the collection must support
the + and / operations.
Holistic functions are all other functions that are not distributive nor algebraic.
For example, the mode function, which returns the most frequent value in a
collection.
4 Due to space constraints, we only mention some examples in this paper, but all the
defined statistical functions are available in http://www.lucentia.es/research/
ocllib.html
context Collection::mode(): T
pre: self −&gt;notEmpty()
post: result = self −&gt;any(e | self −&gt;forAll(e2 |</p>
        <p>self−&gt;count(e) &gt;= self−&gt;count(e2))
2.5</p>
      </sec>
      <sec id="sec-2-4">
        <title>Applying the functions</title>
        <p>These statistical operations can be used exactly in the same way as any other
OCL function. As an example, we show the use of the avg function to compute
the average number of miles earned by a customer in each flight leg.
context Customer::avgMilesPerFlightLeg():Real
body: self−&gt;frequentFlyerLegs.Miles−&gt;avg()</p>
        <p>In the foreground of Fig. 1 we show one of the queries we have used to test
our functions in the USE tool (in this case the query is used to check our avg
function) together with the resulting collection of data returned by the query.
Interested readers can download5 the scripts and data of our running example
together with the definition of our library of statistical functions.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and Future Work</title>
      <p>Support for defining complex queries is very limited in conceptual modeling
languages and would hinder designers to directly implement these kind of queries,
preventing them from easily satisfying the user requirements. Specifically, queries
containing statistical functions cannot be easily defined in OCL since they are
not part of the standard library and thus, they must be manually defined by
the designer which is an error-prone and time-consuming activity (due to the
complexity of some statistical functions).</p>
      <p>To solve this problem we argue in this paper that the OCL Standard Library
should be extended by predefining a list of new statistical functions that can be
used by designers in the definition of their OCL expressions.</p>
      <p>Our short term future work is to grow the number of predefined functions
in our library and align them with current Model-Driven Development (MDD)
and Model-Driven Architecture (MDA) approaches, where the implementation
of the system is supposed to be (semi)automatically generated from its
highlevel models. The definition of all queries at the conceptual level permits a more
complete code-generation phase, including the automatic translation of these
queries from their initial platform-independent definition to the final
(platformdependent) implementation.
5 http://www.lucentia.es/research/ocllib.html</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>Work supported by the projects: TIN2008-00444, ESPIA (TIN2007-67078) from
the Spanish Ministry of Education and Science (MEC), QUASIMODO
(PAC080157-0668) from the Castilla-La Mancha Ministry of Education and Science
(Spain), and DEMETER (GVPRE/2008/063) from the Valencia Government
(Spain). Jose-Norberto Maz´on and Jesu´s Pardillo are funded by MEC under FPU
grants AP2005-1360 and AP2006-00332, respectively. Jordi Cabot is funded by
the 2007 BP-A 00128 grant (Catalan Government).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Andries</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Engels</surname>
          </string-name>
          .
          <article-title>A hybrid query language for an extended entityrelationship model</article-title>
          .
          <source>J. Vis. Lang. Comput.</source>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <fpage>321</fpage>
          -
          <lpage>352</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Angelaccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Catarci</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Santucci. QBD</surname>
          </string-name>
          <article-title>*: a graphical query language with recursion</article-title>
          .
          <source>Software Engineering</source>
          , IEEE Transactions on,
          <volume>16</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1150</fpage>
          -
          <lpage>1163</lpage>
          ,
          <year>Oct 1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Bloesch</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Halpin</surname>
          </string-name>
          .
          <article-title>Conceptual queries using ConQuer-II</article-title>
          . In D. W. Embley and R. C. Goldstein, editors,
          <source>ER</source>
          , volume
          <volume>1331</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>113</fpage>
          -
          <lpage>126</lpage>
          . Springer,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>The entity-relationship model - toward a unified view of data</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>9</fpage>
          -
          <lpage>36</lpage>
          ,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>B.</given-names>
            <surname>Czejdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rusinkiewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Embley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Reddy</surname>
          </string-name>
          .
          <article-title>A visual query language for an ER data model</article-title>
          .
          <source>Visual Languages</source>
          ,
          <year>1989</year>
          ., IEEE Workshop on, pages
          <fpage>165</fpage>
          -
          <lpage>170</lpage>
          ,
          <year>Oct 1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D.</given-names>
            <surname>Embley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Barry</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Woodfield. Object-Oriented Systems Analysis. A Model-Driven Approach</surname>
          </string-name>
          . Youdon Press Computing Series,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>M.</given-names>
            <surname>Gogolla</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Bu¨ttner, and M. Richters</article-title>
          .
          <article-title>USE: A UML-based specification environment for validating UML and OCL</article-title>
          .
          <source>Sci. Comput</source>
          . Program.,
          <volume>69</volume>
          (
          <issue>1-3</issue>
          ):
          <fpage>27</fpage>
          -
          <lpage>34</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>J.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosworth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Layman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Reichart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Venkatrao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pellow</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Pirahesh</surname>
          </string-name>
          .
          <article-title>Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub totals</article-title>
          .
          <source>Data Min. Knowl. Discov.</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>29</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>R.</given-names>
            <surname>Kimball</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ross</surname>
          </string-name>
          .
          <article-title>The Data Warehouse Toolkit</article-title>
          . Wiley &amp; Sons,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. H.
          <article-title>-</article-title>
          <string-name>
            <surname>J. Lenz</surname>
            and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Thalheim</surname>
          </string-name>
          .
          <article-title>OLAP schemata for correct applications</article-title>
          . In D. Draheim and G. Weber, editors,
          <source>TEAA</source>
          , volume
          <volume>3888</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>99</fpage>
          -
          <lpage>113</lpage>
          . Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>J. Mylopoulos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Borgida</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Jarke</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Koubarakis</surname>
          </string-name>
          . Telos:
          <article-title>Representing knowledge about information systems</article-title>
          .
          <source>ACM Trans. Inf</source>
          . Syst.,
          <volume>8</volume>
          (
          <issue>4</issue>
          ):
          <fpage>325</fpage>
          -
          <lpage>362</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Object Management Group.
          <source>UML 2.0 OCL Specification</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. A.
          <string-name>
            <surname>Oliv</surname>
          </string-name>
          <article-title>´e. Conceptual schema-centric development: A grand challenge for information systems research</article-title>
          . In O. Pastor and
          <string-name>
            <surname>J. F.</surname>
          </string-name>
          e Cunha, editors,
          <source>CAiSE</source>
          , volume
          <volume>3520</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>R. B. Ross</surname>
            ,
            <given-names>V. S.</given-names>
          </string-name>
          <string-name>
            <surname>Subrahmanian</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Grant</surname>
          </string-name>
          .
          <article-title>Aggregate operators in probabilistic databases</article-title>
          .
          <source>J. ACM</source>
          ,
          <volume>52</volume>
          (
          <issue>1</issue>
          ):
          <fpage>54</fpage>
          -
          <lpage>101</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>