<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the Automated Transformation of Domain Models into Tabular Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alfonso de la Vega</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Garc a-Saiz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marta Zorrilla</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo Sanchez</string-name>
          <email>p.sanchezg@unican.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dpto. Ingenier a Informatica y Electronica Universidad de Cantabria</institution>
          ,
          <addr-line>Santander</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Feature name</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We are surrounded by ubiquitous and interconnected software systems which gather valuable data. The analysis of these data, although highly relevant for decision making, cannot be performed directly by business users, as its execution requires very speci c technical knowledge in areas such as statistics and data mining. One of the complexity problems faced when constructing an analysis of this kind resides in the fact that most data mining tools and techniques work exclusively over tabular-formatted data, preventing business users from analysing excerpts of a data bundle which have not been previously traduced into this format by an expert. In response, this work presents a set of transformation patterns for automatically generating tabular data from domain models. The described patterns have been integrated into a language, which allows business users to specify the elements of a domain model that should be considered for data analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Mining</kwd>
        <kwd>Model-Driven Engineering</kwd>
        <kwd>Domain-Driven Design</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Currently, we rely on computer systems for most of the actions we carry out in
our daily life. Consequently, these systems gather and store information that, if
it is appropriately processed, can help to improve di erent kinds of systems or
processes [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Nevertheless, as pointed out by Cao [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], there is a gap between data
mining research and practice. According to Cao, the academic community has
focused on improving the algorithms for data mining, but much less attention
has been paid to how these algorithms can be deployed in real-life environments.
      </p>
      <p>As a very rst consequence of this gap, data coming from a certain domain
need to be largely processed, formatted and reshaped in order to t in with
the requirements of each data mining algorithm. In general, most data mining
algorithms require their input data to be arranged in a tabular format, such as
a CSV (Comma Separate Values ) le.</p>
      <p>
        The transformation process is often carried out manually by data scientists.
These data scientists create a set of processing scripts which extract data from
its original source and reshape them into a tabular format. The creation of
these scripts can be a time-consuming and error-prone process. Moreover, it
requires specialised skills on data manipulation, which hampers that average
decision-makers could analyse data by themselves, making di cult data mining
democratisation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>To overcome this problem, this work presents a set of patterns for automating
the transformation of data coming from an object-oriented domain model into
tabular format. These patterns have allowed us to develop a high-level language,
called Lavoisier, for specifying which elements of a domain model should be
provided as input for a data mining algorithm.</p>
      <p>This speci cation is then compiled and, using the transformation patterns,
data is automatically retrieved from its source, processed and reshaped,
providing an tabular representation of the requested data as output. Therefore, data
scientists would not need to create scripts for data formatting and reshaping
manually. Moreover, since Lavoisier is a high-level language, it might be used
for people without specialised skills on data processing.</p>
      <p>
        Expressiveness and e ectiveness of our approach have been evaluated using
di erent external case studies. In particular, two data mining open challenges [
        <xref ref-type="bibr" rid="ref13 ref3">13,
3</xref>
        ] have been used. The rst challenge contains data collected by an online
business review system, whereas the second one focuses on data extracted from
continuous integration tools. These challenges made data publicly available and
raised some questions to be answered through their analysis. For both challenges,
Lavoisier was used to construct tabular representations that feed data mining
algorithms, which tried to provide an answer to the proposed questions.
      </p>
      <p>After this introduction, the work is structured as follows: Section 2 exposes
the motivation, and formalizes the contributions of this paper. Some
technical operations applied in the patterns are introduced in section 3. The paper
continues with comments about related work in section 4. The transformation
patterns are then described in section 5, while in section 6 an usage example
of the Lavoisier language is presented. The paper nishes with a recapitulation
and an enumeration of future objectives of this work in section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Case Study and Motivation</title>
      <p>
        This section explains in detail the motivation behind our work. To illustrate it,
a case study based on the 9th edition of the Yelp Dataset Challenge [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is used
through the rest of this work. This case study is described below.
      </p>
      <p>Yelp is an American company which provides an online businesses review
service for customers to write their opinions and for owners to describe and
advertise their businesses. Moreover, additional features, such as events noti cation
or social interaction between reviewers, are also supported. Yelp, during its
regular operations, captures di erent kinds of data, which are being made available
for academic usage through challenges. The objective of these challenges is to
be able to discover some information hidden in these data which might be of
interest for Yelp.</p>
      <p>Location
address : EString
city : EString
state : EString
postalCode : EString
b_id : EString
name : EString
stars : EFloat = 0.0
isOpen : EBoolean = false
[1..1] business [1..1] business
[0..*] reviews [0..*] tips</p>
      <p>Review Tip
rsd_taaitdres:::EEESDFtarloitneagt = 0.0 tdeaxtte::EESDtaritneg
text : EString
[0..*] reviews [0..*] tips
[1..1] user [1..1] user</p>
      <p>User
u_id : EString
name : EString</p>
      <p>Yelp's challenge data is provided as a bundle of interconnected les in JSON
(JavaScript Object Notation) format. From these les, we have abstracted the
domain model depicted in Figure 1. According to it, detailed information for each
business is stored, such as its location, an indication of provided features (for
instance, if WiFi is available, if there is a smoking area, or the minimum required
age to enter) and the categories which best describes it (Cafes, Restaurant,
Italian, Gluten-Free, and so on). Users can make reviews of businesses, where
they star-rate and introduce a text describing their experience. Additionally, user
tips can get also registered. As Yelp provides some social network capabilities,
users can have friends or fans, and can receive votes in their reviews emitted by
other users in case they found the review funny, useful or cool.</p>
      <p>
        As part of the challenge, Yelp proposes the participants to nd issues that
might lead to a successful business, beyond location. This information can be
computed using di erent data mining techniques, such as classi cation.
Nevertheless, as explained before, most algorithms of these techniques only accept
input data in a speci c sort of tabular format. Therefore, this constraint can be
found in tools typically used for data mining, such as R [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] or Weka [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. These
tools only accept as input data arranged in a particular tabular format, such as
CSV (Comma Separated Values ) les or relational database tables. The tabular
data problem is detailed with the help of Figure 2 in the following.
      </p>
      <p>For the sake of simplicity, let us assume we want to nd reasons behind
successful business using just the information of Figure 2 (a), this is, stars rating
and available features per business. Figure 2 (b) shows two objects representing
two di erent businesses. As previously commented, our very rst problem is
that we need to rearrange these objects' data in a tabular format. This tabular
representation must satisfy the following constraint: all data of each instance of
the main class under analysis (in this case, Business ) must be placed under the
same row, as depicted in Figure 2 (d). Other alternative representations, such as
the one in Figure 2 (c), which might be more easily produced, would not work
properly, as the information of each business gets distributed in several rows.
a) Domain Model</p>
      <p>Business
name : String
stars : float
In the following, and inspired by the functional programming paradigm, we will
refer to the operation that places all the information of a domain entity in a
single row, as a attening.</p>
      <p>To implement this attening operation, data scientists often create several
scripts by hand. These scripts collect data from the sources and process them
using several operations such as products, lters, aggregations and reshapes,
until getting the data in the appropriate format. This is a time-consuming and
error-prone process. Moreover, to produce these scripts, deep knowledge on data
manipulation techniques and tools is required, which average business users often
lack. Thus, the situation makes mandatory to hire and rely on the mentioned
data scientists.</p>
      <p>To overcome this problem, our work aims to provide an automated attening
operator. The operator will allow to specify the domain entities to be attened.
Next, the necessary data transformation scripts would be automatically
generated from this speci cation.</p>
      <p>This operator has been implemented relying on di erent data manipulation
operations, more speci cally, as a combination of joins and pivots. To make this
work self-contained, these operators are described in the following section.</p>
    </sec>
    <sec id="sec-3">
      <title>Background: Join and Pivot</title>
      <p>This operator, which can be typically found in relational databases, takes as
input two domain entities A and B and a relationship r from A to B which
connects them. Then, it calculates the Cartesian product between instances of
A and B rst, and then, lters out those tuples whose instances are not connected
through the relationship r.
This operator is commonly used to rearrange into the same row data related
to the same entity which appears, such as in Figure 2 (c), spread over a table.
It can be found in some dialects of SQL (although it is not included in the
o cial standards), in data analysis tools, such as R or Pandas, and even in some
spreadsheets, like Excel.</p>
      <p>A pivot accepts as input a table T , a set pivotingColumns of columns
through which the table would be pivoted, and a set pivotedColumns of columns
whose values will be pivoted. For instance, using table shown in Figure 2 (c), we
might specify that we want to pivot the column available of that table using the
featureName as pivoting column, arriving at the structure of Figure 2 (d).</p>
      <p>Using this information, the pivot operation would work as follows:
1. A new table for holding the results is created. All columns from the original
table which are not included in the pivoting or pivoted columns, are copied in
the new table. In our case, these will be the BusinessName and BusinessStars
columns (reduced in Figure 2 (d) to BName and BStars for space reasons).
2. Each instance with distinct values for the previous columns is added as a
row to the resulting table. In our case, two di erent instances are detected,
(Pete's Pizza, 4.5) and (Sushi &amp; Go, 3.8). It should be noticed as
consequence of this step, several rows might be reduced to only one.
3. The set pivotingValues of distinct values that can be found in the
pivotingColumns is calculated. In our example, the resulting set would be fWiFi,
Parkingg.
4. The Cartesian product between pivotingV alues and pivotingColumns is
calculated. In our example, this would be fW iF i; P arkingg favailableg =
fW iF i available; P arking availableg.
5. Then, a new column for each pair in the pivotingV alues pivotingColumns
set is added to the resulting table.
6. Each new column is lled with the original values coming from the input
table. For instance, in the (Pete's Pizza, 4.5) row, the WiFi available
column takes its value from the available column in the row with values
(Pete's Pizza, 4.5, Wifi). In our case, this value would be true. If
several rows could be identi ed, the pivot operation would need as input an
aggregation function to reduce all the collected values to just one.</p>
      <p>As the reader can notice, we can go from Figure 2 (c) to Figure 2 (d),
following the join operation by a pivot. This is, the atten operator might be
implemented as a combination of joins and pivots. Indeed, this is what a data
scientist will write by hand into a script for tabbing these data. However, it
should be noticed that neither the join nor the pivot operators by themselves
are able to produce a proper tabular format, they need to be used together.
Moreover, as the domain model complexity grows, the concatenation of these
operators also becomes more complex. Thus, our idea is that this concatenation
of operators gets automatically generated.</p>
      <p>Next section will analyse whether this issues have already been addressed in
the literature.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>To the best of our knowledge, there is no work that provides a suitable
implementation for our desired atten operator. Nevertheless, technologies related to data
management provide di erent kinds of operators which are worth to mention.</p>
      <p>It is important to note that we are not trying to determine if by using a
certain technology we can produce an appropriate tabular representation by
hand. With enough e ort, a data transformation script might be produced in
practically any language. Therefore, we focus on analysing if there are concrete
mechanisms into these technologies to automatically tabulate the data, without
having to produce large chains of concatenated operators.</p>
      <p>
        Firstly, we analysed whether our problem can be solved using just SQL or a
SQL-like language, such as JPL (Java Persistence Language). These languages
typically provides e cient implementations of the join operator. Moreover, some
database management systems, such as SQL Server, o ers limited versions of
the pivot operator [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For instance, the pivot version of SQL Server requires to
know the number and names of the columns that will be added as a result of
the operation in advance, instead of being discovered dynamically as we have
described in the previous section. Consequently, scripts combining standard and
proprietary SQL need often to be written by hand to convert data into a proper
tabular format. Writing these scripts is exactly what we want to avoid.
      </p>
      <p>
        Secondly, data warehouses store high volumes of data which can be consulted
for reporting and analytical purposes [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. These data are manipulated through
multidimensional models based in facts and dimensions. Facts store quantitative
measures around a business concept, while dimensions o er di erent
perspectives, such as space and time, from which obtain and analyse fact measures.
Languages for data warehouse management typically provide operators such as
drill-down, drill-up, slice or pivot, among others. Nevertheless, they do not o er
an operator that can execute a attening process by itself. As before, by means of
concatenating by hand di erent operators, a attening process can be achieved,
but we desire to make unnecessary the manual production of these concatenated
operations chain.
      </p>
      <p>
        Finally, the process of transforming an object-oriented domain model into
another representation (e.g., relational, XML) has been largely studied in the
literature [
        <xref ref-type="bibr" rid="ref2 ref8 ref9">2, 8, 9</xref>
        ]. These works typically specify a set of patterns that can be
used to transform step-by-step from a model representation into another one.
r_id : string
stars : double
text : string
r_id stars
      </p>
      <p>text
R1
...</p>
      <p>R2
4.5
...
3.7</p>
      <p>We were recommended this by . . .</p>
      <p>...</p>
      <p>The first impression was not . . .
These patterns are the basis, for instance, of current Object-Relational Mappers
(ORM). Nevertheless, to the best of our knowledge, there is no pattern that can
be used by itself to perform a attening process. Again, by means of a careful
combination by hand of these patterns, a attening process can be implemented.
However, as the reader might have noticed, this is not what we are looking for.</p>
      <p>Next section shows how we have been able to provide an implementation for
our desired attening operator.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Flattening Operator Implementation</title>
      <p>From an abstract point of view, the attening operation can be viewed as the
problem of transforming a set of interconnected objects into a tabular
representation where, in addition, all information related to each instance of a speci c
class must be placed in the same row. In the following, we will refer to this
speci c class as the main class or the main entity.</p>
      <p>
        Our strategy for implementing the attening operator is based on reducing
complex cases to a trivial case, whose implementation is straightforward.
Reductions are achieved by means of operations, such as joins and pivots, over the
data as well as by applying transformation patterns frequently used by
ObjectRelational Mappers [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Next subsections describe this reduction process with
the help of the Yelp case study. However, the described patterns would work
with any domain model under analysis.
5.1
      </p>
      <sec id="sec-5-1">
        <title>Trivial Case: Single Class</title>
        <p>In this case, the main class contains just simple attributes and it is not part
of any inheritance hierarchy. A simple attribute is an attribute whose type is a
basic type. Thus, the main class contains no references to other classes. Figure 3
(left) shows an example for this trivial case.</p>
        <p>In this situation, the attening operation goes as follows: rst, we create a
table with one column per each attribute contained in the main class. Then,
each instance of the main class is placed in a single row, placing its attribute
value within their corresponding columns. Figure 3 (right) depicts the result of
applying the operator to a set of Review instances.</p>
        <p>More complex attening cases are reduced to this trivial case as described in
next subsections.
r_id
stars
text
user
0..1</p>
        <p>User
u_id
name
r_id
stars
text
user_u_id
user_name
This case happens when main class objects have a reference to a single object
of another class, as shown in the Figure 4 (left). In addition to the Review class
data, we want to include information of the user who has written each review.</p>
        <p>As the upper bound of the reference is 1, to reduce this pattern to the trivial
case, attributes of the referenced class can be simply copied into the main class,
as if they were initially included in it. This can be easily achieved by means of a
join operation between the main class and the referenced class. To avoid name
collisions, the association's name is added as a pre x to each copied attribute.</p>
        <p>Figure 4 shows how User attributes are included into the Review class, which
later can be tabulated as described in the single class trivial case.
5.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Unbounded Associations</title>
        <p>In this case, objects of the main class refer to collections of objects of another
class, instead of a single one as in the previous case. Figure 5 (left) shows an
example which illustrates this case. We would like to analyse features in uence
in top-rated business, so we want to include information of the Feature class in
the data analysis. For the sake of simplicity, we have skipped the inheritance
hierarchy in which Feature class is involved (See Figure 1).</p>
        <p>Because of the unbounded reference, the storage of multiple instances of the
reference class must be allowed. Therefore, attributes of the referenced class
will be included several times into the main class. Moreover, we would need
a mechanism to distinguish between instances of the referenced class to, for
instance, determine how many attribute copies must be included in the main
class. This distinction mechanism would also allow to relate referenced instances
from di erent instances of the main class, in order to place them into the same
attribute set or, from the tabular format perspective, under the same column.</p>
        <p>For this purpose, an attribute -or set of attributes- from the referenced class
must act as identi er for their instances. This way, instances can be distinguished
according to the value of their identi er, and information of referenced instances
which share the same identi er can be placed into the same attributes set.</p>
        <p>In order to make the reduction, we will perform a pivot operation. Linking
with the terminology introduced in section 3, the attributes selected as identi er
of the referenced class will act as pivoting columns, and the remaining attributes
would be the pivoted columns.
b_id
stars
features_NoiseLevel_value
features_AgesAllowed_value
features_Smoking_value</p>
        <p>As a result of the pivot operation, new sets of attributes would be added
to the main class, one set per each distinct identi er found in the referenced
objects. Each set will contain the attributes present in the pivoted columns set.
This way, the information of the referenced instances gets condensed into each
main class instance, added as a new set of attributes.</p>
        <p>To avoid name collisions, attributes of each newly created attribute set are
named according to the following pattern: &lt;referenceName&gt; &lt;pivotingValues&gt;
&lt;pivotedColumnName&gt;, where referenceName is the name of the reference,
pivotingValues are values that can be found in the pivoting columns through which
the pivot operation has been performed and, nally, pivotedColumnName is the
name of the pivoted column.</p>
        <p>For instance, in the case of Figure 5 (left), the attribute name from the class
ValuedFeature is selected as identi er, which means that it will be used as
pivoting column. Let us assume that fNoiseLevel, AgesAllowed, Smokingg is the
set of distinct values for this attribute. Using this assumption, three new sets
of attributes would be created, one per each distinct value. Each attribute set
contains those attributes which will get pivoted. In this case, there is just an
attribute to be pivoted, value, therefore only one attribute will be included on each
set. The previously described pattern is used to denote the new attributes. For
example, features NoiseLevel value represents the value of the value attribute
for the feature NoiseLevel contained in the set of features of a speci c Business
instance.</p>
        <p>Finally, it is worth to mention a special case of this pattern. It happens when
all the attributes of the referenced class are used as identi ers, so there are
no remaining attributes to be pivoted. One example of this situation would be
the relationship between Business and Category (see Figure 6). In this case, the
referenced class Category contains only the attribute name, which would be used
as identi er. Therefore, there will not be any attribute to be pivoted, this is, the
pivotedColumns set would be empty. In this particular case, a phantom attribute
holding a boolean value will be pivoted. This attribute will indicate whether a
particular instance of the referenced class appears or not in the collection of
referenced objects attached to each instance of the main class.</p>
        <p>For the case of Figure 6 (left), let us assume that fbu et,dj,kosherg is the
set of all distinct names for the Category class. So, three di erent groups of
attributes are created, one per each category. As there are no attributes to be
pivoted, a boolean attribute is added to each group to indicate whether a speci c
Business belongs to a certain category, as shown in Figure 6 (right).
b_id
categories_buffet
categories_dj
categories_kosher</p>
        <p>Business
b sbt_arids
1 cs_buffet
cs_dj
cs_kosher</p>
        <p>Review
r_id
stars
b_b_id
b_stars
b_cs_ buffet
b_cs_dj
b_cs_kosher
Business
b_id
categories Category</p>
        <p>* name{id}
Review
r_id
stars
b Business
1 sbt_arids</p>
        <p>Category
c*s name</p>
        <p>Review
r_id
stars
In this case, a referenced class has a reference to another class. An example is
shown in Figure 7 (left)1. We would want to nd patterns behind positive
reviews, including information both from the reviewed Business and the Category
each business belongs to.</p>
        <p>It should be noted that the chain of references between classes has an
unbounded length since, for instance, the Category class might have referenced
another class and so on.</p>
        <p>To reduce this pattern to the trivial case, the chain of references is recursively
processed from tail to head. On each step, the deepest class is reduced using
either the one-bounded or the unbounded pattern. After each step, the resulting
chain of references is one level shorter than the previous one, so the process will
end converging to the trivial case.</p>
        <p>This process is illustrated in Figure 7. First, Business and Category are
reduced to one class using the unbounded pattern (Figure 7 (middle)). Then, the
resulting chain of references is reduced using the one-bounded pattern, reaching
the trivial case (Figure 7 (right)).
This case happens when a class has several references to other classes. This
pattern would appear if we combine Figures 5 and 6, because we want to analyse
businesses considering category and feature information at the same time.</p>
        <p>To reduce this pattern to the trivial case, each reference is reduced using the
previous patterns. Since attribute ordering does not matter for the nal tabular
format, references might be processed in any order. Moreover, as references of a
class must have di erent names, avoidance of name collisions is ensured.
1 Reference names have been abbreviated for space reasons.</p>
        <p>z</p>
        <p>B
t
x</p>
        <p>ZABC
t
x
sub_B_z
type
In this section we will explain an inheritance reduction pattern which works for
most found cases in a domain model. There exists two roles that inheritance can
play in our patterns: (1) the main class belongs to an inheritance hierarchy, and
(2) a referenced class resides in an inheritance hierarchy. These roles generate
situations from which, through more advanced patterns, we can bene t of to
achieve a more optimized reduction. We are currently working on these patterns
but, for space and simplicity reasons, they have been left out of this paper.</p>
        <p>
          The pattern performs inheritance reduction by attributes collapse. The
inheritance gets compacted into a class from the hierarchy, including the attributes
of all superclasses and subclasses of that class. This transformation process is
inspired in the single table pattern used by Object-Relational Mappers [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>The point of the inheritance from which we want to perform the reduction
might be the root of the hierarchy, a leaf, or simply be in the middle. Figure 8
shows an example of this latter case, where class A is the main class. It has a
superclass called Z, and two subclasses B and C. Reduction proceeds as follows:
{ First, superclass attributes are included in the main class. For instance,
attribute t from superclass Z is included into class A, conforming the
temporary class ZA (Figure 8, (middle)). This is, attributes descend from the
hierarchy root to the main class.
{ Secondly, subclasses are folded towards the main class. Since we might
need to tabulate information coming from any subclass, attributes of all
subclasses are raised up to the main class. In Figure 8, main class
instances can be of type A, B or C. Hence, it is necessary to include
every attribute present in B and C into the class ZA, resulting in the type
ZABC (Figure 8, (right)). A special problem can be encountered when
mixing attributes coming from di erent subclasses: their names may collide. To
overcome this, attributes are renamed according to the following pattern:
sub &lt;className&gt; &lt;attributeName&gt;. The className refers to the original
class from which the attribute has been brought. In the example, attribute
z, originally from class B, gets renamed to sub B z when placed in the nal
ZABC class (Figure 8 (right)).</p>
        <p>AF
available</p>
        <p>GF</p>
        <p>VF
value
group Group
1 name</p>
        <p>AF
available
sub_GF_group_name</p>
        <p>VF
value</p>
        <p>Feature
name
sub_AF_available
sub_AF_sub_GF_group_name
sub_VF_value
type</p>
        <p>To be able to distinguish the type of each instance of the main class, a type
attribute is included in the resulting class (see Figure 8 (right)). This attribute
has as value the concrete type of each instance.</p>
        <p>The other two extreme cases can be deducted from this general case, assuming
that the main class has no superclasses ( rst step would be skipped) or that it
has no subclasses (step two would be omitted). If multiple levels of inheritance
are found, no matter their direction (up as superclasses or down as subclasses),
they are evaluated recursively in the same manner as multilevel associations are,
starting from the furthest to the main class one.</p>
        <p>The reduction of the Feature inheritance from the domain model is shown
in Figure 9. As before in other example, some names in the gure have been
reduced for drawing purposes. A Feature can be found in three di erent avours:
AvailableFeature (AF), ValuedFeature (VF), and GroupedFeature (GF), which
represents a special case of AvailableFeatures that relate conforming a group,
therefore it has a reference to the Group class.</p>
        <p>No superclasses have to be reduced in this case, as Feature locates in the
root of the hierarchy (Figure 9 (left)). The reduction of the subclasses is
depicted in two steps. First, attributes from GroupedFeature are merged into the
AvailableFeature class (Figure 9 (middle)), being in this case the name obtained
from the group reference. Then, both classes AvailableFeature and ValuedFeature
are merged into the Feature class, and an additional type attribute is included
(Figure 9 (right)). As the original Feature class was abstract, the set of values
the type attribute might take would be fAF, VF, GFg.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Example: Usage in an Entities Extraction Language</title>
      <p>
        Based on these patterns, we have developed a high-level language, Lavoisier 2,
for automatically attening fragments of a domain model. The language has
been developed using Xtext [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This language o ers a set of high-level primitives
for specifying which parts of a domain model should be considered during a
data analysis process. In Lavoisier, as conventionally in the data mining eld,
the tabular structure generated as a result of the attening process is called a
dataset.
2 Current implementation is available at https://github.com/alfonsodelavega/lavoisier
dataset BusinessWithFeatures {
mainClass Business
      </p>
      <p>including [b_id, stars as Rating];
refers_to Feature
through features
identified_by name;
}
Fig. 10. Left. Fragment of the Yelp domain model. Right. A dataset speci cation
written in Lavoisier</p>
      <p>Figure 10 shows how Lavoisier can be used to solve the problem described
in Section 2. The left of Figure 10 depicts the fragment of the domain model we
wanted to use for a data analysis task.</p>
      <p>Dataset de nition starts by giving it a name, being in this case
BusinessWithFeatures the introduced one. Then, the main class for building the dataset must
be speci ed. Business class is selected as main class. Next, the set of attributes
of the main class that will be included in the dataset are speci ed using the
including keyword. This step is optional, and if this clause is omitted, all attributes
of the main class would be included. Moreover, aliases for some attributes can
be speci ed if it is desired through the keyword as.</p>
      <p>Finally, classes which are referenced from the main class can also be included
in the dataset by means of the refers to keyword. In our case, the Feature class
would be added through the features reference. As this is an unbounded
reference, there exists the need to de ne the attributes of the class Feature that will
be used as pivoting attributes. This is done with the keyword identi ed by, which
is used to select name as pivoting attribute in our case. The tool automatically
obtains the set of name values that will be used as ids in the unbounded pattern.</p>
      <p>It is worth to point out that the Feature class is included into an inheritance
hierarchy. Therefore, as previously described, advanced transformations patterns
would be required to correctly tabulate each instance depending on its type.
However, the Lavoisier user does not need to know the transformations details,
as the language will execute them transparently.</p>
      <p>Through Xtext usage, we obtain a very capable editor, with easy inclusion of
terms proposal and validation into the language. The user gets assisted through
the dataset de nition process, which is checked against the existent domain
model to ensure correctness and to provide useful suggestions.</p>
      <p>Next section recapitulates the contributions and concludes this paper.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Summary and Future work</title>
      <p>This paper has presented, as main contribution, a set of patterns for
automatically transforming a selection of interconnected objects, conforming to a domain
model, into a particular kind of tabular format. This transformation process,
named as a attening operation, is mandatory when using these data as input
for a data mining algorithm. These patterns has been integrated into a
highlevel language, called Lavoisier. Working with Lavoisier, the user just speci es,
using a set of high-level primitives, which part of a domain model should be
considered for a data mining task, and the language automatically rearranges the
corresponding data into an appropriate tabular format. This avoids that large
and complex scripts to accomplish this task have to be created by hand, saving
time and reducing errors.</p>
      <p>In future works, we will perform a comprehensive description of Lavoisier
capabilities, as well as the inclusion of more data selection mechanisms, such as
aggregation functions or row lters. Moreover, the patterns will be formalized,
which will allow us to develop further patterns and an easier study of how to
work with other representations, such as entity-relationship or RDF models.</p>
      <p>Acknowledgements. This work has been partially funded by the
Government of Cantabria (Spain) under the doctoral studentship program from the
University of Cantabria, and by the Spanish Government under grant
TIN201456158-C4-2-P (M2C2).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <source>The beckman report on database research. SIGMOD Rec</source>
          .
          <volume>43</volume>
          (
          <issue>3</issue>
          ),
          <volume>61</volume>
          {70 (Dec
          <year>2014</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/2694428.2694441
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Atzeni</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cappellari</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torlone</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gianforme</surname>
          </string-name>
          , G.:
          <article-title>Modelindependent schema translation</article-title>
          .
          <source>VLDB Journal</source>
          <volume>17</volume>
          (
          <issue>6</issue>
          ),
          <volume>1347</volume>
          {
          <fpage>1370</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Beller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gousios</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaidman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Travistorrent: Synthesizing travis ci and github for full-stack research on continuous integration</article-title>
          .
          <source>In: Proceedings of the 14th working conference on mining software repositories</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Domain-Driven Data Mining: Challenges and Prospects</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>22</volume>
          (
          <issue>6</issue>
          ),
          <volume>755</volume>
          {769 (jun
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cunningham</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS</article-title>
          .
          <source>Proceedings of the 30th International Conference on Very Large Data</source>
          Bases pp.
          <volume>998</volume>
          {
          <issue>1009</issue>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Eysholdt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Behrens</surname>
          </string-name>
          , H.:
          <article-title>Xtext: Implement Your Language Faster than the Quick and Dirty Way</article-title>
          . In:
          <article-title>Companion to the 25th Annual Conference on Object-Oriented Programming, Systems</article-title>
          , Languages, and
          <string-name>
            <surname>Applications</surname>
          </string-name>
          (SPLASH/OOPSLA). pp.
          <volume>307</volume>
          {
          <fpage>309</fpage>
          . Reno/Tahoe (Nevada, USA) (
          <year>October 2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fayyad</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piatetsky-Shapiro</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smyth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>From data mining to knowledge discovery in databases</article-title>
          .
          <source>AI</source>
          magazine
          <volume>17</volume>
          (
          <issue>3</issue>
          ),
          <volume>37</volume>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fowler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Patterns of Enterprise Application Architecture</article-title>
          .
          <string-name>
            <surname>Addison-Wesley Longman</surname>
          </string-name>
          Publishing Co., Inc., Boston, MA, USA (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hainaut</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>The Transformational Approach to Database Engineering</article-title>
          . In: Generative and Transformational Techniques in Software Engineering: International Summer School,
          <string-name>
            <surname>GTTSE</surname>
          </string-name>
          <year>2005</year>
          , Braga, Portugal, pp.
          <volume>95</volume>
          {
          <fpage>143</fpage>
          . Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <source>The WEKA Data Mining Software: An Update. SIGKDD Explorations Newsletter</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <volume>10</volume>
          {18 (
          <year>June 2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>R</given-names>
            <surname>Core Team: R: A Language</surname>
          </string-name>
          and
          <article-title>Environment for Statistical Computing</article-title>
          . R Foundation for Statistical Computing, Vienna, Austria (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Wrembel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koncilia</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Data Warehouses And Olap: Concepts, Architectures And Solutions</article-title>
          . IRM Press (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Yelp</surname>
          </string-name>
          <article-title>: Yelp Dataset Challenge Round 9</article-title>
          . https://www.yelp.com/dataset challenge, [Online; accessed 30-March-2017]
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>