=Paper=
{{Paper
|id=Vol-1089/paper2p
|storemode=property
|title=An Approach for Efficient Querying of Large Relational Datasets with OCL based
Languages
|pdfUrl=https://ceur-ws.org/Vol-1089/6.pdf
|volume=Vol-1089
|dblpUrl=https://dblp.org/rec/conf/models/KolovosWB13
}}
==An Approach for Efficient Querying of Large Relational Datasets with OCL based
Languages==
An Approach for Efficient Querying of Large
Relational Datasets with OCL-based Languages
Dimitrios S. Kolovos, Ran Wei, and Konstantinos Barmpis
Department of Computer Science, University of York,
Deramore Lane, York, YO10 5GH, UK
{dimitris.kolovos, rw542, kb634}@york.ac.uk
Abstract. Relational database management systems are used to store
and manage large sets of data, subsets of which can be of interest in the
context of Model Driven Engineering processes. To enable seamless inte-
gration of information stored in relational databases in an MDE process,
the technical and conceptual gap between the two technical spaces needs
to be bridged. In this paper we investigate the challenges involved in
querying large relational datasets using an imperative OCL-based trans-
formation language (EOL) through a running example, and we propose
solutions for some of these challenges.
1 Introduction
Information that can potentially be of interest in the context of a Model Driven
Engineering process is often located within non-model artefacts such as spread-
sheets, XML documents and relational databases. As such, model management
languages and tools would arguably benefit from extending their scope beyond
the narrow boundaries of 3-level metamodelling architectures such as EMF and
MOF for MDE.
In previous work, we have demonstrated how OCL-based model manage-
ment (e.g. model validation, model-to-text and model-to-model transformation)
languages of the Epsilon platform [1] can be used to interact with plain XML
documents [2] and spreadsheets [3]. In this work we investigate the challenges
involved in using such languages to query large relational datasets and extract
abstract models that can be then used (e.g. analysed, validated, transformed) in
the context of MDE processes. In particular, we identify the challenges imposed
by the size of such datasets and the conceptual gap between the organisation
of relational databases and the object-oriented syntax of OCL-based languages,
and we propose some solutions.
The rest of the paper is organised as follows. In Section 2 we present a run-
ning example that involves querying a real-world large relational dataset and
extracting an EMF model from it using an OCL-based imperative transforma-
tion language, we identify the performance challenges involved in doing so, and
propose a run-time query translation approach that addresses some of these chal-
lenges. In Section 3, we review previous work on using OCL to query relational
datasets and compare our approach to it, and in Section 4 we conclude the paper
and provide directions for further work.
2 D. Kolovos et. al.
2 Querying Large Relational Datasets: Challenges and
Solutions
The Epsilon Object Language [4] is an OCL-based imperative model query and
transformation language. EOL is the core language of the Epsilon platform and
underpins a number of task-specific languages for model management tasks in-
cluding model validation, model-to-model and model-to-text transformation. As
such, by adding support for querying relational datasets to EOL, this capability
is automatically propagated to all task-specific languages of the platform. While
the discussion in the rest of the paper focuses on EOL, in principle the discussion
and solutions proposed are also relevant to a wide range of OCL-based model
management languages such as QVTo, ATL and Kermeta.
To experiment with querying relational datasets with EOL, we selected a
large real-world publicly-available dataset from the US Bureau of Transportation
Statistics1 that records all domestic flights in the US in January 2013. The
dataset consists of one table (Flight) with 223 columns and 506,312 rows and is
221MB when persisted in MySQL. Each row of the table records the details of
a domestic flight during that month, including the short codes of its origin and
destination airports, the flight’s departure and arrival time etc. An excerpt of
the Flight table appears in Figure 1.
Our aim in this running example is to transform this dataset into an EMF
model that conforms to the metamodel of Figure 2 and which captures the
incoming and outgoing routes for each airport as well as the volume of traffic
on these routes, so that we can then further process the EMF model to discover
interesting facts about the structure of the US airport network.
origin dest depTime arrTime ...
ABE ATL 1557 1812 ...
ABQ BWI 0735 1252 ...
ANC ADQ 0804 0915 ...
AZA DEN 1556 1731 ...
... ... ... ... ...
Fig. 1. Excerpt of the Flight table Fig. 2. Simple ATM System Metamodel
1
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=
236&DB_Short_Name=On-Time
Querying Relational Datasets with OCL-based Languages 3
2.1 Finding the number of airports in the network
A reasonable OCL-like expression2 that can be used to retrieve the number of
all distinct airports3 in the Flight table would be:
Flight.allInstances.origin.asSet().size()
When such an expression is evaluated against an in-memory model (e.g. an
EMF model) the EOL execution engine performs the following steps:
1. It inspects the model and computes a collection of all model elements of type
Flight;
2. It iterates through the contents of the collection computed in step 1 and
collects the values of their origin properties in a new collection;
3. It removes all duplicates from the collection computed in step 2;
4. It computes the size of the collection computed in step 3.
To evaluate the same expression against the relational database discussed
above, we can assume that each table in the database is a type and each row
in the table is a model element that is an instance of that type. Under these
assumptions, the following issues emerge:
1. To compute the Flight.allInstances collection, the engine needs to execute the
following SQL query: select * from Flight. Due to the size of the Flight table,
the returned result-set cannot fit in a reasonable amount of heap space (we
experimented with up to 1GB), and as such it needs to be streamed from
the database to the engine instead. Streamed result-sets demonstrate the
following challenges:
– They support forward-only iteration;
– To calculate the size of a streamed result-set it needs to be exhaustively
iterated (in which case it becomes unusable as only forward iteration is
permitted);
– Each database connection cannot support more than one streamed result-
sets at a time.
2. Iterating through all rows of the Flight table through a streamed result set
and collecting the values of the origin column of each row is particularly
inefficient given that the same result can be achieved at a fraction of the
time using the following SQL statement: select origin from Flight;
3. Eliminating duplicates by iterating the collection computed in step 2 is also
inefficient as the same result can be achieved using the following – more
efficient – SQL statement: select distinct origin from Flight;
4. Finally, calculating the size of a streamed result-set is not trivial without
invalidating the result-set itself. By contrast, this could be computed in one
step using the following SQL statement: select count(distinct origin) from
Flight.
2
EOL does away with the ocl- prefixes (e.g. oclAsSet()) and the → OCL operator
and uses . instead for all property/method calls.
3
We assume that there are no airports with only incoming or outgoing flights and as
such, looking into one of origin, dest should suffice.
4 D. Kolovos et. al.
2.2 Finding adjunct airports
Assuming that we have computed a set containing the short codes of all airports
in the table, the next task is to find for each airport, which other airports are
directly connected to it, and then compute the volume of traffic between each
pair of adjunct airports. An imperative EOL program that can be used to achieve
this follows:
1 var origins = Flight.allInstances.origin.asSet();
2 for (origin in origins) {
3 var destinations = Flight.allInstances.dest.asSet();
4 for (destination in destinations) {
5 var numberOfFlights = Flight.allInstances.
6 select(f|f.origin = origin and f.dest = destination).
7 size();
8 }
9 }
The following observations can be made for the program above:
– Although the destinations result-set computed in line 3 does not change,
it needs to be re-computed for every nested iteration as the result of the
computation is streamed, and therefore only permits forward navigation;
– Unless care is taken to evaluate the right-hand side expressions in lines 1 and
3 using different database connections, the program will fail (as discussed
above, each MySQL connection only permits at most one streamed result-set
at a time);
– Iterating through all the rows of the Flight table in the select(. . . ) method in
lines 5-7 is inefficient, particularly as the same result can be computed using
the following SQL statement select count(*) from Flight where origin=? and
destination=? (where ? should be replaced every time with the appropriate
origin/destination values).
2.3 Runtime SQL Query Generation
In this section we argue that while the naive way of evaluating OCL-like queries
on relational datasets can dramatically degrade performance (as shown in the
previous section), there are certain runtime optimisations that the execution en-
gine can perform to significantly reduce the execution time and memory footprint
of some types of queries.
After applying such optimisations, the following EOL transformation, can
transform the complete dataset (DB) in question to an EMF-based model (ATMS)
that conforms to the metamodel of Figure 2 in less than 45 seconds on average
hardware4 . A visualisation of an excerpt of the extracted model appears in Figure
3.
The functionality of the transformation is outlined below:
– In line 1 it creates a new instance of the Model EClass in the ATMS EMF
(target) model;
4
CPU: 2.66 GHz Intel Core 2 Duo, RAM: 8 GB DDR3.
Querying Relational Datasets with OCL-based Languages 5
– In line 2 it computes a set of all origin airports in the Flight table;
– In line 4 it iterates through the set of strings computed in line 2;
– In line 5 it invokes the airportForName method defined in lines 21-30 which
returns an instance of the Airport EClass in the target model with a matching
name;
– In lines 6-7 it computes a set of adjunct airports to the origin airport;
– In lines 10-12 for each adjunct airport (destination), it computes the number
of flights between the two airports;
– In lines 13-16 it creates a new instance of the Route EClass in the target
model and populates its origin, destination and numberOfFlights properties.
– The airportForName() method in lines 21-30 is responsible for preventing
the creation of airports with duplicate names in the target model.
1 var m : new ATMS!Model;
2 var origins = DB!Flight.allInstances.origin.asSet();
3
4 for (origin in origins) {
5 var originAirport = airportForName(origin);
6 var destinations = DB!Flight.allInstances.
7 select(f|f.origin = origin).dest.asSet();
8
9 for (destination in destinations) {
10 var numberOfFlights = DB!Flight.allInstances.
11 select(f|f.origin = origin and f.dest = destination)
12 .size();
13 var route = new ATMS!Route;
14 route.origin = originAirport;
15 route.destination = airportForName(destination);
16 route.numberOfFlights = numberOfFlights.asInteger();
17 }
18
19 }
20
21 operation airportForName(name : String) {
22 var airport = ATMS!Airport.allInstances.
23 selectOne(a|a.name = name);
24
25 if (airport.isUndefined()) {
26 airport = new ATMS!Airport;
27 airport.name = name;
28 m.airports.add(airport);
29 }
30 return airport;
31 }
Listing 1.1. EOL transformation
To achieve an acceptable level of performance, we have extended the EOL
execution engine to use streamed lazy collections and a runtime OCL to SQL
query translation strategy for certain types of OCL expressions when the latter
are evaluated against relational datasets. Each lazy collection acts as a wrapper
for an SQL query generated at runtime and only starts streaming data from the
database if/when it needs to be iterated. This prevents unnecessary database
queries and enables multi-step query translation at runtime. An example of the
query translation process is illustrated in Figure 4 which calculates the average
delay of flights flying from JFK to LAX on Sundays. In particular, the following
OCL expressions are rewritten as SQL queries.
6 D. Kolovos et. al.
550
DTW ORD 161
518
620
589 126 AUS
SFO
Fig. 3. Visualisation of an excerpt of the model extracted using the transformation in
Listing 1.1
.allInstances Retrieving all the rows of a table in the database returns a
streamed lazy collection (ResultSetList) that is backed by a select * from