An Approach for Efficient Querying of Large Relational Datasets with OCL-based Languages Dimitrios S. Kolovos, Ran Wei, and Konstantinos Barmpis Department of Computer Science, University of York, Deramore Lane, York, YO10 5GH, UK {dimitris.kolovos, rw542, kb634}@york.ac.uk Abstract. Relational database management systems are used to store and manage large sets of data, subsets of which can be of interest in the context of Model Driven Engineering processes. To enable seamless inte- gration of information stored in relational databases in an MDE process, the technical and conceptual gap between the two technical spaces needs to be bridged. In this paper we investigate the challenges involved in querying large relational datasets using an imperative OCL-based trans- formation language (EOL) through a running example, and we propose solutions for some of these challenges. 1 Introduction Information that can potentially be of interest in the context of a Model Driven Engineering process is often located within non-model artefacts such as spread- sheets, XML documents and relational databases. As such, model management languages and tools would arguably benefit from extending their scope beyond the narrow boundaries of 3-level metamodelling architectures such as EMF and MOF for MDE. In previous work, we have demonstrated how OCL-based model manage- ment (e.g. model validation, model-to-text and model-to-model transformation) languages of the Epsilon platform [1] can be used to interact with plain XML documents [2] and spreadsheets [3]. In this work we investigate the challenges involved in using such languages to query large relational datasets and extract abstract models that can be then used (e.g. analysed, validated, transformed) in the context of MDE processes. In particular, we identify the challenges imposed by the size of such datasets and the conceptual gap between the organisation of relational databases and the object-oriented syntax of OCL-based languages, and we propose some solutions. The rest of the paper is organised as follows. In Section 2 we present a run- ning example that involves querying a real-world large relational dataset and extracting an EMF model from it using an OCL-based imperative transforma- tion language, we identify the performance challenges involved in doing so, and propose a run-time query translation approach that addresses some of these chal- lenges. In Section 3, we review previous work on using OCL to query relational datasets and compare our approach to it, and in Section 4 we conclude the paper and provide directions for further work. 2 D. Kolovos et. al. 2 Querying Large Relational Datasets: Challenges and Solutions The Epsilon Object Language [4] is an OCL-based imperative model query and transformation language. EOL is the core language of the Epsilon platform and underpins a number of task-specific languages for model management tasks in- cluding model validation, model-to-model and model-to-text transformation. As such, by adding support for querying relational datasets to EOL, this capability is automatically propagated to all task-specific languages of the platform. While the discussion in the rest of the paper focuses on EOL, in principle the discussion and solutions proposed are also relevant to a wide range of OCL-based model management languages such as QVTo, ATL and Kermeta. To experiment with querying relational datasets with EOL, we selected a large real-world publicly-available dataset from the US Bureau of Transportation Statistics1 that records all domestic flights in the US in January 2013. The dataset consists of one table (Flight) with 223 columns and 506,312 rows and is 221MB when persisted in MySQL. Each row of the table records the details of a domestic flight during that month, including the short codes of its origin and destination airports, the flight’s departure and arrival time etc. An excerpt of the Flight table appears in Figure 1. Our aim in this running example is to transform this dataset into an EMF model that conforms to the metamodel of Figure 2 and which captures the incoming and outgoing routes for each airport as well as the volume of traffic on these routes, so that we can then further process the EMF model to discover interesting facts about the structure of the US airport network. origin dest depTime arrTime ... ABE ATL 1557 1812 ... ABQ BWI 0735 1252 ... ANC ADQ 0804 0915 ... AZA DEN 1556 1731 ... ... ... ... ... ... Fig. 1. Excerpt of the Flight table Fig. 2. Simple ATM System Metamodel 1 http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID= 236&DB_Short_Name=On-Time Querying Relational Datasets with OCL-based Languages 3 2.1 Finding the number of airports in the network A reasonable OCL-like expression2 that can be used to retrieve the number of all distinct airports3 in the Flight table would be: Flight.allInstances.origin.asSet().size() When such an expression is evaluated against an in-memory model (e.g. an EMF model) the EOL execution engine performs the following steps: 1. It inspects the model and computes a collection of all model elements of type Flight; 2. It iterates through the contents of the collection computed in step 1 and collects the values of their origin properties in a new collection; 3. It removes all duplicates from the collection computed in step 2; 4. It computes the size of the collection computed in step 3. To evaluate the same expression against the relational database discussed above, we can assume that each table in the database is a type and each row in the table is a model element that is an instance of that type. Under these assumptions, the following issues emerge: 1. To compute the Flight.allInstances collection, the engine needs to execute the following SQL query: select * from Flight. Due to the size of the Flight table, the returned result-set cannot fit in a reasonable amount of heap space (we experimented with up to 1GB), and as such it needs to be streamed from the database to the engine instead. Streamed result-sets demonstrate the following challenges: – They support forward-only iteration; – To calculate the size of a streamed result-set it needs to be exhaustively iterated (in which case it becomes unusable as only forward iteration is permitted); – Each database connection cannot support more than one streamed result- sets at a time. 2. Iterating through all rows of the Flight table through a streamed result set and collecting the values of the origin column of each row is particularly inefficient given that the same result can be achieved at a fraction of the time using the following SQL statement: select origin from Flight; 3. Eliminating duplicates by iterating the collection computed in step 2 is also inefficient as the same result can be achieved using the following – more efficient – SQL statement: select distinct origin from Flight; 4. Finally, calculating the size of a streamed result-set is not trivial without invalidating the result-set itself. By contrast, this could be computed in one step using the following SQL statement: select count(distinct origin) from Flight. 2 EOL does away with the ocl- prefixes (e.g. oclAsSet()) and the → OCL operator and uses . instead for all property/method calls. 3 We assume that there are no airports with only incoming or outgoing flights and as such, looking into one of origin, dest should suffice. 4 D. Kolovos et. al. 2.2 Finding adjunct airports Assuming that we have computed a set containing the short codes of all airports in the table, the next task is to find for each airport, which other airports are directly connected to it, and then compute the volume of traffic between each pair of adjunct airports. An imperative EOL program that can be used to achieve this follows: 1 var origins = Flight.allInstances.origin.asSet(); 2 for (origin in origins) { 3 var destinations = Flight.allInstances.dest.asSet(); 4 for (destination in destinations) { 5 var numberOfFlights = Flight.allInstances. 6 select(f|f.origin = origin and f.dest = destination). 7 size(); 8 } 9 } The following observations can be made for the program above: – Although the destinations result-set computed in line 3 does not change, it needs to be re-computed for every nested iteration as the result of the computation is streamed, and therefore only permits forward navigation; – Unless care is taken to evaluate the right-hand side expressions in lines 1 and 3 using different database connections, the program will fail (as discussed above, each MySQL connection only permits at most one streamed result-set at a time); – Iterating through all the rows of the Flight table in the select(. . . ) method in lines 5-7 is inefficient, particularly as the same result can be computed using the following SQL statement select count(*) from Flight where origin=? and destination=? (where ? should be replaced every time with the appropriate origin/destination values). 2.3 Runtime SQL Query Generation In this section we argue that while the naive way of evaluating OCL-like queries on relational datasets can dramatically degrade performance (as shown in the previous section), there are certain runtime optimisations that the execution en- gine can perform to significantly reduce the execution time and memory footprint of some types of queries. After applying such optimisations, the following EOL transformation, can transform the complete dataset (DB) in question to an EMF-based model (ATMS) that conforms to the metamodel of Figure 2 in less than 45 seconds on average hardware4 . A visualisation of an excerpt of the extracted model appears in Figure 3. The functionality of the transformation is outlined below: – In line 1 it creates a new instance of the Model EClass in the ATMS EMF (target) model; 4 CPU: 2.66 GHz Intel Core 2 Duo, RAM: 8 GB DDR3. Querying Relational Datasets with OCL-based Languages 5 – In line 2 it computes a set of all origin airports in the Flight table; – In line 4 it iterates through the set of strings computed in line 2; – In line 5 it invokes the airportForName method defined in lines 21-30 which returns an instance of the Airport EClass in the target model with a matching name; – In lines 6-7 it computes a set of adjunct airports to the origin airport; – In lines 10-12 for each adjunct airport (destination), it computes the number of flights between the two airports; – In lines 13-16 it creates a new instance of the Route EClass in the target model and populates its origin, destination and numberOfFlights properties. – The airportForName() method in lines 21-30 is responsible for preventing the creation of airports with duplicate names in the target model. 1 var m : new ATMS!Model; 2 var origins = DB!Flight.allInstances.origin.asSet(); 3 4 for (origin in origins) { 5 var originAirport = airportForName(origin); 6 var destinations = DB!Flight.allInstances. 7 select(f|f.origin = origin).dest.asSet(); 8 9 for (destination in destinations) { 10 var numberOfFlights = DB!Flight.allInstances. 11 select(f|f.origin = origin and f.dest = destination) 12 .size(); 13 var route = new ATMS!Route; 14 route.origin = originAirport; 15 route.destination = airportForName(destination); 16 route.numberOfFlights = numberOfFlights.asInteger(); 17 } 18 19 } 20 21 operation airportForName(name : String) { 22 var airport = ATMS!Airport.allInstances. 23 selectOne(a|a.name = name); 24 25 if (airport.isUndefined()) { 26 airport = new ATMS!Airport; 27 airport.name = name; 28 m.airports.add(airport); 29 } 30 return airport; 31 } Listing 1.1. EOL transformation To achieve an acceptable level of performance, we have extended the EOL execution engine to use streamed lazy collections and a runtime OCL to SQL query translation strategy for certain types of OCL expressions when the latter are evaluated against relational datasets. Each lazy collection acts as a wrapper for an SQL query generated at runtime and only starts streaming data from the database if/when it needs to be iterated. This prevents unnecessary database queries and enables multi-step query translation at runtime. An example of the query translation process is illustrated in Figure 4 which calculates the average delay of flights flying from JFK to LAX on Sundays. In particular, the following OCL expressions are rewritten as SQL queries. 6 D. Kolovos et. al. 550 DTW ORD 161 518 620 589 126 AUS SFO Fig. 3. Visualisation of an excerpt of the model extracted using the transformation in Listing 1.1 .allInstances Retrieving all the rows of a table in the database returns a streamed lazy collection (ResultSetList) that is backed by a select * from