Efficient Cube Construction for Smart City Data∗ Michael Scriney & Mark Roantree Insight Centre for Data Analytics, School of Computing, Dublin City University, Dublin 9, Ireland michael.scriney@insight-centre.org, mark.roantree@dcu.ie ABSTRACT To deliver powerful smart city environments, there is a re- quirement to analyse web produced data streams in close to real time so that city planners can employ up to date pre- dictive models in both short and long term planning. Data cubes, fused from multiple sources provide a popular input Figure 1: Sample DWARF input to predictive models. A key component in this infrastructure is an efficient mechanism for transforming web data (XML or JSON) into multi-dimensional cubes. In our research, we the same levels of efficiency. However, with the migration of have developed a framework for efficient transformation of many large data repositories to the NoSQL model, we inves- XML data from multiple smart city services into DWARF tigate if DWARF cubes based on the NoSQL model can de- cubes using a NoSQL storage engine. Our evaluation shows liver similar efficiencies and thus, allow a canonical approach a high level of performance when compared to other ap- to managing XML and JSON smart city data streams. proaches and thus, provides a platform for predictive models Contribution. Our overall research goal is the mainte- in a smart city environment. nance of data cubes, fused from the multiple sources listed above. For this paper, we limit ourselves to development Keywords and performance testing the DWARF cubes creation times Smart City, Data Streams, Cubes, XML Analytics and storage size as this provides the fundamentals for our system. Our contribution is the development of a DWARF to NoSQL (bi-directional) mapping mechanism, which facili- 1. INTRODUCTION tates the creation of a DWARF cube from XML data and its Many smart city applications are fed information from storage in a NoSQL database for future retrieval and query- their online services and repositories through XML and JSON ing. Our evaluation takes one of these datasets (the bicycle data objects. As part of our research, we are seeking to sharing scheme) and uses a variety of cube definitions to maintain up to date cubes of information taken from multi- robustly test and compare its performance. ple sources in a European capital city (Dublin) in order to Structure. This paper is organised as follows: Section make predictions about the usage of various services. The 2 outlines the structure and implementation of DWARF data streams in our research include car parks, bicycle shar- cubes; Section 3 describes DWARF cube storage; Section 4 ing schemes, online auction data, air quality sensor data, and details the transformation of the in-memory DWARF cube sales data. While some are not directly associated with the to a NoSQL database; Section 5 presents our evaluation; re- smart city project, they may influence decision making and lated research is presented in Section 6; and in Section 7, are thus, included in our data cubes. Our data cubes adopt conclusions are presented. the DWARF approach [12] which was shown to offer signif- icant savings in terms of both storage and computational efficiencies for relational data. Previously, we demonstrated 2. STRUCTURE AND IMPLEMENTATION how the DWARF model processes XML data [2], [3] with OF DWARF CUBES ∗This work is supported by Science Foundation Ireland un- The DWARF [12] compression algorithm produces data der grant number SFI/12/RC/2289 cubes of a compressed form. While clustering algorithms [10], use a form of suffix coalescing for storage efficiency, DWARF uses both suffix and prefix coalescing to detect du- plicate aggregates before they can be computed. The result of this process is a DWARF cube which takes input in the c 2016, Copyright is with the authors. Published in the Workshop Pro- form of a series of tuples which are used to create a DWARF. ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor- Each tuple takes the form: deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this (dimension_1,dimension_2,...,dimension_n,measure). paper is permitted under the terms of the Creative Commons license CC- For example the input in Fig. 1 would produce the DWARF by-nc-nd 4.0 Workshop Proceedings of the EDBT/ICDT 2016 Joint Conference March cube in Fig. 2. 15, 2016, Bordeaux, France A DWARF cube is a tree-like structure which contains a particular DWARF Node such as its parent and child cells. • The DWARF_Cell column family contains information regarding a particular DWARF Cell, specifically the id of its parent and pointer node and the measure as- sociated with the cell. The role of the DWARF_Schema column family in Table 1-A is to record an individual DWARF Schema and its metadata. node_count is the amount of DWARF nodes in the schema, cell_count is the amount of cells and size_as_mb is an ap- proximation of the size taken up on disk by the DWARF Schema. The entry_node_id attribute contains the id of the top-level node in the DWARF schema and serves as the Figure 2: A sample DWARF cube created by Fig. 1 input entry point for all traversal functions. Finally, the is_cube attribute is a flag which indicates whether or not this par- ticular record is a full DWARF Schema or a DWARF cube constructed from querying a DWARF schema. The role of two main structures: a DWARF cell and a DWARF node. the DWARF_Cell column family outlined in Table 1-C is to A DWARF node is a container for groups of cells which store information about a DWARF Cell. The role of the share the same parent. At the top level of the tree in a DWARF_Node column family (Table 1-B) is to model all DWARF DWARF cube there is a root node. This contains the top Nodes in a DWARF cube. cells of the tree at the highest dimension level. A DWARF cell represents a single item in a DWARF cube. It has a reference key and points to a DWARF node 4. TRANSFORMATION APPROACH which contains all of its child cells. The cell itself is con- We now outline the process used to convert an in-memory tained in a DWARF node. The value for a particular cell DWARF Schema into a NoSQL model. This process involves can be retrieved by starting at the top of the tree and fol- transforming each Node and Cell in the DWARF structure lowing the cell paths until the target cell is found. The value into the relevant NoSQL query which will be used to insert of a DWARF cell is synonymous with its child’s aggregate it into the database. cell. If a DWARF cell does not point to a DWARF node The first step in this process is the creation of a DWARF_Schema further down the tree, it is at the bottom level of the tree item which is inserted into the DWARF_Schema column fam- and is called a leaf cell. A leaf cell is the smallest struc- ily seen in Table 1-A. The id field is obtained by query- ture in a DWARF cube. The value of a leaf cell is derived ing the DWARF_Schema column family in the NoSQL data from the measure item in the tuple list supplied to create warehouse to determine the next id to be used. The fields the DWARF cube. This is identical to the measure found in node_count and cell_count are obtained by scanning the a traditional Fact Table. DWARF structure in-memory. The field size_as_mb is pop- ulated separately by querying the NoSQL data warehouse to 3. A NOSQL DWARF MODEL determine the size of the DWARF structure when stored. The terminology and notation used in columnar NoSQL In order to map a DWARF Schema to a NoSQL model, databases provide the basis of our NoSQL DWARF model. each Node and Cell in the DWARF must be visited which Briefly, a columnar NoSQL database is composed of keyspaces involves a full traversal of the DWARF. Starting from the and column families. A keyspace can be roughly equated to root node of the DWARF, the tree is traversed in a breadth- a database in a relational schema and column families are first top-down fashion. Using the DWARF presented in Fig. similar to tables in a RDBMS. 2 the root node is processed, then the cell Ireland. From In order to correctly model a DWARF Schema, three col- here the descendant node of Ireland is processed. This con- umn families must be created: DWARF_Schema, DWARF_Node tinues until all leaf cells which are descendant from Ireland and DWARF_Cell. Together, these column families represent are processed. Afterwards, the root node is revisited and the the tree-like structure of a DWARF shown in Fig. 2. This cell France is visited and all of its descendants are evaluated. model necessitates the construction of a bi-directional model However, as a DWARF structure contains multiple-inheritance mapper where a full DWARF Schema can be rebuilt from (Nodes can have multiple parent cells) precautions must be the records stored in the NoSQL database. This is achieved taken in order to prevent Nodes and Cells from being pro- by reading the records of the DWARF cells and nodes from cessed multiple times. This is accomplished by a lookup the NoSQL database and joining them based on their unique table which records each Node and Cell visited by assign- ids. ing them a unique ID. Upon visiting a Cell or Node in the DWARF structure, the lookup table is first checked to en- • The DWARF_Schema column family stores information sure that is has not already been transformed. regarding an individual DWARF Schema. This is to During traversal, the structure of a DWARF Node or Cell ensure that multiple schemas can be stored in the same is evaluated and the relevant INSERT command for a NoSQL NoSQL database and is the entry point for query and database is created. Using Cassandra [6] as a sample NoSQL traversal functions. database and the Cassandra Query Language (CQL) as a sample query language, a transformation is presented in Fig. • The DWARF_Node column family stores information about 3 where the values of an in-memory DWARF cell are pre- DWARF Schema id node count cell count size as mb entry node id is cube int int int int int bool A: DWARF Schema column family DWARF Node id parentIds childrenIds root schema id int set< int > set< int > boolean int B: DWARF Node schema DWARF CELL id key measure parentNode pointerNode leaf schema id dimension table name int text int int int boolean int text C: DWARF Cell schema Table 1: NoSQL DWARF Schema Sample DWARF Cell Values Day Week Month TMonth SMonth parentNode: DWARF Node (id 3) Size (MB) 2.1 17.1 54.1 113 338 Number of tuples 7358 60102 118934 396756 1181344 pointerNode null key ”Fenian St” Table 2: The datasets used in the experiments measure 3 id 3 INSERT INTO DWARF CELL (id,key,measure,parentNode, pointerNode,leaf, schema_id, dimension_table_name) VALUES (3,"Fenian St", 3,3,null,true,1,"Station"); Figure 3: Sample DWARF Cell values and CQL query after transformation sented along with the corresponding NoSQL INSERT com- Figure 4: MySQL-DWARF Schema for a DWARF cube mand which is generated after visiting the cell. Additionally, if a dimension table is specified in the schema definition, the dimension_table_name is also updated to in- purpose of these schemas is to firstly show how a NoSQL- clude the name of the dimension table which contains addi- DWARF compares to its SQL counterpart and subsequently, tional information about the DWARF Cell. how an unnormalised DWARF model (for the purpose of As each appropriate NoSQL query is created, it is added speed in MySQL) performs overall. Five DWARF cubes a list of Insert expressions. These queries are subsequently were created each in increasing size. Each one representing executed in a bulk process to populate the column families a time period of bikes data. The five period chosen were shown in Table 1. Finally, when all column families have one day (Day), one week (Week), one month (Month), two been populated, the NoSQL store is queried to determine months (TMonth) and six months (SMonth). All periods the size of the DWARF structure and the size_as_mb field chosen use bikes data based on the CitiBikes dataset [7]. in the DWARF_Schema column family is updated. We now briefly describe the comparison schemas used in our evaluation. 5. EXPERIMENTS All experiments were run on Ubuntu 14.04 LTS 64-bit MySQL-DWARF. with an ASUS Z87-PRO V motherboard, an Intel Core i7 A diagram of the schema can be seen in Fig 4. This schema 4770K processor and 8GB 1600 MHz RAM. For both MySQL was chosen as it most accurately describes a dwarf structure and Cassandra the DWARF cubes were inserted in bulk. All in a relational database. The reason for the NODE CHILDREN DWARFs contain 8 dimensions however, they differ in the and CELL CHILDREN tables is that nodes can contain number of source tuples used in construction. multiple cells and multiple cells can point to the same node. As part of our evaluation of NoSQL-DWARF cubes and This multi-inheritance like structure is hard to represent ac- Cassandra as a storage mechanism for a DWARF cube, we curately in a traditional RDBMS. used 4 schema models (as shown in Tables 4 and 5) for the purpose of comparison. The first two schema models NoSQL-Min. MySQL and MySQL-Min, represent a DWARF schema in a This schema was created to show that the construct of relational model and a single table DWARF representation a dwarf node does not need to be stored since the dwarf respectively. The next two schema models NoSQL-DWARF cells contain the ids of their parent and pointer nodes and and NoSQL-Min represent the DWARF model described these nodes can be rebuilt at a later stage. An outline of here and a single NoSQL table format respectively. The the schema can be seen in Table 3. DWARF CUBE id node count cell count size as mb int int int int DWARF Cell id item name leaf root cubeid parentNodeId childNodeId int int text bool bool int int int Table 3: NoSQL-Min Schema Table 4: DWARF storage performance Table 5: DWARF storage time performance Size (MB) use to store a DWARF cube Time (Milliseconds) taken to insert a DWARF cube Day Week Month TMonth SMonth Day Week Month TMonth SMonth MySQL-DWARF 2 20 80 169 424 MySQL-DWARF 1768 12501 47247 100466 255098 MySQL-Min <1 8 33 70 178 MySQL-Min 1107 5955 22243 47936 121221 NoSQL-DWARF <1 9 35 73 182 NoSQL-DWARF 927 4368 15955 34203 89257 NoSQL-Min <1 11 45 96 243 NoSQL-Min 5699 57153 222044 484498 1219887 This schema contains only two tables. DWARF Cube and anticipate the absence of a DWARF Node construct will DWARF Cell. DWARF Cube contains information for a have a significant impact on query times as DWARF Node particular DWARF cube inside the database. the DWARF CELL reconstruction is required. column family contains information regarding the individual Storage Time. The time taken to insert a DWARF cube cells of the cube. into each schema is outlined in Table 5. The NoSQL-Min schema performed worst overall with an insertion time of MySQL-Min.. 1220 seconds for the largest dataset. This is due to the num- This schema was designed to test how well MySQL per- ber of indexes present on each schema. With the NoSQL- forms using a schema without joins. It is based on the ap- DWARF schema having one index per table containing the proach presented in the NoSQL-Min schema. id of the DWARF Cell, Node and Cube respectively, the ab- sence of a DWARF Node table in the NoSQL-Min schema 5.1 Results and Analysis necessitates the addition of two secondary indexes on the DWARF Cell table parentNodeId and childNodeId. The Storage Space. In terms of storage size our work out- presence of these indexes increase the resulting insertion performs that presented in [1] where the authors stored a time and size of the cube. The MySQL-DWARF schema had DWARF containing 400,000 tuples with 8 dimensions in the second largest insertion time for the largest dataset with 200MB using their standard DWARF implementation and 255 seconds to insert. This is due to the relational nature 260MB using their recursion clustering method. Conversely of the schema, where each relationship between a Node and using Cassandra and our DWARF implementation we were Cell must be recorded and thus, a large volume of inserts able to store a DWARF cube of 1,181,344 tuples across 8 di- is necessary to represent the DWARF cube. Conversely, mensions in 182MB. However, it is important to note that as with Cassandra, this construct can be described using a set different datasets were used the degree of compression pro- datatype which can complete in one insert operation. The vided by the DWARF algorithm differs which can affect the NoSQL-DWARF schema performed best with an insertion resulting size on disc. The MySQL-Min schema performed time of 89 seconds for the largest dataset. best for the small datasets with < 1 MB for the smallest dataset and 70MB for the TMonth dataset. However, for the largest dataset the NoSQL-DWARF has a smaller resulting 6. RELATED RESEARCH size (182MB for NoSQL-DWARF and 185MB for MySQL- In [5] and [8], the authors describe the creation of an Min). MySQL-DWARF (Fig. 4) performed worst overall OLAP cube from XML sources. Both methods involve com- with a size of 2MB for the smallest dataset and 424MB for bining the XML data sources with data obtained from a rela- the largest. This is due to its relational design. A DWARF tional database, where the data obtained from both sources Node can have many child cells. Each individual relation- is abstracted. In [5], they model an XML cube using an ship between a node and a cell must be individually recorded UML Snowflake diagram. A similar method of integrating in the Node_Children table. Cassandra does not encounter XML and UML for data warehousing was presented in [13]. such problems as these relations can be stored in a single The data is combined using a data integrator to determine set datatype. The MySQL-Min schema had the smallest which data source is to be queried. The end user of the sys- size with < 1MB for the smallest cube and a size of 178MB tem queries the OLAP server using standard OLAP query for the largest cube. This schema had small cube sizes due languages e.g. MDX queries. The results are presented in to the absence of the DWARF Node construct. This sig- relational form to the user. A similar method of abstract- nificantly reduces the size on disc as the relationships be- ing the data from its source format is used in our approach tween a Node and a Cell (which has greatest impact on the to create the DWARF cube and provide the functionality MySQL-DWARF schema) need not be stored. However, we for a DWARF cube to be constructed from multiple source formats. In [8], they propose an XML based representation of a resulting data cube but do not address the underlying problems associated with OLAP cubes (construction time, [1] Bao, Y.; Leng, F.; Wang, D. & Yu, G. A Clustered storage space). In addition, the source data must be kept Dwarf Structure to Speed up Queries on Data Cubes. intact in its original format which could lead to issues when JCSE, 2007, 1, 195-210 updating data. [2] Gui, H. & Roantree, M. Topological XML data cube In [4] and [9], they also store data cubes in native XML for- construction International Journal of Web Engineering mat similar to our NoSQL-DWARF model. However, their and Technology, Inderscience Publishers, 2013, 8, goals are primarily aimed towards interoperability between 347-368 data warehouses as opposed to querying or data mining. [3] Gui, H. & Roantree, M. Using a pipeline approach to In [1], the authors employ a DWARF clustering model for build data cube for large xml data streams Database storing DWARF cubes on a hard disk. They propose a Node Systems for Advanced Applications, Springer Berlin indexing approach where a DWARF Node or DWARF Cell Heidelberg, 2013, 59-73 does not contain a pointer to its sub Nodes/Cells but in- [4] HÃijmmer, W.; Bauer, A. & Harde, G. XCube: XML stead contains the unique ID of its children. This method is for data warehouses Proceedings of the 6th ACM adopted in our Cassandra schema. They propose two algo- international workshop on Data warehousing and rithms for storing a DWARF cube as a flat file; a hierarchical OLAP, 2003, 33-40 clustering model optimised for range queries on a DWARF [5] Jensen, M. R.; MÃÿller, T. H. & Pedersen, T. B. structure and a recursive algorithm which is optimised for Specifying OLAP cubes on XML data Journal of point queries on a DWARF (the recursive algorithm is also Intelligent Information Systems, Springer, 2001, 17, adequate for range queries). However as our approach uses 255-280 Cassandra as a storage model instead of a flat file we do not [6] Lakshman, A. & Malik, P. Cassandra: a decentralized encounter these problems. structured storage system ACM SIGOPS Operating Finally, in [14], the authors use MapReduce to improve Systems Review, ACM, 2010, 44, 35-40 cube construction time but this is only true for the most [7] Marks, G., Roantree, M., Smyth, D.: Optimizing common queries with the rest being constructed in real-time queries for web generated sensor data. In:Proceedings so there is potentially a large overhead in cube construction of the Twenty-Second Australasian Database time depending on the query. In [11], they propose extend- Conference-Volume 115. pp. 47-56.Australian ing DWARF to include dimensional hierarchies, a property Computer Society, Inc. (2011) of OLAP cubes which is not present in DWARF. [5] states that a hierarchical structure of a cube is necessary for a [8] Niemi, T.; NiinimÃd’ki, M.; Nummenmaa, J. & cube constructed from XML sources. They propose stor- Thanisch, P. Constructing an OLAP cube from ing partial DWARFs (where all views are not materialised) distributed XML data Proceedings of the 5th ACM to contain information pertaining to a dimension hierarchy. international workshop on Data Warehousing and The result of this is a modified DWARF structure which can OLAP, 2002, 22-27 use the traditional OLAP operations ROLLUP and DRILL [9] Nguyen, B. T.; Tjoa, A. M.& Mangisengi, O. Meta DOWN. The resulting Hierarchial DWARF structure is sim- Cube-X: An XML Metadata Foundation for ilar to a traditional DWARF structure (Fig 2). However, Interoperability Search among Web Data Warehouses. a DWARF Node can also point to another Node occupying DMDW, 2001, 8 the same dimension level, where this node would contain the [10] Roantree M. and Liu J. A heuristic approach to dimensional hierarchy. An extension of the DWARF Node selecting views for materialization. In Software: schema outlined in Table 1-B where it contains the id of a Practice and Experience 44(10): pp. 1157-1179, 2014. pointer node would accommodate this functionality. [11] Sismanis, Y.; Deligiannakis, A.; Kotidis, Y. & Roussopoulos, N. Hierarchical dwarfs for the rollup cube Proceedings of the 6th ACM international 7. CONCLUSIONS AND FUTURE WORK workshop on Data warehousing and OLAP, 2003, 17-24 Today’s cities have an important new resource, their knowl- [12] Sismanis, Y.; Deligiannakis, A.; Roussopoulos, N. & edge infrastructure, which is generated from low level sensor Kotidis, Y. Dwarf: Shrinking the petacube Proceedings networks, public databases and online information services. of the 2002 ACM SIGMOD international conference on Before this knowledge can be exploited in order to generate Management of data, 2002, 464-475 real impact, we require a number of layers in the technology [13] Trujillo, J.; LujÃan-Mora, S. & Song, I.-Y. Applying , stack for the smart city. Above the data harvesting level, UML and XML for designing and interchanging it is necessary to read and transform data streams and to information for data warehouses and OLAP create the structures (cubes) that higher level applications applications Journal of Database Management (JDM), such as data mining can exploit. In this paper, we presented IGI Global, 2004, 15, 41-72 our novel cube generation approach which takes web gener- [14] Wang, Y.; Song, A. & Luo, J. A ated data and efficiently constructs data cubes. Our usage mapreducemerge-based data cube construction method of the DWARF and NoSQL models delivers new layers of Grid and Cooperative Computing (GCC), 2010 9th efficiency in terms of speed and compression. Our current International Conference on, 2010, 1-6 focus is on cube updates through efficient query primitives for our DWARF cubes. 8. REFERENCES