Efficient Cube Construction for Smart City Data∗

                                               Michael Scriney & Mark Roantree
            Insight Centre for Data Analytics, School of Computing, Dublin City University, Dublin 9, Ireland
                           michael.scriney@insight-centre.org, mark.roantree@dcu.ie


ABSTRACT
To deliver powerful smart city environments, there is a re-
quirement to analyse web produced data streams in close to
real time so that city planners can employ up to date pre-
dictive models in both short and long term planning. Data
cubes, fused from multiple sources provide a popular input                             Figure 1: Sample DWARF input
to predictive models. A key component in this infrastructure
is an efficient mechanism for transforming web data (XML
or JSON) into multi-dimensional cubes. In our research, we
                                                                         the same levels of efficiency. However, with the migration of
have developed a framework for efficient transformation of
                                                                         many large data repositories to the NoSQL model, we inves-
XML data from multiple smart city services into DWARF
                                                                         tigate if DWARF cubes based on the NoSQL model can de-
cubes using a NoSQL storage engine. Our evaluation shows
                                                                         liver similar efficiencies and thus, allow a canonical approach
a high level of performance when compared to other ap-
                                                                         to managing XML and JSON smart city data streams.
proaches and thus, provides a platform for predictive models
                                                                            Contribution. Our overall research goal is the mainte-
in a smart city environment.
                                                                         nance of data cubes, fused from the multiple sources listed
                                                                         above. For this paper, we limit ourselves to development
Keywords                                                                 and performance testing the DWARF cubes creation times
Smart City, Data Streams, Cubes, XML Analytics                           and storage size as this provides the fundamentals for our
                                                                         system. Our contribution is the development of a DWARF
                                                                         to NoSQL (bi-directional) mapping mechanism, which facili-
1.    INTRODUCTION                                                       tates the creation of a DWARF cube from XML data and its
   Many smart city applications are fed information from                 storage in a NoSQL database for future retrieval and query-
their online services and repositories through XML and JSON              ing. Our evaluation takes one of these datasets (the bicycle
data objects. As part of our research, we are seeking to                 sharing scheme) and uses a variety of cube definitions to
maintain up to date cubes of information taken from multi-               robustly test and compare its performance.
ple sources in a European capital city (Dublin) in order to                 Structure. This paper is organised as follows: Section
make predictions about the usage of various services. The                2 outlines the structure and implementation of DWARF
data streams in our research include car parks, bicycle shar-            cubes; Section 3 describes DWARF cube storage; Section 4
ing schemes, online auction data, air quality sensor data, and           details the transformation of the in-memory DWARF cube
sales data. While some are not directly associated with the              to a NoSQL database; Section 5 presents our evaluation; re-
smart city project, they may influence decision making and               lated research is presented in Section 6; and in Section 7,
are thus, included in our data cubes. Our data cubes adopt               conclusions are presented.
the DWARF approach [12] which was shown to offer signif-
icant savings in terms of both storage and computational
efficiencies for relational data. Previously, we demonstrated            2.   STRUCTURE AND IMPLEMENTATION
how the DWARF model processes XML data [2], [3] with                          OF DWARF CUBES
∗This work is supported by Science Foundation Ireland un-                   The DWARF [12] compression algorithm produces data
der grant number SFI/12/RC/2289                                          cubes of a compressed form. While clustering algorithms
                                                                         [10], use a form of suffix coalescing for storage efficiency,
                                                                         DWARF uses both suffix and prefix coalescing to detect du-
                                                                         plicate aggregates before they can be computed. The result
                                                                         of this process is a DWARF cube which takes input in the
 c 2016, Copyright is with the authors. Published in the Workshop Pro-   form of a series of tuples which are used to create a DWARF.
ceedings of the EDBT/ICDT 2016 Joint Conference (March 15, 2016, Bor-    Each tuple takes the form:
deaux, France) on CEUR-WS.org (ISSN 1613-0073). Distribution of this     (dimension_1,dimension_2,...,dimension_n,measure).
paper is permitted under the terms of the Creative Commons license CC-   For example the input in Fig. 1 would produce the DWARF
by-nc-nd 4.0
Workshop Proceedings of the EDBT/ICDT 2016 Joint Conference March        cube in Fig. 2.
15, 2016, Bordeaux, France                                                  A DWARF cube is a tree-like structure which contains
                                                                         a particular DWARF Node such as its parent and child
                                                                         cells.

                                                                       • The DWARF_Cell column family contains information
                                                                         regarding a particular DWARF Cell, specifically the
                                                                         id of its parent and pointer node and the measure as-
                                                                         sociated with the cell.

                                                                     The role of the DWARF_Schema column family in Table 1-A
                                                                  is to record an individual DWARF Schema and its metadata.
                                                                  node_count is the amount of DWARF nodes in the schema,
                                                                  cell_count is the amount of cells and size_as_mb is an ap-
                                                                  proximation of the size taken up on disk by the DWARF
                                                                  Schema. The entry_node_id attribute contains the id of
                                                                  the top-level node in the DWARF schema and serves as the
Figure 2: A sample DWARF cube created by Fig. 1 input             entry point for all traversal functions. Finally, the is_cube
                                                                  attribute is a flag which indicates whether or not this par-
                                                                  ticular record is a full DWARF Schema or a DWARF cube
                                                                  constructed from querying a DWARF schema. The role of
two main structures: a DWARF cell and a DWARF node.
                                                                  the DWARF_Cell column family outlined in Table 1-C is to
   A DWARF node is a container for groups of cells which
                                                                  store information about a DWARF Cell. The role of the
share the same parent. At the top level of the tree in a
                                                                  DWARF_Node column family (Table 1-B) is to model all DWARF
DWARF cube there is a root node. This contains the top
                                                                  Nodes in a DWARF cube.
cells of the tree at the highest dimension level.
   A DWARF cell represents a single item in a DWARF
cube. It has a reference key and points to a DWARF node           4.    TRANSFORMATION APPROACH
which contains all of its child cells. The cell itself is con-       We now outline the process used to convert an in-memory
tained in a DWARF node. The value for a particular cell           DWARF Schema into a NoSQL model. This process involves
can be retrieved by starting at the top of the tree and fol-      transforming each Node and Cell in the DWARF structure
lowing the cell paths until the target cell is found. The value   into the relevant NoSQL query which will be used to insert
of a DWARF cell is synonymous with its child’s aggregate          it into the database.
cell. If a DWARF cell does not point to a DWARF node                 The first step in this process is the creation of a DWARF_Schema
further down the tree, it is at the bottom level of the tree      item which is inserted into the DWARF_Schema column fam-
and is called a leaf cell. A leaf cell is the smallest struc-     ily seen in Table 1-A. The id field is obtained by query-
ture in a DWARF cube. The value of a leaf cell is derived         ing the DWARF_Schema column family in the NoSQL data
from the measure item in the tuple list supplied to create        warehouse to determine the next id to be used. The fields
the DWARF cube. This is identical to the measure found in         node_count and cell_count are obtained by scanning the
a traditional Fact Table.                                         DWARF structure in-memory. The field size_as_mb is pop-
                                                                  ulated separately by querying the NoSQL data warehouse to
3.    A NOSQL DWARF MODEL                                         determine the size of the DWARF structure when stored.
  The terminology and notation used in columnar NoSQL                In order to map a DWARF Schema to a NoSQL model,
databases provide the basis of our NoSQL DWARF model.             each Node and Cell in the DWARF must be visited which
Briefly, a columnar NoSQL database is composed of keyspaces       involves a full traversal of the DWARF. Starting from the
and column families. A keyspace can be roughly equated to         root node of the DWARF, the tree is traversed in a breadth-
a database in a relational schema and column families are         first top-down fashion. Using the DWARF presented in Fig.
similar to tables in a RDBMS.                                     2 the root node is processed, then the cell Ireland. From
  In order to correctly model a DWARF Schema, three col-          here the descendant node of Ireland is processed. This con-
umn families must be created: DWARF_Schema, DWARF_Node            tinues until all leaf cells which are descendant from Ireland
and DWARF_Cell. Together, these column families represent         are processed. Afterwards, the root node is revisited and the
the tree-like structure of a DWARF shown in Fig. 2. This          cell France is visited and all of its descendants are evaluated.
model necessitates the construction of a bi-directional model     However, as a DWARF structure contains multiple-inheritance
mapper where a full DWARF Schema can be rebuilt from              (Nodes can have multiple parent cells) precautions must be
the records stored in the NoSQL database. This is achieved        taken in order to prevent Nodes and Cells from being pro-
by reading the records of the DWARF cells and nodes from          cessed multiple times. This is accomplished by a lookup
the NoSQL database and joining them based on their unique         table which records each Node and Cell visited by assign-
ids.                                                              ing them a unique ID. Upon visiting a Cell or Node in the
                                                                  DWARF structure, the lookup table is first checked to en-
     • The DWARF_Schema column family stores information          sure that is has not already been transformed.
       regarding an individual DWARF Schema. This is to              During traversal, the structure of a DWARF Node or Cell
       ensure that multiple schemas can be stored in the same     is evaluated and the relevant INSERT command for a NoSQL
       NoSQL database and is the entry point for query and        database is created. Using Cassandra [6] as a sample NoSQL
       traversal functions.                                       database and the Cassandra Query Language (CQL) as a
                                                                  sample query language, a transformation is presented in Fig.
     • The DWARF_Node column family stores information about      3 where the values of an in-memory DWARF cell are pre-
                                                             DWARF Schema
                                id    node count     cell count size as mb entry node id     is cube
                                int      int             int        int         int            bool
                                                   A: DWARF Schema column family

                                                           DWARF Node
                                         id     parentIds  childrenIds root      schema id
                                         int   set< int > set< int > boolean        int
                                                       B: DWARF Node schema

                                                           DWARF CELL
                  id    key    measure    parentNode    pointerNode   leaf     schema id     dimension table name
                  int   text     int          int            int    boolean       int                text
                                                       C: DWARF Cell schema

                                                 Table 1: NoSQL DWARF Schema


     Sample DWARF Cell Values                                                              Day    Week    Month     TMonth   SMonth
 parentNode: DWARF Node (id 3)                                        Size (MB)            2.1    17.1    54.1      113      338
                                                                      Number of tuples     7358   60102   118934    396756   1181344
 pointerNode        null
 key             ”Fenian St”
                                                                           Table 2: The datasets used in the experiments
 measure              3
 id                   3

INSERT INTO DWARF CELL (id,key,measure,parentNode,
pointerNode,leaf, schema_id, dimension_table_name)
VALUES (3,"Fenian St", 3,3,null,true,1,"Station");


Figure 3: Sample DWARF Cell values and CQL query after
transformation


sented along with the corresponding NoSQL INSERT com-                 Figure 4: MySQL-DWARF Schema for a DWARF cube
mand which is generated after visiting the cell.
   Additionally, if a dimension table is specified in the schema
definition, the dimension_table_name is also updated to in-
                                                                     purpose of these schemas is to firstly show how a NoSQL-
clude the name of the dimension table which contains addi-
                                                                     DWARF compares to its SQL counterpart and subsequently,
tional information about the DWARF Cell.
                                                                     how an unnormalised DWARF model (for the purpose of
   As each appropriate NoSQL query is created, it is added
                                                                     speed in MySQL) performs overall. Five DWARF cubes
a list of Insert expressions. These queries are subsequently
                                                                     were created each in increasing size. Each one representing
executed in a bulk process to populate the column families
                                                                     a time period of bikes data. The five period chosen were
shown in Table 1. Finally, when all column families have
                                                                     one day (Day), one week (Week), one month (Month), two
been populated, the NoSQL store is queried to determine
                                                                     months (TMonth) and six months (SMonth). All periods
the size of the DWARF structure and the size_as_mb field
                                                                     chosen use bikes data based on the CitiBikes dataset [7].
in the DWARF_Schema column family is updated.
                                                                       We now briefly describe the comparison schemas used in
                                                                     our evaluation.
5.   EXPERIMENTS
  All experiments were run on Ubuntu 14.04 LTS 64-bit                MySQL-DWARF.
with an ASUS Z87-PRO V motherboard, an Intel Core i7                    A diagram of the schema can be seen in Fig 4. This schema
4770K processor and 8GB 1600 MHz RAM. For both MySQL                 was chosen as it most accurately describes a dwarf structure
and Cassandra the DWARF cubes were inserted in bulk. All             in a relational database. The reason for the NODE CHILDREN
DWARFs contain 8 dimensions however, they differ in the              and CELL CHILDREN tables is that nodes can contain
number of source tuples used in construction.                        multiple cells and multiple cells can point to the same node.
  As part of our evaluation of NoSQL-DWARF cubes and                 This multi-inheritance like structure is hard to represent ac-
Cassandra as a storage mechanism for a DWARF cube, we                curately in a traditional RDBMS.
used 4 schema models (as shown in Tables 4 and 5) for
the purpose of comparison. The first two schema models               NoSQL-Min.
MySQL and MySQL-Min, represent a DWARF schema in a                     This schema was created to show that the construct of
relational model and a single table DWARF representation             a dwarf node does not need to be stored since the dwarf
respectively. The next two schema models NoSQL-DWARF                 cells contain the ids of their parent and pointer nodes and
and NoSQL-Min represent the DWARF model described                    these nodes can be rebuilt at a later stage. An outline of
here and a single NoSQL table format respectively. The               the schema can be seen in Table 3.
                                                      DWARF CUBE
                                        id    node count cell count size as mb
                                        int       int        int         int
                                                       DWARF Cell
                         id    item   name    leaf root cubeid parentNodeId          childNodeId
                         int    int    text   bool bool    int         int                int

                                              Table 3: NoSQL-Min Schema


          Table 4: DWARF storage performance                            Table 5: DWARF storage time performance


           Size (MB) use to store a DWARF cube                        Time (Milliseconds) taken to insert a DWARF cube
                  Day Week Month TMonth             SMonth                          Day     Week Month TMonth SMonth
 MySQL-DWARF       2      20        80    169        424         MySQL-DWARF        1768 12501       47247    100466   255098
 MySQL-Min        <1       8        33     70        178         MySQL-Min          1107    5955     22243    47936    121221
 NoSQL-DWARF      <1       9        35     73        182         NoSQL-DWARF         927    4368     15955    34203    89257
 NoSQL-Min        <1      11        45     96        243         NoSQL-Min          5699 57153 222044         484498  1219887


   This schema contains only two tables. DWARF Cube and      anticipate the absence of a DWARF Node construct will
DWARF Cell. DWARF Cube contains information for a            have a significant impact on query times as DWARF Node
particular DWARF cube inside the database. the DWARF CELL reconstruction is required.
column family contains information regarding the individual     Storage Time. The time taken to insert a DWARF cube
cells of the cube.                                           into  each schema is outlined in Table 5. The NoSQL-Min
                                                             schema performed worst overall with an insertion time of
MySQL-Min..                                                  1220 seconds for the largest dataset. This is due to the num-
   This schema was designed to test how well MySQL per-      ber of indexes present on each schema. With the NoSQL-
forms using a schema without joins. It is based on the ap-   DWARF schema having one index per table containing the
proach presented in the NoSQL-Min schema.                    id of the DWARF Cell, Node and Cube respectively, the ab-
                                                             sence of a DWARF Node table in the NoSQL-Min schema
5.1 Results and Analysis                                     necessitates the addition of two secondary indexes on the
                                                             DWARF Cell table parentNodeId and childNodeId. The
   Storage Space. In terms of storage size our work out-     presence of these indexes increase the resulting insertion
performs that presented in [1] where the authors stored a    time and size of the cube. The MySQL-DWARF schema had
DWARF containing 400,000 tuples with 8 dimensions in         the second largest insertion time for the largest dataset with
200MB using their standard DWARF implementation and          255 seconds to insert. This is due to the relational nature
260MB using their recursion clustering method. Conversely    of the schema, where each relationship between a Node and
using Cassandra and our DWARF implementation we were         Cell must be recorded and thus, a large volume of inserts
able to store a DWARF cube of 1,181,344 tuples across 8 di-  is necessary to represent the DWARF cube. Conversely,
mensions in 182MB. However, it is important to note that as  with Cassandra, this construct can be described using a set
different datasets were used the degree of compression pro-  datatype which can complete in one insert operation. The
vided by the DWARF algorithm differs which can affect the    NoSQL-DWARF schema performed best with an insertion
resulting size on disc. The MySQL-Min schema performed       time of 89 seconds for the largest dataset.
best for the small datasets with < 1 MB for the smallest
dataset and 70MB for the TMonth dataset. However, for the
largest dataset the NoSQL-DWARF has a smaller resulting      6. RELATED RESEARCH
size (182MB for NoSQL-DWARF and 185MB for MySQL-                In [5] and [8], the authors describe the creation of an
Min). MySQL-DWARF (Fig. 4) performed worst overall           OLAP cube from XML sources. Both methods involve com-
with a size of 2MB for the smallest dataset and 424MB for    bining the XML data sources with data obtained from a rela-
the largest. This is due to its relational design. A DWARF   tional database, where the data obtained from both sources
Node can have many child cells. Each individual relation-    is abstracted. In [5], they model an XML cube using an
ship between a node and a cell must be individually recorded UML Snowflake diagram. A similar method of integrating
in the Node_Children table. Cassandra does not encounter     XML and UML for data warehousing was presented in [13].
such problems as these relations can be stored in a single   The data is combined using a data integrator to determine
set datatype. The MySQL-Min schema had the smallest          which data source is to be queried. The end user of the sys-
size with < 1MB for the smallest cube and a size of 178MB    tem queries the OLAP server using standard OLAP query
for the largest cube. This schema had small cube sizes due   languages e.g. MDX queries. The results are presented in
to the absence of the DWARF Node construct. This sig-        relational form to the user. A similar method of abstract-
nificantly reduces the size on disc as the relationships be- ing the data from its source format is used in our approach
tween a Node and a Cell (which has greatest impact on the    to create the DWARF cube and provide the functionality
MySQL-DWARF schema) need not be stored. However, we          for a DWARF cube to be constructed from multiple source
formats. In [8], they propose an XML based representation
of a resulting data cube but do not address the underlying
problems associated with OLAP cubes (construction time,            [1] Bao, Y.; Leng, F.; Wang, D. & Yu, G. A Clustered
storage space). In addition, the source data must be kept              Dwarf Structure to Speed up Queries on Data Cubes.
intact in its original format which could lead to issues when          JCSE, 2007, 1, 195-210
updating data.                                                     [2] Gui, H. & Roantree, M. Topological XML data cube
   In [4] and [9], they also store data cubes in native XML for-       construction International Journal of Web Engineering
mat similar to our NoSQL-DWARF model. However, their                   and Technology, Inderscience Publishers, 2013, 8,
goals are primarily aimed towards interoperability between             347-368
data warehouses as opposed to querying or data mining.             [3] Gui, H. & Roantree, M. Using a pipeline approach to
   In [1], the authors employ a DWARF clustering model for             build data cube for large xml data streams Database
storing DWARF cubes on a hard disk. They propose a Node                Systems for Advanced Applications, Springer Berlin
indexing approach where a DWARF Node or DWARF Cell                     Heidelberg, 2013, 59-73
does not contain a pointer to its sub Nodes/Cells but in-          [4] HÃijmmer, W.; Bauer, A. & Harde, G. XCube: XML
stead contains the unique ID of its children. This method is           for data warehouses Proceedings of the 6th ACM
adopted in our Cassandra schema. They propose two algo-                international workshop on Data warehousing and
rithms for storing a DWARF cube as a flat file; a hierarchical         OLAP, 2003, 33-40
clustering model optimised for range queries on a DWARF            [5] Jensen, M. R.; MÃÿller, T. H. & Pedersen, T. B.
structure and a recursive algorithm which is optimised for             Specifying OLAP cubes on XML data Journal of
point queries on a DWARF (the recursive algorithm is also              Intelligent Information Systems, Springer, 2001, 17,
adequate for range queries). However as our approach uses              255-280
Cassandra as a storage model instead of a flat file we do not      [6] Lakshman, A. & Malik, P. Cassandra: a decentralized
encounter these problems.                                              structured storage system ACM SIGOPS Operating
   Finally, in [14], the authors use MapReduce to improve              Systems Review, ACM, 2010, 44, 35-40
cube construction time but this is only true for the most
                                                                   [7] Marks, G., Roantree, M., Smyth, D.: Optimizing
common queries with the rest being constructed in real-time
                                                                       queries for web generated sensor data. In:Proceedings
so there is potentially a large overhead in cube construction
                                                                       of the Twenty-Second Australasian Database
time depending on the query. In [11], they propose extend-
                                                                       Conference-Volume 115. pp. 47-56.Australian
ing DWARF to include dimensional hierarchies, a property
                                                                       Computer Society, Inc. (2011)
of OLAP cubes which is not present in DWARF. [5] states
that a hierarchical structure of a cube is necessary for a         [8] Niemi, T.; NiinimÃd’ki, M.; Nummenmaa, J. &
cube constructed from XML sources. They propose stor-                  Thanisch, P. Constructing an OLAP cube from
ing partial DWARFs (where all views are not materialised)              distributed XML data Proceedings of the 5th ACM
to contain information pertaining to a dimension hierarchy.            international workshop on Data Warehousing and
The result of this is a modified DWARF structure which can             OLAP, 2002, 22-27
use the traditional OLAP operations ROLLUP and DRILL               [9] Nguyen, B. T.; Tjoa, A. M.& Mangisengi, O. Meta
DOWN. The resulting Hierarchial DWARF structure is sim-                Cube-X: An XML Metadata Foundation for
ilar to a traditional DWARF structure (Fig 2). However,                Interoperability Search among Web Data Warehouses.
a DWARF Node can also point to another Node occupying                  DMDW, 2001, 8
the same dimension level, where this node would contain the        [10] Roantree M. and Liu J. A heuristic approach to
dimensional hierarchy. An extension of the DWARF Node                  selecting views for materialization. In Software:
schema outlined in Table 1-B where it contains the id of a             Practice and Experience 44(10): pp. 1157-1179, 2014.
pointer node would accommodate this functionality.                 [11] Sismanis, Y.; Deligiannakis, A.; Kotidis, Y. &
                                                                       Roussopoulos, N. Hierarchical dwarfs for the rollup
                                                                       cube Proceedings of the 6th ACM international
7.   CONCLUSIONS AND FUTURE WORK                                       workshop on Data warehousing and OLAP, 2003, 17-24
   Today’s cities have an important new resource, their knowl-     [12] Sismanis, Y.; Deligiannakis, A.; Roussopoulos, N. &
edge infrastructure, which is generated from low level sensor          Kotidis, Y. Dwarf: Shrinking the petacube Proceedings
networks, public databases and online information services.            of the 2002 ACM SIGMOD international conference on
Before this knowledge can be exploited in order to generate            Management of data, 2002, 464-475
real impact, we require a number of layers in the technology       [13] Trujillo, J.; LujÃan-Mora, S. & Song, I.-Y. Applying
                                                                                            ,
stack for the smart city. Above the data harvesting level,             UML and XML for designing and interchanging
it is necessary to read and transform data streams and to              information for data warehouses and OLAP
create the structures (cubes) that higher level applications           applications Journal of Database Management (JDM),
such as data mining can exploit. In this paper, we presented           IGI Global, 2004, 15, 41-72
our novel cube generation approach which takes web gener-          [14] Wang, Y.; Song, A. & Luo, J. A
ated data and efficiently constructs data cubes. Our usage             mapreducemerge-based data cube construction method
of the DWARF and NoSQL models delivers new layers of                   Grid and Cooperative Computing (GCC), 2010 9th
efficiency in terms of speed and compression. Our current              International Conference on, 2010, 1-6
focus is on cube updates through efficient query primitives
for our DWARF cubes.


8.   REFERENCES