=Paper= {{Paper |id=None |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1020/proceedings.pdf |volume=Vol-1020 }} ==None== https://ceur-ws.org/Vol-1020/proceedings.pdf
                 Proceedings

25. GI-Workshop „Grundlagen von Datenbanken“
             28.05.2013 – 31.05.2013

              Ilmenau, Deutschland




                Kai-Uwe Sattler
               Stephan Baumann
                   Felix Beier
                   Heiko Betz
              Francis Gropengießer
                Stefan Hagedorn
                     (Hrsg.)
ii
Vorwort
Liebe Teilnehmerinnen und Teilnehmer,

mittlerweile zum 25. Mal fand vom 28.5. bis 31.5.2013 der Workshop Grundlagen
                                                                             ”
von Datenbanken“ des GI-Arbeitskreises Grundlagen von Informationssystemen“ im
                                            ”
Fachbereich Datenbanken und Informationssysteme (DBIS) statt. Nach Österreich im
Jahr 2011 und dem Spreewald im Jahr 2012 war bereits zum dritten Mal Thüringen
der Austragungsort – diesmal die kleine Gemeinde Elgersburg am Fuße der Hohen
Warte im Ilm-Kreis. Organisiert wurde der Workshop vom Fachgebiet Datenbanken
und Informationssysteme der TU Ilmenau.
   Die Workshop-Reihe, die 1989 in Volkse bei Braunschweig vom Braunschweiger
Datenbanklehrstuhl ins Leben gerufen wurde und die ersten 3 Jahre auch in Volkse
blieb, hat sich inzwischen als eine Institution für den Gedankenaustausch gerade für
Nachwuchswissenschaftler/-innen aus dem deutschsprachigen Raum im Bereich Da-
tenbanken und Informationssysteme etabliert. Längst sind dabei die Beschränkungen
auf Deutsch als Vortragssprache und reine theorie- und grundlagenorientierte Themen
gefallen – auch wenn die offene Atmosphäre an abgeschiedenen Tagungsorten (und
Elgersburg stellte hier keine Ausnahme dar) mit viel Zeit für intensive Diskussionen
während der Sitzungen und an den Abenden geblieben sind.
   Für den diesjährigen Workshop wurden 15 Beiträge eingereicht und von jeweils drei
Mitgliedern des 13-köpfigen Programmkomitees begutachtet. Aus allen eingereichten
Beiträgen wurden 13 für die Präsentation auf dem Workshop ausgewählt. Die Band-
breite der Themen reichte dabei von fast schon klassischen Datenbankthemen wie
Anfrageverarbeitung (mit XQuery), konzeptueller Modellierung (für XML Schemaevo-
lution), Indexstrukturen (für Muster auf bewegten Objekten) und dem Auffinden von
Spaltenkorrelationen über aktuelle Themen wie MapReduce und Cloud-Datenbanken
bis hin zu Anwendungen im Bereich Image Retrieval, Informationsextraktion, Complex
Event Processing sowie Sicherheitsaspekten.
   Vervollständigt wurde das viertägige Programm durch zwei Keynotes von namhaften
Datenbankforschern: Theo Härder stellte das WattDB-Projekt eines energieproportio-
nalen Datenbanksystems vor und Peter Boncz diskutierte die Herausforderungen an
die Optimierung von Datenbanksysteme durch moderne Hardwarearchitekturen - un-
tersetzt mit praktischen Vorführungen. Beiden sei an dieser Stelle für ihr Kommen und
ihre interessanten Vorträge gedankt. In zwei weiteren Vorträgen nutzten die Sponso-
ren des diesjährigen Workshops, SAP AG und Objectivity Inc., die Gelegenheit, die
Datenbanktechnologien hinter HANA (SAP AG) und InfiniteGraph (Objectivity Inc.)
vorzustellen. Hannes Rauhe und Timo Wagner als Vortragenden möchten wir daher
genauso wie den beiden Unternehmen für die finanzielle Unterstützung des Workshops
und damit der Arbeit des GI-Arbeitskreises danken.
   Gedankt sei an dieser Stelle auch allen, die an der Organisation und Durchführung
beteiligt waren: den Autoren für ihre Beiträge und Vorträge, den Mitgliedern des Pro-
grammkomitees für ihre konstruktive und pünktliche Begutachtung der Einreichungen,
den Mitarbeitern vom Hotel am Wald in Elgersburg, dem Leitungsgremium des Ar-
beitskreises in Person von Günther Specht und Stefan Conrad, die es sich nicht nehmen



                                                                                      iii
ließen, persönlich am Workshop teilzunehmen, sowie Eike Schallehn, der im Hinter-
grund mit Rat und Tat zur Seite stand. Der größte Dank gilt aber meinem Fachge-
bietsteam, das den Großteil der Organisationsarbeit geleistet hat: Stephan Baumann,
Francis Gropengießer, Heiko Betz, Stefan Hagedorn und Felix Beier. Ohne ihr Enga-
gement wäre der Workshop nicht möglich gewesen. Herzlichen Dank!

Kai-Uwe Sattler


Ilmenau am 28.5.2013




iv
v
Komitee
Programm-Komitee
     • Andreas Heuer, Universität Rostock
     • Eike Schallehn, Universität Magdeburg
     • Erik Buchmann, Karlsruher Institut für Technologie

     • Friederike Klan, Universität Jena
     • Gunter Saake, Universität Magdeburg
     • Günther Specht, Universität Innsbruck
     • Holger Schwarz, Universität Stuttgart

     • Ingo Schmitt, Brandenburgische Technische Universität Cottbus
     • Kai-Uwe Sattler, Technische Universität Ilmenau
     • Katja Hose, Aalborg University

     • Klaus Meyer-Wegener, Universität Erlangen
     • Stefan Conrad, Universität Düsseldorf
     • Torsten Grust, Universität Tübingen

Organisations-Komitee
     • Kai-Uwe Sattler, TU Ilmenau
     • Stephan Baumann, TU Ilmenau

     • Felix Beier, TU Ilmenau
     • Heiko Betz, TU Ilmenau
     • Francis Gropengießer, TU Ilmenau
     • Stefan Hagedorn, TU Ilmenau




vi
vii
Inhaltsverzeichnis

1 Keynotes                                                                              1
  1.1 WattDB—a Rocky Road to Energy Proportionality Theo Härder . . . .                1
  1.2 Optimizing database architecture for machine architecture: is there still
      hope? Peter Boncz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      3

2 Workshop-Beiträge                                                                    5
  2.1 Adaptive Prejoin Approach for Performance Optimization in MapReduce-
       based Warehouses Weiping Qu, Michael Rappold und Stefan Dessloch .                5
  2.2 Ein Cloud-basiertes räumliches Decision Support System für die Her-
       ausforderungen der Energiewende Golo Klossek, Stefanie Scherzinger
       und Michael Sterner . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    10
  2.3 Consistency Models for Cloud-based Online Games: the Storage Sys-
       tem’s Perspective Ziqiang Diao . . . . . . . . . . . . . . . . . . . . . . .     16
  2.4 Analysis of DDoS Detection Systems Michael Singhof . . . . . . . . . .            22
  2.5 A Conceptual Model for the XML Schema Evolution Thomas Nösinger,
       Meike Klettke und Andreas Heuer . . . . . . . . . . . . . . . . . . . . .        28
  2.6 Semantic Enrichment of Ontology Mappings: Detecting Relation Types
       and Complex Correspondences Patrick Arnold . . . . . . . . . . . . . .           34
  2.7 Extraktion und Anreicherung von Merkmalshierarchien durch Analyse
       unstrukturierter Produktrezensionen Robin Küppers . . . . . . . . . . .         40
  2.8 Ein Verfahren zur Beschleunigung eines neuronalen Netzes für die Ver-
       wendung im Image Retrieval Daniel Braun . . . . . . . . . . . . . . . .          46
  2.9 Auffinden von Spaltenkorrelationen mithilfe proaktiver und reaktiver
       Verfahren Katharina Büchse . . . . . . . . . . . . . . . . . . . . . . . .      52
  2.10 MVAL: Addressing the Insider Threat by Valuation-based Query Pro-
       cessing Stefan Barthel und Eike Schallehn . . . . . . . . . . . . . . . . .      58
  2.11 TrIMPI: A Data Structure for Efficient Pattern Matching on Moving
       Objects Tsvetelin Polomski und Hans-Joachim Klein . . . . . . . . . . .          64
  2.12 Complex Event Processing in Wireless Sensor Networks Omran Saleh .               69
  2.13 XQuery processing over NoSQL stores Henrique Valer, Caetano Sauer
       und Theo Härder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   75




                                                                                        ix
           WattDB—a Rocky Road to Energy Proportionality

                                                            Theo Härder
                                      Databases and Information Systems Group
                                        University of Kaiserslautern, Germany
                                                haerder@cs.uni-kl.de


Extended Abstract                                                     pared to a single, brawny server, they offer higher energy
Energy efficiency is becoming more important in database              saving potential in turn.
design, i. e., the work delivered by a database server should            Current hardware is not energy proportional, because a
be accomplished by minimal energy consumption. So far, a              single server consumes, even when idle, a substantial frac-
substantial number of research papers examined and opti-              tion of its peak power [1]. Because typical usage patterns
mized the energy consumption of database servers or single            lead to a server utilization far less than its maximum, en-
components. In this way, our first efforts were exclusively fo-       ergy efficiency of a server aside from peak performance is
cused on the use of flash memory or SSDs in a DBMS context            reduced [4]. In order to achieve energy proportionality using
to identify their performance potential for typical DB opera-         commodity hardware, we have chosen a clustered approach,
tions. In particular, we developed tailor-made algorithms to          where each node can be powered independently. By turn-
support caching for flash-based databases [3], however with           ing on/off whole nodes, the overall performance and energy
limited success concerning the energy efficiency of the entire        consumption can be fitted to the current workload. Unused
database server.                                                      servers could be either shut down or made available to other
   A key observation made by Tsirogiannis et al. [5] con-             processes. If present in a cloud, those servers could be leased
cerning the energy efficiency of single servers, the best per-        to other applications.
forming configuration is also the most energy-efficient one,             We have developed a research prototype of a distribu-
because power use is not proportional to system utilization           ted DBMS called WattDB on a scale-out architecture, con-
and, for this reason, runtime needed for accomplishing a              sisting of n wimpy computing nodes, interconnected by an
computing task essentially determines energy consumption.             1GBit/s Ethernet switch. The cluster currently consists of
Based on our caching experiments for flash-based databases,           10 identical nodes, composed of an Intel Atom D510 CPU,
we came to the same conclusion [2]. Hence, the server sys-            2 GB DRAM and an SSD. The configuration is considered
tem must be fully utilized to be most energy efficient. How-          Amdahl-balanced, i. e., balanced between I/O and network
ever, real-world workloads do not stress servers continuously.        throughput on one hand and processing power on the other.
Typically, their average utilization ranges between 20 and               Compared to InfiniBand, the bandwidth of the intercon-
50% of peak performance [1]. Therefore, traditional single-           necting network is limited but sufficient to supply the light-
server DBMSs are chronically underutilized and operate be-            weight nodes with data. More expensive, yet faster con-
low their optimal energy-consumption-per-query ratio. As              nections would have required more powerful processors and
a result, there is a big optimization opportunity to decrease         more sophisticated I/O subsystems. Such a design would
energy consumption during off-peak times.                             have pushed the cost beyond limits, especially because we
   Because the energy use of single-server systems is far from        would not have been able to use commodity hardware. Fur-
being energy proportional, we came up with the hypothe-               thermore, by choosing lightweight components, the overall
sis that better energy efficiency may be achieved by a clus-          energy footprint is low and the smallest configuration, i. e.,
ter of nodes whose size is dynamically adjusted to the cur-           the one with the fewest number of nodes, exhibits low power
rent workload demand. For this reason, we shifted our re-             consumption. Moreover, experiments running on a small
search focus from inflexible single-server DBMSs to distribu-         cluster can easily be repeated on a cluster with more pow-
ted clusters running on lightweight nodes. Although distri-           erful nodes.
buted systems impose some performance degradation com-                   A dedicated node is the master node, handling incoming
                                                                      queries and coordinating the cluster. Some of the nodes
                                                                      have each four hard disks attached and act as storage nodes,
                                                                      providing persistent data storage to the cluster. The remain-
                                                                      ing nodes (without hard disks drives) are called processing
                                                                      nodes. Due to the lack of directly accessible storage, they
                                                                      can only operate on data provided by other nodes (see Fig-
                                                                      ure 1).
                                                                         All nodes can evaluate (partial) query plans and execute
                                                                      DB operators, e. g., sorting, aggregation, etc., but only the
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
                                                                      storage nodes can access the DB storage structures, i. e.,
banken), 28.05.2013 - 31.05.2012, Ilmenau, Germany.                   tables and indexes. Each storage node maintains a DB buffer
Copyright is held by the author/owner(s).
                                                          Master Node


                  S                           S   Processing                              S
                  S
                          Processing          S
                                                                 S    Processing              Processing
                                                                 S                        S
                  D         Node              D     Node         D      Node              D     Node


                      S     Storage Node             S    Storage Node         S     Storage Node
                      S                              S                         S
                            Disk       Disk               Disk    Disk               Disk       Disk
                      D     Disk       Disk          D    Disk    Disk         D     Disk       Disk

                                   S   Storage Node              S     Storage Node
                                   S                             S
                                       Disk        Disk                Disk        Disk
                                   D   Disk        Disk          D     Disk        Disk

                                        Figure 1: Overview of the WattDB cluster


to keep recently referenced pages in main memory, whereas        1.   REFERENCES
a processing node does not cache intermediate results. As a      [1] L. A. Barroso and U. Hölzle. The Case for
consequence, each query needs to always fetch the qualified          Energy-Proportional Computing. IEEE Computer,
records from the corresponding storage nodes.                        40(12):33–37, 2007.
   Hence, our cluster design results in a shared-nothing ar-     [2] T. Härder, V. Hudlet, Y. Ou, and D. Schall. Energy
chitecture where the nodes only differentiate to those which         efficiency is not enough, energy proportionality is
have or have not direct access to DB data on external stor-          needed! In DASFAA Workshops, 1st Int. Workshop on
age. Each of the nodes is additionally equipped with a               FlashDB, LNCS 6637, pages 226–239, 2011.
128GB Solid-State Disk (Samsung 830 SSD). The SSDs do
                                                                 [3] Y. Ou, T. Härder, and D. Schall. Performance and
not store the DB data, they provide swap space to support
                                                                     Power Evaluation of Flash-Aware Buffer Algorithms. In
external sorting and to provide persistent storage for con-
                                                                     DEXA, LNCS 6261, pages 183–197, 2010.
figuration files. We have chosen SSDs, because their access
                                                                 [4] D. Schall, V. Höfner, and M. Kern. Towards an
latency is much lower compared to traditional hard disks;
                                                                     Enhanced Benchmark Advocating Energy-Efficient
hence, they are better suited for temp storage.
                                                                     Systems. In TPCTC, LNCS 7144, pages 31–45, 2012.
   In WattDB, a dedicated component, running on the mas-
ter node, controls the energy consumption, called Energy-        [5] D. Tsirogiannis, S. Harizopoulos, and M. A. Shah.
Controller. This component monitors the performance of               Analyzing the Energy Efficiency of a Database Server.
all nodes in the cluster. Depending on the current query             In SIGMOD Conference, pages 231–242, 2010.
workload and node utilization, the EnergyController acti-
vates and suspends nodes to guarantee a sufficiently high
node utilization depending on the workload demand. Sus-
pended nodes do only consume a fraction of the idle power,
but can be brought back online in a matter of a few sec-
onds. It also modifies query plans to dynamically distribute
the current workload on all running nodes thereby achieving
balanced utilization of the active processing nodes.
   As data-intensive workloads, we submit specific TPC-H
queries against a distributed shared-nothing DBMS, where
time and energy use are captured by specific monitoring and
measurement devices. We configure various static clusters
of varying sizes and show their influence on energy efficiency
and performance. Further, using an EnergyController and
a load-aware scheduler, we verify the hypothesis that en-
ergy proportionality for database management tasks can be
well approximated by dynamic clusters of wimpy computing
nodes.
Optimizing database architecture for machine architecture:
                   is there still hope?

                                                             Peter Boncz
                                                               CWI
                                                          p.boncz@cwi.nl

Extended Abstract                                                       In particular, there is the all too present danger to over-
In the keynote, I will give some examples of how computer             optimize of one particular architecture; or to propose tech-
architecture has strongly evolved in the past decennia and            niques that will have only a very short span of utility. The
how this influences the performance, and therefore the de-            question thus is not only to find specific ways to optimize
sign, of algorithms and data structure for data management.           for certain hardware features, but do so in a way that works
One the one hand, these changes in hardware architecture              across the full spectrum of architectural, i.e. robust tech-
have caused the (continuing) need for new data management             niques.
research. i.e. hardware-conscious database research. Here,              I will close the talk by recent work at CWI and Vectorwise
I will draw examples from hardware-conscious research per-            on robustness of query evaluator performance, describing a
formed on the CWI systems MonetDB and Vectorwise.                     project called ”Micro-Adaptivity” where database systems
   This diversification trend in computer architectural char-         are made self-adaptive and react immediately to observed
acteristics of the various solutions in the market seems to           performance, self-optimizing to the combination of current
be intensifying. This is seen in quite different architectural        query workload, observed data distributions, and hardware
options, such as CPU vs GPU vs FPGA, but also even re-                characteristics.
stricting oneself to just CPUs there seems to be increasing
design variation in architecture and platform behavior. This
poses a challenge to hardware-conscious database research.




25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2012, Ilmenau, Germany.
Copyright is held by the author/owner(s).
  Adaptive Prejoin Approach for Performance Optimization
             in MapReduce-based Warehouses

                                                                           ∗
                    Weiping Qu                         Michael Rappold                    Stefan Dessloch
            Heterogeneous Information               Department of Computer            Heterogeneous Information
                 Systems Group                               Science                       Systems Group
            University of Kaiserslautern           University of Kaiserslautern       University of Kaiserslautern
           qu@informatik.uni-kl.de                 m_rappol@cs.uni-kl.de            dessloch@informatik.uni-
                                                                                             kl.de

ABSTRACT                                                            scalable file system, MapReduce/Hadoop1 systems enable
MapReduce-based warehousing solutions (e.g. Hive) for big           analytics on large amounts of unstructured data or struc-
data analytics with the capabilities of storing and analyzing       tured data in acceptable response time.
high volume of both structured and unstructured data in a              With the continuous growth of data, scalable data stores
scalable file system have emerged recently. Their efficient         based on Hadoop/HDFS2 have achieved more and more at-
data loading features enable a so-called near real-time ware-       tention for big data analytics. In addition, by means of sim-
housing solution in contrast to those offered by conventional       ply pulling data into the file system of MapReduce-based
data warehouses with complex, long-running ETL processes.           systems, unstructured data without schema information is
   However, there are still many opportunities for perfor-          directly analyzed with parallelizable custom programs, where-
mance improvements in MapReduce systems. The perfor-                as data can only be queried in traditional data warehouses
mance of analyzing structured data in them cannot cope              after it has been loaded by ETL tools (cleansing, normaliza-
with the one in traditional data warehouses. For example,           tion, etc.), which normally takes a long period of time.
join operations are generally regarded as a bottleneck of per-         Consequently, many web or business companies add MapRe-
forming generic complex analytics over structured data with         duce systems to their analytical architecture. For example,
MapReduce jobs.                                                     Fatma Özcan et al. [12] integrate their DB2 warehouse with
   In this paper, we present one approach for improving per-        the Hadoop-based analysis tool - IBM Infosphere BigInsights
formance in MapReduce-based warehouses by pre-joining               with connectors between these two platforms. An analytical
frequently used dimension columns with fact table redun-            synthesis is provided, where unstructured data is initially
dantly during data transfer and adapting queries to this join-      placed in a Hadoop-based system and analyzed by MapRe-
friendly schema automatically at runtime using a rewrite            duce programs. Once its schema can be defined, it is further
component. This approach is driven by the statistics infor-         loaded into a DB2 warehouse with more efficient analysis ex-
mation derived from previous executed workloads in terms            ecution capabilities.
of join operations.                                                    Another example is the data warehousing infrastructure
   The results show that the execution performance is im-           at Facebook which involves a web-based tier, a federated
proved by getting rid of join operations in a set of future         MySQL tier and a Hadoop-based analytical cluster - Hive.
workloads whose join exactly fits the pre-joined fact table            Such orchestration of various analytical platforms forms a
schema while the performance still remains the same for             heterogeneous environment where each platform has a differ-
other workloads.                                                    ent interface, data model, computational capability, storage
                                                                    system, etc.
                                                                       Pursuing a global optimization in such a heterogeneous
1. INTRODUCTION                                                     environment is always challenging, since it is generally hard
  By packaging complex custom imperative programs (text             to estimate the computational capability or operational cost
mining, machine learning, etc.) into simple map and reduce          concisely on each autonomous platform. The internal query
functions and executing them in parallel on files in a large        engine and storage system do not tend to be exposed to
                                                                    outside and are not designed for data integration.
∗finished his work during his master study at university of            In our case, relational databases and Hadoop will be in-
kaiserslautern                                                      tegrated together to deliver an analytical cluster. Simply
                                                                    transferring data from relational databases to Hadoop with-
                                                                    out considering the computational capabilities in Hadoop
                                                                    can lead to lower performance.
                                                                       As an example, performing complex analytical workloads
                                                                    over multiple small/large tables (loaded from relational data-

                                                                    1
                                                                      one open-source implementation of MapReduce framework
  th
25 GI-Workshop on Foundations of Databases (Grundlagen von Daten-   from Apache community, see http://hadoop.apache.org
                                                                    2
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany.                   Hadoop Distributed File System - is used to store the data
Copyright is held by the author/owner(s).                           in Hadoop for analysis
                                                                          information describing advertisements – their category, there name,              daily in the form
                                                                          the advertiser information etc. The data sets originating in the latter          through a set of l
                                                                          mostly correspond to actions such as viewing an advertisement,                   consumption.
                                                                          clicking on it, fanning a Facebook page etc. In traditional data                 The data from th
                                                                          warehousing terminology, more often than not the data in the                     Hadoop clusters
                                                                                                                                                           processes dump
                                                                                                                                                           compressing them
                                                                                                                                                           into the Hive-Had
bases) in Hadoop leads to a number of join operations which                                                                                                failures and also n
slows down the whole processing. The reason is that the                                                                                                    much load on the
join performance is normally weak in MapReduce systems                                                          Scribe-Hadoop Clusters                     running the scrape
                                                                                   Web Servers                                                             avoiding extra loa
as compared to relational databases [15]. Performance limi-                                                                                                any notions of stro
tations have been shown due to several reasons such as the                                                                                                 order to avoid loc
inherent unary feature of map and reduce functions.                                                                                                        database server b
                                                                                                                                                           cannot be read ev
   To achieve better global performance in such an analytical                                                                                              data from that par
synthesis with multiple platforms from a global perspective                                                                                                servers, there are a
of view, several strategies can be applied.                                                     Hive replication                                           the scrapes and by
                                                                                                                                                           data a daily dum
   One would be simply improving the join implementation                                                                                                   Hadoop clusters. T
                                                                                                                Production Hive-Hadoop
on single MapReduce platform. There have been several ex-                   Adhoc Hive-Hadoop                            Cluster                           tables.
isting works trying to improve join performance in MapRe-                         Cluster
                                                                                                                                                           As shown in Figu
duce systems [3, 1].                                                                                                                                       where the data b
   Another one would be using heuristics for global perfor-                                                                                                stream processes.
                                                                                                                                                           Hadoop cluster - i
mance optimization. In this paper, we will take a look at the                                                                                              strict delivery dea
second one. In order to validate our general idea of improv-                                                                                               Hive-Hadoop clus
ing global performance on multiple platforms, we deliver our                                                       Federated MySQL                         well as any ad ho
                                                                                                                                                           data sets. The ad
adaptive approach in terms of join performance. We take the                              Figure 1: Data Flow Architecture
                                                                                                                                                           run production job
data flow architecture at Facebook as a starting point and         Figure 1: Facebook Data Flow Architecture[17]
the contributions are summarized as follows:
                                                                                                                                                    1014
  1. Adaptively pre-joining tables during data transfer for
     better performance in Hadoop/Hive.                          2.2    Hive
                                                                    Hive [16] is an open source data warehousing solution built
  2. Rewriting incoming queries according to changing ta-        on top of MapReduce/Hadoop. Analytics is essentially done
     ble schema.                                                 by MapReduce jobs and data is still stored and managed in
                                                                 Hadoop/HDFS.
   The remainder of this paper is structured as follows: Sec-       Hive supports a higher-level SQL-like language called Hive-
tion 2 describes the background of this paper. Section 3 gives   QL for users who are familiar with SQL for accessing files
a naı̈ve approach of fully pre-joining related tables. Based     in Hadoop/HDFS, which highly increases the productivity
on the performance observation of this naı̈ve approach, more     of using MapReduce systems. When a HiveQL query comes
considerations have been taken into account and an adap-         in, it will be automatically translated into corresponding
tive pre-join approach is proposed in Section 4, followed by     MapReduce jobs with the same analytical semantics. For
the implementation and experimental evaluation shown in          this purpose, Hive has its own meta-data store which maps
Section 5. Section 6 shows some related works. Section 7         the HDFS files to the relational data model. Files are log-
concludes with a summary and future work.                        ically interpreted as relational tables during HiveQL query
                                                                 execution.
2. BACKGROUND                                                       Furthermore, in contrast to high data loading cost (using
                                                                 ETL jobs) in traditional data warehouses, Hive benefits from
  In this section, we will introduce our starting point, i.e.
                                                                 its efficient loading process which pulls raw files directly into
the analytical data flow architecture at Facebook and its
                                                                 Hadoop/HDFS and further publishes them as tables. This
MapReduce-based analytical platform - Hive. In addition,
                                                                 feature makes Hive much more suitable for dealing with large
the performance issue in terms of join is also stated subse-
                                                                 volumes of data (i.e. big data).
quently.

2.1 Facebook Data Flow Architecture                              2.3    Join in Hadoop/Hive
   Instead of using a traditional data warehouse, Facebook         There has been an ongoing debate comparing parallel data-
uses Hive - a MapReduce-based analytical platform - to           base systems and MapReduce/Hadoop. In [13], experiments
perform analytics on information describing advertisement.       showed that performance of selection, aggregation and join
The MapReduce/Hadoop system offers high scalability which        tasks in Hadoop could not reach parallel databases (Vertica
enables Facebook to perform data analytics over 15PB of          & DBMS-X). Several reasons of the performance difference
data and load 60TB of new data every day [17]. The archi-        have been also explained by Stonebraker et al. in [15] such
tecture of data flow at Facebook is described as follows.        as repetitive record parsing, and high I/O cost due to non-
   As depicted in Figure 1, data is extracted from two types     compression & non-indexing.
of data sources: a federated MySQL tier and a web-based            Moreover, as MapReduce was not originally designed to
tier. The former offers the category, the name and corre-        combine information from two or more data sources, join im-
sponding information of the advertisements as dimension          plementations are always cumbersome [3]. The join perfor-
data while the actions such as viewing an advertisement,         mance relies heavily on the implementation of MapReduce
clicking on it, fanning a Facebook page are extracted as fact    jobs which have been considered as not straightforward.
data from the latter.                                              As Hive is built on top of MapReduce/Hadoop, the join
   There are two types of analytical cluster: production Hive    operation is essentially done by corresponding MapReduce
cluster and ad hoc Hive cluster. Periodic queries are per-       jobs. Thus, Hive suffers from these issues even though there
formed on the production Hive cluster while the ad hoc           have been efforts [5] to improve join performance in MapRe-
queries are executed on the ad hoc Hive cluster.                 duce systems or in Hive.
                                                                                      information describing advertisemen
                                                                                                                   ts ! their category, there name,
                                                                                      the advertiser information etc.
                                                                                                                  data The                        through a set of loader processes and then becomes
                                                                                                                        sets originating in the latter
                                                                                                                    as viewing an advertisement, consumption.
                                                                                      mostly correspond to actions such
                                                                                      clicking on it, fanning a Facebook
                                                                                                                       ge etc.
                                                                                                                           pa In traditional dataThe data from the federated mysql tier gets loaded
                                                                                      warehousing terminology, more often than not the data in theHadoop clusters through daily scra
                                                                                                                                                                                 pe processes. The sc
                                                                                                                                                  processes dump the desired data         sets from mysql datab
                                                                                                                                                  compressing them on the source        stems sy and finally movin
3. FULL PRE-JOIN APPROACH                                                                                                                         into the Hive-Hadoop cluster. The scrapes need to b
                                                                                                                                                  failures and also need to be designed such that the
   Due to the fact that the join performance is a perfor-                                                                                         much load on the mysql databases. The latter is acc
                                                                                                       Scribe-HadoopClusters                      running the scrapes on a replicated   tier of mysql database
mance bottleneck in Hive with its inherent MapReduce                         Web Servers  fea-                       a    b    c       d          avoiding extra load on the already loaded masters.
ture, one naı̈ve thinking for improving total workload perfor-                                                                                    any notions ofAdaptive
                                                                                                                                                                    strongPre-joined
                                                                                                                                                                              consistencySchema
                                                                                                                                                                                         the scraped
                                                                                                                                                                                                 in     data is sa
mance would be to simply eliminate the join task from the                                                                                         order to avoid locking overheads.    The scrapes are retried
                                                                                                                       fact table: λ              database servera basis
                                                                                                                                                                       b    c ind ther′ of x′ failures and if the
                                                                                                                                                                                         case
workload by performing a rewritten workload with the same
                                                                                                                                               “(λ,cannot   be read even after repeated tries, the previ
                                                                                                                                                    α.r, β.x)“
analytical semantics over pre-joined tables created in the                                                                                        data from that particular         server
                                                                                                                                                                                      used. is  With thousands of
                                                                                                                                                                      fact table: λ′
data load phase. A performance gain would be expected by                                                                                          servers, there are always some servers that may not
                                                                                                                                                                       r    s   t
performing large table scan with high parallelism of increas-Hive replication                                                                     the scrapes and by a combination        using of retries and scr
                                                                                                                                                  data a daily dump of the dimension data is created
ing working nodes in Hadoop instead of join. In addition,                                              Production Hive-Hadoop  p                  Hadoop clusters. dim These  ps
                                                                                                                                                                               dum
                                                                                                                                                                           table: αare then converted to top
the scalable storage system allows us to create                         AdhocHive-Hadoop
                                                                            redundant                          Cluster                            tables.
                                                                             Cluster                                                                             x   y   z
pre-joined tables for some workloads with specific join pat-                                                                r  s     t
                                                                                                                                            As shown in Figure 1, theretwoaredifferent Hive-Hadoop
terns.                                                                                                                                      where the data becomes available for consumption by
                                                                                                                          dim table: α                          dim table: β
                                                                                                                                            stream processes. One of these clusters ! the produ
   In an experiment, we tried to validate this strategy. An                                                                                 Hadoop cluster - is used to execute
                                                                                                                                                                           obs thatj need to adher
analytical workload (TPC-H Query 3) was executed over                                                                           x      y  z
                                                                                                                                            strict delivery deadlines, where as the other clust
two data sets of TPC-H benchmark (with scale factor 5 &                                                                                     Hive-Hadoop cluster is used toteexecu
                                                                                                                                                                                lower priority bat
                                                                                                          Federated MySQL      dim table: βwell as any ad hoc analysis that the users want to
10) of the original table schema (with join at runtime) and a
                                                                                                                                            data sets. The ad hoc nature
                                                                                                                                                                       userofqueries makes it dan
fully pre-joined table schema (without join) which fully Figure                        joins 1: Data Flow Architecture                      run production jobs in the same cluster. A badly wr
all the related dimension tables with the fact table during                                        Figure 3: Adaptive Pre-joined Schema in Facebook
the load phase, respectively. In this case, we trade storage                                       Example
overhead for better total performance.                                                                                                 1014
   As shown on the left side of the Figure 2(a), the perfor-
mance gain of the total workload (including the join) over                                         the periodic queries on production Hive-Hadoop cluster, a
the data set with SF 5 can be seen with 6GB storage over-                                          frequent column set could be extracted.
head introduced by fully pre-joining the related tables into                                          One example is illustrated in Figure 3. The frequent set
one redundant table (shown in Figure 2(b)). The overall                                            of additional columns has been extracted. The column r
                                                                                                   in dimension table α is frequently joined with fact table in
   350                                        25
                                                                                                   company in the previous workloads as a filter or aggregate
                                           data volume for executing workloads (GB)




   300
                                              20                                                   column, as the same for the column x in dimension table
 average runtime (sec)




   250

   200                                        15                                                   β. During next load phase, the fact table is expanded by
                                no pre-join                                     no pre-join
   150
                                full pre-join 10                                full pre-join
                                                                                                   redundantly pre-joining these two additional columns r and
   100
                                               5
                                                                                                   x  with it.
    50
                                                                                                      Depending on the statistics information of previous queries,
     0                                         0
         5GB               10GB                     5GB               10GB                         different frequent sets of additional columns could be found
             data set size                              data set size
                                                                                                   in diverse time intervals. Thus, the fact table is pre-joined
       (a) Average Runtimes                     (b) Accessed Data Volume                           in an adaptive manner.
                                                                                                      Assume that the additional columns identified in previ-
Figure 2: Running TPC-H Query-3 on Original and                                                    ous queries will also frequently occur in the future ones (as
Full Pre-joined Table Schema                                                                       in the Facebook example), the benefits of adaptive pre-join
                                                                                                   approach are two-fold:
performance can be significantly increased if workloads with                                          First, when all the columns (including dimension columns)
the same join pattern later frequently occur, especially for                                       in a certain incoming query which requires a join opera-
periodic queries over production Hive-Hadoop cluster in the                                        tion have been contained in the pre-joined fact table, this
Facebook example.                                                                                  query could be directly performed on the pre-joined fact ta-
   However, the result of performing the same query on the                                         ble without join.
data set with SF 10 size is disappointing as there is no per-                                         Second, the adaptive pre-join approach leads to a smaller
formance gain while paying 12.5GB storage for redundancy                                           table size in contrast to the full pre-join approach, as only
(shown in Figure 2(b)), which is not what we expected. The                                         subsets of the dimension tables are pre-joined. Thus, the
reason could be that the overhead of scanning such redun-                                          resulting storage overhead is reduced, which plays a signif-
dant fully pre-joined tables and the high I/O cost as well off-                                    icant role especially in big data scenarios (i.e. terabytes,
set the performance gain as the accessed data volume grows.                                        petabytes of data).
                                                                                                      To automatically accomplish the adaptive pre-join ap-
4. ADAPTIVE PRE-JOIN APPROACH                                                                      proach, three sub-steps are developed: frequent column set
   Taking the lessons learned from the full pre-join approach                                      extraction, pre-join and query rewrite.
above, we propose an adaptive pre-join approach in this pa-
per.
                                                                                                               4.1     Frequent Column Set Extraction
  Instead of pre-joining full dimension tables with the fact                                                      In the first phase, the statistics collected for extracting
table, we try to identify the dimension columns which oc-                                                      frequent set of additional columns is formated as a list of
curred frequently in the select, where, etc. clauses of previ-                                                 entries each which has the following form:
ous executed queries for filtering, aggregation and so on. We                                                   Set : {Fact, Dim X.Col i, Dim X.Col j ... Dim Y.Col k}
refer to these columns as additional columns as compared to
the join columns in the join predicates. By collecting a list of                                                The join set always starts with the involved fact table
additional column sets from previous queries, for example,                                                     while the joint dimension columns are identified and cap-
tured from the select, where, etc. clauses or from the sub-                                                                                                  to answer queries using views in data warehouses. Further-
queries.                                                                                                                                                     more, several subsequent works [14, 10] have focuses on dy-
  The frequent set of additional columns could be extracted                                                                                                  namic view management based on runtime statistics (e.g.
using a set of frequent itemset mining approaches [2, 7, 11]                                                                                                 reference frequency, result data size, execution cost) and
                                                                                                                                                             measured profits for better query performance. In our work,
4.2 Query Rewrite                                                                                                                                            we reviewed these sophisticated techniques in a MapReduce-
   As the table schema is changed in our case (i.e. newly gen-                                                                                               based environment.
erated fact table schema), initial queries need to be rewritten                                                                                                 Cheetah [4] is a high performance, custom data warehouse
for successful execution. Since the fact table is pre-joined                                                                                                 on top of MapReduce. It is very similar to the MapReduce-
with a set of dedicated redundant dimension columns, the                                                                                                     based warehouse Hive introduced in this paper. The perfor-
tables which are involved in the from clause of the original                                                                                                 mance issue of join implementation has also been addressed
query can be replaced with this new fact table once all the                                                                                                  in Cheetah. To reduce the network overhead for joining
columns have been covered in it.                                                                                                                             big dimension table with fact table at query runtime, big
   By storing the mapping from newly generated fact table                                                                                                    dimension tables are denormalized and all the dimension at-
schema to the old schema in the catalog, the query rewrite                                                                                                   tributes are directly stored into the fact table. In contrast,
process can be easily applied. Note that the common issue                                                                                                    we choose to only denormalize the frequently used dimen-
of handling complex sub-queries for Hive can thereby be                                                                                                      sion attributes with the fact table since we believe that less
facilitated if the columns in the sub-query have been pre-                                                                                                   I/O cost can be achieved in this way.
joined with the fact table.
                                                                                                                                                             7.   CONCLUSION AND FUTURE WORK
5. IMPLEMENTATION AND EVALUATION                                                                                                                                We propose a schema adaption approach for global opti-
   We use Sqoop3 as the basis to implement our approach.                                                                                                     mization in an analytical synthesis of relational databases
The TPC-H benchmark data set with SF 10 is adaptively                                                                                                        and a MapReduce-based warehouse - Hive. As MapRe-
pre-joined according to the workload statistics and trans-                                                                                                   duce systems have weak join performance, frequently used
ferred from MySQL to Hive. First, the extracted join pat-                                                                                                    columns of dimension tables are pre-joined with the fact
tern information is sent to Sqoop as additional transforma-                                                                                                  table according to useful workload statistics in an adap-
tion logic embedded in the data transfer jobs for generating                                                                                                 tive manner before being transfered to Hive. Besides, a
the adaptive pre-joined table schema on the original data                                                                                                    rewrite component enables the execution of incoming work-
sources. Furthermore, the generated schema is stored in                                                                                                      loads with join operations over such pre-joined tables trans-
Hive to enable automatic query rewrite at runtime.                                                                                                           parently. In this way, better performance can be achieved in
   We tested the adaptive pre-join approach on a six-node                                                                                                    Hive. Note that this approach is not restricted to any spe-
cluster (Xeon Quadcore CPU at 2.53GHz, 4GB RAM, 1TB                                                                                                          cific platform like Hive. Any MapReduce-based warehouse
SATA-II disk, Gigabit Ethernet) running Hadoop and Hive.                                                                                                     can benefit from it, as generic complex join operations occur
   After running the same TPC-H Query 3 over the adaptive                                                                                                    in almost every analytical platform.
pre-joined table schema, the result in the Figure 4(a) shows                                                                                                    However, the experimental results also show that the per-
that the average runtime is significantly reduced. The join                                                                                                  formance improvement is not stable while the data volume
                                                                                                                                                             grows continuously. For example, when the query is exe-
                         350                                                                                      25                                         cuted on one larger pre-joined table, the performance gain
                                                                       data volume for executing workloads (GB)




                         300
                                                                                                                  20                                         from eliminating joins is offset by the impact caused by the
                         250
                                                                                                                                                             record parsing overhead and high I/O cost during the scan,
 average runtime (sec)




                                                                                                                  15
                         200
                                                   no pre-join                                                                           no pre-join         which results in worse performance. This concludes that
                                                   full pre-join                                                                         full pre-join
                         150
                                                   adaptive pre-join
                                                                                                                  10
                                                                                                                                         adaptive pre-join   the total performance of complex data analytics is effected
                         100
                                                                                                                   5                                         by multiple metrics rather than a unique consideration, e.g.
                         50

                                                                                                                   0
                                                                                                                                                             join.
                          0
                                      10GB
                                   data set size
                                                                                                                            10GB
                                                                                                                         data set size
                                                                                                                                                                With the continuous growth of data, diverse frameworks
                                                                                                                                                             and platforms (e.g. Hive, Pig) are built for large-scale data
                               (a) Average Runtimes                                                                (b) Accessed Data Volume
                                                                                                                                                             analytics and business intelligent applications. Data trans-
                                                                                                                                                             fer between different platforms generally takes place in the
Figure 4: Running TPC-H Query-3 on Original, Full                                                                                                            absence of key information such as operational cost model,
Pre-joined and Adaptive Pre-joined Table Schema                                                                                                              resource consumption, computational capability etc. within
                                                                                                                                                             platforms which are autonomous and inherently not designed
task has been eliminated for this query and the additional                                                                                                   for data integration. Therefore, we are looking at a generic
overheads (record parsing, I/O cost) have been relieved due                                                                                                  description of the operational semantics with their compu-
to the smaller size of redundancy as shown in Figure 4(b).                                                                                                   tational capabilities on different platforms and a cost model
                                                                                                                                                             for performance optimization from a global perspective of
6. RELATED WORK                                                                                                                                              view. The granularity we are observing is a single operator
                                                                                                                                                             in the execution engines. Thus, a global operator model with
   An adaptively pre-joined fact table is essentially a mate-
                                                                                                                                                             generic cost model is expected for performance improvement
rialized view in Hive. Creating materialized views in data
                                                                                                                                                             in several use cases, e.g. federated systems.
warehouses is nothing new but a technique used for query
                                                                                                                                                                Moreover, as an adaptively pre-joined fact table is re-
optimization. Since 1990s, a substantial effort [6, 8] has been
                                                                                                                                                             garded as a materialized view in a MapReduce-based ware-
3                                                                                                                                                            house, another open problem left is how to handle the view
  an open source tool for data transfer between Hadoop and
relational database, see http://sqoop.apache.org/                                                                                                            maintanence issue. The work from [9] introduced an incre-
mental loading approach to achieve near real-time dataware-         International Conference on Management of data,
housing by using change data capture and change propaga-            SIGMOD ’09, pages 165–178, New York, NY, USA,
tion techniques. Ideas from this work could be taken further        2009. ACM.
to improve the performance of total workload including the     [14] P. Scheuermann, J. Shim, and R. Vingralek.
pre-join task.                                                      Watchman: A data warehouse intelligent cache
                                                                    manager. In Proceedings of the 22th International
8. REFERENCES                                                       Conference on Very Large Data Bases, VLDB ’96,
 [1] F. N. Afrati and J. D. Ullman. Optimizing joins in a           pages 51–62, San Francisco, CA, USA, 1996. Morgan
     map-reduce environment. In Proceedings of the 13th             Kaufmann Publishers Inc.
     International Conference on Extending Database            [15] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden,
     Technology, EDBT ’10, pages 99–110, New York, NY,              E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and
     USA, 2010. ACM.                                                parallel dbmss: friends or foes? Commun. ACM,
 [2] R. Agrawal and R. Srikant. Fast algorithms for mining          53(1):64–71, Jan. 2010.
     association rules in large databases. In Proceedings of   [16] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,
     the 20th International Conference on Very Large Data           N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a
     Bases, VLDB ’94, pages 487–499, San Francisco, CA,             petabyte scale data warehouse using Hadoop. In ICDE
     USA, 1994. Morgan Kaufmann Publishers Inc.                     ’10: Proceedings of the 26th International Conference
 [3] S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J.            on Data Engineering, pages 996–1005. IEEE, Mar.
     Shekita, and Y. Tian. A comparison of join algorithms          2010.
     for log processing in mapreduce. In Proceedings of the    [17] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur,
     2010 ACM SIGMOD International Conference on                    N. Jain, J. Sen Sarma, R. Murthy, and H. Liu. Data
     Management of data, SIGMOD ’10, pages 975–986,                 warehousing and analytics infrastructure at facebook.
     New York, NY, USA, 2010. ACM.                                  In Proceedings of the 2010 ACM SIGMOD
 [4] S. Chen. Cheetah: a high performance, custom data              International Conference on Management of data,
     warehouse on top of mapreduce. Proc. VLDB Endow.,              SIGMOD ’10, pages 1013–1020, New York, NY, USA,
     3(1-2):1459–1468, Sept. 2010.                                  2010. ACM.
 [5] A. Gruenheid, E. Omiecinski, and L. Mark. Query
     optimization using column statistics in hive. In
     Proceedings of the 15th Symposium on International
     Database Engineering & Applications, IDEAS ’11,
     pages 97–105, New York, NY, USA, 2011. ACM.
 [6] A. Y. Halevy. Answering queries using views: A
     survey. The VLDB Journal, 10(4):270–294, Dec. 2001.
 [7] J. Han, J. Pei, and Y. Yin. Mining frequent patterns
     without candidate generation. SIGMOD Rec.,
     29(2):1–12, May 2000.
 [8] V. Harinarayan, A. Rajaraman, and J. D. Ullman.
     Implementing data cubes efficiently. In Proceedings of
     the 1996 ACM SIGMOD international conference on
     Management of data, SIGMOD ’96, pages 205–216,
     New York, NY, USA, 1996. ACM.
 [9] T. Jörg and S. Deßloch. Towards generating etl
     processes for incremental loading. In Proceedings of
     the 2008 international symposium on Database
     engineering & applications, IDEAS ’08, pages 101–110,
     New York, NY, USA, 2008. ACM.
[10] Y. Kotidis and N. Roussopoulos. Dynamat: a dynamic
     view management system for data warehouses.
     SIGMOD Rec., 28(2):371–382, June 1999.
[11] H. Mannila, H. Toivonen, and I. Verkamo. Efficient
     algorithms for discovering association rules. pages
     181–192. AAAI Press, 1994.
[12] F. Özcan, D. Hoa, K. S. Beyer, A. Balmin, C. J. Liu,
     and Y. Li. Emerging trends in the enterprise data
     analytics: connecting hadoop and db2 warehouse. In
     Proceedings of the 2011 ACM SIGMOD International
     Conference on Management of data, SIGMOD ’11,
     pages 1161–1164, New York, NY, USA, 2011. ACM.
[13] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J.
     DeWitt, S. Madden, and M. Stonebraker. A
     comparison of approaches to large-scale data analysis.
     In Proceedings of the 2009 ACM SIGMOD
 Ein Cloud-basiertes räumliches Decision Support System
       für die Herausforderungen der Energiewende

                   Golo Klossek                        Stefanie Scherzinger                    Michael Sterner
              Hochschule Regensburg                     Hochschule Regensburg              Hochschule Regensburg
               golo.klossek                             stefanie.scherzinger                michael.sterner
          @stud.hs-regensburg.de                        @hs-regensburg.de                  @hs-regensburg.de


KURZFASSUNG                                                             Die Standortfindung etwa für Bankfilialen und die Zo-
Die Energiewende in Deutschland wirft sehr konkrete Fra-              nierung, also das Ausweisen geographischer Flächen für die
gestellungen auf: Welche Standorte eignen sich für Wind-             Landwirtschaft, sind klassische Fragestellungen für räumli-
kraftwerke, wo können Solaranlagen wirtschaftlich betrieben          che Entscheidungsunterstützungssysteme [6].
werden? Dahinter verbergen sich rechenintensive Datenver-               Die Herausforderungen an solch ein Spatial Decision Sup-
arbeitungsschritte, auszuführen auf Big Data aus mehreren            port System im Kontext der Energiewende sind vielfältig:
Datenquellen, in entsprechend heterogenen Formaten. Diese
                                                                        1. Verarbeitung heterogener Datenformate.
Arbeit stellt exemplarisch eine konkrete Fragestellung und
ihre Beantwortung als MapReduce Algorithmus vor. Wir                    2. Skalierbare Anfragebearbeitung auf Big Data.
konzipieren eine geeignete, Cluster-basierte Infrastruktur für
ein neues Spatial Decision Support System und legen die                 3. Eine elastische Infrastruktur, die mit der Erschließung
Notwendigkeit einer deklarativen, domänenspezifischen An-                 neuer Datenquellen ausgebaut werden kann.
fragesprache dar.
                                                                        4. Eine deklarative, domänenspezifische Anfragesprache
                                                                           für komplexe ad-hoc Anfragen.
Allgemeine Begriffe
Measurement, Performance, Languages.                                     Wir begründen kurz die Eckpunkte dieses Anforderungs-
                                                                      profils im Einzelnen. Dabei vertreten wir den Standpunkt,
Stichworte                                                            dass existierende Entscheidungsunterstützungssysteme auf
                                                                      Basis relationaler Datenbanken diese nicht in allen Punkten
Cloud-Computing, MapReduce, Energiewende.                             erfüllen können.
                                                                         (1) Historische Wetterdaten sind zum Teil öffentlich zu-
1. EINLEITUNG                                                         gänglich, werden aber auch von kommerziellen Anbietern
   Der Beschluss der Bundesregierung, zum Jahr 2022 aus               bezogen. Prominente Vertreter sind das National Center for
der Kernenergie auszusteigen und deren Anteil am Strom-               Atmospheric Research [12] in Boulder Colorado, der Deut-
Mix durch erneuerbare Energien zu ersetzen, fordert einen             sche Wetterdienst [7] und die Satel-Light [14] Datenbank der
rasanten Ausbau der erneuerbaren Energien. Entscheidend               Europäischen Union. Hinzu kommen Messwerte der hoch-
für den Bau neuer Windkraft- und Solaranlagen sind vor               schuleigenen experimentellen Windkraft- und Solaranlagen.
allem die zu erzielenden Gewinne und die Sicherheit der In-           Die Vielzahl der Quellen und somit der Formate führen zu
vestitionen. Somit sind präzise Ertragsprognosen von großer          den klassischen Problemen der Datenintegration.
Bedeutung. Unterschiedliche Standorte sind zu vergleichen,               (2) Daten in hoher zeitlicher Auflösung, die über Jahr-
die Ausrichtung der Windkraftanlagen zueinander in den                zehnte hinweg erhoben werden, verursachen Datenvolumi-
Windparks ist sorgfältig zu planen. Als Entscheidungsgrund-          na im Big Data Bereich. Der Deutsche Wetterdienst allein
lage dienen hierzu vor allem historische Wetterdaten. Für            verwaltet ein Datenarchiv von 5 Petabyte [7]. Bei solchen
die Kalkulation des Ertrags von Windkraftanlagen muss ef-             Größenordnung haben sich NoSQL Datenbanken gegenüber
fizient auf die Datenbasis zugegriffen werden können. Diese          relationalen Datenbanken bewährt [4].
erstreckt sich über große Zeiträume, da das Windaufkom-                (3) Wir stellen die Infrastruktur für ein interdisziplinäres
men nicht nur jährlich schwankt, sondern auch dekadenweise           Team der Regensburg School of Energy and Resources mit
variiert [3, 9].                                                      mehreren im Aufbau befindlichen Projekten bereit. Um den
                                                                      wachsenden Anforderungen unserer Nutzer gerecht werden
                                                                      zu können, muss das System elastisch auf neue Datenquellen
                                                                      und neue Nutzergruppen angepasst werden können.
                                                                         (4) Unsere Nutzer sind überwiegend IT-affin, doch nicht
                                                                      erfahren in der Entwicklung komplexer verteilter Systeme.
                                                                      Mit einer domänenspezifischen Anfragesprache wollen die
                                                                      Autoren dieses Artikels die intuitive Nutzbarkeit des Sys-
                                                                      tems gewährleisten.
                                                                         Unter diesen Gesichtspunkten konzipieren wir unser Sys-
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-   tem als Hadoop-Rechencluster [1, 5]. Damit sind die Ska-
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany.
Copyright is held by the author/owner(s).
lierbarkeit auf große Datenmengen (2) und die horizonta-
le Skalierbarkeit der Hardware gegeben (3). Da auf histo-
rische Daten ausschließlich lesend zugegriffen wird, bietet
sich der MapReduce Ansatz geradezu an. Zudem erlaubt
Hadoop das Verarbeiten unstrukturierter, heterogener Da-
ten (1). Der Entwurf einer eigenen Anfragesprache (4) stellt
dabei eine spannende und konzeptionelle Herausforderung
dar, weil hierfür ein tiefes Verständnis für die Fragestellun-
gen der Nutzer erforderlich ist.

Struktur. Die folgenden Kapitel liefern Details zu unserem
Vorhaben. In Kapitel 2 beschreiben wir eine konkrete Frage-
stellung bei der Standortfindung von Windkraftwerken. In
Kapitel 3 stellen wir unsere Lösung als MapReduce Algorith-
mus dar. Kapitel 4 skizziert unsere Infrastruktur. Im 6. Ka-
pitel wird auf verwandte Arbeiten eingegangen. Das letzte           Abbildung 1: Aussagen über die Leistung in Abhän-
Kapitel gibt eine Zusammenfassung unserer Arbeit und zeigt          gigkeit zur Windgeschwindigkeit (aus [9]).
deren Perspektive auf.

2. WINDPOTENTIALANALYSE
   Ein aktuelles Forschungsprojekt der Hochschule Regens-
burg beschäftigt sich mit der Potentialanalyse von Wind-
kraftanlagen. Hier werden die wirtschaftlichen Aspekte, die
für das Errichten neuer Windkraftanlagen entscheidend sind,
untersucht. Mithilfe der prognostizierten Volllaststunden ei-
ner Windkraftanlage kann eine Aussage über die Rentabi-
lität getroffen werden. Diese ist bestimmt durch die Leis-
tungskennlinie der Windkraftanlage und letztlich durch die
zu erwartenden Windgeschwindigkeiten.
   Abbildung 1 (aus [9]) skizziert die spezifische Leistungs-
kennlinie einer Windkraftanlage in vier Phasen:

     I) Erst ab einer gewissen Windgeschwindigkeit beginnt
        die Anlage Strom zu produzieren.

    II) Die Leistung steigt über den wichtigsten Arbeitsbe-
        reich in der dritten Potenz zur Windgeschwindigkeit
        an, bis die Nennleistung der Anlage erreicht ist.
                                                                    Abbildung 2: Histogramme über die Windstärkever-
 III) Die Ausgangsleistung wird auf die Nennleistung der            teilung.
      Anlage begrenzt. Ausschlaggebend für die Höhe der
      Nennleistung ist die Auslegungsgröße des Generators.

    IV) Die Windkraftanlage schaltet sich bei zu hohen Wind-        Orographie variieren die Windgeschwindigkeiten schon bei
        geschwindigkeiten ab, um eine mechanische Überbelas-       kurzen Distanzen stark.
        tung zu verhindern.                                           Abbildung 2 skizziert die resultierende Aufgabenstellung:
                                                                    Geographische Flächen werden kleinräumig unterteilt, was
Wie Abbildung 1 verdeutlicht, ist zum Errechnen der ab-             die Abbildung aus Gründen der Anschaulichkeit stark ver-
gegeben Arbeit einer Windkraftanlage eine genaue Kennt-             einfacht darstellt. Für jeden Quadranten, bestimmt durch
nis der stochastischen Verteilung der Windgeschwindigkeit 1         Längen- und Breitengrad, interessiert die Häufigkeitsvertei-
notwendig. Mithilfe entsprechender Histogramme können so-          lung der Windstärken (dargestellt als Histogramm).
mit potentielle Standorte für neue Windkraftanlagen vergli-          Je nach Fragestellung wird von unterschiedlichen Zeiträu-
chen, und Anlagen mit geeigneter Leistungskennlinie pas-            men und unterschiedlicher Granularität der Quadranten aus-
send für den spezifischen Standort ausgewählt werden.             gegangen. Aufgrund der schieren Größe der Datenbasis ist
   Als Datenbasis eignen sich etwa die Wetterdaten des For-         hier ein massiv paralleler Rechenansatz gefordert, wenn über
schungsinstitut des National Center for Atmospheric Rese-           eine Vielzahl von Quadranten hinweg Histogramme berech-
arch [12] und die des Deutschen Wetterdienstes [7].                 net werden sollen.
   Insbesondere im Binnenland ist eine hohe räumliche Auf-
lösung der meteorologischen Daten wichtig. Aufgrund der
                                                                    3.   MASSIV PARALLELE HISTOGRAMM-
1
 Wir verwenden die Begriffe Windgeschwindigkeit und                      BERECHNUNG
Windstärke synonym. Streng genommen wird die Windge-
schwindigkeit als Vektor dargestellt, während die Windstär-         Im Folgenden stellen wir einen MapReduce Algorithmus
ke als skalare Größe erfasst wird. Dabei kann die Windstärke      zur parallelen Berechnung von Windgeschwindigkeitsvertei-
aus der Windgeschwindigkeit errechnet werden.                       lungen vor. Wir betrachten dabei die Plattform Apache Ha-
             Abbildung 3: Erste MapReduce-Sequenz zur Berechnung der absoluten Häufigkeiten.



doop [1], eine quelloffene MapReduce Implementierung [5].        hieren wir von dem tatsächlichen Eingabeformat und be-
   Hadoop ist dafür ausgelegt, mit großen Datenmengen um-       schränken uns auf nur eine Datenquelle. Die Eingabetupel
zugehen. Ein intuitives Programmierparadigma erlaubt es,         enthalten einen Zeitstempel, den Längen- und Breitengrad
massiv parallele Datenverarbeitungsschritte zu spezifizieren.    als Ortsangabe und diverse Messwerte.
Die Plattform partitioniert die Eingabe in kleinere Daten-          Wir nehmen vereinfachend an, dass die Ortsangabe be-
blöcke und verteilt diese redundant auf dem Hadoop Distri-      reits in eine Quadranten-ID übersetzt ist. Diese Vereinfa-
buted File System [15]. Dadurch wird eine hohe Datensicher-      chung erlaubt eine übersichtlichere Darstellung, gleichzeitig
heit gewährleistet. Als logische Basiseinheit arbeitet Hadoop   ist die Klassifikation der Datensätze nach Quadranten ein-
mit einfachen Schlüssel/Werte Paaren. Somit können selbst      fach umzusetzen. Zudem ignorieren wir alle Messwerte bis
unstrukturierte oder nur schwach strukturierte Daten ad hoc      auf die Windstärke. Tabelle 1 zeigt exemplarisch einige Da-
verarbeitet werden.                                              tensätze, die wir in unserem laufenden Beispiel verarbeiten.
   MapReduce Programme werden in drei Phasen ausgeführt.           Wir betonen an dieser Stelle, dass diese vereinfachenden
                                                                 Annahmen nur der Anschaulichkeit dienen und keine Ein-
  1. In der ersten Phase wird auf den partitionierten Einga-     schränkung unseres Systems darstellen.
     bedaten eine Map-Funktion parallel ausgeführt. Diese
     Map-Funktion transformiert einfache Schlüssel/Werte                          Quadrant     Windstärke
     Paare in eine Liste von neuen Schlüssel/Werte Paaren.                           q            ws
  2. Die anschließende Shuffle-Phase verteilt die entstande-                          2              0
     nen Tupel so um, dass nun alle Paare mit dem gleichen                            3              7
     Schlüssel an demselben Rechner vorliegen.                                       4              9
  3. Die Reduce-Phase berechnet meist eine Aggregatfunk-                              1              3
     tion auf allen Tupeln mit demselben Schlüssel.                                 ...            ...
   Die Signaturen der Map- und Reduce-Funktion werden             Tabelle 1: Tabellarisch dargestellte Eingabedaten.
üblicherweise wie folgt beschrieben [11]:

     Map:       (k1, v1)       → list(k2, v2)                    Wir schalten zwei MapReduce-Sequenzen in Reihe:
     Reduce:    (k2, list(v2)) → list(k3, v3)                       • Die erste Sequenz ermittelt, wie oft in einem Quadran-
                                                                      ten eine konkrete Windstärke aufgetreten ist.
   Wir erläutern nun unseren MapReduce Algorithmus zum
Erstellen von Histogrammen der Windgeschwindigkeitsver-             • Die zweite Sequenz fasst die berechneten Tupel zu Hi-
teilungen. Im Sinne einer anschaulichen Darstellung abstra-           stogrammen zusammen.
                  Abbildung 4: Zweite MapReduce-Sequenz zur Berechnung der Histogramme.




 def map(Datei d, Liste L) :              def reduce((Quadrant q, Windstärke ws), Liste L) :
     foreach (q, ws) in L do                                         int total = 0;
       if (q ∈ Q)                                                    foreach count in L do
          int count = 1;                                               total += count;
          emit ((q, ws), count);                                     od
       fi                                                            emit (q, (ws, total));
     od

                                                                Abbildung 6: Reduce-Funktion der ersten Sequenz.
 Abbildung 5: Map-Funktion der ersten Sequenz.

                                                                2}). Die Shuffle-Phase reorganisiert die Tupel so, dass an-
  Durch das Aneinanderreihen von MapReduce-Sequenzen            schließend alle Tupel mit den gleichen Werten für Quadrant
werden ganz im Sinne des Prinzips teile und herrsche“ mit       und Windstärke bei demselben Rechner vorliegen. Hadoop
                                   ”
sehr einfachen und gut parallelisierbaren Rechenschritten       fasst dabei die count-Werte bereits zu einer Liste zusammen.
komplexe Transformationen spezifiziert. Wir betrachten nun         Die Reduce-Funktion produziert daraus Tupel mit dem
beide Sequenzen im Detail.                                      Quadranten als Schlüssel. Der Wert setzt sich aus der Wind-
3.1   Sequenz 1: Absolute Häufigkeiten                          stärke und ihrer absoluten Häufigkeit zusammen.         2
   Die erste Sequenz erinnert an das WordCount“ -Beispiel,
                                      ”
das klassische Einsteigerbeispiel für MapReduce Program-
                                                                3.2   Sequenz 2: Histogramm-Berechnung
mierung [11]. Die Eingabe der Map-Funktion ist der Name           Die Ausgabe der ersten Sequenz wird nun weiter verar-
einer Datei und deren Inhalt, nämlich eine Liste der Qua-      beitet. Die Map-Funktion der zweiten Sequenz ist schlicht
dranten und der darin gemessenen Windstärke. Wir nehmen        die Identitätsfunktion. Die Shuffle-Phase gruppiert die Tu-
an, dass nur eine ausgewählte Menge von Quadranten Q in-       pel nach dem Quadranten. Somit findet die finale Erstellung
teressiert, etwa um mögliche Standorte von Windkraftanla-      der Histogramme in der Reduce-Funktion statt.
gen im Regensburger Raum zu untersuchen.
   Abbildung 5 zeigt die Map-Funktion in Pseudocode. Die
Anweisung emit produziert ein neues Ausgabetupel. In der           Beispiel 2. Abbildung 4 zeigt für das laufende Beispiel
Shuffle-Phase werden die produzierten Tupel nach der Schlüs-   die Verarbeitungsschritte der zweiten Sequenz.           2
selkomponente aus Quadrant und Windstärke umverteilt.
Die Reduce-Funktion in Abbildung 6 berechnet nun die Häu-
figkeit der einzelnen Windstärkewerte pro Quadrant.
                                                                4.    ARCHITEKTURBESCHREIBUNG
   Beispiel 1. Abbildung 3 visualisiert die erste Sequenz          Unsere Vision eines Cloud-basierten Spatial Decision Sup-
anhand konkreter Eingabedaten. Die Map-Funktion selek-          port Systems für die Fragestellungen der Energiewende fußt
tiert nur Tupel aus den Quadranten 1 und 2 (d.h. Q = {1,        fest auf MapReduce Technologie.
                                                                 hinsichtlich Performanz und Korrektheit analysiert werden
                                                                 kann. Hadoop Laien hingegen brauchen sich mit diesen In-
                                                                 terna nicht zu belasten. Aktuell erarbeiten wir einen Katalog
                                                                 konkreter Fragestellungen der Energiewirtschaft, um gängi-
                                                                 ge Query-Bausteine für WenQL zu identifizieren.

                                                                 5.   VERWANDTE ARBEITEN
                                                                    In diversen Forschungsgebieten der Informatik finden sich
                                                                 Berührungspunkte mit unserer Arbeit. Viele Forschungspro-
                                                                 jekte, die sich mit der Smart Grid Technologie beschäftigen,
                                                                 setzten auf distributive Systeme zum Bearbeiten ihrer Da-
                                                                 ten [4, 17]. Ähnlich wie in unserem Projekt wird diese Ent-
                                                                 scheidung aufgrund der großen Datenmengen getroffen, wel-
                                                                 che aus unterschiedlichen Quellen stammen. Wettereinflüsse
                                                                 auf Kraftwerke, Börsenstrompreise, Auslastung von Strom-
                                                                 netzen und das Stromverbrauchsverhalten von Millionen von
                                                                 Nutzern müssen verglichen werden. Smart Grid Analysen
                                                                 unterscheiden sich von unserem Projekt darin, dass wir nur
                                                                 auf historische Daten zurückgreifen und somit keine Echt-
                                                                 zeitanforderungen an das System stellen.
                                                                    Fragestellungen wie Standortbestimmung und Zonierung
                                                                 haben eine lange Tradition in der Entwicklung dedizierter
                                                                 Spatial Decision Support Systeme [6]. Deren Architekturen
                                                                 fußen üblicherweise auf relationalen Datenbanken zur Da-
                                                                 tenhaltung. Mit der Herausforderung, auf Big Data zu ska-
                                                                 lieren, und mit heterogenen Datenquellen zu arbeiten, besit-
                                                                 zen NoSQL Systeme wie Hadoop und das Hadoop Dateisys-
                                                                 tem hingegen eindeutige Vorteile in der Datenhaltung und
                                                                 Anfragebearbeitung.
                                                                    Die Definition deklarativer Anfragesprachen für MapRe-
                                                                 duce Algorithmen ist ein sehr aktives Forschungsgebiet. Am
                                                                 relevantesten für uns sind SQL-ähnliche Anfragesprachen,
                                                                 wie etwa im Hive Projekt umgesetzt [2, 16]. Allerdings wird
                                                                 SQL von unseren Anwendern in der Regel nicht beherrscht.
                                                                 Daher planen wir, eine eigene Anfragesprache zu definieren,
                                                                 die möglichst intuitiv für unsere Anwender zu erlernen ist.

                                                                 6.   ZUSAMMENFASSUNG UND AUSBLICK
                                                                    Der Bedarf für eine neue, BigData-fähige Generation von
                                                                 räumlichen Entscheidungsunterstützungssystemen für diver-
Abbildung 7: Architektur des Cloud-basierten Spa-                se Fragestellungen der Energiewende ist gegeben.
tial Decision Support Systems.                                      In dieser Arbeit stellen wir unsere Vision eines Cloud-
                                                                 basierten Spatial Decision Support Systems vor. Anhand des
                                                                 Beispiels der Windpotentialanalyse zeigen wir die Einsatz-
                                                                 fähigkeit von MapReduce Algorithmen für strategische Fra-
   Das Projektvorhaben geht dabei über den bloßen Einsatz       gestellungen in der Energieforschung.
von Cluster-Computing hinaus. Das Ziel ist der Entwurf ei-          Wir sind zuversichtlich, ein breites Spektrum essentiel-
ner domänenspezifischen Anfragesprache WEnQL, die Wet-          ler Entscheidungen unterstützen zu können. Eine Weiterfüh-
                                                        ”
ter und Energie Query Language“. Diese wird in interdis-         rung unserer Fallstudie ist die Ausrichtung von Windkrafträ-
ziplinärer Zusammenarbeit mit dem Forschungsinstitut Re-        dern innerhalb eines Windparks. Dafür ist die dominierende
gensburg School of Energy and Resources entwickelt. Mit ihr      Windrichtung entscheidend, um Windkrafträder günstig zu-
sollen auch MapReduce Laien in der Lage sein, Algorithmen        einander und zur Hauptwindrichtung auszurichten. Ein ein-
auf dem Cluster laufen zu lassen.                                zelnes Windkraftwerk kann zwar die Gondel um 360◦ dre-
   Abbildung 7 illustriert die Vision: Unsere Nutzer formulie-   hen, um die Rotoren gegen den Wind auszurichten. Die
ren eine deklarative Anfrage in WenQL, etwa um die Wind          Anordnung der Türme zueinander im Park ist allerdings
geschwindigkeits-Histogramme einer Region zu berechnen.          fixiert. Bei einem ungünstigen Layout der Türme können
Ein eigener Compiler übersetzt die WenQL Anfrage in das         somit Windschatteneffekte die Rendite nachhaltig schmä-
gängige Anfrageformat Apache Pig [8, 13], das wiederum          lern. Abbildung 8 (aus [10]) visualisiert die Windstärke und
nach Java übersetzt wird. Das Java Programm wird anschlie-      Windrichtung als Entscheidungsgrundlage. Unser MapRe-
ßend kompiliert und auf dem Hadoop Cluster ausgeführt.          duce Algorithmus aus Kapitel 3 kann dementsprechend er-
   Die Übersetzung in zwei Schritten hat den Vorteil, dass      weitert werden.
das Programm in der Zwischensprache Pig von Experten                Darüber hinaus eruieren wir derzeit die Standortbestim-
                                                                  8.   LITERATUR
                                                                   [1] Apache Hadoop. http://hadoop.apache.org/, 2013.
                                                                   [2] Apache Hive. http://hive.apache.org/, 2013.
                                                                   [3] O. Brückl. Meteorologische Grundlagen der
                                                                       Windenergienutzung. Vorlesungsskript: Windenergie.
                                                                       Hochschule Regensburg, 2012.
                                                                   [4] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.
                                                                       Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E.
                                                                       Gruber. Bigtable: A distributed storage system for
                                                                       structured data. ACM Transactions on Computer
                                                                       Systems (TOCS), 26(2):4, 2008.
                                                                   [5] J. Dean and S. Ghemawat. MapReduce: Simplified
                                                                       data processing on large clusters. Commun. ACM,
                                                                       51(1):107–113, Jan. 2008.
                                                                   [6] P. J. Densham. Spatial decision support systems.
                                                                       Geographical information systems: Principles and
                                                                       applications, 1:403–412, 1991.
                                                                   [7] Deutscher Wetterdienst. http://www.dwd.de/, 2013.
                                                                   [8] A. Gates. Programming Pig. O’Reilly Media, 2011.
                                                                   [9] M. Kaltschmitt, W. Streicher, and A. Wiese.
                                                                       Erneuerbare Energien Systemtechnik,
                                                                       Wirtschaftlichkeit, Umweltaspekte. Springer, 2006.
                                                                  [10] Lakes Environmental Software.
Abbildung 8: Aussagen            über   Windstärke     und           http://www.weblakes.com/, 2013.
Windrichtung (aus [10]).                                          [11] C. Lam. Hadoop in Action. Manning Publications,
                                                                       2010.
                                                                  [12] National Center for Atmospheric Research (NCAR).
                                                                       http://ncar.ucar.edu/ 2013.
mung von Solaranlagen, und noch komplexer, den strategi-                                     ”
                                                                  [13] C. Olston, B. Reed, U. Srivastava, R. Kumar, and
schen Einsatz von Energiespeichern, um etwa Windstillen
                                                                       A. Tomkins. Pig latin: A not-so-foreign language for
oder Nachtphasen ausgleichen zu können.
                                                                       data processing. In Proceedings of the 2008 ACM
  Mit den Fähigkeiten unseres künftigen Decision Support
                                                                       SIGMOD international conference on Management of
Systems, der Skalierbarkeit auf sehr großen Datenmengen,
                                                                       data, pages 1099–1110. ACM, 2008.
dem flexible Umgang mit heterogenen Datenformaten und
                                                                  [14] Satel-Light. http://www.satel-light.com/, 2013.
nicht zuletzt mit einer domänenspezifischen Anfragesprache
wollen wir unseren Beitrag zu einer klimafreundlichen und         [15] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.
nachhaltigen Energieversorgung leisten.                                The Hadoop Distributed File System. In Mass Storage
                                                                       Systems and Technologies (MSST), 2010 IEEE 26th
                                                                       Symposium on, pages 1–10, 2010.
7.   DANKSAGUNG                                                   [16] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,
  Diese Arbeit ist ein Projekt der Regensburg School of Ener-          S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive:
gy and Resources, eine interdisziplinäre Einrichtung der Hoch-        A warehousing solution over a map-reduce framework.
schule Regensburg und des Technologie- und Wissenschafts-              Proceedings of the VLDB Endowment, 2(2):1626–1629,
netzwerkes Oberpfalz.                                                  2009.
                                                                  [17] D. Wang and L. Xiao. Storage and Query of Condition
                                                                       Monitoring Data in Smart Grid Based on Hadoop. In
                                                                       Computational and Information Sciences (ICCIS),
                                                                       2012 Fourth International Conference, pages 377–380.
                                                                       IEEE, 2012.
       Consistency Models for Cloud-based Online Games:
               the Storage System’s Perspective

                                                   Ziqiang Diao
                                    Otto-von-Guericke University Magdeburg
                                          39106 Magdeburg, Germany
                                       diao@iti.cs.uni-magdeburg.de


ABSTRACT                                                                                              Client


The existing architecture for massively multiplayer online                 Login Server           Gateway Server                    Chat Server
role-playing games (MMORPG) based on RDBMS limits
the availability and scalability. With increasing numbers of                                       Zone Server
                                                                                                                                HDFS/Cassandra
players, the storage systems become bottlenecks. Although                                   Logic Server       Map Server   (Game Data and Log Data)
                                                                       RDBMS as a Service
a Cloud-based architecture has the ability to solve these spe-           (Account Data)

cific issues, the support for data consistency becomes a new                                      In-Memory DB

open issue. In this paper, we will analyze the data consis-                                     Data Access Server
tency requirements in MMORPGs from the storage system
point of view, and highlight the drawbacks of Cassandra to
                                                                                             Cloud Storage System
support of game consistency. A timestamp-based solution                                          (State Data)
will be proposed to address this issue. Accordingly, we will
present data replication strategies, concurrency control, and
system reliability as well.                                         Figure 1: Cloud-based Architecture of MMORPGs
                                                                    [4]
1. INTRODUCTION
   In massively multiplayer online role-playing games (MMORPG)
thousands of players can cooperate with other players in a
virtual game world. To support such a huge game world             (e.g., log data and state data) that requires scalability and
following often complex application logic and specific re-        availability is stored in a Cloud data storage system (Cas-
quirements. Additionally, we have to bear the burden of           sandra, in this paper). Figure 1 shows the new architecture.
managing large amounts of data. The root of the issue is             Unfortunately, there are still some open issues, such as
that the existing architectures of MMORPGs use RDBMS              the support of data consistency. According to the CAP
to manage data, which limits the availability and scalability.    theorem, in a partition tolerant distributed system (e.g.,
   Cloud data storage systems are designed for internet ap-       an MMORPG), we have to sacrifice one of the two prop-
plications, and are complementary to RDBMS. For example,          erties: consistency or availability [5]. If an online game
Cloud systems are able to support system availability and         does not guarantee availability, players’ requests may fail.
scalability well, but not data consistency. In order to take      If data is inconsistent, players may get data not conforming
advantages of these two types of storage systems, we have         to game logic, which affects their operations. For this rea-
classified data in MMORPGs into four data sets according          son, we must analyze the data consistency requirements of
to typical data management requirements (e.g., data consis-       MMORPGs so as to find a balance between data consistency
tency, system availability, system scalability, data model, se-   and system availability.
curity, and real-time processing) in [4]: account data, game         Although there has been some research work focused on
data, state data, and log data. Then, we have proposed to         the data consistency model of online games, the researchers
apply multiple data management systems (or services) in one       generally discussed it from players’ or servers’ point of view
MMORPG, and manage diverse data sets accordingly. Data            [15, 9, 11], which actually are only related to data synchro-
with strong requirements for data consistency and security        nization among players. Another existing research work did
(e.g., account data) is still managed by RDBMS, while data        not process diverse data accordingly [3], or just handled this
                                                                  issue based on a rough classification of data [16]. However,
                                                                  we believe the only efficient way to solve this issue is to ana-
                                                                  lyze the consistency requirements of each data set from the
                                                                  storage system’s perspective. Hence, we organize the rest of
                                                                  this paper as follows: in Section 2, we highlight data consis-
                                                                  tency requirements of the four data sets. In Section 3, we
                                                                  discuss the data consistency issue of our Cloud-based archi-
  rd                                                              tecture. We explain our timestamp-based solution in detail
25 GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2011, Ilmenau, Deutschland.           from Section 4 to Section 6. Then, we point out some opti-
Copyright is held by the author/owner(s).                         mization programs and our future work in Section 7. Finally,
we summarize this paper in Section 8.                            players in the same game world could be treated equally.
                                                                 It is noteworthy that a zone server accesses data generally
                                                                 from one data center. Hence, we guarantee strong consis-
2. CONSISTENCY REQUIREMENTS OF DI-                               tency within one data center, and causal consistency among
   VERSE DATA IN MMORPGS                                         data centers. In other words, when game developers modify
                                                                 the game data, the updated value should be submitted syn-
   Due to different application scenarios, the four data sets
                                                                 chronously to all replicas within the same data center, and
have distinct data consistency requirements. For this reason,
                                                                 then propagated asynchronously across data centers.
we need to apply different consistency models to fulfill them.
                                                                    State data: for instance, metadata of PCs (Player Char-
   Account data: is stored on the server side, and is cre-
                                                                 acters) and state (e.g., position, task, or inventory) of char-
ated, accessed as well as deleted when players log in to or
                                                                 acters, is modified by players frequently during the game.
log out of a game. It includes player’s private data and
                                                                 The change of state data must be perceived by all relevant
some other sensitive information (e.g., user ID, password,
                                                                 players synchronously, so that players and NPCs can re-
and recharge records). The inconsistency of account data
                                                                 spond correctly and timely. An example for the necessity of
might bring troubles to a player as well as the game provider,
                                                                 data synchronization is that players cannot tolerate that a
or even lead to an economic or legal dispute. Imagine the
                                                                 dead character can continue to attack other characters. Note
following two scenarios: a player has changed the password
                                                                 that players only access data from the in-memory database
successfully. However, when this player log in to the game
                                                                 during the game. Hence, we need to ensure strong consis-
again, the new password is not effective; a player has trans-
                                                                 tency in the in-memory database.
ferred to the game account, or the player has consumed in
                                                                    Another point about managing state data is that updated
the game, but the account balance is somehow not properly
                                                                 values must be backed up to the disk-resident database asyn-
presented in the game system. Both cases would influence
                                                                 chronously. Similarly, game developers also need to take
on the player’s experience, and might result in the customer
                                                                 care of data consistency and durability in the disk-resident
or the economic loss of a game company. Hence, we need
                                                                 database, for instance, it is intolerable for a player to find
to access account data under strong consistency guarantees,
                                                                 that her/his last game record is lost when she/he starts the
and manage it with transactions. In a distributed database
                                                                 game again. In contrast to that in the in-memory database,
system, it means that each copy should hold the same view
                                                                 we do not recommend ensuring strong consistency to state
on the data value.
                                                                 data. The reason is as follows: according to the CAP theo-
   Game data: such as world appearance, metadata (name,
                                                                 rem, a distributed database system can only simultaneously
race, appearance, etc.) of NPC (Non Player Character),
                                                                 satisfy two of three the following desirable properties: con-
system configuration files, and game rules, is used by play-
                                                                 sistency, availability, and partition tolerance. Certainly, we
ers and game engine in the entire game, which can only be
                                                                 hope to satisfy both consistency and availability guarantees.
modified by game developers. Players are not as sensitive to
                                                                 However, in the case of network partition or under high net-
game data as to account data. For example, the change of
                                                                 work latency, we have to sacrifice one of them. Obviously,
an NPC’s appearance or name, the duration of a bird ani-
                                                                 we do not want all update operations to be blocked until the
mation, and the game interface may not catch the players’
                                                                 system recovery, which may lead to data loss. Consequently,
attention and have no influence on players’ operations. As a
                                                                 the level of data consistency should be reduced. We propose
result, it seems that strong consistency for game data is not
                                                                 to ensure read-your-writes consistency guarantee [13]. In
so necessary. On the other hand, some changes of the game
                                                                 this paper, it describes that once state data of player A has
data must be propagated to all online players synchronously,
                                                                 been persisted in the Cloud, the subsequent read request of
for instance, the change of the game world’s appearance, the
                                                                 player A will receive the updated values, yet other players
power of a weapon or an NPC, game rules as well as scripts,
                                                                 (or the game engine) may only obtain an outdated version of
and the occurrence frequency of an object during the game.
                                                                 it. From the storage system’s perspective, as long as a quo-
The inconsistency of these data will lead to errors on the
                                                                 rum of replicas has been updated successfully, the commit
game display and logic, unfair competition among players,
                                                                 operation is considered complete. In this case, the storage
or even a server failure. For this reason, we also need to
                                                                 system needs to provide a solution to return the up-to-date
treat data consistency of game data seriously. Game data
                                                                 data to player A. We will discuss it in the next section.
could be stored on both the server side and the client side,
                                                                    Log data: (e.g., player chat history and operation logs)
so we have to deal with it accordingly.
                                                                 is created by players, but used by data analysts for the pur-
   Game data on the client side could only synchronize with
                                                                 pose of data mining. This data will be sorted and cached
servers when a player logs in to or starts a game. For this
                                                                 on the server side during the game, and then bulk stored
reason, causal consistency is required [8, 13]. In this paper,
                                                                 into the database, thereby reducing the conflict rate as well
it means when player A uses client software or browser to
                                                                 as the I/O workload, and increasing the total simultaneous
connect with the game server, the game server will then
                                                                 throughput [2]. The management of log data has three fea-
transmit the latest game data in the form of data packets
                                                                 tures: log data will be appended continually, and its value
to the client side of player A. In this case, the subsequent
                                                                 will not be modified once it is written to the database; The
local access by player A is able to return the updated value.
                                                                 replication of log data from thousands of players to multiple
Player B that has not communicated with the game server
                                                                 nodes will significantly increase the network traffic and even
will still retain the outdated game data.
                                                                 block the network; Moreover, log data is generally organized
   Although both client side and server side store the game
                                                                 and analyzed after a long time. Data analysts are only con-
data, only the game server maintains the authority of it.
                                                                 cerned about the continuous sequence of the data, rather
Furthermore, players in different game worlds cannot com-
                                                                 than the timeliness of the data. Hence, data inconsistency
municate to each other. Therefore, we only need to ensure
                                                                 is accepted in a period of time. For these three reasons,
that the game data is consistent in one zone server so that
                       Account data                     Game data                           State data                  Log data
     Modified by           Players                     Game developers                         Players                   Players
     Utilized by   Players & Game engine           Players & Game engine              Players & Game engine           Data analysts
      Stored in            Cloud           Client side            Cloud          In-memory DB           Cloud            Cloud
     Data center           Across              —           Single       Across       Single             Across           Across
     Consistency           Strong            Causal        Strong       Causal       Strong        Read-your-writes      Timed
       model             consistency       consistency consistency consistency     consistency        consistency      consistency

                                            Table 1: Consistency requirements


a deadline-based consistency model, such as timed consis-          update all replicas while executing write operations. In this
tency, is more suitable for log data[12, 10]. In this paper,       case, data in Cassandra is consistent, and we can obtain
timed consistency specifically means that update operations        the up-to-date data from the closest replica directly. Un-
are performed on a quorum of replicas instantaneously at           fortunately, this replication strategy significantly increases
time t, and then the updated values will be propagated to          the network traffic as well as the response time of write op-
all the other replicas within a time bounded by t + 4 [10].        erations, and sacrifices system availability. As a result, to
Additionally, to maintain the linear order of the log data,        implement read-your-writes consistency efficiently becomes
the new value needs to be sorted with original values before       an open issue.
being appended to a replica. In other words, we execute               Another drawback is that Cassandra makes all replicas
a sort-merge join by the timestamp when two replicas are           eventually consistent, which sometimes does not match the
asynchronous. Under timed consistency guarantee, data an-          application scenarios of MMORPG, and reduce the efficiency
alysts can at time t + 4 obtain a continuously sequential          of the system. The reasons are as follows.
log data until time t.
                                                                        • Unnecessary for State data: state data of a PC is read
                                                                          by a player from the Cloud storage system only once
3.     OPPORTUNITIES AND CHALLENGES                                       during the game. The subsequent write operations do
                                                                          not depend on values in the Cloud any more. Hence,
   In our previous work, we have already presented the ca-
                                                                          after obtaining the up-to-date data from the Cloud,
pability of the given Cloud-based architecture to support
                                                                          there is no necessity to ensure that all replicas reach a
the corresponding consistency model for each data set in
                                                                          consensus on these values.
MMORPGs [4]. However, we also have pointed out that to
ensure read-your-writes consistency to state data and timed             • Increase network traffic: Cassandra utilizes Read Re-
consistency to log data efficiently in Cassandra is an open               pair functionality to guarantee eventual consistency
issue. In this section, we aim at discussing it in detail.                [1]. It means that all replicas have to be compared
   Through customizing the quorum of replicas in response to              in the background while executing a write operation
read and write operations, Cassandra provides tunable con-                in order to return the up-to-date data to players, de-
sistency, which is an inherent advantage to support MMORPGs               tect the outdated data versions, and fix them. In
[7, 4]. There are two reasons: first, as long as a write request          MMORPGs, both state data and log data have a large
receives a quorum of responses, it completes successfully. In             scale, and are distributed in multiple data centers.
this case, although data in Cassandra is inconsistent, it re-             Hence, transmission of these data across replicas will
duces the response time of write operations, and ensures                  significantly increase the network traffic and affect the
availability as well as fault tolerance of the system; Addi-              system performance.
tionally, a read request will be sent to the closest replica,
or routed to a quorum or all replicas according to the con-        4.    A TIMESTAMP-BASED CONSISTENCY
sistency requirement of the client. For example, if a write
request is accepted by three (N, N> 0) of all five (M, M>=               SOLUTION
N) replicas, at least three replicas (M-N+1) need to respond          A common method for solving the consistency problem of
to the subsequent read request, so that the up-to-date data        Cloud storage system is to build an extra transaction layer
can be returned. At this case, Cassandra can guarantee             on top of the system [6, 3, 14]. Similarly, we have proposed
read-your-writes consistency or strong consistency. Other-         a timestamp-based solution especially for MMORPG, which
wise, it can only guarantee timed consistency or eventual          is designed based on the features of Cassandra [4]. Cassan-
consistency [7, 13]. Due to the support of tunable consis-         dra records timestamps in each column, and utilizes it as a
tency, Cassandra has the potential to manage state data and        version identification (ID). Therefore, we record the times-
log data of MMORPGs simultaneously, and is more suitable           tamps from a global server in both server side and in the
than some other Cloud storage systems that only provide            Cloud storage system. When we read state data from the
either strong or eventual consistency guarantees.                  Cloud, the timestamps recorded on the server side will be
   On the other hand, Cassandra fails to implement tun-            sent with the read request. In this way, we can find out the
able consistency efficiently according to MMORPG require-          most recent data easily. In the following sections, we will
ments. For example, M-N+1 replicas of state data have to be        introduce this solution in detail.
compared so as to guarantee read-your-writes consistency.
However, state data has typically hundreds of attributes,          4.1     Data Access Server
the transmission and comparison of which affect the read             Data access servers are responsible for data exchange be-
performance. Opposite to update a quorum of replicas, we           tween the in-memory database and the Cloud storage sys-
       Player                             Data access servers                                 Player/Game engine                    Data access servers
                         In-memory DB                                Cloud storage system                          In-memory DB                               Cloud storage system




          Write request (WR)                                                                                Read request (RR)
                                                                                                                                                  (Logout)
                    Status                                                                                                                             LMT & RR
                                                                                              PR(1)                                                                      Check
                                                                                                                                                                       Version ID
                                      Snapshot          (LMT)                                                                           (Login)
                                                                                                                                                      State data
                                                        Timestamp(TS)&WR
                                                                                                                                State data
W(1)                                                                                                        State data
                                                                            TS → Version ID
                                                 (TS)            Status
                                                                                                                   RR
                WR &                                                                                                                              (LMT,Login)
                                                                                              PR(2)
          Quit request(QR)                                                                                                          (TS,Login)

                                                                                                            State data             RR
                    Status

                                    Snapshot & QR       (LMT, Login)                                               RR
                                                                TS & WR
                                                                                                                                                         RR
W(2)                                                                        TS→ Version ID
                                                                                               GER
                                        (TS, Logout)             Status
                                                                                                                                                      State data
                Delete state data    Delete request
                                                                                                                                State data
                                                                                                            State data



Figure 2: Executions of Write Operations: W(1)
describes a general backup operation; W(2) shows                                              Figure 3: Executions of Read Operations: PR(1)
the process of data persistence when a player quits                                           shows a general read request from the player; In the
the game.                                                                                     case of PR(2), the backup operation is not yet com-
                                                                                              pleted when the read request arrives; GER presents
                                                                                              the execution of a read operation from the game
                                                                                              engine.
tem. They ensure the consistency of state data, maintain
timestamp tables, and play the role of global counters as
well. In order to balance the workload and prevent server
failures, several data access servers run in one zone server in                               data access server obtains the corresponding state data from
parallel. Data access servers need to synchronize their sys-                                  the snapshot periodically. In order to reduce the I/O work-
tem clock with each other automatically. However, a com-                                      load of the Cloud, a data access server generates one message
plete synchronization is not required. A time difference less                                 including all its responsible state data as well as a new times-
than the frequency of data backup is acceptable.                                              tamp TS, and then sends it to the Cloud storage system. In
   An important component in data access servers is the                                       the Cloud, this message is divided based on the ID of state
timestamp table, which stores the ID as well as the last mod-                                 data into several messages, each of which still includes TS.
ified time (LMT) of state data, and the log status (LS). If a                                 In this way, the update failure of one state data won’t block
character or an object in the game is active, its value of LS                                 the submission of other state data. Then, these messages
is “login”. Otherwise, the value of LS is “logout”. We utilize                                are routed to appropriate nodes. When a node receives a
a hash function to map IDs of state data to distinct times-                                   message, it writes changes immediately into the commit log,
tamp tables, which are distributed and partitioned in data                                    updates data, and records TS as version ID in each column.
access servers. It is noteworthy that timestamp tables are                                    If an update is successful and TS is higher than the exist-
partitioned and managed by data access servers in parallel                                    ing LMT of this state data, then the data access server uses
and data processing is simple, so that accessing timestamp                                    TS to replace the LMT. Note that if a player has quit the
tables will not become a bottleneck of the game system.                                       game and the state data of the relevant PC has backed up
   Note that players can only interact with each other in the                                 into the Cloud storage system, the LS of this PC needs to
same game world, which is managed by one zone server.                                         be modified form “login” to “logout”, and the relevant state
Moreover, a player cannot switch the zone server freely.                                      data in the in-memory database needs to be deleted.
Therefore, data access servers as well as timestamp tables                                       Data access servers obtain log data not from the in-memory
across zone servers are independent.                                                          database, but from the client side. Log data also updates in
                                                                                              batch, and gets timestamp from a data access server. When
4.2      Data Access                                                                          a node in the Cloud receives log data, it inserts log data into
   In this subsection, we discuss the data access without con-                                its value list according to the timestamp. However, times-
sidering data replication and concurrency conflicts.                                          tamp tables are not modified when the update is complete.
   In Figure 2, we show the storage process of state data in                                     Figure 3 presents executions of read operations. When a
the new Cloud-based architecture: the in-memory database                                      player starts the game, a data access server firstly obtains
takes a consistent snapshot periodically. Though using the                                    the LS information from the timestamp table. If the value
same hash function employed by timestamp tables, each                                         is “login”, that means the previous backup operation is not
completed and the state data is still stored in the in-memory     game. As a result, the subsequent read operation can obtain
database. In this case, the player gets the state date from       the updated values quickly.
the in-memory database directly, and the data access server         While using our replication strategies, a replica may con-
needs to generate a new timestamp to replace the LMT of           tain outdated data when it receives a read request. Though
the relevant state data; if the value is “logout”, the data       comparing LMT held by the read request with the Version
access server then gets the LMT, and sends it with a read         ID in a replica, this case can be detected easily. Contrary to
request to the Cloud storage system. When the relevant            the existing approach of Cassandra (compares M-N+1 repli-
node receives the request, it compares the LMT with its           cas and utilizes Read Repair), only the read request will be
local version ID. If they match, the replica responds the         sent to other replicas until the lasted values was found. In
read request immediately. If not match, this read request         this way, the network traffic will not be increased signifi-
will be sent to other replicas (we will discuss it in detail      cantly, and the up-to-date data can also be found easily.
in the section 5). When the data access server receives the       However, if the read request comes from the game engine,
state data, it sends it to the in-memory database as well as      the replica will respond immediately. These strategies en-
the relevant client sides, and modifies the LS from “logout”      sure that this Cloud-based architecture can manage state
to “login” in the timestamp table. Note that state data may       data under read-your-writes consistency guarantees.
also be read by the game engine for the purpose of statistics.      Similar to state data, a write request to log data is also
In this case, the up-to-date data is not necessary, so that we    accepted by a quorum of replicas at first. However, the
do not need to compare the LMT with the Version ID.               updated values then must be propagated to other replicas
   Data analysts read data also through data access servers.      asynchronously when the Cloud storage system is not busy,
If a read request contains a timestamp T, the cloud stor-         and arranged in order of timestamp within a predetermined
age system only returns log data until T-4 because it only        time (4), which can be done with the help of Anti-Entropy
guarantees log data timed consistency.                            functionality in Cassandra [1]. In this way, this Cloud stor-
                                                                  age system guarantees log data timed consistency.
4.3 Concurrency Control
   Concurrency conflicts appear rarely in the storage layer       5.2    Version Conflict Reconciliation
of MMORPGs: the probability of read-write conflicts is low          When the Cloud storage system detected a version conflict
because only state data with a specific version ID (the same      between two replicas: if it is state data, the replica with
as its LMT) will be read by players during the game, and a        higher version ID wins, and values of another replica will be
read request to log data does not return the up-to-date data.     replaced by new values; if it is log data, these two replicas
Certain data is periodically updated by only one data access      perform a sort-merge join by timestamps for the purpose of
server simultaneously. Therefore, write-write conflicts occur     synchronization.
only when the per-update is not completed for some reason,
for example, serious network latency, or a node failure. For-     6.    SYSTEM RELIABILITY
tunately, we can solve these conflicts easily by comparing
                                                                     Our Cloud-based architecture for MMORPGs requires a
timestamps. If two processes attempt to update the same
                                                                  mutual cooperation of multiple components. Unfortunately,
state data, the process with higher timestamp wins, and an-
                                                                  each component has the possibility of failure. In the follow-
other process should be canceled because it is out of date. If
                                                                  ing, we discuss measures to deal with different failures.
two processes intend to update the same log data, the pro-
                                                                     Cloud storage system failure: the new architecture for
cess with lower timestamp wins, and another process enters
                                                                  MMORPGs is built based on Cassandra, which has the abil-
the wait queue. The reason is that values contained in both
                                                                  ity to deal with its own failure. For example, Cassandra ap-
processes must be stored in correct order.
                                                                  plies comment logs to recover nodes. It is noteworthy that
                                                                  by using our timestamp-based solution, when a failed node
5. DATA REPLICATION                                               comes back up, it could be regarded as an asynchronous
  Data in the Cloud typically has multiple replicas for the       node. Therefore, the node recovery as well as response to
purpose of increasing data reliability as well as system avail-   write and read requests can perform simultaneously.
ability, and balancing the node workload. On the other               In-memory database failure: similarly, we can also apply
hand, data replication increases the response time and the        comment logs to handle this kind of failure so that there
network traffic as well, which cannot be handled well by          is no data loss. However, writing logs affects the real-time
Cassandra. For most of this section, we focus on resolving        response. Moreover, logs are useless when changes are per-
this contradiction according to access features of state data     sisted in the Cloud. Hence, we have to find a solution in our
and log data.                                                     future work.
                                                                     Data access server failure: If all data access servers crash,
5.1 Replication Strategies                                        the game can still keep running, whereas data cannot be
  Although state data is backed up periodically into the          backed up to the Cloud until servers restart, and only play-
Cloud, only the last updated values will be read when play-       ers already in the game can continue to play; Data access
ers start the game again. It is noteworthy that the data          servers have the same functionality and their system clocks
loss in the server layer occurs infrequently. Therefore, we       are relatively synchronized, so if one server is down, any
propose to synchronize only a quorum of replicas during the       other servers can replace it.
game, so that an update can complete effectively and won’t           Timestamp table failure: We utilize the primary/secondary
block the subsequent updates. In addition, players usually        model and the synchronous replication mechanism to main-
start a game again after a period of time, so the system has      tain the reliability of timestamp tables. In the case of all
enough time to store state data. For this reason, we propose      replicas failure, we have to apply the original feature of Cas-
to update all replicas synchronously when players quit the        sandra to obtain the up-to-date data. In other words, M-
N+1 replicas need to be compared. In this way, we can                  and V. Yushprakh. Megastore: Providing scalable,
rebuild timestamp tables as well.                                      highly available storage for interactive services. In
                                                                       Conference on Innovative Data Systems
7. OPTIMIZATION AND FUTURE WORK                                        Research(CIDR), pages 223–234, Asilomar, California,
                                                                       USA, 2011.
   When a data access server updates state data in the Cloud,
                                                                   [3] S. Das, D. Agrawal, and A. E. Abbadi. G-store: a
it propagates a snapshot of state data to multiple replicas.
                                                                       scalable data store for transactional multi key access in
Note that state data has hundreds of attributes, so the trans-
                                                                       the cloud. In Symposium on Cloud Computing(SoCC),
mission of a large volume of state data may block the net-
                                                                       pages 163–174, Indianapolis, Indiana, USA, 2010.
work. Therefore, we proposed two optimization strategies
in our previous work [4]: if only some less important at-          [4] Z. Diao and E. Schallehn. Cloud Data Management
tributes of the state (e.g., the position or orientation of a          for Online Games : Potentials and Open Issues. In
character) are modified, the backup can be skipped; Only               Data Management in the Cloud (DMC), Magdeburg,
the timestamp, ID, and the modified values are sent as mes-            Germany, 2013. Accepted for publication.
sages to the Cloud. However, in order to achieve the second        [5] S. Gilbert and N. Lynch. Brewer’s conjecture and the
optimization strategy, our proposed data access approach,              feasibility of consistent, available, partition-tolerant
data replication strategies, and concurrency control mech-             web services. ACM Special Interest Group on
anism have to be changed. For example, even during the                 Algorithms and Computation Theory (SIGACT),
game, updated values must be accepted by all replicas, so              33(2):51–59, 2002.
that the subsequent read request does not need to compare          [6] F. Gropengieß er, S. Baumann, and K.-U. Sattler.
M-N+1 replicas. We will detail the adjustment program in               Cloudy transactions cooperative xml authoring on
our future work.                                                       amazon s3. In Datenbanksysteme für Business,
   It is noteworthy that a data access server stores a times-          Technologie und Web (BTW), pages 307–326,
tamp repeatedly into the timestamp table, which increases              Kaiserslautern, Germany, 2011.
the workload. A possible optimization program is as fol-           [7] A. Lakshman. Cassandra - A Decentralized Structured
lows: If a batch write is successful, data access server caches        Storage System. Operating Systems Review,
the timestamp (TS) of this write request. Accordingly, in              44(2):35–40, 2010.
the timestamp table, we add a new column to each row to            [8] L. Lamport. Time, clocks, and the ordering of events
maintain a pointer. If a row is active (the value of LS is             in a distributed system. Communications of the ACM,
“login”), the pointer refers to the memory location of TS; if          21(7):558–565, July 1978.
not, it refers to its own LMT. When a row becomes inactive,        [9] F. W. Li, L. W. Li, and R. W. Lau. Supporting
it uses TS to replace its LMT. In this way, the workload of            continuous consistency in multiplayer online games. In
a timestamp table will reduce significantly. However, LMT              12. ACM Multimedia 2004, pages 388–391, New York,
and Version ID of state data may be inconsistent due to the            New York, USA, 2004.
failure of the Cloud storage system or the data access server.    [10] H. Liu, M. Bowman, and F. Chang. Survey of state
                                                                       melding in virtual worlds. ACM Computing Surveys,
8. CONCLUSIONS                                                         44(4):1–25, 2012.
  Our Cloud-based architecture of MMORPGs can cope                [11] W. Palant, C. Griwodz, and P. l. Halvorsen.
with data management requirements regarding availability               Consistency requirements in multiplayer online games.
and scalability successfully, while supporting data consis-            In Proceedings of the 5th Workshop on Network and
tency becomes an open issue. In this paper, we detailed our            System Support for Games, NETGAMES 2006,
timestamp-based solution in theory, which will guide the               page 51, Singapore, 2006.
implementation work in the future. We analyzed the data           [12] F. J. Torres-Rojas, M. Ahamad, and M. Raynal.
consistency requirements of each data set from the storage             Timed consistency for shared distributed objects. In
system’s perspective, and studied methods of Cassandra to              Proceedings of the eighteenth annual ACM symposium
guarantee tunable consistency. We found that Cassandra                 on Principles of distributed computing - PODC ’99,
cannot ensure read-your-writes consistency for state data              pages 163–172, Atlanta, Georgia, USA, 1999.
and timed consistency for log data efficiently. Hence, we         [13] W. Vogels. Eventually consistent. ACM Queue,
proposed a timestamp-based solution to improve it, and ex-             6(6):14–19, 2008.
plained our idea for concurrency control, data replication        [14] Z. Wei, G. Pierre, and C.-H. Chi. Scalable
strategies, and fault handling in detail. In our future work,          Transactions for Web Applications in the Cloud. In
we will implement our proposals and the optimization strate-           15th International Euro-Par Conference, pages
gies.                                                                  442–453, Delft, The Netherlands, 2009.
                                                                  [15] K. Zhang and B. Kemme. Transaction Models for
9. ACKNOWLEDGEMENTS                                                    Massively Multiplayer Online Games. In 30th IEEE
                                                                       Symposium on Reliable Distributed Systems (SRDS
  Thanks to Eike Schallehn for his comments.
                                                                       2011), pages 31–40, Madrid, Spain, 2011.
                                                                  [16] K. Zhang, B. Kemme, and A. Denault. Persistence in
10. REFERENCES                                                         massively multiplayer online games. In Proceedings of
 [1] Apache. Cassandra, January 2013.                                  the 7th ACM SIGCOMM Workshop on Network and
     http://cassandra.apache.org/.                                     System Support for Games, NETGAMES 2008, pages
 [2] J. Baker, C. Bond, J. C. Corbett, J. Furman,                      53–58, Worcester, Massachusetts, USA, 2008.
     A. Khorlin, J. Larson, J.-M. Lt’eon, Y. Li, A. Lloyd,
                          Analysis of DDoS Detection Systems

                                                          Michael Singhof
                                                    Heinrich-Heine-Universität
                                                      Institut für Informatik
                                                       Universitätsstraße 1
                                                  40225 Düsseldorf, Deutschland
                                               singhof@cs.uni-duesseldorf.de


ABSTRACT                                                              targeting specific weaknesses in that service or by brute force
While there are plenty of papers describing algorithms for            approaches. A particularly well-known and dangerous kind
detecting distributed denial of service (DDoS) attacks, here          of DoS attack are distributed denial of service attacks. These
an introduction to the considerations preceding such an im-           kinds of attacks are more or less brute force bandwidth DoS
plementation is given. Therefore, a brief history of and in-          attacks carried out by multiple attackers simultaneously.
troduction to DDoS attacks is given, showing that these kind             In general, there are two ways to detect any kind of net-
of attacks are nearly two decades old. It is also depicted that       work attacks: Signature based approaches in which the in-
most algorithms used for the detection of DDoS attacks are            trusion detection software compares network input to known
outlier detection algorithms, such that intrusion detection           attacks and anomaly detection methods. Here, the software
can be seen as a part of the KDD research field.                      is either trained with examples for normal traffic or not
   It is then pointed out that no well known and up-to-date           previously trained at all. Obviously, the latter method is
test cases for DDoS detection system are known. To over-              more variable since normal network traffic does not change
come this problem in a way that allows to test algorithms             as quickly as attack methods. The algorithms used in this
as well as making results reproducible for others we advice           field are, essentially, known KDD methods for outlier detec-
using a simulator for DDoS attacks.                                   tion such as clustering algorithms, classification algorithms
   The challenge of detecting denial of service attacks in            or novelty detection algorithms on time series. However,
real time is addressed by presenting two recently published           in contrast to many other related tasks such as credit card
methods that try to solve the performance problem in very             fraud detection, network attack detection is highly time crit-
different ways. We compare both approaches and finally                ical since attacks have to be detected in near real time. This
summarise the conclusions drawn from this, especially that            makes finding suitable methods especially hard because high
methods concentrating on one network traffic parameter only           precision is necessary, too, in order for an intrusion detection
are not able to detect all kinds of distributed denial of service     system to not cause more harm than being of help.
attacks.                                                                 The main goal of this research project is to build a dis-
                                                                      tributed denial of service detection system that can be used
                                                                      in today’s networks and meets the demands formulated in
Categories and Subject Descriptors                                    the previous paragraph. In order to build such a system,
H.2.8 [Database Management]: Database Applications—                   many considerations have to be done. Some of these are
Data Mining; H.3.3 [Information Storage and Retrieval]:               presented in this work.
Information Search and Retrieval—Clustering, Information                 The remainder of this paper is structured as follows: In
filtering                                                             section 2 an introduction to distributed denial of service at-
                                                                      tacks and known countermeasures is given, section 3 points
Keywords                                                              out known test datasets. In section 4 some already existing
                                                                      approaches are presented and finally section 5 concludes this
DDoS, Intrusion Detection, KDD, Security                              work and gives insight in future work.

1. INTRODUCTION
                                                                      2.   INTRODUCTION TO DDOS ATTACKS
  Denial of service (DoS) attacks are attacks that have the
goal of making a network service unusable for its legitimate             Denial of service and distributed denial of service attacks
users. This can be achieved in different ways, either by              are not a new threat in the internet. In [15] the first notable
                                                                      denial of service attack is dated to 1996 when the internet
                                                                      provider Panix was taken down for a week by a TCP SYN
                                                                      flood attack. The same article dates the first noteworthy
                                                                      distributed denial of service attack to the year 1997 when
                                                                      internet service providers in several countries as well as an
                                                                      IRC network were attacked by a teenager. Since then, many
                                                                      of the more elaborate attacks that worked well in the past,
                                                                      have been successfully defused.
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany.                      Let us, as an example, examine the TCP SYN flood at-
Copyright is held by the author/owner(s).                             tack. A TCP connection is established by a three way hand-
shake. On getting a SYN request packet, in order to open a
TCP connection, the addressed computer has to store some
information on the incoming packet and then answers with
a SYN ACK packet which is, on regularly opening a TCP
connection, again replied by an ACK packet.
   The idea of the SYN flood attack is to cause a memory
overrun on the victim by sending many TCP SYN packets.
As for every such packet the victim has to store information
while the attacker just generates new packets and ignores the
victim’s answers. By this the whole available memory of the
victim can be used up, thus disabling the victim to open le-
gitimate connection to regular clients. As a countermeasure,        Figure 1: Detection locations for DDoS attacks.
in [7] SYN cookies were introduced. Here, instead of storing
the information associated with the only half opened TCP
connection in the local memory, that information is coded
                                                                  testing that allows users, among other functions, to volun-
into the TCP sequence number. Since that number is re-
                                                                  tary join a botnet in order to carry out an attack. Since
turned by regular clients on sending the last packet of the
                                                                  the tool is mainly for testing purposes, the queries are not
already described three way handshake and initial sequence
                                                                  masqueraded so that it is easy to identify the participat-
numbers can be arbitrarily chosen by each connection part-
                                                                  ing persons. Again, however, the initiator of the attack does
ner, no changes on the TCP implementation of the client
                                                                  not necessarily have to have direct contact to the victim and
side have to be made. Essentially, this reduces the SYN
                                                                  thus remains unknown.
cookie attack to a mere bandwidth based attack.
                                                                    A great diversity of approaches to solve the problem of
   The same applies to many other attack methods that have
                                                                  detecting DDoS attacks exists. Note again, that this work
been successfully used in the past, such as the smurf attack
                                                                  focuses on anomaly detection methods only. This describes
[1] or the fraggle attack. Both of these attacks are so called
                                                                  methods, that essentially make use of outlier detection meth-
reflector attacks that consist of sending an echo packet –
                                                                  ods to distinguish normal traffic and attack traffic. In a field
ICMP echo in case of the smurf attack and UDP echo in
                                                                  with as many publications as intrusion detection and even,
case of the fraggle attack – to a network’s broadcast address.
                                                                  more specialised, DDoS detection, it is not surprising, that
The sender’s address in this packet has to be forged so that
                                                                  many different approaches are used, most of which are com-
the packet occurs to be sent by the victim of the attack, so
                                                                  mon in other knowledge discovery research fields as well.
that all replies caused by the echo packet are routed to the
                                                                    As can be seen in Figure 1 this research part, again, can
victim.
                                                                  be divided in three major categories, namely distributed de-
   Thus, it seems like nowadays most denial of service attacks
                                                                  tection or in network detection, source end detection and
are distributed denial of service attack trying to exhaust the
                                                                  end point or victim end detection.
victims bandwidth. Examples for this are the attacks on
                                                                    By distributed detection approaches we denote all ap-
Estonian government and business computers in 2007 [12].
                                                                  proaches that use more than one node in order to monitor
   As already mentioned, distributed denial of service attacks
                                                                  the network traffic. This kind of solution is mostly aimed
are denial of service attacks with several participating at-
                                                                  to be used by internet providers and sometimes cooperation
tackers. The number of participating computers can differ
                                                                  between more than one or even all ISPs is expected. The
largely, ranging from just a few machines to several thou-
                                                                  main idea of almost all of these systems is to monitor the
sands. Also, in most cases, the owners of these computers
                                                                  network traffic inside the backbone network. Monitors are
are not aware that they are part of an attack. This lies in the
                                                                  mostly expected to be backbone routers, that communicate
nature of most DDoS attacks which consist of three steps:
                                                                  the results of their monitoring either to a central instance or
  1. Building or reusing a malware that is able to receive        among each other. These systems allow an early detection of
     commands from the main attacker (“master”) and to            suspicious network traffic so that an attack can be detected
     carry out the attack. A popular DDoS framework is            and disabled – by dropping the suspicious network packets
     Stacheldraht [9].                                            – before it reaches the server the attack is aimed at. How-
                                                                  ever, despite these methods being very mighty in theory,
  2. Distribute the software created in step one to create        they suffer the main disadvantage of not being able to be
     a botnet. This step can essentially be carried out in        employed without the help of one or more ISPs. Currently,
     every known method of distributing malware, for ex-          this makes these approaches impractical for end users since,
     ample by forged mail attachments or by adding it to          to the knowledge of the author, at this moment no ISP uses
     software like pirate copies.                                 such an approach.
                                                                    Source end detection describes approaches that monitor
  3. Launch the attack by giving the infected computers           outgoing attack streams. Of course, such methods can only
     the command.                                                 be successful if the owner of an attacking computer is not
                                                                  aware of his computer participating in that attack. A widely
  This procedure – from the point of view of the main at-         used deployment of such solutions is necessary for them to
tacker – has the advantage of not having to maintain a direct     have an effect. If this happens, however, these methods have
connection to the victim. This makes it very hard to track        the chance to not only detect distributed denial of service
that person. It is notable though, that during attacks origi-     attacks but also to prevent them by stopping the attacking
nating to Anonymous in the years 2010 and 2012 Low Orbit          traffic flows. However, in our opinion, the necessity of wide
Ion Cannon [6] was used. This is originally a tool for stress     deployment makes a successful usage of this methods – at
         Packet type     No of packets    Percentage                                   1e+08
                                                                                                                            Number of packets over arrival times
             IP              65612516            100
                                                                                       1e+07
            TCP              65295894        99.5174
            UDP                     77        0.0001                                   1e+06
           ICMP                316545         0.4824
                                                                                       100000




                                                                   Number of packets
      Protocol    Incoming Traffic     Outgoing Traffic
                                                                                       10000
         IP             24363685             41248831
        TCP             24204679             41091215                                   1000

        UDP                    77                     0
       ICMP               158929               157616                                    100


                                                                                          10
Table 1: Distribution of web traffic on protocol types
and incoming and outgoing traffic at the university’s                                      1
web server.                                                                                     0   0.2   0.4                  0.6             0.8                 1
                                                                                                                Arrival time [seconds]




least in the near future – difficult.                              Figure 2: Arrival times for the university’s web-
   In contrast to the approaches described above, end point        server trace.
detection describes those methods that rely on one host only.
In general, this host can be either the same server other ap-
                                                                   UDP packets seem to be unwanted packets as none of these
plications are running on or a dedicated firewall in the case
                                                                   packets is replied. The low overall number of these packets is
of small networks. Clearly, these approaches suffer one dis-
                                                                   an indicator for this fact, too. With ICMP traffic, incoming
advantage: Attacks cannot be detected before the attack
                                                                   and outgoing packet numbers are nearly the same which lies
packets arrive at their destination, as only those packets
                                                                   in the nature of this message protocol.
can be inspected. On the other hand end point based meth-
                                                                      In order to overcome the problems with old traces, based
ods allow individual deployment and can therefore be used
                                                                   on the characteristics of the web trace, as a next step we
nowadays. Due to this fact, our work focuses on end point
                                                                   implement a simulator for distributed denial of service at-
approaches.
                                                                   tacks. As the results in [20] show, the network simulators
                                                                   OMNeT++ [19], ns-3 [10] and JiST [5] are, in terms of speed
3. TEST TRACES OF DISTRIBUTED DE-                                  and memory usage, more or less equal. To not let the simula-
                                                                   tion become either too slow or too inaccurate, it is intended
   NIAL OF SERVICE ATTACKS                                         to simulate a nearer neighbourhood of the victim server very
   Today, the testing of DDoS detection methods unfortu-           accurately. With greater distance to the victim, it is planned
nately is not easy, as not many recordings of real or simu-        to simulate in less detail. In this context, the distance be-
lated DDoS attacks exist or, at least, are not publicly avail-     tween two network nodes is given by the number of hops
able. The best known test trace is the KDD Cup 99 data             between the nodes.
set [3]. A detailed description of this data set is given in          Simulation results then will be compared with the afore-
[18]. Other known datasets are the 1998 DARPA intrusion            mentioned network trace to ensure its realistic behaviour.
detection evaluation data set that has been described in [14]      After the simulation of normal network traffic resembles the
as well as the 1999 DARPA intrusion detection evaluation           real traffic at the victim server close enough, we will proceed
data set examined in [13].                                         by implementing distributed denial of service attacks in the
   In terms of the internet, with an age of 14 to 15 years,        simulator environment. With this simulator it will then,
these data sets are rather old and therefore cannot reflect        hopefully, be possible to test existing and new distributed
today’s traffic volume and behaviour in a realistic fashion.       denial of service detection approaches in greater detail as
Since testing with real distributed denial of service attacks      has been possible in the past.
is rather difficult both on technical as well as legal level, we
suggest the usage of a DDoS simulator. In order to get a feel-
ing for today’s web traffic, we recorded a trace at the main       4.                     EXISTING APPROACHES
web server of Heinrich-Heine-Universität. Tracing started on         Many approaches to the detection of distributed denial of
17th September 2012 at eight o’clock local time and lasted         service attacks already exist. As has been previously pointed
until eight o’clock the next day.                                  out in section 1, in contrast to many other outlier and nov-
   This trace consists of 65612516 packets of IP traffic with      elty detection applications in the KDD field, the detection
31841 unique communication partners contacting the web             of DDoS attacks is extremely time critical, hence near real
server. As can be seen in Table 1 almost all of these packets      time detection is necessary.
are TCP traffic. This is not surprising as the HTTP protocol          Intuitively, the less parameters are observed by an ap-
uses the TCP protocol and web page requests are HTTP               proach, the faster it should work. Therefore, first, we take a
messages.                                                          look at a recently issued method that relies on one parameter
   About one third of the TCP traffic is incoming traffic.         only.
This, too, is no surprise as most clients send small request
messages and, in return, get web pages that often include          4.1                      Arrival Time Based DDoS Detection
images or other larger data and thus consist of more than            In [17] the authors propose an approach that bases on ir-
one package. It can also be seen, clearly, that all of the         regularities in the inter packet arrival times. By this term
      1
                                                                                                      Now, since we are solely interested in the estimation of x̄,
    0.9                                                                                             only 1 M is needed, which is computed to be [x̄, x̄] since
                                                                                                                          
    0.8                                                                                                              1   β        β    1                 1
                                                                                                            g(1) =     −     1 + = (1 − β + β) =
    0.7
                                                                                                                     2   2        2    2                 2
    0.6                                                                                             and
                                                                                                                                                 1
    0.5
                                                                                                                zg(1) = Φ−1 (1 − g(1)) = Φ−1 ( ) = 0.
α




                                                                                                                                                 2
    0.4
                                                                                                      During traffic monitoring, for a given time interval, the
    0.3
                                                                                                    current traffic arrival times tc are computed by estimating
    0.2
                                                                                                                                                 
                                                                                                                               1      1          1    1
                                                                                                              [tc ]α = ln               , ln
    0.1                                                                                                                      1 − p rα          1 − p lα
       0
      0.00122 0.00124 0.00126 0.00128   0.0013 0.00132 0.00134 0.00136 0.00138   0.0014   0.00142   where p is some given but again not specified probability and
                                             Arrival times [s]                                      [lα , rα ] are the α-cuts for E(T ) = t̄. As described above, the
                                                                                                    only value that is of further use is tc , the only value in the
Figure 3: The fuzzy mean estimator constructed ac-                                                  interval of [tc ]1 . Since [E(T )]1 = [t̄]1 = [t̄, t̄] it follows that
cording to [17].                                                                                                                                       
                                                                                                                                  1     1          1         1
                                                                                                                   [tc ]1 = ln            , ln
                                                                                                                                1 − p t̄         1 − p t̄
the authors describe the time that elapses between two sub-                                         and thus
sequent packets.                                                                                                                
   The main idea of this work is based on [8] where non-                                                                    1        1   1
                                                                                                             tc = ln                    = (ln(1) − ln(1 − p)) .
asymptotic fuzzy estimators are used to estimate variable                                                                  1−p       t̄  t̄
costs. Here, this idea is used to estimate the mean arrival                                         As ln(1) = 0 this can be further simplified to
time x̄ of normal traffic packets. Then, the mean arrival
time of the current traffic – denoted by tc – is estimated,                                                                         ln(1 − p)
                                                                                                                           tc = −             ∈ [0, ∞)
too, and compared to the overall value. If tc > x̄, the traffic                                                                         t̄
is considered as normal traffic and if tc < x̄ a DDoS attack                                        with p ∈ [0, 1).
is assumed to be happening. We suppose here, that for a                                               By this we are able to determine a value for p by choosing
value of tc = x̄ no attack is happening, although this case is                                      the smallest p where tc ≥ x̄ for all intervals in our trace. An
not considered in the original paper.                                                               interval length of four seconds was chosen to ensure compa-
   To get a general feeling for the arrival times, we computed                                      rability with the results presented in [17].
them for our trace. The result is shown in Figure 2. Note,                                            During the interval with the highest traffic volume 53568
that the y-axis is scaled logarithmic as values for arrival                                         packets arrived resulting in an average arrival time of t̄ ≈
times larger than 0.1 seconds could not been distinguished                                          7.4671 · 10−5 seconds. Note here, that we did not maximise
from zero on a linear y-axis. It can be seen here, that most                                        the number of packets for the interval but instead let the first
arrival times are very close to zero. It is also noteworthy                                         interval begin at the first timestamp in our trace rounded
that, due to the limited precision of libpcap [2], the most                                         down to full seconds and split the trace sequentially from
common arrival interval is zero.                                                                    there on.
   Computing the fuzzy mean estimator for packet arrival                                              Now, in order to compute p one has to set
times yields the graph presented in Figure 3 and x̄ ≈ 0.00132.
Note, that since the choice of parameter β ∈ [0, 1) is not                                                                       p = 1 − e−x̄t̄
specified in [17], we here chose β = 12 . We will see, however,
                                                                                                    leading to p ≈ 9.8359 · 10−8 . As soon as this value of p is
that, as far as our understanding of the proposed method
                                                                                                    learned, the approach is essentially a static comparison.
goes, this parameter has no further influence.
                                                                                                       There are, however, other weaknesses to this approach
   To compute the α-cuts of a fuzzy number, one has to
                                                                                                    as well: Since the only monitored value is the arrival time,
compute
                                                                                                  a statement on values such as bandwidth usage cannot be
             α                   σ              σ                                                   made. Consider an attack where multiple corrupted com-
               M = x̄ − zg(α) √ , x̄ + zg(α) √                                                      puters try to download a large file from a server via a TCP
                                  n              n
                                                                                                    connection. This behaviour will result in relatively large
where x̄ is the mean value – i.e. exactly the value that is                                         packets being sent from the server to the clients, resulting
going to be estimated – and σ is presumably the arrival                                             in larger arrival times as well. Still, the server’s connec-
times’ deviation. Also                                                                              tion can be jammed by this traffic thus causing a denial of
                                
                           1   β       β                                                            service.
                  g(α) =     −     α+                                                                  By this, we draw the conclusion that a method relying on
                           2   2        2
                                                                                                    only one parameter – in this example arrival times – can-
and                                                                                                 not detect all kinds of DDoS attacks. Thus, despite its low
                             zg(α) = Φ−1 (1 − g(α)).                                                processing requirements, such an approach in our opinion is
                                                                                                    not suited for general DDoS detection even if it seems that
  Note, that α M is the (1 − α)(1 − β) confidence interval                                          it can detect packet flooding attacks with high precision as
for µ, the real mean value of packet arrival times.                                                 stated in the paper.
                                                                  Algorithm 1 LCFS algorithm based on [11].
                                                                  Require: the initial set of all features I,
                                                                     the class-outputs y,
                                                                     the desired number of features n
                                                                  Ensure: the dimension reduced subset F ⊂ I
                                                                   1: for all fi ∈ I do
                                                                   2:    compute corr(fi , y)
                                                                   3: end for
                                                                   4: f := max{correlation(fi , y)|fi ∈ I}
                                                                   5: F := {f }
                                                                   6: I := I \ {f }
                                                                   7: while |F | <(n do                                          )
                                                                                                   1
                                                                                                      P
                                                                   8:    f := max corr(fi , y) − |F |      corr(fi , fj ) fi ∈ I
Figure 4: Protocol specific DDoS detection architec-                                                   fj ∈F
ture as proposed in [11].                                          9:   F := F ∪ {f }
                                                                  10:   I := I \ {f }
                                                                  11: end while
4.2    Protocol Type Specific DDoS Detection                      12: return F
  In [11] another approach is presented: Instead of using the
same methods on all types of packets, different procedures
are used for different protocol types. This is due to the fact,   the university’s campus. The presented results show that
that different protocols show different behaviour. Especially     on all data sets the DDoS detection accuracy varies in the
TCP traffic behaviour differs from UDP and ICMP traffic           range of 99.683% to 99,986% if all of the traffic’s attributes
because of its flow control features. By this the authors try     are used. When reduced to three or five attributes, accuracy
to minimise the feature set characterising distributed denial     stays high with DDoS detection of 99.481% to 99.972%. At
of service attacks for every protocol type, separately, such      the same time, the computation time shrinks by a factor of
that computation time is minimised, too.                          two leading to a per instance computation time of 0.0116ms
  The proposed detection scheme is described as a four step       (three attributes) on the KDD Cup data set and 0.0108ms
approach, as shown in Figure 4. Here, the first step is the       (three attributes) and 0.0163ms (five attributes) on the self-
preprocessing where all features of the raw network traffic       recorded data sets of the authors.
are extracted. Then packets are forwarded to the correspon-         Taking into account the 53568 packets in a four second
dent modules based on the packet’s protocol type.                 interval we recorded, the computation time during this in-
  The next step is the protocol specific feature selection.       terval would be about (53568 · 0.0163ms ≈) 0.87 seconds.
Here, per protocol type, the most significant features are        However, there is no information about the machine that
selected. This is done by using the linear correlation based      carried out the computations given in the paper such that
feature selection (LCFS) algorithm that has been introduced       this number appears to be rather meaningless. If we suppose
in [4], which essentially ranks the given features by their       a fast machine with no additional tasks, this computation
correlation coefficients given by                                 time would be relatively high.
                             Pn
                                   (xi − x̄)(yi − x̄)               Nevertheless, the results presented in the paper at hand
       corr(X, Y ) := pPn i=1              Pn                     are promising enough to consider a future re-evaluation on a
                                         2                 2
                            i=1 (xi − x̄)    i=1 (yi − ȳ)
                                                                  known machine with our recorded trace and simulated DDoS
for two random variables X, Y with values xi , yi , 1 ≤ i ≤ n,    attacks.
respectively. A pseudo code version of LCFS is given in
Algorithm 1. As can be seen there, the number of features
in the reduced set must be given by the user. This number         5.   CONCLUSION
characterises the trade-off between precision of the detection       We have seen that distributed denial of service attacks are,
and detection speed.                                              in comparison to the age of the internet itself, a relatively
  The third step is the classification of the instances in ei-    old threat. Against many of the more sophisticated attacks
ther normal traffic or DDoS traffic. The classification is        specialised counter measures exist, such as TCP SYN cook-
trained on the reduced feature set generated in the previous      ies in order to prevent the dangers of SYN flooding. Thus,
step. The authors tested different well known classification      most DDoS attacks nowadays are pure bandwidth or brute
techniques and established C4.5 [16] as the method working        force attacks and attack detection should focus on this types
best in this case.                                                of attacks, making outlier detection techniques the method
  Finally, the outputs of the classifiers are given to the        of choice. Still, since many DDoS toolkits such as Stachel-
merger to be able to report warnings over one alarm gen-          draht allow for attacks like SYN flooding properties of this
eration interface instead of three. The authors mention that      attacks can still indicate an ongoing attack.
there is a check for false positives in the merger, too. How-        Also, albeit much research on the field of DDoS detection
ever, there is no further information given on how this check     has been done during the last two decades that lead to a
works apart from the fact that it is relatively slow.             nearly equally large number of possible solutions, in section
  The presented experiments have been carried out on the          3 we have seen that one of the biggest problems is the un-
aforementioned KDD Cup data set as well as on two self-           availability of recent test traces or a simulator being able
made data sets for which the authors attacked a server within     to produce such traces. With the best known test series
having an age of fourteen years, today, the results presented          Off-line Intrusion Detection Evaluation. In DARPA
in many of the research papers on this topic are difficult to          Information Survivability Conference and Exposition,
compare and confirm.                                                   2000. DISCEX’00. Proceedings, volume 2, pages
  Even if one can rate the suitability of certain approaches in        12–26. IEEE, 2000.
respect to detect certain approaches, such as seen in section     [15] G. Loukas and G. Öke. Protection Against Denial of
4, a definite judgement of given methods is not easy. We               Service Attacks: A Survey. The Computer Journal,
therefore, before starting to implement an own approach to             53(7):1020–1037, 2010.
distributed denial of service detection, want to overcome this    [16] J. R. Quinlan. C4.5: Programs for Machine Learning,
problem by implementing a DDoS simulator.                              volume 1. Morgan Kaufmann, 1993.
  With the help of this tool, we will be subsequently able to     [17] S. N. Shiaeles, V. Katos, A. S. Karakos, and B. K.
compare existing approaches among each other and to our                Papadopoulos. Real Time DDoS Detection Using
ideas in a fashion reproducible for others.                            Fuzzy Estimators. Computers & Security,
                                                                       31(6):782–790, 2012.
6. REFERENCES                                                     [18] M. Tavallaee, E. Bagheri, W. Lu, and A.-A. Ghorbani.
 [1] CERT CC. Smurf Attack.                                            A Detailed Analysis of the KDD CUP 99 Data Set. In
     http://www.cert.org/advisories/CA-1998-01.html.                   Proceedings of the Second IEEE Symposium on
 [2] The Homepage of Tcpdump and Libpcap.                              Computational Intelligence for Security and Defence
     http://www.tcpdump.org/.                                          Applications 2009, 2009.
 [3] KDD Cup Dataset.                                             [19] A. Varga and R. Hornig. An Overview of the
     http://kdd.ics.uci.edu/databases/kddcup99/                        OMNeT++ Simulation Environment. In Proceedings
     kddcup99.html, 1999.                                              of the 1st International Conference on Simulation
 [4] F. Amiri, M. Rezaei Yousefi, C. Lucas, A. Shakery,                Tools and Techniques for Communications, Networks
     and N. Yazdani. Mutual Information-based Feature                  and Systems & Workshops, Simutools ’08, pages
     Selection for Intrusion Detection Systems. Journal of             60:1–60:10, ICST, Brussels, Belgium, Belgium, 2008.
     Network and Computer Applications, 34(4):1184–1199,               ICST (Institute for Computer Sciences,
     2011.                                                             Social-Informatics and Telecommunications
 [5] R. Barr, Z. J. Haas, and R. van Renesse. JiST: An                 Engineering).
     Efficient Approach to Simulation Using Virtual               [20] E. Weingartner, H. vom Lehn, and K. Wehrle. A
     Machines. Software: Practice and Experience,                      Performance Comparison of Recent Network
     35(6):539–576, 2005.                                              Simulators. In Communications, 2009. ICC ’09. IEEE
 [6] A. M. Batishchev. Low Orbit Ion Cannon.                           International Conference on, pages 1–5, 2009.
     http://sourceforge.net/projects/loic/.
 [7] D. Bernstein and E. Schenk. TCP SYN Cookies.
     on-line journal, http://cr.yp.to/syncookies.html, 1996.
 [8] K. A. Chrysafis and B. K. Papadopoulos.
     Cost–volume–profit Analysis Under Uncertainty: A
     Model with Fuzzy Estimators Based on Confidence
     Intervals. International Journal of Production
     Research, 47(21):5977–5999, 2009.
 [9] D. Dittrich. The ‘Stacheldraht’ Distributed Denial of
     Service Attack Tool.
     http://staff.washington.edu/dittrich/misc/
     stacheldraht.analysis, 1999.
[10] T. Henderson. ns-3 Overview.
     http://www.nsnam.org/docs/ns-3-overview.pdf, May
     2011.
[11] H. J. Kashyap and D. Bhattacharyya. A DDoS Attack
     Detection Mechanism Based on Protocol Specific
     Traffic Features. In Proceedings of the Second
     International Conference on Computational Science,
     Engineering and Information Technology, pages
     194–200. ACM, 2012.
[12] M. Lesk. The New Front Line: Estonia under
     Cyberassault. Security & Privacy, IEEE, 5(4):76–79,
     2007.
[13] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and
     K. Das. The 1999 DARPA Off-line Intrusion Detection
     Evaluation. Computer networks, 34(4):579–595, 2000.
[14] R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines,
     K. R. Kendall, D. McClung, D. Weber, S. E. Webster,
     D. Wyschogrod, R. K. Cunningham, et al. Evaluating
     Intrusion Detection Systems: The 1998 DARPA
         A Conceptual Model for the XML Schema Evolution
                   Overview: Storing, Base-Model-Mapping and Visualization
                                  Thomas Nösinger, Meike Klettke, Andreas Heuer
                                                    Database Research Group
                                                  University of Rostock, Germany
                                       (tn, meike, ah)@informatik.uni-rostock.de


ABSTRACT                                                              our conceptual model called EMX (Entity Model for XML-
In this article the conceptual model EMX (Entity Model                Schema).
for XML-Schema) for dealing with the evolution of XML                    A further issue, not covered in this paper, but important
Schema (XSD) is introduced. The model is a simplified                 in the overall context of exchanging information, is the valid-
representation of an XSD, which hides the complexity of               ity of XML documents [5]. Modifications of XML Schema re-
XSD and offers a graphical presentation. For this purpose             quire adaptions of all XML documents that are valid against
a unique mapping is necessary which is presented as well              the former XML Schema (also known as co-evolution).
as further information about the visualization and the log-              One unpractical way to handle this problem is to introduce
ical structure. A small example illustrates the relation-             different versions of an XML Schema, but in this case all
ships between an XSD and an EMX. Finally, the integration             versions have to be stored and every participant of the het-
into a developed research prototype for dealing with the co-          erogeneous environment has to understand all different doc-
evolution of corresponding XML documents is presented.                ument descriptions. An alternative solution is the evolution
                                                                      of the XML Schema, so that just one document description
                                                                      exists at one time. The above mentioned validity problem
1. INTRODUCTION                                                       of XML documents is not solved, but with the standardized
   The eXtensible Markup Language (XML) [2] is one of the             description of the adaptions (e.g. a sequence of operations
most popular formats for exchanging and storing structured            [8]) and by knowing a conceptual model inclusively the cor-
and semi-structured information in heterogeneous environ-             responding mapping to the base-model (e.g. XSD), it is
ments. To assure that well-defined XML documents can be               possible to derive necessary XML document transformation
understood by every participant (e.g. user or application)            steps automatically out of the adaptions [7]. The conceptual
it is necessary to introduce a document description, which            model is an essential prerequisite for the here not in detail
contains information about allowed structures, constraints,           but incidentally handled process of the evolution of XML
data types and so on. XML Schema [4] is one commonly used             Schema.
standard for dealing with this problem. An XML document                  This paper is organized as follows. Section 2 gives the
is called valid, if it fulfills all restrictions and conditions of    necessary background of XML Schema and corresponding
an associated XML Schema.                                             concepts. Section 3 presents our conceptual model by first
   XML Schema that have been used for years have to be                giving a formal definition (3.1), followed by the specification
modified from time to time. The main reason is that the               of the unique mapping between EMX and XSD (3.2) and
requirements for exchanged information can change. To                 the logical structure of the EMX (3.3). After introducing
meet these requirements the schema has to be adapted, for             the conceptual model we present an example of an EMX in
example if additional elements are added into an existing             section 4. In section 5 we describe the practical use of
content model, the data type of information changed or in-            EMX in our prototype, which was developed for handle the
tegrity constraints are introduced. All in all every possi-           co-evolution of XML Schema and XML documents. Finally
ble structure of an XML Schema definition (XSD) can be                in section 6 we draw our conclusions.
changed. A question occurs: In which way can somebody
make these adaptions without being coerced to understand              2.   TECHNICAL BACKGROUND
and deal with the whole complexity of an XSD? One solu-                  In this section we present a common notation used in the
tion is the definition of a conceptual model for simplifying          rest of the paper. At first we will shortly introduce the
the base-model; in this paper we outline further details of           abstract data model (ADM) and element information item
                                                                      (EII) of XML Schema, before further details concerning dif-
                                                                      ferent modeling styles are given.
                                                                         The XML Schema abstract data model consists of different
                                                                      components or node types1 , basically these are: type defi-
                                                                      nition components (simple and complex types), declaration
                                                                      components (elements and attributes), model group compo-
                                                                      nents, constraint components, group definition components
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-   1
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany.                     An XML Schema can be visualized as a directed graph with
Copyright is held by the author/owner(s).                             different nodes (components); an edge realizes the hierarchy
and annotation components [3]. Additionally the element                                                  against exchanged information change and the underlying
information item exists, an XML representation of these                                                  schema has to be adapted then this modeling style is the
components. The element information item defines which                                                   most suitable. The advantage of the Garden of Eden style
content and attributes can be used in an XML Schema. Ta-                                                 is that all components can be easily identified by knowing
ble 1 gives an overview about the most important compo-                                                  the QNAME (qualified name). Furthermore the position of
nents and their concrete representation. The ,                                                  components within an XML Schema is obvious. A qualified
                                                                                                         name is a colon separated string of the target namespace of
        ADM                 Element Information Item                                                     the XML Schema followed by the name of the declaration
     declarations             ,                                                      or definition. The name of a declaration and definition is
   group-definitions                                                                     a string of the data type NCNAME (non-colonized name),
     model-groups           , , ,                                                 a string without colons. The Garden of Eden style is the
                              ,                                                       basic modeling style which is considered in this paper, a
    type-definitions               ,                                                         transformation between different styles is possible.2
                                   
         N.N.                   , ,                                                     3. CONCEPTUAL MODEL
                              ,                                                        In [7] the three layer architecture for dealing with XML
      annotations                                                                            Schema adaptions (i.e. the XML Schema evolution) was
      constraints           , , ,                                                   introduced and the correlations between them were men-
                                ,                                                     tioned. An overview is illustrated in figure 2. The first
         N.N.                         




                                                                                                               EMX
     Table 1: XML Schema Information Items
                                                                                                                           EMX          Operation      EMX‘
,  and  are not explicitly
                                                                                                                        1 - 1 mapping
given in the abstract data model (N.N. - Not Named), but



                                                                                                               Schema
they are important components for embedding externally

                                                                                                                 XML
defined XML Schema (esp. element declarations, attribute                                                                   XSD          Operation      XSD‘
declarations and type definitions). In the rest of the pa-                                                              1 - * mapping
per, we will summarize them under the node type ”module”.
                                                                                                               XML

The  ”is the document (root) element of any W3C                                                                                 Operation
XML Schema. It’s both a container for all the declarations                                                                 XML                         XML‘
and definitions of the schema and a place holder for a number
of default values expressed as attributes” [9]. Analyzing the
possibilities of specifying declarations and definitions leads                                                       Figure 2: Three Layer Architecture
to four different modeling styles of XML Schema, these are:
Russian Doll, Salami Slice, Venetian Blind and Garden of                                                 layer is our conceptual model EMX (Entity Model for XML-
Eden [6]. These modeling styles influence mainly the re-                                                 Schema), a simplified representation of the second layer.
usability of element declarations or defined data types and                                              This layer is the XML Schema (XSD), where a unique map-
also the flexibility of an XML Schema in general. Figure                                                 ping between these layers exists. The mapping is one of the
1 summarizes the modeling styles with their scopes. The                                                  main aspects of this paper (see section 3.2). The third layer
                                                                                                         are XML documents or instances, an ambiguous mapping
                                                                                                         between XSD and XML documents exists. It is ambiguous
                                                                                                         because of the optionality of structures (e.g. minOccurs =
                                                                                        Garden of Eden
                                                                       Venetian Blind




                                                                                                         ’0’; use = ’optional’) or content types (e.g. ). The
                                         Russian Doll
                                                        Salami Slice




                                                                                                         third layer and the mapping between layer two and three, as
                                                                                                         well as the operations for transforming the different layers
                                                                                                         are not covered in this paper (parts were published in [7]).
                                Scope
      element and attribute      local    x                             x                                3.1   Formal Definition
          declaration           global                   x                               x                 The conceptual model EMX is a triplet of nodes (NM ),
                                 local    x              x                                               directed edges between nodes (EM ) and features (FM ).
          type definition
                                global                                  x                x
                                                                                                                             EM X = (NM , EM , FM )                (1)
  Figure 1: XSD Modeling Styles according to [6]                                                         Nodes are separated in simple types (st), complex types (ct),
                                                                                                         elements, attribute-groups, groups (e.g. content model), an-
scope of element and attribute declarations as well as the                                               notations, constraints and modules (i.e. externally managed
scope of type definitions is global iff the corresponding node                                           XML Schemas). Every node has under consideration of the
is specified as a child of the  and can be referenced                                            element information item of a corresponding XSD different
(by knowing e.g. the name and namespace). Locally speci-                                                 attributes, e.g. an element node has a name, occurrence
fied nodes are in contrast not directly under , the                                              values, type information, etc. One of the most important
re-usability is not given respectively not possible.                                                     2
                                                                                                          A student thesis to address the issue of converting different
   An XML Schema in the Garden of Eden style just con-                                                   modeling styles into each other is in progress at our profes-
tains global declarations and definitions. If the requirements                                           sorship; this is not covered in this paper.
attributes of every node is the EID (EMX ID), a unique           visualized by adding a ”blue W in a circle”, a similar be-
identification value for referencing and localization of every   haviour takes place if an attribute wildcard is given in an
node; an EID is one-to-one in every EMX. The directed            .
edges are defined between nodes by using the EIDs, i.e. ev-         The type-definitions are not directly visualized in an EMX.
ery edge is a pair of EID values from a source to a tar-         Simple types for example can be specified and afterwards be
get. The direction defines the include property, which was       referenced by elements or attributes 3 by using the EID of the
specified under consideration of the possibilities of an XML     corresponding EMX node. The complex type is also implic-
Schema. For example if a model-group of the abstract data        itly given, the type will be automatically derived from the
model (i.e. an EMX group with ”EID = 1”) contains dif-           structure of the EMX after finishing the modeling process.
ferent elements (e.g. EID = {2,3}), then two edges exist:        The XML Schema specification 1.1 has introduced different
(1,2) and (1,3). In section 3.3 further details about allowed    logical constraints, which are also integrated in the EMX.
edges are specified (see also figure 5). The additional fea-     These are the EIIs  (for constraints on complex
tures allow the user-specific setting of the overall process     types) and . An  is under consider-
of co-evolution. It is not only possible to specify default      ation of the specification a facet of a restricted simple type
values but also to configure the general behaviour of opera-     [4]. The last EII is , this ”root” is an EMX itself.
tions (e.g. only capacity-preserving operations are allowed).    This is also the reason why further information or properties
Furthermore all XML Schema properties of the element in-         of an XML Schema are stored in the additional features as
formation item  are included in the additional           mentioned above.
features. The additional features are not covered in this
paper.                                                           3.3      Logical Structure
                                                                    After introducing the conceptual model and specifying the
3.2    Mapping between XSD and EMX                               mapping between an EMX and XSD, in the following section
  An overview about the components of an XSD has been            details about the logical structure (i.e. the storing model)
given in table 1. In the following section the unique map-       are given. Also details about the valid edges of an EMX are
ping between these XSD components and the EMX nodes              illustrated. Figure 3 gives an overview about the different
introduced in section 3.1 is specified. Table 2 summarizes       relations used as well as the relationships between them.
the mapping. For every element information item (EII) the        The logical structure is the direct consequence of the used

          EII             EMX Node         Visualization
                   element                                    Path     Constraint          ST_List        Facet        Annotation

    ,            attribute-
              group                                                                                              Attribute
                                                                       Assert       Element            ST        Attribute
   , ,           group                                                                                                _Ref
     
                                                              Element_                                      Attribute       Attribute
                                                                                      CT            Schema
                                                                      Ref                                           _Gr          _Gr_Ref
   
                                                                                                   @        @
                   st            implicit and
                                                                                @
                                              specifiable              Group                       Wildcard       Module
                  ct            implicit and
                                               derived
                                                                                EMX           visualized             extern             @   Attribute

      ,             module                                Relation                              parent_EIDparent_EID has_asame @
                                                                                node           in EMX                                       Element


       ,
      ,
                                                                Figure 3: Logical Structure of an EMX
               annotation
                                                                 modeling style Garden of Eden, e.g. elements are either
  , ,         constraint
                                                                 element declarations or element references. That’s why this
      
                                                                 separation is also made in the EMX.
                                   implicit in ct
                                                                    All in all there are 18 relations, which store the content of
                              restriction in st
                                                                 an XML Schema and form the base of an EMX. The different
                       the EMX itself                    nodes reference each other by using the well known foreign
                                                                 key constraints of relational databases. This is expressed by
         Table 2: Mapping and Visualization
                                                                 using the directed ”parent EID” arrows, e.g. the EMX nodes
                                                                 (”rectangle with thick line”) element, st, ct, attribute-group
corresponding EMX node is given as well as the assigned vi-
                                                                 and modules reference the ”Schema” itself. If declarations
sualization. For example an EMX node group represents the
                                                                 or definitions are externally defined then the ”parent EID”
abstract data model (ADM) node model-group (see table 1).
                                                                 is the EID of the corresponding module (”blue arrow”). The
This ADM node is visualized through the EII content mod-
                                                                 ”Schema” relation is an EMX respectively the root of an
els ,  and , and the wildcards
                                                                 EMX as already mentioned above.
 and . In an EMX the visualization
of a group is the blue ”triangle with a G” in it. Further-       3
                                                                   The EII  and  are the same
more if a group contains an element wildcard then this is        in the EMX, an attribute-group is always a container
   The ”Annotation” relation can reference every other re-         specified under consideration of the XML Schema specifica-
lation according to the XML Schema specification. Wild-            tion [4], e.g. an element declaration needs a ”name” and a
cards are realized as an element wildcard, which belongs to        type (”type EID” as a foreign key) as well as other optional
a ”Group” (i.e. EII ), or they can be attribute wild-         values like the final (”finalV”), default (”defaultV”), ”fixed”,
cards which belongs to a ”CT” or ”Attribute Gr” (i.e. EII          ”nillable”, XML Schema ”id” or ”form” value. Other EMX
). Every ”Element” relation (i.e. element            specific attributes are also given, e.g. the ”file ID” and the
declaration) has either a simple type or a complex type,           ”parent EID” (see figure 3). The element references have a
and every ”Element Ref” relation has an element declara-           ”ref EID”, which is a foreign key to a given element declara-
tion. Attributes and attribute-groups are the same in an           tion. Moreover attributes of the occurrence (”minOccurs”,
EMX, as mentioned above.                                           ”maxOccurs”), the ”position” in a content model and the
   Moreover figure 3 illustrates the distinction between visu-     XML Schema ”id” are stored. Element references are visual-
alized (”yellow border”) and not visualized relations. Under       ized in an EMX. That’s why some values about the position
consideration of table 2 six relations are direct visible in       in an EMX are stored, i.e. the coordinates (”x Pos”, ”y Pos”)
an EMX: constraints, annotations, modules, groups and be-          and the ”width” and ”height” of an EMX node. The same
cause of the Garden of Eden style element references and           position attributes are given in every other visualized EMX
attribute-group references. Table 3 summarizes which rela-         node.
tion of figure 3 belongs to which EMX node of table 2.                The edges of the formal definition of an EMX can be de-
                                                                   rived by knowing the logical structure and the visualization
         EMX Node                             Relation             of an EMX. Figure 5 illustrates the allowed edges of EMX
            element                    Element, Element Ref        nodes. An edge is always a pair of EIDs, from a source
        attribute-group               Attribute, Atttribute Ref,
                                            Attribute Gr,




                                                                                                                    attribute-group
                                          Attribute Gr Ref




                                                                                               source X




                                                                                                                                                        annotation
            group                         Group, Wildcard




                                                                                                                                                                     constraint
                                                                            edge(X,Y)




                                                                                                          element




                                                                                                                                                                                           schema
                                                                                                                                                                                  module
              st                         ST, ST List, Facet




                                                                                                                                      group
              ct                                 CT                                                                                           ct   st
          annotation                         Annotation                      target Y
          constraint                   Contraint, Path, Assert               element                                                  x                                           x        x
            module                             Module                    attribute-group                             x                        x                                   x        x
                                                                              group                                  x                x       x
    Table 3: EMX Nodes with Logical Structure                                   ct                        x                                                                       x        x
                                                                                st                        x          x                                                            x        x
                                                                           annotation                     x          x                x       x    x                  x           x        x
   The EMX node st (i.e. simple type) has three relations.                  constraint                    x                                   x    x
These are the relation ”ST” for the most simple types, the re-               module                                                                                                        x
lation ”ST List” for set free storing of simple union types and             implicitly given

the relation ”Facet” for storing facets of a simple restriction
type. Constraints are realized through the relation ”Path”               Figure 5: Allowed Edges of EMX Nodes
for storing all used XPath statements for the element infor-
mation items (EII) ,  and  and                (”X”) to a target (”Y”). For example it is possible to add
the relation ”Constraint” for general properties e.g. name,        an edge outgoing from an element node to an annotation,
XML Schema id, visualization information, etc. Further-            constraint, st or ct. A ”black cross” in the figure defines a
more the relation ”Assert” is used for storing logical con-        possible edge. If an EMX is visualized then not all EMX
straints against complex types (i.e. EII ) and sim-        nodes are explicitly given, e.g. the type-definitions of the
ple types (i.e. EII ). Figure 4 illustrates the         abstract data model (i.e. EMX nodes st, ct; see table 2). In
                                                                   this case the corresponding ”black cross” has to be moved
                         element       element_ref                 along the given ”yellow arrow”, i.e. an edge in an EMX be-
                   PK    EID          PK   EID                     tween a ct (source) and an attribute-group (target) is valid.
                         name         FK   ref_EID                 If this EMX is visualized, then the attribute-group is shown
                    FK   type_EID          minOccurs
                         finalV            maxOccurs
                                                                   as a child of the group which belongs to above mentioned
                         defaultV          position                ct. Some information are just ”implicitly given” in a visu-
                         fixed             id
                         nillable     FK   file_ID                 alization of an EMX (e.g. simple types). A ”yellow arrow”
                         id
                         form
                                      FK   parent_EID
                                           width
                                                                   which starts and ends in the same field is a hint for an union
                    FK   file_ID           height                  of different nodes into one node, e.g. if a group contains a
                    FK   parent_EID        x_Pos
                                           y_Pos                   wildcard then in the visualization only the group node is
                                                                   visible (extended with the ”blue W”; see table 2).
     Figure 4: Relations of EMX Node element
                                                                   4.   EXAMPLE
stored information concerning the EMX node element re-               In section 3 the conceptual model EMX was introduced.
spectively the relations ”Element” and ”Element Ref”. Both         In the following section an example is given. Figure 6 il-
relations have in common, that every tuple is identified by        lustrates an XML Schema in the Garden of Eden modeling
using the primary key EID. The EID is one-to-one in ev-            style. An event is specified, which contains a place (”ort”)
ery EMX as mentioned above. The other attributes are               and an id (”event-id”). Furthermore the integration of other
                                                                 the connection without ”black rectangle”, the target is the
                                                                 other side. For example the given annotation is a child of
                                                                 the element ”event” and not the other way round; an element
                                                                 can never be a child of an annotation, neither in the XML
                                                                 Schema specification nor in the EMX.
                                                                    The logical structure of the EMX of figure 7 is illustrated
                                                                 in figure 8. The relations of the EMX nodes are given as well

                                                                  Schema
                                                                  EID                xmlns_xs                              targetName       TNPrefix
                                                                   1     http://www.w3.org/2001/XMlSchema                  gvd2013.xsd        eve
                                                                  Element                                         Annotation
                                                                  EID name            type_EID       parent_EID     EID    parent_EID x_Pos y_Pos
                                                                    2 event              14              1          10         2         50     100
                                                                    3 name               11              1                      Wildcard
                                                                    4 datum              12              1                        EID    parent_EID
                                                                    5   ort              13              1                        17         14
                                                                  Element_Ref
                                                                  EID ref_EID     minOccurs          maxOccurs      parent_EID   x_Pos y_Pos
                                                                    6    2            1                 1               1         75    75 event
 Figure 6: XML Schema in Garden of Eden Style                       7    3            1                 1               16        60    175 name
                                                                    8    4            1                 1               16        150   175 datum
                                                                    9    5            1                 1               15        100   125 ort
attributes is possible, because of an attribute wildcard in        ST                                               CT
the respective complex type. The place is a sequence of a          EID      name         mode parent_EID            EID       name         parent_EID
name and a date (”datum”).                                         11     xs:string      built-in 1                 13       orttype           1
   All type definitions (NCNAME s: ”orttype”, ”eventtype”)         12      xs:date       built-in 1                 14      eventtype          1
and declarations (NCNAME s: ”event”, ”name”, ”datum”,             Group
                                                                  EID     mode                   parent_EID       x_Pos       y_Pos
”ort” and the attribute ”event-id”) are globally specified.
                                                                   15   sequence                     14            125         100       eventsequence
The target namespace is ”eve”, so the QNAME of e.g. the            16   sequence                     13            100         150       ortsequence
complex type definition ”orttype” is ”eve:orttype”. By using      Attribute                                       Attribute_Ref
the QNAME every above mentioned definition and decla-             EID       name                 parent_EID         EID      ref_EID       parent_EID
ration can be referenced, so the re-usability of all compo-        18     event-id                   1               19         18             14
nents is given. Furthermore an attribute wildcard is also         Attribute_Gr                      Attribute_Gr_Ref
specified, i.e. the complex type ”eventtype” contains apart       EID parent_EID                         EID     ref_EID   parent_EID    x_Pos y_Pos
from the content model sequence and the attribute refer-           20        1                           21         20         14         185   125
ence ”eve:event-id” the element information item .                                                                 Figure 8: Logical Structure of Figure 7
   Figure 7 is the corresponding EMX of the above specified
XML Schema. The representation is an obvious simplifica-         as the attributes and corresponding values relevant for the
                                                                 example. Next to every tuple of the relations ”Element Ref”
                                                                 and ”Group” small hints which tuples are defined are added
                                                                 (for increasing the readability). It is obvious that an EID
                                                                 has to be unique, this is a prerequisite for the logical struc-
                                                                 ture. An EID is created automatically, a user of the EMX
                                                                 can neither influence nor manipulate it.
                                                                    The element references contain information about the oc-
                                                                 currence (”minOccurs”, ”maxOccurs”), which are not explic-
                                                                 itly given in the XSD of figure 6. The XML Schema spec-
                                                                 ification defines default values in such cases. If an element
                                                                 reference does not specify the occurrence values then the
                                                                 standard value ”1” is used; an element reference is obliga-
                                                                 tory. These default values are also added automatically.
         Figure 7: EMX to XSD of Figure 6
                                                                    The stored names of element declarations are NCNAME s,
                                                                 but by knowing the target namespace of the corresponding
tion, it just contains eight well arranged EMX nodes. These
                                                                 schema (i.e. ”eve”) the QNAME can be derived. The name
are the elements ”event”, ”ort”, ”name” and ”datum”, an an-
                                                                 of a type definition is also the NCNAME, but if e.g. a built-
notation as a child of ”event”, the groups as a child under
                                                                 in type is specified then the name is the QNAME of the
”event” and ”ort”, as well as an attribute-group with wild-
                                                                 XML Schema specification (”xs:string”, ”xs:date”).
card. The simple types of the element references ”name”
and ”datum” are implicitly given and not visualized. The
complex types can be derived by identifying the elements         5.      PRACTICAL USE OF EMX
which have no specified simple type but groups as a child           The co-evolution of XML documents was already men-
(i.e. ”event” and ”ort”).                                        tioned in section 1. At the University of Rostock a research
   The edges are under consideration of figure 5 pairs of not    prototype for dealing with this co-evolution was developed:
visualized, internally defined EIDs. The source is the side of   CodeX (Conceptual design and evolution for XML Schema)
[5]. The idea behind it is simple and straightforward at the                                                                   modeled XSD and an EMX, so it is possible to representa-
same time: Take an XML Schema, transform it to the specif-                                                                     tively adapt or modify the conceptual model instead of the
ically developed conceptual model (EMX - Entity Model for                                                                      XML Schema.
XML-Schema), change the simplified conceptual model in-                                                                           This article presents the formal definition of an EMX, all
stead of dealing with the whole complexity of XML Schema,                                                                      in all there are different nodes, which are connected by di-
collect these changing information (i.e. the user interaction                                                                  rected edges. Thereby the abstract data model and element
with EMX) and use them to create automatically trans-                                                                          information item of the XML Schema specification were con-
formation steps for adapting the XML documents (by us-                                                                         sidered, also the allowed edges are specified according to
ing XSLT - Extensible Stylesheet Language Transformations                                                                      the specification. In general the most important compo-
[1]). The mapping between EMX and XSD is unique, so it is                                                                      nents of an XSD are represented in an EMX, e.g. elements,
possible to describe modifications not only on the EMX but                                                                     attributes, simple types, complex types, annotations, con-
also on the XSD. The transformation and logging language                                                                       strains, model groups and group definitions. Furthermore
ELaX (Evolution Language for XML-Schema [8]) is used to                                                                        the logical structure is presented, which defines not only the
unify the internally collected information as well as intro-                                                                   underlying storing relations but also the relationships be-
duce an interface for dealing directly with XML Schema.                                                                        tween them. The visualization of an EMX is also defined:
Figure 9 illustrates the component model of CodeX, firstly                                                                     outgoing from 18 relations in the logical structure, there are
published in [7] but now extended with the ELaX interface.                                                                     eight EMX nodes in the conceptual model, from which six
                                                                                                                               are visualized.
                                                                                                 Results
                                                                                                                                  Our conceptual model is an essential prerequisite for the
                                                                                                                               prototype CodeX (Conceptual design and evolution for XML
   GUI          Schema modifications
                                 ELaX                              Data supply
                                                                                                                               Schema) as well as for the above mentioned co-evolution. A
          Visualization              ELaX           Import                                       Export                        remaining step is the finalization of the implementation in
                                                                                   XSD
                                                                                                                               CodeX. After this work an evaluation of the usability of the
              Evolution engine                          XSD         Config        XML              XSLT       XSD     Config
                                                                                                                               conceptual model is planned. Nevertheless we are confident,
                                                                                                                               that the usage is straightforward and the simplification of
                                                                                                                               EMX in comparison to deal with the whole complexity of
           Model                Spezification                  Configuration        XML
                                                                                                                               an XML Schema itself is huge.
          mapping               of operation                                     documents
                                                                                                       Update notes &

                                                                                                                               7.   REFERENCES
     Model data                          Evolution spezific data                                       evolution results

  Knowledge                                                                                  Transformation
    base                  Log
                                                                                                                               [1] XSL Transformations (XSLT) Version 2.0.
  CodeX
                                                                                                                                   http://www.w3.org/TR/2007/REC-xslt20-20070123/,
                                                                                                                                   January 2007. Online; accessed 26-March-2013.
          Figure 9: Component Model of CodeX [5]                                                                               [2] Extensible Markup Language (XML) 1.0 (Fifth
                                                                                                                                   Edition).
   The component model illustrates the different parts for                                                                         http://www.w3.org/TR/2008/REC-xml-20081126/,
dealing with the co-evolution. The main parts are an im-                                                                           November 2008. Online; accessed 26-March-2013.
port and export component for collecting and providing data                                                                    [3] XQuery 1.0 and XPath 2.0 Data Model (XDM)
of e.g. a user (XML Schemas, configuration files, XML doc-                                                                         (Second Edition). http://www.w3.org/TR/2010/
ument collections, XSLT files), a knowledge base for stor-                                                                         REC-xpath-datamodel-20101214/, December 2010.
ing information (model data, evolution specific data and                                                                           Online; accessed 26-March-2013.
co-evolution results) and especially the logged ELaX state-                                                                    [4] W3C XML Schema Definition Language (XSD) 1.1
ments (”Log”). The mapping information between XSD and                                                                             Part 1: Structures. http://www.w3.org/TR/2012/
EMX of table 2 are specified in the ”Model data” component.                                                                        REC-xmlschema11-1-20120405/, April 2012. Online;
   Furthermore the CodeX prototype also provides a graph-                                                                          accessed 26-March-2013.
ical user interface (”GUI”), a visualization component for                                                                     [5] M. Klettke. Conceptual XML Schema Evolution - the
the conceptual model and an evolution engine, in which the                                                                         CoDEX Approach for Design and Redesign. In BTW
transformations are derived. The visualization component                                                                           Workshops, pages 53–63, 2007.
realizes the visualization of an EMX introduced in table 2.                                                                    [6] E. Maler. Schema design rules for ubl...and maybe for
The ELaX interface for modifying imported XML Schemas                                                                              you. In XML 2002 Proceedings by deepX, 2002.
communicates directly with the evolution engine.
                                                                                                                               [7] T. Nösinger, M. Klettke, and A. Heuer. Evolution von
                                                                                                                                   XML-Schemata auf konzeptioneller Ebene - Übersicht:
6. CONCLUSION                                                                                                                      Der CodeX-Ansatz zur Lösung des Gültigkeitsproblems.
   Valid XML documents need e.g. an XML Schema, which                                                                              In Grundlagen von Datenbanken, pages 29–34, 2012.
restricts the possibilities and usage of declarations, defini-                                                                 [8] T. Nösinger, M. Klettke, and A. Heuer. Automatisierte
tions and structures in general. In a heterogeneous changing                                                                       Modelladaptionen durch Evolution - (R)ELaX in the
environment (e.g. an information exchange scenario), also                                                                          Garden of Eden. Technical Report CS-01-13, Institut
”old” and longtime used XML Schema have to be modified                                                                             für Informatik, Universität Rostock, Rostock, Germany,
to meet new requirements and to be up-to-date.                                                                                     Jan. 2013. Published as technical report CS-01-13
   EMX (Entity Model for XML-Schema) as a conceptual                                                                               under ISSN 0944-5900.
model is a simplified representation of an XSD, which hides                                                                    [9] E. van der Vlist. XML Schema. O’Reilly & Associates,
its complexity and offers a graphical presentation. A unique                                                                       Inc., 2002.
mapping exists between every in the Garden of Eden style
    Semantic Enrichment of Ontology Mappings: Detecting
       Relation Types and Complex Correspondences

                                                                           ∗
                                                          Patrick Arnold
                                                          Universität Leipzig
                                              arnold@informatik.uni-leipzig.de


ABSTRACT                                                              being a tripe (s, t, c), where s is a concept in the source ontol-
While there are numerous tools for ontology matching, most            ogy, t a concept in the target ontology and c the confidence
approaches provide only little information about the true na-         (similarity).
ture of the correspondences they discover, restricting them-             These tools are able to highly reduce the effort of man-
selves on the mere links between matching concepts. How-              ual ontology mapping, but most approaches solely focus on
ever, many disciplines such as ontology merging, ontology             detecting the matching pairs between two ontologies, with-
evolution or data transformation, require more-detailed in-           out giving any specific information about the true nature
formation, such as the concrete relation type of matches or           of these matches. Thus, a correspondence is commonly re-
information about the cardinality of a correspondence (one-           garded an equivalence relation, which is correct for a corre-
to-one or one-to-many). In this study we present a new ap-            spondence like (zip code, postal code), but incorrect for cor-
proach where we denote additional semantic information to             respondences like (car, vehicle) or (tree trunk, tree), where
an initial ontology mapping carried out by a state-of-the-art         is-a resp. part-of would be the correct relation type. This re-
matching tool. The enriched mapping contains the relation             striction is an obvious shortcoming, because in many cases
type (like equal, is-a, part-of) of the correspondences as well       a mapping should also include further kinds of correspon-
as complex correspondences. We present different linguistic,          dences, such as is-a, part-of or related. Adding these infor-
structural and background knowledge strategies that allow             mation to a mapping is generally beneficial and has been
semi-automatic mapping enrichment, and according to our               shown to considerably improve ontology merging [13]. It
first internal tests we are already able to add valuable se-          provides more precise mappings and is also a crucial aspect
mantic information to an existing ontology mapping.                   in related areas, such as data transformation, entity resolu-
                                                                      tion and linked data.
                                                                         An example is given in Fig. 1, which depicts the basic
Keywords                                                              idea of our approach. While we get a simple alignment as
ontology matching, relation type detection, complex corre-            input, with the mere links between concepts (above picture),
spondences, semantic enrichment                                       we return an enriched alignment with the relation type an-
                                                                      notated to each correspondence (lower picture). As we will
1. INTRODUCTION                                                       point out in the course of this study, we use different linguis-
                                                                      tic methods and background knowledge in order to find the
   Ontology matching plays a key role in data integration
                                                                      relevant relation type. Besides this, we have to distinguish
and ontology management. With the ontologies getting in-
                                                                      between simple concepts (as ”Office Software”) and complex
creasingly larger and more complex, as in the medical or
                                                                      concepts, which contain itemizations like ”Monitors and Dis-
biological domain, efficient matching tools are an important
                                                                      plays”, and which need a special treatment for relation type
prerequisite for ontology matching, merging and evolution.
                                                                      detection.
There are already various approaches and tools for ontol-
                                                                         Another issue of present ontology matchers is their restric-
ogy matching, which exploit most different techniques like
                                                                      tion to (1:1)-correspondences, where exactly one source con-
lexicographic, linguistic or structural methods in order to
                                                                      cept matches exactly one target concept. However, this can
identify the corresponding concepts between two ontologies
                                                                      occasionally lead to inaccurate mappings, because there may
[16], [2]. The determined correspondences build a so-called
                                                                      occur complex correspondences where more than one source
alignment or ontology mapping, with each correspondence
                                                                      element corresponds to a target element or vice versa, as
∗                                                                     the two concepts first name and last name correspond to a
                                                                      concept name, leading to a (2:1)-correspondence. We will
                                                                      show in section 5 that distinguishing between one-to-one
                                                                      and one-to-many correspondences plays an important role
                                                                      in data transformation, and that we can exploit the results
                                                                      from the relation type detection to discover such complex
                                                                      matches in a set of (1:1)-matches to add further knowledge
                                                                      to a mapping.
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
                                                                         In this study we present different strategies to assign the
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany.                   relation types to an existing mapping and demonstrate how
Copyright is held by the author/owner(s).
                                                                   lence, less/more-general (is-a / inverse is-a) and is-close (”re-
                                                                   lated”) and exploits linguistic techniques and background
                                                                   sources such as WordNet. The linguistic strategies seem
                                                                   rather simple; if a term appears as a part in another term,
                                                                   a more-general relation is assumed which is not always the
                                                                   case. For example, in Figure 1 the mentioned rule holds
                                                                   for the correspondence between Games and Action Games,
                                                                   but not between M onitors and M onitors and Displays. In
                                                                   [14], the authors evaluated Taxomap for a mapping scenario
                                                                   with 162 correspondences and achieved a recall of 23 % and
                                                                   a precision of 89 %.
                                                                      The LogMap tool [9] distinguishes between equivalence
                                                                   and so-called weak (subsumption / is-a) correspondences. It
                                                                   is based on Horn Logic, where first lexicographic and struc-
                                                                   tural knowledge from the ontologies is accumulated to build
                                                                   an initial mapping and subsequently an iterative process is
                                                                   carried out to first enhance the mapping and then to verify
                                                                   the enhancement. This tool is the least precise one with
                                                                   regard to relation type detection, and in evaluations the re-
                                                                   lation types were not further regarded.
                                                                      Several further studies deal with the identification of se-
                                                                   mantic correspondence types without providing a complete
                                                                   tool or framework. An approach utilizing current search
                                                                   engines is introduced in [10]. For two concepts A, B they
                                                                   generate different search queries like ”A, such as B” or ”A,
                                                                   which is a B” and submit them to a search engine (e.g.,
                                                                   Google). They then analyze the snippets of the search en-
                                                                   gine results, if any, to verify or reject the tested relation-
Figure 1: Input (above) and output (below) of the                  ship. The approach in [15] uses the Swoogle search engine
Enrichment Engine                                                  to detect correspondences and relationship types between
                                                                   concepts of many crawled ontologies. The approach sup-
                                                                   ports equal, subset or mismatch relationships. [17] exploits
complex correspondences can be discovered. Our approach,           reasoning and machine learning to determine the relation
which we refer to as Enrichment Engine, takes an ontology          type of a correspondence, where several structural patterns
mapping generated by a state-of-the-art matching tool as in-       between ontologies are used as training data.
put and returns a more-expressive mapping with the relation           Unlike relation type determination, the complex corre-
type added to each correspondence and complex correspon-           spondence detection problem has hardly been discussed so
dences revealed. According to our first internal tests, we         far. It was once addressed in [5], coming to the conclusion
recognized that even simple strategies already add valuable        that there is hardly any approach for complex correspon-
information to an initial mapping and may be a notable gain        dence detection because of the vast amount of required com-
for current ontology matching tools.                               parisons in contrast to (1:1)-matching, as well as the many
  Our paper is structured as follows: We discuss related           possible operators needed for the mapping function. One
work in section 2 and present the architecture and basic           key observation for efficient complex correspondence detec-
procedure of our approach in section 3. In section 4 we            tion has been the need of large amounts of domain knowl-
present different strategies to determine the relation types       edge, but until today there is no available tool being able to
in a mapping, while we discuss the problem of complex cor-         semi-automatically detect complex matches.
respondence detection in section 5. We finally conclude in            One remarkable approach is iMAP [4], where complex
section 6.                                                         matches between two schemas could be discovered and even
                                                                   several transformation functions calculated, as RoomP rice =
2.   RELATED WORK                                                  RoomP rice∗(1+T axP rice). For this, iMAP first calculates
                                                                   (1:1)-matches and then runs an iterative process to gradu-
  Only a few tools and studies regard different kinds of
                                                                   ally combine them to more-complex correspondences. To
correspondences or relationships for ontology matching. S-
                                                                   justify complex correspondences, instance data is analyzed
Match [6][7] is one of the first such tools for ”semantic ontol-
                                                                   and several heuristics are used. In [8] complex correspon-
ogy matching”. They distinguish between equivalence, sub-
                                                                   dences were also regarded for matching web query inter-
set (is-a), overlap and mismatch correspondences and try
                                                                   faces, mainly exploiting co-occurrences. However, in order
to provide a relationship for any pair of concepts of two
                                                                   to derive common co-occurrences, the approach requires a
ontologies by utilizing standard match techniques and back-
                                                                   large amount of schemas as input, and thus does not appear
ground knowledge from WordNet. Unfortunately, the result
                                                                   appropriate for matching two or few schemas.
mappings tend to become very voluminous with many corre-
                                                                      While the approaches presented in this section try to a-
spondences per concept, while users are normally interested
                                                                   chieve both matching and semantic annotation in one step,
only in the most relevant ones.
                                                                   thus often tending to neglect the latter part, we will demon-
  Taxomap [11] is an alignment tool developed for the geo-
                                                                   strate a two-step architecture in which we first perform a
graphic domain. It regards the correspondence types equiva-
schema mapping and then concentrate straight on the en-           Strategy             equal       is-a     part-of    related
                                                                  Compounding                       X
richment of the mapping (semantic part). Additionally, we         Background K.          X          X          X          X
want to analyze several linguistic features to provide more       Itemization            X          X
qualitative mappings than obtained by the existing tools,         Structure                         X          X
and finally develop an independent system that is not re-
stricted to schema and ontology matching, but will be dif-       Table 1: Supported correspondence types by the
ferently exploitable in the wide field of date integration and   strategies
data analysis.
                                                                 ”undecided”. In this case we assign the relation type ”equal”,
3. ARCHITECTURE                                                  because it is the default type in the initial match result and
  As illustrated in Fig. 2 our approach uses a 2-step ar-        possibly the most likely one to hold. Secondly, there might
chitecture in which we first calculate an ontology mapping       be different outcomes from the strategies, e.g., one returns
(match result) using our state-of-the-art matching tool          is-a, one equal and the others undecided. There are different
COMA 3.0 (step 1) [12] and then perform an enrichment            ways to solve this problem, e.g., by prioritizing strategies or
on this mapping (step 2).                                        relation types. However, we hardly discovered such cases so
  Our 2-step approach for semantic ontology matching offers      far, so we currently return ”undecided” and request the user
different advantages. First of all, we reduce complexity com-    to manually specify the correct type.
pared to 1-step approaches that try to directly determine the       At the present, our approach is already able to fully assign
correspondence type when comparing concepts in O1 with           relation types to an input mapping using the 4 strategies,
concepts in O2 . For large ontologies, such a direct match-      which we will describe in detail in the next section. We have
ing is already time-consuming and error-prone for standard       not implemented strategies to create complex matches from
matching. The proposed approaches for semantic matching          the match result, but will address a couple of conceivable
are even more complex and could not yet demonstrate their        techniques in section 5.
general effectiveness.
  Secondly, our approach is generic as it can be used for
different domains and in combination with different match-       4.    IMPLEMENTED STRATEGIES
ing tools for the first step. We can even re-use the tool in        We have implemented 4 strategies to determine the type
different fields, such as entity resolution or text mining. On   of a given correspondence. Table 1 gives an overview of the
the other hand, this can also be a disadvantage, since the       strategies and the relation types they are able to detect. It
enrichment step depends on the completeness and quality of       can be seen that the Background Knowledge approach is
the initially determined match result. Therefore, it is im-      especially valuable, as it can help to detect all relationship
portant to use powerful tools for the initial matching and       types. Besides, all strategies are able to identify is-a corre-
possibly to fine-tune their configuration.                       spondences.
                                                                    In the following let O1 , O2 be two ontologies with c1 , c2
                                                                 being two concepts from O1 resp. O2 . Further, let C =
                                                                 (c1 , c2 ) be a correspondence between two concepts (we do
                                                                 not regard the confidence value in this study).

                                                                 4.1    Compound Strategy
                                                                    In linguistics, a compound is a special word W that con-
                                                                 sists of a head WH carrying the basic meaning of W , and
                                                                 a modifier WM that specifies WH [3]. In many cases, a
                                                                 compound thus expresses something more specific than its
                                                                 head, and is therefore a perfect candidate to discover an is-a
                                                                 relationship. For instance, a blackboard is a board or an
                                                                 apple tree is a tree. Such compounds are called endocen-
                                                                 tric compounds, while exocentric compounds are not related
Figure 2: Basic Workflow for Mapping Enrichment                  with their head, such as buttercup, which is not a cup, or saw
                                                                 tooth, which is not a tooth. These compounds are of literal
   The basics of the relation type detection, on which we fo-    meaning (metaphors) or changed their spelling as the lan-
cus in this study, can be seen in the right part of Fig. 2. We   guage evolved, and thus do not hold the is-a relation, or only
provide 4 strategies so far (Compound, Background Knowl-         to a very limited extent (like airport, which is a port only in
edge, Itemization, Structure), where each strategy returns       a broad sense). There is a third form of compounds, called
the relation type of a given correspondence, or ”undecided”      appositional or copulative compounds, where the two words
in case no specific type can be determined. In the Enrich-       are at the same level, and the relation is rather more-general
ment step we thus iterate through each correspondence in         (inverse is-a) than more-specific, as in Bosnia-Herzegowina,
the mapping and pass it to each strategy. We eventually          which means both Bosnia and Herzegowina, or bitter-sweet,
annotate the type that was most frequently returned by the       which means both bitter and sweet (not necessarily a ”spe-
strategies (type computation). In this study, we regard 4        cific bitter” or a ”specific sweet”). However, this type is quite
distinct relation types: equal, is-a and inv. is-a (composi-     rare.
tion), part-of and has-a (aggregation), as well as related.         In the following, let A, B be the literals of two con-
   There are two problems we may encounter when comput-          cepts of a correspondence. The Compound Strategy ana-
ing the correspondence type. First, all strategies may return    lyzes whether B ends with A. If so, it seems likely that B
is a compound with head A, so that the relationship B is-a             by w1 .
A (or A inv. is-a B) is likely to hold. The Compound ap-
proach allows us to identify the three is-a correspondences        3. Remove each w1 ∈ I1 , w2 ∈ I2 if there is a synonym
shown in Figure 1 (below).                                            pair (w1 , w2 ).
   We added an additional rule to this simple approach: B is       4. Remove each w2 ∈ I2 which is a hyponym of w1 ∈ I1 .
only considered a compound to A if length(B)−length(A) ≥
3, where length(X) is the length of a string X. Thus, we           5. Determine the relation type:
expect the supposed compound to be at least 3 characters
longer than the head it matches. This way, we are able to              (a) If I1 = ∅, I2 = ∅: equal
eliminate obviously wrong compound conclusions, like sta-              (b) If I1 = ∅, |I2 | ≥ 1: is-a
ble is a table, which we call pseudo compounds. The value                  If I2 = ∅, |I1 | ≥ 1: inverse is-a
of 3 is motivated by the observation that typical nouns or              (c) If |I1 | ≥ 1, I2 ≥ 1: undecided
adjectives consist of at least 3 letters.
                                                                 The rationale behind this algorithm is that we remove items
4.2 Background Knowledge                                         from the item sets as long as no information gets lost. Then
   Background knowledge is commonly of great help in on-         we compare what is left in the two sets and come to the
tology matching to detect more difficult correspondences,        conclusions presented in step 5.
especially in special domains. In our approach, we intend to        Let us consider the concept pair C1 = ”books, ebooks,
use it for relation type detection. So far, we use WordNet       movies, films, cds” and C2 =”novels, cds”. Our item sets are
3.0 to determine the relation that holds between two words       I1 = {books, ebooks, movies, f ilms, cds}, I2 = {novels, cds}.
(resp. two concepts). WordNet is a powerful dictionary and       First, we remove synonyms and hyponyms within each set,
thesaurus that contains synonym relations (equivalence), hy-     because this would cause no loss of information (steps 1+2).
pernym relations (is-a) and holonym relations (part-of) be-      We remove f ilms in I1 (because of the synonym movies)
tween words [22]. Using the Java API for WordNet Search          and ebooks in I1 , because it is a hyponym of books. We have
(JAWS), we built an interface that allows to answer ques-        I1 = {books, movies, cds} , I2 = {novels, cds}. Now we re-
tions like ”Is X a synonym to Y?”, or ”Is X a direct hyper-      move synonym pairs between the two item sets, so we remove
nym of Y?”. The interface is also able to detect cohyponyms,     cds in either set (step 3). Lastly, we remove a hyponym in I1
which are two words X, Y that have a common direct hyper-        if there is a hypernym in I2 (step 4). We remove novel in I2 ,
nym Z. We call a correspondence between two cohyponyms           because it is a book. We have I1 = {books, movies} , I2 = ∅.
X and Y related, because both concepts are connected to          Since I1 still contains items, while I2 is empty, we conclude
the same father element. For example, the relation between       that I1 specifies something more general, i.e., it holds C1
apple tree and pear tree is related, because of the common       inverse is-a C2 .
father concept tree.                                                If neither item set is empty, we return ”undecided” because
   Although WordNet has a limited vocabulary, especially         we cannot derive an equal or is-a relationship in this case.
with regard to specific domains, it is a valuable source to
detect the relation type that holds between concepts. It al-     4.4    Structure Strategy
lows an excellent precision, because the links in WordNet are       The structure strategy takes the structure of the ontolo-
manually defined, and contains all relation types we intend      gies into account. For a correspondence between concepts
to detect, which the other strategies are not able to achieve.   Y and Z we check whether we can derive a semantic rela-
                                                                 tionship between a father concept X of Y and Z (or vice
4.3   Itemization                                                versa). For an is-a relationship between Y and X we draw
   In several taxonomies we recognized that itemizations ap-     the following conclusions:
pear very often, and which cannot be processed with the pre-
viously presented strategies. Consider the correspondence           • X equiv Z → Y is-a Z
(”books and newspapers”, ”newspapers”). The compound                • X is-a Z → Y is-a Z
strategy would be mislead and consider the source concept
a compound, resulting in the type ”is-a”, although the op-       For a part-of relationship between Y and X we can analo-
posite is the case (inv. is-a). WordNet would not know the       gously derive:
word ”books and newspapers” and return ”undecided”.
   Itemizations thus deserve special treatment. We first split      • X equiv Z → Y part-of Z
each itemization in its atomic items, where we define an item
as a string that does not contain commas, slashes or the            • X part-of Z → Y part-of Z
words ”and” and ”or”.
                                                                 The approach obviously utilizes the semantics of the intra-
   We now show how our approach determines the correspon-
                                                                 ontology relationships to determine the correspondence types
dence types between two concepts C1 , C2 where at least one
                                                                 for pairs of concepts for which the semantic relationship can-
of the two concepts is an itemization with more than one
                                                                 not directly be determined.
item. Let I1 be the item set of C1 and I2 the item set of
C2 . Let w1 , w2 be two words, with w1 6= w2 . Our approach      4.5    Comparison
works as follows:
                                                                   We tested our strategies and overall system on 3 user-
  1. In each set I remove each w1 ∈ I which is a hyponym         generated mappings in which each correspondence was tagged
     of w2 ∈ I.                                                  with its supposed type. After running the scenarios, we
                                                                 checked how many of the non-trivial relations were detected
  2. In each set I, replace a synonym pair (w1 ∈ I, w2 ∈ I)      by the program. The 3 scenario consisted of about 350
.. 750 correspondences. We had a German-language sce-
nario (product catalogs from online shops), a health scenario
(diseases) and a text annotation catalog scenario (everyday
speech).
   Compounding and Background Knowledge are two inde-
pendent strategies that separately try to determine the rela-
tion type of a correspondence. In our tests we saw that Com-
pounding offers a good precision (72 .. 97 %), even without
the many exocentric and pseudo-compounds that exist. By
contrast, we recognized only moderate recall, ranging from
12 to 43 %. Compounding is only able to determine is-a
relations, however, it is the only strategy that invariably
works.
   Background Knowledge has a low or moderate recall (10 ..     Figure 3: Match result containing two complex cor-
50 %), depending on the scenario at hand. However, it offers    respondences (name and address)
an excellent precision being very close to 100 % and is the
only strategy that is able to determine all relation types we
regard. As matter of fact, it did not work on our German-
                                                                structure of the schemas to transform several (1:1)-corres-
language example and only poorly in our health scenario.
                                                                pondences into a complex correspondence, although these
   Structure and Itemization strategy depend much on the
                                                                approaches will fail in more intricate scenarios. We used
given schemas and are thus very specific strategies to han-
                                                                the structure of the schemas and the already existing (1:1)-
dle individual cases. They exploit the Compound and Back-
                                                                matches to derive complex correspondences. Fig. 3 demon-
ground Knowledge Strategy and are thus not independent.
                                                                strates this approach. There are two complex correspon-
Still, they were able to boost the recall to some degree.
                                                                dences in the mapping, ( (First Name, Last Name), (Name))
   We realized that the best result is gained by exploiting
                                                                and ( (Street, City, Zip Code, Country), Address), repre-
all strategies. Currently, we do not weight the strategies,
                                                                sented by simple (1:1)-correspondences. Our approach was
however, we may do so in order to optimize our system. We
                                                                able to detect both complex correspondences. The first one
finally achieved an overall recall between 46 and 65 % and
                                                                (name) was detected, because first name and last name can-
precision between 69 and 97 %.
                                                                not be mapped to one element at the same time, since the
                                                                name element can only store either of the two values. The
5. COMPLEX CORRESPONDENCES                                      second example (address) is detected since schema data is
   Schema and ontology matching tools generally calculate       located in the leaf nodes, not in inner nodes. In database
(1:1)-correspondences, where exactly one source element         schemas we always expect data to reside in the leaf nodes,
matches exactly one target element. Naturally, either el-       so that the match (Address, Address) is considered unrea-
ement may take part in different correspondences, as in         sonable.
(name, first name) and (name, last name), however, having          In the first case, our approach would apply the concatena-
these two separate correspondences is very imprecise and the    tion function, because two values have to be concatenated to
correct mapping would rather be the single correspondence       match the target value, and in the second case the split func-
( (first name, last name), (name)). These kind of matches       tion would be applied, because the Address values have to
are called complex correspondences or one-to-many corre-        be split into the address components (street, city, zip code,
spondences.                                                     country). The user needs to adjust these functions, e.g., in
   The disambiguation between a complex correspondence          order to tell the program where in the address string the
or 2 (or more) one-to-one correspondences is an inevitable      split operations have to be performed.
premise for data transformation where data from a source           This approach was mostly based on heuristics and would
database is to be transformed into a target database, which     only work in simple cases. Now that we are able to de-
we could show in [1]. Moreover, we could prove that each        termine the relation types of (1:1)-matches, we can enhance
complex correspondence needs a transformation function in       this original approach. If a node takes part in more than one
order to correctly map data. If elements are of the type        composition relation (part-of / has-a), we can conclude that
string, the transformation function is normally concatena-      it is a complex correspondence and can derive it from the
tion in (n:1)-matches and split in (1:n)-matches. If the el-    (1:1)-correspondences. For instance, if we have the 3 corre-
ements are of a numerical type, as in the correspondence        spondences (day part-of date), (month part-of date), (year
( (costs), ((operational costs), (material costs), (personnel   part-of date) we could create the complex correspondence (
costs))), a set of numerical operations is normally required.   (day, month, year), date).
   There are proprietary solutions that allow to manually          We have not implemented this approach so far, and we as-
create transformation mappings including complex corre-         sume that detecting complex correspondences and the cor-
spondences, such as Microsoft Biztalk Server [19], Altova       rect transformation function will still remain a very challeng-
MapForce [18] or Stylus Studio [20], however, to the best       ing issue, so that we intend to investigate additional methods
of our knowledge there is no matching tool that is able to      like using instance data to allow more effectiveness. How-
detect complex correspondences automatically. Next to rela-     ever, adding these techniques to our existing Enrichment
tion type detection, we therefore intend to discover complex    Engine, we are able to present a first solution that semi-
correspondences in the initial mapping, which is a second       automatically determines complex correspondences, which
important step of mapping enrichment.                           is another step towards more precise ontology matching, and
   We already developed simple methods that exploit the         an important condition for data transformation.
6. OUTLOOK AND CONCLUSION                                         [4] Dhamankar, R., Yoonkyong, L., Doan, A., Halevy, A.,
   We presented a new approach to semantically enrich ontol-          Domingos, P.: iMAP: Discovering Complex Semantic
ogy mappings by determining the concrete relation type of a           Matches between Database Schemas. In: SIGMOD ’04,
correspondence and detecting complex correspondences. For             pp. 383–394
this, we developed a 2-step architecture in which the actual      [5] Doan, A., Halevy, A. Y.: Semantic Integration
ontology matching and the semantic enrichment are strictly            Research in the Database Community: A Brief Survey.
separated. This makes the Enrichment Engine highly generic            In AI Mag. (2005), pp. 83–94
so that it is not designed for any specific ontology matching     [6] Giunchiglia, F., Shvaiko, P., Yatskevich, M.: S-Match:
tool, and moreover, can be used independently in various              An Algorithm and an Implementation of Semantic
fields different from ontology matching, such as data trans-          Matching. Proceedings of the European Semantic Web
formation, entity resolution and text mining.                         Symposium (2004), LNCS 3053, pp. 61–75
   In our approach we developed new linguistic strategies         [7] Giunchiglia, F., Autayeu, A., Pane, J.: S-Match: an
to determine the relation type, and with regard to our first          open source framework for matching lightweight
internal tests even the rather simple strategies already added        ontologies. In: Semantic Web, vol. 3-3 (2012), pp.
much useful information to the input mapping. We also                 307-317
discovered that some strategies (Compounding, and to a less       [8] He, B., Chen-Chuan Chang, H., Han, J.: Discovering
degree Itemization and Structure) are rather independent              complex matchings across web query interfaces: A
from the language of the ontologies, so that our approach             correlation mining approach. In: KDD ’04, pp. 148–157
provided remarkable results both in German and English-           [9] Jiménez-Ruiz, E., Grau, B. C.: LogMap: Logic-Based
language ontologies.                                                  and Scalable Ontology Matching. In: International
   One important obstacle is the strong dependency to the             Semantic Web Conference (2011), LNCS 7031, pp.
initial mapping. We recognized that matching tools tend to            273–288
discover equivalence relations, so that different non-equiva-     [10] van Hage, W. R., Katrenko, S., Schreiber, G. A
lence correspondences are not contained by the initial map-           Method to Combine Linguistic Ontology-Mapping
ping, and can thus not be detected. It is future work to              Techniques. In: International Semantic Web Conference
adjust our tool COMA 3.0 to provide a more convenient in-             (2005), LNCS 3729, pp. 732–744
put, e.g., by using relaxed configurations. A particular issue    [11] Hamdi, F., Safar, B., Niraula, N. B., Reynaud, C.:
we are going to investigate is the use of instance data con-          TaxoMap alignment and refinement modules: Results
nected with the concepts to derive the correct relation type          for OAEI 2010. Proceedings of the ISWC Workshop
if the other strategies (which operate on the meta level) fail.
                                                                      (2010), pp. 212–219
This will also result in a time-complexity problem, which we
                                                                  [12] Massmann, S., Raunich, S., Aumueller, D., Arnold, P.,
will have to consider in our ongoing research.
                                                                      Rahm, E. Evolution of the COMA Match System. Proc.
   Our approach is still in a rather early state, and there
                                                                      Sixth Intern. Workshop on Ontology Matching (2011)
is still much space for improvement, since the implemented
strategies have different restrictions so far. For this reason,   [13] Raunich, S.,Rahm, E.: ATOM: Automatic
we will extend and fine-tune our tool in order to increase            Target-driven Ontology Merging. Proc. Int. Conf. on
effectiveness and precision. Among other aspects, we intend           Data Engineering (2011)
to improve the structure strategy by considering the entire       [14] Reynaud, C., Safar, B.: Exploiting WordNet as
concept path rather than the mere father concept, to add              Background Knowledge. Proc. Intern. ISWCŠ07
further background knowledge to the system, especially in             Ontology Matching (OM-07) Workshop
specific domains, and to investigate further linguistic strate-   [15] Sabou, M., d’Aquin, M., Motta, E.: Using the
gies, for instance, in which way compounds also indicate the          semantic web as background knowledge for ontology
part-of relation. Next to relation type detection, we will also       mapping. Proc. 1st Intern. Workshop on on Ontology
concentrate on complex correspondence detection in data               Matching (2006).
transformation to provide further semantic information to         [16] Shvaiko, P., Euzenat, J.: A Survey of Schema-based
ontology mappings.                                                    Matching Approaches. J. Data Semantics IV (2005),
                                                                      pp. 146–171
7. ACKNOWLEDGMENT                                                 [17] Spiliopoulos, V., Vouros, G., Karkaletsis, V: On the
                                                                      discovery of subsumption relations for the alignment of
   This study was partly funded by the European Commis-
                                                                      ontologies. Web Semantics: Science, Services and
sion through Project ”LinkedDesign” (No. 284613 FoF-ICT-
                                                                      Agents on the World Wide Web 8 (2010), pp. 69-88
2011.7.4).
                                                                  [18] Altova MapForce - Graphical Data Mapping,
                                                                      Conversion, and Integration Tool.
8. REFERENCES                                                         http://www.altova.com/mapforce.html
[1] Arnold P.: The Basics of Complex Correspondences              [19] Microsoft BizTalk Server.
    and Functions and their Implementation and                        http://www.microsoft.com/biztalk
    Semi-automatic Detection in COMA++ (Master’s                  [20] XML Editor, XML Tools, and XQuery - Stylus
    thesis), University of Leipzig, 2011.                             Studio. http://www.stylusstudio.com/
[2] Bellahsene., Z., Bonifati, A., Rahm, E. (eds.): Schema        [21] Java API for WordNet Searching (JAWS),
    Matching and Mapping, Springer (2011)                             http://lyle.smu.edu/~tspell/jaws/index.html
[3] Bisetto, A., Scalise, S.: Classification of Compounds.        [22] WordNet - A lexical database for English,
    University of Bologna, 2009. In: The Oxford Handbook              http://wordnet.princeton.edu/wordnet/
    of Compounding, Oxford University Press, pp. 49-82.
     Extraktion und Anreicherung von Merkmalshierarchien
      durch Analyse unstrukturierter Produktrezensionen

                                                           Robin Küppers
                                                      Institut für Informatik
                                                    Heinrich-Heine-Universität
                                                        Universitätsstr. 1
                                                  40225 Düsseldorf, Deutschland
                                             kueppers@cs.uni-duesseldorf.de

ABSTRACT                                                              tionelle Datenblätter oder Produktbeschreibungen möglich
Wir präsentieren einen Algorithmus zur Extraktion bzw.               wäre, da diese dazu tendieren, die Vorteile eines Produkts zu
Anreicherung von hierarchischen Produktmerkmalen mittels              beleuchten und die Nachteile zu verschweigen. Aus diesem
einer Analyse von unstrukturierten, kundengenerierten Pro-            Grund haben potentielle Kunden ein berechtigtes Interesse
duktrezensionen. Unser Algorithmus benötigt eine initiale            an der subjektiven Meinung anderer Käufer.
Merkmalshierarchie, die in einem rekursiven Verfahren mit             Zudem sind kundengenerierte Produktrezensionen auch für
neuen Untermerkmalen angereichert wird, wobei die natür-             Produzenten interessant, da sie wertvolle Informationen über
liche Ordnung der Merkmale beibehalten wird. Die Funk-                Qualität und Marktakzeptanz eines Produkts aus Kunden-
tionsweise unseres Algorithmus basiert auf häufigen, gram-           sicht enthalten. Diese Informationen können Produzenten
matikalischen Strukturen, die in Produktrezensionen oft be-           dabei helfen, die eigene Produktpalette zu optimieren und
nutzt werden, um Eigenschaften eines Produkts zu beschrei-            besser an Kundenbedürfnisse anzupassen.
ben. Diese Strukturen beschreiben Obermerkmale im Kon-                Mit wachsendem Umsatz der Web-Shops nimmt auch die
text ihrer Untermerkmale und werden von unserem Algo-                 Anzahl der Produktrezensionen stetig zu, so dass es für Kun-
rithmus ausgenutzt, um Merkmale hierarchisch zu ordnen.               den (und Produzenten) immer schwieriger wird, einen um-
                                                                      fassenden Überblick über ein Produkt / eine Produktgrup-
                                                                      pe zu behalten. Deshalb ist unser Ziel eine feingranulare
Kategorien                                                            Zusammenfassung von Produktrezensionen, die es erlaubt
H.2.8 [Database Management]: Database Applications—                   Produkte dynamisch anhand von Produktmerkmalen (pro-
data mining; I.2.7 [Artificial Intelligence]: Natural Lan-            duct features) zu bewerten und mit ähnlichen Produkten zu
guage Processing—text analysis                                        vergleichen. Auf diese Weise wird ein Kunde in die Lage
                                                                      versetzt ein Produkt im Kontext seines eigenen Bedürfnis-
Schlüsselwörter                                                       ses zu betrachten und zu bewerten: beispielsweise spielt das
                                                                      Gewicht einer Kamera keine große Rolle für einen Kunden,
Text Mining, Review Analysis, Product Feature                         aber es wird viel Wert auf eine hohe Bildqualität gelegt.
                                                                      Produzenten können ihre eigene Produktpalette im Kontext
1.   EINLEITUNG                                                       der Konkurrenz analysieren, um z. B. Mängel an den eige-
   Der Einkauf von Waren (z. B. Kameras) und Dienstleis-              nen Produkten zu identifizieren.
tungen (z. B. Hotels) über Web-Shops wie Amazon unter-               Das Ziel unserer Forschung ist ein Gesamtsystem zur Analy-
liegt seit Jahren einem stetigen Wachstum. Web-Shops ge-              se und Präsentation von Produktrezensionen in zusammen-
ben ihren Kunden (i. d. R.) die Möglichkeit die gekaufte Wa-         gefasster Form (vgl. [3]). Dieses System besteht aus mehre-
re in Form einer Rezension zu kommentieren und zu bewer-              ren Komponenten, die verschiedene Aufgaben übernehmen,
ten. Diese kundengenerierten Rezensionen enthalten wert-              wie z.B. die Extraktion von Meinungen und die Bestimmung
volle Informationen über das Produkt, die von potentiellen           der Tonalität bezüglich eines Produktmerkmals (siehe dazu
Kunden für ihre Kaufentscheidung herangezogen werden. Je             auch Abschnitt 2). Im Rahmen dieser Arbeit beschränken
positiver ein Produkt bewertet wird, desto wahrscheinlicher           wir uns auf einen wichtigen Teilaspekt dieses Systems: die
wird es von anderen Kunden gekauft.                                   Extraktion und Anreicherung von hierarchisch organisierten
Der Kunde kann sich so ausführlicher über die Vor- und              Produktmerkmalen.
Nachteile eines Produkts informieren, als dies über redak-           Der Rest dieser Arbeit ist wie folgt gegliedert: zunächst
                                                                      geben wir in Abschnitt 2 einen Überblick über verwandte
                                                                      Arbeiten, die auf unsere Forschung entscheidenen Einfluss
                                                                      hatten. Anschließend präsentieren wir in Abschnitt 3 einen
                                                                      Algorithmus zur Extraktion und zur Anreicherung von hier-
                                                                      archisch organisierten Produktmerkmalen. Eine Bewertung
                                                                      des Algorithmus wird in Abschnitt 4 vorgenommen, sowie
                                                                      einige Ergebnisse präsentiert, die die Effektivität unseres
                                                                      Algorithmus demonstrieren. Die gewonnenen Erkenntnisse
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany.                   werden in Abschnitt 5 diskutiert und zusammengefasst. Des
Copyright is held by the author/owner(s).
Weiteren geben wir einen Ausblick auf unsere zukünftige
Forschung.

2.     VERWANDTE ARBEITEN
   Dieser Abschnitt gibt einen kurzen Überblick über ver-
wandte Arbeiten, die einen Einfluss auf unsere Forschung
hatten. Die Analyse von Produktrezensionen basiert auf Al-          Abbildung 1: Beispielhafte Merkmalshierarchie ei-
gorithmen und Methoden aus verschiedensten Disziplinen.             ner Digitalkamera.
Zu den Wichtigsten zählen: Feature Extraction, Opining Mi-
ning und Sentiment Analysis.
Ein typischer Algorithmus zur merkmalsbasierten Tonali-             Wir haben hauptsächlich Arbeiten vorgestellt, die Merk-
tätsanalyse von Produktrezensionen ist in 3 unterschiedliche       male und Meinungen aus Produktrezensionen extrahieren,
Phasen unterteilt (vgl. [3]):                                       aber Meinungsanalysen sind auch für andere Domänen inter-
                                                                    essant: z. B. verwenden die Autoren von [7] einen von Exper-
     1. Extraktion von Produktmerkmalen.                            ten annotierten Korpus mit Nachrichten, um mit Techniken
                                                                    des maschinellen Lernens einen Klassifikator zu trainieren,
     2. Extraktion von Meinungen über Produktmerkmale.             der zwischen Aussagen (Meinungen) und Nicht-Aussagen
     3. Tonalitätsanalyse der Meinungen.                           unterscheidet. Solche Ansätze sind nicht auf die Extrakti-
                                                                    on von Produktmerkmalen angewiesen.
Man unterscheidet zwischen impliziten und expliziten Merk-
malen[3]: explizite Merkmale werden direkt im Text genannt,
implizite Merkmale müssen aus dem Kontext erschlossen
                                                                    3. ANREICHERUNG VON MERKMALS-
werden. Wir beschränken uns im Rahmen dieser Arbeit auf
die Extraktion expliziter Merkmale.                                    HIERARCHIEN
Die Autoren von [3] extrahieren häufig auftretende, explizi-           Dieser Abschnitt dient der Beschreibung eines neuen Al-
te Merkmale mit dem a-priori Algorithmus. Mit Hilfe dieser          gorithmus zur Anreicherung einer gegebenen, unvollständi-
Produktmerkmale werden Meinungen aus dem Text extra-                gen Merkmalshierarchie mit zusätzlichen Merkmalen. Die-
hiert, die sich auf ein Produktmerkmal beziehen. Die Tona-          se Merkmale werden aus unstrukturierten kundengenerier-
lität einer Meinung wird auf die Tonalität der enthaltenen        ten Produktrezensionen gewonnen, wobei versucht wird die
Adjektive zurückgeführt. Die extrahierten Merkmale werden         natürliche Ordnung der Merkmale (Unter- bzw. Obermerk-
- im Gegensatz zu unserer Arbeit - nicht hierarchisch mo-           malsbeziehung) zu beachten.
delliert.                                                           Die Merkmalshierarchie bildet die Basis für weitergehende
Es gibt auch Ansätze, die versuchen die natürliche Hierar-        Analysen, wie z.B. die gezielte Extraktion von Meinungen
chie von Produktmerkmalen abzubilden. Die Autoren von [1]           und Tonalitäten, die sich auf Produktmerkmale beziehen.
nutzen die tabellarische Struktur von Produktbeschreibun-           Diese nachfolgenden Analyseschritte sind nicht mehr Gegen-
gen aus, um explizite Produktmerkmale zu extrahieren, wo-           stand dieser Arbeit. Produkte (aber auch Dienstleistungen)
bei die hierarchische Struktur aus der Tabellenstruktur ab-         können durch eine Menge von Merkmalen (product features)
geleitet wird. Einen ähnlichen Ansatz verfolgen [5] et. al.: die   beschrieben werden. Produktmerkmale folgen dabei einer
Autoren nutzen ebenfalls die oftmals hochgradige Struktu-           natürlichen, domänenabhängigen Ordnung. Eine derartige
rierung von Produktbeschreibungen aus. Die Produktmerk-             natürliche Hierarchie ist exemplarisch in Abbildung 1 für
male werden mit Clusteringtechniken aus einem Korpus ex-            das Produkt Digitalkamera dargestellt. Offensichtlich ist
trahiert, wobei die Hierarchie der Merkmale durch das Clus-         Display ein Untermerkmal von Digitalkamera und besitzt
tering vorgegeben wird. Die Extraktion von expliziten Merk-         eigene Untermerkmale Auflösung und Farbtemperatur.
malen aus strukturierten Texten ist (i. d. R.) einfacher, als       Hierarchien von Produktmerkmalen können auf Basis von
durch Analyse unstrukturierter Daten.                               strukturierten Texten erzeugt werden, wie z. B. technische
Die Methode von [2] et. al. benutzt eine Taxonomie zur Ab-          Datenblättern und Produktbeschreibungen (vgl. [5]). Die-
bildung der Merkmalshierarchie, wobei diese von einem Ex-           se Datenquellen enthalten i. d. R. die wichtigsten Produkt-
perten erstellt wird. Diese Hierarchie bildet die Grundlage         merkmale. Der hohe Strukturierungsgrad dieser Datenquel-
für die Meinungsextraktion. Die Tonalität der Meinungen           len erlaubt eine Extraktion der Merkmale mit hoher Ge-
wird über ein Tonalitätswörterbuch gelöst. Für diesen An-      nauigkeit (≈ 71% [5]). Allerdings tendieren Datenblätter
satz wird - im Gegensatz zu unserer Methode - umfangrei-            und Produktbeschreibungen dazu, ein Produkt relativ ober-
ches Expertenwissen benötigt.                                      flächlich darzustellen oder zu Gunsten des Produkts zu ver-
Die Arbeit von [8] et. al. konzentriert sich auf die Extrakti-      zerren. Zum Beispiel enthält die Hierarchie in Abbildung
on von Meinungen und die anschließende Tonalitätsanalyse.          1 eine Reihe von Merkmalen, wie sie häufig in strukturier-
Die Autoren unterscheiden zwischen subjektiven und kom-             ten Datenquellen zu finden sind (helle Knoten). Allerdings
parativen Sätze. Sowohl subjektive, als auch komparative           sind weitere, detailliertere Merkmale denkbar, die für eine
Sätze enthalten Meinungen, wobei im komparativen Fall ei-          Kaufentscheidung von Interesse sein könnten. Beispielsweise
ne Meinung nicht direkt gegeben wird, sondern über einen           könnte das Display einer Digitalkamera zur Fleckenbildung
Vergleich mit einem anderen Produkt erfolgt. Die Autoren            am unteren/oberen Rand neigen. Unterer/Oberer Rand
nutzen komparative Sätze, um Produktgraphen zu erzeu-              wird in diesem Fall zu einem Untermerkmal von Display
gen mit deren Hilfe verschiedene Produkte hinsichtlich eines        und Obermerkmal von Fleckenbildung (dunkle Knoten).
Merkmals geordnet werden können. Die notwendigen Tona-             Eine derartige Anreicherung einer gegebenen, unvollständi-
litätswerte werden einem Wörterbuch entnommen.                    gen Merkmalshierarchie kann durch die Verarbeitung von
kundengenerierten, unstrukturierten Rezensionen erfolgen.          z.B. steht DET für einen Artikel, NOUN für ein Hauptwort
Wir halten einen hybriden Ansatz für durchaus sinnvoll: zu-       und ADJ für ein Adjektiv. Weitere Informationen über das
nächst wird eine initiale Merkmalshierarchie mit hoher Ge-        Universal Tagset finden sich in [6].
nauigkeit aus strukturierten Daten gewonnen. Anschließend
wird diese Hierarchie in einer zweiten Verarbeitungshase mit       3.2 Analysepipeline
zusätzlichen Produktmerkmalen angereichert.                          Für die Verarbeitung und Untersuchung der Produktre-
Für den weiteren Verlauf dieses Abschnitts beschränken wir       zensionen haben wir eine für den NLP-Bereich (Natural Lan-
uns auf die zweite Analysephase, d.h. wir nehmen eine in-          guage Processing) typische Standardpipeline benutzt: die
itiale Merkmalshierarchie als gegeben an. Für die Evaluation      Volltexte der Rezensionen sind für unsere Zwecke zu grob-
unseres Algorithmus (siehe Abschnitt 4) wurden die initia-         granular, so dass in einer ersten Phase der Volltext in Sätze
len Merkmalshierarchien manuell erzeugt.                           zerteilt wird. Anschließend werden die Sätze tokenisiert und
Unser Algorithmus wurde auf der Basis einer Reihe von ein-         die Wortarten der einzelnen Worte bestimmt. Des Weite-
fachen Beobachtungen entworfen, die wir bei der Analyse            ren werden Stoppworte markiert - dafür werden Standard-
von unserem Rezensionskorpus gemacht haben.                        Stoppwortlisten benutzt. Wir beenden die Analysepipeline
                                                                   mit einer Stammformreduktion für jedes Wort, um die ver-
  1. Ein Produktmerkmal wird häufig durch ein Hauptwort
                                                                   schiedenen Flexionsformen eines Wortes auf eine kanonische
     repräsentiert.
                                                                   Basis zu bringen.
  2. Viele Hauptwörter können dasselbe Produktmerkmal            Für die Bestimmung zusätzlicher Produktmerkmale aus Pro-
     beschreiben. (Synonyme)                                       duktrezensionen, sind vor allem Hauptworte interessant, die
                                                                   i. d. R. keine Stoppworte sind. Allerdings ist uns aufgefal-
  3. Untermerkmale werden häufig im Kontext ihrer Ober-           len, dass überdurchschnittlich viele Worte fälschlicherweise
     merkmale genannt, wie z. B. ”das Ladegerät der Ka-           als ein Hauptwort erkannt werden - viele dieser Worte sind
     mera”.                                                        Stoppworte. Wir nehmen an, dass die variierende, gramma-
  4. Textfragmente, die von Produktmerkmalen handeln,              tikalische Qualität der Produktrezensionen für die hohe An-
     besitzen häufig eine sehr ähnliche grammatikalische         zahl falsch bestimmer Worte verantwortlich ist. Die Stopp-
     Struktur, wie z.B. ”die Auflösung der Anzeige” oder          wortmarkierung hilft dabei, diesen Fehler etwas auszuglei-
     ”die Laufzeit des Akkus”, wobei Unter- und Obermerk-          chen.
     male gemeinsam genannt werden. Die Struktur der               3.3 Der Algorithmus
     Fragmente lautet [DET, NOUN, DET, NOUN], wo-
     bei DET einen Artikel und NOUN ein Hauptwort be-                 In diesem Abschnitt beschreiben wir einen neuen Algorith-
     schreibt.                                                     mus, um eine initiale Hierarchie von Produktmerkmalen mit
                                                                   zusätzlichen Merkmalen anzureichern, wobei die natürliche
Der Rest dieses Abschnitts gliedert sich wie folgt: zunächst      Ordnung der Merkmale erhalten bleibt (siehe Algorithmus 1).
werden Definitionen in Unterabschnitt 3.1 eingeführt, die         Der Algorithmus erwartet 3 Parameter: eine 2-dimensionale
für das weitere Verständnis notwendig sind. Anschließend         Liste von Token T , die sämtliche Token für jeden Satz ent-
beschreiben wir unsere Analysepipeline, die für die Vorver-       hält (dabei beschreibt die erste Dimension die Sätze, die
arbeitung der Produktrezensionen verwendet wurde, in Un-           zweite Dimensionen die einzelnen Wörter), eine initiale Hier-
terabschnitt 3.2. Darauf aufbauend wird in Unterabschnitt          archie von Merkmalen f und eine Menge von POS-Mustern
3.3 unser Algorithmus im Detail besprochen.                        P . Da der Algorithmus rekursiv arbeitet, wird zusätzlich ein
                                                                   Parameter d übergeben, der die maximale Rekursionstiefe
3.1 Definitionen                                                   angibt. Der Algorithmus bricht ab, sobald die vorgegebene
  Für das Verständnis der nächsten Abschnitte werden eini-      Tiefe erreicht wird (Zeile 1-3).
ge Begriffe benötigt, die in diesem Unterabschnitt definiert
werden sollen:
                                                                   Kandidatensuche (Zeile 4-11). Um geeignete Kandida-
Token. Ein Token t ist ein Paar t = (vword , vP OS ), wobei        ten für neue Produktmerkmale zu finden, werden alle Sätze
vword das Wort und vpos die Wortart angibt. Im Rahmen              betrachtet und jeweils entschieden, ob der Satz eine Realisie-
dieser Arbeit wurde das Universal Tagset [6] benutzt.              rung des aktuell betrachteten Merkmals enthält oder nicht.
                                                                   Wenn ein Satz eine Realisierung hat, dann wird die Funkti-
Merkmal. Wir definieren ein Produktmerkmal f als ein               on applyP atterns aufgerufen. Diese Funktion sucht im über-
Tripel f = (S, C, p), wobei S eine Menge von Synonymen be-         gebenen Satz nach gegebenen POS-Mustern und gibt – so-
schreibt, die als textuelle Realisierung eines Merkmals Ver-       fern mindestens ein Muster anwendbar ist – die entsprechen-
wendung finden können. Die Elemente von S können Wor-            den Token als Kandidat zurück, wobei die Mustersuche auf
te, Produktbezeichnungen und auch Abkürzungen enthal-             das unmittelbare Umfeld der gefundenen Realisierung einge-
ten. Die Hierarchie wird über C und p kontrolliert, wobei         schränkt wird, damit das korrekte POS-Muster zurückgelie-
C eine Menge von Untermerkmalen und p das Obermerk-                fert wird, da POS-Muster mehrfach innerhalb eines Satzes
mal von f angibt. Das Wurzelelement einer Hierarchie be-           vorkommen können.
schreibt das Produkt/die Produktgruppe selbst und besitzt          Im Rahmen dieser Arbeit haben wie die folgenden POS-
kein Obermerkmal.                                                  Muster verwendet:
                                                                      • [DET, NOUN, DET, NOUN]
POS-Muster. Ein POS-Muster q ist eine geordnete Sequenz
von POS-Tags p = [tag1 , tag2 , . . . , tagn ], wobei n die Mus-      • [DET, NOUN, VERB, DET, ADJ, NOUN]
terlänge beschreibt. Ein POS-Tag beschreibt eine Wortart,
 Algorithm 1: refineHierarchy                                      Synonymen. Dazu wird das Wort mit den Synonymen von f
                                                                   verglichen (z.B. mit der Levenshtein-Distanz) und als Syn-
                                                                   onym aufgenommen, falls eine ausreichende Ähnlichkeit be-
     Eingabe : T : Eine 2-dimensionale Liste von Token.            steht. Damit soll verhindert werden, dass die falsche Schreib-
     Eingabe : P : Ein Array von POS-Mustern.                      weise eines eigentlich bekannten Merkmals dazu führt, dass
     Eingabe : f : Eine initiale Merkmalshierarchie.               ein neuer Knoten in die Hierarchie eingefügt wird.
     Eingabe : d : Die maximale Rekursionstiefe.
     Ausgabe: Das Wurzelmerkmal der angereicherten                 Wenn der Token t die Heuristiken erfolgreich passiert hat,
               Hierarchie.                                         dann wird t zu einem neuen Untermerkmal von f (Zeile 27).
  1 if d = 0 then
  2     return f                                                   Rekursiver Aufruf (Zeile 30-32). Nachdem das Merkmal
  3 end                                                            f nun mit zusätzlichen Merkmalen angereichert wurde, wird
  4 C ← {} ;                                                       der Algorithmus rekursiv für alle Untermerkmale von f auf-
  5 for Token[] T ′ ∈ T do                                         gerufen, um diese mit weiteren Merkmalen zu versehen. Die-
  6     for Token t ∈ T ′ do                                       ser Vorgang wiederholt sich solange, bis die maximale Re-
  7         if t.word ∈Sf.S then                                   kursionstiefe erreicht wird.
  8              C ← C applyP attern(T ′ , P ) ;
  9         end
 10     end                                                        Nachbearbeitungsphase. Die Hierarchie, die von Algorith-
 11 end                                                            mus 1 erweitert wurde, muss in einer Nachbearbeitungspha-
 12 for Token[] C ′ ∈ C do
                                                                   se bereinigt werden, da viele Merkmale enthalten sind, die
 13     for Token t ∈ C ′ do                                       keine realen Produktmerkmale beschreiben (Rauschen). Für
 14         if t.pos 6= NOUN then                                  diese Arbeit verwenden wir die relative Häufigkeit eines Un-
 15              next ;                                            termerkmals im Kontext seines Obermerkmals, um nieder-
 16         end                                                    frequente Merkmale (samt Untermerkmalen) aus der Hier-
 17         if t.length ≤ 3 then                                   archie zu entfernen. Es sind aber auch andere Methoden
 18              next ;
 19         end
                                                                   denkbar, wie z.B. eine Gewichtung nach tf-idf [4]. Dabei wird
 20         if hasP arent(t.word, f ) then                         nicht nur die Termhäufigkeit (tf ) betrachtet, sondern auch
 21              next ;                                            die inverse Dokumenthäufigkeit (idf ) mit einbezogen. Der
 22         end                                                    idf eines Terms beschreibt die Bedeutsamkeit des Terms im
 23         if isSynonym(t.word, f.S) then                         Bezug auf die gesamte Dokumentenmenge.
 24              f.S ← t.word ;
 25              next ;
 26         end         S
                                                                   4. EVALUATION
 27         f.C ← f.C ({t.word}, {}, f ) ;                            In diesem Abschnitt diskutieren wir die Vor- und Nachteile
 28     end                                                        unseres Algorithmus. Um unseren Algorithmus evaluieren zu
 29 end                                                            können, haben wir einen geeigneten Korpus aus Kundenre-
 30 for Feature[] f ′ ∈ f.C do                                     zensionen zusammengestellt. Unser Korpus besteht aus 4000
 31     ref ineHierarchy(T, f ′ , P, d − 1);                       Kundenrezensionen von amazon.de aus der Produktgruppe
 32 end
                                                                   Digitalkamera.
                                                                   Wir haben unseren Algorithmus für die genannte Produkt-
                                                                   gruppe eine Hierarchie anreichern lassen. Die initiale Pro-
                                                                   dukthierarchie enthält ein Obermerkmal, welches die Pro-
                                                                   duktgruppe beschreibt. Zudem wurden häufig gebrauchte
Validierungsphase (Zeile 12-29). Die Validierungsphase             Synonyme hinzugefügt, wie z.B. Gerät. Im Weiteren prä-
dient dazu die gefundenen Kandidaten zu validieren, also           sentieren wir exemplarisch die angereicherte Hierarchie. Für
zu entscheiden, ob ein Kandidat ein neues Merkmal enthält.        dieses Experiment wurde die Rekursionstiefe auf 3 gesetzt,
Man beachte, dass es sich bei diesem neuen Merkmal um              niederfrequente Merkmale (relative Häufigkeit < 0, 002) wur-
ein Untermerkmal des aktuellen Produktmerkmals handelt,            den eliminiert. Wir haben für diese Arbeit Rezensionen in
sofern es existiert. Für die Entscheidungsfindung nutzen wir      Deutscher Sprache verwendet, aber der Algorithmus kann
eine Reihe von einfachen Heuristiken. Ein Token t ist kein         leicht auf andere Sprachen angepasst werden. Die erzeug-
Produktmerkmal und wird übergangen, falls t.vword :               te Hierarchie ist in Abbildung 2 dargestellt. Es zeigt sich,
                                                                   dass unser Algorithmus – unter Beachtung der hierarchi-
  1. kein Hauptwort ist (Zeile 14-16).                             schen Struktur – eine Reihe wertvoller Merkmale extrahieren
  2. keine ausreichende Länge besitzt (Zeile 17-19).              konnte: z. B. Batterie mit seinen Untermerkmalen Halte-
                                                                   zeit und Verbrauch oder Akkus mit den Untermerkmalen
  3. ein Synonym von f (oder eines Obermerkmals von f )            Auflad und Zukauf. Es wurden aber auch viele Merkmale
     ist (Zeile 20-22).                                            aus den Rezensionen extrahiert, die entweder keine echten
                                                                   Produktmerkmale sind (z.B. Kompakt oder eine falsche
  4. ein neues Synonym von f darstellt (Zeile 23-26).              Ober-Untermerkmalsbeziehung abbilden (z. B. Haptik und
                                                                   Kamera). Des Weiteren sind einige Merkmale, wie z. B.
   Die 3. Heuristik stellt sicher, dass sich keine Kreise in der   Qualität zu generisch und sollten nicht als Produktmerk-
Hierarchie bilden können. Man beachte, dass Obermerkma-           mal benutzt werden.
le, die nicht direkt voneinander abhängen, gleiche Unter-
merkmale tragen können.
Die 4. Heuristik dient zum Lernen von vorher unbekannten
                                                              malen anreichert. Die neuen Merkmale werden automatisch
                                                              aus unstrukturierten Produktrezensionen gewonnen, wobei
                                                              der Algorithmus versucht die natürliche Ordnung der Pro-
                                                              duktmerkmale zu beachten.
                                                              Wir konnten zeigen, dass unser Algorithmus eine initiale
                                                              Merkmalshierarchie mit sinnvollen Untermerkmalen anrei-
                                                              chern kann, allerdings werden auch viele falsche Merkma-
                                                              le extrahiert und in fehlerhafte Merkmalsbeziehungen ge-
                                                              bracht. Wir halten unseren Algorithmus dennoch für viel-
                                                              versprechend. Unsere weitere Forschung wird sich auf Teila-
                                                              spekte dieser Arbeit konzentrieren:
                                                                 • Die Merkmalsextraktion muss verbessert werden: wir
                                                                   haben beobachtet, dass eine Reihe extrahierter Merk-
                                                                   male keine echten Produktmerkmale beschreiben. Da-
                                                                   bei handelt es sich häufig um sehr allgemeine Wörter
                                                                   wie z.B. Möglichkeiten. Wir bereiten deshalb den
                                                                   Aufbau einer Stoppwortliste für Produktrezensionen
                                                                   vor. Auf diese Weise könnte diese Problematik abge-
                                                                   schwächt werden.
                                                                 • Des Weiteren enthalten die angereicherten Hierarchi-
                                                                   en teilweise Merkmale, die in einer falschen Beziehung
                                                                   zueinander stehen, z.B. induzieren die Merkmale Ak-
                                                                   ku und Akku-Ladegerät eine Ober-Untermerkmals-
                                                                   beziehung: Akku kann als Obermerkmal von Ladege-
                                                                   rät betrachtet werden. Außerdem konnte beobachtet
                                                                   werden, dass einige Merkmalsbeziehungen alternieren:
                                                                   z.B. existieren 2 Merkmale Taste und Druckpunkt
                                                                   in wechselnder Ober-Untermerkmalbeziehung.
                                                                 • Der Algorithmus benötigt POS-Muster, um Untermerk-
                                                                   male in Sätzen zu finden. Für diese Arbeit wurden die
                                                                   verwendeten POS-Muster manuell konstruiert, aber wir
                                                                   planen die Konstruktion der POS-Muster weitestge-
                                                                   hend zu automatisieren. Dazu ist eine umfangreiche
                                                                   Analyse eines großen Korpus notwendig.
                                                                 • Die Bereinigung der erzeugten Hierarchien ist unzurei-
                                                                   chend - die relative Häufigkeit eines Merkmals reicht
                                                                   als Gewichtung für unsere Zwecke nicht aus. Aus die-
                                                                   sem Grund möchten wir mit anderen Gewichtungsma-
                                                                   ßen experimentieren.
                                                                 • Die Experimente in dieser Arbeit sind sehr einfach ge-
                                                                   staltet. Eine sinnvolle Evaluation ist (z. Zt.) nicht mög-
                                                                   lich, da (unseres Wissens nach) kein geeigneter Test-
                                                                   korpus mit annotierten Merkmalshierarchien existiert.
                                                                   Die Konstruktion eines derartigen Korpus ist geplant.
                                                                 • Des Weiteren sind weitere Experimente geplant, um
                                                                   den Effekt der initialen Merkmalshierarchie auf den
                                                                   Algorithmus zu evaluieren. Diese Versuchsreihe um-
                                                                   fasst Experimente mit mehrstufigen, initialen Merk-
                                                                   malshierarchien, die sowohl manuell, als auch automa-
                                                                   tisch erzeugt wurden.

Abbildung 2: Angereicherte Hierarchie für die Pro-              • Abschließend planen wir die Verwendung unseres Al-
duktgruppe Digitalkamera.                                          gorithmus zur Extraktion von Produktmerkmalen in
                                                                   einem Gesamtsystem zur automatischen Zusammen-
                                                                   fassung und Analyse von Produktrezensionen einzu-
                                                                   setzen.
5.   RESÜMEE UND AUSBLICK
  In dieser Arbeit wurde ein neuer Algorithmus vorgestellt,
der auf Basis einer gegebenen – möglicherweise flachen –
Merkmalshierarchie diese Hierarchie mit zusätzlichen Merk-
6.   REFERENZEN
[1] M. Acher, A. Cleve, G. Perrouin, P. Heymans,
    C. Vanbeneden, P. Collet, and P. Lahire. On extracting
    feature models from product descriptions. In
    Proceedings of the Sixth International Workshop on
    Variability Modeling of Software-Intensive Systems,
    VaMoS ’12, pages 45–54, New York, NY, USA, 2012.
    ACM.
[2] F. L. Cruz, J. A. Troyano, F. Enrı́quez, F. J. Ortega,
    and C. G. Vallejo. A knowledge-rich approach to
    feature-based opinion extraction from product reviews.
    In Proceedings of the 2nd international workshop on
    Search and mining user-generated contents, SMUC ’10,
    pages 13–20, New York, NY, USA, 2010. ACM.
[3] M. Hu and B. Liu. Mining and summarizing customer
    reviews. In Proceedings of the tenth ACM SIGKDD
    international conference on Knowledge discovery and
    data mining, KDD ’04, pages 168–177, New York, NY,
    USA, 2004. ACM.
[4] K. S. Jones. A statistical interpretation of term
    specificity and its application in retrieval. Journal of
    Documentation, 28:11–21, 1972.
[5] X. Meng and H. Wang. Mining user reviews: From
    specification to summarization. In Proceedings of the
    ACL-IJCNLP 2009 Conference Short Papers,
    ACLShort ’09, pages 177–180, Stroudsburg, PA, USA,
    2009. Association for Computational Linguistics.
[6] S. Petrov, D. Das, and R. McDonald. A universal
    part-of-speech tagset. In N. C. C. Chair), K. Choukri,
    T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani,
    J. Odijk, and S. Piperidis, editors, Proceedings of the
    Eight International Conference on Language Resources
    and Evaluation (LREC’12), Istanbul, Turkey, may 2012.
    European Language Resources Association (ELRA).
[7] T. Scholz and S. Conrad. Extraction of statements in
    news for a media response analysis. In Proc. of the 18th
    Intl. conf. on Applications of Natural Language
    Processing to Information Systems 2013 (NLDB 2013),
    2013. (to appear).
[8] K. Zhang, R. Narayanan, and A. Choudhary. Voice of
    the customers: Mining online customer reviews for
    product feature-based ranking. In Proceedings of the 3rd
    conference on Online social networks, WOSN’10, pages
    11–11, Berkeley, CA, USA, 2010. USENIX Association.
        Ein Verfahren zur Beschleunigung eines neuronalen
           Netzes für die Verwendung im Image Retrieval

                                                            Daniel Braun
                                                     Heinrich-Heine-Universität
                                                       Institut für Informatik
                                                         Universitätsstr. 1
                                                   D-40225 Düsseldorf, Germany
                                                braun@cs.uni-duesseldorf.de

ABSTRACT                                                              fikator eine untergeordnete Rolle, da man die Berechnung
Künstliche neuronale Netze haben sich für die Mustererken-          vor der eigentlichen Anwendung ausführt. Will man aller-
nung als geeignetes Mittel erwiesen. Deshalb sollen ver-              dings auch während der Nutzung des Systems weiter ler-
schiedene neuronale Netze verwendet werden, um die für               nen, so sollten die benötigten Rechnungen möglichst wenig
ein bestimmtes Objekt wichtigen Merkmale zu identifizier-             Zeit verbrauchen, da der Nutzer ansonsten entweder auf die
en. Dafür werden die vorhandenen Merkmale als erstes                 Berechnung warten muss oder das Ergebnis, dass ihm aus-
durch ein Art2-a System kategorisiert. Damit die Kategorien           gegeben wird, berücksichtigt nicht die durch ihn hinzuge-
verschiedener Objekte sich möglichst wenig überschneiden,           fügten Daten.
muss bei deren Berechnung eine hohe Genauigkeit erzielt                  Für ein fortlaufendes Lernen bieten sich künstliche neu-
werden. Dabei zeigte sich, dass das Art2 System, wie auch             ronale Netze an, da sie so ausgelegt sind, dass jeder neue
die Art2-a Variante, bei steigender Anzahl an Kategorien              Input eine Veränderung des Gedächtnisses des Netzes nach
schnell zu langsam wird, um es im Live-Betrieb verwen-                sich ziehen kann. Solche Netze erfreuen sich, bedingt durch
den zu können. Deshalb wird in dieser Arbeit eine Opti-              die sich in den letzten Jahren häufenden erfolgreichen An-
mierung des Systems vorgestellt, welche durch Abschätzung            wendungen - zum Beispiel in der Mustererkennung - einer
des von dem Art2-a System benutzen Winkels die Anzahl                 steigenden Beliebtheit in verschiedensten Einsatzgebieten,
der möglichen Kategorien für einen Eingabevektor stark ein-         wie zum Beispiel auch im Image Retrieval.
schränkt. Des Weiteren wird eine darauf aufbauende In-                  Der geplante Systemaufbau sieht dabei wie folgt aus: die
dexierung der Knoten angegeben, die potentiell den Speich-            Merkmalsvektoren eines Bildes werden nacheinander einer
erbedarf für die zu überprüfenden Vektoren reduzieren kann.        Clustereinheit übergeben, welche die Merkmalsvektoren clus-
Wie sich in den durchgeführten Tests zeigte, kann die vorge-         tert und die Kategorien der in dem Bild vorkommenden
stellte Abschätzung die Bearbeitungszeit für kleine Cluster-        Merkmale berechnet. Das Clustering der Clustereinheit pas-
radien stark reduzieren.                                              siert dabei fortlaufend. Das bedeutet, dass die einmal be-
                                                                      rechneten Cluster für alle weiteren Bilder verwendet werden.
                                                                      Danach werden die für das Bild gefundenen Kategorien von
Kategorien                                                            Merkmalen an die Analyseeinheit übergeben, in der versucht
H.3.3 [Information Storage and Retrieval]: Information                wird, die für ein bestimmtes Objekt wichtigen Kategorien zu
Search and Retrieval—Clustering; F.1.1 [Computation by                identifizieren. Die dort gefundenen Kategorien werden dann
Abstract Devices]: Models of Computation—Neural Net-                  für die Suche dieser Objekte in anderen Bildern verwendet.
work                                                                  Das Ziel ist es dabei, die Analyseeinheit so zu gestalten,
                                                                      dass sie nach einem initialen Training weiter lernt und so
Schlüsselwörter                                                       neue Merkmale eines Objektes identifizieren soll.
                                                                         Für die Analyseeinheit ist die Verwendung verschiedener
Neuronale Netze, Clustering, Image Retrieval                          neuronaler Netze geplant. Da sie aber für die vorgenomme-
                                                                      nen Optimierungen irrelevant ist, wird im Folgenden nicht
1. EINLEITUNG                                                         weiter auf sie eingegangen.
  Trainiert man ein Retrieval System mit einem festen Kor-               Von dem Clusteringverfahren für die Clustereinheit wer-
pus und wendet die berechneten Daten danach unverän-                 den dabei die folgenden Punkte gefordert:
dert an, so spielt die Berechnungsdauer für einen Klassi-
                                                                         • Das Clustering soll nicht überwacht funktionieren. Das
                                                                           bedeutet, dass es keine Zielvorgabe für die Anzahl der
                                                                           Cluster geben soll. Das System soll also auch bei einem
                                                                           bestehenden Clustering für einen neuen Eingabevektor
                                                                           erkennen, ob er einem Cluster zugewiesen werden kann
                                                                           oder ob ein neuer Cluster erstellt werden muss.

                                                                         • Die Ausdehnung der Cluster soll begrenzt sein. Das
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany.                        soll dazu führen, dass gefundene Merkmalskategorien
Copyright is held by the author/owner(s).                                  mit höherer Wahrscheinlichkeit zu bestimmten Objek-
     ten gehören und nicht die Vektoren anderer Objekte                        Orienting Subsystem      Attentional Subsystem
     die Kategorie verschmutzen.
   • Das Clustering Verfahren sollte auch bei einer hohen                                  Reset
                                                                                                         Category Representation Field

     Anzahl an Clustern, die aus der gewünschten hohen
     Genauigkeit der einzelnen Cluster resultiert, schnell
                                                                                    Reset Modul
     berechnet werden können.                                                                                     LTM

   In dieser Arbeit wird ein Adaptive Resonance Theory Netz
[5] verwendet, genauer ein Art2 Netz [1], da es die beiden er-                                     zJ*
sten Bedingungen erfüllt. Denn dieses neuronale Netz führt                                              Input Representation Field


ein nicht überwachtes Clustering aus, wobei es mit jedem
Eingabevektor weiter lernt und gegebenenfalls neue Cluster                                                               I
erschafft. Der genaue Aufbau dieses Systems wird in Kapitel
3 genauer dargestellt.                                                                    I
                                                                                                              Preprocessing Field
   Zur Beschreibung des Bildes dienen SIFT Deskriptoren [9,
10], welche mit 128 Dimensionen einen sehr großen Raum für
mögliche Kategorien aufspannen. Dadurch wächst die Kno-                                                                I0
tenanzahl innerhalb des Art2 Netzes rapide an, was zu einer
Verlangsamung des Netzes führt. Deshalb wird die Art2-a
Variante [2] verwendet, welche das Verhalten des Art2 Sys-              Abbildung 1: Skizze eines Art2-a Systems
tems approximiert. Dieses System hat die Vorteile, dass es
zum Einen im Vergleich zu Art2 um mehrere Größenord-
nungen schneller ist und sich zum Anderen gleichzeitig auch        durchlaufen und der ähnlichste Knoten als Antwort gewählt.
noch größtenteils parallelisieren lässt, wodurch ein weiterer    Das Framework verwendet dabei das Feedback des Nutzers,
Geschwindigkeitsgewinn erzielt werden kann.                        wodurch nach jeder Iteration das Ergebnis verfeinert wird.
   Dennoch zeigt sich, dass durch die hohe Dimension des           Das neuronale Netz dient hier somit als Klassifikator.
Vektors, die für die Berechnung der Kategorie benötigten           [12] benutzt das Fuzzy Art neuronale Netz, um die Merk-
Skalarprodukte, unter Berücksichtigung der hohen Anzahl           malsvektoren zu klassifizieren. Sie schlagen dabei eine zweite
an Knoten, weiterhin sehr rechenintensiv sind. Dadurch             Bewertungsphase vor, die dazu dient, das Netz an ein er-
steigt auch bei starker Parallelisierung, sofern die maximale      wartetes Ergebnis anzupassen, das System damit zu über-
Anzahl paralleler Prozesse begrenzt ist, die Zeit für die Bear-   wachen und die Resultate des Netzes zu präzisieren.
beitung eines neuen Vektors mit fortlaufendem Training kon-          In [6] wird ein Radial Basis Funktion Netzwerk (RBF) als
tinuierlich an. Aus diesem Grund wird in Kapitel 4 eine Er-        Klassifikator verwendet. Eins ihrer Verfahren lässt dabei den
weiterung des Systems vorgestellt, die die Menge der Kandi-        Nutzer einige Bilder nach der Nähe zu ihrem Suchziel bew-
daten möglicher Gewinnerknoten schon vor der teuren Be-           erten, um diese Bewertung dann für das Training ihrer Net-
rechnung des Skalarproduktes verkleinert.                          zwerke zu verwenden. Danach nutzen sie die so trainierten
   Der weitere Aufbau dieser Arbeit sieht dabei wie folgt aus:     neuronalen Netze zur Bewertung aller Bilder der Datenbank.
in dem folgenden Kapitel 2 werden einige ausgewählte An-            Auch [11] nutzt ein Radial Basis Funktion Netz zur Suche
sätze aus der Literatur genannt, in denen neuronale Netze         nach Bildern und trainiert dieses mit der vom Nutzer angege-
für das Image Retrieval verwendet werden. Um die Plausi-          benen Relevanz des Bildes, wobei das neuronale Netz nach
bilität der Erweiterung zu verstehen, werden dann in Kapi-        jeder Iteration aus Bewertung und Suche weiter trainiert
tel 3 die dafür benötigten Mechanismen und Formeln eines         wird.
Art2-a Systems vorgestellt. Kapitel 4 fokussiert sich danach         In [3] wird ein Multiple Instance Netzwerk verwendet.
auf die vorgeschlagene Erweiterung des bekannten Systems.          Das bedeutet, dass für jede mögliche Klasse von Bildern
In Kapitel 5 wird die Erweiterung evaluiert, um danach in          ein eigenes neuronales Netz erstellt wird. Danach wird ein
dem folgenden Kapitel eine Zusammenfassung des Gezeigten           Eingabebild jedem dieser Subnetze präsentiert und gegebe-
sowie einen Ausblick zu geben.                                     nenfalls der dazugehörigen Klasse zugeordnet.

2. VERWANDTE ARBEITEN                                              3.   ART2-A BESCHREIBUNG
   In diesem Kapitel werden einige Ansätze aus der Liter-           In diesem Kapitel werden die benötigten Mechanismen
atur vorgestellt, in denen neuronale Netze für verschiedene       eines Art2-a Systems vorgestellt. Für das Verständnis sind
Aufgabenstellungen im Image Retrieval verwendet werden.            dabei nicht alle Funktionen des Systems nötig, weshalb zum
Darunter fallen Themengebiete wie Clustering und Klassi-           Beispiel auf die nähere Beschreibung der für das Lernen
fikation von Bildern und ihren Merkmalsvektoren.                   benötigten Formeln und des Preprocessing Fields verzichtet
   Ein bekanntes Beispiel für die Verwendung von neuronalen       wird. Für weiterführende Informationen über diese beiden
Netzen im Image Retrieval ist das PicSOM Framework, wel-           Punkte sowie generell über das Art2-a System sei deshalb
ches in [8] vorgestellt wird. Dort werden TS-SOMs (Tree            auf [1, 2] verwiesen.
Structured Self Orienting Maps) für die Bildsuche verwen-           Wie in Bild 1 zu sehen ist, besteht das System aus zwei
det. Ein Bild wird dabei durch einen Merkmalsvektor darge-         Subsystemen: einem Attentional Subsystem, in dem die Be-
stellt. Diese Vektoren werden dann dem neuronalen Netz             arbeitung und Zuordnung eines an den Eingang angelegten
präsentiert, welches sie dann der Baumstruktur hinzufügt,        Vektors ausgeführt wird, sowie einem Orienting Subsystem,
so dass im Idealfall am Ende jedes Bild in der Baumstruk-          welches die Ähnlichkeit des Eingabevektors mit der vorher
tur repräsentiert wird. Bei der Suche wird der Baum dann          gewählten Gewinnerkategorie berechnet und diese bei zu
geringer Nähe zurücksetzt.                                         unterliegen. Damit wird der Knoten J genau dann abge-
  Innerhalb des Category Representation Field F2 liegen die        lehnt, wenn
Knoten die für die einzelnen Vektorkategorien stehen. Dabei
wird die Beschreibung der Kategorie in der Long Term Mem-
                                                                                              T J < ρ∗                         (4)
ory (LTM) gespeichert, die das Feld F2 mit dem Input Rep-
resentation Field F1 in beide Richtungen verbindet.                   gilt. Ist das der Fall, wird ein neuer Knoten aktiviert und
  Nach [2] gilt für den Aktivitätswert T von Knoten J in         somit eine neue Kategorie erstellt. Mit 2 und 4 folgt damit,
dem Feld F2 :                                                      dass ein Knoten nur dann ausgewählt werden kann, wenn
                                                                   für den Winkel θ zwischen dem Eingabevektor I und dem
          { ∑                                                      gespeicherten LTM-Vektor zJ∗
           α· n     i=1 Ii ,    wenn J nicht gebunden ist,
     TJ =
           I · zJ∗ ,            wenn J gebunden ist.
                                                                                             cos θ ≥ ρ∗                        (5)
   Ii steht dabei für den durch das Preprocessing Field F0
berechneten Input in das Feld F1 und α ist ein wählbarer Pa-         gilt. Da die einzelnen Rechnungen, die von dem System
rameter, der klein genug ist, so dass die Aktivität eines unge-   ausgeführt werden müssen, unabhängig sind, ist dieses Sys-
bundenen Knotens für bestimmte Eingangsvektoren nicht             tem hochgradig parallelisierbar, weshalb alleine durch Aus-
immer größer ist als alle Aktivitätswerte der gebundenen         nutzung dieser Tatsache die Berechnungszeit stark gesenkt
Knoten. Hierbei gilt ein Knoten als gebunden, wenn ihm             werden kann. Mit steigender Knotenanzahl lässt sich das
mindestens ein Vektor zugeordnet wurde.                            System dennoch weiter optimieren, wie in dem folgenden
   Da der Aktivitätswert für alle nicht gebundenen Knoten        Kapitel gezeigt werden soll.
konstant ist und deshalb nur einmal berechnet werden muss,            Das Art2-a System hat dabei allerdings einen Nachteil,
ist dieser Fall für eine Effizienzsteigerung von untergeord-        denn bedingt durch die Nutzung des Kosinus des Winkels
netem Interesse und wird deshalb im Folgenden nicht weiter         zwischen zwei Vektoren werden Vektoren, die linear abhäng-
betrachtet.                                                        ig sind, in dieselbe Kategorie gelegt. Dieses Verhalten ist für
   Wie in [2] gezeigt wird, sind sowohl I als auch zJ∗ , durch     die geforderte Genauigkeit bei den Kategorien unerwünscht.
die Anwendung der euklidischen Normalisierung, Einheits-           Dennoch lässt sich dieses Problem leicht durch die Erhebung
vektoren, weshalb folglich                                         weiterer Daten, wie zum Beispiel den Clustermittelpunkt,
                                                                   lösen, weshalb im Folgenden nicht weiter darauf eingegangen
                                                                   wird.
                         ∥I∥ = ∥zJ∗ ∥ = 1                    (1)
 gilt. Deshalb folgt für die Aktivitätswerte der gebunden        4.    VORGENOMMENE OPTIMIERUNG
Kategorieknoten:                                                     Dieses Kapitel dient der Beschreibung der verwendeten
                                                                   Abschätzung und ihrer Implementierung in das Art2-a Sys-
                     TJ = I · zJ∗                                  tem. Abschließend wird dann noch auf eine weitere Verbes-
                         = ∥I∥ · ∥zJ∗ ∥ · cos θ                    serung, die sich durch diese Implementierung ergibt, einge-
                                                                   gangen. Der Aufbau des Abschnitts ist dabei wie folgt: in
                         = cos θ                             (2)   Unterabschnitt 1 wird das Verfahren zur Abschätzung des
  Die Aktivität eines Knotens entspricht damit dem Winkel         Winkels vorgestellt. In dem folgenden Unterabschnitt 2 wird
zwischen dem Eingangsvektor I und dem LTM-Vektor zJ∗ .             dann gezeigt, wie man diese Abschätzung in das Art2-a Sys-
Damit der Knoten mit dem Index J gewählt wird, muss               tem integrieren kann. In dem letzten Unterabschnitt folgt
                                                                   dann eine Vorstellung der Abschätzung als Index für die
                                                                   Knoten.
                         TJ = max{Tj }
                                                                   4.1    Abschätzung des Winkels
                                    j

   gelten, sprich der Knoten mit der maximalen Aktivität
wird als mögliche Kategorie gewählt. Dabei wird bei Gleich-         In [7] wird eine Methode zur Schätzung der Distanz zwis-
heit mehrerer Werte der zu erst gefundene Knoten genom-            chen einem Anfragevektor und einem Datenvektor beschrie-
men. Die maximale Distanz, die das Resetmodul akzep-               ben. Im Folgenden wird beschrieben, wie man Teile dieses
tiert, wird durch den Schwellwert ρ, im Folgenden Vigilance        Verfahrens nutzen kann, um damit die Menge möglicher
Parameter genannt, bestimmt, mit dem die, für die Art2-a          Knoten schon vor der Berechnung des Aktivitätswertes TJ
Variante benötigte, Schwelle ρ∗ wie folgt berechnet wird:         zu verringern. Das Ziel ist es, die teure Berechnung des
                                                                   Skalarproduktes zwischen I und zJ∗ möglichst selten auszu-
                                                                   führen und gleichzeitig möglichst wenig Daten im Speicher
                         ρ2 (1 + σ)2 − (1 + σ 2 )
                  ρ∗ =                                             vorrätig halten zu müssen. Dafür wird der unbekannte Win-
                                   2σ                              kel θ zwischen den beiden Vektoren P und Q durch die
  mit                                                              bekannten Winkel α und β zwischen beiden Vektoren und
                                                                   einer festen Achse T wie folgt approximiert:
                                cd
                               σ≡                       (3)
                              1−d
                                                                          cos θ ≤ cos (|α − β|)
  und c und d als frei wählbare Parameter des Systems, die
der Beschränkung                                                               = cos (α − β)
                                                                                = cos α cos β + sin α sin β
                                cd                                                              √           √
                                   ≤1                                           = cos α cos β + 1 − cos α2 1 − cos β 2         (6)
                               1−d
                                                                                     Damit die Bedingung (7) ausgenutzt werden kann, wird
                                                                                  das Art2 System um ein weiteres Feld, im Folgenden Estima-
    F2       1            2         3        . . .              n                 tion Field genannt, erweitert. Dieses Feld soll als Schnittstel-
                                                                                  le zwischen F0 und F2 dienen und die Abschätzung des
                                                                                  Winkels zwischen dem Eingabevektor und dem gespeicher-
                                                                                  ten LTM Vektor vornehmen. Dazu wird dem Feld, wie in
                                                    z *n                          Abbildung 2 gezeigt wird, von dem Feld F0 die Summe S I
                                                   S
                                                                                  übergeben. Innerhalb des Feldes gibt es dann für jeden
                                                                                                                                             ′
                                                                                  Knoten J im Feld F2 eine zugehörige Estimation Unit J . In
                                                                                  der Verbindung von jedem Knoten J zu der ihm zugehörigen
                                                                                                     ′
         Estimation                     I                  F0                     Estimation Unit J wird
                                                                                                       ∗
                                                                                                            die Summe der Werte des jeweili-
           Field                    S                                             gen LTM Vektors S zJ als LTM gespeichert. Die Estimation
                                                                                  Unit berechnet dann die Funktion
                                                                    = LTM                                           √
                                                                                                               ∗                         ∗2
                                                                                                     S I ∗ S zJ +    (n − S I 2 )(n − S zJ )
Abbildung 2: Erweiterung des Art2 Systems mit                                              f (J) =
                                                                                                                   n
einem neuen Feld für die Abschätzung des Winkels
                                                                                    für den ihr zugehörigen Knoten J. Abschließend wird als
                                                                                  Aktivierungsfunktion, für die Berechnung der Ausgabe oJ ′
                                                                                                          ′
  Als Achse T wird hierbei ein n-dimensionaler mit Einsen                         der Estimation Unit J , die folgende Schwellenfunktion ver-
gefüllter Vektor verwendet, wodurch für die L2-Norm des                         wendet:
                      √
Achsenvektors ∥T ∥ = n folgt. Eingesetzt in die Formel
                                                                                                           {
                                                                                                               1,   wenn f (J) ≥ ρ∗
                                       ⟨P, Q⟩                                                     oJ ′ =                                       (8)
                              cos θ =                                                                          0,   sonst
                                      ∥P ∥∥Q∥
                                                                                    Damit ergibt sich für die Aktivitätsberechnung jedes Kno-
  ergibt sich damit, unter Ausnutzung von (1), für das Sys-                      tens des F2 Feldes die angepasste Formel
tem mit den Vektoren I und zJ∗ :
                                                                                          ∑
                 ∑n                 ∑n    ∗                                           
                                                                                      α ∗ i Ii ,     wenn J nicht gebunden ist,
                  i=1 Ii             i=1 zJ i
           cosα = √      und cosβ =   √                                           TJ = I ∗ zJ∗ ,      wenn J gebunden ist und oJ ′ = 1 gilt,
                    n                  n                                              
                                                                                      0
                          ∗
                                                                                                      wenn oJ ′ = 0 gilt.
  Mit S I und S zJ als jeweilige Summe der Vektorwerte                                                                                   (9)
reduziert sich, unter Verwendung der Formel (6), die Ab-                            mit oJ ′ als Ausgabe des Estimation Field zu Knoten J.
schätzung des Kosinus vom Winkel θ auf
                                                                                  4.3    Verwendung als Index
                                        √                                            Durch die gezeigte Kosinusschätzung werden unnötige Ska-
                                ∗                                   ∗2
                     S I ∗ S zJ                    SI 2       S zJ                larprodukte vermieden und somit das System beschleunigt.
         cos θ ≤                +           (1 −        )(1 −      )
                         n                          n           n                 Allerdings kann es bei weiterhin wachsender Anzahl der Kno-
                                        √                                         ten, zum Beispiel weil der Arbeitsspeicher nicht ausreicht,
                                ∗                                ∗2
                     S I ∗ S zJ +           (n − S I 2 )(n − S zJ )               nötig werden, nicht mehr alle LTM Vektoren im Speicher zu
                 =
                                               n                                  halten, sondern nur ein Set möglicher Kandidaten zu laden
   Diese Abschätzung ermöglicht es nun, die Menge der Kan-                      und diese dann gezielt zu analysieren. In dem folgenden Ab-
didaten möglicher Knoten für einen Eingabevektor I vorzei-                      schnitt wird gezeigt, wie die Abschätzung sinnvoll als Index
tig zu reduzieren, indem man ausnutzt, dass der wirkliche                         für die Knoten verwendet werden kann.
                                                                                                                                         +
Winkel zwischen Eingabevektor und in der LTM gespeicher-                             Für die Indexierung wird als Indexstruktur ein B ∗
                                                                                                                                           -Baum
                                                                                                                                      zJ
tem Vektor maximal genauso groß ist, wie der mit der gezeig-                      mit der Summe der Werte jedes LTM Vektors S und der
ten Formel geschätzte Winkel zwischen beiden Vektoren.                           ID J des Knotens als zusammengesetzten Schlüssel verwen-    ∗

Damit ist diese Abschätzung des wirklichen Winkels θ ver-                        det. Für die Sortierreihenfolge gilt: zuerst wird nach S zJ
                                                                                                                                     +
lustfrei, denn es können keine Knoten mit einem tatsächlich                     sortiert und dann nach J. Dadurch bleibt der B -Baum für
größeren Kosinuswert des Winkels aus der Menge an Kan-                           partielle Bereichsanfragen nach dem Wert der Summe opti-
didaten entfernt werden. Daraus resultiert, dass ein Knoten                       miert. Damit das funktioniert muss allerdings die Suche so
nur dann weiter betrachtet werden muss, wenn die Bedin-                           angepasst werden, dass sie bei einer partiellen Bereichsan-
gung                                                                              frage für die ID den kleinstmöglichen Wert einsetzt und dann
                                                                                  bei der Ankunft in einem Blatt der Sortierung bis zum ersten
                              √                                                   Vorkommen, auch über Blattgrenzen hinweg, der gesuchten
                      ∗                            ∗2
           S I ∗ S zJ +        (n − S I 2 )(n − S zJ )                            Summe folgt.
                                                                ≥ ρ∗        (7)      Dieser Index wird nun verwendet, um die Menge der Kan-
                                    n                                             didaten einzuschränken, ohne, wie in der vorher vorgestell-
  erfüllt wird.                                                                  ten Optimierung durch die Estimation Unit, alle Knoten
                                                                                  durchlaufen zu müssen. Anschaulich bedeutet das, dass
4.2 Erweiterung des Systems                                                       das Art2-a System nur noch die der Menge an Kandidaten
für den Eingabevektor I angehörenden Knoten sehen soll
und somit nur in diesen den Gewinnerknoten suchen muss.
Für diesen Zweck
            ∗
                   müssen mögliche Wertebereiche der gespe-
icherten S zJ für einen beliebigen Eingabevektor festgelegt
werden. Dies geschieht wieder mit Hilfe der Bedingung (7):
                   √
            ∗                           ∗2
    S I · S zJ +    (n − S I 2 )(n − S zJ )
                                                       ≥ρ
                            n
                   √
                                                 ∗2              ∗
                       (n − S I 2 )(n − S zJ ) ≥ ρ · n − S I · S zJ
                        ∗
  Für ρ · n − S I · S zJ < 0 ist diese Ungleichung offensichtlich
immer erfüllt, da die Quadratwurzel auf der linken Seite
immer positiv ist. Damit ergibt sich die erste Bedingung:                            Abbildung 3: Zeitmessung für ρ = 0.95

                                      ∗      ρ·n
                                    S zJ >                            (10)
                                             SI                              gestellt, denn es sind keine Einbußen in der Güte des Ergeb-
  Nun wird im Folgenden noch der Fall ρ · n ≥ S I · S zJ
                                                                        ∗    nisses zu erwarten. Außerdem wird die mögliche Paralleli-
weiter betrachtet:                                                           sierung nicht weiter betrachtet, da bei einer begrenzten An-
                                                                             zahl von parallelen Prozessen die Anzahl der Knoten pro
         √                                                                   Prozess mit jedem weiteren Bild steigt und den Prozess so
                              ∗2                     ∗
          (n − S I 2 )(n − S zJ ) ≥ ρ · n − S I · S zJ                       verlangsamt. Als mögliche Werte für den Schwellwert ρ wur-
                                             2        ∗2    ∗
                                                                             den die zwei, in der Literatur öfter genannten, Werte 0.95
                n · (1 − ρ2 ) − S I ≥ S zJ − 2ρS I S zJ                      und 0.98 sowie der Wert 0.999 verwendet. Für die restlichen
                                2                     ∗                      benötigten Parameter aus Formel (3) und (9) gilt: c = 0.1,
                (n − S I )(1 − ρ2 ) ≥ (S zJ − ρ · S I )2                     d = 0.9 und α = 0
Damit ergibt sich:
                                                                             5.2    Ergebnisse
√                                            √                                  Für die kleineren Vigilance Werte von 0.95 und 0.98 zeigt
                              ∗
   (n − S I 2 )(1 − ρ2 ) ≥ S zJ − ρ · S I ≥ − (n − S I 2 )(1 − ρ2 )          sich, wie in den Abbildungen 3 und 4 zu sehen ist, dass die
                                                              (11)           Abschätzung hier kaum einen Vorteil bringt. Sie ist sogar
   Mit den Bedingungen (10) und (11) können nun die par-                    langsamer als das originale System. Dies liegt daran, dass
tiellen Bereichsanfragen an den Index für einen beliebigen                  die Abschätzung durch Verwendung nur eines Wertes, näm-
Eingabevektor I wie folgt formuliert werden:                                 lich der Summe, viel zu ungenau ist, um bei diesem Vigilance
                                                                             Wert genug Knoten herauszufiltern, da fast alle Knoten über
             √                            √                                  der Grenze liegen. Da deshalb kaum Zeit gewonnen wird,
r1 = [ρS I − (n − S I 2 )(1 − ρ2 ), ρS I + (n − S I 2 )(1 − ρ2 )]            wird das System durch den betriebenen Mehraufwand lang-
      ρ·n                                                                    samer. Mit steigendem Vigilance Parameter nimmt auch
r2 = [ I , ∞]
       S                                                                     der Nutzen des Verfahrens zu, da die Anzahl der entfernten
  Da für diese Bereichsanfragen die Bedingung (7) genutzt                   Knoten signifikant zunimmt. Dies sieht man deutlich in Ab-
wird und somit alle geschätzten Winkel größer als ρ∗ sind,                 bildung 5, in der die benötigte Rechenzeit für einen Wert von
hat bei der Verwendung des Indexes das Estimation Field                      0.999 dargestellt ist. In diesem Fall filtert die gezeigte Ab-
keinen Effekt mehr.                                                           schätzung sehr viele Knoten heraus, weshalb der Zeitgewinn
                                                                             den Zeitverlust durch den größeren Aufwand weit übersteigt.
                                                                             Da aber möglichst genaue Kategorien erwünscht sind, ist ein
5. EVALUATION                                                                hoher Vigilance Parameter die richtige Wahl. Deshalb kann
   In diesem Kapitel wird die gezeigte Abschätzung evaluiert.               das gezeigte Verfahren für das angestrebte System adaptiert
Der vorgeschlagene Index wird dabei aber noch nicht berück-                 werden.
sichtigt.

5.1 Versuchsaufbau                                                           6.    RESÜMEE UND AUSBLICK
  Für die Evaluierung des gezeigten Ansatzes wurde ein                        In dieser Arbeit wurde eine Optimierung des Art2-a Sys-
Computer mit einem Intel Core 2 Duo E8400 3 GHz als                          tems vorgestellt, die durch Abschätzung des Winkels zwis-
Prozesser und 4 GB RAM benutzt.                                              chen Eingabevektor und gespeichertem Vektor die Menge
  Als Datensatz wurden Bilder von Flugzeugen aus dem                         an zu überprüfenden Kandidaten für hohe Vigilance Werte
Caltech 101 Datensatz [4] verwendet. Diese Bilder zeigen                     stark reduzieren kann. Des Weiteren wurde ein Ansatz zur
verschiedene Flugzeuge auf dem Boden beziehungsweise in                      Indexierung der Knoten basierend auf der für die Abschätz-
der Luft. Für den Geschwindigkeitstest wurden 20 Bilder                     ung nötigen Summe vorgestellt. Auch wenn eine abschlie-
aus dem Pool ausgewählt und nacheinander dem neuronalen                     ßende Analyse des gezeigten noch offen ist, so scheint dieser
Netz präsentiert. Im Schnitt produzieren die benutzten Bil-                 Ansatz dennoch erfolgversprechend für die erwünschten ho-
der dabei 4871 SIFT Feature Vektoren pro Bild.                               hen Vigilance Werte.
  Bedingt dadurch, dass die Ansätze verlustfrei sind, wird                    Aufbauend auf dem gezeigten wird unsere weitere For-
nur die Rechenzeit der verschiedenen Verfahren gegenüber                    schung die folgenden Punkte beinhalten:
                                                                     fortlaufendes Lernen braucht, um einem Objekt keine
                                                                     falschen neuen Kategorien zuzuweisen oder richtige Ka-
                                                                     tegorien zu entfernen. Danach soll ein geeignetes neu-
                                                                     ronales Netz aufgebaut werden, um damit die Zuord-
                                                                     nung der Kategorien zu den Objekten durchführen zu
                                                                     können. Das Netz muss dann an die vorher erhobenen
                                                                     Daten angepasst werden, um die Präzision des Netzes
                                                                     zu erhöhen. Abschließend wird das Verfahren dann
                                                                     gegen andere populäre Verfahren getestet.

                                                                7.   REFERENZEN
                                                                 [1] G. A. Carpenter and S. Grossberg. Art 2:
                                                                     Self-organization of stable category recognition codes
                                                                     for analog input patterns. Applied Optics,
    Abbildung 4: Zeitmessung für ρ = 0.98                           26(23):4919–4930, 1987.
                                                                 [2] G. A. Carpenter, S. Grossberg, and D. B. Rosen. Art
                                                                     2-a: an adaptive resonance algorithm for rapid
                                                                     category learning and recognition. In Neural Networks,
                                                                     volume 4, pages 493–504, 1991.
                                                                 [3] S.-C. Chuang, Y.-Y. Xu, H. C. Fu, and H.-C. Huang.
                                                                     A multiple-instance neural networks based image
                                                                     content retrieval system. In Proceedings of the First
                                                                     International Conference on Innovative Computing,
                                                                     Information and Control, volume 2, pages 412–415,
                                                                     2006.
                                                                 [4] L. Fei-Fei, R. Fergus, and P. Perona. Learning
                                                                     generative visual models from few training examples:
                                                                     an incremental bayesian approach tested on 101 object
    Abbildung 5: Zeitmessung für ρ = 0.999                          categories, 2004. CVPR 2004, Workshop on
                                                                     Generative-Model Based Vision.
                                                                 [5] S. Grossberg. Adaptive pattern classification and
• Es wird geprüft, ob die Abschätzung durch die Hinzu-             universal recording: II. Feedback, expectation,
  nahme weiterer Daten verbessert werden kann und so-                olfaction, illusions. Biological Cybernetics, 23:187–202,
  mit eine weitere Beschleunigung erzielt wird. Dafür               1976.
  kann man, um das Problem der zu geringen Präzision
                                                                 [6] B. Jyothi and D. Shanker. Neural network approach
  der Abschätzung bei kleinerem Vigilance Parameter
                                                                     for image retrieval based on preference elicitation.
  zu umgehen, die Vektoren teilen und die Abschätzung
                                                                     International Journal on Computer Science and
  wie in [7] aus den Teilsegmenten der Vektoren zusam-
                                                                     Engineering, 2(4):934–941, 2010.
  mensetzen. Dafür bräuchte man aber auch die Summe
                                                                 [7] Y. Kim, C.-W. Chung, S.-L. Lee, and D.-H. Kim.
  der Quadrate, da die Teilsegmente der Vektoren keine
                                                                     Distance approximation techniques to reduce the
  Einheitsvektoren mehr sind. Deshalb wird es sich noch
                                                                     dimensionality for multimedia databases. Knowledge
  zeigen, ob der Gewinn an Präzision durch eine Auftei-
                                                                     and Information Systems, 2010.
  lung den größeren Aufwand durch Berechnung und
  Speicherung weiterer Werte rechtfertigt. Des Weiteren          [8] L. Koskela, , J. T. Laaksonen, J. M. Koskela, and
  soll damit überprüft werden, ob die Abschätzung auch            E. Oja. Picsom a framework for content-based image
  für kleinere Vigilance Werte verwendet werden kann.               database retrieval using self-organizing maps. In In
                                                                     11th Scandinavian Conference on Image Analysis,
• Es wird überprüft, wie groß die Auswirkungen der                 pages 151–156, 1999.
  vorgestellten Verfahren bei einer parallelen Berechnung        [9] D. G. Lowe. Object recognition from local
  des Gewinnerknotens sind. Des Weiteren wird das                    scale-invariant features. In Proceedings of the
  Verfahren auf größeren Datenmengen getestet, um zu                International Conference on Computer Vision, 1999.
  überprüfen, ob eine weitere Beschleunigung nötig ist,      [10] D. G. Lowe. Distinctive image features from
  damit man das Verfahren im Live Betrieb verwenden                  scale-invariant keypoints. International Journal of
  kann.                                                              Computer Vision, 60:91–110, 2004.
• Die Verwendung der Abschätzung zum Indexieren soll           [11] K. N. S., Čabarkapa Slobodan K., Z. G. J., and R. B.
  getestet und mit anderen Indexierungsverfahren ver-                D. Implementation of neural network in cbir systems
  glichen werden, um ihren Nutzen besser bewerten zu                 with relevance feedback. Journal of Automatic
  können. Aber vor allem ihre Auswirkungen auf das                  Control, 16:41–45, 2006.
  Art2-a System im parallelisierten Betrieb sind noch           [12] H.-J. Wang and C.-Y. Chang. Semantic real-world
  offen und werden überprüft.                                       image classification for image retrieval with fuzzy-art
                                                                     neural network. Neural Computing and Applications,
• Danach werden wir die Analyseeinheit konzipieren. Da-              21(8):2137–2151, 2012.
  für wird als erstes überprüft, welche Daten man für ein
Auffinden von Spaltenkorrelationen mithilfe proaktiver und
                   reaktiver Verfahren

                                                       Katharina Büchse
                                                   Friedrich-Schiller-Universität
                                                       Institut für Informatik
                                                        Ernst-Abbe-Platz 2
                                                            07743 Jena
                                             katharina.buechse@uni-jena.de

KURZFASSUNG                                                         Keywords
Zur Verbesserung von Statistikdaten in relativen Datenbank-         Anfrageoptimierung, Spaltenkorrelation, Feedback
systemen werden seit einigen Jahren Verfahren für das Fin-
den von Korrelationen zwischen zwei oder mehr Spalten               1. EINFÜHRUNG
entwickelt. Dieses Wissen über Korrelationen ist notwen-
dig, weil der Optimizer des Datenbankmanagementsystems                 Die Verwaltung großer Datenmengen benötigt zunehmend
(DBMS) bei der Anfrageplanerstellung sonst von Unabhän-            leistungsfähigere Algorithmen, da die Verbesserung der Tech-
gigkeit der Daten ausgeht, was wiederum zu groben Fehlern           nik (Hardware) nicht mit dem immer höheren Datenauf-
bei der Kostenschätzung und somit zu schlechten Ausfüh-           kommen heutiger Zeit mithalten kann. Bspw. werden wis-
rungsplänen führen kann.                                          senschaftliche Messergebnisse aufgrund besserer Messtech-
   Die entsprechenden Verfahren gliedern sich grob in proak-        nik immer genauer und umfangreicher, sodass Wissenschaft-
tive und reaktive Verfahren: Erstere liefern ein gutes Ge-          ler sie detaillierter, aber auch umfassender analysieren wol-
samtbild über sämtliche vorhandenen Daten, müssen dazu           len und müssen, oder Online-Shops speichern sämtliche ihrer
allerdings selbst regelmäßig auf die Daten zugreifen und be-       Verkaufsdaten und werten sie aus, um dem Benutzer passend
nötigen somit Kapazität des DBMS. Letztere überwachen            zu seinen Interessen zeitnah und individuell neue Angebote
und analysieren hingegen die Anfrageergebnisse und liefern          machen zu können.
daher nur Korrelationsannahmen für bereits abgefragte Da-             Zur Verwaltung dieser wie auch anderer Daten sind (im
ten, was einerseits das bisherige Nutzerinteresse sehr gut wi-      Datenbankbereich) insbesondere schlaue Optimizer gefragt,
derspiegelt, andererseits aber bei Änderungen des Workloads        weil sie für die Erstellung der Anfragepläne (und somit für
versagen kann. Dafür wird einzig bei der Überwachung der          die Ausführungszeit einer jeden Anfrage) verantwortlich sind.
Anfragen DBMS-Kapazität benötigt, es erfolgt kein eigen-          Damit sie in ihrer Wahl nicht völlig daneben greifen, gibt
ständiger Zugriff auf die Daten.                                   es Statistiken, anhand derer sie eine ungefähre Vorstellung
   Im Zuge dieser Arbeit werden beide Ansätze miteinan-            bekommen, wie die vorhandene Datenlandschaft aussieht.
der verbunden, um ihre jeweiligen Vorteile auszunutzen. Da-         Hierbei ist insbesondere die zu erwartende Tupelanzahl von
zu werden die sich ergebenden Herausforderungen, wie sich           Interesse, da sie in hohem Maße die Ausführungszeit einer
widersprechende Korrelationsannahmen, aufgezeigt und als            Anfrage beeinflusst. Je besser die Statistiken die Verteilung
Lösungsansatz u. a. der zusätzliche Einsatz von reaktiv er-       der Daten wiedergeben (und je aktueller sie sind), desto bes-
stellten Statistiken vorgeschlagen.                                 ser ist der resultierende Ausführungsplan. Sind die Daten
                                                                    unkorreliert (was leider sehr unwahrscheinlich ist), genügt
                                                                    es, pro zu betrachtender Spalte die Verteilung der Werte
Categories and Subject Descriptors                                  innerhalb dieser Spalte zu speichern. Treten in diesem Fall
                                                                    später in den Anfragen Kombinationen der Spalten auf, er-
H.2 [Information Systems]: Database Management; H.2.4               gibt sich die zu erwartende Tupelanzahl mithilfe einfacher
[Database Management]: Systems—Query processing                     statistischer Weisheiten (durch Multiplikation der Einzel-
                                                                    wahrscheinlichkeiten).
                                                                       Leider versagen diese ab einem bestimmten Korrelations-
General Terms                                                       grad (also bei korrelierten Daten), und zwar in dem Sinne,
Theory, Performance                                                 dass die vom Optimizer berechneten Schätzwerte zu stark
                                                                    von der Wirklichkeit abweichen, was wiederum zu schlech-
                                                                    ten Ausführungszeiten führt. Diese ließen sich u.U. durch die
                                                                    Wahl eines anderen Plans, welcher unter Berücksichtigung
                                                                    der Korrelation vom Optimizer erstellt wurde, verringern
                                                                    oder sogar vermeiden.

                                                                       Zur Veranschaulichung betrachten wir eine Tabelle, wel-
  th                                                                che u. a. die Spalten A und B besitzt, und eine Anfrage,
25 GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany.                 welche Teile eben dieser Spalten ausgeben soll. Desweiteren
Copyright is held by the author/owner(s).                           liege auf Spalte B ein Index, den wir mit IB bezeichnen wol-
len, und es existiere ein zusammengesetzter Index IA,B für          Daher gibt es zwei grundsätzliche Möglichkeiten: Entwe-
beide Spalten. Beide Indizes seien im DBMS mithilfe von           der schauen wir dem Benutzer auf die Finger und suchen
Bäumen (bspw. B∗ -Bäume) implementiert, sodass wir auch         in den von ihm abgefragten Daten nach Korrelationen (das
(etwas informell) von flachen“ oder hohen“ Indizes spre-          entspricht einer reaktiven Vorgehensweise), oder wir suchen
                        ”               ”                                                                                 ”
chen können.                                                     uns selbst“ ein paar Daten der Datenbank aus“, die wir
                                                                                                                   ”
   Sind beide Spalten unkorreliert, so lohnt sich in der Regel    untersuchen wollen (und gehen somit proaktiv vor). Beide
die Abfrage über IA,B . Bei einer starken Korrelation bei-       Vorgehensweisen haben ihre Vor- und Nachteile. Während
der Spalten dagegen könnte die alleinige Verwendung von          im reaktiven Fall keine Daten speziell zur Korrelationsfin-
IB vorteilhaft sein, und zwar wenn die Werte aus Spalte A         dung angefasst“ werden müssen, hier aber alle Daten, die
                                                                         ”
i.d.R. durch die Werte aus Spalte B bestimmt werden (ein          bis zu einer bestimmten Anfrage nie abgefragt wurden, als
typisches Beispiel, welches auch in CORDS [7] anzutreffen         unkorreliert gelten, müssen wir für die proaktive Methode
ist, wäre eine Tabelle Auto“ mit den Spalten A = Firma“          (also nur zum Feststellen, ob Korrelation vorherrscht) extra
                       ”                              ”
und B = Marke“, sodass sich für A Werte wie Opel“ oder           Daten lesen, sind aber für (fast) alle Eventualitäten gewapp-
           ”                                      ”
  Mercedes“ und für B Werte wie Zafira“ oder S-Klasse“ er-       net.
”                                 ”             ”
geben). Statt nun im vergleichsweise hohen Index IA,B erst           Interessanterweise kann es vorkommen, dass beide Metho-
passende A- und dann passende B-Werte zu suchen, werden           den für ein und dieselbe Spaltenkombination unterschied-
sämtliche Tupel, welche die gewünschten B-Werte enthalten,      liche Ergebnisse liefern (der Einfachheit halber beschrän-
über den flacheren Index IB geladen und überprüft, ob die      ken wir uns hierbei auf die möglichen Ergebnisse korre-
                                                                                                                           ”
jeweiligen A-Werte der Anfrage entsprechen (was aufgrund          liert“ oder unabhängig“). Für den Fall, dass die reaktive
                                                                               ”
der Abhängigkeit der Regelfall sein sollte).                     Methode eine Spaltenkombination gar nicht betrachtet hat,
                                                                  sollte das klar sein. Aber nehmen wir an, dass die Kombi-
   Das Wissen über Korrelationen fällt aber natürlich nicht    nation von beiden Methoden analysiert wurde. Da für die
vom Himmel, es hat seinen Preis. Jeder Datenbänkler hofft,       Analyse höchstwahrscheinlich jeweils unterschiedliche Tupel
dass seine Daten unkorreliert sind, weil sein DBMS dann we-       (Wertekombinationen) verwendet wurden, können sich na-
niger Metadaten (also Daten über die Daten) speichern und        türlich auch die Schlüsse unterscheiden. Hier stellt sich nun
verwalten muss, sondern auf die bereits erwähnten statis-        die Frage, welches Ergebnis besser“ ist. Dafür gibt es kei-
                                                                                                 ”
tischen Weisheiten zurückgreifen kann. Sind die Daten da-        ne allgemeine Antwort, gehen wir aber von einer modera-
gegen (stark) korreliert, lässt sich die Erkenntnis darüber     ten Änderung des Anfrageverhaltens aus, ist sicherlich das
nicht so einfach wie die Unabhängigkeit mit (anderen) sta-         reaktive Ergebnis“ kurzfristig entscheidender, während das
                                                                  ”
tistischen Weisheiten verbinden und somit abarbeiten“.              proaktive Ergebnis“ in die längerfristige Planung der Sta-
                                              ”                   ”
   Nicht jede (eher kaum eine) Korrelation stellt eine (schwa-    tistikerstellung mit aufgenommen werden sollte.
che) funktionale Abhängigkeit dar, wie es im Beispiel der
Fall war, wo wir einfach sagen konnten Aus der Marke folgt
                                         ”
die Firma (bis auf wenige Ausnahmen)“. Oft liebäugeln be-        2. GRUNDLAGEN
stimmte Werte der einen Spalte mit bestimmten Werten an-             Wie in der Einleitung bereits angedeutet, können Korrela-
derer Spalten, ohne sich jedoch in irgendeiner Weise auf diese    tionen einem Datenbanknutzer den Tag vermiesen. Um dies
Kombinationen zu beschränken. (In Stuttgart gibt es sicher-      zu verhindern, wurden einige Methoden vorgeschlagen, wel-
lich eine Menge Porsches, aber die gibt es woanders auch.)        che sich auf verschiedene Art und Weise dieser Problematik
Außerdem ändern sie möglicherweise mit der Zeit ihre Vor-       annehmen (z. B. [7, 6]) oder sie sogar ausnutzen (z. B. [4, 8]),
lieben (das Stuttgarter Porschewerk könnte bspw. nach Chi-       um noch an Performance zuzulegen. Letztere sind allerdings
na umziehen) oder schaffen letztere völlig ab (wer braucht       mit hohem Aufwand oder der Möglichkeit, fehlerhafte An-
schon einen Porsche? Oder überhaupt ein Auto?).                  frageergebnisse zu liefern1 , verbunden. Daher konzentrieren
   Deswegen werden für korrelierte Daten zusätzliche Sta-       wir uns hier auf das Erkennen von Korrelationen allein zur
tistiken benötigt, welche nicht nur die Werteverteilung ei-      Verbesserung der Statistiken und wollen hierbei zwischen
ner, sondern die Werteverteilung mehrerer Spalten wiederge-       proaktiven und reaktiven Verfahren unterscheiden.
ben. Diese zusätzlichen Statistiken müssen natürlich irgend-
wo abgespeichert und, was noch viel schlimmer ist, gewartet       2.1 Proaktive (datengetriebene) Verfahren
werden. Somit ergeben sich zusätzlicher Speicherbedarf und         Proaktiv zu handeln bedeutet, etwas auf Verdacht“ zu
                                                                                                             ”
zusätzlicher Aufwand, also viel zu viel von dem, was keiner      tun. Impfungen sind dafür ein gutes Beispiel – mithilfe ei-
so richtig will.                                                  ner Impfung ist der Körper in der Lage, Krankheitserreger
                                                                  zu bekämpfen, aber in vielen Fällen ist unklar, ob er die-
   Da sich ein bisschen statistische Korrelation im Grunde        se Fähigkeit jemals benötigen wird. Da Impfungen auch mit
überall findet, gilt es, die Korrelationen ausfindig zu ma-      Nebenwirkungen verbunden sein können, muss jeder für sich
chen, welche unsere statistischen Weisheiten alt aussehen         entscheiden, ob und wogegen er sich impfen lässt.
lassen und dazu führen, dass das Anfrageergebnis erst nach         Auch Datenbanken können geimpft“ werden, allerdings
                                                                                                  ”
einer gefühlten halben Ewigkeit ausgeben wird. Ob letzte-        handelt es sich bei langen Anfrageausführungszeiten (die
res überhaupt passiert, hängt natürlich auch vom Anfrage-      wir ja bekämpfen wollen) eher um Symptome (wie Bauch-
verhalten auf die Datenbank ab. Wenn die Benutzer sich            schmerzen oder eine laufende Nase), die natürlich unter-
in ihren (meist mithilfe von Anwendungsprogrammen abge-           schiedliche Ursachen haben können. Eine davon bilden ganz
setzten) SQL-Anfragen in der WHERE-Klausel jeweils auf
                                                                  1
eine Spalte beschränken und auf jedwede Verbünde (Joins)          Da die Verfahren direkt in die Anfrageplanerstellung ein-
verzichten, dann ist die Welt in Ordnung. Leider lassen sich      greifen und dabei auf ihr Wissen über Korrelationen aufbau-
Benutzer nur ungern so stark einschränken.                       en, muss, für ein korrektes Anfrageergebnis, dieses Wissen
                                                                  aktuell und vollständig sein.
klar Korrelationen zwischen den Daten, wobei natürlich erst      2.2 Reaktive (anfragegetriebene) Verfahren
ein gewisses Maß an Korrelation überhaupt als krankhaft“             Während wir im vorherigen Abschnitt Vermutungen auf-
                                                ”
anzusehen ist. (Es benötigt ja auch eine gewisse Menge an        gestellt und auf Verdacht gehandelt haben, um den Daten-
Bakterien, damit eine Krankheit mit ihren Symptomen aus-          bankbenutzer glücklich zu machen, gehen wir jetzt davon
bricht.) Der grobe Impfvorgang“ gegen“ Korrelationen um-          aus, dass den Benutzer auch weiterhin das interessieren wird,
                   ”              ”
fasst zwei Schritte:                                              wofür er sich bis jetzt interessiert hat.
                                                                      Wir ziehen also aus der Vergangenheit Rückschlüsse für
    1. Es werden Vermutungen aufgestellt, welche Spalten-
                                                                  die Zukunft, und zwar indem wir den Benutzer bei seinem
       kombinationen für spätere Anfragen eine Rolle spielen
                                                                  Tun beobachten und darauf reagieren (daher auch die Be-
       könnten.
                                                                  zeichnung reaktiv“). Dabei achten wir nicht allein auf die
                                                                               ”
    2. Es wird kontrolliert, ob diese Kombinationen von Kor-      gestellten SQL-Anfragen, sondern überwachen viel mehr die
       relation betroffen sind oder nicht.                        von der Datenbank zurückgegebenen Anfrageergebnisse. Die-
                                                                  se verraten uns nämlich alles (jeweils 100-prozentig aktuell!)
Entscheidend dabei ist, dass die Daten bzw. ein Teil der          über den Teil der vorhandenen Datenlandschaft, den der Be-
Daten gelesen (und analysiert) werden, und zwar ohne da-          nutzer bis jetzt interessant fand.
mit konkrete Anfragen zu bedienen, sondern rein zur Aus-              Auf diese Weise können bspw. Statistiken erzeugt werden
führung des Verfahrens bzw. der Impfung“ (in diesem Fall         [5, 11, 3] (wobei STHoles [5] und ISOMER [11] sogar in
                                    ”
  gegen“ Korrelation, wobei die Korrelation natürlich nicht      der Lage sind, mehrdimensionale Statistiken zu erstellen)
”
beseitigt wird, schließlich können wir schlecht den Datenbe-     oder es lassen sich mithilfe alter Anfragen neue, ähnliche
stand ändern, sondern die Datenbank lernt, damit umzuge-         Anfragen in ihrer Performance verbessern [12]. Sinnvoll kann
hen). Das Lesen und Analysieren kostet natürlich Zeit, wo-       auch eine Unterbrechung der Anfrageausführung mit damit
mit klar wird, dass auch diese Impfung“ Nebenwirkungen“           verbundener Reoptimierung sein [9, 2, 10]. Zu guter letzt
                                 ”         ”
mit sich bringt.                                                  lässt sich mithilfe dieses Ansatzes zumindest herausfinden,
   Eine konkrete Umsetzung haben Ilyas et al., aufbauend          welche Statistikdaten entscheidend sein könnten [1].
auf BHUNT [4], mit CORDS [7] vorgestellt. Dieses Verfah-              In [1] haben Aboulnaga et al. auch schon erste Ansätze für
ren findet Korrelationen zwischen Spaltenpaaren, die Spal-        eine Analyse auf Spaltenkorrelation vorgestellt, welche spä-
tenanzahl pro Spaltenkombination wurde also auf zwei be-          ter in [6] durch Haas et al. ausgebaut und verbessert wurden.
grenzt.2                                                          In Analogie zu CORDS werden in [1] und [6] nur Spaltenpaa-
   Es geht folgendermaßen vor: Im ersten Impfschritt“ sucht       re für die Korrelationssuche in Betracht gezogen. Allerdings
                                           ”
es mithilfe des Katalogs oder mittels Stichproben nach Schlüs-   fällt die Auswahl der infrage kommenden Spaltenpaare we-
sel-Fremdschlüssel-Beziehungen und führt somit eine Art         sentlich leichter aus, weil einfach alle Spaltenpaare, die in
Rückabbildung von Datenbank zu Datenmodell durch (engl.          den Anfragen (mit hinreichend vielen Daten3 ) vorkommen,
 reverse engineering“) [4]. Darauf aufbauend werden dann          potentielle Kandidaten bilden.
”
nur solche Spaltenkombinationen als für die Korrelationssu-          Während in [1] pro auftretendes Wertepaar einer Spalten-
che infrage kommend angesehen, deren Spalten                      kombination ein Quotient aus Häufigkeit bei Unabhängig-
                                                                                                    ”
                                                                  keit“ und tatsächliche Häufigkeit“ gebildet und das Spal-
    a) aus derselben Tabelle stammen oder                                      ”
                                                                  tenpaar als korreliert“ angesehen wird, sobald zu viele die-
                                                                                 ”
    b) aus einer Verbundtabelle stammen, wobei der Verbund        ser Quotienten von einem gewissen Wert abweichen, setzen
       ( Join“) mittels (Un-) Gleichung zwischen Schlüssel-      Haas et al. in [6] einen angepassten Chi-Quadrat-Test ein,
        ”                                                         um Korrelationen zu finden. Dieser ist etwas aufwendiger als
       und Fremdschlüsselspalten entstanden ist.
                                                                  die Vorgehensweise von [1], dafür jedoch nicht so fehleranfäl-
Zudem gibt es zusätzliche Reduktionsregeln (engl. pruning        lig [6]. Zudem stellen Haas et al. in [6] Möglichkeiten vor, wie
                                                    ”
rules“) für das Finden der Beziehungen und für die Aus-         sich die einzelnen Korrelationswerte“ pro Spaltenpaar mit-
                                                                                      ”
wahl der zu betrachtenden Spaltenkombinationen. Schließ-          einander vergleichen lassen, sodass, ähnlich wie in CORDS,
lich kann die Spaltenanzahl sehr hoch sein, was die Anzahl        eine Rangliste der am stärksten korrelierten Spaltenpaare
an möglichen Kombinationen gegebenenfalls ins Unermess-          erstellt werden kann. Diese kann als Entscheidungshilfe für
liche steigert.                                                   das Anlegen zusätzlicher Statistikdaten genutzt werden.
   Im zweiten Impfschritt“ wird für jede Spaltenkombinati-
                ”
on eine Stichprobe entnommen und darauf aufbauend eine
Kontingenztabelle erstellt. Letztere dient dann wiederum als
                                                                  3. HERAUSFORDERUNGEN
Grundlage für einen Chi-Quadrat-Test, der als Ergebnis eine        In [6] wurde bereits vorgeschlagen, dieses Verfahren mit
Zahl χ2 ≥ 0 liefert. Gilt χ2 = 0, so sind die Spalten voll-       CORDS zu verbinden. Das reaktive Verfahren spricht auf-
ständig unabhängig. Da dieser Fall aber in der Praxis kaum      grund seiner Effizienz für sich, während das proaktive Ver-
auftritt, muss χ2 einen gewissen Schwellwert überschreiten,      fahren eine gewisse Robustheit bietet und somit bei Lern-
damit die entsprechende Spaltenkombination als korreliert         phasen von [6] (wenn es neu eingeführt wird oder wenn sich
angesehen wird. Zum Schluss wird eine Art Rangliste der           die Anfragen ändern) robuste Schätzwerte zur Erstellung
Spaltenkombinationen mit den höchsten χ2 -Werten erstellt        eines Anfrageplans berechnet werden können [6]. Dazu soll-
und für die obersten n Kombinationen werden zusätzliche         te CORDS entweder in einem gedrosselten Modus während
Statistikdaten angelegt. Die Zahl n ist dabei u. a. durch die     des normalen Datenbankbetriebs laufen oder während War-
Größe des Speicherplatzes (für Statistikdaten) begrenzt.        tungszeiten ausgeführt werden. Allerdings werden in [6] kei-
                                                                  ne Aussagen darüber getroffen, wie die jeweiligen Ergebnis-
2
 Die Begrenzung wird damit begründet, dass auf diese Weise
                                                                  3
das beste Aufwand-Nutzen-Verhältnis entsteht. Das Verfah-          Um aussagefähige Ergebnisse zu bekommen, wird ein ge-
ren selbst ist nicht auf Spaltenpaare beschränkt.                wisses Mindestmaß an Beobachtungen benötigt, insb. in [6].
se beider Verfahren miteinander kombiniert werden sollten.        ders interessant sein könnten, die möglicherweise eben gera-
Folgende Punkte sind dabei zu bedenken:                           de mit Korrelation einhergehen, spricht wiederum für eine
                                                                  Art Hinweis“ an den Optimizer.
   • Beide Verfahren liefern eine Rangliste mit den als am            ”
     stärksten von Korrelation betroffenen Spalten. Aller-
     dings sind die den Listen zugrunde liegenden Korrela-        4. LÖSUNGSANSATZ
                                                    ”
     tionswerte“ (s. bspw. χ2 im Abschnitt über proaktive           Da CORDS wie auch das Verfahren aus [6] nur Spalten-
     Verfahren) auf unterschiedliche Weise entstanden und         paare betrachten und dies mit einem sich experimentell erge-
     lassen sich nicht einfach vergleichen. Liefern beide Lis-    benem Aufwand-Nutzen-Optimum begründen, werden auch
     ten unterschiedliche Spaltenkombinationen, so kann es        wir uns auf Spaltenpaare begrenzen. Allerdings wollen wir
     passieren, dass eine Kombination, die in der eine Lis-       uns für die Kombination von proaktiver und reaktiver Kor-
     te sehr weit unten erscheint, stärker korreliert ist, als   relationssuche zunächst nicht auf diese beiden Verfahren be-
     Kombinationen, die auf der anderen Liste sehr weit           schränken, müssen aber doch gewisse Voraussetzungen an
     oben aufgeführt sind.                                       die verwendeten Verfahren (und das Datenmodell der Da-
                                                                  tenbank) stellen. Diese seien hier aufgezählt:
   • Die Daten, welche zu einer gewissen Entscheidung bei
     den beiden Verfahren führen, ändern sich, werden aber        1. Entscheidung über die zu untersuchenden Spaltenkom-
     in der Regel nicht gleichzeitig von beiden Verfahren ge-          binationen:
     lesen. Das hängt damit zusammen, dass CORDS zu ei-
     nem bestimmten Zeitpunkt eine Stichprobe entnimmt                    • Das proaktive Verfahren betreibt reverse engi-
     und darauf seine Analyse aufbaut, während das Ver-                                                      ”
                                                                            neering“, um zu entscheiden, welche Spaltenkom-
     fahren aus [6] die im Laufe der Zeit angesammelten                     binationen untersucht werden sollen.
     Anfragedaten auswertet.
                                                                          • Das Datenmodell der Datenbank ändert sich nicht,
   • Da zusätzliche Statistikdaten Speicherplatz benötigen                bzw. sind nur geringfügige Änderungen zu erwar-
     und vor allem gewartet werden müssen, ist es nicht                    ten, welche vom proaktiven Verfahren in das von
     sinnvoll, einfach für alle Spaltenkombinationen, die in               ihm erstellte Datenmodell sukzessive eingearbei-
     der einen und/oder der anderen Rangliste vorkommen,                    tet werden können. Auf diese Weise können wir
     gleich zu verfahren und zusätzliche Statistiken zu er-                bei unseren Betrachtungen den ersten Impfschritt“
                                                                                                                  ”
     stellen.                                                               vernachlässigen.
   Zur Verdeutlichung wollen wir die Tabelle aller Firmen-          2. Datengrundlage für die Untersuchung:
wagen eines großen, internationalen IT-Unternehmens be-
trachten, in welcher zu jedem Wagen u. a. seine Farbe und                 • Das proaktive Verfahren entnimmt für jegliche zu
die Personal- sowie die Abteilungsnummer desjenigen Mitar-                  untersuchende Spaltenkombination eine Stichpro-
beiters verzeichnet ist, der den Wagen hauptsächlich nutzt.                be, welche mit einem Zeitstempel versehen wird.
Diverse dieser Mitarbeiter wiederum gehen in einem Dres-                    Diese Stichprobe wird solange aufbewahrt, bis das
dener mittelständischen Unternehmen ein und aus, welches                   Verfahren auf Unkorreliertheit“ plädiert oder für
nur rote KFZ auf seinem Parkplatz zulässt (aus Kapazitäts-                              ”
                                                                            die entsprechende Spaltenkombination eine neue
gründen wurde eine solche, vielleicht etwas seltsam anmu-                  Stichprobe erstellt wird.
tende Regelung eingeführt). Da die Mitarbeiter sich dieser
Regelung bei der Wahl ihres Wagens bewusst waren, fahren                  • Das reaktive Verfahren bedient sich eines Que-
sie alle ein rotes Auto. Zudem sitzen sie alle in derselben                 ry-Feedback-Warehouses, in welchem die Beob-
Abteilung.                                                                  achtungen ( Query-Feedback-Records“) der An-
                                                                                        ”
   Allerdings ist das internationale Unternehmen wirklich                   fragen notiert sind.
sehr groß und besitzt viele Firmenwagen sowie unzählige
Abteilungen, sodass diese roten Autos in der Gesamtheit der         3. Vergleich der Ergebnisse:
Tabelle nicht auffallen. In diesem Sinne würde das proaktive
Verfahren CORDS also (eher) keinen Zusammenhang zwi-                      • Beide Verfahren geben für jede Spaltenkombinati-
schen der Abteilungsnummer des den Wagen benutzenden                        on, die sie untersucht haben, einen Korrelations-
                                                                                                                ”
Mitarbeiters und der Farbe des Autos erkennen.                              wert“ aus, der sich innerhalb des Verfahrens ver-
   Werden aber häufig genau diese Mitarbeiter mit der Farbe                gleichen lässt. Wie dieser genau berechnet wird,
ihres Wagens abgefragt, z. B. weil sich diese kuriose Rege-                 ist für uns unerheblich.
lung des mittelständischen Unternehmens herumspricht, es                 • Aus den höchsten Korrelationswerten ergeben sich
keiner so recht glauben will und deswegen die Datenbank                     zwei Ranglisten der am stärksten korrelierten Spal-
konsultiert, so könnte ein reaktives Verfahren feststellen,                tenpaare, die wir unterschiedlich auswerten wol-
dass beide Spalten korreliert sind. Diese Feststellung tritt                len.
insbesondere dann auf, wenn sonst wenig Anfragen an beide
betroffenen Spalten gestellt werden, was durchaus möglich          Zudem wollen wir davon ausgehen, dass das proaktive Ver-
ist, weil sonst die Farbe des Wagens eine eher untergeordnete     fahren in einem gedrosselten Modus ausgeführt wird und
Rolle spielt.                                                     somit sukzessive seine Rangliste befüllt. (Zusätzliche War-
   Insbesondere der letztgenannte Umstand macht deutlich,         tungszeiträume, bei denen das Verfahren ungedrosselt lau-
dass es nicht sinnvoll ist, Statistikdaten für die Gesamtheit    fen kann, beschleunigen die Arbeit und bilden somit einen
beider Spalten zu erstellen und zu warten. Aber die Tat-          schönen Zusatz, aber da heutzutage viele Datenbanken quasi
sache, dass bestimmte Spezialfälle für den Benutzer beson-      dauerhaft laufen müssen, wollen wir sie nicht voraussetzen.)
Das reaktive Verfahren dagegen wird zu bestimmten Zeit-           in der Rangliste des reaktiven Verfahrens, dann löschen wir
punkten gestartet, um die sich bis dahin angesammelten Be-        die reaktiv erstellten Statistiken und erstellen neue Statis-
obachtungen zu analysieren, und gibt nach beendeter Ana-          tiken mittels einer Stichprobe, analog zum ersten Fall. (Die
lyse seine Rangliste bekannt. Da es als Grundlage nur die         Kombination beider Statistiktypen wäre viel zu aufwendig,
Daten aus dem Query-Feedback-Warehouse benötigt, kann            u. a. wegen unterschiedlicher Entstehungszeitpunkte.) Wenn
es völlig entkoppelt von der eigentlichen Datenbank laufen.      das proaktive Verfahren dagegen explizit unkorreliert“ aus-
                                                                                                              ”
                                                                  gibt, bleibt es bei den reaktiv erstellten Statistiken, s. oben.
   Ist die reaktive Rangliste bekannt, kann diese mit der (bis
dahin angefertigten) proaktiven Rangliste verglichen wer-            Wenn jedoch nur das proaktive Verfahren eine bestimmte
den. Tritt eine Spaltenkombination in beiden Ranglisten auf,      Korrelation erkennt, dann ist diese Erkenntnis zunächst für
so bedeutet das, dass diese Korrelation für die bisherigen An-   die Benutzer unerheblich. Sei es, weil der Nutzer diese Spal-
fragen eine Rolle gespielt hat und nicht nur auf Einzelfälle     tenkombination noch gar nicht abgefragt hat, oder weil er
beschränkt ist, sondern auch mittels Analyse einer repräsen-    bis jetzt nur den Teil der Daten benötigt hat, der scheinbar
tativen Stichprobe an Wertepaaren gefunden wurde.                 unkorreliert ist. In diesem Fall markieren wir nur im Daten-
   Unter diesen Umständen lassen wir mittels einer Stichpro-     bankkatolog (wo die Statistiken abgespeichert werden) die
be Statistikdaten für die betreffende Spaltenkorrelation er-     beiden Spalten als korreliert und geben dem Optimizer somit
stellen. Dabei wählen wir die Stichprobe des proaktiven Ver-     ein Zeichen, dass hier hohe Schätzfehler möglich sind und
fahrens, solange diese ein gewisses Alter nicht überschritten    er deswegen robuste Pläne zu wählen hat. Dabei bedeutet
hat. Ist sie zu alt, wird eine neue Stichprobe entnommen.4         robust“, dass der gewählte Plan für die errechneten Schätz-
                                                                  ”
                                                                  werte möglicherweise nicht ganz optimal ist, dafür aber bei
   Interessanter wird es, wenn nur eines der Verfahren auf        stärker abweichenden wahren Werten“ immer noch akzep-
                                                                                           ”
Korrelation tippt, während das andere Verfahren die ent-         table Ergebnisse liefert. Zudem können wir ohne wirklichen
sprechende Spaltenkombination nicht in seiner Rangliste ent-      Einsatz des reaktiven Verfahrens die Anzahl der Anfragen
hält. Die Ursache dafür liegt entweder darin, dass letzteres    zählen, die auf diese Spalten zugreifen und bei denen sich
Verfahren die Kombination noch nicht analysiert hat (beim         der Optimizer stark verschätzt hat. Übersteigt der Zähler
reaktiven Verfahren heißt das, dass sie nicht oder zu selten      einen Schwellwert, werden mithilfe einer neuen Stichprobe
in den Anfragen vorkam), oder bei seiner Analyse zu dem           (vollständige, also insb. mit Werteverteilung) Statistikdaten
Ergebnis nicht korreliert“ gekommen ist.                          erstellt und im Katalog abgelegt.
           ”
   Diese Unterscheidung wollen wir insbesondere in dem Fall
vornehmen, wenn einzig das reaktive Verfahren die Korre-            Der Vollständigkeit halber wollen wir hier noch den Fall
lation entdeckt“ hat. Unter der Annahme, dass weitere,            erwähnen, dass eine Spaltenkombination weder in der einen,
        ”
ähnliche Anfragen folgen werden, benötigt der Optimizer         noch in der anderen Rangliste vorkommt. Es sollte klar sein,
schnell Statistiken für den abgefragten Bereich. Diese sol-      dass diese Kombination als unkorreliert“ angesehen und so-
                                                                                                ”
len zunächst reaktiv mithilfe der Query-Feedback-Records         mit für die Statistikerstellung nicht weiter betrachtet wird.
aus der Query-Feedback-Warehouse erstellt werden (unter
Verwendung von bspw. [11], wobei wir nur zweidimensionale
Statistiken benötigen). Das kann wieder völlig getrennt von     5. AUSBLICK
der eigentlichen Datenbank geschehen, da nur das Query-
                                                                     Die hier vorgestellte Vorgehensweise zur Verbesserung der
Feedback-Warehouse als Grundlage dient.
                                                                  Korrelationsfindung mittels Einsatz zweier unterschiedlicher
   Wir überprüfen nun, ob das proaktive Verfahren das Spal-
                                                                  Verfahren muss weiter vertieft und insbesondere praktisch
tenpaar schon bearbeitet hat. Dies sollte anhand der Ab-
                                                                  umgesetzt und getestet werden. Vor allem muss ein passen-
arbeitungsreihenfolge der infrage kommenden Spaltenpaare
                                                                  des Datenmodell für die reaktive Erstellung von Spalten-
erkennbar sein.
                                                                  paarstatistiken gefunden werden. Das vorgeschlagene Ver-
   Ist dem so, hat das proaktive Verfahren das entsprechen-
                                                                  fahren ISOMER [11] setzt hier auf STHoles [5], einem Da-
de Paar als unkorreliert“ eingestuft und wir bleiben bei den
              ”                                                   tenmodell, welches bei sich stark überschneidenden Anfra-
reaktiv erstellten Statistiken, die auch nur reaktiv aktuali-
                                                                  gen schnell inperformant werden kann. Für den eindimen-
siert werden. Veralten sie später zu stark aufgrund fehlender
                                                                  sionalen Fall wurde bereits von Informix-Entwicklern eine
Anfragen (und somit fehlendem Nutzerinteresse), können sie
                                                                  performante Lösung vorgestellt [3], welche sich aber nicht
gelöscht werden.
                                                                  einfach auf den zweidimensionalen Fall übertragen lässt.
   Ist dem nicht so, geben wir die entsprechende Kombina-
tion an das proaktive Verfahren weiter mit dem Auftrag,
                                                                     Eine weitere, noch nicht völlig ausgearbeitete Herausfor-
diese zu untersuchen.5 Beim nächsten Vergleich der Ranglis-
                                                                  derung bildet die Tatsache, dass das proaktive Verfahren im
ten muss es für das betrachtete Spaltenpaar eine konkrete
                                                                  gedrosselten Modus läuft und erst sukzessive seine Rangliste
Antwort geben. Entscheidet sich das proaktive Verfahren für
                                                                  erstellt. Das bedeutet, dass wir eigentlich nur Zwischener-
  korreliert“ und befindet sich das Spaltenpaar auch wieder
”                                                                 gebnisse dieser Rangliste mit der reaktiv erstellten Ranglis-
4
                                                                  te vergleichen. Dies kann zu unerwünschten Effekten füh-
  Falls die betroffenen Spalten einen Zähler besitzen, der bei   ren, z. B. könnten beide Ranglisten völlig unterschiedliche
Änderungsoperationen hochgezählt wird (vgl. z. B. [1]), kön-   Spaltenkombinationen enthalten, was einfach der Tatsache
nen natürlich auch solche Daten mit in die Wahl der Stich-       geschuldet ist, dass beide Verfahren unterschiedliche Spal-
probe einfließen, allerdings sind hier unterschiedliche Aus-
gangszeiten“ zu beachten.                                ”        tenkombinationen untersucht haben. Um solche Missstände
5
  Dadurch stören wir zwar etwas die vorgegebene Abarbei-         zu vermeiden, muss die proaktive Abarbeitungsreihenfolge
tungsreihenfolge der infrage kommenden Spaltenpaare, aber         der Spaltenpaare überdacht werden. In CORDS wird bspw.
der Fall ist ja auch dringend.                                    als Reduktionsregel vorgeschlagen, nur Spaltenpaare zu be-
trachten, die im Anfrageworkload vorkommen (dazu müssen         [9] V. Markl, V. Raman, D. Simmen, G. Lohman,
von CORDS nur die Anfragen, aber nicht deren Ergebnis-               H. Pirahesh, and M. Cilimdzic. Robust query
se betrachtet werden). Würde sich dann aber der Workload            processing through progressive optimization. In ACM,
dahingehend ändern, dass völlig neue Spalten oder Tabel-           editor, Proceedings of the 2004 ACM SIGMOD
len abgefragt werden, hätten wir dasselbe Problem wie bei           International Conference on Management of Data
einem rein reaktiven Verfahren. Deswegen muss hier eine              2004, Paris, France, June 13–18, 2004, pages 659–670.
Zwischenlösung gefunden werden, die Spaltenkombinationen            ACM Press, 2004.
aus Anfragen bevorzugt behandelt“, sich aber nicht darauf       [10] T. Neumann and C. Galindo-Legaria. Taking the edge
                          ”
beschränkt.                                                         off cardinality estimation errors using incremental
   Außerdem muss überlegt werden, wann wir Statistikda-             execution. In BTW, pages 73–92, 2013.
ten, die auf Stichproben beruhen, wieder löschen können.      [11] U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and
Im reaktiven Fall fiel die Entscheidung leicht aus, weil feh-        T. M. Tran. ISOMER: Consistent histogram
lender Zugriff auf die Daten auch ein fehlendes Nutzerinter-         construction using query feedback. In ICDE, page 39.
esse widerspiegelt und auf diese Weise auch keine Aktuali-           IEEE Computer Society, 2006.
sierung mehr stattfindet, sodass die Metadaten irgendwann       [12] M. Stillger, G. Lohman, V. Markl, and M. Kandil.
unbrauchbar werden.                                                  LEO - DB2’s learning optimizer. In Proceedings of the
   Basieren die Statistiken dagegen auf Stichproben, müs-           27th International Conference on Very Large Data
sen sie von Zeit zu Zeit aktualisiert werden. Passiert diese         Bases(VLDB ’01), pages 19–28, Orlando, Sept. 2001.
Aktualisierung ohne zusätzliche Überprüfung auf Korrelati-
on (welche ja aufgrund geänderten Datenbestands nachlas-
sen könnte), müssen mit der Zeit immer mehr zusätzliche
Statistikdaten über Spaltenpaare gespeichert und gewartet
werden. Der für Statistikdaten zur Verfügung stehende Spei-
cherplatz im Katalog kann so an seine Grenzen treten, au-
ßerdem kostet die Wartung wiederum Kapazität des DBMS.
Hier müssen sinnvolle Entscheidungen über die Wartung und
das Aufräumen“ nicht mehr benötigter Daten getroffen wer-
    ”
den.

6.   REFERENCES
 [1] A. Aboulnaga, P. J. Haas, S. Lightstone, G. M.
     Lohman, V. Markl, I. Popivanov, and V. Raman.
     Automated statistics collection in DB2 UDB. In
     VLDB, pages 1146–1157, 2004.
 [2] S. Babu, P. Bizarro, and D. J. DeWitt. Proactive
     re-optimization. In SIGMOD Conference, pages
     107–118. ACM, 2005.
 [3] E. Behm, V. Markl, P. Haas, and K. Murthy.
     Integrating query-feedback based statistics into
     informix dynamic server, Apr. 03 2008.
 [4] P. Brown and P. J. Haas. BHUNT: Automatic
     discovery of fuzzy algebraic constraints in relational
     data. In VLDB 2003: Proceedings of 29th International
     Conference on Very Large Data Bases, September
     9–12, 2003, Berlin, Germany, pages 668–679, 2003.
 [5] N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: a
     multidimensional workload-aware histogram.
     SIGMOD Rec., 30(2):211–222, May 2001.
 [6] P. J. Haas, F. Hueske, and V. Markl. Detecting
     attribute dependencies from query feedback. In
     VLDB, pages 830–841. ACM, 2007.
 [7] I. F. Ilyas, V. Markl, P. Haas, P. Brown, and
     A. Aboulnaga. CORDS: automatic discovery of
     correlations and soft functional dependencies. In
     ACM, editor, Proceedings of the 2004 ACM SIGMOD
     International Conference on Management of Data
     2004, Paris, France, June 13–18, 2004, pages 647–658,
     pub-ACM:adr, 2004. ACM Press.
 [8] H. Kimura, G. Huo, A. Rasin, S. Madden, and S. B.
     Zdonik. Correlation maps: A compressed access
     method for exploiting soft functional dependencies.
     PVLDB, 2(1):1222–1233, 2009.
                      MVAL: Addressing the Insider Threat by
                        Valuation-based Query Processing

                            Stefan Barthel                                            Eike Schallehn
         Institute of Technical and Business Information              Institute of Technical and Business Information
                             Systems                                                      Systems
            Otto-von-Guericke-University Magdeburg                       Otto-von-Guericke-University Magdeburg
                       Magdeburg, Germany                                           Magdeburg, Germany
                    stefan.barthel@ovgu.de                                     eike.schallehn@ovgu.de

ABSTRACT                                                              by considering relational and SQL operations and describing
The research presented in this paper is inspired by problems          possible valuation derivations for them.
of conventional database security mechanisms to address the
insider threat, i.e. authorized users abusing granted privi-          2.   PRINCIPLES OF DATA VALUATION
leges for illegal or disadvantageous accesses. The basic idea           In [1] we outlined our approach of a leakage-resistant data
is to restrict the data one user can access by a valuation            valuation which computes a monetary value (mval) for each
of data, e.g. a monetary value of data items, and, based              query. This is based on the following basic principles: Every
on that, introducing limits for accesses. The specific topic          attribute Ai ∈ R of a base relation schema R is valuated by
of the present paper is the conceptual background, how the            a certain monetary value (mval(Ai ) ∈ R). The attribute
process of querying valuated data leads to valuated query             valuation for base tables are part of the data dictionary and
results. For this, by analyzing operations of the relational          can for instance be specified as an extension of the SQL
algebra and SQL, derivation functions are added.                      DDL:
                                                                      CREATE TABLE table_1
1. INTRODUCTION                                                       (
   An acknowledged main threat to data security are fraud-               attribute_1 INT PRIMARY KEY MVAL 0.1,
ulent accesses by authorized users, often referred to as the             attribute_2 UNIQUE COMMENT ’important’ MVAL 10,
insider threat [2]. To address this problem, in [1] we pro-              attribute_3 DATE
posed a novel approach of detecting authorization misuse              );
based on a valuation of data, i.e. of an assigned descrip-
                                                                        With these attribute valuations, we derive a monetary
tion of the worth of data management in a system, which
                                                                      value for one tuple t ∈ r(R) given by Equation (1), as well
could for instance be interpreted as monetary values. Ac-
                                                                      as the total monetary value of the relation r(R) given by
cordingly, possible security leaks exist if users access more
                                                                      Equation (2), if data is extracted by a query.
valuable data than they are allowed to within a query or
cumulated over a given time period. E.g., a bank account                                                 X
manager accessing a single customer record does not repre-                         mval(t ∈ r(R)) =             mval(Ai )        (1)
sent a problem, while dumping all data in an unrestricted                                               Ai ∈R
query should be prohibited. Here, common approaches like                               X
role-based security mechanisms typically fail.                         mval(r(R)) =            mval(t) = |r(R)| ∗ mval(t ∈ r(R)) (2)
   According to our proposal, the data valuation is first of                          t∈r(R)
all based on the relation definitions, i.e. as part of the data
dictionary information about the value of data items such as             To be able to consider the mval for a query as well as sev-
attribute values and, derived from that, entire tuples and re-        eral queries of one user over a certain period of time, we log
lations. Then, a key question is how the valuation of a query         all mvals in an alert log and compare the current cumulated
result can be derived from the input valuations, because per-         mval per user to two thresholds. If a user exceeds the first
forming operations on the base data causes transformations            threshold – suspicious threshold – she will be categorized as
that have an impact on the data’s significance.                       suspect. After additionally exceeding the truncation thresh-
   This problem is addressed in the research presented here           old her query output will be limited by hiding tuples and
                                                                      presenting a user notification. We embedded our approach
                                                                      in an additional layer in the security defense-in-depth model
                                                                      for raw data, which we have enhanced by a backup entity
                                                                      (see Fig. 1). Furthermore, encryption has to be established
                                                                      to prevent data theft via unauthorized, physical reads as
                                                                      well as backup theft. In this paper we are going into detail
                                                                      about how to handle operations like joins, aggregate func-
                                                                      tions, stored procedures as well as common functions.
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany.                      Most of the data stored in a database can be easily iden-
Copyright is held by the author/owner(s).                             tified as directly interpretable. One example would be an
                  Views                                             by uncertainty has to be less valuable than properly set at-
                                                                    tribute values. Therefore, the monetary value should be
             Access Control                                         only a percentage of the respective monetary value of an at-
                                                                    tribute value. If several source attribute values are involved,
             Data Valuation                                         we recommend to value the computed attribute value as a
               Encryption                          DBMS Level       percentage of the monetary value of all participating source
                  Raw
                                                                    attribute values. In general, we suggest a maximum of 50%
                  Data                                              for both valuations. Furthermore, we need to consider the
                                                                    overall purpose of our leakage-resistant data valuation which
                                                  Physical Level    shall prevent extractions of large amounts of data. There-
                                                                    fore, the percentage needs to be increased with the amount
                                                                    of data, but not in a way that an unset or unknown attribute
                                                                    value becomes equivalent valuable than a properly set one.
                                                                    For that reason, exponential growth is not a suitable option.
                 Backup                                             Additionally, we have to focus a certain area of application,
                  Data                                              because a trillion attributes (1012 ) are conceivable whereas a
               Encryption                                           septillion attributes (1024 ) are currently not realistic. From
                            Backup                                  the overall view on our data valuation, we assume depending
                                                                    on the application, that the extraction of sensitive data be-
Figure 1: Security defense model on DBMS and                        comes critical when 103 up to 109 attribute values will be ex-
physical level                                                      tracted. Therefore, the growth of our uncertainty factor UF
                                                                    increases much more until 109 attribute values than after-
                                                                    wards, which predominantly points to a logarithmic growth.
employee-table, where each tuple has a value for attributes         We also do not need to have a huge difference of the factor if
"first name", "surname" and "gender". In this case, it is           theoretically much more attribute values shall be extracted
also quite easy to calculate the monetary value for a query         (e.g., 1014 and more), because with respect to an extraction
(r(Remp )) by simply summarizing all mval per attribute and         limiting approach, it is way too much data to return. This
multiply those with the number of involved rows (see Eq.            assumption does also refer to a logarithmic increase. We
(3)).                                                               conclude that the most promising formula that was adapted
                                                                    to fit our needs is shown in Eq. (4).
                           X
    mval(r(Remp )) =                mval(Ai ) ∗ |r(Remp )|    (3)
                                                                                          1
                         Ai ∈Remp                                                 UF =      log10 (|valAi ,...,Ak | + 1)       (4)
                                                                                         30
However, it becomes more challenging if an additional at-
tribute "license plate number" is added, which does have
some unset or unknown attribute values - in most cases
NULL values. By knowing there is a NULL value for a                 3.    DERIVING VALUATIONS FOR
certain record, this could be interpreted as either simply un-
known whether there is any car or unset because this person
                                                                          DATABASE OPERATIONS
has no car. So there is an uncertainty that could lead to an          In this chapter we will describe valuation derivation for
information gain which would be uncovered if no adequate            main database operations by first discussing core relational
valuation exists. Some other potentially implicit informa-          operations. Furthermore, we address specifics of join oper-
tion gains are originated from joins and aggregate functions        ations and finally functions (aggregate, user-defined, stored
which we do mention in the regarding section.                       procedures) which are defined in SQL.
   Because the terms information gain and information loss
are widely used and do not have a uniform definition, we do
define them for further use. We call a situation where an           3.1    Core Operations of Relational Algebra
attacker received new data (resp. information) information             The relational algebra [4] consists of six basic operators,
gain and the same situation in the view of the data owner           where selection, projection, and rename are unary opera-
an information loss.                                                tions and union, set difference, and Cartesian product are
                                                                    operators that take two relations as input (binary opera-
Uncertainty Factor                                                  tion). Due to the fact that applying rename to a relation or
Some operators used for query processing obviously reduce           attribute will not change the monetary value, we will only
the information content of the result set (e.g. selection, ag-      consider the rest.
gregations, semi joins, joins with resulting NULL values),
but there is still an uncertain, implicit information gain.
Since, the information gain by uncertainty is blurry, mean-         Projection
ing in some cases more indicative than in others, we have           The projection πattr_list (r(R)) is a unary operation and
to distinguish uncertainty of one attribute value generated         eliminates all attributes (columns) of an input relation r(R)
out of one source attribute value (e.g., generated NULL val-        except those mentioned in the attribute list. For computa-
ues) and attribute values which are derived from informa-           tion of the monetary value of such a projection, only mval
tion of several source attribute values (e.g., aggregations).       for chosen attributes of the input relation are considered
In case of one source attribute value, an information gain          while taking into account that a projection may eliminate
duplicates (shown in Eq. (5)).                                        fully aware that by a user mistake, e.g. using cross join
                                                                      instead of natural join, thresholds will be exceeded and the
                             X
                             k
                                                                      user will be classified as potentially suspicious. However, we
 mval(πAj ,..,Ak (r(R))) =         mval(Ai ) ∗ |πAj ,..,Ak (r(R))|    recommend a multiplication of the monetary value of both
                             i=j                                      source relations instead of a summation due to the fact that
                                                                (5)   the calculation of the monetary value needs to be consistent
                                                                      also by combining different operators. For that reason, by
Selection                                                             following our recommendation, we ensure that an inner join
According to the relational algebra, a selection of a certain         is valuated with the same monetary value as the respective
relation σpred r(R) reduces tuples to a subset which satisfy          combination of a cross join (Cartesian product) and selection
specified predicates. Because the selection reduces the num-          on the join condition.
ber of tuples, the calculation of the monetary value does not
have to consider those filtered tuples and only the number              mval(r(R1 × R2 )) =
of present tuples are relevant (shown in Eq. (6)).                          mval(t ∈ r(R1 )) ∗ |r(R1 )| + mval(t ∈ r(R2 )) ∗ |r(R2 )|
                                                                                                                                    (9)
  mval(σpred (r(R))) = mval(t ∈ r(R)) ∗ |σpred (r(R))|          (6)

Set Union
                                                                      3.2     Join Operations
                                                                         In the context of relational databases, a join is a binary
A relation of all distinct elements (resp. tuples) of any two         operation of two tables (resp. data sources). The result set
relations is called the union (denoted by ∪) of those re-             of a join is an association of tuples from one table with tuples
lations. For performing set union, the two involved rela-             from another table by concatenating concerned attributes.
tions must be union-compatible – they must have the same              Joining is an important operation and most often perfor-
set of attributes. In symbols, the union is represented as            mance critical to certain queries that target tables whose
R1 ∪ R2 = {x : x ∈ R1 ∨ x ∈ R2 }. However, if two re-                 relationships to each other cannot be followed directly. Be-
lations contain identical tuples, within a resulting relation         cause the type of join affects the number of resulting tuples
these tuples do only exist once, meaning duplicates are elim-         and their attributes, the monetary value of each join needs
inated. Accordingly, the mval of a union of two relations is          to be calculated independently.
computed by adding mval of both relations, subtracted with
mval of duplicates (shown in Eq. (7)).                                Inner Join
        mval(R1 ∪ R2 ) = mval(r(R1 ))+                                An inner join produces a result table containing composite
                             X                                        rows of involved tables that match some pre-defined, or ex-
            mval(r(R2 )) −         mval(ti ∈ r(R1 ∩ R2 )        (7)
                                                                      plicitly specified, join condition. This join condition can be
                             i                                        any simple or compound search condition, but does not have
                                                                      to contain a subquery reference. The valuation of an inner
Set Difference                                                        join is computed by the sum of the monetary values of all
The difference of relations R1 and R2 is the relation that            attributes of a composite row multiplied by the number of
contains all the tuples that are in R1 , but do not belong to         rows within the result set. Because the join attribute Ajoin
R2 . The set difference is denoted by R1 − R2 or R1 \R2 and           of two joined tables has to be counted only once, we need
defined by R1 \R2 = {x : x ∈ R1 ∧ x ∈    / R2 }. Also, the set        to subtract it (shown in Eq. (10)).
difference is union-compatible, meaning the relations must
                                                                         mval(r(R1 ./ R2 ) = |r(R1 ./ R2 )| ∗
have the same number of attributes and the domain of each
attribute is the same in both R1 and R2 . The mval of a set              (mval(t ∈ r(R1 )) + (mval(t ∈ r(R2 )) − mval(Ajoin ))
difference of two relations is computed by subtracting the                                                                   (10)
mval of tuples that have both relations in common from the
monetary value of R1 given by Equation (8).                           Outer Join
                                     X                                An outer join does not require matching records for each
 mval(R1 \R2 ) = mval(r(R1 ) −             mval(ti ∈ r(R1 ∩ R2 )      tuple of concerned tables. The joined result table retains all
                                       i                              rows from at least one of the tables mentioned in the FROM
                                                                (8)   clause, as long as those rows are consistent with the search
                                                                      condition. Outer joins are subdivided further into left, right,
Cartesian Product                                                     and full outer joins. The result set of a left outer join (or left
The Cartesian product, also known as cross product, is an             join) includes all rows of the first mentioned table (left of
operator which works on two relations, just as set union              the join keyword) merged with attribute values of the right
and set difference. However, the Cartesian product is the             table where the join attribute matches. In case there is no
costliest operator to evaluate [9], because it combines the           match, attributes of the right table are set to NULL. The
tuples of one relation with all the tuples of the other relation      right outer join (or right join) will return rows that have data
– it pairs rows from both tables. Therefore, if the input             in the right table, even if there’s no matching rows in the left
relations R1 and R2 have n and m rows, respectively, the              table enhanced by atteributes (with NULL values) of the left
result set will contain n ∗ m rows and consist of columns of          table. A full outer join is used to retain the non-matching
R1 and the columns of R2 . Because, the number of tuples              information of all affected tables by including non-matching
of the outgoing relations are known, the monetary value is a          rows in the result set. To cumulate the monetary value
summation of all attribute valuations multiplied by number            for a query that contains a left or right outer join, we only
of rows of both relations given by Equation (9). We are               need to compute the monetary value of an inner join of both
tables and add the mval of an antijoin r(R1 . R2 ) ⊆ r(R1 )       of the ISO (1987) and ANSI (1986) standard for the SQL
which includes only tuples of R1 that do not have a join          database query language.
partner in R2 (shown in Eq. (11)). For the monetary value of        To be able to compute the monetary value of a derived,
a full outer join, we additionally would consider an antijoin     aggregated attribute, we need to consider two more factors.
r(R2 . R1 ) ⊆ r(R2 ) which includes tuples of R2 that do not      First of all, we divided aggregate function into two groups:
have a join partner given by Equation (12)).                      informative and conservative.
         mval(r(R1 1 R2 )) = mval(r(R1 ./ R2 ))+                    1. Informative are those aggregate functions where the
                                                          (11)         aggregated value of a certain aggregate function leads
          mval(r(R1 . R2 ))
                                                                       to an information gain of the entire input of all at-
                                                                       tribute values. This means that every single attribute
         mval(r(R1 1 R2 )) = mval(r(R1 ./ R2 ))+                       value participates in the computation of the aggre-
                                                          (12)
            mval(r(R1 . R2 )) + mval(r(R2 . R1 ))                      gated attribute value. Representatives for informative
                                                                       aggregate functions are COUNT, AVG and SUM.
Semi Join
                                                                    2. Conservative, on the contrary, are those functions where
A semi join is similar to the inner join, but with the addition        the aggregated value is represented by only one at-
that only attributes of one relation are represented in the            tribute value, but in consideration of all other attribute
result set. Semi joins are subdivided further into left and            values. So if the aggregated value are again separated
right semi joins. The left semi join operator returns each row         from the input set, all other attribute values will re-
from the first input relation (left of the join keyword) when          main. Conservative aggregate functions are MAX and
there is a matching row in the second input relation (right            MIN.
of the join keyword). The right semi join is computed vice
versa. The monetary value for a query that uses semi joins          The second factor that needs to be considered is the num-
can be easily cumulated by multiplying the sum of monetary        ber of attributes that are used to compute the aggregated
values for included attributes with number of matching rows       values. In case of a conservative aggregate function, it is
of the outgoing relation (shown in Eq. (13)).                     simple, because only one attribute value is part of the out-
                         X                                        put. For that reason we recommend to leave the mval of
  mval(r(R1 n R2 )) =            mval(Ai ) ∗ |r(R1 n R2 )| (13)   the source attribute unchanged (shown in Eq. (14)).
                        Ai ∈R1                                      mval(Ai ) = mval(M AX(Ai )) = mval(M IN (Ai ))          (14)
Nevertheless, we do have an information gain by knowing              For the informative aggregate functions the computation
join attributes of R1 have some join partners within R2           is more challenging due to several participating attribute
which are not considered. But adding our uncertainty factor       values. Because several input attribute values are concerned,
UF in this equation would lead to inconsistency by cumu-          we recommend the usage of our uncertainty factor which
lating the mval of a semi join compared to the mval of a          we already mentioned in a prior section. With the uncer-
combination of a natural join and a projection. In future         tainty factor it is possible to integrate the number of at-
work, we will solve this issue by presenting a calculation        tribute values in a way that a higher number of concerned
that is based on a combination of projections and joins to        attributes leads to an increase in percentage terms of the
cover such an implicit information gain.                          monetary value of the aggregated attribute value given by
                                                                  Equation (15).
3.3 Aggregate Functions
   In computer science, an aggregate function is a function         mval(COU N T (Ai )) = mval(SU M (Ai )) =
where the values of multiple rows are grouped together as                                 1                                 (15)
                                                                     mval(AV G(Ai )) =      log10 (|Ai | + 1) ∗ mval(Ai )
input on certain criteria to form a single value of more sig-                            30
nificant meaning. The SQL aggregate functions are useful
when mathematical operations must be performed on all or
                                                                  3.4   Scalar Functions
on a group of values. For that reason, they are frequently          Besides the SQL aggregate functions, which return a sin-
used with the GROUP BY clause within a SELECT state-              gle value, calculated from values in a column, there are also
ment. According to the SQL standard, the following aggre-         scalar functions defined in SQL, that return a single value
gate function are implemented in most DBMS and the ones           based on the input value. The possibly most commonly used
used most often: COUNT, AVG, SUM, MAX, and MIN.                   and well known scalar functions are:
   All aggregate functions are deterministic, i.e. they return       • UCASE() - Converts a field to upper case
the same value any time they are called by using the same
set of input values. SQL aggregate functions return a sin-           • LCASE() - Converts a field to lower case
gle value, calculated from values within one column of a             • LEN() - Returns the length of a text field
arbitrary relation [10]. However, it should be noted that ex-
cept for COUNT, these functions return a NULL value when             • ROUND() - Rounds a number to a specified degree
no rows are selected. For example, the function SUM per-
                                                                     • FORMAT() - Formats how a field is to be displayed
formed on no rows returns NULL, not zero as one might ex-
pect. Furthermore, except for COUNT, aggregate functions             Returned values of this scalar functions are always derived
ignore NULL values at all during computation. All aggre-          from one source attribute value, and some of them do not
gate function are defined in SQL:2011 standard or ISO/IEC         even change the main content of the attribute value. There-
9075:2011 (under the general title "Information technology        fore, we recommend that the monetary value of the source
- Database languages - SQL") which is the seventh revision        attribute stays untouched.
3.5 User-Defined Functions                                           Furthermore, by summing all partial result, we make sure
   User-defined functions (UDF ) are subroutines made up             that the worst case of information loss is considered, entirely
of one or several SQL or programming extension statements            in line with our general idea of a leakage resistant data val-
that can be used to encapsulate code for reuse. Most database        uation that should prevent a massive data extraction. How-
management systems (DBMS) allow users to create their                ever, since SPs represent a completed unit, by reaching the
own user-defined functions and do not limit them to the              truncate threshold the whole SP will be blocked and rolled
built-in functions of their SQL programming language (e.g.,          back. For that reason, we recommend smaller SPs resp.
TSQL, PL/SQL, etc.). User-defined functions in most sys-             split existing SPs in DBS with an enabled leakage resistant
tems are created by using the CREATE FUNCTION state-                 data valuation.
ment and other users than the owner must be granted ap-
propriate permissions on a function before they can use it.          4.   RELATED WORK
Furthermore, UDFs can be either deterministic or nondeter-              Conventional database management systems mostly use
ministic. A deterministic function always returns the same           access control models to face unauthorized access on data.
results if the input is the equal and a nondeterministic func-       However, these are insufficient when an authorized individ-
tion returns different results every time it is called.              ual extracts data regardless whether she is the owner or
   On the basis of the multiple possibilities offered by most        has stolen that account. Several methods were conceived to
DBMS, it is impossible to estimate all feasible results of a         eliminate those weaknesses. We refer to Park and Giordano
UDF. Also, due to several features like shrinking, concate-          [14], who give an overview of requirements needed to address
nating, and encrypting of return values, a valuation of a            the insider threat.
single or an array of output values is practically impossible.          Authorization views partially achieve those crucial goals of
For this reason we decided not to calculate the monetary             an extended access control and have been proposed several
value depending on the output of a UDF, much more we                 times. For example, Rizvi et al. [15] as well as Rosenthal
do consider the attribute values that are passed to an UDF           et al. [16] use authorization-transparent views. In detail,
(shown in Eq. (16)). This assumption is also the most re-            incoming user queries are only admitted, if they can be an-
liable, because it does not matter what happens inside an            swered using information contained in authorization views.
UDF – like a black box – the information loss after inserting        Contrary to this, we do not prohibit a query in its entirety.
cannot get worse.                                                    Another approach based on views was introduced by Motro
                                                                     [12]. Motro handles only conjunctive queries and answers
    mval(U DFoutput (Aa , .., Ag )) =                                a query only with a part of the result set, but without any
                                         p
                                         X                    (16)   indication why it is partial. We do handle information en-
      mval(U DFinput (Ak , .., Ap )) =            mval(Ai )          hancing (e.g., joins), as well as coarsening operations (e.g.,
                                            i=k                      aggregation) and we do display a user notification. All au-
                                                                     thorization view approaches require an explicit definition of
3.6 Stored Procedures                                                a view for each possible access need, which also imposes
   Stored procedures (SP) are stored similar to user-defined         the burden of knowing and directly querying these views.
functions (UDF ) within a database system. The major dif-            In contrast, the monetary values of attributes are set while
ference is that stored procedures have to be called and the          defining the tables and the user can query the tables or views
return values of UDFs are used in other SQL statements in            she is used to. Moreover, the equivalence test of general re-
the same way pre-installed functions are used (e.g., LEN,            lational queries is undecidable and equivalence for conjunc-
ROUND, etc.). A stored procedure, which is depending on              tive queries is known to be NP complete [3]. Therefore, the
the DBMS also called proc, sproc, StoredProc or SP, is a             leakage-resistant data valuation is more applicable, because
group of SQL statements compiled into a single execution             it does not have to face those challenges.
plan [13] and mostly developed for applications that need               However, none of these methods does consider the sensi-
to access easily a relational database system. Furthermore,          tivity level of data that is extracted by an authorized user.
SPs combine and provide logic also for extensive or complex          In the field of privacy-preserving data publishing (PPDP),
processing that requires execution of several SQL statement,         on the contrary, several methods are provided for publishing
which had to be implemented in an application before. Also           useful information while preserving data privacy. In detail,
a nesting of SPs is feasible by executing one stored procedure       multiple security-related measures (e.g., k-anonymity [17],
from within another. A typical use for SPs refers to data            l-Diversity [11]) have been proposed, which aggregate infor-
validation (integrated into the database) or access control          mation within a data extract in a way that they can not lead
mechanisms [13].                                                     to an identification of a single individual. We refer to Fung et
   Because stored procedures have such a complex structure,          al. [5], who give a detailed overview of recent developments
nesting is also legitimate and SPs are "only" a group of             in methods and tools of PPDP. However, these mechanisms
SQL statements, we recommend to value each single state-             are mainly used for privacy-preserving tasks and are not in
ment within a SP and sum up all partial results (shown in            use when an insider accesses data. They are not applica-
Eq. (17). With this assumption we do follow the principal            ble for our scenario, because they do not consider a line by
that single SQL statements are moved into stored proce-              line extraction over time as well as the information loss by
dures to provide a simple access for applications which only         aggregating attributes.
need to call the procedures.                                            To the best of our knowledge, there is only the approach of
                                                                     Harel et al. ([6], [7], [8]) that is comparable to our data val-
                                      X
                                      k                              uation to prevent suspicious, authorized data extractions.
    mval(SP (r(Rj ), .., r(Rk ))) =         mval(r(Ri ))      (17)   Harel et al. introduce the Misuseability Weight (M-score)
                                      i=j                            that desribes the sensitivity level of the data exposed to
the user. Hence, Harel et al. focus on the protection of the       [4] E. F. Codd. A Relational Model of Data for Large
quality of information, whereas our approach predominantly             Shared Data Banks. ACM Communication,
preserves the extraction of a collection of data (quantity of          13(6):377–387, June 1970.
information). Harel et al. also do not consider extractions        [5] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu.
over time, logging of malicious requester and the backup pro-          Privacy-Preserving Data Publishing: A Survey of
cess. In addition, mapping attributes to a certain monetary            Recent Developments. ACM Comput. Surv.,
value is much more applicable and intuitive, than mapping              42(4):14:1–14:53, June 2010.
to a artificial M-score.                                           [6] A. Harel, A. Shabtai, L. Rokach, and Y. Elovici.
  Our extended authorization control does not limit the sys-           M-score: Estimating the Potential Damage of Data
tem to a simple query-authorization control without any                Leakage Incident by Assigning Misuseability Weight.
protection against the insider threat, rather we allow a query         In Proc. of the 2010 ACM Workshop on Insider
to be executed whenever the information carried by the                 Threats, Insider Threats’10, pages 13–20. ACM, 2010.
query is legitimate according to the specified authorizations      [7] A. Harel, A. Shabtai, L. Rokach, and Y. Elovici.
and thresholds.                                                        Eliciting Domain Expert Misuseability Conceptions.
                                                                       In Proc. of the 6th Int’l Conference on Knowledge
5. CONCLUSIONS AND FUTURE WORK                                         Capture, K-CAP’11, pages 193–194. ACM, 2011.
                                                                   [8] A. Harel, A. Shabtai, L. Rokach, and Y. Elovici.
   In this paper we described conceptual background details
                                                                       M-Score: A Misuseability Weight Measure. IEEE
for a novel approach for database security. The key contri-
                                                                       Trans. Dependable Secur. Comput., 9(3):414–428, May
bution is to derive valuations for query results by considering
                                                                       2012.
the most important operations of the relational algebra as
                                                                   [9] T. Helleseth and T. Klove. The Number of Cross-Join
well as SQL and providing specific mval functions for each
                                                                       Pairs in Maximum Length Linear Sequences. IEEE
of them. While some of these rules are straight forward, e.g.
                                                                       Transactions on Information Theory, 37(6):1731–1733,
for core operations like selection and projection, other oper-
                                                                       Nov. 1991.
ations like specific join operations require some more thor-
ough considerations. Further operations, e.g. grouping and        [10] P. A. Laplante. Dictionary of Computer Science,
aggregation or user-defined function, would actually require           Engineering and Technology. CRC Pressl,
application specific valuations. To minimize the overhead              London,England, 1st edition, 2000.
for using valuation-based security, we discuss and recom-         [11] A. Machanavajjhala, D. Kifer, J. Gehrke, and
mend some reasonable valuation functions for these cases,              M. Venkitasubramaniam. L-Diversity: Privacy Beyond
too.                                                                   k-Anonymity. ACM Trans. Knowl. Discov. Data,
   As the results presented here merely are of conceptual              1(1):1–50, Mar. 2007.
nature, our current and future research includes considering      [12] A. Motro. An Access Authorization Model for
implementation alternatives, e.g. integrated with a given              Relational Databases Based on Algebraic
DBMS or as part of a middleware or driver as well as eval-             Manipulation of View Definitions. In Proc. of the 5th
uating the overhead and the effectiveness of the approach.             Int’l Conference on Data Engineering, pages 339–347.
We will also come up with a detailed recommendation of                 IEEE Computer Society, 1989.
how to set monetary values appropriate to different environ-      [13] J. Natarajan, S. Shaw, R. Bruchez, and M. Coles. Pro
ments and situations. Furthermore, we plan to investigate              T-SQL 2012 Programmer’s Guide. Apress,
further possible use cases for data valuation, such as billing         Berlin-Heidelberg, Germany, 3rd edition, 2012.
of data-providing services on a fine-grained level and con-       [14] J. S. Park and J. Giordano. Access Control
trolling benefit/cost trade-offs for data security and safety.         Requirements for Preventing Insider Threats. In Proc.
                                                                       of the 4th IEEE Int’l Conference on Intelligence and
                                                                       Security Informatics, ISI’06, pages 529–534. Springer,
6. ACKNOWLEDGMENTS                                                     2006.
  This research has been funded in part by the German             [15] S. Rizvi, A. Mendelzon, S. Sudarshan, and P. Roy.
Federal Ministry of Education and Science (BMBF) through               Extending Query Rewriting Techniques for
the Research Program under Contract FKZ: 13N10818.                     Fine-Grained Access Control. In Proc. of the 2004
                                                                       ACM SIGMOD Int’l Conference on Management of
                                                                       Data, SIGMOD’04, pages 551–562. ACM, 2004.
7. REFERENCES                                                     [16] A. Rosenthal and E. Sciore. View Security as the Basis
 [1] S. Barthel and E. Schallehn. The Monetary Value of                for Data Warehouse Security. In CAiSE Workshop on
     Information: A Leakage-Resistant Data Valuation. In               Design and Management of Data Warehouses,
     BTW Workshops, BTW’2013, pages 131–138. Köln                      DMDW’2000, pages 5–6. CEUR-WS, 2000.
     Verlag, 2013.                                                [17] L. Sweeney. K-Anonymity: A Model For Protecting
 [2] E. Bertino and R. Sandhu. Database Security -                     Privacy. Int. J. Uncertain. Fuzziness Knowl.-Based
     Concepts, Approaches, and Challenges. IEEE                        Syst., 10(5):557–570, Oct. 2002.
     Dependable and Secure Comp., 2(1):2 –19, Mar. 2005.
 [3] A. K. Chandra and P. M. Merlin. Optimal
     Implementation of Conjunctive Queries in Relational
     Data Bases. In Proc. of the 9th Annual ACM
     Symposium on Theory of Computing, STOC’77, pages
     77–90. ACM, 1977.
  TrIMPI: A Data Structure for Efficient Pattern Matching on
                       Moving Objects

                          Tsvetelin Polomski                                             Hans-Joachim Klein
                 Christian-Albrechts-University at Kiel                         Christian-Albrechts-University at Kiel
                    Hermann-Rodewald-Straße 3                                      Hermann-Rodewald-Straße 3
                              24118 Kiel                                                     24118 Kiel
                  tpo@is.informatik.uni-kiel.de                                   hjk@is.informatik.uni-kiel.de

ABSTRACT                                                                   qualitative description, e.g. return all trajectories where the under-
Managing movement data efficiently often requires the exploita-            lying object slowed down (during any time interval) and after that
tion of some indexing scheme. Taking into account the kind of              it changed its course. Obviously, the motion properties slowdown
queries issued to the given data, several indexing structures have         and course alteration as well as their temporal adjustment can be
been proposed which focus on spatial, temporal or spatio-temporal          computed using formal methods. The crucial point is that, even if
data. Since all these approaches consider only raw data of moving          an indexing structure is used, the stated properties must be com-
objects, they may be well-suited if the queries of interest contain        puted for each trajectory and this results in sequential scan(s) on
concrete trajectories or spatial regions. However, if the query con-       the whole trajectory data. Time consuming processing of queries
sists only of a qualitative description of a trajectory, e.g. by stating   is not acceptable, however, in a scenario where fast reaction on in-
some properties of the underlying object, sequential scans on the          coming data streams is needed. An example of such a situation with
whole trajectory data are necessary to compute the property, even          so-called tracks computed from radar and sonar data as input is the
if an indexing structure is available.                                     detection of patterns of skiff movements typical for many piracy
The present paper presents some results of an ongoing work on a            attacks [14]. A track comprises the position of an object at a time
data structure for Trajectory Indexing using Motion Property In-           moment and can hold additional information e.g. about its current
formation (TrIMPI). The proposed approach is flexible since it al-         course and velocity. Gathering the tracks of a single object over a
lows the user to define application-specific properties of trajecto-       time interval yields its trajectory over this interval.
ries which have to be used for indexing. Thereby, we show how              To address the efficiency problem, we propose an indexing scheme
to efficiently answer queries given in terms of such qualitative de-       which is not primarily focused on the “time-position data” of tra-
scriptions. Since the index structure is built on top of ordinary data     jectories but uses meta information about them instead.
structures, it can be implemented in arbitrary database management         We start with a discussion of related work in Section 2. Section 3
systems.                                                                   provides some formal definitions on trajectories and their motion
                                                                           properties. In section 4 we introduce the indexing scheme itself
                                                                           and illustrate algorithms for querying it. Section 5 summarizes the
Keywords                                                                   present work and outlines our future work.
Moving object databases, motion patterns, indexing structures
                                                                           2. RELATED WORK
1.    INTRODUCTION AND MOTIVATION                                             In this section we provide a short overview on previous contri-
                                                                           butions which are related to our approach. We start the section
   Most index structures for trajectories considered in the literature
(e.g. [8]) concentrate on (time dependent) positional data, e.g. R-        by reviewing classical indexing structures for moving objects data.
Tree [9] or TPR*-Tree [17]. There are different approaches (e.g.           Next to this, we show an approach which is similar in general terms
[1], [12]) exploiting transformation functions on the original data        to the proposed one and finally we review literature related to se-
                                                                           mantical aspects of moving objects.
and thereby reducing the indexing overhead through “light ver-
sions” of the trajectories to be indexed. In these approaches only         2.1 Indexing of Spatial, Temporal and Spatio-
stationary data is being handled. In cases where the queries of in-
terest consist of concrete trajectories or polygons covering them,
                                                                               Temporal Data
such indexing schemata as well as trajectory compression tech-                The moving object databases community has developed several
niques (e.g. [1], [6], [10], [12], [13]) may be well-suited. However,      data structures for indexing movement data. According to [8], these
there are applications [14] where a query may consist only of a            structures can be roughly categorized as structures indexing only
                                                                           spatial data, also known as spatial access methods (SAM); index-
                                                                           ing approaches for temporal data, also known as temporal index
                                                                           structures; and those which manage both - spatial and temporal
                                                                           data, also known as spatio-temporal index structures. One of the
                                                                           first structures developed for SAMs is the well-known R-Tree [9].
                                                                           Several extensions of R-Trees have been provided over the years,
                                                                           thus yielding a variety of spatio-temporal index structures. An in-
                                                                           formal schematic overview on these extensions, including also new
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2013, Illmenau, Germany.                       developments as the HTPR*-Tree [7] can be found in [11]. Since
Copyright is held by the author/owner(s).                                  all of the proposed access methods focus mainly on the raw spatio-
temporal data, they are well-suited for queries on history of move-       “real time” and use any time domain instead. A time domain is any
ment and predicting new positions of moving objects, or for re-           set which is interval scaled and countably infinite. The first re-
turning most similar trajectories to a given one. If a query consists     quirement ensures that timestamps can be used for ordering and,
only of a qualitative description, however, all the proposed index-       furthermore, that the “delay” between two time assignments can
ing structures are of no use.                                             be determined. The second requirement ensures that we have an
                                                                          infinite number of “time moments” which can be unambiguously
2.2 Applying Dimensionality Reduction upon                                indexed by elements of N. In the following we denote a time do-
    Indexing - the GEMINI Approach                                        main by T.
   The overall approach we consider in this work is similar to the           Since objects move in a space, we also need a notion for a spa-
GEMINI (GEneric Multimedia INdexIng method) indexing scheme               tial domain. In the following, let S denote the spatial domain. We
presented in [6]. This approach was originally proposed for time          require that S is equipped with an adequate metric, such as the Eu-
series and has been applied later for other types of data, e.g. for       clidean distance (e.g. for S = R × R), which allows us to measure
motion data in [16]. The main idea behind GEMINI is to reduce the         the spatial distance between objects.
dimensionality of the original data before indexing. Therefor, rep-          Having the notions of time and space we can define formally the
resentatives of much lower dimensionality are created for the data        term trajectory.
(trajectory or time series) to be indexed by using an appropriate
transform and used for indexing. A crucial result in [6] is that the         Definition 1. Let T, S and O denote a time domain, a space do-
authors proved that in order to guarantee no false dismissals [12],       main and a set of distinct objects, respectively. Then, the trajectory
the exploited transform must retain the distance (or similarity) of       τo of an object o ∈ O is a function τo : T → S.
the data to be indexed, that is, the distance between representatives
should not exceed the distance of the original time series.                  For brevity, we can also write the trajectory of an object o ∈ O
In the mentioned approaches, the authors achieve encouraging re-          in the form (o, t0 , s0 ), (o, t1 , s1 ) . . . for those t ∈ T where τo (t) = s is
sults on querying most similar trajectories (or time series) to a given   defined. A single element (o, ti , si ) is called the track of object o at
one. However, since the representatives of the original data are tra-     time ti .
jectories or time series, respectively, evaluating a query which only
describes a motion behavior would result in the inspection of all         3.2 Motion Patterns
representatives.                                                             We consider a motion pattern as a sequence of properties of
                                                                          trajectories which reveal some characteristics of the behavior of
2.3 Semantical Properties of Movement                                     the underlying moving objects. Such properties may be expressed
   Semantical properties of movement data have been considered in         through any predicates which are important for the particular anal-
various works, e.g. in [2], [5], and [15].                                ysis, such as start, stop, turn, or speedup.
The authors of [2] propose a spatio-temporal representation scheme
for moving objects in the area of video data. The considered rep-            Definition 2. Let T be a time domain, T be the set of trajectories
resentation scheme distinguishes between spatio-temporal data of          of an object set O over T, and IT be the set of all closed inter-
trajectories and their topological information, and also utilizes in-     vals over T. A motion property on T is a function p : 2T × IT →
formation about distances between pairs of objects. The topolog-          {true, f alse}.
ical information itself is defined through a set of topological re-
lations operators expressing spatial relations between objects over       That is, a motion property is fulfilled for a set of trajectories and
some time interval, including faraway, disjoint, meet, overlap, is-       a certain time interval if the appropriate predicate is satisfied. To
included-by/includes and same.                                            illustrate this definition, some examples of motion properties are
In [5], a comprehensive study on the research that has been carried       provided below:
out on data mining and visual analysis of movement patterns has
been provided. The authors propose a conceptual framework for                 • Appearance: Let t ∈ T. Then we define appear(·, ·) as
movement behavior of different moving objects. The extracted be-                follows: appear({τo }, [t, t]) = true ⇔ ∀t′ ∈ T : τo (t′ ) ,
havior patterns are classified according to a taxonomy.                         undefined → t ≤ t′ . That is, an object “appears” only in the
In [15], the authors provide some aspects related to a semantic view            “first” moment it is being observed.
of trajectories. They show a conceptual approach for how trajectory
behaviors can be described by predicates that involve movement                • Speedup: Let t1 , t2 ∈ T and t1 < t2 . Then speedup(·, ·) is de-
attributes and/or semantic annotations. The provided approach is                fined as follows: speedup({τo }, [t1 , t2 ]) = true ⇔ v(τo , t1 ) <
rather informal and considers behavior analysis of moving objects               v(τo , t2 ) ∧ ∀t ∈ T : t1 ≤ t ≤ t2 → v(τo , t1 ) ≤ v(τo , t) ≤ v(τo , t2 )
on a general level.                                                             where v(τo , t) denotes the velocity of the underlying moving
                                                                                object o at time t. That is, the predicate speedup is satisfied
                                                                                for a trajectory and a time interval if and only if the velocity
3.    FORMAL BACKGROUND                                                         of the underlying object is increasing in the considered time
   This section provides the formal notions as well as the definitions          interval. Note that the increase may not be strictly mono-
needed throughout the rest of the paper. We start with the term                 tonic.
trajectory and then direct later our attention to motion properties
and patterns.                                                                 • Move away: Let t1 , t2 ∈ T and t1 < t2 . Then we define:
                                                                                moveaway({τo1 , τo2 }, [t1 , t2 ]) = true ⇔ ∀t, t′ ∈ T : t1 ≤ t <
3.1 Trajectories                                                                t′ ≤ t2 → dist(τo1 , τo2 , t) < dist(τo1 , τo2 , t′ ) where the term
   In our approach we consider the trajectory τo of an object o sim-            dist(τo1 , τo2 , t) denotes the distance between the underlying
ply as a function of time which assigns a position to o at any point            moving objects o1 and o2 at time t. That is, two objects are
in time. Since time plays only a role for the determination of tem-             moving away from each other for a time interval, if their dis-
poral causality between the positions of an object, we abstract from            tance is increasing during the considered time interval.
                                                                                the trajectory data in blocks. This has the advantage that extract-
                                                                                ing the complete trajectory requires only loading as much blocks as
                                                                                needed for storing a trajectory.

                                                                                4.2 Indexing Motion Patterns
                                                                                   For the maintenance of motion patterns we consider two cases -
                                                                                single motion properties and sequences of motion properties. Stor-
                                                                                ing single motion properties allows the efficient finding of trajec-
                                                                                tories which contain the considered motion property. This is ad-
                                                                                vantageous if the searched property is not often satisfied. Thus, for
                                                                                each motion property p a “list” DBT p holding all trajectories sat-
                                                                                isfying this property is maintained. As we shall see in Algorithm
             Figure 1: Overview of the index structure                          4.3, we have to combine such lists and, thus, a simple unsorted list
                                                                                would not be very favourable. Therefore, we implement these lists
                                                                                through B+ -Trees (ordered by the trajectory/object identifiers). An
   Using motion properties, a motion pattern of a single trajectory
                                                                                evaluation of union and intersection of two B+ -Trees with m and n
or a set of trajectories is defined as a sequence of motion properties
                                                                                leaves can be performed in O(m log m+n  m
                                                                                                                          )[4].
ordered by the time intervals in which they are fulfilled. It is impor-
                                                                                The search for motion patterns with more than one motion property
tant to note, that this common definition of a motion pattern allows
                                                                                can be conducted through the single DBT p structures. However, if
multiple occurrences of the same motion property in the sequence.
                                                                                the query motion pattern is too long, too many intersections of the
In order to get a well-defined notion it has to be required that the
                                                                                DBT p structures will happen and the resulting trajectories will have
time intervals in which the motion properties are fulfilled are dis-
                                                                                to be checked for containing properties that match the given order,
joint or that meaningful preferences on the motion properties are
                                                                                as well. To overcome this problem, sequences of motion properties
specified in order to allow ordering in case the time intervals over-
                                                                                are stored in an additional B+ -Tree structure DBT . The elements of
lap.
                                                                                DBT have the form (p, τo ) where p is a motion pattern, and o ∈ O.
                                                                                To sort the elements of DBT , we apply lexicographical ordering.
4.    TRAJECTORY INDEXING USING MO-                                             As a result, sequences with the same prefix are stored consecu-
      TION PROPERTIES                                                           tively. Thus, storing of motion patterns that are prefixes of other
   In this section we explain how the proposed index is being cre-              motion patterns can be omitted.
ated and used. Index creation starts with the determination of the              4.3 Building the Index
motion pattern of each trajectory to be indexed. For this purpose,
the motion predicates specified by the user are computed. The re-                  The algorithm for the index creation is quite simple. It consists
sulting motion patterns are indexed with references to the original             primarily of the following steps:
trajectories.                                                                      • Determine the motion properties for each trajectory τo . Con-
   The resulting index is schematically depicted in Figure 1. TrIMPI                 sider, if needed, a sliding window or some reduction or seg-
consists mainly of a data structure holding the raw trajectory data,                 menting technique as proposed in [1], [6], [10], [12], [13],
and secondary index structures for maintaining motion patterns.                      for example. Generate a list f of the motion properties of τo ,
Thereby, we differentiate between indexing single motion proper-                     ordered by their appearance in τo .
ties and indexing motion patterns.
A query to the index can be stated either through a motion pattern or              • Store τo into the trajectory record file.
through a concrete trajectory. The index is searched for motion pat-
terns containing the given one or the computed one, respectively. In               • Apply Algorithm 4.1 to f to generate access keys relevant
both cases, the associated trajectories are returned. The following                  for indexing.
subsections consider the outlined procedures more precisely.                       • For each generated access key, check whether it is already
4.1 Indexing Trajectory Raw Data                                                     contained in the index. If this is not the case, store it in the
                                                                                     index. Link the trajectory record file entry of τo to the access
    Since the focus of TrIMPI is not on querying trajectories by ex-
                                                                                     key.
ample, the index structure for the raw trajectory data can be rather
simple. For our implementation, we considered a trajectory record               Algorithm 4.1 is used to generate index keys of a pattern. An index
file as proposed by [3]. This structure (Figure 1) stores trajectories          key is any subpattern p′ = (p′j )m−1                       n−1
                                                                                                                 j=0 of a pattern p = (pi )i=0 which is
in records of fixed length. The overall structure of the records is as          defined as follows:
follows
                                                                                   • For each j ≤ m − 1 exists i ≤ n − 1 such that p′j = pi
       IDo    next_ptr      prev_ptr      {track0 , . . . , tracknum−1 } .
                                                                                   • For each j, k such that 0 ≤ j < k ≤ m − 1 exist i, l such that
IDo denotes the identifier of the underlying moving object, next_ptr                 0 ≤ i < l ≤ n − 1 and p′j = pi and p′k = pl .
and prev_ptr are references to the appropriate records holding fur-
ther parts of the trajectory, and {track0 , . . . , tracknum−1 } is a list of   To generate the list of index keys, algorithm 4.1 proceeds itera-
tracks of a predefined fixed length num. If a record ri for a tra-              tively. At each iteration of the outer loop (lines 3 to 16) the algo-
jectory τo gets filled, a new record r j is created for τo holding its          rithm considers a single element p of the input sequence f . On the
further tracks. In this case, next_ptrri is set up to point to r j , and        one hand, p is being added as an index key to the (interim) result
prev_ptrr j is set up to point to ri .                                          (lines 14 and 15) and on the other hand it is being appended as a
Using a trajectory record file, the data is not completely clustered,           suffix to each previously generated index key (inner loop - lines 5
but choosing appropriate record size leads to partial clustering of             to 13). Algorithm 4.1 utilizes two sets whose elements are lists of
motion properties - supplist and entries. The set supplist               that each trajectory of the interim result has to be checked whether
contains at each iteration the complete set of index keys, includ-       it matches the queried pattern (lines 9 to 13).
ing those which are prefixes of other patterns. The set entries is       The other special case are queries longer than G (lines 16 to 24). As
built in each iteration of the inner loop (lines 5 to 13) by appending   we have seen in algorithm 4.1, in such cases the index keys are cut
the current motion property of the input sequence to any element         to prefixes of length G. Thus, the extraction in this case considers
of supplist. Thereby, at line 14 entries holds only index keys           the prefix of length G of the query sequence (lines 17) and extracts
which are no prefixes of other index keys. Since the resulting lists     the appropriate trajectories (line 18). Since these trajectories may
of index keys are stored in a B+ -Tree by applying a lexicographical     still not match the query sequence, e.g. by not fulfilling some of the
order, sequences of motion properties which are prefixes of other        properties appearing on a position after G − 1 in the input sequence,
sequences can be omitted. Therefore, the set entries is returned         an additional check of the trajectories in the interim result is made
as final result (line 17).                                               (lines 19 to 23).
Since the given procedure may result in the computation of up to         The last case to consider are query sequences with length between
2k0 different indexing keys for an input sequence with k0 motion         α and G. In these cases, the index DBT holding the index keys is
properties, a global constant G is used to limit the maximal length      searched through a call to algorithm 4.2 and the result is returned.
of index keys. Using an appropriate value for G leads to no draw-        Finally, the function Match (algorithm 4.4) checks whether a tra-
backs for the application. Furthermore, the proposed querying al-
gorithm can handle queries longer than G.                                Algorithm 4.3 Querying trajectories with a sequence of arbitrary
                                                                         length
Algorithm 4.1 Building the indexing keys                                 Require: s is a sequence of motion properties
Require: f is a sequence of motion properties                            Require: G is the maximal length of stored sequences
Require: G is the maximal length of sequences to be indexed              Require: DBT p is the index of the property p
 1 function createIndexKeys( f )                                         Require: 1 ≤ α < G maximal query length for searching single property indexes
 2     supplist ← empty set of lists                                      1 function GetEntries(s)
 3    for all a ∈ f do                                                    2     result ← empty set
 4        entries ← empty set of lists                                    3     if |s| < α then
 5        for all l ∈ supplist do                                         4          result ← T
 6            new ← empty list                                            5          for all p ∈ s do
 7            if |l| ≤ G then                                             6              suppset ← DBT p
 8                  new ← l.append(a)                                     7              result ← result ∩ suppset
 9            else                                                        8          end for
10                  new ← l                                               9          for all τo ∈ result do
11             end if                                                    10              if ! match(τo , s) then
12             entries ← entries ∪ {new}                                 11                   result ← result\{τo }
13         end for                                                       12              end if
14         entries ← entries ∪ {[a]}                                     13          end for
15         supplist ← entries ∪ supplist                                 14     else if |s| ≤ G then
16     end for                                                           15          result ← GetEntriesFromDBT (s)
17     return entries                                                    16     else
18 end function                                                          17          k ← s[0..G − 1]
                                                                         18          result ← GetEntriesFromDBT (k)
                                                                         19          for all τo ∈ result do
                                                                         20              if ! match(τo , s) then
4.4 Searching for Motion Patterns                                        21                   result ← result\{τo }
                                                                         22              end if
   Since the index is primarily considered to support queries on se-     23          end for
quences of motion properties, the appropriate algorithm for eval-        24     end if
                                                                         25     return result
uating such queries given in the following is rather simple. In its      26 end function
“basic” version, query processing is just traversing the index and re-
turning all trajectories referenced by index keys which contain the
queried one (as a subpattern). This procedure is illustrated in algo-    jectory τo fulfills a pattern s. For this purpose, the list of motion
rithm 4.2. There are, however, some special cases which have to          properties of τo is being generated (line 2). Thereafter, s and the
                                                                         generated pattern of τo are traversed (lines 5 to 14) so that it can be
                                                                         checked whether the elements of s can be found in the trajectory
Algorithm 4.2 Basic querying of trajectories with a sequence of
                                                                         pattern of τo in the same order. In this case the function Match
motion properties
                                                                         returns true, otherwise it returns false.
Require: s is a sequence of motion properties; |s| ≤ G
Require: DBT is the index containing motion patterns
 1 function GetEntriesFromDBT(s)                                         5. CONCLUSIONS AND OUTLOOK
 2    result ← {τo | ∃p s.t. s ≤ p ∧ (p, τo ) ∈ DBT }
 3     return result                                                        In this paper we provided some first results of an ongoing work
 4 end function                                                          on an indexing structure for trajectories of moving objects called
                                                                         TrIMPI. The focus of TrIMPI lies not on indexing spatio-temporal
be taken into account. The first of them considers query sequences       data but on the exploitation of motion properties of moving objects.
which are “too short”. As stated in Section 4.2, it can be advan-        For this purpose, we provided a formal notion of motion proper-
tageous to evaluate queries containing only few motion properties        ties and showed how they form a motion pattern. Furthermore, we
by examination of the index structures for single motion proper-         showed how these motion patterns can be used to build a meta in-
ties. To be able to define an application specific notion of “short”     dex. Algorithms for querying the index were also provided. In
queries, we provide besides G an additional global parameter α for       the next steps, we will finalize the implementation of TrIMPI and
which holds 1 ≤ α < G. In algorithm 4.3, which evaluates queries         perform tests in the scenario of the automatic detection of piracy at-
of patterns of arbitrary length, each pattern of length shorter than α   tacks mentioned in the Introduction. As a conceptual improvement
is being handled in the described way (lines 3 to 8). It is important    of the work provided in this paper, we consider a flexibilisation of
Algorithm 4.4 Checks whether a trajectory matches a motion pat-              Notes in Computer Science, pages 26–39. Springer Berlin
tern                                                                         Heidelberg, 2012.
Require: τo is a valid trajectory                                        [8] R. H. Güting and M. Schneider. Moving Object Databases.
Require: s is a sequence of motion properties
 1 function match(τo , s)                                                    Data Management Systems. Morgan Kaufmann, 2005.
 2    motion_properties ← compute the list of motion properties of τo    [9] A. Guttman. R-Trees: a dynamic index structure for spatial
 3    index_s ← 0
                                                                             searching. In Proceedings of the 1984 ACM SIGMOD
 4    index_props ← 0
 5    while index_props < motion_properties.length do                        international conference on Management of data, SIGMOD
 6        if motion_properties[index_props] = s[index_s] then                ’84, pages 47–57, New York, NY, USA, 1984. ACM.
 7             index_s ← index_s + 1
                                                                        [10] J. Hershberger and J. Snoeyink. Speeding Up the
 8        else
 9             index_props ← index_props + 1                                 Douglas-Peucker Line-Simplification Algorithm. In
10         end if                                                            P. Bresnahan, editor, Proceedings of the 5th International
11         if index_s = s.length then
                                                                             Symposium on Spatial Data Handling, SDH’92, Charleston,
12              return true
13         end if                                                            South Carolina, USA, August 3-7, 1992, pages 134–143.
14     end while                                                             University of South Carolina. Humanities and Social
15     return false                                                          Sciences Computing Lab, August 1992.
16 end function
                                                                        [11] C. S. Jensen. TPR-Tree Successors 2000–2012.
                                                                             http://cs.au.dk/~csj/tpr-tree-successors, 2013.
                                                                             Last accessed 24.03.2013.
the definition of motion patterns including arbitrary temporal rela-
                                                                        [12] E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra.
tions between motion predicates.
                                                                             Dimensionality reduction for fast similarity search in large
                                                                             time series databases. Journal Of Knowledge And
6.     ACKNOWLEDGMENTS                                                       Information Systems, 3(3):263–286, 2001.
  The authors would like to give special thanks to their former stu-    [13] E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani. An online
dent Lasse Stehnken for his help in implementing TrIMPI.                     algorithm for segmenting time series. In N. Cercone, T. Y.
                                                                             Lin, and X. Wu, editors, Proceedings of the 2001 IEEE
                                                                             International Conference on Data Mining, ICDM’01, San
7.     REFERENCES                                                            Jose, California, USA, 29 November - 2 December 2001,
 [1] R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient                    pages 289–296. IEEE Computer Society, 2001.
     similarity search in sequence databases. In D. B. Lomet,           [14] T. Polomski and H.-J. Klein. How to Improve Maritime
     editor, Proceedings of the 4th International Conference on              Situational Awareness using Piracy Attack Patterns. 2013.
     Foundations of Data Organization and Algorithms,                        submitted.
     FODO’93, Chicago, Illinois, USA, October 13-15, 1993,              [15] S. Spaccapietra and C. Parent. Adding meaning to your steps
     volume 730 of Lecture Notes in Computer Science, pages                  (keynote paper). In M. Jeusfeld, L. Delcambre, and T.-W.
     69–84. Springer, 1993.                                                  Ling, editors, Conceptual Modeling - ER 2011, 30th
 [2] J.-W. Chang, H.-J. Lee, J.-H. Um, S.-M. Kim, and T.-W.                  International Conference, ER 2011, Brussels, Belgium,
     Wang. Content-based retrieval using moving objects’                     October 31 - November 3, 2011. Proceedings, ER’11, pages
     trajectories in video data. In IADIS International Conference           13–31. Springer, 2011.
     Applied Computing, pages 11–18, 2007.                              [16] Y.-S. Tak, J. Kim, and E. Hwang. Hierarchical querying
 [3] J.-W. Chang, M.-S. Song, and J.-H. Um. TMN-Tree: New                    scheme of human motions for smart home environment. Eng.
     trajectory index structure for moving objects in spatial                Appl. Artif. Intell., 25(7):1301–1312, Oct. 2012.
     networks. In Computer and Information Technology (CIT),            [17] Y. Tao, D. Papadias, and J. Sun. The TPR*-tree: an
     2010 IEEE 10th International Conference on, pages                       optimized spatio-temporal access method for predictive
     1633–1638. IEEE Computer Society, July 2010.                            queries. In J. C. Freytag, P. C. Lockemann, S. Abiteboul,
 [4] E. D. Demaine, A. López-Ortiz, and J. I. Munro. Adaptive                M. J. Carey, P. G. Selinger, and A. Heuer, editors,
     set intersections, unions, and differences. In Proceedings of           Proceedings of the 29th international conference on Very
     the eleventh annual ACM-SIAM symposium on Discrete                      large data bases - Volume 29, VLDB ’03, pages 790–801.
     algorithms, SODA ’00, pages 743–752, Philadelphia, PA,                  VLDB Endowment, 2003.
     USA, 2000. Society for Industrial and Applied Mathematics.
 [5] S. Dodge, R. Weibel, and A.-K. Lautenschütz. Towards a
     taxonomy of movement patterns. Information Visualization,
     7(3):240–252, June 2008.
 [6] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast
     subsequence matching in time-series databases. In R. T.
     Snodgrass and M. Winslett, editors, Proceedings of the 1994
     ACM SIGMOD international conference on Management of
     data, SIGMOD ’94, pages 419–429, New York, NY, USA,
     1994. ACM. 472940.
 [7] Y. Fang, J. Cao, J. Wang, Y. Peng, and W. Song.
     HTPR*-Tree: An efficient index for moving objects to
     support predictive query and partial history query. In
     L. Wang, J. Jiang, J. Lu, L. Hong, and B. Liu, editors,
     Web-Age Information Management, volume 7142 of Lecture
     Complex Event Processing in Wireless Sensor Networks

                                                            Omran Saleh
                                          Faculty of Computer Science and Automation
                                                Ilmenau University of Technology
                                                       Ilmenau, Germany
                                                 omran.saleh@tu-ilmenau.de


ABSTRACT                                                              itored region. These nodes can sense the surrounding envi-
Most of the WSN applications need the number of sensor                ronment and share the information with their neighboring
nodes deployed to be in order of hundreds, thousands or               nodes. They are gaining adoption on an increasing scale
more to monitor certain phenomena and capture measure-                for tracking and monitoring purposes. Furthermore, sensor
ments over a long period of time. The large volume of sensor          nodes are often used in control purposes. They are capable
networks would generate continuous streams of raw events1             of performing simple processing.
in case of centralized architecture, in which the sensor data            In the near future, it is prospective that wireless sensor
captured by all the sensor nodes is sent to a central entity.         networks will offer and make conceivable a wide range of
   In this paper, we describe the design and implementation           applications and emerge as an important area of comput-
of a system that carries out complex event detection queries          ing. WSN technology is exciting with boundless potential for
inside wireless sensor nodes. These queries filter and re-             various application areas. They are now found in many in-
move undesirable events. They can detect complex events               dustrial and civilian application areas, military and security
and meaningful information by combining raw events with               applications, environmental monitoring, disaster prevention
logical and temporal relationship, and output this informa-           and health care applications, etc.
tion to external monitoring application for further analysis.            One of the most important issues in the design of WSNs
This system reduces the amount of data that needs to be               is energy efficiency. Each node should be as energy effi-
sent to the central entity by avoiding transmitting the raw           cient as possible. Processing a chunk of information is less
data outside the network. Therefore, it can dramatically re-          costly than wireless communication; the ratio between them
duce the communication burden between nodes and improve               is commonly supposed to be much smaller than one [19].
the lifetime of sensor networks.                                      There is a significant link between energy efficiency and su-
   We have implemented our approach for the TinyOS Oper-              perfluous data. The sensor node is going to consume unnec-
ating System, for the TelosB and Mica2 platforms. We con-             essary energy for the transmission of superfluous data to the
ducted a performance evaluation of our method comparing               central entity, which means minimizing the energy efficiency.
it with a naive method. Results clearly confirm the effec-                 Furthermore, traditional WSN software systems do not
tiveness of our approach.                                             apparently aim at efficient processing of continuous data or
                                                                      event streams. According to previous notions, we are looking
                                                                      for an approach that makes our system gains high perfor-
Keywords                                                              mance and power saving via preventing the generation and
Complex Event Processing, Wireless Sensor Networks, In-               transmission of needless data to the central entity. There-
network processing, centralized processing, Non-deterministic         fore, it can dramatically reduce the communication burden
Finite state Automata                                                 between nodes and improve the lifetime of sensor networks.
                                                                      This approach takes into account the resource limitations in
1. INTRODUCTION                                                       terms of computation power, memory, and communication.
                                                                      Sensor nodes can employ their processing capabilities to per-
  Wireless sensor networks are defined as a distributed and
                                                                      form some computations. Therefore, an in-network complex
cooperative network of devices, denoted as sensor nodes that
                                                                      event processing 2 based solution is proposed.
are densely deployed over a region especially in harsh envi-
                                                                         We have proposed to run a complex event processing en-
ronments to gather data for some phenomena in this mon-
                                                                      gine inside the sensor nodes. CEP engine is implemented
1
    The terms data, events and tuples are used interchangeably.       to transform the raw data into meaningful and beneficial
                                                                      events that are to be notified to the users after detecting
                                                                      them. It is responsible for combining primitive events to
                                                                      identify higher level complex events. This engine provides
                                                                      an efficient Non-deterministic Finite state Automata (NFA)
                                                                      [1] based implementation to lead the evaluation of the com-
                                                                      plex event queries where the automaton runs as an integral
                                                                      part of the in-network query plan. It also provides the the-
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
                                                                      oretical basis of CEP as well as supports us with particular
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany.                   2
Copyright is held by the author/owner(s).                                 CEP is discussed in reference [15]
operators (conjunction, negation, disjunction and sequence        of active databases. Most of the models in these engines
operators, etc.).                                                 are based on fixed data structures such as tree, graph, fi-
  Complex event processing over data stream has increas-          nite automaton or petri net. The authors of [6] used a tree
ingly become an important field due to the increasing num-         based model. Paper [9] used petri net based model to de-
ber of its applications for wireless sensor networks. There       tect complex events from active database. Reference [17]
have been various event detection applications proposed in        used Timed Petri-Net (TPN) to detect complex events from
the WSNs, e.g. for detecting eruptions of volcanoes [18],         RFID stream.
forest fires, and for the habitat monitoring of animals [5].
An increasing number of applications in such networks is
confronted with the necessity to process voluminous data
                                                                  3.   RELATED WORKS
streams in real time fashion.                                        It is preferable to perform In-Network Processing in-
  The rest of the paper is organized as follows: section 2 pro-   side sensor network to reduce the transmission cost between
vides an overview of the naive approaches for normal data         neighboring nodes. This concept is proposed by several sys-
and complex event processing in WSNs. Related works are           tems such as TinyDB [16], and Cougar [19]. Cougar project
briefly reviewed in section 3. Then we introduce the overall       applies a database system concept to sensor networks. It
system architecture in order to perform complex event pro-        uses the declarative queries that are similar to SQL to query
cessing in sensor networks in section 4. Section 5 discusses      sensor nodes. Additionally, sensor data in cougar is consid-
how to create logical query plans to evaluate sensor portion      ered like a “virtual” relational database. Cougar places on
queries. Section 6 explains our approach and how queries are      each node an additional query layer that lies between the
implemented by automata. In section 7, the performance of         network and application layers which has the responsibility
our system is evaluated using a particular simulator. Fi-         of in-network processing. This system generates one plan for
nally, section 8 presents our concluding remarks and future       the leader node to perform aggregation and send the data to
works.                                                            a sink node. Another plan is generated for non-leader nodes
                                                                  to measure the sensors status. The query plans are dissem-
                                                                  inated to the query layers of all sensor nodes. The query
2. NAIVE APPROACHES IN WSNS                                       layer will register the plan inside the sensor node, enable
  The ideas behind naive approaches which are definitely           desired sensors, and return results according to this plan.
different from our approach lie in the processing of data as          TinyDB is an acquisitional query processing system for
the central architectural concept. For normal sensor data         sensor networks which maintains a single, infinitely-long vir-
processing, the centralized approach proceeds in two steps;       tual database table. It uses an SQL-like interface to ask for
the sensor data captured by all the sensor nodes is sent to       data from the network. In this system, users specify the
the sink node and then routed to the central server (base         data they want and the rate at which the data should be
station) where it is stored in centralized database. High vol-    refreshed, and the underlying system would decide the best
ume data are arriving at the server. Subsequently, query          plan to be executed. Several in-network aggregation tech-
processing takes place on this database by running queries        niques have been proposed in order to extend the life time
against stored data. Each query executes one time and re-         of sensor network such as tree-based aggregation protocols
turns a set of results.                                           i.e., directed diffusion.
  Another approach which adopts the idea of centralized              Paper [13] proposes a framework to detect complex events
architecture is the use of a central data stream management       in wireless sensor networks by transforming them into sub-
system (DSMS), which simply takes the sensor data stream          events. In this case, the sub-events can easily be detected
as input source. Sending all sensor readings to DSMS is also      by sensor nodes. Reference [14] splits queries into server and
an option for WSN data processing. DSMS is defined as a            node queries, where each query can be executed. The final
system that manages a data stream, executes a continuous          results from both sides are combined by the results merger.
query against a data stream and supports on-line analysis         In [20], symbolic aggregate approximation (SAX) is used to
of rapidly changing data streams [10]. Traditional stream         transform sensor data to symbolic representations. To de-
processing systems such as Aurora [2], NiagraCQ [7], and          tect complex events, a distance metric for string comparison
AnduIN [12] extend the relational query processing to work        is utilized. These papers are the closer works to our system.
with stream data. Generally the select, project, join and            Obviously, there is currently little work into how the idea
aggregate operations are supported in these stream systems.       of in-network processing can be extended and implemented
  The naive approach for Complex Event Processing in              to allow more complex event queries to be resolved within
WSNs is similar to the central architectural idea of normal       the network.
data processing, but instead of using traditional database
and data stream engine, CEP uses a dedicated engine for
processing complex events such as Esper [8], SASE [11] and        4.   SYSTEM ARCHITECTURE
Cayuga [4], in which sensor data or events streams need to          We have proposed a system architecture in which collected
be filtered, aggregated, processed and analyzed to find the         data at numerous, inexpensive sensor nodes are processed
events of interest and identify some patterns among them,         locally. The resulting information is transmitted to larger,
finally take actions if needed.                                    more capable and more expensive nodes for further analysis
  Reference [11] uses SASE in order to process RFID stream        and processing through specific node called sink node.
data for a real-world retail management scenario. Paper [3]         The architecture has three main parts that need to be
demonstrates the use of Esper engine for object detection         modified or created to make our system better suited to
tracking in sensor networks. All the aforementioned engines       queries over sensor nodes: 1- Server side: queries will be
use some variant of a NFA model to detect the complex             originated at server side and then forwarded to the near-
event. Moreover, there are many CEP engines in the field           est sink node. Additionally, this side mainly contains an
application that runs on the user’s PC (base station). Its
main purpose is to collect the results stream over the sen-
sor network and display them. Server side application can
offer more functions i.e., further filtering for the collected
data, perform joining on sensor data, extract, save, man-
age, and search the semantic information and apply further
complex event processing on incoming events after process-
ing them locally in sensor nodes. Because sensor data can
be considered as a data stream, we proposed to use a data
stream management system to play a role of server side, for
that we selected AnduIN data stream engine. 2- Sink side:
sink node (also known as root or gateway node) is one of the
motes in the network which communicates with the base sta-
tion directly, all the data collected by sensors is forwarded
to a sink node and then to server side. This node will be
in charge of disseminating the query down to all the sensor                   Figure 1: Logical query plan
nodes in the network that comes from server side. 3- Node
side: in this side, we have made huge changes to the tra-
ditional application which runs on the nodes themselves to       mechanism takes as input primitive events from lower oper-
enable database manner queries involving filters, aggregates,     ators and detects occurrences of composite events which are
complex event processing operator (engine) and other oper-       used as an output to the rest of the system.
ators to be slightly executed within sensor networks. These
changes are done in order to reduce communication costs          6.    IN-NETWORK CEP SYSTEM
and get useful information instead of raw data.                     Various applications including WSNs require the ability to
   When combining on-sensor portions of the query with the       handle complex events among apparently unrelated events
server side query, most of the pieces of the sensor data query   and find interesting and/or special patterns. Users want
are in place. This makes our system more advanced.               to be notified immediately as soon as these complex events
                                                                 are detected. Sensor node devices generate massive sensor
5. LOGICAL PLAN                                                  data streams. These streams generate a variety of primitive
   Each and every sensor node of a network generates tu-         events continuously. The continuous events form a sequence
ples. Every tuple may consist of information about the node      of primitive events, and recognition of the sequence supplies
id, and sensor readings. Query plan can specify the tuples       us a high level event, which the users are interested in.
flow between all necessary operators and a precise computa-          Sensor event streams have to be automatically filtered,
tion plan for each sensor node. Figure 1 (lower plan) illus-     processed, and transformed into significative information.
trates how our query plan can be employed. It corresponds        In non-centralized architecture, CEP has to be performed
to an acyclic directed graph of operators. We assume the         as close to real time as possible (inside the node). The task
dataflow being upward. At the bottom, there is a homo-            of identifying composite events from primitive ones is per-
geneous data source which generates data tuples that must        formed by the Complex Event Processing engine. CEP en-
be processed by operators belonging to query plans. Tu-          gine provides the runtime to perform complex event process-
ples are flowed through intermediate operators composed in        ing where they accept queries provided by the user, match
the query graph. The operators perform the actual process-       those queries against continuous event streams, and trigger
ing and eventually forward the data to the sink operator         an event or an execution when the conditions specified in
for transmitting the resulting information to the server side    the queries have been satisfied. The idea of this concept is
(base station). These operators adopt publish/subscribe          close to Event-Condition-Action (ECA) concept in conven-
mechanism to transfer tuples from one operator to next op-       tional database systems where an action has to be carried
erator.                                                          out in response to an event and one or more conditions are
   We differ between three different types of operators within     satisfied.
a query graph [12]: 1- Source operator: produces tuples             Each data tuple from the sensor node is viewed as a prim-
and transfers them to other operators. 2- Sink operator:         itive event and it has to be processed inside the node. We
receives incoming tuples from other operators. 3- Inner          have proposed an event detection system that specifically
operators: receive incoming tuples from source operator,         targets applications with limited resources, such in our sys-
process them, and transfer the result to sink operator or        tem. There are four phases for complex event processing
other inner operators.                                           in our in-network model: NFA creation, Filtering, Sequence
   A query plan consists of one source at the bottom of a        scan and Response as shown in figure 2.
logical query graph, several inner operators, and one sink
at the top and the tuples are flowing strictly upward. In         6.1    NFA Creation Phase
our system, we have extended this plan to give the system           The first phase is NFA creation. NFA’s structure is cre-
the capability to perform the complex event processing and       ated by the translation from the sequence pattern through
detecting by adding new operators. We have separated the         mapping the events to NFA states and edges, where the con-
mechanism for detecting complex events from the rest of          ditions of the events (generally called event types) are asso-
normal processing side. We have a particular component           ciated with edges. For pattern matching over sensor node
working as an extra operator or engine within the main pro-      streams, NFA is employed to represent the structure of an
cess, as we can see from figure 1 (upper plan). The detection     event sequence. For a concrete example, consider the query
                 Figure 2: CEP Phases



                                                                 Figure 4: Sequence Scan for SEQ (A, B+, D) within
                                                                 6 Time Unit Using UNRESTRICTED Mode

      Figure 3: NFA for SEQ(A a, B+ b, C c)
                                                                 is waiting for the arrival of events in its starting state. Once
                                                                 a new instance event e arrives, the sequence scan responds
pattern: SEQ(A a, B+ b, C c)3 . Figure 3 shows the NFA           as follows: 1- It checks whether the type of instance (from
created for the aforementioned pattern (A, B+, C), where         attributes) and occurrence time of e satisfy a transition for
state S0 is the starting state, state S1 is for the successful   one of the logical existing sequences. If not, the event is
detection of an A event, state S2 is for the detection of a B    directly rejected. 2- If yes, e is registered in the system (the
event after event A, also state S3 is for the detection of a C   registration is done in the sliding window) and the sequence
event after the B event. State S1 contains a self-loop with      advances to next state. 3- If e allows for a sequence to move
the condition of a B event. State S3 is the accepting state,     from the starting state to next state, the engine will create
reaching this state indicates that the sequence is detected.     other logical sequence to process further incoming events
                                                                 while keeping the original sequence in its current state to
6.2    Filtering Phase                                           receive new event. Therefore, multiple sequences work on
   The second phase is to filter primitive events at early        the events at the same time. 4- Delete some sequences when
stage, generated by sensor nodes. Sensor nodes cannot un-        their last received items are not within a time limit. It be-
derstand whether a particular event is necessary or not.         comes impossible for them to proceed to the next state since
When additional conditions are added to the system, possi-       the time limits for future transitions have already expired.
ble event instances might be pruned at the first stage.              Next, we use an example to illustrate how UNRESTRICTED
   After filtering, timestamp operator will add the occur-        sequence scan works. Suppose we have the following pat-
rence time of the event t. A new operator is designed for        tern4 SEQ (A, B+, D) and sequence of events (tuples)
adding a timestamp t to the events (tuples) before entering      presented as [a1, b2, a3, c4, c5, b6, d7 ...] within 6 time
the complex event processing operator. We can notice that        unit. Figure 4 shows, step by step, how the aforementioned
from figure 1. The timestamp attribute value of an event          events are processed. Once the sequence has reached the
t records the reading of a clock in the system in which the      accepting state (F ), the occurrences of SEQ (A, B+, D)
event was created, in this case it can reflect the true order     will be established at : {{a1, b2, d7 }, {a1, b6, d7 },
of the occurrences of primitive events.                          {a3, b6, d7 }}.
                                                                    The drawback of this mode is the use of high storage to
6.3 Sequence Scan Phase                                          accumulate all the events that participate in the combina-
   The third phase is sequence scan to detect a pattern match.   tions in addition to computation overhead for the detection.
We have three modes state the way in which events may con-       It consumes more energy. On other hand, it gives us all the
tribute to scan a sequence: UNRESTRICTED, RECENT                 possibilities of event combination which can be used (e.g.
and FIRST. Every mode has a different behavior. The selec-        for further analysis). In our system, we only output one of
tion between them depends on the users and the application       these possibilities to reduce transmission cost overhead. All
domain. These modes have advantages and disadvantages.           registered events are stored in a sliding window. Once the
We will illustrate them below.                                   overflow has occurred, the candidate events would be the
   In the UNRESTRICTED mode, each start event e, which           newest registered ones from the first sequence. The engine
allows a sequence to move from the initial state to the next     will continue to replace the events from the first sequence as
state, starts a separate sequence detection. In this case any    long as there is no space. When the initial event (first event
event occurrence combination that matches the definition of       in the first sequence combination) is replaced, the engine
the sequence can be considered as an output. By using this       starts the replacement from the second sequence and so on.
mode, we can get all the possibilities of event combination      The engine applies this replacement policy to ensure that
which satisfy the sequence. When the sequence is created, it     the system still has several sequences to detect a composite
                                                                 event, because replacing the initial events would destroy the
3
 Notice: In this paper, we are going to only focus on se-
                                                                 4
quence operator SEQ because of the limited number of               The terms complex event, composite event, pattern and
pages.                                                           sequence are used interchangeably.
          Figure 5: First and Recent Modes


whole sequence.                                                          Figure 6: Total Energy Consumption
   In the FIRST mode, the earliest occurrence of each con-
tributing event type is used to form the composite event
output. Only the first event from a group of events which         ing the in-network complex event processing techniques as
have the same type advances the sequence to the next state.      well as base station functionality, is written in TinyOS. Our
In this mode, we have just one sequence in the system. The       code runs successfully on both real motes and the TinyOS
automaton engine will examine every incoming instance e,         Avrora simulator. The aim of the proposed work is to com-
whether the type of it and occurrence time of e satisfy a        pare the performance of our system, in-network processor
transition from the current state to next state. If it is, the   which includes complex event engine in comparison with
sequence will register the event in the current state and ad-    centralized approach in wireless sensor networks and to as-
vance to next state. If not, the event is directly rejected.     sess the suitability of our approach in an environment where
Suppose we have the following pattern SEQ (A, B+, C+,            resources are limited. The comparison would be done in
D) and sequence of tuples presented as [a1, a2, b3, c4, c5,      terms of energy efficiency (amount of energy consumed) and
b6, d7 ...] within 6 time unit. The result as shown in the       the number of messages transmitted per particular interval,
upper part of figure 5 .                                          in the entire network. The experiment was run for varying
   In the RECENT mode (as the lower part of figure 5 which        the SEQ length. We started with length 2 then 3 and finally
has FIRST pattern and the same sequence of tuples), the          5. Simulations were run for 60 seconds with one event per
most recent event occurrences of contributing event types        second. The performance for different SEQ lengths and dif-
are used to form the composite event. In RECENT mode,            ferent modes with a network of 75 nodes is shown in figure 6.
once an instance satisfies the condition and timing constraint    The centralized architecture led to higher energy consump-
to jump from a state to next state, the engine will stay in      tion because sensor nodes transmitted events to the sink
the current state unlike FIRST mode. This mode tries to          node at regular periods. In our system, we used in-network
find the most recent instance from consecutive instances for      complex event processing to decrease the number of trans-
that state before moving to next state. When a1 enters the       missions of needless events at each sensor node. What we
engine. It satisfies the condition to move from S0 to S1.         can notice from figure 6 is summarized as: 1- By increasing
The engine registers it, stays in S0 and does not jump to        the SEQ length in our approach, the RAM size is increased
the next state. Perhaps the new incoming instance is more        while energy consumption is reduced. The reason is: the
recent from the last one in the current state.                   transmission will not occur until the sequence reaches the
   The advantages of FIRST and RECENT modes are the              accepting state, few events (tuples) will be relatively satis-
use of less storage to accumulate all the events that partic-    fied. Hence, the number of transmissions after detections
ipate in the combinations. Only a few events will be regis-      will be decreased. 2- FIRST is a little bit better than RE-
tered in the system in addition to low computation overhead      CENT, and both of them are better than UNRESTRICTED
for the detection. They consume less energy. Unlike UNRE-        in energy consumption. The gap between them is resulting
STRICTED, they do not give all possible matches.                 from processing energy consumption, that is because UN-
                                                                 RESTRICTED needs more processing power while the other
6.4 Response Phase                                               needs less, as shown in figure 6.
                                                                    Figure 7 shows the radio energy consumption for each
  Once an accepting state F is reached by the engine, the
                                                                 sensor node and the total number of messages when SEQ
engine should immediately output the event sequence. This
                                                                 length was 3. The nodes in the centralized architecture sent
phase is responsible for preparing the output sequence to
                                                                 more messages than our approach (nearly three times more).
pass it to the sink operator. The output sequence depends
                                                                 Hence, it consumed more radio energy. Additionally, the
on the mode of the scan. This phase will start to create
                                                                 gateway nodes consumed more radio energy due to receiv-
the response by reading the sliding window contents. In
                                                                 ing and processing the messages from other sensor nodes.
case of FIRST and RECENT modes, the sliding window
                                                                 In a 25 nodes network, the centralized approach consumed
contains only the events which contribute in sequence de-
                                                                 energy nearly 4203mJ in sink side, while our approach con-
tection. In UNRESTRICTED mode, the engine randomly
                                                                 sumed around 2811mJ. Thus, our system conserved nearly
selects a combination of events which matches the pattern
                                                                 1392mJ (33% of the centralized approach) of the energy. In
in order to reduce transmission cost.
                                                                 our architecture, the number of transmissions was reduced.
                                                                 Therefore, the radio energy consumption is reduced not only
7. EVALUATION                                                    at the sensor nodes but also at the sink nodes.
  We have completed an initial in-network complex event
processing implementation. All the source code, implement-       8.   CONCLUSIONS
                                                                     In ACM SIGMOD, pages 1100–1102, New York, NY,
                                                                     USA, 2007. ACM.
                                                                 [5] A. Cerpa, J. Elson, D. Estrin, L. Girod, M. Hamilton,
                                                                     and J. Zhao. Habitat monitoring: application driver
                                                                     for wireless communications technology. SIGCOMM
                                                                     Comput. Commun. Rev., 31(2 supplement):20–41,
                                                                     Apr. 2001.
                                                                 [6] S. Chakravarthy, V. Krishnaprasad, E. Anwar, and
                                                                     S.-K. Kim. Composite events for active databases:
                                                                     semantics, contexts and detection. In Proceedings of
                                                                     the 20th International Conference on Very Large Data
Figure 7: Energy Consumption vs. Radio Message                       Bases, VLDB ’94, pages 606–617, San Francisco, CA,
                                                                     USA, 1994. Morgan Kaufmann Publishers Inc.
                                                                 [7] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang.
                                                                     NiagaraCQ: a scalable continuous query system for
   Sensor networks provide a considerably challenging pro-           Internet databases. In ACM SIGMOD, pages 379–390,
gramming and computing environment. They require ad-                 New York, NY, USA, 2000. ACM.
vanced paradigms for software design, due to their character-    [8] EsperTech. Event stream intelligence: Esper &
istics such as limited computational power, limited memory           NEsper. http://www.esper.codehaus.org/.
and battery power which WSNs suffer from. In this paper,          [9] S. Gatziu and K. R. Dittrich. Events in an active
we presented our system, an in-network complex event pro-            object-oriented database system, 1993.
cessing, a system that efficiently carries out complex event      [10] V. Goebel and T. Plagemann. Data stream
queries inside network nodes.                                        management systems - a technology for network
   We have proposed an engine to allow the system to de-             monitoring and traffic analysis? In ConTEL 2005,
tect complex events and valuable information from primitive          volume 2, pages 685–686, June 2005.
events.                                                         [11] D. Gyllstrom, E. Wu, H. Chae, Y. Diao, P. Stahlberg,
   We developed a query plan based approach to implement             and G. Anderson. SASE: complex event processing
the system. We provided the architecture to collect the              over streams (Demo). In CIDR, pages 407–411, 2007.
events from sensor network, this architecture includes three
                                                                [12] D. Klan, M. Karnstedt, K. Hose, L. Ribe-Baumann,
sides; sensor side to perform in-network complex event pro-
                                                                     and K. Sattler. Stream engines meet wireless sensor
cessing, sink side to deliver the events from the network to
                                                                     networks: cost-based planning and processing of
AnduIN server side which has the responsibility to display
                                                                     complex queries in AnduIN, distributed and parallel
them and perform further analysis.
                                                                     databases. Distributed and Parallel Databases,
   We demonstrated the effectiveness of our system in a de-
                                                                     29(1):151–183, Jan. 2011.
tailed performance study. Results obtained from a compari-
son between centralized approach and our approach confirms       [13] Y. Lai, W. Zeng, Z. Lin, and G. Li. LAMF: framework
that our in-network complex event processing in small-scale          for complex event processing in wireless sensor
and large-scale sensor networks has shown to increase the            networks. In 2nd International Conference on
lifetime of the network. We plan to continue our research            (ICISE), pages 2155–2158, Dec. 2010.
to build distributed in-network complex event processing, in    [14] P. Li and W. Bingwen. Design of complex event
which each sensor node has a different complex event pro-             processing system for wireless sensor networks. In
cessing plan and can communicate directly between them to            NSWCTC, volume 1, pages 354–357, Apr. 2010.
detect complex events.                                          [15] D. C. Luckham. The power of events. Addison-Wesley
                                                                     Longman Publishing Co., Inc., Boston, MA, USA,
                                                                     2001.
9. REFERENCES                                                   [16] S. R. Madden, M. J. Franklin, J. M. Hellerstein, and
 [1] Nondeterministic finite automaton.                               W. Hong. TinyDB: an acquisitional query processing
     http://en.wikipedia.org/wiki/Nondeterministic_                  system for sensor networks. ACM Trans. Database
     finite_automaton.                                               Syst., 30(1):122–173, Mar. 2005.
 [2] D. Abadi, D. Carney, U. Cetintemel, M. Cherniack,          [17] J. Xingyi, L. Xiaodong, K. Ning, and Y. Baoping.
     C. Convey, C. Erwin, E. Galvez, M. Hatoun, J.-h.                Efficient complex event processing over RFID data
     Hwang, A. Maskey, A. Rasin, A. Singer,                          stream. In IEEE/ACIS, pages 75–81, May 2008.
     M. Stonebraker, N. Tatbul, Y. Xing, R. Yan, and            [18] X. Yang, H. B. Lim, T. M. Özsu, and K. L. Tan.
     S. Zdonik. Aurora: a data stream management system.             In-network execution of monitoring queries in sensor
     In ACM SIGMOD Conference, page 666, 2003.                       networks. In ACM SIGMOD, pages 521–532, New
 [3] R. Bhargavi, V. Vaidehi, P. T. V. Bhuvaneswari,                 York, NY, USA, 2007. ACM.
     P. Balamuralidhar, and M. G. Chandra. Complex              [19] Y. Yao and J. Gehrke. The cougar approach to
     event processing for object tracking and intrusion              in-network query processing in sensor networks.
     detection in wireless sensor networks. In ICARCV,               SIGMOD Rec., 31(3):9–18, Sept. 2002.
     pages 848–853. IEEE, 2010.                                 [20] M. Zoumboulakis and G. Roussos. Escalation:
 [4] L. Brenna, A. Demers, J. Gehrke, M. Hong, J. Ossher,            complex event detection in wireless sensor networks.
     B. Panda, M. Riedewald, M. Thatte, and W. White.                In EuroSSC, pages 270–285, 2007.
     Cayuga: a high-performance event processing engine.
                       XQuery processing over NoSQL stores

                  Henrique Valer                           Caetano Sauer                        Theo Härder
            University of Kaiserslautern             University of Kaiserslautern        University of Kaiserslautern
                  P.O. Box 3049                            P.O. Box 3049                       P.O. Box 3049
              67653 Kaiserslautern,                    67653 Kaiserslautern,               67653 Kaiserslautern,
                     Germany                                  Germany                             Germany
               valer@cs.uni-kl.de                      csauer@cs.uni-kl.de                haerder@cs.uni-kl.de


ABSTRACT                                                              for that are scalability and flexibility. The solution RDBMS
Using NoSQL stores as storage layer for the execution of              provide is usually twofold: either (i) a horizontally-scalable
declarative query processing using XQuery provides a high-            architecture, which in database terms generally means giv-
level interface to process data in an optimized manner. The           ing up joins and also complex multi-row transactions; or (ii)
term NoSQL refers to a plethora of new stores which es-               by using parallel databases, thus using multiple CPUs and
sentially trades off well-known ACID properties for higher            disks in parallel to optimize performance. While the lat-
availability or scalability, using techniques such as eventual        ter increases complexity, the former just gives up operations
consistency, horizontal scalability, efficient replication, and       because they are too hard to implement in distributed envi-
schema-less data models. This work proposes a mapping                 ronments. Nevertheless, these solutions are neither scalable
from the data model of different kinds of NoSQL stores—               nor flexible.
key/value, columnar, and document-oriented—to the XDM                    NoSQL tackles these problems with a mix of techniques,
data model, thus allowing for standardization and querying            which involves either weakening ACID properties or allow-
NoSQL data using higher-level languages, such as XQuery.              ing more flexible data models. The latter is rather simple:
This work also explores several optimization scenarios to im-         some scenarios—such as web applications—do not conform
prove performance on top of these stores. Besides, we also            to a rigid relational schema, cannot be bound to the struc-
add updating semantics to XQuery by introducing simple                tures of a RDBMS, and need flexibility. Solutions exist, such
CRUD-enabling functionalities. Finally, this work analyzes            as using XML, JSON, pure key/value stores, etc, as data
the performance of the system in several scenarios.                   model for the storage layer. Regarding the former, some
                                                                      NoSQL systems relax consistency by using mechanisms such
                                                                      as multi-version concurrency control, thus allowing for even-
Keywords                                                              tually consistent scenarios. Others support atomicity and
NoSQL, Big Data, key/value, XQuery, ACID, CAP                         isolation only when each transaction accesses data within
                                                                      some convenient subset of the database data. Atomic oper-
1.   INTRODUCTION                                                     ations would require some distributed commit protocol—like
                                                                      two-phase commit—involving all nodes participating in the
   We have seen a trend towards specialization in database
                                                                      transaction, and that would definitely not scale. Note that
markets in the last few years. There is no more one-size-
                                                                      this has nothing to do with SQL, as the acronym NoSQL
fits-all approach when comes to storing and dealing with
                                                                      suggests. Any RDBMS that relaxes ACID properties could
data, and different types of DBMSs are being used to tackle
                                                                      scale just as well, and keep SQL as querying language.
different types of problems. One of these being the Big Data
                                                                         Nevertheless, when it comes to performance, NoSQL sys-
topic.
                                                                      tems have shown some interesting improvements. When
   It is not completely clear what Big Data means after all.
                                                                      considering update- and lookup-intensive OLTP workloads—
Lately, it is being characterized by the so-called 3 V’s: vol-
                                                                      scenarios where NoSQL are most often considered—the work
ume—comprising the actual size of data; velocity—compri-
                                                                      of [13] shows that the total OLTP time is almost evenly
sing essentially a time span in which data data must be
                                                                      distributed among four possible overheads: logging, locking,
analyzed; and variety—comprising types of data. Big Data
                                                                      latching, and buffer management. In essence, NoSQL sys-
applications need to understand how to create solutions in
                                                                      tems improve locking by relaxing atomicity, when compared
these data dimensions.
                                                                      to RDBMS.
   RDBMS have had problems when facing Big Data appli-
                                                                         When considering OLAP scenarios, RDBMS require rigid
cations, like in web environments. Two of the main reasons
                                                                      schema to perform usual OLAP queries, whereas most NoSQL
                                                                      stores rely on a brute-force processing model called MapRe-
                                                                      duce. It is a linearly-scalable programming model for pro-
                                                                      cessing and generating large data sets, and works with any
                                                                      data format or shape. Using MapReduce capabilities, par-
                                                                      allelization details, fault-tolerance, and distribution aspects
                                                                      are transparently offered to the user. Nevertheless, it re-
24th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
                                                                      quires implementing queries from scratch and still suffers
banken), 29.05.2012 - 01.06.2012, Lübbenau, Germany.                  from the lack of proper tools to enhance its querying capa-
Copyright is held by the author/owner(s).
bilities. Moreover, when executed atop raw files, the pro-         balancing and data replication. It does not have any rela-
cessing is inefficient. NoSQL stores provide this structure,       tionship between data, even though it tries by adding link
thus one could provide a higher-level query language to take       between key/value pairs. It provides the most flexibility, by
full advantage of it, like Hive [18], Pig [16], and JAQL [6].      allowing for a per-request scheme on choosing between avail-
   These approaches require learning separated query lan-          ability or consistency. Its distributed system has no master
guages, each of which specifically made for the implementa-        node, thus no single point of failure, and in order to solve
tion. Besides, some of them require schemas, like Hive and         partial ordering, it uses Vector Clocks [15].
Pig, thus making them quite inflexible. On the other hand,            HBase enhances Riak’s data model by allowing colum-
there exists a standard that is flexible enough to handle the      nar data, where a table in HBase can be seen as a map of
offered data flexibility of these different stores, whose compi-   maps. More precisely, each key is an arbitrary string that
lation steps are directly mappable to distributed operations       maps to a row of data. A row is a map, where columns
on MapReduce, and is been standardized for over a decade:          act as keys, and values are uninterpreted arrays of bytes.
XQuery.                                                            Columns are grouped into column families, and therefore,
                                                                   the full key access specification of a value is through column
Contribution                                                       family concatenated with a column—or using HBase nota-
Consider employing XQuery for implementing the large class         tion: a qualifier. Column families make the implementation
of query-processing tasks, such as aggregating, sorting, fil-      more complex, but their existence enables fine-grained per-
tering, transforming, joining, etc, on top of MapReduce as a       formance tuning, because (i) each column family’s perfor-
first step towards standardization on the realms of NoSQL          mance options are configured independently, like read and
[17]. A second step is essentially to incorporate NoSQL sys-       write access, and disk space consumption; and (ii) columns
tems as storage layer of such framework, providing a sig-          of a column family are stored contiguously in disk. More-
nificant performance boost for MapReduce queries. This             over, operations in HBase are atomic in the row level, thus
storage layer not only leverages the storage efficiency of         keeping a consistent view of a given row. Data relations
RDBMS, but allows for pushdown projections, filters, and           exist from column family to qualifiers, and operations are
predicate evaluations to be done as close to the storage level     atomic on a per-row basis. HBase chooses consistency over
as possible, drastically reducing the amount of data used on       availability, and much of that reflects on the system archi-
the query processing level.                                        tecture. Auto-sharding and automatic replication are also
   This is essentially the contribution of this work: allowing     present: shardling is automatically done by dividing data
for NoSQL stores to be used as storage layer underneath            in regions, and replication is achieved by the master-slave
a MapReduce-based XQuery engine, Brackit[?]—a generic              pattern.
XQuery processor, independent of storage layer. We rely               MongoDB fosters functionality by allowing more RDBMS-
on Brackit’s MapReduce-mapping facility as a transparently         like features, such as secondary indexes, range queries, and
distributed execution engine, thus providing scalability. Mo-      sorting. The data unit is a document, which is an ordered
reover, we exploit the XDM-mapping layer of Brackit, which         set of keys with associated values. Keys are strings, and
provides flexibility by using new data models. We created          values, for the first time, are not simply objects, or arrays
three XDM-mappings, investigating three different imple-           of bytes as in Riak or HBase. In MongoDB, values can be
mentations, encompassing the most used types of NoSQL              of different data types, such as strings, date, integers, and
stores: key/value, column-based, and document-based.               even embedded documents. MongoDB provides collections,
   The remainder of this paper is organized as follows. Sec-       which are grouping of documents, and databases, which are
tion 2 introduces the NoSQL models and their characteris-          grouping of collections. Stored documents do not follow any
tics. Section 3 describes the used XQuery engine, Brackit,         predefined schema. Updates within a single document are
and the execution environment of XQuery on top of the              transactional. Consistency is also taken over availability in
MapReduce model. Section 4 describes the mappings from             MongoDB, as in HBase, and that also reflects in the system
various stores to XDM, besides all implemented optimiza-           architecture, that follows a master-worker pattern.
tions. Section 5 exposes the developed experiments and the            Overall, all systems provide scaling-out, replication, and
obtained results. Finally, Section 6 concludes this work.          parallel-computation capabilities. What changes is essen-
                                                                   tially the data-model: Riak seams to be better suited for
2.   NOSQL STORES                                                  problems where data is not really relational, like logging. On
                                                                   the other hand, because of the lack of scan capabilities, on
   This work focuses on three different types of NoSQL stores,
                                                                   situations where data querying is needed, Riak will not per-
namely key/value, columnar, and document-oriented, repre-
                                                                   form that well. HBase allows for some relationship between
sented by Riak [14], HBase[11], and MongoDB[8], respec-
                                                                   data, besides built-in compression and versioning. It is thus
tively.
                                                                   an excellent tool for indexing web pages, which are highly
   Riak is the simplest model we dealt with: a pure key/-
                                                                   textual (thus benefiting from compression), as well as inter-
value store. It provides solely read and write operations to
                                                                   related and updatable (benefiting from built-in versioning).
uniquely-identified values, referenced by key. It does not
                                                                   Finally, MongoDB provides documents as granularity unit,
provide operations that span across multiple data items and
                                                                   thus fitting well when the scenario involves highly-variable
there is no need for relational schema. It uses concepts
                                                                   or unpredictable data.
such as buckets, keys, and values. Data is stored and ref-
erenced by bucket/key pairs. Each bucket defines a virtual
key space and can be thought of as tables in classical rela-       3.   BRACKIT AND MAPREDUCE
tional databases. Each key references a unique value, and            Several different XQuery engines are available as options
there are no data type definitions: objects are the only unit      for querying XML documents. Most of them provide ei-
of data storage. Moreover, Riak provides automatic load            ther (i) a lightweight application that can perform queries
on documents, or collections of documents, or (ii) an XML          XQuery over MapReduce
database that uses XQuery to query documents. The for-             Mapping XQuery to the MapReduce model is an alternative
mer lacks any sort of storage facility, while the latter is just   to implementing a distributed query processor from scratch,
not flexible enough, because of the built-in storage layer.        as normally done in parallel databases. This choice relies
Brackit1 provides intrinsic flexibility, allowing for different    on the MapReduce middleware for the distribution aspects.
storage levels to be “plugged in”, without lacking the neces-      BrackitMR is one such implementation, and is more deeply
sary performance when dealing with XML documents [5].              discussed in [17]. It achieves a distributed XQuery engine in
By dividing the components of the system into different            Brackit by scaling out using MapReduce.
modules, namely language, engine, and storage, it gives us            The system hitherto cited processes collections stored in
the needed flexibility, thus allowing us to use any store for      HDFS as text files, and therefore does not control details
our storage layer.                                                 about encoding and management of low-level files. If the
                                                                   DBMS architecture [12] is considered, it implements solely
Compilation                                                        the topmost layer of it, the set-oriented interface. It executes
The compilation process in Brackit works as follows: the           processes using MapReduce functions, but abstracts this
parser analyzes the query to validate the syntax and ensure        from the final user by compiling XQuery over the MapRe-
that there are no inconsistencies among parts of the state-        duce model.
ment. If any syntax errors are detected, the query compiler           It represents each query in MapReduce as sequence of jobs,
stops processing and returns the appropriate error message.        where each job processes a section of a FLWOR pipeline.
Throughout this step, a data structure is built, namely an         In order to use MapReduce as a query processor, (i) it
AST (Abstract Syntax Tree). Each node of the tree de-              breaks FLWOR pipelines are into map and reduce functions,
notes a construct occurring in the source query, and is used       and (ii) groups these functions to form a MapReduce job.
through the rest of the compilation process. Simple rewrites,      On (i), it converts the logical-pipeline representation of the
like constant folding, and the introduction of let bindings are    FLWOR expression—AST—to a MapReduce-friendly ver-
also done in this step.                                            sion. MapReduce uses a tree of splits, which represents the
   The pipelining phase transforms FLWOR expressions into          logical plan of a MapReduce-based query. Each split is a
pipelines—the internal, data-flow-oriented representation of       non-blocking operator used by MapReduce functions. The
FLWORs, discussed later. Optimizations are done atop               structure of splits is rather simple: it contains an AST and
pipelines, and the compiler uses global semantics stored in        pointers to successor and predecessor splits. Because splits
the AST to transform the query into a more-easily-optimized        are organized in a bottom-up fashion, leaves of the tree are
form. For example, the compiler will move predicates if pos-       map functions, and the root is a reduce function—which
sible, altering the level at which they are applied and poten-     produces the query output.
tially improving query performance. This type of opera-               On (ii), the system uses the split tree to generate pos-
tion movement is called predicate pushdown, or filter push-        sibly multiple MapReduce job descriptions, which can be
down, and we will apply them to our stores later on. More          executed in a distributed manner. Jobs are exactly the ones
optimizations such as join recognition, and unnesting are          used on Hadoop MapReduce [20], and therefore we will not
present in Brackit and are discussed in [4]. In the opti-          go into details here.
mization phase, optimizations are applied to the AST. The
distribution phase is specific to distributed scenarios, and       4.   XDM MAPPINGS
is where MapReduce translation takes place. More details
about the distribution phase are presented in [17]. At the           This section shows how to leverage NoSQL stores to work
end of the compilation, the translator receives the final AST.     as storage layer for XQuery processing. First, we present
It generates a tree of executable physical operators. This         mappings from NoSQL data models to XDM, adding XDM-
compilation process chain is illustrated in Figure 1.              node behavior to these data mappings. Afterwards, we dis-
                                                                   cuss possible optimizations regarding data-filtering techniques.

                                                                   Riak
                                                                   Riak’s mapping strategy starts by constructing a key/value
                                                                   tuple from its low-level storage representation. This is es-
                                                                   sentially an abstraction and is completely dependent on the
                                                                   storage used by Riak. Second, we represent XDM opera-
                                                                   tions on this key/value tuple. We map data stored within
                                                                   Riak utilizing Riak’s linking mechanism. A key/value pair
                                                                   kv represents an XDM element, and key/value pairs linked
                                                                   to kv are addressed as children of kv. We map key/value
                                                                   tuples as XDM elements. The name of the element is sim-
                                                                   ply the name of the bucket it belongs to. We create one
                                                                   bucket for the element itself, and one extra bucket for each
                                                                   link departing from the element. Each child element stored
         Figure 1: Compilation process in Brackit [5].             in a separated bucket represents a nested element within the
                                                                   key/value tuple. The name of the element is the name of the
                                                                   link between key/values. This does not necessarily decrease
                                                                   data locality: buckets are stored among distributed nodes
1
    Available at http:\\www.brackit.org                            based on hashed keys, therefore uniformly distributing the
                               Figure 2: Mapping between an HBase row and an XDM instance.


load on the system. Besides, each element has an attribute         behavior to data. Brackit interacts with the storage using
key which Riak uses to access key/value pairs on the storage       this interface. It provides general rules present in XDM [19],
level.                                                             Namespaces [2], and Xquery Update Facility [3] standards,
   It allows access using key/value as granularity, because        resulting in navigational operations, comparisons, and other
every single element can be accessed within a single get op-       functionalities. RiakRowNode wraps Riak’s buckets, key/-
eration. Full reconstruction of an element el requires one ac-     values, and links. HBaseRowNode wraps HBase’s tables, col-
cess for each key/value linked to el. Besides, Riak provides       umn families, qualifiers, and values. Finally, MongoRowN-
atomicity using single key/value pairs as granularity, there-      ode wraps MongoDB’s collections, documents, fields, and
fore consistent updates of multiple key/value tuples cannot        values.
be guaranteed.                                                        Overall, each instance of these objects represents one unit
                                                                   of data from the storage level. In order to better grasp the
HBase                                                              mapping, we describe the HBase abstraction in more de-
HBase’s mapping strategy starts by constructing a colum-           tails, because it represents the more complex case. Riak’s
nar tuple from the HDFS low-level-storage representation.          and MongoDB’s representation follow the same approach,
HBase stores column-family data in separated files within          but without a “second-level node”. Tables are not repre-
HDFS, therefore we can use this to create an efficient map-        sented within the Node interface, because their semantics
ping. Figure 2 presents this XDM mapping, where we map             represent where data is logically stored, and not data itself.
a table partsupp using two column families: references and         Therefore, they are represented using a separated interface,
values, five qualifiers: partkey, suppkey, availqty, supplycost,   called Collection. Column families represent a first-level-
and comment. We map each row within an HBase table to              access. Qualifiers represent a second-level-access. Finally,
an XDM element. The name of the element is simply the              values represent a value-access. Besides, first-level-access,
name of the table it belongs to, and we store the key used         second-level-access, and value-access must keep track of cur-
to access such element within HBase as an attribute in the         rent indexes, allowing the node to properly implement XDM
element. The figure shows two column families: references          operations. Figure 3 depicts the mapping. The upper-most
and values. Each column family represents a child element,         part of the picture shows a node which represents a data
whose name is the name of the column family. Accordingly,          row from any of the three different stores. The first layer
each qualifier is nested as a child within the column-family       of nodes—with level = 1st—represents the first-level-access,
element from which it descends.                                    explained previously. The semantic of first-level-access dif-
                                                                   fers within different stores: while Riak and MongoDB inter-
MongoDB                                                            pret it as a value wrapper, HBase prefers a column family
MongoDB’s mapping strategy is straight-forward. Because            wrapper. Following, HBase is the only implementation that
it stores JSON-like documents, the mapping consists essen-         needs a second-level-access, represented by the middle-most
tially of a document field → element mapping. We map               node with level = 2nd, in this example accessing the wrap-
each document within a MongoDB collection to an XDM el-            per of regionkey = “1”. Finally, lower-level nodes with level
ement. The name of the element is the name of the collection       = value access values from the structure.
it belongs to. We store the id —used to access the document
within MongoDB—as an attribute on each element. Nested             Optimizations
within the collection element, each field of the document          We introduce projection and predicate pushdowns optimiza-
represents a child element, whose name is the name of the          tions. The only storage that allows for predicate push-
field itself. Note that MongoDB allows fields to be of type        down is MongoDB, while filter pushdown is realized on all of
document, therefore more complex nested elements can be            them. These optimizations are fundamental advantages of
achieved. Nevertheless, the mapping rules work recursively,        this work, when compared with processing MapReduce over
just as described above.                                           raw files: we can take “shortcuts” that takes us directly to
                                                                   the bytes we want in the disk.
Nodes                                                                 Filter and projections pushdown are an important opti-
We describe XDM mappings using object-oriented notation.           mization for minimizing the amount of data scanned and
Each store implements a Node interface that provides node          processed by storage levels, as well as reducing the amount
                                                                  a $key, therefore allowing for both insertions and updates.
                                                                       db:insert($table as xs:string,
                                                                                 $key as xs:string,
                                                                                 $value as node()) as xs:boolean
                                                                  The delete function deletes a values from the store. We
                                                                  also provide two possible signatures: with or without $key,
                                                                  therefore allowing for deletion of a giveng key, or droping a
                                                                  given table.
                                                                       db:delete($table as xs:string,
                                                                                 $key as xs:string) as xs:boolean

                                                                  5.     EXPERIMENTS
                                                                     The framework we developed in this work is mainly con-
                                                                  cerned with the feasibility of executing XQuery queries atop
                                                                  NoSQL stores. Therefore, our focus is primarily on the proof
                                                                  of concept. The data used for our tests comes from the TPC-
                                                                  H benchmark [1]. The dataset size we used has 1GB, and
                                                                  we essentially scanned the five biggest tables on TPC-H:
                                                                  part, partsupp, order, lineitem, and customer. The experi-
                                                                  ments were performed in a single Intel Centrino Duo dual-
                                                                  core CPU with 2.00 GHz, with 4GB RAM, running Ubuntu
                                                                  Linux 10.04 LTS. HBase used is version 0.94.1, Riak is 1.2.1,
                                                                  and MongoDB is 2.2.1. It is not our goal to assess the scal-
                                                                  ability of these systems, but rather their query-procedure
      Figure 3: Nodes implementing XDM structure.                 performance. For scalability benchmarks, we refer to [9]
                                                                  and [10].
of data passed up to the query processor. Predicate push-         5.1     Results
down is yet another optimization technique to minimize the
amount of data flowing between storage and processing lay-
ers. The whole idea is to process predicates as early in the
plan as possible, thus pushing them to the storage layer.
   On both cases we traverse the AST, generated in the be-
ginning of the compilation step, looking for specific nodes,
and when found we annotate the collection node on the AST
with this information. The former looks for path expres-
sions (PathExpr ) that represent a child step from a collec-
tion node, or for descendants of collection nodes, because
in the HBase implementation we have more than one access
level within storage. The later looks for general-comparison
operators, such as equal, not equal, less than, greater than,
less than or equal to, and greater than or equal to. After-
wards, when accessing the collection on the storage level,
we use the marked collections nodes to filter data, without                Figure 4: Latency comparison among stores.
further sending it to the query engine.
                                                                     Figure 4 shows the gathered latency times of the best
NoSQL updates                                                     schemes of each store, using log-scale. As we can see, all ap-
The used NoSQL stores present different API to persist data.      proaches take advantage from the optimization techniques.
Even though XQuery does not provide data-storing mecha-           The blue column of the graph—full table scan—shows the
nisms on its recommendation, it does provide an extension         latency when scanning all data from TPC-H tables. The red
called XQuery Update Facility [3] for that end. It allows         column —single column scan—represents the latency when
to add new nodes, delete or rename existing nodes, and re-        scanning a simple column of each table. Filter pushdown op-
place existing nodes and their values. XQuery Update Fa-          timizations explain the improvement in performance when
cility adds very natural and efficient persistence-capabilities   compared to the first scan, reducing the amount of data flow-
to XQuery, but it adds lots of complexity as well. More-          ing from storage to processing level. The orange column—
over, some of the constructions need document-order, which        predicate column scan—represents the latency when scan-
is simply not possible in the case of Riak. Therefore, simple-    ning a single column and where results were filtered by a
semantic functions such as “insert” or “put” seem more at-        predicate. We have chosen predicates to cut in half the
tractive, and achieve the goal of persisting or updating data.    amount of resulting data when compared with single column
   The insert function stores a value within the underlying       scan. The querying time was reduced in approximately 30%,
store. We provide two possible signatures: with or without        not reaching the 50% theoretically-possible-improvement rate,
essentially because of processing overhead. Nevertheless, it          Independence, and Parallelism. PhD thesis, University
shows how efficient the technique is.                                 of Kaiserslautern, 12 2012.
   In scanning scenarios like the ones on this work, MongoDB      [5] S. Bächle and C. Sauer. Unleashing xquery for
has shown to be more efficient than the other stores, by al-          data-independent programming. Submitted, 2011.
ways presenting better latency. MongoDB was faster by de-         [6] K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin,
sign: trading of data-storage capacity for data-addressability        M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J.
has proved to be a very efficiency-driven solution, although          Shekita. Jaql: A scripting language for large scale
being a huge limitation. Moreover, MongoDB uses pre-                  semistructured data analysis. PVLDB,
caching techniques. Therefore, at run-time it allows work-            4(12):1272–1283, 2011.
ing with data almost solely from main memory, specially in        [7] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey,
scanning scenarios.                                                   D. Shakib, S. Weaver, and J. Zhou. Scope: easy and
                                                                      efficient parallel processing of massive data sets. Proc.
6.   CONCLUSIONS                                                      VLDB Endow., 1(2):1265–1276, Aug. 2008.
   We extended a mechanism that executes XQuery to work           [8] K. Chodorow and M. Dirolf. MongoDB: The
with different NoSQL stores as storage layer, thus providing          Definitive Guide. Oreilly Series. O’Reilly Media,
a high-level interface to process data in an optimized man-           Incorporated, 2010.
ner. We have shown that our approach is generic enough to         [9] B. F. Cooper, A. Silberstein, E. Tam,
work with different NoSQL implementations.                            R. Ramakrishnan, and R. Sears. Benchmarking cloud
   Whenever querying these systems with MapReduce—ta-                 serving systems with ycsb. In Proceedings of the 1st
king advantage of its linearly-scalable programming model             ACM symposium on Cloud computing, SoCC ’10,
for processing and generating large-data sets—parallelization         pages 143–154, New York, NY, USA, 2010. ACM.
details, fault-tolerance, and distribution aspects are hidden    [10] T. Dory, B. Mejhas, P. V. Roy, and N. L. Tran.
from the user. Nevertheless, as a data-processing paradigm,           Measuring elasticity for cloud databases. In
MapReduce represents the past. It is not novel, does not use          Proceedings of the The Second International
schemas, and provides a low-level record-at-a-time API: a             Conference on Cloud Computing, GRIDs, and
scenario that represents the 1960’s, before modern DBMS’s.            Virtualization, 2011.
It requires implementing queries from scratch and still suf-     [11] L. George. HBase: The Definitive Guide. O’Reilly
fers from the lack of proper tools to enhance its querying            Media, 2011.
capabilities. Moreover, when executed atop raw files, the        [12] T. Härder. Dbms architecture - new challenges ahead.
processing is inefficient—because brute force is the only pro-        Datenbank-Spektrum, 14:38–48, 2005.
cessing option. We solved precisely these two MapReduce          [13] S. Harizopoulos, D. J. Abadi, S. Madden, and
problems: XQuery works as the higher-level query language,            M. Stonebraker. Oltp through the looking glass, and
and NoSQL stores replace raw files, thus increasing perfor-           what we found there, 2008.
mance. Overall, MapReduce emerges as solution for situ-          [14] R. Klophaus. Riak core: building distributed
ations where DBMS’s are too “hard” to work with, but it               applications without shared state. In ACM SIGPLAN
should not overlook the lessons of more than 40 years of              Commercial Users of Functional Programming, CUFP
database technology.                                                  ’10, pages 14:1–14:1, New York, NY, USA, 2010.
   Other approaches cope with similar problems, like Hive,            ACM.
and Scope. Hive [18] is a framework for data warehousing on
                                                                 [15] F. Mattern. Virtual time and global states of
top of Hadoop. Nevertheless, it only provides equi-joins, and
                                                                      distributed systems. In C. M. et al., editor, Proc.
does not fully support point access, or CRUD operations—
                                                                      Workshop on Parallel and Distributed Algorithms,
inserts into existing tables are not supported due to sim-
                                                                      pages 215–226, North-Holland / Elsevier, 1989.
plicity in the locking protocols. Moreover, it uses raw files
as storage level, supporting only CSV files. Moreover, Hive      [16] C. Olston, B. Reed, U. Srivastava, R. Kumar, and
is not flexible enough for Big Data problems, because it is           A. Tomkins. Pig latin: a not-so-foreign language for
not able to understand the structure of Hadoop files with-            data processing. In Proceedings of the 2008 ACM
out some catalog information. Scope [7] provides a declar-            SIGMOD international conference on Management of
ative scripting language targeted for massive data analysis,          data, SIGMOD ’08, pages 1099–1110, New York, NY,
borrowing several features from SQL. It also runs atop a              USA, 2008. ACM.
distributed computing platform, a MapReduce-like model,          [17] C. Sauer. Xquery processing in the mapreduce
therefore suffering from the same problems: lack of flexibil-         framework. Master thesis, Technische Universität
ity and generality, although being scalable.                          Kaiserslautern, 2012.
                                                                 [18] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,
                                                                      N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive -
7.   REFERENCES                                                       a petabyte scale data warehouse using hadoop. In
 [1] The tpc-h benchmark. http://www.tpc.org/tpch/,                   ICDE, pages 996–1005, 2010.
     1999.                                                       [19] N. Walsh, M. Fernández, A. Malhotra, M. Nagy, and
 [2] Namespaces in xml 1.1 (second edition).                          J. Marsh. XQuery 1.0 and XPath 2.0 data model
     http://www.w3.org/TR/xml-names11/, August 2006.                  (XDM). http://www.w3.org/TR/2007/
 [3] Xquery update facility 1.0. http://www.w3.org/TR/                REC-xpath-datamodel-20070123/, January 2007.
     2009/CR-xquery-update-10-20090609/, June 2009.              [20] T. White. Hadoop: The Definitive Guide. O’Reilly
 [4] S. Bächle. Separating Key Concerns in Query                     Media, 2012.
     Processing - Set Orientation, Physical Data