=Paper= {{Paper |id=None |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1020/proceedings.pdf |volume=Vol-1020 }} ==None== https://ceur-ws.org/Vol-1020/proceedings.pdf

Proceedings

25. GI-Workshop „Grundlagen von Datenbanken“
28.05.2013 – 31.05.2013

Ilmenau, Deutschland

Kai-Uwe Sattler
Stephan Baumann
Felix Beier
Heiko Betz
Francis Gropengießer
Stefan Hagedorn
(Hrsg.)
ii
Vorwort
Liebe Teilnehmerinnen und Teilnehmer,

mittlerweile zum 25. Mal fand vom 28.5. bis 31.5.2013 der Workshop Grundlagen
”
von Datenbanken“ des GI-Arbeitskreises Grundlagen von Informationssystemen“ im
”
Fachbereich Datenbanken und Informationssysteme (DBIS) statt. Nach Österreich im
Jahr 2011 und dem Spreewald im Jahr 2012 war bereits zum dritten Mal Thüringen
der Austragungsort – diesmal die kleine Gemeinde Elgersburg am Fuße der Hohen
Warte im Ilm-Kreis. Organisiert wurde der Workshop vom Fachgebiet Datenbanken
und Informationssysteme der TU Ilmenau.
Die Workshop-Reihe, die 1989 in Volkse bei Braunschweig vom Braunschweiger
Datenbanklehrstuhl ins Leben gerufen wurde und die ersten 3 Jahre auch in Volkse
blieb, hat sich inzwischen als eine Institution für den Gedankenaustausch gerade für
Nachwuchswissenschaftler/-innen aus dem deutschsprachigen Raum im Bereich Da-
tenbanken und Informationssysteme etabliert. Längst sind dabei die Beschränkungen
auf Deutsch als Vortragssprache und reine theorie- und grundlagenorientierte Themen
gefallen – auch wenn die offene Atmosphäre an abgeschiedenen Tagungsorten (und
Elgersburg stellte hier keine Ausnahme dar) mit viel Zeit für intensive Diskussionen
während der Sitzungen und an den Abenden geblieben sind.
Für den diesjährigen Workshop wurden 15 Beiträge eingereicht und von jeweils drei
Mitgliedern des 13-köpfigen Programmkomitees begutachtet. Aus allen eingereichten
Beiträgen wurden 13 für die Präsentation auf dem Workshop ausgewählt. Die Band-
breite der Themen reichte dabei von fast schon klassischen Datenbankthemen wie
Anfrageverarbeitung (mit XQuery), konzeptueller Modellierung (für XML Schemaevo-
lution), Indexstrukturen (für Muster auf bewegten Objekten) und dem Auffinden von
Spaltenkorrelationen über aktuelle Themen wie MapReduce und Cloud-Datenbanken
bis hin zu Anwendungen im Bereich Image Retrieval, Informationsextraktion, Complex
Event Processing sowie Sicherheitsaspekten.
Vervollständigt wurde das viertägige Programm durch zwei Keynotes von namhaften
Datenbankforschern: Theo Härder stellte das WattDB-Projekt eines energieproportio-
nalen Datenbanksystems vor und Peter Boncz diskutierte die Herausforderungen an
die Optimierung von Datenbanksysteme durch moderne Hardwarearchitekturen - un-
tersetzt mit praktischen Vorführungen. Beiden sei an dieser Stelle für ihr Kommen und
ihre interessanten Vorträge gedankt. In zwei weiteren Vorträgen nutzten die Sponso-
ren des diesjährigen Workshops, SAP AG und Objectivity Inc., die Gelegenheit, die
Datenbanktechnologien hinter HANA (SAP AG) und InfiniteGraph (Objectivity Inc.)
vorzustellen. Hannes Rauhe und Timo Wagner als Vortragenden möchten wir daher
genauso wie den beiden Unternehmen für die finanzielle Unterstützung des Workshops
und damit der Arbeit des GI-Arbeitskreises danken.
Gedankt sei an dieser Stelle auch allen, die an der Organisation und Durchführung
beteiligt waren: den Autoren für ihre Beiträge und Vorträge, den Mitgliedern des Pro-
grammkomitees für ihre konstruktive und pünktliche Begutachtung der Einreichungen,
den Mitarbeitern vom Hotel am Wald in Elgersburg, dem Leitungsgremium des Ar-
beitskreises in Person von Günther Specht und Stefan Conrad, die es sich nicht nehmen

iii
ließen, persönlich am Workshop teilzunehmen, sowie Eike Schallehn, der im Hinter-
grund mit Rat und Tat zur Seite stand. Der größte Dank gilt aber meinem Fachge-
bietsteam, das den Großteil der Organisationsarbeit geleistet hat: Stephan Baumann,
Francis Gropengießer, Heiko Betz, Stefan Hagedorn und Felix Beier. Ohne ihr Enga-
gement wäre der Workshop nicht möglich gewesen. Herzlichen Dank!

Kai-Uwe Sattler

Ilmenau am 28.5.2013

iv
v
Komitee
Programm-Komitee
• Andreas Heuer, Universität Rostock
• Eike Schallehn, Universität Magdeburg
• Erik Buchmann, Karlsruher Institut für Technologie

• Friederike Klan, Universität Jena
• Gunter Saake, Universität Magdeburg
• Günther Specht, Universität Innsbruck
• Holger Schwarz, Universität Stuttgart

• Ingo Schmitt, Brandenburgische Technische Universität Cottbus
• Kai-Uwe Sattler, Technische Universität Ilmenau
• Katja Hose, Aalborg University

• Klaus Meyer-Wegener, Universität Erlangen
• Stefan Conrad, Universität Düsseldorf
• Torsten Grust, Universität Tübingen

Organisations-Komitee
• Kai-Uwe Sattler, TU Ilmenau
• Stephan Baumann, TU Ilmenau

• Felix Beier, TU Ilmenau
• Heiko Betz, TU Ilmenau
• Francis Gropengießer, TU Ilmenau
• Stefan Hagedorn, TU Ilmenau

vi
vii
Inhaltsverzeichnis

1 Keynotes 1
1.1 WattDB—a Rocky Road to Energy Proportionality Theo Härder . . . . 1
1.2 Optimizing database architecture for machine architecture: is there still
hope? Peter Boncz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Workshop-Beiträge 5
2.1 Adaptive Prejoin Approach for Performance Optimization in MapReduce-
based Warehouses Weiping Qu, Michael Rappold und Stefan Dessloch . 5
2.2 Ein Cloud-basiertes räumliches Decision Support System für die Her-
ausforderungen der Energiewende Golo Klossek, Stefanie Scherzinger
und Michael Sterner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Consistency Models for Cloud-based Online Games: the Storage Sys-
tem’s Perspective Ziqiang Diao . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Analysis of DDoS Detection Systems Michael Singhof . . . . . . . . . . 22
2.5 A Conceptual Model for the XML Schema Evolution Thomas Nösinger,
Meike Klettke und Andreas Heuer . . . . . . . . . . . . . . . . . . . . . 28
2.6 Semantic Enrichment of Ontology Mappings: Detecting Relation Types
and Complex Correspondences Patrick Arnold . . . . . . . . . . . . . . 34
2.7 Extraktion und Anreicherung von Merkmalshierarchien durch Analyse
unstrukturierter Produktrezensionen Robin Küppers . . . . . . . . . . . 40
2.8 Ein Verfahren zur Beschleunigung eines neuronalen Netzes für die Ver-
wendung im Image Retrieval Daniel Braun . . . . . . . . . . . . . . . . 46
2.9 Auffinden von Spaltenkorrelationen mithilfe proaktiver und reaktiver
Verfahren Katharina Büchse . . . . . . . . . . . . . . . . . . . . . . . . 52
2.10 MVAL: Addressing the Insider Threat by Valuation-based Query Pro-
cessing Stefan Barthel und Eike Schallehn . . . . . . . . . . . . . . . . . 58
2.11 TrIMPI: A Data Structure for Efficient Pattern Matching on Moving
Objects Tsvetelin Polomski und Hans-Joachim Klein . . . . . . . . . . . 64
2.12 Complex Event Processing in Wireless Sensor Networks Omran Saleh . 69
2.13 XQuery processing over NoSQL stores Henrique Valer, Caetano Sauer
und Theo Härder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

ix
WattDB—a Rocky Road to Energy Proportionality

Theo Härder
Databases and Information Systems Group
University of Kaiserslautern, Germany
haerder@cs.uni-kl.de

Extended Abstract pared to a single, brawny server, they offer higher energy
Energy efficiency is becoming more important in database saving potential in turn.
design, i. e., the work delivered by a database server should Current hardware is not energy proportional, because a
be accomplished by minimal energy consumption. So far, a single server consumes, even when idle, a substantial frac-
substantial number of research papers examined and opti- tion of its peak power [1]. Because typical usage patterns
mized the energy consumption of database servers or single lead to a server utilization far less than its maximum, en-
components. In this way, our first efforts were exclusively fo- ergy efficiency of a server aside from peak performance is
cused on the use of flash memory or SSDs in a DBMS context reduced [4]. In order to achieve energy proportionality using
to identify their performance potential for typical DB opera- commodity hardware, we have chosen a clustered approach,
tions. In particular, we developed tailor-made algorithms to where each node can be powered independently. By turn-
support caching for flash-based databases [3], however with ing on/off whole nodes, the overall performance and energy
limited success concerning the energy efficiency of the entire consumption can be fitted to the current workload. Unused
database server. servers could be either shut down or made available to other
A key observation made by Tsirogiannis et al. [5] con- processes. If present in a cloud, those servers could be leased
cerning the energy efficiency of single servers, the best per- to other applications.
forming configuration is also the most energy-efficient one, We have developed a research prototype of a distribu-
because power use is not proportional to system utilization ted DBMS called WattDB on a scale-out architecture, con-
and, for this reason, runtime needed for accomplishing a sisting of n wimpy computing nodes, interconnected by an
computing task essentially determines energy consumption. 1GBit/s Ethernet switch. The cluster currently consists of
Based on our caching experiments for flash-based databases, 10 identical nodes, composed of an Intel Atom D510 CPU,
we came to the same conclusion [2]. Hence, the server sys- 2 GB DRAM and an SSD. The configuration is considered
tem must be fully utilized to be most energy efficient. How- Amdahl-balanced, i. e., balanced between I/O and network
ever, real-world workloads do not stress servers continuously. throughput on one hand and processing power on the other.
Typically, their average utilization ranges between 20 and Compared to InfiniBand, the bandwidth of the intercon-
50% of peak performance [1]. Therefore, traditional single- necting network is limited but sufficient to supply the light-
server DBMSs are chronically underutilized and operate be- weight nodes with data. More expensive, yet faster con-
low their optimal energy-consumption-per-query ratio. As nections would have required more powerful processors and
a result, there is a big optimization opportunity to decrease more sophisticated I/O subsystems. Such a design would
energy consumption during off-peak times. have pushed the cost beyond limits, especially because we
Because the energy use of single-server systems is far from would not have been able to use commodity hardware. Fur-
being energy proportional, we came up with the hypothe- thermore, by choosing lightweight components, the overall
sis that better energy efficiency may be achieved by a clus- energy footprint is low and the smallest configuration, i. e.,
ter of nodes whose size is dynamically adjusted to the cur- the one with the fewest number of nodes, exhibits low power
rent workload demand. For this reason, we shifted our re- consumption. Moreover, experiments running on a small
search focus from inflexible single-server DBMSs to distribu- cluster can easily be repeated on a cluster with more pow-
ted clusters running on lightweight nodes. Although distri- erful nodes.
buted systems impose some performance degradation com- A dedicated node is the master node, handling incoming
queries and coordinating the cluster. Some of the nodes
have each four hard disks attached and act as storage nodes,
providing persistent data storage to the cluster. The remain-
ing nodes (without hard disks drives) are called processing
nodes. Due to the lack of directly accessible storage, they
can only operate on data provided by other nodes (see Fig-
ure 1).
All nodes can evaluate (partial) query plans and execute
DB operators, e. g., sorting, aggregation, etc., but only the
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
storage nodes can access the DB storage structures, i. e.,
banken), 28.05.2013 - 31.05.2012, Ilmenau, Germany. tables and indexes. Each storage node maintains a DB buffer
Copyright is held by the author/owner(s).
Master Node

S S Processing S
S
Processing S
S Processing Processing
S S
D Node D Node D Node D Node

S Storage Node S Storage Node S Storage Node
S S S
Disk Disk Disk Disk Disk Disk
D Disk Disk D Disk Disk D Disk Disk

S Storage Node S Storage Node
S S
Disk Disk Disk Disk
D Disk Disk D Disk Disk

Figure 1: Overview of the WattDB cluster

to keep recently referenced pages in main memory, whereas 1. REFERENCES
a processing node does not cache intermediate results. As a [1] L. A. Barroso and U. Hölzle. The Case for
consequence, each query needs to always fetch the qualified Energy-Proportional Computing. IEEE Computer,
records from the corresponding storage nodes. 40(12):33–37, 2007.
Hence, our cluster design results in a shared-nothing ar- [2] T. Härder, V. Hudlet, Y. Ou, and D. Schall. Energy
chitecture where the nodes only differentiate to those which efficiency is not enough, energy proportionality is
have or have not direct access to DB data on external stor- needed! In DASFAA Workshops, 1st Int. Workshop on
age. Each of the nodes is additionally equipped with a FlashDB, LNCS 6637, pages 226–239, 2011.
128GB Solid-State Disk (Samsung 830 SSD). The SSDs do
[3] Y. Ou, T. Härder, and D. Schall. Performance and
not store the DB data, they provide swap space to support
Power Evaluation of Flash-Aware Buffer Algorithms. In
external sorting and to provide persistent storage for con-
DEXA, LNCS 6261, pages 183–197, 2010.
figuration files. We have chosen SSDs, because their access
[4] D. Schall, V. Höfner, and M. Kern. Towards an
latency is much lower compared to traditional hard disks;
Enhanced Benchmark Advocating Energy-Efficient
hence, they are better suited for temp storage.
Systems. In TPCTC, LNCS 7144, pages 31–45, 2012.
In WattDB, a dedicated component, running on the mas-
ter node, controls the energy consumption, called Energy- [5] D. Tsirogiannis, S. Harizopoulos, and M. A. Shah.
Controller. This component monitors the performance of Analyzing the Energy Efficiency of a Database Server.
all nodes in the cluster. Depending on the current query In SIGMOD Conference, pages 231–242, 2010.
workload and node utilization, the EnergyController acti-
vates and suspends nodes to guarantee a sufficiently high
node utilization depending on the workload demand. Sus-
pended nodes do only consume a fraction of the idle power,
but can be brought back online in a matter of a few sec-
onds. It also modifies query plans to dynamically distribute
the current workload on all running nodes thereby achieving
balanced utilization of the active processing nodes.
As data-intensive workloads, we submit specific TPC-H
queries against a distributed shared-nothing DBMS, where
time and energy use are captured by specific monitoring and
measurement devices. We configure various static clusters
of varying sizes and show their influence on energy efficiency
and performance. Further, using an EnergyController and
a load-aware scheduler, we verify the hypothesis that en-
ergy proportionality for database management tasks can be
well approximated by dynamic clusters of wimpy computing
nodes.
Optimizing database architecture for machine architecture:
is there still hope?

Peter Boncz
CWI
p.boncz@cwi.nl

Extended Abstract In particular, there is the all too present danger to over-
In the keynote, I will give some examples of how computer optimize of one particular architecture; or to propose tech-
architecture has strongly evolved in the past decennia and niques that will have only a very short span of utility. The
how this influences the performance, and therefore the de- question thus is not only to find specific ways to optimize
sign, of algorithms and data structure for data management. for certain hardware features, but do so in a way that works
One the one hand, these changes in hardware architecture across the full spectrum of architectural, i.e. robust tech-
have caused the (continuing) need for new data management niques.
research. i.e. hardware-conscious database research. Here, I will close the talk by recent work at CWI and Vectorwise
I will draw examples from hardware-conscious research per- on robustness of query evaluator performance, describing a
formed on the CWI systems MonetDB and Vectorwise. project called ”Micro-Adaptivity” where database systems
This diversification trend in computer architectural char- are made self-adaptive and react immediately to observed
acteristics of the various solutions in the market seems to performance, self-optimizing to the combination of current
be intensifying. This is seen in quite different architectural query workload, observed data distributions, and hardware
options, such as CPU vs GPU vs FPGA, but also even re- characteristics.
stricting oneself to just CPUs there seems to be increasing
design variation in architecture and platform behavior. This
poses a challenge to hardware-conscious database research.

25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2012, Ilmenau, Germany.
Copyright is held by the author/owner(s).
Adaptive Prejoin Approach for Performance Optimization
in MapReduce-based Warehouses

∗
Weiping Qu Michael Rappold Stefan Dessloch
Heterogeneous Information Department of Computer Heterogeneous Information
Systems Group Science Systems Group
University of Kaiserslautern University of Kaiserslautern University of Kaiserslautern
qu@informatik.uni-kl.de m_rappol@cs.uni-kl.de dessloch@informatik.uni-
kl.de

ABSTRACT scalable file system, MapReduce/Hadoop1 systems enable
MapReduce-based warehousing solutions (e.g. Hive) for big analytics on large amounts of unstructured data or struc-
data analytics with the capabilities of storing and analyzing tured data in acceptable response time.
high volume of both structured and unstructured data in a With the continuous growth of data, scalable data stores
scalable file system have emerged recently. Their efficient based on Hadoop/HDFS2 have achieved more and more at-
data loading features enable a so-called near real-time ware- tention for big data analytics. In addition, by means of sim-
housing solution in contrast to those offered by conventional ply pulling data into the file system of MapReduce-based
data warehouses with complex, long-running ETL processes. systems, unstructured data without schema information is
However, there are still many opportunities for perfor- directly analyzed with parallelizable custom programs, where-
mance improvements in MapReduce systems. The perfor- as data can only be queried in traditional data warehouses
mance of analyzing structured data in them cannot cope after it has been loaded by ETL tools (cleansing, normaliza-
with the one in traditional data warehouses. For example, tion, etc.), which normally takes a long period of time.
join operations are generally regarded as a bottleneck of per- Consequently, many web or business companies add MapRe-
forming generic complex analytics over structured data with duce systems to their analytical architecture. For example,
MapReduce jobs. Fatma Özcan et al. [12] integrate their DB2 warehouse with
In this paper, we present one approach for improving per- the Hadoop-based analysis tool - IBM Infosphere BigInsights
formance in MapReduce-based warehouses by pre-joining with connectors between these two platforms. An analytical
frequently used dimension columns with fact table redun- synthesis is provided, where unstructured data is initially
dantly during data transfer and adapting queries to this join- placed in a Hadoop-based system and analyzed by MapRe-
friendly schema automatically at runtime using a rewrite duce programs. Once its schema can be defined, it is further
component. This approach is driven by the statistics infor- loaded into a DB2 warehouse with more efficient analysis ex-
mation derived from previous executed workloads in terms ecution capabilities.
of join operations. Another example is the data warehousing infrastructure
The results show that the execution performance is im- at Facebook which involves a web-based tier, a federated
proved by getting rid of join operations in a set of future MySQL tier and a Hadoop-based analytical cluster - Hive.
workloads whose join exactly fits the pre-joined fact table Such orchestration of various analytical platforms forms a
schema while the performance still remains the same for heterogeneous environment where each platform has a differ-
other workloads. ent interface, data model, computational capability, storage
system, etc.
Pursuing a global optimization in such a heterogeneous
1. INTRODUCTION environment is always challenging, since it is generally hard
By packaging complex custom imperative programs (text to estimate the computational capability or operational cost
mining, machine learning, etc.) into simple map and reduce concisely on each autonomous platform. The internal query
functions and executing them in parallel on files in a large engine and storage system do not tend to be exposed to
outside and are not designed for data integration.
∗finished his work during his master study at university of In our case, relational databases and Hadoop will be in-
kaiserslautern tegrated together to deliver an analytical cluster. Simply
transferring data from relational databases to Hadoop with-
out considering the computational capabilities in Hadoop
can lead to lower performance.
As an example, performing complex analytical workloads
over multiple small/large tables (loaded from relational data-

1
one open-source implementation of MapReduce framework
th
25 GI-Workshop on Foundations of Databases (Grundlagen von Daten- from Apache community, see http://hadoop.apache.org
2
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. Hadoop Distributed File System - is used to store the data
Copyright is held by the author/owner(s). in Hadoop for analysis
information describing advertisements – their category, there name, daily in the form
the advertiser information etc. The data sets originating in the latter through a set of l
mostly correspond to actions such as viewing an advertisement, consumption.
clicking on it, fanning a Facebook page etc. In traditional data The data from th
warehousing terminology, more often than not the data in the Hadoop clusters
processes dump
compressing them
into the Hive-Had
bases) in Hadoop leads to a number of join operations which failures and also n
slows down the whole processing. The reason is that the much load on the
join performance is normally weak in MapReduce systems Scribe-Hadoop Clusters running the scrape
Web Servers avoiding extra loa
as compared to relational databases [15]. Performance limi- any notions of stro
tations have been shown due to several reasons such as the order to avoid loc
inherent unary feature of map and reduce functions. database server b
cannot be read ev
To achieve better global performance in such an analytical data from that par
synthesis with multiple platforms from a global perspective servers, there are a
of view, several strategies can be applied. Hive replication the scrapes and by
data a daily dum
One would be simply improving the join implementation Hadoop clusters. T
Production Hive-Hadoop
on single MapReduce platform. There have been several ex- Adhoc Hive-Hadoop Cluster tables.
isting works trying to improve join performance in MapRe- Cluster
As shown in Figu
duce systems [3, 1]. where the data b
Another one would be using heuristics for global perfor- stream processes.
Hadoop cluster - i
mance optimization. In this paper, we will take a look at the strict delivery dea
second one. In order to validate our general idea of improv- Hive-Hadoop clus
ing global performance on multiple platforms, we deliver our Federated MySQL well as any ad ho
data sets. The ad
adaptive approach in terms of join performance. We take the Figure 1: Data Flow Architecture
run production job
data flow architecture at Facebook as a starting point and Figure 1: Facebook Data Flow Architecture[17]
the contributions are summarized as follows:
1014
1. Adaptively pre-joining tables during data transfer for
better performance in Hadoop/Hive. 2.2 Hive
Hive [16] is an open source data warehousing solution built
2. Rewriting incoming queries according to changing ta- on top of MapReduce/Hadoop. Analytics is essentially done
ble schema. by MapReduce jobs and data is still stored and managed in
Hadoop/HDFS.
The remainder of this paper is structured as follows: Sec- Hive supports a higher-level SQL-like language called Hive-
tion 2 describes the background of this paper. Section 3 gives QL for users who are familiar with SQL for accessing files
a naı̈ve approach of fully pre-joining related tables. Based in Hadoop/HDFS, which highly increases the productivity
on the performance observation of this naı̈ve approach, more of using MapReduce systems. When a HiveQL query comes
considerations have been taken into account and an adap- in, it will be automatically translated into corresponding
tive pre-join approach is proposed in Section 4, followed by MapReduce jobs with the same analytical semantics. For
the implementation and experimental evaluation shown in this purpose, Hive has its own meta-data store which maps
Section 5. Section 6 shows some related works. Section 7 the HDFS files to the relational data model. Files are log-
concludes with a summary and future work. ically interpreted as relational tables during HiveQL query
execution.
2. BACKGROUND Furthermore, in contrast to high data loading cost (using
ETL jobs) in traditional data warehouses, Hive benefits from
In this section, we will introduce our starting point, i.e.
its efficient loading process which pulls raw files directly into
the analytical data flow architecture at Facebook and its
Hadoop/HDFS and further publishes them as tables. This
MapReduce-based analytical platform - Hive. In addition,
feature makes Hive much more suitable for dealing with large
the performance issue in terms of join is also stated subse-
volumes of data (i.e. big data).
quently.

2.1 Facebook Data Flow Architecture 2.3 Join in Hadoop/Hive
Instead of using a traditional data warehouse, Facebook There has been an ongoing debate comparing parallel data-
uses Hive - a MapReduce-based analytical platform - to base systems and MapReduce/Hadoop. In [13], experiments
perform analytics on information describing advertisement. showed that performance of selection, aggregation and join
The MapReduce/Hadoop system offers high scalability which tasks in Hadoop could not reach parallel databases (Vertica
enables Facebook to perform data analytics over 15PB of & DBMS-X). Several reasons of the performance difference
data and load 60TB of new data every day [17]. The archi- have been also explained by Stonebraker et al. in [15] such
tecture of data flow at Facebook is described as follows. as repetitive record parsing, and high I/O cost due to non-
As depicted in Figure 1, data is extracted from two types compression & non-indexing.
of data sources: a federated MySQL tier and a web-based Moreover, as MapReduce was not originally designed to
tier. The former offers the category, the name and corre- combine information from two or more data sources, join im-
sponding information of the advertisements as dimension plementations are always cumbersome [3]. The join perfor-
data while the actions such as viewing an advertisement, mance relies heavily on the implementation of MapReduce
clicking on it, fanning a Facebook page are extracted as fact jobs which have been considered as not straightforward.
data from the latter. As Hive is built on top of MapReduce/Hadoop, the join
There are two types of analytical cluster: production Hive operation is essentially done by corresponding MapReduce
cluster and ad hoc Hive cluster. Periodic queries are per- jobs. Thus, Hive suffers from these issues even though there
formed on the production Hive cluster while the ad hoc have been efforts [5] to improve join performance in MapRe-
queries are executed on the ad hoc Hive cluster. duce systems or in Hive.
information describing advertisemen
ts ! their category, there name,
the advertiser information etc.
data The through a set of loader processes and then becomes
sets originating in the latter
as viewing an advertisement, consumption.
mostly correspond to actions such
clicking on it, fanning a Facebook
ge etc.
pa In traditional dataThe data from the federated mysql tier gets loaded
warehousing terminology, more often than not the data in theHadoop clusters through daily scra
pe processes. The sc
processes dump the desired data sets from mysql datab
compressing them on the source stems sy and finally movin
3. FULL PRE-JOIN APPROACH into the Hive-Hadoop cluster. The scrapes need to b
failures and also need to be designed such that the
Due to the fact that the join performance is a perfor- much load on the mysql databases. The latter is acc
Scribe-HadoopClusters running the scrapes on a replicated tier of mysql database
mance bottleneck in Hive with its inherent MapReduce Web Servers fea- a b c d avoiding extra load on the already loaded masters.
ture, one naı̈ve thinking for improving total workload perfor- any notions ofAdaptive
strongPre-joined
consistencySchema
the scraped
in data is sa
mance would be to simply eliminate the join task from the order to avoid locking overheads. The scrapes are retried
fact table: λ database servera basis
b c ind ther′ of x′ failures and if the
case
workload by performing a rewritten workload with the same
“(λ,cannot be read even after repeated tries, the previ
α.r, β.x)“
analytical semantics over pre-joined tables created in the data from that particular server
used. is With thousands of
fact table: λ′
data load phase. A performance gain would be expected by servers, there are always some servers that may not
r s t
performing large table scan with high parallelism of increas-Hive replication the scrapes and by a combination using of retries and scr
data a daily dump of the dimension data is created
ing working nodes in Hadoop instead of join. In addition, Production Hive-Hadoop p Hadoop clusters. dim These ps
dum
table: αare then converted to top
the scalable storage system allows us to create AdhocHive-Hadoop
redundant Cluster tables.
Cluster x y z
pre-joined tables for some workloads with specific join pat- r s t
As shown in Figure 1, theretwoaredifferent Hive-Hadoop
terns. where the data becomes available for consumption by
dim table: α dim table: β
stream processes. One of these clusters ! the produ
In an experiment, we tried to validate this strategy. An Hadoop cluster - is used to execute
obs thatj need to adher
analytical workload (TPC-H Query 3) was executed over x y z
strict delivery deadlines, where as the other clust
two data sets of TPC-H benchmark (with scale factor 5 & Hive-Hadoop cluster is used toteexecu
lower priority bat
Federated MySQL dim table: βwell as any ad hoc analysis that the users want to
10) of the original table schema (with join at runtime) and a
data sets. The ad hoc nature
userofqueries makes it dan
fully pre-joined table schema (without join) which fully Figure joins 1: Data Flow Architecture run production jobs in the same cluster. A badly wr
all the related dimension tables with the fact table during Figure 3: Adaptive Pre-joined Schema in Facebook
the load phase, respectively. In this case, we trade storage Example
overhead for better total performance. 1014
As shown on the left side of the Figure 2(a), the perfor-
mance gain of the total workload (including the join) over the periodic queries on production Hive-Hadoop cluster, a
the data set with SF 5 can be seen with 6GB storage over- frequent column set could be extracted.
head introduced by fully pre-joining the related tables into One example is illustrated in Figure 3. The frequent set
one redundant table (shown in Figure 2(b)). The overall of additional columns has been extracted. The column r
in dimension table α is frequently joined with fact table in
350 25
company in the previous workloads as a filter or aggregate
data volume for executing workloads (GB)

300
20 column, as the same for the column x in dimension table
average runtime (sec)

250

200 15 β. During next load phase, the fact table is expanded by
no pre-join no pre-join
150
full pre-join 10 full pre-join
redundantly pre-joining these two additional columns r and
100
5
x with it.
50
Depending on the statistics information of previous queries,
0 0
5GB 10GB 5GB 10GB different frequent sets of additional columns could be found
data set size data set size
in diverse time intervals. Thus, the fact table is pre-joined
(a) Average Runtimes (b) Accessed Data Volume in an adaptive manner.
Assume that the additional columns identified in previ-
Figure 2: Running TPC-H Query-3 on Original and ous queries will also frequently occur in the future ones (as
Full Pre-joined Table Schema in the Facebook example), the benefits of adaptive pre-join
approach are two-fold:
performance can be significantly increased if workloads with First, when all the columns (including dimension columns)
the same join pattern later frequently occur, especially for in a certain incoming query which requires a join opera-
periodic queries over production Hive-Hadoop cluster in the tion have been contained in the pre-joined fact table, this
Facebook example. query could be directly performed on the pre-joined fact ta-
However, the result of performing the same query on the ble without join.
data set with SF 10 size is disappointing as there is no per- Second, the adaptive pre-join approach leads to a smaller
formance gain while paying 12.5GB storage for redundancy table size in contrast to the full pre-join approach, as only
(shown in Figure 2(b)), which is not what we expected. The subsets of the dimension tables are pre-joined. Thus, the
reason could be that the overhead of scanning such redun- resulting storage overhead is reduced, which plays a signif-
dant fully pre-joined tables and the high I/O cost as well off- icant role especially in big data scenarios (i.e. terabytes,
set the performance gain as the accessed data volume grows. petabytes of data).
To automatically accomplish the adaptive pre-join ap-
4. ADAPTIVE PRE-JOIN APPROACH proach, three sub-steps are developed: frequent column set
Taking the lessons learned from the full pre-join approach extraction, pre-join and query rewrite.
above, we propose an adaptive pre-join approach in this pa-
per.
4.1 Frequent Column Set Extraction
Instead of pre-joining full dimension tables with the fact In the first phase, the statistics collected for extracting
table, we try to identify the dimension columns which oc- frequent set of additional columns is formated as a list of
curred frequently in the select, where, etc. clauses of previ- entries each which has the following form:
ous executed queries for filtering, aggregation and so on. We Set : {Fact, Dim X.Col i, Dim X.Col j ... Dim Y.Col k}
refer to these columns as additional columns as compared to
the join columns in the join predicates. By collecting a list of The join set always starts with the involved fact table
additional column sets from previous queries, for example, while the joint dimension columns are identified and cap-
tured from the select, where, etc. clauses or from the sub- to answer queries using views in data warehouses. Further-
queries. more, several subsequent works [14, 10] have focuses on dy-
The frequent set of additional columns could be extracted namic view management based on runtime statistics (e.g.
using a set of frequent itemset mining approaches [2, 7, 11] reference frequency, result data size, execution cost) and
measured profits for better query performance. In our work,
4.2 Query Rewrite we reviewed these sophisticated techniques in a MapReduce-
As the table schema is changed in our case (i.e. newly gen- based environment.
erated fact table schema), initial queries need to be rewritten Cheetah [4] is a high performance, custom data warehouse
for successful execution. Since the fact table is pre-joined on top of MapReduce. It is very similar to the MapReduce-
with a set of dedicated redundant dimension columns, the based warehouse Hive introduced in this paper. The perfor-
tables which are involved in the from clause of the original mance issue of join implementation has also been addressed
query can be replaced with this new fact table once all the in Cheetah. To reduce the network overhead for joining
columns have been covered in it. big dimension table with fact table at query runtime, big
By storing the mapping from newly generated fact table dimension tables are denormalized and all the dimension at-
schema to the old schema in the catalog, the query rewrite tributes are directly stored into the fact table. In contrast,
process can be easily applied. Note that the common issue we choose to only denormalize the frequently used dimen-
of handling complex sub-queries for Hive can thereby be sion attributes with the fact table since we believe that less
facilitated if the columns in the sub-query have been pre- I/O cost can be achieved in this way.
joined with the fact table.
7. CONCLUSION AND FUTURE WORK
5. IMPLEMENTATION AND EVALUATION We propose a schema adaption approach for global opti-
We use Sqoop3 as the basis to implement our approach. mization in an analytical synthesis of relational databases
The TPC-H benchmark data set with SF 10 is adaptively and a MapReduce-based warehouse - Hive. As MapRe-
pre-joined according to the workload statistics and trans- duce systems have weak join performance, frequently used
ferred from MySQL to Hive. First, the extracted join pat- columns of dimension tables are pre-joined with the fact
tern information is sent to Sqoop as additional transforma- table according to useful workload statistics in an adap-
tion logic embedded in the data transfer jobs for generating tive manner before being transfered to Hive. Besides, a
the adaptive pre-joined table schema on the original data rewrite component enables the execution of incoming work-
sources. Furthermore, the generated schema is stored in loads with join operations over such pre-joined tables trans-
Hive to enable automatic query rewrite at runtime. parently. In this way, better performance can be achieved in
We tested the adaptive pre-join approach on a six-node Hive. Note that this approach is not restricted to any spe-
cluster (Xeon Quadcore CPU at 2.53GHz, 4GB RAM, 1TB cific platform like Hive. Any MapReduce-based warehouse
SATA-II disk, Gigabit Ethernet) running Hadoop and Hive. can benefit from it, as generic complex join operations occur
After running the same TPC-H Query 3 over the adaptive in almost every analytical platform.
pre-joined table schema, the result in the Figure 4(a) shows However, the experimental results also show that the per-
that the average runtime is significantly reduced. The join formance improvement is not stable while the data volume
grows continuously. For example, when the query is exe-
350 25 cuted on one larger pre-joined table, the performance gain
data volume for executing workloads (GB)

300
20 from eliminating joins is offset by the impact caused by the
250
record parsing overhead and high I/O cost during the scan,
average runtime (sec)

15
200
no pre-join no pre-join which results in worse performance. This concludes that
full pre-join full pre-join
150
adaptive pre-join
10
adaptive pre-join the total performance of complex data analytics is effected
100
5 by multiple metrics rather than a unique consideration, e.g.
50

0
join.
0
10GB
data set size
10GB
data set size
With the continuous growth of data, diverse frameworks
and platforms (e.g. Hive, Pig) are built for large-scale data
(a) Average Runtimes (b) Accessed Data Volume
analytics and business intelligent applications. Data trans-
fer between different platforms generally takes place in the
Figure 4: Running TPC-H Query-3 on Original, Full absence of key information such as operational cost model,
Pre-joined and Adaptive Pre-joined Table Schema resource consumption, computational capability etc. within
platforms which are autonomous and inherently not designed
task has been eliminated for this query and the additional for data integration. Therefore, we are looking at a generic
overheads (record parsing, I/O cost) have been relieved due description of the operational semantics with their compu-
to the smaller size of redundancy as shown in Figure 4(b). tational capabilities on different platforms and a cost model
for performance optimization from a global perspective of
6. RELATED WORK view. The granularity we are observing is a single operator
in the execution engines. Thus, a global operator model with
An adaptively pre-joined fact table is essentially a mate-
generic cost model is expected for performance improvement
rialized view in Hive. Creating materialized views in data
in several use cases, e.g. federated systems.
warehouses is nothing new but a technique used for query
Moreover, as an adaptively pre-joined fact table is re-
optimization. Since 1990s, a substantial effort [6, 8] has been
garded as a materialized view in a MapReduce-based ware-
3 house, another open problem left is how to handle the view
an open source tool for data transfer between Hadoop and
relational database, see http://sqoop.apache.org/ maintanence issue. The work from [9] introduced an incre-
mental loading approach to achieve near real-time dataware- International Conference on Management of data,
housing by using change data capture and change propaga- SIGMOD ’09, pages 165–178, New York, NY, USA,
tion techniques. Ideas from this work could be taken further 2009. ACM.
to improve the performance of total workload including the [14] P. Scheuermann, J. Shim, and R. Vingralek.
pre-join task. Watchman: A data warehouse intelligent cache
manager. In Proceedings of the 22th International
8. REFERENCES Conference on Very Large Data Bases, VLDB ’96,
[1] F. N. Afrati and J. D. Ullman. Optimizing joins in a pages 51–62, San Francisco, CA, USA, 1996. Morgan
map-reduce environment. In Proceedings of the 13th Kaufmann Publishers Inc.
International Conference on Extending Database [15] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden,
Technology, EDBT ’10, pages 99–110, New York, NY, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and
USA, 2010. ACM. parallel dbmss: friends or foes? Commun. ACM,
[2] R. Agrawal and R. Srikant. Fast algorithms for mining 53(1):64–71, Jan. 2010.
association rules in large databases. In Proceedings of [16] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,
the 20th International Conference on Very Large Data N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a
Bases, VLDB ’94, pages 487–499, San Francisco, CA, petabyte scale data warehouse using Hadoop. In ICDE
USA, 1994. Morgan Kaufmann Publishers Inc. ’10: Proceedings of the 26th International Conference
[3] S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. on Data Engineering, pages 996–1005. IEEE, Mar.
Shekita, and Y. Tian. A comparison of join algorithms 2010.
for log processing in mapreduce. In Proceedings of the [17] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur,
2010 ACM SIGMOD International Conference on N. Jain, J. Sen Sarma, R. Murthy, and H. Liu. Data
Management of data, SIGMOD ’10, pages 975–986, warehousing and analytics infrastructure at facebook.
New York, NY, USA, 2010. ACM. In Proceedings of the 2010 ACM SIGMOD
[4] S. Chen. Cheetah: a high performance, custom data International Conference on Management of data,
warehouse on top of mapreduce. Proc. VLDB Endow., SIGMOD ’10, pages 1013–1020, New York, NY, USA,
3(1-2):1459–1468, Sept. 2010. 2010. ACM.
[5] A. Gruenheid, E. Omiecinski, and L. Mark. Query
optimization using column statistics in hive. In
Proceedings of the 15th Symposium on International
Database Engineering & Applications, IDEAS ’11,
pages 97–105, New York, NY, USA, 2011. ACM.
[6] A. Y. Halevy. Answering queries using views: A
survey. The VLDB Journal, 10(4):270–294, Dec. 2001.
[7] J. Han, J. Pei, and Y. Yin. Mining frequent patterns
without candidate generation. SIGMOD Rec.,
29(2):1–12, May 2000.
[8] V. Harinarayan, A. Rajaraman, and J. D. Ullman.
Implementing data cubes efficiently. In Proceedings of
the 1996 ACM SIGMOD international conference on
Management of data, SIGMOD ’96, pages 205–216,
New York, NY, USA, 1996. ACM.
[9] T. Jörg and S. Deßloch. Towards generating etl
processes for incremental loading. In Proceedings of
the 2008 international symposium on Database
engineering & applications, IDEAS ’08, pages 101–110,
New York, NY, USA, 2008. ACM.
[10] Y. Kotidis and N. Roussopoulos. Dynamat: a dynamic
view management system for data warehouses.
SIGMOD Rec., 28(2):371–382, June 1999.
[11] H. Mannila, H. Toivonen, and I. Verkamo. Efficient
algorithms for discovering association rules. pages
181–192. AAAI Press, 1994.
[12] F. Özcan, D. Hoa, K. S. Beyer, A. Balmin, C. J. Liu,
and Y. Li. Emerging trends in the enterprise data
analytics: connecting hadoop and db2 warehouse. In
Proceedings of the 2011 ACM SIGMOD International
Conference on Management of data, SIGMOD ’11,
pages 1161–1164, New York, NY, USA, 2011. ACM.
[13] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J.
DeWitt, S. Madden, and M. Stonebraker. A
comparison of approaches to large-scale data analysis.
In Proceedings of the 2009 ACM SIGMOD
Ein Cloud-basiertes räumliches Decision Support System
für die Herausforderungen der Energiewende

Golo Klossek Stefanie Scherzinger Michael Sterner
Hochschule Regensburg Hochschule Regensburg Hochschule Regensburg
golo.klossek stefanie.scherzinger michael.sterner
@stud.hs-regensburg.de @hs-regensburg.de @hs-regensburg.de

KURZFASSUNG Die Standortfindung etwa für Bankfilialen und die Zo-
Die Energiewende in Deutschland wirft sehr konkrete Fra- nierung, also das Ausweisen geographischer Flächen für die
gestellungen auf: Welche Standorte eignen sich für Wind- Landwirtschaft, sind klassische Fragestellungen für räumli-
kraftwerke, wo können Solaranlagen wirtschaftlich betrieben che Entscheidungsunterstützungssysteme [6].
werden? Dahinter verbergen sich rechenintensive Datenver- Die Herausforderungen an solch ein Spatial Decision Sup-
arbeitungsschritte, auszuführen auf Big Data aus mehreren port System im Kontext der Energiewende sind vielfältig:
Datenquellen, in entsprechend heterogenen Formaten. Diese
1. Verarbeitung heterogener Datenformate.
Arbeit stellt exemplarisch eine konkrete Fragestellung und
ihre Beantwortung als MapReduce Algorithmus vor. Wir 2. Skalierbare Anfragebearbeitung auf Big Data.
konzipieren eine geeignete, Cluster-basierte Infrastruktur für
ein neues Spatial Decision Support System und legen die 3. Eine elastische Infrastruktur, die mit der Erschließung
Notwendigkeit einer deklarativen, domänenspezifischen An- neuer Datenquellen ausgebaut werden kann.
fragesprache dar.
4. Eine deklarative, domänenspezifische Anfragesprache
für komplexe ad-hoc Anfragen.
Allgemeine Begriffe
Measurement, Performance, Languages. Wir begründen kurz die Eckpunkte dieses Anforderungs-
profils im Einzelnen. Dabei vertreten wir den Standpunkt,
Stichworte dass existierende Entscheidungsunterstützungssysteme auf
Basis relationaler Datenbanken diese nicht in allen Punkten
Cloud-Computing, MapReduce, Energiewende. erfüllen können.
(1) Historische Wetterdaten sind zum Teil öffentlich zu-
1. EINLEITUNG gänglich, werden aber auch von kommerziellen Anbietern
Der Beschluss der Bundesregierung, zum Jahr 2022 aus bezogen. Prominente Vertreter sind das National Center for
der Kernenergie auszusteigen und deren Anteil am Strom- Atmospheric Research [12] in Boulder Colorado, der Deut-
Mix durch erneuerbare Energien zu ersetzen, fordert einen sche Wetterdienst [7] und die Satel-Light [14] Datenbank der
rasanten Ausbau der erneuerbaren Energien. Entscheidend Europäischen Union. Hinzu kommen Messwerte der hoch-
für den Bau neuer Windkraft- und Solaranlagen sind vor schuleigenen experimentellen Windkraft- und Solaranlagen.
allem die zu erzielenden Gewinne und die Sicherheit der In- Die Vielzahl der Quellen und somit der Formate führen zu
vestitionen. Somit sind präzise Ertragsprognosen von großer den klassischen Problemen der Datenintegration.
Bedeutung. Unterschiedliche Standorte sind zu vergleichen, (2) Daten in hoher zeitlicher Auflösung, die über Jahr-
die Ausrichtung der Windkraftanlagen zueinander in den zehnte hinweg erhoben werden, verursachen Datenvolumi-
Windparks ist sorgfältig zu planen. Als Entscheidungsgrund- na im Big Data Bereich. Der Deutsche Wetterdienst allein
lage dienen hierzu vor allem historische Wetterdaten. Für verwaltet ein Datenarchiv von 5 Petabyte [7]. Bei solchen
die Kalkulation des Ertrags von Windkraftanlagen muss ef- Größenordnung haben sich NoSQL Datenbanken gegenüber
fizient auf die Datenbasis zugegriffen werden können. Diese relationalen Datenbanken bewährt [4].
erstreckt sich über große Zeiträume, da das Windaufkom- (3) Wir stellen die Infrastruktur für ein interdisziplinäres
men nicht nur jährlich schwankt, sondern auch dekadenweise Team der Regensburg School of Energy and Resources mit
variiert [3, 9]. mehreren im Aufbau befindlichen Projekten bereit. Um den
wachsenden Anforderungen unserer Nutzer gerecht werden
zu können, muss das System elastisch auf neue Datenquellen
und neue Nutzergruppen angepasst werden können.
(4) Unsere Nutzer sind überwiegend IT-affin, doch nicht
erfahren in der Entwicklung komplexer verteilter Systeme.
Mit einer domänenspezifischen Anfragesprache wollen die
Autoren dieses Artikels die intuitive Nutzbarkeit des Sys-
tems gewährleisten.
Unter diesen Gesichtspunkten konzipieren wir unser Sys-
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- tem als Hadoop-Rechencluster [1, 5]. Damit sind die Ska-
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany.
Copyright is held by the author/owner(s).
lierbarkeit auf große Datenmengen (2) und die horizonta-
le Skalierbarkeit der Hardware gegeben (3). Da auf histo-
rische Daten ausschließlich lesend zugegriffen wird, bietet
sich der MapReduce Ansatz geradezu an. Zudem erlaubt
Hadoop das Verarbeiten unstrukturierter, heterogener Da-
ten (1). Der Entwurf einer eigenen Anfragesprache (4) stellt
dabei eine spannende und konzeptionelle Herausforderung
dar, weil hierfür ein tiefes Verständnis für die Fragestellun-
gen der Nutzer erforderlich ist.

Struktur. Die folgenden Kapitel liefern Details zu unserem
Vorhaben. In Kapitel 2 beschreiben wir eine konkrete Frage-
stellung bei der Standortfindung von Windkraftwerken. In
Kapitel 3 stellen wir unsere Lösung als MapReduce Algorith-
mus dar. Kapitel 4 skizziert unsere Infrastruktur. Im 6. Ka-
pitel wird auf verwandte Arbeiten eingegangen. Das letzte Abbildung 1: Aussagen über die Leistung in Abhän-
Kapitel gibt eine Zusammenfassung unserer Arbeit und zeigt gigkeit zur Windgeschwindigkeit (aus [9]).
deren Perspektive auf.

2. WINDPOTENTIALANALYSE
Ein aktuelles Forschungsprojekt der Hochschule Regens-
burg beschäftigt sich mit der Potentialanalyse von Wind-
kraftanlagen. Hier werden die wirtschaftlichen Aspekte, die
für das Errichten neuer Windkraftanlagen entscheidend sind,
untersucht. Mithilfe der prognostizierten Volllaststunden ei-
ner Windkraftanlage kann eine Aussage über die Rentabi-
lität getroffen werden. Diese ist bestimmt durch die Leis-
tungskennlinie der Windkraftanlage und letztlich durch die
zu erwartenden Windgeschwindigkeiten.
Abbildung 1 (aus [9]) skizziert die spezifische Leistungs-
kennlinie einer Windkraftanlage in vier Phasen:

I) Erst ab einer gewissen Windgeschwindigkeit beginnt
die Anlage Strom zu produzieren.

II) Die Leistung steigt über den wichtigsten Arbeitsbe-
reich in der dritten Potenz zur Windgeschwindigkeit
an, bis die Nennleistung der Anlage erreicht ist.
Abbildung 2: Histogramme über die Windstärkever-
III) Die Ausgangsleistung wird auf die Nennleistung der teilung.
Anlage begrenzt. Ausschlaggebend für die Höhe der
Nennleistung ist die Auslegungsgröße des Generators.

IV) Die Windkraftanlage schaltet sich bei zu hohen Wind- Orographie variieren die Windgeschwindigkeiten schon bei
geschwindigkeiten ab, um eine mechanische Überbelas- kurzen Distanzen stark.
tung zu verhindern. Abbildung 2 skizziert die resultierende Aufgabenstellung:
Geographische Flächen werden kleinräumig unterteilt, was
Wie Abbildung 1 verdeutlicht, ist zum Errechnen der ab- die Abbildung aus Gründen der Anschaulichkeit stark ver-
gegeben Arbeit einer Windkraftanlage eine genaue Kennt- einfacht darstellt. Für jeden Quadranten, bestimmt durch
nis der stochastischen Verteilung der Windgeschwindigkeit 1 Längen- und Breitengrad, interessiert die Häufigkeitsvertei-
notwendig. Mithilfe entsprechender Histogramme können so- lung der Windstärken (dargestellt als Histogramm).
mit potentielle Standorte für neue Windkraftanlagen vergli- Je nach Fragestellung wird von unterschiedlichen Zeiträu-
chen, und Anlagen mit geeigneter Leistungskennlinie pas- men und unterschiedlicher Granularität der Quadranten aus-
send für den spezifischen Standort ausgewählt werden. gegangen. Aufgrund der schieren Größe der Datenbasis ist
Als Datenbasis eignen sich etwa die Wetterdaten des For- hier ein massiv paralleler Rechenansatz gefordert, wenn über
schungsinstitut des National Center for Atmospheric Rese- eine Vielzahl von Quadranten hinweg Histogramme berech-
arch [12] und die des Deutschen Wetterdienstes [7]. net werden sollen.
Insbesondere im Binnenland ist eine hohe räumliche Auf-
lösung der meteorologischen Daten wichtig. Aufgrund der
3. MASSIV PARALLELE HISTOGRAMM-
1
Wir verwenden die Begriffe Windgeschwindigkeit und BERECHNUNG
Windstärke synonym. Streng genommen wird die Windge-
schwindigkeit als Vektor dargestellt, während die Windstär- Im Folgenden stellen wir einen MapReduce Algorithmus
ke als skalare Größe erfasst wird. Dabei kann die Windstärke zur parallelen Berechnung von Windgeschwindigkeitsvertei-
aus der Windgeschwindigkeit errechnet werden. lungen vor. Wir betrachten dabei die Plattform Apache Ha-
Abbildung 3: Erste MapReduce-Sequenz zur Berechnung der absoluten Häufigkeiten.

doop [1], eine quelloffene MapReduce Implementierung [5]. hieren wir von dem tatsächlichen Eingabeformat und be-
Hadoop ist dafür ausgelegt, mit großen Datenmengen um- schränken uns auf nur eine Datenquelle. Die Eingabetupel
zugehen. Ein intuitives Programmierparadigma erlaubt es, enthalten einen Zeitstempel, den Längen- und Breitengrad
massiv parallele Datenverarbeitungsschritte zu spezifizieren. als Ortsangabe und diverse Messwerte.
Die Plattform partitioniert die Eingabe in kleinere Daten- Wir nehmen vereinfachend an, dass die Ortsangabe be-
blöcke und verteilt diese redundant auf dem Hadoop Distri- reits in eine Quadranten-ID übersetzt ist. Diese Vereinfa-
buted File System [15]. Dadurch wird eine hohe Datensicher- chung erlaubt eine übersichtlichere Darstellung, gleichzeitig
heit gewährleistet. Als logische Basiseinheit arbeitet Hadoop ist die Klassifikation der Datensätze nach Quadranten ein-
mit einfachen Schlüssel/Werte Paaren. Somit können selbst fach umzusetzen. Zudem ignorieren wir alle Messwerte bis
unstrukturierte oder nur schwach strukturierte Daten ad hoc auf die Windstärke. Tabelle 1 zeigt exemplarisch einige Da-
verarbeitet werden. tensätze, die wir in unserem laufenden Beispiel verarbeiten.
MapReduce Programme werden in drei Phasen ausgeführt. Wir betonen an dieser Stelle, dass diese vereinfachenden
Annahmen nur der Anschaulichkeit dienen und keine Ein-
1. In der ersten Phase wird auf den partitionierten Einga- schränkung unseres Systems darstellen.
bedaten eine Map-Funktion parallel ausgeführt. Diese
Map-Funktion transformiert einfache Schlüssel/Werte Quadrant Windstärke
Paare in eine Liste von neuen Schlüssel/Werte Paaren. q ws
2. Die anschließende Shuffle-Phase verteilt die entstande- 2 0
nen Tupel so um, dass nun alle Paare mit dem gleichen 3 7
Schlüssel an demselben Rechner vorliegen. 4 9
3. Die Reduce-Phase berechnet meist eine Aggregatfunk- 1 3
tion auf allen Tupeln mit demselben Schlüssel. ... ...
Die Signaturen der Map- und Reduce-Funktion werden Tabelle 1: Tabellarisch dargestellte Eingabedaten.
üblicherweise wie folgt beschrieben [11]:

Map: (k1, v1) → list(k2, v2) Wir schalten zwei MapReduce-Sequenzen in Reihe:
Reduce: (k2, list(v2)) → list(k3, v3) • Die erste Sequenz ermittelt, wie oft in einem Quadran-
ten eine konkrete Windstärke aufgetreten ist.
Wir erläutern nun unseren MapReduce Algorithmus zum
Erstellen von Histogrammen der Windgeschwindigkeitsver- • Die zweite Sequenz fasst die berechneten Tupel zu Hi-
teilungen. Im Sinne einer anschaulichen Darstellung abstra- stogrammen zusammen.
Abbildung 4: Zweite MapReduce-Sequenz zur Berechnung der Histogramme.

def map(Datei d, Liste L) : def reduce((Quadrant q, Windstärke ws), Liste L) :
foreach (q, ws) in L do int total = 0;
if (q ∈ Q) foreach count in L do
int count = 1; total += count;
emit ((q, ws), count); od
fi emit (q, (ws, total));
od

Abbildung 6: Reduce-Funktion der ersten Sequenz.
Abbildung 5: Map-Funktion der ersten Sequenz.

2}). Die Shuffle-Phase reorganisiert die Tupel so, dass an-
Durch das Aneinanderreihen von MapReduce-Sequenzen schließend alle Tupel mit den gleichen Werten für Quadrant
werden ganz im Sinne des Prinzips teile und herrsche“ mit und Windstärke bei demselben Rechner vorliegen. Hadoop
”
sehr einfachen und gut parallelisierbaren Rechenschritten fasst dabei die count-Werte bereits zu einer Liste zusammen.
komplexe Transformationen spezifiziert. Wir betrachten nun Die Reduce-Funktion produziert daraus Tupel mit dem
beide Sequenzen im Detail. Quadranten als Schlüssel. Der Wert setzt sich aus der Wind-
3.1 Sequenz 1: Absolute Häufigkeiten stärke und ihrer absoluten Häufigkeit zusammen. 2
Die erste Sequenz erinnert an das WordCount“ -Beispiel,
”
das klassische Einsteigerbeispiel für MapReduce Program-
3.2 Sequenz 2: Histogramm-Berechnung
mierung [11]. Die Eingabe der Map-Funktion ist der Name Die Ausgabe der ersten Sequenz wird nun weiter verar-
einer Datei und deren Inhalt, nämlich eine Liste der Qua- beitet. Die Map-Funktion der zweiten Sequenz ist schlicht
dranten und der darin gemessenen Windstärke. Wir nehmen die Identitätsfunktion. Die Shuffle-Phase gruppiert die Tu-
an, dass nur eine ausgewählte Menge von Quadranten Q in- pel nach dem Quadranten. Somit findet die finale Erstellung
teressiert, etwa um mögliche Standorte von Windkraftanla- der Histogramme in der Reduce-Funktion statt.
gen im Regensburger Raum zu untersuchen.
Abbildung 5 zeigt die Map-Funktion in Pseudocode. Die
Anweisung emit produziert ein neues Ausgabetupel. In der Beispiel 2. Abbildung 4 zeigt für das laufende Beispiel
Shuffle-Phase werden die produzierten Tupel nach der Schlüs- die Verarbeitungsschritte der zweiten Sequenz. 2
selkomponente aus Quadrant und Windstärke umverteilt.
Die Reduce-Funktion in Abbildung 6 berechnet nun die Häu-
figkeit der einzelnen Windstärkewerte pro Quadrant.
4. ARCHITEKTURBESCHREIBUNG
Beispiel 1. Abbildung 3 visualisiert die erste Sequenz Unsere Vision eines Cloud-basierten Spatial Decision Sup-
anhand konkreter Eingabedaten. Die Map-Funktion selek- port Systems für die Fragestellungen der Energiewende fußt
tiert nur Tupel aus den Quadranten 1 und 2 (d.h. Q = {1, fest auf MapReduce Technologie.
hinsichtlich Performanz und Korrektheit analysiert werden
kann. Hadoop Laien hingegen brauchen sich mit diesen In-
terna nicht zu belasten. Aktuell erarbeiten wir einen Katalog
konkreter Fragestellungen der Energiewirtschaft, um gängi-
ge Query-Bausteine für WenQL zu identifizieren.

5. VERWANDTE ARBEITEN
In diversen Forschungsgebieten der Informatik finden sich
Berührungspunkte mit unserer Arbeit. Viele Forschungspro-
jekte, die sich mit der Smart Grid Technologie beschäftigen,
setzten auf distributive Systeme zum Bearbeiten ihrer Da-
ten [4, 17]. Ähnlich wie in unserem Projekt wird diese Ent-
scheidung aufgrund der großen Datenmengen getroffen, wel-
che aus unterschiedlichen Quellen stammen. Wettereinflüsse
auf Kraftwerke, Börsenstrompreise, Auslastung von Strom-
netzen und das Stromverbrauchsverhalten von Millionen von
Nutzern müssen verglichen werden. Smart Grid Analysen
unterscheiden sich von unserem Projekt darin, dass wir nur
auf historische Daten zurückgreifen und somit keine Echt-
zeitanforderungen an das System stellen.
Fragestellungen wie Standortbestimmung und Zonierung
haben eine lange Tradition in der Entwicklung dedizierter
Spatial Decision Support Systeme [6]. Deren Architekturen
fußen üblicherweise auf relationalen Datenbanken zur Da-
tenhaltung. Mit der Herausforderung, auf Big Data zu ska-
lieren, und mit heterogenen Datenquellen zu arbeiten, besit-
zen NoSQL Systeme wie Hadoop und das Hadoop Dateisys-
tem hingegen eindeutige Vorteile in der Datenhaltung und
Anfragebearbeitung.
Die Definition deklarativer Anfragesprachen für MapRe-
duce Algorithmen ist ein sehr aktives Forschungsgebiet. Am
relevantesten für uns sind SQL-ähnliche Anfragesprachen,
wie etwa im Hive Projekt umgesetzt [2, 16]. Allerdings wird
SQL von unseren Anwendern in der Regel nicht beherrscht.
Daher planen wir, eine eigene Anfragesprache zu definieren,
die möglichst intuitiv für unsere Anwender zu erlernen ist.

6. ZUSAMMENFASSUNG UND AUSBLICK
Der Bedarf für eine neue, BigData-fähige Generation von
räumlichen Entscheidungsunterstützungssystemen für diver-
Abbildung 7: Architektur des Cloud-basierten Spa- se Fragestellungen der Energiewende ist gegeben.
tial Decision Support Systems. In dieser Arbeit stellen wir unsere Vision eines Cloud-
basierten Spatial Decision Support Systems vor. Anhand des
Beispiels der Windpotentialanalyse zeigen wir die Einsatz-
fähigkeit von MapReduce Algorithmen für strategische Fra-
Das Projektvorhaben geht dabei über den bloßen Einsatz gestellungen in der Energieforschung.
von Cluster-Computing hinaus. Das Ziel ist der Entwurf ei- Wir sind zuversichtlich, ein breites Spektrum essentiel-
ner domänenspezifischen Anfragesprache WEnQL, die Wet- ler Entscheidungen unterstützen zu können. Eine Weiterfüh-
”
ter und Energie Query Language“. Diese wird in interdis- rung unserer Fallstudie ist die Ausrichtung von Windkrafträ-
ziplinärer Zusammenarbeit mit dem Forschungsinstitut Re- dern innerhalb eines Windparks. Dafür ist die dominierende
gensburg School of Energy and Resources entwickelt. Mit ihr Windrichtung entscheidend, um Windkrafträder günstig zu-
sollen auch MapReduce Laien in der Lage sein, Algorithmen einander und zur Hauptwindrichtung auszurichten. Ein ein-
auf dem Cluster laufen zu lassen. zelnes Windkraftwerk kann zwar die Gondel um 360◦ dre-
Abbildung 7 illustriert die Vision: Unsere Nutzer formulie- hen, um die Rotoren gegen den Wind auszurichten. Die
ren eine deklarative Anfrage in WenQL, etwa um die Wind Anordnung der Türme zueinander im Park ist allerdings
geschwindigkeits-Histogramme einer Region zu berechnen. fixiert. Bei einem ungünstigen Layout der Türme können
Ein eigener Compiler übersetzt die WenQL Anfrage in das somit Windschatteneffekte die Rendite nachhaltig schmä-
gängige Anfrageformat Apache Pig [8, 13], das wiederum lern. Abbildung 8 (aus [10]) visualisiert die Windstärke und
nach Java übersetzt wird. Das Java Programm wird anschlie- Windrichtung als Entscheidungsgrundlage. Unser MapRe-
ßend kompiliert und auf dem Hadoop Cluster ausgeführt. duce Algorithmus aus Kapitel 3 kann dementsprechend er-
Die Übersetzung in zwei Schritten hat den Vorteil, dass weitert werden.
das Programm in der Zwischensprache Pig von Experten Darüber hinaus eruieren wir derzeit die Standortbestim-
8. LITERATUR
[1] Apache Hadoop. http://hadoop.apache.org/, 2013.
[2] Apache Hive. http://hive.apache.org/, 2013.
[3] O. Brückl. Meteorologische Grundlagen der
Windenergienutzung. Vorlesungsskript: Windenergie.
Hochschule Regensburg, 2012.
[4] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.
Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E.
Gruber. Bigtable: A distributed storage system for
structured data. ACM Transactions on Computer
Systems (TOCS), 26(2):4, 2008.
[5] J. Dean and S. Ghemawat. MapReduce: Simplified
data processing on large clusters. Commun. ACM,
51(1):107–113, Jan. 2008.
[6] P. J. Densham. Spatial decision support systems.
Geographical information systems: Principles and
applications, 1:403–412, 1991.
[7] Deutscher Wetterdienst. http://www.dwd.de/, 2013.
[8] A. Gates. Programming Pig. O’Reilly Media, 2011.
[9] M. Kaltschmitt, W. Streicher, and A. Wiese.
Erneuerbare Energien Systemtechnik,
Wirtschaftlichkeit, Umweltaspekte. Springer, 2006.
[10] Lakes Environmental Software.
Abbildung 8: Aussagen über Windstärke und http://www.weblakes.com/, 2013.
Windrichtung (aus [10]). [11] C. Lam. Hadoop in Action. Manning Publications,
2010.
[12] National Center for Atmospheric Research (NCAR).
http://ncar.ucar.edu/ 2013.
mung von Solaranlagen, und noch komplexer, den strategi- ”
[13] C. Olston, B. Reed, U. Srivastava, R. Kumar, and
schen Einsatz von Energiespeichern, um etwa Windstillen
A. Tomkins. Pig latin: A not-so-foreign language for
oder Nachtphasen ausgleichen zu können.
data processing. In Proceedings of the 2008 ACM
Mit den Fähigkeiten unseres künftigen Decision Support
SIGMOD international conference on Management of
Systems, der Skalierbarkeit auf sehr großen Datenmengen,
data, pages 1099–1110. ACM, 2008.
dem flexible Umgang mit heterogenen Datenformaten und
[14] Satel-Light. http://www.satel-light.com/, 2013.
nicht zuletzt mit einer domänenspezifischen Anfragesprache
wollen wir unseren Beitrag zu einer klimafreundlichen und [15] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.
nachhaltigen Energieversorgung leisten. The Hadoop Distributed File System. In Mass Storage
Systems and Technologies (MSST), 2010 IEEE 26th
Symposium on, pages 1–10, 2010.
7. DANKSAGUNG [16] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,
Diese Arbeit ist ein Projekt der Regensburg School of Ener- S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive:
gy and Resources, eine interdisziplinäre Einrichtung der Hoch- A warehousing solution over a map-reduce framework.
schule Regensburg und des Technologie- und Wissenschafts- Proceedings of the VLDB Endowment, 2(2):1626–1629,
netzwerkes Oberpfalz. 2009.
[17] D. Wang and L. Xiao. Storage and Query of Condition
Monitoring Data in Smart Grid Based on Hadoop. In
Computational and Information Sciences (ICCIS),
2012 Fourth International Conference, pages 377–380.
IEEE, 2012.
Consistency Models for Cloud-based Online Games:
the Storage System’s Perspective

Ziqiang Diao
Otto-von-Guericke University Magdeburg
39106 Magdeburg, Germany
diao@iti.cs.uni-magdeburg.de

ABSTRACT Client

The existing architecture for massively multiplayer online Login Server Gateway Server Chat Server
role-playing games (MMORPG) based on RDBMS limits
the availability and scalability. With increasing numbers of Zone Server
HDFS/Cassandra
players, the storage systems become bottlenecks. Although Logic Server Map Server (Game Data and Log Data)
RDBMS as a Service
a Cloud-based architecture has the ability to solve these spe- (Account Data)

cific issues, the support for data consistency becomes a new In-Memory DB

open issue. In this paper, we will analyze the data consis- Data Access Server
tency requirements in MMORPGs from the storage system
point of view, and highlight the drawbacks of Cassandra to
Cloud Storage System
support of game consistency. A timestamp-based solution (State Data)
will be proposed to address this issue. Accordingly, we will
present data replication strategies, concurrency control, and
system reliability as well. Figure 1: Cloud-based Architecture of MMORPGs
[4]
1. INTRODUCTION
In massively multiplayer online role-playing games (MMORPG)
thousands of players can cooperate with other players in a
virtual game world. To support such a huge game world (e.g., log data and state data) that requires scalability and
following often complex application logic and specific re- availability is stored in a Cloud data storage system (Cas-
quirements. Additionally, we have to bear the burden of sandra, in this paper). Figure 1 shows the new architecture.
managing large amounts of data. The root of the issue is Unfortunately, there are still some open issues, such as
that the existing architectures of MMORPGs use RDBMS the support of data consistency. According to the CAP
to manage data, which limits the availability and scalability. theorem, in a partition tolerant distributed system (e.g.,
Cloud data storage systems are designed for internet ap- an MMORPG), we have to sacrifice one of the two prop-
plications, and are complementary to RDBMS. For example, erties: consistency or availability [5]. If an online game
Cloud systems are able to support system availability and does not guarantee availability, players’ requests may fail.
scalability well, but not data consistency. In order to take If data is inconsistent, players may get data not conforming
advantages of these two types of storage systems, we have to game logic, which affects their operations. For this rea-
classified data in MMORPGs into four data sets according son, we must analyze the data consistency requirements of
to typical data management requirements (e.g., data consis- MMORPGs so as to find a balance between data consistency
tency, system availability, system scalability, data model, se- and system availability.
curity, and real-time processing) in [4]: account data, game Although there has been some research work focused on
data, state data, and log data. Then, we have proposed to the data consistency model of online games, the researchers
apply multiple data management systems (or services) in one generally discussed it from players’ or servers’ point of view
MMORPG, and manage diverse data sets accordingly. Data [15, 9, 11], which actually are only related to data synchro-
with strong requirements for data consistency and security nization among players. Another existing research work did
(e.g., account data) is still managed by RDBMS, while data not process diverse data accordingly [3], or just handled this
issue based on a rough classification of data [16]. However,
we believe the only efficient way to solve this issue is to ana-
lyze the consistency requirements of each data set from the
storage system’s perspective. Hence, we organize the rest of
this paper as follows: in Section 2, we highlight data consis-
tency requirements of the four data sets. In Section 3, we
discuss the data consistency issue of our Cloud-based archi-
rd tecture. We explain our timestamp-based solution in detail
25 GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2011, Ilmenau, Deutschland. from Section 4 to Section 6. Then, we point out some opti-
Copyright is held by the author/owner(s). mization programs and our future work in Section 7. Finally,
we summarize this paper in Section 8. players in the same game world could be treated equally.
It is noteworthy that a zone server accesses data generally
from one data center. Hence, we guarantee strong consis-
2. CONSISTENCY REQUIREMENTS OF DI- tency within one data center, and causal consistency among
VERSE DATA IN MMORPGS data centers. In other words, when game developers modify
the game data, the updated value should be submitted syn-
Due to different application scenarios, the four data sets
chronously to all replicas within the same data center, and
have distinct data consistency requirements. For this reason,
then propagated asynchronously across data centers.
we need to apply different consistency models to fulfill them.
State data: for instance, metadata of PCs (Player Char-
Account data: is stored on the server side, and is cre-
acters) and state (e.g., position, task, or inventory) of char-
ated, accessed as well as deleted when players log in to or
acters, is modified by players frequently during the game.
log out of a game. It includes player’s private data and
The change of state data must be perceived by all relevant
some other sensitive information (e.g., user ID, password,
players synchronously, so that players and NPCs can re-
and recharge records). The inconsistency of account data
spond correctly and timely. An example for the necessity of
might bring troubles to a player as well as the game provider,
data synchronization is that players cannot tolerate that a
or even lead to an economic or legal dispute. Imagine the
dead character can continue to attack other characters. Note
following two scenarios: a player has changed the password
that players only access data from the in-memory database
successfully. However, when this player log in to the game
during the game. Hence, we need to ensure strong consis-
again, the new password is not effective; a player has trans-
tency in the in-memory database.
ferred to the game account, or the player has consumed in
Another point about managing state data is that updated
the game, but the account balance is somehow not properly
values must be backed up to the disk-resident database asyn-
presented in the game system. Both cases would influence
chronously. Similarly, game developers also need to take
on the player’s experience, and might result in the customer
care of data consistency and durability in the disk-resident
or the economic loss of a game company. Hence, we need
database, for instance, it is intolerable for a player to find
to access account data under strong consistency guarantees,
that her/his last game record is lost when she/he starts the
and manage it with transactions. In a distributed database
game again. In contrast to that in the in-memory database,
system, it means that each copy should hold the same view
we do not recommend ensuring strong consistency to state
on the data value.
data. The reason is as follows: according to the CAP theo-
Game data: such as world appearance, metadata (name,
rem, a distributed database system can only simultaneously
race, appearance, etc.) of NPC (Non Player Character),
satisfy two of three the following desirable properties: con-
system configuration files, and game rules, is used by play-
sistency, availability, and partition tolerance. Certainly, we
ers and game engine in the entire game, which can only be
hope to satisfy both consistency and availability guarantees.
modified by game developers. Players are not as sensitive to
However, in the case of network partition or under high net-
game data as to account data. For example, the change of
work latency, we have to sacrifice one of them. Obviously,
an NPC’s appearance or name, the duration of a bird ani-
we do not want all update operations to be blocked until the
mation, and the game interface may not catch the players’
system recovery, which may lead to data loss. Consequently,
attention and have no influence on players’ operations. As a
the level of data consistency should be reduced. We propose
result, it seems that strong consistency for game data is not
to ensure read-your-writes consistency guarantee [13]. In
so necessary. On the other hand, some changes of the game
this paper, it describes that once state data of player A has
data must be propagated to all online players synchronously,
been persisted in the Cloud, the subsequent read request of
for instance, the change of the game world’s appearance, the
player A will receive the updated values, yet other players
power of a weapon or an NPC, game rules as well as scripts,
(or the game engine) may only obtain an outdated version of
and the occurrence frequency of an object during the game.
it. From the storage system’s perspective, as long as a quo-
The inconsistency of these data will lead to errors on the
rum of replicas has been updated successfully, the commit
game display and logic, unfair competition among players,
operation is considered complete. In this case, the storage
or even a server failure. For this reason, we also need to
system needs to provide a solution to return the up-to-date
treat data consistency of game data seriously. Game data
data to player A. We will discuss it in the next section.
could be stored on both the server side and the client side,
Log data: (e.g., player chat history and operation logs)
so we have to deal with it accordingly.
is created by players, but used by data analysts for the pur-
Game data on the client side could only synchronize with
pose of data mining. This data will be sorted and cached
servers when a player logs in to or starts a game. For this
on the server side during the game, and then bulk stored
reason, causal consistency is required [8, 13]. In this paper,
into the database, thereby reducing the conflict rate as well
it means when player A uses client software or browser to
as the I/O workload, and increasing the total simultaneous
connect with the game server, the game server will then
throughput [2]. The management of log data has three fea-
transmit the latest game data in the form of data packets
tures: log data will be appended continually, and its value
to the client side of player A. In this case, the subsequent
will not be modified once it is written to the database; The
local access by player A is able to return the updated value.
replication of log data from thousands of players to multiple
Player B that has not communicated with the game server
nodes will significantly increase the network traffic and even
will still retain the outdated game data.
block the network; Moreover, log data is generally organized
Although both client side and server side store the game
and analyzed after a long time. Data analysts are only con-
data, only the game server maintains the authority of it.
cerned about the continuous sequence of the data, rather
Furthermore, players in different game worlds cannot com-
than the timeliness of the data. Hence, data inconsistency
municate to each other. Therefore, we only need to ensure
is accepted in a period of time. For these three reasons,
that the game data is consistent in one zone server so that
Account data Game data State data Log data
Modified by Players Game developers Players Players
Utilized by Players & Game engine Players & Game engine Players & Game engine Data analysts
Stored in Cloud Client side Cloud In-memory DB Cloud Cloud
Data center Across — Single Across Single Across Across
Consistency Strong Causal Strong Causal Strong Read-your-writes Timed
model consistency consistency consistency consistency consistency consistency consistency

Table 1: Consistency requirements

a deadline-based consistency model, such as timed consis- update all replicas while executing write operations. In this
tency, is more suitable for log data[12, 10]. In this paper, case, data in Cassandra is consistent, and we can obtain
timed consistency specifically means that update operations the up-to-date data from the closest replica directly. Un-
are performed on a quorum of replicas instantaneously at fortunately, this replication strategy significantly increases
time t, and then the updated values will be propagated to the network traffic as well as the response time of write op-
all the other replicas within a time bounded by t + 4 [10]. erations, and sacrifices system availability. As a result, to
Additionally, to maintain the linear order of the log data, implement read-your-writes consistency efficiently becomes
the new value needs to be sorted with original values before an open issue.
being appended to a replica. In other words, we execute Another drawback is that Cassandra makes all replicas
a sort-merge join by the timestamp when two replicas are eventually consistent, which sometimes does not match the
asynchronous. Under timed consistency guarantee, data an- application scenarios of MMORPG, and reduce the efficiency
alysts can at time t + 4 obtain a continuously sequential of the system. The reasons are as follows.
log data until time t.
• Unnecessary for State data: state data of a PC is read
by a player from the Cloud storage system only once
3. OPPORTUNITIES AND CHALLENGES during the game. The subsequent write operations do
not depend on values in the Cloud any more. Hence,
In our previous work, we have already presented the ca-
after obtaining the up-to-date data from the Cloud,
pability of the given Cloud-based architecture to support
there is no necessity to ensure that all replicas reach a
the corresponding consistency model for each data set in
consensus on these values.
MMORPGs [4]. However, we also have pointed out that to
ensure read-your-writes consistency to state data and timed • Increase network traffic: Cassandra utilizes Read Re-
consistency to log data efficiently in Cassandra is an open pair functionality to guarantee eventual consistency
issue. In this section, we aim at discussing it in detail. [1]. It means that all replicas have to be compared
Through customizing the quorum of replicas in response to in the background while executing a write operation
read and write operations, Cassandra provides tunable con- in order to return the up-to-date data to players, de-
sistency, which is an inherent advantage to support MMORPGs tect the outdated data versions, and fix them. In
[7, 4]. There are two reasons: first, as long as a write request MMORPGs, both state data and log data have a large
receives a quorum of responses, it completes successfully. In scale, and are distributed in multiple data centers.
this case, although data in Cassandra is inconsistent, it re- Hence, transmission of these data across replicas will
duces the response time of write operations, and ensures significantly increase the network traffic and affect the
availability as well as fault tolerance of the system; Addi- system performance.
tionally, a read request will be sent to the closest replica,
or routed to a quorum or all replicas according to the con- 4. A TIMESTAMP-BASED CONSISTENCY
sistency requirement of the client. For example, if a write
request is accepted by three (N, N> 0) of all five (M, M>= SOLUTION
N) replicas, at least three replicas (M-N+1) need to respond A common method for solving the consistency problem of
to the subsequent read request, so that the up-to-date data Cloud storage system is to build an extra transaction layer
can be returned. At this case, Cassandra can guarantee on top of the system [6, 3, 14]. Similarly, we have proposed
read-your-writes consistency or strong consistency. Other- a timestamp-based solution especially for MMORPG, which
wise, it can only guarantee timed consistency or eventual is designed based on the features of Cassandra [4]. Cassan-
consistency [7, 13]. Due to the support of tunable consis- dra records timestamps in each column, and utilizes it as a
tency, Cassandra has the potential to manage state data and version identification (ID). Therefore, we record the times-
log data of MMORPGs simultaneously, and is more suitable tamps from a global server in both server side and in the
than some other Cloud storage systems that only provide Cloud storage system. When we read state data from the
either strong or eventual consistency guarantees. Cloud, the timestamps recorded on the server side will be
On the other hand, Cassandra fails to implement tun- sent with the read request. In this way, we can find out the
able consistency efficiently according to MMORPG require- most recent data easily. In the following sections, we will
ments. For example, M-N+1 replicas of state data have to be introduce this solution in detail.
compared so as to guarantee read-your-writes consistency.
However, state data has typically hundreds of attributes, 4.1 Data Access Server
the transmission and comparison of which affect the read Data access servers are responsible for data exchange be-
performance. Opposite to update a quorum of replicas, we tween the in-memory database and the Cloud storage sys-
Player Data access servers Player/Game engine Data access servers
In-memory DB Cloud storage system In-memory DB Cloud storage system

Write request (WR) Read request (RR)
(Logout)
Status LMT & RR
PR(1) Check
Version ID
Snapshot (LMT) (Login)
State data
Timestamp(TS)&WR
State data
W(1) State data
TS → Version ID
(TS) Status
RR
WR & (LMT,Login)
PR(2)
Quit request(QR) (TS,Login)

State data RR
Status

Snapshot & QR (LMT, Login) RR
TS & WR
RR
W(2) TS→ Version ID
GER
(TS, Logout) Status
State data
Delete state data Delete request
State data
State data

Figure 2: Executions of Write Operations: W(1)
describes a general backup operation; W(2) shows Figure 3: Executions of Read Operations: PR(1)
the process of data persistence when a player quits shows a general read request from the player; In the
the game. case of PR(2), the backup operation is not yet com-
pleted when the read request arrives; GER presents
the execution of a read operation from the game
engine.
tem. They ensure the consistency of state data, maintain
timestamp tables, and play the role of global counters as
well. In order to balance the workload and prevent server
failures, several data access servers run in one zone server in data access server obtains the corresponding state data from
parallel. Data access servers need to synchronize their sys- the snapshot periodically. In order to reduce the I/O work-
tem clock with each other automatically. However, a com- load of the Cloud, a data access server generates one message
plete synchronization is not required. A time difference less including all its responsible state data as well as a new times-
than the frequency of data backup is acceptable. tamp TS, and then sends it to the Cloud storage system. In
An important component in data access servers is the the Cloud, this message is divided based on the ID of state
timestamp table, which stores the ID as well as the last mod- data into several messages, each of which still includes TS.
ified time (LMT) of state data, and the log status (LS). If a In this way, the update failure of one state data won’t block
character or an object in the game is active, its value of LS the submission of other state data. Then, these messages
is “login”. Otherwise, the value of LS is “logout”. We utilize are routed to appropriate nodes. When a node receives a
a hash function to map IDs of state data to distinct times- message, it writes changes immediately into the commit log,
tamp tables, which are distributed and partitioned in data updates data, and records TS as version ID in each column.
access servers. It is noteworthy that timestamp tables are If an update is successful and TS is higher than the exist-
partitioned and managed by data access servers in parallel ing LMT of this state data, then the data access server uses
and data processing is simple, so that accessing timestamp TS to replace the LMT. Note that if a player has quit the
tables will not become a bottleneck of the game system. game and the state data of the relevant PC has backed up
Note that players can only interact with each other in the into the Cloud storage system, the LS of this PC needs to
same game world, which is managed by one zone server. be modified form “login” to “logout”, and the relevant state
Moreover, a player cannot switch the zone server freely. data in the in-memory database needs to be deleted.
Therefore, data access servers as well as timestamp tables Data access servers obtain log data not from the in-memory
across zone servers are independent. database, but from the client side. Log data also updates in
batch, and gets timestamp from a data access server. When
4.2 Data Access a node in the Cloud receives log data, it inserts log data into
In this subsection, we discuss the data access without con- its value list according to the timestamp. However, times-
sidering data replication and concurrency conflicts. tamp tables are not modified when the update is complete.
In Figure 2, we show the storage process of state data in Figure 3 presents executions of read operations. When a
the new Cloud-based architecture: the in-memory database player starts the game, a data access server firstly obtains
takes a consistent snapshot periodically. Though using the the LS information from the timestamp table. If the value
same hash function employed by timestamp tables, each is “login”, that means the previous backup operation is not
completed and the state data is still stored in the in-memory game. As a result, the subsequent read operation can obtain
database. In this case, the player gets the state date from the updated values quickly.
the in-memory database directly, and the data access server While using our replication strategies, a replica may con-
needs to generate a new timestamp to replace the LMT of tain outdated data when it receives a read request. Though
the relevant state data; if the value is “logout”, the data comparing LMT held by the read request with the Version
access server then gets the LMT, and sends it with a read ID in a replica, this case can be detected easily. Contrary to
request to the Cloud storage system. When the relevant the existing approach of Cassandra (compares M-N+1 repli-
node receives the request, it compares the LMT with its cas and utilizes Read Repair), only the read request will be
local version ID. If they match, the replica responds the sent to other replicas until the lasted values was found. In
read request immediately. If not match, this read request this way, the network traffic will not be increased signifi-
will be sent to other replicas (we will discuss it in detail cantly, and the up-to-date data can also be found easily.
in the section 5). When the data access server receives the However, if the read request comes from the game engine,
state data, it sends it to the in-memory database as well as the replica will respond immediately. These strategies en-
the relevant client sides, and modifies the LS from “logout” sure that this Cloud-based architecture can manage state
to “login” in the timestamp table. Note that state data may data under read-your-writes consistency guarantees.
also be read by the game engine for the purpose of statistics. Similar to state data, a write request to log data is also
In this case, the up-to-date data is not necessary, so that we accepted by a quorum of replicas at first. However, the
do not need to compare the LMT with the Version ID. updated values then must be propagated to other replicas
Data analysts read data also through data access servers. asynchronously when the Cloud storage system is not busy,
If a read request contains a timestamp T, the cloud stor- and arranged in order of timestamp within a predetermined
age system only returns log data until T-4 because it only time (4), which can be done with the help of Anti-Entropy
guarantees log data timed consistency. functionality in Cassandra [1]. In this way, this Cloud stor-
age system guarantees log data timed consistency.
4.3 Concurrency Control
Concurrency conflicts appear rarely in the storage layer 5.2 Version Conflict Reconciliation
of MMORPGs: the probability of read-write conflicts is low When the Cloud storage system detected a version conflict
because only state data with a specific version ID (the same between two replicas: if it is state data, the replica with
as its LMT) will be read by players during the game, and a higher version ID wins, and values of another replica will be
read request to log data does not return the up-to-date data. replaced by new values; if it is log data, these two replicas
Certain data is periodically updated by only one data access perform a sort-merge join by timestamps for the purpose of
server simultaneously. Therefore, write-write conflicts occur synchronization.
only when the per-update is not completed for some reason,
for example, serious network latency, or a node failure. For- 6. SYSTEM RELIABILITY
tunately, we can solve these conflicts easily by comparing
Our Cloud-based architecture for MMORPGs requires a
timestamps. If two processes attempt to update the same
mutual cooperation of multiple components. Unfortunately,
state data, the process with higher timestamp wins, and an-
each component has the possibility of failure. In the follow-
other process should be canceled because it is out of date. If
ing, we discuss measures to deal with different failures.
two processes intend to update the same log data, the pro-
Cloud storage system failure: the new architecture for
cess with lower timestamp wins, and another process enters
MMORPGs is built based on Cassandra, which has the abil-
the wait queue. The reason is that values contained in both
ity to deal with its own failure. For example, Cassandra ap-
processes must be stored in correct order.
plies comment logs to recover nodes. It is noteworthy that
by using our timestamp-based solution, when a failed node
5. DATA REPLICATION comes back up, it could be regarded as an asynchronous
Data in the Cloud typically has multiple replicas for the node. Therefore, the node recovery as well as response to
purpose of increasing data reliability as well as system avail- write and read requests can perform simultaneously.
ability, and balancing the node workload. On the other In-memory database failure: similarly, we can also apply
hand, data replication increases the response time and the comment logs to handle this kind of failure so that there
network traffic as well, which cannot be handled well by is no data loss. However, writing logs affects the real-time
Cassandra. For most of this section, we focus on resolving response. Moreover, logs are useless when changes are per-
this contradiction according to access features of state data sisted in the Cloud. Hence, we have to find a solution in our
and log data. future work.
Data access server failure: If all data access servers crash,
5.1 Replication Strategies the game can still keep running, whereas data cannot be
Although state data is backed up periodically into the backed up to the Cloud until servers restart, and only play-
Cloud, only the last updated values will be read when play- ers already in the game can continue to play; Data access
ers start the game again. It is noteworthy that the data servers have the same functionality and their system clocks
loss in the server layer occurs infrequently. Therefore, we are relatively synchronized, so if one server is down, any
propose to synchronize only a quorum of replicas during the other servers can replace it.
game, so that an update can complete effectively and won’t Timestamp table failure: We utilize the primary/secondary
block the subsequent updates. In addition, players usually model and the synchronous replication mechanism to main-
start a game again after a period of time, so the system has tain the reliability of timestamp tables. In the case of all
enough time to store state data. For this reason, we propose replicas failure, we have to apply the original feature of Cas-
to update all replicas synchronously when players quit the sandra to obtain the up-to-date data. In other words, M-
N+1 replicas need to be compared. In this way, we can and V. Yushprakh. Megastore: Providing scalable,
rebuild timestamp tables as well. highly available storage for interactive services. In
Conference on Innovative Data Systems
7. OPTIMIZATION AND FUTURE WORK Research(CIDR), pages 223–234, Asilomar, California,
USA, 2011.
When a data access server updates state data in the Cloud,
[3] S. Das, D. Agrawal, and A. E. Abbadi. G-store: a
it propagates a snapshot of state data to multiple replicas.
scalable data store for transactional multi key access in
Note that state data has hundreds of attributes, so the trans-
the cloud. In Symposium on Cloud Computing(SoCC),
mission of a large volume of state data may block the net-
pages 163–174, Indianapolis, Indiana, USA, 2010.
work. Therefore, we proposed two optimization strategies
in our previous work [4]: if only some less important at- [4] Z. Diao and E. Schallehn. Cloud Data Management
tributes of the state (e.g., the position or orientation of a for Online Games : Potentials and Open Issues. In
character) are modified, the backup can be skipped; Only Data Management in the Cloud (DMC), Magdeburg,
the timestamp, ID, and the modified values are sent as mes- Germany, 2013. Accepted for publication.
sages to the Cloud. However, in order to achieve the second [5] S. Gilbert and N. Lynch. Brewer’s conjecture and the
optimization strategy, our proposed data access approach, feasibility of consistent, available, partition-tolerant
data replication strategies, and concurrency control mech- web services. ACM Special Interest Group on
anism have to be changed. For example, even during the Algorithms and Computation Theory (SIGACT),
game, updated values must be accepted by all replicas, so 33(2):51–59, 2002.
that the subsequent read request does not need to compare [6] F. Gropengieß er, S. Baumann, and K.-U. Sattler.
M-N+1 replicas. We will detail the adjustment program in Cloudy transactions cooperative xml authoring on
our future work. amazon s3. In Datenbanksysteme für Business,
It is noteworthy that a data access server stores a times- Technologie und Web (BTW), pages 307–326,
tamp repeatedly into the timestamp table, which increases Kaiserslautern, Germany, 2011.
the workload. A possible optimization program is as fol- [7] A. Lakshman. Cassandra - A Decentralized Structured
lows: If a batch write is successful, data access server caches Storage System. Operating Systems Review,
the timestamp (TS) of this write request. Accordingly, in 44(2):35–40, 2010.
the timestamp table, we add a new column to each row to [8] L. Lamport. Time, clocks, and the ordering of events
maintain a pointer. If a row is active (the value of LS is in a distributed system. Communications of the ACM,
“login”), the pointer refers to the memory location of TS; if 21(7):558–565, July 1978.
not, it refers to its own LMT. When a row becomes inactive, [9] F. W. Li, L. W. Li, and R. W. Lau. Supporting
it uses TS to replace its LMT. In this way, the workload of continuous consistency in multiplayer online games. In
a timestamp table will reduce significantly. However, LMT 12. ACM Multimedia 2004, pages 388–391, New York,
and Version ID of state data may be inconsistent due to the New York, USA, 2004.
failure of the Cloud storage system or the data access server. [10] H. Liu, M. Bowman, and F. Chang. Survey of state
melding in virtual worlds. ACM Computing Surveys,
8. CONCLUSIONS 44(4):1–25, 2012.
Our Cloud-based architecture of MMORPGs can cope [11] W. Palant, C. Griwodz, and P. l. Halvorsen.
with data management requirements regarding availability Consistency requirements in multiplayer online games.
and scalability successfully, while supporting data consis- In Proceedings of the 5th Workshop on Network and
tency becomes an open issue. In this paper, we detailed our System Support for Games, NETGAMES 2006,
timestamp-based solution in theory, which will guide the page 51, Singapore, 2006.
implementation work in the future. We analyzed the data [12] F. J. Torres-Rojas, M. Ahamad, and M. Raynal.
consistency requirements of each data set from the storage Timed consistency for shared distributed objects. In
system’s perspective, and studied methods of Cassandra to Proceedings of the eighteenth annual ACM symposium
guarantee tunable consistency. We found that Cassandra on Principles of distributed computing - PODC ’99,
cannot ensure read-your-writes consistency for state data pages 163–172, Atlanta, Georgia, USA, 1999.
and timed consistency for log data efficiently. Hence, we [13] W. Vogels. Eventually consistent. ACM Queue,
proposed a timestamp-based solution to improve it, and ex- 6(6):14–19, 2008.
plained our idea for concurrency control, data replication [14] Z. Wei, G. Pierre, and C.-H. Chi. Scalable
strategies, and fault handling in detail. In our future work, Transactions for Web Applications in the Cloud. In
we will implement our proposals and the optimization strate- 15th International Euro-Par Conference, pages
gies. 442–453, Delft, The Netherlands, 2009.
[15] K. Zhang and B. Kemme. Transaction Models for
9. ACKNOWLEDGEMENTS Massively Multiplayer Online Games. In 30th IEEE
Symposium on Reliable Distributed Systems (SRDS
Thanks to Eike Schallehn for his comments.
2011), pages 31–40, Madrid, Spain, 2011.
[16] K. Zhang, B. Kemme, and A. Denault. Persistence in
10. REFERENCES massively multiplayer online games. In Proceedings of
[1] Apache. Cassandra, January 2013. the 7th ACM SIGCOMM Workshop on Network and
http://cassandra.apache.org/. System Support for Games, NETGAMES 2008, pages
[2] J. Baker, C. Bond, J. C. Corbett, J. Furman, 53–58, Worcester, Massachusetts, USA, 2008.
A. Khorlin, J. Larson, J.-M. Lt’eon, Y. Li, A. Lloyd,
Analysis of DDoS Detection Systems

Michael Singhof
Heinrich-Heine-Universität
Institut für Informatik
Universitätsstraße 1
40225 Düsseldorf, Deutschland
singhof@cs.uni-duesseldorf.de

ABSTRACT targeting specific weaknesses in that service or by brute force
While there are plenty of papers describing algorithms for approaches. A particularly well-known and dangerous kind
detecting distributed denial of service (DDoS) attacks, here of DoS attack are distributed denial of service attacks. These
an introduction to the considerations preceding such an im- kinds of attacks are more or less brute force bandwidth DoS
plementation is given. Therefore, a brief history of and in- attacks carried out by multiple attackers simultaneously.
troduction to DDoS attacks is given, showing that these kind In general, there are two ways to detect any kind of net-
of attacks are nearly two decades old. It is also depicted that work attacks: Signature based approaches in which the in-
most algorithms used for the detection of DDoS attacks are trusion detection software compares network input to known
outlier detection algorithms, such that intrusion detection attacks and anomaly detection methods. Here, the software
can be seen as a part of the KDD research field. is either trained with examples for normal traffic or not
It is then pointed out that no well known and up-to-date previously trained at all. Obviously, the latter method is
test cases for DDoS detection system are known. To over- more variable since normal network traffic does not change
come this problem in a way that allows to test algorithms as quickly as attack methods. The algorithms used in this
as well as making results reproducible for others we advice field are, essentially, known KDD methods for outlier detec-
using a simulator for DDoS attacks. tion such as clustering algorithms, classification algorithms
The challenge of detecting denial of service attacks in or novelty detection algorithms on time series. However,
real time is addressed by presenting two recently published in contrast to many other related tasks such as credit card
methods that try to solve the performance problem in very fraud detection, network attack detection is highly time crit-
different ways. We compare both approaches and finally ical since attacks have to be detected in near real time. This
summarise the conclusions drawn from this, especially that makes finding suitable methods especially hard because high
methods concentrating on one network traffic parameter only precision is necessary, too, in order for an intrusion detection
are not able to detect all kinds of distributed denial of service system to not cause more harm than being of help.
attacks. The main goal of this research project is to build a dis-
tributed denial of service detection system that can be used
in today’s networks and meets the demands formulated in
Categories and Subject Descriptors the previous paragraph. In order to build such a system,
H.2.8 [Database Management]: Database Applications— many considerations have to be done. Some of these are
Data Mining; H.3.3 [Information Storage and Retrieval]: presented in this work.
Information Search and Retrieval—Clustering, Information The remainder of this paper is structured as follows: In
filtering section 2 an introduction to distributed denial of service at-
tacks and known countermeasures is given, section 3 points
Keywords out known test datasets. In section 4 some already existing
approaches are presented and finally section 5 concludes this
DDoS, Intrusion Detection, KDD, Security work and gives insight in future work.

1. INTRODUCTION
2. INTRODUCTION TO DDOS ATTACKS
Denial of service (DoS) attacks are attacks that have the
goal of making a network service unusable for its legitimate Denial of service and distributed denial of service attacks
users. This can be achieved in different ways, either by are not a new threat in the internet. In [15] the first notable
denial of service attack is dated to 1996 when the internet
provider Panix was taken down for a week by a TCP SYN
flood attack. The same article dates the first noteworthy
distributed denial of service attack to the year 1997 when
internet service providers in several countries as well as an
IRC network were attacked by a teenager. Since then, many
of the more elaborate attacks that worked well in the past,
have been successfully defused.
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. Let us, as an example, examine the TCP SYN flood at-
Copyright is held by the author/owner(s). tack. A TCP connection is established by a three way hand-
shake. On getting a SYN request packet, in order to open a
TCP connection, the addressed computer has to store some
information on the incoming packet and then answers with
a SYN ACK packet which is, on regularly opening a TCP
connection, again replied by an ACK packet.
The idea of the SYN flood attack is to cause a memory
overrun on the victim by sending many TCP SYN packets.
As for every such packet the victim has to store information
while the attacker just generates new packets and ignores the
victim’s answers. By this the whole available memory of the
victim can be used up, thus disabling the victim to open le-
gitimate connection to regular clients. As a countermeasure, Figure 1: Detection locations for DDoS attacks.
in [7] SYN cookies were introduced. Here, instead of storing
the information associated with the only half opened TCP
connection in the local memory, that information is coded
testing that allows users, among other functions, to volun-
into the TCP sequence number. Since that number is re-
tary join a botnet in order to carry out an attack. Since
turned by regular clients on sending the last packet of the
the tool is mainly for testing purposes, the queries are not
already described three way handshake and initial sequence
masqueraded so that it is easy to identify the participat-
numbers can be arbitrarily chosen by each connection part-
ing persons. Again, however, the initiator of the attack does
ner, no changes on the TCP implementation of the client
not necessarily have to have direct contact to the victim and
side have to be made. Essentially, this reduces the SYN
thus remains unknown.
cookie attack to a mere bandwidth based attack.
A great diversity of approaches to solve the problem of
The same applies to many other attack methods that have
detecting DDoS attacks exists. Note again, that this work
been successfully used in the past, such as the smurf attack
focuses on anomaly detection methods only. This describes
[1] or the fraggle attack. Both of these attacks are so called
methods, that essentially make use of outlier detection meth-
reflector attacks that consist of sending an echo packet –
ods to distinguish normal traffic and attack traffic. In a field
ICMP echo in case of the smurf attack and UDP echo in
with as many publications as intrusion detection and even,
case of the fraggle attack – to a network’s broadcast address.
more specialised, DDoS detection, it is not surprising, that
The sender’s address in this packet has to be forged so that
many different approaches are used, most of which are com-
the packet occurs to be sent by the victim of the attack, so
mon in other knowledge discovery research fields as well.
that all replies caused by the echo packet are routed to the
As can be seen in Figure 1 this research part, again, can
victim.
be divided in three major categories, namely distributed de-
Thus, it seems like nowadays most denial of service attacks
tection or in network detection, source end detection and
are distributed denial of service attack trying to exhaust the
end point or victim end detection.
victims bandwidth. Examples for this are the attacks on
By distributed detection approaches we denote all ap-
Estonian government and business computers in 2007 [12].
proaches that use more than one node in order to monitor
As already mentioned, distributed denial of service attacks
the network traffic. This kind of solution is mostly aimed
are denial of service attacks with several participating at-
to be used by internet providers and sometimes cooperation
tackers. The number of participating computers can differ
between more than one or even all ISPs is expected. The
largely, ranging from just a few machines to several thou-
main idea of almost all of these systems is to monitor the
sands. Also, in most cases, the owners of these computers
network traffic inside the backbone network. Monitors are
are not aware that they are part of an attack. This lies in the
mostly expected to be backbone routers, that communicate
nature of most DDoS attacks which consist of three steps:
the results of their monitoring either to a central instance or
1. Building or reusing a malware that is able to receive among each other. These systems allow an early detection of
commands from the main attacker (“master”) and to suspicious network traffic so that an attack can be detected
carry out the attack. A popular DDoS framework is and disabled – by dropping the suspicious network packets
Stacheldraht [9]. – before it reaches the server the attack is aimed at. How-
ever, despite these methods being very mighty in theory,
2. Distribute the software created in step one to create they suffer the main disadvantage of not being able to be
a botnet. This step can essentially be carried out in employed without the help of one or more ISPs. Currently,
every known method of distributing malware, for ex- this makes these approaches impractical for end users since,
ample by forged mail attachments or by adding it to to the knowledge of the author, at this moment no ISP uses
software like pirate copies. such an approach.
Source end detection describes approaches that monitor
3. Launch the attack by giving the infected computers outgoing attack streams. Of course, such methods can only
the command. be successful if the owner of an attacking computer is not
aware of his computer participating in that attack. A widely
This procedure – from the point of view of the main at- used deployment of such solutions is necessary for them to
tacker – has the advantage of not having to maintain a direct have an effect. If this happens, however, these methods have
connection to the victim. This makes it very hard to track the chance to not only detect distributed denial of service
that person. It is notable though, that during attacks origi- attacks but also to prevent them by stopping the attacking
nating to Anonymous in the years 2010 and 2012 Low Orbit traffic flows. However, in our opinion, the necessity of wide
Ion Cannon [6] was used. This is originally a tool for stress deployment makes a successful usage of this methods – at
Packet type No of packets Percentage 1e+08
Number of packets over arrival times
IP 65612516 100
1e+07
TCP 65295894 99.5174
UDP 77 0.0001 1e+06
ICMP 316545 0.4824
100000

Number of packets
Protocol Incoming Traffic Outgoing Traffic
10000
IP 24363685 41248831
TCP 24204679 41091215 1000

UDP 77 0
ICMP 158929 157616 100

10
Table 1: Distribution of web traffic on protocol types
and incoming and outgoing traffic at the university’s 1
web server. 0 0.2 0.4 0.6 0.8 1
Arrival time [seconds]

least in the near future – difficult. Figure 2: Arrival times for the university’s web-
In contrast to the approaches described above, end point server trace.
detection describes those methods that rely on one host only.
In general, this host can be either the same server other ap-
UDP packets seem to be unwanted packets as none of these
plications are running on or a dedicated firewall in the case
packets is replied. The low overall number of these packets is
of small networks. Clearly, these approaches suffer one dis-
an indicator for this fact, too. With ICMP traffic, incoming
advantage: Attacks cannot be detected before the attack
and outgoing packet numbers are nearly the same which lies
packets arrive at their destination, as only those packets
in the nature of this message protocol.
can be inspected. On the other hand end point based meth-
In order to overcome the problems with old traces, based
ods allow individual deployment and can therefore be used
on the characteristics of the web trace, as a next step we
nowadays. Due to this fact, our work focuses on end point
implement a simulator for distributed denial of service at-
approaches.
tacks. As the results in [20] show, the network simulators
OMNeT++ [19], ns-3 [10] and JiST [5] are, in terms of speed
3. TEST TRACES OF DISTRIBUTED DE- and memory usage, more or less equal. To not let the simula-
tion become either too slow or too inaccurate, it is intended
NIAL OF SERVICE ATTACKS to simulate a nearer neighbourhood of the victim server very
Today, the testing of DDoS detection methods unfortu- accurately. With greater distance to the victim, it is planned
nately is not easy, as not many recordings of real or simu- to simulate in less detail. In this context, the distance be-
lated DDoS attacks exist or, at least, are not publicly avail- tween two network nodes is given by the number of hops
able. The best known test trace is the KDD Cup 99 data between the nodes.
set [3]. A detailed description of this data set is given in Simulation results then will be compared with the afore-
[18]. Other known datasets are the 1998 DARPA intrusion mentioned network trace to ensure its realistic behaviour.
detection evaluation data set that has been described in [14] After the simulation of normal network traffic resembles the
as well as the 1999 DARPA intrusion detection evaluation real traffic at the victim server close enough, we will proceed
data set examined in [13]. by implementing distributed denial of service attacks in the
In terms of the internet, with an age of 14 to 15 years, simulator environment. With this simulator it will then,
these data sets are rather old and therefore cannot reflect hopefully, be possible to test existing and new distributed
today’s traffic volume and behaviour in a realistic fashion. denial of service detection approaches in greater detail as
Since testing with real distributed denial of service attacks has been possible in the past.
is rather difficult both on technical as well as legal level, we
suggest the usage of a DDoS simulator. In order to get a feel-
ing for today’s web traffic, we recorded a trace at the main 4. EXISTING APPROACHES
web server of Heinrich-Heine-Universität. Tracing started on Many approaches to the detection of distributed denial of
17th September 2012 at eight o’clock local time and lasted service attacks already exist. As has been previously pointed
until eight o’clock the next day. out in section 1, in contrast to many other outlier and nov-
This trace consists of 65612516 packets of IP traffic with elty detection applications in the KDD field, the detection
31841 unique communication partners contacting the web of DDoS attacks is extremely time critical, hence near real
server. As can be seen in Table 1 almost all of these packets time detection is necessary.
are TCP traffic. This is not surprising as the HTTP protocol Intuitively, the less parameters are observed by an ap-
uses the TCP protocol and web page requests are HTTP proach, the faster it should work. Therefore, first, we take a
messages. look at a recently issued method that relies on one parameter
About one third of the TCP traffic is incoming traffic. only.
This, too, is no surprise as most clients send small request
messages and, in return, get web pages that often include 4.1 Arrival Time Based DDoS Detection
images or other larger data and thus consist of more than In [17] the authors propose an approach that bases on ir-
one package. It can also be seen, clearly, that all of the regularities in the inter packet arrival times. By this term
1
Now, since we are solely interested in the estimation of x̄,
0.9 only 1 M is needed, which is computed to be [x̄, x̄] since

0.8 1 β β 1 1
g(1) = − 1 + = (1 − β + β) =
0.7
2 2 2 2 2
0.6 and
1
0.5
zg(1) = Φ−1 (1 − g(1)) = Φ−1 ( ) = 0.
α

2
0.4
During traffic monitoring, for a given time interval, the
0.3
current traffic arrival times tc are computed by estimating
0.2

1 1 1 1
[tc ]α = ln , ln
0.1 1 − p rα 1 − p lα
0
0.00122 0.00124 0.00126 0.00128 0.0013 0.00132 0.00134 0.00136 0.00138 0.0014 0.00142 where p is some given but again not specified probability and
Arrival times [s] [lα , rα ] are the α-cuts for E(T ) = t̄. As described above, the
only value that is of further use is tc , the only value in the
Figure 3: The fuzzy mean estimator constructed ac- interval of [tc ]1 . Since [E(T )]1 = [t̄]1 = [t̄, t̄] it follows that
cording to [17].
1 1 1 1
[tc ]1 = ln , ln
1 − p t̄ 1 − p t̄
the authors describe the time that elapses between two sub- and thus
sequent packets.
The main idea of this work is based on [8] where non- 1 1 1
tc = ln = (ln(1) − ln(1 − p)) .
asymptotic fuzzy estimators are used to estimate variable 1−p t̄ t̄
costs. Here, this idea is used to estimate the mean arrival As ln(1) = 0 this can be further simplified to
time x̄ of normal traffic packets. Then, the mean arrival
time of the current traffic – denoted by tc – is estimated, ln(1 − p)
tc = − ∈ [0, ∞)
too, and compared to the overall value. If tc > x̄, the traffic t̄
is considered as normal traffic and if tc < x̄ a DDoS attack with p ∈ [0, 1).
is assumed to be happening. We suppose here, that for a By this we are able to determine a value for p by choosing
value of tc = x̄ no attack is happening, although this case is the smallest p where tc ≥ x̄ for all intervals in our trace. An
not considered in the original paper. interval length of four seconds was chosen to ensure compa-
To get a general feeling for the arrival times, we computed rability with the results presented in [17].
them for our trace. The result is shown in Figure 2. Note, During the interval with the highest traffic volume 53568
that the y-axis is scaled logarithmic as values for arrival packets arrived resulting in an average arrival time of t̄ ≈
times larger than 0.1 seconds could not been distinguished 7.4671 · 10−5 seconds. Note here, that we did not maximise
from zero on a linear y-axis. It can be seen here, that most the number of packets for the interval but instead let the first
arrival times are very close to zero. It is also noteworthy interval begin at the first timestamp in our trace rounded
that, due to the limited precision of libpcap [2], the most down to full seconds and split the trace sequentially from
common arrival interval is zero. there on.
Computing the fuzzy mean estimator for packet arrival Now, in order to compute p one has to set
times yields the graph presented in Figure 3 and x̄ ≈ 0.00132.
Note, that since the choice of parameter β ∈ [0, 1) is not p = 1 − e−x̄t̄
specified in [17], we here chose β = 12 . We will see, however,
leading to p ≈ 9.8359 · 10−8 . As soon as this value of p is
that, as far as our understanding of the proposed method
learned, the approach is essentially a static comparison.
goes, this parameter has no further influence.
There are, however, other weaknesses to this approach
To compute the α-cuts of a fuzzy number, one has to
as well: Since the only monitored value is the arrival time,
compute
a statement on values such as bandwidth usage cannot be
α σ σ made. Consider an attack where multiple corrupted com-
M = x̄ − zg(α) √ , x̄ + zg(α) √ puters try to download a large file from a server via a TCP
n n
connection. This behaviour will result in relatively large
where x̄ is the mean value – i.e. exactly the value that is packets being sent from the server to the clients, resulting
going to be estimated – and σ is presumably the arrival in larger arrival times as well. Still, the server’s connec-
times’ deviation. Also tion can be jammed by this traffic thus causing a denial of

1 β β service.
g(α) = − α+ By this, we draw the conclusion that a method relying on
2 2 2
only one parameter – in this example arrival times – can-
and not detect all kinds of DDoS attacks. Thus, despite its low
zg(α) = Φ−1 (1 − g(α)). processing requirements, such an approach in our opinion is
not suited for general DDoS detection even if it seems that
Note, that α M is the (1 − α)(1 − β) confidence interval it can detect packet flooding attacks with high precision as
for µ, the real mean value of packet arrival times. stated in the paper.
Algorithm 1 LCFS algorithm based on [11].
Require: the initial set of all features I,
the class-outputs y,
the desired number of features n
Ensure: the dimension reduced subset F ⊂ I
1: for all fi ∈ I do
2: compute corr(fi , y)
3: end for
4: f := max{correlation(fi , y)|fi ∈ I}
5: F := {f }
6: I := I \ {f }
7: while |F | <(n do )
1
P
8: f := max corr(fi , y) − |F | corr(fi , fj ) fi ∈ I
Figure 4: Protocol specific DDoS detection architec- fj ∈F
ture as proposed in [11]. 9: F := F ∪ {f }
10: I := I \ {f }
11: end while
4.2 Protocol Type Specific DDoS Detection 12: return F
In [11] another approach is presented: Instead of using the
same methods on all types of packets, different procedures
are used for different protocol types. This is due to the fact, the university’s campus. The presented results show that
that different protocols show different behaviour. Especially on all data sets the DDoS detection accuracy varies in the
TCP traffic behaviour differs from UDP and ICMP traffic range of 99.683% to 99,986% if all of the traffic’s attributes
because of its flow control features. By this the authors try are used. When reduced to three or five attributes, accuracy
to minimise the feature set characterising distributed denial stays high with DDoS detection of 99.481% to 99.972%. At
of service attacks for every protocol type, separately, such the same time, the computation time shrinks by a factor of
that computation time is minimised, too. two leading to a per instance computation time of 0.0116ms
The proposed detection scheme is described as a four step (three attributes) on the KDD Cup data set and 0.0108ms
approach, as shown in Figure 4. Here, the first step is the (three attributes) and 0.0163ms (five attributes) on the self-
preprocessing where all features of the raw network traffic recorded data sets of the authors.
are extracted. Then packets are forwarded to the correspon- Taking into account the 53568 packets in a four second
dent modules based on the packet’s protocol type. interval we recorded, the computation time during this in-
The next step is the protocol specific feature selection. terval would be about (53568 · 0.0163ms ≈) 0.87 seconds.
Here, per protocol type, the most significant features are However, there is no information about the machine that
selected. This is done by using the linear correlation based carried out the computations given in the paper such that
feature selection (LCFS) algorithm that has been introduced this number appears to be rather meaningless. If we suppose
in [4], which essentially ranks the given features by their a fast machine with no additional tasks, this computation
correlation coefficients given by time would be relatively high.
Pn
(xi − x̄)(yi − x̄) Nevertheless, the results presented in the paper at hand
corr(X, Y ) := pPn i=1 Pn are promising enough to consider a future re-evaluation on a
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)
known machine with our recorded trace and simulated DDoS
for two random variables X, Y with values xi , yi , 1 ≤ i ≤ n, attacks.
respectively. A pseudo code version of LCFS is given in
Algorithm 1. As can be seen there, the number of features
in the reduced set must be given by the user. This number 5. CONCLUSION
characterises the trade-off between precision of the detection We have seen that distributed denial of service attacks are,
and detection speed. in comparison to the age of the internet itself, a relatively
The third step is the classification of the instances in ei- old threat. Against many of the more sophisticated attacks
ther normal traffic or DDoS traffic. The classification is specialised counter measures exist, such as TCP SYN cook-
trained on the reduced feature set generated in the previous ies in order to prevent the dangers of SYN flooding. Thus,
step. The authors tested different well known classification most DDoS attacks nowadays are pure bandwidth or brute
techniques and established C4.5 [16] as the method working force attacks and attack detection should focus on this types
best in this case. of attacks, making outlier detection techniques the method
Finally, the outputs of the classifiers are given to the of choice. Still, since many DDoS toolkits such as Stachel-
merger to be able to report warnings over one alarm gen- draht allow for attacks like SYN flooding properties of this
eration interface instead of three. The authors mention that attacks can still indicate an ongoing attack.
there is a check for false positives in the merger, too. How- Also, albeit much research on the field of DDoS detection
ever, there is no further information given on how this check has been done during the last two decades that lead to a
works apart from the fact that it is relatively slow. nearly equally large number of possible solutions, in section
The presented experiments have been carried out on the 3 we have seen that one of the biggest problems is the un-
aforementioned KDD Cup data set as well as on two self- availability of recent test traces or a simulator being able
made data sets for which the authors attacked a server within to produce such traces. With the best known test series
having an age of fourteen years, today, the results presented Off-line Intrusion Detection Evaluation. In DARPA
in many of the research papers on this topic are difficult to Information Survivability Conference and Exposition,
compare and confirm. 2000. DISCEX’00. Proceedings, volume 2, pages
Even if one can rate the suitability of certain approaches in 12–26. IEEE, 2000.
respect to detect certain approaches, such as seen in section [15] G. Loukas and G. Öke. Protection Against Denial of
4, a definite judgement of given methods is not easy. We Service Attacks: A Survey. The Computer Journal,
therefore, before starting to implement an own approach to 53(7):1020–1037, 2010.
distributed denial of service detection, want to overcome this [16] J. R. Quinlan. C4.5: Programs for Machine Learning,
problem by implementing a DDoS simulator. volume 1. Morgan Kaufmann, 1993.
With the help of this tool, we will be subsequently able to [17] S. N. Shiaeles, V. Katos, A. S. Karakos, and B. K.
compare existing approaches among each other and to our Papadopoulos. Real Time DDoS Detection Using
ideas in a fashion reproducible for others. Fuzzy Estimators. Computers & Security,
31(6):782–790, 2012.
6. REFERENCES [18] M. Tavallaee, E. Bagheri, W. Lu, and A.-A. Ghorbani.
[1] CERT CC. Smurf Attack. A Detailed Analysis of the KDD CUP 99 Data Set. In
http://www.cert.org/advisories/CA-1998-01.html. Proceedings of the Second IEEE Symposium on
[2] The Homepage of Tcpdump and Libpcap. Computational Intelligence for Security and Defence
http://www.tcpdump.org/. Applications 2009, 2009.
[3] KDD Cup Dataset. [19] A. Varga and R. Hornig. An Overview of the
http://kdd.ics.uci.edu/databases/kddcup99/ OMNeT++ Simulation Environment. In Proceedings
kddcup99.html, 1999. of the 1st International Conference on Simulation
[4] F. Amiri, M. Rezaei Yousefi, C. Lucas, A. Shakery, Tools and Techniques for Communications, Networks
and N. Yazdani. Mutual Information-based Feature and Systems & Workshops, Simutools ’08, pages
Selection for Intrusion Detection Systems. Journal of 60:1–60:10, ICST, Brussels, Belgium, Belgium, 2008.
Network and Computer Applications, 34(4):1184–1199, ICST (Institute for Computer Sciences,
2011. Social-Informatics and Telecommunications
[5] R. Barr, Z. J. Haas, and R. van Renesse. JiST: An Engineering).
Efficient Approach to Simulation Using Virtual [20] E. Weingartner, H. vom Lehn, and K. Wehrle. A
Machines. Software: Practice and Experience, Performance Comparison of Recent Network
35(6):539–576, 2005. Simulators. In Communications, 2009. ICC ’09. IEEE
[6] A. M. Batishchev. Low Orbit Ion Cannon. International Conference on, pages 1–5, 2009.
http://sourceforge.net/projects/loic/.
[7] D. Bernstein and E. Schenk. TCP SYN Cookies.
on-line journal, http://cr.yp.to/syncookies.html, 1996.
[8] K. A. Chrysafis and B. K. Papadopoulos.
Cost–volume–profit Analysis Under Uncertainty: A
Model with Fuzzy Estimators Based on Confidence
Intervals. International Journal of Production
Research, 47(21):5977–5999, 2009.
[9] D. Dittrich. The ‘Stacheldraht’ Distributed Denial of
Service Attack Tool.
http://staff.washington.edu/dittrich/misc/
stacheldraht.analysis, 1999.
[10] T. Henderson. ns-3 Overview.
http://www.nsnam.org/docs/ns-3-overview.pdf, May
2011.
[11] H. J. Kashyap and D. Bhattacharyya. A DDoS Attack
Detection Mechanism Based on Protocol Specific
Traffic Features. In Proceedings of the Second
International Conference on Computational Science,
Engineering and Information Technology, pages
194–200. ACM, 2012.
[12] M. Lesk. The New Front Line: Estonia under
Cyberassault. Security & Privacy, IEEE, 5(4):76–79,
2007.
[13] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and
K. Das. The 1999 DARPA Off-line Intrusion Detection
Evaluation. Computer networks, 34(4):579–595, 2000.
[14] R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines,
K. R. Kendall, D. McClung, D. Weber, S. E. Webster,
D. Wyschogrod, R. K. Cunningham, et al. Evaluating
Intrusion Detection Systems: The 1998 DARPA
A Conceptual Model for the XML Schema Evolution
Overview: Storing, Base-Model-Mapping and Visualization
Thomas Nösinger, Meike Klettke, Andreas Heuer
Database Research Group
University of Rostock, Germany
(tn, meike, ah)@informatik.uni-rostock.de

ABSTRACT our conceptual model called EMX (Entity Model for XML-
In this article the conceptual model EMX (Entity Model Schema).
for XML-Schema) for dealing with the evolution of XML A further issue, not covered in this paper, but important
Schema (XSD) is introduced. The model is a simplified in the overall context of exchanging information, is the valid-
representation of an XSD, which hides the complexity of ity of XML documents [5]. Modifications of XML Schema re-
XSD and offers a graphical presentation. For this purpose quire adaptions of all XML documents that are valid against
a unique mapping is necessary which is presented as well the former XML Schema (also known as co-evolution).
as further information about the visualization and the log- One unpractical way to handle this problem is to introduce
ical structure. A small example illustrates the relation- different versions of an XML Schema, but in this case all
ships between an XSD and an EMX. Finally, the integration versions have to be stored and every participant of the het-
into a developed research prototype for dealing with the co- erogeneous environment has to understand all different doc-
evolution of corresponding XML documents is presented. ument descriptions. An alternative solution is the evolution
of the XML Schema, so that just one document description
exists at one time. The above mentioned validity problem
1. INTRODUCTION of XML documents is not solved, but with the standardized
The eXtensible Markup Language (XML) [2] is one of the description of the adaptions (e.g. a sequence of operations
most popular formats for exchanging and storing structured [8]) and by knowing a conceptual model inclusively the cor-
and semi-structured information in heterogeneous environ- responding mapping to the base-model (e.g. XSD), it is
ments. To assure that well-defined XML documents can be possible to derive necessary XML document transformation
understood by every participant (e.g. user or application) steps automatically out of the adaptions [7]. The conceptual
it is necessary to introduce a document description, which model is an essential prerequisite for the here not in detail
contains information about allowed structures, constraints, but incidentally handled process of the evolution of XML
data types and so on. XML Schema [4] is one commonly used Schema.
standard for dealing with this problem. An XML document This paper is organized as follows. Section 2 gives the
is called valid, if it fulfills all restrictions and conditions of necessary background of XML Schema and corresponding
an associated XML Schema. concepts. Section 3 presents our conceptual model by first
XML Schema that have been used for years have to be giving a formal definition (3.1), followed by the specification
modified from time to time. The main reason is that the of the unique mapping between EMX and XSD (3.2) and
requirements for exchanged information can change. To the logical structure of the EMX (3.3). After introducing
meet these requirements the schema has to be adapted, for the conceptual model we present an example of an EMX in
example if additional elements are added into an existing section 4. In section 5 we describe the practical use of
content model, the data type of information changed or in- EMX in our prototype, which was developed for handle the
tegrity constraints are introduced. All in all every possi- co-evolution of XML Schema and XML documents. Finally
ble structure of an XML Schema definition (XSD) can be in section 6 we draw our conclusions.
changed. A question occurs: In which way can somebody
make these adaptions without being coerced to understand 2. TECHNICAL BACKGROUND
and deal with the whole complexity of an XSD? One solu- In this section we present a common notation used in the
tion is the definition of a conceptual model for simplifying rest of the paper. At first we will shortly introduce the
the base-model; in this paper we outline further details of abstract data model (ADM) and element information item
(EII) of XML Schema, before further details concerning dif-
ferent modeling styles are given.
The XML Schema abstract data model consists of different
components or node types1 , basically these are: type defi-
nition components (simple and complex types), declaration
components (elements and attributes), model group compo-
nents, constraint components, group definition components
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- 1
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. An XML Schema can be visualized as a directed graph with
Copyright is held by the author/owner(s). different nodes (components); an edge realizes the hierarchy
and annotation components [3]. Additionally the element against exchanged information change and the underlying
information item exists, an XML representation of these schema has to be adapted then this modeling style is the
components. The element information item defines which most suitable. The advantage of the Garden of Eden style
content and attributes can be used in an XML Schema. Ta- is that all components can be easily identified by knowing
ble 1 gives an overview about the most important compo- the QNAME (qualified name). Furthermore the position of
nents and their concrete representation. The , components within an XML Schema is obvious. A qualified
name is a colon separated string of the target namespace of
ADM Element Information Item the XML Schema followed by the name of the declaration
declarations , or definition. The name of a declaration and definition is
group-definitions a string of the data type NCNAME (non-colonized name),
model-groups , , , a string without colons. The Garden of Eden style is the
, basic modeling style which is considered in this paper, a
type-definitions , transformation between different styles is possible.2

N.N. , , 3. CONCEPTUAL MODEL
, In [7] the three layer architecture for dealing with XML
annotations Schema adaptions (i.e. the XML Schema evolution) was
constraints , , , introduced and the correlations between them were men-
, tioned. An overview is illustrated in figure 2. The first
N.N.

EMX
Table 1: XML Schema Information Items
EMX Operation EMX‘
, and are not explicitly
1 - 1 mapping
given in the abstract data model (N.N. - Not Named), but

Schema
they are important components for embedding externally

XML
defined XML Schema (esp. element declarations, attribute XSD Operation XSD‘
declarations and type definitions). In the rest of the pa- 1 - * mapping
per, we will summarize them under the node type ”module”.
XML

The ”is the document (root) element of any W3C Operation
XML Schema. It’s both a container for all the declarations XML XML‘
and definitions of the schema and a place holder for a number
of default values expressed as attributes” [9]. Analyzing the
possibilities of specifying declarations and definitions leads Figure 2: Three Layer Architecture
to four different modeling styles of XML Schema, these are:
Russian Doll, Salami Slice, Venetian Blind and Garden of layer is our conceptual model EMX (Entity Model for XML-
Eden [6]. These modeling styles influence mainly the re- Schema), a simplified representation of the second layer.
usability of element declarations or defined data types and This layer is the XML Schema (XSD), where a unique map-
also the flexibility of an XML Schema in general. Figure ping between these layers exists. The mapping is one of the
1 summarizes the modeling styles with their scopes. The main aspects of this paper (see section 3.2). The third layer
are XML documents or instances, an ambiguous mapping
between XSD and XML documents exists. It is ambiguous
because of the optionality of structures (e.g. minOccurs =
Garden of Eden
Venetian Blind

’0’; use = ’optional’) or content types (e.g. ). The
Russian Doll
Salami Slice

third layer and the mapping between layer two and three, as
well as the operations for transforming the different layers
are not covered in this paper (parts were published in [7]).
Scope
element and attribute local x x 3.1 Formal Definition
declaration global x x The conceptual model EMX is a triplet of nodes (NM ),
local x x directed edges between nodes (EM ) and features (FM ).
type definition
global x x
EM X = (NM , EM , FM ) (1)
Figure 1: XSD Modeling Styles according to [6] Nodes are separated in simple types (st), complex types (ct),
elements, attribute-groups, groups (e.g. content model), an-
scope of element and attribute declarations as well as the notations, constraints and modules (i.e. externally managed
scope of type definitions is global iff the corresponding node XML Schemas). Every node has under consideration of the
is specified as a child of the and can be referenced element information item of a corresponding XSD different
(by knowing e.g. the name and namespace). Locally speci- attributes, e.g. an element node has a name, occurrence
fied nodes are in contrast not directly under , the values, type information, etc. One of the most important
re-usability is not given respectively not possible. 2
A student thesis to address the issue of converting different
An XML Schema in the Garden of Eden style just con- modeling styles into each other is in progress at our profes-
tains global declarations and definitions. If the requirements sorship; this is not covered in this paper.
attributes of every node is the EID (EMX ID), a unique visualized by adding a ”blue W in a circle”, a similar be-
identification value for referencing and localization of every haviour takes place if an attribute wildcard is given in an
node; an EID is one-to-one in every EMX. The directed .
edges are defined between nodes by using the EIDs, i.e. ev- The type-definitions are not directly visualized in an EMX.
ery edge is a pair of EID values from a source to a tar- Simple types for example can be specified and afterwards be
get. The direction defines the include property, which was referenced by elements or attributes 3 by using the EID of the
specified under consideration of the possibilities of an XML corresponding EMX node. The complex type is also implic-
Schema. For example if a model-group of the abstract data itly given, the type will be automatically derived from the
model (i.e. an EMX group with ”EID = 1”) contains dif- structure of the EMX after finishing the modeling process.
ferent elements (e.g. EID = {2,3}), then two edges exist: The XML Schema specification 1.1 has introduced different
(1,2) and (1,3). In section 3.3 further details about allowed logical constraints, which are also integrated in the EMX.
edges are specified (see also figure 5). The additional fea- These are the EIIs (for constraints on complex
tures allow the user-specific setting of the overall process types) and . An is under consider-
of co-evolution. It is not only possible to specify default ation of the specification a facet of a restricted simple type
values but also to configure the general behaviour of opera- [4]. The last EII is , this ”root” is an EMX itself.
tions (e.g. only capacity-preserving operations are allowed). This is also the reason why further information or properties
Furthermore all XML Schema properties of the element in- of an XML Schema are stored in the additional features as
formation item are included in the additional mentioned above.
features. The additional features are not covered in this
paper. 3.3 Logical Structure
After introducing the conceptual model and specifying the
3.2 Mapping between XSD and EMX mapping between an EMX and XSD, in the following section
An overview about the components of an XSD has been details about the logical structure (i.e. the storing model)
given in table 1. In the following section the unique map- are given. Also details about the valid edges of an EMX are
ping between these XSD components and the EMX nodes illustrated. Figure 3 gives an overview about the different
introduced in section 3.1 is specified. Table 2 summarizes relations used as well as the relationships between them.
the mapping. For every element information item (EII) the The logical structure is the direct consequence of the used

EII EMX Node Visualization
element Path Constraint ST_List Facet Annotation

, attribute-
group Attribute
Assert Element ST Attribute
, , group _Ref

Element_ Attribute Attribute
CT Schema
Ref _Gr _Gr_Ref

@ @
st implicit and
@
specifiable Group Wildcard Module
ct implicit and
derived
EMX visualized extern @ Attribute

, module Relation parent_EIDparent_EID has_asame @
node in EMX Element

,
,
Figure 3: Logical Structure of an EMX
annotation
modeling style Garden of Eden, e.g. elements are either
, , constraint
element declarations or element references. That’s why this

separation is also made in the EMX.
implicit in ct
All in all there are 18 relations, which store the content of
restriction in st
an XML Schema and form the base of an EMX. The different
the EMX itself nodes reference each other by using the well known foreign
key constraints of relational databases. This is expressed by
Table 2: Mapping and Visualization
using the directed ”parent EID” arrows, e.g. the EMX nodes
(”rectangle with thick line”) element, st, ct, attribute-group
corresponding EMX node is given as well as the assigned vi-
and modules reference the ”Schema” itself. If declarations
sualization. For example an EMX node group represents the
or definitions are externally defined then the ”parent EID”
abstract data model (ADM) node model-group (see table 1).
is the EID of the corresponding module (”blue arrow”). The
This ADM node is visualized through the EII content mod-
”Schema” relation is an EMX respectively the root of an
els , and , and the wildcards
EMX as already mentioned above.
and . In an EMX the visualization
of a group is the blue ”triangle with a G” in it. Further- 3
The EII and are the same
more if a group contains an element wildcard then this is in the EMX, an attribute-group is always a container
The ”Annotation” relation can reference every other re- specified under consideration of the XML Schema specifica-
lation according to the XML Schema specification. Wild- tion [4], e.g. an element declaration needs a ”name” and a
cards are realized as an element wildcard, which belongs to type (”type EID” as a foreign key) as well as other optional
a ”Group” (i.e. EII ), or they can be attribute wild- values like the final (”finalV”), default (”defaultV”), ”fixed”,
cards which belongs to a ”CT” or ”Attribute Gr” (i.e. EII ”nillable”, XML Schema ”id” or ”form” value. Other EMX
). Every ”Element” relation (i.e. element specific attributes are also given, e.g. the ”file ID” and the
declaration) has either a simple type or a complex type, ”parent EID” (see figure 3). The element references have a
and every ”Element Ref” relation has an element declara- ”ref EID”, which is a foreign key to a given element declara-
tion. Attributes and attribute-groups are the same in an tion. Moreover attributes of the occurrence (”minOccurs”,
EMX, as mentioned above. ”maxOccurs”), the ”position” in a content model and the
Moreover figure 3 illustrates the distinction between visu- XML Schema ”id” are stored. Element references are visual-
alized (”yellow border”) and not visualized relations. Under ized in an EMX. That’s why some values about the position
consideration of table 2 six relations are direct visible in in an EMX are stored, i.e. the coordinates (”x Pos”, ”y Pos”)
an EMX: constraints, annotations, modules, groups and be- and the ”width” and ”height” of an EMX node. The same
cause of the Garden of Eden style element references and position attributes are given in every other visualized EMX
attribute-group references. Table 3 summarizes which rela- node.
tion of figure 3 belongs to which EMX node of table 2. The edges of the formal definition of an EMX can be de-
rived by knowing the logical structure and the visualization
EMX Node Relation of an EMX. Figure 5 illustrates the allowed edges of EMX
element Element, Element Ref nodes. An edge is always a pair of EIDs, from a source
attribute-group Attribute, Atttribute Ref,
Attribute Gr,

attribute-group
Attribute Gr Ref

source X

annotation
group Group, Wildcard

constraint
edge(X,Y)

element

schema
module
st ST, ST List, Facet

group
ct CT ct st
annotation Annotation target Y
constraint Contraint, Path, Assert element x x x
module Module attribute-group x x x x
group x x x
Table 3: EMX Nodes with Logical Structure ct x x x
st x x x x
annotation x x x x x x x x
The EMX node st (i.e. simple type) has three relations. constraint x x x
These are the relation ”ST” for the most simple types, the re- module x
lation ”ST List” for set free storing of simple union types and implicitly given

the relation ”Facet” for storing facets of a simple restriction
type. Constraints are realized through the relation ”Path” Figure 5: Allowed Edges of EMX Nodes
for storing all used XPath statements for the element infor-
mation items (EII) , and and (”X”) to a target (”Y”). For example it is possible to add
the relation ”Constraint” for general properties e.g. name, an edge outgoing from an element node to an annotation,
XML Schema id, visualization information, etc. Further- constraint, st or ct. A ”black cross” in the figure defines a
more the relation ”Assert” is used for storing logical con- possible edge. If an EMX is visualized then not all EMX
straints against complex types (i.e. EII ) and sim- nodes are explicitly given, e.g. the type-definitions of the
ple types (i.e. EII ). Figure 4 illustrates the abstract data model (i.e. EMX nodes st, ct; see table 2). In
this case the corresponding ”black cross” has to be moved
element element_ref along the given ”yellow arrow”, i.e. an edge in an EMX be-
PK EID PK EID tween a ct (source) and an attribute-group (target) is valid.
name FK ref_EID If this EMX is visualized, then the attribute-group is shown
FK type_EID minOccurs
finalV maxOccurs
as a child of the group which belongs to above mentioned
defaultV position ct. Some information are just ”implicitly given” in a visu-
fixed id
nillable FK file_ID alization of an EMX (e.g. simple types). A ”yellow arrow”
id
form
FK parent_EID
width
which starts and ends in the same field is a hint for an union
FK file_ID height of different nodes into one node, e.g. if a group contains a
FK parent_EID x_Pos
y_Pos wildcard then in the visualization only the group node is
visible (extended with the ”blue W”; see table 2).
Figure 4: Relations of EMX Node element
4. EXAMPLE
stored information concerning the EMX node element re- In section 3 the conceptual model EMX was introduced.
spectively the relations ”Element” and ”Element Ref”. Both In the following section an example is given. Figure 6 il-
relations have in common, that every tuple is identified by lustrates an XML Schema in the Garden of Eden modeling
using the primary key EID. The EID is one-to-one in ev- style. An event is specified, which contains a place (”ort”)
ery EMX as mentioned above. The other attributes are and an id (”event-id”). Furthermore the integration of other
the connection without ”black rectangle”, the target is the
other side. For example the given annotation is a child of
the element ”event” and not the other way round; an element
can never be a child of an annotation, neither in the XML
Schema specification nor in the EMX.
The logical structure of the EMX of figure 7 is illustrated
in figure 8. The relations of the EMX nodes are given as well

Schema
EID xmlns_xs targetName TNPrefix
1 http://www.w3.org/2001/XMlSchema gvd2013.xsd eve
Element Annotation
EID name type_EID parent_EID EID parent_EID x_Pos y_Pos
2 event 14 1 10 2 50 100
3 name 11 1 Wildcard
4 datum 12 1 EID parent_EID
5 ort 13 1 17 14
Element_Ref
EID ref_EID minOccurs maxOccurs parent_EID x_Pos y_Pos
6 2 1 1 1 75 75 event
Figure 6: XML Schema in Garden of Eden Style 7 3 1 1 16 60 175 name
8 4 1 1 16 150 175 datum
9 5 1 1 15 100 125 ort
attributes is possible, because of an attribute wildcard in ST CT
the respective complex type. The place is a sequence of a EID name mode parent_EID EID name parent_EID
name and a date (”datum”). 11 xs:string built-in 1 13 orttype 1
All type definitions (NCNAME s: ”orttype”, ”eventtype”) 12 xs:date built-in 1 14 eventtype 1
and declarations (NCNAME s: ”event”, ”name”, ”datum”, Group
EID mode parent_EID x_Pos y_Pos
”ort” and the attribute ”event-id”) are globally specified.
15 sequence 14 125 100 eventsequence
The target namespace is ”eve”, so the QNAME of e.g. the 16 sequence 13 100 150 ortsequence
complex type definition ”orttype” is ”eve:orttype”. By using Attribute Attribute_Ref
the QNAME every above mentioned definition and decla- EID name parent_EID EID ref_EID parent_EID
ration can be referenced, so the re-usability of all compo- 18 event-id 1 19 18 14
nents is given. Furthermore an attribute wildcard is also Attribute_Gr Attribute_Gr_Ref
specified, i.e. the complex type ”eventtype” contains apart EID parent_EID EID ref_EID parent_EID x_Pos y_Pos
from the content model sequence and the attribute refer- 20 1 21 20 14 185 125
ence ”eve:event-id” the element information item . Figure 8: Logical Structure of Figure 7
Figure 7 is the corresponding EMX of the above specified
XML Schema. The representation is an obvious simplifica- as the attributes and corresponding values relevant for the
example. Next to every tuple of the relations ”Element Ref”
and ”Group” small hints which tuples are defined are added
(for increasing the readability). It is obvious that an EID
has to be unique, this is a prerequisite for the logical struc-
ture. An EID is created automatically, a user of the EMX
can neither influence nor manipulate it.
The element references contain information about the oc-
currence (”minOccurs”, ”maxOccurs”), which are not explic-
itly given in the XSD of figure 6. The XML Schema spec-
ification defines default values in such cases. If an element
reference does not specify the occurrence values then the
standard value ”1” is used; an element reference is obliga-
tory. These default values are also added automatically.
Figure 7: EMX to XSD of Figure 6
The stored names of element declarations are NCNAME s,
but by knowing the target namespace of the corresponding
tion, it just contains eight well arranged EMX nodes. These
schema (i.e. ”eve”) the QNAME can be derived. The name
are the elements ”event”, ”ort”, ”name” and ”datum”, an an-
of a type definition is also the NCNAME, but if e.g. a built-
notation as a child of ”event”, the groups as a child under
in type is specified then the name is the QNAME of the
”event” and ”ort”, as well as an attribute-group with wild-
XML Schema specification (”xs:string”, ”xs:date”).
card. The simple types of the element references ”name”
and ”datum” are implicitly given and not visualized. The
complex types can be derived by identifying the elements 5. PRACTICAL USE OF EMX
which have no specified simple type but groups as a child The co-evolution of XML documents was already men-
(i.e. ”event” and ”ort”). tioned in section 1. At the University of Rostock a research
The edges are under consideration of figure 5 pairs of not prototype for dealing with this co-evolution was developed:
visualized, internally defined EIDs. The source is the side of CodeX (Conceptual design and evolution for XML Schema)
[5]. The idea behind it is simple and straightforward at the modeled XSD and an EMX, so it is possible to representa-
same time: Take an XML Schema, transform it to the specif- tively adapt or modify the conceptual model instead of the
ically developed conceptual model (EMX - Entity Model for XML Schema.
XML-Schema), change the simplified conceptual model in- This article presents the formal definition of an EMX, all
stead of dealing with the whole complexity of XML Schema, in all there are different nodes, which are connected by di-
collect these changing information (i.e. the user interaction rected edges. Thereby the abstract data model and element
with EMX) and use them to create automatically trans- information item of the XML Schema specification were con-
formation steps for adapting the XML documents (by us- sidered, also the allowed edges are specified according to
ing XSLT - Extensible Stylesheet Language Transformations the specification. In general the most important compo-
[1]). The mapping between EMX and XSD is unique, so it is nents of an XSD are represented in an EMX, e.g. elements,
possible to describe modifications not only on the EMX but attributes, simple types, complex types, annotations, con-
also on the XSD. The transformation and logging language strains, model groups and group definitions. Furthermore
ELaX (Evolution Language for XML-Schema [8]) is used to the logical structure is presented, which defines not only the
unify the internally collected information as well as intro- underlying storing relations but also the relationships be-
duce an interface for dealing directly with XML Schema. tween them. The visualization of an EMX is also defined:
Figure 9 illustrates the component model of CodeX, firstly outgoing from 18 relations in the logical structure, there are
published in [7] but now extended with the ELaX interface. eight EMX nodes in the conceptual model, from which six
are visualized.
Results
Our conceptual model is an essential prerequisite for the
prototype CodeX (Conceptual design and evolution for XML
GUI Schema modifications
ELaX Data supply
Schema) as well as for the above mentioned co-evolution. A
Visualization ELaX Import Export remaining step is the finalization of the implementation in
XSD
CodeX. After this work an evaluation of the usability of the
Evolution engine XSD Config XML XSLT XSD Config
conceptual model is planned. Nevertheless we are confident,
that the usage is straightforward and the simplification of
EMX in comparison to deal with the whole complexity of
Model Spezification Configuration XML
an XML Schema itself is huge.
mapping of operation documents
Update notes &

7. REFERENCES
Model data Evolution spezific data evolution results

Knowledge Transformation
base Log
[1] XSL Transformations (XSLT) Version 2.0.
CodeX
http://www.w3.org/TR/2007/REC-xslt20-20070123/,
January 2007. Online; accessed 26-March-2013.
Figure 9: Component Model of CodeX [5] [2] Extensible Markup Language (XML) 1.0 (Fifth
Edition).
The component model illustrates the different parts for http://www.w3.org/TR/2008/REC-xml-20081126/,
dealing with the co-evolution. The main parts are an im- November 2008. Online; accessed 26-March-2013.
port and export component for collecting and providing data [3] XQuery 1.0 and XPath 2.0 Data Model (XDM)
of e.g. a user (XML Schemas, configuration files, XML doc- (Second Edition). http://www.w3.org/TR/2010/
ument collections, XSLT files), a knowledge base for stor- REC-xpath-datamodel-20101214/, December 2010.
ing information (model data, evolution specific data and Online; accessed 26-March-2013.
co-evolution results) and especially the logged ELaX state- [4] W3C XML Schema Definition Language (XSD) 1.1
ments (”Log”). The mapping information between XSD and Part 1: Structures. http://www.w3.org/TR/2012/
EMX of table 2 are specified in the ”Model data” component. REC-xmlschema11-1-20120405/, April 2012. Online;
Furthermore the CodeX prototype also provides a graph- accessed 26-March-2013.
ical user interface (”GUI”), a visualization component for [5] M. Klettke. Conceptual XML Schema Evolution - the
the conceptual model and an evolution engine, in which the CoDEX Approach for Design and Redesign. In BTW
transformations are derived. The visualization component Workshops, pages 53–63, 2007.
realizes the visualization of an EMX introduced in table 2. [6] E. Maler. Schema design rules for ubl...and maybe for
The ELaX interface for modifying imported XML Schemas you. In XML 2002 Proceedings by deepX, 2002.
communicates directly with the evolution engine.
[7] T. Nösinger, M. Klettke, and A. Heuer. Evolution von
XML-Schemata auf konzeptioneller Ebene - Übersicht:
6. CONCLUSION Der CodeX-Ansatz zur Lösung des Gültigkeitsproblems.
Valid XML documents need e.g. an XML Schema, which In Grundlagen von Datenbanken, pages 29–34, 2012.
restricts the possibilities and usage of declarations, defini- [8] T. Nösinger, M. Klettke, and A. Heuer. Automatisierte
tions and structures in general. In a heterogeneous changing Modelladaptionen durch Evolution - (R)ELaX in the
environment (e.g. an information exchange scenario), also Garden of Eden. Technical Report CS-01-13, Institut
”old” and longtime used XML Schema have to be modified für Informatik, Universität Rostock, Rostock, Germany,
to meet new requirements and to be up-to-date. Jan. 2013. Published as technical report CS-01-13
EMX (Entity Model for XML-Schema) as a conceptual under ISSN 0944-5900.
model is a simplified representation of an XSD, which hides [9] E. van der Vlist. XML Schema. O’Reilly & Associates,
its complexity and offers a graphical presentation. A unique Inc., 2002.
mapping exists between every in the Garden of Eden style
Semantic Enrichment of Ontology Mappings: Detecting
Relation Types and Complex Correspondences

∗
Patrick Arnold
Universität Leipzig
arnold@informatik.uni-leipzig.de

ABSTRACT being a tripe (s, t, c), where s is a concept in the source ontol-
While there are numerous tools for ontology matching, most ogy, t a concept in the target ontology and c the confidence
approaches provide only little information about the true na- (similarity).
ture of the correspondences they discover, restricting them- These tools are able to highly reduce the effort of man-
selves on the mere links between matching concepts. How- ual ontology mapping, but most approaches solely focus on
ever, many disciplines such as ontology merging, ontology detecting the matching pairs between two ontologies, with-
evolution or data transformation, require more-detailed in- out giving any specific information about the true nature
formation, such as the concrete relation type of matches or of these matches. Thus, a correspondence is commonly re-
information about the cardinality of a correspondence (one- garded an equivalence relation, which is correct for a corre-
to-one or one-to-many). In this study we present a new ap- spondence like (zip code, postal code), but incorrect for cor-
proach where we denote additional semantic information to respondences like (car, vehicle) or (tree trunk, tree), where
an initial ontology mapping carried out by a state-of-the-art is-a resp. part-of would be the correct relation type. This re-
matching tool. The enriched mapping contains the relation striction is an obvious shortcoming, because in many cases
type (like equal, is-a, part-of) of the correspondences as well a mapping should also include further kinds of correspon-
as complex correspondences. We present different linguistic, dences, such as is-a, part-of or related. Adding these infor-
structural and background knowledge strategies that allow mation to a mapping is generally beneficial and has been
semi-automatic mapping enrichment, and according to our shown to considerably improve ontology merging [13]. It
first internal tests we are already able to add valuable se- provides more precise mappings and is also a crucial aspect
mantic information to an existing ontology mapping. in related areas, such as data transformation, entity resolu-
tion and linked data.
An example is given in Fig. 1, which depicts the basic
Keywords idea of our approach. While we get a simple alignment as
ontology matching, relation type detection, complex corre- input, with the mere links between concepts (above picture),
spondences, semantic enrichment we return an enriched alignment with the relation type an-
notated to each correspondence (lower picture). As we will
1. INTRODUCTION point out in the course of this study, we use different linguis-
tic methods and background knowledge in order to find the
Ontology matching plays a key role in data integration
relevant relation type. Besides this, we have to distinguish
and ontology management. With the ontologies getting in-
between simple concepts (as ”Office Software”) and complex
creasingly larger and more complex, as in the medical or
concepts, which contain itemizations like ”Monitors and Dis-
biological domain, efficient matching tools are an important
plays”, and which need a special treatment for relation type
prerequisite for ontology matching, merging and evolution.
detection.
There are already various approaches and tools for ontol-
Another issue of present ontology matchers is their restric-
ogy matching, which exploit most different techniques like
tion to (1:1)-correspondences, where exactly one source con-
lexicographic, linguistic or structural methods in order to
cept matches exactly one target concept. However, this can
identify the corresponding concepts between two ontologies
occasionally lead to inaccurate mappings, because there may
[16], [2]. The determined correspondences build a so-called
occur complex correspondences where more than one source
alignment or ontology mapping, with each correspondence
element corresponds to a target element or vice versa, as
∗ the two concepts first name and last name correspond to a
concept name, leading to a (2:1)-correspondence. We will
show in section 5 that distinguishing between one-to-one
and one-to-many correspondences plays an important role
in data transformation, and that we can exploit the results
from the relation type detection to discover such complex
matches in a set of (1:1)-matches to add further knowledge
to a mapping.
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
In this study we present different strategies to assign the
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. relation types to an existing mapping and demonstrate how
Copyright is held by the author/owner(s).
lence, less/more-general (is-a / inverse is-a) and is-close (”re-
lated”) and exploits linguistic techniques and background
sources such as WordNet. The linguistic strategies seem
rather simple; if a term appears as a part in another term,
a more-general relation is assumed which is not always the
case. For example, in Figure 1 the mentioned rule holds
for the correspondence between Games and Action Games,
but not between M onitors and M onitors and Displays. In
[14], the authors evaluated Taxomap for a mapping scenario
with 162 correspondences and achieved a recall of 23 % and
a precision of 89 %.
The LogMap tool [9] distinguishes between equivalence
and so-called weak (subsumption / is-a) correspondences. It
is based on Horn Logic, where first lexicographic and struc-
tural knowledge from the ontologies is accumulated to build
an initial mapping and subsequently an iterative process is
carried out to first enhance the mapping and then to verify
the enhancement. This tool is the least precise one with
regard to relation type detection, and in evaluations the re-
lation types were not further regarded.
Several further studies deal with the identification of se-
mantic correspondence types without providing a complete
tool or framework. An approach utilizing current search
engines is introduced in [10]. For two concepts A, B they
generate different search queries like ”A, such as B” or ”A,
which is a B” and submit them to a search engine (e.g.,
Google). They then analyze the snippets of the search en-
gine results, if any, to verify or reject the tested relation-
Figure 1: Input (above) and output (below) of the ship. The approach in [15] uses the Swoogle search engine
Enrichment Engine to detect correspondences and relationship types between
concepts of many crawled ontologies. The approach sup-
ports equal, subset or mismatch relationships. [17] exploits
complex correspondences can be discovered. Our approach, reasoning and machine learning to determine the relation
which we refer to as Enrichment Engine, takes an ontology type of a correspondence, where several structural patterns
mapping generated by a state-of-the-art matching tool as in- between ontologies are used as training data.
put and returns a more-expressive mapping with the relation Unlike relation type determination, the complex corre-
type added to each correspondence and complex correspon- spondence detection problem has hardly been discussed so
dences revealed. According to our first internal tests, we far. It was once addressed in [5], coming to the conclusion
recognized that even simple strategies already add valuable that there is hardly any approach for complex correspon-
information to an initial mapping and may be a notable gain dence detection because of the vast amount of required com-
for current ontology matching tools. parisons in contrast to (1:1)-matching, as well as the many
Our paper is structured as follows: We discuss related possible operators needed for the mapping function. One
work in section 2 and present the architecture and basic key observation for efficient complex correspondence detec-
procedure of our approach in section 3. In section 4 we tion has been the need of large amounts of domain knowl-
present different strategies to determine the relation types edge, but until today there is no available tool being able to
in a mapping, while we discuss the problem of complex cor- semi-automatically detect complex matches.
respondence detection in section 5. We finally conclude in One remarkable approach is iMAP [4], where complex
section 6. matches between two schemas could be discovered and even
several transformation functions calculated, as RoomP rice =
2. RELATED WORK RoomP rice∗(1+T axP rice). For this, iMAP first calculates
(1:1)-matches and then runs an iterative process to gradu-
Only a few tools and studies regard different kinds of
ally combine them to more-complex correspondences. To
correspondences or relationships for ontology matching. S-
justify complex correspondences, instance data is analyzed
Match [6][7] is one of the first such tools for ”semantic ontol-
and several heuristics are used. In [8] complex correspon-
ogy matching”. They distinguish between equivalence, sub-
dences were also regarded for matching web query inter-
set (is-a), overlap and mismatch correspondences and try
faces, mainly exploiting co-occurrences. However, in order
to provide a relationship for any pair of concepts of two
to derive common co-occurrences, the approach requires a
ontologies by utilizing standard match techniques and back-
large amount of schemas as input, and thus does not appear
ground knowledge from WordNet. Unfortunately, the result
appropriate for matching two or few schemas.
mappings tend to become very voluminous with many corre-
While the approaches presented in this section try to a-
spondences per concept, while users are normally interested
chieve both matching and semantic annotation in one step,
only in the most relevant ones.
thus often tending to neglect the latter part, we will demon-
Taxomap [11] is an alignment tool developed for the geo-
strate a two-step architecture in which we first perform a
graphic domain. It regards the correspondence types equiva-
schema mapping and then concentrate straight on the en- Strategy equal is-a part-of related
Compounding X
richment of the mapping (semantic part). Additionally, we Background K. X X X X
want to analyze several linguistic features to provide more Itemization X X
qualitative mappings than obtained by the existing tools, Structure X X
and finally develop an independent system that is not re-
stricted to schema and ontology matching, but will be dif- Table 1: Supported correspondence types by the
ferently exploitable in the wide field of date integration and strategies
data analysis.
”undecided”. In this case we assign the relation type ”equal”,
3. ARCHITECTURE because it is the default type in the initial match result and
As illustrated in Fig. 2 our approach uses a 2-step ar- possibly the most likely one to hold. Secondly, there might
chitecture in which we first calculate an ontology mapping be different outcomes from the strategies, e.g., one returns
(match result) using our state-of-the-art matching tool is-a, one equal and the others undecided. There are different
COMA 3.0 (step 1) [12] and then perform an enrichment ways to solve this problem, e.g., by prioritizing strategies or
on this mapping (step 2). relation types. However, we hardly discovered such cases so
Our 2-step approach for semantic ontology matching offers far, so we currently return ”undecided” and request the user
different advantages. First of all, we reduce complexity com- to manually specify the correct type.
pared to 1-step approaches that try to directly determine the At the present, our approach is already able to fully assign
correspondence type when comparing concepts in O1 with relation types to an input mapping using the 4 strategies,
concepts in O2 . For large ontologies, such a direct match- which we will describe in detail in the next section. We have
ing is already time-consuming and error-prone for standard not implemented strategies to create complex matches from
matching. The proposed approaches for semantic matching the match result, but will address a couple of conceivable
are even more complex and could not yet demonstrate their techniques in section 5.
general effectiveness.
Secondly, our approach is generic as it can be used for
different domains and in combination with different match- 4. IMPLEMENTED STRATEGIES
ing tools for the first step. We can even re-use the tool in We have implemented 4 strategies to determine the type
different fields, such as entity resolution or text mining. On of a given correspondence. Table 1 gives an overview of the
the other hand, this can also be a disadvantage, since the strategies and the relation types they are able to detect. It
enrichment step depends on the completeness and quality of can be seen that the Background Knowledge approach is
the initially determined match result. Therefore, it is im- especially valuable, as it can help to detect all relationship
portant to use powerful tools for the initial matching and types. Besides, all strategies are able to identify is-a corre-
possibly to fine-tune their configuration. spondences.
In the following let O1 , O2 be two ontologies with c1 , c2
being two concepts from O1 resp. O2 . Further, let C =
(c1 , c2 ) be a correspondence between two concepts (we do
not regard the confidence value in this study).

4.1 Compound Strategy
In linguistics, a compound is a special word W that con-
sists of a head WH carrying the basic meaning of W , and
a modifier WM that specifies WH [3]. In many cases, a
compound thus expresses something more specific than its
head, and is therefore a perfect candidate to discover an is-a
relationship. For instance, a blackboard is a board or an
apple tree is a tree. Such compounds are called endocen-
tric compounds, while exocentric compounds are not related
Figure 2: Basic Workflow for Mapping Enrichment with their head, such as buttercup, which is not a cup, or saw
tooth, which is not a tooth. These compounds are of literal
The basics of the relation type detection, on which we fo- meaning (metaphors) or changed their spelling as the lan-
cus in this study, can be seen in the right part of Fig. 2. We guage evolved, and thus do not hold the is-a relation, or only
provide 4 strategies so far (Compound, Background Knowl- to a very limited extent (like airport, which is a port only in
edge, Itemization, Structure), where each strategy returns a broad sense). There is a third form of compounds, called
the relation type of a given correspondence, or ”undecided” appositional or copulative compounds, where the two words
in case no specific type can be determined. In the Enrich- are at the same level, and the relation is rather more-general
ment step we thus iterate through each correspondence in (inverse is-a) than more-specific, as in Bosnia-Herzegowina,
the mapping and pass it to each strategy. We eventually which means both Bosnia and Herzegowina, or bitter-sweet,
annotate the type that was most frequently returned by the which means both bitter and sweet (not necessarily a ”spe-
strategies (type computation). In this study, we regard 4 cific bitter” or a ”specific sweet”). However, this type is quite
distinct relation types: equal, is-a and inv. is-a (composi- rare.
tion), part-of and has-a (aggregation), as well as related. In the following, let A, B be the literals of two con-
There are two problems we may encounter when comput- cepts of a correspondence. The Compound Strategy ana-
ing the correspondence type. First, all strategies may return lyzes whether B ends with A. If so, it seems likely that B
is a compound with head A, so that the relationship B is-a by w1 .
A (or A inv. is-a B) is likely to hold. The Compound ap-
proach allows us to identify the three is-a correspondences 3. Remove each w1 ∈ I1 , w2 ∈ I2 if there is a synonym
shown in Figure 1 (below). pair (w1 , w2 ).
We added an additional rule to this simple approach: B is 4. Remove each w2 ∈ I2 which is a hyponym of w1 ∈ I1 .
only considered a compound to A if length(B)−length(A) ≥
3, where length(X) is the length of a string X. Thus, we 5. Determine the relation type:
expect the supposed compound to be at least 3 characters
longer than the head it matches. This way, we are able to (a) If I1 = ∅, I2 = ∅: equal
eliminate obviously wrong compound conclusions, like sta- (b) If I1 = ∅, |I2 | ≥ 1: is-a
ble is a table, which we call pseudo compounds. The value If I2 = ∅, |I1 | ≥ 1: inverse is-a
of 3 is motivated by the observation that typical nouns or (c) If |I1 | ≥ 1, I2 ≥ 1: undecided
adjectives consist of at least 3 letters.
The rationale behind this algorithm is that we remove items
4.2 Background Knowledge from the item sets as long as no information gets lost. Then
Background knowledge is commonly of great help in on- we compare what is left in the two sets and come to the
tology matching to detect more difficult correspondences, conclusions presented in step 5.
especially in special domains. In our approach, we intend to Let us consider the concept pair C1 = ”books, ebooks,
use it for relation type detection. So far, we use WordNet movies, films, cds” and C2 =”novels, cds”. Our item sets are
3.0 to determine the relation that holds between two words I1 = {books, ebooks, movies, f ilms, cds}, I2 = {novels, cds}.
(resp. two concepts). WordNet is a powerful dictionary and First, we remove synonyms and hyponyms within each set,
thesaurus that contains synonym relations (equivalence), hy- because this would cause no loss of information (steps 1+2).
pernym relations (is-a) and holonym relations (part-of) be- We remove f ilms in I1 (because of the synonym movies)
tween words [22]. Using the Java API for WordNet Search and ebooks in I1 , because it is a hyponym of books. We have
(JAWS), we built an interface that allows to answer ques- I1 = {books, movies, cds} , I2 = {novels, cds}. Now we re-
tions like ”Is X a synonym to Y?”, or ”Is X a direct hyper- move synonym pairs between the two item sets, so we remove
nym of Y?”. The interface is also able to detect cohyponyms, cds in either set (step 3). Lastly, we remove a hyponym in I1
which are two words X, Y that have a common direct hyper- if there is a hypernym in I2 (step 4). We remove novel in I2 ,
nym Z. We call a correspondence between two cohyponyms because it is a book. We have I1 = {books, movies} , I2 = ∅.
X and Y related, because both concepts are connected to Since I1 still contains items, while I2 is empty, we conclude
the same father element. For example, the relation between that I1 specifies something more general, i.e., it holds C1
apple tree and pear tree is related, because of the common inverse is-a C2 .
father concept tree. If neither item set is empty, we return ”undecided” because
Although WordNet has a limited vocabulary, especially we cannot derive an equal or is-a relationship in this case.
with regard to specific domains, it is a valuable source to
detect the relation type that holds between concepts. It al- 4.4 Structure Strategy
lows an excellent precision, because the links in WordNet are The structure strategy takes the structure of the ontolo-
manually defined, and contains all relation types we intend gies into account. For a correspondence between concepts
to detect, which the other strategies are not able to achieve. Y and Z we check whether we can derive a semantic rela-
tionship between a father concept X of Y and Z (or vice
4.3 Itemization versa). For an is-a relationship between Y and X we draw
In several taxonomies we recognized that itemizations ap- the following conclusions:
pear very often, and which cannot be processed with the pre-
viously presented strategies. Consider the correspondence • X equiv Z → Y is-a Z
(”books and newspapers”, ”newspapers”). The compound • X is-a Z → Y is-a Z
strategy would be mislead and consider the source concept
a compound, resulting in the type ”is-a”, although the op- For a part-of relationship between Y and X we can analo-
posite is the case (inv. is-a). WordNet would not know the gously derive:
word ”books and newspapers” and return ”undecided”.
Itemizations thus deserve special treatment. We first split • X equiv Z → Y part-of Z
each itemization in its atomic items, where we define an item
as a string that does not contain commas, slashes or the • X part-of Z → Y part-of Z
words ”and” and ”or”.
The approach obviously utilizes the semantics of the intra-
We now show how our approach determines the correspon-
ontology relationships to determine the correspondence types
dence types between two concepts C1 , C2 where at least one
for pairs of concepts for which the semantic relationship can-
of the two concepts is an itemization with more than one
not directly be determined.
item. Let I1 be the item set of C1 and I2 the item set of
C2 . Let w1 , w2 be two words, with w1 6= w2 . Our approach 4.5 Comparison
works as follows:
We tested our strategies and overall system on 3 user-
1. In each set I remove each w1 ∈ I which is a hyponym generated mappings in which each correspondence was tagged
of w2 ∈ I. with its supposed type. After running the scenarios, we
checked how many of the non-trivial relations were detected
2. In each set I, replace a synonym pair (w1 ∈ I, w2 ∈ I) by the program. The 3 scenario consisted of about 350
.. 750 correspondences. We had a German-language sce-
nario (product catalogs from online shops), a health scenario
(diseases) and a text annotation catalog scenario (everyday
speech).
Compounding and Background Knowledge are two inde-
pendent strategies that separately try to determine the rela-
tion type of a correspondence. In our tests we saw that Com-
pounding offers a good precision (72 .. 97 %), even without
the many exocentric and pseudo-compounds that exist. By
contrast, we recognized only moderate recall, ranging from
12 to 43 %. Compounding is only able to determine is-a
relations, however, it is the only strategy that invariably
works.
Background Knowledge has a low or moderate recall (10 .. Figure 3: Match result containing two complex cor-
50 %), depending on the scenario at hand. However, it offers respondences (name and address)
an excellent precision being very close to 100 % and is the
only strategy that is able to determine all relation types we
regard. As matter of fact, it did not work on our German-
structure of the schemas to transform several (1:1)-corres-
language example and only poorly in our health scenario.
pondences into a complex correspondence, although these
Structure and Itemization strategy depend much on the
approaches will fail in more intricate scenarios. We used
given schemas and are thus very specific strategies to han-
the structure of the schemas and the already existing (1:1)-
dle individual cases. They exploit the Compound and Back-
matches to derive complex correspondences. Fig. 3 demon-
ground Knowledge Strategy and are thus not independent.
strates this approach. There are two complex correspon-
Still, they were able to boost the recall to some degree.
dences in the mapping, ( (First Name, Last Name), (Name))
We realized that the best result is gained by exploiting
and ( (Street, City, Zip Code, Country), Address), repre-
all strategies. Currently, we do not weight the strategies,
sented by simple (1:1)-correspondences. Our approach was
however, we may do so in order to optimize our system. We
able to detect both complex correspondences. The first one
finally achieved an overall recall between 46 and 65 % and
(name) was detected, because first name and last name can-
precision between 69 and 97 %.
not be mapped to one element at the same time, since the
name element can only store either of the two values. The
5. COMPLEX CORRESPONDENCES second example (address) is detected since schema data is
Schema and ontology matching tools generally calculate located in the leaf nodes, not in inner nodes. In database
(1:1)-correspondences, where exactly one source element schemas we always expect data to reside in the leaf nodes,
matches exactly one target element. Naturally, either el- so that the match (Address, Address) is considered unrea-
ement may take part in different correspondences, as in sonable.
(name, first name) and (name, last name), however, having In the first case, our approach would apply the concatena-
these two separate correspondences is very imprecise and the tion function, because two values have to be concatenated to
correct mapping would rather be the single correspondence match the target value, and in the second case the split func-
( (first name, last name), (name)). These kind of matches tion would be applied, because the Address values have to
are called complex correspondences or one-to-many corre- be split into the address components (street, city, zip code,
spondences. country). The user needs to adjust these functions, e.g., in
The disambiguation between a complex correspondence order to tell the program where in the address string the
or 2 (or more) one-to-one correspondences is an inevitable split operations have to be performed.
premise for data transformation where data from a source This approach was mostly based on heuristics and would
database is to be transformed into a target database, which only work in simple cases. Now that we are able to de-
we could show in [1]. Moreover, we could prove that each termine the relation types of (1:1)-matches, we can enhance
complex correspondence needs a transformation function in this original approach. If a node takes part in more than one
order to correctly map data. If elements are of the type composition relation (part-of / has-a), we can conclude that
string, the transformation function is normally concatena- it is a complex correspondence and can derive it from the
tion in (n:1)-matches and split in (1:n)-matches. If the el- (1:1)-correspondences. For instance, if we have the 3 corre-
ements are of a numerical type, as in the correspondence spondences (day part-of date), (month part-of date), (year
( (costs), ((operational costs), (material costs), (personnel part-of date) we could create the complex correspondence (
costs))), a set of numerical operations is normally required. (day, month, year), date).
There are proprietary solutions that allow to manually We have not implemented this approach so far, and we as-
create transformation mappings including complex corre- sume that detecting complex correspondences and the cor-
spondences, such as Microsoft Biztalk Server [19], Altova rect transformation function will still remain a very challeng-
MapForce [18] or Stylus Studio [20], however, to the best ing issue, so that we intend to investigate additional methods
of our knowledge there is no matching tool that is able to like using instance data to allow more effectiveness. How-
detect complex correspondences automatically. Next to rela- ever, adding these techniques to our existing Enrichment
tion type detection, we therefore intend to discover complex Engine, we are able to present a first solution that semi-
correspondences in the initial mapping, which is a second automatically determines complex correspondences, which
important step of mapping enrichment. is another step towards more precise ontology matching, and
We already developed simple methods that exploit the an important condition for data transformation.
6. OUTLOOK AND CONCLUSION [4] Dhamankar, R., Yoonkyong, L., Doan, A., Halevy, A.,
We presented a new approach to semantically enrich ontol- Domingos, P.: iMAP: Discovering Complex Semantic
ogy mappings by determining the concrete relation type of a Matches between Database Schemas. In: SIGMOD ’04,
correspondence and detecting complex correspondences. For pp. 383–394
this, we developed a 2-step architecture in which the actual [5] Doan, A., Halevy, A. Y.: Semantic Integration
ontology matching and the semantic enrichment are strictly Research in the Database Community: A Brief Survey.
separated. This makes the Enrichment Engine highly generic In AI Mag. (2005), pp. 83–94
so that it is not designed for any specific ontology matching [6] Giunchiglia, F., Shvaiko, P., Yatskevich, M.: S-Match:
tool, and moreover, can be used independently in various An Algorithm and an Implementation of Semantic
fields different from ontology matching, such as data trans- Matching. Proceedings of the European Semantic Web
formation, entity resolution and text mining. Symposium (2004), LNCS 3053, pp. 61–75
In our approach we developed new linguistic strategies [7] Giunchiglia, F., Autayeu, A., Pane, J.: S-Match: an
to determine the relation type, and with regard to our first open source framework for matching lightweight
internal tests even the rather simple strategies already added ontologies. In: Semantic Web, vol. 3-3 (2012), pp.
much useful information to the input mapping. We also 307-317
discovered that some strategies (Compounding, and to a less [8] He, B., Chen-Chuan Chang, H., Han, J.: Discovering
degree Itemization and Structure) are rather independent complex matchings across web query interfaces: A
from the language of the ontologies, so that our approach correlation mining approach. In: KDD ’04, pp. 148–157
provided remarkable results both in German and English- [9] Jiménez-Ruiz, E., Grau, B. C.: LogMap: Logic-Based
language ontologies. and Scalable Ontology Matching. In: International
One important obstacle is the strong dependency to the Semantic Web Conference (2011), LNCS 7031, pp.
initial mapping. We recognized that matching tools tend to 273–288
discover equivalence relations, so that different non-equiva- [10] van Hage, W. R., Katrenko, S., Schreiber, G. A
lence correspondences are not contained by the initial map- Method to Combine Linguistic Ontology-Mapping
ping, and can thus not be detected. It is future work to Techniques. In: International Semantic Web Conference
adjust our tool COMA 3.0 to provide a more convenient in- (2005), LNCS 3729, pp. 732–744
put, e.g., by using relaxed configurations. A particular issue [11] Hamdi, F., Safar, B., Niraula, N. B., Reynaud, C.:
we are going to investigate is the use of instance data con- TaxoMap alignment and refinement modules: Results
nected with the concepts to derive the correct relation type for OAEI 2010. Proceedings of the ISWC Workshop
if the other strategies (which operate on the meta level) fail.
(2010), pp. 212–219
This will also result in a time-complexity problem, which we
[12] Massmann, S., Raunich, S., Aumueller, D., Arnold, P.,
will have to consider in our ongoing research.
Rahm, E. Evolution of the COMA Match System. Proc.
Our approach is still in a rather early state, and there
Sixth Intern. Workshop on Ontology Matching (2011)
is still much space for improvement, since the implemented
strategies have different restrictions so far. For this reason, [13] Raunich, S.,Rahm, E.: ATOM: Automatic
we will extend and fine-tune our tool in order to increase Target-driven Ontology Merging. Proc. Int. Conf. on
effectiveness and precision. Among other aspects, we intend Data Engineering (2011)
to improve the structure strategy by considering the entire [14] Reynaud, C., Safar, B.: Exploiting WordNet as
concept path rather than the mere father concept, to add Background Knowledge. Proc. Intern. ISWCŠ07
further background knowledge to the system, especially in Ontology Matching (OM-07) Workshop
specific domains, and to investigate further linguistic strate- [15] Sabou, M., d’Aquin, M., Motta, E.: Using the
gies, for instance, in which way compounds also indicate the semantic web as background knowledge for ontology
part-of relation. Next to relation type detection, we will also mapping. Proc. 1st Intern. Workshop on on Ontology
concentrate on complex correspondence detection in data Matching (2006).
transformation to provide further semantic information to [16] Shvaiko, P., Euzenat, J.: A Survey of Schema-based
ontology mappings. Matching Approaches. J. Data Semantics IV (2005),
pp. 146–171
7. ACKNOWLEDGMENT [17] Spiliopoulos, V., Vouros, G., Karkaletsis, V: On the
discovery of subsumption relations for the alignment of
This study was partly funded by the European Commis-
ontologies. Web Semantics: Science, Services and
sion through Project ”LinkedDesign” (No. 284613 FoF-ICT-
Agents on the World Wide Web 8 (2010), pp. 69-88
2011.7.4).
[18] Altova MapForce - Graphical Data Mapping,
Conversion, and Integration Tool.
8. REFERENCES http://www.altova.com/mapforce.html
[1] Arnold P.: The Basics of Complex Correspondences [19] Microsoft BizTalk Server.
and Functions and their Implementation and http://www.microsoft.com/biztalk
Semi-automatic Detection in COMA++ (Master’s [20] XML Editor, XML Tools, and XQuery - Stylus
thesis), University of Leipzig, 2011. Studio. http://www.stylusstudio.com/
[2] Bellahsene., Z., Bonifati, A., Rahm, E. (eds.): Schema [21] Java API for WordNet Searching (JAWS),
Matching and Mapping, Springer (2011) http://lyle.smu.edu/~tspell/jaws/index.html
[3] Bisetto, A., Scalise, S.: Classification of Compounds. [22] WordNet - A lexical database for English,
University of Bologna, 2009. In: The Oxford Handbook http://wordnet.princeton.edu/wordnet/
of Compounding, Oxford University Press, pp. 49-82.
Extraktion und Anreicherung von Merkmalshierarchien
durch Analyse unstrukturierter Produktrezensionen

Robin Küppers
Institut für Informatik
Heinrich-Heine-Universität
Universitätsstr. 1
40225 Düsseldorf, Deutschland
kueppers@cs.uni-duesseldorf.de

ABSTRACT tionelle Datenblätter oder Produktbeschreibungen möglich
Wir präsentieren einen Algorithmus zur Extraktion bzw. wäre, da diese dazu tendieren, die Vorteile eines Produkts zu
Anreicherung von hierarchischen Produktmerkmalen mittels beleuchten und die Nachteile zu verschweigen. Aus diesem
einer Analyse von unstrukturierten, kundengenerierten Pro- Grund haben potentielle Kunden ein berechtigtes Interesse
duktrezensionen. Unser Algorithmus benötigt eine initiale an der subjektiven Meinung anderer Käufer.
Merkmalshierarchie, die in einem rekursiven Verfahren mit Zudem sind kundengenerierte Produktrezensionen auch für
neuen Untermerkmalen angereichert wird, wobei die natür- Produzenten interessant, da sie wertvolle Informationen über
liche Ordnung der Merkmale beibehalten wird. Die Funk- Qualität und Marktakzeptanz eines Produkts aus Kunden-
tionsweise unseres Algorithmus basiert auf häufigen, gram- sicht enthalten. Diese Informationen können Produzenten
matikalischen Strukturen, die in Produktrezensionen oft be- dabei helfen, die eigene Produktpalette zu optimieren und
nutzt werden, um Eigenschaften eines Produkts zu beschrei- besser an Kundenbedürfnisse anzupassen.
ben. Diese Strukturen beschreiben Obermerkmale im Kon- Mit wachsendem Umsatz der Web-Shops nimmt auch die
text ihrer Untermerkmale und werden von unserem Algo- Anzahl der Produktrezensionen stetig zu, so dass es für Kun-
rithmus ausgenutzt, um Merkmale hierarchisch zu ordnen. den (und Produzenten) immer schwieriger wird, einen um-
fassenden Überblick über ein Produkt / eine Produktgrup-
pe zu behalten. Deshalb ist unser Ziel eine feingranulare
Kategorien Zusammenfassung von Produktrezensionen, die es erlaubt
H.2.8 [Database Management]: Database Applications— Produkte dynamisch anhand von Produktmerkmalen (pro-
data mining; I.2.7 [Artificial Intelligence]: Natural Lan- duct features) zu bewerten und mit ähnlichen Produkten zu
guage Processing—text analysis vergleichen. Auf diese Weise wird ein Kunde in die Lage
versetzt ein Produkt im Kontext seines eigenen Bedürfnis-
Schlüsselwörter ses zu betrachten und zu bewerten: beispielsweise spielt das
Gewicht einer Kamera keine große Rolle für einen Kunden,
Text Mining, Review Analysis, Product Feature aber es wird viel Wert auf eine hohe Bildqualität gelegt.
Produzenten können ihre eigene Produktpalette im Kontext
1. EINLEITUNG der Konkurrenz analysieren, um z. B. Mängel an den eige-
Der Einkauf von Waren (z. B. Kameras) und Dienstleis- nen Produkten zu identifizieren.
tungen (z. B. Hotels) über Web-Shops wie Amazon unter- Das Ziel unserer Forschung ist ein Gesamtsystem zur Analy-
liegt seit Jahren einem stetigen Wachstum. Web-Shops ge- se und Präsentation von Produktrezensionen in zusammen-
ben ihren Kunden (i. d. R.) die Möglichkeit die gekaufte Wa- gefasster Form (vgl. [3]). Dieses System besteht aus mehre-
re in Form einer Rezension zu kommentieren und zu bewer- ren Komponenten, die verschiedene Aufgaben übernehmen,
ten. Diese kundengenerierten Rezensionen enthalten wert- wie z.B. die Extraktion von Meinungen und die Bestimmung
volle Informationen über das Produkt, die von potentiellen der Tonalität bezüglich eines Produktmerkmals (siehe dazu
Kunden für ihre Kaufentscheidung herangezogen werden. Je auch Abschnitt 2). Im Rahmen dieser Arbeit beschränken
positiver ein Produkt bewertet wird, desto wahrscheinlicher wir uns auf einen wichtigen Teilaspekt dieses Systems: die
wird es von anderen Kunden gekauft. Extraktion und Anreicherung von hierarchisch organisierten
Der Kunde kann sich so ausführlicher über die Vor- und Produktmerkmalen.
Nachteile eines Produkts informieren, als dies über redak- Der Rest dieser Arbeit ist wie folgt gegliedert: zunächst
geben wir in Abschnitt 2 einen Überblick über verwandte
Arbeiten, die auf unsere Forschung entscheidenen Einfluss
hatten. Anschließend präsentieren wir in Abschnitt 3 einen
Algorithmus zur Extraktion und zur Anreicherung von hier-
archisch organisierten Produktmerkmalen. Eine Bewertung
des Algorithmus wird in Abschnitt 4 vorgenommen, sowie
einige Ergebnisse präsentiert, die die Effektivität unseres
Algorithmus demonstrieren. Die gewonnenen Erkenntnisse
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. werden in Abschnitt 5 diskutiert und zusammengefasst. Des
Copyright is held by the author/owner(s).
Weiteren geben wir einen Ausblick auf unsere zukünftige
Forschung.

2. VERWANDTE ARBEITEN
Dieser Abschnitt gibt einen kurzen Überblick über ver-
wandte Arbeiten, die einen Einfluss auf unsere Forschung
hatten. Die Analyse von Produktrezensionen basiert auf Al- Abbildung 1: Beispielhafte Merkmalshierarchie ei-
gorithmen und Methoden aus verschiedensten Disziplinen. ner Digitalkamera.
Zu den Wichtigsten zählen: Feature Extraction, Opining Mi-
ning und Sentiment Analysis.
Ein typischer Algorithmus zur merkmalsbasierten Tonali- Wir haben hauptsächlich Arbeiten vorgestellt, die Merk-
tätsanalyse von Produktrezensionen ist in 3 unterschiedliche male und Meinungen aus Produktrezensionen extrahieren,
Phasen unterteilt (vgl. [3]): aber Meinungsanalysen sind auch für andere Domänen inter-
essant: z. B. verwenden die Autoren von [7] einen von Exper-
1. Extraktion von Produktmerkmalen. ten annotierten Korpus mit Nachrichten, um mit Techniken
des maschinellen Lernens einen Klassifikator zu trainieren,
2. Extraktion von Meinungen über Produktmerkmale. der zwischen Aussagen (Meinungen) und Nicht-Aussagen
3. Tonalitätsanalyse der Meinungen. unterscheidet. Solche Ansätze sind nicht auf die Extrakti-
on von Produktmerkmalen angewiesen.
Man unterscheidet zwischen impliziten und expliziten Merk-
malen[3]: explizite Merkmale werden direkt im Text genannt,
implizite Merkmale müssen aus dem Kontext erschlossen
3. ANREICHERUNG VON MERKMALS-
werden. Wir beschränken uns im Rahmen dieser Arbeit auf
die Extraktion expliziter Merkmale. HIERARCHIEN
Die Autoren von [3] extrahieren häufig auftretende, explizi- Dieser Abschnitt dient der Beschreibung eines neuen Al-
te Merkmale mit dem a-priori Algorithmus. Mit Hilfe dieser gorithmus zur Anreicherung einer gegebenen, unvollständi-
Produktmerkmale werden Meinungen aus dem Text extra- gen Merkmalshierarchie mit zusätzlichen Merkmalen. Die-
hiert, die sich auf ein Produktmerkmal beziehen. Die Tona- se Merkmale werden aus unstrukturierten kundengenerier-
lität einer Meinung wird auf die Tonalität der enthaltenen ten Produktrezensionen gewonnen, wobei versucht wird die
Adjektive zurückgeführt. Die extrahierten Merkmale werden natürliche Ordnung der Merkmale (Unter- bzw. Obermerk-
- im Gegensatz zu unserer Arbeit - nicht hierarchisch mo- malsbeziehung) zu beachten.
delliert. Die Merkmalshierarchie bildet die Basis für weitergehende
Es gibt auch Ansätze, die versuchen die natürliche Hierar- Analysen, wie z.B. die gezielte Extraktion von Meinungen
chie von Produktmerkmalen abzubilden. Die Autoren von [1] und Tonalitäten, die sich auf Produktmerkmale beziehen.
nutzen die tabellarische Struktur von Produktbeschreibun- Diese nachfolgenden Analyseschritte sind nicht mehr Gegen-
gen aus, um explizite Produktmerkmale zu extrahieren, wo- stand dieser Arbeit. Produkte (aber auch Dienstleistungen)
bei die hierarchische Struktur aus der Tabellenstruktur ab- können durch eine Menge von Merkmalen (product features)
geleitet wird. Einen ähnlichen Ansatz verfolgen [5] et. al.: die beschrieben werden. Produktmerkmale folgen dabei einer
Autoren nutzen ebenfalls die oftmals hochgradige Struktu- natürlichen, domänenabhängigen Ordnung. Eine derartige
rierung von Produktbeschreibungen aus. Die Produktmerk- natürliche Hierarchie ist exemplarisch in Abbildung 1 für
male werden mit Clusteringtechniken aus einem Korpus ex- das Produkt Digitalkamera dargestellt. Offensichtlich ist
trahiert, wobei die Hierarchie der Merkmale durch das Clus- Display ein Untermerkmal von Digitalkamera und besitzt
tering vorgegeben wird. Die Extraktion von expliziten Merk- eigene Untermerkmale Auflösung und Farbtemperatur.
malen aus strukturierten Texten ist (i. d. R.) einfacher, als Hierarchien von Produktmerkmalen können auf Basis von
durch Analyse unstrukturierter Daten. strukturierten Texten erzeugt werden, wie z. B. technische
Die Methode von [2] et. al. benutzt eine Taxonomie zur Ab- Datenblättern und Produktbeschreibungen (vgl. [5]). Die-
bildung der Merkmalshierarchie, wobei diese von einem Ex- se Datenquellen enthalten i. d. R. die wichtigsten Produkt-
perten erstellt wird. Diese Hierarchie bildet die Grundlage merkmale. Der hohe Strukturierungsgrad dieser Datenquel-
für die Meinungsextraktion. Die Tonalität der Meinungen len erlaubt eine Extraktion der Merkmale mit hoher Ge-
wird über ein Tonalitätswörterbuch gelöst. Für diesen An- nauigkeit (≈ 71% [5]). Allerdings tendieren Datenblätter
satz wird - im Gegensatz zu unserer Methode - umfangrei- und Produktbeschreibungen dazu, ein Produkt relativ ober-
ches Expertenwissen benötigt. flächlich darzustellen oder zu Gunsten des Produkts zu ver-
Die Arbeit von [8] et. al. konzentriert sich auf die Extrakti- zerren. Zum Beispiel enthält die Hierarchie in Abbildung
on von Meinungen und die anschließende Tonalitätsanalyse. 1 eine Reihe von Merkmalen, wie sie häufig in strukturier-
Die Autoren unterscheiden zwischen subjektiven und kom- ten Datenquellen zu finden sind (helle Knoten). Allerdings
parativen Sätze. Sowohl subjektive, als auch komparative sind weitere, detailliertere Merkmale denkbar, die für eine
Sätze enthalten Meinungen, wobei im komparativen Fall ei- Kaufentscheidung von Interesse sein könnten. Beispielsweise
ne Meinung nicht direkt gegeben wird, sondern über einen könnte das Display einer Digitalkamera zur Fleckenbildung
Vergleich mit einem anderen Produkt erfolgt. Die Autoren am unteren/oberen Rand neigen. Unterer/Oberer Rand
nutzen komparative Sätze, um Produktgraphen zu erzeu- wird in diesem Fall zu einem Untermerkmal von Display
gen mit deren Hilfe verschiedene Produkte hinsichtlich eines und Obermerkmal von Fleckenbildung (dunkle Knoten).
Merkmals geordnet werden können. Die notwendigen Tona- Eine derartige Anreicherung einer gegebenen, unvollständi-
litätswerte werden einem Wörterbuch entnommen. gen Merkmalshierarchie kann durch die Verarbeitung von
kundengenerierten, unstrukturierten Rezensionen erfolgen. z.B. steht DET für einen Artikel, NOUN für ein Hauptwort
Wir halten einen hybriden Ansatz für durchaus sinnvoll: zu- und ADJ für ein Adjektiv. Weitere Informationen über das
nächst wird eine initiale Merkmalshierarchie mit hoher Ge- Universal Tagset finden sich in [6].
nauigkeit aus strukturierten Daten gewonnen. Anschließend
wird diese Hierarchie in einer zweiten Verarbeitungshase mit 3.2 Analysepipeline
zusätzlichen Produktmerkmalen angereichert. Für die Verarbeitung und Untersuchung der Produktre-
Für den weiteren Verlauf dieses Abschnitts beschränken wir zensionen haben wir eine für den NLP-Bereich (Natural Lan-
uns auf die zweite Analysephase, d.h. wir nehmen eine in- guage Processing) typische Standardpipeline benutzt: die
itiale Merkmalshierarchie als gegeben an. Für die Evaluation Volltexte der Rezensionen sind für unsere Zwecke zu grob-
unseres Algorithmus (siehe Abschnitt 4) wurden die initia- granular, so dass in einer ersten Phase der Volltext in Sätze
len Merkmalshierarchien manuell erzeugt. zerteilt wird. Anschließend werden die Sätze tokenisiert und
Unser Algorithmus wurde auf der Basis einer Reihe von ein- die Wortarten der einzelnen Worte bestimmt. Des Weite-
fachen Beobachtungen entworfen, die wir bei der Analyse ren werden Stoppworte markiert - dafür werden Standard-
von unserem Rezensionskorpus gemacht haben. Stoppwortlisten benutzt. Wir beenden die Analysepipeline
mit einer Stammformreduktion für jedes Wort, um die ver-
1. Ein Produktmerkmal wird häufig durch ein Hauptwort
schiedenen Flexionsformen eines Wortes auf eine kanonische
repräsentiert.
Basis zu bringen.
2. Viele Hauptwörter können dasselbe Produktmerkmal Für die Bestimmung zusätzlicher Produktmerkmale aus Pro-
beschreiben. (Synonyme) duktrezensionen, sind vor allem Hauptworte interessant, die
i. d. R. keine Stoppworte sind. Allerdings ist uns aufgefal-
3. Untermerkmale werden häufig im Kontext ihrer Ober- len, dass überdurchschnittlich viele Worte fälschlicherweise
merkmale genannt, wie z. B. ”das Ladegerät der Ka- als ein Hauptwort erkannt werden - viele dieser Worte sind
mera”. Stoppworte. Wir nehmen an, dass die variierende, gramma-
4. Textfragmente, die von Produktmerkmalen handeln, tikalische Qualität der Produktrezensionen für die hohe An-
besitzen häufig eine sehr ähnliche grammatikalische zahl falsch bestimmer Worte verantwortlich ist. Die Stopp-
Struktur, wie z.B. ”die Auflösung der Anzeige” oder wortmarkierung hilft dabei, diesen Fehler etwas auszuglei-
”die Laufzeit des Akkus”, wobei Unter- und Obermerk- chen.
male gemeinsam genannt werden. Die Struktur der 3.3 Der Algorithmus
Fragmente lautet [DET, NOUN, DET, NOUN], wo-
bei DET einen Artikel und NOUN ein Hauptwort be- In diesem Abschnitt beschreiben wir einen neuen Algorith-
schreibt. mus, um eine initiale Hierarchie von Produktmerkmalen mit
zusätzlichen Merkmalen anzureichern, wobei die natürliche
Der Rest dieses Abschnitts gliedert sich wie folgt: zunächst Ordnung der Merkmale erhalten bleibt (siehe Algorithmus 1).
werden Definitionen in Unterabschnitt 3.1 eingeführt, die Der Algorithmus erwartet 3 Parameter: eine 2-dimensionale
für das weitere Verständnis notwendig sind. Anschließend Liste von Token T , die sämtliche Token für jeden Satz ent-
beschreiben wir unsere Analysepipeline, die für die Vorver- hält (dabei beschreibt die erste Dimension die Sätze, die
arbeitung der Produktrezensionen verwendet wurde, in Un- zweite Dimensionen die einzelnen Wörter), eine initiale Hier-
terabschnitt 3.2. Darauf aufbauend wird in Unterabschnitt archie von Merkmalen f und eine Menge von POS-Mustern
3.3 unser Algorithmus im Detail besprochen. P . Da der Algorithmus rekursiv arbeitet, wird zusätzlich ein
Parameter d übergeben, der die maximale Rekursionstiefe
3.1 Definitionen angibt. Der Algorithmus bricht ab, sobald die vorgegebene
Für das Verständnis der nächsten Abschnitte werden eini- Tiefe erreicht wird (Zeile 1-3).
ge Begriffe benötigt, die in diesem Unterabschnitt definiert
werden sollen:
Kandidatensuche (Zeile 4-11). Um geeignete Kandida-
Token. Ein Token t ist ein Paar t = (vword , vP OS ), wobei ten für neue Produktmerkmale zu finden, werden alle Sätze
vword das Wort und vpos die Wortart angibt. Im Rahmen betrachtet und jeweils entschieden, ob der Satz eine Realisie-
dieser Arbeit wurde das Universal Tagset [6] benutzt. rung des aktuell betrachteten Merkmals enthält oder nicht.
Wenn ein Satz eine Realisierung hat, dann wird die Funkti-
Merkmal. Wir definieren ein Produktmerkmal f als ein on applyP atterns aufgerufen. Diese Funktion sucht im über-
Tripel f = (S, C, p), wobei S eine Menge von Synonymen be- gebenen Satz nach gegebenen POS-Mustern und gibt – so-
schreibt, die als textuelle Realisierung eines Merkmals Ver- fern mindestens ein Muster anwendbar ist – die entsprechen-
wendung finden können. Die Elemente von S können Wor- den Token als Kandidat zurück, wobei die Mustersuche auf
te, Produktbezeichnungen und auch Abkürzungen enthal- das unmittelbare Umfeld der gefundenen Realisierung einge-
ten. Die Hierarchie wird über C und p kontrolliert, wobei schränkt wird, damit das korrekte POS-Muster zurückgelie-
C eine Menge von Untermerkmalen und p das Obermerk- fert wird, da POS-Muster mehrfach innerhalb eines Satzes
mal von f angibt. Das Wurzelelement einer Hierarchie be- vorkommen können.
schreibt das Produkt/die Produktgruppe selbst und besitzt Im Rahmen dieser Arbeit haben wie die folgenden POS-
kein Obermerkmal. Muster verwendet:
• [DET, NOUN, DET, NOUN]
POS-Muster. Ein POS-Muster q ist eine geordnete Sequenz
von POS-Tags p = [tag1 , tag2 , . . . , tagn ], wobei n die Mus- • [DET, NOUN, VERB, DET, ADJ, NOUN]
terlänge beschreibt. Ein POS-Tag beschreibt eine Wortart,
Algorithm 1: refineHierarchy Synonymen. Dazu wird das Wort mit den Synonymen von f
verglichen (z.B. mit der Levenshtein-Distanz) und als Syn-
onym aufgenommen, falls eine ausreichende Ähnlichkeit be-
Eingabe : T : Eine 2-dimensionale Liste von Token. steht. Damit soll verhindert werden, dass die falsche Schreib-
Eingabe : P : Ein Array von POS-Mustern. weise eines eigentlich bekannten Merkmals dazu führt, dass
Eingabe : f : Eine initiale Merkmalshierarchie. ein neuer Knoten in die Hierarchie eingefügt wird.
Eingabe : d : Die maximale Rekursionstiefe.
Ausgabe: Das Wurzelmerkmal der angereicherten Wenn der Token t die Heuristiken erfolgreich passiert hat,
Hierarchie. dann wird t zu einem neuen Untermerkmal von f (Zeile 27).
1 if d = 0 then
2 return f Rekursiver Aufruf (Zeile 30-32). Nachdem das Merkmal
3 end f nun mit zusätzlichen Merkmalen angereichert wurde, wird
4 C ← {} ; der Algorithmus rekursiv für alle Untermerkmale von f auf-
5 for Token[] T ′ ∈ T do gerufen, um diese mit weiteren Merkmalen zu versehen. Die-
6 for Token t ∈ T ′ do ser Vorgang wiederholt sich solange, bis die maximale Re-
7 if t.word ∈Sf.S then kursionstiefe erreicht wird.
8 C ← C applyP attern(T ′ , P ) ;
9 end
10 end Nachbearbeitungsphase. Die Hierarchie, die von Algorith-
11 end mus 1 erweitert wurde, muss in einer Nachbearbeitungspha-
12 for Token[] C ′ ∈ C do
se bereinigt werden, da viele Merkmale enthalten sind, die
13 for Token t ∈ C ′ do keine realen Produktmerkmale beschreiben (Rauschen). Für
14 if t.pos 6= NOUN then diese Arbeit verwenden wir die relative Häufigkeit eines Un-
15 next ; termerkmals im Kontext seines Obermerkmals, um nieder-
16 end frequente Merkmale (samt Untermerkmalen) aus der Hier-
17 if t.length ≤ 3 then archie zu entfernen. Es sind aber auch andere Methoden
18 next ;
19 end
denkbar, wie z.B. eine Gewichtung nach tf-idf [4]. Dabei wird
20 if hasP arent(t.word, f ) then nicht nur die Termhäufigkeit (tf ) betrachtet, sondern auch
21 next ; die inverse Dokumenthäufigkeit (idf ) mit einbezogen. Der
22 end idf eines Terms beschreibt die Bedeutsamkeit des Terms im
23 if isSynonym(t.word, f.S) then Bezug auf die gesamte Dokumentenmenge.
24 f.S ← t.word ;
25 next ;
26 end S
4. EVALUATION
27 f.C ← f.C ({t.word}, {}, f ) ; In diesem Abschnitt diskutieren wir die Vor- und Nachteile
28 end unseres Algorithmus. Um unseren Algorithmus evaluieren zu
29 end können, haben wir einen geeigneten Korpus aus Kundenre-
30 for Feature[] f ′ ∈ f.C do zensionen zusammengestellt. Unser Korpus besteht aus 4000
31 ref ineHierarchy(T, f ′ , P, d − 1); Kundenrezensionen von amazon.de aus der Produktgruppe
32 end
Digitalkamera.
Wir haben unseren Algorithmus für die genannte Produkt-
gruppe eine Hierarchie anreichern lassen. Die initiale Pro-
dukthierarchie enthält ein Obermerkmal, welches die Pro-
duktgruppe beschreibt. Zudem wurden häufig gebrauchte
Validierungsphase (Zeile 12-29). Die Validierungsphase Synonyme hinzugefügt, wie z.B. Gerät. Im Weiteren prä-
dient dazu die gefundenen Kandidaten zu validieren, also sentieren wir exemplarisch die angereicherte Hierarchie. Für
zu entscheiden, ob ein Kandidat ein neues Merkmal enthält. dieses Experiment wurde die Rekursionstiefe auf 3 gesetzt,
Man beachte, dass es sich bei diesem neuen Merkmal um niederfrequente Merkmale (relative Häufigkeit < 0, 002) wur-
ein Untermerkmal des aktuellen Produktmerkmals handelt, den eliminiert. Wir haben für diese Arbeit Rezensionen in
sofern es existiert. Für die Entscheidungsfindung nutzen wir Deutscher Sprache verwendet, aber der Algorithmus kann
eine Reihe von einfachen Heuristiken. Ein Token t ist kein leicht auf andere Sprachen angepasst werden. Die erzeug-
Produktmerkmal und wird übergangen, falls t.vword : te Hierarchie ist in Abbildung 2 dargestellt. Es zeigt sich,
dass unser Algorithmus – unter Beachtung der hierarchi-
1. kein Hauptwort ist (Zeile 14-16). schen Struktur – eine Reihe wertvoller Merkmale extrahieren
2. keine ausreichende Länge besitzt (Zeile 17-19). konnte: z. B. Batterie mit seinen Untermerkmalen Halte-
zeit und Verbrauch oder Akkus mit den Untermerkmalen
3. ein Synonym von f (oder eines Obermerkmals von f ) Auflad und Zukauf. Es wurden aber auch viele Merkmale
ist (Zeile 20-22). aus den Rezensionen extrahiert, die entweder keine echten
Produktmerkmale sind (z.B. Kompakt oder eine falsche
4. ein neues Synonym von f darstellt (Zeile 23-26). Ober-Untermerkmalsbeziehung abbilden (z. B. Haptik und
Kamera). Des Weiteren sind einige Merkmale, wie z. B.
Die 3. Heuristik stellt sicher, dass sich keine Kreise in der Qualität zu generisch und sollten nicht als Produktmerk-
Hierarchie bilden können. Man beachte, dass Obermerkma- mal benutzt werden.
le, die nicht direkt voneinander abhängen, gleiche Unter-
merkmale tragen können.
Die 4. Heuristik dient zum Lernen von vorher unbekannten
malen anreichert. Die neuen Merkmale werden automatisch
aus unstrukturierten Produktrezensionen gewonnen, wobei
der Algorithmus versucht die natürliche Ordnung der Pro-
duktmerkmale zu beachten.
Wir konnten zeigen, dass unser Algorithmus eine initiale
Merkmalshierarchie mit sinnvollen Untermerkmalen anrei-
chern kann, allerdings werden auch viele falsche Merkma-
le extrahiert und in fehlerhafte Merkmalsbeziehungen ge-
bracht. Wir halten unseren Algorithmus dennoch für viel-
versprechend. Unsere weitere Forschung wird sich auf Teila-
spekte dieser Arbeit konzentrieren:
• Die Merkmalsextraktion muss verbessert werden: wir
haben beobachtet, dass eine Reihe extrahierter Merk-
male keine echten Produktmerkmale beschreiben. Da-
bei handelt es sich häufig um sehr allgemeine Wörter
wie z.B. Möglichkeiten. Wir bereiten deshalb den
Aufbau einer Stoppwortliste für Produktrezensionen
vor. Auf diese Weise könnte diese Problematik abge-
schwächt werden.
• Des Weiteren enthalten die angereicherten Hierarchi-
en teilweise Merkmale, die in einer falschen Beziehung
zueinander stehen, z.B. induzieren die Merkmale Ak-
ku und Akku-Ladegerät eine Ober-Untermerkmals-
beziehung: Akku kann als Obermerkmal von Ladege-
rät betrachtet werden. Außerdem konnte beobachtet
werden, dass einige Merkmalsbeziehungen alternieren:
z.B. existieren 2 Merkmale Taste und Druckpunkt
in wechselnder Ober-Untermerkmalbeziehung.
• Der Algorithmus benötigt POS-Muster, um Untermerk-
male in Sätzen zu finden. Für diese Arbeit wurden die
verwendeten POS-Muster manuell konstruiert, aber wir
planen die Konstruktion der POS-Muster weitestge-
hend zu automatisieren. Dazu ist eine umfangreiche
Analyse eines großen Korpus notwendig.
• Die Bereinigung der erzeugten Hierarchien ist unzurei-
chend - die relative Häufigkeit eines Merkmals reicht
als Gewichtung für unsere Zwecke nicht aus. Aus die-
sem Grund möchten wir mit anderen Gewichtungsma-
ßen experimentieren.
• Die Experimente in dieser Arbeit sind sehr einfach ge-
staltet. Eine sinnvolle Evaluation ist (z. Zt.) nicht mög-
lich, da (unseres Wissens nach) kein geeigneter Test-
korpus mit annotierten Merkmalshierarchien existiert.
Die Konstruktion eines derartigen Korpus ist geplant.
• Des Weiteren sind weitere Experimente geplant, um
den Effekt der initialen Merkmalshierarchie auf den
Algorithmus zu evaluieren. Diese Versuchsreihe um-
fasst Experimente mit mehrstufigen, initialen Merk-
malshierarchien, die sowohl manuell, als auch automa-
tisch erzeugt wurden.

Abbildung 2: Angereicherte Hierarchie für die Pro- • Abschließend planen wir die Verwendung unseres Al-
duktgruppe Digitalkamera. gorithmus zur Extraktion von Produktmerkmalen in
einem Gesamtsystem zur automatischen Zusammen-
fassung und Analyse von Produktrezensionen einzu-
setzen.
5. RESÜMEE UND AUSBLICK
In dieser Arbeit wurde ein neuer Algorithmus vorgestellt,
der auf Basis einer gegebenen – möglicherweise flachen –
Merkmalshierarchie diese Hierarchie mit zusätzlichen Merk-
6. REFERENZEN
[1] M. Acher, A. Cleve, G. Perrouin, P. Heymans,
C. Vanbeneden, P. Collet, and P. Lahire. On extracting
feature models from product descriptions. In
Proceedings of the Sixth International Workshop on
Variability Modeling of Software-Intensive Systems,
VaMoS ’12, pages 45–54, New York, NY, USA, 2012.
ACM.
[2] F. L. Cruz, J. A. Troyano, F. Enrı́quez, F. J. Ortega,
and C. G. Vallejo. A knowledge-rich approach to
feature-based opinion extraction from product reviews.
In Proceedings of the 2nd international workshop on
Search and mining user-generated contents, SMUC ’10,
pages 13–20, New York, NY, USA, 2010. ACM.
[3] M. Hu and B. Liu. Mining and summarizing customer
reviews. In Proceedings of the tenth ACM SIGKDD
international conference on Knowledge discovery and
data mining, KDD ’04, pages 168–177, New York, NY,
USA, 2004. ACM.
[4] K. S. Jones. A statistical interpretation of term
specificity and its application in retrieval. Journal of
Documentation, 28:11–21, 1972.
[5] X. Meng and H. Wang. Mining user reviews: From
specification to summarization. In Proceedings of the
ACL-IJCNLP 2009 Conference Short Papers,
ACLShort ’09, pages 177–180, Stroudsburg, PA, USA,
2009. Association for Computational Linguistics.
[6] S. Petrov, D. Das, and R. McDonald. A universal
part-of-speech tagset. In N. C. C. Chair), K. Choukri,
T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani,
J. Odijk, and S. Piperidis, editors, Proceedings of the
Eight International Conference on Language Resources
and Evaluation (LREC’12), Istanbul, Turkey, may 2012.
European Language Resources Association (ELRA).
[7] T. Scholz and S. Conrad. Extraction of statements in
news for a media response analysis. In Proc. of the 18th
Intl. conf. on Applications of Natural Language
Processing to Information Systems 2013 (NLDB 2013),
2013. (to appear).
[8] K. Zhang, R. Narayanan, and A. Choudhary. Voice of
the customers: Mining online customer reviews for
product feature-based ranking. In Proceedings of the 3rd
conference on Online social networks, WOSN’10, pages
11–11, Berkeley, CA, USA, 2010. USENIX Association.
Ein Verfahren zur Beschleunigung eines neuronalen
Netzes für die Verwendung im Image Retrieval

Daniel Braun
Heinrich-Heine-Universität
Institut für Informatik
Universitätsstr. 1
D-40225 Düsseldorf, Germany
braun@cs.uni-duesseldorf.de

ABSTRACT fikator eine untergeordnete Rolle, da man die Berechnung
Künstliche neuronale Netze haben sich für die Mustererken- vor der eigentlichen Anwendung ausführt. Will man aller-
nung als geeignetes Mittel erwiesen. Deshalb sollen ver- dings auch während der Nutzung des Systems weiter ler-
schiedene neuronale Netze verwendet werden, um die für nen, so sollten die benötigten Rechnungen möglichst wenig
ein bestimmtes Objekt wichtigen Merkmale zu identifizier- Zeit verbrauchen, da der Nutzer ansonsten entweder auf die
en. Dafür werden die vorhandenen Merkmale als erstes Berechnung warten muss oder das Ergebnis, dass ihm aus-
durch ein Art2-a System kategorisiert. Damit die Kategorien gegeben wird, berücksichtigt nicht die durch ihn hinzuge-
verschiedener Objekte sich möglichst wenig überschneiden, fügten Daten.
muss bei deren Berechnung eine hohe Genauigkeit erzielt Für ein fortlaufendes Lernen bieten sich künstliche neu-
werden. Dabei zeigte sich, dass das Art2 System, wie auch ronale Netze an, da sie so ausgelegt sind, dass jeder neue
die Art2-a Variante, bei steigender Anzahl an Kategorien Input eine Veränderung des Gedächtnisses des Netzes nach
schnell zu langsam wird, um es im Live-Betrieb verwen- sich ziehen kann. Solche Netze erfreuen sich, bedingt durch
den zu können. Deshalb wird in dieser Arbeit eine Opti- die sich in den letzten Jahren häufenden erfolgreichen An-
mierung des Systems vorgestellt, welche durch Abschätzung wendungen - zum Beispiel in der Mustererkennung - einer
des von dem Art2-a System benutzen Winkels die Anzahl steigenden Beliebtheit in verschiedensten Einsatzgebieten,
der möglichen Kategorien für einen Eingabevektor stark ein- wie zum Beispiel auch im Image Retrieval.
schränkt. Des Weiteren wird eine darauf aufbauende In- Der geplante Systemaufbau sieht dabei wie folgt aus: die
dexierung der Knoten angegeben, die potentiell den Speich- Merkmalsvektoren eines Bildes werden nacheinander einer
erbedarf für die zu überprüfenden Vektoren reduzieren kann. Clustereinheit übergeben, welche die Merkmalsvektoren clus-
Wie sich in den durchgeführten Tests zeigte, kann die vorge- tert und die Kategorien der in dem Bild vorkommenden
stellte Abschätzung die Bearbeitungszeit für kleine Cluster- Merkmale berechnet. Das Clustering der Clustereinheit pas-
radien stark reduzieren. siert dabei fortlaufend. Das bedeutet, dass die einmal be-
rechneten Cluster für alle weiteren Bilder verwendet werden.
Danach werden die für das Bild gefundenen Kategorien von
Kategorien Merkmalen an die Analyseeinheit übergeben, in der versucht
H.3.3 [Information Storage and Retrieval]: Information wird, die für ein bestimmtes Objekt wichtigen Kategorien zu
Search and Retrieval—Clustering; F.1.1 [Computation by identifizieren. Die dort gefundenen Kategorien werden dann
Abstract Devices]: Models of Computation—Neural Net- für die Suche dieser Objekte in anderen Bildern verwendet.
work Das Ziel ist es dabei, die Analyseeinheit so zu gestalten,
dass sie nach einem initialen Training weiter lernt und so
Schlüsselwörter neue Merkmale eines Objektes identifizieren soll.
Für die Analyseeinheit ist die Verwendung verschiedener
Neuronale Netze, Clustering, Image Retrieval neuronaler Netze geplant. Da sie aber für die vorgenomme-
nen Optimierungen irrelevant ist, wird im Folgenden nicht
1. EINLEITUNG weiter auf sie eingegangen.
Trainiert man ein Retrieval System mit einem festen Kor- Von dem Clusteringverfahren für die Clustereinheit wer-
pus und wendet die berechneten Daten danach unverän- den dabei die folgenden Punkte gefordert:
dert an, so spielt die Berechnungsdauer für einen Klassi-
• Das Clustering soll nicht überwacht funktionieren. Das
bedeutet, dass es keine Zielvorgabe für die Anzahl der
Cluster geben soll. Das System soll also auch bei einem
bestehenden Clustering für einen neuen Eingabevektor
erkennen, ob er einem Cluster zugewiesen werden kann
oder ob ein neuer Cluster erstellt werden muss.

• Die Ausdehnung der Cluster soll begrenzt sein. Das
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. soll dazu führen, dass gefundene Merkmalskategorien
Copyright is held by the author/owner(s). mit höherer Wahrscheinlichkeit zu bestimmten Objek-
ten gehören und nicht die Vektoren anderer Objekte Orienting Subsystem Attentional Subsystem
die Kategorie verschmutzen.
• Das Clustering Verfahren sollte auch bei einer hohen Reset
Category Representation Field

Anzahl an Clustern, die aus der gewünschten hohen
Genauigkeit der einzelnen Cluster resultiert, schnell
Reset Modul
berechnet werden können. LTM

In dieser Arbeit wird ein Adaptive Resonance Theory Netz
[5] verwendet, genauer ein Art2 Netz [1], da es die beiden er- zJ*
sten Bedingungen erfüllt. Denn dieses neuronale Netz führt Input Representation Field

ein nicht überwachtes Clustering aus, wobei es mit jedem
Eingabevektor weiter lernt und gegebenenfalls neue Cluster I
erschaﬀt. Der genaue Aufbau dieses Systems wird in Kapitel
3 genauer dargestellt. I
Preprocessing Field
Zur Beschreibung des Bildes dienen SIFT Deskriptoren [9,
10], welche mit 128 Dimensionen einen sehr großen Raum für
mögliche Kategorien aufspannen. Dadurch wächst die Kno- I0
tenanzahl innerhalb des Art2 Netzes rapide an, was zu einer
Verlangsamung des Netzes führt. Deshalb wird die Art2-a
Variante [2] verwendet, welche das Verhalten des Art2 Sys- Abbildung 1: Skizze eines Art2-a Systems
tems approximiert. Dieses System hat die Vorteile, dass es
zum Einen im Vergleich zu Art2 um mehrere Größenord-
nungen schneller ist und sich zum Anderen gleichzeitig auch durchlaufen und der ähnlichste Knoten als Antwort gewählt.
noch größtenteils parallelisieren lässt, wodurch ein weiterer Das Framework verwendet dabei das Feedback des Nutzers,
Geschwindigkeitsgewinn erzielt werden kann. wodurch nach jeder Iteration das Ergebnis verfeinert wird.
Dennoch zeigt sich, dass durch die hohe Dimension des Das neuronale Netz dient hier somit als Klassifikator.
Vektors, die für die Berechnung der Kategorie benötigten [12] benutzt das Fuzzy Art neuronale Netz, um die Merk-
Skalarprodukte, unter Berücksichtigung der hohen Anzahl malsvektoren zu klassifizieren. Sie schlagen dabei eine zweite
an Knoten, weiterhin sehr rechenintensiv sind. Dadurch Bewertungsphase vor, die dazu dient, das Netz an ein er-
steigt auch bei starker Parallelisierung, sofern die maximale wartetes Ergebnis anzupassen, das System damit zu über-
Anzahl paralleler Prozesse begrenzt ist, die Zeit für die Bear- wachen und die Resultate des Netzes zu präzisieren.
beitung eines neuen Vektors mit fortlaufendem Training kon- In [6] wird ein Radial Basis Funktion Netzwerk (RBF) als
tinuierlich an. Aus diesem Grund wird in Kapitel 4 eine Er- Klassifikator verwendet. Eins ihrer Verfahren lässt dabei den
weiterung des Systems vorgestellt, die die Menge der Kandi- Nutzer einige Bilder nach der Nähe zu ihrem Suchziel bew-
daten möglicher Gewinnerknoten schon vor der teuren Be- erten, um diese Bewertung dann für das Training ihrer Net-
rechnung des Skalarproduktes verkleinert. zwerke zu verwenden. Danach nutzen sie die so trainierten
Der weitere Aufbau dieser Arbeit sieht dabei wie folgt aus: neuronalen Netze zur Bewertung aller Bilder der Datenbank.
in dem folgenden Kapitel 2 werden einige ausgewählte An- Auch [11] nutzt ein Radial Basis Funktion Netz zur Suche
sätze aus der Literatur genannt, in denen neuronale Netze nach Bildern und trainiert dieses mit der vom Nutzer angege-
für das Image Retrieval verwendet werden. Um die Plausi- benen Relevanz des Bildes, wobei das neuronale Netz nach
bilität der Erweiterung zu verstehen, werden dann in Kapi- jeder Iteration aus Bewertung und Suche weiter trainiert
tel 3 die dafür benötigten Mechanismen und Formeln eines wird.
Art2-a Systems vorgestellt. Kapitel 4 fokussiert sich danach In [3] wird ein Multiple Instance Netzwerk verwendet.
auf die vorgeschlagene Erweiterung des bekannten Systems. Das bedeutet, dass für jede mögliche Klasse von Bildern
In Kapitel 5 wird die Erweiterung evaluiert, um danach in ein eigenes neuronales Netz erstellt wird. Danach wird ein
dem folgenden Kapitel eine Zusammenfassung des Gezeigten Eingabebild jedem dieser Subnetze präsentiert und gegebe-
sowie einen Ausblick zu geben. nenfalls der dazugehörigen Klasse zugeordnet.

2. VERWANDTE ARBEITEN 3. ART2-A BESCHREIBUNG
In diesem Kapitel werden einige Ansätze aus der Liter- In diesem Kapitel werden die benötigten Mechanismen
atur vorgestellt, in denen neuronale Netze für verschiedene eines Art2-a Systems vorgestellt. Für das Verständnis sind
Aufgabenstellungen im Image Retrieval verwendet werden. dabei nicht alle Funktionen des Systems nötig, weshalb zum
Darunter fallen Themengebiete wie Clustering und Klassi- Beispiel auf die nähere Beschreibung der für das Lernen
fikation von Bildern und ihren Merkmalsvektoren. benötigten Formeln und des Preprocessing Fields verzichtet
Ein bekanntes Beispiel für die Verwendung von neuronalen wird. Für weiterführende Informationen über diese beiden
Netzen im Image Retrieval ist das PicSOM Framework, wel- Punkte sowie generell über das Art2-a System sei deshalb
ches in [8] vorgestellt wird. Dort werden TS-SOMs (Tree auf [1, 2] verwiesen.
Structured Self Orienting Maps) für die Bildsuche verwen- Wie in Bild 1 zu sehen ist, besteht das System aus zwei
det. Ein Bild wird dabei durch einen Merkmalsvektor darge- Subsystemen: einem Attentional Subsystem, in dem die Be-
stellt. Diese Vektoren werden dann dem neuronalen Netz arbeitung und Zuordnung eines an den Eingang angelegten
präsentiert, welches sie dann der Baumstruktur hinzufügt, Vektors ausgeführt wird, sowie einem Orienting Subsystem,
so dass im Idealfall am Ende jedes Bild in der Baumstruk- welches die Ähnlichkeit des Eingabevektors mit der vorher
tur repräsentiert wird. Bei der Suche wird der Baum dann gewählten Gewinnerkategorie berechnet und diese bei zu
geringer Nähe zurücksetzt. unterliegen. Damit wird der Knoten J genau dann abge-
Innerhalb des Category Representation Field F2 liegen die lehnt, wenn
Knoten die für die einzelnen Vektorkategorien stehen. Dabei
wird die Beschreibung der Kategorie in der Long Term Mem-
T J < ρ∗ (4)
ory (LTM) gespeichert, die das Feld F2 mit dem Input Rep-
resentation Field F1 in beide Richtungen verbindet. gilt. Ist das der Fall, wird ein neuer Knoten aktiviert und
Nach [2] gilt für den Aktivitätswert T von Knoten J in somit eine neue Kategorie erstellt. Mit 2 und 4 folgt damit,
dem Feld F2 : dass ein Knoten nur dann ausgewählt werden kann, wenn
für den Winkel θ zwischen dem Eingabevektor I und dem
{ ∑ gespeicherten LTM-Vektor zJ∗
α· n i=1 Ii , wenn J nicht gebunden ist,
TJ =
I · zJ∗ , wenn J gebunden ist.
cos θ ≥ ρ∗ (5)
Ii steht dabei für den durch das Preprocessing Field F0
berechneten Input in das Feld F1 und α ist ein wählbarer Pa- gilt. Da die einzelnen Rechnungen, die von dem System
rameter, der klein genug ist, so dass die Aktivität eines unge- ausgeführt werden müssen, unabhängig sind, ist dieses Sys-
bundenen Knotens für bestimmte Eingangsvektoren nicht tem hochgradig parallelisierbar, weshalb alleine durch Aus-
immer größer ist als alle Aktivitätswerte der gebundenen nutzung dieser Tatsache die Berechnungszeit stark gesenkt
Knoten. Hierbei gilt ein Knoten als gebunden, wenn ihm werden kann. Mit steigender Knotenanzahl lässt sich das
mindestens ein Vektor zugeordnet wurde. System dennoch weiter optimieren, wie in dem folgenden
Da der Aktivitätswert für alle nicht gebundenen Knoten Kapitel gezeigt werden soll.
konstant ist und deshalb nur einmal berechnet werden muss, Das Art2-a System hat dabei allerdings einen Nachteil,
ist dieser Fall für eine Eﬃzienzsteigerung von untergeord- denn bedingt durch die Nutzung des Kosinus des Winkels
netem Interesse und wird deshalb im Folgenden nicht weiter zwischen zwei Vektoren werden Vektoren, die linear abhäng-
betrachtet. ig sind, in dieselbe Kategorie gelegt. Dieses Verhalten ist für
Wie in [2] gezeigt wird, sind sowohl I als auch zJ∗ , durch die geforderte Genauigkeit bei den Kategorien unerwünscht.
die Anwendung der euklidischen Normalisierung, Einheits- Dennoch lässt sich dieses Problem leicht durch die Erhebung
vektoren, weshalb folglich weiterer Daten, wie zum Beispiel den Clustermittelpunkt,
lösen, weshalb im Folgenden nicht weiter darauf eingegangen
wird.
∥I∥ = ∥zJ∗ ∥ = 1 (1)
gilt. Deshalb folgt für die Aktivitätswerte der gebunden 4. VORGENOMMENE OPTIMIERUNG
Kategorieknoten: Dieses Kapitel dient der Beschreibung der verwendeten
Abschätzung und ihrer Implementierung in das Art2-a Sys-
TJ = I · zJ∗ tem. Abschließend wird dann noch auf eine weitere Verbes-
= ∥I∥ · ∥zJ∗ ∥ · cos θ serung, die sich durch diese Implementierung ergibt, einge-
gangen. Der Aufbau des Abschnitts ist dabei wie folgt: in
= cos θ (2) Unterabschnitt 1 wird das Verfahren zur Abschätzung des
Die Aktivität eines Knotens entspricht damit dem Winkel Winkels vorgestellt. In dem folgenden Unterabschnitt 2 wird
zwischen dem Eingangsvektor I und dem LTM-Vektor zJ∗ . dann gezeigt, wie man diese Abschätzung in das Art2-a Sys-
Damit der Knoten mit dem Index J gewählt wird, muss tem integrieren kann. In dem letzten Unterabschnitt folgt
dann eine Vorstellung der Abschätzung als Index für die
Knoten.
TJ = max{Tj }
4.1 Abschätzung des Winkels
j

gelten, sprich der Knoten mit der maximalen Aktivität
wird als mögliche Kategorie gewählt. Dabei wird bei Gleich- In [7] wird eine Methode zur Schätzung der Distanz zwis-
heit mehrerer Werte der zu erst gefundene Knoten genom- chen einem Anfragevektor und einem Datenvektor beschrie-
men. Die maximale Distanz, die das Resetmodul akzep- ben. Im Folgenden wird beschrieben, wie man Teile dieses
tiert, wird durch den Schwellwert ρ, im Folgenden Vigilance Verfahrens nutzen kann, um damit die Menge möglicher
Parameter genannt, bestimmt, mit dem die, für die Art2-a Knoten schon vor der Berechnung des Aktivitätswertes TJ
Variante benötigte, Schwelle ρ∗ wie folgt berechnet wird: zu verringern. Das Ziel ist es, die teure Berechnung des
Skalarproduktes zwischen I und zJ∗ möglichst selten auszu-
führen und gleichzeitig möglichst wenig Daten im Speicher
ρ2 (1 + σ)2 − (1 + σ 2 )
ρ∗ = vorrätig halten zu müssen. Dafür wird der unbekannte Win-
2σ kel θ zwischen den beiden Vektoren P und Q durch die
mit bekannten Winkel α und β zwischen beiden Vektoren und
einer festen Achse T wie folgt approximiert:
cd
σ≡ (3)
1−d
cos θ ≤ cos (|α − β|)
und c und d als frei wählbare Parameter des Systems, die
der Beschränkung = cos (α − β)
= cos α cos β + sin α sin β
cd √ √
≤1 = cos α cos β + 1 − cos α2 1 − cos β 2 (6)
1−d
Damit die Bedingung (7) ausgenutzt werden kann, wird
das Art2 System um ein weiteres Feld, im Folgenden Estima-
F2 1 2 3 . . . n tion Field genannt, erweitert. Dieses Feld soll als Schnittstel-
le zwischen F0 und F2 dienen und die Abschätzung des
Winkels zwischen dem Eingabevektor und dem gespeicher-
ten LTM Vektor vornehmen. Dazu wird dem Feld, wie in
z *n Abbildung 2 gezeigt wird, von dem Feld F0 die Summe S I
S
übergeben. Innerhalb des Feldes gibt es dann für jeden
′
Knoten J im Feld F2 eine zugehörige Estimation Unit J . In
der Verbindung von jedem Knoten J zu der ihm zugehörigen
′
Estimation I F0 Estimation Unit J wird
∗
die Summe der Werte des jeweili-
Field S gen LTM Vektors S zJ als LTM gespeichert. Die Estimation
Unit berechnet dann die Funktion
= LTM √
∗ ∗2
S I ∗ S zJ + (n − S I 2 )(n − S zJ )
Abbildung 2: Erweiterung des Art2 Systems mit f (J) =
n
einem neuen Feld für die Abschätzung des Winkels
für den ihr zugehörigen Knoten J. Abschließend wird als
Aktivierungsfunktion, für die Berechnung der Ausgabe oJ ′
′
Als Achse T wird hierbei ein n-dimensionaler mit Einsen der Estimation Unit J , die folgende Schwellenfunktion ver-
gefüllter Vektor verwendet, wodurch für die L2-Norm des wendet:
√
Achsenvektors ∥T ∥ = n folgt. Eingesetzt in die Formel
{
1, wenn f (J) ≥ ρ∗
⟨P, Q⟩ oJ ′ = (8)
cos θ = 0, sonst
∥P ∥∥Q∥
Damit ergibt sich für die Aktivitätsberechnung jedes Kno-
ergibt sich damit, unter Ausnutzung von (1), für das Sys- tens des F2 Feldes die angepasste Formel
tem mit den Vektoren I und zJ∗ :
 ∑
∑n ∑n ∗ 
α ∗ i Ii , wenn J nicht gebunden ist,
i=1 Ii i=1 zJ i
cosα = √ und cosβ = √ TJ = I ∗ zJ∗ , wenn J gebunden ist und oJ ′ = 1 gilt,
n n 
0
∗
wenn oJ ′ = 0 gilt.
Mit S I und S zJ als jeweilige Summe der Vektorwerte (9)
reduziert sich, unter Verwendung der Formel (6), die Ab- mit oJ ′ als Ausgabe des Estimation Field zu Knoten J.
schätzung des Kosinus vom Winkel θ auf
4.3 Verwendung als Index
√ Durch die gezeigte Kosinusschätzung werden unnötige Ska-
∗ ∗2
S I ∗ S zJ SI 2 S zJ larprodukte vermieden und somit das System beschleunigt.
cos θ ≤ + (1 − )(1 − )
n n n Allerdings kann es bei weiterhin wachsender Anzahl der Kno-
√ ten, zum Beispiel weil der Arbeitsspeicher nicht ausreicht,
∗ ∗2
S I ∗ S zJ + (n − S I 2 )(n − S zJ ) nötig werden, nicht mehr alle LTM Vektoren im Speicher zu
=
n halten, sondern nur ein Set möglicher Kandidaten zu laden
Diese Abschätzung ermöglicht es nun, die Menge der Kan- und diese dann gezielt zu analysieren. In dem folgenden Ab-
didaten möglicher Knoten für einen Eingabevektor I vorzei- schnitt wird gezeigt, wie die Abschätzung sinnvoll als Index
tig zu reduzieren, indem man ausnutzt, dass der wirkliche für die Knoten verwendet werden kann.
+
Winkel zwischen Eingabevektor und in der LTM gespeicher- Für die Indexierung wird als Indexstruktur ein B ∗
-Baum
zJ
tem Vektor maximal genauso groß ist, wie der mit der gezeig- mit der Summe der Werte jedes LTM Vektors S und der
ten Formel geschätzte Winkel zwischen beiden Vektoren. ID J des Knotens als zusammengesetzten Schlüssel verwen- ∗

Damit ist diese Abschätzung des wirklichen Winkels θ ver- det. Für die Sortierreihenfolge gilt: zuerst wird nach S zJ
+
lustfrei, denn es können keine Knoten mit einem tatsächlich sortiert und dann nach J. Dadurch bleibt der B -Baum für
größeren Kosinuswert des Winkels aus der Menge an Kan- partielle Bereichsanfragen nach dem Wert der Summe opti-
didaten entfernt werden. Daraus resultiert, dass ein Knoten miert. Damit das funktioniert muss allerdings die Suche so
nur dann weiter betrachtet werden muss, wenn die Bedin- angepasst werden, dass sie bei einer partiellen Bereichsan-
gung frage für die ID den kleinstmöglichen Wert einsetzt und dann
bei der Ankunft in einem Blatt der Sortierung bis zum ersten
√ Vorkommen, auch über Blattgrenzen hinweg, der gesuchten
∗ ∗2
S I ∗ S zJ + (n − S I 2 )(n − S zJ ) Summe folgt.
≥ ρ∗ (7) Dieser Index wird nun verwendet, um die Menge der Kan-
n didaten einzuschränken, ohne, wie in der vorher vorgestell-
erfüllt wird. ten Optimierung durch die Estimation Unit, alle Knoten
durchlaufen zu müssen. Anschaulich bedeutet das, dass
4.2 Erweiterung des Systems das Art2-a System nur noch die der Menge an Kandidaten
für den Eingabevektor I angehörenden Knoten sehen soll
und somit nur in diesen den Gewinnerknoten suchen muss.
Für diesen Zweck
∗
müssen mögliche Wertebereiche der gespe-
icherten S zJ für einen beliebigen Eingabevektor festgelegt
werden. Dies geschieht wieder mit Hilfe der Bedingung (7):
√
∗ ∗2
S I · S zJ + (n − S I 2 )(n − S zJ )
≥ρ
n
√
∗2 ∗
(n − S I 2 )(n − S zJ ) ≥ ρ · n − S I · S zJ
∗
Für ρ · n − S I · S zJ < 0 ist diese Ungleichung oﬀensichtlich
immer erfüllt, da die Quadratwurzel auf der linken Seite
immer positiv ist. Damit ergibt sich die erste Bedingung: Abbildung 3: Zeitmessung für ρ = 0.95

∗ ρ·n
S zJ > (10)
SI gestellt, denn es sind keine Einbußen in der Güte des Ergeb-
Nun wird im Folgenden noch der Fall ρ · n ≥ S I · S zJ
∗ nisses zu erwarten. Außerdem wird die mögliche Paralleli-
weiter betrachtet: sierung nicht weiter betrachtet, da bei einer begrenzten An-
zahl von parallelen Prozessen die Anzahl der Knoten pro
√ Prozess mit jedem weiteren Bild steigt und den Prozess so
∗2 ∗
(n − S I 2 )(n − S zJ ) ≥ ρ · n − S I · S zJ verlangsamt. Als mögliche Werte für den Schwellwert ρ wur-
2 ∗2 ∗
den die zwei, in der Literatur öfter genannten, Werte 0.95
n · (1 − ρ2 ) − S I ≥ S zJ − 2ρS I S zJ und 0.98 sowie der Wert 0.999 verwendet. Für die restlichen
2 ∗ benötigten Parameter aus Formel (3) und (9) gilt: c = 0.1,
(n − S I )(1 − ρ2 ) ≥ (S zJ − ρ · S I )2 d = 0.9 und α = 0
Damit ergibt sich:
5.2 Ergebnisse
√ √ Für die kleineren Vigilance Werte von 0.95 und 0.98 zeigt
∗
(n − S I 2 )(1 − ρ2 ) ≥ S zJ − ρ · S I ≥ − (n − S I 2 )(1 − ρ2 ) sich, wie in den Abbildungen 3 und 4 zu sehen ist, dass die
(11) Abschätzung hier kaum einen Vorteil bringt. Sie ist sogar
Mit den Bedingungen (10) und (11) können nun die par- langsamer als das originale System. Dies liegt daran, dass
tiellen Bereichsanfragen an den Index für einen beliebigen die Abschätzung durch Verwendung nur eines Wertes, näm-
Eingabevektor I wie folgt formuliert werden: lich der Summe, viel zu ungenau ist, um bei diesem Vigilance
Wert genug Knoten herauszufiltern, da fast alle Knoten über
√ √ der Grenze liegen. Da deshalb kaum Zeit gewonnen wird,
r1 = [ρS I − (n − S I 2 )(1 − ρ2 ), ρS I + (n − S I 2 )(1 − ρ2 )] wird das System durch den betriebenen Mehraufwand lang-
ρ·n samer. Mit steigendem Vigilance Parameter nimmt auch
r2 = [ I , ∞]
S der Nutzen des Verfahrens zu, da die Anzahl der entfernten
Da für diese Bereichsanfragen die Bedingung (7) genutzt Knoten signifikant zunimmt. Dies sieht man deutlich in Ab-
wird und somit alle geschätzten Winkel größer als ρ∗ sind, bildung 5, in der die benötigte Rechenzeit für einen Wert von
hat bei der Verwendung des Indexes das Estimation Field 0.999 dargestellt ist. In diesem Fall filtert die gezeigte Ab-
keinen Eﬀekt mehr. schätzung sehr viele Knoten heraus, weshalb der Zeitgewinn
den Zeitverlust durch den größeren Aufwand weit übersteigt.
Da aber möglichst genaue Kategorien erwünscht sind, ist ein
5. EVALUATION hoher Vigilance Parameter die richtige Wahl. Deshalb kann
In diesem Kapitel wird die gezeigte Abschätzung evaluiert. das gezeigte Verfahren für das angestrebte System adaptiert
Der vorgeschlagene Index wird dabei aber noch nicht berück- werden.
sichtigt.

5.1 Versuchsaufbau 6. RESÜMEE UND AUSBLICK
Für die Evaluierung des gezeigten Ansatzes wurde ein In dieser Arbeit wurde eine Optimierung des Art2-a Sys-
Computer mit einem Intel Core 2 Duo E8400 3 GHz als tems vorgestellt, die durch Abschätzung des Winkels zwis-
Prozesser und 4 GB RAM benutzt. chen Eingabevektor und gespeichertem Vektor die Menge
Als Datensatz wurden Bilder von Flugzeugen aus dem an zu überprüfenden Kandidaten für hohe Vigilance Werte
Caltech 101 Datensatz [4] verwendet. Diese Bilder zeigen stark reduzieren kann. Des Weiteren wurde ein Ansatz zur
verschiedene Flugzeuge auf dem Boden beziehungsweise in Indexierung der Knoten basierend auf der für die Abschätz-
der Luft. Für den Geschwindigkeitstest wurden 20 Bilder ung nötigen Summe vorgestellt. Auch wenn eine abschlie-
aus dem Pool ausgewählt und nacheinander dem neuronalen ßende Analyse des gezeigten noch oﬀen ist, so scheint dieser
Netz präsentiert. Im Schnitt produzieren die benutzten Bil- Ansatz dennoch erfolgversprechend für die erwünschten ho-
der dabei 4871 SIFT Feature Vektoren pro Bild. hen Vigilance Werte.
Bedingt dadurch, dass die Ansätze verlustfrei sind, wird Aufbauend auf dem gezeigten wird unsere weitere For-
nur die Rechenzeit der verschiedenen Verfahren gegenüber schung die folgenden Punkte beinhalten:
fortlaufendes Lernen braucht, um einem Objekt keine
falschen neuen Kategorien zuzuweisen oder richtige Ka-
tegorien zu entfernen. Danach soll ein geeignetes neu-
ronales Netz aufgebaut werden, um damit die Zuord-
nung der Kategorien zu den Objekten durchführen zu
können. Das Netz muss dann an die vorher erhobenen
Daten angepasst werden, um die Präzision des Netzes
zu erhöhen. Abschließend wird das Verfahren dann
gegen andere populäre Verfahren getestet.

7. REFERENZEN
[1] G. A. Carpenter and S. Grossberg. Art 2:
Self-organization of stable category recognition codes
for analog input patterns. Applied Optics,
Abbildung 4: Zeitmessung für ρ = 0.98 26(23):4919–4930, 1987.
[2] G. A. Carpenter, S. Grossberg, and D. B. Rosen. Art
2-a: an adaptive resonance algorithm for rapid
category learning and recognition. In Neural Networks,
volume 4, pages 493–504, 1991.
[3] S.-C. Chuang, Y.-Y. Xu, H. C. Fu, and H.-C. Huang.
A multiple-instance neural networks based image
content retrieval system. In Proceedings of the First
International Conference on Innovative Computing,
Information and Control, volume 2, pages 412–415,
2006.
[4] L. Fei-Fei, R. Fergus, and P. Perona. Learning
generative visual models from few training examples:
an incremental bayesian approach tested on 101 object
Abbildung 5: Zeitmessung für ρ = 0.999 categories, 2004. CVPR 2004, Workshop on
Generative-Model Based Vision.
[5] S. Grossberg. Adaptive pattern classification and
• Es wird geprüft, ob die Abschätzung durch die Hinzu- universal recording: II. Feedback, expectation,
nahme weiterer Daten verbessert werden kann und so- olfaction, illusions. Biological Cybernetics, 23:187–202,
mit eine weitere Beschleunigung erzielt wird. Dafür 1976.
kann man, um das Problem der zu geringen Präzision
[6] B. Jyothi and D. Shanker. Neural network approach
der Abschätzung bei kleinerem Vigilance Parameter
for image retrieval based on preference elicitation.
zu umgehen, die Vektoren teilen und die Abschätzung
International Journal on Computer Science and
wie in [7] aus den Teilsegmenten der Vektoren zusam-
Engineering, 2(4):934–941, 2010.
mensetzen. Dafür bräuchte man aber auch die Summe
[7] Y. Kim, C.-W. Chung, S.-L. Lee, and D.-H. Kim.
der Quadrate, da die Teilsegmente der Vektoren keine
Distance approximation techniques to reduce the
Einheitsvektoren mehr sind. Deshalb wird es sich noch
dimensionality for multimedia databases. Knowledge
zeigen, ob der Gewinn an Präzision durch eine Auftei-
and Information Systems, 2010.
lung den größeren Aufwand durch Berechnung und
Speicherung weiterer Werte rechtfertigt. Des Weiteren [8] L. Koskela, , J. T. Laaksonen, J. M. Koskela, and
soll damit überprüft werden, ob die Abschätzung auch E. Oja. Picsom a framework for content-based image
für kleinere Vigilance Werte verwendet werden kann. database retrieval using self-organizing maps. In In
11th Scandinavian Conference on Image Analysis,
• Es wird überprüft, wie groß die Auswirkungen der pages 151–156, 1999.
vorgestellten Verfahren bei einer parallelen Berechnung [9] D. G. Lowe. Object recognition from local
des Gewinnerknotens sind. Des Weiteren wird das scale-invariant features. In Proceedings of the
Verfahren auf größeren Datenmengen getestet, um zu International Conference on Computer Vision, 1999.
überprüfen, ob eine weitere Beschleunigung nötig ist, [10] D. G. Lowe. Distinctive image features from
damit man das Verfahren im Live Betrieb verwenden scale-invariant keypoints. International Journal of
kann. Computer Vision, 60:91–110, 2004.
• Die Verwendung der Abschätzung zum Indexieren soll [11] K. N. S., Čabarkapa Slobodan K., Z. G. J., and R. B.
getestet und mit anderen Indexierungsverfahren ver- D. Implementation of neural network in cbir systems
glichen werden, um ihren Nutzen besser bewerten zu with relevance feedback. Journal of Automatic
können. Aber vor allem ihre Auswirkungen auf das Control, 16:41–45, 2006.
Art2-a System im parallelisierten Betrieb sind noch [12] H.-J. Wang and C.-Y. Chang. Semantic real-world
oﬀen und werden überprüft. image classification for image retrieval with fuzzy-art
neural network. Neural Computing and Applications,
• Danach werden wir die Analyseeinheit konzipieren. Da- 21(8):2137–2151, 2012.
für wird als erstes überprüft, welche Daten man für ein
Auffinden von Spaltenkorrelationen mithilfe proaktiver und
reaktiver Verfahren

Katharina Büchse
Friedrich-Schiller-Universität
Institut für Informatik
Ernst-Abbe-Platz 2
07743 Jena
katharina.buechse@uni-jena.de

KURZFASSUNG Keywords
Zur Verbesserung von Statistikdaten in relativen Datenbank- Anfrageoptimierung, Spaltenkorrelation, Feedback
systemen werden seit einigen Jahren Verfahren für das Fin-
den von Korrelationen zwischen zwei oder mehr Spalten 1. EINFÜHRUNG
entwickelt. Dieses Wissen über Korrelationen ist notwen-
dig, weil der Optimizer des Datenbankmanagementsystems Die Verwaltung großer Datenmengen benötigt zunehmend
(DBMS) bei der Anfrageplanerstellung sonst von Unabhän- leistungsfähigere Algorithmen, da die Verbesserung der Tech-
gigkeit der Daten ausgeht, was wiederum zu groben Fehlern nik (Hardware) nicht mit dem immer höheren Datenauf-
bei der Kostenschätzung und somit zu schlechten Ausfüh- kommen heutiger Zeit mithalten kann. Bspw. werden wis-
rungsplänen führen kann. senschaftliche Messergebnisse aufgrund besserer Messtech-
Die entsprechenden Verfahren gliedern sich grob in proak- nik immer genauer und umfangreicher, sodass Wissenschaft-
tive und reaktive Verfahren: Erstere liefern ein gutes Ge- ler sie detaillierter, aber auch umfassender analysieren wol-
samtbild über sämtliche vorhandenen Daten, müssen dazu len und müssen, oder Online-Shops speichern sämtliche ihrer
allerdings selbst regelmäßig auf die Daten zugreifen und be- Verkaufsdaten und werten sie aus, um dem Benutzer passend
nötigen somit Kapazität des DBMS. Letztere überwachen zu seinen Interessen zeitnah und individuell neue Angebote
und analysieren hingegen die Anfrageergebnisse und liefern machen zu können.
daher nur Korrelationsannahmen für bereits abgefragte Da- Zur Verwaltung dieser wie auch anderer Daten sind (im
ten, was einerseits das bisherige Nutzerinteresse sehr gut wi- Datenbankbereich) insbesondere schlaue Optimizer gefragt,
derspiegelt, andererseits aber bei Änderungen des Workloads weil sie für die Erstellung der Anfragepläne (und somit für
versagen kann. Dafür wird einzig bei der Überwachung der die Ausführungszeit einer jeden Anfrage) verantwortlich sind.
Anfragen DBMS-Kapazität benötigt, es erfolgt kein eigen- Damit sie in ihrer Wahl nicht völlig daneben greifen, gibt
ständiger Zugriff auf die Daten. es Statistiken, anhand derer sie eine ungefähre Vorstellung
Im Zuge dieser Arbeit werden beide Ansätze miteinan- bekommen, wie die vorhandene Datenlandschaft aussieht.
der verbunden, um ihre jeweiligen Vorteile auszunutzen. Da- Hierbei ist insbesondere die zu erwartende Tupelanzahl von
zu werden die sich ergebenden Herausforderungen, wie sich Interesse, da sie in hohem Maße die Ausführungszeit einer
widersprechende Korrelationsannahmen, aufgezeigt und als Anfrage beeinflusst. Je besser die Statistiken die Verteilung
Lösungsansatz u. a. der zusätzliche Einsatz von reaktiv er- der Daten wiedergeben (und je aktueller sie sind), desto bes-
stellten Statistiken vorgeschlagen. ser ist der resultierende Ausführungsplan. Sind die Daten
unkorreliert (was leider sehr unwahrscheinlich ist), genügt
es, pro zu betrachtender Spalte die Verteilung der Werte
Categories and Subject Descriptors innerhalb dieser Spalte zu speichern. Treten in diesem Fall
später in den Anfragen Kombinationen der Spalten auf, er-
H.2 [Information Systems]: Database Management; H.2.4 gibt sich die zu erwartende Tupelanzahl mithilfe einfacher
[Database Management]: Systems—Query processing statistischer Weisheiten (durch Multiplikation der Einzel-
wahrscheinlichkeiten).
Leider versagen diese ab einem bestimmten Korrelations-
General Terms grad (also bei korrelierten Daten), und zwar in dem Sinne,
Theory, Performance dass die vom Optimizer berechneten Schätzwerte zu stark
von der Wirklichkeit abweichen, was wiederum zu schlech-
ten Ausführungszeiten führt. Diese ließen sich u.U. durch die
Wahl eines anderen Plans, welcher unter Berücksichtigung
der Korrelation vom Optimizer erstellt wurde, verringern
oder sogar vermeiden.

Zur Veranschaulichung betrachten wir eine Tabelle, wel-
th che u. a. die Spalten A und B besitzt, und eine Anfrage,
25 GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. welche Teile eben dieser Spalten ausgeben soll. Desweiteren
Copyright is held by the author/owner(s). liege auf Spalte B ein Index, den wir mit IB bezeichnen wol-
len, und es existiere ein zusammengesetzter Index IA,B für Daher gibt es zwei grundsätzliche Möglichkeiten: Entwe-
beide Spalten. Beide Indizes seien im DBMS mithilfe von der schauen wir dem Benutzer auf die Finger und suchen
Bäumen (bspw. B∗ -Bäume) implementiert, sodass wir auch in den von ihm abgefragten Daten nach Korrelationen (das
(etwas informell) von flachen“ oder hohen“ Indizes spre- entspricht einer reaktiven Vorgehensweise), oder wir suchen
” ” ”
chen können. uns selbst“ ein paar Daten der Datenbank aus“, die wir
”
Sind beide Spalten unkorreliert, so lohnt sich in der Regel untersuchen wollen (und gehen somit proaktiv vor). Beide
die Abfrage über IA,B . Bei einer starken Korrelation bei- Vorgehensweisen haben ihre Vor- und Nachteile. Während
der Spalten dagegen könnte die alleinige Verwendung von im reaktiven Fall keine Daten speziell zur Korrelationsfin-
IB vorteilhaft sein, und zwar wenn die Werte aus Spalte A dung angefasst“ werden müssen, hier aber alle Daten, die
”
i.d.R. durch die Werte aus Spalte B bestimmt werden (ein bis zu einer bestimmten Anfrage nie abgefragt wurden, als
typisches Beispiel, welches auch in CORDS [7] anzutreffen unkorreliert gelten, müssen wir für die proaktive Methode
ist, wäre eine Tabelle Auto“ mit den Spalten A = Firma“ (also nur zum Feststellen, ob Korrelation vorherrscht) extra
” ”
und B = Marke“, sodass sich für A Werte wie Opel“ oder Daten lesen, sind aber für (fast) alle Eventualitäten gewapp-
” ”
Mercedes“ und für B Werte wie Zafira“ oder S-Klasse“ er- net.
” ” ”
geben). Statt nun im vergleichsweise hohen Index IA,B erst Interessanterweise kann es vorkommen, dass beide Metho-
passende A- und dann passende B-Werte zu suchen, werden den für ein und dieselbe Spaltenkombination unterschied-
sämtliche Tupel, welche die gewünschten B-Werte enthalten, liche Ergebnisse liefern (der Einfachheit halber beschrän-
über den flacheren Index IB geladen und überprüft, ob die ken wir uns hierbei auf die möglichen Ergebnisse korre-
”
jeweiligen A-Werte der Anfrage entsprechen (was aufgrund liert“ oder unabhängig“). Für den Fall, dass die reaktive
”
der Abhängigkeit der Regelfall sein sollte). Methode eine Spaltenkombination gar nicht betrachtet hat,
sollte das klar sein. Aber nehmen wir an, dass die Kombi-
Das Wissen über Korrelationen fällt aber natürlich nicht nation von beiden Methoden analysiert wurde. Da für die
vom Himmel, es hat seinen Preis. Jeder Datenbänkler hofft, Analyse höchstwahrscheinlich jeweils unterschiedliche Tupel
dass seine Daten unkorreliert sind, weil sein DBMS dann we- (Wertekombinationen) verwendet wurden, können sich na-
niger Metadaten (also Daten über die Daten) speichern und türlich auch die Schlüsse unterscheiden. Hier stellt sich nun
verwalten muss, sondern auf die bereits erwähnten statis- die Frage, welches Ergebnis besser“ ist. Dafür gibt es kei-
”
tischen Weisheiten zurückgreifen kann. Sind die Daten da- ne allgemeine Antwort, gehen wir aber von einer modera-
gegen (stark) korreliert, lässt sich die Erkenntnis darüber ten Änderung des Anfrageverhaltens aus, ist sicherlich das
nicht so einfach wie die Unabhängigkeit mit (anderen) sta- reaktive Ergebnis“ kurzfristig entscheidender, während das
”
tistischen Weisheiten verbinden und somit abarbeiten“. proaktive Ergebnis“ in die längerfristige Planung der Sta-
” ”
Nicht jede (eher kaum eine) Korrelation stellt eine (schwa- tistikerstellung mit aufgenommen werden sollte.
che) funktionale Abhängigkeit dar, wie es im Beispiel der
Fall war, wo wir einfach sagen konnten Aus der Marke folgt
”
die Firma (bis auf wenige Ausnahmen)“. Oft liebäugeln be- 2. GRUNDLAGEN
stimmte Werte der einen Spalte mit bestimmten Werten an- Wie in der Einleitung bereits angedeutet, können Korrela-
derer Spalten, ohne sich jedoch in irgendeiner Weise auf diese tionen einem Datenbanknutzer den Tag vermiesen. Um dies
Kombinationen zu beschränken. (In Stuttgart gibt es sicher- zu verhindern, wurden einige Methoden vorgeschlagen, wel-
lich eine Menge Porsches, aber die gibt es woanders auch.) che sich auf verschiedene Art und Weise dieser Problematik
Außerdem ändern sie möglicherweise mit der Zeit ihre Vor- annehmen (z. B. [7, 6]) oder sie sogar ausnutzen (z. B. [4, 8]),
lieben (das Stuttgarter Porschewerk könnte bspw. nach Chi- um noch an Performance zuzulegen. Letztere sind allerdings
na umziehen) oder schaffen letztere völlig ab (wer braucht mit hohem Aufwand oder der Möglichkeit, fehlerhafte An-
schon einen Porsche? Oder überhaupt ein Auto?). frageergebnisse zu liefern1 , verbunden. Daher konzentrieren
Deswegen werden für korrelierte Daten zusätzliche Sta- wir uns hier auf das Erkennen von Korrelationen allein zur
tistiken benötigt, welche nicht nur die Werteverteilung ei- Verbesserung der Statistiken und wollen hierbei zwischen
ner, sondern die Werteverteilung mehrerer Spalten wiederge- proaktiven und reaktiven Verfahren unterscheiden.
ben. Diese zusätzlichen Statistiken müssen natürlich irgend-
wo abgespeichert und, was noch viel schlimmer ist, gewartet 2.1 Proaktive (datengetriebene) Verfahren
werden. Somit ergeben sich zusätzlicher Speicherbedarf und Proaktiv zu handeln bedeutet, etwas auf Verdacht“ zu
”
zusätzlicher Aufwand, also viel zu viel von dem, was keiner tun. Impfungen sind dafür ein gutes Beispiel – mithilfe ei-
so richtig will. ner Impfung ist der Körper in der Lage, Krankheitserreger
zu bekämpfen, aber in vielen Fällen ist unklar, ob er die-
Da sich ein bisschen statistische Korrelation im Grunde se Fähigkeit jemals benötigen wird. Da Impfungen auch mit
überall findet, gilt es, die Korrelationen ausfindig zu ma- Nebenwirkungen verbunden sein können, muss jeder für sich
chen, welche unsere statistischen Weisheiten alt aussehen entscheiden, ob und wogegen er sich impfen lässt.
lassen und dazu führen, dass das Anfrageergebnis erst nach Auch Datenbanken können geimpft“ werden, allerdings
”
einer gefühlten halben Ewigkeit ausgeben wird. Ob letzte- handelt es sich bei langen Anfrageausführungszeiten (die
res überhaupt passiert, hängt natürlich auch vom Anfrage- wir ja bekämpfen wollen) eher um Symptome (wie Bauch-
verhalten auf die Datenbank ab. Wenn die Benutzer sich schmerzen oder eine laufende Nase), die natürlich unter-
in ihren (meist mithilfe von Anwendungsprogrammen abge- schiedliche Ursachen haben können. Eine davon bilden ganz
setzten) SQL-Anfragen in der WHERE-Klausel jeweils auf
1
eine Spalte beschränken und auf jedwede Verbünde (Joins) Da die Verfahren direkt in die Anfrageplanerstellung ein-
verzichten, dann ist die Welt in Ordnung. Leider lassen sich greifen und dabei auf ihr Wissen über Korrelationen aufbau-
Benutzer nur ungern so stark einschränken. en, muss, für ein korrektes Anfrageergebnis, dieses Wissen
aktuell und vollständig sein.
klar Korrelationen zwischen den Daten, wobei natürlich erst 2.2 Reaktive (anfragegetriebene) Verfahren
ein gewisses Maß an Korrelation überhaupt als krankhaft“ Während wir im vorherigen Abschnitt Vermutungen auf-
”
anzusehen ist. (Es benötigt ja auch eine gewisse Menge an gestellt und auf Verdacht gehandelt haben, um den Daten-
Bakterien, damit eine Krankheit mit ihren Symptomen aus- bankbenutzer glücklich zu machen, gehen wir jetzt davon
bricht.) Der grobe Impfvorgang“ gegen“ Korrelationen um- aus, dass den Benutzer auch weiterhin das interessieren wird,
” ”
fasst zwei Schritte: wofür er sich bis jetzt interessiert hat.
Wir ziehen also aus der Vergangenheit Rückschlüsse für
1. Es werden Vermutungen aufgestellt, welche Spalten-
die Zukunft, und zwar indem wir den Benutzer bei seinem
kombinationen für spätere Anfragen eine Rolle spielen
Tun beobachten und darauf reagieren (daher auch die Be-
könnten.
zeichnung reaktiv“). Dabei achten wir nicht allein auf die
”
2. Es wird kontrolliert, ob diese Kombinationen von Kor- gestellten SQL-Anfragen, sondern überwachen viel mehr die
relation betroffen sind oder nicht. von der Datenbank zurückgegebenen Anfrageergebnisse. Die-
se verraten uns nämlich alles (jeweils 100-prozentig aktuell!)
Entscheidend dabei ist, dass die Daten bzw. ein Teil der über den Teil der vorhandenen Datenlandschaft, den der Be-
Daten gelesen (und analysiert) werden, und zwar ohne da- nutzer bis jetzt interessant fand.
mit konkrete Anfragen zu bedienen, sondern rein zur Aus- Auf diese Weise können bspw. Statistiken erzeugt werden
führung des Verfahrens bzw. der Impfung“ (in diesem Fall [5, 11, 3] (wobei STHoles [5] und ISOMER [11] sogar in
”
gegen“ Korrelation, wobei die Korrelation natürlich nicht der Lage sind, mehrdimensionale Statistiken zu erstellen)
”
beseitigt wird, schließlich können wir schlecht den Datenbe- oder es lassen sich mithilfe alter Anfragen neue, ähnliche
stand ändern, sondern die Datenbank lernt, damit umzuge- Anfragen in ihrer Performance verbessern [12]. Sinnvoll kann
hen). Das Lesen und Analysieren kostet natürlich Zeit, wo- auch eine Unterbrechung der Anfrageausführung mit damit
mit klar wird, dass auch diese Impfung“ Nebenwirkungen“ verbundener Reoptimierung sein [9, 2, 10]. Zu guter letzt
” ”
mit sich bringt. lässt sich mithilfe dieses Ansatzes zumindest herausfinden,
Eine konkrete Umsetzung haben Ilyas et al., aufbauend welche Statistikdaten entscheidend sein könnten [1].
auf BHUNT [4], mit CORDS [7] vorgestellt. Dieses Verfah- In [1] haben Aboulnaga et al. auch schon erste Ansätze für
ren findet Korrelationen zwischen Spaltenpaaren, die Spal- eine Analyse auf Spaltenkorrelation vorgestellt, welche spä-
tenanzahl pro Spaltenkombination wurde also auf zwei be- ter in [6] durch Haas et al. ausgebaut und verbessert wurden.
grenzt.2 In Analogie zu CORDS werden in [1] und [6] nur Spaltenpaa-
Es geht folgendermaßen vor: Im ersten Impfschritt“ sucht re für die Korrelationssuche in Betracht gezogen. Allerdings
”
es mithilfe des Katalogs oder mittels Stichproben nach Schlüs- fällt die Auswahl der infrage kommenden Spaltenpaare we-
sel-Fremdschlüssel-Beziehungen und führt somit eine Art sentlich leichter aus, weil einfach alle Spaltenpaare, die in
Rückabbildung von Datenbank zu Datenmodell durch (engl. den Anfragen (mit hinreichend vielen Daten3 ) vorkommen,
reverse engineering“) [4]. Darauf aufbauend werden dann potentielle Kandidaten bilden.
”
nur solche Spaltenkombinationen als für die Korrelationssu- Während in [1] pro auftretendes Wertepaar einer Spalten-
che infrage kommend angesehen, deren Spalten kombination ein Quotient aus Häufigkeit bei Unabhängig-
”
keit“ und tatsächliche Häufigkeit“ gebildet und das Spal-
a) aus derselben Tabelle stammen oder ”
tenpaar als korreliert“ angesehen wird, sobald zu viele die-
”
b) aus einer Verbundtabelle stammen, wobei der Verbund ser Quotienten von einem gewissen Wert abweichen, setzen
( Join“) mittels (Un-) Gleichung zwischen Schlüssel- Haas et al. in [6] einen angepassten Chi-Quadrat-Test ein,
” um Korrelationen zu finden. Dieser ist etwas aufwendiger als
und Fremdschlüsselspalten entstanden ist.
die Vorgehensweise von [1], dafür jedoch nicht so fehleranfäl-
Zudem gibt es zusätzliche Reduktionsregeln (engl. pruning lig [6]. Zudem stellen Haas et al. in [6] Möglichkeiten vor, wie
”
rules“) für das Finden der Beziehungen und für die Aus- sich die einzelnen Korrelationswerte“ pro Spaltenpaar mit-
”
wahl der zu betrachtenden Spaltenkombinationen. Schließ- einander vergleichen lassen, sodass, ähnlich wie in CORDS,
lich kann die Spaltenanzahl sehr hoch sein, was die Anzahl eine Rangliste der am stärksten korrelierten Spaltenpaare
an möglichen Kombinationen gegebenenfalls ins Unermess- erstellt werden kann. Diese kann als Entscheidungshilfe für
liche steigert. das Anlegen zusätzlicher Statistikdaten genutzt werden.
Im zweiten Impfschritt“ wird für jede Spaltenkombinati-
”
on eine Stichprobe entnommen und darauf aufbauend eine
Kontingenztabelle erstellt. Letztere dient dann wiederum als
3. HERAUSFORDERUNGEN
Grundlage für einen Chi-Quadrat-Test, der als Ergebnis eine In [6] wurde bereits vorgeschlagen, dieses Verfahren mit
Zahl χ2 ≥ 0 liefert. Gilt χ2 = 0, so sind die Spalten voll- CORDS zu verbinden. Das reaktive Verfahren spricht auf-
ständig unabhängig. Da dieser Fall aber in der Praxis kaum grund seiner Effizienz für sich, während das proaktive Ver-
auftritt, muss χ2 einen gewissen Schwellwert überschreiten, fahren eine gewisse Robustheit bietet und somit bei Lern-
damit die entsprechende Spaltenkombination als korreliert phasen von [6] (wenn es neu eingeführt wird oder wenn sich
angesehen wird. Zum Schluss wird eine Art Rangliste der die Anfragen ändern) robuste Schätzwerte zur Erstellung
Spaltenkombinationen mit den höchsten χ2 -Werten erstellt eines Anfrageplans berechnet werden können [6]. Dazu soll-
und für die obersten n Kombinationen werden zusätzliche te CORDS entweder in einem gedrosselten Modus während
Statistikdaten angelegt. Die Zahl n ist dabei u. a. durch die des normalen Datenbankbetriebs laufen oder während War-
Größe des Speicherplatzes (für Statistikdaten) begrenzt. tungszeiten ausgeführt werden. Allerdings werden in [6] kei-
ne Aussagen darüber getroffen, wie die jeweiligen Ergebnis-
2
Die Begrenzung wird damit begründet, dass auf diese Weise
3
das beste Aufwand-Nutzen-Verhältnis entsteht. Das Verfah- Um aussagefähige Ergebnisse zu bekommen, wird ein ge-
ren selbst ist nicht auf Spaltenpaare beschränkt. wisses Mindestmaß an Beobachtungen benötigt, insb. in [6].
se beider Verfahren miteinander kombiniert werden sollten. ders interessant sein könnten, die möglicherweise eben gera-
Folgende Punkte sind dabei zu bedenken: de mit Korrelation einhergehen, spricht wiederum für eine
Art Hinweis“ an den Optimizer.
• Beide Verfahren liefern eine Rangliste mit den als am ”
stärksten von Korrelation betroffenen Spalten. Aller-
dings sind die den Listen zugrunde liegenden Korrela- 4. LÖSUNGSANSATZ
”
tionswerte“ (s. bspw. χ2 im Abschnitt über proaktive Da CORDS wie auch das Verfahren aus [6] nur Spalten-
Verfahren) auf unterschiedliche Weise entstanden und paare betrachten und dies mit einem sich experimentell erge-
lassen sich nicht einfach vergleichen. Liefern beide Lis- benem Aufwand-Nutzen-Optimum begründen, werden auch
ten unterschiedliche Spaltenkombinationen, so kann es wir uns auf Spaltenpaare begrenzen. Allerdings wollen wir
passieren, dass eine Kombination, die in der eine Lis- uns für die Kombination von proaktiver und reaktiver Kor-
te sehr weit unten erscheint, stärker korreliert ist, als relationssuche zunächst nicht auf diese beiden Verfahren be-
Kombinationen, die auf der anderen Liste sehr weit schränken, müssen aber doch gewisse Voraussetzungen an
oben aufgeführt sind. die verwendeten Verfahren (und das Datenmodell der Da-
tenbank) stellen. Diese seien hier aufgezählt:
• Die Daten, welche zu einer gewissen Entscheidung bei
den beiden Verfahren führen, ändern sich, werden aber 1. Entscheidung über die zu untersuchenden Spaltenkom-
in der Regel nicht gleichzeitig von beiden Verfahren ge- binationen:
lesen. Das hängt damit zusammen, dass CORDS zu ei-
nem bestimmten Zeitpunkt eine Stichprobe entnimmt • Das proaktive Verfahren betreibt reverse engi-
und darauf seine Analyse aufbaut, während das Ver- ”
neering“, um zu entscheiden, welche Spaltenkom-
fahren aus [6] die im Laufe der Zeit angesammelten binationen untersucht werden sollen.
Anfragedaten auswertet.
• Das Datenmodell der Datenbank ändert sich nicht,
• Da zusätzliche Statistikdaten Speicherplatz benötigen bzw. sind nur geringfügige Änderungen zu erwar-
und vor allem gewartet werden müssen, ist es nicht ten, welche vom proaktiven Verfahren in das von
sinnvoll, einfach für alle Spaltenkombinationen, die in ihm erstellte Datenmodell sukzessive eingearbei-
der einen und/oder der anderen Rangliste vorkommen, tet werden können. Auf diese Weise können wir
gleich zu verfahren und zusätzliche Statistiken zu er- bei unseren Betrachtungen den ersten Impfschritt“
”
stellen. vernachlässigen.
Zur Verdeutlichung wollen wir die Tabelle aller Firmen- 2. Datengrundlage für die Untersuchung:
wagen eines großen, internationalen IT-Unternehmens be-
trachten, in welcher zu jedem Wagen u. a. seine Farbe und • Das proaktive Verfahren entnimmt für jegliche zu
die Personal- sowie die Abteilungsnummer desjenigen Mitar- untersuchende Spaltenkombination eine Stichpro-
beiters verzeichnet ist, der den Wagen hauptsächlich nutzt. be, welche mit einem Zeitstempel versehen wird.
Diverse dieser Mitarbeiter wiederum gehen in einem Dres- Diese Stichprobe wird solange aufbewahrt, bis das
dener mittelständischen Unternehmen ein und aus, welches Verfahren auf Unkorreliertheit“ plädiert oder für
nur rote KFZ auf seinem Parkplatz zulässt (aus Kapazitäts- ”
die entsprechende Spaltenkombination eine neue
gründen wurde eine solche, vielleicht etwas seltsam anmu- Stichprobe erstellt wird.
tende Regelung eingeführt). Da die Mitarbeiter sich dieser
Regelung bei der Wahl ihres Wagens bewusst waren, fahren • Das reaktive Verfahren bedient sich eines Que-
sie alle ein rotes Auto. Zudem sitzen sie alle in derselben ry-Feedback-Warehouses, in welchem die Beob-
Abteilung. achtungen ( Query-Feedback-Records“) der An-
”
Allerdings ist das internationale Unternehmen wirklich fragen notiert sind.
sehr groß und besitzt viele Firmenwagen sowie unzählige
Abteilungen, sodass diese roten Autos in der Gesamtheit der 3. Vergleich der Ergebnisse:
Tabelle nicht auffallen. In diesem Sinne würde das proaktive
Verfahren CORDS also (eher) keinen Zusammenhang zwi- • Beide Verfahren geben für jede Spaltenkombinati-
schen der Abteilungsnummer des den Wagen benutzenden on, die sie untersucht haben, einen Korrelations-
”
Mitarbeiters und der Farbe des Autos erkennen. wert“ aus, der sich innerhalb des Verfahrens ver-
Werden aber häufig genau diese Mitarbeiter mit der Farbe gleichen lässt. Wie dieser genau berechnet wird,
ihres Wagens abgefragt, z. B. weil sich diese kuriose Rege- ist für uns unerheblich.
lung des mittelständischen Unternehmens herumspricht, es • Aus den höchsten Korrelationswerten ergeben sich
keiner so recht glauben will und deswegen die Datenbank zwei Ranglisten der am stärksten korrelierten Spal-
konsultiert, so könnte ein reaktives Verfahren feststellen, tenpaare, die wir unterschiedlich auswerten wol-
dass beide Spalten korreliert sind. Diese Feststellung tritt len.
insbesondere dann auf, wenn sonst wenig Anfragen an beide
betroffenen Spalten gestellt werden, was durchaus möglich Zudem wollen wir davon ausgehen, dass das proaktive Ver-
ist, weil sonst die Farbe des Wagens eine eher untergeordnete fahren in einem gedrosselten Modus ausgeführt wird und
Rolle spielt. somit sukzessive seine Rangliste befüllt. (Zusätzliche War-
Insbesondere der letztgenannte Umstand macht deutlich, tungszeiträume, bei denen das Verfahren ungedrosselt lau-
dass es nicht sinnvoll ist, Statistikdaten für die Gesamtheit fen kann, beschleunigen die Arbeit und bilden somit einen
beider Spalten zu erstellen und zu warten. Aber die Tat- schönen Zusatz, aber da heutzutage viele Datenbanken quasi
sache, dass bestimmte Spezialfälle für den Benutzer beson- dauerhaft laufen müssen, wollen wir sie nicht voraussetzen.)
Das reaktive Verfahren dagegen wird zu bestimmten Zeit- in der Rangliste des reaktiven Verfahrens, dann löschen wir
punkten gestartet, um die sich bis dahin angesammelten Be- die reaktiv erstellten Statistiken und erstellen neue Statis-
obachtungen zu analysieren, und gibt nach beendeter Ana- tiken mittels einer Stichprobe, analog zum ersten Fall. (Die
lyse seine Rangliste bekannt. Da es als Grundlage nur die Kombination beider Statistiktypen wäre viel zu aufwendig,
Daten aus dem Query-Feedback-Warehouse benötigt, kann u. a. wegen unterschiedlicher Entstehungszeitpunkte.) Wenn
es völlig entkoppelt von der eigentlichen Datenbank laufen. das proaktive Verfahren dagegen explizit unkorreliert“ aus-
”
gibt, bleibt es bei den reaktiv erstellten Statistiken, s. oben.
Ist die reaktive Rangliste bekannt, kann diese mit der (bis
dahin angefertigten) proaktiven Rangliste verglichen wer- Wenn jedoch nur das proaktive Verfahren eine bestimmte
den. Tritt eine Spaltenkombination in beiden Ranglisten auf, Korrelation erkennt, dann ist diese Erkenntnis zunächst für
so bedeutet das, dass diese Korrelation für die bisherigen An- die Benutzer unerheblich. Sei es, weil der Nutzer diese Spal-
fragen eine Rolle gespielt hat und nicht nur auf Einzelfälle tenkombination noch gar nicht abgefragt hat, oder weil er
beschränkt ist, sondern auch mittels Analyse einer repräsen- bis jetzt nur den Teil der Daten benötigt hat, der scheinbar
tativen Stichprobe an Wertepaaren gefunden wurde. unkorreliert ist. In diesem Fall markieren wir nur im Daten-
Unter diesen Umständen lassen wir mittels einer Stichpro- bankkatolog (wo die Statistiken abgespeichert werden) die
be Statistikdaten für die betreffende Spaltenkorrelation er- beiden Spalten als korreliert und geben dem Optimizer somit
stellen. Dabei wählen wir die Stichprobe des proaktiven Ver- ein Zeichen, dass hier hohe Schätzfehler möglich sind und
fahrens, solange diese ein gewisses Alter nicht überschritten er deswegen robuste Pläne zu wählen hat. Dabei bedeutet
hat. Ist sie zu alt, wird eine neue Stichprobe entnommen.4 robust“, dass der gewählte Plan für die errechneten Schätz-
”
werte möglicherweise nicht ganz optimal ist, dafür aber bei
Interessanter wird es, wenn nur eines der Verfahren auf stärker abweichenden wahren Werten“ immer noch akzep-
”
Korrelation tippt, während das andere Verfahren die ent- table Ergebnisse liefert. Zudem können wir ohne wirklichen
sprechende Spaltenkombination nicht in seiner Rangliste ent- Einsatz des reaktiven Verfahrens die Anzahl der Anfragen
hält. Die Ursache dafür liegt entweder darin, dass letzteres zählen, die auf diese Spalten zugreifen und bei denen sich
Verfahren die Kombination noch nicht analysiert hat (beim der Optimizer stark verschätzt hat. Übersteigt der Zähler
reaktiven Verfahren heißt das, dass sie nicht oder zu selten einen Schwellwert, werden mithilfe einer neuen Stichprobe
in den Anfragen vorkam), oder bei seiner Analyse zu dem (vollständige, also insb. mit Werteverteilung) Statistikdaten
Ergebnis nicht korreliert“ gekommen ist. erstellt und im Katalog abgelegt.
”
Diese Unterscheidung wollen wir insbesondere in dem Fall
vornehmen, wenn einzig das reaktive Verfahren die Korre- Der Vollständigkeit halber wollen wir hier noch den Fall
lation entdeckt“ hat. Unter der Annahme, dass weitere, erwähnen, dass eine Spaltenkombination weder in der einen,
”
ähnliche Anfragen folgen werden, benötigt der Optimizer noch in der anderen Rangliste vorkommt. Es sollte klar sein,
schnell Statistiken für den abgefragten Bereich. Diese sol- dass diese Kombination als unkorreliert“ angesehen und so-
”
len zunächst reaktiv mithilfe der Query-Feedback-Records mit für die Statistikerstellung nicht weiter betrachtet wird.
aus der Query-Feedback-Warehouse erstellt werden (unter
Verwendung von bspw. [11], wobei wir nur zweidimensionale
Statistiken benötigen). Das kann wieder völlig getrennt von 5. AUSBLICK
der eigentlichen Datenbank geschehen, da nur das Query-
Die hier vorgestellte Vorgehensweise zur Verbesserung der
Feedback-Warehouse als Grundlage dient.
Korrelationsfindung mittels Einsatz zweier unterschiedlicher
Wir überprüfen nun, ob das proaktive Verfahren das Spal-
Verfahren muss weiter vertieft und insbesondere praktisch
tenpaar schon bearbeitet hat. Dies sollte anhand der Ab-
umgesetzt und getestet werden. Vor allem muss ein passen-
arbeitungsreihenfolge der infrage kommenden Spaltenpaare
des Datenmodell für die reaktive Erstellung von Spalten-
erkennbar sein.
paarstatistiken gefunden werden. Das vorgeschlagene Ver-
Ist dem so, hat das proaktive Verfahren das entsprechen-
fahren ISOMER [11] setzt hier auf STHoles [5], einem Da-
de Paar als unkorreliert“ eingestuft und wir bleiben bei den
” tenmodell, welches bei sich stark überschneidenden Anfra-
reaktiv erstellten Statistiken, die auch nur reaktiv aktuali-
gen schnell inperformant werden kann. Für den eindimen-
siert werden. Veralten sie später zu stark aufgrund fehlender
sionalen Fall wurde bereits von Informix-Entwicklern eine
Anfragen (und somit fehlendem Nutzerinteresse), können sie
performante Lösung vorgestellt [3], welche sich aber nicht
gelöscht werden.
einfach auf den zweidimensionalen Fall übertragen lässt.
Ist dem nicht so, geben wir die entsprechende Kombina-
tion an das proaktive Verfahren weiter mit dem Auftrag,
Eine weitere, noch nicht völlig ausgearbeitete Herausfor-
diese zu untersuchen.5 Beim nächsten Vergleich der Ranglis-
derung bildet die Tatsache, dass das proaktive Verfahren im
ten muss es für das betrachtete Spaltenpaar eine konkrete
gedrosselten Modus läuft und erst sukzessive seine Rangliste
Antwort geben. Entscheidet sich das proaktive Verfahren für
erstellt. Das bedeutet, dass wir eigentlich nur Zwischener-
korreliert“ und befindet sich das Spaltenpaar auch wieder
” gebnisse dieser Rangliste mit der reaktiv erstellten Ranglis-
4
te vergleichen. Dies kann zu unerwünschten Effekten füh-
Falls die betroffenen Spalten einen Zähler besitzen, der bei ren, z. B. könnten beide Ranglisten völlig unterschiedliche
Änderungsoperationen hochgezählt wird (vgl. z. B. [1]), kön- Spaltenkombinationen enthalten, was einfach der Tatsache
nen natürlich auch solche Daten mit in die Wahl der Stich- geschuldet ist, dass beide Verfahren unterschiedliche Spal-
probe einfließen, allerdings sind hier unterschiedliche Aus-
gangszeiten“ zu beachten. ” tenkombinationen untersucht haben. Um solche Missstände
5
Dadurch stören wir zwar etwas die vorgegebene Abarbei- zu vermeiden, muss die proaktive Abarbeitungsreihenfolge
tungsreihenfolge der infrage kommenden Spaltenpaare, aber der Spaltenpaare überdacht werden. In CORDS wird bspw.
der Fall ist ja auch dringend. als Reduktionsregel vorgeschlagen, nur Spaltenpaare zu be-
trachten, die im Anfrageworkload vorkommen (dazu müssen [9] V. Markl, V. Raman, D. Simmen, G. Lohman,
von CORDS nur die Anfragen, aber nicht deren Ergebnis- H. Pirahesh, and M. Cilimdzic. Robust query
se betrachtet werden). Würde sich dann aber der Workload processing through progressive optimization. In ACM,
dahingehend ändern, dass völlig neue Spalten oder Tabel- editor, Proceedings of the 2004 ACM SIGMOD
len abgefragt werden, hätten wir dasselbe Problem wie bei International Conference on Management of Data
einem rein reaktiven Verfahren. Deswegen muss hier eine 2004, Paris, France, June 13–18, 2004, pages 659–670.
Zwischenlösung gefunden werden, die Spaltenkombinationen ACM Press, 2004.
aus Anfragen bevorzugt behandelt“, sich aber nicht darauf [10] T. Neumann and C. Galindo-Legaria. Taking the edge
”
beschränkt. off cardinality estimation errors using incremental
Außerdem muss überlegt werden, wann wir Statistikda- execution. In BTW, pages 73–92, 2013.
ten, die auf Stichproben beruhen, wieder löschen können. [11] U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and
Im reaktiven Fall fiel die Entscheidung leicht aus, weil feh- T. M. Tran. ISOMER: Consistent histogram
lender Zugriff auf die Daten auch ein fehlendes Nutzerinter- construction using query feedback. In ICDE, page 39.
esse widerspiegelt und auf diese Weise auch keine Aktuali- IEEE Computer Society, 2006.
sierung mehr stattfindet, sodass die Metadaten irgendwann [12] M. Stillger, G. Lohman, V. Markl, and M. Kandil.
unbrauchbar werden. LEO - DB2’s learning optimizer. In Proceedings of the
Basieren die Statistiken dagegen auf Stichproben, müs- 27th International Conference on Very Large Data
sen sie von Zeit zu Zeit aktualisiert werden. Passiert diese Bases(VLDB ’01), pages 19–28, Orlando, Sept. 2001.
Aktualisierung ohne zusätzliche Überprüfung auf Korrelati-
on (welche ja aufgrund geänderten Datenbestands nachlas-
sen könnte), müssen mit der Zeit immer mehr zusätzliche
Statistikdaten über Spaltenpaare gespeichert und gewartet
werden. Der für Statistikdaten zur Verfügung stehende Spei-
cherplatz im Katalog kann so an seine Grenzen treten, au-
ßerdem kostet die Wartung wiederum Kapazität des DBMS.
Hier müssen sinnvolle Entscheidungen über die Wartung und
das Aufräumen“ nicht mehr benötigter Daten getroffen wer-
”
den.

6. REFERENCES
[1] A. Aboulnaga, P. J. Haas, S. Lightstone, G. M.
Lohman, V. Markl, I. Popivanov, and V. Raman.
Automated statistics collection in DB2 UDB. In
VLDB, pages 1146–1157, 2004.
[2] S. Babu, P. Bizarro, and D. J. DeWitt. Proactive
re-optimization. In SIGMOD Conference, pages
107–118. ACM, 2005.
[3] E. Behm, V. Markl, P. Haas, and K. Murthy.
Integrating query-feedback based statistics into
informix dynamic server, Apr. 03 2008.
[4] P. Brown and P. J. Haas. BHUNT: Automatic
discovery of fuzzy algebraic constraints in relational
data. In VLDB 2003: Proceedings of 29th International
Conference on Very Large Data Bases, September
9–12, 2003, Berlin, Germany, pages 668–679, 2003.
[5] N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: a
multidimensional workload-aware histogram.
SIGMOD Rec., 30(2):211–222, May 2001.
[6] P. J. Haas, F. Hueske, and V. Markl. Detecting
attribute dependencies from query feedback. In
VLDB, pages 830–841. ACM, 2007.
[7] I. F. Ilyas, V. Markl, P. Haas, P. Brown, and
A. Aboulnaga. CORDS: automatic discovery of
correlations and soft functional dependencies. In
ACM, editor, Proceedings of the 2004 ACM SIGMOD
International Conference on Management of Data
2004, Paris, France, June 13–18, 2004, pages 647–658,
pub-ACM:adr, 2004. ACM Press.
[8] H. Kimura, G. Huo, A. Rasin, S. Madden, and S. B.
Zdonik. Correlation maps: A compressed access
method for exploiting soft functional dependencies.
PVLDB, 2(1):1222–1233, 2009.
MVAL: Addressing the Insider Threat by
Valuation-based Query Processing

Stefan Barthel Eike Schallehn
Institute of Technical and Business Information Institute of Technical and Business Information
Systems Systems
Otto-von-Guericke-University Magdeburg Otto-von-Guericke-University Magdeburg
Magdeburg, Germany Magdeburg, Germany
stefan.barthel@ovgu.de eike.schallehn@ovgu.de

ABSTRACT by considering relational and SQL operations and describing
The research presented in this paper is inspired by problems possible valuation derivations for them.
of conventional database security mechanisms to address the
insider threat, i.e. authorized users abusing granted privi- 2. PRINCIPLES OF DATA VALUATION
leges for illegal or disadvantageous accesses. The basic idea In [1] we outlined our approach of a leakage-resistant data
is to restrict the data one user can access by a valuation valuation which computes a monetary value (mval) for each
of data, e.g. a monetary value of data items, and, based query. This is based on the following basic principles: Every
on that, introducing limits for accesses. The specific topic attribute Ai ∈ R of a base relation schema R is valuated by
of the present paper is the conceptual background, how the a certain monetary value (mval(Ai ) ∈ R). The attribute
process of querying valuated data leads to valuated query valuation for base tables are part of the data dictionary and
results. For this, by analyzing operations of the relational can for instance be specified as an extension of the SQL
algebra and SQL, derivation functions are added. DDL:
CREATE TABLE table_1
1. INTRODUCTION (
An acknowledged main threat to data security are fraud- attribute_1 INT PRIMARY KEY MVAL 0.1,
ulent accesses by authorized users, often referred to as the attribute_2 UNIQUE COMMENT ’important’ MVAL 10,
insider threat [2]. To address this problem, in [1] we pro- attribute_3 DATE
posed a novel approach of detecting authorization misuse );
based on a valuation of data, i.e. of an assigned descrip-
With these attribute valuations, we derive a monetary
tion of the worth of data management in a system, which
value for one tuple t ∈ r(R) given by Equation (1), as well
could for instance be interpreted as monetary values. Ac-
as the total monetary value of the relation r(R) given by
cordingly, possible security leaks exist if users access more
Equation (2), if data is extracted by a query.
valuable data than they are allowed to within a query or
cumulated over a given time period. E.g., a bank account X
manager accessing a single customer record does not repre- mval(t ∈ r(R)) = mval(Ai ) (1)
sent a problem, while dumping all data in an unrestricted Ai ∈R
query should be prohibited. Here, common approaches like X
role-based security mechanisms typically fail. mval(r(R)) = mval(t) = |r(R)| ∗ mval(t ∈ r(R)) (2)
According to our proposal, the data valuation is first of t∈r(R)
all based on the relation definitions, i.e. as part of the data
dictionary information about the value of data items such as To be able to consider the mval for a query as well as sev-
attribute values and, derived from that, entire tuples and re- eral queries of one user over a certain period of time, we log
lations. Then, a key question is how the valuation of a query all mvals in an alert log and compare the current cumulated
result can be derived from the input valuations, because per- mval per user to two thresholds. If a user exceeds the first
forming operations on the base data causes transformations threshold – suspicious threshold – she will be categorized as
that have an impact on the data’s significance. suspect. After additionally exceeding the truncation thresh-
This problem is addressed in the research presented here old her query output will be limited by hiding tuples and
presenting a user notification. We embedded our approach
in an additional layer in the security defense-in-depth model
for raw data, which we have enhanced by a backup entity
(see Fig. 1). Furthermore, encryption has to be established
to prevent data theft via unauthorized, physical reads as
well as backup theft. In this paper we are going into detail
about how to handle operations like joins, aggregate func-
tions, stored procedures as well as common functions.
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. Most of the data stored in a database can be easily iden-
Copyright is held by the author/owner(s). tified as directly interpretable. One example would be an
Views by uncertainty has to be less valuable than properly set at-
tribute values. Therefore, the monetary value should be
Access Control only a percentage of the respective monetary value of an at-
tribute value. If several source attribute values are involved,
Data Valuation we recommend to value the computed attribute value as a
Encryption DBMS Level percentage of the monetary value of all participating source
Raw
attribute values. In general, we suggest a maximum of 50%
Data for both valuations. Furthermore, we need to consider the
overall purpose of our leakage-resistant data valuation which
Physical Level shall prevent extractions of large amounts of data. There-
fore, the percentage needs to be increased with the amount
of data, but not in a way that an unset or unknown attribute
value becomes equivalent valuable than a properly set one.
For that reason, exponential growth is not a suitable option.
Backup Additionally, we have to focus a certain area of application,
Data because a trillion attributes (1012 ) are conceivable whereas a
Encryption septillion attributes (1024 ) are currently not realistic. From
Backup the overall view on our data valuation, we assume depending
on the application, that the extraction of sensitive data be-
Figure 1: Security defense model on DBMS and comes critical when 103 up to 109 attribute values will be ex-
physical level tracted. Therefore, the growth of our uncertainty factor UF
increases much more until 109 attribute values than after-
wards, which predominantly points to a logarithmic growth.
employee-table, where each tuple has a value for attributes We also do not need to have a huge difference of the factor if
"first name", "surname" and "gender". In this case, it is theoretically much more attribute values shall be extracted
also quite easy to calculate the monetary value for a query (e.g., 1014 and more), because with respect to an extraction
(r(Remp )) by simply summarizing all mval per attribute and limiting approach, it is way too much data to return. This
multiply those with the number of involved rows (see Eq. assumption does also refer to a logarithmic increase. We
(3)). conclude that the most promising formula that was adapted
to fit our needs is shown in Eq. (4).
X
mval(r(Remp )) = mval(Ai ) ∗ |r(Remp )| (3)
1
Ai ∈Remp UF = log10 (|valAi ,...,Ak | + 1) (4)
30
However, it becomes more challenging if an additional at-
tribute "license plate number" is added, which does have
some unset or unknown attribute values - in most cases
NULL values. By knowing there is a NULL value for a 3. DERIVING VALUATIONS FOR
certain record, this could be interpreted as either simply un-
known whether there is any car or unset because this person
DATABASE OPERATIONS
has no car. So there is an uncertainty that could lead to an In this chapter we will describe valuation derivation for
information gain which would be uncovered if no adequate main database operations by first discussing core relational
valuation exists. Some other potentially implicit informa- operations. Furthermore, we address specifics of join oper-
tion gains are originated from joins and aggregate functions ations and finally functions (aggregate, user-defined, stored
which we do mention in the regarding section. procedures) which are defined in SQL.
Because the terms information gain and information loss
are widely used and do not have a uniform definition, we do
define them for further use. We call a situation where an 3.1 Core Operations of Relational Algebra
attacker received new data (resp. information) information The relational algebra [4] consists of six basic operators,
gain and the same situation in the view of the data owner where selection, projection, and rename are unary opera-
an information loss. tions and union, set difference, and Cartesian product are
operators that take two relations as input (binary opera-
Uncertainty Factor tion). Due to the fact that applying rename to a relation or
Some operators used for query processing obviously reduce attribute will not change the monetary value, we will only
the information content of the result set (e.g. selection, ag- consider the rest.
gregations, semi joins, joins with resulting NULL values),
but there is still an uncertain, implicit information gain.
Since, the information gain by uncertainty is blurry, mean- Projection
ing in some cases more indicative than in others, we have The projection πattr_list (r(R)) is a unary operation and
to distinguish uncertainty of one attribute value generated eliminates all attributes (columns) of an input relation r(R)
out of one source attribute value (e.g., generated NULL val- except those mentioned in the attribute list. For computa-
ues) and attribute values which are derived from informa- tion of the monetary value of such a projection, only mval
tion of several source attribute values (e.g., aggregations). for chosen attributes of the input relation are considered
In case of one source attribute value, an information gain while taking into account that a projection may eliminate
duplicates (shown in Eq. (5)). fully aware that by a user mistake, e.g. using cross join
instead of natural join, thresholds will be exceeded and the
X
k
user will be classified as potentially suspicious. However, we
mval(πAj ,..,Ak (r(R))) = mval(Ai ) ∗ |πAj ,..,Ak (r(R))| recommend a multiplication of the monetary value of both
i=j source relations instead of a summation due to the fact that
(5) the calculation of the monetary value needs to be consistent
also by combining different operators. For that reason, by
Selection following our recommendation, we ensure that an inner join
According to the relational algebra, a selection of a certain is valuated with the same monetary value as the respective
relation σpred r(R) reduces tuples to a subset which satisfy combination of a cross join (Cartesian product) and selection
specified predicates. Because the selection reduces the num- on the join condition.
ber of tuples, the calculation of the monetary value does not
have to consider those filtered tuples and only the number mval(r(R1 × R2 )) =
of present tuples are relevant (shown in Eq. (6)). mval(t ∈ r(R1 )) ∗ |r(R1 )| + mval(t ∈ r(R2 )) ∗ |r(R2 )|
(9)
mval(σpred (r(R))) = mval(t ∈ r(R)) ∗ |σpred (r(R))| (6)

Set Union
3.2 Join Operations
In the context of relational databases, a join is a binary
A relation of all distinct elements (resp. tuples) of any two operation of two tables (resp. data sources). The result set
relations is called the union (denoted by ∪) of those re- of a join is an association of tuples from one table with tuples
lations. For performing set union, the two involved rela- from another table by concatenating concerned attributes.
tions must be union-compatible – they must have the same Joining is an important operation and most often perfor-
set of attributes. In symbols, the union is represented as mance critical to certain queries that target tables whose
R1 ∪ R2 = {x : x ∈ R1 ∨ x ∈ R2 }. However, if two re- relationships to each other cannot be followed directly. Be-
lations contain identical tuples, within a resulting relation cause the type of join affects the number of resulting tuples
these tuples do only exist once, meaning duplicates are elim- and their attributes, the monetary value of each join needs
inated. Accordingly, the mval of a union of two relations is to be calculated independently.
computed by adding mval of both relations, subtracted with
mval of duplicates (shown in Eq. (7)). Inner Join
mval(R1 ∪ R2 ) = mval(r(R1 ))+ An inner join produces a result table containing composite
X rows of involved tables that match some pre-defined, or ex-
mval(r(R2 )) − mval(ti ∈ r(R1 ∩ R2 ) (7)
plicitly specified, join condition. This join condition can be
i any simple or compound search condition, but does not have
to contain a subquery reference. The valuation of an inner
Set Difference join is computed by the sum of the monetary values of all
The difference of relations R1 and R2 is the relation that attributes of a composite row multiplied by the number of
contains all the tuples that are in R1 , but do not belong to rows within the result set. Because the join attribute Ajoin
R2 . The set difference is denoted by R1 − R2 or R1 \R2 and of two joined tables has to be counted only once, we need
defined by R1 \R2 = {x : x ∈ R1 ∧ x ∈ / R2 }. Also, the set to subtract it (shown in Eq. (10)).
difference is union-compatible, meaning the relations must
mval(r(R1 ./ R2 ) = |r(R1 ./ R2 )| ∗
have the same number of attributes and the domain of each
attribute is the same in both R1 and R2 . The mval of a set (mval(t ∈ r(R1 )) + (mval(t ∈ r(R2 )) − mval(Ajoin ))
difference of two relations is computed by subtracting the (10)
mval of tuples that have both relations in common from the
monetary value of R1 given by Equation (8). Outer Join
X An outer join does not require matching records for each
mval(R1 \R2 ) = mval(r(R1 ) − mval(ti ∈ r(R1 ∩ R2 ) tuple of concerned tables. The joined result table retains all
i rows from at least one of the tables mentioned in the FROM
(8) clause, as long as those rows are consistent with the search
condition. Outer joins are subdivided further into left, right,
Cartesian Product and full outer joins. The result set of a left outer join (or left
The Cartesian product, also known as cross product, is an join) includes all rows of the first mentioned table (left of
operator which works on two relations, just as set union the join keyword) merged with attribute values of the right
and set difference. However, the Cartesian product is the table where the join attribute matches. In case there is no
costliest operator to evaluate [9], because it combines the match, attributes of the right table are set to NULL. The
tuples of one relation with all the tuples of the other relation right outer join (or right join) will return rows that have data
– it pairs rows from both tables. Therefore, if the input in the right table, even if there’s no matching rows in the left
relations R1 and R2 have n and m rows, respectively, the table enhanced by atteributes (with NULL values) of the left
result set will contain n ∗ m rows and consist of columns of table. A full outer join is used to retain the non-matching
R1 and the columns of R2 . Because, the number of tuples information of all affected tables by including non-matching
of the outgoing relations are known, the monetary value is a rows in the result set. To cumulate the monetary value
summation of all attribute valuations multiplied by number for a query that contains a left or right outer join, we only
of rows of both relations given by Equation (9). We are need to compute the monetary value of an inner join of both
tables and add the mval of an antijoin r(R1 . R2 ) ⊆ r(R1 ) of the ISO (1987) and ANSI (1986) standard for the SQL
which includes only tuples of R1 that do not have a join database query language.
partner in R2 (shown in Eq. (11)). For the monetary value of To be able to compute the monetary value of a derived,
a full outer join, we additionally would consider an antijoin aggregated attribute, we need to consider two more factors.
r(R2 . R1 ) ⊆ r(R2 ) which includes tuples of R2 that do not First of all, we divided aggregate function into two groups:
have a join partner given by Equation (12)). informative and conservative.
mval(r(R1 1 R2 )) = mval(r(R1 ./ R2 ))+ 1. Informative are those aggregate functions where the
(11) aggregated value of a certain aggregate function leads
mval(r(R1 . R2 ))
to an information gain of the entire input of all at-
tribute values. This means that every single attribute
mval(r(R1 1 R2 )) = mval(r(R1 ./ R2 ))+ value participates in the computation of the aggre-
(12)
mval(r(R1 . R2 )) + mval(r(R2 . R1 )) gated attribute value. Representatives for informative
aggregate functions are COUNT, AVG and SUM.
Semi Join
2. Conservative, on the contrary, are those functions where
A semi join is similar to the inner join, but with the addition the aggregated value is represented by only one at-
that only attributes of one relation are represented in the tribute value, but in consideration of all other attribute
result set. Semi joins are subdivided further into left and values. So if the aggregated value are again separated
right semi joins. The left semi join operator returns each row from the input set, all other attribute values will re-
from the first input relation (left of the join keyword) when main. Conservative aggregate functions are MAX and
there is a matching row in the second input relation (right MIN.
of the join keyword). The right semi join is computed vice
versa. The monetary value for a query that uses semi joins The second factor that needs to be considered is the num-
can be easily cumulated by multiplying the sum of monetary ber of attributes that are used to compute the aggregated
values for included attributes with number of matching rows values. In case of a conservative aggregate function, it is
of the outgoing relation (shown in Eq. (13)). simple, because only one attribute value is part of the out-
X put. For that reason we recommend to leave the mval of
mval(r(R1 n R2 )) = mval(Ai ) ∗ |r(R1 n R2 )| (13) the source attribute unchanged (shown in Eq. (14)).
Ai ∈R1 mval(Ai ) = mval(M AX(Ai )) = mval(M IN (Ai )) (14)
Nevertheless, we do have an information gain by knowing For the informative aggregate functions the computation
join attributes of R1 have some join partners within R2 is more challenging due to several participating attribute
which are not considered. But adding our uncertainty factor values. Because several input attribute values are concerned,
UF in this equation would lead to inconsistency by cumu- we recommend the usage of our uncertainty factor which
lating the mval of a semi join compared to the mval of a we already mentioned in a prior section. With the uncer-
combination of a natural join and a projection. In future tainty factor it is possible to integrate the number of at-
work, we will solve this issue by presenting a calculation tribute values in a way that a higher number of concerned
that is based on a combination of projections and joins to attributes leads to an increase in percentage terms of the
cover such an implicit information gain. monetary value of the aggregated attribute value given by
Equation (15).
3.3 Aggregate Functions
In computer science, an aggregate function is a function mval(COU N T (Ai )) = mval(SU M (Ai )) =
where the values of multiple rows are grouped together as 1 (15)
mval(AV G(Ai )) = log10 (|Ai | + 1) ∗ mval(Ai )
input on certain criteria to form a single value of more sig- 30
nificant meaning. The SQL aggregate functions are useful
when mathematical operations must be performed on all or
3.4 Scalar Functions
on a group of values. For that reason, they are frequently Besides the SQL aggregate functions, which return a sin-
used with the GROUP BY clause within a SELECT state- gle value, calculated from values in a column, there are also
ment. According to the SQL standard, the following aggre- scalar functions defined in SQL, that return a single value
gate function are implemented in most DBMS and the ones based on the input value. The possibly most commonly used
used most often: COUNT, AVG, SUM, MAX, and MIN. and well known scalar functions are:
All aggregate functions are deterministic, i.e. they return • UCASE() - Converts a field to upper case
the same value any time they are called by using the same
set of input values. SQL aggregate functions return a sin- • LCASE() - Converts a field to lower case
gle value, calculated from values within one column of a • LEN() - Returns the length of a text field
arbitrary relation [10]. However, it should be noted that ex-
cept for COUNT, these functions return a NULL value when • ROUND() - Rounds a number to a specified degree
no rows are selected. For example, the function SUM per-
• FORMAT() - Formats how a field is to be displayed
formed on no rows returns NULL, not zero as one might ex-
pect. Furthermore, except for COUNT, aggregate functions Returned values of this scalar functions are always derived
ignore NULL values at all during computation. All aggre- from one source attribute value, and some of them do not
gate function are defined in SQL:2011 standard or ISO/IEC even change the main content of the attribute value. There-
9075:2011 (under the general title "Information technology fore, we recommend that the monetary value of the source
- Database languages - SQL") which is the seventh revision attribute stays untouched.
3.5 User-Defined Functions Furthermore, by summing all partial result, we make sure
User-defined functions (UDF ) are subroutines made up that the worst case of information loss is considered, entirely
of one or several SQL or programming extension statements in line with our general idea of a leakage resistant data val-
that can be used to encapsulate code for reuse. Most database uation that should prevent a massive data extraction. How-
management systems (DBMS) allow users to create their ever, since SPs represent a completed unit, by reaching the
own user-defined functions and do not limit them to the truncate threshold the whole SP will be blocked and rolled
built-in functions of their SQL programming language (e.g., back. For that reason, we recommend smaller SPs resp.
TSQL, PL/SQL, etc.). User-defined functions in most sys- split existing SPs in DBS with an enabled leakage resistant
tems are created by using the CREATE FUNCTION state- data valuation.
ment and other users than the owner must be granted ap-
propriate permissions on a function before they can use it. 4. RELATED WORK
Furthermore, UDFs can be either deterministic or nondeter- Conventional database management systems mostly use
ministic. A deterministic function always returns the same access control models to face unauthorized access on data.
results if the input is the equal and a nondeterministic func- However, these are insufficient when an authorized individ-
tion returns different results every time it is called. ual extracts data regardless whether she is the owner or
On the basis of the multiple possibilities offered by most has stolen that account. Several methods were conceived to
DBMS, it is impossible to estimate all feasible results of a eliminate those weaknesses. We refer to Park and Giordano
UDF. Also, due to several features like shrinking, concate- [14], who give an overview of requirements needed to address
nating, and encrypting of return values, a valuation of a the insider threat.
single or an array of output values is practically impossible. Authorization views partially achieve those crucial goals of
For this reason we decided not to calculate the monetary an extended access control and have been proposed several
value depending on the output of a UDF, much more we times. For example, Rizvi et al. [15] as well as Rosenthal
do consider the attribute values that are passed to an UDF et al. [16] use authorization-transparent views. In detail,
(shown in Eq. (16)). This assumption is also the most re- incoming user queries are only admitted, if they can be an-
liable, because it does not matter what happens inside an swered using information contained in authorization views.
UDF – like a black box – the information loss after inserting Contrary to this, we do not prohibit a query in its entirety.
cannot get worse. Another approach based on views was introduced by Motro
[12]. Motro handles only conjunctive queries and answers
mval(U DFoutput (Aa , .., Ag )) = a query only with a part of the result set, but without any
p
X (16) indication why it is partial. We do handle information en-
mval(U DFinput (Ak , .., Ap )) = mval(Ai ) hancing (e.g., joins), as well as coarsening operations (e.g.,
i=k aggregation) and we do display a user notification. All au-
thorization view approaches require an explicit definition of
3.6 Stored Procedures a view for each possible access need, which also imposes
Stored procedures (SP) are stored similar to user-defined the burden of knowing and directly querying these views.
functions (UDF ) within a database system. The major dif- In contrast, the monetary values of attributes are set while
ference is that stored procedures have to be called and the defining the tables and the user can query the tables or views
return values of UDFs are used in other SQL statements in she is used to. Moreover, the equivalence test of general re-
the same way pre-installed functions are used (e.g., LEN, lational queries is undecidable and equivalence for conjunc-
ROUND, etc.). A stored procedure, which is depending on tive queries is known to be NP complete [3]. Therefore, the
the DBMS also called proc, sproc, StoredProc or SP, is a leakage-resistant data valuation is more applicable, because
group of SQL statements compiled into a single execution it does not have to face those challenges.
plan [13] and mostly developed for applications that need However, none of these methods does consider the sensi-
to access easily a relational database system. Furthermore, tivity level of data that is extracted by an authorized user.
SPs combine and provide logic also for extensive or complex In the field of privacy-preserving data publishing (PPDP),
processing that requires execution of several SQL statement, on the contrary, several methods are provided for publishing
which had to be implemented in an application before. Also useful information while preserving data privacy. In detail,
a nesting of SPs is feasible by executing one stored procedure multiple security-related measures (e.g., k-anonymity [17],
from within another. A typical use for SPs refers to data l-Diversity [11]) have been proposed, which aggregate infor-
validation (integrated into the database) or access control mation within a data extract in a way that they can not lead
mechanisms [13]. to an identification of a single individual. We refer to Fung et
Because stored procedures have such a complex structure, al. [5], who give a detailed overview of recent developments
nesting is also legitimate and SPs are "only" a group of in methods and tools of PPDP. However, these mechanisms
SQL statements, we recommend to value each single state- are mainly used for privacy-preserving tasks and are not in
ment within a SP and sum up all partial results (shown in use when an insider accesses data. They are not applica-
Eq. (17). With this assumption we do follow the principal ble for our scenario, because they do not consider a line by
that single SQL statements are moved into stored proce- line extraction over time as well as the information loss by
dures to provide a simple access for applications which only aggregating attributes.
need to call the procedures. To the best of our knowledge, there is only the approach of
Harel et al. ([6], [7], [8]) that is comparable to our data val-
X
k uation to prevent suspicious, authorized data extractions.
mval(SP (r(Rj ), .., r(Rk ))) = mval(r(Ri )) (17) Harel et al. introduce the Misuseability Weight (M-score)
i=j that desribes the sensitivity level of the data exposed to
the user. Hence, Harel et al. focus on the protection of the [4] E. F. Codd. A Relational Model of Data for Large
quality of information, whereas our approach predominantly Shared Data Banks. ACM Communication,
preserves the extraction of a collection of data (quantity of 13(6):377–387, June 1970.
information). Harel et al. also do not consider extractions [5] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu.
over time, logging of malicious requester and the backup pro- Privacy-Preserving Data Publishing: A Survey of
cess. In addition, mapping attributes to a certain monetary Recent Developments. ACM Comput. Surv.,
value is much more applicable and intuitive, than mapping 42(4):14:1–14:53, June 2010.
to a artificial M-score. [6] A. Harel, A. Shabtai, L. Rokach, and Y. Elovici.
Our extended authorization control does not limit the sys- M-score: Estimating the Potential Damage of Data
tem to a simple query-authorization control without any Leakage Incident by Assigning Misuseability Weight.
protection against the insider threat, rather we allow a query In Proc. of the 2010 ACM Workshop on Insider
to be executed whenever the information carried by the Threats, Insider Threats’10, pages 13–20. ACM, 2010.
query is legitimate according to the specified authorizations [7] A. Harel, A. Shabtai, L. Rokach, and Y. Elovici.
and thresholds. Eliciting Domain Expert Misuseability Conceptions.
In Proc. of the 6th Int’l Conference on Knowledge
5. CONCLUSIONS AND FUTURE WORK Capture, K-CAP’11, pages 193–194. ACM, 2011.
[8] A. Harel, A. Shabtai, L. Rokach, and Y. Elovici.
In this paper we described conceptual background details
M-Score: A Misuseability Weight Measure. IEEE
for a novel approach for database security. The key contri-
Trans. Dependable Secur. Comput., 9(3):414–428, May
bution is to derive valuations for query results by considering
2012.
the most important operations of the relational algebra as
[9] T. Helleseth and T. Klove. The Number of Cross-Join
well as SQL and providing specific mval functions for each
Pairs in Maximum Length Linear Sequences. IEEE
of them. While some of these rules are straight forward, e.g.
Transactions on Information Theory, 37(6):1731–1733,
for core operations like selection and projection, other oper-
Nov. 1991.
ations like specific join operations require some more thor-
ough considerations. Further operations, e.g. grouping and [10] P. A. Laplante. Dictionary of Computer Science,
aggregation or user-defined function, would actually require Engineering and Technology. CRC Pressl,
application specific valuations. To minimize the overhead London,England, 1st edition, 2000.
for using valuation-based security, we discuss and recom- [11] A. Machanavajjhala, D. Kifer, J. Gehrke, and
mend some reasonable valuation functions for these cases, M. Venkitasubramaniam. L-Diversity: Privacy Beyond
too. k-Anonymity. ACM Trans. Knowl. Discov. Data,
As the results presented here merely are of conceptual 1(1):1–50, Mar. 2007.
nature, our current and future research includes considering [12] A. Motro. An Access Authorization Model for
implementation alternatives, e.g. integrated with a given Relational Databases Based on Algebraic
DBMS or as part of a middleware or driver as well as eval- Manipulation of View Definitions. In Proc. of the 5th
uating the overhead and the effectiveness of the approach. Int’l Conference on Data Engineering, pages 339–347.
We will also come up with a detailed recommendation of IEEE Computer Society, 1989.
how to set monetary values appropriate to different environ- [13] J. Natarajan, S. Shaw, R. Bruchez, and M. Coles. Pro
ments and situations. Furthermore, we plan to investigate T-SQL 2012 Programmer’s Guide. Apress,
further possible use cases for data valuation, such as billing Berlin-Heidelberg, Germany, 3rd edition, 2012.
of data-providing services on a fine-grained level and con- [14] J. S. Park and J. Giordano. Access Control
trolling benefit/cost trade-offs for data security and safety. Requirements for Preventing Insider Threats. In Proc.
of the 4th IEEE Int’l Conference on Intelligence and
Security Informatics, ISI’06, pages 529–534. Springer,
6. ACKNOWLEDGMENTS 2006.
This research has been funded in part by the German [15] S. Rizvi, A. Mendelzon, S. Sudarshan, and P. Roy.
Federal Ministry of Education and Science (BMBF) through Extending Query Rewriting Techniques for
the Research Program under Contract FKZ: 13N10818. Fine-Grained Access Control. In Proc. of the 2004
ACM SIGMOD Int’l Conference on Management of
Data, SIGMOD’04, pages 551–562. ACM, 2004.
7. REFERENCES [16] A. Rosenthal and E. Sciore. View Security as the Basis
[1] S. Barthel and E. Schallehn. The Monetary Value of for Data Warehouse Security. In CAiSE Workshop on
Information: A Leakage-Resistant Data Valuation. In Design and Management of Data Warehouses,
BTW Workshops, BTW’2013, pages 131–138. Köln DMDW’2000, pages 5–6. CEUR-WS, 2000.
Verlag, 2013. [17] L. Sweeney. K-Anonymity: A Model For Protecting
[2] E. Bertino and R. Sandhu. Database Security - Privacy. Int. J. Uncertain. Fuzziness Knowl.-Based
Concepts, Approaches, and Challenges. IEEE Syst., 10(5):557–570, Oct. 2002.
Dependable and Secure Comp., 2(1):2 –19, Mar. 2005.
[3] A. K. Chandra and P. M. Merlin. Optimal
Implementation of Conjunctive Queries in Relational
Data Bases. In Proc. of the 9th Annual ACM
Symposium on Theory of Computing, STOC’77, pages
77–90. ACM, 1977.
TrIMPI: A Data Structure for Efficient Pattern Matching on
Moving Objects

Tsvetelin Polomski Hans-Joachim Klein
Christian-Albrechts-University at Kiel Christian-Albrechts-University at Kiel
Hermann-Rodewald-Straße 3 Hermann-Rodewald-Straße 3
24118 Kiel 24118 Kiel
tpo@is.informatik.uni-kiel.de hjk@is.informatik.uni-kiel.de

ABSTRACT qualitative description, e.g. return all trajectories where the under-
Managing movement data efficiently often requires the exploita- lying object slowed down (during any time interval) and after that
tion of some indexing scheme. Taking into account the kind of it changed its course. Obviously, the motion properties slowdown
queries issued to the given data, several indexing structures have and course alteration as well as their temporal adjustment can be
been proposed which focus on spatial, temporal or spatio-temporal computed using formal methods. The crucial point is that, even if
data. Since all these approaches consider only raw data of moving an indexing structure is used, the stated properties must be com-
objects, they may be well-suited if the queries of interest contain puted for each trajectory and this results in sequential scan(s) on
concrete trajectories or spatial regions. However, if the query con- the whole trajectory data. Time consuming processing of queries
sists only of a qualitative description of a trajectory, e.g. by stating is not acceptable, however, in a scenario where fast reaction on in-
some properties of the underlying object, sequential scans on the coming data streams is needed. An example of such a situation with
whole trajectory data are necessary to compute the property, even so-called tracks computed from radar and sonar data as input is the
if an indexing structure is available. detection of patterns of skiff movements typical for many piracy
The present paper presents some results of an ongoing work on a attacks [14]. A track comprises the position of an object at a time
data structure for Trajectory Indexing using Motion Property In- moment and can hold additional information e.g. about its current
formation (TrIMPI). The proposed approach is flexible since it al- course and velocity. Gathering the tracks of a single object over a
lows the user to define application-specific properties of trajecto- time interval yields its trajectory over this interval.
ries which have to be used for indexing. Thereby, we show how To address the efficiency problem, we propose an indexing scheme
to efficiently answer queries given in terms of such qualitative de- which is not primarily focused on the “time-position data” of tra-
scriptions. Since the index structure is built on top of ordinary data jectories but uses meta information about them instead.
structures, it can be implemented in arbitrary database management We start with a discussion of related work in Section 2. Section 3
systems. provides some formal definitions on trajectories and their motion
properties. In section 4 we introduce the indexing scheme itself
and illustrate algorithms for querying it. Section 5 summarizes the
Keywords present work and outlines our future work.
Moving object databases, motion patterns, indexing structures
2. RELATED WORK
1. INTRODUCTION AND MOTIVATION In this section we provide a short overview on previous contri-
butions which are related to our approach. We start the section
Most index structures for trajectories considered in the literature
(e.g. [8]) concentrate on (time dependent) positional data, e.g. R- by reviewing classical indexing structures for moving objects data.
Tree [9] or TPR*-Tree [17]. There are different approaches (e.g. Next to this, we show an approach which is similar in general terms
[1], [12]) exploiting transformation functions on the original data to the proposed one and finally we review literature related to se-
mantical aspects of moving objects.
and thereby reducing the indexing overhead through “light ver-
sions” of the trajectories to be indexed. In these approaches only 2.1 Indexing of Spatial, Temporal and Spatio-
stationary data is being handled. In cases where the queries of in-
terest consist of concrete trajectories or polygons covering them,
Temporal Data
such indexing schemata as well as trajectory compression tech- The moving object databases community has developed several
niques (e.g. [1], [6], [10], [12], [13]) may be well-suited. However, data structures for indexing movement data. According to [8], these
there are applications [14] where a query may consist only of a structures can be roughly categorized as structures indexing only
spatial data, also known as spatial access methods (SAM); index-
ing approaches for temporal data, also known as temporal index
structures; and those which manage both - spatial and temporal
data, also known as spatio-temporal index structures. One of the
first structures developed for SAMs is the well-known R-Tree [9].
Several extensions of R-Trees have been provided over the years,
thus yielding a variety of spatio-temporal index structures. An in-
formal schematic overview on these extensions, including also new
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 28.05.2013 - 31.05.2013, Illmenau, Germany. developments as the HTPR*-Tree [7] can be found in [11]. Since
Copyright is held by the author/owner(s). all of the proposed access methods focus mainly on the raw spatio-
temporal data, they are well-suited for queries on history of move- “real time” and use any time domain instead. A time domain is any
ment and predicting new positions of moving objects, or for re- set which is interval scaled and countably infinite. The first re-
turning most similar trajectories to a given one. If a query consists quirement ensures that timestamps can be used for ordering and,
only of a qualitative description, however, all the proposed index- furthermore, that the “delay” between two time assignments can
ing structures are of no use. be determined. The second requirement ensures that we have an
infinite number of “time moments” which can be unambiguously
2.2 Applying Dimensionality Reduction upon indexed by elements of N. In the following we denote a time do-
Indexing - the GEMINI Approach main by T.
The overall approach we consider in this work is similar to the Since objects move in a space, we also need a notion for a spa-
GEMINI (GEneric Multimedia INdexIng method) indexing scheme tial domain. In the following, let S denote the spatial domain. We
presented in [6]. This approach was originally proposed for time require that S is equipped with an adequate metric, such as the Eu-
series and has been applied later for other types of data, e.g. for clidean distance (e.g. for S = R × R), which allows us to measure
motion data in [16]. The main idea behind GEMINI is to reduce the the spatial distance between objects.
dimensionality of the original data before indexing. Therefor, rep- Having the notions of time and space we can define formally the
resentatives of much lower dimensionality are created for the data term trajectory.
(trajectory or time series) to be indexed by using an appropriate
transform and used for indexing. A crucial result in [6] is that the Definition 1. Let T, S and O denote a time domain, a space do-
authors proved that in order to guarantee no false dismissals [12], main and a set of distinct objects, respectively. Then, the trajectory
the exploited transform must retain the distance (or similarity) of τo of an object o ∈ O is a function τo : T → S.
the data to be indexed, that is, the distance between representatives
should not exceed the distance of the original time series. For brevity, we can also write the trajectory of an object o ∈ O
In the mentioned approaches, the authors achieve encouraging re- in the form (o, t0 , s0 ), (o, t1 , s1 ) . . . for those t ∈ T where τo (t) = s is
sults on querying most similar trajectories (or time series) to a given defined. A single element (o, ti , si ) is called the track of object o at
one. However, since the representatives of the original data are tra- time ti .
jectories or time series, respectively, evaluating a query which only
describes a motion behavior would result in the inspection of all 3.2 Motion Patterns
representatives. We consider a motion pattern as a sequence of properties of
trajectories which reveal some characteristics of the behavior of
2.3 Semantical Properties of Movement the underlying moving objects. Such properties may be expressed
Semantical properties of movement data have been considered in through any predicates which are important for the particular anal-
various works, e.g. in [2], [5], and [15]. ysis, such as start, stop, turn, or speedup.
The authors of [2] propose a spatio-temporal representation scheme
for moving objects in the area of video data. The considered rep- Definition 2. Let T be a time domain, T be the set of trajectories
resentation scheme distinguishes between spatio-temporal data of of an object set O over T, and IT be the set of all closed inter-
trajectories and their topological information, and also utilizes in- vals over T. A motion property on T is a function p : 2T × IT →
formation about distances between pairs of objects. The topolog- {true, f alse}.
ical information itself is defined through a set of topological re-
lations operators expressing spatial relations between objects over That is, a motion property is fulfilled for a set of trajectories and
some time interval, including faraway, disjoint, meet, overlap, is- a certain time interval if the appropriate predicate is satisfied. To
included-by/includes and same. illustrate this definition, some examples of motion properties are
In [5], a comprehensive study on the research that has been carried provided below:
out on data mining and visual analysis of movement patterns has
been provided. The authors propose a conceptual framework for • Appearance: Let t ∈ T. Then we define appear(·, ·) as
movement behavior of different moving objects. The extracted be- follows: appear({τo }, [t, t]) = true ⇔ ∀t′ ∈ T : τo (t′ ) ,
havior patterns are classified according to a taxonomy. undefined → t ≤ t′ . That is, an object “appears” only in the
In [15], the authors provide some aspects related to a semantic view “first” moment it is being observed.
of trajectories. They show a conceptual approach for how trajectory
behaviors can be described by predicates that involve movement • Speedup: Let t1 , t2 ∈ T and t1 < t2 . Then speedup(·, ·) is de-
attributes and/or semantic annotations. The provided approach is fined as follows: speedup({τo }, [t1 , t2 ]) = true ⇔ v(τo , t1 ) <
rather informal and considers behavior analysis of moving objects v(τo , t2 ) ∧ ∀t ∈ T : t1 ≤ t ≤ t2 → v(τo , t1 ) ≤ v(τo , t) ≤ v(τo , t2 )
on a general level. where v(τo , t) denotes the velocity of the underlying moving
object o at time t. That is, the predicate speedup is satisfied
for a trajectory and a time interval if and only if the velocity
3. FORMAL BACKGROUND of the underlying object is increasing in the considered time
This section provides the formal notions as well as the definitions interval. Note that the increase may not be strictly mono-
needed throughout the rest of the paper. We start with the term tonic.
trajectory and then direct later our attention to motion properties
and patterns. • Move away: Let t1 , t2 ∈ T and t1 < t2 . Then we define:
moveaway({τo1 , τo2 }, [t1 , t2 ]) = true ⇔ ∀t, t′ ∈ T : t1 ≤ t <
3.1 Trajectories t′ ≤ t2 → dist(τo1 , τo2 , t) < dist(τo1 , τo2 , t′ ) where the term
In our approach we consider the trajectory τo of an object o sim- dist(τo1 , τo2 , t) denotes the distance between the underlying
ply as a function of time which assigns a position to o at any point moving objects o1 and o2 at time t. That is, two objects are
in time. Since time plays only a role for the determination of tem- moving away from each other for a time interval, if their dis-
poral causality between the positions of an object, we abstract from tance is increasing during the considered time interval.
the trajectory data in blocks. This has the advantage that extract-
ing the complete trajectory requires only loading as much blocks as
needed for storing a trajectory.

4.2 Indexing Motion Patterns
For the maintenance of motion patterns we consider two cases -
single motion properties and sequences of motion properties. Stor-
ing single motion properties allows the efficient finding of trajec-
tories which contain the considered motion property. This is ad-
vantageous if the searched property is not often satisfied. Thus, for
each motion property p a “list” DBT p holding all trajectories sat-
isfying this property is maintained. As we shall see in Algorithm
Figure 1: Overview of the index structure 4.3, we have to combine such lists and, thus, a simple unsorted list
would not be very favourable. Therefore, we implement these lists
through B+ -Trees (ordered by the trajectory/object identifiers). An
Using motion properties, a motion pattern of a single trajectory
evaluation of union and intersection of two B+ -Trees with m and n
or a set of trajectories is defined as a sequence of motion properties
leaves can be performed in O(m log m+n m
)[4].
ordered by the time intervals in which they are fulfilled. It is impor-
The search for motion patterns with more than one motion property
tant to note, that this common definition of a motion pattern allows
can be conducted through the single DBT p structures. However, if
multiple occurrences of the same motion property in the sequence.
the query motion pattern is too long, too many intersections of the
In order to get a well-defined notion it has to be required that the
DBT p structures will happen and the resulting trajectories will have
time intervals in which the motion properties are fulfilled are dis-
to be checked for containing properties that match the given order,
joint or that meaningful preferences on the motion properties are
as well. To overcome this problem, sequences of motion properties
specified in order to allow ordering in case the time intervals over-
are stored in an additional B+ -Tree structure DBT . The elements of
lap.
DBT have the form (p, τo ) where p is a motion pattern, and o ∈ O.
To sort the elements of DBT , we apply lexicographical ordering.
4. TRAJECTORY INDEXING USING MO- As a result, sequences with the same prefix are stored consecu-
TION PROPERTIES tively. Thus, storing of motion patterns that are prefixes of other
In this section we explain how the proposed index is being cre- motion patterns can be omitted.
ated and used. Index creation starts with the determination of the 4.3 Building the Index
motion pattern of each trajectory to be indexed. For this purpose,
the motion predicates specified by the user are computed. The re- The algorithm for the index creation is quite simple. It consists
sulting motion patterns are indexed with references to the original primarily of the following steps:
trajectories. • Determine the motion properties for each trajectory τo . Con-
The resulting index is schematically depicted in Figure 1. TrIMPI sider, if needed, a sliding window or some reduction or seg-
consists mainly of a data structure holding the raw trajectory data, menting technique as proposed in [1], [6], [10], [12], [13],
and secondary index structures for maintaining motion patterns. for example. Generate a list f of the motion properties of τo ,
Thereby, we differentiate between indexing single motion proper- ordered by their appearance in τo .
ties and indexing motion patterns.
A query to the index can be stated either through a motion pattern or • Store τo into the trajectory record file.
through a concrete trajectory. The index is searched for motion pat-
terns containing the given one or the computed one, respectively. In • Apply Algorithm 4.1 to f to generate access keys relevant
both cases, the associated trajectories are returned. The following for indexing.
subsections consider the outlined procedures more precisely. • For each generated access key, check whether it is already
4.1 Indexing Trajectory Raw Data contained in the index. If this is not the case, store it in the
index. Link the trajectory record file entry of τo to the access
Since the focus of TrIMPI is not on querying trajectories by ex-
key.
ample, the index structure for the raw trajectory data can be rather
simple. For our implementation, we considered a trajectory record Algorithm 4.1 is used to generate index keys of a pattern. An index
file as proposed by [3]. This structure (Figure 1) stores trajectories key is any subpattern p′ = (p′j )m−1 n−1
j=0 of a pattern p = (pi )i=0 which is
in records of fixed length. The overall structure of the records is as defined as follows:
follows
• For each j ≤ m − 1 exists i ≤ n − 1 such that p′j = pi
IDo next_ptr prev_ptr {track0 , . . . , tracknum−1 } .
• For each j, k such that 0 ≤ j < k ≤ m − 1 exist i, l such that
IDo denotes the identifier of the underlying moving object, next_ptr 0 ≤ i < l ≤ n − 1 and p′j = pi and p′k = pl .
and prev_ptr are references to the appropriate records holding fur-
ther parts of the trajectory, and {track0 , . . . , tracknum−1 } is a list of To generate the list of index keys, algorithm 4.1 proceeds itera-
tracks of a predefined fixed length num. If a record ri for a tra- tively. At each iteration of the outer loop (lines 3 to 16) the algo-
jectory τo gets filled, a new record r j is created for τo holding its rithm considers a single element p of the input sequence f . On the
further tracks. In this case, next_ptrri is set up to point to r j , and one hand, p is being added as an index key to the (interim) result
prev_ptrr j is set up to point to ri . (lines 14 and 15) and on the other hand it is being appended as a
Using a trajectory record file, the data is not completely clustered, suffix to each previously generated index key (inner loop - lines 5
but choosing appropriate record size leads to partial clustering of to 13). Algorithm 4.1 utilizes two sets whose elements are lists of
motion properties - supplist and entries. The set supplist that each trajectory of the interim result has to be checked whether
contains at each iteration the complete set of index keys, includ- it matches the queried pattern (lines 9 to 13).
ing those which are prefixes of other patterns. The set entries is The other special case are queries longer than G (lines 16 to 24). As
built in each iteration of the inner loop (lines 5 to 13) by appending we have seen in algorithm 4.1, in such cases the index keys are cut
the current motion property of the input sequence to any element to prefixes of length G. Thus, the extraction in this case considers
of supplist. Thereby, at line 14 entries holds only index keys the prefix of length G of the query sequence (lines 17) and extracts
which are no prefixes of other index keys. Since the resulting lists the appropriate trajectories (line 18). Since these trajectories may
of index keys are stored in a B+ -Tree by applying a lexicographical still not match the query sequence, e.g. by not fulfilling some of the
order, sequences of motion properties which are prefixes of other properties appearing on a position after G − 1 in the input sequence,
sequences can be omitted. Therefore, the set entries is returned an additional check of the trajectories in the interim result is made
as final result (line 17). (lines 19 to 23).
Since the given procedure may result in the computation of up to The last case to consider are query sequences with length between
2k0 different indexing keys for an input sequence with k0 motion α and G. In these cases, the index DBT holding the index keys is
properties, a global constant G is used to limit the maximal length searched through a call to algorithm 4.2 and the result is returned.
of index keys. Using an appropriate value for G leads to no draw- Finally, the function Match (algorithm 4.4) checks whether a tra-
backs for the application. Furthermore, the proposed querying al-
gorithm can handle queries longer than G. Algorithm 4.3 Querying trajectories with a sequence of arbitrary
length
Algorithm 4.1 Building the indexing keys Require: s is a sequence of motion properties
Require: f is a sequence of motion properties Require: G is the maximal length of stored sequences
Require: G is the maximal length of sequences to be indexed Require: DBT p is the index of the property p
1 function createIndexKeys( f ) Require: 1 ≤ α < G maximal query length for searching single property indexes
2 supplist ← empty set of lists 1 function GetEntries(s)
3 for all a ∈ f do 2 result ← empty set
4 entries ← empty set of lists 3 if |s| < α then
5 for all l ∈ supplist do 4 result ← T
6 new ← empty list 5 for all p ∈ s do
7 if |l| ≤ G then 6 suppset ← DBT p
8 new ← l.append(a) 7 result ← result ∩ suppset
9 else 8 end for
10 new ← l 9 for all τo ∈ result do
11 end if 10 if ! match(τo , s) then
12 entries ← entries ∪ {new} 11 result ← result\{τo }
13 end for 12 end if
14 entries ← entries ∪ {[a]} 13 end for
15 supplist ← entries ∪ supplist 14 else if |s| ≤ G then
16 end for 15 result ← GetEntriesFromDBT (s)
17 return entries 16 else
18 end function 17 k ← s[0..G − 1]
18 result ← GetEntriesFromDBT (k)
19 for all τo ∈ result do
20 if ! match(τo , s) then
4.4 Searching for Motion Patterns 21 result ← result\{τo }
22 end if
Since the index is primarily considered to support queries on se- 23 end for
quences of motion properties, the appropriate algorithm for eval- 24 end if
25 return result
uating such queries given in the following is rather simple. In its 26 end function
“basic” version, query processing is just traversing the index and re-
turning all trajectories referenced by index keys which contain the
queried one (as a subpattern). This procedure is illustrated in algo- jectory τo fulfills a pattern s. For this purpose, the list of motion
rithm 4.2. There are, however, some special cases which have to properties of τo is being generated (line 2). Thereafter, s and the
generated pattern of τo are traversed (lines 5 to 14) so that it can be
checked whether the elements of s can be found in the trajectory
Algorithm 4.2 Basic querying of trajectories with a sequence of
pattern of τo in the same order. In this case the function Match
motion properties
returns true, otherwise it returns false.
Require: s is a sequence of motion properties; |s| ≤ G
Require: DBT is the index containing motion patterns
1 function GetEntriesFromDBT(s) 5. CONCLUSIONS AND OUTLOOK
2 result ← {τo | ∃p s.t. s ≤ p ∧ (p, τo ) ∈ DBT }
3 return result In this paper we provided some first results of an ongoing work
4 end function on an indexing structure for trajectories of moving objects called
TrIMPI. The focus of TrIMPI lies not on indexing spatio-temporal
be taken into account. The first of them considers query sequences data but on the exploitation of motion properties of moving objects.
which are “too short”. As stated in Section 4.2, it can be advan- For this purpose, we provided a formal notion of motion proper-
tageous to evaluate queries containing only few motion properties ties and showed how they form a motion pattern. Furthermore, we
by examination of the index structures for single motion proper- showed how these motion patterns can be used to build a meta in-
ties. To be able to define an application specific notion of “short” dex. Algorithms for querying the index were also provided. In
queries, we provide besides G an additional global parameter α for the next steps, we will finalize the implementation of TrIMPI and
which holds 1 ≤ α < G. In algorithm 4.3, which evaluates queries perform tests in the scenario of the automatic detection of piracy at-
of patterns of arbitrary length, each pattern of length shorter than α tacks mentioned in the Introduction. As a conceptual improvement
is being handled in the described way (lines 3 to 8). It is important of the work provided in this paper, we consider a flexibilisation of
Algorithm 4.4 Checks whether a trajectory matches a motion pat- Notes in Computer Science, pages 26–39. Springer Berlin
tern Heidelberg, 2012.
Require: τo is a valid trajectory [8] R. H. Güting and M. Schneider. Moving Object Databases.
Require: s is a sequence of motion properties
1 function match(τo , s) Data Management Systems. Morgan Kaufmann, 2005.
2 motion_properties ← compute the list of motion properties of τo [9] A. Guttman. R-Trees: a dynamic index structure for spatial
3 index_s ← 0
searching. In Proceedings of the 1984 ACM SIGMOD
4 index_props ← 0
5 while index_props < motion_properties.length do international conference on Management of data, SIGMOD
6 if motion_properties[index_props] = s[index_s] then ’84, pages 47–57, New York, NY, USA, 1984. ACM.
7 index_s ← index_s + 1
[10] J. Hershberger and J. Snoeyink. Speeding Up the
8 else
9 index_props ← index_props + 1 Douglas-Peucker Line-Simplification Algorithm. In
10 end if P. Bresnahan, editor, Proceedings of the 5th International
11 if index_s = s.length then
Symposium on Spatial Data Handling, SDH’92, Charleston,
12 return true
13 end if South Carolina, USA, August 3-7, 1992, pages 134–143.
14 end while University of South Carolina. Humanities and Social
15 return false Sciences Computing Lab, August 1992.
16 end function
[11] C. S. Jensen. TPR-Tree Successors 2000–2012.
http://cs.au.dk/~csj/tpr-tree-successors, 2013.
Last accessed 24.03.2013.
the definition of motion patterns including arbitrary temporal rela-
[12] E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra.
tions between motion predicates.
Dimensionality reduction for fast similarity search in large
time series databases. Journal Of Knowledge And
6. ACKNOWLEDGMENTS Information Systems, 3(3):263–286, 2001.
The authors would like to give special thanks to their former stu- [13] E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani. An online
dent Lasse Stehnken for his help in implementing TrIMPI. algorithm for segmenting time series. In N. Cercone, T. Y.
Lin, and X. Wu, editors, Proceedings of the 2001 IEEE
International Conference on Data Mining, ICDM’01, San
7. REFERENCES Jose, California, USA, 29 November - 2 December 2001,
[1] R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient pages 289–296. IEEE Computer Society, 2001.
similarity search in sequence databases. In D. B. Lomet, [14] T. Polomski and H.-J. Klein. How to Improve Maritime
editor, Proceedings of the 4th International Conference on Situational Awareness using Piracy Attack Patterns. 2013.
Foundations of Data Organization and Algorithms, submitted.
FODO’93, Chicago, Illinois, USA, October 13-15, 1993, [15] S. Spaccapietra and C. Parent. Adding meaning to your steps
volume 730 of Lecture Notes in Computer Science, pages (keynote paper). In M. Jeusfeld, L. Delcambre, and T.-W.
69–84. Springer, 1993. Ling, editors, Conceptual Modeling - ER 2011, 30th
[2] J.-W. Chang, H.-J. Lee, J.-H. Um, S.-M. Kim, and T.-W. International Conference, ER 2011, Brussels, Belgium,
Wang. Content-based retrieval using moving objects’ October 31 - November 3, 2011. Proceedings, ER’11, pages
trajectories in video data. In IADIS International Conference 13–31. Springer, 2011.
Applied Computing, pages 11–18, 2007. [16] Y.-S. Tak, J. Kim, and E. Hwang. Hierarchical querying
[3] J.-W. Chang, M.-S. Song, and J.-H. Um. TMN-Tree: New scheme of human motions for smart home environment. Eng.
trajectory index structure for moving objects in spatial Appl. Artif. Intell., 25(7):1301–1312, Oct. 2012.
networks. In Computer and Information Technology (CIT), [17] Y. Tao, D. Papadias, and J. Sun. The TPR*-tree: an
2010 IEEE 10th International Conference on, pages optimized spatio-temporal access method for predictive
1633–1638. IEEE Computer Society, July 2010. queries. In J. C. Freytag, P. C. Lockemann, S. Abiteboul,
[4] E. D. Demaine, A. López-Ortiz, and J. I. Munro. Adaptive M. J. Carey, P. G. Selinger, and A. Heuer, editors,
set intersections, unions, and differences. In Proceedings of Proceedings of the 29th international conference on Very
the eleventh annual ACM-SIAM symposium on Discrete large data bases - Volume 29, VLDB ’03, pages 790–801.
algorithms, SODA ’00, pages 743–752, Philadelphia, PA, VLDB Endowment, 2003.
USA, 2000. Society for Industrial and Applied Mathematics.
[5] S. Dodge, R. Weibel, and A.-K. Lautenschütz. Towards a
taxonomy of movement patterns. Information Visualization,
7(3):240–252, June 2008.
[6] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast
subsequence matching in time-series databases. In R. T.
Snodgrass and M. Winslett, editors, Proceedings of the 1994
ACM SIGMOD international conference on Management of
data, SIGMOD ’94, pages 419–429, New York, NY, USA,
1994. ACM. 472940.
[7] Y. Fang, J. Cao, J. Wang, Y. Peng, and W. Song.
HTPR*-Tree: An efficient index for moving objects to
support predictive query and partial history query. In
L. Wang, J. Jiang, J. Lu, L. Hong, and B. Liu, editors,
Web-Age Information Management, volume 7142 of Lecture
Complex Event Processing in Wireless Sensor Networks

Omran Saleh
Faculty of Computer Science and Automation
Ilmenau University of Technology
Ilmenau, Germany
omran.saleh@tu-ilmenau.de

ABSTRACT itored region. These nodes can sense the surrounding envi-
Most of the WSN applications need the number of sensor ronment and share the information with their neighboring
nodes deployed to be in order of hundreds, thousands or nodes. They are gaining adoption on an increasing scale
more to monitor certain phenomena and capture measure- for tracking and monitoring purposes. Furthermore, sensor
ments over a long period of time. The large volume of sensor nodes are often used in control purposes. They are capable
networks would generate continuous streams of raw events1 of performing simple processing.
in case of centralized architecture, in which the sensor data In the near future, it is prospective that wireless sensor
captured by all the sensor nodes is sent to a central entity. networks will oﬀer and make conceivable a wide range of
In this paper, we describe the design and implementation applications and emerge as an important area of comput-
of a system that carries out complex event detection queries ing. WSN technology is exciting with boundless potential for
inside wireless sensor nodes. These queries ﬁlter and re- various application areas. They are now found in many in-
move undesirable events. They can detect complex events dustrial and civilian application areas, military and security
and meaningful information by combining raw events with applications, environmental monitoring, disaster prevention
logical and temporal relationship, and output this informa- and health care applications, etc.
tion to external monitoring application for further analysis. One of the most important issues in the design of WSNs
This system reduces the amount of data that needs to be is energy eﬃciency. Each node should be as energy eﬃ-
sent to the central entity by avoiding transmitting the raw cient as possible. Processing a chunk of information is less
data outside the network. Therefore, it can dramatically re- costly than wireless communication; the ratio between them
duce the communication burden between nodes and improve is commonly supposed to be much smaller than one [19].
the lifetime of sensor networks. There is a signiﬁcant link between energy eﬃciency and su-
We have implemented our approach for the TinyOS Oper- perﬂuous data. The sensor node is going to consume unnec-
ating System, for the TelosB and Mica2 platforms. We con- essary energy for the transmission of superﬂuous data to the
ducted a performance evaluation of our method comparing central entity, which means minimizing the energy eﬃciency.
it with a naive method. Results clearly conﬁrm the eﬀec- Furthermore, traditional WSN software systems do not
tiveness of our approach. apparently aim at eﬃcient processing of continuous data or
event streams. According to previous notions, we are looking
for an approach that makes our system gains high perfor-
Keywords mance and power saving via preventing the generation and
Complex Event Processing, Wireless Sensor Networks, In- transmission of needless data to the central entity. There-
network processing, centralized processing, Non-deterministic fore, it can dramatically reduce the communication burden
Finite state Automata between nodes and improve the lifetime of sensor networks.
This approach takes into account the resource limitations in
1. INTRODUCTION terms of computation power, memory, and communication.
Sensor nodes can employ their processing capabilities to per-
Wireless sensor networks are deﬁned as a distributed and
form some computations. Therefore, an in-network complex
cooperative network of devices, denoted as sensor nodes that
event processing 2 based solution is proposed.
are densely deployed over a region especially in harsh envi-
We have proposed to run a complex event processing en-
ronments to gather data for some phenomena in this mon-
gine inside the sensor nodes. CEP engine is implemented
1
The terms data, events and tuples are used interchangeably. to transform the raw data into meaningful and beneﬁcial
events that are to be notiﬁed to the users after detecting
them. It is responsible for combining primitive events to
identify higher level complex events. This engine provides
an eﬃcient Non-deterministic Finite state Automata (NFA)
[1] based implementation to lead the evaluation of the com-
plex event queries where the automaton runs as an integral
part of the in-network query plan. It also provides the the-
25th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
oretical basis of CEP as well as supports us with particular
banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. 2
Copyright is held by the author/owner(s). CEP is discussed in reference [15]
operators (conjunction, negation, disjunction and sequence of active databases. Most of the models in these engines
operators, etc.). are based on ﬁxed data structures such as tree, graph, ﬁ-
Complex event processing over data stream has increas- nite automaton or petri net. The authors of [6] used a tree
ingly become an important ﬁeld due to the increasing num- based model. Paper [9] used petri net based model to de-
ber of its applications for wireless sensor networks. There tect complex events from active database. Reference [17]
have been various event detection applications proposed in used Timed Petri-Net (TPN) to detect complex events from
the WSNs, e.g. for detecting eruptions of volcanoes [18], RFID stream.
forest ﬁres, and for the habitat monitoring of animals [5].
An increasing number of applications in such networks is
confronted with the necessity to process voluminous data
3. RELATED WORKS
streams in real time fashion. It is preferable to perform In-Network Processing in-
The rest of the paper is organized as follows: section 2 pro- side sensor network to reduce the transmission cost between
vides an overview of the naive approaches for normal data neighboring nodes. This concept is proposed by several sys-
and complex event processing in WSNs. Related works are tems such as TinyDB [16], and Cougar [19]. Cougar project
brieﬂy reviewed in section 3. Then we introduce the overall applies a database system concept to sensor networks. It
system architecture in order to perform complex event pro- uses the declarative queries that are similar to SQL to query
cessing in sensor networks in section 4. Section 5 discusses sensor nodes. Additionally, sensor data in cougar is consid-
how to create logical query plans to evaluate sensor portion ered like a “virtual” relational database. Cougar places on
queries. Section 6 explains our approach and how queries are each node an additional query layer that lies between the
implemented by automata. In section 7, the performance of network and application layers which has the responsibility
our system is evaluated using a particular simulator. Fi- of in-network processing. This system generates one plan for
nally, section 8 presents our concluding remarks and future the leader node to perform aggregation and send the data to
works. a sink node. Another plan is generated for non-leader nodes
to measure the sensors status. The query plans are dissem-
inated to the query layers of all sensor nodes. The query
2. NAIVE APPROACHES IN WSNS layer will register the plan inside the sensor node, enable
The ideas behind naive approaches which are deﬁnitely desired sensors, and return results according to this plan.
diﬀerent from our approach lie in the processing of data as TinyDB is an acquisitional query processing system for
the central architectural concept. For normal sensor data sensor networks which maintains a single, inﬁnitely-long vir-
processing, the centralized approach proceeds in two steps; tual database table. It uses an SQL-like interface to ask for
the sensor data captured by all the sensor nodes is sent to data from the network. In this system, users specify the
the sink node and then routed to the central server (base data they want and the rate at which the data should be
station) where it is stored in centralized database. High vol- refreshed, and the underlying system would decide the best
ume data are arriving at the server. Subsequently, query plan to be executed. Several in-network aggregation tech-
processing takes place on this database by running queries niques have been proposed in order to extend the life time
against stored data. Each query executes one time and re- of sensor network such as tree-based aggregation protocols
turns a set of results. i.e., directed diﬀusion.
Another approach which adopts the idea of centralized Paper [13] proposes a framework to detect complex events
architecture is the use of a central data stream management in wireless sensor networks by transforming them into sub-
system (DSMS), which simply takes the sensor data stream events. In this case, the sub-events can easily be detected
as input source. Sending all sensor readings to DSMS is also by sensor nodes. Reference [14] splits queries into server and
an option for WSN data processing. DSMS is deﬁned as a node queries, where each query can be executed. The ﬁnal
system that manages a data stream, executes a continuous results from both sides are combined by the results merger.
query against a data stream and supports on-line analysis In [20], symbolic aggregate approximation (SAX) is used to
of rapidly changing data streams [10]. Traditional stream transform sensor data to symbolic representations. To de-
processing systems such as Aurora [2], NiagraCQ [7], and tect complex events, a distance metric for string comparison
AnduIN [12] extend the relational query processing to work is utilized. These papers are the closer works to our system.
with stream data. Generally the select, project, join and Obviously, there is currently little work into how the idea
aggregate operations are supported in these stream systems. of in-network processing can be extended and implemented
The naive approach for Complex Event Processing in to allow more complex event queries to be resolved within
WSNs is similar to the central architectural idea of normal the network.
data processing, but instead of using traditional database
and data stream engine, CEP uses a dedicated engine for
processing complex events such as Esper [8], SASE [11] and 4. SYSTEM ARCHITECTURE
Cayuga [4], in which sensor data or events streams need to We have proposed a system architecture in which collected
be ﬁltered, aggregated, processed and analyzed to ﬁnd the data at numerous, inexpensive sensor nodes are processed
events of interest and identify some patterns among them, locally. The resulting information is transmitted to larger,
ﬁnally take actions if needed. more capable and more expensive nodes for further analysis
Reference [11] uses SASE in order to process RFID stream and processing through speciﬁc node called sink node.
data for a real-world retail management scenario. Paper [3] The architecture has three main parts that need to be
demonstrates the use of Esper engine for object detection modiﬁed or created to make our system better suited to
tracking in sensor networks. All the aforementioned engines queries over sensor nodes: 1- Server side: queries will be
use some variant of a NFA model to detect the complex originated at server side and then forwarded to the near-
event. Moreover, there are many CEP engines in the ﬁeld est sink node. Additionally, this side mainly contains an
application that runs on the user’s PC (base station). Its
main purpose is to collect the results stream over the sen-
sor network and display them. Server side application can
oﬀer more functions i.e., further ﬁltering for the collected
data, perform joining on sensor data, extract, save, man-
age, and search the semantic information and apply further
complex event processing on incoming events after process-
ing them locally in sensor nodes. Because sensor data can
be considered as a data stream, we proposed to use a data
stream management system to play a role of server side, for
that we selected AnduIN data stream engine. 2- Sink side:
sink node (also known as root or gateway node) is one of the
motes in the network which communicates with the base sta-
tion directly, all the data collected by sensors is forwarded
to a sink node and then to server side. This node will be
in charge of disseminating the query down to all the sensor Figure 1: Logical query plan
nodes in the network that comes from server side. 3- Node
side: in this side, we have made huge changes to the tra-
ditional application which runs on the nodes themselves to mechanism takes as input primitive events from lower oper-
enable database manner queries involving ﬁlters, aggregates, ators and detects occurrences of composite events which are
complex event processing operator (engine) and other oper- used as an output to the rest of the system.
ators to be slightly executed within sensor networks. These
changes are done in order to reduce communication costs 6. IN-NETWORK CEP SYSTEM
and get useful information instead of raw data. Various applications including WSNs require the ability to
When combining on-sensor portions of the query with the handle complex events among apparently unrelated events
server side query, most of the pieces of the sensor data query and ﬁnd interesting and/or special patterns. Users want
are in place. This makes our system more advanced. to be notiﬁed immediately as soon as these complex events
are detected. Sensor node devices generate massive sensor
5. LOGICAL PLAN data streams. These streams generate a variety of primitive
Each and every sensor node of a network generates tu- events continuously. The continuous events form a sequence
ples. Every tuple may consist of information about the node of primitive events, and recognition of the sequence supplies
id, and sensor readings. Query plan can specify the tuples us a high level event, which the users are interested in.
ﬂow between all necessary operators and a precise computa- Sensor event streams have to be automatically ﬁltered,
tion plan for each sensor node. Figure 1 (lower plan) illus- processed, and transformed into signiﬁcative information.
trates how our query plan can be employed. It corresponds In non-centralized architecture, CEP has to be performed
to an acyclic directed graph of operators. We assume the as close to real time as possible (inside the node). The task
dataﬂow being upward. At the bottom, there is a homo- of identifying composite events from primitive ones is per-
geneous data source which generates data tuples that must formed by the Complex Event Processing engine. CEP en-
be processed by operators belonging to query plans. Tu- gine provides the runtime to perform complex event process-
ples are ﬂowed through intermediate operators composed in ing where they accept queries provided by the user, match
the query graph. The operators perform the actual process- those queries against continuous event streams, and trigger
ing and eventually forward the data to the sink operator an event or an execution when the conditions speciﬁed in
for transmitting the resulting information to the server side the queries have been satisﬁed. The idea of this concept is
(base station). These operators adopt publish/subscribe close to Event-Condition-Action (ECA) concept in conven-
mechanism to transfer tuples from one operator to next op- tional database systems where an action has to be carried
erator. out in response to an event and one or more conditions are
We diﬀer between three diﬀerent types of operators within satisﬁed.
a query graph [12]: 1- Source operator: produces tuples Each data tuple from the sensor node is viewed as a prim-
and transfers them to other operators. 2- Sink operator: itive event and it has to be processed inside the node. We
receives incoming tuples from other operators. 3- Inner have proposed an event detection system that speciﬁcally
operators: receive incoming tuples from source operator, targets applications with limited resources, such in our sys-
process them, and transfer the result to sink operator or tem. There are four phases for complex event processing
other inner operators. in our in-network model: NFA creation, Filtering, Sequence
A query plan consists of one source at the bottom of a scan and Response as shown in ﬁgure 2.
logical query graph, several inner operators, and one sink
at the top and the tuples are ﬂowing strictly upward. In 6.1 NFA Creation Phase
our system, we have extended this plan to give the system The ﬁrst phase is NFA creation. NFA’s structure is cre-
the capability to perform the complex event processing and ated by the translation from the sequence pattern through
detecting by adding new operators. We have separated the mapping the events to NFA states and edges, where the con-
mechanism for detecting complex events from the rest of ditions of the events (generally called event types) are asso-
normal processing side. We have a particular component ciated with edges. For pattern matching over sensor node
working as an extra operator or engine within the main pro- streams, NFA is employed to represent the structure of an
cess, as we can see from ﬁgure 1 (upper plan). The detection event sequence. For a concrete example, consider the query
Figure 2: CEP Phases

Figure 4: Sequence Scan for SEQ (A, B+, D) within
6 Time Unit Using UNRESTRICTED Mode

Figure 3: NFA for SEQ(A a, B+ b, C c)
is waiting for the arrival of events in its starting state. Once
a new instance event e arrives, the sequence scan responds
pattern: SEQ(A a, B+ b, C c)3 . Figure 3 shows the NFA as follows: 1- It checks whether the type of instance (from
created for the aforementioned pattern (A, B+, C), where attributes) and occurrence time of e satisfy a transition for
state S0 is the starting state, state S1 is for the successful one of the logical existing sequences. If not, the event is
detection of an A event, state S2 is for the detection of a B directly rejected. 2- If yes, e is registered in the system (the
event after event A, also state S3 is for the detection of a C registration is done in the sliding window) and the sequence
event after the B event. State S1 contains a self-loop with advances to next state. 3- If e allows for a sequence to move
the condition of a B event. State S3 is the accepting state, from the starting state to next state, the engine will create
reaching this state indicates that the sequence is detected. other logical sequence to process further incoming events
while keeping the original sequence in its current state to
6.2 Filtering Phase receive new event. Therefore, multiple sequences work on
The second phase is to ﬁlter primitive events at early the events at the same time. 4- Delete some sequences when
stage, generated by sensor nodes. Sensor nodes cannot un- their last received items are not within a time limit. It be-
derstand whether a particular event is necessary or not. comes impossible for them to proceed to the next state since
When additional conditions are added to the system, possi- the time limits for future transitions have already expired.
ble event instances might be pruned at the ﬁrst stage. Next, we use an example to illustrate how UNRESTRICTED
After ﬁltering, timestamp operator will add the occur- sequence scan works. Suppose we have the following pat-
rence time of the event t. A new operator is designed for tern4 SEQ (A, B+, D) and sequence of events (tuples)
adding a timestamp t to the events (tuples) before entering presented as [a1, b2, a3, c4, c5, b6, d7 ...] within 6 time
the complex event processing operator. We can notice that unit. Figure 4 shows, step by step, how the aforementioned
from ﬁgure 1. The timestamp attribute value of an event events are processed. Once the sequence has reached the
t records the reading of a clock in the system in which the accepting state (F ), the occurrences of SEQ (A, B+, D)
event was created, in this case it can reﬂect the true order will be established at : {{a1, b2, d7 }, {a1, b6, d7 },
of the occurrences of primitive events. {a3, b6, d7 }}.
The drawback of this mode is the use of high storage to
6.3 Sequence Scan Phase accumulate all the events that participate in the combina-
The third phase is sequence scan to detect a pattern match. tions in addition to computation overhead for the detection.
We have three modes state the way in which events may con- It consumes more energy. On other hand, it gives us all the
tribute to scan a sequence: UNRESTRICTED, RECENT possibilities of event combination which can be used (e.g.
and FIRST. Every mode has a diﬀerent behavior. The selec- for further analysis). In our system, we only output one of
tion between them depends on the users and the application these possibilities to reduce transmission cost overhead. All
domain. These modes have advantages and disadvantages. registered events are stored in a sliding window. Once the
We will illustrate them below. overﬂow has occurred, the candidate events would be the
In the UNRESTRICTED mode, each start event e, which newest registered ones from the ﬁrst sequence. The engine
allows a sequence to move from the initial state to the next will continue to replace the events from the ﬁrst sequence as
state, starts a separate sequence detection. In this case any long as there is no space. When the initial event (ﬁrst event
event occurrence combination that matches the deﬁnition of in the ﬁrst sequence combination) is replaced, the engine
the sequence can be considered as an output. By using this starts the replacement from the second sequence and so on.
mode, we can get all the possibilities of event combination The engine applies this replacement policy to ensure that
which satisfy the sequence. When the sequence is created, it the system still has several sequences to detect a composite
event, because replacing the initial events would destroy the
3
Notice: In this paper, we are going to only focus on se-
4
quence operator SEQ because of the limited number of The terms complex event, composite event, pattern and
pages. sequence are used interchangeably.
Figure 5: First and Recent Modes

whole sequence. Figure 6: Total Energy Consumption
In the FIRST mode, the earliest occurrence of each con-
tributing event type is used to form the composite event
output. Only the ﬁrst event from a group of events which ing the in-network complex event processing techniques as
have the same type advances the sequence to the next state. well as base station functionality, is written in TinyOS. Our
In this mode, we have just one sequence in the system. The code runs successfully on both real motes and the TinyOS
automaton engine will examine every incoming instance e, Avrora simulator. The aim of the proposed work is to com-
whether the type of it and occurrence time of e satisfy a pare the performance of our system, in-network processor
transition from the current state to next state. If it is, the which includes complex event engine in comparison with
sequence will register the event in the current state and ad- centralized approach in wireless sensor networks and to as-
vance to next state. If not, the event is directly rejected. sess the suitability of our approach in an environment where
Suppose we have the following pattern SEQ (A, B+, C+, resources are limited. The comparison would be done in
D) and sequence of tuples presented as [a1, a2, b3, c4, c5, terms of energy eﬃciency (amount of energy consumed) and
b6, d7 ...] within 6 time unit. The result as shown in the the number of messages transmitted per particular interval,
upper part of ﬁgure 5 . in the entire network. The experiment was run for varying
In the RECENT mode (as the lower part of ﬁgure 5 which the SEQ length. We started with length 2 then 3 and ﬁnally
has FIRST pattern and the same sequence of tuples), the 5. Simulations were run for 60 seconds with one event per
most recent event occurrences of contributing event types second. The performance for diﬀerent SEQ lengths and dif-
are used to form the composite event. In RECENT mode, ferent modes with a network of 75 nodes is shown in ﬁgure 6.
once an instance satisﬁes the condition and timing constraint The centralized architecture led to higher energy consump-
to jump from a state to next state, the engine will stay in tion because sensor nodes transmitted events to the sink
the current state unlike FIRST mode. This mode tries to node at regular periods. In our system, we used in-network
ﬁnd the most recent instance from consecutive instances for complex event processing to decrease the number of trans-
that state before moving to next state. When a1 enters the missions of needless events at each sensor node. What we
engine. It satisﬁes the condition to move from S0 to S1. can notice from ﬁgure 6 is summarized as: 1- By increasing
The engine registers it, stays in S0 and does not jump to the SEQ length in our approach, the RAM size is increased
the next state. Perhaps the new incoming instance is more while energy consumption is reduced. The reason is: the
recent from the last one in the current state. transmission will not occur until the sequence reaches the
The advantages of FIRST and RECENT modes are the accepting state, few events (tuples) will be relatively satis-
use of less storage to accumulate all the events that partic- ﬁed. Hence, the number of transmissions after detections
ipate in the combinations. Only a few events will be regis- will be decreased. 2- FIRST is a little bit better than RE-
tered in the system in addition to low computation overhead CENT, and both of them are better than UNRESTRICTED
for the detection. They consume less energy. Unlike UNRE- in energy consumption. The gap between them is resulting
STRICTED, they do not give all possible matches. from processing energy consumption, that is because UN-
RESTRICTED needs more processing power while the other
6.4 Response Phase needs less, as shown in ﬁgure 6.
Figure 7 shows the radio energy consumption for each
Once an accepting state F is reached by the engine, the
sensor node and the total number of messages when SEQ
engine should immediately output the event sequence. This
length was 3. The nodes in the centralized architecture sent
phase is responsible for preparing the output sequence to
more messages than our approach (nearly three times more).
pass it to the sink operator. The output sequence depends
Hence, it consumed more radio energy. Additionally, the
on the mode of the scan. This phase will start to create
gateway nodes consumed more radio energy due to receiv-
the response by reading the sliding window contents. In
ing and processing the messages from other sensor nodes.
case of FIRST and RECENT modes, the sliding window
In a 25 nodes network, the centralized approach consumed
contains only the events which contribute in sequence de-
energy nearly 4203mJ in sink side, while our approach con-
tection. In UNRESTRICTED mode, the engine randomly
sumed around 2811mJ. Thus, our system conserved nearly
selects a combination of events which matches the pattern
1392mJ (33% of the centralized approach) of the energy. In
in order to reduce transmission cost.
our architecture, the number of transmissions was reduced.
Therefore, the radio energy consumption is reduced not only
7. EVALUATION at the sensor nodes but also at the sink nodes.
We have completed an initial in-network complex event
processing implementation. All the source code, implement- 8. CONCLUSIONS
In ACM SIGMOD, pages 1100–1102, New York, NY,
USA, 2007. ACM.
[5] A. Cerpa, J. Elson, D. Estrin, L. Girod, M. Hamilton,
and J. Zhao. Habitat monitoring: application driver
for wireless communications technology. SIGCOMM
Comput. Commun. Rev., 31(2 supplement):20–41,
Apr. 2001.
[6] S. Chakravarthy, V. Krishnaprasad, E. Anwar, and
S.-K. Kim. Composite events for active databases:
semantics, contexts and detection. In Proceedings of
the 20th International Conference on Very Large Data
Figure 7: Energy Consumption vs. Radio Message Bases, VLDB ’94, pages 606–617, San Francisco, CA,
USA, 1994. Morgan Kaufmann Publishers Inc.
[7] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang.
NiagaraCQ: a scalable continuous query system for
Sensor networks provide a considerably challenging pro- Internet databases. In ACM SIGMOD, pages 379–390,
gramming and computing environment. They require ad- New York, NY, USA, 2000. ACM.
vanced paradigms for software design, due to their character- [8] EsperTech. Event stream intelligence: Esper &
istics such as limited computational power, limited memory NEsper. http://www.esper.codehaus.org/.
and battery power which WSNs suﬀer from. In this paper, [9] S. Gatziu and K. R. Dittrich. Events in an active
we presented our system, an in-network complex event pro- object-oriented database system, 1993.
cessing, a system that eﬃciently carries out complex event [10] V. Goebel and T. Plagemann. Data stream
queries inside network nodes. management systems - a technology for network
We have proposed an engine to allow the system to de- monitoring and traﬃc analysis? In ConTEL 2005,
tect complex events and valuable information from primitive volume 2, pages 685–686, June 2005.
events. [11] D. Gyllstrom, E. Wu, H. Chae, Y. Diao, P. Stahlberg,
We developed a query plan based approach to implement and G. Anderson. SASE: complex event processing
the system. We provided the architecture to collect the over streams (Demo). In CIDR, pages 407–411, 2007.
events from sensor network, this architecture includes three
[12] D. Klan, M. Karnstedt, K. Hose, L. Ribe-Baumann,
sides; sensor side to perform in-network complex event pro-
and K. Sattler. Stream engines meet wireless sensor
cessing, sink side to deliver the events from the network to
networks: cost-based planning and processing of
AnduIN server side which has the responsibility to display
complex queries in AnduIN, distributed and parallel
them and perform further analysis.
databases. Distributed and Parallel Databases,
We demonstrated the eﬀectiveness of our system in a de-
29(1):151–183, Jan. 2011.
tailed performance study. Results obtained from a compari-
son between centralized approach and our approach conﬁrms [13] Y. Lai, W. Zeng, Z. Lin, and G. Li. LAMF: framework
that our in-network complex event processing in small-scale for complex event processing in wireless sensor
and large-scale sensor networks has shown to increase the networks. In 2nd International Conference on
lifetime of the network. We plan to continue our research (ICISE), pages 2155–2158, Dec. 2010.
to build distributed in-network complex event processing, in [14] P. Li and W. Bingwen. Design of complex event
which each sensor node has a diﬀerent complex event pro- processing system for wireless sensor networks. In
cessing plan and can communicate directly between them to NSWCTC, volume 1, pages 354–357, Apr. 2010.
detect complex events. [15] D. C. Luckham. The power of events. Addison-Wesley
Longman Publishing Co., Inc., Boston, MA, USA,
2001.
9. REFERENCES [16] S. R. Madden, M. J. Franklin, J. M. Hellerstein, and
[1] Nondeterministic ﬁnite automaton. W. Hong. TinyDB: an acquisitional query processing
http://en.wikipedia.org/wiki/Nondeterministic_ system for sensor networks. ACM Trans. Database
finite_automaton. Syst., 30(1):122–173, Mar. 2005.
[2] D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, [17] J. Xingyi, L. Xiaodong, K. Ning, and Y. Baoping.
C. Convey, C. Erwin, E. Galvez, M. Hatoun, J.-h. Eﬃcient complex event processing over RFID data
Hwang, A. Maskey, A. Rasin, A. Singer, stream. In IEEE/ACIS, pages 75–81, May 2008.
M. Stonebraker, N. Tatbul, Y. Xing, R. Yan, and [18] X. Yang, H. B. Lim, T. M. Özsu, and K. L. Tan.
S. Zdonik. Aurora: a data stream management system. In-network execution of monitoring queries in sensor
In ACM SIGMOD Conference, page 666, 2003. networks. In ACM SIGMOD, pages 521–532, New
[3] R. Bhargavi, V. Vaidehi, P. T. V. Bhuvaneswari, York, NY, USA, 2007. ACM.
P. Balamuralidhar, and M. G. Chandra. Complex [19] Y. Yao and J. Gehrke. The cougar approach to
event processing for object tracking and intrusion in-network query processing in sensor networks.
detection in wireless sensor networks. In ICARCV, SIGMOD Rec., 31(3):9–18, Sept. 2002.
pages 848–853. IEEE, 2010. [20] M. Zoumboulakis and G. Roussos. Escalation:
[4] L. Brenna, A. Demers, J. Gehrke, M. Hong, J. Ossher, complex event detection in wireless sensor networks.
B. Panda, M. Riedewald, M. Thatte, and W. White. In EuroSSC, pages 270–285, 2007.
Cayuga: a high-performance event processing engine.
XQuery processing over NoSQL stores

Henrique Valer Caetano Sauer Theo Härder
University of Kaiserslautern University of Kaiserslautern University of Kaiserslautern
P.O. Box 3049 P.O. Box 3049 P.O. Box 3049
67653 Kaiserslautern, 67653 Kaiserslautern, 67653 Kaiserslautern,
Germany Germany Germany
valer@cs.uni-kl.de csauer@cs.uni-kl.de haerder@cs.uni-kl.de

ABSTRACT for that are scalability and flexibility. The solution RDBMS
Using NoSQL stores as storage layer for the execution of provide is usually twofold: either (i) a horizontally-scalable
declarative query processing using XQuery provides a high- architecture, which in database terms generally means giv-
level interface to process data in an optimized manner. The ing up joins and also complex multi-row transactions; or (ii)
term NoSQL refers to a plethora of new stores which es- by using parallel databases, thus using multiple CPUs and
sentially trades off well-known ACID properties for higher disks in parallel to optimize performance. While the lat-
availability or scalability, using techniques such as eventual ter increases complexity, the former just gives up operations
consistency, horizontal scalability, efficient replication, and because they are too hard to implement in distributed envi-
schema-less data models. This work proposes a mapping ronments. Nevertheless, these solutions are neither scalable
from the data model of different kinds of NoSQL stores— nor flexible.
key/value, columnar, and document-oriented—to the XDM NoSQL tackles these problems with a mix of techniques,
data model, thus allowing for standardization and querying which involves either weakening ACID properties or allow-
NoSQL data using higher-level languages, such as XQuery. ing more flexible data models. The latter is rather simple:
This work also explores several optimization scenarios to im- some scenarios—such as web applications—do not conform
prove performance on top of these stores. Besides, we also to a rigid relational schema, cannot be bound to the struc-
add updating semantics to XQuery by introducing simple tures of a RDBMS, and need flexibility. Solutions exist, such
CRUD-enabling functionalities. Finally, this work analyzes as using XML, JSON, pure key/value stores, etc, as data
the performance of the system in several scenarios. model for the storage layer. Regarding the former, some
NoSQL systems relax consistency by using mechanisms such
as multi-version concurrency control, thus allowing for even-
Keywords tually consistent scenarios. Others support atomicity and
NoSQL, Big Data, key/value, XQuery, ACID, CAP isolation only when each transaction accesses data within
some convenient subset of the database data. Atomic oper-
1. INTRODUCTION ations would require some distributed commit protocol—like
two-phase commit—involving all nodes participating in the
We have seen a trend towards specialization in database
transaction, and that would definitely not scale. Note that
markets in the last few years. There is no more one-size-
this has nothing to do with SQL, as the acronym NoSQL
fits-all approach when comes to storing and dealing with
suggests. Any RDBMS that relaxes ACID properties could
data, and different types of DBMSs are being used to tackle
scale just as well, and keep SQL as querying language.
different types of problems. One of these being the Big Data
Nevertheless, when it comes to performance, NoSQL sys-
topic.
tems have shown some interesting improvements. When
It is not completely clear what Big Data means after all.
considering update- and lookup-intensive OLTP workloads—
Lately, it is being characterized by the so-called 3 V’s: vol-
scenarios where NoSQL are most often considered—the work
ume—comprising the actual size of data; velocity—compri-
of [13] shows that the total OLTP time is almost evenly
sing essentially a time span in which data data must be
distributed among four possible overheads: logging, locking,
analyzed; and variety—comprising types of data. Big Data
latching, and buffer management. In essence, NoSQL sys-
applications need to understand how to create solutions in
tems improve locking by relaxing atomicity, when compared
these data dimensions.
to RDBMS.
RDBMS have had problems when facing Big Data appli-
When considering OLAP scenarios, RDBMS require rigid
cations, like in web environments. Two of the main reasons
schema to perform usual OLAP queries, whereas most NoSQL
stores rely on a brute-force processing model called MapRe-
duce. It is a linearly-scalable programming model for pro-
cessing and generating large data sets, and works with any
data format or shape. Using MapReduce capabilities, par-
allelization details, fault-tolerance, and distribution aspects
are transparently offered to the user. Nevertheless, it re-
24th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
quires implementing queries from scratch and still suffers
banken), 29.05.2012 - 01.06.2012, Lübbenau, Germany. from the lack of proper tools to enhance its querying capa-
Copyright is held by the author/owner(s).
bilities. Moreover, when executed atop raw files, the pro- balancing and data replication. It does not have any rela-
cessing is inefficient. NoSQL stores provide this structure, tionship between data, even though it tries by adding link
thus one could provide a higher-level query language to take between key/value pairs. It provides the most flexibility, by
full advantage of it, like Hive [18], Pig [16], and JAQL [6]. allowing for a per-request scheme on choosing between avail-
These approaches require learning separated query lan- ability or consistency. Its distributed system has no master
guages, each of which specifically made for the implementa- node, thus no single point of failure, and in order to solve
tion. Besides, some of them require schemas, like Hive and partial ordering, it uses Vector Clocks [15].
Pig, thus making them quite inflexible. On the other hand, HBase enhances Riak’s data model by allowing colum-
there exists a standard that is flexible enough to handle the nar data, where a table in HBase can be seen as a map of
offered data flexibility of these different stores, whose compi- maps. More precisely, each key is an arbitrary string that
lation steps are directly mappable to distributed operations maps to a row of data. A row is a map, where columns
on MapReduce, and is been standardized for over a decade: act as keys, and values are uninterpreted arrays of bytes.
XQuery. Columns are grouped into column families, and therefore,
the full key access specification of a value is through column
Contribution family concatenated with a column—or using HBase nota-
Consider employing XQuery for implementing the large class tion: a qualifier. Column families make the implementation
of query-processing tasks, such as aggregating, sorting, fil- more complex, but their existence enables fine-grained per-
tering, transforming, joining, etc, on top of MapReduce as a formance tuning, because (i) each column family’s perfor-
first step towards standardization on the realms of NoSQL mance options are configured independently, like read and
[17]. A second step is essentially to incorporate NoSQL sys- write access, and disk space consumption; and (ii) columns
tems as storage layer of such framework, providing a sig- of a column family are stored contiguously in disk. More-
nificant performance boost for MapReduce queries. This over, operations in HBase are atomic in the row level, thus
storage layer not only leverages the storage efficiency of keeping a consistent view of a given row. Data relations
RDBMS, but allows for pushdown projections, filters, and exist from column family to qualifiers, and operations are
predicate evaluations to be done as close to the storage level atomic on a per-row basis. HBase chooses consistency over
as possible, drastically reducing the amount of data used on availability, and much of that reflects on the system archi-
the query processing level. tecture. Auto-sharding and automatic replication are also
This is essentially the contribution of this work: allowing present: shardling is automatically done by dividing data
for NoSQL stores to be used as storage layer underneath in regions, and replication is achieved by the master-slave
a MapReduce-based XQuery engine, Brackit[?]—a generic pattern.
XQuery processor, independent of storage layer. We rely MongoDB fosters functionality by allowing more RDBMS-
on Brackit’s MapReduce-mapping facility as a transparently like features, such as secondary indexes, range queries, and
distributed execution engine, thus providing scalability. Mo- sorting. The data unit is a document, which is an ordered
reover, we exploit the XDM-mapping layer of Brackit, which set of keys with associated values. Keys are strings, and
provides flexibility by using new data models. We created values, for the first time, are not simply objects, or arrays
three XDM-mappings, investigating three different imple- of bytes as in Riak or HBase. In MongoDB, values can be
mentations, encompassing the most used types of NoSQL of different data types, such as strings, date, integers, and
stores: key/value, column-based, and document-based. even embedded documents. MongoDB provides collections,
The remainder of this paper is organized as follows. Sec- which are grouping of documents, and databases, which are
tion 2 introduces the NoSQL models and their characteris- grouping of collections. Stored documents do not follow any
tics. Section 3 describes the used XQuery engine, Brackit, predefined schema. Updates within a single document are
and the execution environment of XQuery on top of the transactional. Consistency is also taken over availability in
MapReduce model. Section 4 describes the mappings from MongoDB, as in HBase, and that also reflects in the system
various stores to XDM, besides all implemented optimiza- architecture, that follows a master-worker pattern.
tions. Section 5 exposes the developed experiments and the Overall, all systems provide scaling-out, replication, and
obtained results. Finally, Section 6 concludes this work. parallel-computation capabilities. What changes is essen-
tially the data-model: Riak seams to be better suited for
2. NOSQL STORES problems where data is not really relational, like logging. On
the other hand, because of the lack of scan capabilities, on
This work focuses on three different types of NoSQL stores,
situations where data querying is needed, Riak will not per-
namely key/value, columnar, and document-oriented, repre-
form that well. HBase allows for some relationship between
sented by Riak [14], HBase[11], and MongoDB[8], respec-
data, besides built-in compression and versioning. It is thus
tively.
an excellent tool for indexing web pages, which are highly
Riak is the simplest model we dealt with: a pure key/-
textual (thus benefiting from compression), as well as inter-
value store. It provides solely read and write operations to
related and updatable (benefiting from built-in versioning).
uniquely-identified values, referenced by key. It does not
Finally, MongoDB provides documents as granularity unit,
provide operations that span across multiple data items and
thus fitting well when the scenario involves highly-variable
there is no need for relational schema. It uses concepts
or unpredictable data.
such as buckets, keys, and values. Data is stored and ref-
erenced by bucket/key pairs. Each bucket defines a virtual
key space and can be thought of as tables in classical rela- 3. BRACKIT AND MAPREDUCE
tional databases. Each key references a unique value, and Several different XQuery engines are available as options
there are no data type definitions: objects are the only unit for querying XML documents. Most of them provide ei-
of data storage. Moreover, Riak provides automatic load ther (i) a lightweight application that can perform queries
on documents, or collections of documents, or (ii) an XML XQuery over MapReduce
database that uses XQuery to query documents. The for- Mapping XQuery to the MapReduce model is an alternative
mer lacks any sort of storage facility, while the latter is just to implementing a distributed query processor from scratch,
not flexible enough, because of the built-in storage layer. as normally done in parallel databases. This choice relies
Brackit1 provides intrinsic flexibility, allowing for different on the MapReduce middleware for the distribution aspects.
storage levels to be “plugged in”, without lacking the neces- BrackitMR is one such implementation, and is more deeply
sary performance when dealing with XML documents [5]. discussed in [17]. It achieves a distributed XQuery engine in
By dividing the components of the system into different Brackit by scaling out using MapReduce.
modules, namely language, engine, and storage, it gives us The system hitherto cited processes collections stored in
the needed flexibility, thus allowing us to use any store for HDFS as text files, and therefore does not control details
our storage layer. about encoding and management of low-level files. If the
DBMS architecture [12] is considered, it implements solely
Compilation the topmost layer of it, the set-oriented interface. It executes
The compilation process in Brackit works as follows: the processes using MapReduce functions, but abstracts this
parser analyzes the query to validate the syntax and ensure from the final user by compiling XQuery over the MapRe-
that there are no inconsistencies among parts of the state- duce model.
ment. If any syntax errors are detected, the query compiler It represents each query in MapReduce as sequence of jobs,
stops processing and returns the appropriate error message. where each job processes a section of a FLWOR pipeline.
Throughout this step, a data structure is built, namely an In order to use MapReduce as a query processor, (i) it
AST (Abstract Syntax Tree). Each node of the tree de- breaks FLWOR pipelines are into map and reduce functions,
notes a construct occurring in the source query, and is used and (ii) groups these functions to form a MapReduce job.
through the rest of the compilation process. Simple rewrites, On (i), it converts the logical-pipeline representation of the
like constant folding, and the introduction of let bindings are FLWOR expression—AST—to a MapReduce-friendly ver-
also done in this step. sion. MapReduce uses a tree of splits, which represents the
The pipelining phase transforms FLWOR expressions into logical plan of a MapReduce-based query. Each split is a
pipelines—the internal, data-flow-oriented representation of non-blocking operator used by MapReduce functions. The
FLWORs, discussed later. Optimizations are done atop structure of splits is rather simple: it contains an AST and
pipelines, and the compiler uses global semantics stored in pointers to successor and predecessor splits. Because splits
the AST to transform the query into a more-easily-optimized are organized in a bottom-up fashion, leaves of the tree are
form. For example, the compiler will move predicates if pos- map functions, and the root is a reduce function—which
sible, altering the level at which they are applied and poten- produces the query output.
tially improving query performance. This type of opera- On (ii), the system uses the split tree to generate pos-
tion movement is called predicate pushdown, or filter push- sibly multiple MapReduce job descriptions, which can be
down, and we will apply them to our stores later on. More executed in a distributed manner. Jobs are exactly the ones
optimizations such as join recognition, and unnesting are used on Hadoop MapReduce [20], and therefore we will not
present in Brackit and are discussed in [4]. In the opti- go into details here.
mization phase, optimizations are applied to the AST. The
distribution phase is specific to distributed scenarios, and 4. XDM MAPPINGS
is where MapReduce translation takes place. More details
about the distribution phase are presented in [17]. At the This section shows how to leverage NoSQL stores to work
end of the compilation, the translator receives the final AST. as storage layer for XQuery processing. First, we present
It generates a tree of executable physical operators. This mappings from NoSQL data models to XDM, adding XDM-
compilation process chain is illustrated in Figure 1. node behavior to these data mappings. Afterwards, we dis-
cuss possible optimizations regarding data-filtering techniques.

Riak
Riak’s mapping strategy starts by constructing a key/value
tuple from its low-level storage representation. This is es-
sentially an abstraction and is completely dependent on the
storage used by Riak. Second, we represent XDM opera-
tions on this key/value tuple. We map data stored within
Riak utilizing Riak’s linking mechanism. A key/value pair
kv represents an XDM element, and key/value pairs linked
to kv are addressed as children of kv. We map key/value
tuples as XDM elements. The name of the element is sim-
ply the name of the bucket it belongs to. We create one
bucket for the element itself, and one extra bucket for each
link departing from the element. Each child element stored
Figure 1: Compilation process in Brackit [5]. in a separated bucket represents a nested element within the
key/value tuple. The name of the element is the name of the
link between key/values. This does not necessarily decrease
data locality: buckets are stored among distributed nodes
1
Available at http:\\www.brackit.org based on hashed keys, therefore uniformly distributing the
Figure 2: Mapping between an HBase row and an XDM instance.

load on the system. Besides, each element has an attribute behavior to data. Brackit interacts with the storage using
key which Riak uses to access key/value pairs on the storage this interface. It provides general rules present in XDM [19],
level. Namespaces [2], and Xquery Update Facility [3] standards,
It allows access using key/value as granularity, because resulting in navigational operations, comparisons, and other
every single element can be accessed within a single get op- functionalities. RiakRowNode wraps Riak’s buckets, key/-
eration. Full reconstruction of an element el requires one ac- values, and links. HBaseRowNode wraps HBase’s tables, col-
cess for each key/value linked to el. Besides, Riak provides umn families, qualifiers, and values. Finally, MongoRowN-
atomicity using single key/value pairs as granularity, there- ode wraps MongoDB’s collections, documents, fields, and
fore consistent updates of multiple key/value tuples cannot values.
be guaranteed. Overall, each instance of these objects represents one unit
of data from the storage level. In order to better grasp the
HBase mapping, we describe the HBase abstraction in more de-
HBase’s mapping strategy starts by constructing a colum- tails, because it represents the more complex case. Riak’s
nar tuple from the HDFS low-level-storage representation. and MongoDB’s representation follow the same approach,
HBase stores column-family data in separated files within but without a “second-level node”. Tables are not repre-
HDFS, therefore we can use this to create an efficient map- sented within the Node interface, because their semantics
ping. Figure 2 presents this XDM mapping, where we map represent where data is logically stored, and not data itself.
a table partsupp using two column families: references and Therefore, they are represented using a separated interface,
values, five qualifiers: partkey, suppkey, availqty, supplycost, called Collection. Column families represent a first-level-
and comment. We map each row within an HBase table to access. Qualifiers represent a second-level-access. Finally,
an XDM element. The name of the element is simply the values represent a value-access. Besides, first-level-access,
name of the table it belongs to, and we store the key used second-level-access, and value-access must keep track of cur-
to access such element within HBase as an attribute in the rent indexes, allowing the node to properly implement XDM
element. The figure shows two column families: references operations. Figure 3 depicts the mapping. The upper-most
and values. Each column family represents a child element, part of the picture shows a node which represents a data
whose name is the name of the column family. Accordingly, row from any of the three different stores. The first layer
each qualifier is nested as a child within the column-family of nodes—with level = 1st—represents the first-level-access,
element from which it descends. explained previously. The semantic of first-level-access dif-
fers within different stores: while Riak and MongoDB inter-
MongoDB pret it as a value wrapper, HBase prefers a column family
MongoDB’s mapping strategy is straight-forward. Because wrapper. Following, HBase is the only implementation that
it stores JSON-like documents, the mapping consists essen- needs a second-level-access, represented by the middle-most
tially of a document field → element mapping. We map node with level = 2nd, in this example accessing the wrap-
each document within a MongoDB collection to an XDM el- per of regionkey = “1”. Finally, lower-level nodes with level
ement. The name of the element is the name of the collection = value access values from the structure.
it belongs to. We store the id —used to access the document
within MongoDB—as an attribute on each element. Nested Optimizations
within the collection element, each field of the document We introduce projection and predicate pushdowns optimiza-
represents a child element, whose name is the name of the tions. The only storage that allows for predicate push-
field itself. Note that MongoDB allows fields to be of type down is MongoDB, while filter pushdown is realized on all of
document, therefore more complex nested elements can be them. These optimizations are fundamental advantages of
achieved. Nevertheless, the mapping rules work recursively, this work, when compared with processing MapReduce over
just as described above. raw files: we can take “shortcuts” that takes us directly to
the bytes we want in the disk.
Nodes Filter and projections pushdown are an important opti-
We describe XDM mappings using object-oriented notation. mization for minimizing the amount of data scanned and
Each store implements a Node interface that provides node processed by storage levels, as well as reducing the amount
a $key, therefore allowing for both insertions and updates.
db:insert($table as xs:string,
$key as xs:string,
$value as node()) as xs:boolean
The delete function deletes a values from the store. We
also provide two possible signatures: with or without $key,
therefore allowing for deletion of a giveng key, or droping a
given table.
db:delete($table as xs:string,
$key as xs:string) as xs:boolean

5. EXPERIMENTS
The framework we developed in this work is mainly con-
cerned with the feasibility of executing XQuery queries atop
NoSQL stores. Therefore, our focus is primarily on the proof
of concept. The data used for our tests comes from the TPC-
H benchmark [1]. The dataset size we used has 1GB, and
we essentially scanned the five biggest tables on TPC-H:
part, partsupp, order, lineitem, and customer. The experi-
ments were performed in a single Intel Centrino Duo dual-
core CPU with 2.00 GHz, with 4GB RAM, running Ubuntu
Linux 10.04 LTS. HBase used is version 0.94.1, Riak is 1.2.1,
and MongoDB is 2.2.1. It is not our goal to assess the scal-
ability of these systems, but rather their query-procedure
Figure 3: Nodes implementing XDM structure. performance. For scalability benchmarks, we refer to [9]
and [10].
of data passed up to the query processor. Predicate push- 5.1 Results
down is yet another optimization technique to minimize the
amount of data flowing between storage and processing lay-
ers. The whole idea is to process predicates as early in the
plan as possible, thus pushing them to the storage layer.
On both cases we traverse the AST, generated in the be-
ginning of the compilation step, looking for specific nodes,
and when found we annotate the collection node on the AST
with this information. The former looks for path expres-
sions (PathExpr ) that represent a child step from a collec-
tion node, or for descendants of collection nodes, because
in the HBase implementation we have more than one access
level within storage. The later looks for general-comparison
operators, such as equal, not equal, less than, greater than,
less than or equal to, and greater than or equal to. After-
wards, when accessing the collection on the storage level,
we use the marked collections nodes to filter data, without Figure 4: Latency comparison among stores.
further sending it to the query engine.
Figure 4 shows the gathered latency times of the best
NoSQL updates schemes of each store, using log-scale. As we can see, all ap-
The used NoSQL stores present different API to persist data. proaches take advantage from the optimization techniques.
Even though XQuery does not provide data-storing mecha- The blue column of the graph—full table scan—shows the
nisms on its recommendation, it does provide an extension latency when scanning all data from TPC-H tables. The red
called XQuery Update Facility [3] for that end. It allows column —single column scan—represents the latency when
to add new nodes, delete or rename existing nodes, and re- scanning a simple column of each table. Filter pushdown op-
place existing nodes and their values. XQuery Update Fa- timizations explain the improvement in performance when
cility adds very natural and efficient persistence-capabilities compared to the first scan, reducing the amount of data flow-
to XQuery, but it adds lots of complexity as well. More- ing from storage to processing level. The orange column—
over, some of the constructions need document-order, which predicate column scan—represents the latency when scan-
is simply not possible in the case of Riak. Therefore, simple- ning a single column and where results were filtered by a
semantic functions such as “insert” or “put” seem more at- predicate. We have chosen predicates to cut in half the
tractive, and achieve the goal of persisting or updating data. amount of resulting data when compared with single column
The insert function stores a value within the underlying scan. The querying time was reduced in approximately 30%,
store. We provide two possible signatures: with or without not reaching the 50% theoretically-possible-improvement rate,
essentially because of processing overhead. Nevertheless, it Independence, and Parallelism. PhD thesis, University
shows how efficient the technique is. of Kaiserslautern, 12 2012.
In scanning scenarios like the ones on this work, MongoDB [5] S. Bächle and C. Sauer. Unleashing xquery for
has shown to be more efficient than the other stores, by al- data-independent programming. Submitted, 2011.
ways presenting better latency. MongoDB was faster by de- [6] K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin,
sign: trading of data-storage capacity for data-addressability M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J.
has proved to be a very efficiency-driven solution, although Shekita. Jaql: A scripting language for large scale
being a huge limitation. Moreover, MongoDB uses pre- semistructured data analysis. PVLDB,
caching techniques. Therefore, at run-time it allows work- 4(12):1272–1283, 2011.
ing with data almost solely from main memory, specially in [7] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey,
scanning scenarios. D. Shakib, S. Weaver, and J. Zhou. Scope: easy and
efficient parallel processing of massive data sets. Proc.
6. CONCLUSIONS VLDB Endow., 1(2):1265–1276, Aug. 2008.
We extended a mechanism that executes XQuery to work [8] K. Chodorow and M. Dirolf. MongoDB: The
with different NoSQL stores as storage layer, thus providing Definitive Guide. Oreilly Series. O’Reilly Media,
a high-level interface to process data in an optimized man- Incorporated, 2010.
ner. We have shown that our approach is generic enough to [9] B. F. Cooper, A. Silberstein, E. Tam,
work with different NoSQL implementations. R. Ramakrishnan, and R. Sears. Benchmarking cloud
Whenever querying these systems with MapReduce—ta- serving systems with ycsb. In Proceedings of the 1st
king advantage of its linearly-scalable programming model ACM symposium on Cloud computing, SoCC ’10,
for processing and generating large-data sets—parallelization pages 143–154, New York, NY, USA, 2010. ACM.
details, fault-tolerance, and distribution aspects are hidden [10] T. Dory, B. Mejhas, P. V. Roy, and N. L. Tran.
from the user. Nevertheless, as a data-processing paradigm, Measuring elasticity for cloud databases. In
MapReduce represents the past. It is not novel, does not use Proceedings of the The Second International
schemas, and provides a low-level record-at-a-time API: a Conference on Cloud Computing, GRIDs, and
scenario that represents the 1960’s, before modern DBMS’s. Virtualization, 2011.
It requires implementing queries from scratch and still suf- [11] L. George. HBase: The Definitive Guide. O’Reilly
fers from the lack of proper tools to enhance its querying Media, 2011.
capabilities. Moreover, when executed atop raw files, the [12] T. Härder. Dbms architecture - new challenges ahead.
processing is inefficient—because brute force is the only pro- Datenbank-Spektrum, 14:38–48, 2005.
cessing option. We solved precisely these two MapReduce [13] S. Harizopoulos, D. J. Abadi, S. Madden, and
problems: XQuery works as the higher-level query language, M. Stonebraker. Oltp through the looking glass, and
and NoSQL stores replace raw files, thus increasing perfor- what we found there, 2008.
mance. Overall, MapReduce emerges as solution for situ- [14] R. Klophaus. Riak core: building distributed
ations where DBMS’s are too “hard” to work with, but it applications without shared state. In ACM SIGPLAN
should not overlook the lessons of more than 40 years of Commercial Users of Functional Programming, CUFP
database technology. ’10, pages 14:1–14:1, New York, NY, USA, 2010.
Other approaches cope with similar problems, like Hive, ACM.
and Scope. Hive [18] is a framework for data warehousing on
[15] F. Mattern. Virtual time and global states of
top of Hadoop. Nevertheless, it only provides equi-joins, and
distributed systems. In C. M. et al., editor, Proc.
does not fully support point access, or CRUD operations—
Workshop on Parallel and Distributed Algorithms,
inserts into existing tables are not supported due to sim-
pages 215–226, North-Holland / Elsevier, 1989.
plicity in the locking protocols. Moreover, it uses raw files
as storage level, supporting only CSV files. Moreover, Hive [16] C. Olston, B. Reed, U. Srivastava, R. Kumar, and
is not flexible enough for Big Data problems, because it is A. Tomkins. Pig latin: a not-so-foreign language for
not able to understand the structure of Hadoop files with- data processing. In Proceedings of the 2008 ACM
out some catalog information. Scope [7] provides a declar- SIGMOD international conference on Management of
ative scripting language targeted for massive data analysis, data, SIGMOD ’08, pages 1099–1110, New York, NY,
borrowing several features from SQL. It also runs atop a USA, 2008. ACM.
distributed computing platform, a MapReduce-like model, [17] C. Sauer. Xquery processing in the mapreduce
therefore suffering from the same problems: lack of flexibil- framework. Master thesis, Technische Universität
ity and generality, although being scalable. Kaiserslautern, 2012.
[18] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka,
N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive -
7. REFERENCES a petabyte scale data warehouse using hadoop. In
[1] The tpc-h benchmark. http://www.tpc.org/tpch/, ICDE, pages 996–1005, 2010.
1999. [19] N. Walsh, M. Fernández, A. Malhotra, M. Nagy, and
[2] Namespaces in xml 1.1 (second edition). J. Marsh. XQuery 1.0 and XPath 2.0 data model
http://www.w3.org/TR/xml-names11/, August 2006. (XDM). http://www.w3.org/TR/2007/
[3] Xquery update facility 1.0. http://www.w3.org/TR/ REC-xpath-datamodel-20070123/, January 2007.
2009/CR-xquery-update-10-20090609/, June 2009. [20] T. White. Hadoop: The Definitive Guide. O’Reilly
[4] S. Bächle. Separating Key Concerns in Query Media, 2012.
Processing - Set Orientation, Physical Data