I tried 1,000 by Total on Spotify

=Paper= {{Paper |id=Vol-1313/GvD2014 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1313/GvD2014.pdf |volume=Vol-1313 }} ==None== https://ceur-ws.org/Vol-1313/GvD2014.pdf

Proceedings of the 26th GI-Workshop
Grundlagen von Datenbanken

Bozen-Bolzano, Italien, 21.-24. Oktober 2014
c 2014 for the individual papers by the papers’ authors. Copying permitted for private
and academic purposes. Re-publication of material from this volume requires permission
by the copyright owners.

Herausgeber:
Friederike Klan
Friedrich-Schiller-Universität Jena
Fakultät für Mathematik und Informatik
Heinz-Nixdorf-Stiftungsprofessur für Verteilte Informationssysteme
Ernst-Abbe-Platz 2
DE-07743 Jena
E-Mail: friederike.klan@uni-jena.de

Günther Specht
Universität Innsbruck
Fakultät für Mathematik, Informatik und Physik
Forschungsgruppe Datenbanken und Informationssysteme
Technikerstrasse 21a
AT-6020 Innsbruck
E-Mail: guenther.specht@uibk.ac.at

Hans Gamper
Freie Universität Bozen-Bolzano
Fakultät für Informatik
Dominikanerplatz 3
IT-39100 Bozen-Bolzano
E-Mail: gamper@inf.unibz.it

2
Vorwort

Der 26. Workshop ”Grundlagen von Datenbanken” (GvDB) 2014 fand dieses Jahr vom
21.10.2014 bis 24.10.2014 auf dem Ritten in Südtirol statt, einem reizvollen Hochplateau
mit Blick auf die Dolomiten. Bereits die Anreise war ein Highlight: vom Bahnhof Bozen
ging es mit der längsten Seilbahn Südtirols nach oben und dann mit der Rittner Bahn, einer
alten Schmalspur-Stras̈enbahn, über die Lärchenwiesen bis zum Tagungsort.
Der viertägige Workshop wurde vom GI-Arbeitskreis ”Grundlagen von Informationssyste-
men” im Fachbereich Datenbanken und Informationssysteme (DBIS) veranstaltet und hat
die konzeptionellen und methodischen Grundlagen von Datenbanken und Information-
ssystemen zum Thema, ist aber auch für neue Anwendungen offen. Die Workshopreihe
und der Arbeitskreis feiern dieses Jahr ihr 25-jähriges Bestehen. Der AK ist damit der
ältesten Arbeitskreise der GI. Organisiert wurde der Jubiläumsworkshop gemeinsam von
Fr. Dr. Friederike Klan von der Heinz-Nixdorf-Stiftungsprofessur für Verteilte Infor-
mationssysteme der Friedrich-Schiller-Universität Jena, Hr. Prof. Dr. Günther Specht
von der Forschungsgruppe Datenbanken und Informationssysteme (DBIS) der Universität
Innsbruck und Hr. Prof. Dr. Johann Gamper von der Gruppe Datenbanken und Informa-
tionssysteme (DIS) der Freien Universität Bozen-Bolzano.
Der Workshop soll die Kommunikation zwischen Wissenschaftlern/-innen im deutsch-
sprachigen Raum fördern, die sich grundlagenorientiert mit Datenbanken und Informa-
tionssystemen beschäftigen. Er bietet insbesondere Nachwuchswissen-schaftler/-innen
die Möglichkeit, ihre aktuellen Arbeiten einem grös̈eren Forum in lockerer Atmosphäre
vorzustellen. Mit der Kulisse der beeindruckenden Südtiroler Bergwelt bot der Work-
shop auf 1200 Metern Meereshöhe einen idealen Rahmen für offene und inspirierende
Diskussionen dazu ohne Zeitzwang. Insgesamt wurden 14 Arbeiten aus den Einsendun-
gen nach einem Review-Prozess ausgewählt und vorgestellt. Besonders hervorzuheben ist
die Vielfältigkeit der Themenbereiche: sowohl Kerngebiete in Datenbanksystemen bzw.
Datenbankdesign, als auch Themen zur Informationsextraktion, Empfehlungssysteme, Ve-
rarbeitung von Zeitreihen, Graphalgorithmen im GIS Bereich, sowie zu Datenschutz und
Datenqualität wurden vorgestellt.
Die Vorträge ergänzten zwei Keynotes: Ulf Leser, Professor an der Humboldt-Universität
zu Berlin, hielt eine Keynote zu Next Generation Data Integration (for Life Sciences) und
Francesco Ricci, Professor an der Freien Universität von Bozen-Bolzano zu Context and
Recommendations: Challenges and Results. Beiden Vortragenden sei an dieser Stelle für
ihre spontane Bereitschaft zu Kommen und ihre interessanten Vorträge gedankt.
Neben dem Wissensaustausch darf auch die soziale Komponente nicht fehlen. Die bei-
den gemeinsamen Ausflüge bleiben sicher allen lange in guter Erinnerung. Zum einen
erklommen wir das bereits schneebedeckte Rittner Horn (2260 Hm), von dem man einen
herrlichen Blick auf die Dolomiten hat. Zum anderen ist bei einem Aufenthalt im Herbst
in Südtirol das so genannte Törggelen nicht wegzudenken: eine Wanderung zu lokalen
Bauernschänken, die die Köstlichkeiten des Jahres zusammen mit Kastanien und heurigem
Wein auftischen. Sogar der Rektor der Universität Bozen-Bolzano kam dazu extra vom Tal
herauf.

3
Eine Konferenz kann nur erfolgreich in einer guten Umgebung stattfinden. Daher danken
wir an dieser Stelle den Mitarbeitern des Hauses der Familie für ihre Arbeit im Hinter-
grund. Weiterer Dank gilt allen Autoren, die mit ihren Beiträgen und Vorträgen erst einen
interessanten Workshop ermöglichen, sowie dem Programmkomitee und allen Gutachtern
für ihre Arbeit. Abschlies̈end gilt dem Organisationsteam, das interaktiv über alle Landes-
grenzen hinweg (Deutschland, Österreich und Italien) hervorragend zusammen gearbeitet
hat, ein gros̈es Dankeschön. So international war der GvDB noch nie.

Auf ein Wiedersehen beim nächsten GvDB-Workshop

Günther Specht
Friederike Klan
Johann Gamper

Innsbruck, Jena, Bozen am 26.10.2014

4
Komitee

Organisation

Friederike Klan Friedrich-Schiller-Universität Jena
Günther Specht Universität Innsbruck
Hans Gamper Universität Bozen-Bolzano

Programm-Komitee

Alsayed Algergawy Friedrich-Schiller-Universität Jena
Erik Buchmann Karlsruher Institut für Technologie
Stefan Conrad Universität Düsseldorf
Hans Gamper Universität Bozen-Bolzano
Torsten Grust Universität Tübingen
Andreas Heuer Universität Rostock
Friederike Klan Friedrich-Schiller-Universität Jena
Birgitta König-Ries Friedrich-Schiller-Universität Jena
Klaus Meyer-Wegener Universität Erlangen
Gunter Saake Universität Magdeburg
Kai-Uwe Sattler Technische Universität Ilmenau
Eike Schallehn Universität Magdeburg
Ingo Schmitt Brandenburgische Technische Universität Cottbus
Holger Schwarz Universität Stuttgart
Günther Specht Universität Innsbruck

Zusätzliche Reviewer

Mustafa Al-Hajjaji Universität Magdeburg
Xiao Chen Universität Magdeburg
Doris Silbernagl Universität Innsbruck

5
6
Contents

Next Generation Data Integration (for the Life Sciences) (Keynote)
Ulf Leser 9

Context and Recommendations: Challenges and Results (Keynote)
Francesco Ricci 10

Optimization of Sequences of XML Schema Modifications - The ROfEL Ap-
proach
Thomas Nösinger, Andreas Heuer and Meike Klettke 11

Automatic Decomposition of Multi-Author Documents Using Grammar Analysis
Michael Tschuggnall and Günther Specht 17

Proaktive modellbasierte Performance-Analyse und -Vorhersage von Datenbankan-
wendungen
Christoph Koch 23

Big Data und der Fluch der Dimensionalität: Die effiziente Suche nach Quasi-
Identifikatoren in hochdimensionalen Daten
Hannes Grunert and Andreas Heuer 29

Combining Spotify and Twitter Data for Generating a Recent and Public Dataset
for Music Recommendation
Martin Pichl, Eva Zangerle and Günther Specht 35

Incremental calculation of isochrones regarding duration
Nikolaus Krismer, Günther Specht and Johann Gamper 41

Software Design Approaches for Mastering Variability in Database Systems
David Broneske, Sebastian Dorok, Veit Koeppen and Andreas Meister 47

PageBeat - Zeitreihenanalyse und Datenbanken
Andreas Finger, Ilvio Bruder, Andreas Heuer, Martin Klemkow and Steffen Konerow 53

Databases under the Partial Closed-world Assumption: A Survey
Simon Razniewski and Werner Nutt 59

Towards Semantic Recommendation of Biodiversity Datasets based on Linked
Open Data
Felicitas Löffler, Bahar Sateli, René Witte and Birgitta König-Ries 65

7
Exploring Graph Partitioning for Shortest Path Queries on Road Networks
Theodoros Chondrogiannis and Johann Gamper 71

Missing Value Imputation in Time Series Using Top-k Case Matching
Kevin Wellenzohn, Hannes Mitterer, Johann Gamper, Michael Böhlen and Mourad
Khayati 77

Dominanzproblem bei der Nutzung von Multi-Feature-Ansätzen
Thomas Böttcher and Ingo Schmitt 83

PEL: Position-Enhanced Length Filter for Set Similarity Joins
Willi Mann and Nikolaus Augsten 89

8
Next Generation Data Integration (for the Life Sciences)
[Abstract]
Ulf Leser
Humboldt-Universität zu Berlin
Institute for Computer Science
leser@informatik.hu-berlin.de

ABSTRACT
Ever since the advent of high-throughput biology (e.g., the
Human Genome Project), integrating the large number of
diverse biological data sets has been considered as one of
the most important tasks for advancement in the biolog-
ical sciences. The life sciences also served as a blueprint
for complex integration tasks in the CS community, due to
the availability of a large number of highly heterogeneous
sources and the urgent integration needs. Whereas the early
days of research in this area were dominated by virtual inte-
gration, the currently most successful architecture uses ma-
terialization. Systems are built using ad-hoc techniques and
a large amount of scripting. However, recent years have
seen a shift in the understanding of what a ”data integra-
tion system” actually should do, revitalizing research in this
direction. In this tutorial, we review the past and current
state of data integration (exemplified by the Life Sciences)
and discuss recent trends in detail, which all pose challenges
for the database community.

About the Author
Ulf Leser obtained a Diploma in Computer Science at the
Technische Universität München in 1995. He then worked as
database developer at the Max-Planck-Institute for Molec-
ular Genetics before starting his PhD with the Graduate
School for ”Distributed Information Systems” in Berlin. Since
2002 he is a professor for Knowledge Management in Bioin-
formatics at Humboldt-Universität zu Berlin.

Copyright c by the paper’s authors. Copying permitted only
for private and academic purposes.
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-
Workshop on Foundations of Databases (Grundlagen von Datenbanken),
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.

9
Context and Recommendations: Challenges and Results
[Abstract]
Francesco Ricci
Free University of Bozen-Bolzano
Faculty of Computer Science
fricci@unibz.it

ABSTRACT About the Author
Recommender Systems (RSs) are popular tools that auto- Francesco Ricci is associate professor of computer science
matically compute suggestions for items that are predicted at Free University of Bozen-Bolzano, Italy. His current re-
to be interesting and useful to a user. They track users’ search interests include recommender systems, intelligent in-
actions, which signal users’ preferences, and aggregate them terfaces, mobile systems, machine learning, case-based rea-
into predictive models of the users’ interests. In addition soning, and the applications of ICT to tourism and eHealth.
to the long-term interests, which are normally acquired and He has published more than one hundred of academic pa-
modeled in RSs, the specific ephemeral needs of the users, pers on these topics and has been invited to give talks in
their decision biases, the context of the search, and the con- many international conferences, universities and companies.
text of items’ usage, do influence the user’s response to and He is among the editors of the Handbook of Recommender
evaluation for the suggested items. But appropriately mod- Systems (Springer 2011), a reference text for researchers and
eling the user in the situational context and reasoning upon practitioners working in this area. He is the editor in chief
that is still challenging; there are still major technical and of the Journal of Information Technology & Tourism and in
practical difficulties to solve: obtaining sufficient and infor- the editorial board of the Journal of User Modeling and User
mative data describing user preferences in context; under- Adapted Interaction. He is member of the steering commit-
standing the impact of the contextual dimensions on user tee of the ACM Conference on Recommender Systems. He
decision-making process; embedding the contextual dimen- served on the program committees of several conferences,
sions in a recommendation computational model. These top- including as a program co-chair of the ACM Conference on
ics will be illustrated in the talk, making examples taken Recommender Systems (RecSys), the International Confer-
from the recommender systems that we have developed. ence on Case-Based Reasoning (ICCBR) and the Interna-
tional Conference on Information and Communication Tech-
nologies in Tourism (ENTER).

10
Optimization of Sequences of XML Schema Modifications -
The ROfEL Approach

Thomas Nösinger, Meike Klettke, Andreas Heuer
Database Research Group
University of Rostock, Germany
(tn, meike, ah)@informatik.uni-rostock.de

ABSTRACT element and shortly afterwards delete the same element. In
The transformation language ELaX (Evolution Language for the overall context of an efficient realization of modification
XML-Schema [16]) is a domain-specific language for modi- steps, such operations have to be removed. Further issues
fying existing XML Schemas. ELaX was developed to ex- are incorrect information (possibly caused by network prob-
press complex modifications by using add, delete and up- lems), for example if the same element is deleted twice or the
date statements. Additionally, it is used to consistently order of modifications is invalid (e.g. update before add).
log all change operations specified by a user. In this pa- The new rule-based optimizer for ELaX (ROfEL - Rule-
per we present the rule-based optimization algorithm ROfEL based Optimizer for ELaX) had been developed for solving
(Rule-based Optimizer for ELaX) for reducing the number the above mentioned problems. With ROfEL it is possible
of logged operations by identifying and removing unneces- to identify unnecessary or redundant operations by using
sary, redundant and also invalid modifications. This is an different straightforward optimization rules. Furthermore,
essential prerequisite for the co-evolution of XML Schemas the underlying algorithm is capable to correct invalid modi-
and corresponding XML documents. fication steps. All in all, ROfEL could reduce the number of
modification steps by removing or even correcting the logged
ELaX operations.
1. INTRODUCTION This paper is organized as follows. Section 2 gives the
The eXtensible Markup Language (XML) [2] is one of the necessary background of XML Schema, ELaX and corre-
most popular formats for exchanging and storing structured sponding concepts. Section 3 and section 4 present our
and semi-structured information in heterogeneous environ- approach, by first specifying our ruled-based algorithm RO-
ments. To assure that well-defined XML documents are fEL and then showing how our approach can be applied for
valid it is necessary to introduce a document description, an example. Related work is shown in section 5. Finally,
which contains information about allowed structures, con- in section 6 we draw our conclusions.
straints, data types and so on. XML Schema [4] is one com-
monly used standard for dealing with this problem. After 2. TECHNICAL BACKGROUND
using an XML Schema a period of time, the requirements In this section we present a common notation used in the
can change; for example if additional elements are needed, remainder of this paper. At first, we will shortly introduce
data types change or integrity constraints are introduced. the XSD (XML Schema Definition [4]), before details con-
This may result in the adaptation of the XML Schema def- cerning ELaX (Evolution Language for XML-Schema [16])
inition. and the logging of ELaX are given.
In [16] we presented the transformation language ELaX The XML Schema abstract data model consists of different
(Evolution Language for XML-Schema) to describe and for- components (simple and complex type definitions, element
mulate these XML Schema modifications. Furthermore, we and attribute declarations, etc.). Additionally, the element
mentioned briefly that ELaX is also useful to log informa- information item serves as an XML representation of these
tion about modifications consistently, an essential prerequi- components and defines which content and attributes can be
site for the co-evolution process of XML Schema and corre- used in an XML Schema. The possibility of specifying decla-
sponding XML documents [14]. rations and definitions in a local or global scope leads to four
One problem of storing information over a long period of different modeling styles [13]. One of them is the Garden of
time is, that there can be different unnecessary or redundant Eden style, in which all above mentioned components are
modifications. Consider modifications which firstly add an globally defined. This results in a high re-usability of decla-
rations and defined data types and influences the flexibility
of an XML Schema in general.
The transformation language ELaX1 was developed to
handle modifications on an XML Schema and to express
such modifications formally. The abstract data model, el-
Copyright c by the paper’s authors. Copying permitted only ement information item and Garden of Eden style were
for private and academic purposes. important through the development process and influence
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 1
Workshop on Foundations of Databases (Grundlagen von Datenbanken), The whole transformation language ELaX is available at:
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. www.ls-dbis.de/elax

11
the EBNF (Extended Backus-Naur Form) like notation of // ↓ most recent operation: add ↓

ELaX. U: op(EID) → del(EID) → add(EID, content) (5)
An ELaX statement always starts with ”add”, ”delete” ⇒ op(EID) → add(EID, content)
or ”update” followed by one of the alternative components
(simple type, element declaration, etc.), an identifier of the I: add(EID, ) → add(EID, content)
(6)
current component and completed with optional tuples of ⇒ add(EID, content)
attributes and values (examples follow on, e.g. see figure I: upd(EID, ) → add(EID, content)
1). The identifier is a unique EID (emxid)2 , a QNAME (7)
(qualified name) or a subset of XPath expressions. In the ⇒ upd(EID, content)
remaining parts we will use the EID as the identifier, but a // ↓ most recent operation: update (upd) ↓

transformation would be easily possible. I: op(EID) → del(EID) → upd(EID, content) (8)
ELaX statements are logged for further analyses and also
as a prerequisite for the rule-base optimizer (see section 3). ⇒ op(EID) → upd(EID, content)
Figure 1 illustrates the relational schema of the log. The U: add(EID, content) → upd(EID, content)
(9)
⇒ add(EID, content)
op- msg-
file-ID time EID content
Type Type U: add(EID, content) → upd(EID, content’)
1 1 1 add 0 add element name 'name' type 'xs:decimal' id 'EID1' ; (10)
1 2 1 upd 0 update element name 'name' change type 'xs:string' ; ⇒ add(EID, MERGE(content0 , content))
1 3 2 add 0 add element name 'count' type 'xs:decimal' id 'EID2' ;
… … … … … …
R: upd(EID, content) → upd(EID, content)
(11)
⇒ upd(EID, content)
Figure 1: Schema with relation for logging ELaX U: upd(EID, content) → upd(EID, content’)
(12)
⇒ upd(EID, MERGE(content0 , content))
chosen values are simple ones (especially the length). The
attributes file-ID and time are the composite key for the The rules have to be sequentially analyzed from left to right
logging relation, the EID represents the unique identifier for (→), whereas the left operation comes temporally before the
a component of the XSD. The op-Type is a short form for right one (i.e., time(left) < time(right). To warrant that the
add, delete (del) or update (upd) operations, the msg-Type operations are working on the same component, the EID
is for the different message types (ELaX (0), etc.). Lastly, of both operations is equal. If two operations exist and a
the content contains the logged ELaX statements. The file- rule applies to them, then the result can be found on the
ID and msg-Type are management information, which are right side of ⇒. The time of the result is the time of the
not covered in this paper. prior (left) operation, except further investigations are ex-
plicit necessary or the time is unknown (e.g. empty).
Another point of view illustrates, that the introduced rules
3. RULE-BASED OPTIMIZER are complete concerning the given operations add, delete
The algorithm ROfEL (Rule-based Optimizer for ELaX) and update. Figure 2 represents an operation matrix, in
was developed to reduce the number of logged ELaX opera- which every possible combination is covered with at least one
tions. This is possible by combining given operations and/or rule. On the x-axis the prior operation and on the y-axis the
removing unnecessary or even redundant operations. Fur-
thermore, the algorithm could identify invalid operations in prior
a given log and correct these to a certain degree. operation
add delete update
ROfEL is a rule-based algorithm. Provided that a log of add (6) (5) (7)
ELaX operations is given (see section 2), the following rules recent delete (3) (2) (4)
are essential to reduce the number of operations. In com- update (9) , (10) (8) (11) , (12)
pliance with ELaX these operations are delete (del), add
or update (upd). If a certain distinction is not necessary
a general operation (op) or variable ( ) are used, empty Figure 2: Operation matrix of rules
denotes a not given operation. Additionally, the rules are
classified by their purpose to handle redundant (R), unnec- recent operation are given, whereas the three-valued rules
essary (U) or invalid (I) operations. ROfEL stops (S) if no (5) and (8) are minimized to the both most recent operations
other rules are applicable, for example no other operation (e.g. without op(EID)). The break-even point contains the
with the same EID is given. applying rule or rules (considering the possibility of merging
the content, see below).
S: empty → op(EID) ⇒ op(EID) (1) Rule (4) is one example for further investigations. If a
// ↓ most recent operation: delete (del) ↓
component is deleted (del(EID)) but updated (upd(EID))
(2) before, then it is not possible to replace the prior operation
R: del(EID) → del(EID) ⇒ del(EID) with the result (del(EID)) without analyzing other opera-
U: add(EID, content) → del(EID) ⇒ empty (3) tions between them. The problem is: if another operation
(op(EID’)) references the deleted component (e.g. a simple
U: upd(EID, content) → del(EID) ⇒ del(EID)
(4) type) but because of ROfEL upd(EID) (it is the prior op-
with time(del(EID)) := TIME(del(EID), upd(EID, content))
eration) is replaced with del(EID), then op(EID’) would be
2
Our conceptual model is EMX (Entity Model for XML invalid. Therefore, the function TIME() is used to deter-
Schema [15]), in which every component of a model has its mine the correct time of the result. The function is given
own, global identifier: EID in pseudocode in figure 3. TIME() has two input parame-

12
TIME(op, op’): value pairs of the most recent operation are completely in-
// time(op) = t; time(op’) = t’; time(opx) = tx; serted into the result. Simultaneously, these attributes are
// op.EID == op’.EID; op.EID != opx.EID; t > t’; removed from the content of the prior operation. At the end
begin of the function, all remaining attributes of the prior (right)
if ((t > tx > t’) AND operation are inserted, before the result is returned.
(op.EID in opx.content)) All mentioned rules, as well as the functions TIME() and
then return t; MERGE() are essential parts of the main function RO-
return t’; FEL(); the pseudocode is presented in figure 5. ROFEL()
end.
ROFEL(log):
Figure 3: TIME() function of optimizer // log = ((t1,op1), (t2,op2), ...); t1 < t2 < ...;
begin
for (i := log.size(); i >= 2; i := i - 1)
MERGE(content, content’): for (k := i - 1; k >= 1 ; k := k - 1)
// content = (A1 = ’a1’, A2 = ’a2’, if(!(log.get(i).EID == log.get(k).EID AND
// A3 = ’’, A4 = ’a4’); log.get(i).time != log.get(k).time))
// content’ = (A1 = ’a1’, A2 = ’’, then continue;
// A3 = ’a3’, A5 = ’a5’); // R: del(EID) -> del(EID) => del(EID) (2)
begin if (log.get(i).op-Type == 1 AND
result := {}; log.get(k).op-Type == 1)
count := 1; then
while (count <= content.size()) log.remove(i);
result.add(content.get(count)); return ROFEL(log);
if (content.get(count) in content’) // U: upd(EID, content) -> del(EID)
then // => del(EID) (4)
content’.remove(content.get(count)); if (log.get(i).op-Type == 1 AND
count := count + 1; log.get(k).op-Type == 2)
count := 1; then
while (count <= content’.size()) temp := TIME(log.get(i), log.get(k));
result.add(content’.get(count)); if (temp == log.get(i).time)
count := count + 1; then
// result = (A1 = ’a1’, A2 = ’a2’, log.remove(k);
// A3 = ’’, A4 = ’a4’, A5 = ’a5’); return ROFEL(log);
return result; log.get(k) := log.get(i);
end. log.remove(i);
return ROFEL(log); [...]
Figure 4: MERGE() function of optimizer // U: upd(EID,con) -> upd(EID,con’)
// => upd(EID, MERGE(con’,con)) (12)
if (log.get(i).op-Type == 2 AND
ters and returns a time value, dependent on the existence of log.get(k).op-Type == 2)
an operation, which references the EID in its content. If no then
such operation exists, the time of the result in rule (4) would temp := MERGE(log.get(i).content,
be the time of the left (op), otherwise of the right operation log.get(k).content);
(op’ ). The lines with // are comments and contain further log.get(k).content := temp;
information, some hints or even explanations of variables. log.remove(i);
The rules (6), (7) and (8) adapt invalid operations. For ex- return ROFEL(log);
ample if a component is updated but deleted before (see rule return log;
(8)), then ROfEL has to decide, which operation is valid. In end.
this and similar cases the most recent operation is preferred,
because it is more difficult (or even impossible) to check the Figure 5: Main function ROFEL() of optimizer
intention of the prior operation. Consequently, in rule (8)
del(EID) is removed and rule op(EID) → upd(EID, content) has one input parameter, the log of ELaX operations. This
applies (op(EID) could be empty; see rule (1)). log is a sequence sorted according to time, it is analyzed
The rules (10) and (12) removes unnecessary operations reversely. In general, one operation is pinned (log.get(i))
by merging the content of the involved operations. The func- and compared with the next, prior operation (log.get(k)).
tion MERGE() implements this, the pseudocode is pre- If log.get(k) modifies the same component as log.get(i) (i.e.,
sented in figure 4. MERGE() has two input parameter, EID is equal) and the time is different, then an applying rule
the content of the most recent (left) and prior (right) oper- is searched, otherwise the next operation (log.get(k - 1)) is
ation. The content is given as a sequence of attribute-value analyzed. The algorithm terminates, if the outer loop com-
pairs (see ELaX description in section 2). The result of the pletes successfully (i.e., no further optimization is possible).
function is the combination of the input, whereas the con- Three rules are presented in figure 5; the missing ones
tent of the most recent operation is preferred analogical to are skipped ([...]). The first rule is (2), the occurrence of
the above mentioned behaviour for I rules. All attribute- redundant delete operations. According to the above men-

13
tioned time choosing guidelines, the most recent operation

ROfEL
op-
time EID content
(log.get(i)) is removed. After this the optimizer starts again Type
with the modified log recursively (return ROFEL(log)). 1 1 add add element name 'name' type 'xs:decimal' id 'EID1' ;
10
The second rule is (4), which removes an unnecessary up- 2 1 upd update element name 'name' change type 'xs:string' ;
3 2 add add element name 'count' type 'xs:decimal' id 'EID2' ;
date operation, because the whole referenced component will
4 3 add add element name 'start' type 'xs:date' id 'EID3' ;
be deleted later. This rule uses the TIME() function of fig-
5 42 add add element name 'stop' type 'xs:date' id 'EID42' ;
ure 3 to decide, which time should be assigned to the result. 6 4 add add complextype name 'confType' id 'EID4' ;
3
If another operation between log.get(i) and log.get(k) exists 7 5 add add group mode sequence id 'EID5' in 'EID4' ;
and this operation contains or references log.get(i).EID, then 8 42 upd update element name 'stop' change type 'xs:string' ;
the most recent time (log.get(i).time) is assigned, otherwise 9 6 add add elementref 'name' id 'EID6' in 'EID5' ;
the prior time (log.get(k).time). 10 4 7 add add elementref 'count' id 'EID7' in 'EID5' ;
The last rule is (12), different updates on the same com- 11 8 add add elementref 'start' id 'EID8' in 'EID5' ;
12 42 del delete element name 'stop' ;
ponent are given. The MERGE() function of figure 4 com-
13 2 9 add add element name 'conf' type 'confType' id 'EID9' ;
bines the content of both operations, before the content of
14 42 del delete element name 'stop' ;
the prior operation is changed and the most recent operation
is removed.
After introducing detailed information about the concept Figure 7: XML Schema modification log of figure 6
of the ROfEL algorithm, we want to use it to optimize an
example in the next section.
given in the XML Schema (EID > 9). Additionally, some
4. EXAMPLE entries are connected within the new introduced column RO-
In the last section we specified the rule-based algorithm fEL. The red lines and numbers represent the involved log
ROfEL (Rule-based Optimizer for ELaX), now we want to entries and applying ROfEL rule.
explain the use with an example: we want to store some in- The sorted log is analyzed reversely, the operation with
formation about a conference. We assume the XML Schema time stamp 14 is pinned and compared with time entry 13.
of figure 6 is given, a corresponding XML document is also Because the modified component is not the same (EID not
presented. The XML Schema is in the Garden of Eden style equal), the next operation with time 12 is taken. Both op-
erations delete the same component (op-Type == 1 ). Ac-
cording to rule (2), the redundant entry 14 is removed and
ROFEL restarts with the adapted log.
Rule (4) applies next, a component is updated but deleted
later. This rule calls the TIME() function to determine, if
the time of the result (i.e., del(EID)) should be 12 or 8.
Because no operation between 12 and 8 references EID 42,
the time of the result of (4) is 8. The content of time 8 is
replaced with delete element name ’stop’;, the op-Type is set
to 1 and the time entry 12 is deleted.
Afterwards, ROFEL restarts again and rule (3) could be
used to compare the new operation of entry 8 (original entry
12) with the operation of time 5. A component is inserted
but deleted later, so all modifications on this component
are unnecessary in general. Consequently, both entries are
deleted and the component with EID 42 is not given in the
XML Schema of figure 6.
The last applying rule is (10). An element declaration
is inserted (time 1) and updated (time 2). Consequently,
the MERGE() function is used to combine the content of
both operations. According to the ELaX specification, the
content of the update operation contains the attribute type
Figure 6: XML Schema with XML document with the value xs:string, whereas the add operation contains
the attribute type with the value xs:decimal and id with
and contains four element declarations (conf, name, count, EID1. All attribute-value pairs of the update operation are
start) and one complex type definition (confType) with a completely inserted into the output of the function (type =
group model (sequence). The group model has three ele- ”xs:string”). Simultaneously, the attribute type is removed
ment references, which reference one of the simple type el- from the content of the add operation (type = ”xs:decimal”).
ement declarations mentioned above. The identification of The remaining attributes are inserted in the output (id =
all components is simplified by using an EID, it is visualized ”EID1”). Afterwards, the content of entry 1 is replaced by
as a unique ID attribute (id = ”..”). add element ’name’ type ”xs:string” id ”EID1”; and the sec-
The log of modification steps to create this XML Schema ond entry is deleted (time 2).
is presented in figure 7. The relational schema is reduced in The modification log of figure 7 is optimized with rules
comparison to figure 1. The time, the component EID, the (2), (4), (3) and (10). It is presented in figure 8. All in all,
op-Type and the content of the modification steps are given. five of 14 entries are removed, whereas one is replaced by a
The log contains different modification steps, which are not combination of two others.

14
op- In [8] an approach is presented, which deals with four
time EID content
Type operations (insert, delete, update, move) on a tree repre-
1 1 add add element name 'name' type 'xs:string' id 'EID1' ; sentation of XML. It is similar to our algorithm, but we use
3 2 add add element name 'count' type 'xs:decimal' id 'EID2' ; ELaX as basis and EIDs instead of update-intensive labelling
4 3 add add element name 'start' type 'xs:date' id 'EID3' ;
mechanisms. Moreover the distinction between property and
6 4 add add complextype name 'confType' id 'EID4' ;
node, the ”deletion always wins” view, as well as the limita-
7 5 add add group mode sequence id 'EID5' in 'EID4' ;
tion that ”reduced sequence might still be reducible” [8] are
9 6 add add elementref 'name' id 'EID6' in 'EID5' ;
10 7 add add elementref 'count' id 'EID7' in 'EID5' ;
drawbacks. The optimized reduction algorithm eliminates
11 8 add add elementref 'start' id 'EID8' in 'EID5' ; the last drawback, but needs another complex structure, an
13 9 add add element name 'conf' type 'confType' id 'EID9' ; operation hyper-graph.

Figure 8: XML Schema modification log of figure 7 6. CONCLUSION
after using rules (2), (4), (3) and (10) of ROfEL The rule-based algorithm ROfEL (Rule-based Optimizer
for ELaX) was developed to reduce the number of logged
ELaX (Evolution Language for XML-Schema [16]) opera-
This simple example illustrates how ROfEL can reduce the tions. In general ELaX statements are add, delete and up-
number of logged operations introduced in section 3. More date operations on the components of XML Schema, speci-
complex examples are easy to construct and can be solved fied by a user.
by using the same rules and the same algorithm. ROfEL allows the identification and deletion of unnec-
essary and redundant modifications by applying different
heuristic rules. Additionally, invalid operations are also cor-
5. RELATED WORK rected or removed. In general if the preconditions and condi-
Comparable to the object lifecycle, we create new types tions for an adaptation of two ELaX log entries are satisfied
or elements, use (e.g. modify, move or rename) and delete (e.g. EID equivalent, op-Type correct, etc.), one rule is ap-
them. The common optimization rules to reduce the num- plied and the modified, reduced log is returned.
ber of operations are originally introduced in [10] and are We are confident, that even if ROfEL is domain specific
available in other application in the same way. In [11], rules and the underlying log is specialized for our needs, the above
for reducing a list of user actions (e.g. move, replace, delete, specified rules are applicable in other scenarios or applica-
...) are introduced. In [9], pre and postconditions of op- tions, in which the common modification operations add,
erations are used for deciding which optimizations can be delete and update are used (minor adaptations precondi-
executed. Additional applications can easily be found in tioned).
further scientific disquisitions. Future work. The integration of a cost-based component
Regarding other transformation languages, the most com- in ROfEL could be very interesting. It is possible, that under
monly used are XQuery [3] and XSLT (Extensible Stylesheet consideration of further analyses the combination of different
Language Transformations [1]), there are also approaches to operations (e.g. rule (10)) is inefficient in general. In this
reduce the number of unnecessary or redundant operations. and similar cases a cost function with different thresholds
Moreover, different transformations to improve efficiency are could be defined to guarantee, that only efficient adaptations
mentioned. of the log are applied. A convenient cost model would be
In [12] different ”high-level transformations to prune and necessary, but this requires further research.
merge the stream data flow graph” [12] are applied. ”Such Feasibility of the approach. At the University of Ro-
techniques not only simplify the later analyses, but most stock we implemented the prototype CodeX (Conceptual
importantly, they can rewrite some queries” [12], an essen- design and evolution for XML Schema) for dealing with the
tial prerequisite for the efficient evaluation of XQuery over co-evolution [14] of XML Schema and XML documents; RO-
streaming data. fEL and corresponding concepts are fully integrated. As we
In [5] packages are introduced because of efficiency ben- plan to report in combination with the first release of CodeX,
efits. A package is a collection of stylesheet modules ”to the significantly reduced number of logged operations proves
avoid compiling libraries repeatedly when they are used in that the whole algorithm is definitely feasible.
multiple stylesheets, and to avoid holding multiple copies
of the same library in memory simultaneously” [5]. Fur- 7. REFERENCES
thermore, XSLT works with templates and matching rules
for identifying structures in general. If different templates [1] XSL Transformations (XSLT) Version 2.0.
could be applied, automatic or user given priorities manage http://www.w3.org/TR/2007/REC-xslt20-20070123/,
which template is chosen. To avoid unexpected behaviour January 2007. Online; accessed 25-June-2014.
and improve the efficiency of analyses, it is a good practise [2] Extensible Markup Language (XML) 1.0 (Fifth
to remove unnecessary or redundant templates. Edition).
Another XML Schema modification language is XSchema- http://www.w3.org/TR/2008/REC-xml-20081126/,
Update [6], which is used in the co-evolution prototype EXup November 2008. Online; accessed 25-June-2014.
[7]. Especially the auto adaptation guidelines are similar to [3] XQuery 1.0: An XML Query Language (Second
the ROfEL purpose of reducing the number of modification Edition).
steps. ”Automatic adaptation will insert or remove the min- http://www.w3.org/TR/2010/REC-xquery-20101214/,
imum allowed number of elements for instance” [6], i.e., ”a December 2010. Online; accessed 25-June-2014.
minimal set of updates will be applied to the documents” [4] W3C XML Schema Definition Language (XSD) 1.1
[6]. Part 1: Structures. http://www.w3.org/TR/2012/

15
REC-xmlschema11-1-20120405/, April 2012. Online;
accessed 25-June-2014.
[5] XSL Transformations (XSLT) Version 3.0.
http://www.w3.org/TR/2013/WD-xslt-30-20131212/,
December 2013. Online; accessed 25-June-2014.
[6] F. Cavalieri. Querying and Evolution of XML Schemas
and Related Documents. Master’s thesis, University of
Genova, 2009.
[7] F. Cavalieri. EXup: an engine for the evolution of
XML schemas and associated documents. In
Proceedings of the 2010 EDBT/ICDT Workshops,
EDBT ’10, pages 21:1–21:10, New York, NY, USA,
2010. ACM.
[8] F. Cavalieri, G. Guerrini, M. Mesiti, and B. Oliboni.
On the Reduction of Sequences of XML Document
and Schema Update Operations. In ICDE Workshops,
pages 77–86, 2011.
[9] H. U. Hoppe. Task-oriented Parsing - a Diagnostic
Method to Be Used Adaptive Systems. In Proceedings
of the SIGCHI Conference on Human Factors in
Computing Systems, CHI ’88, pages 241–247, New
York, NY, USA, 1988. ACM.
[10] M. Klettke. Modellierung, Bewertung und Evolution
von XML-Dokumentkollektionen. Habilitation,
Fakultät für Informatik und Elektrotechnik,
Universität Rostock, 2007.
[11] R. Kramer. iContract - the Java(tm) Design by
Contract(tm) tool. In In TOOLS ’98: Proceedings of
the Technology of Object-Oriented Languages and
Systems, page 295. IEEE Computer Society, 1998.
[12] X. Li and G. Agrawal. Efficient Evaluation of XQuery
over Streaming Data. In In Proc. VLDB’05, pages
265–276, 2005.
[13] E. Maler. Schema Design Rules for UBL...and Maybe
for You. In XML 2002 Proceedings by deepX, 2002.
[14] T. Nösinger, M. Klettke, and A. Heuer. Evolution von
XML-Schemata auf konzeptioneller Ebene - Übersicht:
Der CodeX-Ansatz zur Lösung des
Gültigkeitsproblems. In Grundlagen von Datenbanken,
pages 29–34, 2012.
[15] T. Nösinger, M. Klettke, and A. Heuer. A Conceptual
Model for the XML Schema Evolution - Overview:
Storing, Base-Model-Mapping and Visualization. In
Grundlagen von Datenbanken, 2013.
[16] T. Nösinger, M. Klettke, and A. Heuer. XML Schema
Transformations - The ELaX Approach. In DEXA (1),
pages 293–302, 2013.

16
Automatic Decomposition of Multi-Author Documents
Using Grammar Analysis

Michael Tschuggnall and Günther Specht
Databases and Information Systems
Institute of Computer Science, University of Innsbruck, Austria
{michael.tschuggnall, guenther.specht}@uibk.ac.at

ABSTRACT try to build a cluster for the main author and one or more clusters
The task of text segmentation is to automatically split a text doc- for intrusive paragraphs. Another scenario where the clustering of
ument into individual subparts, which differ according to specific text is applicable is the analysis of multi-author academic papers:
measures. In this paper, an approach is presented that attempts to especially the verification of collaborated student works such as
separate text sections of a collaboratively written document based bachelor or master theses can be useful in order to determine the
on the grammar syntax of authors. The main idea is thereby to amount of work done by each student.
quantify differences of the grammatical writing style of authors Using results of previous work in the field of intrinsic plagia-
and to use this information to build paragraph clusters, whereby rism detection [31] and authorship attribution [32], the assumption
each cluster is assigned to a different author. In order to analyze that individual authors have significantly different writing styles in
the style of a writer, text is split into single sentences, and for each terms of the syntax that is used to construct sentences has been
sentence a full parse tree is calculated. Using the latter, a profile reused. For example, the following sentence (extracted from a web
is computed subsequently that represents the main characteristics blog): ”My chair started squeaking a few days ago and it’s driving
for each paragraph. Finally, the profiles serve as input for common me nuts." (S1) could also be formulated as ”Since a few days my
clustering algorithms. An extensive evaluation using different En- chair is squeaking - it’s simply annoying.” (S2) which is semanti-
glish data sets reveals promising results, whereby a supplementary cally equivalent but differs significantly according to the syntax as
analysis indicates that in general common classification algorithms can be seen in Figure 1. The main idea of this work is to quantify
perform better than clustering approaches. those differences by calculating grammar profiles and to use this
information to decompose a collaboratively written document, i.e.,
to assign each paragraph of a document to an author.
Keywords
The rest of this paper is organized as follows: Section 2 at first
Text Segmentation, Multi-Author Decomposition, Parse Trees, pq-
recapitulates the principle of pq-grams, which represent a core con-
grams, Clustering
cept of the approach. Subsequently the algorithm is presented in
detail, which is then evaluated in Section 3 by using different clus-
1. INTRODUCTION tering algorithms and data sets. A comparison of clustering and
The growing amount of currently available data is hardly man- classification approaches is discussed in Section 4, while Section 5
ageable without the use of specific tools and algorithms that pro- depicts related work. Finally, a conclusion and future work direc-
vide relevant portions of that data to the user. While this problem tions are given in Section 6.
is generally addressed with information retrieval approaches, an-
other possibility to significantly reduce the amount of data is to 2. ALGORITHM
build clusters. Within each cluster, the data is similar according to
In the following the concept of pq-grams is explained, which
some predefined features. Thereby many approaches exist that pro-
serves as the basic stylistic measure in this approach to distinguish
pose algorithms to cluster plain text documents (e.g. [16], [22]) or
between authors. Subsequently, the concrete steps performed by
specific web documents (e.g. [33]) by utilizing various features.
the algorithm are discussed in detail.
Approaches which attempt to divide a single text document into
distinguishable units like different topics, for example, are usu- 2.1 Preliminaries: pq-grams
ally referred to as text segmentation approaches. Here, also many
Similar to n-grams that represent subparts of given length n of
features including statistical models, similarities between words or
a string, pq-grams extract substructures of an ordered, labeled tree
other semantic analyses are used. Moreover, text clusters are also
[4]. The size of a pq-gram is determined by a stem (p) and a base
used in recent plagiarism detection algorithms (e.g. [34]) which
(q) like it is shown in Figure 2. Thereby p defines how much nodes
are included vertically, and q defines the number of nodes to be
considered horizontally. For example, a valid pq-gram with p = 2
and q = 3 starting from PP at the left side of tree (S2) shown in
Figure 1 would be [PP-NP-DT-JJ-NNS] (the concrete words
are omitted).
Copyright c by the paper’s authors. Copying permitted only for The pq-gram index then consists of all possible pq-grams of
private and academic purposes.
a tree. In order to obtain all pq-grams, the base is shifted left
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-
Workshop on Foundations of Databases (Grundlagen von Datenbanken), and right additionally: If then less than p nodes exist horizon-
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. tally, the corresponding place in the pq-gram is filled with *, in-

17
(S1) steps:
S

S CC S 1. At first the document is preprocessed by eliminating unnec-
(and) essary whitespaces or non-parsable characters. For exam-
NP VP NP VP
ple, many data sets often are based on novels and articles of
various authors, whereby frequently OCR text recognition is
PRP NN VBD S PRP VBZ
(My) (chair) (started) (it) ('s)
VP used due to the lack of digital data. Additionally, such doc-
VBG S
uments contain problem sources like chapter numbers and
VP
(driving) titles or incorrectly parsed picture frames that result in non-
alphanumeric characters.
VBG ADVP NP NP
(squeaking)

NP RB PRP NNS 2. Subsequently, the document is partitioned into single para-
(ago) (me) (nuts) graphs. For simplification reasons this is currently done by
DT JJ NNS
only detecting multiple line breaks.
(a) (few) (days)

3. Each paragraph is then split into single sentences by utiliz-
ing a sentence boundary detection algorithm implemented
(S2)
S within the OpenNLP framework1 . Then for each sentence
a full grammar tree is calculated using the Stanford Parser
S - S [19]. For example, Figure 1 depicts the grammar trees re-
PP NP VP NP VP
sulting from analyzing sentences (S1) and (S2), respectively.
The labels of each tree correspond to a part-of-speech (POS)
IN NP PRP NN VBZ VP PRP VBZ ADVP ADJP
tag of the Penn Treebank set [23], where e.g NP corresponds
(Since) (my) (chair) (is) (it) ('s) to a noun phrase, DT to a determiner or JJS to a superla-
DT JJ NNS VBG RB JJ
tive adjective. In order to examine the building structure of
(a) (few) (days) (squeaking) (simply) (annoying) sentences only like it is intended by this work, the concrete
words, i.e., the leafs of the tree, are omitted.
Figure 1: Grammar Trees of the Semantically Equivalent Sen-
tences (S1) and (S2). 4. Using the grammar trees of all sentences of the document,
the pq-gram index is calculated. As shown in Section 2.1
all valid pq-grams of a sentence are extracted and stored into
dicating a missing node. Applying this idea to the previous exam- a pq-gram index. By combining all pq-gram indices of all
ple, also the pq-gram [PP-IN-*-*-*] (no nodes in the base) is sentences, a pq-gram profile is computed which contains a
valid, as well as [PP-NP-*-*-DT] (base shifted left by two), list of all pq-grams and their corresponding frequency of ap-
[PP-NP-*-DT-JJ] (base shifted left by one), [PP-NP-JJ- pearance in the text. Thereby the frequency is normalized by
NNS-*] (base shifted right by one) and [PP-NP-NNS-*-*] (base the total number of all appearing pq-grams. As an example,
shifted right by two) have to be considered. As a last example, all the five mostly used pq-grams using p = 2 and q = 3 of a
leaves have the pq-gram pattern [leaf_label-*-*-*-*]. sample document are shown in Table 1. The profile is sorted
Finally, the pq-gram index is the set of all valid pq-grams of a descending by the normalized occurrence, and an additional
tree, whereby multiple occurrences of the same pq-grams are also rank value is introduced that simply defines a natural order
present multiple times in the index. which is used in the evaluation (see Section 3).

Table 1: Example of the Five Mostly Used pq-grams of a Sam-
ple Document.
pq-gram Occurrence [%] Rank
NP-NN-*-*-* 2.68 1
PP-IN-*-*-* 2.25 2
Figure 2: Structure of a pq-gram Consisting of Stem p = 2 and NP-DT-*-*-* 1.99 3
Base q = 3. NP-NNP-*-*-* 1.44 4
S-VP-*-*-VBD 1.08 5

2.2 Clustering by Authors
The number of choices an author has to formulate a sentence 5. Finally, each paragraph-profile is provided as input for clus-
in terms of grammar structure is rather high, and the assumption tering algorithms, which are asked to build clusters based on
in this approach is that the concrete choice is made mostly intu- the pq-grams contained. Concretely, three different feature
itively and unconsciously. On that basis the grammar of authors is sets have been evaluated: (1.) the frequencies of occurences
analyzed, which serves as input for common state-of-the-art clus- of each pq-gram, (2.) the rank of each pq-gram and (3.) a
tering algorithms to build clusters of text documents or paragraphs. union of the latter sets.
The decision of the clustering algorithms is thereby based on the
frequencies of occurring pq-grams, i.e., on pq-gram profiles. In de- 1
Apache OpenNLP, http://incubator.apache.org/opennlp, vis-
tail, given a text document the algorithm consists of the following ited July 2014

18
2.3 Utilized Algorithms many works have studied and questioned the correct author-
Using the WEKA framework [15], the following clustering algo- ship of 12 disputed essays [24], which have been excluded in
rithms have been evaluated: K-Means [3], Cascaded K-Means (the the experiment.
number of clusters is cascaded and automatically chosen) [5], X-
• The PAN’12 competition corpus (PAN12): As a well-known,
Means [26], Agglomerative Hierarchical Clustering [25], and Far-
state-of-the-art corpus originally created for the use in au-
thest First [9].
thorship identification, parts3 of the PAN2012 corpus [18]
For the clustering algorithms K-Means, Hierarchical Clustering
have been integrated. The corpus is composed of several
and Farthest First the number of clusters has been predefined ac-
fiction texts and split into several subtasks that cover small-
cording to the respective test data. This means if the test document
and common-length documents (1800-6060 words) as well
has been collaborated by three authors, the number of clusters has
as larger documents (up to 13000 words) and novel-length
also been set to three. On the other hand, the algorithms Cascaded
documents (up to 170,000 words). Finally, the test setused in
K-Means and X-Means implicitly decide which amount of clusters
this evaluation contains 14 documents (paragraphs) written
is optimal. Therefore these algorithms have been limited only in
by three authors that are distributed equally.
ranges, i.e., the minimum and maximum number of clusters has
been set to two and six, respectively. 3.2 Results
The best results of the evaluation are presented in Table 2, where
3. EVALUATION the best performance for each clusterer over all data sets is shown in
The utilization of pq-gram profiles as input features for mod- subtable (a), and the best configuration for each data set is shown
ern clustering algorithms has been extensively evaluated using dif- in subtable (b), respectively. With an accuracy of 63.7% the K-
ferent documents and data sets. As clustering and classification Means algorithm worked best by using p = 2, q = 3 and by uti-
problems are closely related, the global aim was to experiment on lizing all available features. Interestingly, the X-Means algorithm
the accuracy of automatic text clustering using solely the proposed also achieved good results considering the fact that in this case the
grammar feature, and furthermore to compare it to those of current number of clusters has been assigned automatically by the algo-
classification techniques. rithm. Finally, the hierarchical cluster performed worst gaining an
accuracy of nearly 10% less than K-Means.
3.1 Test Data and Experimental Setup Regarding the best performances for each test data set, the re-
In order to evaluate the idea, different documents and test data sults for the manually created data sets from novel literature are
sets have been used, which are explained in more detail in the fol- generally poor. For example, the best result for the two-author doc-
lowing. Thereby single documents have been created which con- ument Twain-Wells is only 59.6%, i.e., the accuracy is only slightly
tain paragraphs written by different authors, as well as multiple better than the baseline percentage of 50%, which can be achieved
documents, whereby each document is written by one author. In by randomly assigning paragraphs into two clusters.4 On the other
the latter case, every document is treated as one (large) paragraph hand, the data sets reused from authorship attribution, namely the
for simplification reasons. FED and the PAN12 data set, achieved very good results with an
For the experiment, different parameter settings have been eval- accuracy of about 89% and 83%, respectively. Nevertheless, as the
uated, i.e., the pq-gram values p and q have been varied from 2 to other data sets have been specifically created for the clustering eval-
4, in combination with the three different feature sets. Concretely, uation, these results may be more expressive. Therefore a compar-
the following data sets have been used: ison between clustering and classification approaches is discussed
in the following, showing that the latter achieve significantly better
• Twain-Wells (T-W): This document has been specifically results on those data sets when using the same features.
created for the evaluation of in-document clustering. It con-
tains 50 paragraphs of the book ”The Adventures of Huck- Method p q Feature Set Accuracy
leberry Finn” by Mark Twain, and 50 paragraphs of ”The K-Means 3 2 All 63.7
Time Machine” by H. G. Wells2 . All paragraphs have been X-Means 2 4 Rank 61.7
randomly shuffled, whereby the size of each paragraph varies Farthest First 4 2 Occurrence-Rate 58.7
from approximately 25 words up to 280 words. Cascaded K-Means 2 2 Rank 55.3
Hierarchical Clust. 4 3 Occurrence-Rate 54.7
• Twain-Wells-Shelley (T-W-S): In a similar fashion a three-
author document has been created. It again uses (different) (a) Clustering Algorithms
paragraphs of the same books by Twain and Wells, and ap-
pends it by paragraphs of the book ”Frankenstein; Or, The Data Set Method p q Feat. Set Accuracy
Modern Prometheus” by Mary Wollstonecraft Shelley. Sum- T-W X-Means 3 2 All 59.6
marizing, the document contains 50 paragraphs by Mark T-W-S X-Means 3 4 All 49.0
Twain, 50 paragraphs by H. G. Wells and another 50 para- FED Farth. First 4 3 Rank 89.4
graphs by Mary Shelley, whereby the paragraph sizes are PAN12-A/B K-Means 3 3 All 83.3
similar to the Twain-Wells document. (b) Test Data Sets

• The Federalist Papers (FED): Probably the mostly referred Table 2: Best Evaluation Results for Each Clustering Algo-
text corpus in the field of authorship attribution is a series rithm and Test Data Set in Percent.
of 85 political essays called ”The Federalist Papers” written
by John Jay, Alexander Hamilton and James Madison in the 3
18th century. While most of the authorships are undoubted, the subtasks A and B, respectively
4
In this case X-Means dynamically created two clusters, but
2
The books have been obtained from the Project Gutenberg li- the result is still better than that of other algorithms using a fixed
brary, http://www.gutenberg.org, visited July 2014 number of clusters.

19
4. COMPARISON OF CLUSTERING AND p
2
q
2
Algorithm
X-Means
Max
57.6
N-Bay
77.8
Bay-Net
82.3
LibLin
85.2
LibSVM
86.9
kNN
62.6
J48
85.5

CLASSIFICATION APPROACHES
2 3 X-Means 56.6 79.8 80.8 81.8 83.3 60.6 80.8
2 4 X-Means 57.6 76.8 79.8 82.2 83.8 58.6 81.0
3 2 X-Means 59.6 78.8 80.8 81.8 83.6 59.6 80.8
For the given data sets, any clustering problem can be rewrit- 3
3
3
4
X-Means
X-Means
53.5
52.5
76.8
81.8
77.8
79.8
80.5
81.8
82.3
83.8
61.6
63.6
79.8
82.0
ten as classification problem with the exception that the latter need 4 2 K-Means 52.5 86.9 83.3 83.5 84.3 62.6 81.8
4 3 X-Means 52.5 79.8 79.8 80.1 80.3 59.6 77.4
training data. Although a direct comparison should be treated with 4 4 Farth. First 51.5 72.7 74.7 75.8 77.0 60.6 75.8
caution, it still gives an insight of how the two different approaches average improvement 24.1 25.0 26.5 27.9 6.2 25.7

perform using the same data sets. Therefore an additional evalua- (a) Twain-Wells
tion is shown in the following, which compares the performance of
the clustering algorithms to the performance of the the following p
2
q
2
Algorithm
K-Means
Max
44.3
N-Bay
67.8
Bay-Net
70.8
LibLin
74.0
LibSVM
75.2
kNN
51.0
J48
73.3
classification algorithms: Naive Bayes classifier [17], Bayes Net- 2 3 X-Means 38.3 65.1 67.1 70.7 72.3 48.3 70.2
2 4 X-Means 45.6 63.1 68.1 70.5 71.8 49.0 69.3
work using the K2 classifier [8], Large Linear Classification using 3 2 X-Means 45.0 51.7 64.1 67.3 68.8 45.6 65.4
3 3 X-Means 47.0 57.7 64.8 67.3 68.5 47.0 65.9
LibLinear [12], Support vector machine using LIBSVM with nu- 3 4 X-Means 49.0 67.8 67.8 70.5 72.5 46.3 68.3
SVC classification [6], k-nearest-neighbors classifier (kNN) using 4
4
2
3
X-Means
K-Means
36.2
35.6
61.1
53.0
67.1
63.8
69.1
67.6
69.5
70.0
50.3
47.0
65.1
66.6
k = 1 [1], and a pruned C4.5 decision tree (J48) [28]. To compen- 4 4 X-Means 35.6 57.7 66.1 68.5 69.3 42.3 66.8
average improvement 18.7 24.8 27.7 29.0 5.6 26.0
sate the missing training data, a 10-fold cross-validation has been
used for each classifier. (b) Twain-Wells-Shelley
Table 3 shows the performance of each classifier compared to the
p q Algorithm Max N-Bay Bay-Net LibLin LibSVM kNN J48
best clustering result using the same data and pq-setting. It can be 2 2 Farth. First 77.3 81.1 86.4 90.9 84.2 74.2 81.8
2 3 Farth. First 78.8 85.6 87.4 92.4 89.0 78.8 82.8
seen that the classifiers significantly outperform the clustering re- 2 4 X-Means 78.8 89.4 92.4 90.9 87.3 89.4 85.9
sults for the Twain-Wells and Twain-Wells-Shelley documents. The 3
3
2
3
K-Means
K-Means
81.8
78.8
82.6
92.4
87.9
92.4
92.4
92.4
85.5
86.4
80.3
81.8
83.8
83.8
support vector machine framework (LibSVM) and the linear classi- 3 4 Farth. First 86.4 84.8 90.9 97.0 85.8 81.8 85.6
4 2 Farth. First 86.6 81.8 89.4 87.9 83.3 77.3 84.1
fier (LibLinear) performed best, reaching a maximum accuracy of 4 3 Farth. First 89.4 85.6 92.4 89.4 85.8 80.3 83.3
4 4 Farth. First 84.8 86.4 90.9 89.4 85.8 84.8 83.6
nearly 87% for the Twain-Wells document. Moreover, the average average improvement 3.0 7.5 8.9 3.4 -1.6 1.3
improvement is given in the bottom line, showing that most of the
(c) Federalist Papers
classifiers outperform the best clustering result by over 20% in av-
erage. Solely the kNN algorithm achieves minor improvements as p q Algorithm Max N-Bay Bay-Net LibLin LibSVM kNN J48
it attributed the two-author document with a poor accuracy of about 2
2
2
3
K-Means
K-Means
83.3
83.3
83.3
83.3
33.3
33.3
100.0
100.0
100.0
100.0
100.0
100.0
33.3
33.3
60% only. 2 4 K-Means 83.3 83.4 33.3 100.0 100.0 100.0 33.3
3 2 K-Means 83.3 75.0 33.3 91.7 91.7 100.0 33.3
A similar general improvement could be achieved on the three- 3 3 K-Means 83.3 100.0 33.3 100.0 91.7 100.0 33.3
author document Twain-Wells-Shelley as can be seen in subtable 3
4
4
2
Farth. First
K-Means
75.0
83.3
66.7
91.7
33.3
33.3
100.0
91.7
100.0
75.0
91.7
91.7
33.3
33.3
(b). Again, LibSVM could achieve an accuracy of about 75%, 4
4
3
4
K-Means
K-Means
83.3
83.3
75.0
75.0
33.3
33.3
100.0
100.0
75.0
83.4
91.7
83.4
33.3
33.3
whereas the best clustering configuration could only reach 49%. average improvement -0.9 -49.1 15.8 8.4 13.0 -49.1

Except for the kNN algorithm, all classifiers significantly outper- (d) PAN12-A/B
form the best clustering results for every configuration.
Quite different comparison results have been obtained for the Table 3: Best Evaluation Results for each Clustering Algorithm
Federalist Papers and PAN12 data sets, respectively. Here, the im- and Test Data Set in Percent.
provements gained from the classifiers are only minor, and in some
cases are even negative, i.e., the classification algorithms perform
worse than the clustering algorithms. A general explanation is the to one document. The main idea is often to compute topically re-
good performance of the clustering algorithms on these data sets, lated document clusters and to assist web search engines to be able
especially by utilizing the Farthest First and K-Means algorithms. to provide better results to the user, whereby the algorithms pro-
In case of the Federalist Papers data set shown in subtable (c), posed frequently are also patented (e.g. [2]). Regularly applied
all algorithms except kNN could achieve at least some improve- concepts in the feature extraction phase are the term frequency tf ,
ment. Although the LibLinear classifier could reach an outstanding which measures how often a word in a document occurs, and the
accuracy of 97%, the global improvement is below 10% for all clas- term frequency-inverse document frequency tf − idf , which mea-
sifiers. Finally, subtable (d) shows the results for PAN12, where the sures the significance of a word compared to the whole document
outcome is quite diverse as some classifiers could improve the clus- collection. An example of a classical approach using these tech-
terers significantly, whereas others worsen the accuracy even more niques is published in [21].
drastically. A possible explanation might be the small data set (only The literature on cluster analysis within a single document to
the subproblems A and B have been used), which may not be suited discriminate the authorships in a multi-author document like it is
very well for a reliable evaluation of the clustering approaches. done in this paper is surprisingly sparse. On the other hand, many
approaches exist to separate a document into paragraphs of differ-
Summarizing, the comparison of the different algorithms reveal ent topics, which are generally called text segmentation problems.
that in general classification algorithms perform better than cluster- In this domain, the algorithms often perform vocabulary analysis
ing algorithms when provided with the same (pq-gram) feature set. in various forms like word stem repetitions [27] or word frequency
Nevertheless, the results of the PAN12 experiment are very diverse models [29], whereby ”methods for finding the topic boundaries
and indicate that there might be a problem with the data set itself, include sliding window, lexical chains, dynamic programming, ag-
and that this comparison should be treated carefully. glomerative clustering and divisive clustering” [7]. Despite the
given possibility to modify these techniques to also cluster by au-
5. RELATED WORK thors instead of topics, this is rarely done. In the following some of
Most of the traditional document clustering approaches are based the existing methods are shortly summarized.
on occurrences of words, i.e., inverted indices are built and used to Probably one of the first approaches that uses stylometry to au-
group documents. Thereby a unit to be clustered conforms exactly tomatically detect boundaries of authors of collaboratively written

20
text is proposed in [13]. Thereby the main intention was not to ex- K-‐Means
pose authors or to gain insight into the work distribution, but to pro- X-‐Means

vide a methodology for collaborative authors to equalize their style Farthest
First
Cascaded
K-‐Means
in order to achieve better readability. To extract the style of sepa- Hierarchical
Clusterer
rated paragraphs, common stylometric features such as word/sentence
lengths, POS tag distributions or frequencies of POS classes at Naive
Bayes
BayesNet
sentence-initial and sentence-final positions are considered. An ex- LibLinear
tensive experiment revealed that styolmetric features can be used to LibSVM

find authorship boundaries, but that there has to be done additional kNN
J48
research in order to increase the accuracy and informativeness.
0
10
20
30
40
50
60
70
80
90
100
In [14] the authors also tried to divide a collaborative text into Accuracy
[%]
different single-author paragraphs. In contrast to the previously
described handmade corpus, a large data set has been computation-
ally created by using (well-written) articles of an internet forum. At Figure 3: Best Evaluation Results Over All Data Sets For All
first, different neural networks have been utilized using several sty- Utilized Clustering and Classification Algorithms.
lometric features. By using 90% of the data for training, the best
network could achieve an F-score of 53% for multi-author docu-
ments on the remaining 10% of test data. In a second experiment, Twain-‐Wells
only letter-bigram frequencies are used as distinguishing features.
Thereby an authorship boundary between paragraphs was marked Twain-‐Wells-‐Shelley
if the cosine distance exceeded a certain threshold. This method
reached an F-score of only 42%, and it is suspected that letter- Best
Clusterer
FED
bigrams are not suitable for the (short) paragraphs used in the eval- Best
Classiﬁer

uation.
PAN12-‐A/B
A two-stage process to cluster Hebrew Bible texts by authorship
is proposed in [20]. Because a first attempt to represent chapters 0
20
40
60
80
100
only by bag-of-words led to negative results, the authors addition- Accuracy
[%]

ally incorporated sets of synonyms (which could be generated by
comparing the original Hebrew texts with an English translation).
With a modified cosine-measure comparing these sets for given Figure 4: Best Clustering and Classification Results For Each
chapters, two core clusters are compiled by using the ncut algo- Data Set.
rithm [10]. In the second step, the resulting clusters are used as
training data for a support vector machine, which finally assigns
every chapter to one of the two core clusters by using the simple linear classification algorithm LibLinear could reach nearly 88%,
bag-of-words features tested earlier. Thereby it can be the case, outperforming K-Means by 25% over all data sets.
that units originally assigned to one cluster are moved to the other Finally, the best classification and clustering results for each data
one, depending on the prediction of the support vector machine. set are shown in Figure 4. Consequently the classifiers achieve
With this two-stage approach the authors report a good accuracy of higher accuracies, whereby the PAN12 subsets could be classified
about 80%, whereby it should be considered that the size of poten- 100% correctly. As can be seen, a major improvement can be
tial authors has been fixed to two in the experiment. Nevertheless, gained for the novel literature documents. For example, the best
the authors state that their approach could be extended for more classifier reached 87% on the Twain-Wells document, whereas the
authors with less effort. best clustering approach achieved only 59%.

As shown in this paper, paragraphs of documents can be split
6. CONCLUSION AND FUTURE WORK and clustered based on grammar features, but the accuracy is below
In this paper, the automatic creation of paragraph clusters based that of classification algorithms. Although the two algorithm types
on the grammar of authors has been evaluated. Different state-of- should not be compared directly as they are designed to manage
the-art clustering algorithms have been utilized with different input different problems, the significant differences in accuracies indi-
features and tested on different data sets. The best working algo- cate that classifiers can handle the grammar features better. Never-
rithm K-Means could achieve an accuracy of about 63% over all theless future work should focus on evaluating the same features on
test sets, whereby good individual results of up to 89% could be larger data sets, as clustering algorithms may produce better results
reached for some configurations. On the contrary, the specifically with increasing amount of sample data.
created documents incorporating two and three authors could only Another possible application could be the creation of whole doc-
be clustered with a maximum accuracy of 59%. ument clusters, where documents with similar grammar are grouped
A comparison between clustering and classification algorithms together. Despite the fact that such huge clusters are very difficult to
using the same input features has been implemented. Disregarding evaluate - due to the lack of ground truth data - a navigation through
the missing training data, it could be observed that classifiers gen- thousands of documents based on grammar may be interesting like
erally produce higher accuracies with improvements of up to 29%. it has been done for music genres (e.g. [30]) or images (e.g. [11]).
On the other hand, some classifiers perform worse on average than Moreover, grammar clusters may also be utilized for modern rec-
clustering algorithms over individual data sets when using some pq- ommendation algorithms once they have been calculated for large
gram configurations. Nevertheless, if the maximum accuracy for data sets. For example, by analyzing all freely available books from
each algorithm is considered, all classifiers perform significantly libraries like Project Gutenberg, a system could recommend other
better as can be seen in Figure 3. Here the best performances of all books with a similar style based on the users reading history. Also,
utilized classification and clustering algorithms are illustrated. The an enhancement of current commercial recommender systems that

21
are used in large online stores like Amazon is conceivable. [18] P. Juola. An Overview of the Traditional Authorship
Attribution Subtask. In CLEF (Online Working
7. REFERENCES Notes/Labs/Workshop), 2012.
[1] D. Aha and D. Kibler. Instance-Based Learning Algorithms. [19] D. Klein and C. D. Manning. Accurate Unlexicalized
Machine Learning, 6:37–66, 1991. Parsing. In Proceedings of the 41st Annual Meeting on
[2] C. Apte, S. M. Weiss, and B. F. White. Lightweight Association for Computational Linguistics - Volume 1, ACL
Document Clustering, Nov. 25 2003. US Patent 6,654,739. ’03, pages 423–430, Stroudsburg, PA, USA, 2003.
[3] D. Arthur and S. Vassilvitskii. K-means++: The advantages [20] M. Koppel, N. Akiva, I. Dershowitz, and N. Dershowitz.
of careful seeding. In Proceedings of the Eighteenth Annual Unsupervised Decomposition of a Document into Authorial
ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, Components. In Proc. of the 49th Annual Meeting of the
pages 1027–1035, Philadelphia, PA, USA, 2007. Society for Association for Computational Linguistics: Human
Industrial and Applied Mathematics. Language Technologies - Volume 1, HLT ’11, pages
[4] N. Augsten, M. Böhlen, and J. Gamper. The pq-Gram 1356–1364, Stroudsburg, PA, USA, 2011.
Distance between Ordered Labeled Trees. ACM Transactions [21] B. Larsen and C. Aone. Fast and Effective Text Mining Using
on Database Systems (TODS), 2010. Linear-Time Document Clustering. In Proceedings of the 5th
[5] T. Caliński and J. Harabasz. A Dendrite Method for Cluster ACM SIGKDD international conference on Knowledge
Analysis. Communications in Statistics - Theory and discovery and data mining, pages 16–22. ACM, 1999.
Methods, 3(1):1–27, 1974. [22] Y. Li, S. M. Chung, and J. D. Holt. Text Document
[6] C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Clustering Based on Frequent Word Meaning Sequences.
Vector Machines. ACM Transactions on Intelligent Systems Data & Knowledge Engineering, 64(1):381–404, 2008.
and Technology (TIST), 2(3):27, 2011. [23] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini.
[7] F. Y. Choi. Advances in Domain Independent Linear Text Building a large annotated corpus of English: The Penn
Segmentation. In Proceedings of the 1st North American Treebank. Computational Linguistics, 19:313–330, June
chapter of the Association for Computational Linguistics 1993.
conference, pages 26–33. Association for Computational [24] F. Mosteller and D. Wallace. Inference and Disputed
Linguistics, 2000. Authorship: The Federalist. Addison-Wesley, 1964.
[8] G. F. Cooper and E. Herskovits. A Bayesian Method for the [25] F. Murtagh. A Survey of Recent Advances in Hierarchical
Induction of Probabilistic Networks From Data. Machine Clustering Algorithms. The Computer Journal,
learning, 9(4):309–347, 1992. 26(4):354–359, 1983.
[9] S. Dasgupta. Performance Guarantees for Hierarchical [26] D. Pelleg, A. W. Moore, et al. X-means: Extending K-means
Clustering. In Computational Learning Theory, pages with Efficient Estimation of the Number of Clusters. In
351–363. Springer, 2002. ICML, pages 727–734, 2000.
[10] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: [27] J. M. Ponte and W. B. Croft. Text Segmentation by Topic. In
Spectral Clustering and Normalized Cuts. In Proceedings of Research and Advanced Technology for Digital Libraries,
the tenth ACM SIGKDD international conference on pages 113–125. Springer, 1997.
Knowledge discovery and data mining, pages 551–556. [28] J. R. Quinlan. C4.5: Programs for Machine Learning,
ACM, 2004. volume 1. Morgan Kaufmann, 1993.
[11] A. Faktor and M. Irani. “Clustering by Composition” - [29] J. C. Reynar. Statistical Models for Topic Segmentation. In
Unsupervised Discovery of Image Categories. In Computer Proc. of the 37th annual meeting of the Association for
Vision–ECCV 2012, pages 474–487. Springer, 2012. Computational Linguistics on Computational Linguistics,
[12] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. pages 357–364, 1999.
Lin. LIBLINEAR: A Library for Large Linear Classification. [30] N. Scaringella, G. Zoia, and D. Mlynek. Automatic Genre
The Journal of Machine Learning Research, 9:1871–1874, Classification of Music Content: a Survey. Signal Processing
2008. Magazine, IEEE, 23(2):133–141, 2006.
[13] A. Glover and G. Hirst. Detecting Stylistic Inconsistencies in [31] M. Tschuggnall and G. Specht. Using Grammar-Profiles to
Collaborative Writing. In The New Writing Environment, Intrinsically Expose Plagiarism in Text Documents. In Proc.
pages 147–168. Springer, 1996. of the 18th Conf. of Natural Language Processing and
[14] N. Graham, G. Hirst, and B. Marthi. Segmenting Documents Information Systems (NLDB), pages 297–302, 2013.
by Stylistic Character. Natural Language Engineering, [32] M. Tschuggnall and G. Specht. Enhancing Authorship
11(04):397–415, 2005. Attribution By Utilizing Syntax Tree Profiles. In Proc. of the
[15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, 14th Conf. of the European Chapter of the Assoc. for
and I. H. Witten. The WEKA Data Mining Software: an Computational Ling. (EACL), pages 195–199, 2014.
Update. ACM SIGKDD explorations newsletter, [33] O. Zamir and O. Etzioni. Web Document Clustering: A
11(1):10–18, 2009. Feasibility Demonstration. In Proc. of the 21st annual
[16] A. Hotho, S. Staab, and G. Stumme. Ontologies Improve international ACM conference on Research and development
Text Document Clustering. In Data Mining, 2003. ICDM in information retrieval (SIGIR), pages 46–54. ACM, 1998.
2003. Third IEEE International Conference on, pages [34] D. Zou, W.-J. Long, and Z. Ling. A Cluster-Based
541–544. IEEE, 2003. Plagiarism Detection Method. In Notebook Papers of CLEF
[17] G. H. John and P. Langley. Estimating Continuous 2010 LABs and Workshops, 22-23 September, 2010.
Distributions in Bayesian Classifiers. In Proceedings of the
Eleventh conference on Uncertainty in artificial intelligence,
pages 338–345. Morgan Kaufmann Publishers Inc., 1995.

22
Proaktive modellbasierte Performance-Analyse und
-Vorhersage von Datenbankanwendungen
Christoph Koch
Friedrich-Schiller-Universität Jena
Lehrstuhl für Datenbanken und DATEV eG
Informationssysteme Abteilung Datenbanken
Ernst-Abbe-Platz 2 Paumgartnerstr. 6 - 14
07743 Jena 90429 Nürnberg
Christoph.Koch@uni-jena.de Christoph.Koch@datev.de

KURZFASSUNG 1. EINLEITUNG
Moderne (Datenbank-)Anwendungen sehen sich in der heutigen
Zur Erfüllung komplexerer Anforderungen und maximalen
Zeit mit immer höheren Anforderungen hinsichtlich Flexibilität,
Benutzerkomforts ist gute Performance eine Grundvoraussetzung
Funktionalität oder Verfügbarkeit konfrontiert. Nicht zuletzt für
für moderne Datenbankanwendungen. Neben Anwendungs-
deren Backend – ein meist relationales Datenbankmanagement-
Design und Infrastrukturkomponenten wie Netzwerk oder
system – entsteht dadurch eine kontinuierlich steigende Kom-
Anwendungs- beziehungsweise Web-Server wird sie maßgeblich
plexität und Workload, die es frühestmöglich proaktiv zu er-
durch die Performance ihres Datenbank-Backends – wir beschrän-
kennen, einzuschätzen und effizient zu bewältigen gilt. Die dazu
ken uns hier ausschließlich auf relationale Datenbankmanage-
nötigen Anwendungs- und Datenbankspezialisten sind jedoch
mentsysteme (DBMS) – bestimmt [1]. Dabei ist die Datenbank-
aufgrund immer engerer Projektpläne, kürzerer Release-Zyklen
Performance einer Anwendung selbst ebenfalls durch zahlreiche
und weiter wachsender Systemlandschaften stark ausgelastet,
Faktoren beeinflusst. Während Hardware- und systemseitige
sodass für regelmäßige proaktive Expertenanalysen hinsichtlich
Eigenschaften oftmals durch bestehende Infrastrukturen vor-
der Datenbank-Performance kaum Kapazität vorhanden ist.
gegeben sind, können speziell das Datenbank-Design sowie die
Zur Auflösung dieses Dilemmas stellt dieser Beitrag ein anwendungsseitig implementierten Zugriffe mittels SQL weit-
Verfahren vor, mit dessen Hilfe frühzeitig auf Grundlage der gehend frei gestaltet werden. Hinzu kommt als Einflussfaktor
Datenmodellierung und synthetischer Datenbankstatistiken Per- noch die Beschaffenheit der zu speichernden/gespeicherten Daten,
formance-Analysen und -Vorhersagen für Anwendungen mit die sich in Menge und Verteilung ebenfalls stark auf die
relationalem Datenbank-Backend durchgeführt und deren Performance auswirkt.
Ergebnisse auf leicht zugängliche Weise visualisiert werden
können. Das Datenbank-Design entwickelt sich über unterschiedlich
abstrakte, aufeinander aufbauende Modellstrukturen vom konzep-
tionellen hin zum physischen Datenmodell. Bereits bei der
Kategorien und Themenbeschreibung Entwicklung dieser Modelle können „Designfehler“ wie beispiels-
Data Models and Database Design, Database Performance weise fehlende oder „übertriebene“ Normalisierungen gravierende
Auswirkungen auf die späteren Antwortzeiten des Datenbank-
Allgemeine Bestimmungen systems haben. Der Grad an Normalisierung selbst ist jedoch nur
Performance, Design als vager Anhaltspunkt für die Performance von Datenbank-
systemen anzusehen, der sich ab einem gewissen Maß auch
negativ auswirken kann. Eine einfache Metrik zur Beurteilung der
Schlüsselwörter Qualität des Datenbank-Designs bezüglich der zu erwartenden
Performance, Proaktivität, Statistiken, relationale Datenbanken, Performance (in Abhängigkeit anderer Einflussfaktoren, wie etwa
Modellierung, UML, Anwendungsentwicklung der Workload) existiert nach vorhandenem Kenntnisstand nicht.
Etwas abweichend dazu verhält es sich mit dem Einfluss der
Workload – repräsentiert als Menge von SQL-Statements und der
Häufigkeit ihrer Ausführung, die von der Anwendung an das
Datenbanksystem zum Zugriff auf dort gespeicherte Daten
abgesetzt wird. Moderne DBMS besitzen einen kostenbasierten
Copyright © by the paper’s authors. Copying permitted only Optimierer zur Optimierung eingehender Statements. Dieser
for private and academic purposes. berechnet mögliche Ausführungspläne und wählt unter Zu-
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26 GI- hilfenahme von gesammelten Objekt-Statistiken den günstigsten
Workshop on Foundations of Databases (Grundlagen von Ausführungsplan zur Abarbeitung eines SQL-Statements aus.
Datenbanken), Mittels DBMS-internen Mechanismen – im Folgenden als
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.

23
EXPLAIN-Mechanismen bezeichnet – besteht die Möglichkeit, netzwerks. Ein Überblick dazu findet sich in [3]. Demnach zeigt
noch vor der eigentlichen Ausführung von Statements den vom sich für all diese Konzepte ein eher wissenschaftlicher Fokus und
Optimierer bestimmten optimalen Ausführungsplan ermitteln und eine damit einhergehende weitgehend unerprobte Übertragbarkeit
ausgeben zu lassen. Zusätzlich umfasst das EXPLAIN-Ergebnis auf die Praxis. So fehlen Studien zur Integration in praxisnahe
eine Abschätzung der zur Abarbeitung des Ausführungsplans (Entwicklungs-)Prozesse, zur Benutzerfreundlichkeit sowie zum
erwarteten Zugriffskosten bezüglich der CPU-und I/O-Zeit – Kosten-Nutzen-Verhältnis der notwendigen Maßnahmen. Ein
fortan als Kosten bezeichnet. Anhand dieser Informationen damit einhergehendes Defizit ist zusätzlich der mangelnde Tool-
können bereits frühzeitig in Hinblick auf die Datenbank- support. Das in Kapitel 4 vorgestellte Konzept verfolgt diesbe-
Performance (häufige) teure Zugriffe erkannt und gegebenenfalls züglich einen davon abweichenden Ansatz. Es baut direkt auf
optimiert werden. Voraussetzung für dieses Vorgehen ist aller- etablierten modellbasierten Praxisabläufen bei der Entwicklung
dings, dass dem DBMS zur Berechnung der Ausführungspläne von Datenbankanwendungen auf (vgl. Kapitel 3). Insbesondere
repräsentative Datenbank-Statistiken vorliegen, was insbeson- durch die Verwendung von standardisierten UML-Erweiterungs-
dere für neue Datenbankanwendungen nicht der Fall ist. mechanismen integriert es sich auch Tool-seitig nahtlos in
bestehende UML-unterstützende Infrastrukturen.
Auf der anderen Seite sehen sich sowohl Anwendungsentwickler-
beziehungsweise -designerteams als auch Datenbankspezialisten Die Methodik der synthetischen Statistiken – also dem künstli-
mit immer komplexeren Anforderungen und Aufgaben konfron- chen Erstellen sowie Manipulieren von Datenbank-Statistiken –
tiert. Kapazitäten für umfangreiche Performance-Analysen oder ist neben dem in Kapitel 4 vorgestellten Ansatz wesentlicher
auch nur die Aneignung des dafür nötigen Wissens sind oft nicht Bestandteil von [4]. Sie wird zum einen verwendet, um Statistiken
gegeben. Nicht zuletzt deshalb geraten proaktive Performance- aus Produktionsumgebungen in eine Testumgebung zu trans-
Analysen verglichen mit beispielsweise funktionalen Tests ver- ferieren. Zum anderen sieht der Ansatz aber auch die gezielte
mehrt aus dem Fokus. manuelle Veränderung der Statistiken vor, um mögliche dadurch
entstehende Änderungen in den Ausführungsplänen und den zu
Das im vorliegenden Beitrag vorgestellte modellbasierte Konzept deren Abarbeitung benötigten Kosten mithilfe anschließender
setzt an diesen beiden Problemen an und stellt Mechanismen vor, EXPLAIN-Analysen feststellen zu können. Dies kann beispiels-
um auf einfache Art und Weise eine repräsentative proaktive weise bezogen auf Statistiken zur Datenmenge dafür genutzt
Analyse der Datenbank-Performance zu ermöglichen. Nachdem in werden, um Zugriffe auf eine (noch) kleine Tabelle mit wenigen
Kapitel 2 eine Abgrenzung zu alternativen/verwandten Ansätzen Datensätzen bereits so zu simulieren, als ob diese eine enorme
gegeben wird, rückt Kapitel 3 den Entwicklungsprozess einer Menge an Daten umfasst. Weitere Einbettungen in den Entwick-
Datenbank-Anwendung in den Fokus. Kapitel 4 beschäftigt sich lungsprozess von Datenbankanwendungen sieht [4] gegenüber
mit dem entwickelten proaktiven Ansatz und stellt wesentliche dem hier vorgestellten Ansatz allerdings nicht vor.
Schritte/Komponenten vor. Abschließend fasst Kapitel 5 den
Ein weiterer Ansatzpunkt zur Performance-Analyse und -Optimie-
Beitrag zusammen.
rung existiert im Konzept des autonomen Datenbank-Tunings
[5][6],[7] – also dem fortlaufenden Optimieren des physischen
2. VERWANDTE ARBEITEN Designs von bereits bestehenden Datenbanken durch das DBMS
Das Ziel des im Beitrag vorgestellten proaktiven Ansatzes zur selbst. Ein autonomes System erkennt anhand von erlerntem
Performance-Analyse und -Vorhersage von Datenbankanwendun- Wissen potentielle Probleme und leitet passende Optimierungs-
gen ist die frühzeitige Erkennung von potentiellen Perfor- maßnahmen ein, bevor sich daraus negative Auswirkungen
mance-Problemen auf Basis einer möglichst effizienten, leicht ergeben. Dazu zählt beispielsweise die autonome Durchführung
verständlichen Methodik. Dies verfolgt auch der Ansatz von [2], einer Reorganisierung von Daten, um fortwährend steigenden
dessen Grundprinzip – Informationen über Daten und Datenzu- Zugriffszeiten entgegenzuwirken. Ähnlich können auch die
griffe, die aus der Anforderungsanalyse einer Anwendung bekannt mittlerweile je System vielseitig vorhandenen Tuning-Advisor wie
sind, zur frühzeitigen Optimierung zu nutzen – sich auch im beispielsweise [8] und [9] angesehen werden, die zwar nicht auto-
vorliegenden Beitrag wiederfindet. Dabei gelangt [2] durch eine matisch optimierend ins System eingreifen, dem Administrator
eigene, dem Datenbank-Optimierer nachempfundene Logik und aber Empfehlungen zu sinnvoll durchzuführenden Aktionen
dem verwendeten Modell des offenen Warteschlangennetzwerks geben. Sowohl das autonome Tuning als auch die Tuning-Advisor
frühzeitig zu Kostenabschätzungen bezüglich der Datenbank- sind nicht als Alternative zu dem im vorliegenden Beitrag
Performance. Das in Kapitel 4 des vorliegenden Beitrags vorge- vorgestellten Ansatz einzuordnen. Vielmehr können sich diese
stellte Konzept nutzt dagegen synthetisch erzeugte Statistiken und Konzepte ergänzen, indem die Anwendungsentwicklung auf Basis
datenbankinterne EXPLAIN-Mechanismen, um eine kostenmäßi- des in Kapitel 4 vorgestellten Konzepts erfolgt und für die spätere
ge Performance-Abschätzung zu erhalten. Damit berücksichtigt es Anwendungsadministration/ -evolution verschiedene Tuning-
stets sowohl aktuelle als auch zukünftige Spezifika einzelner Advisor und die Mechanismen des autonomen Tunings zum Ein-
Datenbank-Optimierer und bleibt entgegen [2] von deren interner satz kommen.
Berechnungslogik unabhängig. Ein weiterer Unterschied zwischen
beiden Ansätzen besteht in der Präsentation der Analyse- 3. ENTWICKLUNGSPROZESS VON
ergebnisse. Während sich [2] auf tabellarische Darstellungen
beschränkt, nutzt das im Beitrag vorstellte Konzept eine auf der DATENBANKANWENDUNGEN
Grundlage der Unified Modeling Language (UML) visualisierte Der Entwicklungsprozess von Anwendungen lässt sich anhand
Darstellungsform. des System Development Lifecycle (SDLC) beschreiben und in
verschiedene Phasen von der Analyse der Anforderungen bis hin
Ähnlich wie [2] basieren auch weitere Ansätze zur Performance- zum Betrieb/zur Wartung der fertigen Software gliedern [1].
Analyse und -Evaluation auf dem Modell des Warteschlangen-

24
Project Manager

Analyse Analyse Business Analyst

Datenbank Software
Datenbank Daten- Reports Designer/
Designer/ Detail Design
Design modelle Prozesse Architekt
Architekt

Implementierung Program-
Implementierung mierer
und Laden Erstellen Prototyping
Laden
Test und Tuning Test und Debugging Tester
Auswertung Auswertung

Datenbank System-
Administrator Betrieb Administrator

Datenbank- Wartung der
Wartung Anwendung

Abbildung 1: Phasen und Akteure im Database und Software Development Lifecycle
(DBLC und SDLC)

Zusätzlich zur reinen Anwendungsentwicklung sind weitere der Entwicklungsprozess von Datenbankanwendungen auf die in
Abläufe zur Planung und Bereitstellung einer geeigneten Infra- Abbildung 2 visualisierten Aufgaben. Anhand der analysierten
struktur nötig. Für Datenbankanwendungen wäre das unter ande- Anforderungen wird im Datenbank-Design ein konzeptionelles
rem der Entwicklungsprozess der Datenbank, welcher sich nach Datenmodell entwickelt, das anschließend hin zum physischen
[1] ebenfalls durch ein dem SDLC ähnliches Modell – dem Data- Datenmodell verfeinert wird. Da sich der Beitrag auf die in der
base Lifecycle (DBLC) – formalisieren lässt. Beide Entwicklungs- Praxis vorherrschenden relationalen DBMS beschränkt, wird auf
prozesse verlaufen zeitlich parallel und werden insbesondere in das in der Theorie gebräuchliche Zwischenprodukt des logischen
größeren Unternehmen/Projekten durch verschiedene Akteure Datenmodells (relationale Abbildung) verzichtet.
realisiert. Auf Grundlage von [1] liefert Abbildung 1 eine Über-
sicht dazu. Sie visualisiert parallel ablaufende Entwicklungspha- Nachdem die Design-Phase abgeschlossen ist, beginnt die
sen und eine Auswahl an zuständigen Akteuren, deren konkrete Implementierung. Datenbankseitig wird dabei das physische
Zusammensetzung/Aufgabenverteilung aber stark abhängig von Datenmodell mittels Data Definition Language (DDL) in ein
der Projektgröße und dem Projektteam ist. Wichtig sind hier be- Datenbankschema innerhalb eines installierten und geeignet
sonders zwei Erkenntnisse. Zum einen finden ähnliche Entwick- konfigurierten DBMS umgesetzt und möglicherweise vorhandene
lungsprozesse bei Anwendung und Datenbank parallel statt – in Testdaten geladen. Anwendungsseitig erfolgt parallel dazu die
etwa das Anwendungsdesign und das Datenbankdesign. Zum Entwicklung von SQL-Statements zum Zugriff auf die Datenbank
anderen können sehr viele Akteure am gesamten Entwicklungs- sowie die Implementierung der Anwendung selbst. Nach
prozess beteiligt sein, sodass Designer, Programmierer, Tester und Fertigstellung einzelner Module finden mithilfe des Entwick-
Administratoren in der Regel disjunkte Personenkreise bilden. lungs- und Qualitätssicherungssystems kontinuierliche Tests
statt, die sich allerdings anfangs auf die Prüfung funktionaler
Analyse Konzeptionelles Korrektheit beschränken. Performance-Untersuchungen, insbe-
Datenmodell Physisches sondere bezogen auf die Datenbankzugriffe, erfolgen in der Regel
Datenmodell
erst gezielt zum Abschluss der Implementierungsphase mittels
Design
aufwändig vorzubereitender und im Qualitätssicherungssystem
durchzuführender Lasttests.
Impl.
SQL
Die Folgen aus diesem Vorgehen für die Erkennung und Behand-
Test Entwicklungs- SQL- lung von Performance-Problemen sind mitunter gravierend. Eng-
system Statements
pässe werden erst spät (im Betrieb) bemerkt und sind aufgrund
Qualitäts-
Betrieb sicherungs- des fortgeschrittenen Entwicklungsprozesses nur mit hohem
Produktions- system Aufwand zu korrigieren. Basieren sie gar auf unvorteilhaften
Wartung system Design-Entscheidungen beispielsweise bezogen auf die Daten-
modellierung, ist eine nachträgliche Korrektur aufgrund zahlrei-
Abbildung 2: Performance-relevante Entwicklungsschritte cher Abhängigkeiten (Anwendungslogik, SQL-Statements, Test-
datenbestände, etc.), getrennten Zuständigkeiten und in der Regel
Aus dem Blickwinkel der Datenbank-Performance und der darauf engen Projektzeitplänen nahezu ausgeschlossen. Erfahrungen aus
einwirkenden bereits genannten Einflussfaktoren reduziert sich dem Arbeitsumfeld des Autors haben dies wiederholt bestätigt.

25
Performance
Indikatoren
Abbildung und Statistikerzeugung

Konzeptionelles 1. 2.
Datenmodell Physisches Kosten
Datenmodell EXPLAIN EP2
EP1
3. 4.
SQL
Entwicklungs-
Testsystem Performance-Modell
system SQL-
Qualitäts-
Statements
Produktions- sicherungs-system
system

Abbildung 3: Ansatz zur proaktiven modellbasierten Performance-Analyse und -Vorhersage

bei Anwendungsweiterentwicklungen weitgehend vorliegen, exis-
4. PROAKTIVE MODELLBASIERTE tieren für neu zu entwickelnde Anwendungen im Normalfall keine
PERFORMANCE-ANALYSE repräsentativen Datenbestände. Somit fehlen auch geeignete
Alternativ zur Performance-Analyse mittels Lasttests (vgl. Kapitel Datenbankstatistiken zur Grundlage für die EXPLAIN-Auswer-
3) bieten sich zur Kontrolle der SQL-Performance die eingangs tungen. Die Folge sind Ausführungspläne und Kostenabschätzun-
erwähnten EXPLAIN-Mechanismen an. Mit deren Hilfe lassen gen, die mit denen eines späteren produktiven Einsatzes der State-
sich bei vorliegendem physischen Datenbank-Design (inklusive ments oftmals nur wenig gemeinsam haben und für eine proaktive
Indexe, etc.) bereits in frühen Abschnitten der Implementierungs- Performance-Analyse somit (nahezu) unverwertbar sind.
phase Auswertungen zu Ausführungsplänen und geschätzten Der im folgenden Kapitel vorgestellte proaktive modellbasierte
Kosten für entwickelte SQL-Statements durchführen. Auf diese Ansatz zur Performance-Analyse und -Vorhersage greift beide
Weise gewonnene Erkenntnisse können vom Designer/Program- Probleme auf: die fehlende repräsentative Datenbasis für Daten-
mierer direkt genutzt werden, um Optimierungen in Hinblick auf bankstatistiken und die mangelnde Expertise zur Ausführungs-
die quasi grade entworfenen/implementierten SQL-Statements planbewertung durch Designer/Programmierer. Dabei sieht dieser
durchzuführen. Durch die gegebene zeitliche Nähe zum Anwen- Ansatz zur Bereitstellung geeigneter Datenbankstatistiken ein
dungs- und Datenbank-Design sind auch Performance-Optimie- synthetisches Erzeugen anhand von Performance-Indikatoren vor.
rungen auf Basis von Datenmodellanpassungen (Normalisie- Das Problem der mangelnden Expertise wird durch eine einfache
rung/Denormalisierung) ohne größeren Aufwand möglich. modellbasierte Darstellung von gewonnenen EXPLAIN-Ergeb-
Das beschriebene Vorgehen hat zwar den Vorteil, dass mögliche nissen adressiert. Wie diese gestaltet ist und mit den Performance-
Performance-Probleme schon von den Akteuren (Designer/Pro- Indikatoren zusammenwirkt verdeutlichen die weiteren Ausfüh-
grammierer) erkannt werden können, die diese durch Design- rungen des Kapitels anhand Abbildung 3.
Änderungen am effektivsten zu lösen wissen. Demgegenüber
erfordern die EXPLAIN-Analysen und das Verständnis der Aus- 4.1 Performance-Indikatoren im Datenmodell
führungspläne einen Grad an Expertise, den Designer/Program- Als Performance-Indikatoren bezeichnet die vorliegende Arbeit
mierer in der Regel nicht besitzen. Ein Datenbank Administrator ausgewählte Metadaten zu Entitäten und deren Attributen
(DBA), der über diese verfügt, ist wiederum von den fachlichen (beziehungsweise zu Tabellen und deren Spalten), die Aufschluss
Anforderungen zu distanziert, sodass er zwar mögliche Perfor- über die erwarteten realen Datenbestände geben und in Zusam-
mance-Ausreißer erkennen, nicht aber fachlich bewerten kann. menhang mit dem Datenbank-Design und der Infrastruktur erste
Führt eine Anwendung beispielsweise einmal monatlich eine sehr Rückschlüsse auf die zukünftige Datenbank-Performance erlau-
komplexe Auswertung mithilfe eines entsprechend Laufzeit- ben. Dazu zählen Informationen zu den erwarteten Datenmengen
intensiven SQL-Statements durch, dann würde dem DBA diese wie in etwa die erwartete Anzahl an Zeilen pro Tabelle und
Abfrage bei EXPLAIN-Analysen als kritisch erscheinen. Denn er Kennzahlen zur Datenverteilung – beispielsweise in Form von
weiß weder, dass damit ein fachlich aufwändiger Prozess Wertebereichsangaben, Einzelwertwahrscheinlichkeiten oder der
durchgeführt wird, noch dass es sich dabei um eine einmalig pro Kardinalität pro Spalte. Viele dieser Informationen sind Teil des
Monat auszuführende Abfrage handelt. Um sich als DBA in einer Ergebnisses der Anforderungsanalyse und somit frühzeitig im
Infrastruktur von nicht selten mehr als 100 unterschiedlichen SDLC bekannt und vom Business Analyst erfasst worden. Dabei
Anwendungen über die fachlichen Anforderungen und speziellen reicht die Dokumentation von rein textuellen Beschreibungen bis
Prozesse jeder einzelnen im Detail zu informieren beziehungs- hin zu tief strukturierten Darstellungen. Eine einheitlich stan-
weise um sich als Designer/Programmierer das nötige Knowhow dardisierte Form zur Erfassung von Performance-Indikatoren im
zur Ausführungsplanbewertung aufzubauen, ist personelle DBLC existiert jedoch bislang nicht, wodurch die Metadaten
Kapazität vonnöten, die in der Regel nicht verfügbar ist. kaum bis gar nicht in den weiteren Entwicklungsprozess ein-
fließen.
Ein anderes Problem, dass sich in Zusammenhang mit frühzei-
tigen EXPLAIN-Analysen zeigt, begründet sich in dem dritten In der Praxis basiert die Datenmodellierung ähnlich wie weite
zuvor genannten Performance-Faktor: den Daten. Während diese Teile der Anwendungsmodellierung auf der Sprache UML. Dabei

26
wurde diese ursprünglich nicht zur Abbildung von Daten- der Designer/Programmierer beim Modellieren oder dem Ent-
strukturen im Sinn einer Entity-Relationship-Modellierung kon- wickeln von SQL-Statements auf relationale Weise. Die im vorlie-
zipiert, sodass die Verbindung beider Welten – und damit die genden Ansatz als Performance-Modell bezeichnete vereinfachte
Modellierung von Anwendung und Datenstrukturen mithilfe einer Präsentation von Ausführungsplänen versucht, diese Diskrepanz
gemeinsamen Sprache in einem gemeinsamen Tool – erst durch aufzulösen.
Ansätze wie [10] oder auch den Entwurf zum IMM Standard der
OMG [11] geschaffen wurde. Die Voraussetzung dafür bildet Das Performance-Modell basiert auf dem physischen Datenmo-
jeweils die UML-Profil-Spezifikation, die es ermöglicht, beste- dell und damit auf einer dem Designer/Programmierer bekannten
hende UML-Objekte über Neu-Stereotypisierungen zu erweitern. Darstellungsform. Zusätzlich umfasst es die für diesen Personen-
kreis wesentlichen Informationen aus den EXPLAIN-Ergebnissen.
Um die zuvor genannten Performance-Indikatoren für den weite- Dazu zählen die vom DBMS abgeschätzten Kosten für die Aus-
ren Entwicklungsprozess nutzbar zu machen und sie innerhalb führung des gesamten Statements sowie wichtiger Operatoren wie
bestehender Infrastrukturen/Tool-Landschaften standardisiert zu Tabellen- beziehungsweise Indexzugriffe oder Tabellenverknü-
erfassen, kann ebenfalls der UML-Profil-Mechanismus genutzt pfungen mittels Join – jeweils skaliert um die erwartete Aus-
werden. So ließe sich beispielsweise mithilfe eines geeigneten führungshäufigkeit des Statements. Weitere Detailinformationen
Profils wie in Abbildung 3 in 1. schematisch angedeutet aus einem innerhalb der Ausführungspläne wie beispielsweise die konkrete
UML-Objekt „entity“ ein neues Objekt „entity_extended“ ablei- Abarbeitungsreihenfolge einzelner Operatoren oder Angaben zu
ten, das in einem zusätzlichen Merkmal „cardinality“ Infor- abgeschätzten Prädikat-Selektivitäten werden vom Modell zum
mationen über die produktiv erwartete Datenmenge zu einer Zweck der Einfachheit und Verständlichkeit bewusst vernach-
Entität/Tabelle aufnehmen kann. lässigt. Für die gleichzeitige Analyse mehrerer Statements erfolgt
eine Aggregation der jeweils abgeschätzten Kosten auf Objekt-
4.2 Synthetische Datenbankstatistiken ebene.
Eines der eingangs aufgezeigten Hindernisse für proaktive Perfor- Zentrale Komponente im Performance-Modell ist eine ebenfalls
mance-Analysen beziehungsweise -Vorhersagen bestand in der dem physischen Datenmodell angelehnte Diagrammdarstellung.
fehlenden repräsentativen Datenbasis für Datenbank-Statisti- Mithilfe farblicher Hervorhebung und geeigneter Bewertungs-
ken. Diese Statistiken werden im Normalfall vom DBMS anhand metriken sollen sämtliche Objekte gemäß den vom DBMS
der gespeicherten Daten selbst gesammelt. Dem entgegen verfolgt geschätzten Zugriffskosten zur Abarbeitung der Workload
das hier vorgestellte Konzept den Ansatz, dem DBMS Statistiken klassifiziert und visualisiert werden. Auf diese Weise kann ein
vorzugeben, ohne dazu datenbankseitig repräsentative Datenbe- Designer/Programmierer frühzeitig Auskunft über aus Perfor-
stände vorhalten zu müssen. Dafür bieten zwar die wenigsten mance-Perspektive zu optimierende Bereiche im Datenbank-
DBMS vordefinierte Schnittstellen an, allerdings sind sämtliche schema beziehungsweise kritische, alternativ zu konzipierende
Statistik-Informationen in der Regel innerhalb DBMS-interner SQL-Statements erhalten. Abbildung 3 veranschaulicht exempla-
manipulierbarer Tabellen gespeichert, wie dies beispielswiese risch ein visualisiertes Performance-Modell für zwei Statements/
auch bei DB2 oder Oracle der Fall ist [12]. Ausführungspläne (EP). Während der untere Bereich weitgehend
grün/unkritisch markiert ist, befinden sich im oberen Diagramm-
Datenbankstatistiken enthalten Informationen über Datenmengen
teil mögliche Performance-kritische rot gekennzeichnete Zugriffe,
und Datenverteilungen sowie Kennzahlen zur physischen Spei-
die es gezielt zu untersuchen und an geeigneter Stelle (SQL-State-
cherung wie beispielsweise die Anzahl der verwendeten Daten-
ment, Datenbank-Design) zu optimieren gilt (vgl. gestrichelte
bankseiten pro Tabelle. Während erstere inhaltlich den zuvor
Pfeile in Abbildung 3).
beschriebenen Performance-Indikatoren entsprechen, sind die
Statistikdaten zur physischen Speicherung interne DBMS-abhän- Die technische Realisierung des Performance-Modells sowie der
gige Größen. Mithilfe geeigneter, von den DBMS-Herstellern zur dazugehörigen Diagrammdarstellung erfolgt analog zur Erfassung
Unterstützung beim Datenbank-Design bereitgestellter Abschät- der Performance-Indikatoren über den UML-Profil-Mechanismus,
zungsvorschriften lassen sich aber auch diese Kennzahlen auf wodurch auch in diesem Punkt die Kompatibilität des vorge-
Grundlage der Performance-Indikatoren approximieren. Somit ist stellten Ansatzes zu bestehenden Tool-Infrastrukturen gewähr-
es wie in Abbildung 3 in 2. gezeigt möglich, anhand geeignet leistet ist.
formalisierter Performance-Indikatoren frühzeitig im SDLC/
DBLC repräsentative Datenbankstatistiken künstlich zu erzeugen.
4.4 Ablauf einer Analyse/Vorhersage
Für den Designer/Programmierer sieht der in Abbildung 3
4.3 EXPLAIN und Performance-Modell vorgestellte proaktive Ansatz folgende Vorgehensweise vor.
Auf Grundlage von synthetischen Datenbankstatistiken können Nachdem nach 1. ein Datenbank-Design-Entwurf fertiggestellt ist,
wie in Abbildung 3 in 3. und 4. zu sehen, mittels der vom DBMS initiiert er in 2. einen Automatismus zur Abbildung des Designs
bereitgestellten EXPLAIN-Funktionalität, der SQL-Workload in ein Datenbank-Schema sowie zur Erstellung von synthetischen
und dem aus dem physischen Datenmodell ableitbaren Daten- Datenbank-Statistiken anhand der von ihm modellierten Perfor-
bankschema proaktive Performance-Vorhersagen durchgeführt mance-Indikatoren. Mithilfe einer weiteren Routine startet der
werden. Die resultierenden, teils komplexen Ausführungspläne Designer/Programmierer in 3. und 4. anschließend einen Simu-
lassen sich allerdings nur mit ausreichend Expertise und vor- lationsprozess, der auf Basis der EXPLAIN-Mechanismen Perfor-
handenen personellen Kapazitäten angemessen auswerten, sodass mance-Vorhersagen für eine gegebene Workload erstellt und diese
diese Problematik vorläufig weiterbesteht. Eine Hauptursache, die als Performance-Modell aufbereitet. Von dort aus informiert er
das Verständnis von Ausführungsplänen erschwert, ist ihre sich mithilfe der Diagrammdarstellung über mögliche kritische
hierarchische Darstellung als Zugriffsbaum. Demgegenüber denkt Zugriffe, die er daraufhin gezielt analysiert und optimiert.

27
5. ZUSAMMENFASSUNG Ansatzes entgegensteht. Somit sind alternative Varianten zur
Datenbank-Performance ist ein wichtiger, oftmals jedoch vernach- Beschaffung der Workload für den Analyseprozess zu
lässigter Faktor in der Anwendungsentwicklung. Durch moderne untersuchen und abzuwägen.
Anforderungen und dazu implementierte Anwendungen sehen
sich speziell deren Datenbank-Backends mit kontinuierlich 7. LITERATUR
wachsenden Herausforderungen insbesondere betreffend der [1] C. Coronel, S. Morris, P. Rob. Database Systems: Design,
Performance konfrontiert. Diese können nur bewältigt werden, Implementation, and Management, Course Technology, 10.
wenn das Thema Datenbank-Performance intensiver betrachtet Auflage, 2011.
und durch proaktive Analysen (beispielsweise mittels EXPLAIN-
[2] S. Salza, M. Renzetti. A Modeling Tool for Workload
Mechanismen) kontinuierlich verfolgt wird. Doch auch dann sind
Analysis and Performance Tuning of Parallel Database
einzelne Hindernisse unvermeidlich: fehlende repräsentative
Applications, Proceedings in ADBIS'97, 09.1997
Daten(-mengen) und Expertise/Kapazitäten zur Analyse.
http://www.bcs.org/upload/pdf/ewic_ad97_paper38.pdf
Der vorliegende Beitrag präsentiert zur Lösung dieser Probleme [3] R. Osman, W. J. Knottenbelt. Database system performance
einen modellbasierten Ansatz, der auf Basis synthetisch erzeugter evaluation models: A survey, Artikel in Performance
Statistiken proaktive Performance-Analysen sowie -Vorhersagen Evaluation, Elsevier Verlag, 10.2012
erlaubt und die daraus gewonnenen Ergebnisse in einer einfach http://dx.doi.org/10.1016/j.peva.2012.05.006
verständlichen Form visualisiert. Die technologische Grundlage
dafür bietet die in der Praxis vorherrschende Modellierungs- [4] Tata Consultancy Services. System and method for SQL
sprache UML mit ihrer UML-Profil-Spezifikation. Sie erlaubt es performance assurance services, Internationales Patent
das hier vorgestellte Konzept und die dazu benötigten Kom- PCT/IN2011/000348, 11.2011
ponenten mit vorhandenen technischen Mitteln abzubilden und http://dx.doi.org/10.1016/j.peva.2012.05.006
nahtlos in bestehende UML-Infrastrukturen zu integrieren. [5] D. Wiese. Gewinnung, Verwaltung und Anwendung von
Performance-Daten zur Unterstützung des autonomen
6. AUSBLICK Datenbank-Tuning, Dissertation, Fakultät für Mathematik
Bei dem im Beitrag vorgestellten Konzept handelt es sich um und Informatik, Friedrich-Schiller-Universität Jena, 05.2011.
einen auf Basis wiederkehrender praktischer Problemstellungen http://www.informatik.uni-jena.de/dbis/alumni/wiese/pubs/D
und den daraus gewonnenen Erfahrungen konstruierten Ansatz. issertation__David_Wiese.pdf
Während die technische Umsetzbarkeit einzelner Teilaspekte wie [6] S. Chaudhuri, V. Narasayya. A Self-Tuning Database
etwa die Erfassung von Performance-Indikatoren oder die Kon- Systems: A Decade of Progress, Proceedings in VLDB'07,
struktion des Performance-Modells auf Basis von UML-Profilen 09.2007
bereits geprüft wurde, steht eine prototypische Implementierung http://research.microsoft.com/pubs/76506/vldb07-10yr.pdf
des gesamten Prozesses zur Performance-Analyse noch aus.
[7] N. Bruno, S. Chaudhuri. An Online Approach to Physical
Zuvor sind weitere Detailbetrachtungen nötig. So ist beispiels- Design Tuning, Proceedings in ICDE'07, 04.2007
weise zu klären, in welchem Umfang Performance-Indikatoren http://research.microsoft.com/pubs/74112/continuous.pdf
im Datenmodell vom Analyst/Designer sinnvoll erfasst werden
[8] Oracle Corporation. Oracle Database 2 Day DBA 12c
sollten. Dabei ist ein Kompromiss zwischen maximalem
Release 1 (12.1) – Monitoring and Tuning the Database,
Detailgrad und minimal nötigem Informationsgehalt anzustreben,
2013.
sodass der Aufwand zur Angabe von Performance-Indikatoren
http://docs.oracle.com/cd/E16655_01/server.121/e17643/mo
möglichst gering ist, mit deren Hilfe aber dennoch eine
ntune.htm#ADMQS103
repräsentative Performance-Vorhersage ermöglicht wird.
[9] Microsoft Corporation. SQL Server 2005 – Database Engine
Weiterhin gilt es, eine geeignete Metrik zur Bewertung/Katego- Tuning Advisor (DTA) in SQL Server 2005, Technischer
risierung der Analyseergebnisse zu entwickeln. Hier steht die Artikel, 2006.
Frage im Vordergrund, wann ein Zugriff anhand seiner Kosten als http://download.microsoft.com/download/4/7/a/47a548b9-
schlecht und wann er als gut zu bewerten ist. Ein teurer Zugriff ist 249e-484c-abd7-29f31282b04d/SQL2005DTA.doc
nicht zwangsweise ein schlechter, wenn er beispielsweise zur
Realisierung einer komplexen Funktionalität verwendet wird. [10] C.-M. Lo. A Study of Applying a Model-Driven Approach to
the Development of Database Applications, Dissertation,
Zuletzt sei noch die Erfassung beziehungsweise Beschaffung der Department of Information Management, National Taiwan
für die EXPLAIN-Analysen notwendigen Workload erwähnt. University of Science and Technology, 06.2012.
Diese muss dem vorgestellten proaktiven Analyseprozess [11] Object Management Group. Information Management
zugänglich gemacht werden, um anhand des beschriebenen Metamodel (IMM) Specification Draft Version 8.0,
Konzepts frühzeitige Performance-Untersuchungen durchführen Spezifikationsentwurf, 03.2009.
zu können. Im einfachsten Fall könnte angenommen werden, dass http://www.omgwiki.org/imm/doku.php
sämtliche SQL-Statements (inklusive ihrer Ausführungshäu-
figkeit) vom Designer/Programmierer ebenfalls im Datenmodell [12] N. Burgold, M. Gerstmann, F. Leis. Statistiken in
beispielsweise als zusätzliche Merkmale von Methoden in der relationalen DBMSen und Möglichkeiten zu deren
UML-Klassenmodellierung zu erfassen und kontinuierlich zu synthetischer Erzeugung, Projektarbeit, Fakultät für
pflegen wären. Dies wäre jedoch ein sehr aufwändiges Verfahren, Mathematik und Informatik, Friedrich-Schiller-Universität
das der gewünschten hohen Praxistauglichkeit des proaktiven Jena, 05.2014.

28
Big Data und der Fluch der Dimensionalität
Die effiziente Suche nach Quasi-Identifikatoren in hochdimensionalen Daten
Hannes Grunert Andreas Heuer
Lehrstuhl für Datenbank- und Lehrstuhl für Datenbank- und
Informationssysteme Informationssysteme
Universität Rostock Universität Rostock
Albert-Einstein-Straße 22 Albert-Einstein-Straße 22
hg(at)informatik.uni-rostock.de ah(at)informatik.uni-rostock.de

Kurzfassung gen Handlungen des Benutzers abgeleitet, sodass die smarte
In smarten Umgebungen werden häufig große Datenmengen Umgebung eigenständig auf die Bedürfnisse des Nutzers rea-
durch eine Vielzahl von Sensoren erzeugt. In vielen Fällen gieren kann.
werden dabei mehr Informationen generiert und verarbei- In Assistenzsystemen [17] werden häufig wesentlich mehr
tet als in Wirklichkeit vom Assistenzsystem benötigt wird. Informationen gesammelt als benötigt. Außerdem hat der
Dadurch lässt sich mehr über den Nutzer erfahren und sein Nutzer meist keinen oder nur einen sehr geringen Einfluss
Recht auf informationelle Selbstbestimmung ist verletzt. auf die Speicherung und Verarbeitung seiner personenbe-
Bestehende Methoden zur Sicherstellung der Privatheits- zogenen Daten. Dadurch ist sein Recht auf informationel-
ansprüche von Nutzern basieren auf dem Konzept sogenann- le Selbstbestimmung verletzt. Durch eine Erweiterung des
ter Quasi-Identifikatoren. Wie solche Quasi-Identifikatoren Assistenzsystems um eine Datenschutzkomponente, welche
erkannt werden können, wurde in der bisherigen Forschung die Privatheitsansprüche des Nutzers gegen den Informati-
weitestgehend vernachlässigt. onsbedarf des Systems überprüft, kann diese Problematik
In diesem Artikel stellen wir einen Algorithmus vor, der behoben werden.
identifizierende Attributmengen schnell und vollständig er- Zwei Hauptaspekte des Datenschutzes sind Datenvermei-
kennt. Die Evaluierung des Algorithmus erfolgt am Beispiel dung und Datensparsamkeit. In §3a des Bundesdatenschutz-
einer Datenbank mit personenbezogenen Informationen. gesetzes [1] wird gefordert, dass
[d]ie Erhebung, Verarbeitung und Nutzung
”
ACM Klassifikation personenbezogener Daten und die Auswahl und
K.4.1 [Computer and Society]: Public Policy Issues— Gestaltung von Datenverarbeitungssystemen [...]
Privacy; H.2.4 [Database Management]: Systems—Que- an dem Ziel auszurichten [sind], so wenig perso-
ry Processing nenbezogene Daten wie möglich zu erheben, zu
verarbeiten oder zu nutzen.“.
Stichworte Mittels einer datensparsamen Weitergabe der Sensor- und
Datenbanken, Datenschutz, Big Data Kontext-Informationen an die Analysewerkzeuge des Assis-
tenzsystems wird nicht nur die Datenschutzfreundlichkeit
des Systems verbessert. Bei der Vorverdichtung der Daten
1. EINLEITUNG durch Selektion, Aggregation und Komprimierung am Sen-
Assistenzsysteme sollen den Nutzer bei der Arbeit (Am- sor selbst lässt sich die Effizienz des Systems steigern. Die
bient Assisted Working) und in der Wohnung (Ambient Privatheitsansprüche und der Informationsbedarf der Ana-
Assisted Living) unterstützen. Durch verschiedene Senso- lysewerkzeuge können als Integritätsbedingungen im Daten-
ren werden Informationen über die momentane Situation banksystem umgesetzt werden. Durch die Integritätsbedin-
und die Handlungen des Anwenders gesammelt. Diese Da- gungen lassen sich die notwendigen Algorithmen zur An-
ten werden durch das System gespeichert und mit weiteren onymisierung und Vorverarbeitung direkt auf dem Datenbe-
Daten, beispielsweise mit dem Facebook-Profil des Nutzers stand ausführen. Eine Übertragung in externe Programme
verknüpft. Durch die so gewonnenen Informationen lassen bzw. Module, die sich evtl. auf anderen Recheneinheiten be-
sich Vorlieben, Verhaltensmuster und zukünftige Ereignis- finden, entfällt somit.
se berechnen. Daraus werden die Intentionen und zukünfti- Für die Umsetzung von Datenschutzbestimmungen
in smarten Umgebungen wird derzeit das PArADISE1 -
Framework entwickelt, welches insbesondere die Aspekte
der Datensparsamkeit und Datenvermeidung in heteroge-
nen Systemumgebungen realisieren soll.
In [3] stellen wir ein einfaches XML-Schema vor, mit der
Copyright c by the paper’s authors. Copying permitted only sich Privatheitsansprüche durch den Nutzer von smarten
for private and academic purposes. Systemen formulieren lassen. Dabei wird eine Anwendung
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 1
Workshop on Foundations of Databases (Grundlagen von Datenbanken), Privacy-aware assistive distributed information system
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. environment

29
innerhalb eines abgeschlossenen Systems in ihre Funktionali- pel (ti ) angibt. Ein Quasi-Identifikator QI := {A1 , ..., An }
täten aufgeteilt. Für jede Funktionalität lässt sich festlegen, ist für eine Relation R entsprechend definiert:
welche Informationen in welchem Detailgrad an das System ≥p
weitergegeben werden dürfen. Dazu lassen sich einzelne At- Quasi-Identifikator. ∀ t1 , t2 ∈ R [t1 6= t2 ⇒ ∃ A ∈ QI:
tribute zu Attributkombinationen zusammenfassen, die an- t1 (A) 6= t2 (A)]
gefragt werden können.
Wie beim Datenbankentwurf reicht es auch für die Anga-
Für einen unerfahrenen Nutzer ist das Festlegen von sinn-
be von Quasi-Identifikatoren aus, wenn die minimale Men-
vollen Einstellungen nur schwer möglich. Die Frage, die sich
ge von Attributen angegeben wird, welche die Eigenschaft
ihm stellt, ist nicht die, ob er seine persönlichen Daten schüt-
eines QI hat. Eine solche Menge wird als minimaler Quasi-
zen soll, sondern vielmehr, welche Daten es wert sind, ge-
Identifikator bezeichnet.
schützt zu werden. Zur Kennzeichnung schützenswerter Da-
ten werden u.a. sogenannte Quasi-Identifikatoren [2] verwen- minimaler Quasi-Identifikator. X ist ein minimaler
det. In diesem Artikel stellen wir einen neuen Ansatz vor, Quasi-Identifikator (mQI), wenn X ein Quasi-Identifikator
mit dem Quasi-Identifikatoren schnell und vollständig er- ist und jede nicht-leere Teilmenge Y von X kein Quasi-
kannt werden können. Identifikator ist.
Der Rest des Artikels ist wie folgt strukturiert: Kapitel 2 X ist mQI: X ist QI ∧ (@ Y ⊂ X: (Y 6= ) ∧ (Y ist QI))
gibt einen aktuellen Überblick über den Stand der Forschung
Insbesondere ist X kein minimaler Quasi-Identifikator,
im Bereich der Erkennung von Quasi-Identifikatoren. Im fol-
wenn eine Teilmenge X-{A} von X mit A ∈ X existiert,
genden Kapitel gehen wir detailliert darauf ein, wie schüt-
die ein Quasi-Identifikator ist. Das Finden von allen Quasi-
zenswerte Daten definiert sind und wie diese effizient erkannt
Identifikatoren stellt ein NP-vollständiges Problem dar, weil
werden können. Kapitel 4 evaluiert den Ansatz anhand eines
die Menge der zu untersuchenden Teilmengen exponentiell
Datensatzes. Das letzte Kapitel fasst den Beitrag zusammen
zur Anzahl der Attribute einer Relation steigt. Besteht eine
und gibt einen Ausblick auf zukünftige Arbeiten.
Relation aus n Attributen, so existieren insgesamt 2n Attri-
butkombinationen, für die ermittelt werden muss, ob sie ein
2. STAND DER TECHNIK QI sind.
In diesem Kapitel stellen wir bestehende Konzepte zur In [12] stellen Motwani und Xu einen Algorithmus zum ef-
Ermittlung von Quasi-Identifikatoren (QI) vor. Außerdem fizienten Erkennen von minimalen Quasi-Identifikatoren vor.
werden Techniken vorgestellt, die in unseren Algorithmus Dieser baut auf die von Mannila et. al [10] vorgeschlagene,
eingefloßen sind. ebenenweise Erzeugung von Attributmengen auf. Dabei wird
die Minimalitätseigenschaft von Quasi-Identifikatoren sofort
2.1 Quasi-Identifikatoren erkannt und der Suchraum beim Durchlauf auf der nächsten
Zum Schutz personenbezogener Daten existieren Konzep- Ebene eingeschränkt.
te wie k-anonymity [16], l-diversity [8] und t-closeness [7]. Der Algorithmus ist effizienter als alle 2n Teilmengen zu
Diese Konzepte unterteilen die Attribute einer Relation in testen, allerdings stellt die von Big-Data-Anwendungen er-
Schlüssel, Quasi-Identifikatoren, sensitive Daten und sons- zeugte Datenmenge eine neue Herausforderung dar. Insbe-
tige Daten. Ziel ist es, dass die sensitiven Daten sich nicht sondere die hohe Dimensionalität und die Vielfalt der Daten
eindeutig zu einer bestimmten Person zuordnen lassen. Da sind ernst zu nehmende Probleme. Aus diesem Grund schla-
durch Schlüsselattribute Tupel eindeutig bestimmt werden gen wir im folgenden Kapitel einen neuen Algorithmus vor,
können, dürfen diese unter keinen Umständen zusammen der auf den Algorithmus von Motwani und Xu aufsetzt.
mit den sensitiven Attributen veröffentlicht werden.
Während Schlüssel im Laufe des Datenbankentwurfes fest- 2.2 Sideways Information Passing
gelegt werden, lassen sich Quasi-Identifikatoren erst beim Der von uns entwickelte Algorithmus verwendet Techni-
Vorliegen der Daten feststellen, da sie von den konkreten ken, die bereits beim Sideways Information Passing (SIP,
Attributwerten der Relation abhängen. Der Begriff Quasi- [4]) eingesetzt werden. Der grundlegende Ansatz von SIP
Identifikator wurde von Dalenius [2] geprägt und bezeichnet besteht darin, dass während der Ausführung von Anfrage-
a subset of attributes that can uniquely identify most tuples plänen Tupel nicht weiter betrachtet werden, sofern mit Si-
”
in a table“. cherheit feststeht, dass sie keinen Bezug zu Tupeln aus an-
Für most tuples“ wird häufig ein Grenzwert p festge- deren Relationen besitzen.
”
legt, der bestimmt, ob eine Attributkombination ein Quasi- Durch das frühzeitige Erkennen solcher Tupel wird der
Identifikator ist oder nicht. Dieser Grenzwert lässt sich bei- zu betrachtende Suchraum eingeschränkt und die Ausfüh-
spielsweise in relationalen Datenbanken durch zwei SQL- rungszeit von Anfragen reduziert. Besonders effektiv ist die-
Anfragen wie folgt bestimmen: ses Vorgehen, wenn das Wissen über diese magic sets“ [14]
”
zwischen den Teilen eines Anfrageplans ausgetauscht und
p = COUNT DISTINCT *COUNT
FROM (SELECT FROM table)
∗ FROM table
in höheren Ebenen des Anfrageplans mit eingebunden wird.
(1) Beim SIP werden zudem weitere Techniken wie Bloomjoins
Wird für p der Wert 1 gewählt, so sind die gefundenen QI [9] und Semi-Joins eingesetzt um den Anfrageplan weiter zu
mit diesem Grenzwert auch Schlüssel der Relation. Um eine optimieren.
Vergleichbarkeit unseres Algorithmus mit dem von Motwani
und Xu zu gewährleisten, verwenden wir ebenfalls die in (1) 2.3 Effiziente Erfragung von identifizieren-
definierte distinct ratio“ (nach [12]).
” den Attributmengen
Da es für den Ausdruck die meisten“ keinen standardisier-
” ≥p In [5] wird ein Algorithmus zur Ermittlung von identi-
ten Quantor gibt, formulieren wir ihn mit dem Zeichen: ∀ , fizierenden Attributmengen (IA) in einer relationalen Da-
wobei p den Prozentsatz der eindeutig identifizierbaren Tu- tenbank beschrieben. Wird für eine Attributmenge erkannt,

30
dass diese eine IA für eine Relation R ist, so sind auch alle Algorithm 1: bottomUp
Obermengen dieser Attributmenge IA für R. Ist für eine Re- Data: database table tbl, list of attributes elements
lation bestehend aus den Attributen A, B und C bekannt, Result: a set with all minimal QI qiLowerSet
dass B eine identifizierende Attributmenge ist, dann sind initialization();
auch AB, BC und ABC eine IA der Relation. for element in elements do
Ist eine Attributmenge hingegen keine IA für R, so sind set := set ∪ {element}
auch alle Teilmengen dieser Attributmenge keine IA. Wenn end
beispielsweise AC keine IA für R ist, dann sind auch weder A while set is not empty do
noch C identifizierende Attributmengen für R. Attributmen- for Set testSet: set do
gen, die keine identifizierende Attributmenge sind, werden double p := getPercentage(testSet, tbl);
als negierte Schlüssel bezeichnet. if p ≥ threshold then
Der in [5] vorgestellte Algorithmus nutzt diese Eigenschaf- qiLowerSet := qiLowerSet ∪ {testSet};
ten um anhand eines Dialoges mit dem Nutzer die Schlüs- end
seleigenschaften einer bereits existierenden Relation festzu-
end
legen. Dabei wird dem Nutzer ein Ausschnitt der Relations-
set := buildNewLowerSet(set, elements);
tabelle präsentiert anhand derer entschieden werden soll, ob
end
eine Attributkombination Schlüssel ist oder nicht. Wird in
return qiLowerSet;
einer Teilrelation festgestellt, dass die Attributmenge Tu-
pel mit gleichen Attributwerten besitzt, so kann die Attri-
butkombination für die Teilmenge, als auch für die gesamte
Relation kein Schlüssel sein. Algorithm 2: buildNewLowerSet
Data: current lower set lSet, list of attributes
elements
3. ALGORITHMUS Result: the new lower set lSetNew
In diesem Kapitel stellen wir einen neuen Algorithmus Set lSetNew := new Set();
zum Finden von minimalen Quasi-Identifikatoren vor. Der for Set set: lSet do
Algorithmus beschränkt sich dabei auf die Einschränkung for Attribut A: elements do
der zu untersuchenden Attributkombinationen. Der entwi- if @q ∈ qiLowerSet : q ⊆ set then
ckelte Ansatz führt dabei den von [12] vorgestellten Bottom- lSetNew := lSetNew ∪ {set ∪ {A}};
Up-Ansatz mit einen gegenläufigen Top-Down-Verfahren zu- end
sammen. end
3.1 Bottom-Up end
return lSetNew;
Der von Motwani und Xu in [12] vorgestellte Ansatz zum
Erkennen aller Quasi-Identifikatoren innerhalb einer Rela-
tion nutzt einen in [10] präsentierten Algorithmus. Dabei
wird für eine Relation mit n Attributen ebenenweise von gesetzte QIs besitzt, da so der Suchraum gleich zu Beginn
den einelementigen zu n-elementigen Attributkombinatio- stark eingeschränkt wird.
nen Tests durchgeführt. Wird für eine i-elementige (1≤i testSet: set do
double p := getPercentage(testSet, tbl); Passing [4] untereinander ausgetauscht. Es wird pro Berech-
if p < threshold then nungsschritt entweder die Top-Down- oder die Bottom-Up-
optOutSet := optOutSet ∪ {subset}; Methode angewandt und das Ergebnis an die jeweils ande-
else re Methode übergeben. Der Algorithmus terminiert, sobald
qiUpperSet := qiUpperSet ∪ {testSet}; alle Attributebenen durch einen der beiden Methoden abge-
for Set o: qiSet do arbeitet wurden oder das Bottom-Up-Vorgehen keine Attri-
if testSet ⊂ o then butkombinationen mehr zu überprüfen hat. In Abbildung 1
qiUpperSet := qiUpperSet - {o}; ist die Arbeitsweise des Algorithmus anhand einer Beispiel-
end relation mit sechs Attributen dargestellt. Die rot markierten
end Kombinationen stehen dabei für negierte QI, grün markierte
end für minimale QI und gelb markierte für potentiell minimale
end QI.
set := buildNewUpper(set); Um zu entscheiden, welcher Algorithmus im nächsten Zy-
end klus angewandt wird, wird eine Wichtungsfunktion einge-
return qiUpperSet; führt. Die Überprüfung einer einzelnen Attributkombinati-
on auf Duplikate hat eine Laufzeit von O(n*log(n)), wobei
n die Anzahl der Tupel in der Relation ist. Die Überprü-
Der Top-Down-Ansatz hebt die Nachteile des Bottom-Up- fung der Tupel hängt aber auch von der Größe der Attri-
Vorgehens auf: der Algorithmus arbeitet effizient, wenn QIs butkombination ab. Besteht ein zu überprüfendes Tupel aus
aus vielen Attributen zusammengesetzt sind und für den mehreren Attributen, so müssen im Datenbanksystem auch
Fall, dass die gesamte Relation kein QI ist, wird dies bei der mehr Daten in den Arbeitsspeicher für die Duplikaterken-
ersten Überprüfung erkannt und der Algorithmus terminiert nung geladen werden. Durch große Datenmengen werden
dann umgehend. Seiten schnell aus dem Arbeitsspeicher verdrängt, obwohl
Besteht die Relation hingegen aus vielen kleinen QIs, dann sie später wieder benötigt werden. Dadurch steigt die Re-
wird der Suchraum erst zum Ende des Algorithmus stark chenzeit weiter an.
eingeschränkt. Ein weiterer Nachteil liegt in der erhöhten Für eine vereinfachte Wichtungsfunktion nehmen wir an,
Rechenzeit, auf die in der Evaluation näher eingegangen dass alle Attribute den gleichen Speicherplatz belegen. Die
wird. Anzahl der Attribute in einer Attributkombination bezeich-
nen wir mit m. Für die Duplikaterkennung ergibt sich dann
3.3 Bottom-Up+Top-Down eine Laufzeit von O((n*m)*log(n*m)).
Der in diesem Artikel vorgeschlagene Algorithmus kom- Da die Anzahl der Tupel für jede Duplikaterkennung kon-
biniert die oben vorgestellten Verfahren. Dabei werden die stant bleibt, kann n aus der Kostenabschätzung entfernt
Verfahren im Wechsel angewandt und das Wissen über (ne- werden. Die Kosten für die Überprüfung einer einzelnen
gierte) Quasi-Identifikatoren wie beim Sideways Information

32
Algorithm 5: bottomUpTopDown Die Evaluation erfolgte in einer Client-Server-Umgebung.
Data: database table tbl, list of attributes attrList Als Server dient eine virtuelle Maschine, die mit einer 64-Bit-
Result: a set with all minimal quasi-identifier qiSet CPU (vier Kerne @ 2 GHz und jeweils 4 MB Cache) und 4
attrList.removeConstantAttributes(); GB Arbeitsspeicher ausgestattet ist. Auf dieser wurde eine
Set upperSet := new Set({attrList}); MySQL-Datenbank mit InnoDB als Speichersystem verwen-
Set lowerSet := new Set(attrList); det. Der Client wurde mit einem i7-3630QM als CPU betrie-
// Sets to check for each algorithm ben. Dieser bestand ebenfalls aus vier Kernen, die jeweils
int bottom := 0; über 2,3 GHz und 6 MB Cache verfügten. Als Arbeitsspei-
int top := attrList.size(); cher standen 8 GB zur Verfügung. Als Laufzeitumgebung
while (bottom<=top) or (lowerSet is empty) do wurde Java SE 8u5 eingesetzt.
calculateWeights(); Der Datensatz wurde mit jedem Algorithmus getestet.
if isLowerSetNext then Um zu ermitteln, wie die Algorithmen sich bei verschiede-
bottomUp(); nen Grenzwerten für Quasi-Identifikatoren verhalten, wur-
buildNewLowerSet(); den die Tests mit 10 Grenzwerten zwischen 50% und 99%
bottom++; wiederholt.
// Remove new QI from upper set Die Tests mit den Top-Down- und Bottom-Up-
modifyUpperSet(); Algorithmen benötigten im Schnitt gleich viele Tablescans
(siehe Abbildung 2). Die Top-Down-Methode lieferte bes-
else
sere Ergebnisse bei hohen QI-Grenzwerten, Bottom-Up
topDown();
ist besser bei niedrigeren Grenzwerten. Bei der Laufzeit
buildNewUpperSet();
(siehe Abbildung 3) liegt die Bottom-Up-Methode deutlich
top--;
vor dem Top-Down-Ansatz. Grund hierfür sind die großen
// Remove new negated QI from lower set
Attributkombinationen, die der Top-Down-Algorithmus zu
modifyLowerSet();
Beginn überprüfen muss.
end Der Bottom-Up+Top-Down-Ansatz liegt hinsichtlich
end Laufzeit als auch bei der Anzahl der Attributvergleiche
qiSet := qiLowerSet ∪ qiUpperSet; deutlich vorne. Die Anzahl der Tablescans konnte im Ver-
return qiSet; gleich zum Bottom-Up-Verfahren zwischen 67,4% (4076
statt 12501 Scans; Grenzwert: 0.5) und 96,8% (543 statt
16818 Scans; Grenzwert 0.9) reduziert werden. Gleiches gilt
Attributkombination mit m Attributen beträgt demnach für die Laufzeit (58,1% bis 97,5%; siehe Abbildung 3).
O((m*log(m)).
Die Gesamtkosten für das Überprüfen der möglichen
Quasi-Identifikatoren werden mit WAV G bezeichnet. WAV G 6000
Anzahl Tablescans

ergibt sich aus dem Produkt für das Überprüfen einer ein-
zelnen Attributkombination und der Anzahl der Attribut-
kombinationen (AttrKn ) mit n Attributen. 4000

WAV G := AttrKn ∗ log(m) ∗ m (2) 2000
Soll die Wichtungsfunktion präziser sein, so lässt sich der
Aufwand abschätzen, indem für jede Attributkombination
X die Summe s über die Attributgrößen von X gebildet und 0
anschließend gewichtet wird. Die Einzelgewichte werden an- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
schließend zum Gesamtgewicht aufsummiert.
Anzahl Attribute in der Attributkombination
P P
WAV G := log(s) ∗ s; s = size(A) (3) Brute-Force
X∈AttrKn A∈X Bottom-Up
Diese Wichtung eignet sich allerdings nur, wenn Zugang Top-Down
zu den Metadaten der Datenbankrelation besteht. Bottom-Up+Top-Down (AVG)

4. EVALUATION Abbildung 2: Verhältnis von der Anzahl der Attri-
Für die Evaluation des Algorithmus wurde die Adult“- bute in den Attributkombinationen zur Anzahl von
” Tablescans (Adult-DB, Grenzwert 90%)
Relation aus dem UCI Machine Learning Repository [6] ver-
wendet. Die Relation besteht aus anonymisierten, personen-
bezogenen Daten, bei denen Schlüssel sowie Vor- und Nach- Wie in Abbildung 3 zu erkennen ist, nimmt die Lauf-
name von Personen entfernt wurden. Die übrigen 15 Attri- zeit beim Bottom-Up+Top-Down-Verfahren im Grenz-
bute enthalten Angaben zu Alter, Ehestand, Staatsangehö- wertbereich von 70%-90% stark ab. Interessant ist dies
rigkeit und Schulabschluss. Die Relation besteht insgesamt aus zwei Gründen. Erstens nimmt die Anzahl der Quasi-
aus 32561 Tupeln, die zunächst im CSV-Format vorlagen Identifikatoren bis 90% ebenfalls ab (179 bei 50%, 56 bei
und in eine Datenbank geparst wurden. 90%). Dies legt nahe, dass die Skalierung des Verfahrens
neben der Dimension der Relation (Anzahl von Tupel und

33
Attributen) auch von der Anzahl der vorhandenen QIs Bekanntmachung vom 14. Januar 2003, das zuletzt
abhängt. Um den Zusammenhang zu bestätigen, sind aber durch Artikel 1 des Gesetzes vom 14. August 2009
weitere Untersuchungen erforderlich. geändert worden ist, 2010.
Zweitens wird dieser Grenzwertbereich in der Literatur [2] T. Dalenius. Finding a Needle In a Haystack or
[13] häufig benutzt, um besonders schützenswerte Daten her- Identifying Anonymous Census Records. Journal of
vorzuheben. Durch die gute Skalierung des Algorithmus in Official Statistics, 2(3):329–336, 1986.
diesem Bereich lassen sich diese QIs schnell feststellen. [3] H. Grunert. Privacy Policy for Smart Environments.
http://www.ls-dbis.de/pp4se, 2014. zuletzt
aufgerufen am 17.07.2014.
8000 [4] Z. G. Ives and N. E. Taylor. Sideways information
Laufzeit in Sekunden

passing for push-style query processing. In Data
6000 Engineering, 2008. ICDE 2008. IEEE 24th
International Conference on, pages 774–783. IEEE,
4000 2008.
[5] M. Klettke. Akquisition von Integritätsbedingungen in
2000 Datenbanken. PhD thesis, Universität Rostock, 1997.
[6] R. Kohavi and B. Becker. Adult Data Set.
http://archive.ics.uci.edu/ml/datasets/Adult,
0 1996. zuletzt aufgerufen am 17.07.2014.
50 60 70 80 90 95 99 [7] N. Li, T. Li, and S. Venkatasubramanian. t-Closeness:
Grenzwert in % Privacy Beyond k-Anonymity and l-Diversity. In
ICDE, volume 7, pages 106–115, 2007.
Bottom-Up [8] A. Machanavajjhala, D. Kifer, J. Gehrke, and
Top-Down M. Venkitasubramaniam. l-diversity: Privacy beyond
Bottom-Up+Top-Down(AVG) k-anonymity. ACM Transactions on Knowledge
Discovery from Data (TKDD), 1(1):3, 2007.
[9] L. F. Mackert. R* optimizer validation and
Abbildung 3: Vergleich der Laufzeit der verschiede- performance evaluation for distributed queries. In
nen Algorithmen (Adult-DB) Readings in database systems, pages 219–229. Morgan
Kaufmann Publishers Inc., 1988.
[10] H. Mannila, H. Toivonen, and A. I. Verkamo.
5. AUSBLICK Discovery of frequent episodes in event sequences.
In dieser Arbeit stellten wir einen effizienten Algorithmus Data Mining and Knowledge Discovery, 1(3):259–289,
zur Erkennung von QI in hochdimensionalen Daten vor. An- 1997.
hand eines Beispiels mit Sensordaten zeigten wir die Eignung [11] D. Moos. Konzepte und Lösungen für
in Assistenzsystemen. Darüber hinaus ermitteln wir derzeit, Datenaufzeichnungen in heterogenen dynamischen
inwiefern sich QIs in temporalen Datenbanken feststellen Umgebungen. Bachelorarbeit, Universität Rostock,
lassen. Das so gewonnene Wissen über schützenswerte Daten 2011.
wird in unser Gesamtprojekt zur datenschutzfreundlichen [12] R. Motwani and Y. Xu. Efficient algorithms for
Anfrageverarbeitung in Assistenzsystemen eingebunden. masking and finding quasi-identifiers. In Proceedings
In späteren Untersuchungen werden wir testen, welche of the Conference on Very Large Data Bases (VLDB),
weiteren Quasi-Identifikatoren sich aus der Kombination pages 83–93, 2007.
von Daten verschiedener Relationen ableiten lassen. Der [13] P. Samarati and L. Sweeney. Protecting privacy when
dafür verwendete Datensatz besteht aus Sensordaten, die disclosing information: k-anonymity and its
im Smart Appliance Lab des Graduiertenkollegs MuSA- enforcement through generalization and suppression.
MA durch ein Tool [11] aufgezeichnet wurden. Die Daten Technical report, Technical report, SRI International,
umfassen dabei Bewegungsprofile, die mittels RFID-Tags 1998.
und einen Sensfloor [15] erfasst wurden, aber auch Infor- [14] P. Seshadri, J. M. Hellerstein, H. Pirahesh, T. Leung,
mationen zu Licht und Temperatur. Eine Verknüpfung der R. Ramakrishnan, D. Srivastava, P. J. Stuckey, and
Basis-Relationen erfolgt dabei über die ermittelten Quasi- S. Sudarshan. Cost-based optimization for magic:
Identifikatoren. Algebra and implementation. In ACM SIGMOD
Record, volume 25, pages 435–446. ACM, 1996.
6. DANKSAGUNG [15] A. Steinhage and C. Lauterbach. Sensfloor (r): Ein
Hannes Grunert wird durch die Deutsche Forschungsge- AAL Sensorsystem für Sicherheit, Homecare und
meinschaft (DFG) im Rahmen des Graduiertenkollegs 1424 Komfort. Ambient Assisted Living-AAL, 2008.
(Multimodal Smart Appliance Ensembles for Mobile Appli- [16] L. Sweeney. k-anonymity: A model for protecting
cations - MuSAMA) gefördert. Wir danken den anonymen privacy. International Journal of Uncertainty,
Gutachtern für ihre Anregungen und Kommentare. Fuzziness and Knowledge-Based Systems,
10(05):557–570, 2002.
7. LITERATUR [17] M. Weiser. The computer for the 21st century.
[1] Bundesrepublik Deutschland. Scientific american, 265(3):94–104, 1991.
Bundesdatenschutzgesetz in der Fassung der

34
Combining Spotify and Twitter Data for Generating a
Recent and Public Dataset for Music Recommendation

Martin Pichl Eva Zangerle Günther Specht
Databases and Information Databases and Information Databases and Information
Systems Systems Systems
Institute of Computer Science Institute of Computer Science Institute of Computer Science
University of Innsbruck, University of Innsbruck, University of Innsbruck,
Austria Austria Austria
martin.pichl@uibk.ac.at eva.zangerle@uibk.ac.at guenther.specht@uibk.ac.at

ABSTRACT recommender systems, i.e., the million song dataset (MSD)
In this paper, we present a dataset based on publicly avail- [4], however such datasets like the MSD often are not recent
able information. It contains listening histories of Spotify anymore. Thus, in order to address the problem of a lack
users, who posted what they are listening at the moment of recent and public available data for the development and
on the micro blogging platform Twitter. The dataset was evaluation of recommender systems, we exploit the fact that
derived using the Twitter Streaming API and is updated many users of music streaming platforms post what they are
regularly. To show an application of this dataset, we imple- listening to on the microblogging Twitter. An example for
ment and evaluate a pure collaborative filtering based rec- such a tweet is “#NowPlaying Human (The Killers) #craig-
ommender system. The performance of this system can be cardiff #spotify http://t.co/N08f2MsdSt”. Using a dataset
seen as a baseline approach for evaluating further, more so- derived from such tweets, we implement and evaluate a col-
phisticated recommendation approaches. These approaches laborative filtering (CF) based music recommender system
will be implemented and benchmarked against our baseline and show that this is a promising approach. Music recom-
approach in future works. mender systems are of interest, as the volume and variety
of available music increased dramatically, as mentioned in
the beginning. Besides commercial vendors like Spotify1 ,
Categories and Subject Descriptors there are also open platforms like SoundCloud2 or Promo
H.3.3 [Information Search and Retrieval]: Information DJ3 , which foster this development. On those platforms,
filtering; H.2.8 [Database Applications]: Data mining users can upload and publish their own creations. As more
and more music is available to be consumed, it gets difficult
for the user or rather customer to navigate through it. By
General Terms giving music recommendations, recommender systems help
Algorithms, Experimentation the user to identify music he or she wants to listen to with-
out browsing through the whole collection. By supporting
Keywords the user finding items he or she likes, the platform opera-
tors benefit from an increased usability and thus increase
Music Recommender Systems, Collaborative Filtering, So- the customer satisfaction.
cial Media As the recommender system implemented in this work de-
livers suitable results, we will gradually enlarge the dataset
1. INTRODUCTION by further sources and assess how the enlargements influ-
More and more music is available to be consumed, due ences the performance of the recommender system in fu-
to new distribution channels enabled by the rise of the web. ture work. Additionally, as the dataset also contains time
Those new distribution channels, for instance music stream- stamps and a part of the captured tweets contains a ge-
ing platforms, generate and store valuable data about users olocation, more sophisticated recommendation approaches
and their listening behavior. However, most of the time the utilizing these additional context based information can be
data gathered by these companies is not publicly available. compared against the baseline pure CF-based approach in
There are datasets available based on such private data cor- future works.
pora, which are widely used for implementing and evaluating The remainder of this paper is structured as follows: in
Section 2 we present the dataset creation process as well as
the dataset itself in more detail. Afterwards, in Section 3 we
briefly present the recommendation approach, which is eval-
uated in Section 4. Before we present the conclusion drawn
from the evaluation on Section 6, related work is discussed
in Section 5.
Copyright c by the paper’s authors. Copying permitted only
for private and academic purposes. 1
http://www.spotify.com
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 2
Workshop on Foundations of Databases (Grundlagen von Datenbanken), http://soundcloud.com
3
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. http://promodj.com

35
2. THE SPOTIFY DATASET 2.2 Dataset Description
In this Section, the used dataset 4 for developing and eval- Based on the raw data presented in the previous Sec-
uating the recommender system is presented. tion, we generate a final dataset of -
triples which contains 5,504,496 tweets of 569,722 unique
2.1 Dataset Creation users who listened to 322,647 tracks by 69,271 artists. In
For the crawling of a sufficiently large dataset, we relied on this final dataset, users considered as not valuable for rec-
the Twitter Streaming API which allows for crawling tweets ommendations, i.e., the @SpotifyNowPlay Twitter account
containing specified keywords. Since July 2011, we crawled which retweets tweets sent via @Spotify, are removed. These
for tweets containing the keywords nowplaying, listento and users were identified manually by the authors.
listeningto. Until October 2014, we were able to crawl more As typical for social media datasets, our dataset has a
than 90 million tweets. In contrast to other contributions long-tailed distribution among the users and their respective
aiming at extracting music information from Twitter, where number of posted tweets [5]. This means that there are only
the tweet’s content is used to extract artist and track in- a few number of users tweeting rather often in this dataset
formation from [17, 7, 16], we propose to exploit the subset and numerous users are tweeted rarely which can be found
of crawled tweets containing a URL leading to the website in the long-tail. This long-tailed distribution can be seen in
of the Spotify music streaming service5 . I.e., information Table 2 and Figure 1, where the logarithm of the number of
about the artist and the track are extracted from the web- tweets is plotted against the corresponding number of users.
site mentioned in the tweet, rather than from the content
of the tweet. This enables us an unambiguous resolution Number of Tweets Number of Users
of the tweets, in contradiction to the contributions men- >0 569,722
tioned above, where the text of the tweets is compared to >1 354,969
entries in the reference database using some similarity mea- >10 91,217
sure. A typical tweet, published via Spotify, is depicted in >100 7,419
the following: “#nowPlaying I Tried by Total on #Spotify
>1,000 198
http://t.co/ZaFH ZAokbV”, where a user published that he
or she listened to the track “I Tried” by the band “Total” on
Table 2: Number of Tweets and Number of Users
Spotify. Additionally, a shortened URL is provided. Besides
this shortened URL, Twitter also provides the according re-
solved URL via its API. This allows for directly identifying
all Spotify-URLs by searching for all URLs containing the
string “spotify.com” or “spoti.fi”. By following the identified 4,000

URLs, the artist and the track can be extracted from the
title tag of the according website. For instance, the title of
the website behind the URL stated above is “I tried 1,000

by Total on Spotify ”. Using the regular expression
“(.*) by (.*) on.*” the name of the track
(group 1) and the artist (group 2) can be extracted.
log(Number of Tweets)

By applying the presented approach to the crawled tweets,
we were able to extract artist and track information from 100
7.08% of all tweets or rather 49.45% of all tweets containing
at least one URL. We refer to the subset of tweets, for which
we are able to extract the artist and the track, as “matched
tweets”. An overview of the captured tweets is given in Table
1. 1.94% of the tweets containing a Spotify-URL couldn’t
10
be matched due to HTTP 404 Not Found and HTTP 500
Internal Server errors.

Restriction Number of Tweets Percentage
None 90,642,123 100.00%
At least one URL 12,971,482 14.31%
A Spotify-URL 6,541,595 7.22% 0 50,000 100,000 150,000 200,000
Number of Users
Matched 6,414,702 7.08%

Table 1: Captured and Matched Tweets Figure 1: Number of Tweets versus Number of
Users
Facilitating the dataset creation approach previously pre-
sented, we are able to gather 6,414,702 tweets and extract The performance of a pure collaborative filtering-based
artist and track data from the contained Spotify-URLs. recommender system increases with the detailedness of a
user profile. Especially for new users in a system, where
no or only little data is available about them, this poses a
4
available at: http://dbis-twitterdata.uibk.ac.at/ problem as no suitable recommendations can be computed.
spotifyDataset/ In our case, problematic users are users who tweeted rarely
5
http://www.spotify.com and thus can be found in the long tail.

36
Besides the long-tail among the number of posted tweets, based on the listening histories of the user. The Jaccard-
there is another long-tail among the distribution of the artist Coefficient is defined in Equation 1 and measures the pro-
play-counts in the dataset: there are a few popular artists portion of common items in two sets.
occurring in a large number of tweets and many artists are
mentioned only in a limited number of tweets. This is shown |Ai ∩ Aj |
in Figure 2, where the logarithm of the number of tweets in jaccardi,j = (1)
|Ai ∪ Aj |
which an artist occurs in (the play-count) is plotted against
the number of artists. Thus, this plot states how many For each user, there are two listening histories we take
artists are mentioned how often in the dataset. into consideration: the set of all tracks a user listened to
and the set of all artists a user listened to. Thus, we are
able to compute a artist similartiy (artistSim) and a track
similarity (trackSim) as shown in Equations 2 and 3.

|artistsi ∩ artistsj |
artistSimi,j = (2)
|artistsi ∪ artistsj |
4,000

|tracksi ∩ tracksj |
trackSimi,j = (3)
1,000 |tracksi ∪ tracksj |
log(Number of Tweets)

The final user similarity is computed using a weighted
average of both, the artistSim and trackSim as depicted in
Equation 4.
100

simi,j = wa ∗ artistSimi,j + wt ∗ trackSimi,j (4)
The weights wa and wt determine the influence of the
10 artist- and the track listening history on the user similar-
ity, where wa + wt = 1. Thus, if wt = 0, only the artist
listening history is taken into consideration. We call such a
recommender system an artist-based recommender system.
Analogously, if wa = 0 we call such a recommender system
track-based. If wa > 0 ∧ wt > 0, both the artist- and track
0 5000 10000 15000 20000
Number of Artists listening histories are used. Hence, we facilitate a hybrid
recommender system for artist recommendations.
The presented weights have to be predetermined. In this
Figure 2: Play-Count versus Number of Artists work, we use a grid-search for finding suitable input param-
eter for our recommender system as described in Section 4.2.
How the presented dataset is used as input- and evaluation
data for a music recommender system, is presented in the 4. EVALUATION
next Section.
In this Section we present the performance of the imple-
mented artist recommender system, but also discuss the lim-
3. THE BASELINE RECOMMENDATION AP- itations of the conducted offline evaluation.
PROACH
In order to present how the dataset can be applied, we
4.1 Evaluation Setup
use our dataset as input and evaluation data for an artist The performance of the recommender system with differ-
recommendation system. This recommender system is based ent input parameters was evaluated using precision and re-
on the open source machine learning library Mahout[2]. The call. Although we focus on the precision, for the sake of com-
performance of this recommender system is shown in Section pleteness we also include the recall into the evaluation, as
4 and serves as a benchmark for future work. this is usual in the field of information retrieval [3]. The met-
rics were computed using a Leave-n-Out algorithm, which
3.1 Recommendation Approach can be described as follows:
For showing the usefulness of our dataset, we implemented
a User-based CF approach. User-based CF recommends 1. Randomly remove n items from the listening history
items by solely utilizing past user-item interactions. For the of a user
music recommender system, a user-item interaction states
2. Recommend m items to the user
that a user listened to a certain track by a certain artist.
Thus, the past user-item interactions represent the listening 3. Calculate precision and recall by comparing the m rec-
history of a user. In the following, we describe our basic ommended and the n removed items
approach taken for computing artist recommendations and
provide details about the implementation. 4. Repeat step 1 to 3 p times
In order to estimate the similarity of two users, we com-
puted a linear combination of the Jaccard-Coefficients [10] 5. Calculate the mean precision and the mean recall

37
Each evaluation in the following Sections has been re-
peated five times (p = 5) and the size of the test set was
fixed to 10 items (r = 10). Thus, we can evaluate the per-
formance of the recommender for recommending up to 10
0.5
items.

4.2 Determining the Input Parameters
In order to determine good input parameters for the rec- 0.4
ommender system, a grid search was conducted. Therefore,
we define a grid of parameters and the possible combina- Recommender

Precision
tions are evaluated using a performance measure [9]. In our ● Artist
0.3
case, we relied on the precision of the recommender system Hybrid
(cf. Figure 3), as the task of a music recommender system Track
is to find a certain number of items a user will listen to (or
buy), but not necessarily to find all good items. Precision 0.2
is a reasonable metric for this so called Find Good Items
task [8] and was assessed using the explained Leave-n-Out
algorithm. For this grid search, we recommended one item
0.1 ● ● ● ●
and the size of the test set was fixed to 10 items. In order ●
● ● ● ●

to find good input parameters, the following grid parame- ●
●
ters determining the computation of the user similarity were
altered: 0.0 ●
0 10 20 30 40 50 60 70 80 90 100
• Number of nearest neighbors k k−Nearest Neighbors

• Weight of the artist similarity wa
Figure 3: Precision and Recall of the Track-Based
• Weight of the track similarity wt Recommender

The result can be seen in Figure 3. For our dataset it n Precision Recall Upper Bound
holds, that the best results are achieved with a track-based 1 0.49 0.05 0.10
recommender system (wa = 0,wt = 1) and 80 nearest neigh- 5 0.23 0.11 0.50
bors (k = 80). Thus, for the performance evaluation of the 6 0.20 0.12 0.60
recommender system in the next Section, we use the follow- 7 0.19 0.13 0.70
ing parameters: 10 0.15 0.15 1.00
• Number of nearest neighbors 80 Table 3: Precision and Recall of the Track-Based
• Weight of the artist similarity 0 Recommender

• Weight of the track similarity 1
As shown in Figure 4, with an increasing number of recom-
mendations, the performance of the presented recommender
4.3 Performance of the Baseline Recommender system declines. Thus, for a high number of recommenda-
System tions the recommender system is rather limited. This is,
In this Section, the performance of the recommender sys- as the chance of false positives increases if the size of the
tem using the optimized input parameters is presented. Prior test set is kept constant. For computing the recall metric,
to the evaluation, we also examined real implementations the 10 items in the test set are considered as relevant items
of music recommender systems: Last.fm, a music discovery (and hence are desirable to recommend to the user). The
service, for instance recommends 6 artists6 when display- recall metric describes the fraction of relevant artists who
ing a certain artist. If an artist is displayed on Spotify7 , are recommended, i.e., when recommending 5 items, even
7 similar artists are recommended at the first page. This if all items are considered relevant, the maximum recall is
number of items also corresponds to the work of Miller [11], still only 50% as 10 items are considered as relevant. Thus,
who argues that people are able to process about 7 items at in the evaluation setup, recall is bound by an upper limit,
a glance, or rather that the span of attention is too short which is the number of recommended items divided by the
for processing long lists of items. The precision@6 and the size of the test set.
precision@7 of our recommender are 0.20 and 0.19, respec-
tively. In such a setting, 20% of the recommended items 4.4 Limitations of the Evaluation
computed by the proposed recommender system would be a Beside discussing the results, it is worth to mention also
hit. In other words, a customer should be interested in at two limitations in the evaluation approach: First, only rec-
least in two of the recommended artists. An overview about ommendations for items the user already interacted with can
the precision@n of the recommender is given in Table 3. be evaluated [5]. If something new is recommended, it can’t
6
http://www.last.fm/music/Lana+Del+Rey be stated whether the user likes the item or not. We can
7 only state that it is not part of the user’s listening history
http://play.spotify.com/artist/
00FQb4jTyendYWaN8pK0wa in our dataset. Thus, this evaluation doesn’t fit to the per-

38
1.0 by monitoring users using the Yahoo! Music Services be-
tween 2002 and 2006. Again, the MSD dataset, the Yahoo
0.9 dataset is less recent. Additionally to the ratings, the Yahoo
dataset contains genre information which can be exploited
0.8 by a hybrid recommender system.
Celma also provides a music dataset, containing data re-
0.7
trieved from last.fm10 , a music discovery service. It con-
tains user, artists and play counts as well as the MusicBrainz
identifiers for 360,000 users. This dataset was published in
Precision / Recall

0.6
Legend 2010 [5].
● Precision Beside the datasets presented above, which are based on
0.5 ●
Recall data of private companies, there exist several datasets based
Upper Bound on publicly available information. Sources exploited have
0.4 been websites in general [12, 15, 14], Internet radios posting
●
their play lists [1] and micro-blogging platforms, in partic-
0.3 ● ular Twitter [17, 13]. However, using these sources has a
● drawback: For cleaning and matching the data, high effort
●
0.2 ●
●
is necessary.
● ● ● One of the most similar datasets to the dataset used in
0.1 this work, is the Million Musical Tweets Dataset 11 dataset
by Hauger et al. [7]. Like our dataset, it was created using
● 0.0
the Twitter streaming API from September 2011 to April
1 5 10 2013, however, all tweets not containing a geolocation were
Number of Recommended Items removed and thus it is much smaller. The dataset con-
tains 1,086,808 tweets by 215,375 users. Among the dataset,
Figure 4: Precision and Recall of the Track-Based 25,060 unique artists have been identified [7].
Recommender Another dataset based on publicly available data which
is similar to the MovieLens dataset, is the MovieTweetings
dataset published by Dooms et al. [6]. The MovieTweet-
fectly to the intended use of providing recommendations for ings dataset is continually updated and has the same format
new artists. However, this evaluation approach enabled us as the MovieLens dataset, in order to foster exchange. At
to find the optimal input parameters using a grid search. the moment, a snapshot containing 200,000 ratings is avail-
Secondly, as we don’t have any preference values, the as- able12 . The dataset is generated by crawling well-structured
sumption that a certain user likes the artist he/she listened tweets and extracting the desired information using regular
to, has to be made. expressions. Using this regular expressions, the name of the
Both drawbacks can be eliminated by conducting a user- movie, the rating and the corresponding user is extracted.
centric evaluation [5]. Thus, in a future work, it would be The data is afterwards linked to the IMDb, the Internet
worth to conduct a user-experiment using the optimized rec- Movie Database 13 .
ommender system.
6. CONCLUSION AND FUTURE WORK
5. RELATED WORK In this work we have shown that the presented dataset
As already mentioned in the introduction, there exist sev- is valuable for evaluating and benchmarking different ap-
eral other publicly available datasets suitable for music rec- proaches for music recommendation. We implemented a
ommendations. A quick overview of these datasets is given working music recommender systems, however as shown in
in this Section. Section 4, for a high number of recommendations the perfor-
One of the biggest available music datasets is the Million mance of our baseline recommendation approach is limited.
Song Dataset (MSD) [4]. This dataset contains information Thus, we see a need for action at two points: First we will
about one million songs from different sources. Beside real enrich the dataset with further context based information
user play counts, it provides audio features of the songs and that is available: in this case this can be the time stamp
is therefore suitable for CF-, CB- and hybrid recommender or the geolocation. Secondly, hybrid recommender system
systems. At the moment, the Taste Profile subset8 of the utilizing this additional context based information are from
MSD is bigger than the dataset presented in this work, how- interest. Therefore, in future works, we will focus on the
ever it was released 2011 and is therefore not as recent. implementation of such recommender systems and compare
Beside the MSD, also Yahoo! published big datasets9 con- them to the presented baseline approach. First experiments
taining ratings for artists and songs suitable for CF. The were already conducted with a recommender system trying
biggest dataset contains 136,000 songs along with ratings to exploit the geolocation. Two different implementations
given by 1.8 million users. Additionally, the genre informa- are evaluated at the moment: The first uses the normalized
tion is provided in the dataset. The data itself was gathered linear distance between two users for approximating a user
10
8
http://labrosa.ee.columbia.edu/millionsong/ http://www.last.fm
11
tasteprofile available at: http://www.cp.jku.at/datasets/MMTD/
9 12
available at: http://webscope.sandbox.yahoo.com/ https://github.com/sidooms/MovieTweetings
13
catalog.php?datatype=r http://www.imdb.com

39
similarity. The second one, which in an early stage of eval- [14] M. Schedl, P. Knees, and G. Widmer. Investigating
uation seems to be the more promising one, increases the web-based approaches to revealing prototypical music
user similarity if a certain distance threshold is underrun. artists in genre taxonomies. In Proceedings of the 1st
However, there remains the open question how to determine International Conference on Digital Information
this distance threshold. Management (ICDIM 2006), pages 519–524. IEEE,
2006.
7. REFERENCES [15] M. Schedl, C. C. Liem, G. Peeters, and N. Orio. A
[1] N. Aizenberg, Y. Koren, and O. Somekh. Build your Professionally Annotated and Enriched Multimodal
own music recommender by modeling internet radio Data Set on Popular Music. In Proceedings of the 4th
streams. In Proceedings of the 21st International ACM Multimedia Systems Conference (MMSys 2013),
Conference on World Wide Web (WWW 2012), pages pages 78–83, February–March 2013.
1–10. ACM, 2012. [16] M. Schedl and D. Schnitzer. Hybrid Retrieval
[2] Apache Software Foundation. What is Apache Approaches to Geospatial Music Recommendation. In
Mahout?, March 2014. Retrieved July 13, 2014, from Proceedings of the 35th Annual International ACM
http://mahout.apache.org. SIGIR Conference on Research and Development in
[3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval (SIGIR), 2013.
Information Retrieval: The Concepts and Technology [17] E. Zangerle, W. Gassler, and G. Specht. Exploiting
behind Search (2nd Edition) (ACM Press Books). twitter’s collective knowledge for music
Addison-Wesley Professional, 2 edition, 2011. recommendations. In Proceedings of the 2nd Workshop
[4] T. Bertin-Mahieux, D. P. W. Ellis, B. Whitman, and on Making Sense of Microposts (#MSM2012), pages
P. Lamere. The million song dataset. In A. Klapuri 14–17, 2012.
and C. Leider, editors, Proceedings of the 12th
International Society for Music Information Retrieval
Conference (ISMIR 2011), pages 591–596. University
of Miami, 2011.
[5] Ò. Celma. Music Recommendation and Discovery -
The Long Tail, Long Fail, and Long Play in the
Digital Music Space. Springer, 2010.
[6] S. Dooms, T. De Pessemier, and L. Martens.
Movietweetings: a movie rating dataset collected from
twitter. In Workshop on Crowdsourcing and Human
Computation for Recommender Systems at the 7th
ACM Conference on Recommender Systems (RecSys
2013), 2013.
[7] D. Hauger, M. Schedl, A. Kosir, and M. Tkalcic. The
million musical tweet dataset - what we can learn from
microblogs. In A. de Souza Britto Jr., F. Gouyon, and
S. Dixon, editors, Proceedings of the 14th
International Society for Music Information Retrieval
Conference (ISMIR 2013), pages 189–194, 2013.
[8] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and
J. T. Riedl. Evaluating collaborative filtering
recommender systems. ACM Transactions on
Information Systems, 22(1):5–53, Jan. 2004.
[9] C. W. Hsu, C. C. Chang, and C. J. Lin. A practical
guide to support vector classification. Department of
Computer Science and Information Engineering,
National Taiwan University, Taipei, Taiwan, 2003.
[10] P. Jaccard. The distribution of the flora in the alpine
zone. New Phytologist, 11(2):37–50, Feb. 1912.
[11] G. A. Miller. The magical number seven, plus or
minus two: Some limits on our capacity for processing
information. 62:81–97, 1956.
[12] A. Passant. dbrec - Music Recommendations Using
DBpedia. In Proceedings of the 9th International
Semantic Web Conference (ISWC 2010), volume 6497
of Lecture Notes in Computer Science, pages 209–224.
Springer Berlin Heidelberg, 2010.
[13] M. Schedl. Leveraging Microblogs for Spatiotemporal
Music Information Retrieval. In Proceedings of the
35th European Conference on Information Retrieval
(ECIR 2013), pages 796 – 799, 2013.

40
Incremental calculation of isochrones regarding duration

Nikolaus Krismer Günther Specht Johann Gamper
University of Innsbruck, University of Innsbruck, Free University of
Austria Austria Bozen-Bolzano, Italy
nikolaus.krismer@uibk.ac.at guenther.specht@uibk.ac.at gamper@inf.unibz.it

ABSTRACT target. The websites enabling such a navigation usually cal-
An isochrone in a spatial network is the minimal, possibly culate routes using efficient shortest path (SP) algorithms.
disconnected subgraph that covers all locations from where One of the most famous examples of these tools is Google’s
a query point is reachable within a given time span and by map service named GoogleMaps1 . For a long time it was
a given arrival time [5]. A novel approach for computing possible to calculate routes using one transportation system
isochrones in multimodal spatial networks is presented in (by car, by train or by bus) only. This is known as rout-
this paper. The basic idea of this incremental calculation is ing within unimodal spatial networks. Recent developments
to reuse already computed isochrones when a new request enabled the computation combining various transportation
with the same query point is sent, but with different dura- systems within the same route, even if some systems are
tion. Some of the major challenges of the new calculation bound to schedules. This has become popular under the
attempt are described and solutions to the most problematic term “multimodal routing” (or routing in multimodal spa-
ones are outlined on basis of the already established MINE tial networks).
and MINEX algorithms. The development of the incremen- Less famous, but algorithmic very interesting, is to find
tal calculation is done by using six different cases of com- the answer to the question where someone can travel to in
putation. Three of them apply to the MINEX algorithm, a given amount of time starting at a certain time from a
which uses a vertex expiration mechanism, and three cases given place. The result is known as isochrone. Within mul-
to MINE without vertex expiration. Possible evaluations are timodal spatial networks it has been defined by Gamper et
also suggested to ensure the correctness of the incremental al. [5]. Websites using isochrones include Mapnificent2 and
calculation. In the end some further tasks for future research SimpleFleet3 [4].
are outlined. One major advantage of isochrones is that they can be
used for reachability analyses of any kind. They are help-
ful in various fields including city planning and emergency
Categories and Subject Descriptors management. While some providers, like SimpleFleet and
H.2.8 [Database Applications]: Spatial databases and Mapnificent, enable the computation of isochrones based on
GIS pre-calculated information or with heuristic data, the cal-
culation of isochrones is a non-trivial and time-intense task.
Although some improvements to the algorithms that can be
General Terms used for isochrone computation have been published at the
Algorithms Free University of Bozen-Bolzano in [7], one major drawback
is that the task is always performed from scratch. It is not
Keywords possible to create the result of a twenty-minute-isochrone
(meaning that the travelling time from/to a query point q
isochrone, incremental calculation is less than or equal to twenty minutes) based on the re-
sult from a 15-minute-isochrone (the travelling time is often
1. INTRODUCTION referred to as maximal duration dmax). The incremental
Throughout the past years interactive online maps have calculation could dramatically speed up the computation of
become a famous tool for planning routes of any kind. Nowa- isochrones, if there are other ones for the same point q avail-
days everybody with access to the internet is able to easily able. This is especially true for long travel times. However,
get support when travelling from a given point to a specific the computation based on cached results has not been re-
alised until now and is complex. As one could see from
figures 1 and 2 it is not sufficient to extend the outline of
the isochrone, because there might be some network hubs
(e.g. stations of the public transportation system) which
extend the isochrone result into new, possibly disconnected
areas.
Copyright c by the paper’s authors. Copying permitted only
for private and academic purposes. 1
http://maps.google.com
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 2
Workshop on Foundations of Databases (Grundlagen von Datenbanken), http://www.mapnificent.net
3
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. http://www.simplefleet.eu

41
troduced by Bauer et. al [1] suffers from high initial loading
time and is limited by the available memory, since the entire
network is loaded into memory at the beginning. Another
algorithm, called Multimodal Incremental Network Expan-
sion (MINE), which has been proposed by Gamper et al. [5]
overcomes the limitation that the whole network has to be
loaded, but is restricted to the size of the isochrone result,
since all points in the isochrone are still located in memory.
To overcome this limitation, the Multimodal Incremental
Network Expansion with vertex eXpiration (MINEX) algo-
rithm has been developed by Gamper et al. [6] introducing
vertex expiration (also called node expiration). This mech-
anism eliminates unnecessary nodes from memory as soon
as possible and therefore reduces the memory needed during
computation.
There are some more routing algorithms that do not load
the entire network into the main memory. One well-known,
which is not specific for isochrone calculation, but for query
processing in spatial networks in general, called Incremental
Figure 1: Isochrone with dmax of 10 minutes Euclidean Restriction (IER), has been introduced by Papa-
dias [8] in 2003. This algorithm loads chunks of the network
into memory that are specified by the euclidean distance.
The Incremental Network Expiration (INE) algorithm has
also been introduced in the publication of Papadias. It is
basically an extension of the Dijkstra shortest path algo-
rithm. Deng et al. [3] improved the ideas of Papadias et al.
accessing less network data to perform the calculations. The
open source routing software “pgRouting”4 , which calculates
routes on top of the spatial database PostGIS5 (an extension
to the well-known relational database PostgreSQL) uses an
approach similar to IER. Instead of the euclidean distance
it uses the network distance to load the spatial network.
In 2013 similar ideas have been applied to MINEX and
resulted in an algorithm called Multimodal Range Network
Expansion (MRNEX). It has been developed at the Free
University of Bozen-Bolzano by Innerebner [7]. Instead of
loading the needed data edge-by-edge from the network, it
is loaded using chunks, like it is done in IER. Depending
on their size this approach is able to reduce the number of
network accesses by far and therefore reduces calculation
Figure 2: Isochrone with dmax of 15 minutes time.
Recently the term “Optimal location queries” has been
proposed by some researches like Chen et al. [2]. These
This paper presents the calculation of incremental isochrones queries are closely related to isochrones, since they “find a
in multimodal spatial networks on top of already developed location for setting up a new server such that the maximum
algorithms and cached results. It illustrates some ideas that cost of clients being served by the servers (including the new
need to be addressed when extending the algorithms by the server) is minimized”.
incremental calculation approach. The remainder of this pa-
per is structured as follows. Section 2 includes related work.
Section 3 is split into three parts: the first part describes 3. INCREMENTAL CALCULATION
challenges that will have to be faced during the implemen- REGARDING ISOCHRONE DURATION
tation of incremental isochrones. Possible solutions to the In this paper the MINE and MINEX algorithms are ex-
outlined problems are also discussed shortly here. The sec- tended by a new idea that is defined as “incremental cal-
ond part deals with different cases that are regarded during culation”. This allows the creation of new results based on
computation and how these cases differ, while the third part already computed and cached isochrones with different du-
points out some evaluations and tests that will have to be rations, but with the same query point q (defined as base-
performed to ensure the correctness of the implementation. isochrones). This type of computation is complex, since it is
Section 4 consists of a conclusion and lists some possible not sufficient to extend an isochrone from its border points.
future work. In theory it is necessary to re-calculate the isochrone from
every node in the spatial network that is part of the base-
2. RELATED WORK isochrone and connected to other nodes. Although this is
4
The calculation of isochrones in multimodal spatial net- http://pgrouting.org
5
works can be done using various algorithms. The method in- http://postgis.net

42
true for a highly connected spatial network it might not be nations can be triggered by a service provider. Traffic jams
the only or even best way for a real-world multimodal spatial and similar factors can lead to delays in the transportation
network with various transportation systems. The isochrone system and thus also have to be considered. Although it
calculation based on already known results should be doable should be possible to overcome both limitations or at least
with respect to all the isochrone’s border points and all the limit their impact, it will not be further discussed in this
public transportation system stations that are part of the paper.
base isochrone. These network hubs in reality are the only
nodes, which can cause new, possibly disconnected areas to 3.2 Types of calculation
become part of an isochrone with different travelling time. There are six different cases that have to be kept in mind
As it is important for the incremental calculation, the ver- when calculating an isochrone with travelling time dmax us-
tex expiration that is introduced by Gamper et al. in [6] ing a base isochrone with duration dmax_base: three apply-
will now be summarized shortly. The aim of the proposed ing to algorithms without vertex expiration and three cases
approach is to remove loaded network nodes as soon as pos- for the ones using vertex expiration.
sible from memory. However, to keep performance high,
nodes should never be double-loaded at any time and there- 3.2.1 Cases dmax = dmax_base
fore they should not be eliminated from memory too soon. The first two and most simple cases for the MINE and
Removal should only occur when all computations regard- MINEX algorithm, are the ones where dmax is equal to
ing the node have been performed. States are assigned to dmax_base. In these cases it is obvious that the calculation
every node to assist in finding the optimal timeslot for mem- result can be returned directly without any further modifi-
ory elimination. The state of a node can either be “open”, cation. It is not needed to respect expired nodes, since no
“closed” or “expired”. Every loaded node is labelled with the (re)calculation needs to be performed.
open state in the beginning. If all of its outgoing edges are
traversed, its state changes to closed. However, the node 3.2.2 Cases dmax < dmax_base
itself has to be kept in memory in order to avoid cyclic The third, also simple case, is the one where dmax is less
network expansions. A node reaches the expired state, if than dmax_base for algorithms without vertex expiration.
all nodes in its neighbourhood reached the closed or expired In this situation all nodes can be iterated and checked for
state. It then can safely be removed from memory and is not suitability. If the duration is less or equal to dmax, then
available for further computations without reloading it from the node also belongs to the new result, otherwise it does
the network. Since this is problematic for the incremental not. In the fourth case, where the duration is less than
calculation approach this aspect is described in more detail. dmax_base and nodes were expired (and therefore are not
available in memory any more), the isochrone can be shrunk
3.1 Challenges from its borders. The network hubs do not need any special
treatment, since no new areas can become part of the result
There are some challenges that need to be addressed when
if the available time decreased. The only necessary task is
implementing an incremental calculation for the MINE and
the recalculation of the durations from the query point to
MINEX algorithm. The most obvious problem is related to
the nodes in the isochrone and to possibly reload expired
the vertex expiration of the MINEX algorithm. If nodes al-
nodes. It either can be done from the query point or from
ready expired, they will not be available to the calculation
the border points. The duration d from the query point q to
of isochrones with different durations. To take care of this
a network node n is then equal to (assuming that the border
problem all nodes n that are connected to other nodes which
point with the minimal distance to n is named bp):
are not in the direct neighbourhood of n are added to a list
l_hubs. These nodes are the ones we referred to as network d(q, n) = d(q, bp) − d(bp, n)
hubs. Besides the hub node n itself, further information is
stored in this list: the time t of arrival at the node and 3.2.3 Cases dmax > dmax_base
the remaining distance d that can be used. With this infor- The remaining two cases, where dmax_base is less than
mation it is possible to continue computation from any hub dmax, are much more complex. They differ in the fact that
with a modified travelling time for the algorithms. new, possibly disconnected areas can become part of the
The list l_hubs needs to be stored in addition to the result and therefore it is not sufficient to look at all the base
isochrone’s maximal travelling time and the isochrone re- isochrones border points. The new areas become available as
sult itself, so that it can be used for incremental calculation. a result from connections caused by network hubs that often
None of this information needs to be held in memory during are bound to some kind of schedule. A real-world example
computation of the base isochrone itself and is only touched is a train station where a train is leaving at time t_train
on incremental calculation. Therefore, runtime and memory due to its schedule and arriving at a remote station at or
consumption of the isochrone algorithms will not be influ- before time dmax (in fact any time later than dmax_base is
enced much. feasible). The time t_train has to be later than the arrival
Other problems include modifications to the spatial net- time at the station (and after the isochrones starting time).
work in combination with incremental isochrones. If there is Since all network hubs are saved with all the needed in-
some change applied to the underlying network, all the base formation to the list l_hubs it is not of any interest if the
isochrones can not be used for incremental calculation any algorithm uses vertex expiration or not. The points located
more. It can not be guaranteed that the network modifica- at the isochrone’s outline are still in memory. Since only net-
tion does not influence the base isochrone. Changes in the work hubs can create new isochrone areas it is sufficient to
schedules of one or more modalities (for example the pub- grow the isochrone from its border and all the network hubs
lic transportation systems) could cause problems as well, as located in the isochrone. The only effect that vertex expira-
they would also influence the base isochrone. Schedule alter- tion causes is a smaller memory footprint of the calculation,

43
as it would also do without incremental calculation. be recorded to allow comparison. The incremental calcula-
In table 1 and in table 2 the recently mentioned calculation tion can only be seen as successful, if there are situations
types are summarised shortly. The six different cases can be where they perform better than the common calculation.
distinguished with ease using these two tables. As mentioned before, this is expected to be true for at least
large isochrone durations, since large portions of the spatial
MINE network does not need to be loaded then.
dmax < dmax_base iterating nodes from base isochrone Besides these automatically executed tests, it will be pos-
checking if travel time is <= dmax sible to perform manual tests using a graphical user inter-
dmax = dmax_base no change face. This system is under heavy development at the mo-
dmax > dmax_base extend base isochrone by ment and has been named IsoMap. Regardless of its young
border points and with list l_hubs state it will enable any user to calculate isochrones with and
without the incremental approach and to visually compare
Table 1: the results with each other.
Incremental calculation without vertex expiration
4. CONCLUSION AND FUTURE WORK
In this paper an approach to enable the calculation of
MINEX isochrones with the help of already known results was pre-
dmax < dmax_base shrink base isochrone from border sented. The necessary steps will be realised in the near fu-
dmax = dmax_base no change ture, so that runtime comparisons between incremental cal-
dmax > dmax_base extend base isochrone by culated isochrones and such created without the presented
border points and with list l_hubs approach will be available shortly. The ideas developed
throughout this paper do not influence the time needed for
Table 2: calculation of base isochrones by far. The only additional
Incremental calculation with vertex expiration complexity is generated by storing a list l_hubs besides
the base isochrone. However, this is easy to manage and
Although the different types of computations are intro- since the list does not contain any complex data structures,
duced using the MINE and MINEX algorithms they also the changes should be doable without any noticeable conse-
apply to the MRNEX method. When using MRNEX the quence to the runtime of the algorithms.
same basic idea can be used to enable incremental calcula- Future work will extend the incremental procedure to fur-
tions. In addition the same advantages and disadvantages ther calculation parameters, especially to the arrival time,
apply to the incremental calculation using MRNEX com- the travelling speed and the query point q of the isochrone.
pared to MINEX that also apply to the non-incremental Computations on top of cached results are also realisable for
setup. changing arrival times and/or travel speeds. It should even
be possible to use base isochrones with completely different
3.3 Evaluation query points in the context of the incremental approach. If
The evaluations that will need to be carried out to ensure the isochrone calculation for a duration of twenty minutes
the correctness of the implementation can be based on freely reaches a point after five minutes the 15-minute isochrone of
available datasets, such as OpenStreetMap6 . Schedules from this point has to be part of the computed result (if the arrival
various public transportation systems could be used and times are respected). Therefore, cached results can decrease
since they might be subject of licensing it is planned to the algorithm runtimes even for different query points, espe-
create some test schedules. This data can then be used cially if they are calculated for points that can cause complex
as mockups and as a replacement of the license-bound real- calculations like airports or train stations.
world schedules. It is also planned to realise all the described Open fields that could be addressed include the research of
tests in the context of a continuous integration setup. They incremental calculation under conditions where public trans-
will therefore be automatically executed ensuring the cor- portation system schedules may vary due to trouble in the
rectness throughout various software changes. traffic system. The influence of changes in the underlying
The basic idea of the evaluation is to calculate incremental spatial networks to the incremental procedure could also be
isochrones on the basis of isochrones with different durations part of future research. It is planned to use the incremen-
and to compare them with isochrones calculated without the tal calculation approach to calculate city round trips and to
incremental approach. If both results are exactly the same, allow the creation of sight seeing tours for tourists with the
the incremental calculation can be regarded as correct. help of isochrones. This computation will soon be enabled
There will be various tests that need to be executed in in cities where it is not possible by now.
order to cover all the different cases described in section Further improvements regarding the calculation runtime
3.2. As such, all the cases will be performed with and with- of isochrones can be done as well. In this field, some ex-
out vertex expiration. The durations of the base isochrones aminations with different databases and even with different
will cover the three cases per algorithm (less than, equal to types of databases (in particular graph databases and other
and greater than the duration of the incremental calculated NoSQL systems) are planned.
isochrone). Additional tests, such as testing for vertex ex-
piration of the incremental calculation result, will be imple- 5. REFERENCES
mented as well. Furthermore, the calculation times of both
- the incremental and the non-incremental approach - will [1] V. Bauer, J. Gamper, R. Loperfido, S. Profanter,
S. Putzer, and I. Timko. Computing isochrones in
6
http://www.openstreetmap.org multi-modal, schedule-based transport networks. In

44
Proceedings of the 16th ACM SIGSPATIAL
International Conference on Advances in Geographic
Information Systems, GIS ’08, pages 78:1–78:2, New
York, NY, USA, 2008. ACM.
[2] Z. Chen, Y. Liu, R. C.-W. Wong, J. Xiong, G. Mai, and
C. Long. Efficient algorithms for optimal location
queries in road networks. In SIGMOD Conference,
pages 123–134, 2014.
[3] K. Deng, X. Zhou, H. Shen, S. Sadiq, and X. Li.
Instance optimal query processing in spatial networks.
The VLDB Journal, 18(3):675–693, 2009.
[4] A. Efentakis, N. Grivas, G. Lamprianidis,
G. Magenschab, and D. Pfoser. Isochrones, traffic and
demographics. In SIGSPATIAL/GIS, pages 538–541,
2013.
[5] J. Gamper, M. Böhlen, W. Cometti, and
M. Innerebner. Defining isochrones in multimodal
spatial networks. In Proceedings of the 20th ACM
International Conference on Information and
Knowledge Management, CIKM ’11, pages 2381–2384,
New York, NY, USA, 2011. ACM.
[6] J. Gamper, M. Böhlen, and M. Innerebner. Scalable
computation of isochrones with network expiration. In
A. Ailamaki and S. Bowers, editors, Scientific and
Statistical Database Management, volume 7338 of
Lecture Notes in Computer Science, pages 526–543.
Springer Berlin Heidelberg, 2012.
[7] M. Innerebner. Isochrone in Multimodal Spatial
Networks. PhD thesis, Free University of
Bozen-Bolzano, 2013.
[8] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao.
Query processing in spatial network databases. In
Proceedings of the 29th International Conference on
Very Large Data Bases - Volume 29, VLDB ’03, pages
802–813. VLDB Endowment, 2003.

45
46
Software Design Approaches for Mastering Variability in
Database Systems

David Broneske, Sebastian Dorok, Veit Köppen, Andreas Meister*
*author names are in lexicographical order
Otto-von-Guericke-University Magdeburg
Institute for Technical and Business Information Systems
Magdeburg, Germany
firstname.lastname@ovgu.de

ABSTRACT e.g., vectorization and SSD storage, to efficiently process
For decades, database vendors have developed traditional and manage petabytes of data [8]. Exploiting variability to
database systems for different application domains with high- design a tailor-made DBS for applications while making the
ly differing requirements. These systems are extended with variability manageable, that is keeping maintenance effort,
additional functionalities to make them applicable for yet time, and cost reasonable, is what we call mastering vari-
another data-driven domain. The database community ob- ability in DBSs.
served that these “one size fits all” systems provide poor per- Currently, DBSs are designed either as one-size-fits-all
formance for special domains; systems that are tailored for a DBSs, meaning that all possible use cases or functionalities
single domain usually perform better, have smaller memory are integrated at implementation time into a single DBS,
footprint, and less energy consumption. These advantages or as specialized solutions. The first approach does not
do not only originate from different requirements, but also scale down, for instance, to embedded devices. The second
from differences within individual domains, such as using a approach leads to situations, where for each new applica-
certain storage device. tion scenario data management is reinvented to overcome
However, implementing specialized systems means to re- resource restrictions, new requirements, and rapidly chang-
implement large parts of a database system again and again, ing hardware. This usually leads to an increased time to
which is neither feasible for many customers nor efficient in market, high development cost, as well as high maintenance
terms of costs and time. To overcome these limitations, we cost. Moreover, both approaches provide limited capabilities
envision applying techniques known from software product for managing variability in DBSs. For that reason, software
lines to database systems in order to provide tailor-made product line (SPL) techniques could be applied to the data
and robust database systems for nearly every application management domain. In SPLs, variants are concrete pro-
scenario with reasonable effort in cost and time. grams that satisfy the requirements of a specific application
domain [7]. With this, we are able to provide tailor-made
and robust DBSs for various use cases. Initial results in the
General Terms context of embedded systems, expose benefits of applying
Database, Software Engineering SPLs to DBSs [17, 22].
The remainder of this paper is structured as follows: In
Keywords Section 2, we describe variability in a database system re-
garding hardware and software. We review three approaches
Variability, Database System, Software Product Line to design DBSs in Section 3, namely, the one-size-fits-all, the
specialization, and the SPL approach. Moreover, we com-
1. INTRODUCTION pare these approaches w.r.t. robustness and maturity of pro-
In recent years, data management has become increasingly vided DBSs, the effort of managing variability, and the level
important in a variety of application domains, such as auto- of tailoring for specific application domains. Because of the
motive engineering, life sciences, and web analytics. Every superiority of the SPL approach, we argue to apply this ap-
application domain has its unique, different functional and proach to the implementation process of a DBS. Hence, we
non-functional requirements leading to a great diversity of provide research questions in Section 4 that have to be an-
database systems (DBSs). For example, automotive data swered to realize the vision of mastering variability in DBSs
management requires DBSs with small storage and memory using SPL techniques.
consumption to deploy them on embedded devices. In con-
trast, big-data applications, such as in life sciences, require
large-scale DBSs, which exploit newest hardware trends, 2. VARIABILITY IN DATABASE SYSTEMS
Variability in a DBS can be found in software as well as
hardware. Hardware variability is given due to the use of
Copyright c by the paper’s authors. Copying permitted only different devices with specific properties for data processing
for private and academic purposes.
and storage. Variability in software is reflected by differ-
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-
Workshop on Foundations of Databases (Grundlagen von Datenbanken), ent functionalities that have to be provided by the DBS
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. for a specific application. Additionally, the combination of

47
hardware and software functionality for concrete application Field Programmable Gate Array: FPGAs are pro-
domains increases variability. grammable stream processors, providing only a limited stor-
age capacity. They consist of several independent logic cells
2.1 Hardware consisting of a storage unit and a lookup table. The in-
In the past decade, the research community exploited aris- terconnect between logic cells and the lookup tables can be
ing hardware features by tailor-made algorithms to achieve reprogrammed during run time to perform any possible func-
optimized performance. These algorithms effectively utilize, tion (e.g., sorting, selection).
e.g., caches [19] or vector registers of Central Processing
Units (CPUs) using AVX- [27] and SSE-instructions [28]. 2.1.2 Storage Devices
Furthermore, the usage of co-processors for accelerating data Similar to the processing device, current systems offer a
processing opens up another dimension [12]. In the follow- variety of different storage devices used for data processing.
ing, we consider processing and storage devices and sketch In this section, we discuss different properties of current stor-
the variability arising from their different properties. age devices.
Hard Disk Drive: The Hard Disk Drive (HDD), as a
2.1.1 Processing Devices non-volatile storage device, consists of several disks. The
To sketch the heterogeneity of current systems, possible disks of an HDD rotate, while a movable head reads or writes
(co-)processors are summarized in Figure 1. Current sys- information. Hence, sequential access patterns are well sup-
tems do not only include a CPU or an Accelerated Process- ported in contrast to random accesses.
ing Unit (APU), but also co-processors, such as Many In- Solid State Drive: Since no mechanical units are used,
tegrated Cores (MICs), Graphical Processing Units (GPUs), Solid State Drives (SSDs) support random access without
and Field Programmable Gate Arrays (FPGAs). In the fol- high delay. For this, SSDs use flash-memory to persistently
lowing, we give a short description of varying processor prop- store information [20]. Each write wears out the flash cells.
erties. A more extensive overview is presented in our recent Consequently, the write patterns of database systems must
work [5]. be changed compared to HDD-based systems.
Main-Memory: While using main memory as main stor-
age, the access gap between primary and secondary storage
APU Main- GPU MIC is removed, introducing main-memory access as the new bot-
Memory
tleneck [19]. However, main-memory systems cannot omit
front-side bus memory secondary storage types completely, because main memory is
bus
volatile. Thus, efficient persistence mechanisms are needed
I/O
CPU controller
for main-memory systems.
PCIe bus
FPGA To conclude, current architectures offer several different
HDD SSD
processor and storage types. Each type has a unique archi-
tecture and specific characteristics. Hence, to ensure high
Figure 1: Future system architecture [23] performance, the processing characteristics of processors as
well as the access characteristics of the underlying storage
Central Processing Unit: Nowadays, CPUs consist of devices have to be considered. For example, if several pro-
several independent cores, enabling parallel execution of dif- cessing devices are available within a DBS, the DBS must
ferent calculations. CPUs use pipelining, Single Instruction provide suitable algorithms and functionality to fully utilize
Multiple Data (SIMD) capabilities, and branch prediction all available devices to provide peak performance.
to efficiently process conditional statements (e.g., if state-
ments). Hence, CPUs are well suited for control intensive 2.2 Software Functionality
algorithms. Besides hardware, DBS functionality is another source
Graphical Processing Unit: Providing larger SIMD reg- of variability in a DBS. In Figure 2, we show an excerpt
isters and a higher number of cores than CPUs, GPUs offer of DBS functionalities and their dependencies. For exam-
a higher degree of parallelism compared to CPUs. In or- ple, for different application domains different query types
der to perform calculations, data has to be transferred from might be interesting. However, to improve performance
main memory to GPU memory. GPUs offer an own memory or development cost, only required query types should be
hierarchy with different memory types. used within a system. This example can be extended to
Accelerated Processing Unit: APUs are introduced to other functional requirements. Furthermore, a DBS pro-
combine the advantages of CPUs and GPUs by including vides database operators, such as aggregation functions or
both on one chip. Since the APU can directly access main joins. Thereby, database operators perform differently de-
memory, the transfer bottleneck of dedicated GPUs is elim- pending on the used storage and processing model [1]. For
inated. However, due to space limitations, fairly less GPU example, row stores are very efficient when complete tuples
cores fit on the APU die compared to a dedicated GPU, lead- should be retrieved, while column stores in combination with
ing to reduced computational power compared to dedicated operator-at-a-time processing enable fast processing of single
GPUs. columns [18]. Another technique to enable efficient access
Many Integrated Core: MICs use several integrated and to data is to use index structures. Thereby, the choice of an
interconnected CPU cores. With this, MICs offer a high appropriate index structure for the specific data and query
parallelism while still featuring CPU properties. However, types is essential to guarantee best performance [15, 24].
similar to the GPU, MICs suffer from the transfer bottle- Note, we omit comprehensive relationships between func-
neck. tionalities properties in Figure 2 due to complexity. Some

48
Legend
DBS-Functionality
Feature

Mandatory

Optional

OR
Query Storage Processing
Operator Transaction
Type Model Model
XOR

Row Column Operator- Tuple-at- Vectorized
Exact kNN Range
Store Store at-a-time a-time Processing

Join Selection Sorting Grouping

Block-nested- Bitonic
Nested-loops Hash Sort-merge Radix Hash-based Sort-based
loops merge

Figure 2: Excerpt of DBMS-Functionality

functionalities are mandatory in a DBS and others are op- database-application scenario. For example, a DBS for high-
tional, such as support for transactions. Furthermore, it is performance analysis can exploit newest hardware features,
possible that some alternatives can be implemented together such as SIMD, to speed up analysis workloads. Moreover,
and others only exclusively. we can meet limited space requirements in embedded sys-
tems by removing unnecessary functionality [22], such as the
2.3 Putting it all together support for range queries. However, exploiting variability is
So far, we considered variability in hardware and software one part of mastering variability in DBSs. The second part
functionality separately. When using a DBS for a specific is to manage variability efficiently to reduce development
application domain, we also have to consider special require- and maintenance effort.
ments of this domain as well as the interaction between hard- In this section, first, we describe three different approaches
ware and software. to design and implement DBSs. Then, we compare these ap-
Special requirements comprise functional as well as non- proaches regarding their applicability to arbitrary database
functional ones. Examples for functional requirements are scenarios. Moreover, we assess the effort to manage vari-
user-defined aggregation functions (e.g., to perform genome ability in DBSs. Besides managing and exploiting the vari-
analysis tasks directly in a DBS [9]). Other applications ability in database systems, we also consider the robustness
require support for spatial queries, such as geo-information and correctness of tailor-made DBSs created by using the
systems. Thus, special data types as well as index structures discussed approaches.
are required to support these queries efficiently.
Besides performance, memory footprint and energy effi- 3.1 One-Size-Fits-All Design Approach
ciency are other examples for non-functional requirements. One way to design database systems is to integrate all con-
For example, a DBS for embedded devices must have a small ceivable data management functionality into one single DBS.
memory footprint due to resource restrictions. For that rea- We call this approach the one-size-fits-all design approach
son, unnecessary functionality is removed and data process- and DBSs designed according to this approach one-size-fits-
ing is implemented as memory efficient as possible. In this all DBSs. Thereby, support for hardware features as well
scenario, tuple-at-a-time processing is preferred, because in- as DBMS functionality are integrated into one code base.
termediate results during data processing are smaller than Thus, one-size-fits-all DBSs provide a rich set of functional-
in operator-at-a-time processing, which leads to less memory ity. Examples of database systems that follow the one-size-
consumption [29]. fits-all approach are PostgreSQL, Oracle, and IBM DB2. As
In contrast, in large-scale data processing, operators should one-size-fits-all DBSs are monolithic software systems, im-
perform as fast as possible by exploiting underlying hard- plemented functionality is highly interconnected on the code
ware and available indexes. Thereby, exploiting underlying level. Thus, removing functionality is mostly not possible.
hardware is another source of variability as different pro- DBSs that follow the one-size-fits-all design approach aim
cessing devices have different characteristics regarding pro- at providing a comprehensive set of DBS functionality to
cessing model and data access [6]. To illustrate this fact, deal with most database application scenarios. The claim for
we depict different storage models for DBS in Figure 2. For generality often introduces functional overhead that leads to
example, column-storage is preferred on GPUs, because row- performance losses. Moreover, customers pay for function-
storage leads to an inefficient memory access pattern that de- ality they do not really need.
teriorates the possible performance benefits of GPUs [13].
3.2 Specialization Design Approach
3. APPROACHES TO DESIGN TAILOR- In contrast to one-size-fits-all DBSs, DBSs can also be de-
signed and developed to fit very specific use cases. We call
MADE DATABASE SYSTEMS this design approach the specialization design approach and
The variability in hardware and software of DBSs can DBSs designed accordingly, specialized DBSs. Such DBSs
be exploited to tailor database systems for nearly every are designed to provide only that functionality that is needed

49
for the respective use case, such as text processing, data a) general applicability to arbitrary database applications,
warehousing, or scientific database applications [25]. Spe- b) effort for managing variability, and
cialized DBSs are often completely redesigned from scratch c) maturity of the deployed database system.
to meet application requirements and do not follow common Although the one-size-fits-all design approach aims at pro-
design considerations for database systems, such as locking viding a comprehensive set of DBS functionality to deal
and latching to guarantee multi-user access [25]. Specialized with most database application scenarios, a one-size-fits-all
DBSs remove the overhead of unneeded functionality. Thus, database is not applicable to use cases in automotive, em-
developers can highly focus on exploiting hardware and func- bedded, and ubiquitous computing. As soon as tailor-made
tional variability to provide tailor-made DBSs that meet software is required to meet especially storage limitations,
high-performance criteria or limited storage space require- one-size-fits-all database systems cannot be used. Moreover,
ments. Therefore, huge parts of the DBS (if not all) must specialized database systems for one specific use case outper-
be newly developed, implemented, and tested which leads form one-size-fits-all database systems by orders of magni-
to duplicate implementation efforts, and thus, increased de- tude [25]. Thus, although one-size-fits-all database systems
velopment costs. can be applied, they are often not the best choice regarding
performance. For that reason, we consider the applicability
of one-size-fits-all database systems to arbitrary use cases
as limited. In contrast, specialized database systems have
3.3 Software Product Line Design Approach a very good applicability as they are designed for that pur-
In the specialization design approach, a new DBS must pose.
be developed and implemented from scratch for every con- The applicability of the SPL design approach is good as
ceivable database application. To avoid this overhead, the it also creates database systems tailor-made for specific use
SPL design approach reuses already implemented and tested cases. Moreover, the SPL design approach explicitly consid-
parts of a DBS to create a tailor-made DBS. ers variability during software design and implementation
and provides methods and techniques to manage it [2]. For
Domain Analysis Domain Implementation
FAME-DBMS

Buffer Manager
refines class Btree
{
that reason, we assess the effort of managing variability with
Domain
the SPL design approach as lower than managing variability
Storage public :
bool PutData(RECORD& r); enum DataTypes
OS-Abstraction Memory Alloc Replacement

#include
{ "BtreeIndexPage.h"
knowledge Mapping DataType_None,
Win32 Linux NutOS Storage Dynamic Static LFU LRU Access

Data Types Index DataType_Bool, refines class Btree {
};
DataType_Byte, BtreePageRef

using a one-size-fits-all or specialized design approach.
Data Dictionary Data Types Index API Optimizer Transaction SQL Engine

#include "include.h" DataType_Short, GetNewPage()
B+-Tree
Stream-based Relational
Tables Columns
List
List B+-Tree update remove get put
queries queries
#include "DataDictionary.h" ... {
#include "PrimaryIndex.h"
add search remove update
Aggregation
queries
Select queries .
};

We assess the maturity of one-size-fits-all database sys-
.
class Btree : public
.
PrimaryIndex
}
{
public:
Btree() :PrimaryIndex(true)

}
{…}
tems as very good, as these systems are developed and tested
New
Features
Common
implementation
over decades. Specialized database systems are mostly im-
requirements
artifacts plemented from scratch, so, the possibility of errors in the
Customization Product generation code is rather high, leading to a moderate maturity and ro-
Customer
FAME-DBMS
OS-Abstraction Feature bustness of the software. The SPL design approach also
Storage
needs selection Product
Data Dictionary
Data Types
Index
enables the creation of tailor-made database systems, but
List
+
B -Tree
Buffer Manager
from approved features that are already implemented and
Access
tested. Thus, we assess the maturity of database systems
created via the SPL design approach as good.
In Table 1, we summarize our assessment of the three
Figure 3: Managing Variability
software design approaches regarding the above criteria.
To make use of SPL techniques, a special workflow has to
Approach
be followed which is sketched in Figure 3 [2]. At first, the Criteria One-Size-
domain is modeled, e.g., by using a feature model – a tree- Fits-All Specialization SPL
like structure representing features and their dependencies. a) Applicability − ++ +
With this, the variability is captured and implementation b) Management effort − − +
artifacts can be derived for each feature. The second step, c) Maturity ++ +
the domain implementation, is to implement each feature us-
ing a compositional or annotative approach. The third step Table 1: Characteristics of approaches
of the workflow is to customize the product – in our case, Legend: ++ = very good, + = good, = moderate, − = limited

the database system – which will be generated afterwards.
By using the SPL design approach, we are able to imple- The one-size-fits-all and the specialization design approach
ment a database system from a set of features which are are each very good in one of the three categories respec-
mostly already provided. In best case, only non-existing tively. The one-size-fits-all design approach provides robust
features must be implemented. Thus, the feature pool con- and mature DBSs. The specialization design approach pro-
stantly grows and features can be reused in other database vides greatest applicability and can be used for nearly every
systems. Applying this design approach to DBSs enables use case. Whereas the SPL design approach provides a bal-
to create DBSs tailored for specific use cases while reduc- anced assessment regarding all criteria. Thus, against the
ing functional overhead as well as development time. Thus, backdrop of increasing variability due to increasing variety
the SPL design approach aims at the middle ground of the of use cases and hardware while guaranteeing mature and
one-size-fits-all and the specialization design approach. robust DBSs, SPL design approach should be applied to de-
velop future DBSs. Otherwise, development costs for yet
3.4 Characterization of Design Approaches another DBS which has to meet special requirements of the
In this section, we characterize the three design approaches next data-driven domain will limit the use of DBSs in such
discussed above regarding: fields.

50
4. ARISING RESEARCH QUESTIONS on inheritance or additional function calls, which causes per-
Our assessment in the previous section shows that the formance penalties. A technique that allows for variability
SPL design approach is the best choice for mastering vari- without performance penalties are preprocessor directives.
ability in DBSs. To the best of our knowledge, the SPL However, maintaining preprocessor-based SPLs is horrible,
design approach is applied to DBSs only in academic set- which accounts this approach the name #ifdef Hell [11, 10].
tings (e.g., in [22]).Hereby, the previous research were based So, there is a trade-off between performance and maintain-
on BerkeleyDB. Although BerkeleyDB offers the essential ability [22], but also granularity [14]. It could be beneficial
functionality of DBSs (e.g., a processing engine), several for some parts of DBS to prioritize maintainability and for
functionality of relational DBSs were missing (e.g., opti- others performance or maintainability.
mizer, SQL-interface). Although these missing functional-
ity were partially researched (e.g., storage manager [16] and RQ-I2: How to combine different implementation tech-
the SQL parser [26]), no holistic evaluation of a DBS SPL niques for SPLs?
is available. Especially, the optimizer in a DBS (e.g., query If the answer of RQ-I1 is to use different implementation
optimizer) with a huge number of crosscutting concerns is techniques within the same SPL, we have to find an ap-
currently not considered in research. So, there is still the proach to combine these. For example, database operators
need for research to fully apply SPL techniques to all parts and their different hardware optimization must be imple-
of a DBS. Specifically, we need methods for modeling vari- mented using annotative approaches for performance rea-
ability in DBSs and efficient implementation techniques and sons, but the query optimizer can be implemented using
methods for implementing variability-aware database oper- compositional approaches supporting maintainability; the
ations. SPL product generator has to be aware of these different
implementation techniques and their interactions.
4.1 Modeling
For modeling variability in feature-oriented SPLs, feature RQ-I3: How to deal with functionality extensions?
models are state of the art [4]. A feature model is a set Thinking about changing requirements during the usage of
of features whose dependencies are hierarchically modeled. the DBS, we should be able to extend the functionality in
Since variability in DBSs comprises hardware, software, and the case user requirements change. Therefore, we have to
their interaction, the following research questions arise: find a solution to deploy updates from an extended SPL
in order to integrate the new requested functionality into a
RQ-M1: What is a good granularity for modeling a running DBS. Some ideas are presented in [21], however,
variable DBS? due to the increased complexity of hardware and software
In order to define an SPL for DBSs, we have to model fea- requirements an adaption or extension is necessary.
tures of a DBS. Such features can be modeled with different
levels of granularity [14]. Thus, we have to find an appli-
4.3 Customization
cable level of granularity for modeling our SPL for DBSs. In the final customization, features of the product line are
Moreover, we also have to consider the dependencies be- selected that apply to the current use case. State of the art
tween hardware and software. Furthermore, we have to find approaches just list available features and show which fea-
a way to model the hardware and these dependencies. In tures are still available for further configuration. However, in
this context, another research questions emerges: our scenario, it could be helpful to get further information of
the configuration possibilities. Thus, another research ques-
RQ-M2: What is the best way to model hardware and tion is:
its properties in an SPL?
Hardware has become very complex and researchers demand RQ-C1: How to support the user to obtain the best
to develop a better understanding of the impact of hard- selection?
ware on the algorithm performance, especially when paral- In fact, it is possible to help the user in identifying suitable
lelized [3, 5]. Thus, the question arises what properties of configurations for his use case. If he starts to select func-
the hardware are worth to be captured in a feature model. tionality that has to be provided by the generated system,
Furthermore, when thinking about numerical properties, we can give him advice which hardware yields the best per-
such as CPU frequency or amount of memory, we have to formance for his algorithms. However, to achieve this we
find a suitable technique to represent them in feature mod- have to investigate another research question:
els. One possibility are attributes of extended feature-mod-
els [4], which have to be explored for applicability. RQ-C2: How to find the optimal algorithms for a
given hardware?
4.2 Implementing To answer this research question, we have to investigate the
relation between algorithmic design and the impact of the
In the literature, there are several methods for implement-
hardware on the execution. Hence, suitable properties of
ing an SPL. However, most of them are not applicable to
algorithms have to be identified that influence performance
our use case. Databases rely on highly tuned operations
on the given hardware, e.g., access pattern, size of used data
to achieve peak performance. Thus, variability-enabled im-
structures, or result sizes.
plementation techniques must not harm the performance,
which leads to the research question:
5. CONCLUSIONS
RQ-I1: What is a good variability-aware implemen- DBSs are used for more and more use cases. However,
tation technique for an SPL of DBSs? with an increasing diversity of use cases and increasing het-
Many state of the art implementation techniques are based erogeneity of available hardware, it is getting more challeng-

51
ing to design an optimal DBS while guaranteeing low imple- [13] B. He and J. X. Yu. High-throughput Transaction
mentation and maintenance effort at the same time. To solve Executions on Graphics Processors. PVLDB,
this issue, we review three design approaches, namely the 4(5):314–325, Feb. 2011.
one-size-fits-all, the specialization, and the software prod- [14] C. Kästner, S. Apel, and M. Kuhlemann. Granularity
uct line design approach. By comparing these three design in Software Product Lines. In ICSE, pages 311–320.
approaches, we conclude that the SPL design approach is a ACM, 2008.
promising way to master variability in DBSs and to provide [15] V. Köppen, M. Schäler, and R. Schröter. Toward
mature data management solutions with reduced implemen- Variability Management to Tailor High Dimensional
tation and maintenance effort. However, there is currently Index Implementations. In RCIS, pages 452–457.
no comprehensive software product line in the field of DBSs IEEE, 2014.
available. Thus, we present several research questions that [16] T. Leich, S. Apel, and G. Saake. Using Step-wise
have to be answered to fully apply the SPL design approach Refinement to Build a Flexible Lightweight Storage
on DBSs. Manager. In ADBIS, pages 324–337. Springer-Verlag,
2005.
6. ACKNOWLEDGMENTS [17] J. Liebig, S. Apel, C. Lengauer, and T. Leich.
This work has been partly funded by the German BMBF RobbyDBMS: A Case Study on Hardware/Software
under Contract No. 13N10818 and Bayer Pharma AG. Product Line Engineering. In FOSD, pages 63–68.
ACM, 2009.
7. REFERENCES [18] A. Lübcke, V. Köppen, and G. Saake. Heuristics-based
[1] D. J. Abadi, S. R. Madden, and N. Hachem. Workload Analysis for Relational DBMSs. In
Column-stores vs. Row-stores: How Different Are UNISCON, pages 25–36. Springer, 2012.
They Really? In SIGMOD, pages 967–980. ACM, [19] S. Manegold, P. A. Boncz, and M. L. Kersten.
2008. Optimizing Database Architecture for the New
[2] S. Apel, D. Batory, C. Kästner, and G. Saake. Bottleneck: Memory Access. VLDB J., 9(3):231–246,
Feature-Oriented Software Product Lines. Springer, 2000.
2013. [20] R. Micheloni, A. Marelli, and K. Eshghi. Inside Solid
[3] C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. State Drives (SSDs). Springer, 2012.
Multi-Core, Main-Memory Joins: Sort vs. Hash [21] M. Rosenmüller. Towards Flexible Feature
Revisited. PVLDB, 7(1):85–96, 2013. Composition: Static and Dynamic Binding in Software
[4] D. Benavides, S. Segura, and A. Ruiz-Cortés. Product Lines. Dissertation, University of Magdeburg,
Automated Analysis of Feature Models 20 Years Later: Germany, June 2011.
A Literature Review. Inf. Sys., 35(6):615–636, 2010. [22] M. Rosenmüller, N. Siegmund, H. Schirmeier,
[5] D. Broneske, S. Breß, M. Heimel, and G. Saake. J. Sincero, S. Apel, T. Leich, O. Spinczyk, and
Toward Hardware-Sensitive Database Operations. In G. Saake. FAME-DBMS: Tailor-made Data
EDBT, pages 229–234, 2014. Management Solutions for Embedded Systems. In
[6] D. Broneske, S. Breß, and G. Saake. Database Scan SETMDM, pages 1–6. ACM, 2008.
Variants on Modern CPUs: A Performance Study. In [23] M. Saecker and V. Markl. Big Data Analytics on
IMDM@VLDB, 2014. Modern Hardware Architectures: A Technology
[7] K. Czarnecki and U. W. Eisenecker. Generative Survey. In eBISS, pages 125–149. Springer, 2012.
Programming: Methods, Tools, and Applications. [24] M. Schäler, A. Grebhahn, R. Schröter, S. Schulze,
ACM Press/Addison-Wesley Publishing Co., 2000. V. Köppen, and G. Saake. QuEval: Beyond
[8] S. Dorok, S. Breß, H. Läpple, and G. Saake. Toward High-Dimensional Indexing à la Carte. PVLDB,
Efficient and Reliable Genome Analysis Using 6(14):1654–1665, 2013.
Main-Memory Database Systems. In SSDBM, pages [25] M. Stonebraker, S. Madden, D. J. Abadi,
34:1–34:4. ACM, 2014. S. Harizopoulos, N. Hachem, and P. Helland. The End
[9] S. Dorok, S. Breß, and G. Saake. Toward Efficient of an Architectural Era (It’s Time for a Complete
Variant Calling Inside Main-Memory Database Rewrite). In VLDB, pages 1150–1160, 2007.
Systems. In BIOKDD-DEXA. IEEE, 2014. [26] S. Sunkle, M. Kuhlemann, N. Siegmund,
[10] J. Feigenspan, C. Kästner, S. Apel, J. Liebig, M. Rosenmüller, and G. Saake. Generating Highly
M. Schulze, R. Dachselt, M. Papendieck, T. Leich, and Customizable SQL Parsers. In SETMDM, pages
G. Saake. Do Background Colors Improve Program 29–33. ACM, 2008.
Comprehension in the #ifdef Hell? Empir. Softw. [27] T. Willhalm, I. Oukid, I. Müller, and F. Faerber.
Eng., 18(4):699–745, 2013. Vectorizing Database Column Scans with Complex
[11] J. Feigenspan, M. Schulze, M. Papendieck, C. Kästner, Predicates. In ADMS@VLDB, pages 1–12, 2013.
R. Dachselt, V. Köppen, M. Frisch, and G. Saake. [28] J. Zhou and K. A. Ross. Implementing Database
Supporting Program Comprehension in Large Operations Using SIMD Instructions. In SIGMOD,
Preprocessor-Based Software Product Lines. IET pages 145–156. ACM, 2002.
Softw., 6(6):488–501, 2012. [29] M. Zukowski. Balancing Vectorized Query Execution
[12] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, with Bandwidth-Optimized Storage. PhD thesis, CWI
Q. Luo, and P. V. Sander. Relational Query Amsterdam, 2009.
Coprocessing on Graphics Processors. TODS,
34(4):21:1–21:39, 2009.

52
PageBeat - Zeitreihenanalyse und Datenbanken

Andreas Finger Ilvio Bruder Andreas Heuer
Institut für Informatik Institut für Informatik Institut für Informatik
Universität Rostock Universität Rostock Universität Rostock
18051 Rostock 18051 Rostock 18051 Rostock
andreas.finger@uni- ilvio.bruder@uni- andreas.heuer@uni-
rostock.de rostock.de rostock.de
Steffen Konerow Martin Klemkow
Mandarin Medien GmbH Mandarin Medien GmbH
Graf-Schack-Allee 9 Graf-Schack-Allee 9
19053 Schwerin 19053 Schwerin
sk@mandarin-medien.de mk@mandarin-
medien.de

ABSTRACT Keywords
Zeitreihendaten und deren Analyse sind in vielen Anwen- Datenanalyse, R, Time Series Database
dungsbereichen eine wichtiges Mittel zur Bewertung, Steue-
rung und Vorhersage. Für die Zeitreihenanalyse gibt es ei- 1. EINFÜHRUNG
ne Vielzahl von Methoden und Techniken, die in Statistik-
Zeitreihen sind natürlich geordnete Folgen von Beobach-
software umgesetzt und heutzutage komfortabel auch ohne
tungswerten. Die Zeitreihenanalyse beschäftigt sich mit Me-
eigenen Implementierungsaufwand einsetzbar sind. In den
thoden zur Beschreibung dieser Daten etwa mit dem Ziel
meisten Fällen hat man es mit massenhaft Daten oder auch
der Analyse (Verstehen), Vorhersage oder Kontrolle (Steue-
Datenströmen zu tun. Entsprechend gibt es spezialisierte
rung) der Daten. Entsprechende Methoden stehen in frei-
Management-Tools, wie Data Stream Management Systems
er und kommerzieller Statistiksoftware wie R1 , Matlab2 ,
für die Verarbeitung von Datenströmen oder Time Series
Weka3 [7], SPSS4 und anderen zur Verfügung wodurch ei-
Databases zur Speicherung und Anfrage von Zeitreihen. Der
ne komfortable Datenauswertung ohne eigenen Implemen-
folgende Artikel soll hier zu einen kleinen Überblick geben
tierungsaufwand ermöglicht wird. Verfahren zur Zeitreihen-
und insbesondere die Anwendbarkeit an einem Projekt zur
analyse sind etwa die Ermittlung von Trends und Saisona-
Analyse und Vorhersage von Zuständen von Webservern ver-
lität, wobei der Trend den längerfristigen Anstieg und die
anschaulichen. Die Herausforderung innerhalb dieses Pro-
Saisonalität wiederkehrende Muster (jedes Jahr zu Weih-
jekts PageBeat“ ist es massenhaft Zeitreihen in Echtzeit
” nachten steigen die Verkäufe) repräsentieren. So werden Ab-
zu analysieren und für weiterführende Analyseprozesse zu
hängigkeiten in den Daten untersucht, welche eine Prognose
speichern. Außerdem sollen die Ergebnisse zielgruppenspe-
zukünftiger Werte mit Hilfe geeigneter Modelle ermöglichen.
zifisch aufbereitet und visualisiert sowie Benachrichtigungen
In einer Anwendung die in hoher zeitlicher Auflösung eine
ausgelöst werden. Der Artikel beschreibt den im Projekt ge-
Vielzahl von Messwerten erfasst, entstehen schnell große Da-
wählten Ansatz und die dafür eingesetzten Techniken und
tenmengen. Diese sollen in Echtzeit analysiert werden und
Werkzeuge.
gegebenenfalls zur weiteren Auswertung persistent gespei-
chert werden. Hierfür existieren zum Einen Ansätze aus der
Categories and Subject Descriptors Stromdatenverarbeitung und zum Anderen zur Speicherung
H.4 [Information Systems Applications]: Miscellaneous; von auf Zeitreihen spezialisierte Datenbanksysteme (Time
D.2.8 [Software Engineering]: Metrics—complexity mea- Series Databases). Da statistische Analysen etwa mit stand-
sures, performance measures alone R Anwendungen nur funktionieren, solange die zu ana-
lysierenden Daten die Größe des Hauptspeichers nicht über-
General Terms schreiten, ist es notwendig die statistische Analyse in Daten-
1
Big Data, Data Mining and Knowledge Discovery, Streaming R – Programmiersprache für statistische Rechnen und Vi-
sualisieren von der R Foundation for Statistical Computing,
Data http://www.r-project.org.
2
Matlab – kommerzielle Software zum Lösen Veranschau-
lichen mathematischer Probleme vom Entwickler The Ma-
thworks, http://www.mathworks.de.
3
Weka – Waikato Environment for Knowledge Analysis, ein
Werkzeugkasten für Data Mining und Maschinelles Lernen
Copyright c by the paper’s authors. Copying permitted only von der University of Waikato, http://www.cs.waikato.ac.
for private and academic purposes. nz/ml/weka/.
4
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- SPSS – kommerzielle Statistik- und Analysesoftware von
Workshop on Foundations of Databases (Grundlagen von Datenbanken), IBM, http://www-01.ibm.com/software/de/analytics/
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. spss.

53
banksysteme zu integrieren. Ziel ist dabei der transparente Es wird derzeit ein möglichst breites Spektrum an Daten
Zugriff auf partitionierte Daten und deren Analyse mittels in hoher zeitlicher Auflösung erfasst, um in einem Prozess
partitionierter statistischen Modelle. In [6] werden verschie- der Datenexploration auf Zusammenhänge schließen zu kön-
dene Möglichkeiten der Integration beschrieben und sind nen, die zunächst nicht offensichtlich sind bzw. um Vermu-
in Prototypen basierend auf postgreSQL bereits umgesetzt. tungen zu validieren. Derzeit werden über 300 Kennzahlen
Auch kommerzielle Produkte wie etwa Oracle R Enterpri- alle 10 s auf 14 Servern aus 9 Kundenprojekten abgetas-
se[4] integrieren statistische Analyse auf Datenbankebene. tet. Diese Daten werden gespeichert und außerdem unmit-
Im Open-Source-Bereich existiert eine Vielzahl von Ansät- telbar weiterverarbeitet. So findet etwa ein Downsampling
zen zum Umgang mit Zeitreihen, wobei uns InfluxDB5 als für alle genannten 300 Kennzahlen statt. Dabei werden die
besonders geeignetes Werkzeug aufgefallen ist. zeitliche Auflösung unter Verwendung verschiedener Aggre-
Die Herausforderung innerhalb des im Weiteren beschriebe- gatfunktionen auf Zeitfenster unterschiedlicher Größe redu-
nen Projekts PageBeat“ ist es innovative und anwendungs- ziert und die Ergebnisse gespeichert. Andere Analysefunk-
”
reife Open-Source-Lösungen aus den genannten Bereichen tionen quantisieren die Werte hinsichtlich ihrer Zugehörig-
zur Verarbeitung großer Zeitreihendaten innerhalb des Pro- keit zu Statusklassen (etwa optimal, normal, kritisch) und
jektes miteinander zu kombinieren. Im Folgenden wird das speichern die Ergebnisse ebenfalls. So entstehen sehr schnell
Projekt vorgestellt, um dann verschiedene in Frage kommen- große Datenmengen. Derzeit enthält der Datenspeicher etwa
den Techniken und abschließend das gewählte Konzept und 40 GB Daten und wir beobachten bei der aktuellen Anzahl
erste Ergebnisse vorzustellen. beobachteter Werte einen Zuwachs von etwa 1 GB Daten
pro Woche. Auf Basis der erhobenen Daten müssen zeit-
kritische Analysen wie etwa eine Ausreißererkennung oder
2. PROJEKT PAGEBEAT die Erkennung kritischer Muster nahezu in Echtzeit erfol-
Mit PageBeat“ wird eine als Software as a Service“ (SAAS) gen, um Kunden ein rechtzeitiges Eingreifen zu ermöglichen.
” ”
angebotene Softwaresuite speziell zur Beobachtung und Über- Weiterhin soll eine Vorhersage zukünftiger Werte frühzeitig
prüfung von Webanwendungen entwickelt. Dies erfolgt zu- kritische Entwicklungen aufzeigen. Die Herausforderung im
nächst im Rahmen eines vom Bundeswirtschaftsministeri- Projekt ist die Bewältigung des großen Datenvolumens un-
um geförderten ZIM-Kooperationsprojektes. Ziel der Soft- ter Gewährleistung einer echtzeitnahen Bearbeitung durch
ware ist das Beobachten des und das Berichten über den Analysefunktionen.
aktuellen technischen Status einer Webanwendung (Web-
site, Content Management System, E-Commerce System,
Webservice) sowie das Prognostizieren technischer Proble- 3. ZEITREIHENANALYSE UND DATENBAN-
me anhand geeigneter Indikatoren (Hardware- und Software- KEN
spezifische Parameter). Die Berichte werden dabei für un-
Im Rahmen der Evaluierung von für das Projekt geeigneter
terschiedliche Nutzergruppen (Systemadministratoren, Soft-
Software haben wir verschiedene Ansätze zur Datenstrom-
wareentwickler, Abteilungsleiter, Geschäftsführung, Marke-
verarbeitung und der Analyse und Verwaltung von Zeitrei-
ting) und deren Anforderungen aufbereitet und präsentiert.
hen untersucht. Ziel war die Verwendung frei verfügbarer
Mittels PageBeat“ werden somit automatisiert Fehlerbe-
” Software die zudem auf im Unternehmen vorhandener tech-
richte erstellt, die über akute sowie vorhersehbare kritische
nischer Expertise basiert.
Änderungen der Betriebsparameter einer Webanwendung in-
formieren und zielgruppenspezifisch dargestellt werden. 3.1 Data Stream Management Systems
Bei den zugrunde liegenden Kennzahlen handelt es sich um
eine Reihe von Daten, die den Zustand des Gesamtsystems Die Verarbeitung kontinuierlicher Datenströme stellt einen
im Anwendungsbereich Webshopsysteme widerspiegeln. Dies Aspekt unseres Projektes dar. Datenstromverarbeitende Sys-
sind Kennzahlen des Serverbetriebssystems (etwa CPU oder teme bieten hierzu die Möglichkeit kontinuierliche Anfragen
RAM Auslastung) als auch anwendungsspezifische Kennda- auf in temporäre Relationen umgewandelte Datenströme zu
ten (etwa die Laufzeit von Datenbankanfragen). Diese Daten formulieren. Dies kann etwa mit Operatoren der im Pro-
sind semantisch beschrieben und entsprechende Metadaten jekt Stream[1] entwickelten an SQL angelehnten Continuous
sind in einer Wissensbasis abgelegt. Darüber hinaus ist die Query Language[2] erfolgen. Sollen nun komplexere Mus-
Verwendung weiterer Kontextinformationen angedacht, die ter in Datenströmen erkannt werden, spricht man auch von
Einfluss auf den technischen Status des Systems haben kön- der Verarbeitung komplexer Ereignisse. Im Kontext unseres
nen. Hierbei kann es sich etwa um Wetterdaten handeln: Projektes entspricht so ein Muster etwa dem Anstieg der
beim Kinobetreiber Cinestar ist ein regnerisches Wochenen- Aufrufe einer Seite aufgrund einer Marketingaktion, welcher
de vorausgesagt, dass auf eine hohe Auslastung des Kinokar- eine höhere Systemauslastung zur Folge hat (cpu-usage),
tenonlineshops schließen lässt. Ein anderes Beispiel wären was sich wiederum in steigenden time-to-first-byte-Werten
Informationen aus der Softwareentwicklung: bei Codeände- niederschlägt und in einem kritischen Bereich zur Benach-
rungen mit einem bestimmten Zeitstempel können Effekte in richtigung oder gar zur automatischen Aufstockung der ver-
den Auswertungen zu diesem Zeitpunkt nachgewiesen wer- fügbaren Ressourcen führen soll. Complex Event Proces-
den. Das Ändern oder Hinzufügen bzw. Beachten von rele- sing Systems wie Esper[5] bieten die Möglichkeit Anfragen
vanten Inhalten auf den Webseiten können signifikante Än- nach solchen Mustern auf Datenströme zu formulieren und
derungen in Analysen ergeben, z.B. bei der Schaltung von entsprechende Reaktionen zu implementieren. Da etwa Es-
Werbung oder bei Filmbewertungen zu neu anlaufenden Fil- per als eines der wenigen frei verfügbaren und für den pro-
men auf sozialen Plattformen. duktiven Einsatz geeigneten Systeme, in Java und .net im-
plementiert ist, entsprechende Entwicklungskapazitäten je-
5 doch nicht im Unternehmen zur Verfügung stehen, wird im
InfluxDB - An open-source distributed time series database
with no external dependencies. http://influxdb.com. Projekt keines der erwähnten DSMS oder CEPS zum Ein-

54
satz kommen. Deren Architektur diente jedoch zur Orientie- Möglichkeit Oracle Data Frames zu verwenden, um Daten-
rung bei der Entwicklung eines eigenen mit im Unternehmen lokalität zu erreichen. Dabei wird der Code in der Oracle-
eingesetzten Techniken (etwa node.js6 , RabbitMQ7 , Mon- Umgebung ausgeführt, dort wo die Daten liegen und nicht
goDB8 , u.a.) Systems für PageBeat. umgekehrt. Außerdem erfolgt so ein transparenter Zugriff
auf die Daten und Aspekte der Skalierung werden durch das
3.2 Werkzeuge zur Datenanalyse DBMS abgewickelt.
Zur statistischen Auswertung der Daten im Projekt werden Neben den klassischen ORDBMS existieren eine Vielzahl
Werkzeuge benötigt, die es ohne großen Implementierungs- von auf Zeitserien spezialisierte Datenbanken wie OpenTSDB14 ,
aufwand ermöglichen verschiedene Verfahren auf die erhobe- KairosDB15 , RRDB16 . Dabei handelt es sich jeweils um einen
nen Daten anzuwenden und auf ihre Eignung hin zu untersu- auf Schreibzugriffe optimierten Datenspeicher in Form einer
chen. Hierfür stehen verschiedene mathematische Werkzeuge schemalosen Datenbank und darauf zugreifende Anfrage-,
zur Verfügung. Kommerzielle Produkte sind etwa die bereits Analyse- und Visualisierungsfunktionalität. Man sollte sie
erwähnten Matlab oder SPSS. Im Bereich frei verfügbarer deshalb vielmehr als Ereignis-Verarbeitungs- oder Monitoring-
Software kann man auf WEKA und vor allem R zurückgrei- Systeme bezeichnen. Neben den bisher genannten Zeitserien-
fen. Besonders R ist sehr weit verbreitet und wird von ei- datenbanken ist uns bei der Recherche von für das Projekt
ner großen Entwicklergemeinde getragen. Dadurch sind für geeigneter Software InfluxDB17 aufgefallen. InfluxDB ver-
R bereits eine Vielzahl von Verfahren zur Datenaufberei- wendet Googles auf Log-structured merge-trees basierenden
tung und deren statistischer Analyse bis hin zur entspre- key-value Store LevelDB18 und setzt somit auf eine hohen
chenden Visualisierung implementiert. Gerade in Bezug auf Durchsatz bzgl. Schreiboperationen. Einen Nachteil hinge-
die Analyse von Zeitreihen ist R aufgrund vielfältiger ver- gen stellen langwierige Löschoperationen ganzer nicht mehr
fügbarer Pakete zur Zeitreihenanalyse gegenüber WEKA die benötigter Zeitbereiche dar. Die einzelnen Zeitreihen werden
geeignetere Wahl. Mit RStudio9 steht außerdem eine kom- bei der Speicherung sequenziell in sogenannte Shards unter-
fortable Entwicklungsumgebung zur Verfügung. Weiterhin teilt, wobei jeder Shard in einer einzelnen Datenbank gespei-
können mit dem Web Framework Shiny10 schnell R Anwen- chert wird. Eine vorausschauenden Einrichtung verschiede-
dungen im Web bereit gestellt werden und unterstützt so- ner Shard-Spaces (4 Stunden, 1 Tag, 1 Woche etc.) ermög-
mit eine zügige Anwendungsentwicklung. Somit stellt R mit licht es, das langsame Löschen von Zeitbereichen durch das
den zugehörigen Erweiterungen die für das Projekt geeignete einfache Löschen ganzer Shards also ganzer Datenbanken
Umgebung zur Evaluierung von Datenanalyseverfahren und (drop database) zu kompensieren. Eine verteilte Speicherung
zur Datenexploration dar. Im weiteren Verlauf des Projektes der Shards auf verschiedenen Rechnerknoten die wiederum
und in der Überführung in ein produktives System wird die in verschiedenen Clustern organisiert sein können, ermög-
Datenanalyse, etwa die Berechnung von Vorhersagen, inner- licht eine Verteilung der Daten, die falls gewünscht auch red-
halb von node.js reimplementiert. undant mittels Replikation auf verschiedene Knoten erfolgen
kann. Die Verteilung der Daten auf verschiedene Rechner-
3.3 Datenbankunterstützung knoten ermöglicht es auch die Berechnung von Aggregaten
über Zeitfenster die unterhalb der Shardgröße liegen, zu ver-
Klassische objektrelationale DBMS wie Oracle11 , IBM In- teilen und somit Lokalität der Daten und einen Performance-
formix12 oder PostgreSQL13 unterstützen in unterschiedli- Vorteil zu erreichen. Auch hier ist es sinnvoll Shardgrößen
chem Umfang die Speicherung, Anfrage und Auswertung vorausschauend zu planen. Die Anfragen an InfluxDB kön-
von Zeitreihen. PostgreSQL ermöglicht bswp. die Verwen- nen mittels einer SQL-ähnlichen Anfragesprache über eine
dung von Fensterfunktionen etwa zur Berechnung von Ag- http-Schnittstelle formuliert werden. Es werden verschiedene
gregatwerten für entsprechende Zeitabschnitte. Die IBM In- Aggregatfunktionen bereitgestellt, die eine Ausgabe bspw.
formix TimeSeries Solution[3] stellt Container zur Speiche- gruppiert nach Zeitintervallen für einen gesamten Zeitbe-
rung von Zeitreihendaten zur Verfügung, wodurch der Spei- reich erzeugen, wobei die Verwendung Regulärer Ausdrücke
cherplatzbedarf optimiert, die Anfragegeschwindigkeit erhöht unterstützt wird:
sowie die Komplexität der Anfragen reduziert werden sol-
len. Oracle unterstützt nicht nur die Speicherung und An-
frage von Zeitreihen, sondern integriert darüber hinaus um- select median(used) from /cpu\.*/
fassende statistische Analysefunktionalität mittels Oracle R where time > now() - 4h group by time(5m)
Technologies[4]. Dabei hat der R-Anwendungsentwickler die
Hier wird der Median des used“-Wertes für alle 5-Minuten-
”
6
node.js - a cross-platform runtime environment for server- Fenster der letzten 4 Stunden für alle CPUs berechnet und
side and networking applications. http://nodejs.org/. ausgegeben. Neben normalen Anfragen können auch soge-
7
RabbitMQ - Messaging that just works. http://www. nannte Continuous Queries eingerichtet werden, die etwa das
rabbitmq.com. einfache Downsampling von Messdaten ermöglichen:
8
MongoDB - An open-source document database. http://
www.mongodb.org/. 14
OpenTSDB - Scalable Time Series Database. http://
9
RStudio - open source and enterprise-ready professional opentsdb.net/.
software for the R statistical computing environment. http: 15
KairosDB - Fast Scalable Time Series Database. https:
//www.rstudio.com. //code.google.com/p/kairosdb/.
10 16
Shiny - A web application framework for R. http://shiny. RRDB - Round Robin Database. http://oss.oetiker.ch/
rstudio.com. rrdtool/.
11 17
Oracle. http://www.oracle.com. InfluxDB - An open-source distributed time series database
12
IBM Informix. http://www-01.ibm.com/software/data/ with no external dependencies. http://influxdb.com/.
informix/. 18
LevelDB - A fast and lightweight key-value database library
13
PostgreSQL. http://www.postgresql.org/. by Google. http://code.google.com/p/leveldb/.

55
select count(name) from clicks
Datenstrom (Drohne, Lasttestserver, Clientsimulation, etc.)
group by time(1h) into clicks.count.1h

InfluxDB befindet sich noch in einem frühen Stadium der
Entwicklung und wird ständig weiterentwickelt. So ist etwa
angekündigt, dass zukünftig bspw. das Speichern von Meta-
daten zu Zeitreihen (Einheiten, Abtastrate, etc.) oder auch Vorverarbeitung
die Implementierung nutzerdefinierter Aggregatfunktionen / Data Cleaning
ermöglicht wird. InfluxDB ist ein für unsere Anwendung viel-
versprechendes Werkzeug, wobei jedoch abzuwarten bleibt, Integration
inwiefern es sich für den produktiven Einsatz eignet. Aus die-
sem Grund wird derzeit zusätzlich zu InfluxDB, MongoDB
parallel als im Unternehmen bewährter Datenspeicher ver-
wendet. Ergebnisse Adhoc‐Analyse Wissens
(outlier, etc.) Basis
4. LÖSUNG IN PAGEBEAT
Im Projekt Pagebeat wurden verschiedene Lösungsansätze
getestet, wobei die Praktikabilität beim Einsatz im Unter-
nehmen, die schnelle Umsetzbarkeit sowie die freie Verfüg-
barkeit der eingesetzten Werkzeuge die entscheidende Rolle Daten Speicher
spielten.

4.1 Datenfluss Daten
Der Datenfluss innerhalb der Gesamtarchitektur ist in Ab- Explorer
bildung 1 dargestellt. Die Messdaten werden von einer Droh- Langzeit‐
ne19 sowie Clientsimulatoren und Lasttestservern in äquidi- Analyse
Ergebnisse
stanten Zeitabschnitten (meist 10 s) ermittelt. Die erhobe-
nen Daten werden einem Loggingdienst per REST-Schnittstelle
zur Verfügung gestellt und reihen sich in die Warteschlange
eines Nachrichtenservers ein. Von dort aus werden sie ihrer Abbildung 1: Datenfluss
Signatur entsprechend durch registrierte Analyse- bzw. In-
terpretationsprozesse verarbeitet, wobei die Validierung der
eintreffenden Daten sowie die Zuordnung zu registrierten 4.3 Speicherung der Zeitreihen
Analysefunktionen mittels einer Wissensbasis erfolgt. Ergeb- Die Speicherung der Messdaten sowie Analyse- und Inter-
nisse werden wiederum als Nachricht zur Verfügung gestellt pretationsergebnisse erfolgt zum Einen in der im Unterneh-
und falls vorgesehen persistent gespeichert. So in die Nach- men bewährten, auf hochfrequente Schreibvorgänge opti-
richtenschlange gekommene Ergebnisse können nun weitere mierten schemafreien Datenbank MongoDB. Zum Anderen
Analysen bzw. Interpretationen oder die Auslösung einer Be- setzen wir mittlerweile parallel zu MongoDB auf InfluxDB.
nachrichtigung zur Folge haben. Der Daten Explorer ermög- So kann z.B. über die in InluxDB zur Verfügung stehen-
licht eine Sichtung von Rohdaten und bereits in PageBeat den Continious Queries ein automatisches Downsampling
integrierten Analyseergebnissen sowie Tests für zukünftige und somit eine Datenreduktion der im 10 Sekunden Takt
Analysefunktionen. erhobenen Daten erfolgen. Das Downsampling erfolgt der-
4.2 Wissensbasis zeit durch die Berechnung der Mittelwerte von Zeitfenstern
einer Länge von 1 Minute bis hin zu einem Tag und ge-
Die Wissenbasis bildet die Grundlage für die modular auf- neriert somit automatisch unterschiedliche zeitliche Auflö-
gebauten Analyse- und Interpretationsprozesse. Die Abbil- sungen für alle Messwerte. Außerdem stellt die SQL ähn-
dung 2 dargestellten ParameterValues“ repräsentieren die liche Anfragesprache von InfluxDB eine Vielzahl von für
”
Messdaten und deren Eigenschaften wie Name, Beschrei- die statistische Auswertung hilfreichen Aggregatfunktionen
bung oder Einheit. ParameterValues können zu logischen (min, max, mean, median, stddev, percentile, histogramm,
Gruppen (Parameters) zusammengefasst werden (wie z.B. etc.) zur Verfügung. Weiterhin soll es zukünftig möglich sein
die ParameterValues: system“, load“, iowait“ und max“ benutzerdefinierte Funktionen mit eigener Analysefunktio-
” ” ” ”
zum Parameter cpu“). Parameter sind mit Visualisierungs- nalität (etwa Autokorrelation, Kreuzkorrelation, Vorhersa-
”
komponenten und Kundendaten sowie mit Analysen und ge, etc.) auf Datenbankebene umzusetzen oder auch das
Interpretationen verknüpft. Analysen und Interpretationen automatische Zusammenführen verschiedener Zeitserien an-
sind modular aufgebaut und bestehen jeweils aus Eingangs- hand eines Timestamp-Attributs durchzuführen. Dies wür-
und Ausgangsdaten (ParameterValues) sowie aus Verweisen de schon auf Datenbankebene eine zeitreihenübergreifende
auf den Programmcode. Weiterhin sind ihnen spezielle Me- Analyse (bspw. Korrelation) unterstützen und senkt den Re-
thodenparameter zugeordnet. Hierbei handelt es sich etwa implentierungsaufwand von R Funktionalität aus der Daten-
um Start und Ende eines Zeitfensters, Schwellenwerte oder explorationsphase. Da herkömmliche Datenbanken nicht die
andere Modellparameter. Die Wissensbasis ist mittels eines hohe Performance bzgl. Schreibzugriffen erreichen und kaum
relationalem Schemas in MySQL abgebildet. auf Zeitreihen spezialisierte Anfragen unterstützen, scheint
19 InfluxDB ein geeigneter Kandidat für den Einsatz innerhalb
Auf dem zu beobachtenden System installierter Agent zur
Datenerhebung. PageBeats zu sein.

56
Analysis Visualisation

Parameter

Abbildung 4: Autokorrelation

gebnissen dient. Abbildung 5 zeigt etwa die Darstellung ag-
Abbildung 2: Ausschnitt Schema Wissensbasis gregierter Parameter in Ampelform (rot = kritisch, gelb =
Warnung, grün = normal, blau = optimal) was schnell einen
Eindruck über den Zustand verschiedener Systemparameter
ermöglicht.
4.4 Datenexploration
Interpretation Customer Data
Die Datenexploration soll dazu dienen, Administratoren und
auch Endnutzern die Möglichkeit zu geben, die für sie rele-
vanten Daten mit den richtigen Werkzeugen zu analysieren.
Während der Entwicklung nutzen wir die Datenexploration
als Werkzeug zur Ermittlung relevanter Analysemethoden
und zur Evaluierung sowie Visualisierung der Datenströme.
Abbildung 3 zeigt eine einfache Nutzerschnittstelle umge-
setzt mit Shiny zur Datenauswertung mittels R mit Zu-
griff auf unterschiedliche Datenbanken, InfluxDB und Mon-
goDB. Verschiedene Parameter zur Auswahl des Zeitraumes,
der Analysefunktion und deren Parameter sowie Visualisie-
rungsparameter.
Hier sind durchschnittliche CPU-Nutzung und durchschnitt-
liche Plattenzugriffszeiten aus einer Auswahl aus 10 Zeitse- Abbildung 5: Ampel
rien dargestellt. Mittels unterem Interaktionselement lassen
sich Intervalle selektieren und die Granularität anpassen. Analysefunktionalität die über Aggregationen auf Daten-
Mit ähnlichen Visualisierungsmethoden lassen sich auch Au- bankebene hinausgehen wird von uns in einer Experimen-
tokorrelationsanalysen visualisieren, siehe Abbildung 4. talumgebung umgesetzt und evaluiert. Diese basiert auf R.
So stehen eine Vielzahl statistischer Analysemethoden und
4.5 Analyse und Interpretation Methoden zur Aufbereitung komplexer Datenstrukturen in
Analysen sind Basisoperationen wie die Berechnung von Mit- Form von R Paketen zur Verfügung. Darüber hinaus ermög-
telwert, Median, Standardabweichung, Autokorrelation u.a. licht das R-Paket Shiny Server“ die komfortable Bereitstel-
”
deren Ergebnisse falls nötig persistent gespeichert werden lung von R Funktionalität für das Web. Ein wesentlicher Teil
oder direkt anderen Verarbeitungsschritten als Eingabe über- unser Experimentalumgebung ist der Pagebeat Data Explo-
geben werden können. Die Spezifizierung der Analysefunk- rer (siehe Abbildung 3). Dieser basiert auf den genannten
tionen erfolgt in der Wissensbasis, die eigentliche Implemen- Techniken und ermöglicht die Sichtung der erfassten Roh-
tierung ist möglichst nahe an den zu analysierenden Daten, daten oder das Spielen“ mit Analysemethoden und Vorher-
”
wenn möglich unter Verwendung von Aggregat- oder benut- sagemodellen.
zerdefinierten Funktionen des Datenbanksystems, umzuset-
zen. Wissensbasis und Analyse sind hierzu mittels eines me- 5. ZUSAMMENFASSUNG UND AUSBLICK
”
thod codepath“ verknüpft. Pagebeat ist ein Projekt, bei dem es insbesondere auf eine
Interpretationen funktionieren analog zur Analyse bilden je- performante Speicherung und schnelle Adhoc-Auswertung
doch Berechnungsvorschriften etwa für den Gesamtindex (Pagebeat-
der Daten ankommt. Dazu wurden verschiedene Lösungsan-
Faktor) des Systems bzw. einzelner Teilsysteme ab, in dem sätze betrachtet und die favorisierte Lösung auf Basis von
sie z.B. Analyseergebnisse einzelner Zeitreihen gewichtet zu- InfluxDB und R beschrieben.
sammenführen. Weiterhin besitzen Interpretationen einen Die konzeptionelle Phase ist abgeschlossen, die Projektin-
Infotyp, welcher der nutzerspezifischen Aufbereitung von Er- frastruktur umgesetzt und erste Analysemethoden wie Aus-

57
Abbildung 3: Daten

reißererkennung oder Autokorrelation wurden ausprobiert. [7] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
Derzeit beschäftigen wir uns mit den Möglichkeiten einer P. Reutemann, and I. H. Witten. The weka data mining
Vorhersage von Zeitreihenwerten. Dazu werden Ergebnisse software: An update. SIGKDD Explorations, 11(1),
der Autokorrelationsanalyse zur Identifikation von Abhän- 2009.
gigkeiten innerhalb von Zeitreihen verwendet um die Qua-
lität von Vorhersagen abschätzen zu können. Weiterhin ist
geplant Analysen näher an der Datenbank auszuführen um
Datenlokalität zu unterstützen.

6. REFERENCES
[1] A. Arasu, B. Babcock, S. Babu, J. Cieslewicz,
M. Datar, K. Ito, R. Motwani, U. Srivastava, and
J. Widom. Stream: The stanford data stream
management system. Technical Report 2004-20,
Stanford InfoLab, 2004.
[2] A. Arasu, S. Babu, and J. Widom. The cql continuous
query language: Semantic foundations and query
execution. Technical Report 2003-67, Stanford InfoLab,
2003.
[3] K. Chinda and R. Vijay. Informix timeseries solution.
http://www.ibm.com/developerworks/data/library/
techarticle/dm-1203timeseries, 2012.
[4] O. Corporation. R technologies from oracle.
http://www.oracle.com/technetwork/topics/
bigdata/r-offerings-1566363.html, 2014.
[5] EsperTech. Esper. http://esper.codehaus.org, 2014.
[6] U. Fischer, L. Dannecker, L. Siksnys, F. Rosenthal,
M. Boehm, and W. Lehner. Towards integrated data
analytics: Time series forecasting in dbms.
Datenbank-Spektrum, 13(1):45–53, 2013.

58
Databases under the Partial Closed-world Assumption:
A Survey

Simon Razniewski Werner Nutt
Free University of Bozen-Bolzano Free University of Bozen-Bolzano
Dominikanerplatz 3 Dominikanerplatz 3
39100 Bozen, Italy 39100 Bozen, Italy
razniewski@inf.unibz.it nutt@inf.unibz.it

ABSTRACT centralized manner, as each school is responsible for its own
Databases are traditionally considered either under the closed- data. Since there are numerous schools in this province, the
world or the open-world assumption. In some scenarios how- overall database is notoriously incomplete. However, peri-
ever a middle ground, the partial closed-world assumption, odically the statistics department of the province queries the
is needed, which has received less attention so far. school database to generate statistical reports. These statistics
In this survey we review foundational and work on the are the basis for administrative decisions such as the opening
partial closed-world assumption and then discuss work done and closing of classes, the assignment of teachers to schools
in our group in recent years on various aspects of reasoning and others. It is therefore important that these statistics are
over databases under this assumption. correct. Therefore, the IT department is interested in finding
We first discuss the conceptual foundations of this assump- out which data has to be complete in order to guarantee cor-
tion. We then list the main decision problems and the known rectness of the statistics, and on which basis the guarantees
results. Finally, we discuss implementational approaches and can be given.
extensions. Broadly, we investigated the following research questions:

1. How to describe complete parts of a database?
1. INTRODUCTION
Data completeness is an important aspect of data quality. 2. How to find out, whether a query answer over a par-
Traditionally, it is assumed that a database reflects exactly tially closed database is complete?
the state of affairs in an application domain, that is, a fact
that is true in the real world is stored in the database, and a 3. If a query answer is not complete, how to find out which
fact that is missing in the database does not hold in the real kind of data can be missing, and which similar queries
world. This is known as the closed-world assumption (CWA). are complete?
Later approaches have discussed the meaning of databases
that are missing facts that hold in the real world and thus
are incomplete. This is called the open-world assumption Work Overview. The first work on the PCWA is from
(OWA) [16, 7]. Motro [10]. He used queries to describe complete parts and
A middle view, which we call the partial closed-world as- introduced the problem of inferring the completeness of other
sumption (PCWA), has received less attention until recently. queries (QC) from such completeness statements. Later work
Under the PCWA, some parts of the database are assumed by Halevy [8] introduced tuple-generating dependencies or
to be closed (complete), while others are assumed to be open table completeness (TC) statements for specification of com-
(possibly incomplete). So far, the former parts were specified plete parts. A detailed complexity study of TC-QC entailment
using completeness statements, while the latter parts are the was done by Razniewski and Nutt [13].
complement of the complete parts. Later work by Razniewski and Nutt has focussed on databases
with null values [12] and geographic databases [14].
Example. As an example, consider a problem arising in the There has also been work on RDF data [3]. Savkovic
management of school data in the province of Bolzano, Italy, et al. [18, 17] have focussed on implementation techniques,
which motivated the technical work reported here. The IT leveraging especially on logic programming.
department of the provincial school administration runs a Also the derivation of completeness from data-aware busi-
database for storing school data, which is maintained in a de- ness process descriptions has been discussed [15].
Current work is focussing on reasoning wrt. database in-
stances and on queries with negation [4].

Outline. This paper is structured as follows. In Section 2,
we discuss conceptual foundations, in particular the par-
tial closed-world assumption. In Section 3 we present main
Copyright c by the paper’s authors. Copying permitted only for reasoning problems in this framework and known results.
private and academic purposes.
Section 4 discusses implementation techniques. Section 5
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-
Workshop on Foundations of Databases (Grundlagen von Datenbanken), presents extension and Section 6 discusses current work and
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. open problems.

59
2. CONCEPTUAL FOUNDATIONS Example 1. Consider a partial database DS for a school with
two students, Hans and Maria, and one teacher, Carlo, as follows:
2.1 Standard Definitions
In the following, we fix our notation for standard concepts DiS = {student(Hans, 3, A), student(Maria, 5, C),
from database theory. We assume a set of relation symbols
person(Hans, male), person(Maria, female),
Σ, the signature. A database instance D is a finite set of ground
atoms with relation symbols from Σ. For a relation symbol person(Carlo, male) },
R ∈ Σ we write R(D) to denote the interpretation of R in D, that DaS = DiS \ { person(Carlo, male), student(Maria, 5, C) },
is, the set of atoms in D with relation symbol R. A condition
G is a set of atoms using relations from Σ and possibly the that is, the available database misses the facts that Maria is a student
comparison predicates < and ≤. As common, we write a and that Carlo is a person.
condition as a sequence of atoms, separated by commas. A Next, we define statements to express that parts of the in-
condition is safe if each of its variables occurs in a relational formation in Da are complete with regard to the ideal database
atom. A conjunctive query is written in the form Q(s̄) :− B, Di . We distinguish query completeness and table complete-
where B is a safe condition, s̄ is a vector of terms, and every ness statements.
variable in s̄ occurs in B. We often refer to the entire query
by the symbol Q. As usual, we call Q(s̄) the head, B the
body, the variables in s̄ the distinguished variables, and the
Query Completeness. For a query Q, the query completeness
statement Compl(Q) says that Q can be answered completely
remaining variables in B the nondistinguished variables of Q.
over the available database. Formally, Compl(Q) is satisfied by
We generically use the symbol L for the subcondition of B
a partial database D, denoted as D |= Compl(Q), if Q(Da ) =
containing the relational atoms and M for the subcondition
Q(Di ).
containing the comparisons. If B contains no comparisons,
then Q is a relational conjunctive query. Example 2. Consider the above defined partial database DS and
The result of evaluating Q over a database instance D is the query
denoted as Q(D). Containment and equivalence of queries
are defined as usual. A conjunctive query is minimal if no Q1 (n) :− student(n, l, c), person(n, ’male’),
relational atom can be removed from its body without leading asking for all male students. Over both, the available database DaS
to a non-equivalent query.
and the ideal database DiS , this query returns exactly Hans. Thus,
2.2 Running Example DS satisfies the query completeness statement for Q1 , that is,
For our examples throughout the paper, we will use a dras- DS |= Compl(Q1 ).
tically simplified extract taken from the schema of the Bolzano
school database, containing the following two tables: Abiteboul et al. [1] introduced the notion of certain and
possible answers over databases under the open-world as-
- student(name, level, code), sumption. Query completeness can also be seen as a relation
- person(name, gender). between certain and possible answers: A query over a par-
The table student contains records about students, that is, tially complete database is complete, if the certain and the
their names and the level and code of the class we are in. possible answers coincide.
The table person contains records about persons (students,
teachers, etc.), that is, their names and genders. Table completeness. A table completeness (TC) statement
allows one to say that a certain part of a relation is com-
2.3 Completeness plete, without requiring the completeness of other parts of
Open and closed world semantics were first discussed by the database [8]. It has two components, a relation R and
Reiter in [16], where he formalized earlier work on negation a condition G. Intuitively, it says that all tuples of the ideal
as failure [2] from a database point of view. The closed-world relation R that satisfy condition G in the ideal database are
assumption corresponds to the assumption that the whole also present in the available relation R.
database is complete, while the open-world assumption cor- Formally, let R(s̄) be an R-atom and let G be a condition
responds to the assumption that nothing is known about the such that R(s̄), G is safe. We remark that G can contain re-
completeness of the database. lational and built-in atoms and that we do not make any
safety assumptions about G alone. Then Compl(R(s̄); G) is a
Partial Database. The first and very basic concept is that table completeness statement. It has an associated query, which
of a partially complete database or partial database [10]. A is defined as QR(s̄);G (s̄) :− R(s̄), G. The statement is satisfied
database can only be incomplete with respect to another by D = (Di , Da ), written D |= Compl(R(s̄); G), if QR(s̄);G (Di ) ⊆
database that is considered to be complete. So we model a R(Da ). Note that the ideal instance D̂ is used to determine
partial database as a pair of database instances: one instance those tuples in the ideal version R(Di ) that satisfy G and that
that describes the complete state, and another instance that the statement is satisfied if these tuples are present in the
describes the actual, possibly incomplete state. Formally, a available version R(Da ). In the sequel, we will denote a TC
partial database is a pair D = (Di , Da ) of two database instances statement generically as C and refer to the associated query
Di and Da such that Da ⊆ Di . In the style of [8], we call Di simply as QC .
the ideal database, and Da the available database. The require- If we introduce different schemas Σi and Σa for the ideal
ment that Da is included in Di formalizes the intuition that and the available database, respectively, we can view the
the available database contains no more information than the TC statement C = Compl(R(s̄); G) equivalently as the TGD (=
ideal one. tuple-generating dependency) δC : Ri (s̄), Gi → Ra (s̄) from Σi to

60
Σa . It is straightforward to see that a partial database satisfies 3. CHARACTERIZATIONS AND DECISION
the TC statement C if and only if it satisfies the TGD δC . PROCEDURES
The view of TC statements is especially useful for imple-
mentations. Motro [10] introduced the notion of partially incomplete
and incorrect databases as databases that can both miss facts
that hold in the real world or contain facts that do not hold
Example 3. In the partial database DS defined above, we can there. He described partial completeness in terms of query
observe that in the available relation person, the teacher Carlo is completeness (QC) statements, which express that the answer
missing, while all students are present. Thus, person is complete of a query is complete. The query completeness statements
for all students. The available relation student contains Hans, who express that to some parts of the database the closed-world
is the only male student. Thus, student is complete for all male assumption applies, while for the rest of the database, the
persons. Formally, these two observations can be written as table open-world assumption applies. He studied how the com-
completeness statements: pleteness of a given query can be deduced from the com-
pleteness of other queries, which is QC-QC entailment. His
C1 = Compl(person(n, g); student(n, l, c)), solution was based on rewriting queries using views: to infer
C2 = Compl(student(n, l, c); person(n, ’male’)), that a given query is complete whenever a set of other queries
are complete, he would search for a conjunctive rewriting in
which, as seen, are satisfied by the partial database DS . terms of the complete queries. This solution is correct, but
not complete, as later results on query determinacy show:
One can prove that table completeness cannot be expressed the given query may be complete although no conjunctive
by query completeness statements, because the latter require rewriting exists.
completeness of the relevant parts of all the tables that ap- While Levy et al. could show that rewritability of conjunc-
pear in the statement, while the former only talks about the tive queries as conjunctive queries is decidable [9], general
completeness of a single table. rewritability of conjunctive queries by conjunctive queries is
still open: An extensive discussion on that issue was pub-
lished in 2005 by Segoufin and Vianu where it is shown that
Example 4. As an illustration, consider the table completeness it is possible that conjunctive queries can be rewritten using
statement C1 that states that person is complete for all students. The other conjunctive queries, but the rewriting is not a conjunc-
corresponding query QC1 that asks for all persons that are students tive query [19]. They also introduced the notion of query
is determinacy, which for conjunctive queries implies second
QC1 (n, g) :− person(n, g), student(n, l, c). order rewritability. The decidability of query determinacy
for conjunctive queries is an open problem to date.
Evaluating QC1 over DiS gives the result { Hans, Maria }. However,
evaluating it over DaS returns only { Hans }. Thus, DS does not Halevy [8] suggested local completeness statements, which
satisfy the completeness of the query QC1 although it satisfies the we, for a better distinction from the QC statements, call table
table completeness statement C1 . completeness (TC) statements, as an alternate formalism for
expressing partial completeness of an incomplete database.
Reasoning. As usual, a set S1 of TC- or QC-statements en- These statements allow one to express completeness of parts
tails another set S2 (we write S1 |= S2 ) if every partial database of relations independent from the completeness of other parts
that satisfies all elements of S1 also satisfies all elements of S2 . of the database. The main problem he addressed was how to
derive query completeness from table completeness (TC-QC).
He reduced TC-QC to the problem of queries independent
Example 5. Consider the query Q(n) :− student(n, 7, c), of updates (QIU) [5]. However, this reduction introduces
person(n,0 male0 ) that asks for all male students in level 7. The negation, and thus, except for trivial cases, generates QIU
TC statements C1 and C2 entail completeness of this query, because instances for which no decision procedures are known. As
we ensure that all persons that are students and all male students a consequence, the decidability of TC-QC remained largely
are in the database. Note that these are not the minimal precon- open. Moreover, he demonstrated that by taking into ac-
ditions, as it would be enough to only have male persons in the count the concrete database instance and exploiting the key
database who are student in level 7, and students in level 7, who constraints over it, additional queries can be shown to be
are male persons. complete.
Razniewski and Nutt provided decision procedures for TC-
While TC statements are a natural way to describe com- QC in [13]. They showed that for queries under bag semantics
pleteness of available data (“These parts of the data are com- and for minimal queries under set semantics, weakest precon-
plete”), QC statements capture requirements for data qual- ditions for query completeness can be expressed in terms of
ity (“For these queries we need complete answers”). Thus, table completeness statements, which allow to reduce TC-QC
checking whether a set of TC statements entails a set of entailment to TC-TC entailment.
QC statements (TC-QC entailment) is the practically most For the problem of TC-TC entailment, they showed that it
relevant inference. Checking TC-TC entailment is useful is equivalent to query containment.
when managing sets of TC statements. Moreover, as we For QC-QC entailment, they showed that the problem is
will show later on, TC-QC entailment for aggregate queries decidable for queries under bag semantics.
with count and sum can be reduced to TC-TC entailment for For aggregate queries, they showed that for the aggregate
non-aggregate queries. If completeness guarantees are given functions SUM and COUNT, TC-QC has the same complexity
in terms of query completeness, also QC-QC entailment is of as TC-QC for nonaggregate queries under bag semantics. For
interest. the aggregate functions MIN and MAX, they showed that

61
Problem Work by Results
Query rewritability is a sufficient
Motro 1989
QC-QC condition for QC-QCs
Razniewski/Nutt QC-QCb is equivalent to query
2011 containment
Razniewski/Nutt TC-TC is equivalent to query
TC-TC
2011 containment
Levy 1996 Decision procedure for trivial cases
TC-QC TC-QCb is equivalent to TC-TC,
Razniewski/Nutt
TC-QCs is equivalent to TC-TC up
2011
to asymmetric cases
Razniewski/Nutt Decision procedures for TC-QCs
2012 over databases with nulls

Table 1: Main results

TC-QC has the same complexity as TC-QC for nonaggregate that computes for a query that may be incomplete, complete
queries under set semantics. approximations from above and from below. With this exten-
For reasoning wrt. a database instance, they showed that sion, they show how to reformulate the original query in such
TC-QC becomes computationally harder than without an in- a way that answers are guaranteed to be complete. If there
stance, while QC-QC surprisingly becomes solvable, whereas exists a more general complete query, there is a unique most
without an instance, decidability is open. specific one, which is found. If there exists a more specific
complete query, there may even be infinitely many. In this
In [12], Nutt and Razniewski discussed TC-QC entailment
case, the least specific specializations whose size is bounded
reasoning over databases that contain null values. Null val-
by a threshold provided by the user is found. Generalizations
ues as used in SQL are ambiguous, as they can indicate either
are computed by a fixpoint iteration, employing an answer set
that no attribute value exists or that a value exists, but is un-
programming engine. Specializations are found leveraging
known. Nutt and Razniewski studied completeness reason-
unification from logic programming.
ing for both interpretations, and showed that when allowing
both interpretations at the same time, it becomes necessary to
syntactically distinguish between different kinds of null val- 5. EXTENSIONS AND APPLICATIONS SCE-
ues. They presented an encoding for doing that in standard NARIOS
SQL databases. With this technique, any SQL DBMS evalu-
ates complete queries correctly with respect to the different
meanings that null values can carry. Complete generalizations and specializations. When a
The main results are summarized in Table 1. query is not guaranteed to be complete, it may be interesting
to know which similar queries are complete. For instance,
when a query for all students in level 5 is not complete, it
4. IMPLEMENTATION TECHNIQUES may still be the case that the query for students in classes 5b
Systems for reasoning can be developed from scratch, how- and 5c is complete. Such information is especially interesting
ever it may be useful to implement them using existing tech- for interaction with a completeness reasoning system. In [11],
nology as far as possible. So far, it was investigated how Savkovic et al. defined the notion of most general complete
completeness reasoning can be reduced to answer set pro- specialization and the most specific comple generalization,
gramming, in particular using the DLV system. and discussed techniques to find those.
The MAGIK system developed by Savkovic et al. [18]
demonstrates how to use meta-information about the com- Completeness over Business Processes. In many appli-
pleteness of a database to assess the quality of the answers cations, data is managed via well documented processes. If
returned by a query. The system holds table-completeness information about such processes exists, one can draw con-
(TC) statements, by which one can express that a table is par- clusions about completeness as well. In [15], Razniewski et
tially complete, that is, it contains all facts about some aspect al. presented a formalization of so-called quality-aware pro-
of the domain. cesses that create data in the real world and store it in the
Given a query, MAGIK determines from such meta- company’s information system possibly at a later point. They
information whether the database contains sufficient data then showed how one can check the completeness of database
for the query answer to be complete (TC-QC entailment). queries in a certain state of the process or after the execution
If, according to the TC statements, the database content is of a sequence of actions, by leveraging on query contain-
not sufficient for a complete answer, MAGIK explains which ment, a well-studied problem in database theory. Finally,
further TC statements are needed to guarantee completeness. they showed how the results can be extended to the more
MAGIK extends and complements theoretical work on expressive formalism of colored Petri nets.
modeling and reasoning about data completeness by provid-
ing the first implementation of a reasoner. The reasoner op-
erates by translating completeness reasoning tasks into logic
Spatial Data. Volunteered geographical information sys-
tems are gaining popularity. The most established one is
programs, which are executed by an answer set engine.
OpenStreetMap (OSM), but also classical commercial map
In [17], Savkovic et al. present an extension to MAGIK services such as Google Maps now allow users to take part in

62
the content creation. Relationship between
Completeness
Assessing the quality of spatial information is essential for Certain Answers, Query
P Pattern
making informed decisions based on the data, and particu- Answers, and Possible Answers
larly challenging when the data is provided in a decentral- Q :− C CA = QA = PA
ized, crowd-based manner. In [14], Razniewski and Nutt Q :− N CA = QA ⊆ PA = inf
showed how information about the completeness of features Q :− N, ¬N ∅ = CA ⊆ QA ⊆ PA = inf
in certain regions can be used to annotate query answers with Q :− C, ¬C CA = QA = PA
completeness information. They provided a characterization Q :− N, ¬C CA = QA ⊆ PA = inf
of the necessary reasoning and show that when taking into
Q :− C, ¬N ∅ = CA ⊆ QA = PA
account the available database, more completeness can be de-
rived. OSM already contains some completeness statements,
Table 2: Relation between query result, certain answers and
which are originally intended for coordination among the ed-
possible answers for queries with negation. The arguments
itors of the map. A contribution was also to show that these
of Q are irrelevant and therefore omitted.
statements are not only useful for the producers of the data
but also for the consumers.
query answer may either be equal to the possible answers, to
RDF Data. With thousands of RDF data sources today avail- the certain answers, both, or none.
able on the Web, covering disparate and possibly overlapping Note that the above results hold for conjunctive queries in
knowledge domains, the problem of providing high-level de- general, and thus do not only apply to SPARQL but also to
scriptions (in the form of metadata) of their content becomes other query languages with negation, such as SQL.
crucial. In [3], Darari et al. discussed reasoning about the
completeness of semantic web data sources. They showed 6.2 Instance Reasoning
how the previous theory can be adapted for RDF data sources, Another line of current work concerns completeness rea-
what peculiarities the SPARQL query language offers and soning wrt. a database instance. We are currently looking into
how completeness statements themselves can be expressed completeness statements which are simpler than TC state-
in RDF. ments in the sense that we do not contain any joins. For
They also discussed the foundation for the expression of such statements, reasoning is still exponential in the size of
completeness statements about RDF data sources. This al- the database schema, but experimental results suggest that in
lows to complement with qualitative descriptions about com- use cases, the reasoning is feasible. A challenge is however
pleteness the existing proposals like VOID that mainly deal to develop a procedure which is algorithmically complete.
with quantitative descriptions. The second aspect of their
work is to show that completeness statements can be useful
for the semantic web in practice. On the theoretical side, 7. ACKNOWLEDGEMENT
they provide a formalization of completeness for RDF data We thank our collaborators Fariz Darari, Flip Korn, Paramita
sources and techniques to reason about the completeness of Mirza, Marco Montali, Sergey Paramonov, Giuseppe Pirró,
query answers. From the practical side, completeness state- Radityo Eko Prasojo, Ognjen Savkovic and Divesh Srivas-
ments can be easily embedded in current descriptions of data tava.
sources and thus readily used. The results on RDF data have This work has been partially supported by the project
been implemented by Darari et al. in a demo system called “MAGIC: Managing Completeness of Data” funded by the
CORNER [6]. province of Bozen-Bolzano.

6. CURRENT WORK 8. REFERENCES
In this section we list problems that our group is currently [1] S. Abiteboul, P.C. Kanellakis, and G. Grahne. On the
working on. representation and querying of sets of possible worlds.
In Proc. SIGMOD, pages 34–48, 1987.
6.1 SPARQL Queries with Negation [2] Keith L Clark. Negation as failure. In Logic and data
bases, pages 293–322. Springer, 1978.
RDF data is often treated as incomplete, following the
Open-World Assumption. On the other hand, SPARQL, the [3] Fariz Darari, Werner Nutt, Giuseppe Pirrò, and Simon
standard query language over RDF, usually follows the Closed- Razniewski. Completeness statements about RDF data
World Assumption, assuming RDF data to be complete. What sources and their use for query answering. In
then happens is the semantic gap between RDF and SPARQL. International Semantic Web Conference (1), pages 66–83,
In current work, Darari et al. [4] address how to close the se- 2013.
mantic gap between RDF and SPARQL, in terms of certain an- [4] Fariz Darari, Simon Razniewski, and Werner Nutt.
swers and possible answers using completeness statements. Bridging the semantic gap between RDF and SPARQL
Table 2 shows current results for the relations between query using completeness statements. ISWC, 2013.
answers, certain answers and possible answers for queries [5] Ch. Elkan. Independence of logic database queries and
with negation. The queries are assumed to be of the form updates. In Proc. PODS, pages 154–160, 1990.
Q(s̄) :− P+ , ¬P− , where P+ is the positive part and P− is the [6] Radityo Eko Prasojo Fariz Darari and Werner Nutt.
negative part. Then we use letters C and N to indicate which CORNER: A completeness reasoner for the semantic
parts are complete. E.g. Q(s̄) :− N, ¬C indicates that the pos- web (poster). ESWC, 2013.
itive part is not complete and the negative part is complete. [7] T. Imieliński and W. Lipski, Jr. Incomplete information
As the table shows, depending on the complete parts, the in relational databases. J. ACM, 31:761–791, 1984.

63
[8] Alon Y. Levy. Obtaining complete answers from of geographical data (short paper). In BNCOD, 2013.
incomplete databases. In Proceedings of the International [15] Simon Razniewski, Marco Montali, and Werner Nutt.
Conference on Very Large Data Bases, pages 402–412, 1996. Verification of query completeness over processes. In
[9] Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, BPM, pages 155–170, 2013.
and Divesh Srivastava. Answering queries using views. [16] Raymond Reiter. On closed world data bases. In Logic
In PODS, pages 95–104, 1995. and Data Bases, pages 55–76, 1977.
[10] A. Motro. Integrity = Validity + Completeness. ACM [17] Ognjen Savkovic, Paramita Mirza, Sergey Paramonov,
TODS, 14(4):480–502, 1989. and Werner Nutt. Magik: managing completeness of
[11] Werner Nutt, Sergey Paramonov, and Ognjen Savkovic. data. In CIKM, pages 2725–2727, 2012.
An ASP approach to query completeness reasoning. [18] Ognjen Savkovic, Paramita Mirza, Alex Tomasi, and
TPLP, 13(4-5-Online-Supplement), 2013. Werner Nutt. Complete approximations of incomplete
[12] Werner Nutt and Simon Razniewski. Completeness of queries. PVLDB, 6(12):1378–1381, 2013.
queries over SQL databases. In CIKM, pages 902–911, [19] L. Segoufin and V. Vianu. Views and queries:
2012. Determinacy and rewriting. In Proc. PODS, pages
[13] S. Razniewski and W. Nutt. Completeness of queries 49–60, 2005.
over incomplete databases. In VLDB, 2011.
[14] S. Razniewski and W. Nutt. Assessing the completeness

64
Towards Semantic Recommendation of Biodiversity
Datasets based on Linked Open Data

Felicitas Löffler Bahar Sateli René Witte Birgitta König-Ries
Dept. of Mathematics Semantic Software Lab Semantic Software Lab Friedrich Schiller University
and Computer Science Dept. of Computer Science Dept. of Computer Science Jena, Germany and
Friedrich Schiller University and Software Engineering and Software Engineering German Centre for Integrative
Jena, Germany Concordia University Concordia University Biodiversity Research (iDiv)
Montréal, Canada Montréal, Canada Halle-Jena-Leipzig, Germany

ABSTRACT 1. INTRODUCTION
Conventional content-based filtering methods recommend Content-based recommender systems observe a user’s brows-
documents based on extracted keywords. They calculate the ing behaviour and record the interests [1]. By means of natu-
similarity between keywords and user interests and return a ral language processing and machine learning techniques, the
list of matching documents. In the long run, this approach user’s preferences are extracted and stored in a user profile.
often leads to overspecialization and fewer new entries with The same methods are utilized to obtain suitable content
respect to a user’s preferences. Here, we propose a seman- keywords to establish a content profile. Based on previously
tic recommender system using Linked Open Data for the seen documents, the system attempts to recommend similar
user profile and adding semantic annotations to the index. content. Therefore, a mathematical representation of the user
Linked Open Data allows recommendations beyond the con- and content profile is needed. A widely used scheme are TF-
tent domain and supports the detection of new information. IDF (term frequency-inverse document frequency) weights
One research area with a strong need for the discovery of [19]. Computed from the frequency of keywords appearing
new information is biodiversity. Due to their heterogeneity, in a document, these term vectors capture the influence of
the exploration of biodiversity data requires interdisciplinary keywords in a document or preferences in a user profile. The
collaboration. Personalization, in particular in recommender angle between these vectors describes the distance or the
systems, can help to link the individual disciplines in bio- closeness of the profiles and is calculated with similarity mea-
diversity research and to discover relevant documents and sures, like the cosine similarity. The recommendation lists of
datasets from various sources. We developed a first prototype these traditional, keyword-based recommender systems often
for our semantic recommender system in this field, where a contain very similar results to those already seen, leading
multitude of existing vocabularies facilitate our approach. to overspecialization [11] and the “Filter-Bubble”-effect [17]:
The user obtains only content according to the stored prefer-
ences, other related documents not perfectly matching the
Categories and Subject Descriptors stored interests are not displayed. Thus, increasing diversity
H.3.3 [Information Storage And Retrieval]: Informa- in recommendations has become an own research area [21, 25,
tion Search and Retrieval; H.3.5 [Information Storage 24, 18, 3, 6, 23], mainly used to improve the recommendation
And Retrieval]: Online Information Services results in news or movie portals.
One field where content recommender systems could en-
hance daily work is research. Scientists need to be aware
General Terms of relevant research in their own but also neighboring fields.
Design, Human Factors Increasingly, in addition to literature, the underlying data
itself and even data that has not been used in publications
are being made publicly available. An important example
Keywords for such a discipline is biodiversity research, which explores
content filtering, diversity, Linked Open Data, recommender the variety of species and their genetic and characteristic
systems, semantic indexing, semantic recommendation diversity [12]. The morphological and genetic information of
an organism, together with the ecological and geographical
context, forms a highly diverse structure. Collected and
stored in different data formats, the datasets often contain or
link to spatial, temporal and environmental data [22]. Many
important research questions cannot be answered by working
with individual datasets or data collected by one group, but
require meta-analysis across a wide range of data. Since the
analysis of biodiversity data is quite time-consuming, there is
Copyright c by the paper’s authors. Copying permitted only a strong need for personalization and new filtering techniques
for private and academic purposes. in this research area. Ordinary search functions in relevant
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- data portals or databases, e.g., the Global Biodiversity In-
Workshop on Foundations of Databases (Grundlagen von Datenbanken),
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.

65
formation Facility (GBIF)1 and the Catalog of Life,2 only that several types of relations can be taken into account.
return data that match the user’s query exactly and fail at For instance, for a user interested in “geology”, the profile
finding more diverse and semantically related content. Also, contains the concept “geology” that also permits the recom-
user interests are not taken into account in the result list. mendation of inferred concepts, e.g., “fossil”. The idea of
We believe our semantic-based content recommender system recommending related concepts was first introduced by Mid-
could facilitate the difficult and time-consuming research delton et al. [15]. They developed Quickstep, a recommender
process in this domain. system for research papers with ontological terms in the user
Here, we propose a new semantic-based content recom- profile and for paper categories. The ontology only considers
mender system that represents the user profile as Linked is-a relationships and omits other relation types (e.g., part-
Open Data (LOD) [9] and incorporates semantic annotations of). Another simple hierarchical approach from Shoval et
into the recommendation process. Additionally, the search al. [13] calculates the distance among concepts in a profile
engine is connected to a terminology server and utilizes the hierarchy. They distinguish between perfect, close and weak
provided vocabularies for a recommendation. The result list match. When the concept appears in both a user’s and docu-
contains more diverse predictions and includes hierarchical ment’s profile, it is called a perfect match. In a close match,
concepts or individuals. the concept emerges only in one of the profiles and a child or
The structure of this paper is as follows: Next, we de- parent concept appears in the other. The largest distance is
scribe related work. Section 3 presents the architecture of called a weak match, where only one of the profiles contains a
our semantic recommender system and some implementation grandchild or grandparent concept. Finally, a weighted sum
details. In Section 4, an application scenario is discussed. Fi- over all matching categories leads to the recommendation
nally, conclusions and future work are presented in Section 5. list. This ontological filtering method was integrated into the
news recommender system epaper. Another semantically en-
hanced recommender system is Athena [10]. The underlying
2. RELATED WORK ontology is used to explore the semantic neighborhood in the
The major goal of diversity research in recommender sys- news domain. The authors compared several ontology-based
tems is to counteract overspecialization [11] and to recom- similarity measures with the traditional TF-IDF approach.
mend related products, articles or documents. More books However, this system lacks of a connection to a search engine
of an author or different movies of a genre are the classical that allows to query large datasets.
applications, mainly used in recommender systems based on All presented systems use manually established vocabular-
collaborative filtering methods. In order to enhance the vari- ies with a limited number of classes. None of them utilize
ety in book recommendations, Ziegler et al. [25] enrich user a generic user profile to store the preferences in a seman-
profiles with taxonomical super-topics. The recommendation tic format (RDF/XML or OWL). The FOAF (Friend Of A
list generated by this extended profile is merged with a rank Friend) project3 provides a vocabulary for describing and
in reverse order, called dissimilarity rank. Depending on a connecting people, e.g., demographic information (name, ad-
certain diversification factor, this merging process supports dress, age) or interests. As one of the first, in 2006 Celma [2]
more or less diverse recommendations. Larger diversification leveraged FOAF in his music recommender system to store
factors lead to more diverse products beyond user interests. users’ preferences. Our approach goes beyond the FOAF
Zhang and Hurley [24] favor another mathematical solution interests, by incorporating another generic user model vo-
and describe the balance between diversity and similarity as cabulary, the Intelleo User Modelling Ontology (IUMO).4
a constrained optimization problem. They compute a dis- Besides user interests, IUMO offers elements to store learning
similarity matrix according to applied criterias, e.g., movie goals, competences and recommendation preferences. This
genres, and assign a matching function to find a subset of allows to adapt the results to a user’s previous knowledge or
products that are diverse as well as similar. One hybrid to recommend only documents for a specific task.
approach by van Setten [21] combines the results of several
conventional algorithms, e.g., collaborative and case-based,
to improve movie recommendations. Mainly focused on news 3. DESIGN AND IMPLEMENTATION
or social media, approaches using content-based filtering In this section, we describe the architecture and some
methods try to present different viewpoints on an event to implementation details of our semantic-based recommender
decrease the media bias in news portals [18, 3] or to facilitate system (Figure 1). The user model component, described in
the filtering of comments [6, 23]. Section 3.1, contains all user information. The source files,
Apart from Ziegler et al., none of the presented approaches described in Section 3.2, are analyzed with GATE [5], as de-
have considered semantic technologies. However, utilizing scribed in Section 3.3. Additionally, GATE is connected with
ontologies and storing user or document profiles in triple a terminology server (Section 3.2) to annotate documents
stores represents a large potential for diversity research in with concepts from the provided biodiversity vocabularies.
recommender systems. Frasincar et al. [7] define semanti- In Section 3.4, we explain how the annotated documents are
cally enhanced recommenders as systems with an underly- indexed with GATE Mı́mir [4]. The final recommendation list
ing knowledge base. This can either be linguistic-based [8], is generated in the recommender component (Section 3.5).
where only linguistic relations (e.g., synonymy, hypernomy,
meronymy, antonymy) are considered, or ontology-based. In 3.1 User profile
the latter case, the content and the user profile are repre- The user interests are stored in an RDF/XML format uti-
sented with concepts of an ontology. This has the advantage lizing the FOAF vocabulary for general user information. In
1 3
GBIF, http://www.gbif.org FOAF, http://xmlns.com/foaf/spec/
2 4
Catalog of Life, http://www.catalogueoflife.org/col/ IUMO, http://intelleo.eu/ontologies/user-model/
search/all/ spec/

66
Figure 1: The architecture of our semantic content recommender system

order to improve the recommendations regarding a user’s existing vocabularies. Furthermore, biodiversity is an inter-
previous knowledge and to distinguish between learning goals, disciplinary field, where the results from several sources have
interests and recommendation preferences, we incorporate to be linked to gain new knowledge. A recommender system
the Intelleo User Modelling Ontology for an extended profile for this domain needs to support scientists by improving this
description. Recommendation preferences will contain set- linking process and helping them finding relevant content in
tings in respect of visualization, e.g., highlighting of interests, an acceptable time.
and recommender control options, e.g., keyword-search or Researchers in the biodiversity domain are advised to store
more diverse results. Another adjustment will adapt the their datasets together with metadata, describing informa-
result set according to a user’s previous knowledge. In order tion about their collected data. A very common metadata
to enhance the comprehensibility for a beginner, the system format is ABCD.7 This XML-based standard provides ele-
could provide synonyms; and for an expert the recommender ments for general information (e.g., author, title, address),
could include more specific documents. as well as additional biodiversity related metadata, like infor-
The interests are stored in form of links to LOD resources. mation about taxonomy, scientific name, units or gathering.
For instance, in our example profile in Listing 1, a user is Very often, each taxon needs specific ABCD fields, e.g., fossil
interested in “biotic mesoscopic physical object”, which is a datasets include data about the geological era. Therefore,
concept from the ENVO5 ontology. Note that the interest several additional ABCD-related metadata standards have
entry in the RDF file does not contain the textual description, emerged (e.g., ABCDEFG8 , ABCDDNA9 ). One document
but the link to the concept in the ontology, i.e., http://purl. may contain the metadata of one or more species observations
obolibrary.org/obo/ENVO_01000009. Currently, we only in a textual description. This provides for annotation and
support explicit user modelling. Thus, the user information indexing for a semantic search. For our prototype, we use the
has to be added manually to the RDF/XML file. Later, we ABCDEFG metadata files provided by the GFBio10 project;
intend to develop a user profiling component, which gathers specifically, metadata files from the Museum für Naturkunde
a user’s interests automatically. The profile is accessible via (MfN).11 An example for an ABCDEFG metadata file is
an Apache Fuseki6 server. presented in Listing 2, containing the core ABCD structure
as well as additional information about the geological era.
Listing 1: User profile with interests stored as The terminology server supplied by the GFBio project of-
Linked Open Data URIs fers access to several biodiversity vocabularies, e.g., ENVO,
BEFDATA, TDWGREGION. It also provides a SPARQL

Felicitas
3.3 Semantic annotation
Loeffler The source documents are analyzed and annotated accord-
Felicitas Loeffler ing to the vocabularies provided by the terminology server.
Female
that offers several standard language engineering components
Friedrich Schiller University Jena [5]. We developed a custom GATE pipeline (Figure 2) that

felicitas.loeffler@uni−jena.de analyzes the documents: First, the documents are split into
included in the GATE distribution. Afterwards, an ‘Anno-

tation Set Transfer’ processing resource adds the original
7
3.2 Source files and terminology server 8
ABCD, http://www.tdwg.org/standards/115/
ABCDEFG, http://www.geocase.eu/efg
The content provided by our recommender comes from the 9
ABCDDNA, http://www.tdwg.org/standards/640/
biodiversity domain. This research area offers a wide range of 10
GFBio, http://www.gfbio.org
5 11
ENVO, http://purl.obolibrary.org/obo/envo.owl MfN, http://www.naturkundemuseum-berlin.de/
6 12
Apache Fuseki, http://jena.apache.org/documentation/ GFBio terminology server, http://terminologies.gfbio.
serving_data/ org/sparql/

67
Figure 2: The GFBio pipeline in GATE presenting the GFBio annotations

markups of the ABCDEFG files to the annotation set, e.g., the user in steering the recommendation process actively.
abcd:HigherTaxon. The following ontology-aware ‘Large KB The recommender component is still under development and
Gazetteer’ is connected to the terminology server. For each has not been added to the implementation yet.
document, all occurring ontology classes are added as specific
“gfbioAnnot” annotations that have both instance (link to
Listing 2: Excerpt from a biodiversity metadata file
the concrete source document) and class URI. At the end, a
in ABCDEFG format [20]
‘GATE Mı́mir Processing Resource’ submits the annotated
documents to the semantic search engine.
3.4 Semantic indexing

For semantic indexing, we are using GATE Mı́mir:13 “Mı́mir
MfN − Fossil invertebrates
is a multi-paradigm information management index and Gastropods, bivalves, brachiopods, sponges
repository which can be used to index and search over text,
annotations, semantic schemas (ontologies), and semantic
metadata (instance data)” [4]. Besides ordinary keyword- Gastropods, Bivalves, Brachiopods, Sponges
based search, Mı́mir incorporates the previously generated
semantic annotations from GATE to the index. Addition-
ally, it can be connected to the terminology server, allowing
MfN
queries over the ontologies. All index relevant annotations MfN − Fossil invertebrates Ia
and the connection to the terminology server are specified in MB.Ga.3895
an index template.

3.5 Content recommender Euomphaloidea
Family
The Java-based content recommender sends a SPARQL
query to the Fuseki Server and obtains the interests and
preferred recommendation techniques from the user profile Euomphalus sp.

SPARQL query to the Mı́mir server. Presently, this query
asks only for child nodes (Figure 3). The result set contains
ABCDEFG metadata files related to a user’s interests. We

intend to experiment with further semantic relations in the
future, e.g., object properties. Assuming that a specific fossil
used to live in rocks, it might be interesting to know if other System
species, living in this geological era, occured in rocks. An- Triassic
other filtering method would be to use parent or grandparent

provide control options and feedback mechanisms to support

13
GATE Mı́mir, https://gate.ac.uk/mimir/

68
Figure 3: A search for “biotic mesoscopic physical object” returning documents about fossils (child concept)

4. APPLICATION
The semantic content recommender system allows the
recommendation of more specific and diverse ABCDEFG
metadata files with respect to the stored user interests. List-
ing 3 shows the query to obtain the interests from a user
profile, introduced in Listing 1. The result contains a list of
(LOD) URIs to concepts in an ontology.
Figure 4: An excerpt from the ENVO ontology

Listing 3: SPARQL query to retrieve user interests
5. CONCLUSIONS
SELECT ?label ?interest ?syn
WHERE We introduced our new semantically enhanced content
{ recommender system for the biodiversity domain. Its main
?s foaf:firstName "Felicitas" . benefit lays in the connection to a search engine supporting
?s um:TopicPreference ?interest .
?interest rdfs:label ?label . integrated textual, linguistic and ontological queries. We are
?interest oboInOwl:hasRelatedSynonym ?syn using existing vocabularies from the terminology server of the
} GFBio project. The recommendation list contains not only
classical keyword-based results, but documents including
In this example, the user would like to obtain biodiversity semantically related concepts.
datasets about a “biotic mesoscopic physical object”, which In future work, we intend to integrate semantic-based rec-
is the textual description of http://purl.obolibrary.org/ ommender algorithms to obtain further diverse results and to
obo/ENVO_01000009. This technical term might be incom- support the interdisciplinary linking process in biodiversity
prehensible for a beginner, e.g., a student, who would prefer research. We will set up an experiment to evaluate the algo-
a description like “organic material feature”. Thus, for a rithms in large datasets with the established classification
later adjustment of the result according to a user’s previous metrics Precision and Recall [14]. Additionally, we would
knowledge, the system additionally returns synonyms. like to extend the recommender component with control op-
The returned interest (LOD) URI is utilized for a second tions for the user [1]. Integrated into a portal, the result
query to the search engine (Figure 3). The connection to the list should be adapted according to a user’s recommendation
terminology server allows Mı́mir to search within the ENVO settings or adjusted to previous knowledge. These control
ontology (Figure 4) and to include related child concepts functions allow the user to actively steer the recommenda-
as well as their children and individuals. Since there is no tion process. We are planning to utilize the new layered
metadata file containing the exact term “biotic mesoscopic evaluation approach for interactive adaptive systems from
physical object”, a simple keyword-based search would fail. Paramythis, Weibelzahl and Masthoff [16]. Since adaptive
However, Mı́mir can retrieve more specific information than systems present different results to each user, ordinary eval-
stored in the user profile and is returning biodiversity meta- uation metrics are not appropriate. Thus, accuracy, validity,
data files about “fossil”. That ontology class is a child node of usability, scrutability and transparency will be assessed in
“biotic mesoscopic physical object” and represents a semantic several layers, e.g., the collection of input data and their
relation. Due to a high similarity regarding the content of interpretation or the decision upon the adaptation strategy.
the metadata files, the result set in Figure 3 contains only This should lead to an improved consideration of adaptivity
documents which closely resemble each other. in the evaluation process.

69
6. ACKNOWLEDGMENTS P. B. Kantor, editors, Recommender Systems Handbook,
This work was supported by DAAD (German Academic pages 73–105. Springer, 2011.
Exchange Service)14 through the PPP Canada program and [12] M. Loreau. Excellence in ecology. International Ecology
by DFG (German Research Foundation)15 within the GFBio Institute, Oldendorf, Germany, 2010.
project. [13] V. Maidel, P. Shoval, B. Shapira, and
M. Taieb-Maimon. Ontological content-based filtering
7. REFERENCES for personalised newspapers: A method and its
evaluation. Online Information Review, 34 Issue
[1] F. Bakalov, M.-J. Meurs, B. König-Ries, B. Sateli, 5:729–756, 2010.
R. Witte, G. Butler, and A. Tsang. An approach to [14] C. D. Manning, P. Raghavan, and H. Schütze.
controlling user models and personalization effects in Introduction to Information Retrieval. Cambridge
recommender systems. In Proceedings of the 2013 University Press, 2008.
international conference on Intelligent User Interfaces,
[15] S. E. Middleton, N. R. Shadbolt, and D. C. D. Roure.
IUI ’13, pages 49–56, New York, NY, USA, 2013. ACM.
Ontological user profiling in recommender systems.
[2] Ò. Celma. FOAFing the music: Bridging the semantic ACM Trans. Inf. Syst., 22(1):54–88, Jan. 2004.
gap in music recommendation. In Proceedings of 5th
[16] A. Paramythis, S. Weibelzahl, and J. Masthoff. Layered
International Semantic Web Conference, pages 927–934,
evaluation of interactive adaptive systems: Framework
Athens, GA, USA, 2006.
and formative methods. User Modeling and
[3] S. Chhabra and P. Resnick. Cubethat: News article User-Adapted Interaction, 20(5):383–453, Dec. 2010.
recommender. In Proceedings of the sixth ACM
[17] E. Pariser. The Filter Bubble - What the internet is
conference on Recommender systems, RecSys ’12, pages
hiding from you. Viking, 2011.
295–296, New York, NY, USA, 2012. ACM.
[18] S. Park, S. Kang, S. Chung, and J. Song. Newscube:
[4] H. Cunningham, V. Tablan, I. Roberts, M. Greenwood,
delivering multiple aspects of news to mitigate media
and N. Aswani. Information extraction and semantic
bias. In Proceedings of the SIGCHI Conference on
annotation for multi-paradigm information
Human Factors in Computing Systems, CHI ’09, pages
management. In M. Lupu, K. Mayer, J. Tait, and A. J.
443–452, New York, NY, USA, 2009. ACM.
Trippe, editors, Current Challenges in Patent
[19] G. Salton and C. Buckley. Term-weighting approaches
Information Retrieval, volume 29 of The Information
in automatic text retrieval. Information Processing and
Retrieval Series, pages 307–327. Springer Berlin
Management, 24:513–523, 1988.
Heidelberg, 2011.
[20] Museum für Naturkunde Berlin. Fossil invertebrates,
[5] H. Cunningham et al. Text Processing with GATE
UnitID:MB.Ga.3895.
(Version 6). University of Sheffield, Dept. of Computer
http://coll.mfn-berlin.de/u/MB_Ga_3895.html.
Science, 2011.
[21] M. van Setten. Supporting people in finding
[6] S. Faridani, E. Bitton, K. Ryokai, and K. Goldberg.
information: hybrid recommender systems and
Opinion space: A scalable tool for browsing online
goal-based structuring. PhD thesis, Telematica Instituut,
comments. In Proceedings of the SIGCHI Conference
University of Twente, The Netherlands, 2005.
on Human Factors in Computing Systems, CHI ’10,
pages 1175–1184, New York, NY, USA, 2010. ACM. [22] R. Walls, J. Deck, R. Guralnick, S. Baskauf,
R. Beaman, and et al. Semantics in Support of
[7] F. Frasincar, W. IJntema, F. Goossen, and
Biodiversity Knowledge Discovery: An Introduction to
F. Hogenboom. A semantic approach for news
the Biological Collections Ontology and Related
recommendation. Business Intelligence Applications
Ontologies. PLoS ONE 9(3): e89606, 2014.
and the Web: Models, Systems and Technologies, IGI
Global, pages 102–121, 2011. [23] D. Wong, S. Faridani, E. Bitton, B. Hartmann, and
K. Goldberg. The diversity donut: enabling participant
[8] F. Getahun, J. Tekli, R. Chbeir, M. Viviani, and
control over the diversity of recommended responses. In
K. Yétongnon. Relating RSS News/Items. In
CHI ’11 Extended Abstracts on Human Factors in
M. Gaedke, M. Grossniklaus, and O. Dı́az, editors,
Computing Systems, CHI EA ’11, pages 1471–1476,
ICWE, volume 5648 of Lecture Notes in Computer
New York, NY, USA, 2011. ACM.
Science, pages 442–452. Springer, 2009.
[24] M. Zhang and N. Hurley. Avoiding monotony:
[9] T. Health and C. Bizer. Linked Data: Evolving the Web
Improving the diversity of recommendation lists. In
into a Global Data Space. Synthesis Lectures on the
Proceedings of the 2008 ACM Conference on
Semantic Web: Theory and Technology. Morgan &
Recommender Systems, RecSys ’08, pages 123–130, New
Claypool, 2011.
York, NY, USA, 2008. ACM.
[10] W. IJntema, F. Goossen, F. Frasincar, and
[25] C.-N. Ziegler, G. Lausen, and L. Schmidt-Thieme.
F. Hogenboom. Ontology-based news recommendation.
Taxonomy-driven computation of product
In Proceedings of the 2010 EDBT/ICDT Workshops,
recommendations. In Proceedings of the Thirteenth
EDBT ’10, pages 16:1–16:6, New York, NY, USA, 2010.
ACM International Conference on Information and
ACM.
Knowledge Management, CIKM ’04, pages 406–415,
[11] P. Lops, M. de Gemmis, and G. Semeraro.
New York, NY, USA, 2004. ACM.
Content-based recommender systems: State of the art
and trends. In F. Ricci, L. Rokach, B. Shapira, and
14
DAAD, https://www.daad.de/de/
15
DFG, http://www.dfg.de

70
Exploring Graph Partitioning for
Shortest Path Queries on Road Networks

Theodoros Chondrogiannis Johann Gamper
Free University of Bozen-Bolzano Free University of Bozen-Bolzano
tchond@inf.unibz.it gamper@inf.unibz.it

ABSTRACT The classic solution for the shortest path problem is Dijkstra’s al-
Computing the shortest path between two locations in a road net- gorithm [1]. Given a source s and a destination t in a road network
work is an important problem that has found numerous applica- G, Dijkstra’s algorithm traverses the vertices in G in ascending or-
tions. The classic solution for the problem is Dijkstra’s algo- der of their distances to s. However, Dijkstra’s algorithm comes
rithm [1]. Although simple and elegant, the algorithm has proven with a major shortcoming. When the distance between the source
to be inefficient for very large road networks. To address this defi- and the target vertex is high, the algorithm has to expand a very
ciency of Dijkstra’s algorithm, a plethora of techniques that intro- large subset of the vertices in the graph. To address this short-
duce some preprocessing to reduce the query time have been pro- coming, several techniques have been proposed over the last few
posed. In this paper, we propose Partition-based Shortcuts (PbS), a decades [3]. Such techniques require a high start-up cost, but in
technique based on graph-partitioning which offers fast query pro- terms of query processing they outperform Dijkstra’s algorithm by
cessing and supports efficient edge weight updates. We present a orders of magnitude.
shortcut computation scheme, which exploits the traits of a graph Although most of the proposed techniques offer fast query pro-
partition. We also present a modified version of the bidirectional cessing, the preprocessing is always performed under the assump-
search [2], which uses the precomputed shortcuts to efficiently an- tion that the weights of a road network remain unchanged over
swer shortest path queries. Moreover, we introduce the Corridor time. Moreover, the preprocessing is metric-specific, thus for dif-
Matrix (CM), a partition-based structure which is exploited to re- ferent metrics the preprocessing needs to be performed for each
duce the search space during the processing of shortest path queries metric. The recently proposed Customizable Route Planning [4]
when the source and the target point are close. Finally, we evaluate applies preprocessing for various metrics, i.e., distance, time, turn
the performance of our modified algorithm in terms of preprocess- cost and fuel consumption. Such an approach allows a fast com-
ing cost and query runtime for various graph partitioning configu- putation of shortest path queries using any metric desired by the
rations. user, at the cost of some extra space. Moreover, the update cost for
the weights is low since the structure is designed such that only a
small part of the preprocessed information has to be recomputed.
Keywords In this paper, our aim is to develop an approach which offers even
Shortest path, road networks, graph partitioning faster query processing, while keeping the update cost of the pre-
processed information low. This is particularly important in dy-
namic networks, where edge weights might frequently change, e.g.,
1. INTRODUCTION due to traffic jams.
Computing the shortest path between two locations in a road The contributions of this paper can be summarized as follows:
network is a fundamental problem and has found numerous ap-
• We present Partitioned-based Shortcuts (PbS), a preprocess-
plications. The problem can be formally defined as follows. Let
ing method which is based on Customizable Route Planning
G(V, E) be a directed weighted graph with vertices V and edges
(CRP), but computes more shortcuts in order to reduce the
E. For each edge e ∈ E, a weight l(e) is assigned, which usually
query processing time.
represents the length of e or the time required to cross e. A path p
between two vertices s, t ∈ V is a sequence of connected edges, • We propose the Corridor Matrix (CM), a pruning technique
p(s, t) = h(s, v1 ), (v1 , v2 ), . . . , (vk , vt )i where (vk , vk+1 ) ∈ E, which can be used for shortest path queries when the source
that connects s and t. The shortest path between two vertices s and and the target are very close and the precomputed shortcuts
t is the path p(s, t) that has the shortest distance among all paths cannot be exploited.
that connect s and t.
• We run experiments for several different partition configura-
tions and we evaluate our approach in terms of both prepro-
cessing and query processing cost.
The rest of the paper is organized as follows. In Section 2, we
discuss related work. In Section 3, we describe in detail the prepro-
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26 GI- cessing phase of our method. In Section 5, we present a modified
Workshop on Foundations of Databases (Grundlagen von Datenbanken), version of the bidirectional search algorithm. In Section 6, we show
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.
Copyright c by the paper’s authors. Copying permitted only for private preliminary results of an empirical evaluation. Section 7 concludes
and academic purposes.. the paper and points to future research directions.

71
2. RELATED WORK in each component and then CRP applies a modified bidirectional
The preprocessing based techniques that have been proposed search algorithm which expands only the shortcuts and the edges in
in order to reduce the time required for processing shortest path the source or the target component. The main difference between
queries can be classified into different categories [3]. Goal-directed our approach and CRP is that, instead of computing only shortcuts
techniques use either heuristics or precomputed information in or- between border nodes in each component, we compute shortcuts
der to limit the search space by excluding vertices that are not in from every node of a component to the border nodes of the same
the direction of the target. For example, A∗ [5] search uses the component. The extra shortcuts enable the bidirectional algorithm
Euclidean distance as a lower bound. ALT [6] uses precomputed to start directly from the border nodes, while CRP has to scan the
shortest path distances to a carefully selected set of landmarks and original edges of the source and the target component.
produces the lower bound using the triangle inequality. Some goal-
directed techniques exploit graph partitioning in order to prune the 3. PBS PREPROCESSING
search space and speed-up queries. Precomputed Cluster Distances The Partition-based Shortcuts (PbS) method we propose ex-
(PCD) [7] partitions the graph into k components, computes the ploits graph partitioning to produce shortcuts in a preprocessing
distance between all pairs of components and uses the distances be- phase, which during the query phase are used to efficiently com-
tween components to compute lower bounds. Arc Flags [8] main- pute shortest path queries. The idea is similar to the concept of
tains a vector of k bits for each edge, where the i-th bit is set if the transit nodes [12]. Every shortest path between two nodes lo-
arc lies on a shortest path to some vertex of component i. Other- cated in different partitions (also termed components) can be ex-
wise, all edges of component i are pruned by the search algorithm. pressed as a combination of three smaller shortest paths. Con-
Path Coherent techniques take advantage of the fact that shortest sider the graph in Figure 1 and a query q(s, t), where s ∈ C1
paths in road networks are often spatially coherent. To illustrate the and t ∈ C5 . The shortest path from s to t can be expressed as
concept of spatial coherence, let us consider four locations s, s0 , t p(s, bs ) + p(bs , bt ) + p(bt , t), where bs ∈ {b1 , b2 } and bt ∈
and t0 in a road network. If s is close to s0 and t is close to t0 , the {b3 , b4 , b5 }. Before PbS is able to process shortest path queries,
shortest path from s to t is likely to share vertices with the shortest a preprocessing phase is required, which consists of three steps:
path from s0 to t0 . Spatial coherence methods precompute all short- graph partitioning, in-component shortcut computation and short-
est paths and use then some data structures to index the paths and cut graph construction.
answer queries efficiently. For example, Spatially Induced Linkage
Cognizance (SILC) [9] use a quad-tree [10] to store the paths. Path- 3.1 Graph Partitioning
Coherent Pairs Decomposition (PCPD) [11] computes unique path
The first step in the pre-processing phase is the graph partition-
coherent pairs and retrieves any shortest path recursively in almost
ing. Let G(V, E) be a graph with vertices V and edges E. A
linear time to the size of the path.
partition of G is a set P (G) = {C1 , . . . , Ck } of connected sub-
Bounded-hop techniques aim to reduce a shortest path query to
graphs Ci of G, also referred to as components of G. For the set
a number of look-ups. Transit Node Routing (TNR) [12] is an in-
P (G), all components must be disjoint, i.e., C1 ∩ . . . ∩ Ck = ∅.
dexing method that imposes a grid on the road network and re-
Moreover, let V1 , . . . , V|P (G)| be the sets of vertices of each com-
computes the shortest paths from within each grid cell C to a set
ponent. The vertex sets of all components must cover the vertex set
of vertices that are deemed important for C (so-called access nodes
of the graph, i.e., V1 ∪ . . . ∪ V|P (G)| = V . We assign a tag to each
of C). More approaches are based on the theory of 2-hop label-
node of the original graph, which indicates the component the node
ing [13]. During preprocessing, a label L(u) is computed for each
is located in. The set of connecting edges, EC ⊆ E, is the set of all
vertex u of the graph such that for any pair u, v of vertices, the
edges in the graph for which the source and target nodes belong to
distance dist(u, v) can be determined by only looking at the labels
different components, i.e., (n, n0 ) ∈ E such that n ∈ Ci , n0 ∈ Cj
L(u) and L(v). A natural special case of this approach is Hub La-
and Ci 6= Cj . Finally, we define the border nodes of a component
beling (HL) [14], in which the label L(u) associated with vertex
C. A node n ∈ C is a border node of C if there exists a connecting
u consists of a set of vertices (the hubs of u), together with their
edge e = (n, n0 ) or e = (n0 , n), i.e., n0 is not in C. If e = (n, n0 ),
distances from u.
n is called outgoing border node of C, whereas if e = (n0 , n), n
Finally, Hierarchical techniques aim to impose a total order on
is called incoming border node of C. The set of all border nodes
the nodes as they deem nodes that are crossed by many shortest
of a graph is referred to as B. Figure 1 illustrates a graph parti-
paths as more important. Highway Hierarchies (HH) [15] and its
tioned into five components. The filled nodes are the border nodes.
direct descendant Contraction Hierarchies (CH) organize the nodes
Note that for ease of exposition we use only undirected graphs in
in the road network into a hierarchy based on their relative im-
the examples.
portance, and create shortcuts among vertices at the same level
of the hierarchy. Arterial Hierarchies (AH) [16] are inspired by
CH, but produce shortcuts by imposing a grid on the graph. AH
outperform CH in terms of both asymptotic and practical perfor-
mance [17]. Some hierarchical approaches exploit graph partition
to create shortcuts. HEPV [18] and HiTi [19] are techniques that
pre-computes the distance between any two boundary vertices and
create a new overlay graph. By partitioning the overlay graph and
repeating the process several times, a hierarchy of partitions is cre-
ated, which is used to process shortest path queries.
The recent Customizable Route Planning (CRP) [4] is the clos-
est work to our own. CRP is able to handle various arbitrary met-
rics and can also handle dynamic edge weight updates. CRP uses
PUNCH [20], a graph partitioning algorithm tailored to road net-
works. CRP pre-computes distances between boundary vertices Figure 1: Partitioned graph into five components.

72
We characterize a graph partition as good if it minimizes the Thus, the number of vertices and edges in the shortcut graph is,
number of connecting edges between the components. However, respectively,
graph partitioning is an N P -hard problem, thus an optimal solu- k
tion is out of the question [21]. A popular approach is multilevel X
|B| = |Biinc ∪ Biout | and
graph partitioning (MGP), which can be found in many software
i=1
libraries, such as METIS [22]. Algorithms such as PUNCH [20]
k
X
and Spatial Partition Clustering (SPC) [23] take advantage of road
|Esc | = (|Biinc | × |Biout |) + EC .
network characteristics in order to provide a more efficient graph
i=1
partitioning. We use METIS for graph partitioning since it is the
most efficient approach out of all available ones [24]. METIS re- Figure 3 shows the shortcut graph of our running example. Notice
quires only the number of components as an argument in order to that only border nodes are vertices of the shortcut graph. The set of
perform the partitioning. The number of components influences edges consists of connecting edges and the in-component shortcuts
both the number of the in-component shortcuts and the size of the between the border nodes of the same component. Note that there
shortcut graph. is no need for extra computations in order to populate the shortcut
graph.
3.2 In-component Shortcuts
The second step of the preprocessing phase is the computation of
the in-component shortcuts. For each node n in the original graph,
we compute the shortest path from the node to every outgoing bor-
der node of the component in which n is located. Then we create
outgoing shortcuts which abstract the shortest path from n to each
outgoing border node. The incoming shortcuts are computed in a
similar fashion. Thus, the total number of in-component shortcuts,
S, is
k
X
S= Ni × (|Biinc | + |Biout |),
i=1

where Ni is the number of nodes in component Ci and Biinc ,
Biout are the incoming and outgoing border nodes of Ci , respectiv- Figure 3: Shortcut Graph illustrated over the original.
elly. Figure 2 shows the in-component shortcuts for a node located
in component C2 .
4. CORRIDOR MATRIX
In Section 3 we presented how PbS creates shortcuts in order to
answer queries when the source and the target points are in differ-
ent components. However, when the source and the target points
of a query are located in the same component, the shortest path
may lie entirely inside the component. Therefore, the search algo-
rithm will never reach the border nodes and the shortcuts will not
be expanded. In such a case, the common approach is to use bidi-
rectional search to return the shortest path. However, if the compo-
nents of the partitioned graph are large, the query processing can be
quite slow. In order to improve the processing time of such queries,
we partition each component again into sub-components, and for
each component, we compute its Corridor Matrix (CM). In gen-
Figure 2: In-component shortcuts for a given node. eral, given a partition of a graph G in k components, the Corridor
Matrix (CM) of G is a k × k matrix, where each cell C(i, j) of
For each border node in a component, b ∈ C, we execute Di- CM contains a list of components that are crossed by some short-
jkstra’s algorithm with b as source and all other nodes (including est path from a node s ∈ Ci to a node t ∈ Cj . We call such a
border nodes) in C as targets. Depending on the type of the source list the corridor from Ci to Cj . The concept of the CM is similar
node, the expansion strategy is different. When an incoming bor- to Arc-Flags [8], but the CM requires much less space. The space
der node is the source, forward edges are expanded; vice versa, complexity of the CM is O(k3 ), where k is the number of compo-
when an outgoing border node is the source, incoming edges are nents in the partition, while the space complexity of Arc-Flags is
expanded. This strategy ensures that the maximum number of node |E| × k2 , where |E| is the number of edges in the original graph.
expansions is at most twice the number of border nodes of G.
C1 C2 C3 C4 C5
3.3 Shortcut Graph Construction C1 ∅ {C2 , C3 }
The third step of the preprocessing phase of our approach is the C2 ∅
construction of the shortcut graph. Given a graph G, the shortcut C3 ∅
graph of G is a graph Gsc (B, Esc ), where B is the set of border C4 ∅
nodes of G and Esc = EC ∪ SG is the union of the connecting C5 ∅
edges, EC , of G and the shortcuts, SG , from every incoming bor-
der node to every outgoing border node of the same component. Figure 4: Corridor Matrix example.

73
To optimize the look-up time in CM, we implemented each com- Name Region # Vertices # Edges
ponent list using a bitmap of length k. Therefore, the space com- CAL California/Nevada 1,890,815 4,657,742
plexity of the CM in the worst case is O(k3 ). The actual space FLA Florida 1,070,376 2,712,798
occupied by the CM is smaller, since we do not allocate space for BAY SF Bay Area 321,270 800,172
bitmaps when the component list is empty. For the computation of NY New York City 264,346 733,846
the Corridor Matrix, we generate the Shortcut Graph in the same ROME Center of Rome 3353 8,859
way as described in Section 3.3. To compute the distances between
all pairs of vertices, we use the Floyd-Warshall algorithm [25], Table 1: Dataset characteristics.
which is specifically designed to compute the all-pair shortest path
distance efficiently. After having computed the distances between
the nodes, instead of retrieving each shortest path, we retrieve only
the components that are crossed by each path, and we update the contain 1000 queries each. We make sure that the distance of ev-
CM accordingly. ery query in set Qi is smaller than the distance of every query
in set Qi+1 . We also evaluate the CM separately by comparing
our CM implementation against Arc Flags and the original bidi-
5. SHORTEST PATH ALGORITHM rectional search for a set of 1000 random queries in the ROME
In order to process a shortest path query from a source point s dataset. We use a small dataset in order to simulate in-component
to a target point t, we first determine the components of the graph query processing.
the nodes s ∈ Cs and t ∈ Ct are located in. If Cs = Ct , we
execute a modified bidirectional search from s to t. Note that the 6.1 Preprocessing and Space Overhead
shortcuts are not used for processing queries for which the source Figures 5 and 6 show a series of measurements for the prepro-
and target are located in the same component C. Instead, we re- cessing cost of our approach in comparison to CRP and CH over
trieve the appropriate corridor from the CM of C, which contains the four largest datasets. Figure 5 shows how many shortcuts are
a list of sub-components. Then, we apply bidirectional search and created by each approach. The extra shortcuts can be translated
prune all nodes that belong to sub-components which are not in the into the space overhead required in order to speed-up shortest path
retrieved corridor. queries. CH uses shortcuts which represent only two edges, while
In the case that the points s and t are not located in the same the shortcuts in PbS and CRP are composed of much longer se-
component, we exploit the pre-computed shortcuts. First, we re- quences. The difference between the shortcuts produced by CRP
trieve the lengths of the in-component outgoing shortcuts from s to and CH is much less. In short, PbS produces about two orders of
all the outgoing borders of Cs and the length of the in-component magnitude more shortcuts than CRP and CH. Moreover, we can ob-
incoming shortcuts from all the incoming borders of Ct to t. Then serve that the number of shortcuts produced by PbS is getting lower
we apply a many-to-many bidirectional search in the overlay graph as the number of components is increasing.
from all the outgoing borders of Cs to all the incoming borders
of Ct . We use the length of the in-component shortcuts (retrieved CH CRP PbS
in the first step) as initial weights for the source and target nodes
of the bidirectional search in the Shortcut Graph. The list of edges 3
·10 7 shortcuts
3
·107 shortcuts
consisting the path is a set of connecting edges of the original graph
and in-component shortcuts. For each shortcut we retrieve the pre-
computed set of the original edges. The cost to retrieve the original 2 2
path is linear to the size of the path. After the retrieval we replace
the shortcuts with the list of edges in the original graph and we re- 1 1
turn the new edge list, which is the shortest path from s to t in the
original graph.
0 0
128 256 384 512 128 256 384 512
6. PRELIMINARY RESULTS (a) NY (b) BAY
In this section, we compare our PbS method with CRP, the 1
·108 shortcuts
2
·108 shortcuts
method our own approach is based on, and CH, a lightweight yet
very efficient state-of-the-art approach for shortest path queries in 0.75 1.5
road networks [17]. CRP can handle arbitrary metrics and edge
weight updates, while CH is a technique with fast pre-processing 0.5 1
and relatively low query processing time. We implemented in Java
the basic version of CRP and PbS. The CH algorithm in the ex- 0.25 0.5
periments is from Graphhopper Route Planner [26]. Due to the
different implementations of the graph models between ours and 0
256 512 768 1,024
0
512 1,024 1,536 2,048
CH, we do not measure the runtime. Instead, for preprocessing we
count the extra shortcuts created by each algorithm, while for query (c) FLA (d) CAL
processing we count the number of expanded nodes.
For the experiments we follow the same evaluation setting as Figure 5: Preprocessing: # of shortcuts vs. # of components.
in [17]. We use 5 publicly available datasets [27], four of of which
are a part of the US road network, and the smallest one represents The same tendency as observed for the number of shortcuts can
the road network of Rome. We present the characteristics of each be observed for the preprocessing time. In Figure 6, we can see
dataset in Table 1. In order to compare our PbS approach and CRP that PbS requires much more time than CRP and CH in order to
with CH, we run our experiments over 5 query sets Q1 –Q5, which create shortcuts. However, we should also notice that the update

74
cost for CRP and PbS is only a small portion of the preprocessing CRP PbS
cost. When an edge weight changes, we need to update only the ·104 expanded nodes ·104 expanded nodes
shortcuts that contains that particular edge. In contrast, for CH the 1 1
the update cost is the same as the preprocesing cost since a change
0.75 0.75
in a single weight can influence the entire hierarchy.
0.5 0.5
CH CRP PbS
preprocessing time(sec) preprocessing time(sec) 0.25 0.25
300 300

0 0
128 256 384 512 128 256 384 512
200 200
(a) NY (b) BAY
·104 expanded nodes ·104 expanded nodes
2 3
100 100

1.5
2
0 0
128 256 384 512 128 256 384 512
1
(a) NY (b) BAY
1
preprocessing time(sec) preprocessing time(sec) 0.5
1,500 3,000

0 0
256 512 768 1,024 512 1,024 1,536 2,048
1,000 2,000
(c) FLA (d) CAL

500 1,000 Figure 7: Performance of shortest path queries vs. # of components.

0 0
256 512 768 1,024 512 1,024 1,536 2,048 7. CONCLUSION
(c) FLA (d) CAL In this paper we presented PbS, an approach which uses graph
partitioning in order to compute shortcuts and speed-up shortest
Figure 6: Preprocessing: time vs. # of components. path queries in road networks. Our aim was a solution which sup-
ports efficient and incremental updates of edge weights, yet is ef-
ficient enough in many real-world applications. In the evaluation,
6.2 Query Processing we showed that our PbS approach outperforms CRP. PbS supports
Figure 7 shows a series of measurements of the performance of edge weight updates as any change in the weight of an edge can
CRP and PbS. We evaluate both techniques for different partitions influence only shortcuts in a single component. On the other hand,
and various numbers of components. An important observation is CH is faster than our PbS approach. However, CH cannot handle
the tendency of the performance for CRP and PbS. The perfor- well edge weight updates as almost the entire hierarchy of short-
mance of CRP gets worse for partitions with many components cuts has to be recomputed every time a single weight changes. For
while the opposite happens for PbS. The reason is that for parti- queries where the source and the target are in the same component,
tions with few components, PbS manages to process many queries we introduced the CM. The efficiency of the CM in query process-
with two look-ups (the case where the source and the target are in ing approaches the efficiency of Arc Flags, while consuming much
adjacent components). less space.
In Figure 8 we compare CH with CRP (we choose the best result) In future work, we plan to extend our approach to support multi-
and two configurations of PbS: PbS-BT, which is the configuration modal transportation networks, where the computation has to con-
that leads to the best performance, and PbS-AVG, which is the aver- sider a time schedule, and dynamic and traffic aware networks,
age performance of PbS among all configurations. We can see that where the weights of the edges change over time. We will also
PbS outperforms CRP in all datasets from Q1 to Q5 . However, CH improve the preprocessing phase of our approach both in terms of
is faster in terms of query processing than our PbS approach. CH time overhead, by using parallel processing, and space overhead,
is more suitable for static networks as the constructed hierarchy of by using compression techniques or storing some of the precom-
shortcuts enables the shortest path algorithm to expand much fewer puted information on the disk.
nodes.

6.3 In-component Queries 8. REFERENCES
In Figure 9, we compare the performance of our bidirectional [1] E. W. Dijkstra. A note on two problems in connexion with
algorithm using the proposed CM, the original bidirectional search graphs. Numerische Mathematik, 1(1):269–271, December
and the bidirectional algorithm using Arc Flags. We observe that 1959.
the bidirectional search is the slowest since no pruning is applied. [2] I. S. Pohl. Bi-directional and Heuristic Search in Path
Between Arc Flags and CM, the Arc Flags provide slightly better Problems. PhD thesis, Stanford, CA, USA, 1969.
pruning thus fewer expanded nodes by the bidirectional search. On AAI7001588.
the other hand, the preprocessing time required to compute the Arc [3] H. Bast, D. Delling, A. Goldberg, M. Müller, T. Pajor,
Flags is significantly higher than the time required to compute the P. Sanders, D. Wagner, and R Werneck. Route planning in
CM. transportation networks. (MSR-TR-2014-4), January 2014.

75
CH CRP PbS-BT PbS-AVG Int. Workshop on Geographic Information Systems (GIS),
page 200, 2005.
8,000 [10] R.A. Finkel and J. L. Bentley. Quad trees: A data structure
8,000
for retrieval on composite keys. Acta Informatica, 4(1):1–9,
6,000 6,000 1974.
[11] J. Sankaranarayanan and H. Samet, H. andi Alborzi. Path
4,000 4,000 Oracles for Spatial Networks. In Proc. of the 35th VLDB
2,000 2,000
Conf., pages 1210–1221, 2009.
[12] H. Bast, S. Funke, D Matijevic, P. Sanders, and D. Schultes.
0
Q1 Q2 Q3 Q4 Q5
0
Q1 Q2 Q3 Q4 Q5
In Transit to Constant Time Shortest-Path Queries in Road
Networks. In Proc. of the Workshop on Algorithm
(a) NY (b) BAY Engineering and Experiments, pages 45–59, 2007.
·104 ·104 [13] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick.
3 Reachability and distance queries via 2-hop labels. In Proc.
1.5 of the 13th ACM-SIAM Symposium on Discrete Algorithms
2
(SODA), pages 937–946, 2002.
1
[14] I. Abraham, D. Delling, A. V. Goldberg, and R. F. Werneck.
A hub-based labeling algorithm for shortest paths in road
0.5 1
networks. In Proc. of the 10th Int. Symposium on
Experimental Algorithms, pages 230–241, 2011.
0 0
Q1 Q2 Q3 Q4 Q5 Q1 Q2 Q3 Q4 Q5 [15] P. Sanders and D. Schultes. Highway Hierarchies Hasten
(c) FLA (d) CAL
Exact Shortest Path Queries. In Proc. of the 13th European
Conf. on Algorithms (ESA), pages 568–579, 2005.
Figure 8: Performance of shortest path queries vs. query sets. [16] A. D. Zhu, H. Ma, X. Xiao, S. Luo, Y. Tang, and S. Zhou.
Shortest Path and Distance Queries on Road Networks:
Towards Bridging Theory and Practice. In Proc. of the 32nd
Bidirectional Arc Flags CM
SIGMOD Conf., pages 857–868, 2013.
12 3,000 [17] L. Wu, X. Xiao, D. Deng, G. Cong, and A. D. Zhu. Shortest
Path and Distance Queries on Road Networks : An
9 Experimental Evaluation. In Proc. of the 39th VLDB Conf.,
2,000
pages 406–417, 2012.
6
[18] Y. W. Huang, N. Jing, and E. A. Rundensteiner. Hierarchical
1,000 path views : A model based on fragmentation and
3
transportation road types. In Proc. of the 3rd ACM Workshop
0 0 Geographic Information Systems (GIS),, 1995.
8 16 24 32 40 48 8 16 24 32 40 48
[19] S. Jung and S. Pramanik. Hiti graph model of topographical
(a) Preprocessing time (ms) (b) Visited nodes roadmaps in navigation systems. In Proc. of the 12th ICDE
Conf., pages 76–84, 1996.
Figure 9: Evaluation of Arc Flags & CM using ROME dataset. [20] D. Delling, A. V. Goldberg, I. Razenshteyn, and R. F.
Werneck. Graph Partitioning with Natural Cuts. In Proc. of
the 35th Int. Parallel & Distributed Processing Symposium
[4] D. Delling, A. V. Goldberg, T. Pajor, and R. F. Werneck. (IPDPS), pages 1135–1146, 2011.
Customizable route planning. In Proc. of the 10th Int. [21] A. E. Feldmann and L/ Foschini. Balanced Partitions of
Symposium on Experimental Algorithms (SEA), pages Trees and Applications. In 29th Symp. on Theoretical
376–387, 2011. Aspects of Computer Science, volume 14, pages 100–111,
[5] P. Hart, N. Nilsson, and B. Raphael. Formal Basis for the Paris, France, 2012.
Heuristic Determination of Minimum Cost PAths. IEEE [22] G. Karypis and V. Kumar. A Fast and High Quality
Transactions of Systems Science and Cybernetics, Multilevel Scheme for Partitioning Irregular Graphs. SIAM
4(2):100–107, 1968. Journal on Scientific Computing, 20(1):359–392, 1998.
[6] A. V. Goldberg and C. Harrelson. Computing the Shortest [23] Y. W. Huang, N. Jing, and E. Rundensteiner. Effective Graph
Path : A * Search Meets Graph Theory. In Proc. of the 16th Clustering for Path Queries in Digital Map Databases. In
ACM-SIAM Symposium on Discrete Algorithms (SODA), Proc. of the 5th Int. Conf. on Information and Knowledge
pages 156–165, 2005. Management, pages 215–222, 1996.
[7] J. Maue, P. Sanders, and D. Matijevic. Goal-directed [24] X. Sui, D. Nguyen, M. Burtscher, and K. Pingali. Parallel
shortest-path queries using precomputed cluster distances. graph partitioning on multicore architectures. In Proc. of the
Journal on Experimental Algorithms, 14:2:3.2–2:3.27, 23rd Int. Conf. on Languages and Compilers for Parallel
January 2010. Computing, pages 246–260, 2011.
[8] E. Köhler, R. H. Möhring, and H. Schilling. Fast [25] R. W. Floyd. Algorithm 97: Shortest path. Communications
point-to-point shortest path computations with arc-flags. In of the ACM, 5:345, 1962.
Proc. of the 9th DIMACS Implementation Challenge, 2006. [26] https://graphhopper.com.
[9] J. Sankaranarayanan, H. Alborzi, and H. Samet. Efficient [27] http://www.dis.uniroma1.it/challenge9/.
query processing on spatial networks. In Proc. of the 2005

76
Missing Value Imputation in Time Series using Top-k Case
Matching

Kevin Wellenzohn Hannes Mitterer Johann Gamper
Free University of Free University of Free University of
Bozen-Bolzano Bozen-Bolzano Bozen-Bolzano
kevin.wellenzohn@unibz.it hannes.mitterer@unibz.it gamper@inf.unibz.it
M. H. Böhlen Mourad Khayati
University of Zurich University of Zurich
boehlen@ifi.uzh.ch mkhayati@ifi.uzh.ch

ABSTRACT pecially frost is dangerous as it can destroy the harvest within a
In this paper, we present a simple yet effective algorithm, called few minutes unless the farmers react immediately. The Südtiroler
the Top-k Case Matching algorithm, for the imputation of miss- Beratungsring operates more than 120 weather stations spread all
ing values in streams of time series data that are similar to each over South Tyrol, where each of them collects every five minutes
other. The key idea of the algorithm is to look for the k situations up to 20 measurements including temperature, humidity etc. The
in the historical data that are most similar to the current situation weather stations frequently suffer outages due to sensor failures or
and to derive the missing value from the measured values at these k errors in the transmission of the data. However, the continuous
time points. To efficiently identify the top-k most similar historical monitoring of the current weather condition is crucial to immedi-
situations, we adopt Fagin’s Threshold Algorithm, yielding an al- ately warn about imminent threats such as frost and therefore the
gorithm with sub-linear runtime complexity with high probability, need arises to recover those missing values as soon as they are de-
and linear complexity in the worst case (excluding the initial sort- tected.
ing of the data, which is done only once). We provide the results In this paper, we propose an accurate and efficient method to
of a first experimental evaluation using real-world meteorological automatically recover missing values. The need for a continuous
data. Our algorithm achieves a high accuracy and is more accurate monitoring of the weather condition at the SBR has two important
and efficient than two more complex state of the art solutions. implications for our solution. Firstly, the proposed algorithm has
to be efficient enough to complete the imputation before the next
set of measurements arrive in a few minutes time. Secondly, the
Keywords algorithm cannot use future measurements which would facilitate
Time series, imputation of missing values, Threshold Algorithm the imputation, since they are not yet available.
The key idea of our Top-k Case Matching algorithm is to seek
for the k time points in the historical data when the measured val-
1. INTRODUCTION ues at a set of reference stations were most similar to the measured
Time series data is ubiquitous, e.g., in the financial stock mar- values at the current time point (i.e., the time point when a value is
ket or in meteorology. In many applications time series data is in- missing). The missing value is then derived from the values at the k
complete, that is some values are missing for various reasons, e.g., past time points. While a naïve solution to identify the top-k most
sensor failures or transmission errors. However, many applications similar historical situations would have to scan the entire data set,
assume complete data, hence need to recover missing values before we adopt Fagin’s Threshold Algorithm, which efficiently answers
further data processing is possible. top-k queries by scanning, on average, only a small portion of the
In this paper, we focus on the imputation of missing values in data. The runtime complexity of our solution is derived from the
long streams of meteorological time series data. As a case study, Threshold Algorithm and is sub-linear with high probability and
we use real-world meteorological data collected by the Südtiroler linear in the worst case, when all data need to be scanned. We pro-
Beratungsring1 (SBR), which is an organization that provides pro- vide the results of a first experimental evaluation using real-world
fessional and independent consultancy to the local wine and apple meteorological data from the SBR. The results are promising both
farmers, e.g., to determine the optimal harvesting time or to warn in terms of efficiency and accuracy. Our algorithm achieves a high
about potential threats, such as apple scab, fire blight, or frost. Es- accuracy and is more accurate than two state of the art solutions.
1 The rest of the paper is organized as follows. In Section 2, we
http://www.beratungsring.org/
review the existing literature about imputation methods for missing
values. In Section 3, we introduce the basic notation and a running
example. In Section 4, we present our Top-k Case Matching algo-
rithm for the imputation of missing values, followed by the results
of an experimental evaluation in Section 5. Section 6 concludes the
paper and outlines ideas for future work.
Copyright © by the paper’s authors. Copying permitted only for
private and academic purposes.
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 2. RELATED WORK
Workshop on Foundations of Databases (Grundlagen von Datenbanken),
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. Khayati et al. [4] present an algorithm, called REBOM, which

77
recovers blocks of missing values in irregular (with non repeating t∈w s(t) r1 (t) r2 (t) r3 (t)
trends) time series data. The algorithm is based on an iterated trun- 1 16.1° 15.0° 15.9° 14.1°
cated matrix decomposition technique. It builds a matrix which 2 15.8° 15.2° 15.7° 13.9°
stores the time series containing the missing values and its k most 3 15.9° 15.2° 15.8° 14.1°
correlated time series according to the Pearson correlation coeffi- 4 16.2° 15.0° 15.9° 14.2°
cient [7]. The missing values are first initialized using a simple 5 16.5° 15.3° 15.7° 14.5°
interpolation technique, e.g., linear interpolation. Then, the ma- 6 16.1° 15.2° 16.0° 14.1°
trix is iteratively decomposed using the truncated Singular Value 7 ? 15.0° 16.0° 14.3°
Decomposition (SVD). By multiplying the three matrices obtained
from the decomposition, the algorithm is able to accurately approx- Table 1: Four time series in a window w = [1, 7].
imate the missing values. Due to its quadratic runtime complexity,
REBOM is not scalable for long time series data. s (Schlanders) r1 (Kortsch)
Khayati et al. [5] further investigate the use of matrix decompo- r2 (Göflan) r3 (Laas)
sition techniques for the imputation of missing values. They pro-

Temperature in Degree Celsius
pose an algorithm with linear space complexity based on the Cen-
troid Decomposition, which is an approximation of SVD. Due to 16
the memory-efficient implementation, the algorithm scales to long
time series. The imputation follows a similar strategy as the one
used in REBOM. 15
The above techniques are designed to handle missing values in
static time series. Therefore, they are not applicable in our sce-
nario, as we have to continuously impute missing values as soon 14
as they appear. A naïve approach to run the algorithms each time
1 2 3 4 5 6 7
a missing value occurs is not feasible due to their relatively high
runtime complexity. Timestamps
There are numerous statistical approaches for the imputation of
missing values, including easy ones such as linear or spline interpo- Figure 1: Visualization of the time series data.
lation, all the way up to more complex models such as the ARIMA
model. The ARIMA model [1] is frequently used for forecasting
future values, but can be used for backcasting missing values as For the imputation of missing values we assign to each time se-
well, although this is a less common use case. A recent comparison ries s a set Rs of reference time series, which are similar to s.
of statistical imputation techniques for meteorological data is pre- The notion of similarity between two time series is tricky, though.
sented in [9]. The paper comprises several simple techniques, such Intuitively, we want time series to be similar when they have sim-
as the (weighted) average of concurrent measurements at nearby ilar values and behave similarly, i.e., values increase and decrease
reference stations, but also computationally more intensive algo- roughly at the same time and by the same amount.
rithms, such as neural networks. As a simple heuristic for time series similarity, we use the spa-
tial proximity between the stations that record the respective time
3. BACKGROUND series. The underlying assumption is that, if the weather stations
are nearby (say within a radius of 5 kilometers), the measured val-
Let S = {s1 , . . . , sn } be a set of time series. Each time series, ues should be similar, too. Based on this assumption, we manually
s ∈ S, has associated a set of reference time series Rs , Rs ⊆ compiled a list of 3–5 reference time series for each time series.
S \ {s}. The value of a time series s ∈ S at time t is denoted as This heuristic turned out to work well in most cases, though there
s(t). A sliding window of a time series s is denoted as s([t1 , t2 ]) are situations where the assumption simply does not hold. One rea-
and represents all values between t1 and t2 . son for the generally good results is most likely that in our data
E XAMPLE 1. Table 1 shows four temperature time series in a set the over 100 weather stations cover a relatively small area, and
time window w = [1, 7], which in our application corresponds to hence the stations are very close to each other.
seven timestamps in a range of 30 minutes. s is the base time series
from the weather station in Schlanders, and Rs = {r1 , r2 , r3 } is 4. TOP-K CASE MATCHING
the associated set of reference time series containing the stations
Weather phenomena are often repeating, meaning that for exam-
of Kortsch, Göflan, and Laas, respectively. The temperature value
ple during a hot summer day in 2014 the temperature measured at
s(7) is missing. Figure 1 visualizes this example graphically.
the various weather stations are about the same as those measured
The Top-k Case Matching algorithm we propose assumes that during an equally hot summer day in 2011. We use this observa-
the time series data is aligned, which generally is not the case for tion for the imputation of missing values. Let s be a time series
our data. Each weather station collects roughly every 5 minutes where the current measurement at time θ, s(θ), is missing. Our
new measurements and transmits them to a central server. Since assumption on which we base the imputation is as follows: if we
the stations are not perfectly synchronized, the timestamps of the find historical situations in the reference time series Rs such that
measurements typically differ, e.g., one station collects measure- the past values are very close to the current values at time θ, then
ments at 09:02, 09:07, . . . , while another station collects them at also the past measurements in s should be very similar to the miss-
09:04, 09:09, . . . . Therefore, in a pre-processing step we align the ing value s(θ). Based on this assumption, the algorithm searches
time series data using linear interpolation, which yields measure- for similar climatic situations in historical measurements, thereby
ment values every 5 minutes (e.g., 00:00, 00:05, 00:10, . . . ). If we leveraging the vast history of weather records collected by the SBR.
observe a gap of more than 10 minutes in the measurements, we More formally, given a base time series s with reference time
assume that the value is missing. series Rs , we are looking for the k timestamps (i.e., historical sit-

78
uations), D = {t1 , . . . , tk }, ti < θ, which minimize the error popularity. Let us assume that k = 2 and the aggregation
function function f (x1 , x2 ) = x1 + x2 . Further, assume that the bounded
X buffer currently contains {(C, 18), (A, 16)} and the algorithm has
δ(t) = |r(θ) − r(t)|.
read the data up to the boxes shown in gray. At this point the al-
r∈Rs
gorithm computes the threshold using the interestingness
That is, δ(t) ≤ δ(t ) for all t ∈ D and t0 6∈ D ∪ {θ}. The er-
0
grade for object B and the popularity grade of object C, yield-
ror function δ(t) is the accumulated absolute difference between ing τ = f (5, 9) = 5 + 9 = 14. Since the lowest ranked object in
the current temperature r(θ) and the temperature at time t, r(t), the buffer, object A, has an aggregated grade that is greater than τ ,
over all reference time series r ∈ Rs . Once D is determined, we can conclude that C and A are the top-2 objects. Note that the
the missing value is recovered using some aggregation function algorithm never read object D, yet it can conclude that D cannot
g ({s(t)|∀t ∈ D}) over the measured values of the time series s be part of the top-k list.
at the timestamps in D. In our experiments we tested the average
and the median as aggregation function (cf. Section 5).
interestingness popularity
E XAMPLE 2. We show the imputation of the missing value s(7)
in Table 1 using as aggregation function g the average. For Object grade Object grade
the imputation, we seek the k = 2 most similar historical sit- A 10 B 10
uations. The two timestamps D = {4, 1} minimize δ(t) with C 9 C 9
δ(4) = |15.0° − 15.0°| + |16.0° − 15.9°| + |14.3° − 14.2°| = 0.2° B 5 D 8
and δ(1) = 0.3°. The imputation is then simply the average D 4 A 6
of the base station measurements at time t = 4 and t = 1,
i.e.,s(7) = avg(16.2°, 16.1°) = 12 (16.2° + 16.1°) = 16.15°. Table 2: Threshold Algorithm example.

A naïve implementation of this algorithm would have to scan
the entire database of historical data to find the k timestamps that 4.2 Adapting the Threshold Algorithm
minimize δ(t). This is, however, not scalable for huge time series In order to use the Threshold Algorithm for the imputation of
data, hence a more efficient technique is needed. missing values in time series data, we have to adapt it. Instead of
looking for the top-k objects that maximize the aggregation func-
4.1 Fagin’s Threshold Algorithm tion f , we want to find the top-k timestamps that minimize the
What we are actually trying to do is to answer a top-k query for error function δ(t) over the reference time series Rs . Similar to
the k timestamps which minimize δ(t). There exist efficient algo- TA, we need sorted access to the data. Therefore, for each time
rithms for top-k queries. For example, Fagin’s algorithm [2] solves series r ∈ Rs we define Lr to be the time series r ordered first
this problem by looking only at a small fraction of the data. Since by value and then by timestamp in ascending order. Table 3 shows
the first presentation of Fagin’s algorithm there were two notewor- the sorted data for the three reference time series of our running ex-
thy improvements, namely the Threshold Algorithm (TA) by Fagin ample (ignore the gray boxes and small subscript numbers for the
et al. [3] and a probabilistic extension by Theobald et al. [8]. The moment).
latter approach speeds up TA by relaxing the requirement to find
the exact top-k answers and providing approximations with proba- Lr1 Lr2 Lr3
bilistic guarantees.
Our Top-k Case Matching algorithm is a variation of TA with t r1 (t) t r2 (t) t r3 (t)
slightly different settings. Fagin et al. assume objects with m at- 1 15.0° 4 2 15.7° 2 13.9°
tributes, a grade for each attribute and a monotone aggregation 4 15.0° 1 5 15.7° 1 14.1°
function f : Rm 7→ R, which aggregates the m grades of an ob- 7 15.0° 3 15.8° 3 14.1°
ject into an overall grade. The monotonicity property is defined as 2 15.2° 1 15.9° 6 14.1°
follows. 3 15.2° 4 15.9° 5 4 14.2° 3
6 15.2° 6 16.0° 2 7 14.3°
D EFINITION 1. (Monotonicity) Let x1 , . . . , xm and 5 15.3° 7 16.0° 5 14.5° 6
x01 , . . . , x0m be the m grades for objects X and X 0 , re-
spectively. The aggregation function f is monotone if Table 3: Time series sorted by temperature.
f (x1 , . . . , xm ) ≤ f (x01 , . . . , x0m ) given that xi ≤ x0i for
each 1 ≤ i ≤ m. The general idea of our modified TA algorithm is the following.
The TA finds the k objects that maximize the function f . To do The scan of each sorted lists starts at the current element, i.e., the
so it requires two modes of accessing the data, one being sorted and element with the timestamp t = θ. Instead of scanning the lists Lri
the other random access. The sorted access is ensured by maintain- only in one direction as TA does, we scan each list sequentially
ing a sorted list Li for each attribute mi , ordered by the grade in in two directions. Hence, as an initialization step, the algorithm
−
descending order. TA keeps a bounded buffer of size k and scans places two pointers, pos+ r and posr , at the current value r(θ) of

each list Li in parallel until the buffer contains k objects and the time series r (the gray boxes in Table 3). During the execution of
lowest ranked object in the buffer has an aggregated grade that is the algorithm, pointer pos+ r is only incremented (i.e., moved down

greater than or equal to some threshold τ . The threshold τ is com- the list), whereas pos− r is only decremented (i.e., moved up the

puted using the aggregation function f over the grades last seen list). To maintain the k highest ranking timestamps, the algorithm
under the sorted access for each list Li . uses a bounded buffer of size k. A new timestamp t0 is added only
if the buffer is either not yet full or δ(t0 ) < δ(t), where t is the last
E XAMPLE 3. Table 2 shows four objects {A, B, C, D} and (i.e., lowest ranking) timestamp in the buffer. ¯In the latter
¯ case the
their grade for the two attributes interestingness and timestamp t is removed from the buffer.
¯

79
After this initialization, the algorithm iterates over the lists Lr in Algorithm 1: Top−k Case Matching
round robin fashion, i.e., once the last list is reached, the algorithm Data: Reference time series Rs , current time θ, and k
wraps around and continues again with the first list. In each iter- Result: k timestamps that minimize δ(t)
ation, exactly one list Lr is processed, and either pointer pos+ r or
r
1 L ← {L |r ∈ Rs }
pos−r is advanced, depending on which value the two pointers point 2 buffer ← boundendBuffer(k)

to has a smaller absolute difference to the current value at time θ, 3 for r ∈ Rs do

r(θ). This process grows a neighborhood around the element r(θ) 4 pos− +
r , posr ← position of r(θ) in L
r
5 end
in each list. Whenever a pointer is advanced by one position, the
6 while L <> ∅ do
timestamp t at the new position is processed. At this point, the 7 for Lr ∈ L do
algorithm needs random access to the values r(t) in each list to 8 t ← AdvancePointer(Lr )
compute the error function δ(t). Time t is added to the bounded 9 if t = N IL then
buffer using the semantics described above. 10 L ← L \ {Lr }
The algorithm terminates once the error at the lowest ranking 11 else
12 if t 6∈ buffer then
timestamp, t, among the k timestamps in the buffer is less or equal
¯ 13 buffer.addWithPriority(t, δ(t))
to thePthreshold, i.e., δ(t) ≤ τ . The threshold τ is defined as 14 end
τ = r∈Rs |r(θ) − r(pos ¯ )|, where pos is either pos+ or pos− ,
r r r r 15 τ ← ComputeThreshold(L)
depending on which pointer was advanced last. That is, τ is the 16 if buffer.size() = k
sum over all lists Lr of the absolute differences between r(θ) and and buffer.largestError() ≤ τ then
the value under pos+ − return buffer
r or posr .
17
18 end
E XAMPLE 4. We illustrate the Top-k Case Matching algorithm 19 end
for k = 2 and θ = 7. Table 4 shows the state of the algorithm in 20 end
each iteration i. The first column shows an iteration counter i, the 21 end
22 return buffer
second the buffer with the k current best timestamps, and the last
column the threshold τ . The buffer entries are tuples of the form
(t, δ(t)). In iteration i = 1, the algorithm moves the pointer to
t = 4 in list Lr1 and adds (t = 4, δ(4) = 0.2°) to the buffer. Since on the direction of the pointer. If next() reaches the end of a list,
δ(4) = 0.2° > 0.0° = τ , the algorithm continues. The pointer it returns N IL. The utility functions timestamp() and value()
in Lr2 is moved to t = 6, and (6, 0.4°) is added to the buffer. In return the timestamp and value of a list Lr at a given position, re-
iteration i = 4, timestamp 6 is replaced by timestamp 1. Finally, spectively. There are four cases, which the algorithm has to distin-
in iteration i = 6, the error at timestamp t = 1 is smaller or equal guish:
to τ , i.e., δ(1) = 0.3° ≤ τ6 = 0.3°. The algorithm terminates and
returns the two timestamps D = {4, 1}. 1. None of the two pointers reached the beginning or end of the
list. In this case, the algorithm checks which pointer to ad-
vance (line 5). The pointer that is closer to r(θ) after advanc-
Iteration i Buffer Threshold τi ing is moved by one position. In case of a tie, we arbitrarily
1 (4, 0.2°) 0.0° decided to advance pos+ r .
2 (4, 0.2°), (6, 0.4°) 0.0°
3 (4, 0.2°), (6, 0.4°) 0.1° 2. Only pos−
r reached the beginning of the list: the algorithm
4 (4, 0.2°), (1, 0.3°) 0.1° increments pos+
r (line 11).
5 (4, 0.2°), (1, 0.3°) 0.2°
6 (4, 0.2°), (1, 0.3°) 0.3° 3. Only pos+
r reached the end of the list: the algorithm decre-
ments pos−
r (line 13).

Table 4: Finding the k = 2 most similar historical situations. 4. The two pointers reached the beginning respective end of the
list: no pointer is moved.
4.3 Implementation In the first three cases, the algorithm returns the timestamp that
Algorithm 1 shows the pseudo code of the Top-k Case Matching was discovered after advancing the pointer. In the last case, N IL is
algorithm. The algorithm has three input parameters: a set of time returned.
series Rs , the current timestamp θ, and the parameter k. It returns At the moment we use an in-memory implementation of the al-
the top-k most similar timestamps to the current timestamp θ. In gorithm, which loads the whole data set into main memory. More
line 2 the algorithm initializes the bounded buffer of size k, and in specifically, we keep two copies of the data in memory: the data
line 4 the pointers pos+ −
r and posr are initialized for each reference sorted by timestamp for fast random access and the data sorted by
time series r ∈ Rs . In each iteration of the loop in line 7, the algo- value and timestamp for fast sorted access.
rithm advances either pos+ −
r or posr (by calling Algorithm 2) and Note that we did not normalize the raw data using some standard
reads a new timestamp t. The timestamp t is added to the bounded technique like the z-score normalization, as we cannot compute
buffer using the semantics described before. In line 15, the algo- that efficiently for streams of data without increasing the complex-
rithm computes the threshold τ . If the buffer contains k timestamps ity of our algorithm.
and we have δ(t) ≤ τ , the top-k most similar timestamps were
¯
found and the algorithm terminates. 4.4 Proof of Correctness
Algorithm 2 is responsible for moving the pointers pos+ r and The correctness of the Top-k Case Matching algorithm follows
pos− r
r for each list L . The algorithm uses three utility functions. directly from the correctness of the Threshold Algorithm. What
The first is next(), which takes a pointer as input and returns the remains to be shown, however, is that the aggregation function δ(t)
next position by either incrementing or decrementing, depending is monotone.

80
∗
Algorithm 2: AdvancePointer ference between
P the real value∗ s(θ) and the imputed value s (θ),
Data: List Lr where to advance a pointer i.e., ∆ = |w| θ∈w |s(θ) − s (θ)|
1

Result: Next timestamp to look at or N IL Figure 2 shows how the accuracy of the algorithms changes with
1 pos ← N IL varying k. Interestingly and somewhat unexpectedly, ∆ decreases
if next(pos+ − as k increases. This is somehow contrary to what we expected,
2 r ) <> N IL and next(posr ) <> N IL then
∆+ ← |r(θ) − value(Lr [next(pos+ since with an increasing k also the error function δ(t) grows, and
3 r )])|
∆− ← |r(θ) − value(Lr [next(pos− therefore less similar historical situations are used for the imputa-
4 r )])|
5 if ∆+ ≤ ∆− then tion. However, after a careful analysis of the results it turned out
pos, pos+ + that for low values of k the algorithm is more sensitive to outliers,
6 r ← next(posr )
7 else and due to the often low quality of the raw data the imputation is
8 pos, pos− −
r ← next(posr )
flawed.
9 end
Top-k (Average)

Average Difference ∆ in °C
+ −
10 else if next(posr ) <> N IL and next(posr ) = N IL then 0.8
11
+
pos, posr ← next(posr )+ Top-k (Median)
+ − Simple Average
12 else if next(posr ) = N IL and next(posr ) <> N IL then
13 pos, pos− −
r ← next(posr )
0.7

14 end
15 if pos <> N IL then
0.6
16 return timestamp(Lr [pos])
17 else
18 return N IL 0.5
19 end
0 50 100

Parameter k

T HEOREM 4.1. The aggregation function δ(t) is a monotoni- Figure 2: Impact of k on accuracy.
cally increasing function.
P ROOF. Let t1 and t2 be two timestamps such that |r(θ) − Table 5 shows an example of flawed raw data. The first row is
r(t1 )| ≤ |r(θ) − r(t2 )| for each r ∈ Rs . Then it trivially fol- the current situation, and we assume that the value in the gray box
lows that δ(t1 ) ≤ δ(t2 ) as the aggregation function δ is the sum of is missing and need to be recovered. The search for the k = 3
|r(θ) − r(t1 )| over each r ∈ Rs and, by definition, each compo- most similar situations using our algorithm yields the three rows
nent of δ(t1 ) is less than or equal to the corresponding component at the bottom. Notice that one base station value is 39.9° around
in δ(t2 ). midnight of a day in August, which is obviously a very unlikely
thing to happen. By increasing k, the impact of such outliers is
4.5 Theoretical Bounds reduced and hence ∆ decreases. Furthermore, using the median as
The space and runtime bounds of the algorithm follow directly aggregation function reduces the impact of outliers and therefore
from the probabilistic guarantees of TA, which has sub-linear cost yields better results than the average.
with high probability and linear cost in the worst case. Note
Timestamp s r1 r2 r3
that sorting the raw data to build the lists Lr is a one-time pre-
processing step with complexity O(n log n). After that the system 2013-04-16 19:35 18.399° 17.100° 19.293° 18.043°
can insert new measurements efficiently into the sorted lists with 2012-08-24 01:40 18.276° 17.111° 19.300° 18.017°
logarithmic cost. 2004-09-29 15:50 19.644° 17.114° 19.259° 18.072°
2003-08-02 01:10 39.900° 17.100° 19.365° 18.065°
5. EXPERIMENTAL EVALUATION Table 5: Example of flawed raw data.
In this section, we present preliminary results of an experimental
evaluation of the proposed Top-k Case Matching algorithm. First, Figure 3 shows the runtime, which for the Top-k Case Match-
we study the impact of parameter k on the Top-k Case Matching ing algorithm linearly increases with k. Notice that, although the
and a baseline algorithm. The baseline algorithm, referred to as imputation of missing values for 8 days takes several minutes, the
“Simple Average”, imputes the missing value s(θ) with the average algorithm is fast enough to continuously impute missing values in
of thePvalues in the reference time series at time θ, i.e., s(θ) = our application at the SBR. The experiment essentially corresponds
r∈Rs r(θ). Second, we compare our solution with two state
1
|Rs | to a scenario, where in 11452 base stations an error occurs at the
of the art competitors, REBOM [4] and CD [5]. same time. With 120 weather stations operated by the SBR, the
number of missing values at each time is only a tiny fraction of the
5.1 Varying k missing values that we simulated in this experiment.
In this experiment, we study the impact of parameter k on the
accuracy and the runtime of our algorithm. We picked five base 5.2 Comparison with CD and REBOM
stations distributed all over South Tyrol, each having two to five In this experiment, we compare the Top-k Case Matching algo-
reference stations. We simulated a failure of the base station dur- rithm with two state-of-the-art algorithms, REBOM [4] and CD [5].
ing a time interval, w, of 8 days in the month of April 2013. This We used four time series, each containing 50.000 measurements,
amounts to a total of 11452 missing values. We then used the Top-k which corresponds roughly to half a year of temperature measure-
Case Matching (using both the average and median as aggregation ments. We simulated a week of missing values (i.e., 2017 measure-
function g) and Simple Average algorithms to impute the missing ments) in one time series and used the other three as reference time
values. As a measure of accuracy we use the average absolute dif- series for the imputation.

81
further study the impact of complex weather phenomena that we
800 observed in our data, such as the foehn. The foehn induces shifting
effects in the time series data, as the warm wind causes the temper-
Runtime (sec)

600
ature to increase rapidly by up to 15° as soon as the foehn reaches
Top-k (Average) another station.
400
Top-k (Median) There are several possibilities to further improve the algorithm.
Simple Average First, we would like to explore whether the algorithm can dynam-
200
ically determine an optimal value for the parameter k, which is
0 currently given by the user. Second, we would like to make the
0 50 100 algorithm more robust against outliers. For example, the algorithm
Parameter k could consider only historical situations that occur roughly at the
same time of the day. Moreover, we can bend the definition of “cur-
Figure 3: Impact of k on runtime. rent situation” to not only consider the current timestamp, but rather
a small window of consecutive timestamps. This should make the
ranking more robust against anomalies in the raw data and weather
The box plot in Figure 4 shows how the imputation error |s(θ) − phenomena such as the foehn. Third, right now the similarity be-
s∗ (θ)| is distributed for each of the four algorithms. The left and tween time series is based solely on temperature data. We would
right line of the box are the first and third quartile, respectively. like to include the other time series data collected by the weather
The line inside the box denotes the median and the left and right stations, such as humidity, precipitation, wind, etc. Finally, the al-
whiskers are the 2.5% and 97.5% percentile, which means that the gorithm should be able to automatically choose the currently hand-
plot incorporates 95% of the values and omits statistical outliers. picked reference time series based on some similarity measures,
The experiment clearly shows that the Top-k Case Matching algo- such as the Pearson correlation coefficient.
rithm is able to impute the missing values more accurately than CD
and REBOM. Although not visualized, also the maximum observed
error for our algorithm is with 2.29° (Average) and 2.21° (Median)
7. ACKNOWLEDGEMENTS
considerably lower than 3.71° for CD and 3.6° for REBOM. The work has been done as part of the DASA project, which is
funded by the Foundation of the Free University of Bozen-Bolzano.
We wish to thank our partners at the Südtiroler Beratungsring and
Top-k the Research Centre for Agriculture and Forestry Laimburg for the
(Median)
good collaboration and helpful domain insights they provided, in
Top-k particular Armin Hofer, Martin Thalheimer, and Robert Wiedmer.
(Average)

CD 8. REFERENCES
[1] G. E. P. Box and G. Jenkins. Time Series Analysis, Forecasting
REBOM and Control. Holden-Day, Incorporated, 1990.
[2] R. Fagin. Combining fuzzy information from multiple systems
0 0.5 1 1.5 2
(extended abstract). In PODS’96, pages 216–226, New York,
Absolute Difference in °C NY, USA, 1996. ACM.
[3] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation
Figure 4: Comparison with REBOM and CD. algorithms for middleware. In PODS ’01, pages 102–113,
New York, NY, USA, 2001. ACM.
In terms of runtime, the Top-k Case Matching algorithm needed [4] M. Khayati and M. H. Böhlen. REBOM: recovery of blocks of
16 seconds for the imputation of the 2017 missing measurements, missing values in time series. In COMAD’12, pages 44–55,
whereas CD and REBOM needed roughly 10 minutes each. Note, 2012.
however, that this large difference in run time is also due to the [5] M. Khayati, M. H. Böhlen, and J. Gamper. Memory-efficient
fact that CD and REBOM need to compute the Pearson correlation centroid decomposition for long time series. In ICDE’14,
coefficient which is a time intensive operation. pages 100–111, 2014.
[6] L. Li, J. McCann, N. S. Pollard, and C. Faloutsos. Dynammo:
6. CONCLUSION AND FUTURE WORK Mining and summarization of coevolving sequences with
In this paper, we presented a simple yet efficient and accurate al- missing values. In KDD’09, pages 507–516, New York, NY,
gorithm, termed Top-k Case Matching, for the imputation of miss- USA, 2009. ACM.
ing values in time series data, where the time series are similar to [7] A. Mueen, S. Nath, and J. Liu. Fast approximate correlation
each other. The basic idea of the algorithm is to look for the k sit- for massive time-series data. In SIGMOD’10, pages 171–182,
uations in the historical data that are most similar to the current sit- New York, NY, USA, 2010. ACM.
uation and to derive the missing values from the data at these time [8] M. Theobald, G. Weikum, and R. Schenkel. Top-k query
points. Our Top-k Case Matching algorithm is based on Fagin’s evaluation with probabilistic guarantees. In VLDB’04, pages
Threshold Algorithm. We presented the results of a first experi- 648–659. VLDB Endowment, 2004.
mental evaluation. The Top-k Case Matching algorithm achieves a [9] C. Yozgatligil, S. Aslan, C. Iyigun, and I. Batmaz.
high accuracy and outperforms two state of the art solutions both Comparison of missing value imputation methods in time
in terms of accuracy and runtime. series: the case of turkish meteorological data. Theoretical
As next steps we will continue with the evaluation of the algo- and Applied Climatology, 112(1-2):143–167, 2013.
rithm, taking into account also model based techniques such as Dy-
naMMo [6] and other statistical approaches outlined in [9]. We will

82
Dominanzproblem bei der Nutzung von
Multi-Feature-Ansätzen

Thomas Böttcher Ingo Schmitt
Technical University Cottbus-Senftenberg Technical University Cottbus-Senftenberg
Walther-Pauer-Str. 2, 03046 Cottbus Walther-Pauer-Str. 2, 03046 Cottbus
tboettcher@tu-cottbus.de schmitt@tu-cottbus.de

ABSTRACT
Ein Vergleich von Objekten anhand unterschiedlicher Eigen-
schaften liefert auch unterschiedliche Ergebnisse. Zahlreiche
Arbeiten haben gezeigt, dass die Verwendung von mehreren
Eigenschaften signifikante Verbesserungen im Bereich des
Retrievals erzielen kann. Ein großes Problem bei der Verwen- Figure 1: Unterschiedliche Objekte mit sehr hoher
dung mehrerer Eigenschaften ist jedoch die Vergleichbarkeit Farbähnlichkeit
der Einzeleigenschaften in Bezug auf die Aggregation. Häu-
fig wird eine Eigenschaft von einer anderen dominiert. Viele
Normalisierungsansätze versuchen dieses Problem zu lösen, von Eigenschaften erfolgt mittels eines Distanz- bzw. Ähn-
nutzen aber nur eingeschränkte Informationen. In dieser Ar- lichkeitsmaßes1 . Bei der Verwendung mehrerer Eigenschaf-
beit werden wir einen Ansatz vorstellen, der die Messung des ten lassen sich Distanzen mittels einer Aggregationsfunktion
Grades der Dominanz erlaubt und somit auch eine Evaluie- verknüpfen und zu einer Gesamtdistanz zusammenfassen.
rung verschiedener Normalisierungsansätze. Der Einsatz von unterschiedlichen Distanzmaßen und Ag-
gregationsfunktionen bringt jedoch verschiedene Probleme
mit sich:
Keywords Verschiedene Distanzmaße erfüllen unterschiedliche alge-
Dominanz, Score-Normalisierung, Aggregation, Feature braische Eigenschaften und nicht alle Distanzmaße sind für
spezielle Probleme gleich geeignet. So erfordern Ansätze
zu metrischen Indexverfahren oder Algorithmen im Data-
1. EINLEITUNG Mining die Erfüllung der Dreiecksungleichung. Weitere Pro-
Im Bereich des Information-Retrievals (IR), Multimedia- bleme können durch die Eigenschaften der Aggregations-
Retrievals (MMR), Data-Mining (DM) und vielen anderen funktion auftreten. So kann diese z.B. die Monotonie oder
Gebieten ist ein Vergleich von Objekten essentiell, z.B. zur andere algebraische Eigenschaften der Einzeldistanzmaße
Erkennung ähnlicher Objekte bzw. Duplikate oder zur Klas- zerstören. Diese Probleme sollen jedoch nicht im Fokus die-
sifizierung der untersuchten Objekte. Der Vergleich von Ob- ser Arbeit stehen.
jekten einer Objektmenge O basiert dabei in der Regel auf Für einen Ähnlichkeitsvergleich von Objekten anhand meh-
deren Eigenschaftswerten. Im Bereich des MMR sind Eigen- rerer Merkmale wird erwartet, dass die Einzelmerkmale glei-
schaften (Features) wie Farben, Kanten oder Texturen häu- chermaßen das Aggregationsergebnis beeinflussen. Häufig
fig genutzte Merkmale. In vielen Fällen genügt es für einen gibt es jedoch ein Ungleichgewicht, welches die Ergebnisse
erschöpfenden Vergleich von Objekten nicht, nur eine Eigen- so stark beeinflusst, dass einzelne Merkmale keinen oder nur
schaft zu verwenden. Abbildung 1 zeigt anhand des Beispiels einen geringen Einfluss besitzen. Fehlen algebraische Eigen-
eines Farbhistogramms die Schwächen einer einzelnen Eigen- schaften oder gibt es eine zu starke Dominanz, so können die
schaft. Obwohl beide Objekte sich deutlich unterscheiden so Merkmale und dazugehörigen Distanzmaße nicht mehr sinn-
weisen sie ein sehr ähnliches Farbhistogramm auf. voll innerhalb einer geeigneten Merkmalskombination einge-
Statt einer Eigenschaft sollte vielmehr eine geeignete Kombi- setzt werden. Im Bereich der Bildanalyse werden zudem im-
nation verschiedener Merkmale genutzt werden, um mittels mer komplexere Eigenschaften aus den Bilddaten extrahiert.
einer verbesserten Ausdruckskraft [16] genauere Ergebnissen Damit wird auch die Berechnung der Distanzen basierend
zu erzielen. Der (paarweise) Vergleich von Objekten anhand auf diesen Eigenschaften immer spezieller und es kann nicht
sichergestellt werden welche algebraische Eigenschaften er-
füllt werden. Durch die vermehrte Verwendung von vielen
Einzelmerkmalen steigt auch das Risiko der Dominanz eines
oder weniger Merkmale.
Kernfokus dieser Arbeit ist dabei die Analyse von Multi-
Feature-Aggregationen in Bezug auf die Dominanz einzelner
Copyright © by the paper’s authors. Copying permitted only
for private and academic purposes. Merkmale. Wir werden zunächst die Dominanz einer Eigen-
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 1
Workshop on Foundations of Databases (Grundlagen von Datenbanken), Beide lassen sich ineinander überführen [Sch06], im Folgen-
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. den gehen wir daher von Distanzmaßen aus.

83
schaft definieren und zeigen wann sich eine solche Dominanz Beispiel erläutert werden. Abschließend werden wir ein Maß
manifestiert. Anschließend führen wir ein Maß zur Messung definieren, um den Grad der Dominanz messen zu können.
des Dominanzgrades ein. Wir werden darüber hinaus zei-
gen, dass die Ansätze bestehender Normalisierungsverfah- 3.1 Problemdefinition
ren nicht immer ausreichen um das Problem der Dominanz Wie bereits erwähnt ist der Einsatz vieler, unterschiedlicher
zu lösen. Zusätzlich ermöglicht dieses Maß die Evaluation Eigenschaften (Features) und ihrer teilweise speziellen Di-
verschiedener Normalisierungsansätze. stanzmaße nicht trivial und bringt einige Herausforderungen
Die Arbeit ist dabei wie folgt aufgebaut. In Kapitel 2 werden mit sich. Das Problem der Dominanz soll in diesem Unter-
noch einmal einige Grundlagen zur Distanzfunktion und zur abschnitt noch einmal genauer definiert werden.
Aggregation dargelegt. Kapitel 3 beschäftigt sich mit der Zunächst definieren wir das Kernproblem bei der Aggre-
Definition der Dominanz und zeigt anhand eines Beispiels gation mehrerer Distanzwerte.
die Auswirkungen. Weiterhin wird ein neues Maß zur Mes- Problem: Für einen Ähnlichkeitsvergleich von Objekten
sung des Dominanzgrades vorgestellt. Kapitel 4 liefert einen anhand mehrerer Merkmale sollen die Einzelmerkmale glei-
Überblick über bestehende Ansätze. Kapitel 5 gibt eine Zu- chermaßen das Aggregationsergebnis beeinflussen. Dominie-
j
sammenfassung und einen Ausblick für zukünftige Arbeiten. ren die partiellen Distanzen δrs eines Distanzmaßes dj das
Aggregationsergebnis, so soll diese Dominanz reduziert bzw.
2. GRUNDLAGEN beseitigt werden.
Offen ist an dieser Stelle die Frage, wann eine Dominanz ei-
Das folgende Kapitel definiert die grundlegenden Begriffe ner Eigenschaft auftritt, wie sich diese auf das Aggregations-
und die Notationen, die in dieser Arbeit verwendet werden. ergebnis auswirkt und wie der Grad der Dominanz gemessen
Distanzberechnungen auf unterschiedlichen Merkmalen er- werden kann.
fordern in der Regel auch den Einsatz unterschiedlicher Di- Das Ergebnis einer Aggregation von Einzeldistanzwerten ist
stanzmaße. Diese sind in vielen Fällen speziell auf die Eigen- erneut ein Distanzwert. Dieser soll jedoch von allen Einzeldi-
schaft selbst optimiert bzw. angepasst. Für eine Distanzbe- stanzwerten gleichermaßen abhängen. Ist der Wertebereich,
rechnung auf mehreren Merkmalen werden dementsprechend der zur Aggregation verwendeten Distanzfunktionen nicht
auch unterschiedliche Distanzmaße benötigt. identisch, so kann eine Verfälschung des Aggregationsergeb-
Ein Distanzmaß zwischen zwei Objekten basierend auf einer nisses auftreten. Als einfaches Beispiel seien hier zwei Di-
Eigenschaft p sei als eine Funktion d : O × O 7→ R≥0 defi- stanzfunktionen d1 und d2 genannt, wobei d1 alle Distanzen
niert. Ein Distanzwert basierend auf einem Objektvergleich auf das Intervall [0, 1] und d2 alle Distanzen auf [0, 128] ab-
zwischen or und os über einer einzelnen Eigenschaft pj wird bildet. Betrachtet man nun eine Aggregationsfunktion dagg ,
mit dj (or , os ) ∈ R≥0 beschrieben. Unterschiedliche Distanz- die Einzeldistanzen aufsummiert, so zeigt sich, dass d2 das
maße besitzen damit auch unterschiedliche Eigenschaften. Aggregationsergebnis erheblich mehr beeinflusst als d1 .
Zur Klassifikation der unterschiedlichen Distanzmaße wer- Allgemein werden dann die aggregierten Distanzwerte stär-
den folgende vier Eigenschaften genutzt: ker oder schwächer durch Einzeldistanzwerte einer (zur Ag-
Selbstidentität: ∀o ∈ O : d(o, o) = 0, Positivität: ∀or 6= gregation verwendeten) Distanzfunktion beeinflusst als ge-
os ∈ O : d(or , os ) > 0, Symmetrie: ∀or , os ∈ O : wünscht. Wir bezeichnen diesen Effekt als eine Überwer-
d(or , os ) = d(os , or ) und Dreiecksungleichung: ∀or , os , ot ∈ tung. Der Grad der Überbewertung lässt sich mittels Korre-
O : d(or , ot ) ≤ d(or , os ) + d(os , ot ). lationsanalyse (z.B. nach Pearson [10] oder Spearman [13])
Erfüllt eine Distanzfunktion alle vier Eigenschaften so wird bestimmen.
sie als Metrik bezeichnet [11].
Ist der Vergleich zweier Objekte anhand einer einzelnen Ei- Definition 1 (Überbewertung einer Distanzfunktion).
genschaft nicht mehr ausreichend, um die gewünschte (Un-) Für zwei Distanzfunktionen dj und dk , bei der die Distanz-
Ähnlichkeit für zwei Objekte or ,os ∈ O zu bestimmen , so werte δ j in Abhängigkeit einer Aggregationsfunktion agg
ist die Verwendung mehrerer Eigenschaften nötig. Für ei- das Aggregationsergebnis stärker beeinflussen als δ k , also
ne Distanzberechnung mit m Eigenschaften p = (p1 . . . pm ) die Differenz der Korrelationswerte
j
werden zunächst die partiellen Distanzen δrs = dj (or , os ) ρ(δ j , δ agg ) − ρ(δ k , δ agg ) > ist, bezeichnen wir dj als
bestimmt. Anschließend werden die partiellen Distanzwerte überbewertet gegenüber dk .
j
δrs mittels einer Aggregationsfunktion agg : Rm ≥0 7→ R≥0
zu einer Gesamtdistanz aggregiert. Die Menge aller aggre- Eine empirische Untersuchung hat gezeigt, dass sich ab ei-
gierten Distanzen (Dreiecksmatrix) für Objektpaar aus O, nem Wert ≥ 0.2 eine Beeinträchtigung des Aggregations-
2
sei durch δ j = (δ1j , δ2j . . . , δlj ) mit l = n 2−n bestimmt. Die- ergebnisses zu Gunsten einer Distanzfunktion zeigt.
ser Ansatz erlaubt eine Bestimmung der Aggregation auf Ausgehend von einer Überbewertung definieren wir das Pro-
den jeweiligen Einzeldistanzwerten. Die Einzeldistanzfunk- blem der Dominanz.
tionen dj sind in sich geschlossen und damit optimiert auf
die Eigenschaft selbst. Definition 2 (Dominanzproblem). Ein Dominanzpro-
blem liegt vor, wenn es eine Überbewertung einer Distanz-
funktion dj gegenüber dk gibt.
3. DOMINANZPROBLEM
Bisher haben wir das Problem der Dominanz nur kurz ein- Das Problem einer Überbewertung bei unterschiedlichen
geführt. Eine detaillierte Motivation und Heranführung an Wertebereichen in denen die Distanzen abgebildet werden ist
das Problem soll in diesem Kapitel erfolgen. Hierzu werden jedoch bereits weitreichend bekannt. In vielen Fällen kom-
wir zunächst die Begriffe Überbewertung und Dominanzpro- men Normalisierungsverfahren (z.B. im Data-Mining [12]
blem einführen. Die Auswirkungen des Dominanzproblem oder in der Biometrie [5]) zum Einsatz. Diese bereiten Di-
auf das Aggregationsergebnis sollen anschließend durch ein stanzen aus verschiedenen Quellen für eine Aggregation vor.

84
Zur Vermeidung einer Überbewertung werden Distanzen aggQd ,d (or , os ) = d1 (or , os ) ∗ d2 (or , os ) kann nun gezeigt
1 2
häufig auf ein festes Intervall normalisiert (i.d.R. auf [0,1]). werden, dass d1 stärker den aggregierten Distanzwert beein-
Damit ist zumindest das Problem in unserem vorherigen Bei- flusst als d2 .
spiel gelöst. In Abbildung 3 sind zwei verschiedene Rangfolgen aller 10
Das Problem der Dominanz tritt jedoch nicht nur bei un- Distanzwerte zwischen fünf zufälligen Objekten der Vertei-
terschiedlichen Wertebereichen auf. Auch bei Distanzfunk- lungen ν1 und ν2 dargestellt, sowie die Aggregation mittels
tionen, die alle auf den gleichen Wertebereich normalisiert aggQ . Die Distanz-ID definiert hierbei einen Identifikator
sind, kann das Dominanzproblem auftreten. Im folgenden für ein Objektpaar. Betrachtet man die ersten fünf Rän-
Abschnitt soll anhand eines Beispiels dieses Dominanzpro- ge der aggregierten Distanzen, so sieht man, dass die top-
blem demonstriert werden. 5-Objekte von Distanzfunktion d1 komplett mit denen der
Aggregation übereinstimmen, während bei Distanzfunktion
3.2 Beispiel eines Dominanzproblems d2 lediglich zwei Werte in der Rangfolge der aggregierten
In Abbildung 2 sind drei Distanzverteilungen ν1 , ν2 und ν3 Distanzen auftreten. Gleiches gilt für die Ränge 6–10. Da-
aus einer Stichprobe zu den zugehörigen Distanzfunktionen mit zeigt die Distanzfunktion d1 eine Dominanz gegenüber
d1 , d2 sowie d3 dargestellt. Der Wertebereich der Funktio- der Distanzfunktion d2 . Schaut man sich noch einmal die
nen sei auf das Intervall [0,1] definiert. Die Werte aus der Intervalle der Verteilung ν1 und ν2 an, so zeigt sich, dass die
Stichprobe treten ungeachtet der Normalisierung auf [0, 1] Dominanz dem großen Unterschied der Verteilungsintervalle
jedoch in unterschiedlichen Intervallen auf. Die Distanzwer- (0.7 vs. 0.2) obliegt. Eine Dominanz manifestiert sich also
te der Stichprobe von ν1 liegen im Intervall [0.2, 0.9], von ν2 vor allem wenn eine große Differenz zwischen den jeweiligen
im Intervall [0.3, 0.5] und in ν3 im Intervall [0.8, 0.9]. Auch Intervallen der Distanzverteilungen liegt.
wenn es sich hierbei um simulierte Daten handelt so sind
solche Verteilungen im Bereich des MMR häufig anzutref- 3.3 Messung der Dominanz
fen. Um die Überwertung aus unserem Beispiel und somit die
0.12 Dominanz zu quantifizieren, wird die Korrelation zwischen
0.1
den Distanzen von d1 (d2 ) und der aggregierten Distanzen
aus dagg bestimmt. Zur Berechnung der Korrelation kön-
nen mehrere Verfahren genutzt werden. Verwendet man wie
0.08
Häufigkeit

0.06
im obigen Beispiel nur die Ränge, so bietet sich Spearmans
0.04
Rangkorrelationskoeffizient an [13].
0.02
Cov(Rang(A), Rang(B))
ρ(A, B) = mit
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
σRang(A) ∗ σRang(B) (1)
Distanz
(a) ν1 Cov(X, Y ) = E [(X − µx ) ∗ (Y − µy )]
0.12
Hierbei sei Cov(X, Y ) die über den Erwartungswert von X
0.1
und Y definierte Kovarianz. Bezogen auf das vorherige Bei-
spiel erhalten wir eine Korrelation nach Spearman für d1 von
ρ1 = 0.94 und für d2 ρ2 = 0.45. Die Differenz der Korrela-
0.08
Häufigkeit

0.06
tionswerte liegt dabei bei ρ1 − ρ2 = 0.49. Ab = 0.2 lässt
0.04
sich eine Überbewertung einer Distanzfunktion feststellen.
0.02
Somit haben wir mit ρ1 − ρ2 = 0.49 > 0.2 eine starke Über-
bewertung von d1 gegenüber d2 in Bezug auf das Aggrega-
0
0 0.1 0.2 0.3 0.4 0.5

Distanz
0.6 0.7 0.8 0.9 1
tionsergebnis gezeigt.
(b) ν2 Durch die Verwendung der Rangwerte gibt es allerdings
einen Informationsverlust. Eine alternative Berechnung ohne
0.12
Informationsverlust wäre durch Pearsons Korrelationskoeffi-
0.1 zienten möglich [10]. Genügen die Ranginformationen, dann
0.08
bietet Spearmans Rangkorrelationskoeffizient durch eine ge-
ringere Anfälligkeit gegenüber Ausreißern an [14].
Häufigkeit

0.06

Bisher haben wir die Korrelation zwischen den aggregier-
0.04 ten Werten und denen aus je einer Distanzverteilung vergli-
0.02
chen. Um direkt eine Beziehung zwischen zwei verschiede-
nen Distanzverteilungen bzgl. einer aggregierten Verteilung
zu bestimmen, werden zunächst die zwei Korrelationswerte
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Distanz
(c) ν3 ρ1 und ρ2 der Distanzfunktionen d1 und d2 bzgl. ihres Ein-
flusses auf das Aggregationsergebnis graphisch dargestellt
[6]. Hierzu werden die jeweiligen Werte der Korrelation als
Figure 2: Distanzverteilung verschiedener Distanz-
Punkte in [−1, 1]2 definiert. Für eine gleichmäßige Beein-
funktionen (simulierte Daten)
flussung des Aggregationsergebnisses sollten sich die Punk-
te auf der Diagonalen durch den Koordinatenursprung mit
Wir betrachten nun die Distanzfunktionen d1 und d2 . Be-
züglich einer beispielhaften Aggregationsfunktion2 gationsfunktionen wie Summe, Mittelwert etc. auf und kann
zusätzlich eine Dominanz hervorrufen, z.B. bei der Mini-
2
Das Problem der Dominanz tritt auch bei anderen Aggre- mum/Maximumfunktion.

85
1
Rang d1 Distanz-ID d2 Distanz-ID aggQ Distanz-ID
1 0.729 1 0.487 8 0.347 8 0.8

2 0.712 8 0.481 5 0.285 4
3 0.694 4 0.426 10 0.266 1 0.6

4 0.547 9 0.425 7 0.235 5 ρ2

(ρ1, ρ2)
5 0.488 5 0.421 3 0.205 9 0.4

6 0.473 7 0.411 4 0.201 7 u
7 0.394 10 0.375 9 0.168 10 0.2

8 0.351 3 0.367 6 0.148 3 α
9 0.337 2 0.365 1 0.112 6 0
0 0.2 0.4 0.6 0.8 1
10 0.306 6 0.316 2 0.106 2 ρ1

Figure 3: Dominanzproblem bei unterschiedlichen Verteilungen Figure 4: Graphische Darstellung
der Korrelation ρ1 und ρ2 auf das
Aggregationsergebnis

dem Anstieg m = 1 befinden. Wir bezeichnen diese Gerade 3.4 Zusammenfassung
als Kalibrierungslinie. Für unser Beispiel genügt es, nur po- Wir haben in diesem Kapitel gezeigt wann ein Dominanz-
sitive Korrelationswerte zu betrachten. Damit kennzeichnen problem auftritt und wie groß der Einfluss auf das Aggrega-
alle Punkte unterhalb dieser Linie einen größeren Einfluss tionsergebnis sein kann. Mit der Verwendung von Gleichung
durch d1 . Analog gilt bei allen Punkten oberhalb dieser Li- (2) ist es nun möglich den Grad des Dominanzproblems bzw.
nie (grau schraffierter Bereich) eine größere Beeinflussung den Kalibrierungsfehler messen zu können. Ein Hauptgrund
durch d2 . Abbildung 4 zeigt graphisch die Korrelation für für das Auftreten des Dominanzproblem liegt in der Vertei-
unser Beispiel von ρ1 und ρ2 auf das Aggregationsergebnis. lung der Distanzen. Sind die Intervalle, in denen die Distan-
Um die Abweichung vom gewünschten Zustand zu bestim- zen liegen unterschiedlich groß, so ist die Dominanz einer
men, ermitteln wir den Winkel zwischen dem Ortsvektor Eigenschaft unvermeidbar. Können diese Intervalle der Di-
u = (ρ1 , ρ2 )T durch den Punkt (ρ1 , ρ2 ) und der horizon-
~ stanzverteilungen aneinander angeglichen werden ohne da-
talen Koordinatenachse
[6]. Der Winkel α ergibt sich dann bei die Rangfolge zu verletzen, so könnte dies das Dominanz-
durch α = arctan ρρ21 Dieser Winkel liegt zwischen [0, Π 2
], problem lösen. Weiterhin ermöglicht das Maß des Kalibrie-
während die Kalibrierungslinie mit der horizontalen Ach- rungsfehlers die Evaluation von Normalisierungsansätzen.
se einen Winkel von Π 4
einschließt. Für eine vorzeichenbe-
haftete Kennzeichnung der Überbewertung sollen nun alle 4. STAND DER TECHNIK
Korrelationspunkte unterhalb der Kalibrierungslinie einen Die Aggregation auf Basis mehrerer Eigenschaften ist ein
positiven Wert und alle Korrelationspunkte oberhalb einen weit verbreitetes Feld. Es gibt bereits eine Vielzahl von Ar-
negativen Wert erhalten. Für ein Maß der Dominanz defi- beiten die sich mit dem Thema der Score-Normalization be-
nieren wir nun folgende Berechnung [6]: schäftigten. Die Evaluierung solcher Ansätze erfolgt in vielen
Fällen, vor allem im Bereich des IR, direkt über die Auswer-
tung der Qualität der Suchergebnisse anhand verschiedener
4 Corr(δ j , δ agg )
Calerr (δ i , δ j , δ agg ) = 1 − arctan (2) Dokumentenkollektionen, z.B. TREC-Kollektionen3 . Dieses
π Corr(δ i , δ agg ) Vorgehen liefert aber kaum Anhaltspunkte, warum sich ei-
nige Normalisierungsansätze besser für bestimmte Anwen-
Hierbei definiert Corr(X, Y ) ein geeignetes Korrelations-
dungen eignen als andere [6].
maß, in unserem Fall der Rangkorrelationskoeffizient von
Betrachten wir zunächst verschiedene lineare Normalisierun-
Spearman. Wir bezeichnen dieses Maß als Kalibrierungsfeh- δ−xmin
gen der Form normalize(δ) = ymin + xmax (ymax −
ler, wobei ein Fehler von 0 bedeutet, dass es keine Dominanz −xmin

gibt und somit beide Distanzfunktionen gleichermaßen in ymin ) [15], wobei die Bezeichnungen xmin , xmax , ymin und
das Aggregationsergebnis einfließen. Der Wertebereich des ymax verschiedene Normalisierungsparameter darstellen. Ta-
Kalibrierungsfehlers Calerr liegt in [−1, 1]. Für unser Bei- belle 1 stellt einige solcher linearer Ansätze dar [15, 5, 9, 6].
spiel erhalten wir unter Verwendung von Spearmans Rang-
korrelationskoeffizienten Calerr (d1 , d2 , dagg ) = 0.43, womit
erkennbar ist, dass d1 das Aggregationsergebnis stärker be- Name ymin ymax xmin xmax
einflusst als d2 . Min-Max 0 1 min(δ) max(δ)
Fitting 0 |s1 | we conclude (without
inspecting any set element) that s0 cannot reach threshold
Figure 1: Overview of functions.
tC with s1 . Similarly, minoverlap(tC , s0 , s2 ) = 10.1, thus
s2 is too large to meet the threshold with s0 . In fact,
minsize(tC , s0 ) = 6.4 and maxsize(tC , s0 ) = 15.6.
The positional filter is stricter than the prefix filter and
Prefix length. The prefix length is |s0 | − tO + 1 for is applied on top of it. The pruning power of the positional
a given overlap threshold tO and set s0 . For normalized filter is larger for prefix matches further to right (i.e., when
thresholds t the prefix length does not only depend on s0 , p0 , p1 increase). Since the prefix filter may produce the same
but also on the sets we compare to. If we compare to s1 , the candidate pair multiple times (for each match in the prefix),
minimum prefix size of |s0 | is minprefix(t, s0 , s1 ) = |s0 | − an interesting situation arises: a pair that passes the posi-
minoverlap(t, s0 , s1 ) + 1. When we index one of the join tional filter for the first match may not pass the filter for
partners, we do not know the size of the matching partners later matches. Thus, the positional filter is applied to pairs
upfront and need to cover the worst case; this results in the that are already in the candidate set whenever a new match
prefix length maxprefix(t, s0 ) = |s0 |−minsize(t, s0 )+1 [7], is found. To correctly apply the positional filter we need
which does not depend on s1 . For typical Jaccard thresholds to maintain the overlap value for each pair in the candidate
t ≥ 0.8, this reduces the number of tokens to be processed set. We illustrate the positional filter with examples.
during the candidate generation phase by 80 % or more. Example 1. Set s0 in Figure 2 is the probing set (prefix
For self joins we can further reduce the prefix length [12] length maxprefix = 4), s1 is the indexed set (prefix length
w.r.t. maxprefix: when the index is built on-the-fly in in- midprefix = 2, assuming self join). Set s1 is returned from
creasing order of the sets, then the indexed prefix of s0 will the index due to the match on g (first match between s0 and
never be compared to any set s1 with |s1 | < |s0 |. This al- s1 ). The required overlap is dminoverlapC (0.8, s0 , s1 )e =
lows us to reduce the prefix length to midprefix(t, s0 ) = 8. Since there are only 6 tokens left in s1 after the match,
|s0 | − minoverlap(t, s0 , s0 ) + 1. the maximum overlap we can get is 7, and the pair is pruned.
Positional filter. The minimum prefix length for a pair This is also confirmed by the positional filter condition (1)
of sets is often smaller than the worst case length, which we (o = 0, p0 = 3, p1 = 1).
use to build and probe the index. When we probe the index Example 2. Assume a situation similar to Figure 2, but
with a token from the prefix of s0 and find a match in the the match on g is the second match (i.e., o = 1, p0 = 3,
prefix of set s1 , then the matching token may be outside the p1 = 1). Condition (1) holds and the pair can not be pruned,
optimal prefix. If this is the first matching token between i.e., it remains in the candidate set.
s0 and s1 , we do not need to consider the pair. In general, Example 3. Consider Figure 3 with probing set s0 and
a candidate pair s0 , s1 must be considered only if indexed set s1 . The match on token a adds pair (s0 , s1 ) to
the candidate set. Condition (1) holds for the match on a
minoverlap(t, s0 , s1 ) ≤ o + min{|s0 | − p0 , |s1 | − p1 }, (1)
(o = 0, p0 = 0, p1 = 0), and the pair is not pruned by
where o is the current overlap (i.e., number of matching the positional filter. For the next match (on e), however,
tokens so far excluding the current match) and p0 (p1 ) is condition (1) does not hold (o = 1, p0 = 1, p1 = 4) and
the position of the current match in the prefix of s0 (s1 ); the positional filter removes the pair from the candidate set.
positions start at 0. Thus, the positional filter does not only avoid pairs to enter

90
pred: C(s0 , s1 ) ≥ 0.8 s0 : b c e f g h ? ? ? pr
⇒ dminoverlap(s0 , s1 , 0.8)e = 8
s1 : a e h ? ? ? ? ? ? idx
7
s0 : c e f g ? ? ? ? ? ? probing set (pr) Figure 4: Verification: where to start?
s1 : a g ? ? ? ? ? ? indexed set (idx)
pred: J(s0 , s1 ) ≥ 0.7 pred: J(s0 , s1 ) ≥ 0.7
7 ⇒ dminoverlap(. . .)e = 6 ⇒ dminoverlap(. . .)e = 5

s0 : c d e ? ? ? ? pr s0 : c d e ? ? ? ? pr
Figure 2: Sets with matching token In prefix: match
impossible due to positions of matching tokens and s1 : e ? ? ? ? ? idx s1 : e ? ? ? ? idx
remaining tokens. (a) Match impossible (b) Match possible

pred: C(s0 , s1 ) ≥ 0.6
Figure 5: Impossible and possible set sizes based on
⇒ dminoverlap(s0 , s1 , 0.8)e = 8 position in s0 and the size-dependent minoverlap.
14
s0 : a e ? ? ? ? ? ? ? ? ? ? ? ? ? ? pr
midprefix (indexing set) as discussed in earlier sections.
s1 : a b c d e ? ? ? ? ? idx Since the sets are sorted, we compute the overlap in a
+1 +1 5 =7<8 merge fashion. At each merge step, we verify if the current
overlap and the remaining set size are sufficient to achieve
Figure 3: Sets with two matching tokens: pruning the threshold, i.e., we check positional filter condition (1).
of candidate pair by second match. (A) Prefix overlap [12] : At verification time we already
know the overlap between the two prefixes of a candidate
pair. This piece of information should be leveraged. Note
the candidate set, but may remove them later. that we cannot simply continue verification after the two
prefixes. This is illustrated in Figure 4: there is 1 match in
2.2 Improving the Prefix Filter the prefixes of s0 and s1 ; when we start verification after the
The prefix filter often produces candidates that will be prefixes, we miss token h. Token h occurs after the prefix
removed immediately in the next filter stage, the positional of s0 but inside the prefix of s1 . Instead, we compare the
filter (see Example 1). Ideally, such candidates are not pro- last element of the prefixes: for the set with the smaller
duced at all. This issue is addressed in the mpjoin algo- element (s0 ), we start verification after the prefix (g). For
rithm [7] as outlined below. the other set (s1 ) we leverage the number of matches in the
Consider condition (1) for the positional filter. We split prefix (overlap o). Since the leftmost positions where these
the condition into two new conditions by expanding the min- matches can appear are the first o elements, we skip o tokens
imum such that the conjunction of the new conditions is and start at position o (token e in s1 ). There is no risk of
equivalent to the positional filter condition: double-counting tokens w.r.t. overlap o since we start after
the end of the prefix in s0 .
minoverlap(t, s0 , s1 ) ≤ o + |s0 | − p0 (2)
(B) Position of last match [7] : A further improvement is
minoverlap(t, s0 , s1 ) ≤ o + |s1 | − p1 (3) to store the position of the last match. Then we start the
verification in set s1 after this position (h in s1 , Figure 4).
The mpjoin algorithm leverages condition (2) as follows.
The probing sets s0 are processed in increasing size order, so Small candidate set vs. fast verification. The po-
|s0 | grows monotonically during the execution of the algo- sitional filter is applied on each candidate pair returned by
rithm. Hence, for a specific set s1 , minoverlap grows mono- the prefix filter. The same candidate pair may be returned
tonically. We assume o = 0 (and justify this assumption multiple times for different matches in the prefix. The po-
later). For a given index entry (s1 , p1 ), the right side of con- sitional filter potentially removes existing candidate pairs
dition (2) is constant, while the left side can only grow. Af- when they appear again (cf. Section 2.1). This reduces the
ter the condition fails to hold for the first time, it will never size of the candidate set, but comes at the cost of (a) lookups
hold again, and the index list entry is removed. For a given in the candidate set, (b) deletions from the candidate set,
index set s1 , this improvement changes the effective length and (c) book keeping of the overlaps for each candidate pair.
of the prefix (i.e., the part of the sets where we may detect Overall, it might be more efficient to batch-verify a larger
matches) w.r.t. a probing set s0 to minprefix(t, s0 , s1 ) = candidate set than to incrementally maintain the candidates;
|s1 | − minoverlap(t, s0 , s1 ) + 1, which is optimal. On the Ribeiro and Härder [7] empirically analyze this trade-off.
downside, a shorter prefix may require more work in the
verification phase: in some cases, the verification can start 3. POSITION-ENHANCED LENGTH FIL-
after the prefix as will be discussed in Section 2.3.
TERING
2.3 Verification In this section, we motivate the position-enhanced length
Efficient verification techniques are crucial for fast set sim- filter (PEL), derive the new filter function pmaxsize, dis-
ilarity joins. We revisit a baseline algorithm and two im- cuss the effect of PEL on self vs. foreign joins, and show how
provements, which affect the verification speed of both false to apply PEL to previous algorithms.
and true positives. Unless explicitly mentioned, the term Motivation. The introduction of the position-enhanced
prefix subsequently refers to maxprefix (probing set) resp. length filter is inspired by examples for positional filtering

91
1250 base region. The base region is partitioned into four regions
maxsize (A, B, C, and D) by the probing set size and pmaxsize. For
C D probing set size foreign joins, our filter reduces the base region to A+C. If we
set size

assume that all set sizes occur equally likely in the individual
1000 inverted lists of the index, our filter cuts the number of index
B list entries that must be processed by 50%. Since the tokens
A pmaxsize are typically ordered by their frequency, the list length will
minsize increase with increasing matching position. Thus the gain of
800
PEL in practical settings can be expected to be even higher.
0 100 maxprefix 200 This analysis holds for all parameters of Jaccard and Dice.
position in prefix For Cosine, the situation is more tricky since pmaxsize is
quadratic and describes a parabola. Again, this is in our
Figure 6: Illustrating possible set sizes. favor since the parabola is open to the top, and the curve
that splits the base region is below the diagonal.
For self joins, the only relevant regions are A and B since
like Figure 5(a). In set s1 , the only match in the prefix oc- the size of the sets is bounded by the probing set size. Our
curs at the leftmost position. Despite this being the leftmost filter reduces the relevant region from A + B to A. As Fig-
match in s1 , the positional filter removes s1 : the overlap ure 6 illustrates, this reduction is smaller than the reduction
threshold cannot be reached due the position of the match for foreign joins. For the similarity functions in Table 1, B
in s0 . Apparently, the position of the token in the probing is always less than a quarter of the full region A + B. In the
set can render a match of the index sets impossible, inde- example, region B covers about 0.22 of A + B.
pendently of the matching position in the index set. Let us
analyze how we need to modify the example such that it
passes the positional filter: the solution is to shorten index Algorithm 1: AllPairs-PEL(Sp , I, t)
set s1 , as shown in Figure 5(b). This suggests that some Version using pmaxsize for foreign join;
tighter limit on the set size can be derived from the position input : Sp collection of outer sets, I inverted list index
of the matching token. covering maxprefix of inner sets, t similarity
Deriving the PEL filter. For the example in threshold
Figure 5(a) the first part of the positional filter, i.e., output: res set of result pairs (similarity at least t)
condition (2), does not hold. We solve the equation 1 foreach s0 in Sp do
minoverlap(t, s0 , s1 ) ≤ |s0 | − p0 to |s1 | by replacing 2 M = {}; /* Hashmap: candidate set → count */
minoverlap with its definition for the different similarity 3 for p0 ← 0 to maxprefix(t, s0 ) − 1 do
functions. The result is pmaxsize(t, s0 , p0 ), an upper 4 for s1 in Is0 [p] do
bound on the size of eligible sets in the index. This bound 5 if |s1 | < minsize(t, s0 ) then
is at the core of the PEL filter, and definitions of pmaxsize 6 remove index entry with s1 from Is0 [p] ;
for various similarity measures are listed in Table 1. 7 else if |s1 | > pmaxsize(t, s0 , p0 ) then
Application of PEL. We integrate the pmaxsize 8 break;
upper bound into the prefix filter. The basic prefix filter 9 else
algorithm processes a probing set as follows: loop over 10 if M [s1 ] = ∅ then
the tokens of the probing set from position p0 = 0 to 11 M = M ∪ (s1 , 0);
maxprefix(t, s0 ) − 1 and probe each token against the 12 M [s1 ] = M [s1 ] + 1;
index. The index returns a list of sets (their IDs) which 13 end
contain this token. The sets in these lists are ordered by 14 end
increasing size, so we stop processing a list when we hit a 15 /* Verify() verifies the candidates in M */
set that is larger than pmaxsize(t, s0 , p0 ). 16 res = res ∪ V erif y(s0 , M, t);
Intuitively, we move half of the positional filter to the 17 end
prefix filter, where we can evaluate it at lower cost: (a) the
value of pmaxsize needs to be computed only once for each
probing token; (b) we check pmaxsize against the size of
Algorithm. Algorithm 1 shows AllPairs-PEL2 , a ver-
each index list entry, which is a simple integer comparison.
sion of AllPairs enhanced with our PEL filter. AllPairs-
Overall, this is much cheaper than the candidate lookup that
PEL is designed for foreign joins, i.e., the index is con-
the positional filter must do for each index match.
structed in a preprocessing step before the join is executed.
Self Joins vs. Foreign Joins. The PEL filter is more The only difference w.r.t. AllPairs is that AllPairs-PEL uses
powerful on foreign joins than on self joins. In self joins, pmaxsize(t, s0 , p0 ) instead of maxsize(t, s0 ) in the condi-
the size of the probing set is an upper bound for the set tion on line 7. The extensions of the algorithms ppjoin and
size in the index. For all the similarity functions in Table 1, mpjoin with PEL are similar.
pmaxsize is below the probing set size in less than 50% An enhancement that is limited to ppjoin and mpjoin is to
of the prefix positions. Figure 6 gives an example: The simplify the positional filter: PEL ensures that no candidate
probing set size is 1000, the Jaccard threshold is 0.8, so set can fail on the first condition (Equation 2) of the split
minsize(0.8, 1000) = 800, maxsize(0.8, 1000) = 1250, and positional filter. Therefore, we remove the first part of the
the prefix size is 201. The x-axis represents the position in
the prefix, the y-axis represents bounds for the set size of the 2
We use the -PEL suffix for algorithm variants that make
other set. The region between minsize and maxsize is the use of our PEL filter.

92
collections are identical. Figures 7(a) and 7(b) show the per-
Table 2: Input set characteristics. formance on DBLP with Jaccard similarity threshold 0.75
#sets in set size # of diff.
and Cosine similarity 0.85. These thresholds produce result
collection min max avg tokens
sets of similar size. We observe a speedup of factor 3.5 for
DBLP 3.9 · 106 2 283 12 1.34 · 106 AllPairs-PEL over AllPairs with Jaccard, and a speedup of
TREC 3.5 · 105 2 628 134 3.4 · 105 3.8 with Cosine. For mpjoin to mpjoin-PEL we observe a
5
ENRON 5 · 10 1 192 000 298 7.3 · 106 speedup of 4.0 with Jaccard and 4.2 with Cosine. Thus, the
PEL filter provides a substantial speed advantage on these
data points. For other Jaccard thresholds and mpjoin vs.
minimum in the original positional filter (Equation 1), such mpjoin-PEL, the maximum speedup is 4.1 and the minimum
that the minimum is no longer needed. speedup is 1.02. For threshold 0.5, only mpjoin-PEL finishes
Note that the removal of index entries on line 6 is the eas- within the time limit of one hour. Among all Cosine thresh-
iest solution to apply minsize, but in real-world scenarios, olds and mpjoin vs. mpjoin-PEL, the maximum speedup is
it only makes sense for a single join to be executed. For 4.2 (tC = 0.85), the minimum speedup is 1.14 (tC = 0.95).
a similarity search scenario, we recommend to apply binary We only consider Cosine thresholds tC ≥ 0.75, because the
search on the lists. For multiple joins with the same indexed non-PEL variants exceed the time limit for smaller thresh-
sets in a row, we suggest to use an overlay over the index olds. There is no data point where PEL slows down an
that stores the pointer for each list where to start. algorithm. It is also worth noting that AllPairs-PEL beats
mpjoin by a factor of 2.7 with Jaccard threshold tJ = 0.75
4. EXPERIMENTS and 3.3 on Cosine threshold tC = 0.85; we observe such
speedups also on other thresholds.
We compare the algorithms AllPairs [4] and mpjoin [7] Figure 7(c) shows the performance on TREC with Jac-
with and without our PEL extension on both self and for- card threshold tJ = 0.75. The speedup for AllPairs-PEL
eign joins. Our implementation works on integers, which we compared to AllPairs is 1.64, and for mpjoin-PEL compared
order by the frequency of appearance in the collection. The to mpjoin 2.3. The minimum speedup of mpjoin over all
time to generate integers from tokens is not measured in our thresholds is 1.26 (tJ = 0.95), the maximum speedup is
experiments since it is the same for all algorithms. We also 2.3 (tJ = 0.75). Performance gains on ENRON are slightly
do not consider the indexing time for foreign joins, which smaller – we observe speedups of 1.15 (AllPairs-PEL over
is considered a preprocessing step. The use of PEL has no AllPairs), and 1.85 (mpjoin-PEL over mpjoin) on Jaccard
impact on the index construction. The prefix sizes are max- threshold tJ = 0.75 as illustrated in Figure 7(d). The mini-
prefix for foreign joins and midprefix for self joins. For self mum speedup of mpjoin over mpjoin-PEL is 1.24 (tJ = 0.9
joins, we include the indexing time in the overall runtime and 0.95), the maximum speedup is 2.0 (tJ = 0.6).
since the index is built incrementally on-the-fly. We report Figure 8(a) shows the number of processed index entries
results for Jaccard and Cosine similarity, the results for Dice (i.e., the overall length of the inverted lists that must be
show similar behavior. Our experiments are executed on the scanned) for Jaccard threshold tJ = 0.75 on TREC. The
following real-world data sets: number of index entries increases by a factor of 1.67 for
AllPairs w.r.t. AllPairs-PEL, and a factor of 4.0 for mpjoin
• DBLP3 : Snapshot (February 2014) of the DBLP bib- w.r.t. mpjoin-PEL.
liographic database. We concatenate authors and ti- Figure 8(b) shows the number of candidates that must
tle of each entry and generate tokens by splitting on be verified for Jaccard threshold tJ = 0.75 on TREC. On
whitespace. AllPairs, PEL decreases the number of candidates. This is
because AllPairs does not apply any further filters before
• TREC4 : References from the MEDLINE database,
verification. On mpjoin, the number of candidates increases
years 1987–1991. We concatenate author, title, and
by 20%. This is due to the smaller number of matches from
abstract, remove punctuation, and split on whitespace.
the prefix index in the case of PEL: later matches can remove
• ENRON5 : Real e-mail messages published by FERC pairs from the candidate set (using the positional filter) and
after the ENRON bankruptcy. We concatenate sub- thus decrease its size. However, the larger candidate set
ject and body fields, remove punctuation, and split on for PEL does not seriously impact the overall performance:
whitespace. the positional filter is also applied in the verification phase,
where the extra candidate pairs are pruned immediately.
Table 2 lists basic characteristics of the input sets. We Self joins. Due to space constraints, we only show re-
conduct our experiments on an Intel Xeon 2.60GHz machine sults for DBLP and ENRON, i.e., the input sets with the
with 128 GB RAM running Debian 7.6 ’wheezy’. We com- smallest and the largest average set sizes, respectively. Fig-
pile our code with gcc -O3. Claims about results on “all” ure 7(e) and 7(f) show the performance of the algorithms on
thresholds for a particular data set refer to the thresholds DBLP and ENRON with Jaccard threshold tJ = 0.75. Our
{0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95}. We stop tests whose PEL filter provides a speed up of about 1.22 for AllPairs,
runtime exceeds one hour. and 1.17 for mpjoin on DBLP. The maximum speedup we
Foreign Joins. For foreign joins, we join a collection of observe is 1.70 (AllPairs-PEL vs. AllPairs, tJ = 0.6); for
sets with a copy of itself, but do not leverage the fact that the tJ = 0.95 there is no speed difference between mpjoin and
mpjoin-PEL. On the large sets of ENRON, the performance
3
http://www.informatik.uni-trier.de/~Ley/db/ is worse for AllPairs-PEL because verification takes more
4
http://trec.nist.gov/data/t9_filtering.html time than PEL can save in the probing phase (by reducing
5
https://www.cs.cmu.edu/~enron/ the number of processed index entries). There is almost no

93
sec sec sec 400 sec sec sec
500 500 100
150
400 300 30
400 80
AllPairs-PEL

AllPairs-PEL

AllPairs-PEL
mpjoin-PEL

mpjoin-PEL

mpjoin-PEL
300 300 100 20 60
200
AllPairs

AllPairs

AllPairs
200 200 40
mpjoin

mpjoin

mpjoin
50 100 10
100 100 20
0 0 0 0 0 0
(a) Foreign join, (b) Foreign join, (c) Foreign join, (d) Foreign j., EN- (e) Self join, (f) Self join, EN-
DBLP, tJ = 0.75. DBLP, tC = 0.85. TREC, tJ = 0.75 RON, tJ = 0.75 DBLP, tJ = 0.75 RON, tJ = 0.75

Figure 7: Join times.

8.0e10 the PEL filter improves performance in almost any foreign
1.5e10 join and also in some self join scenarios, despite the fact that
6.0e10 it may increase the number of candidates to be verified.
AllPairs-PEL

AllPairs-PEL
mpjoin-PEL

mpjoin-PEL
1.0e10
4.0e10
7. REFERENCES
AllPairs

AllPairs
mpjoin

mpjoin

2.0e10 5.0e9
[1] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact
set-similarity joins. In Proc. VLDB, pages 918 – 929,
0 0
2006.
(a) Number of pro- (b) Number of candi- [2] N. Augsten, M. H. Böhlen, and J. Gamper. The
cessed index entries. dates to be verify.
pq-gram distance between ordered labeled trees. ACM
TODS, 35(1), 2010.
Figure 8: TREC (foreign join): tJ = 0.75 [3] N. Augsten, A. Miraglia, T. Neumann, and
A. Kemper. On-the-fly token similarity joins in
relational databases. In Proc. SIGMOD, pages 1495 –
difference between mpjoin and mpjoin-PEL. The maximum 1506. ACM, 2014.
increase in speed is 9% (threshold 0.8, mpjoin), the maxi- [4] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all
mum slowdown is 30% (threshold 0.6, AllPairs). pairs similarity search. WWW, 7:131 – 140, 2007.
Summarizing, PEL substantially improves the runtime in [5] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive
foreign join scenarios. For self joins, PEL is less effective operator for similarity joins in data cleaning. In Proc.
and, in some cases, may even slightly increase the runtime. ICDE, page 5. IEEE, 2006.
[6] A. Gionis, P. Indyk, and R. Motwani. Similarity
5. RELATED WORK search in high dimensions via hashing. In Proc.
VLDB, pages 518–529, 1999.
Sarawagi and Kirpal [8] first discuss efficient algorithms
for exact set similarity joins. Chaudhuri et al. [5] propose [7] L. A. Ribeiro and T. Härder. Generalizing prefix
SSJoin as an in-database operator for set similarity joins filtering to improve set similarity joins. Information
and introduce the prefix filter. AllPairs [4] uses the prefix Systems, 36(1):62 – 78, 2011.
filter with an inverted list index. The ppjoin algorithm [12] [8] S. Sarawagi and A. Kirpal. Efficient set joins on
extends AllPairs by the positional filter and introduces the similarity predicates. In Proc. SIGMOD, pages 743 –
suffix filter, which reduces the candidate set before the final 754. ACM, 2004.
verification. The mpjoin algorithm [7] improves over ppjoin [9] E. Spertus, M. Sahami, and O. Buyukkokten.
by reducing the number of entries returned from the index. Evaluating similarity measures: A large-scale study in
AdaptJoin [10] takes the opposite approach and drastically the orkut social network. In Proc. SIGKDD, pages 678
reduces the number of candidates at the expense of longer – 684. ACM, 2005.
prefixes. Gionis et al. [6] propose an approximate algorithm [10] J. Wang, G. Li, and J. Feng. Can we beat the prefix
based on LSH for set similarity joins. Recently, an SQL op- filtering?: An adaptive framework for similarity join
erator for the token generation problem was introduced [3]. and search. In Proc. SIGMOD, pages 85 – 96. ACM,
2012.
[11] C. Xiao, W. Wang, and X. Lin. Ed-Join: An efficient
6. CONCLUSIONS algorithm for similarity joins with edit distance
We presented PEL, a new filter based on the pmaxsize constraints. In Proc. VLDB, 2008.
upper bound derived in this paper. PEL can be easily [12] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang.
plugged into algorithms that store prefixes in an inverted Efficient similarity joins for near-duplicate detection.
list index (e.g., AllPairs, ppjoin, or mpjoin). For these algo- ACM TODS, 36(3):15, 2011.
rithms, PEL will effectively reduce the number of list entries
that must be processed. This reduces the overall lookup time
in the inverted list index at the cost of a potentially larger
candidate set. We analyzed this trade-off for foreign joins
and self joins. Our empirical evaluation demonstrated that