Proceedings of the 26th GI-Workshop
Grundlagen von Datenbanken


Bozen-Bolzano, Italien, 21.-24. Oktober 2014
 c 2014 for the individual papers by the papers’ authors. Copying permitted for private
and academic purposes. Re-publication of material from this volume requires permission
by the copyright owners.


Herausgeber:
Friederike Klan
Friedrich-Schiller-Universität Jena
Fakultät für Mathematik und Informatik
Heinz-Nixdorf-Stiftungsprofessur für Verteilte Informationssysteme
Ernst-Abbe-Platz 2
DE-07743 Jena
E-Mail: friederike.klan@uni-jena.de


Günther Specht
Universität Innsbruck
Fakultät für Mathematik, Informatik und Physik
Forschungsgruppe Datenbanken und Informationssysteme
Technikerstrasse 21a
AT-6020 Innsbruck
E-Mail: guenther.specht@uibk.ac.at


Hans Gamper
Freie Universität Bozen-Bolzano
Fakultät für Informatik
Dominikanerplatz 3
IT-39100 Bozen-Bolzano
E-Mail: gamper@inf.unibz.it


                                           2
Vorwort

Der 26. Workshop ”Grundlagen von Datenbanken” (GvDB) 2014 fand dieses Jahr vom
21.10.2014 bis 24.10.2014 auf dem Ritten in Südtirol statt, einem reizvollen Hochplateau
mit Blick auf die Dolomiten. Bereits die Anreise war ein Highlight: vom Bahnhof Bozen
ging es mit der längsten Seilbahn Südtirols nach oben und dann mit der Rittner Bahn, einer
alten Schmalspur-Stras̈enbahn, über die Lärchenwiesen bis zum Tagungsort.
Der viertägige Workshop wurde vom GI-Arbeitskreis ”Grundlagen von Informationssyste-
men” im Fachbereich Datenbanken und Informationssysteme (DBIS) veranstaltet und hat
die konzeptionellen und methodischen Grundlagen von Datenbanken und Information-
ssystemen zum Thema, ist aber auch für neue Anwendungen offen. Die Workshopreihe
und der Arbeitskreis feiern dieses Jahr ihr 25-jähriges Bestehen. Der AK ist damit der
ältesten Arbeitskreise der GI. Organisiert wurde der Jubiläumsworkshop gemeinsam von
Fr. Dr. Friederike Klan von der Heinz-Nixdorf-Stiftungsprofessur für Verteilte Infor-
mationssysteme der Friedrich-Schiller-Universität Jena, Hr. Prof. Dr. Günther Specht
von der Forschungsgruppe Datenbanken und Informationssysteme (DBIS) der Universität
Innsbruck und Hr. Prof. Dr. Johann Gamper von der Gruppe Datenbanken und Informa-
tionssysteme (DIS) der Freien Universität Bozen-Bolzano.
Der Workshop soll die Kommunikation zwischen Wissenschaftlern/-innen im deutsch-
sprachigen Raum fördern, die sich grundlagenorientiert mit Datenbanken und Informa-
tionssystemen beschäftigen. Er bietet insbesondere Nachwuchswissen-schaftler/-innen
die Möglichkeit, ihre aktuellen Arbeiten einem grös̈eren Forum in lockerer Atmosphäre
vorzustellen. Mit der Kulisse der beeindruckenden Südtiroler Bergwelt bot der Work-
shop auf 1200 Metern Meereshöhe einen idealen Rahmen für offene und inspirierende
Diskussionen dazu ohne Zeitzwang. Insgesamt wurden 14 Arbeiten aus den Einsendun-
gen nach einem Review-Prozess ausgewählt und vorgestellt. Besonders hervorzuheben ist
die Vielfältigkeit der Themenbereiche: sowohl Kerngebiete in Datenbanksystemen bzw.
Datenbankdesign, als auch Themen zur Informationsextraktion, Empfehlungssysteme, Ve-
rarbeitung von Zeitreihen, Graphalgorithmen im GIS Bereich, sowie zu Datenschutz und
Datenqualität wurden vorgestellt.
Die Vorträge ergänzten zwei Keynotes: Ulf Leser, Professor an der Humboldt-Universität
zu Berlin, hielt eine Keynote zu Next Generation Data Integration (for Life Sciences) und
Francesco Ricci, Professor an der Freien Universität von Bozen-Bolzano zu Context and
Recommendations: Challenges and Results. Beiden Vortragenden sei an dieser Stelle für
ihre spontane Bereitschaft zu Kommen und ihre interessanten Vorträge gedankt.
Neben dem Wissensaustausch darf auch die soziale Komponente nicht fehlen. Die bei-
den gemeinsamen Ausflüge bleiben sicher allen lange in guter Erinnerung. Zum einen
erklommen wir das bereits schneebedeckte Rittner Horn (2260 Hm), von dem man einen
herrlichen Blick auf die Dolomiten hat. Zum anderen ist bei einem Aufenthalt im Herbst
in Südtirol das so genannte Törggelen nicht wegzudenken: eine Wanderung zu lokalen
Bauernschänken, die die Köstlichkeiten des Jahres zusammen mit Kastanien und heurigem
Wein auftischen. Sogar der Rektor der Universität Bozen-Bolzano kam dazu extra vom Tal
herauf.


                                             3
Eine Konferenz kann nur erfolgreich in einer guten Umgebung stattfinden. Daher danken
wir an dieser Stelle den Mitarbeitern des Hauses der Familie für ihre Arbeit im Hinter-
grund. Weiterer Dank gilt allen Autoren, die mit ihren Beiträgen und Vorträgen erst einen
interessanten Workshop ermöglichen, sowie dem Programmkomitee und allen Gutachtern
für ihre Arbeit. Abschlies̈end gilt dem Organisationsteam, das interaktiv über alle Landes-
grenzen hinweg (Deutschland, Österreich und Italien) hervorragend zusammen gearbeitet
hat, ein gros̈es Dankeschön. So international war der GvDB noch nie.

Auf ein Wiedersehen beim nächsten GvDB-Workshop

Günther Specht
Friederike Klan
Johann Gamper


Innsbruck, Jena, Bozen am 26.10.2014


                                             4
Komitee

Organisation


 Friederike Klan        Friedrich-Schiller-Universität Jena
 Günther Specht        Universität Innsbruck
 Hans Gamper            Universität Bozen-Bolzano


Programm-Komitee


 Alsayed Algergawy      Friedrich-Schiller-Universität Jena
 Erik Buchmann          Karlsruher Institut für Technologie
 Stefan Conrad          Universität Düsseldorf
 Hans Gamper            Universität Bozen-Bolzano
 Torsten Grust          Universität Tübingen
 Andreas Heuer          Universität Rostock
 Friederike Klan        Friedrich-Schiller-Universität Jena
 Birgitta König-Ries   Friedrich-Schiller-Universität Jena
 Klaus Meyer-Wegener    Universität Erlangen
 Gunter Saake           Universität Magdeburg
 Kai-Uwe Sattler        Technische Universität Ilmenau
 Eike Schallehn         Universität Magdeburg
 Ingo Schmitt           Brandenburgische Technische Universität Cottbus
 Holger Schwarz         Universität Stuttgart
 Günther Specht        Universität Innsbruck


Zusätzliche Reviewer


 Mustafa Al-Hajjaji     Universität Magdeburg
 Xiao Chen              Universität Magdeburg
 Doris Silbernagl       Universität Innsbruck


                                5
6
Contents

Next Generation Data Integration (for the Life Sciences) (Keynote)
   Ulf Leser                                                                       9

Context and Recommendations: Challenges and Results (Keynote)
   Francesco Ricci                                                                10

Optimization of Sequences of XML Schema Modifications - The ROfEL Ap-
  proach
  Thomas Nösinger, Andreas Heuer and Meike Klettke                   11

Automatic Decomposition of Multi-Author Documents Using Grammar Analysis
   Michael Tschuggnall and Günther Specht                               17

Proaktive modellbasierte Performance-Analyse und -Vorhersage von Datenbankan-
   wendungen
   Christoph Koch                                                           23

Big Data und der Fluch der Dimensionalität: Die effiziente Suche nach Quasi-
   Identifikatoren in hochdimensionalen Daten
   Hannes Grunert and Andreas Heuer                                           29

Combining Spotify and Twitter Data for Generating a Recent and Public Dataset
  for Music Recommendation
  Martin Pichl, Eva Zangerle and Günther Specht                              35

Incremental calculation of isochrones regarding duration
   Nikolaus Krismer, Günther Specht and Johann Gamper                            41

Software Design Approaches for Mastering Variability in Database Systems
   David Broneske, Sebastian Dorok, Veit Koeppen and Andreas Meister              47

PageBeat - Zeitreihenanalyse und Datenbanken
   Andreas Finger, Ilvio Bruder, Andreas Heuer, Martin Klemkow and Steffen Konerow 53

Databases under the Partial Closed-world Assumption: A Survey
   Simon Razniewski and Werner Nutt                                               59

Towards Semantic Recommendation of Biodiversity Datasets based on Linked
   Open Data
   Felicitas Löffler, Bahar Sateli, René Witte and Birgitta König-Ries 65


                                         7
Exploring Graph Partitioning for Shortest Path Queries on Road Networks
   Theodoros Chondrogiannis and Johann Gamper                                  71

Missing Value Imputation in Time Series Using Top-k Case Matching
   Kevin Wellenzohn, Hannes Mitterer, Johann Gamper, Michael Böhlen and Mourad
   Khayati                                                                      77

Dominanzproblem bei der Nutzung von Multi-Feature-Ansätzen
  Thomas Böttcher and Ingo Schmitt                                            83

PEL: Position-Enhanced Length Filter for Set Similarity Joins
  Willi Mann and Nikolaus Augsten                                              89


                                        8
    Next Generation Data Integration (for the Life Sciences)
                                                                 [Abstract]
                                                                   Ulf Leser
                                                      Humboldt-Universität zu Berlin
                                                      Institute for Computer Science
                                                   leser@informatik.hu-berlin.de


ABSTRACT
Ever since the advent of high-throughput biology (e.g., the
Human Genome Project), integrating the large number of
diverse biological data sets has been considered as one of
the most important tasks for advancement in the biolog-
ical sciences. The life sciences also served as a blueprint
for complex integration tasks in the CS community, due to
the availability of a large number of highly heterogeneous
sources and the urgent integration needs. Whereas the early
days of research in this area were dominated by virtual inte-
gration, the currently most successful architecture uses ma-
terialization. Systems are built using ad-hoc techniques and
a large amount of scripting. However, recent years have
seen a shift in the understanding of what a ”data integra-
tion system” actually should do, revitalizing research in this
direction. In this tutorial, we review the past and current
state of data integration (exemplified by the Life Sciences)
and discuss recent trends in detail, which all pose challenges
for the database community.


About the Author
Ulf Leser obtained a Diploma in Computer Science at the
Technische Universität München in 1995. He then worked as
database developer at the Max-Planck-Institute for Molec-
ular Genetics before starting his PhD with the Graduate
School for ”Distributed Information Systems” in Berlin. Since
2002 he is a professor for Knowledge Management in Bioin-
formatics at Humboldt-Universität zu Berlin.


Copyright c by the paper’s authors. Copying permitted only
for private and academic purposes.
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-
Workshop on Foundations of Databases (Grundlagen von Datenbanken),
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.


                                                                          9
  Context and Recommendations: Challenges and Results
                                                                 [Abstract]
                                                              Francesco Ricci
                                                    Free University of Bozen-Bolzano
                                                      Faculty of Computer Science
                                                               fricci@unibz.it


ABSTRACT                                                                       About the Author
Recommender Systems (RSs) are popular tools that auto-                         Francesco Ricci is associate professor of computer science
matically compute suggestions for items that are predicted                     at Free University of Bozen-Bolzano, Italy. His current re-
to be interesting and useful to a user. They track users’                      search interests include recommender systems, intelligent in-
actions, which signal users’ preferences, and aggregate them                   terfaces, mobile systems, machine learning, case-based rea-
into predictive models of the users’ interests. In addition                    soning, and the applications of ICT to tourism and eHealth.
to the long-term interests, which are normally acquired and                    He has published more than one hundred of academic pa-
modeled in RSs, the specific ephemeral needs of the users,                     pers on these topics and has been invited to give talks in
their decision biases, the context of the search, and the con-                 many international conferences, universities and companies.
text of items’ usage, do influence the user’s response to and                  He is among the editors of the Handbook of Recommender
evaluation for the suggested items. But appropriately mod-                     Systems (Springer 2011), a reference text for researchers and
eling the user in the situational context and reasoning upon                   practitioners working in this area. He is the editor in chief
that is still challenging; there are still major technical and                 of the Journal of Information Technology & Tourism and in
practical difficulties to solve: obtaining sufficient and infor-               the editorial board of the Journal of User Modeling and User
mative data describing user preferences in context; under-                     Adapted Interaction. He is member of the steering commit-
standing the impact of the contextual dimensions on user                       tee of the ACM Conference on Recommender Systems. He
decision-making process; embedding the contextual dimen-                       served on the program committees of several conferences,
sions in a recommendation computational model. These top-                      including as a program co-chair of the ACM Conference on
ics will be illustrated in the talk, making examples taken                     Recommender Systems (RecSys), the International Confer-
from the recommender systems that we have developed.                           ence on Case-Based Reasoning (ICCBR) and the Interna-
                                                                               tional Conference on Information and Communication Tech-
                                                                               nologies in Tourism (ENTER).


Copyright c by the paper’s authors. Copying permitted only
for private and academic purposes.
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-
Workshop on Foundations of Databases (Grundlagen von Datenbanken),
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.


                                                                          10
Optimization of Sequences of XML Schema Modifications -
                  The ROfEL Approach

                                    Thomas Nösinger, Meike Klettke, Andreas Heuer
                                                       Database Research Group
                                                     University of Rostock, Germany
                                          (tn, meike, ah)@informatik.uni-rostock.de


ABSTRACT                                                                       element and shortly afterwards delete the same element. In
The transformation language ELaX (Evolution Language for                       the overall context of an efficient realization of modification
XML-Schema [16]) is a domain-specific language for modi-                       steps, such operations have to be removed. Further issues
fying existing XML Schemas. ELaX was developed to ex-                          are incorrect information (possibly caused by network prob-
press complex modifications by using add, delete and up-                       lems), for example if the same element is deleted twice or the
date statements. Additionally, it is used to consistently                      order of modifications is invalid (e.g. update before add).
log all change operations specified by a user. In this pa-                        The new rule-based optimizer for ELaX (ROfEL - Rule-
per we present the rule-based optimization algorithm ROfEL                     based Optimizer for ELaX) had been developed for solving
(Rule-based Optimizer for ELaX) for reducing the number                        the above mentioned problems. With ROfEL it is possible
of logged operations by identifying and removing unneces-                      to identify unnecessary or redundant operations by using
sary, redundant and also invalid modifications. This is an                     different straightforward optimization rules. Furthermore,
essential prerequisite for the co-evolution of XML Schemas                     the underlying algorithm is capable to correct invalid modi-
and corresponding XML documents.                                               fication steps. All in all, ROfEL could reduce the number of
                                                                               modification steps by removing or even correcting the logged
                                                                               ELaX operations.
1. INTRODUCTION                                                                   This paper is organized as follows. Section 2 gives the
   The eXtensible Markup Language (XML) [2] is one of the                      necessary background of XML Schema, ELaX and corre-
most popular formats for exchanging and storing structured                     sponding concepts. Section 3 and section 4 present our
and semi-structured information in heterogeneous environ-                      approach, by first specifying our ruled-based algorithm RO-
ments. To assure that well-defined XML documents are                           fEL and then showing how our approach can be applied for
valid it is necessary to introduce a document description,                     an example. Related work is shown in section 5. Finally,
which contains information about allowed structures, con-                      in section 6 we draw our conclusions.
straints, data types and so on. XML Schema [4] is one com-
monly used standard for dealing with this problem. After                       2.   TECHNICAL BACKGROUND
using an XML Schema a period of time, the requirements                            In this section we present a common notation used in the
can change; for example if additional elements are needed,                     remainder of this paper. At first, we will shortly introduce
data types change or integrity constraints are introduced.                     the XSD (XML Schema Definition [4]), before details con-
This may result in the adaptation of the XML Schema def-                       cerning ELaX (Evolution Language for XML-Schema [16])
inition.                                                                       and the logging of ELaX are given.
   In [16] we presented the transformation language ELaX                          The XML Schema abstract data model consists of different
(Evolution Language for XML-Schema) to describe and for-                       components (simple and complex type definitions, element
mulate these XML Schema modifications. Furthermore, we                         and attribute declarations, etc.). Additionally, the element
mentioned briefly that ELaX is also useful to log informa-                     information item serves as an XML representation of these
tion about modifications consistently, an essential prerequi-                  components and defines which content and attributes can be
site for the co-evolution process of XML Schema and corre-                     used in an XML Schema. The possibility of specifying decla-
sponding XML documents [14].                                                   rations and definitions in a local or global scope leads to four
   One problem of storing information over a long period of                    different modeling styles [13]. One of them is the Garden of
time is, that there can be different unnecessary or redundant                  Eden style, in which all above mentioned components are
modifications. Consider modifications which firstly add an                     globally defined. This results in a high re-usability of decla-
                                                                               rations and defined data types and influences the flexibility
                                                                               of an XML Schema in general.
                                                                                  The transformation language ELaX1 was developed to
                                                                               handle modifications on an XML Schema and to express
                                                                               such modifications formally. The abstract data model, el-
Copyright c by the paper’s authors. Copying permitted only                     ement information item and Garden of Eden style were
for private and academic purposes.                                             important through the development process and influence
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-          1
Workshop on Foundations of Databases (Grundlagen von Datenbanken),               The whole transformation language ELaX is available at:
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.        www.ls-dbis.de/elax


                                                                          11
the EBNF (Extended Backus-Naur Form) like notation of                                         // ↓ most recent operation: add ↓

ELaX.                                                                                         U: op(EID) → del(EID) → add(EID, content)                           (5)
   An ELaX statement always starts with ”add”, ”delete”                                               ⇒ op(EID) → add(EID, content)
or ”update” followed by one of the alternative components
(simple type, element declaration, etc.), an identifier of the                                I: add(EID, ) → add(EID, content)
                                                                                                                                                                  (6)
current component and completed with optional tuples of                                                   ⇒ add(EID, content)
attributes and values (examples follow on, e.g. see figure                                    I: upd(EID, ) → add(EID, content)
1). The identifier is a unique EID (emxid)2 , a QNAME                                                                                                             (7)
(qualified name) or a subset of XPath expressions. In the                                                 ⇒ upd(EID, content)
remaining parts we will use the EID as the identifier, but a                                  // ↓ most recent operation: update (upd) ↓

transformation would be easily possible.                                                      I: op(EID) → del(EID) → upd(EID, content)                           (8)
   ELaX statements are logged for further analyses and also
as a prerequisite for the rule-base optimizer (see section 3).                                            ⇒ op(EID) → upd(EID, content)
Figure 1 illustrates the relational schema of the log. The                                    U: add(EID, content) → upd(EID, content)
                                                                                                                                                                  (9)
                                                                                                      ⇒ add(EID, content)
                      op- msg-
file-ID time   EID                                 content
                     Type Type                                                                U: add(EID, content) → upd(EID, content’)
  1      1      1     add   0 add element name 'name' type 'xs:decimal' id 'EID1' ;                                                                              (10)
  1      2      1     upd   0 update element name 'name' change type 'xs:string' ;                        ⇒ add(EID, MERGE(content0 , content))
  1      3      2     add   0 add element name 'count' type 'xs:decimal' id 'EID2' ;
  …      …      …      …   … …
                                                                                              R: upd(EID, content) → upd(EID, content)
                                                                                                                                                                 (11)
                                                                                                          ⇒ upd(EID, content)
 Figure 1: Schema with relation for logging ELaX                                              U: upd(EID, content) → upd(EID, content’)
                                                                                                                                                                 (12)
                                                                                                          ⇒ upd(EID, MERGE(content0 , content))
chosen values are simple ones (especially the length). The
attributes file-ID and time are the composite key for the                                    The rules have to be sequentially analyzed from left to right
logging relation, the EID represents the unique identifier for                               (→), whereas the left operation comes temporally before the
a component of the XSD. The op-Type is a short form for                                      right one (i.e., time(left) < time(right). To warrant that the
add, delete (del) or update (upd) operations, the msg-Type                                   operations are working on the same component, the EID
is for the different message types (ELaX (0), etc.). Lastly,                                 of both operations is equal. If two operations exist and a
the content contains the logged ELaX statements. The file-                                   rule applies to them, then the result can be found on the
ID and msg-Type are management information, which are                                        right side of ⇒. The time of the result is the time of the
not covered in this paper.                                                                   prior (left) operation, except further investigations are ex-
                                                                                             plicit necessary or the time is unknown (e.g. empty).
                                                                                                Another point of view illustrates, that the introduced rules
3. RULE-BASED OPTIMIZER                                                                      are complete concerning the given operations add, delete
   The algorithm ROfEL (Rule-based Optimizer for ELaX)                                       and update. Figure 2 represents an operation matrix, in
was developed to reduce the number of logged ELaX opera-                                     which every possible combination is covered with at least one
tions. This is possible by combining given operations and/or                                 rule. On the x-axis the prior operation and on the y-axis the
removing unnecessary or even redundant operations. Fur-
thermore, the algorithm could identify invalid operations in                                                                               prior
a given log and correct these to a certain degree.                                                            operation
                                                                                                                               add        delete    update
   ROfEL is a rule-based algorithm. Provided that a log of                                                           add          (6)       (5)       (7)
ELaX operations is given (see section 2), the following rules                                              recent   delete        (3)       (2)       (4)
are essential to reduce the number of operations. In com-                                                           update   (9) , (10)     (8)    (11) , (12)
pliance with ELaX these operations are delete (del), add
or update (upd). If a certain distinction is not necessary
a general operation (op) or variable ( ) are used, empty                                               Figure 2: Operation matrix of rules
denotes a not given operation. Additionally, the rules are
classified by their purpose to handle redundant (R), unnec-                                  recent operation are given, whereas the three-valued rules
essary (U) or invalid (I) operations. ROfEL stops (S) if no                                  (5) and (8) are minimized to the both most recent operations
other rules are applicable, for example no other operation                                   (e.g. without op(EID)). The break-even point contains the
with the same EID is given.                                                                  applying rule or rules (considering the possibility of merging
                                                                                             the content, see below).
 S: empty → op(EID) ⇒ op(EID)                                                     (1)           Rule (4) is one example for further investigations. If a
 // ↓ most recent operation: delete (del) ↓
                                                                                             component is deleted (del(EID)) but updated (upd(EID))
                                                                                  (2)        before, then it is not possible to replace the prior operation
 R: del(EID) → del(EID) ⇒ del(EID)                                                           with the result (del(EID)) without analyzing other opera-
 U: add(EID, content) → del(EID) ⇒ empty                                          (3)        tions between them. The problem is: if another operation
                                                                                             (op(EID’)) references the deleted component (e.g. a simple
 U: upd(EID, content) → del(EID) ⇒ del(EID)
                                                                                  (4)        type) but because of ROfEL upd(EID) (it is the prior op-
               with time(del(EID)) := TIME(del(EID), upd(EID, content))
                                                                                             eration) is replaced with del(EID), then op(EID’) would be
2
  Our conceptual model is EMX (Entity Model for XML                                          invalid. Therefore, the function TIME() is used to deter-
Schema [15]), in which every component of a model has its                                    mine the correct time of the result. The function is given
own, global identifier: EID                                                                  in pseudocode in figure 3. TIME() has two input parame-


                                                                                        12
 TIME(op, op’):                                                        value pairs of the most recent operation are completely in-
 // time(op) = t; time(op’) = t’; time(opx) = tx;                      serted into the result. Simultaneously, these attributes are
 // op.EID == op’.EID; op.EID != opx.EID; t > t’;                      removed from the content of the prior operation. At the end
    begin                                                              of the function, all remaining attributes of the prior (right)
      if ((t > tx > t’) AND                                            operation are inserted, before the result is returned.
          (op.EID in opx.content))                                       All mentioned rules, as well as the functions TIME() and
        then return t;                                                 MERGE() are essential parts of the main function RO-
      return t’;                                                       FEL(); the pseudocode is presented in figure 5. ROFEL()
    end.
                                                                        ROFEL(log):
       Figure 3: TIME() function of optimizer                           // log = ((t1,op1), (t2,op2), ...); t1 < t2 < ...;
                                                                         begin
                                                                           for (i := log.size(); i >= 2; i := i - 1)
  MERGE(content, content’):                                                  for (k := i - 1; k >= 1 ; k := k - 1)
  // content = (A1 = ’a1’, A2 = ’a2’,                                          if(!(log.get(i).EID == log.get(k).EID AND
  //            A3 = ’’, A4 = ’a4’);                                               log.get(i).time != log.get(k).time))
  // content’ = (A1 = ’a1’, A2 = ’’,                                             then continue;
  //            A3 = ’a3’, A5 = ’a5’);                                  // R: del(EID) -> del(EID) => del(EID) (2)
     begin                                                                     if (log.get(i).op-Type == 1 AND
       result := {};                                                               log.get(k).op-Type == 1)
       count := 1;                                                               then
       while (count <= content.size())                                             log.remove(i);
         result.add(content.get(count));                                           return ROFEL(log);
         if (content.get(count) in content’)                            // U: upd(EID, content) -> del(EID)
           then                                                         //     => del(EID) (4)
             content’.remove(content.get(count));                              if (log.get(i).op-Type == 1 AND
         count := count + 1;                                                       log.get(k).op-Type == 2)
       count := 1;                                                               then
       while (count <= content’.size())                                            temp := TIME(log.get(i), log.get(k));
         result.add(content’.get(count));                                          if (temp == log.get(i).time)
         count := count + 1;                                                         then
  // result = (A1 = ’a1’, A2 = ’a2’,                                                   log.remove(k);
  //           A3 = ’’, A4 = ’a4’, A5 = ’a5’);                                         return ROFEL(log);
       return result;                                                              log.get(k) := log.get(i);
     end.                                                                          log.remove(i);
                                                                                   return ROFEL(log); [...]
     Figure 4: MERGE() function of optimizer                            // U: upd(EID,con) -> upd(EID,con’)
                                                                        //     => upd(EID, MERGE(con’,con)) (12)
                                                                               if (log.get(i).op-Type == 2 AND
ters and returns a time value, dependent on the existence of                       log.get(k).op-Type == 2)
an operation, which references the EID in its content. If no                     then
such operation exists, the time of the result in rule (4) would                    temp := MERGE(log.get(i).content,
be the time of the left (op), otherwise of the right operation                                   log.get(k).content);
(op’ ). The lines with // are comments and contain further                         log.get(k).content := temp;
information, some hints or even explanations of variables.                         log.remove(i);
   The rules (6), (7) and (8) adapt invalid operations. For ex-                    return ROFEL(log);
ample if a component is updated but deleted before (see rule               return log;
(8)), then ROfEL has to decide, which operation is valid. In             end.
this and similar cases the most recent operation is preferred,
because it is more difficult (or even impossible) to check the           Figure 5: Main function ROFEL() of optimizer
intention of the prior operation. Consequently, in rule (8)
del(EID) is removed and rule op(EID) → upd(EID, content)               has one input parameter, the log of ELaX operations. This
applies (op(EID) could be empty; see rule (1)).                        log is a sequence sorted according to time, it is analyzed
   The rules (10) and (12) removes unnecessary operations              reversely. In general, one operation is pinned (log.get(i))
by merging the content of the involved operations. The func-           and compared with the next, prior operation (log.get(k)).
tion MERGE() implements this, the pseudocode is pre-                   If log.get(k) modifies the same component as log.get(i) (i.e.,
sented in figure 4. MERGE() has two input parameter,                   EID is equal) and the time is different, then an applying rule
the content of the most recent (left) and prior (right) oper-          is searched, otherwise the next operation (log.get(k - 1)) is
ation. The content is given as a sequence of attribute-value           analyzed. The algorithm terminates, if the outer loop com-
pairs (see ELaX description in section 2). The result of the           pletes successfully (i.e., no further optimization is possible).
function is the combination of the input, whereas the con-                Three rules are presented in figure 5; the missing ones
tent of the most recent operation is preferred analogical to           are skipped ([...]). The first rule is (2), the occurrence of
the above mentioned behaviour for I rules. All attribute-              redundant delete operations. According to the above men-


                                                                  13
tioned time choosing guidelines, the most recent operation


                                                                               ROfEL
                                                                                              op-
                                                                       time            EID                                content
(log.get(i)) is removed. After this the optimizer starts again                               Type
with the modified log recursively (return ROFEL(log)).                   1              1     add add element name 'name' type 'xs:decimal' id 'EID1' ;
                                                                              10
   The second rule is (4), which removes an unnecessary up-              2              1     upd update element name 'name' change type 'xs:string' ;
                                                                         3              2     add add element name 'count' type 'xs:decimal' id 'EID2' ;
date operation, because the whole referenced component will
                                                                         4              3     add add element name 'start' type 'xs:date' id 'EID3' ;
be deleted later. This rule uses the TIME() function of fig-
                                                                         5             42     add add element name 'stop' type 'xs:date' id 'EID42' ;
ure 3 to decide, which time should be assigned to the result.            6              4     add add complextype name 'confType' id 'EID4' ;
                                                                              3
If another operation between log.get(i) and log.get(k) exists            7              5     add add group mode sequence id 'EID5' in 'EID4' ;
and this operation contains or references log.get(i).EID, then           8             42     upd update element name 'stop' change type 'xs:string' ;
the most recent time (log.get(i).time) is assigned, otherwise            9              6     add add elementref 'name' id 'EID6' in 'EID5' ;
the prior time (log.get(k).time).                                       10    4         7     add add elementref 'count' id 'EID7' in 'EID5' ;
   The last rule is (12), different updates on the same com-            11              8     add add elementref 'start' id 'EID8' in 'EID5' ;
                                                                        12             42     del delete element name 'stop' ;
ponent are given. The MERGE() function of figure 4 com-
                                                                        13    2         9     add add element name 'conf' type 'confType' id 'EID9' ;
bines the content of both operations, before the content of
                                                                        14             42     del delete element name 'stop' ;
the prior operation is changed and the most recent operation
is removed.
   After introducing detailed information about the concept           Figure 7: XML Schema modification log of figure 6
of the ROfEL algorithm, we want to use it to optimize an
example in the next section.
                                                                      given in the XML Schema (EID > 9). Additionally, some
4. EXAMPLE                                                            entries are connected within the new introduced column RO-
  In the last section we specified the rule-based algorithm           fEL. The red lines and numbers represent the involved log
ROfEL (Rule-based Optimizer for ELaX), now we want to                 entries and applying ROfEL rule.
explain the use with an example: we want to store some in-               The sorted log is analyzed reversely, the operation with
formation about a conference. We assume the XML Schema                time stamp 14 is pinned and compared with time entry 13.
of figure 6 is given, a corresponding XML document is also            Because the modified component is not the same (EID not
presented. The XML Schema is in the Garden of Eden style              equal), the next operation with time 12 is taken. Both op-
                                                                      erations delete the same component (op-Type == 1 ). Ac-
                                                                      cording to rule (2), the redundant entry 14 is removed and
                                                                      ROFEL restarts with the adapted log.
                                                                         Rule (4) applies next, a component is updated but deleted
                                                                      later. This rule calls the TIME() function to determine, if
                                                                      the time of the result (i.e., del(EID)) should be 12 or 8.
                                                                      Because no operation between 12 and 8 references EID 42,
                                                                      the time of the result of (4) is 8. The content of time 8 is
                                                                      replaced with delete element name ’stop’;, the op-Type is set
                                                                      to 1 and the time entry 12 is deleted.
                                                                         Afterwards, ROFEL restarts again and rule (3) could be
                                                                      used to compare the new operation of entry 8 (original entry
                                                                      12) with the operation of time 5. A component is inserted
                                                                      but deleted later, so all modifications on this component
                                                                      are unnecessary in general. Consequently, both entries are
                                                                      deleted and the component with EID 42 is not given in the
                                                                      XML Schema of figure 6.
                                                                         The last applying rule is (10). An element declaration
                                                                      is inserted (time 1) and updated (time 2). Consequently,
                                                                      the MERGE() function is used to combine the content of
                                                                      both operations. According to the ELaX specification, the
                                                                      content of the update operation contains the attribute type
   Figure 6: XML Schema with XML document                             with the value xs:string, whereas the add operation contains
                                                                      the attribute type with the value xs:decimal and id with
and contains four element declarations (conf, name, count,            EID1. All attribute-value pairs of the update operation are
start) and one complex type definition (confType) with a              completely inserted into the output of the function (type =
group model (sequence). The group model has three ele-                ”xs:string”). Simultaneously, the attribute type is removed
ment references, which reference one of the simple type el-           from the content of the add operation (type = ”xs:decimal”).
ement declarations mentioned above. The identification of             The remaining attributes are inserted in the output (id =
all components is simplified by using an EID, it is visualized        ”EID1”). Afterwards, the content of entry 1 is replaced by
as a unique ID attribute (id = ”..”).                                 add element ’name’ type ”xs:string” id ”EID1”; and the sec-
   The log of modification steps to create this XML Schema            ond entry is deleted (time 2).
is presented in figure 7. The relational schema is reduced in            The modification log of figure 7 is optimized with rules
comparison to figure 1. The time, the component EID, the              (2), (4), (3) and (10). It is presented in figure 8. All in all,
op-Type and the content of the modification steps are given.          five of 14 entries are removed, whereas one is replaced by a
The log contains different modification steps, which are not          combination of two others.


                                                                 14
                  op-                                                                  In [8] an approach is presented, which deals with four
 time      EID                                content
                 Type                                                               operations (insert, delete, update, move) on a tree repre-
      1     1     add add element name 'name' type 'xs:string' id 'EID1' ;          sentation of XML. It is similar to our algorithm, but we use
      3     2     add add element name 'count' type 'xs:decimal' id 'EID2' ;        ELaX as basis and EIDs instead of update-intensive labelling
      4     3     add add element name 'start' type 'xs:date' id 'EID3' ;
                                                                                    mechanisms. Moreover the distinction between property and
      6     4     add add complextype name 'confType' id 'EID4' ;
                                                                                    node, the ”deletion always wins” view, as well as the limita-
      7     5     add add group mode sequence id 'EID5' in 'EID4' ;
                                                                                    tion that ”reduced sequence might still be reducible” [8] are
      9     6     add add elementref 'name' id 'EID6' in 'EID5' ;
     10     7     add add elementref 'count' id 'EID7' in 'EID5' ;
                                                                                    drawbacks. The optimized reduction algorithm eliminates
     11     8     add add elementref 'start' id 'EID8' in 'EID5' ;                  the last drawback, but needs another complex structure, an
     13     9     add add element name 'conf' type 'confType' id 'EID9' ;           operation hyper-graph.


Figure 8: XML Schema modification log of figure 7                                   6.   CONCLUSION
after using rules (2), (4), (3) and (10) of ROfEL                                      The rule-based algorithm ROfEL (Rule-based Optimizer
                                                                                    for ELaX) was developed to reduce the number of logged
                                                                                    ELaX (Evolution Language for XML-Schema [16]) opera-
This simple example illustrates how ROfEL can reduce the                            tions. In general ELaX statements are add, delete and up-
number of logged operations introduced in section 3. More                           date operations on the components of XML Schema, speci-
complex examples are easy to construct and can be solved                            fied by a user.
by using the same rules and the same algorithm.                                        ROfEL allows the identification and deletion of unnec-
                                                                                    essary and redundant modifications by applying different
                                                                                    heuristic rules. Additionally, invalid operations are also cor-
5.        RELATED WORK                                                              rected or removed. In general if the preconditions and condi-
   Comparable to the object lifecycle, we create new types                          tions for an adaptation of two ELaX log entries are satisfied
or elements, use (e.g. modify, move or rename) and delete                           (e.g. EID equivalent, op-Type correct, etc.), one rule is ap-
them. The common optimization rules to reduce the num-                              plied and the modified, reduced log is returned.
ber of operations are originally introduced in [10] and are                            We are confident, that even if ROfEL is domain specific
available in other application in the same way. In [11], rules                      and the underlying log is specialized for our needs, the above
for reducing a list of user actions (e.g. move, replace, delete,                    specified rules are applicable in other scenarios or applica-
...) are introduced. In [9], pre and postconditions of op-                          tions, in which the common modification operations add,
erations are used for deciding which optimizations can be                           delete and update are used (minor adaptations precondi-
executed. Additional applications can easily be found in                            tioned).
further scientific disquisitions.                                                      Future work. The integration of a cost-based component
   Regarding other transformation languages, the most com-                          in ROfEL could be very interesting. It is possible, that under
monly used are XQuery [3] and XSLT (Extensible Stylesheet                           consideration of further analyses the combination of different
Language Transformations [1]), there are also approaches to                         operations (e.g. rule (10)) is inefficient in general. In this
reduce the number of unnecessary or redundant operations.                           and similar cases a cost function with different thresholds
Moreover, different transformations to improve efficiency are                       could be defined to guarantee, that only efficient adaptations
mentioned.                                                                          of the log are applied. A convenient cost model would be
   In [12] different ”high-level transformations to prune and                       necessary, but this requires further research.
merge the stream data flow graph” [12] are applied. ”Such                              Feasibility of the approach. At the University of Ro-
techniques not only simplify the later analyses, but most                           stock we implemented the prototype CodeX (Conceptual
importantly, they can rewrite some queries” [12], an essen-                         design and evolution for XML Schema) for dealing with the
tial prerequisite for the efficient evaluation of XQuery over                       co-evolution [14] of XML Schema and XML documents; RO-
streaming data.                                                                     fEL and corresponding concepts are fully integrated. As we
   In [5] packages are introduced because of efficiency ben-                        plan to report in combination with the first release of CodeX,
efits. A package is a collection of stylesheet modules ”to                          the significantly reduced number of logged operations proves
avoid compiling libraries repeatedly when they are used in                          that the whole algorithm is definitely feasible.
multiple stylesheets, and to avoid holding multiple copies
of the same library in memory simultaneously” [5]. Fur-                             7.   REFERENCES
thermore, XSLT works with templates and matching rules
for identifying structures in general. If different templates                        [1] XSL Transformations (XSLT) Version 2.0.
could be applied, automatic or user given priorities manage                              http://www.w3.org/TR/2007/REC-xslt20-20070123/,
which template is chosen. To avoid unexpected behaviour                                  January 2007. Online; accessed 25-June-2014.
and improve the efficiency of analyses, it is a good practise                        [2] Extensible Markup Language (XML) 1.0 (Fifth
to remove unnecessary or redundant templates.                                            Edition).
   Another XML Schema modification language is XSchema-                                  http://www.w3.org/TR/2008/REC-xml-20081126/,
Update [6], which is used in the co-evolution prototype EXup                             November 2008. Online; accessed 25-June-2014.
[7]. Especially the auto adaptation guidelines are similar to                        [3] XQuery 1.0: An XML Query Language (Second
the ROfEL purpose of reducing the number of modification                                 Edition).
steps. ”Automatic adaptation will insert or remove the min-                              http://www.w3.org/TR/2010/REC-xquery-20101214/,
imum allowed number of elements for instance” [6], i.e., ”a                              December 2010. Online; accessed 25-June-2014.
minimal set of updates will be applied to the documents”                             [4] W3C XML Schema Definition Language (XSD) 1.1
[6].                                                                                     Part 1: Structures. http://www.w3.org/TR/2012/


                                                                               15
     REC-xmlschema11-1-20120405/, April 2012. Online;
     accessed 25-June-2014.
 [5] XSL Transformations (XSLT) Version 3.0.
     http://www.w3.org/TR/2013/WD-xslt-30-20131212/,
     December 2013. Online; accessed 25-June-2014.
 [6] F. Cavalieri. Querying and Evolution of XML Schemas
     and Related Documents. Master’s thesis, University of
     Genova, 2009.
 [7] F. Cavalieri. EXup: an engine for the evolution of
     XML schemas and associated documents. In
     Proceedings of the 2010 EDBT/ICDT Workshops,
     EDBT ’10, pages 21:1–21:10, New York, NY, USA,
     2010. ACM.
 [8] F. Cavalieri, G. Guerrini, M. Mesiti, and B. Oliboni.
     On the Reduction of Sequences of XML Document
     and Schema Update Operations. In ICDE Workshops,
     pages 77–86, 2011.
 [9] H. U. Hoppe. Task-oriented Parsing - a Diagnostic
     Method to Be Used Adaptive Systems. In Proceedings
     of the SIGCHI Conference on Human Factors in
     Computing Systems, CHI ’88, pages 241–247, New
     York, NY, USA, 1988. ACM.
[10] M. Klettke. Modellierung, Bewertung und Evolution
     von XML-Dokumentkollektionen. Habilitation,
     Fakultät für Informatik und Elektrotechnik,
     Universität Rostock, 2007.
[11] R. Kramer. iContract - the Java(tm) Design by
     Contract(tm) tool. In In TOOLS ’98: Proceedings of
     the Technology of Object-Oriented Languages and
     Systems, page 295. IEEE Computer Society, 1998.
[12] X. Li and G. Agrawal. Efficient Evaluation of XQuery
     over Streaming Data. In In Proc. VLDB’05, pages
     265–276, 2005.
[13] E. Maler. Schema Design Rules for UBL...and Maybe
     for You. In XML 2002 Proceedings by deepX, 2002.
[14] T. Nösinger, M. Klettke, and A. Heuer. Evolution von
     XML-Schemata auf konzeptioneller Ebene - Übersicht:
     Der CodeX-Ansatz zur Lösung des
     Gültigkeitsproblems. In Grundlagen von Datenbanken,
     pages 29–34, 2012.
[15] T. Nösinger, M. Klettke, and A. Heuer. A Conceptual
     Model for the XML Schema Evolution - Overview:
     Storing, Base-Model-Mapping and Visualization. In
     Grundlagen von Datenbanken, 2013.
[16] T. Nösinger, M. Klettke, and A. Heuer. XML Schema
     Transformations - The ELaX Approach. In DEXA (1),
     pages 293–302, 2013.


                                                             16
      Automatic Decomposition of Multi-Author Documents
                   Using Grammar Analysis

                                          Michael Tschuggnall and Günther Specht
                                                Databases and Information Systems
                                  Institute of Computer Science, University of Innsbruck, Austria
                                   {michael.tschuggnall, guenther.specht}@uibk.ac.at


ABSTRACT                                                                        try to build a cluster for the main author and one or more clusters
The task of text segmentation is to automatically split a text doc-             for intrusive paragraphs. Another scenario where the clustering of
ument into individual subparts, which differ according to specific              text is applicable is the analysis of multi-author academic papers:
measures. In this paper, an approach is presented that attempts to              especially the verification of collaborated student works such as
separate text sections of a collaboratively written document based              bachelor or master theses can be useful in order to determine the
on the grammar syntax of authors. The main idea is thereby to                   amount of work done by each student.
quantify differences of the grammatical writing style of authors                   Using results of previous work in the field of intrinsic plagia-
and to use this information to build paragraph clusters, whereby                rism detection [31] and authorship attribution [32], the assumption
each cluster is assigned to a different author. In order to analyze             that individual authors have significantly different writing styles in
the style of a writer, text is split into single sentences, and for each        terms of the syntax that is used to construct sentences has been
sentence a full parse tree is calculated. Using the latter, a profile           reused. For example, the following sentence (extracted from a web
is computed subsequently that represents the main characteristics               blog): ”My chair started squeaking a few days ago and it’s driving
for each paragraph. Finally, the profiles serve as input for common             me nuts." (S1) could also be formulated as ”Since a few days my
clustering algorithms. An extensive evaluation using different En-              chair is squeaking - it’s simply annoying.” (S2) which is semanti-
glish data sets reveals promising results, whereby a supplementary              cally equivalent but differs significantly according to the syntax as
analysis indicates that in general common classification algorithms             can be seen in Figure 1. The main idea of this work is to quantify
perform better than clustering approaches.                                      those differences by calculating grammar profiles and to use this
                                                                                information to decompose a collaboratively written document, i.e.,
                                                                                to assign each paragraph of a document to an author.
Keywords
                                                                                   The rest of this paper is organized as follows: Section 2 at first
Text Segmentation, Multi-Author Decomposition, Parse Trees, pq-
                                                                                recapitulates the principle of pq-grams, which represent a core con-
grams, Clustering
                                                                                cept of the approach. Subsequently the algorithm is presented in
                                                                                detail, which is then evaluated in Section 3 by using different clus-
1. INTRODUCTION                                                                 tering algorithms and data sets. A comparison of clustering and
   The growing amount of currently available data is hardly man-                classification approaches is discussed in Section 4, while Section 5
ageable without the use of specific tools and algorithms that pro-              depicts related work. Finally, a conclusion and future work direc-
vide relevant portions of that data to the user. While this problem             tions are given in Section 6.
is generally addressed with information retrieval approaches, an-
other possibility to significantly reduce the amount of data is to              2.    ALGORITHM
build clusters. Within each cluster, the data is similar according to
                                                                                   In the following the concept of pq-grams is explained, which
some predefined features. Thereby many approaches exist that pro-
                                                                                serves as the basic stylistic measure in this approach to distinguish
pose algorithms to cluster plain text documents (e.g. [16], [22]) or
                                                                                between authors. Subsequently, the concrete steps performed by
specific web documents (e.g. [33]) by utilizing various features.
                                                                                the algorithm are discussed in detail.
   Approaches which attempt to divide a single text document into
distinguishable units like different topics, for example, are usu-              2.1    Preliminaries: pq-grams
ally referred to as text segmentation approaches. Here, also many
                                                                                   Similar to n-grams that represent subparts of given length n of
features including statistical models, similarities between words or
                                                                                a string, pq-grams extract substructures of an ordered, labeled tree
other semantic analyses are used. Moreover, text clusters are also
                                                                                [4]. The size of a pq-gram is determined by a stem (p) and a base
used in recent plagiarism detection algorithms (e.g. [34]) which
                                                                                (q) like it is shown in Figure 2. Thereby p defines how much nodes
                                                                                are included vertically, and q defines the number of nodes to be
                                                                                considered horizontally. For example, a valid pq-gram with p = 2
                                                                                and q = 3 starting from PP at the left side of tree (S2) shown in
                                                                                Figure 1 would be [PP-NP-DT-JJ-NNS] (the concrete words
                                                                                are omitted).
Copyright c by the paper’s authors. Copying permitted only for                     The pq-gram index then consists of all possible pq-grams of
private and academic purposes.
                                                                                a tree. In order to obtain all pq-grams, the base is shifted left
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-
Workshop on Foundations of Databases (Grundlagen von Datenbanken),              and right additionally: If then less than p nodes exist horizon-
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.         tally, the corresponding place in the pq-gram is filled with *, in-


                                                                           17
(S1)                                                                                                               steps:
                                            S

                  S                        CC                 S                                                       1. At first the document is preprocessed by eliminating unnec-
                                          (and)                                                                          essary whitespaces or non-parsable characters. For exam-
        NP                    VP                      NP               VP
                                                                                                                         ple, many data sets often are based on novels and articles of
                                                                                                                         various authors, whereby frequently OCR text recognition is
 PRP        NN           VBD          S              PRP      VBZ
 (My)     (chair)      (started)                      (it)    ('s)
                                                                                   VP                                    used due to the lack of digital data. Additionally, such doc-
                                                                        VBG              S
                                                                                                                         uments contain problem sources like chapter numbers and
                                     VP
                                                                      (driving)                                          titles or incorrectly parsed picture frames that result in non-
                                                                                                                         alphanumeric characters.
                          VBG                 ADVP                                NP              NP
                       (squeaking)

                                       NP              RB                         PRP          NNS                    2. Subsequently, the document is partitioned into single para-
                                                      (ago)                       (me)        (nuts)                     graphs. For simplification reasons this is currently done by
                             DT         JJ         NNS
                                                                                                                         only detecting multiple line breaks.
                             (a)      (few)       (days)

                                                                                                                      3. Each paragraph is then split into single sentences by utiliz-
                                                                                                                         ing a sentence boundary detection algorithm implemented
(S2)
                                                                  S                                                      within the OpenNLP framework1 . Then for each sentence
                                                                                                                         a full grammar tree is calculated using the Stanford Parser
                                     S                            -            S                                         [19]. For example, Figure 1 depicts the grammar trees re-
          PP                         NP                VP               NP                   VP
                                                                                                                         sulting from analyzing sentences (S1) and (S2), respectively.
                                                                                                                         The labels of each tree correspond to a part-of-speech (POS)
   IN           NP            PRP       NN         VBZ       VP        PRP     VBZ        ADVP         ADJP
                                                                                                                         tag of the Penn Treebank set [23], where e.g NP corresponds
(Since)                       (my)    (chair)      (is)                 (it)   ('s)                                      to a noun phrase, DT to a determiner or JJS to a superla-
        DT       JJ       NNS                            VBG                                RB        JJ
                                                                                                                         tive adjective. In order to examine the building structure of
        (a)    (few)     (days)                       (squeaking)                        (simply) (annoying)             sentences only like it is intended by this work, the concrete
                                                                                                                         words, i.e., the leafs of the tree, are omitted.
Figure 1: Grammar Trees of the Semantically Equivalent Sen-
tences (S1) and (S2).                                                                                                 4. Using the grammar trees of all sentences of the document,
                                                                                                                         the pq-gram index is calculated. As shown in Section 2.1
                                                                                                                         all valid pq-grams of a sentence are extracted and stored into
dicating a missing node. Applying this idea to the previous exam-                                                        a pq-gram index. By combining all pq-gram indices of all
ple, also the pq-gram [PP-IN-*-*-*] (no nodes in the base) is                                                            sentences, a pq-gram profile is computed which contains a
valid, as well as [PP-NP-*-*-DT] (base shifted left by two),                                                             list of all pq-grams and their corresponding frequency of ap-
[PP-NP-*-DT-JJ] (base shifted left by one), [PP-NP-JJ-                                                                   pearance in the text. Thereby the frequency is normalized by
NNS-*] (base shifted right by one) and [PP-NP-NNS-*-*] (base                                                             the total number of all appearing pq-grams. As an example,
shifted right by two) have to be considered. As a last example, all                                                      the five mostly used pq-grams using p = 2 and q = 3 of a
leaves have the pq-gram pattern [leaf_label-*-*-*-*].                                                                    sample document are shown in Table 1. The profile is sorted
   Finally, the pq-gram index is the set of all valid pq-grams of a                                                      descending by the normalized occurrence, and an additional
tree, whereby multiple occurrences of the same pq-grams are also                                                         rank value is introduced that simply defines a natural order
present multiple times in the index.                                                                                     which is used in the evaluation (see Section 3).


                                                                                                                   Table 1: Example of the Five Mostly Used pq-grams of a Sam-
                                                                                                                   ple Document.
                                                                                                                          pq-gram               Occurrence [%]    Rank
                                                                                                                           NP-NN-*-*-*                2.68         1
                                                                                                                           PP-IN-*-*-*                2.25         2
Figure 2: Structure of a pq-gram Consisting of Stem p = 2 and                                                              NP-DT-*-*-*                1.99         3
Base q = 3.                                                                                                                NP-NNP-*-*-*               1.44         4
                                                                                                                           S-VP-*-*-VBD               1.08         5

2.2 Clustering by Authors
   The number of choices an author has to formulate a sentence                                                        5. Finally, each paragraph-profile is provided as input for clus-
in terms of grammar structure is rather high, and the assumption                                                         tering algorithms, which are asked to build clusters based on
in this approach is that the concrete choice is made mostly intu-                                                        the pq-grams contained. Concretely, three different feature
itively and unconsciously. On that basis the grammar of authors is                                                       sets have been evaluated: (1.) the frequencies of occurences
analyzed, which serves as input for common state-of-the-art clus-                                                        of each pq-gram, (2.) the rank of each pq-gram and (3.) a
tering algorithms to build clusters of text documents or paragraphs.                                                     union of the latter sets.
The decision of the clustering algorithms is thereby based on the
frequencies of occurring pq-grams, i.e., on pq-gram profiles. In de-                                                   1
                                                                                                                         Apache OpenNLP, http://incubator.apache.org/opennlp, vis-
tail, given a text document the algorithm consists of the following                                                ited July 2014


                                                                                                              18
2.3    Utilized Algorithms                                                         many works have studied and questioned the correct author-
   Using the WEKA framework [15], the following clustering algo-                   ship of 12 disputed essays [24], which have been excluded in
rithms have been evaluated: K-Means [3], Cascaded K-Means (the                     the experiment.
number of clusters is cascaded and automatically chosen) [5], X-
                                                                               • The PAN’12 competition corpus (PAN12): As a well-known,
Means [26], Agglomerative Hierarchical Clustering [25], and Far-
                                                                                 state-of-the-art corpus originally created for the use in au-
thest First [9].
                                                                                 thorship identification, parts3 of the PAN2012 corpus [18]
   For the clustering algorithms K-Means, Hierarchical Clustering
                                                                                 have been integrated. The corpus is composed of several
and Farthest First the number of clusters has been predefined ac-
                                                                                 fiction texts and split into several subtasks that cover small-
cording to the respective test data. This means if the test document
                                                                                 and common-length documents (1800-6060 words) as well
has been collaborated by three authors, the number of clusters has
                                                                                 as larger documents (up to 13000 words) and novel-length
also been set to three. On the other hand, the algorithms Cascaded
                                                                                 documents (up to 170,000 words). Finally, the test setused in
K-Means and X-Means implicitly decide which amount of clusters
                                                                                 this evaluation contains 14 documents (paragraphs) written
is optimal. Therefore these algorithms have been limited only in
                                                                                 by three authors that are distributed equally.
ranges, i.e., the minimum and maximum number of clusters has
been set to two and six, respectively.                                      3.2     Results
                                                                               The best results of the evaluation are presented in Table 2, where
3. EVALUATION                                                               the best performance for each clusterer over all data sets is shown in
   The utilization of pq-gram profiles as input features for mod-           subtable (a), and the best configuration for each data set is shown
ern clustering algorithms has been extensively evaluated using dif-         in subtable (b), respectively. With an accuracy of 63.7% the K-
ferent documents and data sets. As clustering and classification            Means algorithm worked best by using p = 2, q = 3 and by uti-
problems are closely related, the global aim was to experiment on           lizing all available features. Interestingly, the X-Means algorithm
the accuracy of automatic text clustering using solely the proposed         also achieved good results considering the fact that in this case the
grammar feature, and furthermore to compare it to those of current          number of clusters has been assigned automatically by the algo-
classification techniques.                                                  rithm. Finally, the hierarchical cluster performed worst gaining an
                                                                            accuracy of nearly 10% less than K-Means.
3.1    Test Data and Experimental Setup                                        Regarding the best performances for each test data set, the re-
   In order to evaluate the idea, different documents and test data         sults for the manually created data sets from novel literature are
sets have been used, which are explained in more detail in the fol-         generally poor. For example, the best result for the two-author doc-
lowing. Thereby single documents have been created which con-               ument Twain-Wells is only 59.6%, i.e., the accuracy is only slightly
tain paragraphs written by different authors, as well as multiple           better than the baseline percentage of 50%, which can be achieved
documents, whereby each document is written by one author. In               by randomly assigning paragraphs into two clusters.4 On the other
the latter case, every document is treated as one (large) paragraph         hand, the data sets reused from authorship attribution, namely the
for simplification reasons.                                                 FED and the PAN12 data set, achieved very good results with an
   For the experiment, different parameter settings have been eval-         accuracy of about 89% and 83%, respectively. Nevertheless, as the
uated, i.e., the pq-gram values p and q have been varied from 2 to          other data sets have been specifically created for the clustering eval-
4, in combination with the three different feature sets. Concretely,        uation, these results may be more expressive. Therefore a compar-
the following data sets have been used:                                     ison between clustering and classification approaches is discussed
                                                                            in the following, showing that the latter achieve significantly better
   • Twain-Wells (T-W): This document has been specifically                 results on those data sets when using the same features.
     created for the evaluation of in-document clustering. It con-
     tains 50 paragraphs of the book ”The Adventures of Huck-                 Method                  p       q       Feature Set       Accuracy
     leberry Finn” by Mark Twain, and 50 paragraphs of ”The                   K-Means                 3       2       All                 63.7
     Time Machine” by H. G. Wells2 . All paragraphs have been                 X-Means                 2       4       Rank                61.7
     randomly shuffled, whereby the size of each paragraph varies             Farthest First          4       2       Occurrence-Rate     58.7
     from approximately 25 words up to 280 words.                             Cascaded K-Means        2       2       Rank                55.3
                                                                              Hierarchical Clust.     4       3       Occurrence-Rate     54.7
   • Twain-Wells-Shelley (T-W-S): In a similar fashion a three-
     author document has been created. It again uses (different)                                 (a) Clustering Algorithms
     paragraphs of the same books by Twain and Wells, and ap-
     pends it by paragraphs of the book ”Frankenstein; Or, The                Data Set         Method             p     q   Feat. Set   Accuracy
     Modern Prometheus” by Mary Wollstonecraft Shelley. Sum-                  T-W              X-Means            3     2   All           59.6
     marizing, the document contains 50 paragraphs by Mark                    T-W-S            X-Means            3     4   All           49.0
     Twain, 50 paragraphs by H. G. Wells and another 50 para-                 FED              Farth. First       4     3   Rank          89.4
     graphs by Mary Shelley, whereby the paragraph sizes are                  PAN12-A/B        K-Means            3     3   All           83.3
     similar to the Twain-Wells document.                                                             (b) Test Data Sets

   • The Federalist Papers (FED): Probably the mostly referred              Table 2: Best Evaluation Results for Each Clustering Algo-
     text corpus in the field of authorship attribution is a series         rithm and Test Data Set in Percent.
     of 85 political essays called ”The Federalist Papers” written
     by John Jay, Alexander Hamilton and James Madison in the                  3
     18th century. While most of the authorships are undoubted,                  the subtasks A and B, respectively
                                                                               4
                                                                                 In this case X-Means dynamically created two clusters, but
    2
      The books have been obtained from the Project Gutenberg li-           the result is still better than that of other algorithms using a fixed
brary, http://www.gutenberg.org, visited July 2014                          number of clusters.


                                                                       19
4.    COMPARISON OF CLUSTERING AND                                               p
                                                                                 2
                                                                                     q
                                                                                     2
                                                                                          Algorithm
                                                                                          X-Means
                                                                                                        Max
                                                                                                        57.6
                                                                                                                 N-Bay
                                                                                                                  77.8
                                                                                                                         Bay-Net
                                                                                                                          82.3
                                                                                                                                   LibLin
                                                                                                                                    85.2
                                                                                                                                            LibSVM
                                                                                                                                              86.9
                                                                                                                                                     kNN
                                                                                                                                                     62.6
                                                                                                                                                             J48
                                                                                                                                                             85.5

      CLASSIFICATION APPROACHES
                                                                                 2   3    X-Means       56.6      79.8    80.8      81.8      83.3   60.6    80.8
                                                                                 2   4    X-Means       57.6      76.8    79.8      82.2      83.8   58.6    81.0
                                                                                 3   2    X-Means       59.6      78.8    80.8      81.8      83.6   59.6    80.8
    For the given data sets, any clustering problem can be rewrit-               3
                                                                                 3
                                                                                     3
                                                                                     4
                                                                                          X-Means
                                                                                          X-Means
                                                                                                        53.5
                                                                                                        52.5
                                                                                                                  76.8
                                                                                                                  81.8
                                                                                                                          77.8
                                                                                                                          79.8
                                                                                                                                    80.5
                                                                                                                                    81.8
                                                                                                                                              82.3
                                                                                                                                              83.8
                                                                                                                                                     61.6
                                                                                                                                                     63.6
                                                                                                                                                             79.8
                                                                                                                                                             82.0
ten as classification problem with the exception that the latter need            4   2    K-Means       52.5      86.9    83.3      83.5      84.3   62.6    81.8
                                                                                 4   3    X-Means       52.5      79.8    79.8      80.1      80.3   59.6    77.4
training data. Although a direct comparison should be treated with               4   4    Farth. First  51.5      72.7    74.7      75.8      77.0   60.6    75.8
caution, it still gives an insight of how the two different approaches                   average improvement      24.1    25.0      26.5      27.9    6.2    25.7

perform using the same data sets. Therefore an additional evalua-                                                (a) Twain-Wells
tion is shown in the following, which compares the performance of
the clustering algorithms to the performance of the the following                p
                                                                                 2
                                                                                     q
                                                                                     2
                                                                                          Algorithm
                                                                                          K-Means
                                                                                                        Max
                                                                                                        44.3
                                                                                                                 N-Bay
                                                                                                                  67.8
                                                                                                                         Bay-Net
                                                                                                                          70.8
                                                                                                                                   LibLin
                                                                                                                                    74.0
                                                                                                                                            LibSVM
                                                                                                                                              75.2
                                                                                                                                                     kNN
                                                                                                                                                     51.0
                                                                                                                                                             J48
                                                                                                                                                             73.3
classification algorithms: Naive Bayes classifier [17], Bayes Net-               2   3    X-Means       38.3      65.1    67.1      70.7      72.3   48.3    70.2
                                                                                 2   4    X-Means       45.6      63.1    68.1      70.5      71.8   49.0    69.3
work using the K2 classifier [8], Large Linear Classification using              3   2    X-Means       45.0      51.7    64.1      67.3      68.8   45.6    65.4
                                                                                 3   3    X-Means       47.0      57.7    64.8      67.3      68.5   47.0    65.9
LibLinear [12], Support vector machine using LIBSVM with nu-                     3   4    X-Means       49.0      67.8    67.8      70.5      72.5   46.3    68.3
SVC classification [6], k-nearest-neighbors classifier (kNN) using               4
                                                                                 4
                                                                                     2
                                                                                     3
                                                                                          X-Means
                                                                                          K-Means
                                                                                                        36.2
                                                                                                        35.6
                                                                                                                  61.1
                                                                                                                  53.0
                                                                                                                          67.1
                                                                                                                          63.8
                                                                                                                                    69.1
                                                                                                                                    67.6
                                                                                                                                              69.5
                                                                                                                                              70.0
                                                                                                                                                     50.3
                                                                                                                                                     47.0
                                                                                                                                                             65.1
                                                                                                                                                             66.6
k = 1 [1], and a pruned C4.5 decision tree (J48) [28]. To compen-                4   4    X-Means       35.6      57.7    66.1      68.5      69.3   42.3    66.8
                                                                                         average improvement      18.7    24.8      27.7      29.0    5.6    26.0
sate the missing training data, a 10-fold cross-validation has been
used for each classifier.                                                                                 (b) Twain-Wells-Shelley
    Table 3 shows the performance of each classifier compared to the
                                                                                 p   q    Algorithm     Max      N-Bay   Bay-Net   LibLin   LibSVM   kNN     J48
best clustering result using the same data and pq-setting. It can be             2   2    Farth. First  77.3      81.1    86.4      90.9      84.2   74.2    81.8
                                                                                 2   3    Farth. First  78.8      85.6    87.4      92.4      89.0   78.8    82.8
seen that the classifiers significantly outperform the clustering re-            2   4    X-Means       78.8      89.4    92.4      90.9      87.3   89.4    85.9
sults for the Twain-Wells and Twain-Wells-Shelley documents. The                 3
                                                                                 3
                                                                                     2
                                                                                     3
                                                                                          K-Means
                                                                                          K-Means
                                                                                                        81.8
                                                                                                        78.8
                                                                                                                  82.6
                                                                                                                  92.4
                                                                                                                          87.9
                                                                                                                          92.4
                                                                                                                                    92.4
                                                                                                                                    92.4
                                                                                                                                              85.5
                                                                                                                                              86.4
                                                                                                                                                     80.3
                                                                                                                                                     81.8
                                                                                                                                                             83.8
                                                                                                                                                             83.8
support vector machine framework (LibSVM) and the linear classi-                 3   4    Farth. First  86.4      84.8    90.9      97.0      85.8   81.8    85.6
                                                                                 4   2    Farth. First  86.6      81.8    89.4      87.9      83.3   77.3    84.1
fier (LibLinear) performed best, reaching a maximum accuracy of                  4   3    Farth. First  89.4      85.6    92.4      89.4      85.8   80.3    83.3
                                                                                 4   4    Farth. First  84.8      86.4    90.9      89.4      85.8   84.8    83.6
nearly 87% for the Twain-Wells document. Moreover, the average                           average improvement       3.0     7.5       8.9       3.4   -1.6     1.3
improvement is given in the bottom line, showing that most of the
                                                                                                               (c) Federalist Papers
classifiers outperform the best clustering result by over 20% in av-
erage. Solely the kNN algorithm achieves minor improvements as                  p    q    Algorithm     Max      N-Bay   Bay-Net   LibLin   LibSVM   kNN      J48
it attributed the two-author document with a poor accuracy of about             2
                                                                                2
                                                                                     2
                                                                                     3
                                                                                          K-Means
                                                                                          K-Means
                                                                                                        83.3
                                                                                                        83.3
                                                                                                                  83.3
                                                                                                                  83.3
                                                                                                                          33.3
                                                                                                                          33.3
                                                                                                                                    100.0
                                                                                                                                    100.0
                                                                                                                                             100.0
                                                                                                                                             100.0
                                                                                                                                                     100.0
                                                                                                                                                     100.0
                                                                                                                                                             33.3
                                                                                                                                                             33.3
60% only.                                                                       2    4    K-Means       83.3      83.4    33.3      100.0    100.0   100.0   33.3
                                                                                3    2    K-Means       83.3      75.0    33.3       91.7     91.7   100.0   33.3
    A similar general improvement could be achieved on the three-               3    3    K-Means       83.3     100.0    33.3      100.0     91.7   100.0    33.3
author document Twain-Wells-Shelley as can be seen in subtable                  3
                                                                                4
                                                                                     4
                                                                                     2
                                                                                          Farth. First
                                                                                          K-Means
                                                                                                        75.0
                                                                                                        83.3
                                                                                                                  66.7
                                                                                                                  91.7
                                                                                                                          33.3
                                                                                                                          33.3
                                                                                                                                    100.0
                                                                                                                                     91.7
                                                                                                                                             100.0
                                                                                                                                              75.0
                                                                                                                                                      91.7
                                                                                                                                                      91.7
                                                                                                                                                             33.3
                                                                                                                                                             33.3
(b). Again, LibSVM could achieve an accuracy of about 75%,                      4
                                                                                4
                                                                                     3
                                                                                     4
                                                                                          K-Means
                                                                                          K-Means
                                                                                                        83.3
                                                                                                        83.3
                                                                                                                  75.0
                                                                                                                  75.0
                                                                                                                          33.3
                                                                                                                          33.3
                                                                                                                                    100.0
                                                                                                                                    100.0
                                                                                                                                              75.0
                                                                                                                                              83.4
                                                                                                                                                      91.7
                                                                                                                                                      83.4
                                                                                                                                                             33.3
                                                                                                                                                             33.3
whereas the best clustering configuration could only reach 49%.                          average improvement      -0.9    -49.1      15.8      8.4   13.0    -49.1

Except for the kNN algorithm, all classifiers significantly outper-                                              (d) PAN12-A/B
form the best clustering results for every configuration.
    Quite different comparison results have been obtained for the             Table 3: Best Evaluation Results for each Clustering Algorithm
Federalist Papers and PAN12 data sets, respectively. Here, the im-            and Test Data Set in Percent.
provements gained from the classifiers are only minor, and in some
cases are even negative, i.e., the classification algorithms perform
worse than the clustering algorithms. A general explanation is the            to one document. The main idea is often to compute topically re-
good performance of the clustering algorithms on these data sets,             lated document clusters and to assist web search engines to be able
especially by utilizing the Farthest First and K-Means algorithms.            to provide better results to the user, whereby the algorithms pro-
    In case of the Federalist Papers data set shown in subtable (c),          posed frequently are also patented (e.g. [2]). Regularly applied
all algorithms except kNN could achieve at least some improve-                concepts in the feature extraction phase are the term frequency tf ,
ment. Although the LibLinear classifier could reach an outstanding            which measures how often a word in a document occurs, and the
accuracy of 97%, the global improvement is below 10% for all clas-            term frequency-inverse document frequency tf − idf , which mea-
sifiers. Finally, subtable (d) shows the results for PAN12, where the         sures the significance of a word compared to the whole document
outcome is quite diverse as some classifiers could improve the clus-          collection. An example of a classical approach using these tech-
terers significantly, whereas others worsen the accuracy even more            niques is published in [21].
drastically. A possible explanation might be the small data set (only            The literature on cluster analysis within a single document to
the subproblems A and B have been used), which may not be suited              discriminate the authorships in a multi-author document like it is
very well for a reliable evaluation of the clustering approaches.             done in this paper is surprisingly sparse. On the other hand, many
                                                                              approaches exist to separate a document into paragraphs of differ-
   Summarizing, the comparison of the different algorithms reveal             ent topics, which are generally called text segmentation problems.
that in general classification algorithms perform better than cluster-        In this domain, the algorithms often perform vocabulary analysis
ing algorithms when provided with the same (pq-gram) feature set.             in various forms like word stem repetitions [27] or word frequency
Nevertheless, the results of the PAN12 experiment are very diverse            models [29], whereby ”methods for finding the topic boundaries
and indicate that there might be a problem with the data set itself,          include sliding window, lexical chains, dynamic programming, ag-
and that this comparison should be treated carefully.                         glomerative clustering and divisive clustering” [7]. Despite the
                                                                              given possibility to modify these techniques to also cluster by au-
5. RELATED WORK                                                               thors instead of topics, this is rarely done. In the following some of
  Most of the traditional document clustering approaches are based            the existing methods are shortly summarized.
on occurrences of words, i.e., inverted indices are built and used to            Probably one of the first approaches that uses stylometry to au-
group documents. Thereby a unit to be clustered conforms exactly              tomatically detect boundaries of authors of collaboratively written


                                                                         20
text is proposed in [13]. Thereby the main intention was not to ex-                             K-­‐Means	
  
pose authors or to gain insight into the work distribution, but to pro-                         X-­‐Means	
  

vide a methodology for collaborative authors to equalize their style                      Farthest	
  First	
  
                                                                                  Cascaded	
  K-­‐Means	
  
in order to achieve better readability. To extract the style of sepa-          Hierarchical	
  Clusterer	
  
rated paragraphs, common stylometric features such as word/sentence
lengths, POS tag distributions or frequencies of POS classes at                            Naive	
  Bayes	
  
                                                                                               BayesNet	
  
sentence-initial and sentence-final positions are considered. An ex-                            LibLinear	
  
tensive experiment revealed that styolmetric features can be used to                              LibSVM	
  

find authorship boundaries, but that there has to be done additional                                   kNN	
  
                                                                                                         J48	
  
research in order to increase the accuracy and informativeness.
                                                                                                                   0	
        10	
              20	
     30	
         40	
              50	
         60	
     70	
               80	
         90	
        100	
  
   In [14] the authors also tried to divide a collaborative text into                                                                                                          Accuracy	
  [%]	
  
different single-author paragraphs. In contrast to the previously
described handmade corpus, a large data set has been computation-
ally created by using (well-written) articles of an internet forum. At        Figure 3: Best Evaluation Results Over All Data Sets For All
first, different neural networks have been utilized using several sty-        Utilized Clustering and Classification Algorithms.
lometric features. By using 90% of the data for training, the best
network could achieve an F-score of 53% for multi-author docu-
ments on the remaining 10% of test data. In a second experiment,                          Twain-­‐Wells	
  
only letter-bigram frequencies are used as distinguishing features.
Thereby an authorship boundary between paragraphs was marked                   Twain-­‐Wells-­‐Shelley	
  
if the cosine distance exceeded a certain threshold. This method
reached an F-score of only 42%, and it is suspected that letter-                                                                                                                                                                          Best	
  Clusterer	
  
                                                                                                         FED	
  
bigrams are not suitable for the (short) paragraphs used in the eval-                                                                                                                                                                     Best	
  Classiﬁer	
  

uation.
                                                                                            PAN12-­‐A/B	
  
   A two-stage process to cluster Hebrew Bible texts by authorship
is proposed in [20]. Because a first attempt to represent chapters                                                    0	
              20	
               40	
                 60	
                  80	
              100	
  
only by bag-of-words led to negative results, the authors addition-                                                                                           Accuracy	
  [%]	
  

ally incorporated sets of synonyms (which could be generated by
comparing the original Hebrew texts with an English translation).
With a modified cosine-measure comparing these sets for given                 Figure 4: Best Clustering and Classification Results For Each
chapters, two core clusters are compiled by using the ncut algo-              Data Set.
rithm [10]. In the second step, the resulting clusters are used as
training data for a support vector machine, which finally assigns
every chapter to one of the two core clusters by using the simple             linear classification algorithm LibLinear could reach nearly 88%,
bag-of-words features tested earlier. Thereby it can be the case,             outperforming K-Means by 25% over all data sets.
that units originally assigned to one cluster are moved to the other             Finally, the best classification and clustering results for each data
one, depending on the prediction of the support vector machine.               set are shown in Figure 4. Consequently the classifiers achieve
With this two-stage approach the authors report a good accuracy of            higher accuracies, whereby the PAN12 subsets could be classified
about 80%, whereby it should be considered that the size of poten-            100% correctly. As can be seen, a major improvement can be
tial authors has been fixed to two in the experiment. Nevertheless,           gained for the novel literature documents. For example, the best
the authors state that their approach could be extended for more              classifier reached 87% on the Twain-Wells document, whereas the
authors with less effort.                                                     best clustering approach achieved only 59%.

                                                                                 As shown in this paper, paragraphs of documents can be split
6. CONCLUSION AND FUTURE WORK                                                 and clustered based on grammar features, but the accuracy is below
   In this paper, the automatic creation of paragraph clusters based          that of classification algorithms. Although the two algorithm types
on the grammar of authors has been evaluated. Different state-of-             should not be compared directly as they are designed to manage
the-art clustering algorithms have been utilized with different input         different problems, the significant differences in accuracies indi-
features and tested on different data sets. The best working algo-            cate that classifiers can handle the grammar features better. Never-
rithm K-Means could achieve an accuracy of about 63% over all                 theless future work should focus on evaluating the same features on
test sets, whereby good individual results of up to 89% could be              larger data sets, as clustering algorithms may produce better results
reached for some configurations. On the contrary, the specifically            with increasing amount of sample data.
created documents incorporating two and three authors could only                 Another possible application could be the creation of whole doc-
be clustered with a maximum accuracy of 59%.                                  ument clusters, where documents with similar grammar are grouped
   A comparison between clustering and classification algorithms              together. Despite the fact that such huge clusters are very difficult to
using the same input features has been implemented. Disregarding              evaluate - due to the lack of ground truth data - a navigation through
the missing training data, it could be observed that classifiers gen-         thousands of documents based on grammar may be interesting like
erally produce higher accuracies with improvements of up to 29%.              it has been done for music genres (e.g. [30]) or images (e.g. [11]).
On the other hand, some classifiers perform worse on average than             Moreover, grammar clusters may also be utilized for modern rec-
clustering algorithms over individual data sets when using some pq-           ommendation algorithms once they have been calculated for large
gram configurations. Nevertheless, if the maximum accuracy for                data sets. For example, by analyzing all freely available books from
each algorithm is considered, all classifiers perform significantly           libraries like Project Gutenberg, a system could recommend other
better as can be seen in Figure 3. Here the best performances of all          books with a similar style based on the users reading history. Also,
utilized classification and clustering algorithms are illustrated. The        an enhancement of current commercial recommender systems that


                                                                         21
are used in large online stores like Amazon is conceivable.                [18] P. Juola. An Overview of the Traditional Authorship
                                                                                Attribution Subtask. In CLEF (Online Working
7.   REFERENCES                                                                 Notes/Labs/Workshop), 2012.
 [1] D. Aha and D. Kibler. Instance-Based Learning Algorithms.             [19] D. Klein and C. D. Manning. Accurate Unlexicalized
     Machine Learning, 6:37–66, 1991.                                           Parsing. In Proceedings of the 41st Annual Meeting on
 [2] C. Apte, S. M. Weiss, and B. F. White. Lightweight                         Association for Computational Linguistics - Volume 1, ACL
     Document Clustering, Nov. 25 2003. US Patent 6,654,739.                    ’03, pages 423–430, Stroudsburg, PA, USA, 2003.
 [3] D. Arthur and S. Vassilvitskii. K-means++: The advantages             [20] M. Koppel, N. Akiva, I. Dershowitz, and N. Dershowitz.
     of careful seeding. In Proceedings of the Eighteenth Annual                Unsupervised Decomposition of a Document into Authorial
     ACM-SIAM Symposium on Discrete Algorithms, SODA ’07,                       Components. In Proc. of the 49th Annual Meeting of the
     pages 1027–1035, Philadelphia, PA, USA, 2007. Society for                  Association for Computational Linguistics: Human
     Industrial and Applied Mathematics.                                        Language Technologies - Volume 1, HLT ’11, pages
 [4] N. Augsten, M. Böhlen, and J. Gamper. The pq-Gram                          1356–1364, Stroudsburg, PA, USA, 2011.
     Distance between Ordered Labeled Trees. ACM Transactions              [21] B. Larsen and C. Aone. Fast and Effective Text Mining Using
     on Database Systems (TODS), 2010.                                          Linear-Time Document Clustering. In Proceedings of the 5th
 [5] T. Caliński and J. Harabasz. A Dendrite Method for Cluster                ACM SIGKDD international conference on Knowledge
     Analysis. Communications in Statistics - Theory and                        discovery and data mining, pages 16–22. ACM, 1999.
     Methods, 3(1):1–27, 1974.                                             [22] Y. Li, S. M. Chung, and J. D. Holt. Text Document
 [6] C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support                   Clustering Based on Frequent Word Meaning Sequences.
     Vector Machines. ACM Transactions on Intelligent Systems                   Data & Knowledge Engineering, 64(1):381–404, 2008.
     and Technology (TIST), 2(3):27, 2011.                                 [23] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini.
 [7] F. Y. Choi. Advances in Domain Independent Linear Text                     Building a large annotated corpus of English: The Penn
     Segmentation. In Proceedings of the 1st North American                     Treebank. Computational Linguistics, 19:313–330, June
     chapter of the Association for Computational Linguistics                   1993.
     conference, pages 26–33. Association for Computational                [24] F. Mosteller and D. Wallace. Inference and Disputed
     Linguistics, 2000.                                                         Authorship: The Federalist. Addison-Wesley, 1964.
 [8] G. F. Cooper and E. Herskovits. A Bayesian Method for the             [25] F. Murtagh. A Survey of Recent Advances in Hierarchical
     Induction of Probabilistic Networks From Data. Machine                     Clustering Algorithms. The Computer Journal,
     learning, 9(4):309–347, 1992.                                              26(4):354–359, 1983.
 [9] S. Dasgupta. Performance Guarantees for Hierarchical                  [26] D. Pelleg, A. W. Moore, et al. X-means: Extending K-means
     Clustering. In Computational Learning Theory, pages                        with Efficient Estimation of the Number of Clusters. In
     351–363. Springer, 2002.                                                   ICML, pages 727–734, 2000.
[10] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means:                 [27] J. M. Ponte and W. B. Croft. Text Segmentation by Topic. In
     Spectral Clustering and Normalized Cuts. In Proceedings of                 Research and Advanced Technology for Digital Libraries,
     the tenth ACM SIGKDD international conference on                           pages 113–125. Springer, 1997.
     Knowledge discovery and data mining, pages 551–556.                   [28] J. R. Quinlan. C4.5: Programs for Machine Learning,
     ACM, 2004.                                                                 volume 1. Morgan Kaufmann, 1993.
[11] A. Faktor and M. Irani. “Clustering by Composition” -                 [29] J. C. Reynar. Statistical Models for Topic Segmentation. In
     Unsupervised Discovery of Image Categories. In Computer                    Proc. of the 37th annual meeting of the Association for
     Vision–ECCV 2012, pages 474–487. Springer, 2012.                           Computational Linguistics on Computational Linguistics,
[12] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J.                 pages 357–364, 1999.
     Lin. LIBLINEAR: A Library for Large Linear Classification.            [30] N. Scaringella, G. Zoia, and D. Mlynek. Automatic Genre
     The Journal of Machine Learning Research, 9:1871–1874,                     Classification of Music Content: a Survey. Signal Processing
     2008.                                                                      Magazine, IEEE, 23(2):133–141, 2006.
[13] A. Glover and G. Hirst. Detecting Stylistic Inconsistencies in        [31] M. Tschuggnall and G. Specht. Using Grammar-Profiles to
     Collaborative Writing. In The New Writing Environment,                     Intrinsically Expose Plagiarism in Text Documents. In Proc.
     pages 147–168. Springer, 1996.                                             of the 18th Conf. of Natural Language Processing and
[14] N. Graham, G. Hirst, and B. Marthi. Segmenting Documents                   Information Systems (NLDB), pages 297–302, 2013.
     by Stylistic Character. Natural Language Engineering,                 [32] M. Tschuggnall and G. Specht. Enhancing Authorship
     11(04):397–415, 2005.                                                      Attribution By Utilizing Syntax Tree Profiles. In Proc. of the
[15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,                 14th Conf. of the European Chapter of the Assoc. for
     and I. H. Witten. The WEKA Data Mining Software: an                        Computational Ling. (EACL), pages 195–199, 2014.
     Update. ACM SIGKDD explorations newsletter,                           [33] O. Zamir and O. Etzioni. Web Document Clustering: A
     11(1):10–18, 2009.                                                         Feasibility Demonstration. In Proc. of the 21st annual
[16] A. Hotho, S. Staab, and G. Stumme. Ontologies Improve                      international ACM conference on Research and development
     Text Document Clustering. In Data Mining, 2003. ICDM                       in information retrieval (SIGIR), pages 46–54. ACM, 1998.
     2003. Third IEEE International Conference on, pages                   [34] D. Zou, W.-J. Long, and Z. Ling. A Cluster-Based
     541–544. IEEE, 2003.                                                       Plagiarism Detection Method. In Notebook Papers of CLEF
[17] G. H. John and P. Langley. Estimating Continuous                           2010 LABs and Workshops, 22-23 September, 2010.
     Distributions in Bayesian Classifiers. In Proceedings of the
     Eleventh conference on Uncertainty in artificial intelligence,
     pages 338–345. Morgan Kaufmann Publishers Inc., 1995.


                                                                      22
         Proaktive modellbasierte Performance-Analyse und
             -Vorhersage von Datenbankanwendungen
                                                                Christoph Koch
          Friedrich-Schiller-Universität Jena
           Lehrstuhl für Datenbanken und                                                                  DATEV eG
                 Informationssysteme                                                                Abteilung Datenbanken
                  Ernst-Abbe-Platz 2                                                                Paumgartnerstr. 6 - 14
                      07743 Jena                                                                       90429 Nürnberg
          Christoph.Koch@uni-jena.de                                                           Christoph.Koch@datev.de


KURZFASSUNG                                                                     1. EINLEITUNG
Moderne (Datenbank-)Anwendungen sehen sich in der heutigen
                                                                                Zur Erfüllung komplexerer Anforderungen und maximalen
Zeit mit immer höheren Anforderungen hinsichtlich Flexibilität,
                                                                                Benutzerkomforts ist gute Performance eine Grundvoraussetzung
Funktionalität oder Verfügbarkeit konfrontiert. Nicht zuletzt für
                                                                                für moderne Datenbankanwendungen. Neben Anwendungs-
deren Backend – ein meist relationales Datenbankmanagement-
                                                                                Design und Infrastrukturkomponenten wie Netzwerk oder
system – entsteht dadurch eine kontinuierlich steigende Kom-
                                                                                Anwendungs- beziehungsweise Web-Server wird sie maßgeblich
plexität und Workload, die es frühestmöglich proaktiv zu er-
                                                                                durch die Performance ihres Datenbank-Backends – wir beschrän-
kennen, einzuschätzen und effizient zu bewältigen gilt. Die dazu
                                                                                ken uns hier ausschließlich auf relationale Datenbankmanage-
nötigen Anwendungs- und Datenbankspezialisten sind jedoch
                                                                                mentsysteme (DBMS) – bestimmt [1]. Dabei ist die Datenbank-
aufgrund immer engerer Projektpläne, kürzerer Release-Zyklen
                                                                                Performance einer Anwendung selbst ebenfalls durch zahlreiche
und weiter wachsender Systemlandschaften stark ausgelastet,
                                                                                Faktoren beeinflusst. Während Hardware- und systemseitige
sodass für regelmäßige proaktive Expertenanalysen hinsichtlich
                                                                                Eigenschaften oftmals durch bestehende Infrastrukturen vor-
der Datenbank-Performance kaum Kapazität vorhanden ist.
                                                                                gegeben sind, können speziell das Datenbank-Design sowie die
Zur Auflösung dieses Dilemmas stellt dieser Beitrag ein                         anwendungsseitig implementierten Zugriffe mittels SQL weit-
Verfahren vor, mit dessen Hilfe frühzeitig auf Grundlage der                    gehend frei gestaltet werden. Hinzu kommt als Einflussfaktor
Datenmodellierung und synthetischer Datenbankstatistiken Per-                   noch die Beschaffenheit der zu speichernden/gespeicherten Daten,
formance-Analysen und -Vorhersagen für Anwendungen mit                          die sich in Menge und Verteilung ebenfalls stark auf die
relationalem Datenbank-Backend durchgeführt und deren                           Performance auswirkt.
Ergebnisse auf leicht zugängliche Weise visualisiert werden
können.                                                                         Das Datenbank-Design entwickelt sich über unterschiedlich
                                                                                abstrakte, aufeinander aufbauende Modellstrukturen vom konzep-
                                                                                tionellen hin zum physischen Datenmodell. Bereits bei der
Kategorien und Themenbeschreibung                                               Entwicklung dieser Modelle können „Designfehler“ wie beispiels-
Data Models and Database Design, Database Performance                           weise fehlende oder „übertriebene“ Normalisierungen gravierende
                                                                                Auswirkungen auf die späteren Antwortzeiten des Datenbank-
Allgemeine Bestimmungen                                                         systems haben. Der Grad an Normalisierung selbst ist jedoch nur
Performance, Design                                                             als vager Anhaltspunkt für die Performance von Datenbank-
                                                                                systemen anzusehen, der sich ab einem gewissen Maß auch
                                                                                negativ auswirken kann. Eine einfache Metrik zur Beurteilung der
Schlüsselwörter                                                                 Qualität des Datenbank-Designs bezüglich der zu erwartenden
Performance, Proaktivität, Statistiken, relationale Datenbanken,                Performance (in Abhängigkeit anderer Einflussfaktoren, wie etwa
Modellierung, UML, Anwendungsentwicklung                                        der Workload) existiert nach vorhandenem Kenntnisstand nicht.
                                                                                Etwas abweichend dazu verhält es sich mit dem Einfluss der
                                                                                Workload – repräsentiert als Menge von SQL-Statements und der
                                                                                Häufigkeit ihrer Ausführung, die von der Anwendung an das
                                                                                Datenbanksystem zum Zugriff auf dort gespeicherte Daten
                                                                                abgesetzt wird. Moderne DBMS besitzen einen kostenbasierten
 Copyright © by the paper’s authors. Copying permitted only                     Optimierer zur Optimierung eingehender Statements. Dieser
 for private and academic purposes.                                             berechnet mögliche Ausführungspläne und wählt unter Zu-
 In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26 GI-            hilfenahme von gesammelten Objekt-Statistiken den günstigsten
 Workshop on Foundations of Databases (Grundlagen von                           Ausführungsplan zur Abarbeitung eines SQL-Statements aus.
 Datenbanken),                                                                  Mittels DBMS-internen Mechanismen – im Folgenden als
 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.

                                                                           23
EXPLAIN-Mechanismen bezeichnet – besteht die Möglichkeit,                 netzwerks. Ein Überblick dazu findet sich in [3]. Demnach zeigt
noch vor der eigentlichen Ausführung von Statements den vom               sich für all diese Konzepte ein eher wissenschaftlicher Fokus und
Optimierer bestimmten optimalen Ausführungsplan ermitteln und             eine damit einhergehende weitgehend unerprobte Übertragbarkeit
ausgeben zu lassen. Zusätzlich umfasst das EXPLAIN-Ergebnis               auf die Praxis. So fehlen Studien zur Integration in praxisnahe
eine Abschätzung der zur Abarbeitung des Ausführungsplans                 (Entwicklungs-)Prozesse, zur Benutzerfreundlichkeit sowie zum
erwarteten Zugriffskosten bezüglich der CPU-und I/O-Zeit –                Kosten-Nutzen-Verhältnis der notwendigen Maßnahmen. Ein
fortan als Kosten bezeichnet. Anhand dieser Informationen                 damit einhergehendes Defizit ist zusätzlich der mangelnde Tool-
können bereits frühzeitig in Hinblick auf die Datenbank-                  support. Das in Kapitel 4 vorgestellte Konzept verfolgt diesbe-
Performance (häufige) teure Zugriffe erkannt und gegebenenfalls           züglich einen davon abweichenden Ansatz. Es baut direkt auf
optimiert werden. Voraussetzung für dieses Vorgehen ist aller-            etablierten modellbasierten Praxisabläufen bei der Entwicklung
dings, dass dem DBMS zur Berechnung der Ausführungspläne                  von Datenbankanwendungen auf (vgl. Kapitel 3). Insbesondere
repräsentative Datenbank-Statistiken vorliegen, was insbeson-             durch die Verwendung von standardisierten UML-Erweiterungs-
dere für neue Datenbankanwendungen nicht der Fall ist.                    mechanismen integriert es sich auch Tool-seitig nahtlos in
                                                                          bestehende UML-unterstützende Infrastrukturen.
Auf der anderen Seite sehen sich sowohl Anwendungsentwickler-
beziehungsweise -designerteams als auch Datenbankspezialisten             Die Methodik der synthetischen Statistiken – also dem künstli-
mit immer komplexeren Anforderungen und Aufgaben konfron-                 chen Erstellen sowie Manipulieren von Datenbank-Statistiken –
tiert. Kapazitäten für umfangreiche Performance-Analysen oder             ist neben dem in Kapitel 4 vorgestellten Ansatz wesentlicher
auch nur die Aneignung des dafür nötigen Wissens sind oft nicht           Bestandteil von [4]. Sie wird zum einen verwendet, um Statistiken
gegeben. Nicht zuletzt deshalb geraten proaktive Performance-             aus Produktionsumgebungen in eine Testumgebung zu trans-
Analysen verglichen mit beispielsweise funktionalen Tests ver-            ferieren. Zum anderen sieht der Ansatz aber auch die gezielte
mehrt aus dem Fokus.                                                      manuelle Veränderung der Statistiken vor, um mögliche dadurch
                                                                          entstehende Änderungen in den Ausführungsplänen und den zu
Das im vorliegenden Beitrag vorgestellte modellbasierte Konzept           deren Abarbeitung benötigten Kosten mithilfe anschließender
setzt an diesen beiden Problemen an und stellt Mechanismen vor,           EXPLAIN-Analysen feststellen zu können. Dies kann beispiels-
um auf einfache Art und Weise eine repräsentative proaktive               weise bezogen auf Statistiken zur Datenmenge dafür genutzt
Analyse der Datenbank-Performance zu ermöglichen. Nachdem in              werden, um Zugriffe auf eine (noch) kleine Tabelle mit wenigen
Kapitel 2 eine Abgrenzung zu alternativen/verwandten Ansätzen             Datensätzen bereits so zu simulieren, als ob diese eine enorme
gegeben wird, rückt Kapitel 3 den Entwicklungsprozess einer               Menge an Daten umfasst. Weitere Einbettungen in den Entwick-
Datenbank-Anwendung in den Fokus. Kapitel 4 beschäftigt sich              lungsprozess von Datenbankanwendungen sieht [4] gegenüber
mit dem entwickelten proaktiven Ansatz und stellt wesentliche             dem hier vorgestellten Ansatz allerdings nicht vor.
Schritte/Komponenten vor. Abschließend fasst Kapitel 5 den
                                                                          Ein weiterer Ansatzpunkt zur Performance-Analyse und -Optimie-
Beitrag zusammen.
                                                                          rung existiert im Konzept des autonomen Datenbank-Tunings
                                                                          [5][6],[7] – also dem fortlaufenden Optimieren des physischen
2. VERWANDTE ARBEITEN                                                     Designs von bereits bestehenden Datenbanken durch das DBMS
Das Ziel des im Beitrag vorgestellten proaktiven Ansatzes zur             selbst. Ein autonomes System erkennt anhand von erlerntem
Performance-Analyse und -Vorhersage von Datenbankanwendun-                Wissen potentielle Probleme und leitet passende Optimierungs-
gen ist die frühzeitige Erkennung von potentiellen Perfor-                maßnahmen ein, bevor sich daraus negative Auswirkungen
mance-Problemen auf Basis einer möglichst effizienten, leicht             ergeben. Dazu zählt beispielsweise die autonome Durchführung
verständlichen Methodik. Dies verfolgt auch der Ansatz von [2],           einer Reorganisierung von Daten, um fortwährend steigenden
dessen Grundprinzip – Informationen über Daten und Datenzu-               Zugriffszeiten entgegenzuwirken. Ähnlich können auch die
griffe, die aus der Anforderungsanalyse einer Anwendung bekannt           mittlerweile je System vielseitig vorhandenen Tuning-Advisor wie
sind, zur frühzeitigen Optimierung zu nutzen – sich auch im               beispielsweise [8] und [9] angesehen werden, die zwar nicht auto-
vorliegenden Beitrag wiederfindet. Dabei gelangt [2] durch eine           matisch optimierend ins System eingreifen, dem Administrator
eigene, dem Datenbank-Optimierer nachempfundene Logik und                 aber Empfehlungen zu sinnvoll durchzuführenden Aktionen
dem verwendeten Modell des offenen Warteschlangennetzwerks                geben. Sowohl das autonome Tuning als auch die Tuning-Advisor
frühzeitig zu Kostenabschätzungen bezüglich der Datenbank-                sind nicht als Alternative zu dem im vorliegenden Beitrag
Performance. Das in Kapitel 4 des vorliegenden Beitrags vorge-            vorgestellten Ansatz einzuordnen. Vielmehr können sich diese
stellte Konzept nutzt dagegen synthetisch erzeugte Statistiken und        Konzepte ergänzen, indem die Anwendungsentwicklung auf Basis
datenbankinterne EXPLAIN-Mechanismen, um eine kostenmäßi-                 des in Kapitel 4 vorgestellten Konzepts erfolgt und für die spätere
ge Performance-Abschätzung zu erhalten. Damit berücksichtigt es           Anwendungsadministration/ -evolution verschiedene Tuning-
stets sowohl aktuelle als auch zukünftige Spezifika einzelner             Advisor und die Mechanismen des autonomen Tunings zum Ein-
Datenbank-Optimierer und bleibt entgegen [2] von deren interner           satz kommen.
Berechnungslogik unabhängig. Ein weiterer Unterschied zwischen
beiden Ansätzen besteht in der Präsentation der Analyse-                  3. ENTWICKLUNGSPROZESS VON
ergebnisse. Während sich [2] auf tabellarische Darstellungen
beschränkt, nutzt das im Beitrag vorstellte Konzept eine auf der          DATENBANKANWENDUNGEN
Grundlage der Unified Modeling Language (UML) visualisierte               Der Entwicklungsprozess von Anwendungen lässt sich anhand
Darstellungsform.                                                         des System Development Lifecycle (SDLC) beschreiben und in
                                                                          verschiedene Phasen von der Analyse der Anforderungen bis hin
Ähnlich wie [2] basieren auch weitere Ansätze zur Performance-            zum Betrieb/zur Wartung der fertigen Software gliedern [1].
Analyse und -Evaluation auf dem Modell des Warteschlangen-


                                                                     24
                                                                                                                 Project Manager

                                         Analyse                                       Analyse                   Business Analyst

                    Datenbank                                                                                          Software
                                       Datenbank       Daten-                                        Reports           Designer/
                    Designer/                                                        Detail Design
                                        Design         modelle                                       Prozesse          Architekt
                    Architekt

                                     Implementierung                                                                   Program-
                                                                                   Implementierung                       mierer
                                        und Laden          Erstellen                                     Prototyping
                                                            Laden
                                        Test und            Tuning                    Test und           Debugging      Tester
                                       Auswertung                                    Auswertung


                     Datenbank                                                                                       System-
                     Administrator                                     Betrieb                                  Administrator


                                            Datenbank-                                    Wartung der
                                             Wartung                                      Anwendung


                        Abbildung 1: Phasen und Akteure im Database und Software Development Lifecycle
                                                     (DBLC und SDLC)

Zusätzlich zur reinen Anwendungsentwicklung sind weitere                         der Entwicklungsprozess von Datenbankanwendungen auf die in
Abläufe zur Planung und Bereitstellung einer geeigneten Infra-                   Abbildung 2 visualisierten Aufgaben. Anhand der analysierten
struktur nötig. Für Datenbankanwendungen wäre das unter ande-                    Anforderungen wird im Datenbank-Design ein konzeptionelles
rem der Entwicklungsprozess der Datenbank, welcher sich nach                     Datenmodell entwickelt, das anschließend hin zum physischen
[1] ebenfalls durch ein dem SDLC ähnliches Modell – dem Data-                    Datenmodell verfeinert wird. Da sich der Beitrag auf die in der
base Lifecycle (DBLC) – formalisieren lässt. Beide Entwicklungs-                 Praxis vorherrschenden relationalen DBMS beschränkt, wird auf
prozesse verlaufen zeitlich parallel und werden insbesondere in                  das in der Theorie gebräuchliche Zwischenprodukt des logischen
größeren Unternehmen/Projekten durch verschiedene Akteure                        Datenmodells (relationale Abbildung) verzichtet.
realisiert. Auf Grundlage von [1] liefert Abbildung 1 eine Über-
sicht dazu. Sie visualisiert parallel ablaufende Entwicklungspha-                Nachdem die Design-Phase abgeschlossen ist, beginnt die
sen und eine Auswahl an zuständigen Akteuren, deren konkrete                     Implementierung. Datenbankseitig wird dabei das physische
Zusammensetzung/Aufgabenverteilung aber stark abhängig von                       Datenmodell mittels Data Definition Language (DDL) in ein
der Projektgröße und dem Projektteam ist. Wichtig sind hier be-                  Datenbankschema innerhalb eines installierten und geeignet
sonders zwei Erkenntnisse. Zum einen finden ähnliche Entwick-                    konfigurierten DBMS umgesetzt und möglicherweise vorhandene
lungsprozesse bei Anwendung und Datenbank parallel statt – in                    Testdaten geladen. Anwendungsseitig erfolgt parallel dazu die
etwa das Anwendungsdesign und das Datenbankdesign. Zum                           Entwicklung von SQL-Statements zum Zugriff auf die Datenbank
anderen können sehr viele Akteure am gesamten Entwicklungs-                      sowie die Implementierung der Anwendung selbst. Nach
prozess beteiligt sein, sodass Designer, Programmierer, Tester und               Fertigstellung einzelner Module finden mithilfe des Entwick-
Administratoren in der Regel disjunkte Personenkreise bilden.                    lungs- und Qualitätssicherungssystems kontinuierliche Tests
                                                                                 statt, die sich allerdings anfangs auf die Prüfung funktionaler
    Analyse       Konzeptionelles                                                Korrektheit beschränken. Performance-Untersuchungen, insbe-
                   Datenmodell Physisches                                        sondere bezogen auf die Datenbankzugriffe, erfolgen in der Regel
                                  Datenmodell
                                                                                 erst gezielt zum Abschluss der Implementierungsphase mittels
    Design
                                                                                 aufwändig vorzubereitender und im Qualitätssicherungssystem
                                                                                 durchzuführender Lasttests.
     Impl.
                                                             SQL
                                                                                 Die Folgen aus diesem Vorgehen für die Erkennung und Behand-
     Test                                    Entwicklungs- SQL-                  lung von Performance-Problemen sind mitunter gravierend. Eng-
                                                system Statements
                                                                                 pässe werden erst spät (im Betrieb) bemerkt und sind aufgrund
                                 Qualitäts-
    Betrieb                     sicherungs-                                      des fortgeschrittenen Entwicklungsprozesses nur mit hohem
                   Produktions-    system                                        Aufwand zu korrigieren. Basieren sie gar auf unvorteilhaften
   Wartung            system                                                     Design-Entscheidungen beispielsweise bezogen auf die Daten-
                                                                                 modellierung, ist eine nachträgliche Korrektur aufgrund zahlrei-
Abbildung 2: Performance-relevante Entwicklungsschritte                          cher Abhängigkeiten (Anwendungslogik, SQL-Statements, Test-
                                                                                 datenbestände, etc.), getrennten Zuständigkeiten und in der Regel
Aus dem Blickwinkel der Datenbank-Performance und der darauf                     engen Projektzeitplänen nahezu ausgeschlossen. Erfahrungen aus
einwirkenden bereits genannten Einflussfaktoren reduziert sich                   dem Arbeitsumfeld des Autors haben dies wiederholt bestätigt.


                                                                            25
                  Performance
                   Indikatoren
                                                       Abbildung und Statistikerzeugung

                 Konzeptionelles 1.                                   2.
                  Datenmodell        Physisches                                                                      Kosten
                                    Datenmodell                                                     EXPLAIN                              EP2
                                                                                                              EP1
                                                                                  3.                  4.
                                                                   SQL
                                              Entwicklungs-
                                                                                       Testsystem                   Performance-Modell
                                                 system             SQL-
                                  Qualitäts-
                                                                 Statements
                 Produktions- sicherungs-system
                    system

                     Abbildung 3: Ansatz zur proaktiven modellbasierten Performance-Analyse und -Vorhersage

                                                                                bei Anwendungsweiterentwicklungen weitgehend vorliegen, exis-
4. PROAKTIVE MODELLBASIERTE                                                     tieren für neu zu entwickelnde Anwendungen im Normalfall keine
PERFORMANCE-ANALYSE                                                             repräsentativen Datenbestände. Somit fehlen auch geeignete
Alternativ zur Performance-Analyse mittels Lasttests (vgl. Kapitel              Datenbankstatistiken zur Grundlage für die EXPLAIN-Auswer-
3) bieten sich zur Kontrolle der SQL-Performance die eingangs                   tungen. Die Folge sind Ausführungspläne und Kostenabschätzun-
erwähnten EXPLAIN-Mechanismen an. Mit deren Hilfe lassen                        gen, die mit denen eines späteren produktiven Einsatzes der State-
sich bei vorliegendem physischen Datenbank-Design (inklusive                    ments oftmals nur wenig gemeinsam haben und für eine proaktive
Indexe, etc.) bereits in frühen Abschnitten der Implementierungs-               Performance-Analyse somit (nahezu) unverwertbar sind.
phase Auswertungen zu Ausführungsplänen und geschätzten                         Der im folgenden Kapitel vorgestellte proaktive modellbasierte
Kosten für entwickelte SQL-Statements durchführen. Auf diese                    Ansatz zur Performance-Analyse und -Vorhersage greift beide
Weise gewonnene Erkenntnisse können vom Designer/Program-                       Probleme auf: die fehlende repräsentative Datenbasis für Daten-
mierer direkt genutzt werden, um Optimierungen in Hinblick auf                  bankstatistiken und die mangelnde Expertise zur Ausführungs-
die quasi grade entworfenen/implementierten SQL-Statements                      planbewertung durch Designer/Programmierer. Dabei sieht dieser
durchzuführen. Durch die gegebene zeitliche Nähe zum Anwen-                     Ansatz zur Bereitstellung geeigneter Datenbankstatistiken ein
dungs- und Datenbank-Design sind auch Performance-Optimie-                      synthetisches Erzeugen anhand von Performance-Indikatoren vor.
rungen auf Basis von Datenmodellanpassungen (Normalisie-                        Das Problem der mangelnden Expertise wird durch eine einfache
rung/Denormalisierung) ohne größeren Aufwand möglich.                           modellbasierte Darstellung von gewonnenen EXPLAIN-Ergeb-
Das beschriebene Vorgehen hat zwar den Vorteil, dass mögliche                   nissen adressiert. Wie diese gestaltet ist und mit den Performance-
Performance-Probleme schon von den Akteuren (Designer/Pro-                      Indikatoren zusammenwirkt verdeutlichen die weiteren Ausfüh-
grammierer) erkannt werden können, die diese durch Design-                      rungen des Kapitels anhand Abbildung 3.
Änderungen am effektivsten zu lösen wissen. Demgegenüber
erfordern die EXPLAIN-Analysen und das Verständnis der Aus-                     4.1 Performance-Indikatoren im Datenmodell
führungspläne einen Grad an Expertise, den Designer/Program-                    Als Performance-Indikatoren bezeichnet die vorliegende Arbeit
mierer in der Regel nicht besitzen. Ein Datenbank Administrator                 ausgewählte Metadaten zu Entitäten und deren Attributen
(DBA), der über diese verfügt, ist wiederum von den fachlichen                  (beziehungsweise zu Tabellen und deren Spalten), die Aufschluss
Anforderungen zu distanziert, sodass er zwar mögliche Perfor-                   über die erwarteten realen Datenbestände geben und in Zusam-
mance-Ausreißer erkennen, nicht aber fachlich bewerten kann.                    menhang mit dem Datenbank-Design und der Infrastruktur erste
Führt eine Anwendung beispielsweise einmal monatlich eine sehr                  Rückschlüsse auf die zukünftige Datenbank-Performance erlau-
komplexe Auswertung mithilfe eines entsprechend Laufzeit-                       ben. Dazu zählen Informationen zu den erwarteten Datenmengen
intensiven SQL-Statements durch, dann würde dem DBA diese                       wie in etwa die erwartete Anzahl an Zeilen pro Tabelle und
Abfrage bei EXPLAIN-Analysen als kritisch erscheinen. Denn er                   Kennzahlen zur Datenverteilung – beispielsweise in Form von
weiß weder, dass damit ein fachlich aufwändiger Prozess                         Wertebereichsangaben, Einzelwertwahrscheinlichkeiten oder der
durchgeführt wird, noch dass es sich dabei um eine einmalig pro                 Kardinalität pro Spalte. Viele dieser Informationen sind Teil des
Monat auszuführende Abfrage handelt. Um sich als DBA in einer                   Ergebnisses der Anforderungsanalyse und somit frühzeitig im
Infrastruktur von nicht selten mehr als 100 unterschiedlichen                   SDLC bekannt und vom Business Analyst erfasst worden. Dabei
Anwendungen über die fachlichen Anforderungen und speziellen                    reicht die Dokumentation von rein textuellen Beschreibungen bis
Prozesse jeder einzelnen im Detail zu informieren beziehungs-                   hin zu tief strukturierten Darstellungen. Eine einheitlich stan-
weise um sich als Designer/Programmierer das nötige Knowhow                     dardisierte Form zur Erfassung von Performance-Indikatoren im
zur Ausführungsplanbewertung aufzubauen, ist personelle                         DBLC existiert jedoch bislang nicht, wodurch die Metadaten
Kapazität vonnöten, die in der Regel nicht verfügbar ist.                       kaum bis gar nicht in den weiteren Entwicklungsprozess ein-
                                                                                fließen.
Ein anderes Problem, dass sich in Zusammenhang mit frühzei-
tigen EXPLAIN-Analysen zeigt, begründet sich in dem dritten                     In der Praxis basiert die Datenmodellierung ähnlich wie weite
zuvor genannten Performance-Faktor: den Daten. Während diese                    Teile der Anwendungsmodellierung auf der Sprache UML. Dabei


                                                                           26
wurde diese ursprünglich nicht zur Abbildung von Daten-                  der Designer/Programmierer beim Modellieren oder dem Ent-
strukturen im Sinn einer Entity-Relationship-Modellierung kon-           wickeln von SQL-Statements auf relationale Weise. Die im vorlie-
zipiert, sodass die Verbindung beider Welten – und damit die             genden Ansatz als Performance-Modell bezeichnete vereinfachte
Modellierung von Anwendung und Datenstrukturen mithilfe einer            Präsentation von Ausführungsplänen versucht, diese Diskrepanz
gemeinsamen Sprache in einem gemeinsamen Tool – erst durch               aufzulösen.
Ansätze wie [10] oder auch den Entwurf zum IMM Standard der
OMG [11] geschaffen wurde. Die Voraussetzung dafür bildet                Das Performance-Modell basiert auf dem physischen Datenmo-
jeweils die UML-Profil-Spezifikation, die es ermöglicht, beste-          dell und damit auf einer dem Designer/Programmierer bekannten
hende UML-Objekte über Neu-Stereotypisierungen zu erweitern.             Darstellungsform. Zusätzlich umfasst es die für diesen Personen-
                                                                         kreis wesentlichen Informationen aus den EXPLAIN-Ergebnissen.
Um die zuvor genannten Performance-Indikatoren für den weite-            Dazu zählen die vom DBMS abgeschätzten Kosten für die Aus-
ren Entwicklungsprozess nutzbar zu machen und sie innerhalb              führung des gesamten Statements sowie wichtiger Operatoren wie
bestehender Infrastrukturen/Tool-Landschaften standardisiert zu          Tabellen- beziehungsweise Indexzugriffe oder Tabellenverknü-
erfassen, kann ebenfalls der UML-Profil-Mechanismus genutzt              pfungen mittels Join – jeweils skaliert um die erwartete Aus-
werden. So ließe sich beispielsweise mithilfe eines geeigneten           führungshäufigkeit des Statements. Weitere Detailinformationen
Profils wie in Abbildung 3 in 1. schematisch angedeutet aus einem        innerhalb der Ausführungspläne wie beispielsweise die konkrete
UML-Objekt „entity“ ein neues Objekt „entity_extended“ ablei-            Abarbeitungsreihenfolge einzelner Operatoren oder Angaben zu
ten, das in einem zusätzlichen Merkmal „cardinality“ Infor-              abgeschätzten Prädikat-Selektivitäten werden vom Modell zum
mationen über die produktiv erwartete Datenmenge zu einer                Zweck der Einfachheit und Verständlichkeit bewusst vernach-
Entität/Tabelle aufnehmen kann.                                          lässigt. Für die gleichzeitige Analyse mehrerer Statements erfolgt
                                                                         eine Aggregation der jeweils abgeschätzten Kosten auf Objekt-
4.2 Synthetische Datenbankstatistiken                                    ebene.
Eines der eingangs aufgezeigten Hindernisse für proaktive Perfor-        Zentrale Komponente im Performance-Modell ist eine ebenfalls
mance-Analysen beziehungsweise -Vorhersagen bestand in der               dem physischen Datenmodell angelehnte Diagrammdarstellung.
fehlenden repräsentativen Datenbasis für Datenbank-Statisti-             Mithilfe farblicher Hervorhebung und geeigneter Bewertungs-
ken. Diese Statistiken werden im Normalfall vom DBMS anhand              metriken sollen sämtliche Objekte gemäß den vom DBMS
der gespeicherten Daten selbst gesammelt. Dem entgegen verfolgt          geschätzten Zugriffskosten zur Abarbeitung der Workload
das hier vorgestellte Konzept den Ansatz, dem DBMS Statistiken           klassifiziert und visualisiert werden. Auf diese Weise kann ein
vorzugeben, ohne dazu datenbankseitig repräsentative Datenbe-            Designer/Programmierer frühzeitig Auskunft über aus Perfor-
stände vorhalten zu müssen. Dafür bieten zwar die wenigsten              mance-Perspektive zu optimierende Bereiche im Datenbank-
DBMS vordefinierte Schnittstellen an, allerdings sind sämtliche          schema beziehungsweise kritische, alternativ zu konzipierende
Statistik-Informationen in der Regel innerhalb DBMS-interner             SQL-Statements erhalten. Abbildung 3 veranschaulicht exempla-
manipulierbarer Tabellen gespeichert, wie dies beispielswiese            risch ein visualisiertes Performance-Modell für zwei Statements/
auch bei DB2 oder Oracle der Fall ist [12].                              Ausführungspläne (EP). Während der untere Bereich weitgehend
                                                                         grün/unkritisch markiert ist, befinden sich im oberen Diagramm-
Datenbankstatistiken enthalten Informationen über Datenmengen
                                                                         teil mögliche Performance-kritische rot gekennzeichnete Zugriffe,
und Datenverteilungen sowie Kennzahlen zur physischen Spei-
                                                                         die es gezielt zu untersuchen und an geeigneter Stelle (SQL-State-
cherung wie beispielsweise die Anzahl der verwendeten Daten-
                                                                         ment, Datenbank-Design) zu optimieren gilt (vgl. gestrichelte
bankseiten pro Tabelle. Während erstere inhaltlich den zuvor
                                                                         Pfeile in Abbildung 3).
beschriebenen Performance-Indikatoren entsprechen, sind die
Statistikdaten zur physischen Speicherung interne DBMS-abhän-            Die technische Realisierung des Performance-Modells sowie der
gige Größen. Mithilfe geeigneter, von den DBMS-Herstellern zur           dazugehörigen Diagrammdarstellung erfolgt analog zur Erfassung
Unterstützung beim Datenbank-Design bereitgestellter Abschät-            der Performance-Indikatoren über den UML-Profil-Mechanismus,
zungsvorschriften lassen sich aber auch diese Kennzahlen auf             wodurch auch in diesem Punkt die Kompatibilität des vorge-
Grundlage der Performance-Indikatoren approximieren. Somit ist           stellten Ansatzes zu bestehenden Tool-Infrastrukturen gewähr-
es wie in Abbildung 3 in 2. gezeigt möglich, anhand geeignet             leistet ist.
formalisierter Performance-Indikatoren frühzeitig im SDLC/
DBLC repräsentative Datenbankstatistiken künstlich zu erzeugen.
                                                                         4.4 Ablauf einer Analyse/Vorhersage
                                                                         Für den Designer/Programmierer sieht der in Abbildung 3
4.3 EXPLAIN und Performance-Modell                                       vorgestellte proaktive Ansatz folgende Vorgehensweise vor.
Auf Grundlage von synthetischen Datenbankstatistiken können              Nachdem nach 1. ein Datenbank-Design-Entwurf fertiggestellt ist,
wie in Abbildung 3 in 3. und 4. zu sehen, mittels der vom DBMS           initiiert er in 2. einen Automatismus zur Abbildung des Designs
bereitgestellten EXPLAIN-Funktionalität, der SQL-Workload                in ein Datenbank-Schema sowie zur Erstellung von synthetischen
und dem aus dem physischen Datenmodell ableitbaren Daten-                Datenbank-Statistiken anhand der von ihm modellierten Perfor-
bankschema proaktive Performance-Vorhersagen durchgeführt                mance-Indikatoren. Mithilfe einer weiteren Routine startet der
werden. Die resultierenden, teils komplexen Ausführungspläne             Designer/Programmierer in 3. und 4. anschließend einen Simu-
lassen sich allerdings nur mit ausreichend Expertise und vor-            lationsprozess, der auf Basis der EXPLAIN-Mechanismen Perfor-
handenen personellen Kapazitäten angemessen auswerten, sodass            mance-Vorhersagen für eine gegebene Workload erstellt und diese
diese Problematik vorläufig weiterbesteht. Eine Hauptursache, die        als Performance-Modell aufbereitet. Von dort aus informiert er
das Verständnis von Ausführungsplänen erschwert, ist ihre                sich mithilfe der Diagrammdarstellung über mögliche kritische
hierarchische Darstellung als Zugriffsbaum. Demgegenüber denkt           Zugriffe, die er daraufhin gezielt analysiert und optimiert.


                                                                    27
5. ZUSAMMENFASSUNG                                                          Ansatzes entgegensteht. Somit sind alternative Varianten zur
Datenbank-Performance ist ein wichtiger, oftmals jedoch vernach-            Beschaffung der Workload für den Analyseprozess zu
lässigter Faktor in der Anwendungsentwicklung. Durch moderne                untersuchen und abzuwägen.
Anforderungen und dazu implementierte Anwendungen sehen
sich speziell deren Datenbank-Backends mit kontinuierlich                   7. LITERATUR
wachsenden Herausforderungen insbesondere betreffend der                    [1] C. Coronel, S. Morris, P. Rob. Database Systems: Design,
Performance konfrontiert. Diese können nur bewältigt werden,                    Implementation, and Management, Course Technology, 10.
wenn das Thema Datenbank-Performance intensiver betrachtet                      Auflage, 2011.
und durch proaktive Analysen (beispielsweise mittels EXPLAIN-
                                                                            [2] S. Salza, M. Renzetti. A Modeling Tool for Workload
Mechanismen) kontinuierlich verfolgt wird. Doch auch dann sind
                                                                                Analysis and Performance Tuning of Parallel Database
einzelne Hindernisse unvermeidlich: fehlende repräsentative
                                                                                Applications, Proceedings in ADBIS'97, 09.1997
Daten(-mengen) und Expertise/Kapazitäten zur Analyse.
                                                                                http://www.bcs.org/upload/pdf/ewic_ad97_paper38.pdf
Der vorliegende Beitrag präsentiert zur Lösung dieser Probleme              [3] R. Osman, W. J. Knottenbelt. Database system performance
einen modellbasierten Ansatz, der auf Basis synthetisch erzeugter               evaluation models: A survey, Artikel in Performance
Statistiken proaktive Performance-Analysen sowie -Vorhersagen                   Evaluation, Elsevier Verlag, 10.2012
erlaubt und die daraus gewonnenen Ergebnisse in einer einfach                   http://dx.doi.org/10.1016/j.peva.2012.05.006
verständlichen Form visualisiert. Die technologische Grundlage
dafür bietet die in der Praxis vorherrschende Modellierungs-                [4] Tata Consultancy Services. System and method for SQL
sprache UML mit ihrer UML-Profil-Spezifikation. Sie erlaubt es                  performance assurance services, Internationales Patent
das hier vorgestellte Konzept und die dazu benötigten Kom-                      PCT/IN2011/000348, 11.2011
ponenten mit vorhandenen technischen Mitteln abzubilden und                     http://dx.doi.org/10.1016/j.peva.2012.05.006
nahtlos in bestehende UML-Infrastrukturen zu integrieren.                   [5] D. Wiese. Gewinnung, Verwaltung und Anwendung von
                                                                                Performance-Daten zur Unterstützung des autonomen
6. AUSBLICK                                                                     Datenbank-Tuning, Dissertation, Fakultät für Mathematik
Bei dem im Beitrag vorgestellten Konzept handelt es sich um                     und Informatik, Friedrich-Schiller-Universität Jena, 05.2011.
einen auf Basis wiederkehrender praktischer Problemstellungen                   http://www.informatik.uni-jena.de/dbis/alumni/wiese/pubs/D
und den daraus gewonnenen Erfahrungen konstruierten Ansatz.                     issertation__David_Wiese.pdf
Während die technische Umsetzbarkeit einzelner Teilaspekte wie              [6] S. Chaudhuri, V. Narasayya. A Self-Tuning Database
etwa die Erfassung von Performance-Indikatoren oder die Kon-                    Systems: A Decade of Progress, Proceedings in VLDB'07,
struktion des Performance-Modells auf Basis von UML-Profilen                    09.2007
bereits geprüft wurde, steht eine prototypische Implementierung                 http://research.microsoft.com/pubs/76506/vldb07-10yr.pdf
des gesamten Prozesses zur Performance-Analyse noch aus.
                                                                            [7] N. Bruno, S. Chaudhuri. An Online Approach to Physical
Zuvor sind weitere Detailbetrachtungen nötig. So ist beispiels-                 Design Tuning, Proceedings in ICDE'07, 04.2007
weise zu klären, in welchem Umfang Performance-Indikatoren                      http://research.microsoft.com/pubs/74112/continuous.pdf
im Datenmodell vom Analyst/Designer sinnvoll erfasst werden
                                                                            [8] Oracle Corporation. Oracle Database 2 Day DBA 12c
sollten. Dabei ist ein Kompromiss zwischen maximalem
                                                                                Release 1 (12.1) – Monitoring and Tuning the Database,
Detailgrad und minimal nötigem Informationsgehalt anzustreben,
                                                                                2013.
sodass der Aufwand zur Angabe von Performance-Indikatoren
                                                                                http://docs.oracle.com/cd/E16655_01/server.121/e17643/mo
möglichst gering ist, mit deren Hilfe aber dennoch eine
                                                                                ntune.htm#ADMQS103
repräsentative Performance-Vorhersage ermöglicht wird.
                                                                            [9] Microsoft Corporation. SQL Server 2005 – Database Engine
Weiterhin gilt es, eine geeignete Metrik zur Bewertung/Katego-                  Tuning Advisor (DTA) in SQL Server 2005, Technischer
risierung der Analyseergebnisse zu entwickeln. Hier steht die                   Artikel, 2006.
Frage im Vordergrund, wann ein Zugriff anhand seiner Kosten als                 http://download.microsoft.com/download/4/7/a/47a548b9-
schlecht und wann er als gut zu bewerten ist. Ein teurer Zugriff ist            249e-484c-abd7-29f31282b04d/SQL2005DTA.doc
nicht zwangsweise ein schlechter, wenn er beispielsweise zur
Realisierung einer komplexen Funktionalität verwendet wird.                 [10] C.-M. Lo. A Study of Applying a Model-Driven Approach to
                                                                                 the Development of Database Applications, Dissertation,
Zuletzt sei noch die Erfassung beziehungsweise Beschaffung der                   Department of Information Management, National Taiwan
für die EXPLAIN-Analysen notwendigen Workload erwähnt.                           University of Science and Technology, 06.2012.
Diese muss dem vorgestellten proaktiven Analyseprozess                      [11] Object Management Group. Information Management
zugänglich gemacht werden, um anhand des beschriebenen                           Metamodel (IMM) Specification Draft Version 8.0,
Konzepts frühzeitige Performance-Untersuchungen durchführen                      Spezifikationsentwurf, 03.2009.
zu können. Im einfachsten Fall könnte angenommen werden, dass                    http://www.omgwiki.org/imm/doku.php
sämtliche SQL-Statements (inklusive ihrer Ausführungshäu-
figkeit) vom Designer/Programmierer ebenfalls im Datenmodell                [12] N. Burgold, M. Gerstmann, F. Leis. Statistiken in
beispielsweise als zusätzliche Merkmale von Methoden in der                      relationalen DBMSen und Möglichkeiten zu deren
UML-Klassenmodellierung zu erfassen und kontinuierlich zu                        synthetischer Erzeugung, Projektarbeit, Fakultät für
pflegen wären. Dies wäre jedoch ein sehr aufwändiges Verfahren,                  Mathematik und Informatik, Friedrich-Schiller-Universität
das der gewünschten hohen Praxistauglichkeit des proaktiven                      Jena, 05.2014.


                                                                       28
                   Big Data und der Fluch der Dimensionalität
    Die effiziente Suche nach Quasi-Identifikatoren in hochdimensionalen Daten
                            Hannes Grunert                                                    Andreas Heuer
                     Lehrstuhl für Datenbank- und                                       Lehrstuhl für Datenbank- und
                         Informationssysteme                                                Informationssysteme
                          Universität Rostock                                                Universität Rostock
                       Albert-Einstein-Straße 22                                          Albert-Einstein-Straße 22
                 hg(at)informatik.uni-rostock.de                                    ah(at)informatik.uni-rostock.de

Kurzfassung                                                                    gen Handlungen des Benutzers abgeleitet, sodass die smarte
In smarten Umgebungen werden häufig große Datenmengen                         Umgebung eigenständig auf die Bedürfnisse des Nutzers rea-
durch eine Vielzahl von Sensoren erzeugt. In vielen Fällen                    gieren kann.
werden dabei mehr Informationen generiert und verarbei-                           In Assistenzsystemen [17] werden häufig wesentlich mehr
tet als in Wirklichkeit vom Assistenzsystem benötigt wird.                    Informationen gesammelt als benötigt. Außerdem hat der
Dadurch lässt sich mehr über den Nutzer erfahren und sein                    Nutzer meist keinen oder nur einen sehr geringen Einfluss
Recht auf informationelle Selbstbestimmung ist verletzt.                       auf die Speicherung und Verarbeitung seiner personenbe-
  Bestehende Methoden zur Sicherstellung der Privatheits-                      zogenen Daten. Dadurch ist sein Recht auf informationel-
ansprüche von Nutzern basieren auf dem Konzept sogenann-                      le Selbstbestimmung verletzt. Durch eine Erweiterung des
ter Quasi-Identifikatoren. Wie solche Quasi-Identifikatoren                    Assistenzsystems um eine Datenschutzkomponente, welche
erkannt werden können, wurde in der bisherigen Forschung                      die Privatheitsansprüche des Nutzers gegen den Informati-
weitestgehend vernachlässigt.                                                 onsbedarf des Systems überprüft, kann diese Problematik
  In diesem Artikel stellen wir einen Algorithmus vor, der                     behoben werden.
identifizierende Attributmengen schnell und vollständig er-                      Zwei Hauptaspekte des Datenschutzes sind Datenvermei-
kennt. Die Evaluierung des Algorithmus erfolgt am Beispiel                     dung und Datensparsamkeit. In §3a des Bundesdatenschutz-
einer Datenbank mit personenbezogenen Informationen.                           gesetzes [1] wird gefordert, dass
                                                                                        [d]ie Erhebung, Verarbeitung und Nutzung
                                                                                       ”
ACM Klassifikation                                                                  personenbezogener Daten und die Auswahl und
K.4.1 [Computer and Society]: Public Policy Issues—                                 Gestaltung von Datenverarbeitungssystemen [...]
Privacy; H.2.4 [Database Management]: Systems—Que-                                  an dem Ziel auszurichten [sind], so wenig perso-
ry Processing                                                                       nenbezogene Daten wie möglich zu erheben, zu
                                                                                    verarbeiten oder zu nutzen.“.
Stichworte                                                                        Mittels einer datensparsamen Weitergabe der Sensor- und
Datenbanken, Datenschutz, Big Data                                             Kontext-Informationen an die Analysewerkzeuge des Assis-
                                                                               tenzsystems wird nicht nur die Datenschutzfreundlichkeit
                                                                               des Systems verbessert. Bei der Vorverdichtung der Daten
1. EINLEITUNG                                                                  durch Selektion, Aggregation und Komprimierung am Sen-
   Assistenzsysteme sollen den Nutzer bei der Arbeit (Am-                      sor selbst lässt sich die Effizienz des Systems steigern. Die
bient Assisted Working) und in der Wohnung (Ambient                            Privatheitsansprüche und der Informationsbedarf der Ana-
Assisted Living) unterstützen. Durch verschiedene Senso-                      lysewerkzeuge können als Integritätsbedingungen im Daten-
ren werden Informationen über die momentane Situation                         banksystem umgesetzt werden. Durch die Integritätsbedin-
und die Handlungen des Anwenders gesammelt. Diese Da-                          gungen lassen sich die notwendigen Algorithmen zur An-
ten werden durch das System gespeichert und mit weiteren                       onymisierung und Vorverarbeitung direkt auf dem Datenbe-
Daten, beispielsweise mit dem Facebook-Profil des Nutzers                      stand ausführen. Eine Übertragung in externe Programme
verknüpft. Durch die so gewonnenen Informationen lassen                       bzw. Module, die sich evtl. auf anderen Recheneinheiten be-
sich Vorlieben, Verhaltensmuster und zukünftige Ereignis-                     finden, entfällt somit.
se berechnen. Daraus werden die Intentionen und zukünfti-                        Für die Umsetzung von Datenschutzbestimmungen
                                                                               in smarten Umgebungen wird derzeit das PArADISE1 -
                                                                               Framework entwickelt, welches insbesondere die Aspekte
                                                                               der Datensparsamkeit und Datenvermeidung in heteroge-
                                                                               nen Systemumgebungen realisieren soll.
                                                                                  In [3] stellen wir ein einfaches XML-Schema vor, mit der
Copyright c by the paper’s authors. Copying permitted only                     sich Privatheitsansprüche durch den Nutzer von smarten
for private and academic purposes.                                             Systemen formulieren lassen. Dabei wird eine Anwendung
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-          1
Workshop on Foundations of Databases (Grundlagen von Datenbanken),               Privacy-aware assistive distributed information system
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.        environment


                                                                          29
innerhalb eines abgeschlossenen Systems in ihre Funktionali-               pel (ti ) angibt. Ein Quasi-Identifikator QI := {A1 , ..., An }
täten aufgeteilt. Für jede Funktionalität lässt sich festlegen,        ist für eine Relation R entsprechend definiert:
welche Informationen in welchem Detailgrad an das System                                           ≥p
weitergegeben werden dürfen. Dazu lassen sich einzelne At-                Quasi-Identifikator. ∀ t1 , t2 ∈ R [t1 6= t2 ⇒ ∃ A ∈ QI:
tribute zu Attributkombinationen zusammenfassen, die an-                   t1 (A) 6= t2 (A)]
gefragt werden können.
                                                                             Wie beim Datenbankentwurf reicht es auch für die Anga-
   Für einen unerfahrenen Nutzer ist das Festlegen von sinn-
                                                                           be von Quasi-Identifikatoren aus, wenn die minimale Men-
vollen Einstellungen nur schwer möglich. Die Frage, die sich
                                                                           ge von Attributen angegeben wird, welche die Eigenschaft
ihm stellt, ist nicht die, ob er seine persönlichen Daten schüt-
                                                                           eines QI hat. Eine solche Menge wird als minimaler Quasi-
zen soll, sondern vielmehr, welche Daten es wert sind, ge-
                                                                           Identifikator bezeichnet.
schützt zu werden. Zur Kennzeichnung schützenswerter Da-
ten werden u.a. sogenannte Quasi-Identifikatoren [2] verwen-               minimaler Quasi-Identifikator. X ist ein minimaler
det. In diesem Artikel stellen wir einen neuen Ansatz vor,                 Quasi-Identifikator (mQI), wenn X ein Quasi-Identifikator
mit dem Quasi-Identifikatoren schnell und vollständig er-                 ist und jede nicht-leere Teilmenge Y von X kein Quasi-
kannt werden können.                                                      Identifikator ist.
   Der Rest des Artikels ist wie folgt strukturiert: Kapitel 2                X ist mQI: X ist QI ∧ (@ Y ⊂ X: (Y 6= ) ∧ (Y ist QI))
gibt einen aktuellen Überblick über den Stand der Forschung
                                                                              Insbesondere ist X kein minimaler Quasi-Identifikator,
im Bereich der Erkennung von Quasi-Identifikatoren. Im fol-
                                                                           wenn eine Teilmenge X-{A} von X mit A ∈ X existiert,
genden Kapitel gehen wir detailliert darauf ein, wie schüt-
                                                                           die ein Quasi-Identifikator ist. Das Finden von allen Quasi-
zenswerte Daten definiert sind und wie diese effizient erkannt
                                                                           Identifikatoren stellt ein NP-vollständiges Problem dar, weil
werden können. Kapitel 4 evaluiert den Ansatz anhand eines
                                                                           die Menge der zu untersuchenden Teilmengen exponentiell
Datensatzes. Das letzte Kapitel fasst den Beitrag zusammen
                                                                           zur Anzahl der Attribute einer Relation steigt. Besteht eine
und gibt einen Ausblick auf zukünftige Arbeiten.
                                                                           Relation aus n Attributen, so existieren insgesamt 2n Attri-
                                                                           butkombinationen, für die ermittelt werden muss, ob sie ein
2. STAND DER TECHNIK                                                       QI sind.
  In diesem Kapitel stellen wir bestehende Konzepte zur                       In [12] stellen Motwani und Xu einen Algorithmus zum ef-
Ermittlung von Quasi-Identifikatoren (QI) vor. Außerdem                    fizienten Erkennen von minimalen Quasi-Identifikatoren vor.
werden Techniken vorgestellt, die in unseren Algorithmus                   Dieser baut auf die von Mannila et. al [10] vorgeschlagene,
eingefloßen sind.                                                          ebenenweise Erzeugung von Attributmengen auf. Dabei wird
                                                                           die Minimalitätseigenschaft von Quasi-Identifikatoren sofort
2.1    Quasi-Identifikatoren                                               erkannt und der Suchraum beim Durchlauf auf der nächsten
   Zum Schutz personenbezogener Daten existieren Konzep-                   Ebene eingeschränkt.
te wie k-anonymity [16], l-diversity [8] und t-closeness [7].                 Der Algorithmus ist effizienter als alle 2n Teilmengen zu
Diese Konzepte unterteilen die Attribute einer Relation in                 testen, allerdings stellt die von Big-Data-Anwendungen er-
Schlüssel, Quasi-Identifikatoren, sensitive Daten und sons-               zeugte Datenmenge eine neue Herausforderung dar. Insbe-
tige Daten. Ziel ist es, dass die sensitiven Daten sich nicht              sondere die hohe Dimensionalität und die Vielfalt der Daten
eindeutig zu einer bestimmten Person zuordnen lassen. Da                   sind ernst zu nehmende Probleme. Aus diesem Grund schla-
durch Schlüsselattribute Tupel eindeutig bestimmt werden                  gen wir im folgenden Kapitel einen neuen Algorithmus vor,
können, dürfen diese unter keinen Umständen zusammen                    der auf den Algorithmus von Motwani und Xu aufsetzt.
mit den sensitiven Attributen veröffentlicht werden.
   Während Schlüssel im Laufe des Datenbankentwurfes fest-               2.2    Sideways Information Passing
gelegt werden, lassen sich Quasi-Identifikatoren erst beim                    Der von uns entwickelte Algorithmus verwendet Techni-
Vorliegen der Daten feststellen, da sie von den konkreten                  ken, die bereits beim Sideways Information Passing (SIP,
Attributwerten der Relation abhängen. Der Begriff Quasi-                  [4]) eingesetzt werden. Der grundlegende Ansatz von SIP
Identifikator wurde von Dalenius [2] geprägt und bezeichnet               besteht darin, dass während der Ausführung von Anfrage-
 a subset of attributes that can uniquely identify most tuples             plänen Tupel nicht weiter betrachtet werden, sofern mit Si-
”
in a table“.                                                               cherheit feststeht, dass sie keinen Bezug zu Tupeln aus an-
   Für most tuples“ wird häufig ein Grenzwert p festge-                  deren Relationen besitzen.
        ”
legt, der bestimmt, ob eine Attributkombination ein Quasi-                    Durch das frühzeitige Erkennen solcher Tupel wird der
Identifikator ist oder nicht. Dieser Grenzwert lässt sich bei-            zu betrachtende Suchraum eingeschränkt und die Ausfüh-
spielsweise in relationalen Datenbanken durch zwei SQL-                    rungszeit von Anfragen reduziert. Besonders effektiv ist die-
Anfragen wie folgt bestimmen:                                              ses Vorgehen, wenn das Wissen über diese magic sets“ [14]
                                                                                                                       ”
                                                                           zwischen den Teilen eines Anfrageplans ausgetauscht und
 p = COUNT DISTINCT *COUNT
                         FROM (SELECT <attr-list> FROM table)
                               ∗ FROM table
                                                                           in höheren Ebenen des Anfrageplans mit eingebunden wird.
                                                          (1)              Beim SIP werden zudem weitere Techniken wie Bloomjoins
Wird für p der Wert 1 gewählt, so sind die gefundenen QI                 [9] und Semi-Joins eingesetzt um den Anfrageplan weiter zu
mit diesem Grenzwert auch Schlüssel der Relation. Um eine                 optimieren.
Vergleichbarkeit unseres Algorithmus mit dem von Motwani
und Xu zu gewährleisten, verwenden wir ebenfalls die in (1)               2.3    Effiziente Erfragung von identifizieren-
definierte distinct ratio“ (nach [12]).
            ”                                                                     den Attributmengen
   Da es für den Ausdruck die meisten“ keinen standardisier-
                           ”                              ≥p                  In [5] wird ein Algorithmus zur Ermittlung von identi-
ten Quantor gibt, formulieren wir ihn mit dem Zeichen: ∀ ,                 fizierenden Attributmengen (IA) in einer relationalen Da-
wobei p den Prozentsatz der eindeutig identifizierbaren Tu-                tenbank beschrieben. Wird für eine Attributmenge erkannt,


                                                                      30
dass diese eine IA für eine Relation R ist, so sind auch alle          Algorithm 1: bottomUp
Obermengen dieser Attributmenge IA für R. Ist für eine Re-             Data: database table tbl, list of attributes elements
lation bestehend aus den Attributen A, B und C bekannt,                  Result: a set with all minimal QI qiLowerSet
dass B eine identifizierende Attributmenge ist, dann sind                initialization();
auch AB, BC und ABC eine IA der Relation.                                for element in elements do
   Ist eine Attributmenge hingegen keine IA für R, so sind                  set := set ∪ {element}
auch alle Teilmengen dieser Attributmenge keine IA. Wenn                 end
beispielsweise AC keine IA für R ist, dann sind auch weder A            while set is not empty do
noch C identifizierende Attributmengen für R. Attributmen-                  for Set testSet: set do
gen, die keine identifizierende Attributmenge sind, werden                        double p := getPercentage(testSet, tbl);
als negierte Schlüssel bezeichnet.                                               if p ≥ threshold then
   Der in [5] vorgestellte Algorithmus nutzt diese Eigenschaf-                        qiLowerSet := qiLowerSet ∪ {testSet};
ten um anhand eines Dialoges mit dem Nutzer die Schlüs-                          end
seleigenschaften einer bereits existierenden Relation festzu-
                                                                             end
legen. Dabei wird dem Nutzer ein Ausschnitt der Relations-
                                                                             set := buildNewLowerSet(set, elements);
tabelle präsentiert anhand derer entschieden werden soll, ob
                                                                         end
eine Attributkombination Schlüssel ist oder nicht. Wird in
                                                                         return qiLowerSet;
einer Teilrelation festgestellt, dass die Attributmenge Tu-
pel mit gleichen Attributwerten besitzt, so kann die Attri-
butkombination für die Teilmenge, als auch für die gesamte
Relation kein Schlüssel sein.                                          Algorithm 2: buildNewLowerSet
                                                                         Data: current lower set lSet, list of attributes
                                                                                elements
3. ALGORITHMUS                                                           Result: the new lower set lSetNew
  In diesem Kapitel stellen wir einen neuen Algorithmus                  Set lSetNew := new Set();
zum Finden von minimalen Quasi-Identifikatoren vor. Der                  for Set set: lSet do
Algorithmus beschränkt sich dabei auf die Einschränkung                   for Attribut A: elements do
der zu untersuchenden Attributkombinationen. Der entwi-                         if @q ∈ qiLowerSet : q ⊆ set then
ckelte Ansatz führt dabei den von [12] vorgestellten Bottom-                       lSetNew := lSetNew ∪ {set ∪ {A}};
Up-Ansatz mit einen gegenläufigen Top-Down-Verfahren zu-                       end
sammen.                                                                     end
3.1    Bottom-Up                                                         end
                                                                         return lSetNew;
   Der von Motwani und Xu in [12] vorgestellte Ansatz zum
Erkennen aller Quasi-Identifikatoren innerhalb einer Rela-
tion nutzt einen in [10] präsentierten Algorithmus. Dabei
wird für eine Relation mit n Attributen ebenenweise von               gesetzte QIs besitzt, da so der Suchraum gleich zu Beginn
den einelementigen zu n-elementigen Attributkombinatio-                stark eingeschränkt wird.
nen Tests durchgeführt. Wird für eine i-elementige (1≤i<n)              Der Nachteil des Algorithmus zeigt sich, wenn die Relati-
Attributkombination AK festgestellt, dass diese ein Quasi-             on QIs besitzt, die aus vielen Attributen zusammengesetzt
Identifikator ist, so werden alle Attributkombinationen in             sind. In diesem Fall wird der Suchraum erst zum Ende ein-
den höheren Ebenen, die AK als Teilmenge enthalten, nicht             geschränkt, wodurch die Anzahl der zu betrachtenden At-
mehr berücksichtigt.                                                  tributmengen nur unmerklich geringer ist. Falls die Relation
   Die Methodik ist in Algorithmus 1 kurz skizziert. Zu-               sogar keine QIs besitzt, so erkennt der Algorithmus dies erst
nächst erfolgt eine Initialisierung der Datenbank. Dabei wird         nach Überprüfung aller Kombinationen.
zudem die zu untersuchende Relation einmalig überprüft,
um festzustellen, wie viele Tupel vorhanden sind. Anschlie-            3.2   Top-Down
ßend werden alle einelementigen Attributmengen gebildet.                  Für die Erklärung des Algorithmus 5 wird noch das
Für jede Attributmenge wird überprüft, wie viele einzigarti-        zum Bottom-Up-Vorgehen entgegengesetzte Top-Down-
ge Tupel in der Relation vorhanden sind und das Verhältnis            Vorgehen benötigt. Dieses Verfahren setzt auf die in [5]
zur Gesamtzahl an Tupeln gebildet. Liegt der Anteil über              vorgeschlagenen negierten Schlüssel auf. Analog zu ne-
einen vorher bestimmten Grenzwert (threshold ), so ist diese           gierten Schlüsseln gilt, dass eine Teilmenge T kein Quasi-
Attributmenge ein Quasi-Identifikator und wird in die Men-             Identifikator ist, wenn eine Attributkombination existiert,
ge aller minimalen QIs qiLowerSet aufgenommen.                         die kein QI ist und T als Teilmenge enthält.
   Sind alle Attributkombinationen überprüft, werden mit-               Die Überprüfung der Attributkombinationen erfolgt wie
tels Algorithmus 2 die nächstgrößeren Attributkombinatio-            beim Bottom-Up-Verfahren ebenenweise, jedoch in umge-
nen unter Rücksichtnahme der bekannten QI gebildet. Die               kehrter Reihenfolge. Für eine Relation mit n Attributen wird
Überprüfung wird solange fortgesetzt, bis alle potentiellen          zunächst die n-elementige Teilmenge gebildet und geprüft.
Attributkombinationen überprüft worden sind.                         Anschließend werden alle (n-1)-elementigen Teilmengen ge-
   Der Algorithmus arbeitet sehr effizient, da durch das               bildet. Dies wird solange fortgesetzt, bis alle einelementi-
Bottom-Up-Vorgehen die Minimalität der gefundenen QI so-              gen Teilmengen überprüft wurden. Die gefundenen Quasi-
fort festgelegt ist. Besonders gut eignet sich der Algorithmus,        Identifikatoren werden in qiUpperSet gespeichert.
wenn die Relation viele, aus wenigen Attributen zusammen-                 Durch das Top-Down-Vorgehen ist die Minimalität der
                                                                       Quasi-Identifikatoren nicht gewährleistet. Dieser Nachteil


                                                                  31
 Algorithm 3: buildNewUpperSet
  Data: current upper set uSet
  Result: the new upper set uSetNew
  Set uSetNew := new Set();
  for Set set: uSet do
     for Attribut A: set do
         if @o ∈ optOutSet: set - {A} ⊆ o then
             uSetNew := uSetNew ∪ {set - {A}};
         end
     end
  end
  return uSetNew;                                                       (a) Schritt 1: Top-Down      (b) Schritt 2: Bottom-Up


lässt sich dadurch beheben, dass wenn auf Ebene k ein QI
gefunden wird, die entsprechenden Obermengen in den Ebe-
nen mit mehr als k Attributen gestrichen werden. In den
Algorithmen 3 und 4 ist das Top-Down-Vorgehen skizziert.

 Algorithm 4: topDown
  Data: database table tbl, list of attributes elements
  Result: a set with all minimal quasi-identifier qiSet
  initialization();
                                                                      (c) Schritt 3+4: Top-Down (d) Schritt 5+6: Bottom-Up
  set := elements;
  Set optOutSet := new Set();                                         Abbildung 1: Veranschaulichung              des    Bottom-
  Set qiUpperSet := new Set();                                        Up+Top-Down-Algorithmus
  while set is not empty do
      for Set<String> testSet: set do
           double p := getPercentage(testSet, tbl);                   Passing [4] untereinander ausgetauscht. Es wird pro Berech-
           if p < threshold then                                      nungsschritt entweder die Top-Down- oder die Bottom-Up-
               optOutSet := optOutSet ∪ {subset};                     Methode angewandt und das Ergebnis an die jeweils ande-
           else                                                       re Methode übergeben. Der Algorithmus terminiert, sobald
               qiUpperSet := qiUpperSet ∪ {testSet};                  alle Attributebenen durch einen der beiden Methoden abge-
               for Set o: qiSet do                                    arbeitet wurden oder das Bottom-Up-Vorgehen keine Attri-
                  if testSet ⊂ o then                                 butkombinationen mehr zu überprüfen hat. In Abbildung 1
                      qiUpperSet := qiUpperSet - {o};                 ist die Arbeitsweise des Algorithmus anhand einer Beispiel-
                  end                                                 relation mit sechs Attributen dargestellt. Die rot markierten
               end                                                    Kombinationen stehen dabei für negierte QI, grün markierte
           end                                                        für minimale QI und gelb markierte für potentiell minimale
      end                                                             QI.
      set := buildNewUpper(set);                                         Um zu entscheiden, welcher Algorithmus im nächsten Zy-
  end                                                                 klus angewandt wird, wird eine Wichtungsfunktion einge-
  return qiUpperSet;                                                  führt. Die Überprüfung einer einzelnen Attributkombinati-
                                                                      on auf Duplikate hat eine Laufzeit von O(n*log(n)), wobei
                                                                      n die Anzahl der Tupel in der Relation ist. Die Überprü-
  Der Top-Down-Ansatz hebt die Nachteile des Bottom-Up-               fung der Tupel hängt aber auch von der Größe der Attri-
Vorgehens auf: der Algorithmus arbeitet effizient, wenn QIs           butkombination ab. Besteht ein zu überprüfendes Tupel aus
aus vielen Attributen zusammengesetzt sind und für den               mehreren Attributen, so müssen im Datenbanksystem auch
Fall, dass die gesamte Relation kein QI ist, wird dies bei der        mehr Daten in den Arbeitsspeicher für die Duplikaterken-
ersten Überprüfung erkannt und der Algorithmus terminiert           nung geladen werden. Durch große Datenmengen werden
dann umgehend.                                                        Seiten schnell aus dem Arbeitsspeicher verdrängt, obwohl
  Besteht die Relation hingegen aus vielen kleinen QIs, dann          sie später wieder benötigt werden. Dadurch steigt die Re-
wird der Suchraum erst zum Ende des Algorithmus stark                 chenzeit weiter an.
eingeschränkt. Ein weiterer Nachteil liegt in der erhöhten             Für eine vereinfachte Wichtungsfunktion nehmen wir an,
Rechenzeit, auf die in der Evaluation näher eingegangen              dass alle Attribute den gleichen Speicherplatz belegen. Die
wird.                                                                 Anzahl der Attribute in einer Attributkombination bezeich-
                                                                      nen wir mit m. Für die Duplikaterkennung ergibt sich dann
3.3   Bottom-Up+Top-Down                                              eine Laufzeit von O((n*m)*log(n*m)).
  Der in diesem Artikel vorgeschlagene Algorithmus kom-                  Da die Anzahl der Tupel für jede Duplikaterkennung kon-
biniert die oben vorgestellten Verfahren. Dabei werden die            stant bleibt, kann n aus der Kostenabschätzung entfernt
Verfahren im Wechsel angewandt und das Wissen über (ne-              werden. Die Kosten für die Überprüfung einer einzelnen
gierte) Quasi-Identifikatoren wie beim Sideways Information


                                                                 32
 Algorithm 5: bottomUpTopDown                                               Die Evaluation erfolgte in einer Client-Server-Umgebung.
  Data: database table tbl, list of attributes attrList                  Als Server dient eine virtuelle Maschine, die mit einer 64-Bit-
  Result: a set with all minimal quasi-identifier qiSet                  CPU (vier Kerne @ 2 GHz und jeweils 4 MB Cache) und 4
  attrList.removeConstantAttributes();                                   GB Arbeitsspeicher ausgestattet ist. Auf dieser wurde eine
  Set upperSet := new Set({attrList});                                   MySQL-Datenbank mit InnoDB als Speichersystem verwen-
  Set lowerSet := new Set(attrList);                                     det. Der Client wurde mit einem i7-3630QM als CPU betrie-
  // Sets to check for each algorithm                                    ben. Dieser bestand ebenfalls aus vier Kernen, die jeweils
  int bottom := 0;                                                       über 2,3 GHz und 6 MB Cache verfügten. Als Arbeitsspei-
  int top := attrList.size();                                            cher standen 8 GB zur Verfügung. Als Laufzeitumgebung
  while (bottom<=top) or (lowerSet is empty) do                          wurde Java SE 8u5 eingesetzt.
      calculateWeights();                                                   Der Datensatz wurde mit jedem Algorithmus getestet.
      if isLowerSetNext then                                             Um zu ermitteln, wie die Algorithmen sich bei verschiede-
          bottomUp();                                                    nen Grenzwerten für Quasi-Identifikatoren verhalten, wur-
          buildNewLowerSet();                                            den die Tests mit 10 Grenzwerten zwischen 50% und 99%
          bottom++;                                                      wiederholt.
          // Remove new QI from upper set                                   Die Tests mit den Top-Down- und Bottom-Up-
          modifyUpperSet();                                              Algorithmen benötigten im Schnitt gleich viele Tablescans
                                                                         (siehe Abbildung 2). Die Top-Down-Methode lieferte bes-
      else
                                                                         sere Ergebnisse bei hohen QI-Grenzwerten, Bottom-Up
          topDown();
                                                                         ist besser bei niedrigeren Grenzwerten. Bei der Laufzeit
          buildNewUpperSet();
                                                                         (siehe Abbildung 3) liegt die Bottom-Up-Methode deutlich
          top--;
                                                                         vor dem Top-Down-Ansatz. Grund hierfür sind die großen
          // Remove new negated QI from lower set
                                                                         Attributkombinationen, die der Top-Down-Algorithmus zu
          modifyLowerSet();
                                                                         Beginn überprüfen muss.
      end                                                                   Der Bottom-Up+Top-Down-Ansatz liegt hinsichtlich
  end                                                                    Laufzeit als auch bei der Anzahl der Attributvergleiche
  qiSet := qiLowerSet ∪ qiUpperSet;                                      deutlich vorne. Die Anzahl der Tablescans konnte im Ver-
  return qiSet;                                                          gleich zum Bottom-Up-Verfahren zwischen 67,4% (4076
                                                                         statt 12501 Scans; Grenzwert: 0.5) und 96,8% (543 statt
                                                                         16818 Scans; Grenzwert 0.9) reduziert werden. Gleiches gilt
Attributkombination mit m Attributen beträgt demnach                    für die Laufzeit (58,1% bis 97,5%; siehe Abbildung 3).
O((m*log(m)).
  Die Gesamtkosten für das Überprüfen der möglichen
Quasi-Identifikatoren werden mit WAV G bezeichnet. WAV G                                     6000
                                                                         Anzahl Tablescans


ergibt sich aus dem Produkt für das Überprüfen einer ein-
zelnen Attributkombination und der Anzahl der Attribut-
kombinationen (AttrKn ) mit n Attributen.                                                    4000


              WAV G := AttrKn ∗ log(m) ∗ m                    (2)                            2000
  Soll die Wichtungsfunktion präziser sein, so lässt sich der
Aufwand abschätzen, indem für jede Attributkombination
X die Summe s über die Attributgrößen von X gebildet und                                     0
anschließend gewichtet wird. Die Einzelgewichte werden an-                                            1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
schließend zum Gesamtgewicht aufsummiert.
                                                                                                    Anzahl Attribute in der Attributkombination
                     P                        P
      WAV G :=              log(s) ∗ s; s =         size(A)   (3)                                        Brute-Force
                 X∈AttrKn                     A∈X                                                        Bottom-Up
  Diese Wichtung eignet sich allerdings nur, wenn Zugang                                                 Top-Down
zu den Metadaten der Datenbankrelation besteht.                                                          Bottom-Up+Top-Down (AVG)

4. EVALUATION                                                            Abbildung 2: Verhältnis von der Anzahl der Attri-
   Für die Evaluation des Algorithmus wurde die Adult“-                 bute in den Attributkombinationen zur Anzahl von
                                                   ”                     Tablescans (Adult-DB, Grenzwert 90%)
Relation aus dem UCI Machine Learning Repository [6] ver-
wendet. Die Relation besteht aus anonymisierten, personen-
bezogenen Daten, bei denen Schlüssel sowie Vor- und Nach-                 Wie in Abbildung 3 zu erkennen ist, nimmt die Lauf-
name von Personen entfernt wurden. Die übrigen 15 Attri-                zeit beim Bottom-Up+Top-Down-Verfahren im Grenz-
bute enthalten Angaben zu Alter, Ehestand, Staatsangehö-                wertbereich von 70%-90% stark ab. Interessant ist dies
rigkeit und Schulabschluss. Die Relation besteht insgesamt               aus zwei Gründen. Erstens nimmt die Anzahl der Quasi-
aus 32561 Tupeln, die zunächst im CSV-Format vorlagen                   Identifikatoren bis 90% ebenfalls ab (179 bei 50%, 56 bei
und in eine Datenbank geparst wurden.                                    90%). Dies legt nahe, dass die Skalierung des Verfahrens
                                                                         neben der Dimension der Relation (Anzahl von Tupel und


                                                                    33
Attributen) auch von der Anzahl der vorhandenen QIs                              Bekanntmachung vom 14. Januar 2003, das zuletzt
abhängt. Um den Zusammenhang zu bestätigen, sind aber                          durch Artikel 1 des Gesetzes vom 14. August 2009
weitere Untersuchungen erforderlich.                                             geändert worden ist, 2010.
  Zweitens wird dieser Grenzwertbereich in der Literatur                     [2] T. Dalenius. Finding a Needle In a Haystack or
[13] häufig benutzt, um besonders schützenswerte Daten her-                    Identifying Anonymous Census Records. Journal of
vorzuheben. Durch die gute Skalierung des Algorithmus in                         Official Statistics, 2(3):329–336, 1986.
diesem Bereich lassen sich diese QIs schnell feststellen.                    [3] H. Grunert. Privacy Policy for Smart Environments.
                                                                                 http://www.ls-dbis.de/pp4se, 2014. zuletzt
                                                                                 aufgerufen am 17.07.2014.
                       8000                                                  [4] Z. G. Ives and N. E. Taylor. Sideways information
Laufzeit in Sekunden


                                                                                 passing for push-style query processing. In Data
                       6000                                                      Engineering, 2008. ICDE 2008. IEEE 24th
                                                                                 International Conference on, pages 774–783. IEEE,
                       4000                                                      2008.
                                                                             [5] M. Klettke. Akquisition von Integritätsbedingungen in
                       2000                                                      Datenbanken. PhD thesis, Universität Rostock, 1997.
                                                                             [6] R. Kohavi and B. Becker. Adult Data Set.
                                                                                 http://archive.ics.uci.edu/ml/datasets/Adult,
                         0                                                       1996. zuletzt aufgerufen am 17.07.2014.
                              50      60      70     80     90 95 99         [7] N. Li, T. Li, and S. Venkatasubramanian. t-Closeness:
                                           Grenzwert in %                        Privacy Beyond k-Anonymity and l-Diversity. In
                                                                                 ICDE, volume 7, pages 106–115, 2007.
                                   Bottom-Up                                 [8] A. Machanavajjhala, D. Kifer, J. Gehrke, and
                                   Top-Down                                      M. Venkitasubramaniam. l-diversity: Privacy beyond
                                   Bottom-Up+Top-Down(AVG)                       k-anonymity. ACM Transactions on Knowledge
                                                                                 Discovery from Data (TKDD), 1(1):3, 2007.
                                                                             [9] L. F. Mackert. R* optimizer validation and
Abbildung 3: Vergleich der Laufzeit der verschiede-                              performance evaluation for distributed queries. In
nen Algorithmen (Adult-DB)                                                       Readings in database systems, pages 219–229. Morgan
                                                                                 Kaufmann Publishers Inc., 1988.
                                                                            [10] H. Mannila, H. Toivonen, and A. I. Verkamo.
5.                     AUSBLICK                                                  Discovery of frequent episodes in event sequences.
  In dieser Arbeit stellten wir einen effizienten Algorithmus                    Data Mining and Knowledge Discovery, 1(3):259–289,
zur Erkennung von QI in hochdimensionalen Daten vor. An-                         1997.
hand eines Beispiels mit Sensordaten zeigten wir die Eignung                [11] D. Moos. Konzepte und Lösungen für
in Assistenzsystemen. Darüber hinaus ermitteln wir derzeit,                     Datenaufzeichnungen in heterogenen dynamischen
inwiefern sich QIs in temporalen Datenbanken feststellen                         Umgebungen. Bachelorarbeit, Universität Rostock,
lassen. Das so gewonnene Wissen über schützenswerte Daten                      2011.
wird in unser Gesamtprojekt zur datenschutzfreundlichen                     [12] R. Motwani and Y. Xu. Efficient algorithms for
Anfrageverarbeitung in Assistenzsystemen eingebunden.                            masking and finding quasi-identifiers. In Proceedings
  In späteren Untersuchungen werden wir testen, welche                          of the Conference on Very Large Data Bases (VLDB),
weiteren Quasi-Identifikatoren sich aus der Kombination                          pages 83–93, 2007.
von Daten verschiedener Relationen ableiten lassen. Der                     [13] P. Samarati and L. Sweeney. Protecting privacy when
dafür verwendete Datensatz besteht aus Sensordaten, die                         disclosing information: k-anonymity and its
im Smart Appliance Lab des Graduiertenkollegs MuSA-                              enforcement through generalization and suppression.
MA durch ein Tool [11] aufgezeichnet wurden. Die Daten                           Technical report, Technical report, SRI International,
umfassen dabei Bewegungsprofile, die mittels RFID-Tags                           1998.
und einen Sensfloor [15] erfasst wurden, aber auch Infor-                   [14] P. Seshadri, J. M. Hellerstein, H. Pirahesh, T. Leung,
mationen zu Licht und Temperatur. Eine Verknüpfung der                          R. Ramakrishnan, D. Srivastava, P. J. Stuckey, and
Basis-Relationen erfolgt dabei über die ermittelten Quasi-                      S. Sudarshan. Cost-based optimization for magic:
Identifikatoren.                                                                 Algebra and implementation. In ACM SIGMOD
                                                                                 Record, volume 25, pages 435–446. ACM, 1996.
6. DANKSAGUNG                                                               [15] A. Steinhage and C. Lauterbach. Sensfloor (r): Ein
  Hannes Grunert wird durch die Deutsche Forschungsge-                           AAL Sensorsystem für Sicherheit, Homecare und
meinschaft (DFG) im Rahmen des Graduiertenkollegs 1424                           Komfort. Ambient Assisted Living-AAL, 2008.
(Multimodal Smart Appliance Ensembles for Mobile Appli-                     [16] L. Sweeney. k-anonymity: A model for protecting
cations - MuSAMA) gefördert. Wir danken den anonymen                            privacy. International Journal of Uncertainty,
Gutachtern für ihre Anregungen und Kommentare.                                  Fuzziness and Knowledge-Based Systems,
                                                                                 10(05):557–570, 2002.
7. LITERATUR                                                                [17] M. Weiser. The computer for the 21st century.
    [1] Bundesrepublik Deutschland.                                              Scientific american, 265(3):94–104, 1991.
        Bundesdatenschutzgesetz in der Fassung der


                                                                       34
      Combining Spotify and Twitter Data for Generating a
     Recent and Public Dataset for Music Recommendation

                     Martin Pichl                               Eva Zangerle                       Günther Specht
             Databases and Information                   Databases and Information              Databases and Information
                       Systems                                     Systems                                Systems
           Institute of Computer Science               Institute of Computer Science          Institute of Computer Science
              University of Innsbruck,                    University of Innsbruck,               University of Innsbruck,
                        Austria                                     Austria                                Austria
            martin.pichl@uibk.ac.at                    eva.zangerle@uibk.ac.at               guenther.specht@uibk.ac.at

ABSTRACT                                                                   recommender systems, i.e., the million song dataset (MSD)
In this paper, we present a dataset based on publicly avail-               [4], however such datasets like the MSD often are not recent
able information. It contains listening histories of Spotify               anymore. Thus, in order to address the problem of a lack
users, who posted what they are listening at the moment                    of recent and public available data for the development and
on the micro blogging platform Twitter. The dataset was                    evaluation of recommender systems, we exploit the fact that
derived using the Twitter Streaming API and is updated                     many users of music streaming platforms post what they are
regularly. To show an application of this dataset, we imple-               listening to on the microblogging Twitter. An example for
ment and evaluate a pure collaborative filtering based rec-                such a tweet is “#NowPlaying Human (The Killers) #craig-
ommender system. The performance of this system can be                     cardiff #spotify http://t.co/N08f2MsdSt”. Using a dataset
seen as a baseline approach for evaluating further, more so-               derived from such tweets, we implement and evaluate a col-
phisticated recommendation approaches. These approaches                    laborative filtering (CF) based music recommender system
will be implemented and benchmarked against our baseline                   and show that this is a promising approach. Music recom-
approach in future works.                                                  mender systems are of interest, as the volume and variety
                                                                           of available music increased dramatically, as mentioned in
                                                                           the beginning. Besides commercial vendors like Spotify1 ,
Categories and Subject Descriptors                                         there are also open platforms like SoundCloud2 or Promo
H.3.3 [Information Search and Retrieval]: Information                      DJ3 , which foster this development. On those platforms,
filtering; H.2.8 [Database Applications]: Data mining                      users can upload and publish their own creations. As more
                                                                           and more music is available to be consumed, it gets difficult
                                                                           for the user or rather customer to navigate through it. By
General Terms                                                              giving music recommendations, recommender systems help
Algorithms, Experimentation                                                the user to identify music he or she wants to listen to with-
                                                                           out browsing through the whole collection. By supporting
Keywords                                                                   the user finding items he or she likes, the platform opera-
                                                                           tors benefit from an increased usability and thus increase
Music Recommender Systems, Collaborative Filtering, So-                    the customer satisfaction.
cial Media                                                                    As the recommender system implemented in this work de-
                                                                           livers suitable results, we will gradually enlarge the dataset
1. INTRODUCTION                                                            by further sources and assess how the enlargements influ-
  More and more music is available to be consumed, due                     ences the performance of the recommender system in fu-
to new distribution channels enabled by the rise of the web.               ture work. Additionally, as the dataset also contains time
Those new distribution channels, for instance music stream-                stamps and a part of the captured tweets contains a ge-
ing platforms, generate and store valuable data about users                olocation, more sophisticated recommendation approaches
and their listening behavior. However, most of the time the                utilizing these additional context based information can be
data gathered by these companies is not publicly available.                compared against the baseline pure CF-based approach in
There are datasets available based on such private data cor-               future works.
pora, which are widely used for implementing and evaluating                   The remainder of this paper is structured as follows: in
                                                                           Section 2 we present the dataset creation process as well as
                                                                           the dataset itself in more detail. Afterwards, in Section 3 we
                                                                           briefly present the recommendation approach, which is eval-
                                                                           uated in Section 4. Before we present the conclusion drawn
                                                                           from the evaluation on Section 6, related work is discussed
                                                                           in Section 5.
Copyright c by the paper’s authors. Copying permitted only
for private and academic purposes.                                             1
                                                                                 http://www.spotify.com
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-          2
Workshop on Foundations of Databases (Grundlagen von Datenbanken),               http://soundcloud.com
                                                                               3
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.          http://promodj.com

                                                                          35
2.    THE SPOTIFY DATASET                                            2.2                                 Dataset Description
  In this Section, the used dataset 4 for developing and eval-          Based on the raw data presented in the previous Sec-
uating the recommender system is presented.                          tion, we generate a final dataset of <user, artist, track>-
                                                                     triples which contains 5,504,496 tweets of 569,722 unique
2.1    Dataset Creation                                              users who listened to 322,647 tracks by 69,271 artists. In
   For the crawling of a sufficiently large dataset, we relied on    this final dataset, users considered as not valuable for rec-
the Twitter Streaming API which allows for crawling tweets           ommendations, i.e., the @SpotifyNowPlay Twitter account
containing specified keywords. Since July 2011, we crawled           which retweets tweets sent via @Spotify, are removed. These
for tweets containing the keywords nowplaying, listento and          users were identified manually by the authors.
listeningto. Until October 2014, we were able to crawl more             As typical for social media datasets, our dataset has a
than 90 million tweets. In contrast to other contributions           long-tailed distribution among the users and their respective
aiming at extracting music information from Twitter, where           number of posted tweets [5]. This means that there are only
the tweet’s content is used to extract artist and track in-          a few number of users tweeting rather often in this dataset
formation from [17, 7, 16], we propose to exploit the subset         and numerous users are tweeted rarely which can be found
of crawled tweets containing a URL leading to the website            in the long-tail. This long-tailed distribution can be seen in
of the Spotify music streaming service5 . I.e., information          Table 2 and Figure 1, where the logarithm of the number of
about the artist and the track are extracted from the web-           tweets is plotted against the corresponding number of users.
site mentioned in the tweet, rather than from the content
of the tweet. This enables us an unambiguous resolution                                                       Number of Tweets      Number of Users
of the tweets, in contradiction to the contributions men-                                                     >0                           569,722
tioned above, where the text of the tweets is compared to                                                     >1                           354,969
entries in the reference database using some similarity mea-                                                  >10                            91,217
sure. A typical tweet, published via Spotify, is depicted in                                                  >100                            7,419
the following: “#nowPlaying I Tried by Total on #Spotify
                                                                                                              >1,000                            198
http://t.co/ZaFH ZAokbV”, where a user published that he
or she listened to the track “I Tried” by the band “Total” on
                                                                         Table 2: Number of Tweets and Number of Users
Spotify. Additionally, a shortened URL is provided. Besides
this shortened URL, Twitter also provides the according re-
solved URL via its API. This allows for directly identifying
all Spotify-URLs by searching for all URLs containing the
string “spotify.com” or “spoti.fi”. By following the identified                                  4,000

URLs, the artist and the track can be extracted from the
title tag of the according website. For instance, the title of
the website behind the URL stated above is “<title>I tried                                       1,000

by Total on Spotify </title>”. Using the regular expression
“<title>(.*) by (.*) on.*</title>” the name of the track
(group 1) and the artist (group 2) can be extracted.
                                                                         log(Number of Tweets)


   By applying the presented approach to the crawled tweets,
we were able to extract artist and track information from                                         100
7.08% of all tweets or rather 49.45% of all tweets containing
at least one URL. We refer to the subset of tweets, for which
we are able to extract the artist and the track, as “matched
tweets”. An overview of the captured tweets is given in Table
1. 1.94% of the tweets containing a Spotify-URL couldn’t
                                                                                                   10
be matched due to HTTP 404 Not Found and HTTP 500
Internal Server errors.

     Restriction          Number of Tweets       Percentage
     None                       90,642,123         100.00%
     At least one URL           12,971,482          14.31%
     A Spotify-URL               6,541,595           7.22%                                                0          50,000      100,000        150,000   200,000
                                                                                                                              Number of Users
     Matched                     6,414,702           7.08%

       Table 1: Captured and Matched Tweets                          Figure 1:                                  Number of Tweets versus Number of
                                                                     Users
  Facilitating the dataset creation approach previously pre-
sented, we are able to gather 6,414,702 tweets and extract             The performance of a pure collaborative filtering-based
artist and track data from the contained Spotify-URLs.               recommender system increases with the detailedness of a
                                                                     user profile. Especially for new users in a system, where
                                                                     no or only little data is available about them, this poses a
4
  available at:  http://dbis-twitterdata.uibk.ac.at/                 problem as no suitable recommendations can be computed.
spotifyDataset/                                                      In our case, problematic users are users who tweeted rarely
5
  http://www.spotify.com                                             and thus can be found in the long tail.

                                                                    36
  Besides the long-tail among the number of posted tweets,                           based on the listening histories of the user. The Jaccard-
there is another long-tail among the distribution of the artist                      Coefficient is defined in Equation 1 and measures the pro-
play-counts in the dataset: there are a few popular artists                          portion of common items in two sets.
occurring in a large number of tweets and many artists are
mentioned only in a limited number of tweets. This is shown                                                             |Ai ∩ Aj |
in Figure 2, where the logarithm of the number of tweets in                                              jaccardi,j =                             (1)
                                                                                                                        |Ai ∪ Aj |
which an artist occurs in (the play-count) is plotted against
the number of artists. Thus, this plot states how many                                 For each user, there are two listening histories we take
artists are mentioned how often in the dataset.                                      into consideration: the set of all tracks a user listened to
                                                                                     and the set of all artists a user listened to. Thus, we are
                                                                                     able to compute a artist similartiy (artistSim) and a track
                                                                                     similarity (trackSim) as shown in Equations 2 and 3.

                                                                                                                    |artistsi ∩ artistsj |
                                                                                                   artistSimi,j =                                 (2)
                                                                                                                    |artistsi ∪ artistsj |
                         4,000

                                                                                                                    |tracksi ∩ tracksj |
                                                                                                    trackSimi,j =                                 (3)
                         1,000                                                                                      |tracksi ∪ tracksj |
 log(Number of Tweets)


                                                                                       The final user similarity is computed using a weighted
                                                                                     average of both, the artistSim and trackSim as depicted in
                                                                                     Equation 4.
                          100


                                                                                              simi,j = wa ∗ artistSimi,j + wt ∗ trackSimi,j       (4)
                                                                                        The weights wa and wt determine the influence of the
                           10                                                        artist- and the track listening history on the user similar-
                                                                                     ity, where wa + wt = 1. Thus, if wt = 0, only the artist
                                                                                     listening history is taken into consideration. We call such a
                                                                                     recommender system an artist-based recommender system.
                                                                                     Analogously, if wa = 0 we call such a recommender system
                                                                                     track-based. If wa > 0 ∧ wt > 0, both the artist- and track
                                  0      5000        10000          15000   20000
                                                Number of Artists                    listening histories are used. Hence, we facilitate a hybrid
                                                                                     recommender system for artist recommendations.
                                                                                        The presented weights have to be predetermined. In this
               Figure 2: Play-Count versus Number of Artists                         work, we use a grid-search for finding suitable input param-
                                                                                     eter for our recommender system as described in Section 4.2.
  How the presented dataset is used as input- and evaluation
data for a music recommender system, is presented in the                             4.     EVALUATION
next Section.
                                                                                        In this Section we present the performance of the imple-
                                                                                     mented artist recommender system, but also discuss the lim-
3. THE BASELINE RECOMMENDATION AP-                                                   itations of the conducted offline evaluation.
   PROACH
  In order to present how the dataset can be applied, we
                                                                                     4.1      Evaluation Setup
use our dataset as input and evaluation data for an artist                              The performance of the recommender system with differ-
recommendation system. This recommender system is based                              ent input parameters was evaluated using precision and re-
on the open source machine learning library Mahout[2]. The                           call. Although we focus on the precision, for the sake of com-
performance of this recommender system is shown in Section                           pleteness we also include the recall into the evaluation, as
4 and serves as a benchmark for future work.                                         this is usual in the field of information retrieval [3]. The met-
                                                                                     rics were computed using a Leave-n-Out algorithm, which
3.1                              Recommendation Approach                             can be described as follows:
   For showing the usefulness of our dataset, we implemented
a User-based CF approach. User-based CF recommends                                        1. Randomly remove n items from the listening history
items by solely utilizing past user-item interactions. For the                               of a user
music recommender system, a user-item interaction states
                                                                                          2. Recommend m items to the user
that a user listened to a certain track by a certain artist.
Thus, the past user-item interactions represent the listening                             3. Calculate precision and recall by comparing the m rec-
history of a user. In the following, we describe our basic                                   ommended and the n removed items
approach taken for computing artist recommendations and
provide details about the implementation.                                                 4. Repeat step 1 to 3 p times
   In order to estimate the similarity of two users, we com-
puted a linear combination of the Jaccard-Coefficients [10]                               5. Calculate the mean precision and the mean recall

                                                                                    37
   Each evaluation in the following Sections has been re-
peated five times (p = 5) and the size of the test set was
fixed to 10 items (r = 10). Thus, we can evaluate the per-
formance of the recommender for recommending up to 10
                                                                                  0.5
items.

4.2    Determining the Input Parameters
   In order to determine good input parameters for the rec-                       0.4
ommender system, a grid search was conducted. Therefore,
we define a grid of parameters and the possible combina-                                                                             Recommender


                                                                      Precision
tions are evaluated using a performance measure [9]. In our                                                                              ● Artist
                                                                                  0.3
case, we relied on the precision of the recommender system                                                                                  Hybrid
(cf. Figure 3), as the task of a music recommender system                                                                                   Track
is to find a certain number of items a user will listen to (or
buy), but not necessarily to find all good items. Precision                       0.2
is a reasonable metric for this so called Find Good Items
task [8] and was assessed using the explained Leave-n-Out
algorithm. For this grid search, we recommended one item
                                                                                  0.1                                         ●               ●      ●    ●
and the size of the test set was fixed to 10 items. In order                                        ●
                                                                                                         ●        ●    ●             ●

to find good input parameters, the following grid parame-                                     ●
                                                                                          ●
ters determining the computation of the user similarity were
altered:                                                                          0.0 ●
                                                                                      0       10    20   30       40   50     60     70      80      90   100
    • Number of nearest neighbors k                                                                            k−Nearest Neighbors

    • Weight of the artist similarity wa
                                                                  Figure 3: Precision and Recall of the Track-Based
    • Weight of the track similarity wt                           Recommender

  The result can be seen in Figure 3. For our dataset it                                      n    Precision     Recall     Upper Bound
holds, that the best results are achieved with a track-based                                  1         0.49      0.05              0.10
recommender system (wa = 0,wt = 1) and 80 nearest neigh-                                      5         0.23      0.11              0.50
bors (k = 80). Thus, for the performance evaluation of the                                    6         0.20      0.12              0.60
recommender system in the next Section, we use the follow-                                    7         0.19      0.13              0.70
ing parameters:                                                                               10        0.15      0.15              1.00
    • Number of nearest neighbors 80                              Table 3: Precision and Recall of the Track-Based
    • Weight of the artist similarity 0                           Recommender

    • Weight of the track similarity 1
                                                                     As shown in Figure 4, with an increasing number of recom-
                                                                  mendations, the performance of the presented recommender
4.3    Performance of the Baseline Recommender                    system declines. Thus, for a high number of recommenda-
       System                                                     tions the recommender system is rather limited. This is,
   In this Section, the performance of the recommender sys-       as the chance of false positives increases if the size of the
tem using the optimized input parameters is presented. Prior      test set is kept constant. For computing the recall metric,
to the evaluation, we also examined real implementations          the 10 items in the test set are considered as relevant items
of music recommender systems: Last.fm, a music discovery          (and hence are desirable to recommend to the user). The
service, for instance recommends 6 artists6 when display-         recall metric describes the fraction of relevant artists who
ing a certain artist. If an artist is displayed on Spotify7 ,     are recommended, i.e., when recommending 5 items, even
7 similar artists are recommended at the first page. This         if all items are considered relevant, the maximum recall is
number of items also corresponds to the work of Miller [11],      still only 50% as 10 items are considered as relevant. Thus,
who argues that people are able to process about 7 items at       in the evaluation setup, recall is bound by an upper limit,
a glance, or rather that the span of attention is too short       which is the number of recommended items divided by the
for processing long lists of items. The precision@6 and the       size of the test set.
precision@7 of our recommender are 0.20 and 0.19, respec-
tively. In such a setting, 20% of the recommended items           4.4                   Limitations of the Evaluation
computed by the proposed recommender system would be a              Beside discussing the results, it is worth to mention also
hit. In other words, a customer should be interested in at        two limitations in the evaluation approach: First, only rec-
least in two of the recommended artists. An overview about        ommendations for items the user already interacted with can
the precision@n of the recommender is given in Table 3.           be evaluated [5]. If something new is recommended, it can’t
6
 http://www.last.fm/music/Lana+Del+Rey                            be stated whether the user likes the item or not. We can
7                                                                 only state that it is not part of the user’s listening history
 http://play.spotify.com/artist/
00FQb4jTyendYWaN8pK0wa                                            in our dataset. Thus, this evaluation doesn’t fit to the per-

                                                                 38
                         1.0                                                                by monitoring users using the Yahoo! Music Services be-
                                                                                            tween 2002 and 2006. Again, the MSD dataset, the Yahoo
                         0.9                                                                dataset is less recent. Additionally to the ratings, the Yahoo
                                                                                            dataset contains genre information which can be exploited
                         0.8                                                                by a hybrid recommender system.
                                                                                               Celma also provides a music dataset, containing data re-
                         0.7
                                                                                            trieved from last.fm10 , a music discovery service. It con-
                                                                                            tains user, artists and play counts as well as the MusicBrainz
                                                                                            identifiers for 360,000 users. This dataset was published in
    Precision / Recall


                         0.6
                                                                    Legend                  2010 [5].
                                                                    ● Precision                Beside the datasets presented above, which are based on
                         0.5 ●
                                                                       Recall               data of private companies, there exist several datasets based
                                                                       Upper Bound          on publicly available information. Sources exploited have
                         0.4                                                                been websites in general [12, 15, 14], Internet radios posting
                                   ●
                                                                                            their play lists [1] and micro-blogging platforms, in partic-
                         0.3           ●                                                    ular Twitter [17, 13]. However, using these sources has a
                                               ●                                            drawback: For cleaning and matching the data, high effort
                                                     ●
                         0.2                              ●
                                                                ●
                                                                                            is necessary.
                                                                         ●        ●   ●        One of the most similar datasets to the dataset used in
                         0.1                                                                this work, is the Million Musical Tweets Dataset 11 dataset
                                                                                            by Hauger et al. [7]. Like our dataset, it was created using
●                        0.0
                                                                                            the Twitter streaming API from September 2011 to April
                               1                     5                                10    2013, however, all tweets not containing a geolocation were
                                           Number of Recommended Items                      removed and thus it is much smaller. The dataset con-
                                                                                            tains 1,086,808 tweets by 215,375 users. Among the dataset,
Figure 4: Precision and Recall of the Track-Based                                           25,060 unique artists have been identified [7].
Recommender                                                                                    Another dataset based on publicly available data which
                                                                                            is similar to the MovieLens dataset, is the MovieTweetings
                                                                                            dataset published by Dooms et al. [6]. The MovieTweet-
fectly to the intended use of providing recommendations for                                 ings dataset is continually updated and has the same format
new artists. However, this evaluation approach enabled us                                   as the MovieLens dataset, in order to foster exchange. At
to find the optimal input parameters using a grid search.                                   the moment, a snapshot containing 200,000 ratings is avail-
Secondly, as we don’t have any preference values, the as-                                   able12 . The dataset is generated by crawling well-structured
sumption that a certain user likes the artist he/she listened                               tweets and extracting the desired information using regular
to, has to be made.                                                                         expressions. Using this regular expressions, the name of the
   Both drawbacks can be eliminated by conducting a user-                                   movie, the rating and the corresponding user is extracted.
centric evaluation [5]. Thus, in a future work, it would be                                 The data is afterwards linked to the IMDb, the Internet
worth to conduct a user-experiment using the optimized rec-                                 Movie Database 13 .
ommender system.
                                                                                            6.   CONCLUSION AND FUTURE WORK
5.                         RELATED WORK                                                        In this work we have shown that the presented dataset
   As already mentioned in the introduction, there exist sev-                               is valuable for evaluating and benchmarking different ap-
eral other publicly available datasets suitable for music rec-                              proaches for music recommendation. We implemented a
ommendations. A quick overview of these datasets is given                                   working music recommender systems, however as shown in
in this Section.                                                                            Section 4, for a high number of recommendations the perfor-
   One of the biggest available music datasets is the Million                               mance of our baseline recommendation approach is limited.
Song Dataset (MSD) [4]. This dataset contains information                                   Thus, we see a need for action at two points: First we will
about one million songs from different sources. Beside real                                 enrich the dataset with further context based information
user play counts, it provides audio features of the songs and                               that is available: in this case this can be the time stamp
is therefore suitable for CF-, CB- and hybrid recommender                                   or the geolocation. Secondly, hybrid recommender system
systems. At the moment, the Taste Profile subset8 of the                                    utilizing this additional context based information are from
MSD is bigger than the dataset presented in this work, how-                                 interest. Therefore, in future works, we will focus on the
ever it was released 2011 and is therefore not as recent.                                   implementation of such recommender systems and compare
   Beside the MSD, also Yahoo! published big datasets9 con-                                 them to the presented baseline approach. First experiments
taining ratings for artists and songs suitable for CF. The                                  were already conducted with a recommender system trying
biggest dataset contains 136,000 songs along with ratings                                   to exploit the geolocation. Two different implementations
given by 1.8 million users. Additionally, the genre informa-                                are evaluated at the moment: The first uses the normalized
tion is provided in the dataset. The data itself was gathered                               linear distance between two users for approximating a user
                                                                                            10
8
  http://labrosa.ee.columbia.edu/millionsong/                                                  http://www.last.fm
                                                                                            11
tasteprofile                                                                                   available at: http://www.cp.jku.at/datasets/MMTD/
9                                                                                           12
  available at:   http://webscope.sandbox.yahoo.com/                                           https://github.com/sidooms/MovieTweetings
                                                                                            13
catalog.php?datatype=r                                                                         http://www.imdb.com

                                                                                           39
similarity. The second one, which in an early stage of eval-    [14] M. Schedl, P. Knees, and G. Widmer. Investigating
uation seems to be the more promising one, increases the             web-based approaches to revealing prototypical music
user similarity if a certain distance threshold is underrun.         artists in genre taxonomies. In Proceedings of the 1st
However, there remains the open question how to determine            International Conference on Digital Information
this distance threshold.                                             Management (ICDIM 2006), pages 519–524. IEEE,
                                                                     2006.
7.   REFERENCES                                                 [15] M. Schedl, C. C. Liem, G. Peeters, and N. Orio. A
 [1] N. Aizenberg, Y. Koren, and O. Somekh. Build your               Professionally Annotated and Enriched Multimodal
     own music recommender by modeling internet radio                Data Set on Popular Music. In Proceedings of the 4th
     streams. In Proceedings of the 21st International               ACM Multimedia Systems Conference (MMSys 2013),
     Conference on World Wide Web (WWW 2012), pages                  pages 78–83, February–March 2013.
     1–10. ACM, 2012.                                           [16] M. Schedl and D. Schnitzer. Hybrid Retrieval
 [2] Apache Software Foundation. What is Apache                      Approaches to Geospatial Music Recommendation. In
     Mahout?, March 2014. Retrieved July 13, 2014, from              Proceedings of the 35th Annual International ACM
     http://mahout.apache.org.                                       SIGIR Conference on Research and Development in
 [3] R. Baeza-Yates and B. Ribeiro-Neto. Modern                      Information Retrieval (SIGIR), 2013.
     Information Retrieval: The Concepts and Technology         [17] E. Zangerle, W. Gassler, and G. Specht. Exploiting
     behind Search (2nd Edition) (ACM Press Books).                  twitter’s collective knowledge for music
     Addison-Wesley Professional, 2 edition, 2011.                   recommendations. In Proceedings of the 2nd Workshop
 [4] T. Bertin-Mahieux, D. P. W. Ellis, B. Whitman, and              on Making Sense of Microposts (#MSM2012), pages
     P. Lamere. The million song dataset. In A. Klapuri              14–17, 2012.
     and C. Leider, editors, Proceedings of the 12th
     International Society for Music Information Retrieval
     Conference (ISMIR 2011), pages 591–596. University
     of Miami, 2011.
 [5] Ò. Celma. Music Recommendation and Discovery -
     The Long Tail, Long Fail, and Long Play in the
     Digital Music Space. Springer, 2010.
 [6] S. Dooms, T. De Pessemier, and L. Martens.
     Movietweetings: a movie rating dataset collected from
     twitter. In Workshop on Crowdsourcing and Human
     Computation for Recommender Systems at the 7th
     ACM Conference on Recommender Systems (RecSys
     2013), 2013.
 [7] D. Hauger, M. Schedl, A. Kosir, and M. Tkalcic. The
     million musical tweet dataset - what we can learn from
     microblogs. In A. de Souza Britto Jr., F. Gouyon, and
     S. Dixon, editors, Proceedings of the 14th
     International Society for Music Information Retrieval
     Conference (ISMIR 2013), pages 189–194, 2013.
 [8] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and
     J. T. Riedl. Evaluating collaborative filtering
     recommender systems. ACM Transactions on
     Information Systems, 22(1):5–53, Jan. 2004.
 [9] C. W. Hsu, C. C. Chang, and C. J. Lin. A practical
     guide to support vector classification. Department of
     Computer Science and Information Engineering,
     National Taiwan University, Taipei, Taiwan, 2003.
[10] P. Jaccard. The distribution of the flora in the alpine
     zone. New Phytologist, 11(2):37–50, Feb. 1912.
[11] G. A. Miller. The magical number seven, plus or
     minus two: Some limits on our capacity for processing
     information. 62:81–97, 1956.
[12] A. Passant. dbrec - Music Recommendations Using
     DBpedia. In Proceedings of the 9th International
     Semantic Web Conference (ISWC 2010), volume 6497
     of Lecture Notes in Computer Science, pages 209–224.
     Springer Berlin Heidelberg, 2010.
[13] M. Schedl. Leveraging Microblogs for Spatiotemporal
     Music Information Retrieval. In Proceedings of the
     35th European Conference on Information Retrieval
     (ECIR 2013), pages 796 – 799, 2013.

                                                               40
  Incremental calculation of isochrones regarding duration

                 Nikolaus Krismer                             Günther Specht                           Johann Gamper
               University of Innsbruck,                    University of Innsbruck,                    Free University of
                       Austria                                     Austria                            Bozen-Bolzano, Italy
              nikolaus.krismer@uibk.ac.at                 guenther.specht@uibk.ac.at                   gamper@inf.unibz.it


ABSTRACT                                                                       target. The websites enabling such a navigation usually cal-
An isochrone in a spatial network is the minimal, possibly                     culate routes using efficient shortest path (SP) algorithms.
disconnected subgraph that covers all locations from where                     One of the most famous examples of these tools is Google’s
a query point is reachable within a given time span and by                     map service named GoogleMaps1 . For a long time it was
a given arrival time [5]. A novel approach for computing                       possible to calculate routes using one transportation system
isochrones in multimodal spatial networks is presented in                      (by car, by train or by bus) only. This is known as rout-
this paper. The basic idea of this incremental calculation is                  ing within unimodal spatial networks. Recent developments
to reuse already computed isochrones when a new request                        enabled the computation combining various transportation
with the same query point is sent, but with different dura-                    systems within the same route, even if some systems are
tion. Some of the major challenges of the new calculation                      bound to schedules. This has become popular under the
attempt are described and solutions to the most problematic                    term “multimodal routing” (or routing in multimodal spa-
ones are outlined on basis of the already established MINE                     tial networks).
and MINEX algorithms. The development of the incremen-                            Less famous, but algorithmic very interesting, is to find
tal calculation is done by using six different cases of com-                   the answer to the question where someone can travel to in
putation. Three of them apply to the MINEX algorithm,                          a given amount of time starting at a certain time from a
which uses a vertex expiration mechanism, and three cases                      given place. The result is known as isochrone. Within mul-
to MINE without vertex expiration. Possible evaluations are                    timodal spatial networks it has been defined by Gamper et
also suggested to ensure the correctness of the incremental                    al. [5]. Websites using isochrones include Mapnificent2 and
calculation. In the end some further tasks for future research                 SimpleFleet3 [4].
are outlined.                                                                     One major advantage of isochrones is that they can be
                                                                               used for reachability analyses of any kind. They are help-
                                                                               ful in various fields including city planning and emergency
Categories and Subject Descriptors                                             management. While some providers, like SimpleFleet and
H.2.8 [Database Applications]: Spatial databases and                           Mapnificent, enable the computation of isochrones based on
GIS                                                                            pre-calculated information or with heuristic data, the cal-
                                                                               culation of isochrones is a non-trivial and time-intense task.
                                                                               Although some improvements to the algorithms that can be
General Terms                                                                  used for isochrone computation have been published at the
Algorithms                                                                     Free University of Bozen-Bolzano in [7], one major drawback
                                                                               is that the task is always performed from scratch. It is not
Keywords                                                                       possible to create the result of a twenty-minute-isochrone
                                                                               (meaning that the travelling time from/to a query point q
isochrone, incremental calculation                                             is less than or equal to twenty minutes) based on the re-
                                                                               sult from a 15-minute-isochrone (the travelling time is often
1. INTRODUCTION                                                                referred to as maximal duration dmax). The incremental
  Throughout the past years interactive online maps have                       calculation could dramatically speed up the computation of
become a famous tool for planning routes of any kind. Nowa-                    isochrones, if there are other ones for the same point q avail-
days everybody with access to the internet is able to easily                   able. This is especially true for long travel times. However,
get support when travelling from a given point to a specific                   the computation based on cached results has not been re-
                                                                               alised until now and is complex. As one could see from
                                                                               figures 1 and 2 it is not sufficient to extend the outline of
                                                                               the isochrone, because there might be some network hubs
                                                                               (e.g. stations of the public transportation system) which
                                                                               extend the isochrone result into new, possibly disconnected
                                                                               areas.
Copyright c by the paper’s authors. Copying permitted only
for private and academic purposes.                                             1
                                                                                 http://maps.google.com
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-          2
Workshop on Foundations of Databases (Grundlagen von Datenbanken),               http://www.mapnificent.net
                                                                               3
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.          http://www.simplefleet.eu


                                                                          41
                                                                   troduced by Bauer et. al [1] suffers from high initial loading
                                                                   time and is limited by the available memory, since the entire
                                                                   network is loaded into memory at the beginning. Another
                                                                   algorithm, called Multimodal Incremental Network Expan-
                                                                   sion (MINE), which has been proposed by Gamper et al. [5]
                                                                   overcomes the limitation that the whole network has to be
                                                                   loaded, but is restricted to the size of the isochrone result,
                                                                   since all points in the isochrone are still located in memory.
                                                                   To overcome this limitation, the Multimodal Incremental
                                                                   Network Expansion with vertex eXpiration (MINEX) algo-
                                                                   rithm has been developed by Gamper et al. [6] introducing
                                                                   vertex expiration (also called node expiration). This mech-
                                                                   anism eliminates unnecessary nodes from memory as soon
                                                                   as possible and therefore reduces the memory needed during
                                                                   computation.
                                                                      There are some more routing algorithms that do not load
                                                                   the entire network into the main memory. One well-known,
                                                                   which is not specific for isochrone calculation, but for query
                                                                   processing in spatial networks in general, called Incremental
   Figure 1: Isochrone with dmax of 10 minutes                     Euclidean Restriction (IER), has been introduced by Papa-
                                                                   dias [8] in 2003. This algorithm loads chunks of the network
                                                                   into memory that are specified by the euclidean distance.
                                                                   The Incremental Network Expiration (INE) algorithm has
                                                                   also been introduced in the publication of Papadias. It is
                                                                   basically an extension of the Dijkstra shortest path algo-
                                                                   rithm. Deng et al. [3] improved the ideas of Papadias et al.
                                                                   accessing less network data to perform the calculations. The
                                                                   open source routing software “pgRouting”4 , which calculates
                                                                   routes on top of the spatial database PostGIS5 (an extension
                                                                   to the well-known relational database PostgreSQL) uses an
                                                                   approach similar to IER. Instead of the euclidean distance
                                                                   it uses the network distance to load the spatial network.
                                                                      In 2013 similar ideas have been applied to MINEX and
                                                                   resulted in an algorithm called Multimodal Range Network
                                                                   Expansion (MRNEX). It has been developed at the Free
                                                                   University of Bozen-Bolzano by Innerebner [7]. Instead of
                                                                   loading the needed data edge-by-edge from the network, it
                                                                   is loaded using chunks, like it is done in IER. Depending
                                                                   on their size this approach is able to reduce the number of
                                                                   network accesses by far and therefore reduces calculation
   Figure 2: Isochrone with dmax of 15 minutes                     time.
                                                                      Recently the term “Optimal location queries” has been
                                                                   proposed by some researches like Chen et al. [2]. These
  This paper presents the calculation of incremental isochrones    queries are closely related to isochrones, since they “find a
in multimodal spatial networks on top of already developed         location for setting up a new server such that the maximum
algorithms and cached results. It illustrates some ideas that      cost of clients being served by the servers (including the new
need to be addressed when extending the algorithms by the          server) is minimized”.
incremental calculation approach. The remainder of this pa-
per is structured as follows. Section 2 includes related work.
Section 3 is split into three parts: the first part describes      3.     INCREMENTAL CALCULATION
challenges that will have to be faced during the implemen-                REGARDING ISOCHRONE DURATION
tation of incremental isochrones. Possible solutions to the           In this paper the MINE and MINEX algorithms are ex-
outlined problems are also discussed shortly here. The sec-        tended by a new idea that is defined as “incremental cal-
ond part deals with different cases that are regarded during       culation”. This allows the creation of new results based on
computation and how these cases differ, while the third part       already computed and cached isochrones with different du-
points out some evaluations and tests that will have to be         rations, but with the same query point q (defined as base-
performed to ensure the correctness of the implementation.         isochrones). This type of computation is complex, since it is
Section 4 consists of a conclusion and lists some possible         not sufficient to extend an isochrone from its border points.
future work.                                                       In theory it is necessary to re-calculate the isochrone from
                                                                   every node in the spatial network that is part of the base-
2. RELATED WORK                                                    isochrone and connected to other nodes. Although this is
                                                                   4
  The calculation of isochrones in multimodal spatial net-             http://pgrouting.org
                                                                   5
works can be done using various algorithms. The method in-             http://postgis.net


                                                              42
true for a highly connected spatial network it might not be             nations can be triggered by a service provider. Traffic jams
the only or even best way for a real-world multimodal spatial           and similar factors can lead to delays in the transportation
network with various transportation systems. The isochrone              system and thus also have to be considered. Although it
calculation based on already known results should be doable             should be possible to overcome both limitations or at least
with respect to all the isochrone’s border points and all the           limit their impact, it will not be further discussed in this
public transportation system stations that are part of the              paper.
base isochrone. These network hubs in reality are the only
nodes, which can cause new, possibly disconnected areas to              3.2     Types of calculation
become part of an isochrone with different travelling time.               There are six different cases that have to be kept in mind
   As it is important for the incremental calculation, the ver-         when calculating an isochrone with travelling time dmax us-
tex expiration that is introduced by Gamper et al. in [6]               ing a base isochrone with duration dmax_base: three apply-
will now be summarized shortly. The aim of the proposed                 ing to algorithms without vertex expiration and three cases
approach is to remove loaded network nodes as soon as pos-              for the ones using vertex expiration.
sible from memory. However, to keep performance high,
nodes should never be double-loaded at any time and there-              3.2.1    Cases dmax = dmax_base
fore they should not be eliminated from memory too soon.                  The first two and most simple cases for the MINE and
Removal should only occur when all computations regard-                 MINEX algorithm, are the ones where dmax is equal to
ing the node have been performed. States are assigned to                dmax_base. In these cases it is obvious that the calculation
every node to assist in finding the optimal timeslot for mem-           result can be returned directly without any further modifi-
ory elimination. The state of a node can either be “open”,              cation. It is not needed to respect expired nodes, since no
“closed” or “expired”. Every loaded node is labelled with the           (re)calculation needs to be performed.
open state in the beginning. If all of its outgoing edges are
traversed, its state changes to closed. However, the node               3.2.2     Cases dmax < dmax_base
itself has to be kept in memory in order to avoid cyclic                   The third, also simple case, is the one where dmax is less
network expansions. A node reaches the expired state, if                than dmax_base for algorithms without vertex expiration.
all nodes in its neighbourhood reached the closed or expired            In this situation all nodes can be iterated and checked for
state. It then can safely be removed from memory and is not             suitability. If the duration is less or equal to dmax, then
available for further computations without reloading it from            the node also belongs to the new result, otherwise it does
the network. Since this is problematic for the incremental              not. In the fourth case, where the duration is less than
calculation approach this aspect is described in more detail.           dmax_base and nodes were expired (and therefore are not
                                                                        available in memory any more), the isochrone can be shrunk
3.1    Challenges                                                       from its borders. The network hubs do not need any special
                                                                        treatment, since no new areas can become part of the result
   There are some challenges that need to be addressed when
                                                                        if the available time decreased. The only necessary task is
implementing an incremental calculation for the MINE and
                                                                        the recalculation of the durations from the query point to
MINEX algorithm. The most obvious problem is related to
                                                                        the nodes in the isochrone and to possibly reload expired
the vertex expiration of the MINEX algorithm. If nodes al-
                                                                        nodes. It either can be done from the query point or from
ready expired, they will not be available to the calculation
                                                                        the border points. The duration d from the query point q to
of isochrones with different durations. To take care of this
                                                                        a network node n is then equal to (assuming that the border
problem all nodes n that are connected to other nodes which
                                                                        point with the minimal distance to n is named bp):
are not in the direct neighbourhood of n are added to a list
l_hubs. These nodes are the ones we referred to as network                               d(q, n) = d(q, bp) − d(bp, n)
hubs. Besides the hub node n itself, further information is
stored in this list: the time t of arrival at the node and              3.2.3     Cases dmax > dmax_base
the remaining distance d that can be used. With this infor-                The remaining two cases, where dmax_base is less than
mation it is possible to continue computation from any hub              dmax, are much more complex. They differ in the fact that
with a modified travelling time for the algorithms.                     new, possibly disconnected areas can become part of the
   The list l_hubs needs to be stored in addition to the                result and therefore it is not sufficient to look at all the base
isochrone’s maximal travelling time and the isochrone re-               isochrones border points. The new areas become available as
sult itself, so that it can be used for incremental calculation.        a result from connections caused by network hubs that often
None of this information needs to be held in memory during              are bound to some kind of schedule. A real-world example
computation of the base isochrone itself and is only touched            is a train station where a train is leaving at time t_train
on incremental calculation. Therefore, runtime and memory               due to its schedule and arriving at a remote station at or
consumption of the isochrone algorithms will not be influ-              before time dmax (in fact any time later than dmax_base is
enced much.                                                             feasible). The time t_train has to be later than the arrival
   Other problems include modifications to the spatial net-             time at the station (and after the isochrones starting time).
work in combination with incremental isochrones. If there is               Since all network hubs are saved with all the needed in-
some change applied to the underlying network, all the base             formation to the list l_hubs it is not of any interest if the
isochrones can not be used for incremental calculation any              algorithm uses vertex expiration or not. The points located
more. It can not be guaranteed that the network modifica-               at the isochrone’s outline are still in memory. Since only net-
tion does not influence the base isochrone. Changes in the              work hubs can create new isochrone areas it is sufficient to
schedules of one or more modalities (for example the pub-               grow the isochrone from its border and all the network hubs
lic transportation systems) could cause problems as well, as            located in the isochrone. The only effect that vertex expira-
they would also influence the base isochrone. Schedule alter-           tion causes is a smaller memory footprint of the calculation,


                                                                   43
as it would also do without incremental calculation.                    be recorded to allow comparison. The incremental calcula-
  In table 1 and in table 2 the recently mentioned calculation          tion can only be seen as successful, if there are situations
types are summarised shortly. The six different cases can be            where they perform better than the common calculation.
distinguished with ease using these two tables.                         As mentioned before, this is expected to be true for at least
                                                                        large isochrone durations, since large portions of the spatial
                                       MINE                             network does not need to be loaded then.
    dmax < dmax_base    iterating nodes from base isochrone                Besides these automatically executed tests, it will be pos-
                        checking if travel time is <= dmax              sible to perform manual tests using a graphical user inter-
    dmax = dmax_base    no change                                       face. This system is under heavy development at the mo-
    dmax > dmax_base    extend base isochrone by                        ment and has been named IsoMap. Regardless of its young
                        border points and with list l_hubs              state it will enable any user to calculate isochrones with and
                                                                        without the incremental approach and to visually compare
Table 1:                                                                the results with each other.
Incremental calculation without vertex expiration
                                                                        4.   CONCLUSION AND FUTURE WORK
                                                                           In this paper an approach to enable the calculation of
                                      MINEX                             isochrones with the help of already known results was pre-
    dmax < dmax_base    shrink base isochrone from border               sented. The necessary steps will be realised in the near fu-
    dmax = dmax_base    no change                                       ture, so that runtime comparisons between incremental cal-
    dmax > dmax_base    extend base isochrone by                        culated isochrones and such created without the presented
                        border points and with list l_hubs              approach will be available shortly. The ideas developed
                                                                        throughout this paper do not influence the time needed for
Table 2:                                                                calculation of base isochrones by far. The only additional
Incremental calculation with vertex expiration                          complexity is generated by storing a list l_hubs besides
                                                                        the base isochrone. However, this is easy to manage and
   Although the different types of computations are intro-              since the list does not contain any complex data structures,
duced using the MINE and MINEX algorithms they also                     the changes should be doable without any noticeable conse-
apply to the MRNEX method. When using MRNEX the                         quence to the runtime of the algorithms.
same basic idea can be used to enable incremental calcula-                 Future work will extend the incremental procedure to fur-
tions. In addition the same advantages and disadvantages                ther calculation parameters, especially to the arrival time,
apply to the incremental calculation using MRNEX com-                   the travelling speed and the query point q of the isochrone.
pared to MINEX that also apply to the non-incremental                   Computations on top of cached results are also realisable for
setup.                                                                  changing arrival times and/or travel speeds. It should even
                                                                        be possible to use base isochrones with completely different
3.3     Evaluation                                                      query points in the context of the incremental approach. If
   The evaluations that will need to be carried out to ensure           the isochrone calculation for a duration of twenty minutes
the correctness of the implementation can be based on freely            reaches a point after five minutes the 15-minute isochrone of
available datasets, such as OpenStreetMap6 . Schedules from             this point has to be part of the computed result (if the arrival
various public transportation systems could be used and                 times are respected). Therefore, cached results can decrease
since they might be subject of licensing it is planned to               the algorithm runtimes even for different query points, espe-
create some test schedules. This data can then be used                  cially if they are calculated for points that can cause complex
as mockups and as a replacement of the license-bound real-              calculations like airports or train stations.
world schedules. It is also planned to realise all the described           Open fields that could be addressed include the research of
tests in the context of a continuous integration setup. They            incremental calculation under conditions where public trans-
will therefore be automatically executed ensuring the cor-              portation system schedules may vary due to trouble in the
rectness throughout various software changes.                           traffic system. The influence of changes in the underlying
   The basic idea of the evaluation is to calculate incremental         spatial networks to the incremental procedure could also be
isochrones on the basis of isochrones with different durations          part of future research. It is planned to use the incremen-
and to compare them with isochrones calculated without the              tal calculation approach to calculate city round trips and to
incremental approach. If both results are exactly the same,             allow the creation of sight seeing tours for tourists with the
the incremental calculation can be regarded as correct.                 help of isochrones. This computation will soon be enabled
   There will be various tests that need to be executed in              in cities where it is not possible by now.
order to cover all the different cases described in section                Further improvements regarding the calculation runtime
3.2. As such, all the cases will be performed with and with-            of isochrones can be done as well. In this field, some ex-
out vertex expiration. The durations of the base isochrones             aminations with different databases and even with different
will cover the three cases per algorithm (less than, equal to           types of databases (in particular graph databases and other
and greater than the duration of the incremental calculated             NoSQL systems) are planned.
isochrone). Additional tests, such as testing for vertex ex-
piration of the incremental calculation result, will be imple-          5.   REFERENCES
mented as well. Furthermore, the calculation times of both
- the incremental and the non-incremental approach - will               [1] V. Bauer, J. Gamper, R. Loperfido, S. Profanter,
                                                                            S. Putzer, and I. Timko. Computing isochrones in
6
    http://www.openstreetmap.org                                            multi-modal, schedule-based transport networks. In


                                                                   44
    Proceedings of the 16th ACM SIGSPATIAL
    International Conference on Advances in Geographic
    Information Systems, GIS ’08, pages 78:1–78:2, New
    York, NY, USA, 2008. ACM.
[2] Z. Chen, Y. Liu, R. C.-W. Wong, J. Xiong, G. Mai, and
    C. Long. Efficient algorithms for optimal location
    queries in road networks. In SIGMOD Conference,
    pages 123–134, 2014.
[3] K. Deng, X. Zhou, H. Shen, S. Sadiq, and X. Li.
    Instance optimal query processing in spatial networks.
    The VLDB Journal, 18(3):675–693, 2009.
[4] A. Efentakis, N. Grivas, G. Lamprianidis,
    G. Magenschab, and D. Pfoser. Isochrones, traffic and
    demographics. In SIGSPATIAL/GIS, pages 538–541,
    2013.
[5] J. Gamper, M. Böhlen, W. Cometti, and
    M. Innerebner. Defining isochrones in multimodal
    spatial networks. In Proceedings of the 20th ACM
    International Conference on Information and
    Knowledge Management, CIKM ’11, pages 2381–2384,
    New York, NY, USA, 2011. ACM.
[6] J. Gamper, M. Böhlen, and M. Innerebner. Scalable
    computation of isochrones with network expiration. In
    A. Ailamaki and S. Bowers, editors, Scientific and
    Statistical Database Management, volume 7338 of
    Lecture Notes in Computer Science, pages 526–543.
    Springer Berlin Heidelberg, 2012.
[7] M. Innerebner. Isochrone in Multimodal Spatial
    Networks. PhD thesis, Free University of
    Bozen-Bolzano, 2013.
[8] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao.
    Query processing in spatial network databases. In
    Proceedings of the 29th International Conference on
    Very Large Data Bases - Volume 29, VLDB ’03, pages
    802–813. VLDB Endowment, 2003.


                                                             45
46
   Software Design Approaches for Mastering Variability in
                    Database Systems

                      David Broneske, Sebastian Dorok, Veit Köppen, Andreas Meister*
                                                *author names are in lexicographical order
                                              Otto-von-Guericke-University Magdeburg
                                     Institute for Technical and Business Information Systems
                                                        Magdeburg, Germany
                                                   firstname.lastname@ovgu.de

ABSTRACT                                                                       e.g., vectorization and SSD storage, to efficiently process
For decades, database vendors have developed traditional                       and manage petabytes of data [8]. Exploiting variability to
database systems for different application domains with high-                  design a tailor-made DBS for applications while making the
ly differing requirements. These systems are extended with                     variability manageable, that is keeping maintenance effort,
additional functionalities to make them applicable for yet                     time, and cost reasonable, is what we call mastering vari-
another data-driven domain. The database community ob-                         ability in DBSs.
served that these “one size fits all” systems provide poor per-                   Currently, DBSs are designed either as one-size-fits-all
formance for special domains; systems that are tailored for a                  DBSs, meaning that all possible use cases or functionalities
single domain usually perform better, have smaller memory                      are integrated at implementation time into a single DBS,
footprint, and less energy consumption. These advantages                       or as specialized solutions. The first approach does not
do not only originate from different requirements, but also                    scale down, for instance, to embedded devices. The second
from differences within individual domains, such as using a                    approach leads to situations, where for each new applica-
certain storage device.                                                        tion scenario data management is reinvented to overcome
   However, implementing specialized systems means to re-                      resource restrictions, new requirements, and rapidly chang-
implement large parts of a database system again and again,                    ing hardware. This usually leads to an increased time to
which is neither feasible for many customers nor efficient in                  market, high development cost, as well as high maintenance
terms of costs and time. To overcome these limitations, we                     cost. Moreover, both approaches provide limited capabilities
envision applying techniques known from software product                       for managing variability in DBSs. For that reason, software
lines to database systems in order to provide tailor-made                      product line (SPL) techniques could be applied to the data
and robust database systems for nearly every application                       management domain. In SPLs, variants are concrete pro-
scenario with reasonable effort in cost and time.                              grams that satisfy the requirements of a specific application
                                                                               domain [7]. With this, we are able to provide tailor-made
                                                                               and robust DBSs for various use cases. Initial results in the
General Terms                                                                  context of embedded systems, expose benefits of applying
Database, Software Engineering                                                 SPLs to DBSs [17, 22].
                                                                                  The remainder of this paper is structured as follows: In
Keywords                                                                       Section 2, we describe variability in a database system re-
                                                                               garding hardware and software. We review three approaches
Variability, Database System, Software Product Line                            to design DBSs in Section 3, namely, the one-size-fits-all, the
                                                                               specialization, and the SPL approach. Moreover, we com-
1. INTRODUCTION                                                                pare these approaches w.r.t. robustness and maturity of pro-
   In recent years, data management has become increasingly                    vided DBSs, the effort of managing variability, and the level
important in a variety of application domains, such as auto-                   of tailoring for specific application domains. Because of the
motive engineering, life sciences, and web analytics. Every                    superiority of the SPL approach, we argue to apply this ap-
application domain has its unique, different functional and                    proach to the implementation process of a DBS. Hence, we
non-functional requirements leading to a great diversity of                    provide research questions in Section 4 that have to be an-
database systems (DBSs). For example, automotive data                          swered to realize the vision of mastering variability in DBSs
management requires DBSs with small storage and memory                         using SPL techniques.
consumption to deploy them on embedded devices. In con-
trast, big-data applications, such as in life sciences, require
large-scale DBSs, which exploit newest hardware trends,                        2.   VARIABILITY IN DATABASE SYSTEMS
                                                                                 Variability in a DBS can be found in software as well as
                                                                               hardware. Hardware variability is given due to the use of
Copyright c by the paper’s authors. Copying permitted only                     different devices with specific properties for data processing
for private and academic purposes.
                                                                               and storage. Variability in software is reflected by differ-
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-
Workshop on Foundations of Databases (Grundlagen von Datenbanken),             ent functionalities that have to be provided by the DBS
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.        for a specific application. Additionally, the combination of


                                                                          47
hardware and software functionality for concrete application             Field Programmable Gate Array: FPGAs are pro-
domains increases variability.                                        grammable stream processors, providing only a limited stor-
                                                                      age capacity. They consist of several independent logic cells
2.1     Hardware                                                      consisting of a storage unit and a lookup table. The in-
   In the past decade, the research community exploited aris-         terconnect between logic cells and the lookup tables can be
ing hardware features by tailor-made algorithms to achieve            reprogrammed during run time to perform any possible func-
optimized performance. These algorithms effectively utilize,          tion (e.g., sorting, selection).
e.g., caches [19] or vector registers of Central Processing
Units (CPUs) using AVX- [27] and SSE-instructions [28].               2.1.2    Storage Devices
Furthermore, the usage of co-processors for accelerating data            Similar to the processing device, current systems offer a
processing opens up another dimension [12]. In the follow-            variety of different storage devices used for data processing.
ing, we consider processing and storage devices and sketch            In this section, we discuss different properties of current stor-
the variability arising from their different properties.              age devices.
                                                                         Hard Disk Drive: The Hard Disk Drive (HDD), as a
2.1.1    Processing Devices                                           non-volatile storage device, consists of several disks. The
   To sketch the heterogeneity of current systems, possible           disks of an HDD rotate, while a movable head reads or writes
(co-)processors are summarized in Figure 1. Current sys-              information. Hence, sequential access patterns are well sup-
tems do not only include a CPU or an Accelerated Process-             ported in contrast to random accesses.
ing Unit (APU), but also co-processors, such as Many In-                 Solid State Drive: Since no mechanical units are used,
tegrated Cores (MICs), Graphical Processing Units (GPUs),             Solid State Drives (SSDs) support random access without
and Field Programmable Gate Arrays (FPGAs). In the fol-               high delay. For this, SSDs use flash-memory to persistently
lowing, we give a short description of varying processor prop-        store information [20]. Each write wears out the flash cells.
erties. A more extensive overview is presented in our recent          Consequently, the write patterns of database systems must
work [5].                                                             be changed compared to HDD-based systems.
                                                                         Main-Memory: While using main memory as main stor-
                                                                      age, the access gap between primary and secondary storage
              APU           Main-         GPU      MIC                is removed, introducing main-memory access as the new bot-
                           Memory
                                                                      tleneck [19]. However, main-memory systems cannot omit
         front-side bus         memory                                secondary storage types completely, because main memory is
                                bus
                                                                      volatile. Thus, efficient persistence mechanisms are needed
                              I/O
              CPU          controller
                                                                      for main-memory systems.
                                        PCIe bus
                                                   FPGA                  To conclude, current architectures offer several different
                          HDD       SSD
                                                                      processor and storage types. Each type has a unique archi-
                                                                      tecture and specific characteristics. Hence, to ensure high
      Figure 1: Future system architecture [23]                       performance, the processing characteristics of processors as
                                                                      well as the access characteristics of the underlying storage
   Central Processing Unit: Nowadays, CPUs consist of                 devices have to be considered. For example, if several pro-
several independent cores, enabling parallel execution of dif-        cessing devices are available within a DBS, the DBS must
ferent calculations. CPUs use pipelining, Single Instruction          provide suitable algorithms and functionality to fully utilize
Multiple Data (SIMD) capabilities, and branch prediction              all available devices to provide peak performance.
to efficiently process conditional statements (e.g., if state-
ments). Hence, CPUs are well suited for control intensive             2.2     Software Functionality
algorithms.                                                              Besides hardware, DBS functionality is another source
   Graphical Processing Unit: Providing larger SIMD reg-              of variability in a DBS. In Figure 2, we show an excerpt
isters and a higher number of cores than CPUs, GPUs offer             of DBS functionalities and their dependencies. For exam-
a higher degree of parallelism compared to CPUs. In or-               ple, for different application domains different query types
der to perform calculations, data has to be transferred from          might be interesting. However, to improve performance
main memory to GPU memory. GPUs offer an own memory                   or development cost, only required query types should be
hierarchy with different memory types.                                used within a system. This example can be extended to
   Accelerated Processing Unit: APUs are introduced to                other functional requirements. Furthermore, a DBS pro-
combine the advantages of CPUs and GPUs by including                  vides database operators, such as aggregation functions or
both on one chip. Since the APU can directly access main              joins. Thereby, database operators perform differently de-
memory, the transfer bottleneck of dedicated GPUs is elim-            pending on the used storage and processing model [1]. For
inated. However, due to space limitations, fairly less GPU            example, row stores are very efficient when complete tuples
cores fit on the APU die compared to a dedicated GPU, lead-           should be retrieved, while column stores in combination with
ing to reduced computational power compared to dedicated              operator-at-a-time processing enable fast processing of single
GPUs.                                                                 columns [18]. Another technique to enable efficient access
   Many Integrated Core: MICs use several integrated and              to data is to use index structures. Thereby, the choice of an
interconnected CPU cores. With this, MICs offer a high                appropriate index structure for the specific data and query
parallelism while still featuring CPU properties. However,            types is essential to guarantee best performance [15, 24].
similar to the GPU, MICs suffer from the transfer bottle-                Note, we omit comprehensive relationships between func-
neck.                                                                 tionalities properties in Figure 2 due to complexity. Some


                                                                 48
      Legend
                                                                                                 DBS-Functionality
       <name>   Feature

                Mandatory

                Optional

                OR
                                    Query                      Storage                                                                                        Processing
                                                                                                     Operator                      Transaction
                                    Type                        Model                                                                                           Model
                XOR


                                                       Row           Column                                                                      Operator-     Tuple-at-   Vectorized
                            Exact   kNN     Range
                                                       Store          Store                                                                      at-a-time      a-time     Processing

                                                                            Join             Selection               Sorting             Grouping


                                                            Block-nested-                                                      Bitonic
                                             Nested-loops                          Hash         Sort-merge      Radix                     Hash-based    Sort-based
                                                               loops                                                           merge


                                            Figure 2: Excerpt of DBMS-Functionality


functionalities are mandatory in a DBS and others are op-                                 database-application scenario. For example, a DBS for high-
tional, such as support for transactions. Furthermore, it is                              performance analysis can exploit newest hardware features,
possible that some alternatives can be implemented together                               such as SIMD, to speed up analysis workloads. Moreover,
and others only exclusively.                                                              we can meet limited space requirements in embedded sys-
                                                                                          tems by removing unnecessary functionality [22], such as the
2.3      Putting it all together                                                          support for range queries. However, exploiting variability is
   So far, we considered variability in hardware and software                             one part of mastering variability in DBSs. The second part
functionality separately. When using a DBS for a specific                                 is to manage variability efficiently to reduce development
application domain, we also have to consider special require-                             and maintenance effort.
ments of this domain as well as the interaction between hard-                                In this section, first, we describe three different approaches
ware and software.                                                                        to design and implement DBSs. Then, we compare these ap-
   Special requirements comprise functional as well as non-                               proaches regarding their applicability to arbitrary database
functional ones. Examples for functional requirements are                                 scenarios. Moreover, we assess the effort to manage vari-
user-defined aggregation functions (e.g., to perform genome                               ability in DBSs. Besides managing and exploiting the vari-
analysis tasks directly in a DBS [9]). Other applications                                 ability in database systems, we also consider the robustness
require support for spatial queries, such as geo-information                              and correctness of tailor-made DBSs created by using the
systems. Thus, special data types as well as index structures                             discussed approaches.
are required to support these queries efficiently.
   Besides performance, memory footprint and energy effi-                                 3.1       One-Size-Fits-All Design Approach
ciency are other examples for non-functional requirements.                                   One way to design database systems is to integrate all con-
For example, a DBS for embedded devices must have a small                                 ceivable data management functionality into one single DBS.
memory footprint due to resource restrictions. For that rea-                              We call this approach the one-size-fits-all design approach
son, unnecessary functionality is removed and data process-                               and DBSs designed according to this approach one-size-fits-
ing is implemented as memory efficient as possible. In this                               all DBSs. Thereby, support for hardware features as well
scenario, tuple-at-a-time processing is preferred, because in-                            as DBMS functionality are integrated into one code base.
termediate results during data processing are smaller than                                Thus, one-size-fits-all DBSs provide a rich set of functional-
in operator-at-a-time processing, which leads to less memory                              ity. Examples of database systems that follow the one-size-
consumption [29].                                                                         fits-all approach are PostgreSQL, Oracle, and IBM DB2. As
   In contrast, in large-scale data processing, operators should                          one-size-fits-all DBSs are monolithic software systems, im-
perform as fast as possible by exploiting underlying hard-                                plemented functionality is highly interconnected on the code
ware and available indexes. Thereby, exploiting underlying                                level. Thus, removing functionality is mostly not possible.
hardware is another source of variability as different pro-                                  DBSs that follow the one-size-fits-all design approach aim
cessing devices have different characteristics regarding pro-                             at providing a comprehensive set of DBS functionality to
cessing model and data access [6]. To illustrate this fact,                               deal with most database application scenarios. The claim for
we depict different storage models for DBS in Figure 2. For                               generality often introduces functional overhead that leads to
example, column-storage is preferred on GPUs, because row-                                performance losses. Moreover, customers pay for function-
storage leads to an inefficient memory access pattern that de-                            ality they do not really need.
teriorates the possible performance benefits of GPUs [13].
                                                                                          3.2       Specialization Design Approach
3. APPROACHES TO DESIGN TAILOR-                                                              In contrast to one-size-fits-all DBSs, DBSs can also be de-
                                                                                          signed and developed to fit very specific use cases. We call
   MADE DATABASE SYSTEMS                                                                  this design approach the specialization design approach and
  The variability in hardware and software of DBSs can                                    DBSs designed accordingly, specialized DBSs. Such DBSs
be exploited to tailor database systems for nearly every                                  are designed to provide only that functionality that is needed


                                                                                   49
for the respective use case, such as text processing, data                                                                                                                                                                                                                                                                                      a) general applicability to arbitrary database applications,
warehousing, or scientific database applications [25]. Spe-                                                                                                                                                                                                                                                                                     b) effort for managing variability, and
cialized DBSs are often completely redesigned from scratch                                                                                                                                                                                                                                                                                      c) maturity of the deployed database system.
to meet application requirements and do not follow common                                                                                                                                                                                                                                                                                    Although the one-size-fits-all design approach aims at pro-
design considerations for database systems, such as locking                                                                                                                                                                                                                                                                                  viding a comprehensive set of DBS functionality to deal
and latching to guarantee multi-user access [25]. Specialized                                                                                                                                                                                                                                                                                with most database application scenarios, a one-size-fits-all
DBSs remove the overhead of unneeded functionality. Thus,                                                                                                                                                                                                                                                                                    database is not applicable to use cases in automotive, em-
developers can highly focus on exploiting hardware and func-                                                                                                                                                                                                                                                                                 bedded, and ubiquitous computing. As soon as tailor-made
tional variability to provide tailor-made DBSs that meet                                                                                                                                                                                                                                                                                     software is required to meet especially storage limitations,
high-performance criteria or limited storage space require-                                                                                                                                                                                                                                                                                  one-size-fits-all database systems cannot be used. Moreover,
ments. Therefore, huge parts of the DBS (if not all) must                                                                                                                                                                                                                                                                                    specialized database systems for one specific use case outper-
be newly developed, implemented, and tested which leads                                                                                                                                                                                                                                                                                      form one-size-fits-all database systems by orders of magni-
to duplicate implementation efforts, and thus, increased de-                                                                                                                                                                                                                                                                                 tude [25]. Thus, although one-size-fits-all database systems
velopment costs.                                                                                                                                                                                                                                                                                                                             can be applied, they are often not the best choice regarding
                                                                                                                                                                                                                                                                                                                                             performance. For that reason, we consider the applicability
                                                                                                                                                                                                                                                                                                                                             of one-size-fits-all database systems to arbitrary use cases
                                                                                                                                                                                                                                                                                                                                             as limited. In contrast, specialized database systems have
3.3          Software Product Line Design Approach                                                                                                                                                                                                                                                                                           a very good applicability as they are designed for that pur-
  In the specialization design approach, a new DBS must                                                                                                                                                                                                                                                                                      pose.
be developed and implemented from scratch for every con-                                                                                                                                                                                                                                                                                        The applicability of the SPL design approach is good as
ceivable database application. To avoid this overhead, the                                                                                                                                                                                                                                                                                   it also creates database systems tailor-made for specific use
SPL design approach reuses already implemented and tested                                                                                                                                                                                                                                                                                    cases. Moreover, the SPL design approach explicitly consid-
parts of a DBS to create a tailor-made DBS.                                                                                                                                                                                                                                                                                                  ers variability during software design and implementation
                                                                                                                                                                                                                                                                                                                                             and provides methods and techniques to manage it [2]. For
                                                    Domain Analysis                                                                                                                                                                                                Domain Implementation
                                                                                                                              FAME-DBMS


                                                                                                                          Buffer Manager
                                                                                                                                                                                                                                                     refines class Btree
                                                                                                                                                                                                                                                     {
                                                                                                                                                                                                                                                                                                                                             that reason, we assess the effort of managing variability with
  Domain
                                                                                                                                                                                                                                                                                                                                             the SPL design approach as lower than managing variability
                                                       Storage                                                                                                                                                                                        public :
                                                                                                                                                                                                                                                      bool PutData(RECORD& r); enum DataTypes
                     OS-Abstraction                                                                   Memory Alloc                               Replacement

                                                                                                                                                                                                                                                                                                       #include
                                                                                                                                                                                                                                                                                 {                       "BtreeIndexPage.h"
 knowledge                                                                                                                                                                                                                               Mapping                                       DataType_None,
             Win32       Linux          NutOS                       Storage                     Dynamic              Static                LFU                 LRU                     Access


                                                  Data Types                               Index                                                                                                                                                                                       DataType_Bool, refines class Btree {
                                                                                                                                                                                                                                                      };
                                                                                                                                                                                                                                                                                       DataType_Byte, BtreePageRef

                                                                                                                                                                                                                                                                                                                                             using a one-size-fits-all or specialized design approach.
                                           Data Dictionary         Data Types           Index                                                                  API         Optimizer            Transaction      SQL Engine


                                                                                                                                                                                                                                                     #include "include.h"              DataType_Short, GetNewPage()
                                                                                                            B+-Tree
                                                                                                                                                                                                        Stream-based        Relational
                                      Tables             Columns
                                                                                 List
                                                                                 List             B+-Tree                      update             remove             get                put
                                                                                                                                                                                                           queries           queries
                                                                                                                                                                                                                                                     #include "DataDictionary.h"       ...                        {
                                                                                                                                                                                                                                                     #include "PrimaryIndex.h"
                                                   add                  search            remove              update
                                                                                                                                                                                                Aggregation
                                                                                                                                                                                                  queries
                                                                                                                                                                                                                Select queries                                                                                  .
                                                                                                                                                                                                                                                                                 };


                                                                                                                                                                                                                                                                                                                                                We assess the maturity of one-size-fits-all database sys-
                                                                                                                                                                                                                                                                                                                .
                                                                                                                                                                                                                                                     class Btree : public
                                                                                                                                                                                                                                                                                                                .
                                                                                                                                                                                                                                                     PrimaryIndex
                                                                                                                                                                                                                                                                                                                  }
                                                                                                                                                                                                                                                     {
                                                                                                                                                                                                                                                     public:
                                                                                                                                                                                                                                                     Btree() :PrimaryIndex(true)

                                                                                                                                                                                                                                                     }
                                                                                                                                                                                                                                                                {…}
                                                                                                                                                                                                                                                                                                                                             tems as very good, as these systems are developed and tested
                          New
                                                                                                                                                     Features
                                                                                                                                                                                                                                                                                                  Common
                                                                                                                                                                                                                                                                                               implementation
                                                                                                                                                                                                                                                                                                                                             over decades. Specialized database systems are mostly im-
                      requirements
                                                                                                                                                                                                                                                                                                  artifacts                                  plemented from scratch, so, the possibility of errors in the
                                                               Customization                                                                                                                                                                                               Product generation                                                code is rather high, leading to a moderate maturity and ro-
  Customer
                                                                          FAME-DBMS
                                                                            OS-Abstraction                                                                                                                                                Feature                                                                                            bustness of the software. The SPL design approach also
                                                                            Storage
   needs                                                                                                                                                                                                                                 selection                                                                            Product
                                                                               Data Dictionary
                                                                               Data Types
                                                                               Index
                                                                                                                                                                                                                                                                                                                                             enables the creation of tailor-made database systems, but
                                                                                   List
                                                                                     +
                                                                                   B -Tree
                                                                            Buffer Manager
                                                                                                                                                                                                                                                                                                                                             from approved features that are already implemented and
                                                                            Access
                                                                                                                                                                                                                                                                                                                                             tested. Thus, we assess the maturity of database systems
                                                                                                                                                                                                                                                                                                                                             created via the SPL design approach as good.
                                                                                                                                                                                                                                                                                                                                                In Table 1, we summarize our assessment of the three
                                                       Figure 3: Managing Variability
                                                                                                                                                                                                                                                                                                                                             software design approaches regarding the above criteria.
   To make use of SPL techniques, a special workflow has to
                                                                                                                                                                                                                                                                                                                                                                                      Approach
be followed which is sketched in Figure 3 [2]. At first, the                                                                                                                                                                                                                                                                                  Criteria                   One-Size-
domain is modeled, e.g., by using a feature model – a tree-                                                                                                                                                                                                                                                                                                               Fits-All     Specialization      SPL
like structure representing features and their dependencies.                                                                                                                                                                                                                                                                                  a) Applicability               −               ++            +
With this, the variability is captured and implementation                                                                                                                                                                                                                                                                                     b) Management effort           −                −            +
artifacts can be derived for each feature. The second step,                                                                                                                                                                                                                                                                                   c) Maturity                   ++                             +
the domain implementation, is to implement each feature us-
ing a compositional or annotative approach. The third step                                                                                                                                                                                                                                                                                   Table 1: Characteristics of approaches
of the workflow is to customize the product – in our case,                                                                                                                                                                                                                                                                                   Legend: ++ = very good, + = good,   = moderate, − = limited

the database system – which will be generated afterwards.
   By using the SPL design approach, we are able to imple-                                                                                                                                                                                                                                                                                      The one-size-fits-all and the specialization design approach
ment a database system from a set of features which are                                                                                                                                                                                                                                                                                      are each very good in one of the three categories respec-
mostly already provided. In best case, only non-existing                                                                                                                                                                                                                                                                                     tively. The one-size-fits-all design approach provides robust
features must be implemented. Thus, the feature pool con-                                                                                                                                                                                                                                                                                    and mature DBSs. The specialization design approach pro-
stantly grows and features can be reused in other database                                                                                                                                                                                                                                                                                   vides greatest applicability and can be used for nearly every
systems. Applying this design approach to DBSs enables                                                                                                                                                                                                                                                                                       use case. Whereas the SPL design approach provides a bal-
to create DBSs tailored for specific use cases while reduc-                                                                                                                                                                                                                                                                                  anced assessment regarding all criteria. Thus, against the
ing functional overhead as well as development time. Thus,                                                                                                                                                                                                                                                                                   backdrop of increasing variability due to increasing variety
the SPL design approach aims at the middle ground of the                                                                                                                                                                                                                                                                                     of use cases and hardware while guaranteeing mature and
one-size-fits-all and the specialization design approach.                                                                                                                                                                                                                                                                                    robust DBSs, SPL design approach should be applied to de-
                                                                                                                                                                                                                                                                                                                                             velop future DBSs. Otherwise, development costs for yet
3.4          Characterization of Design Approaches                                                                                                                                                                                                                                                                                           another DBS which has to meet special requirements of the
  In this section, we characterize the three design approaches                                                                                                                                                                                                                                                                               next data-driven domain will limit the use of DBSs in such
discussed above regarding:                                                                                                                                                                                                                                                                                                                   fields.


                                                                                                                                                                                                                                                                                                                                        50
4.    ARISING RESEARCH QUESTIONS                                      on inheritance or additional function calls, which causes per-
   Our assessment in the previous section shows that the              formance penalties. A technique that allows for variability
SPL design approach is the best choice for mastering vari-            without performance penalties are preprocessor directives.
ability in DBSs. To the best of our knowledge, the SPL                However, maintaining preprocessor-based SPLs is horrible,
design approach is applied to DBSs only in academic set-              which accounts this approach the name #ifdef Hell [11, 10].
tings (e.g., in [22]).Hereby, the previous research were based        So, there is a trade-off between performance and maintain-
on BerkeleyDB. Although BerkeleyDB offers the essential               ability [22], but also granularity [14]. It could be beneficial
functionality of DBSs (e.g., a processing engine), several            for some parts of DBS to prioritize maintainability and for
functionality of relational DBSs were missing (e.g., opti-            others performance or maintainability.
mizer, SQL-interface). Although these missing functional-
ity were partially researched (e.g., storage manager [16] and         RQ-I2: How to combine different implementation tech-
the SQL parser [26]), no holistic evaluation of a DBS SPL             niques for SPLs?
is available. Especially, the optimizer in a DBS (e.g., query         If the answer of RQ-I1 is to use different implementation
optimizer) with a huge number of crosscutting concerns is             techniques within the same SPL, we have to find an ap-
currently not considered in research. So, there is still the          proach to combine these. For example, database operators
need for research to fully apply SPL techniques to all parts          and their different hardware optimization must be imple-
of a DBS. Specifically, we need methods for modeling vari-            mented using annotative approaches for performance rea-
ability in DBSs and efficient implementation techniques and           sons, but the query optimizer can be implemented using
methods for implementing variability-aware database oper-             compositional approaches supporting maintainability; the
ations.                                                               SPL product generator has to be aware of these different
                                                                      implementation techniques and their interactions.
4.1   Modeling
  For modeling variability in feature-oriented SPLs, feature          RQ-I3: How to deal with functionality extensions?
models are state of the art [4]. A feature model is a set             Thinking about changing requirements during the usage of
of features whose dependencies are hierarchically modeled.            the DBS, we should be able to extend the functionality in
Since variability in DBSs comprises hardware, software, and           the case user requirements change. Therefore, we have to
their interaction, the following research questions arise:            find a solution to deploy updates from an extended SPL
                                                                      in order to integrate the new requested functionality into a
RQ-M1: What is a good granularity for modeling a                      running DBS. Some ideas are presented in [21], however,
variable DBS?                                                         due to the increased complexity of hardware and software
In order to define an SPL for DBSs, we have to model fea-             requirements an adaption or extension is necessary.
tures of a DBS. Such features can be modeled with different
levels of granularity [14]. Thus, we have to find an appli-
                                                                      4.3    Customization
cable level of granularity for modeling our SPL for DBSs.                In the final customization, features of the product line are
Moreover, we also have to consider the dependencies be-               selected that apply to the current use case. State of the art
tween hardware and software. Furthermore, we have to find             approaches just list available features and show which fea-
a way to model the hardware and these dependencies. In                tures are still available for further configuration. However, in
this context, another research questions emerges:                     our scenario, it could be helpful to get further information of
                                                                      the configuration possibilities. Thus, another research ques-
RQ-M2: What is the best way to model hardware and                     tion is:
its properties in an SPL?
Hardware has become very complex and researchers demand               RQ-C1: How to support the user to obtain the best
to develop a better understanding of the impact of hard-              selection?
ware on the algorithm performance, especially when paral-             In fact, it is possible to help the user in identifying suitable
lelized [3, 5]. Thus, the question arises what properties of          configurations for his use case. If he starts to select func-
the hardware are worth to be captured in a feature model.             tionality that has to be provided by the generated system,
   Furthermore, when thinking about numerical properties,             we can give him advice which hardware yields the best per-
such as CPU frequency or amount of memory, we have to                 formance for his algorithms. However, to achieve this we
find a suitable technique to represent them in feature mod-           have to investigate another research question:
els. One possibility are attributes of extended feature-mod-
els [4], which have to be explored for applicability.                 RQ-C2: How to find the optimal algorithms for a
                                                                      given hardware?
4.2   Implementing                                                    To answer this research question, we have to investigate the
                                                                      relation between algorithmic design and the impact of the
  In the literature, there are several methods for implement-
                                                                      hardware on the execution. Hence, suitable properties of
ing an SPL. However, most of them are not applicable to
                                                                      algorithms have to be identified that influence performance
our use case. Databases rely on highly tuned operations
                                                                      on the given hardware, e.g., access pattern, size of used data
to achieve peak performance. Thus, variability-enabled im-
                                                                      structures, or result sizes.
plementation techniques must not harm the performance,
which leads to the research question:
                                                                      5.    CONCLUSIONS
RQ-I1: What is a good variability-aware implemen-                       DBSs are used for more and more use cases. However,
tation technique for an SPL of DBSs?                                  with an increasing diversity of use cases and increasing het-
Many state of the art implementation techniques are based             erogeneity of available hardware, it is getting more challeng-


                                                                 51
ing to design an optimal DBS while guaranteeing low imple-           [13] B. He and J. X. Yu. High-throughput Transaction
mentation and maintenance effort at the same time. To solve               Executions on Graphics Processors. PVLDB,
this issue, we review three design approaches, namely the                 4(5):314–325, Feb. 2011.
one-size-fits-all, the specialization, and the software prod-        [14] C. Kästner, S. Apel, and M. Kuhlemann. Granularity
uct line design approach. By comparing these three design                 in Software Product Lines. In ICSE, pages 311–320.
approaches, we conclude that the SPL design approach is a                 ACM, 2008.
promising way to master variability in DBSs and to provide           [15] V. Köppen, M. Schäler, and R. Schröter. Toward
mature data management solutions with reduced implemen-                   Variability Management to Tailor High Dimensional
tation and maintenance effort. However, there is currently                Index Implementations. In RCIS, pages 452–457.
no comprehensive software product line in the field of DBSs               IEEE, 2014.
available. Thus, we present several research questions that          [16] T. Leich, S. Apel, and G. Saake. Using Step-wise
have to be answered to fully apply the SPL design approach                Refinement to Build a Flexible Lightweight Storage
on DBSs.                                                                  Manager. In ADBIS, pages 324–337. Springer-Verlag,
                                                                          2005.
6.   ACKNOWLEDGMENTS                                                 [17] J. Liebig, S. Apel, C. Lengauer, and T. Leich.
  This work has been partly funded by the German BMBF                     RobbyDBMS: A Case Study on Hardware/Software
under Contract No. 13N10818 and Bayer Pharma AG.                          Product Line Engineering. In FOSD, pages 63–68.
                                                                          ACM, 2009.
7.   REFERENCES                                                      [18] A. Lübcke, V. Köppen, and G. Saake. Heuristics-based
 [1] D. J. Abadi, S. R. Madden, and N. Hachem.                            Workload Analysis for Relational DBMSs. In
     Column-stores vs. Row-stores: How Different Are                      UNISCON, pages 25–36. Springer, 2012.
     They Really? In SIGMOD, pages 967–980. ACM,                     [19] S. Manegold, P. A. Boncz, and M. L. Kersten.
     2008.                                                                Optimizing Database Architecture for the New
 [2] S. Apel, D. Batory, C. Kästner, and G. Saake.                       Bottleneck: Memory Access. VLDB J., 9(3):231–246,
     Feature-Oriented Software Product Lines. Springer,                   2000.
     2013.                                                           [20] R. Micheloni, A. Marelli, and K. Eshghi. Inside Solid
 [3] C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu.                 State Drives (SSDs). Springer, 2012.
     Multi-Core, Main-Memory Joins: Sort vs. Hash                    [21] M. Rosenmüller. Towards Flexible Feature
     Revisited. PVLDB, 7(1):85–96, 2013.                                  Composition: Static and Dynamic Binding in Software
 [4] D. Benavides, S. Segura, and A. Ruiz-Cortés.                        Product Lines. Dissertation, University of Magdeburg,
     Automated Analysis of Feature Models 20 Years Later:                 Germany, June 2011.
     A Literature Review. Inf. Sys., 35(6):615–636, 2010.            [22] M. Rosenmüller, N. Siegmund, H. Schirmeier,
 [5] D. Broneske, S. Breß, M. Heimel, and G. Saake.                       J. Sincero, S. Apel, T. Leich, O. Spinczyk, and
     Toward Hardware-Sensitive Database Operations. In                    G. Saake. FAME-DBMS: Tailor-made Data
     EDBT, pages 229–234, 2014.                                           Management Solutions for Embedded Systems. In
 [6] D. Broneske, S. Breß, and G. Saake. Database Scan                    SETMDM, pages 1–6. ACM, 2008.
     Variants on Modern CPUs: A Performance Study. In                [23] M. Saecker and V. Markl. Big Data Analytics on
     IMDM@VLDB, 2014.                                                     Modern Hardware Architectures: A Technology
 [7] K. Czarnecki and U. W. Eisenecker. Generative                        Survey. In eBISS, pages 125–149. Springer, 2012.
     Programming: Methods, Tools, and Applications.                  [24] M. Schäler, A. Grebhahn, R. Schröter, S. Schulze,
     ACM Press/Addison-Wesley Publishing Co., 2000.                       V. Köppen, and G. Saake. QuEval: Beyond
 [8] S. Dorok, S. Breß, H. Läpple, and G. Saake. Toward                  High-Dimensional Indexing à la Carte. PVLDB,
     Efficient and Reliable Genome Analysis Using                         6(14):1654–1665, 2013.
     Main-Memory Database Systems. In SSDBM, pages                   [25] M. Stonebraker, S. Madden, D. J. Abadi,
     34:1–34:4. ACM, 2014.                                                S. Harizopoulos, N. Hachem, and P. Helland. The End
 [9] S. Dorok, S. Breß, and G. Saake. Toward Efficient                    of an Architectural Era (It’s Time for a Complete
     Variant Calling Inside Main-Memory Database                          Rewrite). In VLDB, pages 1150–1160, 2007.
     Systems. In BIOKDD-DEXA. IEEE, 2014.                            [26] S. Sunkle, M. Kuhlemann, N. Siegmund,
[10] J. Feigenspan, C. Kästner, S. Apel, J. Liebig,                      M. Rosenmüller, and G. Saake. Generating Highly
     M. Schulze, R. Dachselt, M. Papendieck, T. Leich, and                Customizable SQL Parsers. In SETMDM, pages
     G. Saake. Do Background Colors Improve Program                       29–33. ACM, 2008.
     Comprehension in the #ifdef Hell? Empir. Softw.                 [27] T. Willhalm, I. Oukid, I. Müller, and F. Faerber.
     Eng., 18(4):699–745, 2013.                                           Vectorizing Database Column Scans with Complex
[11] J. Feigenspan, M. Schulze, M. Papendieck, C. Kästner,               Predicates. In ADMS@VLDB, pages 1–12, 2013.
     R. Dachselt, V. Köppen, M. Frisch, and G. Saake.               [28] J. Zhou and K. A. Ross. Implementing Database
     Supporting Program Comprehension in Large                            Operations Using SIMD Instructions. In SIGMOD,
     Preprocessor-Based Software Product Lines. IET                       pages 145–156. ACM, 2002.
     Softw., 6(6):488–501, 2012.                                     [29] M. Zukowski. Balancing Vectorized Query Execution
[12] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju,                   with Bandwidth-Optimized Storage. PhD thesis, CWI
     Q. Luo, and P. V. Sander. Relational Query                           Amsterdam, 2009.
     Coprocessing on Graphics Processors. TODS,
     34(4):21:1–21:39, 2009.


                                                                52
              PageBeat - Zeitreihenanalyse und Datenbanken

                  Andreas Finger                                 Ilvio Bruder                          Andreas Heuer
                 Institut für Informatik                     Institut für Informatik                  Institut für Informatik
                  Universität Rostock                         Universität Rostock                      Universität Rostock
                    18051 Rostock                               18051 Rostock                            18051 Rostock
               andreas.finger@uni-         ilvio.bruder@uni-          andreas.heuer@uni-
                   rostock.de                   rostock.de                rostock.de
                              Steffen Konerow              Martin Klemkow
                                    Mandarin Medien GmbH                        Mandarin Medien GmbH
                                     Graf-Schack-Allee 9                         Graf-Schack-Allee 9
                                       19053 Schwerin                              19053 Schwerin
                                sk@mandarin-medien.de                               mk@mandarin-
                                                                                      medien.de

ABSTRACT                                                                       Keywords
Zeitreihendaten und deren Analyse sind in vielen Anwen-                        Datenanalyse, R, Time Series Database
dungsbereichen eine wichtiges Mittel zur Bewertung, Steue-
rung und Vorhersage. Für die Zeitreihenanalyse gibt es ei-                    1.   EINFÜHRUNG
ne Vielzahl von Methoden und Techniken, die in Statistik-
                                                                               Zeitreihen sind natürlich geordnete Folgen von Beobach-
software umgesetzt und heutzutage komfortabel auch ohne
                                                                               tungswerten. Die Zeitreihenanalyse beschäftigt sich mit Me-
eigenen Implementierungsaufwand einsetzbar sind. In den
                                                                               thoden zur Beschreibung dieser Daten etwa mit dem Ziel
meisten Fällen hat man es mit massenhaft Daten oder auch
                                                                               der Analyse (Verstehen), Vorhersage oder Kontrolle (Steue-
Datenströmen zu tun. Entsprechend gibt es spezialisierte
                                                                               rung) der Daten. Entsprechende Methoden stehen in frei-
Management-Tools, wie Data Stream Management Systems
                                                                               er und kommerzieller Statistiksoftware wie R1 , Matlab2 ,
für die Verarbeitung von Datenströmen oder Time Series
                                                                               Weka3 [7], SPSS4 und anderen zur Verfügung wodurch ei-
Databases zur Speicherung und Anfrage von Zeitreihen. Der
                                                                               ne komfortable Datenauswertung ohne eigenen Implemen-
folgende Artikel soll hier zu einen kleinen Überblick geben
                                                                               tierungsaufwand ermöglicht wird. Verfahren zur Zeitreihen-
und insbesondere die Anwendbarkeit an einem Projekt zur
                                                                               analyse sind etwa die Ermittlung von Trends und Saisona-
Analyse und Vorhersage von Zuständen von Webservern ver-
                                                                               lität, wobei der Trend den längerfristigen Anstieg und die
anschaulichen. Die Herausforderung innerhalb dieses Pro-
                                                                               Saisonalität wiederkehrende Muster (jedes Jahr zu Weih-
jekts PageBeat“ ist es massenhaft Zeitreihen in Echtzeit
       ”                                                                       nachten steigen die Verkäufe) repräsentieren. So werden Ab-
zu analysieren und für weiterführende Analyseprozesse zu
                                                                               hängigkeiten in den Daten untersucht, welche eine Prognose
speichern. Außerdem sollen die Ergebnisse zielgruppenspe-
                                                                               zukünftiger Werte mit Hilfe geeigneter Modelle ermöglichen.
zifisch aufbereitet und visualisiert sowie Benachrichtigungen
                                                                               In einer Anwendung die in hoher zeitlicher Auflösung eine
ausgelöst werden. Der Artikel beschreibt den im Projekt ge-
                                                                               Vielzahl von Messwerten erfasst, entstehen schnell große Da-
wählten Ansatz und die dafür eingesetzten Techniken und
                                                                               tenmengen. Diese sollen in Echtzeit analysiert werden und
Werkzeuge.
                                                                               gegebenenfalls zur weiteren Auswertung persistent gespei-
                                                                               chert werden. Hierfür existieren zum Einen Ansätze aus der
Categories and Subject Descriptors                                             Stromdatenverarbeitung und zum Anderen zur Speicherung
H.4 [Information Systems Applications]: Miscellaneous;                         von auf Zeitreihen spezialisierte Datenbanksysteme (Time
D.2.8 [Software Engineering]: Metrics—complexity mea-                          Series Databases). Da statistische Analysen etwa mit stand-
sures, performance measures                                                    alone R Anwendungen nur funktionieren, solange die zu ana-
                                                                               lysierenden Daten die Größe des Hauptspeichers nicht über-
General Terms                                                                  schreiten, ist es notwendig die statistische Analyse in Daten-
                                                                               1
Big Data, Data Mining and Knowledge Discovery, Streaming                         R – Programmiersprache für statistische Rechnen und Vi-
                                                                               sualisieren von der R Foundation for Statistical Computing,
Data                                                                           http://www.r-project.org.
                                                                               2
                                                                                 Matlab – kommerzielle Software zum Lösen Veranschau-
                                                                               lichen mathematischer Probleme vom Entwickler The Ma-
                                                                               thworks, http://www.mathworks.de.
                                                                               3
                                                                                 Weka – Waikato Environment for Knowledge Analysis, ein
                                                                               Werkzeugkasten für Data Mining und Maschinelles Lernen
Copyright c by the paper’s authors. Copying permitted only                     von der University of Waikato, http://www.cs.waikato.ac.
for private and academic purposes.                                             nz/ml/weka/.
                                                                               4
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-            SPSS – kommerzielle Statistik- und Analysesoftware von
Workshop on Foundations of Databases (Grundlagen von Datenbanken),             IBM, http://www-01.ibm.com/software/de/analytics/
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.        spss.


                                                                          53
banksysteme zu integrieren. Ziel ist dabei der transparente           Es wird derzeit ein möglichst breites Spektrum an Daten
Zugriff auf partitionierte Daten und deren Analyse mittels            in hoher zeitlicher Auflösung erfasst, um in einem Prozess
partitionierter statistischen Modelle. In [6] werden verschie-        der Datenexploration auf Zusammenhänge schließen zu kön-
dene Möglichkeiten der Integration beschrieben und sind              nen, die zunächst nicht offensichtlich sind bzw. um Vermu-
in Prototypen basierend auf postgreSQL bereits umgesetzt.             tungen zu validieren. Derzeit werden über 300 Kennzahlen
Auch kommerzielle Produkte wie etwa Oracle R Enterpri-                alle 10 s auf 14 Servern aus 9 Kundenprojekten abgetas-
se[4] integrieren statistische Analyse auf Datenbankebene.            tet. Diese Daten werden gespeichert und außerdem unmit-
Im Open-Source-Bereich existiert eine Vielzahl von Ansät-            telbar weiterverarbeitet. So findet etwa ein Downsampling
zen zum Umgang mit Zeitreihen, wobei uns InfluxDB5 als                für alle genannten 300 Kennzahlen statt. Dabei werden die
besonders geeignetes Werkzeug aufgefallen ist.                        zeitliche Auflösung unter Verwendung verschiedener Aggre-
Die Herausforderung innerhalb des im Weiteren beschriebe-             gatfunktionen auf Zeitfenster unterschiedlicher Größe redu-
nen Projekts PageBeat“ ist es innovative und anwendungs-              ziert und die Ergebnisse gespeichert. Andere Analysefunk-
               ”
reife Open-Source-Lösungen aus den genannten Bereichen               tionen quantisieren die Werte hinsichtlich ihrer Zugehörig-
zur Verarbeitung großer Zeitreihendaten innerhalb des Pro-            keit zu Statusklassen (etwa optimal, normal, kritisch) und
jektes miteinander zu kombinieren. Im Folgenden wird das              speichern die Ergebnisse ebenfalls. So entstehen sehr schnell
Projekt vorgestellt, um dann verschiedene in Frage kommen-            große Datenmengen. Derzeit enthält der Datenspeicher etwa
den Techniken und abschließend das gewählte Konzept und              40 GB Daten und wir beobachten bei der aktuellen Anzahl
erste Ergebnisse vorzustellen.                                        beobachteter Werte einen Zuwachs von etwa 1 GB Daten
                                                                      pro Woche. Auf Basis der erhobenen Daten müssen zeit-
                                                                      kritische Analysen wie etwa eine Ausreißererkennung oder
2. PROJEKT PAGEBEAT                                                   die Erkennung kritischer Muster nahezu in Echtzeit erfol-
Mit PageBeat“ wird eine als Software as a Service“ (SAAS)             gen, um Kunden ein rechtzeitiges Eingreifen zu ermöglichen.
     ”                         ”
angebotene Softwaresuite speziell zur Beobachtung und Über-          Weiterhin soll eine Vorhersage zukünftiger Werte frühzeitig
prüfung von Webanwendungen entwickelt. Dies erfolgt zu-              kritische Entwicklungen aufzeigen. Die Herausforderung im
nächst im Rahmen eines vom Bundeswirtschaftsministeri-               Projekt ist die Bewältigung des großen Datenvolumens un-
um geförderten ZIM-Kooperationsprojektes. Ziel der Soft-             ter Gewährleistung einer echtzeitnahen Bearbeitung durch
ware ist das Beobachten des und das Berichten über den               Analysefunktionen.
aktuellen technischen Status einer Webanwendung (Web-
site, Content Management System, E-Commerce System,
Webservice) sowie das Prognostizieren technischer Proble-             3.    ZEITREIHENANALYSE UND DATENBAN-
me anhand geeigneter Indikatoren (Hardware- und Software-                   KEN
spezifische Parameter). Die Berichte werden dabei für un-
                                                                      Im Rahmen der Evaluierung von für das Projekt geeigneter
terschiedliche Nutzergruppen (Systemadministratoren, Soft-
                                                                      Software haben wir verschiedene Ansätze zur Datenstrom-
wareentwickler, Abteilungsleiter, Geschäftsführung, Marke-
                                                                      verarbeitung und der Analyse und Verwaltung von Zeitrei-
ting) und deren Anforderungen aufbereitet und präsentiert.
                                                                      hen untersucht. Ziel war die Verwendung frei verfügbarer
Mittels PageBeat“ werden somit automatisiert Fehlerbe-
         ”                                                            Software die zudem auf im Unternehmen vorhandener tech-
richte erstellt, die über akute sowie vorhersehbare kritische
                                                                      nischer Expertise basiert.
Änderungen der Betriebsparameter einer Webanwendung in-
formieren und zielgruppenspezifisch dargestellt werden.               3.1   Data Stream Management Systems
Bei den zugrunde liegenden Kennzahlen handelt es sich um
eine Reihe von Daten, die den Zustand des Gesamtsystems               Die Verarbeitung kontinuierlicher Datenströme stellt einen
im Anwendungsbereich Webshopsysteme widerspiegeln. Dies               Aspekt unseres Projektes dar. Datenstromverarbeitende Sys-
sind Kennzahlen des Serverbetriebssystems (etwa CPU oder              teme bieten hierzu die Möglichkeit kontinuierliche Anfragen
RAM Auslastung) als auch anwendungsspezifische Kennda-                auf in temporäre Relationen umgewandelte Datenströme zu
ten (etwa die Laufzeit von Datenbankanfragen). Diese Daten            formulieren. Dies kann etwa mit Operatoren der im Pro-
sind semantisch beschrieben und entsprechende Metadaten               jekt Stream[1] entwickelten an SQL angelehnten Continuous
sind in einer Wissensbasis abgelegt. Darüber hinaus ist die          Query Language[2] erfolgen. Sollen nun komplexere Mus-
Verwendung weiterer Kontextinformationen angedacht, die               ter in Datenströmen erkannt werden, spricht man auch von
Einfluss auf den technischen Status des Systems haben kön-           der Verarbeitung komplexer Ereignisse. Im Kontext unseres
nen. Hierbei kann es sich etwa um Wetterdaten handeln:                Projektes entspricht so ein Muster etwa dem Anstieg der
beim Kinobetreiber Cinestar ist ein regnerisches Wochenen-            Aufrufe einer Seite aufgrund einer Marketingaktion, welcher
de vorausgesagt, dass auf eine hohe Auslastung des Kinokar-           eine höhere Systemauslastung zur Folge hat (cpu-usage),
tenonlineshops schließen lässt. Ein anderes Beispiel wären          was sich wiederum in steigenden time-to-first-byte-Werten
Informationen aus der Softwareentwicklung: bei Codeände-             niederschlägt und in einem kritischen Bereich zur Benach-
rungen mit einem bestimmten Zeitstempel können Effekte in            richtigung oder gar zur automatischen Aufstockung der ver-
den Auswertungen zu diesem Zeitpunkt nachgewiesen wer-                fügbaren Ressourcen führen soll. Complex Event Proces-
den. Das Ändern oder Hinzufügen bzw. Beachten von rele-             sing Systems wie Esper[5] bieten die Möglichkeit Anfragen
vanten Inhalten auf den Webseiten können signifikante Än-           nach solchen Mustern auf Datenströme zu formulieren und
derungen in Analysen ergeben, z.B. bei der Schaltung von              entsprechende Reaktionen zu implementieren. Da etwa Es-
Werbung oder bei Filmbewertungen zu neu anlaufenden Fil-              per als eines der wenigen frei verfügbaren und für den pro-
men auf sozialen Plattformen.                                         duktiven Einsatz geeigneten Systeme, in Java und .net im-
                                                                      plementiert ist, entsprechende Entwicklungskapazitäten je-
5                                                                     doch nicht im Unternehmen zur Verfügung stehen, wird im
  InfluxDB - An open-source distributed time series database
with no external dependencies. http://influxdb.com.                   Projekt keines der erwähnten DSMS oder CEPS zum Ein-


                                                                 54
satz kommen. Deren Architektur diente jedoch zur Orientie-           Möglichkeit Oracle Data Frames zu verwenden, um Daten-
rung bei der Entwicklung eines eigenen mit im Unternehmen            lokalität zu erreichen. Dabei wird der Code in der Oracle-
eingesetzten Techniken (etwa node.js6 , RabbitMQ7 , Mon-             Umgebung ausgeführt, dort wo die Daten liegen und nicht
goDB8 , u.a.) Systems für PageBeat.                                 umgekehrt. Außerdem erfolgt so ein transparenter Zugriff
                                                                     auf die Daten und Aspekte der Skalierung werden durch das
3.2    Werkzeuge zur Datenanalyse                                    DBMS abgewickelt.
Zur statistischen Auswertung der Daten im Projekt werden             Neben den klassischen ORDBMS existieren eine Vielzahl
Werkzeuge benötigt, die es ohne großen Implementierungs-            von auf Zeitserien spezialisierte Datenbanken wie OpenTSDB14 ,
aufwand ermöglichen verschiedene Verfahren auf die erhobe-          KairosDB15 , RRDB16 . Dabei handelt es sich jeweils um einen
nen Daten anzuwenden und auf ihre Eignung hin zu untersu-            auf Schreibzugriffe optimierten Datenspeicher in Form einer
chen. Hierfür stehen verschiedene mathematische Werkzeuge           schemalosen Datenbank und darauf zugreifende Anfrage-,
zur Verfügung. Kommerzielle Produkte sind etwa die bereits          Analyse- und Visualisierungsfunktionalität. Man sollte sie
erwähnten Matlab oder SPSS. Im Bereich frei verfügbarer            deshalb vielmehr als Ereignis-Verarbeitungs- oder Monitoring-
Software kann man auf WEKA und vor allem R zurückgrei-              Systeme bezeichnen. Neben den bisher genannten Zeitserien-
fen. Besonders R ist sehr weit verbreitet und wird von ei-           datenbanken ist uns bei der Recherche von für das Projekt
ner großen Entwicklergemeinde getragen. Dadurch sind für            geeigneter Software InfluxDB17 aufgefallen. InfluxDB ver-
R bereits eine Vielzahl von Verfahren zur Datenaufberei-             wendet Googles auf Log-structured merge-trees basierenden
tung und deren statistischer Analyse bis hin zur entspre-            key-value Store LevelDB18 und setzt somit auf eine hohen
chenden Visualisierung implementiert. Gerade in Bezug auf            Durchsatz bzgl. Schreiboperationen. Einen Nachteil hinge-
die Analyse von Zeitreihen ist R aufgrund vielfältiger ver-         gen stellen langwierige Löschoperationen ganzer nicht mehr
fügbarer Pakete zur Zeitreihenanalyse gegenüber WEKA die           benötigter Zeitbereiche dar. Die einzelnen Zeitreihen werden
geeignetere Wahl. Mit RStudio9 steht außerdem eine kom-              bei der Speicherung sequenziell in sogenannte Shards unter-
fortable Entwicklungsumgebung zur Verfügung. Weiterhin              teilt, wobei jeder Shard in einer einzelnen Datenbank gespei-
können mit dem Web Framework Shiny10 schnell R Anwen-               chert wird. Eine vorausschauenden Einrichtung verschiede-
dungen im Web bereit gestellt werden und unterstützt so-            ner Shard-Spaces (4 Stunden, 1 Tag, 1 Woche etc.) ermög-
mit eine zügige Anwendungsentwicklung. Somit stellt R mit           licht es, das langsame Löschen von Zeitbereichen durch das
den zugehörigen Erweiterungen die für das Projekt geeignete        einfache Löschen ganzer Shards also ganzer Datenbanken
Umgebung zur Evaluierung von Datenanalyseverfahren und               (drop database) zu kompensieren. Eine verteilte Speicherung
zur Datenexploration dar. Im weiteren Verlauf des Projektes          der Shards auf verschiedenen Rechnerknoten die wiederum
und in der Überführung in ein produktives System wird die          in verschiedenen Clustern organisiert sein können, ermög-
Datenanalyse, etwa die Berechnung von Vorhersagen, inner-            licht eine Verteilung der Daten, die falls gewünscht auch red-
halb von node.js reimplementiert.                                    undant mittels Replikation auf verschiedene Knoten erfolgen
                                                                     kann. Die Verteilung der Daten auf verschiedene Rechner-
3.3    Datenbankunterstützung                                        knoten ermöglicht es auch die Berechnung von Aggregaten
                                                                     über Zeitfenster die unterhalb der Shardgröße liegen, zu ver-
Klassische objektrelationale DBMS wie Oracle11 , IBM In-             teilen und somit Lokalität der Daten und einen Performance-
formix12 oder PostgreSQL13 unterstützen in unterschiedli-           Vorteil zu erreichen. Auch hier ist es sinnvoll Shardgrößen
chem Umfang die Speicherung, Anfrage und Auswertung                  vorausschauend zu planen. Die Anfragen an InfluxDB kön-
von Zeitreihen. PostgreSQL ermöglicht bswp. die Verwen-             nen mittels einer SQL-ähnlichen Anfragesprache über eine
dung von Fensterfunktionen etwa zur Berechnung von Ag-               http-Schnittstelle formuliert werden. Es werden verschiedene
gregatwerten für entsprechende Zeitabschnitte. Die IBM In-          Aggregatfunktionen bereitgestellt, die eine Ausgabe bspw.
formix TimeSeries Solution[3] stellt Container zur Speiche-          gruppiert nach Zeitintervallen für einen gesamten Zeitbe-
rung von Zeitreihendaten zur Verfügung, wodurch der Spei-           reich erzeugen, wobei die Verwendung Regulärer Ausdrücke
cherplatzbedarf optimiert, die Anfragegeschwindigkeit erhöht        unterstützt wird:
sowie die Komplexität der Anfragen reduziert werden sol-
len. Oracle unterstützt nicht nur die Speicherung und An-
frage von Zeitreihen, sondern integriert darüber hinaus um-         select median(used) from /cpu\.*/
fassende statistische Analysefunktionalität mittels Oracle R        where time > now() - 4h group by time(5m)
Technologies[4]. Dabei hat der R-Anwendungsentwickler die
                                                                     Hier wird der Median des used“-Wertes für alle 5-Minuten-
                                                                                               ”
6
   node.js - a cross-platform runtime environment for server-        Fenster der letzten 4 Stunden für alle CPUs berechnet und
 side and networking applications. http://nodejs.org/.               ausgegeben. Neben normalen Anfragen können auch soge-
 7
   RabbitMQ - Messaging that just works. http://www.                 nannte Continuous Queries eingerichtet werden, die etwa das
 rabbitmq.com.                                                       einfache Downsampling von Messdaten ermöglichen:
 8
   MongoDB - An open-source document database. http://
 www.mongodb.org/.                                                   14
                                                                        OpenTSDB - Scalable Time Series Database. http://
 9
   RStudio - open source and enterprise-ready professional            opentsdb.net/.
 software for the R statistical computing environment. http:         15
                                                                        KairosDB - Fast Scalable Time Series Database. https:
 //www.rstudio.com.                                                   //code.google.com/p/kairosdb/.
10                                                                   16
   Shiny - A web application framework for R. http://shiny.             RRDB - Round Robin Database. http://oss.oetiker.ch/
 rstudio.com.                                                         rrdtool/.
11                                                                   17
   Oracle. http://www.oracle.com.                                       InfluxDB - An open-source distributed time series database
12
   IBM Informix. http://www-01.ibm.com/software/data/                 with no external dependencies. http://influxdb.com/.
 informix/.                                                          18
                                                                        LevelDB - A fast and lightweight key-value database library
13
   PostgreSQL. http://www.postgresql.org/.                            by Google. http://code.google.com/p/leveldb/.


                                                                55
select count(name) from clicks
                                                                        Datenstrom (Drohne, Lasttestserver, Clientsimulation, etc.)
group by time(1h) into clicks.count.1h

InfluxDB befindet sich noch in einem frühen Stadium der
Entwicklung und wird ständig weiterentwickelt. So ist etwa
angekündigt, dass zukünftig bspw. das Speichern von Meta-
daten zu Zeitreihen (Einheiten, Abtastrate, etc.) oder auch                             Vorverarbeitung
die Implementierung nutzerdefinierter Aggregatfunktionen                                / Data Cleaning
ermöglicht wird. InfluxDB ist ein für unsere Anwendung viel-
versprechendes Werkzeug, wobei jedoch abzuwarten bleibt,                                   Integration
inwiefern es sich für den produktiven Einsatz eignet. Aus die-
sem Grund wird derzeit zusätzlich zu InfluxDB, MongoDB
parallel als im Unternehmen bewährter Datenspeicher ver-
wendet.                                                                  Ergebnisse      Adhoc‐Analyse                Wissens
                                                                                          (outlier, etc.)              Basis
4. LÖSUNG IN PAGEBEAT
Im Projekt Pagebeat wurden verschiedene Lösungsansätze
getestet, wobei die Praktikabilität beim Einsatz im Unter-
nehmen, die schnelle Umsetzbarkeit sowie die freie Verfüg-
barkeit der eingesetzten Werkzeuge die entscheidende Rolle                              Daten Speicher
spielten.

4.1    Datenfluss                                                                                                      Daten
Der Datenfluss innerhalb der Gesamtarchitektur ist in Ab-                                                             Explorer
bildung 1 dargestellt. Die Messdaten werden von einer Droh-                                 Langzeit‐
ne19 sowie Clientsimulatoren und Lasttestservern in äquidi-                                 Analyse
                                                                         Ergebnisse
stanten Zeitabschnitten (meist 10 s) ermittelt. Die erhobe-
nen Daten werden einem Loggingdienst per REST-Schnittstelle
zur Verfügung gestellt und reihen sich in die Warteschlange
eines Nachrichtenservers ein. Von dort aus werden sie ihrer                            Abbildung 1: Datenfluss
Signatur entsprechend durch registrierte Analyse- bzw. In-
terpretationsprozesse verarbeitet, wobei die Validierung der
eintreffenden Daten sowie die Zuordnung zu registrierten               4.3    Speicherung der Zeitreihen
Analysefunktionen mittels einer Wissensbasis erfolgt. Ergeb-           Die Speicherung der Messdaten sowie Analyse- und Inter-
nisse werden wiederum als Nachricht zur Verfügung gestellt            pretationsergebnisse erfolgt zum Einen in der im Unterneh-
und falls vorgesehen persistent gespeichert. So in die Nach-           men bewährten, auf hochfrequente Schreibvorgänge opti-
richtenschlange gekommene Ergebnisse können nun weitere               mierten schemafreien Datenbank MongoDB. Zum Anderen
Analysen bzw. Interpretationen oder die Auslösung einer Be-           setzen wir mittlerweile parallel zu MongoDB auf InfluxDB.
nachrichtigung zur Folge haben. Der Daten Explorer ermög-             So kann z.B. über die in InluxDB zur Verfügung stehen-
licht eine Sichtung von Rohdaten und bereits in PageBeat               den Continious Queries ein automatisches Downsampling
integrierten Analyseergebnissen sowie Tests für zukünftige           und somit eine Datenreduktion der im 10 Sekunden Takt
Analysefunktionen.                                                     erhobenen Daten erfolgen. Das Downsampling erfolgt der-
4.2    Wissensbasis                                                    zeit durch die Berechnung der Mittelwerte von Zeitfenstern
                                                                       einer Länge von 1 Minute bis hin zu einem Tag und ge-
Die Wissenbasis bildet die Grundlage für die modular auf-             neriert somit automatisch unterschiedliche zeitliche Auflö-
gebauten Analyse- und Interpretationsprozesse. Die Abbil-              sungen für alle Messwerte. Außerdem stellt die SQL ähn-
dung 2 dargestellten ParameterValues“ repräsentieren die              liche Anfragesprache von InfluxDB eine Vielzahl von für
                     ”
Messdaten und deren Eigenschaften wie Name, Beschrei-                  die statistische Auswertung hilfreichen Aggregatfunktionen
bung oder Einheit. ParameterValues können zu logischen                (min, max, mean, median, stddev, percentile, histogramm,
Gruppen (Parameters) zusammengefasst werden (wie z.B.                  etc.) zur Verfügung. Weiterhin soll es zukünftig möglich sein
die ParameterValues: system“, load“, iowait“ und max“                  benutzerdefinierte Funktionen mit eigener Analysefunktio-
                      ”         ”      ”            ”
zum Parameter cpu“). Parameter sind mit Visualisierungs-               nalität (etwa Autokorrelation, Kreuzkorrelation, Vorhersa-
                ”
komponenten und Kundendaten sowie mit Analysen und                     ge, etc.) auf Datenbankebene umzusetzen oder auch das
Interpretationen verknüpft. Analysen und Interpretationen             automatische Zusammenführen verschiedener Zeitserien an-
sind modular aufgebaut und bestehen jeweils aus Eingangs-              hand eines Timestamp-Attributs durchzuführen. Dies wür-
und Ausgangsdaten (ParameterValues) sowie aus Verweisen                de schon auf Datenbankebene eine zeitreihenübergreifende
auf den Programmcode. Weiterhin sind ihnen spezielle Me-               Analyse (bspw. Korrelation) unterstützen und senkt den Re-
thodenparameter zugeordnet. Hierbei handelt es sich etwa               implentierungsaufwand von R Funktionalität aus der Daten-
um Start und Ende eines Zeitfensters, Schwellenwerte oder              explorationsphase. Da herkömmliche Datenbanken nicht die
andere Modellparameter. Die Wissensbasis ist mittels eines             hohe Performance bzgl. Schreibzugriffen erreichen und kaum
relationalem Schemas in MySQL abgebildet.                              auf Zeitreihen spezialisierte Anfragen unterstützen, scheint
19                                                                     InfluxDB ein geeigneter Kandidat für den Einsatz innerhalb
 Auf dem zu beobachtenden System installierter Agent zur
Datenerhebung.                                                         PageBeats zu sein.


                                                                  56
 Analysis                                                                                Visualisation

                                                  Parameter


                                                                                  Abbildung 4: Autokorrelation


                                                                     gebnissen dient. Abbildung 5 zeigt etwa die Darstellung ag-
   Abbildung 2: Ausschnitt Schema Wissensbasis                       gregierter Parameter in Ampelform (rot = kritisch, gelb =
                                                                     Warnung, grün = normal, blau = optimal) was schnell einen
                                                                     Eindruck über den Zustand verschiedener Systemparameter
                                                                     ermöglicht.
4.4   Datenexploration
 Interpretation                                                                                    Customer Data
Die Datenexploration soll dazu dienen, Administratoren und
auch Endnutzern die Möglichkeit zu geben, die für sie rele-
vanten Daten mit den richtigen Werkzeugen zu analysieren.
Während der Entwicklung nutzen wir die Datenexploration
als Werkzeug zur Ermittlung relevanter Analysemethoden
und zur Evaluierung sowie Visualisierung der Datenströme.
Abbildung 3 zeigt eine einfache Nutzerschnittstelle umge-
setzt mit Shiny zur Datenauswertung mittels R mit Zu-
griff auf unterschiedliche Datenbanken, InfluxDB und Mon-
goDB. Verschiedene Parameter zur Auswahl des Zeitraumes,
der Analysefunktion und deren Parameter sowie Visualisie-
rungsparameter.
Hier sind durchschnittliche CPU-Nutzung und durchschnitt-
liche Plattenzugriffszeiten aus einer Auswahl aus 10 Zeitse-                           Abbildung 5: Ampel
rien dargestellt. Mittels unterem Interaktionselement lassen
sich Intervalle selektieren und die Granularität anpassen.          Analysefunktionalität die über Aggregationen auf Daten-
Mit ähnlichen Visualisierungsmethoden lassen sich auch Au-          bankebene hinausgehen wird von uns in einer Experimen-
tokorrelationsanalysen visualisieren, siehe Abbildung 4.             talumgebung umgesetzt und evaluiert. Diese basiert auf R.
                                                                     So stehen eine Vielzahl statistischer Analysemethoden und
4.5   Analyse und Interpretation                                     Methoden zur Aufbereitung komplexer Datenstrukturen in
Analysen sind Basisoperationen wie die Berechnung von Mit-           Form von R Paketen zur Verfügung. Darüber hinaus ermög-
telwert, Median, Standardabweichung, Autokorrelation u.a.            licht das R-Paket Shiny Server“ die komfortable Bereitstel-
                                                                                      ”
deren Ergebnisse falls nötig persistent gespeichert werden          lung von R Funktionalität für das Web. Ein wesentlicher Teil
oder direkt anderen Verarbeitungsschritten als Eingabe über-        unser Experimentalumgebung ist der Pagebeat Data Explo-
geben werden können. Die Spezifizierung der Analysefunk-            rer (siehe Abbildung 3). Dieser basiert auf den genannten
tionen erfolgt in der Wissensbasis, die eigentliche Implemen-        Techniken und ermöglicht die Sichtung der erfassten Roh-
tierung ist möglichst nahe an den zu analysierenden Daten,          daten oder das Spielen“ mit Analysemethoden und Vorher-
                                                                                    ”
wenn möglich unter Verwendung von Aggregat- oder benut-             sagemodellen.
zerdefinierten Funktionen des Datenbanksystems, umzuset-
zen. Wissensbasis und Analyse sind hierzu mittels eines me-    5. ZUSAMMENFASSUNG UND AUSBLICK
                                                         ”
thod codepath“ verknüpft.                                     Pagebeat ist ein Projekt, bei dem es insbesondere auf eine
Interpretationen funktionieren analog zur Analyse bilden je-   performante Speicherung und schnelle Adhoc-Auswertung
doch Berechnungsvorschriften etwa für den Gesamtindex (Pagebeat-
                                                               der Daten ankommt. Dazu wurden verschiedene Lösungsan-
Faktor) des Systems bzw. einzelner Teilsysteme ab, in dem      sätze betrachtet und die favorisierte Lösung auf Basis von
sie z.B. Analyseergebnisse einzelner Zeitreihen gewichtet zu-  InfluxDB und R beschrieben.
sammenführen. Weiterhin besitzen Interpretationen einen       Die konzeptionelle Phase ist abgeschlossen, die Projektin-
Infotyp, welcher der nutzerspezifischen Aufbereitung von Er-   frastruktur umgesetzt und erste Analysemethoden wie Aus-


                                                                57
                                                   Abbildung 3: Daten


reißererkennung oder Autokorrelation wurden ausprobiert.            [7] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
Derzeit beschäftigen wir uns mit den Möglichkeiten einer              P. Reutemann, and I. H. Witten. The weka data mining
Vorhersage von Zeitreihenwerten. Dazu werden Ergebnisse                 software: An update. SIGKDD Explorations, 11(1),
der Autokorrelationsanalyse zur Identifikation von Abhän-              2009.
gigkeiten innerhalb von Zeitreihen verwendet um die Qua-
lität von Vorhersagen abschätzen zu können. Weiterhin ist
geplant Analysen näher an der Datenbank auszuführen um
Datenlokalität zu unterstützen.


6. REFERENCES
[1] A. Arasu, B. Babcock, S. Babu, J. Cieslewicz,
    M. Datar, K. Ito, R. Motwani, U. Srivastava, and
    J. Widom. Stream: The stanford data stream
    management system. Technical Report 2004-20,
    Stanford InfoLab, 2004.
[2] A. Arasu, S. Babu, and J. Widom. The cql continuous
    query language: Semantic foundations and query
    execution. Technical Report 2003-67, Stanford InfoLab,
    2003.
[3] K. Chinda and R. Vijay. Informix timeseries solution.
    http://www.ibm.com/developerworks/data/library/
    techarticle/dm-1203timeseries, 2012.
[4] O. Corporation. R technologies from oracle.
    http://www.oracle.com/technetwork/topics/
    bigdata/r-offerings-1566363.html, 2014.
[5] EsperTech. Esper. http://esper.codehaus.org, 2014.
[6] U. Fischer, L. Dannecker, L. Siksnys, F. Rosenthal,
    M. Boehm, and W. Lehner. Towards integrated data
    analytics: Time series forecasting in dbms.
    Datenbank-Spektrum, 13(1):45–53, 2013.


                                                               58
   Databases under the Partial Closed-world Assumption:
                         A Survey

                          Simon Razniewski                                                       Werner Nutt
                Free University of Bozen-Bolzano                                   Free University of Bozen-Bolzano
                         Dominikanerplatz 3                                                 Dominikanerplatz 3
                         39100 Bozen, Italy                                                 39100 Bozen, Italy
                       razniewski@inf.unibz.it                                                nutt@inf.unibz.it

ABSTRACT                                                                       centralized manner, as each school is responsible for its own
Databases are traditionally considered either under the closed-                data. Since there are numerous schools in this province, the
world or the open-world assumption. In some scenarios how-                     overall database is notoriously incomplete. However, peri-
ever a middle ground, the partial closed-world assumption,                     odically the statistics department of the province queries the
is needed, which has received less attention so far.                           school database to generate statistical reports. These statistics
   In this survey we review foundational and work on the                       are the basis for administrative decisions such as the opening
partial closed-world assumption and then discuss work done                     and closing of classes, the assignment of teachers to schools
in our group in recent years on various aspects of reasoning                   and others. It is therefore important that these statistics are
over databases under this assumption.                                          correct. Therefore, the IT department is interested in finding
   We first discuss the conceptual foundations of this assump-                 out which data has to be complete in order to guarantee cor-
tion. We then list the main decision problems and the known                    rectness of the statistics, and on which basis the guarantees
results. Finally, we discuss implementational approaches and                   can be given.
extensions.                                                                      Broadly, we investigated the following research questions:

                                                                                  1. How to describe complete parts of a database?
1. INTRODUCTION
   Data completeness is an important aspect of data quality.                      2. How to find out, whether a query answer over a par-
Traditionally, it is assumed that a database reflects exactly                        tially closed database is complete?
the state of affairs in an application domain, that is, a fact
that is true in the real world is stored in the database, and a                   3. If a query answer is not complete, how to find out which
fact that is missing in the database does not hold in the real                       kind of data can be missing, and which similar queries
world. This is known as the closed-world assumption (CWA).                           are complete?
Later approaches have discussed the meaning of databases
that are missing facts that hold in the real world and thus
are incomplete. This is called the open-world assumption                       Work Overview. The first work on the PCWA is from
(OWA) [16, 7].                                                                 Motro [10]. He used queries to describe complete parts and
   A middle view, which we call the partial closed-world as-                   introduced the problem of inferring the completeness of other
sumption (PCWA), has received less attention until recently.                   queries (QC) from such completeness statements. Later work
Under the PCWA, some parts of the database are assumed                         by Halevy [8] introduced tuple-generating dependencies or
to be closed (complete), while others are assumed to be open                   table completeness (TC) statements for specification of com-
(possibly incomplete). So far, the former parts were specified                 plete parts. A detailed complexity study of TC-QC entailment
using completeness statements, while the latter parts are the                  was done by Razniewski and Nutt [13].
complement of the complete parts.                                                 Later work by Razniewski and Nutt has focussed on databases
                                                                               with null values [12] and geographic databases [14].
Example. As an example, consider a problem arising in the                         There has also been work on RDF data [3]. Savkovic
management of school data in the province of Bolzano, Italy,                   et al. [18, 17] have focussed on implementation techniques,
which motivated the technical work reported here. The IT                       leveraging especially on logic programming.
department of the provincial school administration runs a                         Also the derivation of completeness from data-aware busi-
database for storing school data, which is maintained in a de-                 ness process descriptions has been discussed [15].
                                                                                  Current work is focussing on reasoning wrt. database in-
                                                                               stances and on queries with negation [4].

                                                                               Outline. This paper is structured as follows. In Section 2,
                                                                               we discuss conceptual foundations, in particular the par-
                                                                               tial closed-world assumption. In Section 3 we present main
Copyright c by the paper’s authors. Copying permitted only for                 reasoning problems in this framework and known results.
private and academic purposes.
                                                                               Section 4 discusses implementation techniques. Section 5
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-
Workshop on Foundations of Databases (Grundlagen von Datenbanken),             presents extension and Section 6 discusses current work and
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.        open problems.


                                                                          59
2.     CONCEPTUAL FOUNDATIONS                                                Example 1. Consider a partial database DS for a school with
                                                                           two students, Hans and Maria, and one teacher, Carlo, as follows:
2.1      Standard Definitions
   In the following, we fix our notation for standard concepts               DiS   = {student(Hans, 3, A), student(Maria, 5, C),
from database theory. We assume a set of relation symbols
                                                                                       person(Hans, male), person(Maria, female),
Σ, the signature. A database instance D is a finite set of ground
atoms with relation symbols from Σ. For a relation symbol                              person(Carlo, male) },
R ∈ Σ we write R(D) to denote the interpretation of R in D, that             DaS    = DiS \ { person(Carlo, male), student(Maria, 5, C) },
is, the set of atoms in D with relation symbol R. A condition
G is a set of atoms using relations from Σ and possibly the                that is, the available database misses the facts that Maria is a student
comparison predicates < and ≤. As common, we write a                       and that Carlo is a person.
condition as a sequence of atoms, separated by commas. A                      Next, we define statements to express that parts of the in-
condition is safe if each of its variables occurs in a relational          formation in Da are complete with regard to the ideal database
atom. A conjunctive query is written in the form Q(s̄) :− B,               Di . We distinguish query completeness and table complete-
where B is a safe condition, s̄ is a vector of terms, and every            ness statements.
variable in s̄ occurs in B. We often refer to the entire query
by the symbol Q. As usual, we call Q(s̄) the head, B the
body, the variables in s̄ the distinguished variables, and the
                                                                           Query Completeness. For a query Q, the query completeness
                                                                           statement Compl(Q) says that Q can be answered completely
remaining variables in B the nondistinguished variables of Q.
                                                                           over the available database. Formally, Compl(Q) is satisfied by
We generically use the symbol L for the subcondition of B
                                                                           a partial database D, denoted as D |= Compl(Q), if Q(Da ) =
containing the relational atoms and M for the subcondition
                                                                           Q(Di ).
containing the comparisons. If B contains no comparisons,
then Q is a relational conjunctive query.                                     Example 2. Consider the above defined partial database DS and
   The result of evaluating Q over a database instance D is                the query
denoted as Q(D). Containment and equivalence of queries
are defined as usual. A conjunctive query is minimal if no                              Q1 (n) :− student(n, l, c), person(n, ’male’),
relational atom can be removed from its body without leading               asking for all male students. Over both, the available database DaS
to a non-equivalent query.
                                                                           and the ideal database DiS , this query returns exactly Hans. Thus,
2.2      Running Example                                                   DS satisfies the query completeness statement for Q1 , that is,
   For our examples throughout the paper, we will use a dras-                                     DS |= Compl(Q1 ).
tically simplified extract taken from the schema of the Bolzano
school database, containing the following two tables:                         Abiteboul et al. [1] introduced the notion of certain and
                                                                           possible answers over databases under the open-world as-
      - student(name, level, code),                                        sumption. Query completeness can also be seen as a relation
      - person(name, gender).                                              between certain and possible answers: A query over a par-
  The table student contains records about students, that is,              tially complete database is complete, if the certain and the
their names and the level and code of the class we are in.                 possible answers coincide.
The table person contains records about persons (students,
teachers, etc.), that is, their names and genders.                         Table completeness. A table completeness (TC) statement
                                                                           allows one to say that a certain part of a relation is com-
2.3      Completeness                                                      plete, without requiring the completeness of other parts of
  Open and closed world semantics were first discussed by                  the database [8]. It has two components, a relation R and
Reiter in [16], where he formalized earlier work on negation               a condition G. Intuitively, it says that all tuples of the ideal
as failure [2] from a database point of view. The closed-world             relation R that satisfy condition G in the ideal database are
assumption corresponds to the assumption that the whole                    also present in the available relation R.
database is complete, while the open-world assumption cor-                    Formally, let R(s̄) be an R-atom and let G be a condition
responds to the assumption that nothing is known about the                 such that R(s̄), G is safe. We remark that G can contain re-
completeness of the database.                                              lational and built-in atoms and that we do not make any
                                                                           safety assumptions about G alone. Then Compl(R(s̄); G) is a
Partial Database. The first and very basic concept is that                 table completeness statement. It has an associated query, which
of a partially complete database or partial database [10]. A               is defined as QR(s̄);G (s̄) :− R(s̄), G. The statement is satisfied
database can only be incomplete with respect to another                    by D = (Di , Da ), written D |= Compl(R(s̄); G), if QR(s̄);G (Di ) ⊆
database that is considered to be complete. So we model a                  R(Da ). Note that the ideal instance D̂ is used to determine
partial database as a pair of database instances: one instance             those tuples in the ideal version R(Di ) that satisfy G and that
that describes the complete state, and another instance that               the statement is satisfied if these tuples are present in the
describes the actual, possibly incomplete state. Formally, a               available version R(Da ). In the sequel, we will denote a TC
partial database is a pair D = (Di , Da ) of two database instances        statement generically as C and refer to the associated query
Di and Da such that Da ⊆ Di . In the style of [8], we call Di              simply as QC .
the ideal database, and Da the available database. The require-               If we introduce different schemas Σi and Σa for the ideal
ment that Da is included in Di formalizes the intuition that               and the available database, respectively, we can view the
the available database contains no more information than the               TC statement C = Compl(R(s̄); G) equivalently as the TGD (=
ideal one.                                                                 tuple-generating dependency) δC : Ri (s̄), Gi → Ra (s̄) from Σi to


                                                                      60
Σa . It is straightforward to see that a partial database satisfies           3.   CHARACTERIZATIONS AND DECISION
the TC statement C if and only if it satisfies the TGD δC .                        PROCEDURES
  The view of TC statements is especially useful for imple-
mentations.                                                                      Motro [10] introduced the notion of partially incomplete
                                                                              and incorrect databases as databases that can both miss facts
                                                                              that hold in the real world or contain facts that do not hold
   Example 3. In the partial database DS defined above, we can                there. He described partial completeness in terms of query
observe that in the available relation person, the teacher Carlo is           completeness (QC) statements, which express that the answer
missing, while all students are present. Thus, person is complete             of a query is complete. The query completeness statements
for all students. The available relation student contains Hans, who           express that to some parts of the database the closed-world
is the only male student. Thus, student is complete for all male              assumption applies, while for the rest of the database, the
persons. Formally, these two observations can be written as table             open-world assumption applies. He studied how the com-
completeness statements:                                                      pleteness of a given query can be deduced from the com-
                                                                              pleteness of other queries, which is QC-QC entailment. His
          C1 = Compl(person(n, g); student(n, l, c)),                         solution was based on rewriting queries using views: to infer
          C2 = Compl(student(n, l, c); person(n, ’male’)),                    that a given query is complete whenever a set of other queries
                                                                              are complete, he would search for a conjunctive rewriting in
which, as seen, are satisfied by the partial database DS .                    terms of the complete queries. This solution is correct, but
                                                                              not complete, as later results on query determinacy show:
  One can prove that table completeness cannot be expressed                   the given query may be complete although no conjunctive
by query completeness statements, because the latter require                  rewriting exists.
completeness of the relevant parts of all the tables that ap-                    While Levy et al. could show that rewritability of conjunc-
pear in the statement, while the former only talks about the                  tive queries as conjunctive queries is decidable [9], general
completeness of a single table.                                               rewritability of conjunctive queries by conjunctive queries is
                                                                              still open: An extensive discussion on that issue was pub-
                                                                              lished in 2005 by Segoufin and Vianu where it is shown that
   Example 4. As an illustration, consider the table completeness             it is possible that conjunctive queries can be rewritten using
statement C1 that states that person is complete for all students. The        other conjunctive queries, but the rewriting is not a conjunc-
corresponding query QC1 that asks for all persons that are students           tive query [19]. They also introduced the notion of query
is                                                                            determinacy, which for conjunctive queries implies second
             QC1 (n, g) :− person(n, g), student(n, l, c).                    order rewritability. The decidability of query determinacy
                                                                              for conjunctive queries is an open problem to date.
Evaluating QC1 over DiS gives the result { Hans, Maria }. However,
evaluating it over DaS returns only { Hans }. Thus, DS does not                  Halevy [8] suggested local completeness statements, which
satisfy the completeness of the query QC1 although it satisfies the           we, for a better distinction from the QC statements, call table
table completeness statement C1 .                                             completeness (TC) statements, as an alternate formalism for
                                                                              expressing partial completeness of an incomplete database.
Reasoning. As usual, a set S1 of TC- or QC-statements en-                     These statements allow one to express completeness of parts
tails another set S2 (we write S1 |= S2 ) if every partial database           of relations independent from the completeness of other parts
that satisfies all elements of S1 also satisfies all elements of S2 .         of the database. The main problem he addressed was how to
                                                                              derive query completeness from table completeness (TC-QC).
                                                                              He reduced TC-QC to the problem of queries independent
   Example 5. Consider the query Q(n) :− student(n, 7, c),                    of updates (QIU) [5]. However, this reduction introduces
person(n,0 male0 ) that asks for all male students in level 7. The            negation, and thus, except for trivial cases, generates QIU
TC statements C1 and C2 entail completeness of this query, because            instances for which no decision procedures are known. As
we ensure that all persons that are students and all male students            a consequence, the decidability of TC-QC remained largely
are in the database. Note that these are not the minimal precon-              open. Moreover, he demonstrated that by taking into ac-
ditions, as it would be enough to only have male persons in the               count the concrete database instance and exploiting the key
database who are student in level 7, and students in level 7, who             constraints over it, additional queries can be shown to be
are male persons.                                                             complete.
                                                                                 Razniewski and Nutt provided decision procedures for TC-
   While TC statements are a natural way to describe com-                     QC in [13]. They showed that for queries under bag semantics
pleteness of available data (“These parts of the data are com-                and for minimal queries under set semantics, weakest precon-
plete”), QC statements capture requirements for data qual-                    ditions for query completeness can be expressed in terms of
ity (“For these queries we need complete answers”). Thus,                     table completeness statements, which allow to reduce TC-QC
checking whether a set of TC statements entails a set of                      entailment to TC-TC entailment.
QC statements (TC-QC entailment) is the practically most                         For the problem of TC-TC entailment, they showed that it
relevant inference. Checking TC-TC entailment is useful                       is equivalent to query containment.
when managing sets of TC statements. Moreover, as we                             For QC-QC entailment, they showed that the problem is
will show later on, TC-QC entailment for aggregate queries                    decidable for queries under bag semantics.
with count and sum can be reduced to TC-TC entailment for                        For aggregate queries, they showed that for the aggregate
non-aggregate queries. If completeness guarantees are given                   functions SUM and COUNT, TC-QC has the same complexity
in terms of query completeness, also QC-QC entailment is of                   as TC-QC for nonaggregate queries under bag semantics. For
interest.                                                                     the aggregate functions MIN and MAX, they showed that


                                                                         61
                                Problem          Work by                           Results
                                                                     Query rewritability is a sufficient
                                                Motro 1989
                                QC-QC                                     condition for QC-QCs
                                             Razniewski/Nutt          QC-QCb is equivalent to query
                                                  2011                         containment
                                             Razniewski/Nutt          TC-TC is equivalent to query
                                 TC-TC
                                                  2011                         containment
                                                Levy 1996           Decision procedure for trivial cases
                                TC-QC                                TC-QCb is equivalent to TC-TC,
                                             Razniewski/Nutt
                                                                    TC-QCs is equivalent to TC-TC up
                                                  2011
                                                                           to asymmetric cases
                                             Razniewski/Nutt         Decision procedures for TC-QCs
                                                  2012                  over databases with nulls

                                                        Table 1: Main results


TC-QC has the same complexity as TC-QC for nonaggregate                  that computes for a query that may be incomplete, complete
queries under set semantics.                                             approximations from above and from below. With this exten-
   For reasoning wrt. a database instance, they showed that              sion, they show how to reformulate the original query in such
TC-QC becomes computationally harder than without an in-                 a way that answers are guaranteed to be complete. If there
stance, while QC-QC surprisingly becomes solvable, whereas               exists a more general complete query, there is a unique most
without an instance, decidability is open.                               specific one, which is found. If there exists a more specific
                                                                         complete query, there may even be infinitely many. In this
  In [12], Nutt and Razniewski discussed TC-QC entailment
                                                                         case, the least specific specializations whose size is bounded
reasoning over databases that contain null values. Null val-
                                                                         by a threshold provided by the user is found. Generalizations
ues as used in SQL are ambiguous, as they can indicate either
                                                                         are computed by a fixpoint iteration, employing an answer set
that no attribute value exists or that a value exists, but is un-
                                                                         programming engine. Specializations are found leveraging
known. Nutt and Razniewski studied completeness reason-
                                                                         unification from logic programming.
ing for both interpretations, and showed that when allowing
both interpretations at the same time, it becomes necessary to
syntactically distinguish between different kinds of null val-           5.   EXTENSIONS AND APPLICATIONS SCE-
ues. They presented an encoding for doing that in standard                    NARIOS
SQL databases. With this technique, any SQL DBMS evalu-
ates complete queries correctly with respect to the different
meanings that null values can carry.                                     Complete generalizations and specializations. When a
  The main results are summarized in Table 1.                            query is not guaranteed to be complete, it may be interesting
                                                                         to know which similar queries are complete. For instance,
                                                                         when a query for all students in level 5 is not complete, it
4. IMPLEMENTATION TECHNIQUES                                             may still be the case that the query for students in classes 5b
   Systems for reasoning can be developed from scratch, how-             and 5c is complete. Such information is especially interesting
ever it may be useful to implement them using existing tech-             for interaction with a completeness reasoning system. In [11],
nology as far as possible. So far, it was investigated how               Savkovic et al. defined the notion of most general complete
completeness reasoning can be reduced to answer set pro-                 specialization and the most specific comple generalization,
gramming, in particular using the DLV system.                            and discussed techniques to find those.
   The MAGIK system developed by Savkovic et al. [18]
demonstrates how to use meta-information about the com-                  Completeness over Business Processes. In many appli-
pleteness of a database to assess the quality of the answers             cations, data is managed via well documented processes. If
returned by a query. The system holds table-completeness                 information about such processes exists, one can draw con-
(TC) statements, by which one can express that a table is par-           clusions about completeness as well. In [15], Razniewski et
tially complete, that is, it contains all facts about some aspect        al. presented a formalization of so-called quality-aware pro-
of the domain.                                                           cesses that create data in the real world and store it in the
   Given a query, MAGIK determines from such meta-                       company’s information system possibly at a later point. They
information whether the database contains sufficient data                then showed how one can check the completeness of database
for the query answer to be complete (TC-QC entailment).                  queries in a certain state of the process or after the execution
If, according to the TC statements, the database content is              of a sequence of actions, by leveraging on query contain-
not sufficient for a complete answer, MAGIK explains which               ment, a well-studied problem in database theory. Finally,
further TC statements are needed to guarantee completeness.              they showed how the results can be extended to the more
   MAGIK extends and complements theoretical work on                     expressive formalism of colored Petri nets.
modeling and reasoning about data completeness by provid-
ing the first implementation of a reasoner. The reasoner op-
erates by translating completeness reasoning tasks into logic
                                                                         Spatial Data. Volunteered geographical information sys-
                                                                         tems are gaining popularity. The most established one is
programs, which are executed by an answer set engine.
                                                                         OpenStreetMap (OSM), but also classical commercial map
  In [17], Savkovic et al. present an extension to MAGIK                 services such as Google Maps now allow users to take part in


                                                                    62
the content creation.                                                                               Relationship between
                                                                              Completeness
   Assessing the quality of spatial information is essential for                                  Certain Answers, Query
                                                                               P Pattern
making informed decisions based on the data, and particu-                                      Answers, and Possible Answers
larly challenging when the data is provided in a decentral-                      Q :− C                CA = QA = PA
ized, crowd-based manner. In [14], Razniewski and Nutt                           Q :− N             CA = QA ⊆ PA = inf
showed how information about the completeness of features                      Q :− N, ¬N         ∅ = CA ⊆ QA ⊆ PA = inf
in certain regions can be used to annotate query answers with                  Q :− C, ¬C              CA = QA = PA
completeness information. They provided a characterization                     Q :− N, ¬C           CA = QA ⊆ PA = inf
of the necessary reasoning and show that when taking into
                                                                               Q :− C, ¬N            ∅ = CA ⊆ QA = PA
account the available database, more completeness can be de-
rived. OSM already contains some completeness statements,
                                                                        Table 2: Relation between query result, certain answers and
which are originally intended for coordination among the ed-
                                                                        possible answers for queries with negation. The arguments
itors of the map. A contribution was also to show that these
                                                                        of Q are irrelevant and therefore omitted.
statements are not only useful for the producers of the data
but also for the consumers.
                                                                        query answer may either be equal to the possible answers, to
RDF Data. With thousands of RDF data sources today avail-               the certain answers, both, or none.
able on the Web, covering disparate and possibly overlapping              Note that the above results hold for conjunctive queries in
knowledge domains, the problem of providing high-level de-              general, and thus do not only apply to SPARQL but also to
scriptions (in the form of metadata) of their content becomes           other query languages with negation, such as SQL.
crucial. In [3], Darari et al. discussed reasoning about the
completeness of semantic web data sources. They showed                  6.2   Instance Reasoning
how the previous theory can be adapted for RDF data sources,               Another line of current work concerns completeness rea-
what peculiarities the SPARQL query language offers and                 soning wrt. a database instance. We are currently looking into
how completeness statements themselves can be expressed                 completeness statements which are simpler than TC state-
in RDF.                                                                 ments in the sense that we do not contain any joins. For
   They also discussed the foundation for the expression of             such statements, reasoning is still exponential in the size of
completeness statements about RDF data sources. This al-                the database schema, but experimental results suggest that in
lows to complement with qualitative descriptions about com-             use cases, the reasoning is feasible. A challenge is however
pleteness the existing proposals like VOID that mainly deal             to develop a procedure which is algorithmically complete.
with quantitative descriptions. The second aspect of their
work is to show that completeness statements can be useful
for the semantic web in practice. On the theoretical side,              7.    ACKNOWLEDGEMENT
they provide a formalization of completeness for RDF data                 We thank our collaborators Fariz Darari, Flip Korn, Paramita
sources and techniques to reason about the completeness of              Mirza, Marco Montali, Sergey Paramonov, Giuseppe Pirró,
query answers. From the practical side, completeness state-             Radityo Eko Prasojo, Ognjen Savkovic and Divesh Srivas-
ments can be easily embedded in current descriptions of data            tava.
sources and thus readily used. The results on RDF data have               This work has been partially supported by the project
been implemented by Darari et al. in a demo system called               “MAGIC: Managing Completeness of Data” funded by the
CORNER [6].                                                             province of Bozen-Bolzano.


6. CURRENT WORK                                                         8.    REFERENCES
 In this section we list problems that our group is currently            [1] S. Abiteboul, P.C. Kanellakis, and G. Grahne. On the
working on.                                                                  representation and querying of sets of possible worlds.
                                                                             In Proc. SIGMOD, pages 34–48, 1987.
6.1    SPARQL Queries with Negation                                      [2] Keith L Clark. Negation as failure. In Logic and data
                                                                             bases, pages 293–322. Springer, 1978.
   RDF data is often treated as incomplete, following the
Open-World Assumption. On the other hand, SPARQL, the                    [3] Fariz Darari, Werner Nutt, Giuseppe Pirrò, and Simon
standard query language over RDF, usually follows the Closed-                Razniewski. Completeness statements about RDF data
World Assumption, assuming RDF data to be complete. What                     sources and their use for query answering. In
then happens is the semantic gap between RDF and SPARQL.                     International Semantic Web Conference (1), pages 66–83,
In current work, Darari et al. [4] address how to close the se-              2013.
mantic gap between RDF and SPARQL, in terms of certain an-               [4] Fariz Darari, Simon Razniewski, and Werner Nutt.
swers and possible answers using completeness statements.                    Bridging the semantic gap between RDF and SPARQL
Table 2 shows current results for the relations between query                using completeness statements. ISWC, 2013.
answers, certain answers and possible answers for queries                [5] Ch. Elkan. Independence of logic database queries and
with negation. The queries are assumed to be of the form                     updates. In Proc. PODS, pages 154–160, 1990.
Q(s̄) :− P+ , ¬P− , where P+ is the positive part and P− is the          [6] Radityo Eko Prasojo Fariz Darari and Werner Nutt.
negative part. Then we use letters C and N to indicate which                 CORNER: A completeness reasoner for the semantic
parts are complete. E.g. Q(s̄) :− N, ¬C indicates that the pos-              web (poster). ESWC, 2013.
itive part is not complete and the negative part is complete.            [7] T. Imieliński and W. Lipski, Jr. Incomplete information
As the table shows, depending on the complete parts, the                     in relational databases. J. ACM, 31:761–791, 1984.


                                                                   63
 [8] Alon Y. Levy. Obtaining complete answers from                         of geographical data (short paper). In BNCOD, 2013.
     incomplete databases. In Proceedings of the International        [15] Simon Razniewski, Marco Montali, and Werner Nutt.
     Conference on Very Large Data Bases, pages 402–412, 1996.             Verification of query completeness over processes. In
 [9] Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv,                   BPM, pages 155–170, 2013.
     and Divesh Srivastava. Answering queries using views.            [16] Raymond Reiter. On closed world data bases. In Logic
     In PODS, pages 95–104, 1995.                                          and Data Bases, pages 55–76, 1977.
[10] A. Motro. Integrity = Validity + Completeness. ACM               [17] Ognjen Savkovic, Paramita Mirza, Sergey Paramonov,
     TODS, 14(4):480–502, 1989.                                            and Werner Nutt. Magik: managing completeness of
[11] Werner Nutt, Sergey Paramonov, and Ognjen Savkovic.                   data. In CIKM, pages 2725–2727, 2012.
     An ASP approach to query completeness reasoning.                 [18] Ognjen Savkovic, Paramita Mirza, Alex Tomasi, and
     TPLP, 13(4-5-Online-Supplement), 2013.                                Werner Nutt. Complete approximations of incomplete
[12] Werner Nutt and Simon Razniewski. Completeness of                     queries. PVLDB, 6(12):1378–1381, 2013.
     queries over SQL databases. In CIKM, pages 902–911,              [19] L. Segoufin and V. Vianu. Views and queries:
     2012.                                                                 Determinacy and rewriting. In Proc. PODS, pages
[13] S. Razniewski and W. Nutt. Completeness of queries                    49–60, 2005.
     over incomplete databases. In VLDB, 2011.
[14] S. Razniewski and W. Nutt. Assessing the completeness


                                                                 64
         Towards Semantic Recommendation of Biodiversity
               Datasets based on Linked Open Data

      Felicitas Löffler                      Bahar Sateli                           René Witte                  Birgitta König-Ries
   Dept. of Mathematics        Semantic Software Lab    Semantic Software Lab    Friedrich Schiller University
  and Computer Science Dept. of Computer Science Dept. of Computer Science           Jena, Germany and
Friedrich Schiller University and Software Engineering and Software Engineering German Centre for Integrative
      Jena, Germany             Concordia University     Concordia University    Biodiversity Research (iDiv)
                                 Montréal, Canada         Montréal, Canada      Halle-Jena-Leipzig, Germany

ABSTRACT                                                                       1.   INTRODUCTION
Conventional content-based filtering methods recommend                            Content-based recommender systems observe a user’s brows-
documents based on extracted keywords. They calculate the                      ing behaviour and record the interests [1]. By means of natu-
similarity between keywords and user interests and return a                    ral language processing and machine learning techniques, the
list of matching documents. In the long run, this approach                     user’s preferences are extracted and stored in a user profile.
often leads to overspecialization and fewer new entries with                   The same methods are utilized to obtain suitable content
respect to a user’s preferences. Here, we propose a seman-                     keywords to establish a content profile. Based on previously
tic recommender system using Linked Open Data for the                          seen documents, the system attempts to recommend similar
user profile and adding semantic annotations to the index.                     content. Therefore, a mathematical representation of the user
Linked Open Data allows recommendations beyond the con-                        and content profile is needed. A widely used scheme are TF-
tent domain and supports the detection of new information.                     IDF (term frequency-inverse document frequency) weights
One research area with a strong need for the discovery of                      [19]. Computed from the frequency of keywords appearing
new information is biodiversity. Due to their heterogeneity,                   in a document, these term vectors capture the influence of
the exploration of biodiversity data requires interdisciplinary                keywords in a document or preferences in a user profile. The
collaboration. Personalization, in particular in recommender                   angle between these vectors describes the distance or the
systems, can help to link the individual disciplines in bio-                   closeness of the profiles and is calculated with similarity mea-
diversity research and to discover relevant documents and                      sures, like the cosine similarity. The recommendation lists of
datasets from various sources. We developed a first prototype                  these traditional, keyword-based recommender systems often
for our semantic recommender system in this field, where a                     contain very similar results to those already seen, leading
multitude of existing vocabularies facilitate our approach.                    to overspecialization [11] and the “Filter-Bubble”-effect [17]:
                                                                               The user obtains only content according to the stored prefer-
                                                                               ences, other related documents not perfectly matching the
Categories and Subject Descriptors                                             stored interests are not displayed. Thus, increasing diversity
H.3.3 [Information Storage And Retrieval]: Informa-                            in recommendations has become an own research area [21, 25,
tion Search and Retrieval; H.3.5 [Information Storage                          24, 18, 3, 6, 23], mainly used to improve the recommendation
And Retrieval]: Online Information Services                                    results in news or movie portals.
                                                                                  One field where content recommender systems could en-
                                                                               hance daily work is research. Scientists need to be aware
General Terms                                                                  of relevant research in their own but also neighboring fields.
Design, Human Factors                                                          Increasingly, in addition to literature, the underlying data
                                                                               itself and even data that has not been used in publications
                                                                               are being made publicly available. An important example
Keywords                                                                       for such a discipline is biodiversity research, which explores
content filtering, diversity, Linked Open Data, recommender                    the variety of species and their genetic and characteristic
systems, semantic indexing, semantic recommendation                            diversity [12]. The morphological and genetic information of
                                                                               an organism, together with the ecological and geographical
                                                                               context, forms a highly diverse structure. Collected and
                                                                               stored in different data formats, the datasets often contain or
                                                                               link to spatial, temporal and environmental data [22]. Many
                                                                               important research questions cannot be answered by working
                                                                               with individual datasets or data collected by one group, but
                                                                               require meta-analysis across a wide range of data. Since the
                                                                               analysis of biodiversity data is quite time-consuming, there is
Copyright c by the paper’s authors. Copying permitted only                     a strong need for personalization and new filtering techniques
for private and academic purposes.                                             in this research area. Ordinary search functions in relevant
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-          data portals or databases, e.g., the Global Biodiversity In-
Workshop on Foundations of Databases (Grundlagen von Datenbanken),
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.


                                                                          65
formation Facility (GBIF)1 and the Catalog of Life,2 only               that several types of relations can be taken into account.
return data that match the user’s query exactly and fail at             For instance, for a user interested in “geology”, the profile
finding more diverse and semantically related content. Also,            contains the concept “geology” that also permits the recom-
user interests are not taken into account in the result list.           mendation of inferred concepts, e.g., “fossil”. The idea of
We believe our semantic-based content recommender system                recommending related concepts was first introduced by Mid-
could facilitate the difficult and time-consuming research              delton et al. [15]. They developed Quickstep, a recommender
process in this domain.                                                 system for research papers with ontological terms in the user
   Here, we propose a new semantic-based content recom-                 profile and for paper categories. The ontology only considers
mender system that represents the user profile as Linked                is-a relationships and omits other relation types (e.g., part-
Open Data (LOD) [9] and incorporates semantic annotations               of). Another simple hierarchical approach from Shoval et
into the recommendation process. Additionally, the search               al. [13] calculates the distance among concepts in a profile
engine is connected to a terminology server and utilizes the            hierarchy. They distinguish between perfect, close and weak
provided vocabularies for a recommendation. The result list             match. When the concept appears in both a user’s and docu-
contains more diverse predictions and includes hierarchical             ment’s profile, it is called a perfect match. In a close match,
concepts or individuals.                                                the concept emerges only in one of the profiles and a child or
   The structure of this paper is as follows: Next, we de-              parent concept appears in the other. The largest distance is
scribe related work. Section 3 presents the architecture of             called a weak match, where only one of the profiles contains a
our semantic recommender system and some implementation                 grandchild or grandparent concept. Finally, a weighted sum
details. In Section 4, an application scenario is discussed. Fi-        over all matching categories leads to the recommendation
nally, conclusions and future work are presented in Section 5.          list. This ontological filtering method was integrated into the
                                                                        news recommender system epaper. Another semantically en-
                                                                        hanced recommender system is Athena [10]. The underlying
2.   RELATED WORK                                                       ontology is used to explore the semantic neighborhood in the
   The major goal of diversity research in recommender sys-             news domain. The authors compared several ontology-based
tems is to counteract overspecialization [11] and to recom-             similarity measures with the traditional TF-IDF approach.
mend related products, articles or documents. More books                However, this system lacks of a connection to a search engine
of an author or different movies of a genre are the classical           that allows to query large datasets.
applications, mainly used in recommender systems based on                  All presented systems use manually established vocabular-
collaborative filtering methods. In order to enhance the vari-          ies with a limited number of classes. None of them utilize
ety in book recommendations, Ziegler et al. [25] enrich user            a generic user profile to store the preferences in a seman-
profiles with taxonomical super-topics. The recommendation              tic format (RDF/XML or OWL). The FOAF (Friend Of A
list generated by this extended profile is merged with a rank           Friend) project3 provides a vocabulary for describing and
in reverse order, called dissimilarity rank. Depending on a             connecting people, e.g., demographic information (name, ad-
certain diversification factor, this merging process supports           dress, age) or interests. As one of the first, in 2006 Celma [2]
more or less diverse recommendations. Larger diversification            leveraged FOAF in his music recommender system to store
factors lead to more diverse products beyond user interests.            users’ preferences. Our approach goes beyond the FOAF
Zhang and Hurley [24] favor another mathematical solution               interests, by incorporating another generic user model vo-
and describe the balance between diversity and similarity as            cabulary, the Intelleo User Modelling Ontology (IUMO).4
a constrained optimization problem. They compute a dis-                 Besides user interests, IUMO offers elements to store learning
similarity matrix according to applied criterias, e.g., movie           goals, competences and recommendation preferences. This
genres, and assign a matching function to find a subset of              allows to adapt the results to a user’s previous knowledge or
products that are diverse as well as similar. One hybrid                to recommend only documents for a specific task.
approach by van Setten [21] combines the results of several
conventional algorithms, e.g., collaborative and case-based,
to improve movie recommendations. Mainly focused on news                3.    DESIGN AND IMPLEMENTATION
or social media, approaches using content-based filtering                  In this section, we describe the architecture and some
methods try to present different viewpoints on an event to              implementation details of our semantic-based recommender
decrease the media bias in news portals [18, 3] or to facilitate        system (Figure 1). The user model component, described in
the filtering of comments [6, 23].                                      Section 3.1, contains all user information. The source files,
   Apart from Ziegler et al., none of the presented approaches          described in Section 3.2, are analyzed with GATE [5], as de-
have considered semantic technologies. However, utilizing               scribed in Section 3.3. Additionally, GATE is connected with
ontologies and storing user or document profiles in triple              a terminology server (Section 3.2) to annotate documents
stores represents a large potential for diversity research in           with concepts from the provided biodiversity vocabularies.
recommender systems. Frasincar et al. [7] define semanti-               In Section 3.4, we explain how the annotated documents are
cally enhanced recommenders as systems with an underly-                 indexed with GATE Mı́mir [4]. The final recommendation list
ing knowledge base. This can either be linguistic-based [8],            is generated in the recommender component (Section 3.5).
where only linguistic relations (e.g., synonymy, hypernomy,
meronymy, antonymy) are considered, or ontology-based. In               3.1    User profile
the latter case, the content and the user profile are repre-               The user interests are stored in an RDF/XML format uti-
sented with concepts of an ontology. This has the advantage             lizing the FOAF vocabulary for general user information. In
1                                                                       3
 GBIF, http://www.gbif.org                                               FOAF, http://xmlns.com/foaf/spec/
2                                                                       4
 Catalog of Life, http://www.catalogueoflife.org/col/                    IUMO, http://intelleo.eu/ontologies/user-model/
search/all/                                                             spec/


                                                                   66
                       Figure 1: The architecture of our semantic content recommender system


order to improve the recommendations regarding a user’s                    existing vocabularies. Furthermore, biodiversity is an inter-
previous knowledge and to distinguish between learning goals,              disciplinary field, where the results from several sources have
interests and recommendation preferences, we incorporate                   to be linked to gain new knowledge. A recommender system
the Intelleo User Modelling Ontology for an extended profile               for this domain needs to support scientists by improving this
description. Recommendation preferences will contain set-                  linking process and helping them finding relevant content in
tings in respect of visualization, e.g., highlighting of interests,        an acceptable time.
and recommender control options, e.g., keyword-search or                      Researchers in the biodiversity domain are advised to store
more diverse results. Another adjustment will adapt the                    their datasets together with metadata, describing informa-
result set according to a user’s previous knowledge. In order              tion about their collected data. A very common metadata
to enhance the comprehensibility for a beginner, the system                format is ABCD.7 This XML-based standard provides ele-
could provide synonyms; and for an expert the recommender                  ments for general information (e.g., author, title, address),
could include more specific documents.                                     as well as additional biodiversity related metadata, like infor-
   The interests are stored in form of links to LOD resources.             mation about taxonomy, scientific name, units or gathering.
For instance, in our example profile in Listing 1, a user is               Very often, each taxon needs specific ABCD fields, e.g., fossil
interested in “biotic mesoscopic physical object”, which is a              datasets include data about the geological era. Therefore,
concept from the ENVO5 ontology. Note that the interest                    several additional ABCD-related metadata standards have
entry in the RDF file does not contain the textual description,            emerged (e.g., ABCDEFG8 , ABCDDNA9 ). One document
but the link to the concept in the ontology, i.e., http://purl.            may contain the metadata of one or more species observations
obolibrary.org/obo/ENVO_01000009. Currently, we only                       in a textual description. This provides for annotation and
support explicit user modelling. Thus, the user information                indexing for a semantic search. For our prototype, we use the
has to be added manually to the RDF/XML file. Later, we                    ABCDEFG metadata files provided by the GFBio10 project;
intend to develop a user profiling component, which gathers                specifically, metadata files from the Museum für Naturkunde
a user’s interests automatically. The profile is accessible via            (MfN).11 An example for an ABCDEFG metadata file is
an Apache Fuseki6 server.                                                  presented in Listing 2, containing the core ABCD structure
                                                                           as well as additional information about the geological era.
Listing 1: User profile with interests stored as                           The terminology server supplied by the GFBio project of-
Linked Open Data URIs                                                      fers access to several biodiversity vocabularies, e.g., ENVO,
                                                                           BEFDATA, TDWGREGION. It also provides a SPARQL
<rdf:Description rdf:about="http://www.semanticsoftware.info/person        endpoint12 for querying the ontologies.
      /felicitasloeffler">
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
<foaf:firstName>Felicitas</foaf:firstName>
                                                                           3.3    Semantic annotation
<foaf:lastName>Loeffler</foaf:lastName>                                       The source documents are analyzed and annotated accord-
<foaf:name>Felicitas Loeffler</foaf:name>                                  ing to the vocabularies provided by the terminology server.
<foaf:gender>Female</foaf:gender>
<foaf:workplaceHomepage rdf:resource="http://dbpedia.org/page/             For this process, we use GATE, an open source framework
      University_of_Jena"/>                                                that offers several standard language engineering components
<foaf:organization>Friedrich Schiller University Jena                      [5]. We developed a custom GATE pipeline (Figure 2) that
</foaf:organization>
<foaf:mbox>felicitas.loeffler@uni−jena.de</foaf:mbox>                      analyzes the documents: First, the documents are split into
<um:TopicPreference rdf:resource="http://purl.obolibrary.org/obo/          tokens and sentences, using the existing NLP components
      ENVO_01000009"/>                                                     included in the GATE distribution. Afterwards, an ‘Anno-
</rdf:Description>
                                                                           tation Set Transfer’ processing resource adds the original
                                                                           7
3.2    Source files and terminology server                                 8
                                                                              ABCD, http://www.tdwg.org/standards/115/
                                                                              ABCDEFG, http://www.geocase.eu/efg
  The content provided by our recommender comes from the                    9
                                                                              ABCDDNA, http://www.tdwg.org/standards/640/
biodiversity domain. This research area offers a wide range of             10
                                                                              GFBio, http://www.gfbio.org
5                                                                          11
 ENVO, http://purl.obolibrary.org/obo/envo.owl                                MfN, http://www.naturkundemuseum-berlin.de/
6                                                                          12
 Apache Fuseki, http://jena.apache.org/documentation/                         GFBio terminology server, http://terminologies.gfbio.
serving_data/                                                               org/sparql/


                                                                      67
                    Figure 2: The GFBio pipeline in GATE presenting the GFBio annotations


markups of the ABCDEFG files to the annotation set, e.g.,               the user in steering the recommendation process actively.
abcd:HigherTaxon. The following ontology-aware ‘Large KB                The recommender component is still under development and
Gazetteer’ is connected to the terminology server. For each             has not been added to the implementation yet.
document, all occurring ontology classes are added as specific
“gfbioAnnot” annotations that have both instance (link to
                                                                        Listing 2: Excerpt from a biodiversity metadata file
the concrete source document) and class URI. At the end, a
                                                                        in ABCDEFG format [20]
‘GATE Mı́mir Processing Resource’ submits the annotated
documents to the semantic search engine.                                <abcd:DataSets xmlns:abcd="http://www.tdwg.org/schemas/abcd/2.06"
                                                                               xmlns:efg="http://www.synthesys.info/ABCDEFG/1.0">
3.4      Semantic indexing                                              <abcd:DataSet>
                                                                        <abcd:Metadata>
   For semantic indexing, we are using GATE Mı́mir:13 “Mı́mir           <abcd:Description><abcd:Representation language="en">
                                                                        <abcd:Title>MfN − Fossil invertebrates</abcd:Title>
is a multi-paradigm information management index and                    <abcd:Details>Gastropods, bivalves, brachiopods, sponges</abcd:Details>
repository which can be used to index and search over text,                   </abcd:Representation></abcd:Description>
annotations, semantic schemas (ontologies), and semantic                <abcd:Scope><abcd:TaxonomicTerms>
metadata (instance data)” [4]. Besides ordinary keyword-                <abcd:TaxonomicTerm>Gastropods, Bivalves, Brachiopods, Sponges</
                                                                              abcd:TaxonomicTerm>
based search, Mı́mir incorporates the previously generated              </abcd:TaxonomicTerms></abcd:Scope>
semantic annotations from GATE to the index. Addition-                  </abcd:Metadata>
ally, it can be connected to the terminology server, allowing           <abcd:Units><abcd:Unit>
                                                                        <abcd:SourceInstitutionID>MfN</abcd:SourceInstitutionID>
queries over the ontologies. All index relevant annotations             <abcd:SourceID>MfN − Fossil invertebrates Ia</abcd:SourceID>
and the connection to the terminology server are specified in           <abcd:UnitID>MB.Ga.3895</abcd:UnitID>
an index template.                                                      <abcd:Identifications><abcd:Identification>
                                                                        <abcd:Result><abcd:TaxonIdentified>
                                                                        <abcd:HigherTaxa><abcd:HigherTaxon>
3.5      Content recommender                                            <abcd:HigherTaxonName>Euomphaloidea</abcd:HigherTaxonName>
                                                                        <abcd:HigherTaxonRank>Family</abcd:HigherTaxonRank>
  The Java-based content recommender sends a SPARQL                     </abcd:HigherTaxon></abcd:HigherTaxa>
query to the Fuseki Server and obtains the interests and                <abcd:ScientificName>
preferred recommendation techniques from the user profile               <abcd:FullScientificNameString>Euomphalus sp.</
as a list of (LOD) URIs. This list is utilized for a second                   abcd:FullScientificNameString>
                                                                        </abcd:ScientificName>
SPARQL query to the Mı́mir server. Presently, this query                </abcd:TaxonIdentified></abcd:Result>
asks only for child nodes (Figure 3). The result set contains           </abcd:Identification></abcd:Identifications>
ABCDEFG metadata files related to a user’s interests. We                <abcd:UnitExtension>
                                                                        <efg:EarthScienceSpecimen><efg:UnitStratigraphicDetermination>
intend to experiment with further semantic relations in the             <efg:ChronostratigraphicAttributions>
future, e.g., object properties. Assuming that a specific fossil        <efg:ChronostratigraphicAttribution>
used to live in rocks, it might be interesting to know if other         <efg:ChronoStratigraphicDivision>System</
                                                                                efg:ChronoStratigraphicDivision>
species, living in this geological era, occured in rocks. An-           <efg:ChronostratigraphicName>Triassic</efg:ChronostratigraphicName>
other filtering method would be to use parent or grandparent            </efg:ChronostratigraphicAttribution></
nodes from the vocabularies to broaden the search. We will                    efg:ChronostratigraphicAttributions>
                                                                        </efg:UnitStratigraphicDetermination></efg:EarthScienceSpecimen>
provide control options and feedback mechanisms to support              </abcd:UnitExtension>
                                                                        </abcd:Unit></abcd:Units></abcd:DataSet></abcd:DataSets>
13
     GATE Mı́mir, https://gate.ac.uk/mimir/


                                                                   68
Figure 3: A search for “biotic mesoscopic physical object” returning documents about fossils (child concept)


4.    APPLICATION
  The semantic content recommender system allows the
recommendation of more specific and diverse ABCDEFG
metadata files with respect to the stored user interests. List-
ing 3 shows the query to obtain the interests from a user
profile, introduced in Listing 1. The result contains a list of
(LOD) URIs to concepts in an ontology.
                                                                              Figure 4: An excerpt from the ENVO ontology

Listing 3: SPARQL query to retrieve user interests
                                                                         5.     CONCLUSIONS
SELECT ?label ?interest ?syn
WHERE                                                                       We introduced our new semantically enhanced content
{                                                                        recommender system for the biodiversity domain. Its main
    ?s foaf:firstName "Felicitas" .                                      benefit lays in the connection to a search engine supporting
    ?s um:TopicPreference ?interest .
    ?interest rdfs:label ?label .                                        integrated textual, linguistic and ontological queries. We are
    ?interest oboInOwl:hasRelatedSynonym ?syn                            using existing vocabularies from the terminology server of the
}                                                                        GFBio project. The recommendation list contains not only
                                                                         classical keyword-based results, but documents including
   In this example, the user would like to obtain biodiversity           semantically related concepts.
datasets about a “biotic mesoscopic physical object”, which                 In future work, we intend to integrate semantic-based rec-
is the textual description of http://purl.obolibrary.org/                ommender algorithms to obtain further diverse results and to
obo/ENVO_01000009. This technical term might be incom-                   support the interdisciplinary linking process in biodiversity
prehensible for a beginner, e.g., a student, who would prefer            research. We will set up an experiment to evaluate the algo-
a description like “organic material feature”. Thus, for a               rithms in large datasets with the established classification
later adjustment of the result according to a user’s previous            metrics Precision and Recall [14]. Additionally, we would
knowledge, the system additionally returns synonyms.                     like to extend the recommender component with control op-
   The returned interest (LOD) URI is utilized for a second              tions for the user [1]. Integrated into a portal, the result
query to the search engine (Figure 3). The connection to the             list should be adapted according to a user’s recommendation
terminology server allows Mı́mir to search within the ENVO               settings or adjusted to previous knowledge. These control
ontology (Figure 4) and to include related child concepts                functions allow the user to actively steer the recommenda-
as well as their children and individuals. Since there is no             tion process. We are planning to utilize the new layered
metadata file containing the exact term “biotic mesoscopic               evaluation approach for interactive adaptive systems from
physical object”, a simple keyword-based search would fail.              Paramythis, Weibelzahl and Masthoff [16]. Since adaptive
However, Mı́mir can retrieve more specific information than              systems present different results to each user, ordinary eval-
stored in the user profile and is returning biodiversity meta-           uation metrics are not appropriate. Thus, accuracy, validity,
data files about “fossil”. That ontology class is a child node of        usability, scrutability and transparency will be assessed in
“biotic mesoscopic physical object” and represents a semantic            several layers, e.g., the collection of input data and their
relation. Due to a high similarity regarding the content of              interpretation or the decision upon the adaptation strategy.
the metadata files, the result set in Figure 3 contains only             This should lead to an improved consideration of adaptivity
documents which closely resemble each other.                             in the evaluation process.


                                                                    69
6.      ACKNOWLEDGMENTS                                                   P. B. Kantor, editors, Recommender Systems Handbook,
  This work was supported by DAAD (German Academic                        pages 73–105. Springer, 2011.
Exchange Service)14 through the PPP Canada program and               [12] M. Loreau. Excellence in ecology. International Ecology
by DFG (German Research Foundation)15 within the GFBio                    Institute, Oldendorf, Germany, 2010.
project.                                                             [13] V. Maidel, P. Shoval, B. Shapira, and
                                                                          M. Taieb-Maimon. Ontological content-based filtering
7.      REFERENCES                                                        for personalised newspapers: A method and its
                                                                          evaluation. Online Information Review, 34 Issue
 [1] F. Bakalov, M.-J. Meurs, B. König-Ries, B. Sateli,                  5:729–756, 2010.
     R. Witte, G. Butler, and A. Tsang. An approach to               [14] C. D. Manning, P. Raghavan, and H. Schütze.
     controlling user models and personalization effects in               Introduction to Information Retrieval. Cambridge
     recommender systems. In Proceedings of the 2013                      University Press, 2008.
     international conference on Intelligent User Interfaces,
                                                                     [15] S. E. Middleton, N. R. Shadbolt, and D. C. D. Roure.
     IUI ’13, pages 49–56, New York, NY, USA, 2013. ACM.
                                                                          Ontological user profiling in recommender systems.
 [2] Ò. Celma. FOAFing the music: Bridging the semantic                  ACM Trans. Inf. Syst., 22(1):54–88, Jan. 2004.
     gap in music recommendation. In Proceedings of 5th
                                                                     [16] A. Paramythis, S. Weibelzahl, and J. Masthoff. Layered
     International Semantic Web Conference, pages 927–934,
                                                                          evaluation of interactive adaptive systems: Framework
     Athens, GA, USA, 2006.
                                                                          and formative methods. User Modeling and
 [3] S. Chhabra and P. Resnick. Cubethat: News article                    User-Adapted Interaction, 20(5):383–453, Dec. 2010.
     recommender. In Proceedings of the sixth ACM
                                                                     [17] E. Pariser. The Filter Bubble - What the internet is
     conference on Recommender systems, RecSys ’12, pages
                                                                          hiding from you. Viking, 2011.
     295–296, New York, NY, USA, 2012. ACM.
                                                                     [18] S. Park, S. Kang, S. Chung, and J. Song. Newscube:
 [4] H. Cunningham, V. Tablan, I. Roberts, M. Greenwood,
                                                                          delivering multiple aspects of news to mitigate media
     and N. Aswani. Information extraction and semantic
                                                                          bias. In Proceedings of the SIGCHI Conference on
     annotation for multi-paradigm information
                                                                          Human Factors in Computing Systems, CHI ’09, pages
     management. In M. Lupu, K. Mayer, J. Tait, and A. J.
                                                                          443–452, New York, NY, USA, 2009. ACM.
     Trippe, editors, Current Challenges in Patent
                                                                     [19] G. Salton and C. Buckley. Term-weighting approaches
     Information Retrieval, volume 29 of The Information
                                                                          in automatic text retrieval. Information Processing and
     Retrieval Series, pages 307–327. Springer Berlin
                                                                          Management, 24:513–523, 1988.
     Heidelberg, 2011.
                                                                     [20] Museum für Naturkunde Berlin. Fossil invertebrates,
 [5] H. Cunningham et al. Text Processing with GATE
                                                                          UnitID:MB.Ga.3895.
     (Version 6). University of Sheffield, Dept. of Computer
                                                                          http://coll.mfn-berlin.de/u/MB_Ga_3895.html.
     Science, 2011.
                                                                     [21] M. van Setten. Supporting people in finding
 [6] S. Faridani, E. Bitton, K. Ryokai, and K. Goldberg.
                                                                          information: hybrid recommender systems and
     Opinion space: A scalable tool for browsing online
                                                                          goal-based structuring. PhD thesis, Telematica Instituut,
     comments. In Proceedings of the SIGCHI Conference
                                                                          University of Twente, The Netherlands, 2005.
     on Human Factors in Computing Systems, CHI ’10,
     pages 1175–1184, New York, NY, USA, 2010. ACM.                  [22] R. Walls, J. Deck, R. Guralnick, S. Baskauf,
                                                                          R. Beaman, and et al. Semantics in Support of
 [7] F. Frasincar, W. IJntema, F. Goossen, and
                                                                          Biodiversity Knowledge Discovery: An Introduction to
     F. Hogenboom. A semantic approach for news
                                                                          the Biological Collections Ontology and Related
     recommendation. Business Intelligence Applications
                                                                          Ontologies. PLoS ONE 9(3): e89606, 2014.
     and the Web: Models, Systems and Technologies, IGI
     Global, pages 102–121, 2011.                                    [23] D. Wong, S. Faridani, E. Bitton, B. Hartmann, and
                                                                          K. Goldberg. The diversity donut: enabling participant
 [8] F. Getahun, J. Tekli, R. Chbeir, M. Viviani, and
                                                                          control over the diversity of recommended responses. In
     K. Yétongnon. Relating RSS News/Items. In
                                                                          CHI ’11 Extended Abstracts on Human Factors in
     M. Gaedke, M. Grossniklaus, and O. Dı́az, editors,
                                                                          Computing Systems, CHI EA ’11, pages 1471–1476,
     ICWE, volume 5648 of Lecture Notes in Computer
                                                                          New York, NY, USA, 2011. ACM.
     Science, pages 442–452. Springer, 2009.
                                                                     [24] M. Zhang and N. Hurley. Avoiding monotony:
 [9] T. Health and C. Bizer. Linked Data: Evolving the Web
                                                                          Improving the diversity of recommendation lists. In
     into a Global Data Space. Synthesis Lectures on the
                                                                          Proceedings of the 2008 ACM Conference on
     Semantic Web: Theory and Technology. Morgan &
                                                                          Recommender Systems, RecSys ’08, pages 123–130, New
     Claypool, 2011.
                                                                          York, NY, USA, 2008. ACM.
[10] W. IJntema, F. Goossen, F. Frasincar, and
                                                                     [25] C.-N. Ziegler, G. Lausen, and L. Schmidt-Thieme.
     F. Hogenboom. Ontology-based news recommendation.
                                                                          Taxonomy-driven computation of product
     In Proceedings of the 2010 EDBT/ICDT Workshops,
                                                                          recommendations. In Proceedings of the Thirteenth
     EDBT ’10, pages 16:1–16:6, New York, NY, USA, 2010.
                                                                          ACM International Conference on Information and
     ACM.
                                                                          Knowledge Management, CIKM ’04, pages 406–415,
[11] P. Lops, M. de Gemmis, and G. Semeraro.
                                                                          New York, NY, USA, 2004. ACM.
     Content-based recommender systems: State of the art
     and trends. In F. Ricci, L. Rokach, B. Shapira, and
14
     DAAD, https://www.daad.de/de/
15
     DFG, http://www.dfg.de


                                                                70
                            Exploring Graph Partitioning for
                        Shortest Path Queries on Road Networks

                    Theodoros Chondrogiannis                                                         Johann Gamper
                    Free University of Bozen-Bolzano                                        Free University of Bozen-Bolzano
                           tchond@inf.unibz.it                                                    gamper@inf.unibz.it


ABSTRACT                                                                               The classic solution for the shortest path problem is Dijkstra’s al-
Computing the shortest path between two locations in a road net-                    gorithm [1]. Given a source s and a destination t in a road network
work is an important problem that has found numerous applica-                       G, Dijkstra’s algorithm traverses the vertices in G in ascending or-
tions. The classic solution for the problem is Dijkstra’s algo-                     der of their distances to s. However, Dijkstra’s algorithm comes
rithm [1]. Although simple and elegant, the algorithm has proven                    with a major shortcoming. When the distance between the source
to be inefficient for very large road networks. To address this defi-               and the target vertex is high, the algorithm has to expand a very
ciency of Dijkstra’s algorithm, a plethora of techniques that intro-                large subset of the vertices in the graph. To address this short-
duce some preprocessing to reduce the query time have been pro-                     coming, several techniques have been proposed over the last few
posed. In this paper, we propose Partition-based Shortcuts (PbS), a                 decades [3]. Such techniques require a high start-up cost, but in
technique based on graph-partitioning which offers fast query pro-                  terms of query processing they outperform Dijkstra’s algorithm by
cessing and supports efficient edge weight updates. We present a                    orders of magnitude.
shortcut computation scheme, which exploits the traits of a graph                      Although most of the proposed techniques offer fast query pro-
partition. We also present a modified version of the bidirectional                  cessing, the preprocessing is always performed under the assump-
search [2], which uses the precomputed shortcuts to efficiently an-                 tion that the weights of a road network remain unchanged over
swer shortest path queries. Moreover, we introduce the Corridor                     time. Moreover, the preprocessing is metric-specific, thus for dif-
Matrix (CM), a partition-based structure which is exploited to re-                  ferent metrics the preprocessing needs to be performed for each
duce the search space during the processing of shortest path queries                metric. The recently proposed Customizable Route Planning [4]
when the source and the target point are close. Finally, we evaluate                applies preprocessing for various metrics, i.e., distance, time, turn
the performance of our modified algorithm in terms of preprocess-                   cost and fuel consumption. Such an approach allows a fast com-
ing cost and query runtime for various graph partitioning configu-                  putation of shortest path queries using any metric desired by the
rations.                                                                            user, at the cost of some extra space. Moreover, the update cost for
                                                                                    the weights is low since the structure is designed such that only a
                                                                                    small part of the preprocessed information has to be recomputed.
Keywords                                                                            In this paper, our aim is to develop an approach which offers even
Shortest path, road networks, graph partitioning                                    faster query processing, while keeping the update cost of the pre-
                                                                                    processed information low. This is particularly important in dy-
                                                                                    namic networks, where edge weights might frequently change, e.g.,
1. INTRODUCTION                                                                     due to traffic jams.
   Computing the shortest path between two locations in a road                         The contributions of this paper can be summarized as follows:
network is a fundamental problem and has found numerous ap-
                                                                                       • We present Partitioned-based Shortcuts (PbS), a preprocess-
plications. The problem can be formally defined as follows. Let
                                                                                         ing method which is based on Customizable Route Planning
G(V, E) be a directed weighted graph with vertices V and edges
                                                                                         (CRP), but computes more shortcuts in order to reduce the
E. For each edge e ∈ E, a weight l(e) is assigned, which usually
                                                                                         query processing time.
represents the length of e or the time required to cross e. A path p
between two vertices s, t ∈ V is a sequence of connected edges,                        • We propose the Corridor Matrix (CM), a pruning technique
p(s, t) = h(s, v1 ), (v1 , v2 ), . . . , (vk , vt )i where (vk , vk+1 ) ∈ E,             which can be used for shortest path queries when the source
that connects s and t. The shortest path between two vertices s and                      and the target are very close and the precomputed shortcuts
t is the path p(s, t) that has the shortest distance among all paths                     cannot be exploited.
that connect s and t.
                                                                                       • We run experiments for several different partition configura-
                                                                                         tions and we evaluate our approach in terms of both prepro-
                                                                                         cessing and query processing cost.
                                                                                       The rest of the paper is organized as follows. In Section 2, we
                                                                                    discuss related work. In Section 3, we describe in detail the prepro-
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26 GI-                 cessing phase of our method. In Section 5, we present a modified
Workshop on Foundations of Databases (Grundlagen von Datenbanken),                  version of the bidirectional search algorithm. In Section 6, we show
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.
Copyright c by the paper’s authors. Copying permitted only for private              preliminary results of an empirical evaluation. Section 7 concludes
and academic purposes..                                                             the paper and points to future research directions.


                                                                               71
2.    RELATED WORK                                                              in each component and then CRP applies a modified bidirectional
   The preprocessing based techniques that have been proposed                   search algorithm which expands only the shortcuts and the edges in
in order to reduce the time required for processing shortest path               the source or the target component. The main difference between
queries can be classified into different categories [3]. Goal-directed          our approach and CRP is that, instead of computing only shortcuts
techniques use either heuristics or precomputed information in or-              between border nodes in each component, we compute shortcuts
der to limit the search space by excluding vertices that are not in             from every node of a component to the border nodes of the same
the direction of the target. For example, A∗ [5] search uses the                component. The extra shortcuts enable the bidirectional algorithm
Euclidean distance as a lower bound. ALT [6] uses precomputed                   to start directly from the border nodes, while CRP has to scan the
shortest path distances to a carefully selected set of landmarks and            original edges of the source and the target component.
produces the lower bound using the triangle inequality. Some goal-
directed techniques exploit graph partitioning in order to prune the            3.    PBS PREPROCESSING
search space and speed-up queries. Precomputed Cluster Distances                   The Partition-based Shortcuts (PbS) method we propose ex-
(PCD) [7] partitions the graph into k components, computes the                  ploits graph partitioning to produce shortcuts in a preprocessing
distance between all pairs of components and uses the distances be-             phase, which during the query phase are used to efficiently com-
tween components to compute lower bounds. Arc Flags [8] main-                   pute shortest path queries. The idea is similar to the concept of
tains a vector of k bits for each edge, where the i-th bit is set if the        transit nodes [12]. Every shortest path between two nodes lo-
arc lies on a shortest path to some vertex of component i. Other-               cated in different partitions (also termed components) can be ex-
wise, all edges of component i are pruned by the search algorithm.              pressed as a combination of three smaller shortest paths. Con-
   Path Coherent techniques take advantage of the fact that shortest            sider the graph in Figure 1 and a query q(s, t), where s ∈ C1
paths in road networks are often spatially coherent. To illustrate the          and t ∈ C5 . The shortest path from s to t can be expressed as
concept of spatial coherence, let us consider four locations s, s0 , t          p(s, bs ) + p(bs , bt ) + p(bt , t), where bs ∈ {b1 , b2 } and bt ∈
and t0 in a road network. If s is close to s0 and t is close to t0 , the        {b3 , b4 , b5 }. Before PbS is able to process shortest path queries,
shortest path from s to t is likely to share vertices with the shortest         a preprocessing phase is required, which consists of three steps:
path from s0 to t0 . Spatial coherence methods precompute all short-            graph partitioning, in-component shortcut computation and short-
est paths and use then some data structures to index the paths and              cut graph construction.
answer queries efficiently. For example, Spatially Induced Linkage
Cognizance (SILC) [9] use a quad-tree [10] to store the paths. Path-            3.1     Graph Partitioning
Coherent Pairs Decomposition (PCPD) [11] computes unique path
                                                                                   The first step in the pre-processing phase is the graph partition-
coherent pairs and retrieves any shortest path recursively in almost
                                                                                ing. Let G(V, E) be a graph with vertices V and edges E. A
linear time to the size of the path.
                                                                                partition of G is a set P (G) = {C1 , . . . , Ck } of connected sub-
   Bounded-hop techniques aim to reduce a shortest path query to
                                                                                graphs Ci of G, also referred to as components of G. For the set
a number of look-ups. Transit Node Routing (TNR) [12] is an in-
                                                                                P (G), all components must be disjoint, i.e., C1 ∩ . . . ∩ Ck = ∅.
dexing method that imposes a grid on the road network and re-
                                                                                Moreover, let V1 , . . . , V|P (G)| be the sets of vertices of each com-
computes the shortest paths from within each grid cell C to a set
                                                                                ponent. The vertex sets of all components must cover the vertex set
of vertices that are deemed important for C (so-called access nodes
                                                                                of the graph, i.e., V1 ∪ . . . ∪ V|P (G)| = V . We assign a tag to each
of C). More approaches are based on the theory of 2-hop label-
                                                                                node of the original graph, which indicates the component the node
ing [13]. During preprocessing, a label L(u) is computed for each
                                                                                is located in. The set of connecting edges, EC ⊆ E, is the set of all
vertex u of the graph such that for any pair u, v of vertices, the
                                                                                edges in the graph for which the source and target nodes belong to
distance dist(u, v) can be determined by only looking at the labels
                                                                                different components, i.e., (n, n0 ) ∈ E such that n ∈ Ci , n0 ∈ Cj
L(u) and L(v). A natural special case of this approach is Hub La-
                                                                                and Ci 6= Cj . Finally, we define the border nodes of a component
beling (HL) [14], in which the label L(u) associated with vertex
                                                                                C. A node n ∈ C is a border node of C if there exists a connecting
u consists of a set of vertices (the hubs of u), together with their
                                                                                edge e = (n, n0 ) or e = (n0 , n), i.e., n0 is not in C. If e = (n, n0 ),
distances from u.
                                                                                n is called outgoing border node of C, whereas if e = (n0 , n), n
   Finally, Hierarchical techniques aim to impose a total order on
                                                                                is called incoming border node of C. The set of all border nodes
the nodes as they deem nodes that are crossed by many shortest
                                                                                of a graph is referred to as B. Figure 1 illustrates a graph parti-
paths as more important. Highway Hierarchies (HH) [15] and its
                                                                                tioned into five components. The filled nodes are the border nodes.
direct descendant Contraction Hierarchies (CH) organize the nodes
                                                                                Note that for ease of exposition we use only undirected graphs in
in the road network into a hierarchy based on their relative im-
                                                                                the examples.
portance, and create shortcuts among vertices at the same level
of the hierarchy. Arterial Hierarchies (AH) [16] are inspired by
CH, but produce shortcuts by imposing a grid on the graph. AH
outperform CH in terms of both asymptotic and practical perfor-
mance [17]. Some hierarchical approaches exploit graph partition
to create shortcuts. HEPV [18] and HiTi [19] are techniques that
pre-computes the distance between any two boundary vertices and
create a new overlay graph. By partitioning the overlay graph and
repeating the process several times, a hierarchy of partitions is cre-
ated, which is used to process shortest path queries.
   The recent Customizable Route Planning (CRP) [4] is the clos-
est work to our own. CRP is able to handle various arbitrary met-
rics and can also handle dynamic edge weight updates. CRP uses
PUNCH [20], a graph partitioning algorithm tailored to road net-
works. CRP pre-computes distances between boundary vertices                              Figure 1: Partitioned graph into five components.


                                                                           72
   We characterize a graph partition as good if it minimizes the           Thus, the number of vertices and edges in the shortcut graph is,
number of connecting edges between the components. However,                respectively,
graph partitioning is an N P -hard problem, thus an optimal solu-                                  k
tion is out of the question [21]. A popular approach is multilevel                                 X
                                                                                          |B| =          |Biinc ∪ Biout | and
graph partitioning (MGP), which can be found in many software
                                                                                                   i=1
libraries, such as METIS [22]. Algorithms such as PUNCH [20]
                                                                                                   k
                                                                                                   X
and Spatial Partition Clustering (SPC) [23] take advantage of road
                                                                                        |Esc | =         (|Biinc | × |Biout |) + EC .
network characteristics in order to provide a more efficient graph
                                                                                                   i=1
partitioning. We use METIS for graph partitioning since it is the
most efficient approach out of all available ones [24]. METIS re-          Figure 3 shows the shortcut graph of our running example. Notice
quires only the number of components as an argument in order to            that only border nodes are vertices of the shortcut graph. The set of
perform the partitioning. The number of components influences              edges consists of connecting edges and the in-component shortcuts
both the number of the in-component shortcuts and the size of the          between the border nodes of the same component. Note that there
shortcut graph.                                                            is no need for extra computations in order to populate the shortcut
                                                                           graph.
3.2    In-component Shortcuts
   The second step of the preprocessing phase is the computation of
the in-component shortcuts. For each node n in the original graph,
we compute the shortest path from the node to every outgoing bor-
der node of the component in which n is located. Then we create
outgoing shortcuts which abstract the shortest path from n to each
outgoing border node. The incoming shortcuts are computed in a
similar fashion. Thus, the total number of in-component shortcuts,
S, is
                    k
                    X
               S=         Ni × (|Biinc | + |Biout |),
                    i=1

where Ni is the number of nodes in component Ci and Biinc ,
Biout are the incoming and outgoing border nodes of Ci , respectiv-              Figure 3: Shortcut Graph illustrated over the original.
elly. Figure 2 shows the in-component shortcuts for a node located
in component C2 .
                                                                           4.    CORRIDOR MATRIX
                                                                              In Section 3 we presented how PbS creates shortcuts in order to
                                                                           answer queries when the source and the target points are in differ-
                                                                           ent components. However, when the source and the target points
                                                                           of a query are located in the same component, the shortest path
                                                                           may lie entirely inside the component. Therefore, the search algo-
                                                                           rithm will never reach the border nodes and the shortcuts will not
                                                                           be expanded. In such a case, the common approach is to use bidi-
                                                                           rectional search to return the shortest path. However, if the compo-
                                                                           nents of the partitioned graph are large, the query processing can be
                                                                           quite slow. In order to improve the processing time of such queries,
                                                                           we partition each component again into sub-components, and for
                                                                           each component, we compute its Corridor Matrix (CM). In gen-
       Figure 2: In-component shortcuts for a given node.                  eral, given a partition of a graph G in k components, the Corridor
                                                                           Matrix (CM) of G is a k × k matrix, where each cell C(i, j) of
   For each border node in a component, b ∈ C, we execute Di-              CM contains a list of components that are crossed by some short-
jkstra’s algorithm with b as source and all other nodes (including         est path from a node s ∈ Ci to a node t ∈ Cj . We call such a
border nodes) in C as targets. Depending on the type of the source         list the corridor from Ci to Cj . The concept of the CM is similar
node, the expansion strategy is different. When an incoming bor-           to Arc-Flags [8], but the CM requires much less space. The space
der node is the source, forward edges are expanded; vice versa,            complexity of the CM is O(k3 ), where k is the number of compo-
when an outgoing border node is the source, incoming edges are             nents in the partition, while the space complexity of Arc-Flags is
expanded. This strategy ensures that the maximum number of node            |E| × k2 , where |E| is the number of edges in the original graph.
expansions is at most twice the number of border nodes of G.
                                                                                                 C1 C2 C3 C4 C5
3.3    Shortcut Graph Construction                                                            C1 ∅                    {C2 , C3 }
  The third step of the preprocessing phase of our approach is the                            C2    ∅
construction of the shortcut graph. Given a graph G, the shortcut                             C3       ∅
graph of G is a graph Gsc (B, Esc ), where B is the set of border                             C4          ∅
nodes of G and Esc = EC ∪ SG is the union of the connecting                                   C5             ∅
edges, EC , of G and the shortcuts, SG , from every incoming bor-
der node to every outgoing border node of the same component.                              Figure 4: Corridor Matrix example.


                                                                      73
   To optimize the look-up time in CM, we implemented each com-                          Name                  Region              # Vertices            # Edges
ponent list using a bitmap of length k. Therefore, the space com-                         CAL             California/Nevada        1,890,815            4,657,742
plexity of the CM in the worst case is O(k3 ). The actual space                           FLA                  Florida             1,070,376            2,712,798
occupied by the CM is smaller, since we do not allocate space for                         BAY               SF Bay Area             321,270              800,172
bitmaps when the component list is empty. For the computation of                          NY               New York City            264,346              733,846
the Corridor Matrix, we generate the Shortcut Graph in the same                          ROME              Center of Rome             3353                8,859
way as described in Section 3.3. To compute the distances between
all pairs of vertices, we use the Floyd-Warshall algorithm [25],                                           Table 1: Dataset characteristics.
which is specifically designed to compute the all-pair shortest path
distance efficiently. After having computed the distances between
the nodes, instead of retrieving each shortest path, we retrieve only
the components that are crossed by each path, and we update the               contain 1000 queries each. We make sure that the distance of ev-
CM accordingly.                                                               ery query in set Qi is smaller than the distance of every query
                                                                              in set Qi+1 . We also evaluate the CM separately by comparing
                                                                              our CM implementation against Arc Flags and the original bidi-
5. SHORTEST PATH ALGORITHM                                                    rectional search for a set of 1000 random queries in the ROME
   In order to process a shortest path query from a source point s            dataset. We use a small dataset in order to simulate in-component
to a target point t, we first determine the components of the graph           query processing.
the nodes s ∈ Cs and t ∈ Ct are located in. If Cs = Ct , we
execute a modified bidirectional search from s to t. Note that the            6.1          Preprocessing and Space Overhead
shortcuts are not used for processing queries for which the source               Figures 5 and 6 show a series of measurements for the prepro-
and target are located in the same component C. Instead, we re-               cessing cost of our approach in comparison to CRP and CH over
trieve the appropriate corridor from the CM of C, which contains              the four largest datasets. Figure 5 shows how many shortcuts are
a list of sub-components. Then, we apply bidirectional search and             created by each approach. The extra shortcuts can be translated
prune all nodes that belong to sub-components which are not in the            into the space overhead required in order to speed-up shortest path
retrieved corridor.                                                           queries. CH uses shortcuts which represent only two edges, while
   In the case that the points s and t are not located in the same            the shortcuts in PbS and CRP are composed of much longer se-
component, we exploit the pre-computed shortcuts. First, we re-               quences. The difference between the shortcuts produced by CRP
trieve the lengths of the in-component outgoing shortcuts from s to           and CH is much less. In short, PbS produces about two orders of
all the outgoing borders of Cs and the length of the in-component             magnitude more shortcuts than CRP and CH. Moreover, we can ob-
incoming shortcuts from all the incoming borders of Ct to t. Then             serve that the number of shortcuts produced by PbS is getting lower
we apply a many-to-many bidirectional search in the overlay graph             as the number of components is increasing.
from all the outgoing borders of Cs to all the incoming borders
of Ct . We use the length of the in-component shortcuts (retrieved                                                CH       CRP   PbS
in the first step) as initial weights for the source and target nodes
of the bidirectional search in the Shortcut Graph. The list of edges          3
                                                                                  ·10   7 shortcuts
                                                                                                                                  3
                                                                                                                                      ·107 shortcuts
consisting the path is a set of connecting edges of the original graph
and in-component shortcuts. For each shortcut we retrieve the pre-
computed set of the original edges. The cost to retrieve the original         2                                                   2
path is linear to the size of the path. After the retrieval we replace
the shortcuts with the list of edges in the original graph and we re-         1                                                   1
turn the new edge list, which is the shortest path from s to t in the
original graph.
                                                                              0                                                   0
                                                                                          128       256     384      512                   128     256      384     512
6. PRELIMINARY RESULTS                                                                              (a) NY                                        (b) BAY
   In this section, we compare our PbS method with CRP, the                        1
                                                                                        ·108 shortcuts
                                                                                                                                   2
                                                                                                                                       ·108 shortcuts
method our own approach is based on, and CH, a lightweight yet
very efficient state-of-the-art approach for shortest path queries in         0.75                                               1.5
road networks [17]. CRP can handle arbitrary metrics and edge
weight updates, while CH is a technique with fast pre-processing                  0.5                                              1
and relatively low query processing time. We implemented in Java
the basic version of CRP and PbS. The CH algorithm in the ex-                 0.25                                               0.5
periments is from Graphhopper Route Planner [26]. Due to the
different implementations of the graph models between ours and                     0
                                                                                              256     512      768     1,024
                                                                                                                                   0
                                                                                                                                            512    1,024 1,536 2,048
CH, we do not measure the runtime. Instead, for preprocessing we
count the extra shortcuts created by each algorithm, while for query                                (c) FLA                                       (d) CAL
processing we count the number of expanded nodes.
   For the experiments we follow the same evaluation setting as                    Figure 5: Preprocessing: # of shortcuts vs. # of components.
in [17]. We use 5 publicly available datasets [27], four of of which
are a part of the US road network, and the smallest one represents               The same tendency as observed for the number of shortcuts can
the road network of Rome. We present the characteristics of each              be observed for the preprocessing time. In Figure 6, we can see
dataset in Table 1. In order to compare our PbS approach and CRP              that PbS requires much more time than CRP and CH in order to
with CH, we run our experiments over 5 query sets Q1 –Q5, which               create shortcuts. However, we should also notice that the update


                                                                         74
cost for CRP and PbS is only a small portion of the preprocessing                                                                     CRP    PbS
cost. When an edge weight changes, we need to update only the                                            ·104 expanded nodes                           ·104 expanded nodes
shortcuts that contains that particular edge. In contrast, for CH the                                1                                             1
the update cost is the same as the preprocesing cost since a change
                                                                                                0.75                                         0.75
in a single weight can influence the entire hierarchy.
                                                                                                 0.5                                          0.5
                                     CH     CRP      PbS
           preprocessing time(sec)                        preprocessing time(sec)               0.25                                         0.25
 300                                              300

                                                                                                     0                                             0
                                                                                                              128     256      384    512                   128     256      384   512
 200                                              200
                                                                                                                    (a) NY                                      (b) BAY
                                                                                                         ·104 expanded nodes                       ·104 expanded nodes
                                                                                                     2                                        3
 100                                              100

                                                                                                 1.5
                                                                                                                                              2
   0                                                0
               128     256     384    512                     128     256     384    512
                                                                                                     1
                      (a) NY                                        (b) BAY
                                                                                                                                              1
            preprocessing time(sec)                        preprocessing time(sec)               0.5
 1,500                                            3,000

                                                                                                     0                                        0
                                                                                                              256     512      768   1,024                512     1,024 1,536 2,048
 1,000                                            2,000
                                                                                                                    (c) FLA                                     (d) CAL

  500                                             1,000                                         Figure 7: Performance of shortest path queries vs. # of components.

       0                                             0
                256      512    768       1,024                512    1,024 1,536 2,048         7.        CONCLUSION
                     (c) FLA                                        (d) CAL                        In this paper we presented PbS, an approach which uses graph
                                                                                                partitioning in order to compute shortcuts and speed-up shortest
             Figure 6: Preprocessing: time vs. # of components.                                 path queries in road networks. Our aim was a solution which sup-
                                                                                                ports efficient and incremental updates of edge weights, yet is ef-
                                                                                                ficient enough in many real-world applications. In the evaluation,
6.2         Query Processing                                                                    we showed that our PbS approach outperforms CRP. PbS supports
   Figure 7 shows a series of measurements of the performance of                                edge weight updates as any change in the weight of an edge can
CRP and PbS. We evaluate both techniques for different partitions                               influence only shortcuts in a single component. On the other hand,
and various numbers of components. An important observation is                                  CH is faster than our PbS approach. However, CH cannot handle
the tendency of the performance for CRP and PbS. The perfor-                                    well edge weight updates as almost the entire hierarchy of short-
mance of CRP gets worse for partitions with many components                                     cuts has to be recomputed every time a single weight changes. For
while the opposite happens for PbS. The reason is that for parti-                               queries where the source and the target are in the same component,
tions with few components, PbS manages to process many queries                                  we introduced the CM. The efficiency of the CM in query process-
with two look-ups (the case where the source and the target are in                              ing approaches the efficiency of Arc Flags, while consuming much
adjacent components).                                                                           less space.
   In Figure 8 we compare CH with CRP (we choose the best result)                                  In future work, we plan to extend our approach to support multi-
and two configurations of PbS: PbS-BT, which is the configuration                               modal transportation networks, where the computation has to con-
that leads to the best performance, and PbS-AVG, which is the aver-                             sider a time schedule, and dynamic and traffic aware networks,
age performance of PbS among all configurations. We can see that                                where the weights of the edges change over time. We will also
PbS outperforms CRP in all datasets from Q1 to Q5 . However, CH                                 improve the preprocessing phase of our approach both in terms of
is faster in terms of query processing than our PbS approach. CH                                time overhead, by using parallel processing, and space overhead,
is more suitable for static networks as the constructed hierarchy of                            by using compression techniques or storing some of the precom-
shortcuts enables the shortest path algorithm to expand much fewer                              puted information on the disk.
nodes.

6.3         In-component Queries                                                                8.        REFERENCES
   In Figure 9, we compare the performance of our bidirectional                                  [1] E. W. Dijkstra. A note on two problems in connexion with
algorithm using the proposed CM, the original bidirectional search                                   graphs. Numerische Mathematik, 1(1):269–271, December
and the bidirectional algorithm using Arc Flags. We observe that                                     1959.
the bidirectional search is the slowest since no pruning is applied.                             [2] I. S. Pohl. Bi-directional and Heuristic Search in Path
Between Arc Flags and CM, the Arc Flags provide slightly better                                      Problems. PhD thesis, Stanford, CA, USA, 1969.
pruning thus fewer expanded nodes by the bidirectional search. On                                    AAI7001588.
the other hand, the preprocessing time required to compute the Arc                               [3] H. Bast, D. Delling, A. Goldberg, M. Müller, T. Pajor,
Flags is significantly higher than the time required to compute the                                  P. Sanders, D. Wagner, and R Werneck. Route planning in
CM.                                                                                                  transportation networks. (MSR-TR-2014-4), January 2014.


                                                                                           75
                              CH        CRP       PbS-BT         PbS-AVG                                          Int. Workshop on Geographic Information Systems (GIS),
                                                                                                                  page 200, 2005.
8,000                                                                                                        [10] R.A. Finkel and J. L. Bentley. Quad trees: A data structure
                                                    8,000
                                                                                                                  for retrieval on composite keys. Acta Informatica, 4(1):1–9,
6,000                                               6,000                                                         1974.
                                                                                                             [11] J. Sankaranarayanan and H. Samet, H. andi Alborzi. Path
4,000                                               4,000                                                         Oracles for Spatial Networks. In Proc. of the 35th VLDB
2,000                                               2,000
                                                                                                                  Conf., pages 1210–1221, 2009.
                                                                                                             [12] H. Bast, S. Funke, D Matijevic, P. Sanders, and D. Schultes.
         0
             Q1        Q2    Q3     Q4        Q5
                                                          0
                                                              Q1        Q2        Q3        Q4     Q5
                                                                                                                  In Transit to Constant Time Shortest-Path Queries in Road
                                                                                                                  Networks. In Proc. of the Workshop on Algorithm
                       (a) NY                                           (b) BAY                                   Engineering and Experiments, pages 45–59, 2007.
         ·104                                             ·104                                               [13] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick.
                                                      3                                                           Reachability and distance queries via 2-hop labels. In Proc.
1.5                                                                                                               of the 13th ACM-SIAM Symposium on Discrete Algorithms
                                                      2
                                                                                                                  (SODA), pages 937–946, 2002.
     1
                                                                                                             [14] I. Abraham, D. Delling, A. V. Goldberg, and R. F. Werneck.
                                                                                                                  A hub-based labeling algorithm for shortest paths in road
0.5                                                   1
                                                                                                                  networks. In Proc. of the 10th Int. Symposium on
                                                                                                                  Experimental Algorithms, pages 230–241, 2011.
     0                                                0
             Q1   Q2         Q3    Q4        Q5            Q1          Q2        Q3    Q4         Q5         [15] P. Sanders and D. Schultes. Highway Hierarchies Hasten
                       (c) FLA                                          (d) CAL
                                                                                                                  Exact Shortest Path Queries. In Proc. of the 13th European
                                                                                                                  Conf. on Algorithms (ESA), pages 568–579, 2005.
 Figure 8: Performance of shortest path queries vs. query sets.                                              [16] A. D. Zhu, H. Ma, X. Xiao, S. Luo, Y. Tang, and S. Zhou.
                                                                                                                  Shortest Path and Distance Queries on Road Networks:
                                                                                                                  Towards Bridging Theory and Practice. In Proc. of the 32nd
                             Bidirectional        Arc Flags        CM
                                                                                                                  SIGMOD Conf., pages 857–868, 2013.
12                                                  3,000                                                    [17] L. Wu, X. Xiao, D. Deng, G. Cong, and A. D. Zhu. Shortest
                                                                                                                  Path and Distance Queries on Road Networks : An
 9                                                                                                                Experimental Evaluation. In Proc. of the 39th VLDB Conf.,
                                                    2,000
                                                                                                                  pages 406–417, 2012.
 6
                                                                                                             [18] Y. W. Huang, N. Jing, and E. A. Rundensteiner. Hierarchical
                                                    1,000                                                         path views : A model based on fragmentation and
 3
                                                                                                                  transportation road types. In Proc. of the 3rd ACM Workshop
 0                                                        0                                                       Geographic Information Systems (GIS),, 1995.
             8    16    24    32   40    48                        8        16   24    32    40    48
                                                                                                             [19] S. Jung and S. Pramanik. Hiti graph model of topographical
         (a) Preprocessing time (ms)                             (b) Visited nodes                                roadmaps in navigation systems. In Proc. of the 12th ICDE
                                                                                                                  Conf., pages 76–84, 1996.
Figure 9: Evaluation of Arc Flags & CM using ROME dataset.                                                   [20] D. Delling, A. V. Goldberg, I. Razenshteyn, and R. F.
                                                                                                                  Werneck. Graph Partitioning with Natural Cuts. In Proc. of
                                                                                                                  the 35th Int. Parallel & Distributed Processing Symposium
[4] D. Delling, A. V. Goldberg, T. Pajor, and R. F. Werneck.                                                      (IPDPS), pages 1135–1146, 2011.
    Customizable route planning. In Proc. of the 10th Int.                                                   [21] A. E. Feldmann and L/ Foschini. Balanced Partitions of
    Symposium on Experimental Algorithms (SEA), pages                                                             Trees and Applications. In 29th Symp. on Theoretical
    376–387, 2011.                                                                                                Aspects of Computer Science, volume 14, pages 100–111,
[5] P. Hart, N. Nilsson, and B. Raphael. Formal Basis for the                                                     Paris, France, 2012.
    Heuristic Determination of Minimum Cost PAths. IEEE                                                      [22] G. Karypis and V. Kumar. A Fast and High Quality
    Transactions of Systems Science and Cybernetics,                                                              Multilevel Scheme for Partitioning Irregular Graphs. SIAM
    4(2):100–107, 1968.                                                                                           Journal on Scientific Computing, 20(1):359–392, 1998.
[6] A. V. Goldberg and C. Harrelson. Computing the Shortest                                                  [23] Y. W. Huang, N. Jing, and E. Rundensteiner. Effective Graph
    Path : A * Search Meets Graph Theory. In Proc. of the 16th                                                    Clustering for Path Queries in Digital Map Databases. In
    ACM-SIAM Symposium on Discrete Algorithms (SODA),                                                             Proc. of the 5th Int. Conf. on Information and Knowledge
    pages 156–165, 2005.                                                                                          Management, pages 215–222, 1996.
[7] J. Maue, P. Sanders, and D. Matijevic. Goal-directed                                                     [24] X. Sui, D. Nguyen, M. Burtscher, and K. Pingali. Parallel
    shortest-path queries using precomputed cluster distances.                                                    graph partitioning on multicore architectures. In Proc. of the
    Journal on Experimental Algorithms, 14:2:3.2–2:3.27,                                                          23rd Int. Conf. on Languages and Compilers for Parallel
    January 2010.                                                                                                 Computing, pages 246–260, 2011.
[8] E. Köhler, R. H. Möhring, and H. Schilling. Fast                                                         [25] R. W. Floyd. Algorithm 97: Shortest path. Communications
    point-to-point shortest path computations with arc-flags. In                                                  of the ACM, 5:345, 1962.
    Proc. of the 9th DIMACS Implementation Challenge, 2006.                                                  [26] https://graphhopper.com.
[9] J. Sankaranarayanan, H. Alborzi, and H. Samet. Efficient                                                 [27] http://www.dis.uniroma1.it/challenge9/.
    query processing on spatial networks. In Proc. of the 2005


                                                                                                        76
 Missing Value Imputation in Time Series using Top-k Case
                        Matching

                 Kevin Wellenzohn                            Hannes Mitterer                             Johann Gamper
                   Free University of                          Free University of                         Free University of
                    Bozen-Bolzano                               Bozen-Bolzano                              Bozen-Bolzano
           kevin.wellenzohn@unibz.it hannes.mitterer@unibz.it  gamper@inf.unibz.it
                             M. H. Böhlen            Mourad Khayati
                                        University of Zurich                        University of Zurich
                                     boehlen@ifi.uzh.ch                          mkhayati@ifi.uzh.ch

ABSTRACT                                                                       pecially frost is dangerous as it can destroy the harvest within a
In this paper, we present a simple yet effective algorithm, called             few minutes unless the farmers react immediately. The Südtiroler
the Top-k Case Matching algorithm, for the imputation of miss-                 Beratungsring operates more than 120 weather stations spread all
ing values in streams of time series data that are similar to each             over South Tyrol, where each of them collects every five minutes
other. The key idea of the algorithm is to look for the k situations           up to 20 measurements including temperature, humidity etc. The
in the historical data that are most similar to the current situation          weather stations frequently suffer outages due to sensor failures or
and to derive the missing value from the measured values at these k            errors in the transmission of the data. However, the continuous
time points. To efficiently identify the top-k most similar historical         monitoring of the current weather condition is crucial to immedi-
situations, we adopt Fagin’s Threshold Algorithm, yielding an al-              ately warn about imminent threats such as frost and therefore the
gorithm with sub-linear runtime complexity with high probability,              need arises to recover those missing values as soon as they are de-
and linear complexity in the worst case (excluding the initial sort-           tected.
ing of the data, which is done only once). We provide the results                 In this paper, we propose an accurate and efficient method to
of a first experimental evaluation using real-world meteorological             automatically recover missing values. The need for a continuous
data. Our algorithm achieves a high accuracy and is more accurate              monitoring of the weather condition at the SBR has two important
and efficient than two more complex state of the art solutions.                implications for our solution. Firstly, the proposed algorithm has
                                                                               to be efficient enough to complete the imputation before the next
                                                                               set of measurements arrive in a few minutes time. Secondly, the
Keywords                                                                       algorithm cannot use future measurements which would facilitate
Time series, imputation of missing values, Threshold Algorithm                 the imputation, since they are not yet available.
                                                                                  The key idea of our Top-k Case Matching algorithm is to seek
                                                                               for the k time points in the historical data when the measured val-
1. INTRODUCTION                                                                ues at a set of reference stations were most similar to the measured
   Time series data is ubiquitous, e.g., in the financial stock mar-           values at the current time point (i.e., the time point when a value is
ket or in meteorology. In many applications time series data is in-            missing). The missing value is then derived from the values at the k
complete, that is some values are missing for various reasons, e.g.,           past time points. While a naïve solution to identify the top-k most
sensor failures or transmission errors. However, many applications             similar historical situations would have to scan the entire data set,
assume complete data, hence need to recover missing values before              we adopt Fagin’s Threshold Algorithm, which efficiently answers
further data processing is possible.                                           top-k queries by scanning, on average, only a small portion of the
   In this paper, we focus on the imputation of missing values in              data. The runtime complexity of our solution is derived from the
long streams of meteorological time series data. As a case study,              Threshold Algorithm and is sub-linear with high probability and
we use real-world meteorological data collected by the Südtiroler              linear in the worst case, when all data need to be scanned. We pro-
Beratungsring1 (SBR), which is an organization that provides pro-              vide the results of a first experimental evaluation using real-world
fessional and independent consultancy to the local wine and apple              meteorological data from the SBR. The results are promising both
farmers, e.g., to determine the optimal harvesting time or to warn             in terms of efficiency and accuracy. Our algorithm achieves a high
about potential threats, such as apple scab, fire blight, or frost. Es-        accuracy and is more accurate than two state of the art solutions.
1                                                                                 The rest of the paper is organized as follows. In Section 2, we
    http://www.beratungsring.org/
                                                                               review the existing literature about imputation methods for missing
                                                                               values. In Section 3, we introduce the basic notation and a running
                                                                               example. In Section 4, we present our Top-k Case Matching algo-
                                                                               rithm for the imputation of missing values, followed by the results
                                                                               of an experimental evaluation in Section 5. Section 6 concludes the
                                                                               paper and outlines ideas for future work.
Copyright © by the paper’s authors. Copying permitted only for
private and academic purposes.
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-          2.    RELATED WORK
Workshop on Foundations of Databases (Grundlagen von Datenbanken),
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.          Khayati et al. [4] present an algorithm, called REBOM, which


                                                                          77
recovers blocks of missing values in irregular (with non repeating                                                       t∈w        s(t)      r1 (t)       r2 (t)   r3 (t)
trends) time series data. The algorithm is based on an iterated trun-                                                     1        16.1°      15.0°        15.9°    14.1°
cated matrix decomposition technique. It builds a matrix which                                                            2        15.8°      15.2°        15.7°    13.9°
stores the time series containing the missing values and its k most                                                       3        15.9°      15.2°        15.8°    14.1°
correlated time series according to the Pearson correlation coeffi-                                                       4        16.2°      15.0°        15.9°    14.2°
cient [7]. The missing values are first initialized using a simple                                                        5        16.5°      15.3°        15.7°    14.5°
interpolation technique, e.g., linear interpolation. Then, the ma-                                                        6        16.1°      15.2°        16.0°    14.1°
trix is iteratively decomposed using the truncated Singular Value                                                         7          ?        15.0°        16.0°    14.3°
Decomposition (SVD). By multiplying the three matrices obtained
from the decomposition, the algorithm is able to accurately approx-                                   Table 1: Four time series in a window w = [1, 7].
imate the missing values. Due to its quadratic runtime complexity,
REBOM is not scalable for long time series data.                                                                                         s (Schlanders)        r1 (Kortsch)
   Khayati et al. [5] further investigate the use of matrix decompo-                                                                     r2 (Göflan)           r3 (Laas)
sition techniques for the imputation of missing values. They pro-


                                                                                      Temperature in Degree Celsius
pose an algorithm with linear space complexity based on the Cen-
troid Decomposition, which is an approximation of SVD. Due to                                                            16
the memory-efficient implementation, the algorithm scales to long
time series. The imputation follows a similar strategy as the one
used in REBOM.                                                                                                           15
   The above techniques are designed to handle missing values in
static time series. Therefore, they are not applicable in our sce-
nario, as we have to continuously impute missing values as soon                                                          14
as they appear. A naïve approach to run the algorithms each time
                                                                                                                               1     2        3        4       5      6       7
a missing value occurs is not feasible due to their relatively high
runtime complexity.                                                                                                                               Timestamps
   There are numerous statistical approaches for the imputation of
missing values, including easy ones such as linear or spline interpo-                                                 Figure 1: Visualization of the time series data.
lation, all the way up to more complex models such as the ARIMA
model. The ARIMA model [1] is frequently used for forecasting
future values, but can be used for backcasting missing values as                  For the imputation of missing values we assign to each time se-
well, although this is a less common use case. A recent comparison             ries s a set Rs of reference time series, which are similar to s.
of statistical imputation techniques for meteorological data is pre-           The notion of similarity between two time series is tricky, though.
sented in [9]. The paper comprises several simple techniques, such             Intuitively, we want time series to be similar when they have sim-
as the (weighted) average of concurrent measurements at nearby                 ilar values and behave similarly, i.e., values increase and decrease
reference stations, but also computationally more intensive algo-              roughly at the same time and by the same amount.
rithms, such as neural networks.                                                  As a simple heuristic for time series similarity, we use the spa-
                                                                               tial proximity between the stations that record the respective time
3. BACKGROUND                                                                  series. The underlying assumption is that, if the weather stations
                                                                               are nearby (say within a radius of 5 kilometers), the measured val-
  Let S = {s1 , . . . , sn } be a set of time series. Each time series,        ues should be similar, too. Based on this assumption, we manually
s ∈ S, has associated a set of reference time series Rs , Rs ⊆                 compiled a list of 3–5 reference time series for each time series.
S \ {s}. The value of a time series s ∈ S at time t is denoted as              This heuristic turned out to work well in most cases, though there
s(t). A sliding window of a time series s is denoted as s([t1 , t2 ])          are situations where the assumption simply does not hold. One rea-
and represents all values between t1 and t2 .                                  son for the generally good results is most likely that in our data
   E XAMPLE 1. Table 1 shows four temperature time series in a                 set the over 100 weather stations cover a relatively small area, and
time window w = [1, 7], which in our application corresponds to                hence the stations are very close to each other.
seven timestamps in a range of 30 minutes. s is the base time series
from the weather station in Schlanders, and Rs = {r1 , r2 , r3 } is            4.    TOP-K CASE MATCHING
the associated set of reference time series containing the stations
                                                                                  Weather phenomena are often repeating, meaning that for exam-
of Kortsch, Göflan, and Laas, respectively. The temperature value
                                                                               ple during a hot summer day in 2014 the temperature measured at
s(7) is missing. Figure 1 visualizes this example graphically.
                                                                               the various weather stations are about the same as those measured
   The Top-k Case Matching algorithm we propose assumes that                   during an equally hot summer day in 2011. We use this observa-
the time series data is aligned, which generally is not the case for           tion for the imputation of missing values. Let s be a time series
our data. Each weather station collects roughly every 5 minutes                where the current measurement at time θ, s(θ), is missing. Our
new measurements and transmits them to a central server. Since                 assumption on which we base the imputation is as follows: if we
the stations are not perfectly synchronized, the timestamps of the             find historical situations in the reference time series Rs such that
measurements typically differ, e.g., one station collects measure-             the past values are very close to the current values at time θ, then
ments at 09:02, 09:07, . . . , while another station collects them at          also the past measurements in s should be very similar to the miss-
09:04, 09:09, . . . . Therefore, in a pre-processing step we align the         ing value s(θ). Based on this assumption, the algorithm searches
time series data using linear interpolation, which yields measure-             for similar climatic situations in historical measurements, thereby
ment values every 5 minutes (e.g., 00:00, 00:05, 00:10, . . . ). If we         leveraging the vast history of weather records collected by the SBR.
observe a gap of more than 10 minutes in the measurements, we                     More formally, given a base time series s with reference time
assume that the value is missing.                                              series Rs , we are looking for the k timestamps (i.e., historical sit-


                                                                          78
uations), D = {t1 , . . . , tk }, ti < θ, which minimize the error               popularity. Let us assume that k = 2 and the aggregation
function                                                                         function f (x1 , x2 ) = x1 + x2 . Further, assume that the bounded
                               X                                                 buffer currently contains {(C, 18), (A, 16)} and the algorithm has
                 δ(t) =             |r(θ) − r(t)|.
                                                                                 read the data up to the boxes shown in gray. At this point the al-
                             r∈Rs
                                                                                 gorithm computes the threshold using the interestingness
That is, δ(t) ≤ δ(t ) for all t ∈ D and t0 6∈ D ∪ {θ}. The er-
                      0
                                                                                 grade for object B and the popularity grade of object C, yield-
ror function δ(t) is the accumulated absolute difference between                 ing τ = f (5, 9) = 5 + 9 = 14. Since the lowest ranked object in
the current temperature r(θ) and the temperature at time t, r(t),                the buffer, object A, has an aggregated grade that is greater than τ ,
over all reference time series r ∈ Rs . Once D is determined,                    we can conclude that C and A are the top-2 objects. Note that the
the missing value is recovered using some aggregation function                   algorithm never read object D, yet it can conclude that D cannot
g ({s(t)|∀t ∈ D}) over the measured values of the time series s                  be part of the top-k list.
at the timestamps in D. In our experiments we tested the average
and the median as aggregation function (cf. Section 5).
                                                                                            interestingness                     popularity
    E XAMPLE 2. We show the imputation of the missing value s(7)
in Table 1 using as aggregation function g the average. For                                 Object     grade                  Object     grade
the imputation, we seek the k = 2 most similar historical sit-                                A         10                      B         10
uations. The two timestamps D = {4, 1} minimize δ(t) with                                     C          9                      C          9
δ(4) = |15.0° − 15.0°| + |16.0° − 15.9°| + |14.3° − 14.2°| = 0.2°                             B          5                     D           8
and δ(1) = 0.3°. The imputation is then simply the average                                    D          4                     A           6
of the base station measurements at time t = 4 and t = 1,
i.e.,s(7) = avg(16.2°, 16.1°) = 12 (16.2° + 16.1°) = 16.15°.                                    Table 2: Threshold Algorithm example.

   A naïve implementation of this algorithm would have to scan
the entire database of historical data to find the k timestamps that             4.2     Adapting the Threshold Algorithm
minimize δ(t). This is, however, not scalable for huge time series                  In order to use the Threshold Algorithm for the imputation of
data, hence a more efficient technique is needed.                                missing values in time series data, we have to adapt it. Instead of
                                                                                 looking for the top-k objects that maximize the aggregation func-
4.1     Fagin’s Threshold Algorithm                                              tion f , we want to find the top-k timestamps that minimize the
   What we are actually trying to do is to answer a top-k query for              error function δ(t) over the reference time series Rs . Similar to
the k timestamps which minimize δ(t). There exist efficient algo-                TA, we need sorted access to the data. Therefore, for each time
rithms for top-k queries. For example, Fagin’s algorithm [2] solves              series r ∈ Rs we define Lr to be the time series r ordered first
this problem by looking only at a small fraction of the data. Since              by value and then by timestamp in ascending order. Table 3 shows
the first presentation of Fagin’s algorithm there were two notewor-              the sorted data for the three reference time series of our running ex-
thy improvements, namely the Threshold Algorithm (TA) by Fagin                   ample (ignore the gray boxes and small subscript numbers for the
et al. [3] and a probabilistic extension by Theobald et al. [8]. The             moment).
latter approach speeds up TA by relaxing the requirement to find
the exact top-k answers and providing approximations with proba-                                Lr1                Lr2                 Lr3
bilistic guarantees.
   Our Top-k Case Matching algorithm is a variation of TA with                             t    r1 (t)         t    r2 (t)         t    r3 (t)
slightly different settings. Fagin et al. assume objects with m at-                        1    15.0° 4        2    15.7°          2    13.9°
tributes, a grade for each attribute and a monotone aggregation                            4    15.0° 1        5    15.7°          1    14.1°
function f : Rm 7→ R, which aggregates the m grades of an ob-                              7    15.0°          3    15.8°          3    14.1°
ject into an overall grade. The monotonicity property is defined as                        2    15.2°          1    15.9°          6    14.1°
follows.                                                                                   3    15.2°          4    15.9° 5        4    14.2° 3
                                                                                           6    15.2°          6    16.0° 2        7    14.3°
    D EFINITION 1. (Monotonicity)              Let    x1 , . . . , xm and                  5    15.3°          7    16.0°          5    14.5° 6
x01 , . . . , x0m be the m grades for objects X and X 0 , re-
spectively.         The aggregation function f is monotone if                                  Table 3: Time series sorted by temperature.
f (x1 , . . . , xm ) ≤ f (x01 , . . . , x0m ) given that xi ≤ x0i for
each 1 ≤ i ≤ m.                                                                      The general idea of our modified TA algorithm is the following.
   The TA finds the k objects that maximize the function f . To do               The scan of each sorted lists starts at the current element, i.e., the
so it requires two modes of accessing the data, one being sorted and             element with the timestamp t = θ. Instead of scanning the lists Lri
the other random access. The sorted access is ensured by maintain-               only in one direction as TA does, we scan each list sequentially
ing a sorted list Li for each attribute mi , ordered by the grade in             in two directions. Hence, as an initialization step, the algorithm
                                                                                                                         −
descending order. TA keeps a bounded buffer of size k and scans                  places two pointers, pos+   r and posr , at the current value r(θ) of

each list Li in parallel until the buffer contains k objects and the             time series r (the gray boxes in Table 3). During the execution of
lowest ranked object in the buffer has an aggregated grade that is               the algorithm, pointer pos+  r is only incremented (i.e., moved down

greater than or equal to some threshold τ . The threshold τ is com-              the list), whereas pos−  r is only decremented (i.e., moved up the

puted using the aggregation function f over the grades last seen                 list). To maintain the k highest ranking timestamps, the algorithm
under the sorted access for each list Li .                                       uses a bounded buffer of size k. A new timestamp t0 is added only
                                                                                 if the buffer is either not yet full or δ(t0 ) < δ(t), where t is the last
   E XAMPLE 3. Table 2 shows four objects {A, B, C, D} and                       (i.e., lowest ranking) timestamp in the buffer. ¯In the latter
                                                                                                                                             ¯ case the
their grade for the two attributes interestingness and                           timestamp t is removed from the buffer.
                                                                                             ¯


                                                                            79
   After this initialization, the algorithm iterates over the lists Lr in         Algorithm 1: Top−k Case Matching
round robin fashion, i.e., once the last list is reached, the algorithm             Data: Reference time series Rs , current time θ, and k
wraps around and continues again with the first list. In each iter-                 Result: k timestamps that minimize δ(t)
ation, exactly one list Lr is processed, and either pointer pos+     r or
                                                                                               r
                                                                                  1 L ← {L |r ∈ Rs }
pos−r is advanced, depending on which value the two pointers point                2 buffer ← boundendBuffer(k)

to has a smaller absolute difference to the current value at time θ,              3 for r ∈ Rs do

r(θ). This process grows a neighborhood around the element r(θ)                   4      pos−         +
                                                                                             r , posr ← position of r(θ) in L
                                                                                                                               r
                                                                                  5 end
in each list. Whenever a pointer is advanced by one position, the
                                                                                  6 while L <> ∅ do
timestamp t at the new position is processed. At this point, the                  7      for Lr ∈ L do
algorithm needs random access to the values r(t) in each list to                  8           t ← AdvancePointer(Lr )
compute the error function δ(t). Time t is added to the bounded                   9           if t = N IL then
buffer using the semantics described above.                                      10                L ← L \ {Lr }
   The algorithm terminates once the error at the lowest ranking                 11           else
                                                                                 12                if t 6∈ buffer then
timestamp, t, among the k timestamps in the buffer is less or equal
            ¯                                                                    13                      buffer.addWithPriority(t, δ(t))
to thePthreshold,    i.e., δ(t) ≤ τ . The threshold τ is defined as              14                end
τ = r∈Rs |r(θ) − r(pos       ¯ )|, where pos is either pos+ or pos− ,
                               r              r              r         r         15                τ ← ComputeThreshold(L)
depending on which pointer was advanced last. That is, τ is the                  16                if buffer.size() = k
sum over all lists Lr of the absolute differences between r(θ) and                                   and buffer.largestError() ≤ τ then
the value under pos+             −                                                                       return buffer
                       r or posr .
                                                                                 17
                                                                                 18                end
   E XAMPLE 4. We illustrate the Top-k Case Matching algorithm                   19           end
for k = 2 and θ = 7. Table 4 shows the state of the algorithm in                 20      end
each iteration i. The first column shows an iteration counter i, the             21 end
                                                                                 22 return buffer
second the buffer with the k current best timestamps, and the last
column the threshold τ . The buffer entries are tuples of the form
(t, δ(t)). In iteration i = 1, the algorithm moves the pointer to
t = 4 in list Lr1 and adds (t = 4, δ(4) = 0.2°) to the buffer. Since             on the direction of the pointer. If next() reaches the end of a list,
δ(4) = 0.2° > 0.0° = τ , the algorithm continues. The pointer                    it returns N IL. The utility functions timestamp() and value()
in Lr2 is moved to t = 6, and (6, 0.4°) is added to the buffer. In               return the timestamp and value of a list Lr at a given position, re-
iteration i = 4, timestamp 6 is replaced by timestamp 1. Finally,                spectively. There are four cases, which the algorithm has to distin-
in iteration i = 6, the error at timestamp t = 1 is smaller or equal             guish:
to τ , i.e., δ(1) = 0.3° ≤ τ6 = 0.3°. The algorithm terminates and
returns the two timestamps D = {4, 1}.                                              1. None of the two pointers reached the beginning or end of the
                                                                                       list. In this case, the algorithm checks which pointer to ad-
                                                                                       vance (line 5). The pointer that is closer to r(θ) after advanc-
          Iteration i         Buffer            Threshold τi                           ing is moved by one position. In case of a tie, we arbitrarily
               1        (4, 0.2°)                   0.0°                               decided to advance pos+   r .
               2        (4, 0.2°), (6, 0.4°)        0.0°
               3        (4, 0.2°), (6, 0.4°)        0.1°                            2. Only pos−
                                                                                               r reached the beginning of the list: the algorithm
               4        (4, 0.2°), (1, 0.3°)        0.1°                               increments pos+
                                                                                                     r (line 11).
               5        (4, 0.2°), (1, 0.3°)        0.2°
               6        (4, 0.2°), (1, 0.3°)        0.3°                            3. Only pos+
                                                                                               r reached the end of the list: the algorithm decre-
                                                                                       ments pos−
                                                                                                r (line 13).

   Table 4: Finding the k = 2 most similar historical situations.                   4. The two pointers reached the beginning respective end of the
                                                                                       list: no pointer is moved.
4.3     Implementation                                                           In the first three cases, the algorithm returns the timestamp that
   Algorithm 1 shows the pseudo code of the Top-k Case Matching                  was discovered after advancing the pointer. In the last case, N IL is
algorithm. The algorithm has three input parameters: a set of time               returned.
series Rs , the current timestamp θ, and the parameter k. It returns                At the moment we use an in-memory implementation of the al-
the top-k most similar timestamps to the current timestamp θ. In                 gorithm, which loads the whole data set into main memory. More
line 2 the algorithm initializes the bounded buffer of size k, and in            specifically, we keep two copies of the data in memory: the data
line 4 the pointers pos+           −
                        r and posr are initialized for each reference            sorted by timestamp for fast random access and the data sorted by
time series r ∈ Rs . In each iteration of the loop in line 7, the algo-          value and timestamp for fast sorted access.
rithm advances either pos+           −
                            r or posr (by calling Algorithm 2) and                  Note that we did not normalize the raw data using some standard
reads a new timestamp t. The timestamp t is added to the bounded                 technique like the z-score normalization, as we cannot compute
buffer using the semantics described before. In line 15, the algo-               that efficiently for streams of data without increasing the complex-
rithm computes the threshold τ . If the buffer contains k timestamps             ity of our algorithm.
and we have δ(t) ≤ τ , the top-k most similar timestamps were
                 ¯
found and the algorithm   terminates.                                            4.4    Proof of Correctness
   Algorithm 2 is responsible for moving the pointers pos+       r and              The correctness of the Top-k Case Matching algorithm follows
pos−                  r
    r for each list L . The algorithm uses three utility functions.              directly from the correctness of the Threshold Algorithm. What
The first is next(), which takes a pointer as input and returns the              remains to be shown, however, is that the aggregation function δ(t)
next position by either incrementing or decrementing, depending                  is monotone.


                                                                            80
                                                                                                                                             ∗
    Algorithm 2: AdvancePointer                                              ference between
                                                                                           P the real value∗ s(θ) and the imputed value s (θ),
     Data: List Lr where to advance a pointer                                i.e., ∆ = |w| θ∈w |s(θ) − s (θ)|
                                                                                         1

     Result: Next timestamp to look at or N IL                                  Figure 2 shows how the accuracy of the algorithms changes with
1  pos ← N IL                                                                varying k. Interestingly and somewhat unexpectedly, ∆ decreases
   if next(pos+                         −                                    as k increases. This is somehow contrary to what we expected,
2               r ) <> N IL and next(posr ) <> N IL then
        ∆+ ← |r(θ) − value(Lr [next(pos+                                     since with an increasing k also the error function δ(t) grows, and
 3                                         r )])|
        ∆− ← |r(θ) − value(Lr [next(pos−                                     therefore less similar historical situations are used for the imputa-
 4                                         r )])|
 5      if ∆+ ≤ ∆− then                                                      tion. However, after a careful analysis of the results it turned out
             pos, pos+             +                                         that for low values of k the algorithm is more sensitive to outliers,
 6                   r ← next(posr )
 7      else                                                                 and due to the often low quality of the raw data the imputation is
 8           pos, pos−             −
                     r ← next(posr )
                                                                             flawed.
 9      end
                                                                                                                                             Top-k (Average)


                                                                                        Average Difference ∆ in °C
                     +                       −
10 else if next(posr ) <> N IL and next(posr ) = N IL then                                                           0.8
11
                 +
        pos, posr ← next(posr )+                                                                                                             Top-k (Median)
                     +                    −                                                                                                   Simple Average
12 else if next(posr ) = N IL and next(posr ) <> N IL then
13      pos, pos−              −
                 r ← next(posr )
                                                                                                                     0.7

14 end
15 if pos <> N IL then
                                                                                                                     0.6
16      return timestamp(Lr [pos])
17 else
18      return N IL                                                                                                  0.5
19 end
                                                                                                                           0                 50                100

                                                                                                                                         Parameter k

   T HEOREM 4.1. The aggregation function δ(t) is a monotoni-                                                         Figure 2: Impact of k on accuracy.
cally increasing function.
   P ROOF. Let t1 and t2 be two timestamps such that |r(θ) −                    Table 5 shows an example of flawed raw data. The first row is
r(t1 )| ≤ |r(θ) − r(t2 )| for each r ∈ Rs . Then it trivially fol-           the current situation, and we assume that the value in the gray box
lows that δ(t1 ) ≤ δ(t2 ) as the aggregation function δ is the sum of        is missing and need to be recovered. The search for the k = 3
|r(θ) − r(t1 )| over each r ∈ Rs and, by definition, each compo-             most similar situations using our algorithm yields the three rows
nent of δ(t1 ) is less than or equal to the corresponding component          at the bottom. Notice that one base station value is 39.9° around
in δ(t2 ).                                                                   midnight of a day in August, which is obviously a very unlikely
                                                                             thing to happen. By increasing k, the impact of such outliers is
4.5       Theoretical Bounds                                                 reduced and hence ∆ decreases. Furthermore, using the median as
   The space and runtime bounds of the algorithm follow directly             aggregation function reduces the impact of outliers and therefore
from the probabilistic guarantees of TA, which has sub-linear cost           yields better results than the average.
with high probability and linear cost in the worst case. Note
                                                                                 Timestamp                                        s          r1           r2           r3
that sorting the raw data to build the lists Lr is a one-time pre-
processing step with complexity O(n log n). After that the system             2013-04-16 19:35                                 18.399°     17.100°      19.293°      18.043°
can insert new measurements efficiently into the sorted lists with            2012-08-24 01:40                                 18.276°     17.111°      19.300°      18.017°
logarithmic cost.                                                             2004-09-29 15:50                                 19.644°     17.114°      19.259°      18.072°
                                                                              2003-08-02 01:10                                 39.900°     17.100°      19.365°      18.065°
5. EXPERIMENTAL EVALUATION                                                                                           Table 5: Example of flawed raw data.
   In this section, we present preliminary results of an experimental
evaluation of the proposed Top-k Case Matching algorithm. First,                Figure 3 shows the runtime, which for the Top-k Case Match-
we study the impact of parameter k on the Top-k Case Matching                ing algorithm linearly increases with k. Notice that, although the
and a baseline algorithm. The baseline algorithm, referred to as             imputation of missing values for 8 days takes several minutes, the
“Simple Average”, imputes the missing value s(θ) with the average            algorithm is fast enough to continuously impute missing values in
of thePvalues in the reference time series at time θ, i.e., s(θ) =           our application at the SBR. The experiment essentially corresponds
         r∈Rs r(θ). Second, we compare our solution with two state
  1
|Rs |                                                                        to a scenario, where in 11452 base stations an error occurs at the
of the art competitors, REBOM [4] and CD [5].                                same time. With 120 weather stations operated by the SBR, the
                                                                             number of missing values at each time is only a tiny fraction of the
5.1       Varying k                                                          missing values that we simulated in this experiment.
   In this experiment, we study the impact of parameter k on the
accuracy and the runtime of our algorithm. We picked five base               5.2    Comparison with CD and REBOM
stations distributed all over South Tyrol, each having two to five              In this experiment, we compare the Top-k Case Matching algo-
reference stations. We simulated a failure of the base station dur-          rithm with two state-of-the-art algorithms, REBOM [4] and CD [5].
ing a time interval, w, of 8 days in the month of April 2013. This           We used four time series, each containing 50.000 measurements,
amounts to a total of 11452 missing values. We then used the Top-k           which corresponds roughly to half a year of temperature measure-
Case Matching (using both the average and median as aggregation              ments. We simulated a week of missing values (i.e., 2017 measure-
function g) and Simple Average algorithms to impute the missing              ments) in one time series and used the other three as reference time
values. As a measure of accuracy we use the average absolute dif-            series for the imputation.


                                                                        81
                                                                                further study the impact of complex weather phenomena that we
                            800                                                 observed in our data, such as the foehn. The foehn induces shifting
                                                                                effects in the time series data, as the warm wind causes the temper-
            Runtime (sec)

                            600
                                                                                ature to increase rapidly by up to 15° as soon as the foehn reaches
                                                  Top-k (Average)               another station.
                            400
                                                  Top-k (Median)                   There are several possibilities to further improve the algorithm.
                                                   Simple Average               First, we would like to explore whether the algorithm can dynam-
                            200
                                                                                ically determine an optimal value for the parameter k, which is
                             0                                                  currently given by the user. Second, we would like to make the
                                  0               50                100         algorithm more robust against outliers. For example, the algorithm
                                              Parameter k                       could consider only historical situations that occur roughly at the
                                                                                same time of the day. Moreover, we can bend the definition of “cur-
                             Figure 3: Impact of k on runtime.                  rent situation” to not only consider the current timestamp, but rather
                                                                                a small window of consecutive timestamps. This should make the
                                                                                ranking more robust against anomalies in the raw data and weather
   The box plot in Figure 4 shows how the imputation error |s(θ) −              phenomena such as the foehn. Third, right now the similarity be-
s∗ (θ)| is distributed for each of the four algorithms. The left and            tween time series is based solely on temperature data. We would
right line of the box are the first and third quartile, respectively.           like to include the other time series data collected by the weather
The line inside the box denotes the median and the left and right               stations, such as humidity, precipitation, wind, etc. Finally, the al-
whiskers are the 2.5% and 97.5% percentile, which means that the                gorithm should be able to automatically choose the currently hand-
plot incorporates 95% of the values and omits statistical outliers.             picked reference time series based on some similarity measures,
The experiment clearly shows that the Top-k Case Matching algo-                 such as the Pearson correlation coefficient.
rithm is able to impute the missing values more accurately than CD
and REBOM. Although not visualized, also the maximum observed
error for our algorithm is with 2.29° (Average) and 2.21° (Median)
                                                                                7.    ACKNOWLEDGEMENTS
considerably lower than 3.71° for CD and 3.6° for REBOM.                          The work has been done as part of the DASA project, which is
                                                                                funded by the Foundation of the Free University of Bozen-Bolzano.
                                                                                We wish to thank our partners at the Südtiroler Beratungsring and
               Top-k                                                            the Research Centre for Agriculture and Forestry Laimburg for the
             (Median)
                                                                                good collaboration and helpful domain insights they provided, in
               Top-k                                                            particular Armin Hofer, Martin Thalheimer, and Robert Wiedmer.
            (Average)

                            CD                                                  8.    REFERENCES
                                                                                [1] G. E. P. Box and G. Jenkins. Time Series Analysis, Forecasting
               REBOM                                                                and Control. Holden-Day, Incorporated, 1990.
                                                                                [2] R. Fagin. Combining fuzzy information from multiple systems
                                  0     0.5       1      1.5        2
                                                                                    (extended abstract). In PODS’96, pages 216–226, New York,
                                      Absolute Difference in °C                     NY, USA, 1996. ACM.
                                                                                [3] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation
           Figure 4: Comparison with REBOM and CD.                                  algorithms for middleware. In PODS ’01, pages 102–113,
                                                                                    New York, NY, USA, 2001. ACM.
   In terms of runtime, the Top-k Case Matching algorithm needed                [4] M. Khayati and M. H. Böhlen. REBOM: recovery of blocks of
16 seconds for the imputation of the 2017 missing measurements,                     missing values in time series. In COMAD’12, pages 44–55,
whereas CD and REBOM needed roughly 10 minutes each. Note,                          2012.
however, that this large difference in run time is also due to the              [5] M. Khayati, M. H. Böhlen, and J. Gamper. Memory-efficient
fact that CD and REBOM need to compute the Pearson correlation                      centroid decomposition for long time series. In ICDE’14,
coefficient which is a time intensive operation.                                    pages 100–111, 2014.
                                                                                [6] L. Li, J. McCann, N. S. Pollard, and C. Faloutsos. Dynammo:
6. CONCLUSION AND FUTURE WORK                                                       Mining and summarization of coevolving sequences with
   In this paper, we presented a simple yet efficient and accurate al-              missing values. In KDD’09, pages 507–516, New York, NY,
gorithm, termed Top-k Case Matching, for the imputation of miss-                    USA, 2009. ACM.
ing values in time series data, where the time series are similar to            [7] A. Mueen, S. Nath, and J. Liu. Fast approximate correlation
each other. The basic idea of the algorithm is to look for the k sit-               for massive time-series data. In SIGMOD’10, pages 171–182,
uations in the historical data that are most similar to the current sit-            New York, NY, USA, 2010. ACM.
uation and to derive the missing values from the data at these time             [8] M. Theobald, G. Weikum, and R. Schenkel. Top-k query
points. Our Top-k Case Matching algorithm is based on Fagin’s                       evaluation with probabilistic guarantees. In VLDB’04, pages
Threshold Algorithm. We presented the results of a first experi-                    648–659. VLDB Endowment, 2004.
mental evaluation. The Top-k Case Matching algorithm achieves a                 [9] C. Yozgatligil, S. Aslan, C. Iyigun, and I. Batmaz.
high accuracy and outperforms two state of the art solutions both                   Comparison of missing value imputation methods in time
in terms of accuracy and runtime.                                                   series: the case of turkish meteorological data. Theoretical
   As next steps we will continue with the evaluation of the algo-                  and Applied Climatology, 112(1-2):143–167, 2013.
rithm, taking into account also model based techniques such as Dy-
naMMo [6] and other statistical approaches outlined in [9]. We will


                                                                           82
                         Dominanzproblem bei der Nutzung von
                               Multi-Feature-Ansätzen

                           Thomas Böttcher                                                      Ingo Schmitt
             Technical University Cottbus-Senftenberg                            Technical University Cottbus-Senftenberg
               Walther-Pauer-Str. 2, 03046 Cottbus                                 Walther-Pauer-Str. 2, 03046 Cottbus
                     tboettcher@tu-cottbus.de                                            schmitt@tu-cottbus.de


ABSTRACT
Ein Vergleich von Objekten anhand unterschiedlicher Eigen-
schaften liefert auch unterschiedliche Ergebnisse. Zahlreiche
Arbeiten haben gezeigt, dass die Verwendung von mehreren
Eigenschaften signifikante Verbesserungen im Bereich des
Retrievals erzielen kann. Ein großes Problem bei der Verwen-                   Figure 1: Unterschiedliche Objekte mit sehr hoher
dung mehrerer Eigenschaften ist jedoch die Vergleichbarkeit                    Farbähnlichkeit
der Einzeleigenschaften in Bezug auf die Aggregation. Häu-
fig wird eine Eigenschaft von einer anderen dominiert. Viele
Normalisierungsansätze versuchen dieses Problem zu lösen,                    von Eigenschaften erfolgt mittels eines Distanz- bzw. Ähn-
nutzen aber nur eingeschränkte Informationen. In dieser Ar-                   lichkeitsmaßes1 . Bei der Verwendung mehrerer Eigenschaf-
beit werden wir einen Ansatz vorstellen, der die Messung des                   ten lassen sich Distanzen mittels einer Aggregationsfunktion
Grades der Dominanz erlaubt und somit auch eine Evaluie-                       verknüpfen und zu einer Gesamtdistanz zusammenfassen.
rung verschiedener Normalisierungsansätze.                                    Der Einsatz von unterschiedlichen Distanzmaßen und Ag-
                                                                               gregationsfunktionen bringt jedoch verschiedene Probleme
                                                                               mit sich:
Keywords                                                                       Verschiedene Distanzmaße erfüllen unterschiedliche alge-
Dominanz, Score-Normalisierung, Aggregation, Feature                           braische Eigenschaften und nicht alle Distanzmaße sind für
                                                                               spezielle Probleme gleich geeignet. So erfordern Ansätze
                                                                               zu metrischen Indexverfahren oder Algorithmen im Data-
1. EINLEITUNG                                                                  Mining die Erfüllung der Dreiecksungleichung. Weitere Pro-
Im Bereich des Information-Retrievals (IR), Multimedia-                        bleme können durch die Eigenschaften der Aggregations-
Retrievals (MMR), Data-Mining (DM) und vielen anderen                          funktion auftreten. So kann diese z.B. die Monotonie oder
Gebieten ist ein Vergleich von Objekten essentiell, z.B. zur                   andere algebraische Eigenschaften der Einzeldistanzmaße
Erkennung ähnlicher Objekte bzw. Duplikate oder zur Klas-                     zerstören. Diese Probleme sollen jedoch nicht im Fokus die-
sifizierung der untersuchten Objekte. Der Vergleich von Ob-                    ser Arbeit stehen.
jekten einer Objektmenge O basiert dabei in der Regel auf                      Für einen Ähnlichkeitsvergleich von Objekten anhand meh-
deren Eigenschaftswerten. Im Bereich des MMR sind Eigen-                       rerer Merkmale wird erwartet, dass die Einzelmerkmale glei-
schaften (Features) wie Farben, Kanten oder Texturen häu-                     chermaßen das Aggregationsergebnis beeinflussen. Häufig
fig genutzte Merkmale. In vielen Fällen genügt es für einen                 gibt es jedoch ein Ungleichgewicht, welches die Ergebnisse
erschöpfenden Vergleich von Objekten nicht, nur eine Eigen-                   so stark beeinflusst, dass einzelne Merkmale keinen oder nur
schaft zu verwenden. Abbildung 1 zeigt anhand des Beispiels                    einen geringen Einfluss besitzen. Fehlen algebraische Eigen-
eines Farbhistogramms die Schwächen einer einzelnen Eigen-                    schaften oder gibt es eine zu starke Dominanz, so können die
schaft. Obwohl beide Objekte sich deutlich unterscheiden so                    Merkmale und dazugehörigen Distanzmaße nicht mehr sinn-
weisen sie ein sehr ähnliches Farbhistogramm auf.                             voll innerhalb einer geeigneten Merkmalskombination einge-
Statt einer Eigenschaft sollte vielmehr eine geeignete Kombi-                  setzt werden. Im Bereich der Bildanalyse werden zudem im-
nation verschiedener Merkmale genutzt werden, um mittels                       mer komplexere Eigenschaften aus den Bilddaten extrahiert.
einer verbesserten Ausdruckskraft [16] genauere Ergebnissen                    Damit wird auch die Berechnung der Distanzen basierend
zu erzielen. Der (paarweise) Vergleich von Objekten anhand                     auf diesen Eigenschaften immer spezieller und es kann nicht
                                                                               sichergestellt werden welche algebraische Eigenschaften er-
                                                                               füllt werden. Durch die vermehrte Verwendung von vielen
                                                                               Einzelmerkmalen steigt auch das Risiko der Dominanz eines
                                                                               oder weniger Merkmale.
                                                                               Kernfokus dieser Arbeit ist dabei die Analyse von Multi-
                                                                               Feature-Aggregationen in Bezug auf die Dominanz einzelner
Copyright © by the paper’s authors. Copying permitted only
for private and academic purposes.                                             Merkmale. Wir werden zunächst die Dominanz einer Eigen-
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-          1
Workshop on Foundations of Databases (Grundlagen von Datenbanken),               Beide lassen sich ineinander überführen [Sch06], im Folgen-
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.        den gehen wir daher von Distanzmaßen aus.


                                                                          83
schaft definieren und zeigen wann sich eine solche Dominanz                      Beispiel erläutert werden. Abschließend werden wir ein Maß
manifestiert. Anschließend führen wir ein Maß zur Messung                       definieren, um den Grad der Dominanz messen zu können.
des Dominanzgrades ein. Wir werden darüber hinaus zei-
gen, dass die Ansätze bestehender Normalisierungsverfah-                        3.1    Problemdefinition
ren nicht immer ausreichen um das Problem der Dominanz                           Wie bereits erwähnt ist der Einsatz vieler, unterschiedlicher
zu lösen. Zusätzlich ermöglicht dieses Maß die Evaluation                     Eigenschaften (Features) und ihrer teilweise speziellen Di-
verschiedener Normalisierungsansätze.                                           stanzmaße nicht trivial und bringt einige Herausforderungen
Die Arbeit ist dabei wie folgt aufgebaut. In Kapitel 2 werden                    mit sich. Das Problem der Dominanz soll in diesem Unter-
noch einmal einige Grundlagen zur Distanzfunktion und zur                        abschnitt noch einmal genauer definiert werden.
Aggregation dargelegt. Kapitel 3 beschäftigt sich mit der                          Zunächst definieren wir das Kernproblem bei der Aggre-
Definition der Dominanz und zeigt anhand eines Beispiels                         gation mehrerer Distanzwerte.
die Auswirkungen. Weiterhin wird ein neues Maß zur Mes-                             Problem: Für einen Ähnlichkeitsvergleich von Objekten
sung des Dominanzgrades vorgestellt. Kapitel 4 liefert einen                     anhand mehrerer Merkmale sollen die Einzelmerkmale glei-
Überblick über bestehende Ansätze. Kapitel 5 gibt eine Zu-                    chermaßen das Aggregationsergebnis beeinflussen. Dominie-
                                                                                                                 j
sammenfassung und einen Ausblick für zukünftige Arbeiten.                      ren die partiellen Distanzen δrs   eines Distanzmaßes dj das
                                                                                 Aggregationsergebnis, so soll diese Dominanz reduziert bzw.
2. GRUNDLAGEN                                                                    beseitigt werden.
                                                                                 Offen ist an dieser Stelle die Frage, wann eine Dominanz ei-
Das folgende Kapitel definiert die grundlegenden Begriffe                        ner Eigenschaft auftritt, wie sich diese auf das Aggregations-
und die Notationen, die in dieser Arbeit verwendet werden.                       ergebnis auswirkt und wie der Grad der Dominanz gemessen
Distanzberechnungen auf unterschiedlichen Merkmalen er-                          werden kann.
fordern in der Regel auch den Einsatz unterschiedlicher Di-                      Das Ergebnis einer Aggregation von Einzeldistanzwerten ist
stanzmaße. Diese sind in vielen Fällen speziell auf die Eigen-                  erneut ein Distanzwert. Dieser soll jedoch von allen Einzeldi-
schaft selbst optimiert bzw. angepasst. Für eine Distanzbe-                     stanzwerten gleichermaßen abhängen. Ist der Wertebereich,
rechnung auf mehreren Merkmalen werden dementsprechend                           der zur Aggregation verwendeten Distanzfunktionen nicht
auch unterschiedliche Distanzmaße benötigt.                                     identisch, so kann eine Verfälschung des Aggregationsergeb-
Ein Distanzmaß zwischen zwei Objekten basierend auf einer                        nisses auftreten. Als einfaches Beispiel seien hier zwei Di-
Eigenschaft p sei als eine Funktion d : O × O 7→ R≥0 defi-                       stanzfunktionen d1 und d2 genannt, wobei d1 alle Distanzen
niert. Ein Distanzwert basierend auf einem Objektvergleich                       auf das Intervall [0, 1] und d2 alle Distanzen auf [0, 128] ab-
zwischen or und os über einer einzelnen Eigenschaft pj wird                     bildet. Betrachtet man nun eine Aggregationsfunktion dagg ,
mit dj (or , os ) ∈ R≥0 beschrieben. Unterschiedliche Distanz-                   die Einzeldistanzen aufsummiert, so zeigt sich, dass d2 das
maße besitzen damit auch unterschiedliche Eigenschaften.                         Aggregationsergebnis erheblich mehr beeinflusst als d1 .
Zur Klassifikation der unterschiedlichen Distanzmaße wer-                        Allgemein werden dann die aggregierten Distanzwerte stär-
den folgende vier Eigenschaften genutzt:                                         ker oder schwächer durch Einzeldistanzwerte einer (zur Ag-
Selbstidentität: ∀o ∈ O : d(o, o) = 0, Positivität: ∀or 6=                     gregation verwendeten) Distanzfunktion beeinflusst als ge-
os ∈ O : d(or , os ) > 0, Symmetrie: ∀or , os ∈ O :                              wünscht. Wir bezeichnen diesen Effekt als eine Überwer-
d(or , os ) = d(os , or ) und Dreiecksungleichung: ∀or , os , ot ∈               tung. Der Grad der Überbewertung lässt sich mittels Korre-
O : d(or , ot ) ≤ d(or , os ) + d(os , ot ).                                     lationsanalyse (z.B. nach Pearson [10] oder Spearman [13])
Erfüllt eine Distanzfunktion alle vier Eigenschaften so wird                    bestimmen.
sie als Metrik bezeichnet [11].
Ist der Vergleich zweier Objekte anhand einer einzelnen Ei-                         Definition 1 (Überbewertung einer Distanzfunktion).
genschaft nicht mehr ausreichend, um die gewünschte (Un-)                       Für zwei Distanzfunktionen dj und dk , bei der die Distanz-
Ähnlichkeit für zwei Objekte or ,os ∈ O zu bestimmen , so                      werte δ j in Abhängigkeit einer Aggregationsfunktion agg
ist die Verwendung mehrerer Eigenschaften nötig. Für ei-                       das Aggregationsergebnis stärker beeinflussen als δ k , also
ne Distanzberechnung mit m Eigenschaften p = (p1 . . . pm )                      die Differenz der Korrelationswerte
                                                       j
werden zunächst die partiellen Distanzen δrs             = dj (or , os )        ρ(δ j , δ agg ) − ρ(δ k , δ agg ) >  ist, bezeichnen wir dj als
bestimmt. Anschließend werden die partiellen Distanzwerte                        überbewertet gegenüber dk .
  j
δrs  mittels einer Aggregationsfunktion agg : Rm           ≥0 7→ R≥0
zu einer Gesamtdistanz aggregiert. Die Menge aller aggre-                        Eine empirische Untersuchung hat gezeigt, dass sich ab ei-
gierten Distanzen (Dreiecksmatrix) für Objektpaar aus O,                        nem Wert  ≥ 0.2 eine Beeinträchtigung des Aggregations-
                                                  2
sei durch δ j = (δ1j , δ2j . . . , δlj ) mit l = n 2−n bestimmt. Die-            ergebnisses zu Gunsten einer Distanzfunktion zeigt.
ser Ansatz erlaubt eine Bestimmung der Aggregation auf                           Ausgehend von einer Überbewertung definieren wir das Pro-
den jeweiligen Einzeldistanzwerten. Die Einzeldistanzfunk-                       blem der Dominanz.
tionen dj sind in sich geschlossen und damit optimiert auf
die Eigenschaft selbst.                                                             Definition 2 (Dominanzproblem). Ein Dominanzpro-
                                                                                 blem liegt vor, wenn es eine Überbewertung einer Distanz-
                                                                                 funktion dj gegenüber dk gibt.
3.    DOMINANZPROBLEM
Bisher haben wir das Problem der Dominanz nur kurz ein-                          Das Problem einer Überbewertung bei unterschiedlichen
geführt. Eine detaillierte Motivation und Heranführung an                      Wertebereichen in denen die Distanzen abgebildet werden ist
das Problem soll in diesem Kapitel erfolgen. Hierzu werden                       jedoch bereits weitreichend bekannt. In vielen Fällen kom-
wir zunächst die Begriffe Überbewertung und Dominanzpro-                       men Normalisierungsverfahren (z.B. im Data-Mining [12]
blem einführen. Die Auswirkungen des Dominanzproblem                            oder in der Biometrie [5]) zum Einsatz. Diese bereiten Di-
auf das Aggregationsergebnis sollen anschließend durch ein                       stanzen aus verschiedenen Quellen für eine Aggregation vor.


                                                                            84
Zur Vermeidung einer Überbewertung werden Distanzen                                                  aggQd ,d (or , os ) = d1 (or , os ) ∗ d2 (or , os ) kann nun gezeigt
                                                                                                             1 2
häufig auf ein festes Intervall normalisiert (i.d.R. auf [0,1]).                                     werden, dass d1 stärker den aggregierten Distanzwert beein-
Damit ist zumindest das Problem in unserem vorherigen Bei-                                            flusst als d2 .
spiel gelöst.                                                                                        In Abbildung 3 sind zwei verschiedene Rangfolgen aller 10
Das Problem der Dominanz tritt jedoch nicht nur bei un-                                               Distanzwerte zwischen fünf zufälligen Objekten der Vertei-
terschiedlichen Wertebereichen auf. Auch bei Distanzfunk-                                             lungen ν1 und ν2 dargestellt, sowie die Aggregation mittels
tionen, die alle auf den gleichen Wertebereich normalisiert                                           aggQ . Die Distanz-ID definiert hierbei einen Identifikator
sind, kann das Dominanzproblem auftreten. Im folgenden                                                für ein Objektpaar. Betrachtet man die ersten fünf Rän-
Abschnitt soll anhand eines Beispiels dieses Dominanzpro-                                             ge der aggregierten Distanzen, so sieht man, dass die top-
blem demonstriert werden.                                                                             5-Objekte von Distanzfunktion d1 komplett mit denen der
                                                                                                      Aggregation übereinstimmen, während bei Distanzfunktion
3.2      Beispiel eines Dominanzproblems                                                              d2 lediglich zwei Werte in der Rangfolge der aggregierten
In Abbildung 2 sind drei Distanzverteilungen ν1 , ν2 und ν3                                           Distanzen auftreten. Gleiches gilt für die Ränge 6–10. Da-
aus einer Stichprobe zu den zugehörigen Distanzfunktionen                                            mit zeigt die Distanzfunktion d1 eine Dominanz gegenüber
d1 , d2 sowie d3 dargestellt. Der Wertebereich der Funktio-                                           der Distanzfunktion d2 . Schaut man sich noch einmal die
nen sei auf das Intervall [0,1] definiert. Die Werte aus der                                          Intervalle der Verteilung ν1 und ν2 an, so zeigt sich, dass die
Stichprobe treten ungeachtet der Normalisierung auf [0, 1]                                            Dominanz dem großen Unterschied der Verteilungsintervalle
jedoch in unterschiedlichen Intervallen auf. Die Distanzwer-                                          (0.7 vs. 0.2) obliegt. Eine Dominanz manifestiert sich also
te der Stichprobe von ν1 liegen im Intervall [0.2, 0.9], von ν2                                       vor allem wenn eine große Differenz zwischen den jeweiligen
im Intervall [0.3, 0.5] und in ν3 im Intervall [0.8, 0.9]. Auch                                       Intervallen der Distanzverteilungen liegt.
wenn es sich hierbei um simulierte Daten handelt so sind
solche Verteilungen im Bereich des MMR häufig anzutref-                                              3.3    Messung der Dominanz
fen.                                                                                                  Um die Überwertung aus unserem Beispiel und somit die
                        0.12                                                                          Dominanz zu quantifizieren, wird die Korrelation zwischen
                         0.1
                                                                                                      den Distanzen von d1 (d2 ) und der aggregierten Distanzen
                                                                                                      aus dagg bestimmt. Zur Berechnung der Korrelation kön-
                                                                                                      nen mehrere Verfahren genutzt werden. Verwendet man wie
                        0.08
           Häufigkeit


                        0.06
                                                                                                      im obigen Beispiel nur die Ränge, so bietet sich Spearmans
                        0.04
                                                                                                      Rangkorrelationskoeffizient an [13].
                        0.02
                                                                                                                             Cov(Rang(A), Rang(B))
                                                                                                                 ρ(A, B) =                         mit
                          0
                               0   0.1   0.2   0.3   0.4      0.5    0.6   0.7   0.8   0.9   1
                                                                                                                               σRang(A) ∗ σRang(B)                    (1)
                                                           Distanz
                                                     (a) ν1                                                        Cov(X, Y ) = E [(X − µx ) ∗ (Y − µy )]
                        0.12
                                                                                                      Hierbei sei Cov(X, Y ) die über den Erwartungswert von X
                         0.1
                                                                                                      und Y definierte Kovarianz. Bezogen auf das vorherige Bei-
                                                                                                      spiel erhalten wir eine Korrelation nach Spearman für d1 von
                                                                                                      ρ1 = 0.94 und für d2 ρ2 = 0.45. Die Differenz der Korrela-
                        0.08
           Häufigkeit


                        0.06
                                                                                                      tionswerte liegt dabei bei ρ1 − ρ2 = 0.49. Ab  = 0.2 lässt
                        0.04
                                                                                                      sich eine Überbewertung einer Distanzfunktion feststellen.
                        0.02
                                                                                                      Somit haben wir mit ρ1 − ρ2 = 0.49 > 0.2 eine starke Über-
                                                                                                      bewertung von d1 gegenüber d2 in Bezug auf das Aggrega-
                           0
                               0   0.1   0.2   0.3   0.4      0.5


                                                           Distanz
                                                                     0.6   0.7   0.8   0.9   1
                                                                                                      tionsergebnis gezeigt.
                                                     (b) ν2                                           Durch die Verwendung der Rangwerte gibt es allerdings
                                                                                                      einen Informationsverlust. Eine alternative Berechnung ohne
                        0.12
                                                                                                      Informationsverlust wäre durch Pearsons Korrelationskoeffi-
                         0.1                                                                          zienten möglich [10]. Genügen die Ranginformationen, dann
                        0.08
                                                                                                      bietet Spearmans Rangkorrelationskoeffizient durch eine ge-
                                                                                                      ringere Anfälligkeit gegenüber Ausreißern an [14].
           Häufigkeit


                        0.06

                                                                                                      Bisher haben wir die Korrelation zwischen den aggregier-
                        0.04                                                                          ten Werten und denen aus je einer Distanzverteilung vergli-
                        0.02
                                                                                                      chen. Um direkt eine Beziehung zwischen zwei verschiede-
                                                                                                      nen Distanzverteilungen bzgl. einer aggregierten Verteilung
                                                                                                      zu bestimmen, werden zunächst die zwei Korrelationswerte
                          0
                               0   0.1   0.2   0.3   0.4      0.5    0.6   0.7   0.8   0.9   1


                                                           Distanz
                                                     (c) ν3                                           ρ1 und ρ2 der Distanzfunktionen d1 und d2 bzgl. ihres Ein-
                                                                                                      flusses auf das Aggregationsergebnis graphisch dargestellt
                                                                                                      [6]. Hierzu werden die jeweiligen Werte der Korrelation als
Figure 2: Distanzverteilung verschiedener Distanz-
                                                                                                      Punkte in [−1, 1]2 definiert. Für eine gleichmäßige Beein-
funktionen (simulierte Daten)
                                                                                                      flussung des Aggregationsergebnisses sollten sich die Punk-
                                                                                                      te auf der Diagonalen durch den Koordinatenursprung mit
Wir betrachten nun die Distanzfunktionen d1 und d2 . Be-
züglich einer beispielhaften Aggregationsfunktion2                                                   gationsfunktionen wie Summe, Mittelwert etc. auf und kann
                                                                                                      zusätzlich eine Dominanz hervorrufen, z.B. bei der Mini-
2
    Das Problem der Dominanz tritt auch bei anderen Aggre-                                            mum/Maximumfunktion.


                                                                                                 85
                                                                                                                        1
   Rang         d1      Distanz-ID           d2         Distanz-ID            aggQ     Distanz-ID
      1       0.729             1          0.487               8              0.347             8                      0.8


      2       0.712             8          0.481               5              0.285             4
      3       0.694             4          0.426              10              0.266             1                      0.6

      4       0.547             9          0.425               7              0.235             5                 ρ2

                                                                                                                                                                        (ρ1, ρ2)
      5       0.488             5          0.421               3              0.205             9                      0.4

      6       0.473             7          0.411               4              0.201             7                                                u
      7       0.394            10          0.375               9              0.168             10                     0.2

      8       0.351             3          0.367               6              0.148             3                                     α
      9       0.337             2          0.365               1              0.112             6                       0
                                                                                                                             0      0.2    0.4            0.6     0.8              1
      10      0.306             6          0.316               2              0.106             2                                                    ρ1


       Figure 3: Dominanzproblem bei unterschiedlichen Verteilungen                                              Figure 4: Graphische Darstellung
                                                                                                                 der Korrelation ρ1 und ρ2 auf das
                                                                                                                 Aggregationsergebnis


dem Anstieg m = 1 befinden. Wir bezeichnen diese Gerade                                    3.4        Zusammenfassung
als Kalibrierungslinie. Für unser Beispiel genügt es, nur po-                            Wir haben in diesem Kapitel gezeigt wann ein Dominanz-
sitive Korrelationswerte zu betrachten. Damit kennzeichnen                                 problem auftritt und wie groß der Einfluss auf das Aggrega-
alle Punkte unterhalb dieser Linie einen größeren Einfluss                                tionsergebnis sein kann. Mit der Verwendung von Gleichung
durch d1 . Analog gilt bei allen Punkten oberhalb dieser Li-                               (2) ist es nun möglich den Grad des Dominanzproblems bzw.
nie (grau schraffierter Bereich) eine größere Beeinflussung                               den Kalibrierungsfehler messen zu können. Ein Hauptgrund
durch d2 . Abbildung 4 zeigt graphisch die Korrelation für                                für das Auftreten des Dominanzproblem liegt in der Vertei-
unser Beispiel von ρ1 und ρ2 auf das Aggregationsergebnis.                                 lung der Distanzen. Sind die Intervalle, in denen die Distan-
Um die Abweichung vom gewünschten Zustand zu bestim-                                      zen liegen unterschiedlich groß, so ist die Dominanz einer
men, ermitteln wir den Winkel zwischen dem Ortsvektor                                      Eigenschaft unvermeidbar. Können diese Intervalle der Di-
u = (ρ1 , ρ2 )T durch den Punkt (ρ1 , ρ2 ) und der horizon-
~                                                                                          stanzverteilungen aneinander angeglichen werden ohne da-
talen Koordinatenachse
                     [6]. Der Winkel α ergibt sich dann                                  bei die Rangfolge zu verletzen, so könnte dies das Dominanz-
durch α = arctan ρρ21 Dieser Winkel liegt zwischen [0, Π 2
                                                           ],                              problem lösen. Weiterhin ermöglicht das Maß des Kalibrie-
während die Kalibrierungslinie mit der horizontalen Ach-                                  rungsfehlers die Evaluation von Normalisierungsansätzen.
se einen Winkel von Π 4
                        einschließt. Für eine vorzeichenbe-
haftete Kennzeichnung der Überbewertung sollen nun alle                                   4.        STAND DER TECHNIK
Korrelationspunkte unterhalb der Kalibrierungslinie einen                                  Die Aggregation auf Basis mehrerer Eigenschaften ist ein
positiven Wert und alle Korrelationspunkte oberhalb einen                                  weit verbreitetes Feld. Es gibt bereits eine Vielzahl von Ar-
negativen Wert erhalten. Für ein Maß der Dominanz defi-                                   beiten die sich mit dem Thema der Score-Normalization be-
nieren wir nun folgende Berechnung [6]:                                                    schäftigten. Die Evaluierung solcher Ansätze erfolgt in vielen
                                                                                           Fällen, vor allem im Bereich des IR, direkt über die Auswer-
                                                                                         tung der Qualität der Suchergebnisse anhand verschiedener
                                      4              Corr(δ j , δ agg )
  Calerr (δ i , δ j , δ agg ) = 1 −     arctan                                 (2)         Dokumentenkollektionen, z.B. TREC-Kollektionen3 . Dieses
                                      π              Corr(δ i , δ agg )                    Vorgehen liefert aber kaum Anhaltspunkte, warum sich ei-
                                                                                           nige Normalisierungsansätze besser für bestimmte Anwen-
Hierbei definiert Corr(X, Y ) ein geeignetes Korrelations-
                                                                                           dungen eignen als andere [6].
maß, in unserem Fall der Rangkorrelationskoeffizient von
                                                                                           Betrachten wir zunächst verschiedene lineare Normalisierun-
Spearman. Wir bezeichnen dieses Maß als Kalibrierungsfeh-                                                                                δ−xmin
                                                                                           gen der Form normalize(δ) = ymin + xmax                 (ymax −
ler, wobei ein Fehler von 0 bedeutet, dass es keine Dominanz                                                                                −xmin

gibt und somit beide Distanzfunktionen gleichermaßen in                                    ymin ) [15], wobei die Bezeichnungen xmin , xmax , ymin und
das Aggregationsergebnis einfließen. Der Wertebereich des                                  ymax verschiedene Normalisierungsparameter darstellen. Ta-
Kalibrierungsfehlers Calerr liegt in [−1, 1]. Für unser Bei-                              belle 1 stellt einige solcher linearer Ansätze dar [15, 5, 9, 6].
spiel erhalten wir unter Verwendung von Spearmans Rang-
korrelationskoeffizienten Calerr (d1 , d2 , dagg ) = 0.43, womit
erkennbar ist, dass d1 das Aggregationsergebnis stärker be-                                         Name      ymin          ymax         xmin                  xmax
einflusst als d2 .                                                                                   Min-Max   0             1            min(δ)                max(δ)
                                                                                                     Fitting   0<a           a<b<1        min(δ)                max(δ)
   Definition 3 (Kalibrierungsfehler ). Ein         Kalibrie-                                        ZUMV      0             1            µδ                    σδ
rungsfehler liegt vor, wenn es eine Dominanz einer                                                   ZUMV2     2             3            µδ                    σδ
                                                                                                     MAD       0             1            Median(δ)             M AD(δ)
Distanzfunktion d1 gegenüber d2 gibt, d.h. die Korrelations-
werte nicht auf der Kalibrierungslinie liegen. Entsprechend
sind zwei Verteilungen von Distanzwerten kalibriert, wenn                                      Table 1: Parameter lin. Normalisierungsverfahren
kein Kalibrierungsfehler auftritt.
                                                                                           Neben den linearen Normalisierungsfunktionen gibt es auch
  Analog zur Definition eines -Wertes zeigte eine empiri-                                 zahlreiche nicht-lineare. Darunter fallen z.B. der tanh-
sche Untersuchung für einen Wert τ ≥ 0.1 eine ungleichmä-                                Estimator [4] und die Double-Sigmoid [2] Normalisierung.
ßige Auswirkung auf das Aggregationsergebnis.                                              3
                                                                                               Text Retrieval Conference (http://trec.nist.gov/)


                                                                                      86
Beide sind den linearen Verfahren jedoch sehr ähnlich.                      First International Workshop, MCS 2000, Cagliari,
Avampatzis und Kamps stellen in [1] drei verschieden Nor-                    Italy, June 21-23, 2000, Proceedings, pages 351–361,
malisierungsverfahren vor, die alle auf der Annahme basie-                   2000.
ren, dass sich ein Score-Wert eine Summe aus einer Signal                [3] M. Fernández, D. Vallet, and P. Castells. Probabilistic
und einer Noise-Komponente zusammensetzen, wobei das                         score normalization for rank aggregation. In Advances
Verhältnis der Summanden nur von dem Gesamt-Score ab-                       in Information Retrieval, 28th European Conference
hängt [6, 1].                                                               on IR Research, ECIR 2006, London, volume 3936 of
Für das Problem der Dominanz lässt sich einfach zeigen,                    Lecture Notes in Computer Science, pages 553–556.
dass diese Ansätze keinen direkten Einfluss auf die Di-                     Springer, 2006.
stanzverteilung haben. Es werden maximal zwei statistische               [4] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and
Merkmale (Minimum, Maximum, Median etc.) genutzt, um                         W. A. Stahel. Robust Statistics - The Approach Based
eine Normalisierung durchzuführen [9, 7]. Auch wenn die-                    on Influence Functions. Wiley, 1986. missing.
se Ansätze auf einigen Testkollektionen Verbesserungen in               [5] A. Jain, K. Nandakumar, and A. Ross. Score
der Retrieval-Qualität erreichten, so kann nicht sichergestellt             normalization in multimodal biometric systems.
werden, dass diese Ansätze allgemein zu einer Verbesserung                  Pattern Recogn., 38(12):2270–2285, Dec. 2005.
des Dominanzproblems beitragen. Besonders problematisch                  [6] R. Kuban. Analyse von Kalibrierungsansätzen für die
sind Ausreißer in den Verteilungen, die das Dominanzpro-                     CQQL-Auswertung. Bachelor’s thesis, University of
blem bei einer Aggregation sogar noch verstärken können.                   Cottbus-Senftenberg, Germany, Oct. 2012.
Ebenfalls problematisch sind Aggregationen auf unterschied-
                                                                         [7] J. H. Lee. Analyses of multiple evidence combination.
lichen Distanzverteilungen, z.B. Normal- und Gleichvertei-
                                                                             In SIGIR ’97: Proceedings of the 20th annual
lungen.
                                                                             international ACM SIGIR conference on Research and
Es gibt allerdings auch Ansätze, die die Distanzverteilung
                                                                             development in information retrieval, pages 267–276,
als Grundlage zur Normalisierung heranziehen. Hierbei wird
                                                                             New York, NY, USA, 1997. ACM Press.
versucht die Distanzen aus unterschiedlichen Quellen so ab-
zubilden, dass sie möglichst exakt gleiche Verteilungen besit-          [8] R. Manmatha, T. Rath, and F. Feng. Modeling score
zen. Die Ansätze von Manmatha [8] und Fernandez [3] ana-                    distributions for combining the outputs of search
lysieren dabei das probabilistische Verhalten von Suchma-                    engines. In SIGIR ’01: Proceedings of the 24th annual
schinen unter der Annahme, dass relevante Dokumente ei-                      international ACM SIGIR conference on Research and
ne Normalverteilung und irrelevante eine exponentielle Ver-                  development in information retrieval, pages 267–275,
teilung besitzen. Diese Ansätze bieten zwar eine optimierte                 New York, NY, USA, 2001. ACM Press.
Normierung, erfordern aber gleichzeitig Zusatzinformation                [9] M. Montague and J. A. Aslam. Relevance score
(z.B. über die Relevanz von Textdokumenten), die in vielen                  normalization for metasearch. In Proceedings of the
Anwendungsfällen gar nicht vorhanden sind.                                  Tenth International Conference on Information and
                                                                             Knowledge Management, CIKM ’01, pages 427–433,
                                                                             New York, NY, USA, 2001. ACM.
5. ZUSAMMENFASSUNG UND AUSBLICK
                                                                        [10] J. L. Rodgers and W. A. Nicewander. Thirteen ways
   In dieser Arbeit wurde ein Verfahren vorgestellt um die                   to look at the correlation coefficient. The American
Dominanz unterschiedlicher Eigenschaften messen zu kön-                     Statistician, 42:59–66, 1988.
nen. Hierzu wurde zunächst der Begriff der Dominanz und
                                                                        [11] H. Samet. Foundations of Multidimensional And
dessen Auswirkung untersucht. Anschließend wurde auf Ba-
                                                                             Metric Data Structures. Morgan Kaufmann,
sis eines von Distanzverteilungen ein Maß vorgestellt, mit
                                                                             2006/08/08/ 2006.
dessen Hilfe der Grad der Dominanz bestimmt werden kann.
Dies ermöglicht uns eine Überbewertung zu erkennen und                [12] L. A. Shalabi and Z. Shaaban. Normalization as a
die Qualität eines Normalisierungsverfahrens zu evaluieren.                 preprocessing engine for data mining and the approach
Die in dieser Arbeit vorgestellten Normalisierungsverfahren                  of preference matrix. In DepCoS-RELCOMEX, pages
wiesen jedoch einige Schwächen auf. Hinzu kommt, dass                       207–214. IEEE Computer Society, 2006.
die algebraischen Eigenschaften der zugrunde liegenden Di-              [13] C. Spearman. The proof and measurement of
stanzfunktionen gänzlich ungeachtet blieben (Problem feh-                   association between two things. American Journal of
lender Metrikeigenschaften). In zukünftigen Arbeiten soll                   Psychology, 15:88–103, 1904.
daher ein Ansatz entwickelt werden, der beide Probleme                  [14] K. Varmuza and P. Filzmoser. Introduction to
gleichermaßen zu lösen versucht. Hierzu soll ein Verfahren                  Multivariate Statistical Analysis in Chemometrics.
der multivariaten Statistik, die multidimensionale Skalie-                   Taylor & Francis, 2008.
rung, verwendet werden. Zusätzlich sollen die Auswirkungen             [15] S. Wu, F. Crestani, and Y. Bi. Evaluating score
unterschiedlicher Normalisierungsansätze auf Dominanz und                   normalization methods in data fusion. In Proceedings
(Retrieval-) Qualität untersucht werden.                                    of the Third Asia Conference on Information Retrieval
                                                                             Technology, AIRS’06, pages 642–648, Berlin,
6. REFERENCES                                                                Heidelberg, 2006. Springer-Verlag.
 [1] A. Arampatzis and J. Kamps. A signal-to-noise                      [16] D. Zellhöfer. Eliciting Inductive User Preferences for
     approach to score normalization. In Proceedings of the                  Multimedia Information Retrieval. In W.-T. Balke and
     18th ACM conference on Information and knowledge                        C. Lofi, editors, Proceedings of the 22nd Workshop
     management, CIKM ’09, pages 797–806, New York,                          ”Grundlagen von Datenbanken 2010”, volume 581,
     NY, USA, 2009. ACM.                                                     2010.
 [2] R. Cappelli, D. Maio, and D. Maltoni. Combining
     fingerprint classifiers. In Multiple Classifier Systems,


                                                                   87
88
     PEL: Position-Enhanced Length Filter for Set Similarity
                            Joins

                                Willi Mann                                                    Nikolaus Augsten
                  Department of Computer Sciences                                    Department of Computer Sciences
                        Jakob-Haringer-Str. 2                                              Jakob-Haringer-Str. 2
                          Salzburg, Austria                                                  Salzburg, Austria
                      wmann@cosy.sbg.ac.at                                            nikolaus.augsten@sbg.ac.at

ABSTRACT                                                                       opments in the field leverage the position of the matching
Set similarity joins compute all pairs of similar sets from two                tokens between two prefixes (positional filter) and the num-
collections of sets. Set similarity joins are typically imple-                 ber of remaining tokens in the overall set (length filter) to
mented in a filter-verify framework: a filter generates candi-                 further reduce the number of candidates.
date pairs, possibly including false positives, which must be                     This paper proposes a new filter, the position-enhanced
verified to produce the final join result. Good filters produce                length filter (PEL), which tightens the length filter based
a small number of false positives, while they reduce the time                  on the position of the current token match. In previous
they spend on hopeless candidates. The best known algo-                        work, position and length information have only partially
rithms generate candidates using the so-called prefix filter                   been exploited. PEL fully leverages position and length to
in conjunction with length- and position-based filters.                        achieve additional pruning power. As a key feature, PEL
   In this paper we show that the potential of length and po-                  does not require changes in the prefix filter index, but is
sition have only partially been leveraged. We propose a new                    applied on top of previous algorithms at almost no cost. In
filter, the position-enhanced length filter, which exploits the                our experiments we show that PEL is particularly effective
matching position to incrementally tighten the length filter;                  for foreign joins. PEL also performs well for self joins over
our filter identifies hopeless candidates and avoids process-                  large collections of small sets.1
ing them. The filter is very efficient, requires no change in                     The remaining paper is organized as follows: Section 2
the data structures of most prefix filter algorithms, and is                   introduces the set similarity join, provides background ma-
particularly effective for foreign joins, i.e., joins between two              terial, and an in-depth analysis of filtering techniques based
different collections of sets.                                                 on position and length. Our novel position-enhanced length
                                                                               filter (PEL) is introduced in Section 3. We empirically eval-
                                                                               uate our technique and demonstrate its effectiveness on real
1. INTRODUCTION                                                                world data in Section 4. In Section 5 we survey related work
   The set similarity join computes all pairs of similar sets                  and finally draw conclusions in Section 6.
from two collections of sets. The similarity is assessed using
a set similarity function, e.g., set overlap, Jaccard, or Cosine               2.    BACKGROUND
similarity, together with a threshold. A pair of sets is in the
join result if the similarity exceeds the threshold.                              We revisit candidate generation, candidate reduction, and
   Set similarity joins have many interesting applications                     efficient verification techniques discussed in literature. The
ranging from near duplicate detection of Web documents to                      main concepts of this section are summarized in Figure 1.
community mining in social networks [9]. The set elements
are called tokens [3] and are often used to represent complex
                                                                               2.1    Candidate Generation
objects, e.g., strings (q-grams [11]) or trees (pq-grams [2]).                    We shorty explain the prefix filter, translate normal-
   The best algorithms for set similarity joins are based on                   ized thresholds to overlap thresholds, revisit length- and
an inverted list index and the so-called prefix filter [5]. The                position-based filter conditions, and discuss prefixes in
prefix filter operates on sorted sets and rejects candidate                    index implementations of set similarity joins.
pairs that have no overlap in a (short) prefix. Only the pre-                     Prefix Filter. The fastest set similarity joins are based
fix must be indexed, which leads to substantial savings in                     on the prefix filter principle [5]: A pair of sorted sets s0 , s1
space and runtime. However, for frequent tokens, the prefix                    can only meet an overlap threshold tO , i.e., |s0 ∩ s1 | ≥ tO ,
filter produces a large number of candidates. Recent devel-                    if there is a non-zero overlap in the prefixes of the sets. The
                                                                               prefixes are of length |s0 | − tO + 1 for s0 and |s1 | − tO + 1 for
                                                                               s1 . The set similarity join proceeds in three steps: (a) index
                                                                               the prefixes of one join partner, (b) probe the prefixes of the
                                                                               other join partner against the index to generate candidates,
                                                                               (c) verify the candidates by inspecting the full sets.
Copyright c by the paper’s authors. Copying permitted only                       Threshold for normalized set overlap. Normalized
for private and academic purposes.                                             set similarity measures take the set sizes into account. For
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-          1
Workshop on Foundations of Databases (Grundlagen von Datenbanken),               In a self join, both input relations are identical, which can-
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.        not be assumed in a foreign join.


                                                                          89
       Table 1: Set similarity functions and related definitions, extending [7, Table 1] by new pmaxsize.
           Similarity function   minoverlap(t, s0 , s1 ) minsize(t, s0 ) pmaxsize(t, s0 , p0 ) maxsize(t, s0 )
                            |s0 ∩s1 |          t                                                  |s0 |−(1+t)·p0            |s0 |
 Jaccard      J(s0 , s1 ) = |s 0 ∪s1 |        1+t
                                                  (|s0 | + |s1 |)          t |s0 |                       t                    t
                                                  p                                               (|s0 |−p0 )2
   Cosine     C(s0 , s1 ) = √|s0 ∩s1 |        t       |s0 | · |s1 |        t2 |s0 |                  |s0 |·t2
                                                                                                                            |s0 |
                                                                                                                             t2
                                |s0 |·|s1 |

                                  0 ∩s1 |)    t(|s0 |+|s1 |)               t |s0 |                (2−t)·|s0 |−2p0           (2−t)|s0 |
     Dice     D(s0 , s1 ) = 2·(|s
                             |s0 |+|s1 |             2                      2−t                          t                     t

 Overlap      O(s0 , s1 ) = |s0 ∩ s1 |        t                            t                      ∞                         ∞


                                                                                     Previous work:
the purpose of set similarity joins, the normalized threshold
is translated into an overlap threshold (called minoverlap).                            • minoverlap(t, s0 , s1 ): equivalent overlap threshold for
The overlap threshold may be different for each pair of sets.                             Jaccard, Cosine, or Dice threshold t for a pair of sets s0 , s1 .
For example,                                                                            • minsize(t, s0 ), maxsize(t, s0 ): minimum and maximum
              p for the Cosine threshold tC (join predicate                               size of any set that can satisfy threshold t w.r.t. set s0 .
|s
p0   ∩ s 1 |/    |s0 | · |s1 | ≥ tC ), minoverlap(tC , s0 , s1 ) := tC ·
                                                                                        • maxprefix(t, s0 ) = |s0 | − minsize(t, s0 ) + 1: length of
   |s0 | · |s1 |. Table 1 lists the definitions of well-known set                         probing prefix
similarity functions with the respective overlap thresholds.                            • midprefix(t, s0 ) = |s0 | − minoverlap(t, s0 , s0 ) + 1: length
   Length filter. For a given threshold t and a reference                                 of indexing prefix for self joins
set s0 , the size of the matching partners must be within                               • minprefix(t, s0 , s1 ) = |s0 | − minoverlap(t, s0 , s1 ) + 1:
the interval [minsize(t, s0 ), maxsize(t, s0 )] (see Table 1).                            length of optimal prefix for a particular pair of sets
This was first observed for the PartEnum algorithm [1]                               Position-enhanced length filter (PEL):
and was later called the length filter. Example: |s0 | = 10,                            • pmaxsize(t, s0 , p0 ): new tighter limit for maximum set size
|s1 | = 6, |s2 | = 16, Cosine threshold tC = 0.8. Since                                   based on the probing set position
minoverlap(tC , s0 , s1 ) = 6.1 > |s1 | we conclude (without
inspecting any set element) that s0 cannot reach threshold
                                                                                                 Figure 1: Overview of functions.
tC with s1 . Similarly, minoverlap(tC , s0 , s2 ) = 10.1, thus
s2 is too large to meet the threshold with s0 . In fact,
minsize(tC , s0 ) = 6.4 and maxsize(tC , s0 ) = 15.6.
                                                                                        The positional filter is stricter than the prefix filter and
   Prefix length. The prefix length is |s0 | − tO + 1 for                            is applied on top of it. The pruning power of the positional
a given overlap threshold tO and set s0 . For normalized                             filter is larger for prefix matches further to right (i.e., when
thresholds t the prefix length does not only depend on s0 ,                          p0 , p1 increase). Since the prefix filter may produce the same
but also on the sets we compare to. If we compare to s1 , the                        candidate pair multiple times (for each match in the prefix),
minimum prefix size of |s0 | is minprefix(t, s0 , s1 ) = |s0 | −                     an interesting situation arises: a pair that passes the posi-
minoverlap(t, s0 , s1 ) + 1. When we index one of the join                           tional filter for the first match may not pass the filter for
partners, we do not know the size of the matching partners                           later matches. Thus, the positional filter is applied to pairs
upfront and need to cover the worst case; this results in the                        that are already in the candidate set whenever a new match
prefix length maxprefix(t, s0 ) = |s0 |−minsize(t, s0 )+1 [7],                       is found. To correctly apply the positional filter we need
which does not depend on s1 . For typical Jaccard thresholds                         to maintain the overlap value for each pair in the candidate
t ≥ 0.8, this reduces the number of tokens to be processed                           set. We illustrate the positional filter with examples.
during the candidate generation phase by 80 % or more.                                  Example 1. Set s0 in Figure 2 is the probing set (prefix
   For self joins we can further reduce the prefix length [12]                       length maxprefix = 4), s1 is the indexed set (prefix length
w.r.t. maxprefix: when the index is built on-the-fly in in-                          midprefix = 2, assuming self join). Set s1 is returned from
creasing order of the sets, then the indexed prefix of s0 will                       the index due to the match on g (first match between s0 and
never be compared to any set s1 with |s1 | < |s0 |. This al-                         s1 ). The required overlap is dminoverlapC (0.8, s0 , s1 )e =
lows us to reduce the prefix length to midprefix(t, s0 ) =                           8. Since there are only 6 tokens left in s1 after the match,
|s0 | − minoverlap(t, s0 , s0 ) + 1.                                                 the maximum overlap we can get is 7, and the pair is pruned.
   Positional filter. The minimum prefix length for a pair                           This is also confirmed by the positional filter condition (1)
of sets is often smaller than the worst case length, which we                        (o = 0, p0 = 3, p1 = 1).
use to build and probe the index. When we probe the index                               Example 2. Assume a situation similar to Figure 2, but
with a token from the prefix of s0 and find a match in the                           the match on g is the second match (i.e., o = 1, p0 = 3,
prefix of set s1 , then the matching token may be outside the                        p1 = 1). Condition (1) holds and the pair can not be pruned,
optimal prefix. If this is the first matching token between                          i.e., it remains in the candidate set.
s0 and s1 , we do not need to consider the pair. In general,                            Example 3. Consider Figure 3 with probing set s0 and
a candidate pair s0 , s1 must be considered only if                                  indexed set s1 . The match on token a adds pair (s0 , s1 ) to
                                                                                     the candidate set. Condition (1) holds for the match on a
  minoverlap(t, s0 , s1 ) ≤ o + min{|s0 | − p0 , |s1 | − p1 }, (1)
                                                                                     (o = 0, p0 = 0, p1 = 0), and the pair is not pruned by
where o is the current overlap (i.e., number of matching                             the positional filter. For the next match (on e), however,
tokens so far excluding the current match) and p0 (p1 ) is                           condition (1) does not hold (o = 1, p0 = 1, p1 = 4) and
the position of the current match in the prefix of s0 (s1 );                         the positional filter removes the pair from the candidate set.
positions start at 0.                                                                Thus, the positional filter does not only avoid pairs to enter


                                                                           90
      pred: C(s0 , s1 ) ≥ 0.8                                                          s0 : b c e f g h ? ? ? pr
         ⇒ dminoverlap(s0 , s1 , 0.8)e = 8
                                                                                       s1 : a e h ? ? ? ? ? ? idx
                              7
        s0 : c e f g ? ? ? ? ? ? probing set (pr)                                 Figure 4: Verification: where to start?
       s1 :   a g ? ? ? ? ? ? indexed set (idx)
                                                                          pred: J(s0 , s1 ) ≥ 0.7        pred: J(s0 , s1 ) ≥ 0.7
                      7                                                      ⇒ dminoverlap(. . .)e = 6      ⇒ dminoverlap(. . .)e = 5

                                                                           s0 : c d e ? ? ? ? pr           s0 : c d e ? ? ? ? pr
Figure 2: Sets with matching token In prefix: match
impossible due to positions of matching tokens and                         s1 : e ? ? ? ? ? idx            s1 : e ? ? ? ? idx
remaining tokens.                                                            (a) Match impossible              (b) Match possible

  pred: C(s0 , s1 ) ≥ 0.6
                                                                          Figure 5: Impossible and possible set sizes based on
     ⇒ dminoverlap(s0 , s1 , 0.8)e = 8                                    position in s0 and the size-dependent minoverlap.
                                 14
    s0 : a e ? ? ? ? ? ? ? ? ? ? ? ? ? ? pr
                                                                          midprefix (indexing set) as discussed in earlier sections.
      s1 : a b c d e ? ? ? ? ? idx                                           Since the sets are sorted, we compute the overlap in a
           +1     +1     5     =7<8                                       merge fashion. At each merge step, we verify if the current
                                                                          overlap and the remaining set size are sufficient to achieve
Figure 3: Sets with two matching tokens: pruning                          the threshold, i.e., we check positional filter condition (1).
of candidate pair by second match.                                           (A) Prefix overlap [12] : At verification time we already
                                                                          know the overlap between the two prefixes of a candidate
                                                                          pair. This piece of information should be leveraged. Note
the candidate set, but may remove them later.                             that we cannot simply continue verification after the two
                                                                          prefixes. This is illustrated in Figure 4: there is 1 match in
2.2    Improving the Prefix Filter                                        the prefixes of s0 and s1 ; when we start verification after the
   The prefix filter often produces candidates that will be               prefixes, we miss token h. Token h occurs after the prefix
removed immediately in the next filter stage, the positional              of s0 but inside the prefix of s1 . Instead, we compare the
filter (see Example 1). Ideally, such candidates are not pro-             last element of the prefixes: for the set with the smaller
duced at all. This issue is addressed in the mpjoin algo-                 element (s0 ), we start verification after the prefix (g). For
rithm [7] as outlined below.                                              the other set (s1 ) we leverage the number of matches in the
   Consider condition (1) for the positional filter. We split             prefix (overlap o). Since the leftmost positions where these
the condition into two new conditions by expanding the min-               matches can appear are the first o elements, we skip o tokens
imum such that the conjunction of the new conditions is                   and start at position o (token e in s1 ). There is no risk of
equivalent to the positional filter condition:                            double-counting tokens w.r.t. overlap o since we start after
                                                                          the end of the prefix in s0 .
              minoverlap(t, s0 , s1 ) ≤ o + |s0 | − p0        (2)
                                                                             (B) Position of last match [7] : A further improvement is
              minoverlap(t, s0 , s1 ) ≤ o + |s1 | − p1        (3)         to store the position of the last match. Then we start the
                                                                          verification in set s1 after this position (h in s1 , Figure 4).
The mpjoin algorithm leverages condition (2) as follows.
The probing sets s0 are processed in increasing size order, so               Small candidate set vs. fast verification. The po-
|s0 | grows monotonically during the execution of the algo-               sitional filter is applied on each candidate pair returned by
rithm. Hence, for a specific set s1 , minoverlap grows mono-              the prefix filter. The same candidate pair may be returned
tonically. We assume o = 0 (and justify this assumption                   multiple times for different matches in the prefix. The po-
later). For a given index entry (s1 , p1 ), the right side of con-        sitional filter potentially removes existing candidate pairs
dition (2) is constant, while the left side can only grow. Af-            when they appear again (cf. Section 2.1). This reduces the
ter the condition fails to hold for the first time, it will never         size of the candidate set, but comes at the cost of (a) lookups
hold again, and the index list entry is removed. For a given              in the candidate set, (b) deletions from the candidate set,
index set s1 , this improvement changes the effective length              and (c) book keeping of the overlaps for each candidate pair.
of the prefix (i.e., the part of the sets where we may detect             Overall, it might be more efficient to batch-verify a larger
matches) w.r.t. a probing set s0 to minprefix(t, s0 , s1 ) =              candidate set than to incrementally maintain the candidates;
|s1 | − minoverlap(t, s0 , s1 ) + 1, which is optimal. On the             Ribeiro and Härder [7] empirically analyze this trade-off.
downside, a shorter prefix may require more work in the
verification phase: in some cases, the verification can start             3.   POSITION-ENHANCED LENGTH FIL-
after the prefix as will be discussed in Section 2.3.
                                                                               TERING
2.3    Verification                                                          In this section, we motivate the position-enhanced length
   Efficient verification techniques are crucial for fast set sim-        filter (PEL), derive the new filter function pmaxsize, dis-
ilarity joins. We revisit a baseline algorithm and two im-                cuss the effect of PEL on self vs. foreign joins, and show how
provements, which affect the verification speed of both false             to apply PEL to previous algorithms.
and true positives. Unless explicitly mentioned, the term                   Motivation. The introduction of the position-enhanced
prefix subsequently refers to maxprefix (probing set) resp.               length filter is inspired by examples for positional filtering


                                                                     91
           1250                                                          base region. The base region is partitioned into four regions
                                             maxsize                     (A, B, C, and D) by the probing set size and pmaxsize. For
                      C              D    probing set size               foreign joins, our filter reduces the base region to A+C. If we
set size


                                                                         assume that all set sizes occur equally likely in the individual
           1000                                                          inverted lists of the index, our filter cuts the number of index
                                                   B                     list entries that must be processed by 50%. Since the tokens
                      A       pmaxsize                                   are typically ordered by their frequency, the list length will
                                         minsize                         increase with increasing matching position. Thus the gain of
            800
                                                                         PEL in practical settings can be expected to be even higher.
                  0                 100        maxprefix 200             This analysis holds for all parameters of Jaccard and Dice.
                             position in prefix                          For Cosine, the situation is more tricky since pmaxsize is
                                                                         quadratic and describes a parabola. Again, this is in our
              Figure 6: Illustrating possible set sizes.                 favor since the parabola is open to the top, and the curve
                                                                         that splits the base region is below the diagonal.
                                                                            For self joins, the only relevant regions are A and B since
like Figure 5(a). In set s1 , the only match in the prefix oc-           the size of the sets is bounded by the probing set size. Our
curs at the leftmost position. Despite this being the leftmost           filter reduces the relevant region from A + B to A. As Fig-
match in s1 , the positional filter removes s1 : the overlap             ure 6 illustrates, this reduction is smaller than the reduction
threshold cannot be reached due the position of the match                for foreign joins. For the similarity functions in Table 1, B
in s0 . Apparently, the position of the token in the probing             is always less than a quarter of the full region A + B. In the
set can render a match of the index sets impossible, inde-               example, region B covers about 0.22 of A + B.
pendently of the matching position in the index set. Let us
analyze how we need to modify the example such that it
passes the positional filter: the solution is to shorten index            Algorithm 1: AllPairs-PEL(Sp , I, t)
set s1 , as shown in Figure 5(b). This suggests that some                  Version using pmaxsize for foreign join;
tighter limit on the set size can be derived from the position             input : Sp collection of outer sets, I inverted list index
of the matching token.                                                              covering maxprefix of inner sets, t similarity
   Deriving the PEL filter. For the example in                                      threshold
Figure 5(a) the first part of the positional filter, i.e.,                 output: res set of result pairs (similarity at least t)
condition (2), does not hold. We solve the equation                      1 foreach s0 in Sp do
minoverlap(t, s0 , s1 ) ≤ |s0 | − p0 to |s1 | by replacing               2    M = {}; /* Hashmap: candidate set → count */
minoverlap with its definition for the different similarity              3    for p0 ← 0 to maxprefix(t, s0 ) − 1 do
functions. The result is pmaxsize(t, s0 , p0 ), an upper                 4        for s1 in Is0 [p] do
bound on the size of eligible sets in the index. This bound              5           if |s1 | < minsize(t, s0 ) then
is at the core of the PEL filter, and definitions of pmaxsize            6               remove index entry with s1 from Is0 [p] ;
for various similarity measures are listed in Table 1.                   7           else if |s1 | > pmaxsize(t, s0 , p0 ) then
   Application of PEL. We integrate the pmaxsize                         8               break;
upper bound into the prefix filter. The basic prefix filter              9           else
algorithm processes a probing set as follows: loop over                 10               if M [s1 ] = ∅ then
the tokens of the probing set from position p0 = 0 to                   11                    M = M ∪ (s1 , 0);
maxprefix(t, s0 ) − 1 and probe each token against the                  12               M [s1 ] = M [s1 ] + 1;
index. The index returns a list of sets (their IDs) which               13        end
contain this token. The sets in these lists are ordered by              14    end
increasing size, so we stop processing a list when we hit a             15    /* Verify() verifies the candidates in M */
set that is larger than pmaxsize(t, s0 , p0 ).                          16    res = res ∪ V erif y(s0 , M, t);
   Intuitively, we move half of the positional filter to the            17 end
prefix filter, where we can evaluate it at lower cost: (a) the
value of pmaxsize needs to be computed only once for each
probing token; (b) we check pmaxsize against the size of
                                                                            Algorithm. Algorithm 1 shows AllPairs-PEL2 , a ver-
each index list entry, which is a simple integer comparison.
                                                                         sion of AllPairs enhanced with our PEL filter. AllPairs-
Overall, this is much cheaper than the candidate lookup that
                                                                         PEL is designed for foreign joins, i.e., the index is con-
the positional filter must do for each index match.
                                                                         structed in a preprocessing step before the join is executed.
   Self Joins vs. Foreign Joins. The PEL filter is more                  The only difference w.r.t. AllPairs is that AllPairs-PEL uses
powerful on foreign joins than on self joins. In self joins,             pmaxsize(t, s0 , p0 ) instead of maxsize(t, s0 ) in the condi-
the size of the probing set is an upper bound for the set                tion on line 7. The extensions of the algorithms ppjoin and
size in the index. For all the similarity functions in Table 1,          mpjoin with PEL are similar.
pmaxsize is below the probing set size in less than 50%                     An enhancement that is limited to ppjoin and mpjoin is to
of the prefix positions. Figure 6 gives an example: The                  simplify the positional filter: PEL ensures that no candidate
probing set size is 1000, the Jaccard threshold is 0.8, so               set can fail on the first condition (Equation 2) of the split
minsize(0.8, 1000) = 800, maxsize(0.8, 1000) = 1250, and                 positional filter. Therefore, we remove the first part of the
the prefix size is 201. The x-axis represents the position in
the prefix, the y-axis represents bounds for the set size of the         2
                                                                           We use the -PEL suffix for algorithm variants that make
other set. The region between minsize and maxsize is the                 use of our PEL filter.


                                                                   92
                                                                         collections are identical. Figures 7(a) and 7(b) show the per-
          Table 2: Input set characteristics.                            formance on DBLP with Jaccard similarity threshold 0.75
            #sets in        set size       # of diff.
                                                                         and Cosine similarity 0.85. These thresholds produce result
            collection min      max avg tokens
                                                                         sets of similar size. We observe a speedup of factor 3.5 for
     DBLP    3.9 · 106   2       283  12 1.34 · 106                      AllPairs-PEL over AllPairs with Jaccard, and a speedup of
     TREC    3.5 · 105   2       628 134     3.4 · 105                   3.8 with Cosine. For mpjoin to mpjoin-PEL we observe a
                     5
    ENRON      5 · 10    1 192 000 298       7.3 · 106                   speedup of 4.0 with Jaccard and 4.2 with Cosine. Thus, the
                                                                         PEL filter provides a substantial speed advantage on these
                                                                         data points. For other Jaccard thresholds and mpjoin vs.
minimum in the original positional filter (Equation 1), such             mpjoin-PEL, the maximum speedup is 4.1 and the minimum
that the minimum is no longer needed.                                    speedup is 1.02. For threshold 0.5, only mpjoin-PEL finishes
   Note that the removal of index entries on line 6 is the eas-          within the time limit of one hour. Among all Cosine thresh-
iest solution to apply minsize, but in real-world scenarios,             olds and mpjoin vs. mpjoin-PEL, the maximum speedup is
it only makes sense for a single join to be executed. For                4.2 (tC = 0.85), the minimum speedup is 1.14 (tC = 0.95).
a similarity search scenario, we recommend to apply binary               We only consider Cosine thresholds tC ≥ 0.75, because the
search on the lists. For multiple joins with the same indexed            non-PEL variants exceed the time limit for smaller thresh-
sets in a row, we suggest to use an overlay over the index               olds. There is no data point where PEL slows down an
that stores the pointer for each list where to start.                    algorithm. It is also worth noting that AllPairs-PEL beats
                                                                         mpjoin by a factor of 2.7 with Jaccard threshold tJ = 0.75
4. EXPERIMENTS                                                           and 3.3 on Cosine threshold tC = 0.85; we observe such
                                                                         speedups also on other thresholds.
   We compare the algorithms AllPairs [4] and mpjoin [7]                    Figure 7(c) shows the performance on TREC with Jac-
with and without our PEL extension on both self and for-                 card threshold tJ = 0.75. The speedup for AllPairs-PEL
eign joins. Our implementation works on integers, which we               compared to AllPairs is 1.64, and for mpjoin-PEL compared
order by the frequency of appearance in the collection. The              to mpjoin 2.3. The minimum speedup of mpjoin over all
time to generate integers from tokens is not measured in our             thresholds is 1.26 (tJ = 0.95), the maximum speedup is
experiments since it is the same for all algorithms. We also             2.3 (tJ = 0.75). Performance gains on ENRON are slightly
do not consider the indexing time for foreign joins, which               smaller – we observe speedups of 1.15 (AllPairs-PEL over
is considered a preprocessing step. The use of PEL has no                AllPairs), and 1.85 (mpjoin-PEL over mpjoin) on Jaccard
impact on the index construction. The prefix sizes are max-              threshold tJ = 0.75 as illustrated in Figure 7(d). The mini-
prefix for foreign joins and midprefix for self joins. For self          mum speedup of mpjoin over mpjoin-PEL is 1.24 (tJ = 0.9
joins, we include the indexing time in the overall runtime               and 0.95), the maximum speedup is 2.0 (tJ = 0.6).
since the index is built incrementally on-the-fly. We report                Figure 8(a) shows the number of processed index entries
results for Jaccard and Cosine similarity, the results for Dice          (i.e., the overall length of the inverted lists that must be
show similar behavior. Our experiments are executed on the               scanned) for Jaccard threshold tJ = 0.75 on TREC. The
following real-world data sets:                                          number of index entries increases by a factor of 1.67 for
                                                                         AllPairs w.r.t. AllPairs-PEL, and a factor of 4.0 for mpjoin
     • DBLP3 : Snapshot (February 2014) of the DBLP bib-                 w.r.t. mpjoin-PEL.
       liographic database. We concatenate authors and ti-                  Figure 8(b) shows the number of candidates that must
       tle of each entry and generate tokens by splitting on             be verified for Jaccard threshold tJ = 0.75 on TREC. On
       whitespace.                                                       AllPairs, PEL decreases the number of candidates. This is
                                                                         because AllPairs does not apply any further filters before
     • TREC4 : References from the MEDLINE database,
                                                                         verification. On mpjoin, the number of candidates increases
       years 1987–1991. We concatenate author, title, and
                                                                         by 20%. This is due to the smaller number of matches from
       abstract, remove punctuation, and split on whitespace.
                                                                         the prefix index in the case of PEL: later matches can remove
     • ENRON5 : Real e-mail messages published by FERC                   pairs from the candidate set (using the positional filter) and
       after the ENRON bankruptcy. We concatenate sub-                   thus decrease its size. However, the larger candidate set
       ject and body fields, remove punctuation, and split on            for PEL does not seriously impact the overall performance:
       whitespace.                                                       the positional filter is also applied in the verification phase,
                                                                         where the extra candidate pairs are pruned immediately.
   Table 2 lists basic characteristics of the input sets. We                Self joins. Due to space constraints, we only show re-
conduct our experiments on an Intel Xeon 2.60GHz machine                 sults for DBLP and ENRON, i.e., the input sets with the
with 128 GB RAM running Debian 7.6 ’wheezy’. We com-                     smallest and the largest average set sizes, respectively. Fig-
pile our code with gcc -O3. Claims about results on “all”                ure 7(e) and 7(f) show the performance of the algorithms on
thresholds for a particular data set refer to the thresholds             DBLP and ENRON with Jaccard threshold tJ = 0.75. Our
{0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95}. We stop tests whose         PEL filter provides a speed up of about 1.22 for AllPairs,
runtime exceeds one hour.                                                and 1.17 for mpjoin on DBLP. The maximum speedup we
  Foreign Joins. For foreign joins, we join a collection of              observe is 1.70 (AllPairs-PEL vs. AllPairs, tJ = 0.6); for
sets with a copy of itself, but do not leverage the fact that the        tJ = 0.95 there is no speed difference between mpjoin and
                                                                         mpjoin-PEL. On the large sets of ENRON, the performance
3
  http://www.informatik.uni-trier.de/~Ley/db/                            is worse for AllPairs-PEL because verification takes more
4
  http://trec.nist.gov/data/t9_filtering.html                            time than PEL can save in the probing phase (by reducing
5
  https://www.cs.cmu.edu/~enron/                                         the number of processed index entries). There is almost no


                                                                    93
      sec                        sec                    sec                   400   sec                   sec                                           sec
500                    500                                                                                                                        100
                                                 150
                       400                                                    300                    30
400                                                                                                                                                80
        AllPairs-PEL


                                  AllPairs-PEL


                                                          AllPairs-PEL


                                                                                      AllPairs-PEL


                                                                                                                      AllPairs-PEL


                                                                                                                                                         AllPairs-PEL
        mpjoin-PEL


                                  mpjoin-PEL


                                                          mpjoin-PEL


                                                                                      mpjoin-PEL


                                                                                                                                     mpjoin-PEL


                                                                                                                                                         mpjoin-PEL
300                    300                       100                                                 20                                            60
                                                                              200
        AllPairs


                                  AllPairs


                                                          AllPairs


                                                                                      AllPairs


                                                                                                           AllPairs


                                                                                                                                                         AllPairs
200                    200                                                                                                                         40
        mpjoin


                                  mpjoin


                                                          mpjoin


                                                                                      mpjoin


                                                                                                                      mpjoin


                                                                                                                                                         mpjoin
                                                  50                          100                    10
100                    100                                                                                                                         20
  0                      0                         0                            0                     0                                             0
(a) Foreign join,      (b) Foreign join,         (c) Foreign join,            (d) Foreign j., EN-    (e) Self    join,                            (f) Self join, EN-
DBLP, tJ = 0.75.       DBLP, tC = 0.85.          TREC, tJ = 0.75              RON, tJ = 0.75         DBLP, tJ = 0.75                              RON, tJ = 0.75

                                                          Figure 7: Join times.


       8.0e10                                                                  the PEL filter improves performance in almost any foreign
                                       1.5e10                                  join and also in some self join scenarios, despite the fact that
       6.0e10                                                                  it may increase the number of candidates to be verified.
                  AllPairs-PEL


                                                 AllPairs-PEL
                  mpjoin-PEL


                                                 mpjoin-PEL
                                       1.0e10
       4.0e10
                                                                               7.    REFERENCES
                  AllPairs


                                                 AllPairs
                  mpjoin


                                                 mpjoin


       2.0e10                           5.0e9
                                                                                [1] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact
                                                                                    set-similarity joins. In Proc. VLDB, pages 918 – 929,
            0                              0
                                                                                    2006.
      (a) Number of pro-               (b) Number of candi-                     [2] N. Augsten, M. H. Böhlen, and J. Gamper. The
      cessed index entries.            dates to be verify.
                                                                                    pq-gram distance between ordered labeled trees. ACM
                                                                                    TODS, 35(1), 2010.
       Figure 8: TREC (foreign join): tJ = 0.75                                 [3] N. Augsten, A. Miraglia, T. Neumann, and
                                                                                    A. Kemper. On-the-fly token similarity joins in
                                                                                    relational databases. In Proc. SIGMOD, pages 1495 –
difference between mpjoin and mpjoin-PEL. The maximum                               1506. ACM, 2014.
increase in speed is 9% (threshold 0.8, mpjoin), the maxi-                      [4] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all
mum slowdown is 30% (threshold 0.6, AllPairs).                                      pairs similarity search. WWW, 7:131 – 140, 2007.
  Summarizing, PEL substantially improves the runtime in                        [5] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive
foreign join scenarios. For self joins, PEL is less effective                       operator for similarity joins in data cleaning. In Proc.
and, in some cases, may even slightly increase the runtime.                         ICDE, page 5. IEEE, 2006.
                                                                                [6] A. Gionis, P. Indyk, and R. Motwani. Similarity
5. RELATED WORK                                                                     search in high dimensions via hashing. In Proc.
                                                                                    VLDB, pages 518–529, 1999.
   Sarawagi and Kirpal [8] first discuss efficient algorithms
for exact set similarity joins. Chaudhuri et al. [5] propose                    [7] L. A. Ribeiro and T. Härder. Generalizing prefix
SSJoin as an in-database operator for set similarity joins                          filtering to improve set similarity joins. Information
and introduce the prefix filter. AllPairs [4] uses the prefix                       Systems, 36(1):62 – 78, 2011.
filter with an inverted list index. The ppjoin algorithm [12]                   [8] S. Sarawagi and A. Kirpal. Efficient set joins on
extends AllPairs by the positional filter and introduces the                        similarity predicates. In Proc. SIGMOD, pages 743 –
suffix filter, which reduces the candidate set before the final                     754. ACM, 2004.
verification. The mpjoin algorithm [7] improves over ppjoin                     [9] E. Spertus, M. Sahami, and O. Buyukkokten.
by reducing the number of entries returned from the index.                          Evaluating similarity measures: A large-scale study in
AdaptJoin [10] takes the opposite approach and drastically                          the orkut social network. In Proc. SIGKDD, pages 678
reduces the number of candidates at the expense of longer                           – 684. ACM, 2005.
prefixes. Gionis et al. [6] propose an approximate algorithm                   [10] J. Wang, G. Li, and J. Feng. Can we beat the prefix
based on LSH for set similarity joins. Recently, an SQL op-                         filtering?: An adaptive framework for similarity join
erator for the token generation problem was introduced [3].                         and search. In Proc. SIGMOD, pages 85 – 96. ACM,
                                                                                    2012.
                                                                               [11] C. Xiao, W. Wang, and X. Lin. Ed-Join: An efficient
6. CONCLUSIONS                                                                      algorithm for similarity joins with edit distance
   We presented PEL, a new filter based on the pmaxsize                             constraints. In Proc. VLDB, 2008.
upper bound derived in this paper. PEL can be easily                           [12] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang.
plugged into algorithms that store prefixes in an inverted                          Efficient similarity joins for near-duplicate detection.
list index (e.g., AllPairs, ppjoin, or mpjoin). For these algo-                     ACM TODS, 36(3):15, 2011.
rithms, PEL will effectively reduce the number of list entries
that must be processed. This reduces the overall lookup time
in the inverted list index at the cost of a potentially larger
candidate set. We analyzed this trade-off for foreign joins
and self joins. Our empirical evaluation demonstrated that


                                                                         94