Proceedings of the 26th GI-Workshop Grundlagen von Datenbanken Bozen-Bolzano, Italien, 21.-24. Oktober 2014 c 2014 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. Re-publication of material from this volume requires permission by the copyright owners. Herausgeber: Friederike Klan Friedrich-Schiller-Universität Jena Fakultät für Mathematik und Informatik Heinz-Nixdorf-Stiftungsprofessur für Verteilte Informationssysteme Ernst-Abbe-Platz 2 DE-07743 Jena E-Mail: friederike.klan@uni-jena.de Günther Specht Universität Innsbruck Fakultät für Mathematik, Informatik und Physik Forschungsgruppe Datenbanken und Informationssysteme Technikerstrasse 21a AT-6020 Innsbruck E-Mail: guenther.specht@uibk.ac.at Hans Gamper Freie Universität Bozen-Bolzano Fakultät für Informatik Dominikanerplatz 3 IT-39100 Bozen-Bolzano E-Mail: gamper@inf.unibz.it 2 Vorwort Der 26. Workshop ”Grundlagen von Datenbanken” (GvDB) 2014 fand dieses Jahr vom 21.10.2014 bis 24.10.2014 auf dem Ritten in Südtirol statt, einem reizvollen Hochplateau mit Blick auf die Dolomiten. Bereits die Anreise war ein Highlight: vom Bahnhof Bozen ging es mit der längsten Seilbahn Südtirols nach oben und dann mit der Rittner Bahn, einer alten Schmalspur-Stras̈enbahn, über die Lärchenwiesen bis zum Tagungsort. Der viertägige Workshop wurde vom GI-Arbeitskreis ”Grundlagen von Informationssyste- men” im Fachbereich Datenbanken und Informationssysteme (DBIS) veranstaltet und hat die konzeptionellen und methodischen Grundlagen von Datenbanken und Information- ssystemen zum Thema, ist aber auch für neue Anwendungen offen. Die Workshopreihe und der Arbeitskreis feiern dieses Jahr ihr 25-jähriges Bestehen. Der AK ist damit der ältesten Arbeitskreise der GI. Organisiert wurde der Jubiläumsworkshop gemeinsam von Fr. Dr. Friederike Klan von der Heinz-Nixdorf-Stiftungsprofessur für Verteilte Infor- mationssysteme der Friedrich-Schiller-Universität Jena, Hr. Prof. Dr. Günther Specht von der Forschungsgruppe Datenbanken und Informationssysteme (DBIS) der Universität Innsbruck und Hr. Prof. Dr. Johann Gamper von der Gruppe Datenbanken und Informa- tionssysteme (DIS) der Freien Universität Bozen-Bolzano. Der Workshop soll die Kommunikation zwischen Wissenschaftlern/-innen im deutsch- sprachigen Raum fördern, die sich grundlagenorientiert mit Datenbanken und Informa- tionssystemen beschäftigen. Er bietet insbesondere Nachwuchswissen-schaftler/-innen die Möglichkeit, ihre aktuellen Arbeiten einem grös̈eren Forum in lockerer Atmosphäre vorzustellen. Mit der Kulisse der beeindruckenden Südtiroler Bergwelt bot der Work- shop auf 1200 Metern Meereshöhe einen idealen Rahmen für offene und inspirierende Diskussionen dazu ohne Zeitzwang. Insgesamt wurden 14 Arbeiten aus den Einsendun- gen nach einem Review-Prozess ausgewählt und vorgestellt. Besonders hervorzuheben ist die Vielfältigkeit der Themenbereiche: sowohl Kerngebiete in Datenbanksystemen bzw. Datenbankdesign, als auch Themen zur Informationsextraktion, Empfehlungssysteme, Ve- rarbeitung von Zeitreihen, Graphalgorithmen im GIS Bereich, sowie zu Datenschutz und Datenqualität wurden vorgestellt. Die Vorträge ergänzten zwei Keynotes: Ulf Leser, Professor an der Humboldt-Universität zu Berlin, hielt eine Keynote zu Next Generation Data Integration (for Life Sciences) und Francesco Ricci, Professor an der Freien Universität von Bozen-Bolzano zu Context and Recommendations: Challenges and Results. Beiden Vortragenden sei an dieser Stelle für ihre spontane Bereitschaft zu Kommen und ihre interessanten Vorträge gedankt. Neben dem Wissensaustausch darf auch die soziale Komponente nicht fehlen. Die bei- den gemeinsamen Ausflüge bleiben sicher allen lange in guter Erinnerung. Zum einen erklommen wir das bereits schneebedeckte Rittner Horn (2260 Hm), von dem man einen herrlichen Blick auf die Dolomiten hat. Zum anderen ist bei einem Aufenthalt im Herbst in Südtirol das so genannte Törggelen nicht wegzudenken: eine Wanderung zu lokalen Bauernschänken, die die Köstlichkeiten des Jahres zusammen mit Kastanien und heurigem Wein auftischen. Sogar der Rektor der Universität Bozen-Bolzano kam dazu extra vom Tal herauf. 3 Eine Konferenz kann nur erfolgreich in einer guten Umgebung stattfinden. Daher danken wir an dieser Stelle den Mitarbeitern des Hauses der Familie für ihre Arbeit im Hinter- grund. Weiterer Dank gilt allen Autoren, die mit ihren Beiträgen und Vorträgen erst einen interessanten Workshop ermöglichen, sowie dem Programmkomitee und allen Gutachtern für ihre Arbeit. Abschlies̈end gilt dem Organisationsteam, das interaktiv über alle Landes- grenzen hinweg (Deutschland, Österreich und Italien) hervorragend zusammen gearbeitet hat, ein gros̈es Dankeschön. So international war der GvDB noch nie. Auf ein Wiedersehen beim nächsten GvDB-Workshop Günther Specht Friederike Klan Johann Gamper Innsbruck, Jena, Bozen am 26.10.2014 4 Komitee Organisation Friederike Klan Friedrich-Schiller-Universität Jena Günther Specht Universität Innsbruck Hans Gamper Universität Bozen-Bolzano Programm-Komitee Alsayed Algergawy Friedrich-Schiller-Universität Jena Erik Buchmann Karlsruher Institut für Technologie Stefan Conrad Universität Düsseldorf Hans Gamper Universität Bozen-Bolzano Torsten Grust Universität Tübingen Andreas Heuer Universität Rostock Friederike Klan Friedrich-Schiller-Universität Jena Birgitta König-Ries Friedrich-Schiller-Universität Jena Klaus Meyer-Wegener Universität Erlangen Gunter Saake Universität Magdeburg Kai-Uwe Sattler Technische Universität Ilmenau Eike Schallehn Universität Magdeburg Ingo Schmitt Brandenburgische Technische Universität Cottbus Holger Schwarz Universität Stuttgart Günther Specht Universität Innsbruck Zusätzliche Reviewer Mustafa Al-Hajjaji Universität Magdeburg Xiao Chen Universität Magdeburg Doris Silbernagl Universität Innsbruck 5 6 Contents Next Generation Data Integration (for the Life Sciences) (Keynote) Ulf Leser 9 Context and Recommendations: Challenges and Results (Keynote) Francesco Ricci 10 Optimization of Sequences of XML Schema Modifications - The ROfEL Ap- proach Thomas Nösinger, Andreas Heuer and Meike Klettke 11 Automatic Decomposition of Multi-Author Documents Using Grammar Analysis Michael Tschuggnall and Günther Specht 17 Proaktive modellbasierte Performance-Analyse und -Vorhersage von Datenbankan- wendungen Christoph Koch 23 Big Data und der Fluch der Dimensionalität: Die effiziente Suche nach Quasi- Identifikatoren in hochdimensionalen Daten Hannes Grunert and Andreas Heuer 29 Combining Spotify and Twitter Data for Generating a Recent and Public Dataset for Music Recommendation Martin Pichl, Eva Zangerle and Günther Specht 35 Incremental calculation of isochrones regarding duration Nikolaus Krismer, Günther Specht and Johann Gamper 41 Software Design Approaches for Mastering Variability in Database Systems David Broneske, Sebastian Dorok, Veit Koeppen and Andreas Meister 47 PageBeat - Zeitreihenanalyse und Datenbanken Andreas Finger, Ilvio Bruder, Andreas Heuer, Martin Klemkow and Steffen Konerow 53 Databases under the Partial Closed-world Assumption: A Survey Simon Razniewski and Werner Nutt 59 Towards Semantic Recommendation of Biodiversity Datasets based on Linked Open Data Felicitas Löffler, Bahar Sateli, René Witte and Birgitta König-Ries 65 7 Exploring Graph Partitioning for Shortest Path Queries on Road Networks Theodoros Chondrogiannis and Johann Gamper 71 Missing Value Imputation in Time Series Using Top-k Case Matching Kevin Wellenzohn, Hannes Mitterer, Johann Gamper, Michael Böhlen and Mourad Khayati 77 Dominanzproblem bei der Nutzung von Multi-Feature-Ansätzen Thomas Böttcher and Ingo Schmitt 83 PEL: Position-Enhanced Length Filter for Set Similarity Joins Willi Mann and Nikolaus Augsten 89 8 Next Generation Data Integration (for the Life Sciences) [Abstract] Ulf Leser Humboldt-Universität zu Berlin Institute for Computer Science leser@informatik.hu-berlin.de ABSTRACT Ever since the advent of high-throughput biology (e.g., the Human Genome Project), integrating the large number of diverse biological data sets has been considered as one of the most important tasks for advancement in the biolog- ical sciences. The life sciences also served as a blueprint for complex integration tasks in the CS community, due to the availability of a large number of highly heterogeneous sources and the urgent integration needs. Whereas the early days of research in this area were dominated by virtual inte- gration, the currently most successful architecture uses ma- terialization. Systems are built using ad-hoc techniques and a large amount of scripting. However, recent years have seen a shift in the understanding of what a ”data integra- tion system” actually should do, revitalizing research in this direction. In this tutorial, we review the past and current state of data integration (exemplified by the Life Sciences) and discuss recent trends in detail, which all pose challenges for the database community. About the Author Ulf Leser obtained a Diploma in Computer Science at the Technische Universität München in 1995. He then worked as database developer at the Max-Planck-Institute for Molec- ular Genetics before starting his PhD with the Graduate School for ”Distributed Information Systems” in Berlin. Since 2002 he is a professor for Knowledge Management in Bioin- formatics at Humboldt-Universität zu Berlin. Copyright c by the paper’s authors. Copying permitted only for private and academic purposes. In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- Workshop on Foundations of Databases (Grundlagen von Datenbanken), 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. 9 Context and Recommendations: Challenges and Results [Abstract] Francesco Ricci Free University of Bozen-Bolzano Faculty of Computer Science fricci@unibz.it ABSTRACT About the Author Recommender Systems (RSs) are popular tools that auto- Francesco Ricci is associate professor of computer science matically compute suggestions for items that are predicted at Free University of Bozen-Bolzano, Italy. His current re- to be interesting and useful to a user. They track users’ search interests include recommender systems, intelligent in- actions, which signal users’ preferences, and aggregate them terfaces, mobile systems, machine learning, case-based rea- into predictive models of the users’ interests. In addition soning, and the applications of ICT to tourism and eHealth. to the long-term interests, which are normally acquired and He has published more than one hundred of academic pa- modeled in RSs, the specific ephemeral needs of the users, pers on these topics and has been invited to give talks in their decision biases, the context of the search, and the con- many international conferences, universities and companies. text of items’ usage, do influence the user’s response to and He is among the editors of the Handbook of Recommender evaluation for the suggested items. But appropriately mod- Systems (Springer 2011), a reference text for researchers and eling the user in the situational context and reasoning upon practitioners working in this area. He is the editor in chief that is still challenging; there are still major technical and of the Journal of Information Technology & Tourism and in practical difficulties to solve: obtaining sufficient and infor- the editorial board of the Journal of User Modeling and User mative data describing user preferences in context; under- Adapted Interaction. He is member of the steering commit- standing the impact of the contextual dimensions on user tee of the ACM Conference on Recommender Systems. He decision-making process; embedding the contextual dimen- served on the program committees of several conferences, sions in a recommendation computational model. These top- including as a program co-chair of the ACM Conference on ics will be illustrated in the talk, making examples taken Recommender Systems (RecSys), the International Confer- from the recommender systems that we have developed. ence on Case-Based Reasoning (ICCBR) and the Interna- tional Conference on Information and Communication Tech- nologies in Tourism (ENTER). Copyright c by the paper’s authors. Copying permitted only for private and academic purposes. In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- Workshop on Foundations of Databases (Grundlagen von Datenbanken), 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. 10 Optimization of Sequences of XML Schema Modifications - The ROfEL Approach Thomas Nösinger, Meike Klettke, Andreas Heuer Database Research Group University of Rostock, Germany (tn, meike, ah)@informatik.uni-rostock.de ABSTRACT element and shortly afterwards delete the same element. In The transformation language ELaX (Evolution Language for the overall context of an efficient realization of modification XML-Schema [16]) is a domain-specific language for modi- steps, such operations have to be removed. Further issues fying existing XML Schemas. ELaX was developed to ex- are incorrect information (possibly caused by network prob- press complex modifications by using add, delete and up- lems), for example if the same element is deleted twice or the date statements. Additionally, it is used to consistently order of modifications is invalid (e.g. update before add). log all change operations specified by a user. In this pa- The new rule-based optimizer for ELaX (ROfEL - Rule- per we present the rule-based optimization algorithm ROfEL based Optimizer for ELaX) had been developed for solving (Rule-based Optimizer for ELaX) for reducing the number the above mentioned problems. With ROfEL it is possible of logged operations by identifying and removing unneces- to identify unnecessary or redundant operations by using sary, redundant and also invalid modifications. This is an different straightforward optimization rules. Furthermore, essential prerequisite for the co-evolution of XML Schemas the underlying algorithm is capable to correct invalid modi- and corresponding XML documents. fication steps. All in all, ROfEL could reduce the number of modification steps by removing or even correcting the logged ELaX operations. 1. INTRODUCTION This paper is organized as follows. Section 2 gives the The eXtensible Markup Language (XML) [2] is one of the necessary background of XML Schema, ELaX and corre- most popular formats for exchanging and storing structured sponding concepts. Section 3 and section 4 present our and semi-structured information in heterogeneous environ- approach, by first specifying our ruled-based algorithm RO- ments. To assure that well-defined XML documents are fEL and then showing how our approach can be applied for valid it is necessary to introduce a document description, an example. Related work is shown in section 5. Finally, which contains information about allowed structures, con- in section 6 we draw our conclusions. straints, data types and so on. XML Schema [4] is one com- monly used standard for dealing with this problem. After 2. TECHNICAL BACKGROUND using an XML Schema a period of time, the requirements In this section we present a common notation used in the can change; for example if additional elements are needed, remainder of this paper. At first, we will shortly introduce data types change or integrity constraints are introduced. the XSD (XML Schema Definition [4]), before details con- This may result in the adaptation of the XML Schema def- cerning ELaX (Evolution Language for XML-Schema [16]) inition. and the logging of ELaX are given. In [16] we presented the transformation language ELaX The XML Schema abstract data model consists of different (Evolution Language for XML-Schema) to describe and for- components (simple and complex type definitions, element mulate these XML Schema modifications. Furthermore, we and attribute declarations, etc.). Additionally, the element mentioned briefly that ELaX is also useful to log informa- information item serves as an XML representation of these tion about modifications consistently, an essential prerequi- components and defines which content and attributes can be site for the co-evolution process of XML Schema and corre- used in an XML Schema. The possibility of specifying decla- sponding XML documents [14]. rations and definitions in a local or global scope leads to four One problem of storing information over a long period of different modeling styles [13]. One of them is the Garden of time is, that there can be different unnecessary or redundant Eden style, in which all above mentioned components are modifications. Consider modifications which firstly add an globally defined. This results in a high re-usability of decla- rations and defined data types and influences the flexibility of an XML Schema in general. The transformation language ELaX1 was developed to handle modifications on an XML Schema and to express such modifications formally. The abstract data model, el- Copyright c by the paper’s authors. Copying permitted only ement information item and Garden of Eden style were for private and academic purposes. important through the development process and influence In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 1 Workshop on Foundations of Databases (Grundlagen von Datenbanken), The whole transformation language ELaX is available at: 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. www.ls-dbis.de/elax 11 the EBNF (Extended Backus-Naur Form) like notation of // ↓ most recent operation: add ↓ ELaX. U: op(EID) → del(EID) → add(EID, content) (5) An ELaX statement always starts with ”add”, ”delete” ⇒ op(EID) → add(EID, content) or ”update” followed by one of the alternative components (simple type, element declaration, etc.), an identifier of the I: add(EID, ) → add(EID, content) (6) current component and completed with optional tuples of ⇒ add(EID, content) attributes and values (examples follow on, e.g. see figure I: upd(EID, ) → add(EID, content) 1). The identifier is a unique EID (emxid)2 , a QNAME (7) (qualified name) or a subset of XPath expressions. In the ⇒ upd(EID, content) remaining parts we will use the EID as the identifier, but a // ↓ most recent operation: update (upd) ↓ transformation would be easily possible. I: op(EID) → del(EID) → upd(EID, content) (8) ELaX statements are logged for further analyses and also as a prerequisite for the rule-base optimizer (see section 3). ⇒ op(EID) → upd(EID, content) Figure 1 illustrates the relational schema of the log. The U: add(EID, content) → upd(EID, content) (9) ⇒ add(EID, content) op- msg- file-ID time EID content Type Type U: add(EID, content) → upd(EID, content’) 1 1 1 add 0 add element name 'name' type 'xs:decimal' id 'EID1' ; (10) 1 2 1 upd 0 update element name 'name' change type 'xs:string' ; ⇒ add(EID, MERGE(content0 , content)) 1 3 2 add 0 add element name 'count' type 'xs:decimal' id 'EID2' ; … … … … … … R: upd(EID, content) → upd(EID, content) (11) ⇒ upd(EID, content) Figure 1: Schema with relation for logging ELaX U: upd(EID, content) → upd(EID, content’) (12) ⇒ upd(EID, MERGE(content0 , content)) chosen values are simple ones (especially the length). The attributes file-ID and time are the composite key for the The rules have to be sequentially analyzed from left to right logging relation, the EID represents the unique identifier for (→), whereas the left operation comes temporally before the a component of the XSD. The op-Type is a short form for right one (i.e., time(left) < time(right). To warrant that the add, delete (del) or update (upd) operations, the msg-Type operations are working on the same component, the EID is for the different message types (ELaX (0), etc.). Lastly, of both operations is equal. If two operations exist and a the content contains the logged ELaX statements. The file- rule applies to them, then the result can be found on the ID and msg-Type are management information, which are right side of ⇒. The time of the result is the time of the not covered in this paper. prior (left) operation, except further investigations are ex- plicit necessary or the time is unknown (e.g. empty). Another point of view illustrates, that the introduced rules 3. RULE-BASED OPTIMIZER are complete concerning the given operations add, delete The algorithm ROfEL (Rule-based Optimizer for ELaX) and update. Figure 2 represents an operation matrix, in was developed to reduce the number of logged ELaX opera- which every possible combination is covered with at least one tions. This is possible by combining given operations and/or rule. On the x-axis the prior operation and on the y-axis the removing unnecessary or even redundant operations. Fur- thermore, the algorithm could identify invalid operations in prior a given log and correct these to a certain degree. operation add delete update ROfEL is a rule-based algorithm. Provided that a log of add (6) (5) (7) ELaX operations is given (see section 2), the following rules recent delete (3) (2) (4) are essential to reduce the number of operations. In com- update (9) , (10) (8) (11) , (12) pliance with ELaX these operations are delete (del), add or update (upd). If a certain distinction is not necessary a general operation (op) or variable ( ) are used, empty Figure 2: Operation matrix of rules denotes a not given operation. Additionally, the rules are classified by their purpose to handle redundant (R), unnec- recent operation are given, whereas the three-valued rules essary (U) or invalid (I) operations. ROfEL stops (S) if no (5) and (8) are minimized to the both most recent operations other rules are applicable, for example no other operation (e.g. without op(EID)). The break-even point contains the with the same EID is given. applying rule or rules (considering the possibility of merging the content, see below). S: empty → op(EID) ⇒ op(EID) (1) Rule (4) is one example for further investigations. If a // ↓ most recent operation: delete (del) ↓ component is deleted (del(EID)) but updated (upd(EID)) (2) before, then it is not possible to replace the prior operation R: del(EID) → del(EID) ⇒ del(EID) with the result (del(EID)) without analyzing other opera- U: add(EID, content) → del(EID) ⇒ empty (3) tions between them. The problem is: if another operation (op(EID’)) references the deleted component (e.g. a simple U: upd(EID, content) → del(EID) ⇒ del(EID) (4) type) but because of ROfEL upd(EID) (it is the prior op- with time(del(EID)) := TIME(del(EID), upd(EID, content)) eration) is replaced with del(EID), then op(EID’) would be 2 Our conceptual model is EMX (Entity Model for XML invalid. Therefore, the function TIME() is used to deter- Schema [15]), in which every component of a model has its mine the correct time of the result. The function is given own, global identifier: EID in pseudocode in figure 3. TIME() has two input parame- 12 TIME(op, op’): value pairs of the most recent operation are completely in- // time(op) = t; time(op’) = t’; time(opx) = tx; serted into the result. Simultaneously, these attributes are // op.EID == op’.EID; op.EID != opx.EID; t > t’; removed from the content of the prior operation. At the end begin of the function, all remaining attributes of the prior (right) if ((t > tx > t’) AND operation are inserted, before the result is returned. (op.EID in opx.content)) All mentioned rules, as well as the functions TIME() and then return t; MERGE() are essential parts of the main function RO- return t’; FEL(); the pseudocode is presented in figure 5. ROFEL() end. ROFEL(log): Figure 3: TIME() function of optimizer // log = ((t1,op1), (t2,op2), ...); t1 < t2 < ...; begin for (i := log.size(); i >= 2; i := i - 1) MERGE(content, content’): for (k := i - 1; k >= 1 ; k := k - 1) // content = (A1 = ’a1’, A2 = ’a2’, if(!(log.get(i).EID == log.get(k).EID AND // A3 = ’’, A4 = ’a4’); log.get(i).time != log.get(k).time)) // content’ = (A1 = ’a1’, A2 = ’’, then continue; // A3 = ’a3’, A5 = ’a5’); // R: del(EID) -> del(EID) => del(EID) (2) begin if (log.get(i).op-Type == 1 AND result := {}; log.get(k).op-Type == 1) count := 1; then while (count <= content.size()) log.remove(i); result.add(content.get(count)); return ROFEL(log); if (content.get(count) in content’) // U: upd(EID, content) -> del(EID) then // => del(EID) (4) content’.remove(content.get(count)); if (log.get(i).op-Type == 1 AND count := count + 1; log.get(k).op-Type == 2) count := 1; then while (count <= content’.size()) temp := TIME(log.get(i), log.get(k)); result.add(content’.get(count)); if (temp == log.get(i).time) count := count + 1; then // result = (A1 = ’a1’, A2 = ’a2’, log.remove(k); // A3 = ’’, A4 = ’a4’, A5 = ’a5’); return ROFEL(log); return result; log.get(k) := log.get(i); end. log.remove(i); return ROFEL(log); [...] Figure 4: MERGE() function of optimizer // U: upd(EID,con) -> upd(EID,con’) // => upd(EID, MERGE(con’,con)) (12) if (log.get(i).op-Type == 2 AND ters and returns a time value, dependent on the existence of log.get(k).op-Type == 2) an operation, which references the EID in its content. If no then such operation exists, the time of the result in rule (4) would temp := MERGE(log.get(i).content, be the time of the left (op), otherwise of the right operation log.get(k).content); (op’ ). The lines with // are comments and contain further log.get(k).content := temp; information, some hints or even explanations of variables. log.remove(i); The rules (6), (7) and (8) adapt invalid operations. For ex- return ROFEL(log); ample if a component is updated but deleted before (see rule return log; (8)), then ROfEL has to decide, which operation is valid. In end. this and similar cases the most recent operation is preferred, because it is more difficult (or even impossible) to check the Figure 5: Main function ROFEL() of optimizer intention of the prior operation. Consequently, in rule (8) del(EID) is removed and rule op(EID) → upd(EID, content) has one input parameter, the log of ELaX operations. This applies (op(EID) could be empty; see rule (1)). log is a sequence sorted according to time, it is analyzed The rules (10) and (12) removes unnecessary operations reversely. In general, one operation is pinned (log.get(i)) by merging the content of the involved operations. The func- and compared with the next, prior operation (log.get(k)). tion MERGE() implements this, the pseudocode is pre- If log.get(k) modifies the same component as log.get(i) (i.e., sented in figure 4. MERGE() has two input parameter, EID is equal) and the time is different, then an applying rule the content of the most recent (left) and prior (right) oper- is searched, otherwise the next operation (log.get(k - 1)) is ation. The content is given as a sequence of attribute-value analyzed. The algorithm terminates, if the outer loop com- pairs (see ELaX description in section 2). The result of the pletes successfully (i.e., no further optimization is possible). function is the combination of the input, whereas the con- Three rules are presented in figure 5; the missing ones tent of the most recent operation is preferred analogical to are skipped ([...]). The first rule is (2), the occurrence of the above mentioned behaviour for I rules. All attribute- redundant delete operations. According to the above men- 13 tioned time choosing guidelines, the most recent operation ROfEL op- time EID content (log.get(i)) is removed. After this the optimizer starts again Type with the modified log recursively (return ROFEL(log)). 1 1 add add element name 'name' type 'xs:decimal' id 'EID1' ; 10 The second rule is (4), which removes an unnecessary up- 2 1 upd update element name 'name' change type 'xs:string' ; 3 2 add add element name 'count' type 'xs:decimal' id 'EID2' ; date operation, because the whole referenced component will 4 3 add add element name 'start' type 'xs:date' id 'EID3' ; be deleted later. This rule uses the TIME() function of fig- 5 42 add add element name 'stop' type 'xs:date' id 'EID42' ; ure 3 to decide, which time should be assigned to the result. 6 4 add add complextype name 'confType' id 'EID4' ; 3 If another operation between log.get(i) and log.get(k) exists 7 5 add add group mode sequence id 'EID5' in 'EID4' ; and this operation contains or references log.get(i).EID, then 8 42 upd update element name 'stop' change type 'xs:string' ; the most recent time (log.get(i).time) is assigned, otherwise 9 6 add add elementref 'name' id 'EID6' in 'EID5' ; the prior time (log.get(k).time). 10 4 7 add add elementref 'count' id 'EID7' in 'EID5' ; The last rule is (12), different updates on the same com- 11 8 add add elementref 'start' id 'EID8' in 'EID5' ; 12 42 del delete element name 'stop' ; ponent are given. The MERGE() function of figure 4 com- 13 2 9 add add element name 'conf' type 'confType' id 'EID9' ; bines the content of both operations, before the content of 14 42 del delete element name 'stop' ; the prior operation is changed and the most recent operation is removed. After introducing detailed information about the concept Figure 7: XML Schema modification log of figure 6 of the ROfEL algorithm, we want to use it to optimize an example in the next section. given in the XML Schema (EID > 9). Additionally, some 4. EXAMPLE entries are connected within the new introduced column RO- In the last section we specified the rule-based algorithm fEL. The red lines and numbers represent the involved log ROfEL (Rule-based Optimizer for ELaX), now we want to entries and applying ROfEL rule. explain the use with an example: we want to store some in- The sorted log is analyzed reversely, the operation with formation about a conference. We assume the XML Schema time stamp 14 is pinned and compared with time entry 13. of figure 6 is given, a corresponding XML document is also Because the modified component is not the same (EID not presented. The XML Schema is in the Garden of Eden style equal), the next operation with time 12 is taken. Both op- erations delete the same component (op-Type == 1 ). Ac- cording to rule (2), the redundant entry 14 is removed and ROFEL restarts with the adapted log. Rule (4) applies next, a component is updated but deleted later. This rule calls the TIME() function to determine, if the time of the result (i.e., del(EID)) should be 12 or 8. Because no operation between 12 and 8 references EID 42, the time of the result of (4) is 8. The content of time 8 is replaced with delete element name ’stop’;, the op-Type is set to 1 and the time entry 12 is deleted. Afterwards, ROFEL restarts again and rule (3) could be used to compare the new operation of entry 8 (original entry 12) with the operation of time 5. A component is inserted but deleted later, so all modifications on this component are unnecessary in general. Consequently, both entries are deleted and the component with EID 42 is not given in the XML Schema of figure 6. The last applying rule is (10). An element declaration is inserted (time 1) and updated (time 2). Consequently, the MERGE() function is used to combine the content of both operations. According to the ELaX specification, the content of the update operation contains the attribute type Figure 6: XML Schema with XML document with the value xs:string, whereas the add operation contains the attribute type with the value xs:decimal and id with and contains four element declarations (conf, name, count, EID1. All attribute-value pairs of the update operation are start) and one complex type definition (confType) with a completely inserted into the output of the function (type = group model (sequence). The group model has three ele- ”xs:string”). Simultaneously, the attribute type is removed ment references, which reference one of the simple type el- from the content of the add operation (type = ”xs:decimal”). ement declarations mentioned above. The identification of The remaining attributes are inserted in the output (id = all components is simplified by using an EID, it is visualized ”EID1”). Afterwards, the content of entry 1 is replaced by as a unique ID attribute (id = ”..”). add element ’name’ type ”xs:string” id ”EID1”; and the sec- The log of modification steps to create this XML Schema ond entry is deleted (time 2). is presented in figure 7. The relational schema is reduced in The modification log of figure 7 is optimized with rules comparison to figure 1. The time, the component EID, the (2), (4), (3) and (10). It is presented in figure 8. All in all, op-Type and the content of the modification steps are given. five of 14 entries are removed, whereas one is replaced by a The log contains different modification steps, which are not combination of two others. 14 op- In [8] an approach is presented, which deals with four time EID content Type operations (insert, delete, update, move) on a tree repre- 1 1 add add element name 'name' type 'xs:string' id 'EID1' ; sentation of XML. It is similar to our algorithm, but we use 3 2 add add element name 'count' type 'xs:decimal' id 'EID2' ; ELaX as basis and EIDs instead of update-intensive labelling 4 3 add add element name 'start' type 'xs:date' id 'EID3' ; mechanisms. Moreover the distinction between property and 6 4 add add complextype name 'confType' id 'EID4' ; node, the ”deletion always wins” view, as well as the limita- 7 5 add add group mode sequence id 'EID5' in 'EID4' ; tion that ”reduced sequence might still be reducible” [8] are 9 6 add add elementref 'name' id 'EID6' in 'EID5' ; 10 7 add add elementref 'count' id 'EID7' in 'EID5' ; drawbacks. The optimized reduction algorithm eliminates 11 8 add add elementref 'start' id 'EID8' in 'EID5' ; the last drawback, but needs another complex structure, an 13 9 add add element name 'conf' type 'confType' id 'EID9' ; operation hyper-graph. Figure 8: XML Schema modification log of figure 7 6. CONCLUSION after using rules (2), (4), (3) and (10) of ROfEL The rule-based algorithm ROfEL (Rule-based Optimizer for ELaX) was developed to reduce the number of logged ELaX (Evolution Language for XML-Schema [16]) opera- This simple example illustrates how ROfEL can reduce the tions. In general ELaX statements are add, delete and up- number of logged operations introduced in section 3. More date operations on the components of XML Schema, speci- complex examples are easy to construct and can be solved fied by a user. by using the same rules and the same algorithm. ROfEL allows the identification and deletion of unnec- essary and redundant modifications by applying different heuristic rules. Additionally, invalid operations are also cor- 5. RELATED WORK rected or removed. In general if the preconditions and condi- Comparable to the object lifecycle, we create new types tions for an adaptation of two ELaX log entries are satisfied or elements, use (e.g. modify, move or rename) and delete (e.g. EID equivalent, op-Type correct, etc.), one rule is ap- them. The common optimization rules to reduce the num- plied and the modified, reduced log is returned. ber of operations are originally introduced in [10] and are We are confident, that even if ROfEL is domain specific available in other application in the same way. In [11], rules and the underlying log is specialized for our needs, the above for reducing a list of user actions (e.g. move, replace, delete, specified rules are applicable in other scenarios or applica- ...) are introduced. In [9], pre and postconditions of op- tions, in which the common modification operations add, erations are used for deciding which optimizations can be delete and update are used (minor adaptations precondi- executed. Additional applications can easily be found in tioned). further scientific disquisitions. Future work. The integration of a cost-based component Regarding other transformation languages, the most com- in ROfEL could be very interesting. It is possible, that under monly used are XQuery [3] and XSLT (Extensible Stylesheet consideration of further analyses the combination of different Language Transformations [1]), there are also approaches to operations (e.g. rule (10)) is inefficient in general. In this reduce the number of unnecessary or redundant operations. and similar cases a cost function with different thresholds Moreover, different transformations to improve efficiency are could be defined to guarantee, that only efficient adaptations mentioned. of the log are applied. A convenient cost model would be In [12] different ”high-level transformations to prune and necessary, but this requires further research. merge the stream data flow graph” [12] are applied. ”Such Feasibility of the approach. At the University of Ro- techniques not only simplify the later analyses, but most stock we implemented the prototype CodeX (Conceptual importantly, they can rewrite some queries” [12], an essen- design and evolution for XML Schema) for dealing with the tial prerequisite for the efficient evaluation of XQuery over co-evolution [14] of XML Schema and XML documents; RO- streaming data. fEL and corresponding concepts are fully integrated. As we In [5] packages are introduced because of efficiency ben- plan to report in combination with the first release of CodeX, efits. A package is a collection of stylesheet modules ”to the significantly reduced number of logged operations proves avoid compiling libraries repeatedly when they are used in that the whole algorithm is definitely feasible. multiple stylesheets, and to avoid holding multiple copies of the same library in memory simultaneously” [5]. Fur- 7. REFERENCES thermore, XSLT works with templates and matching rules for identifying structures in general. If different templates [1] XSL Transformations (XSLT) Version 2.0. could be applied, automatic or user given priorities manage http://www.w3.org/TR/2007/REC-xslt20-20070123/, which template is chosen. To avoid unexpected behaviour January 2007. Online; accessed 25-June-2014. and improve the efficiency of analyses, it is a good practise [2] Extensible Markup Language (XML) 1.0 (Fifth to remove unnecessary or redundant templates. Edition). Another XML Schema modification language is XSchema- http://www.w3.org/TR/2008/REC-xml-20081126/, Update [6], which is used in the co-evolution prototype EXup November 2008. Online; accessed 25-June-2014. [7]. Especially the auto adaptation guidelines are similar to [3] XQuery 1.0: An XML Query Language (Second the ROfEL purpose of reducing the number of modification Edition). steps. ”Automatic adaptation will insert or remove the min- http://www.w3.org/TR/2010/REC-xquery-20101214/, imum allowed number of elements for instance” [6], i.e., ”a December 2010. Online; accessed 25-June-2014. minimal set of updates will be applied to the documents” [4] W3C XML Schema Definition Language (XSD) 1.1 [6]. Part 1: Structures. http://www.w3.org/TR/2012/ 15 REC-xmlschema11-1-20120405/, April 2012. Online; accessed 25-June-2014. [5] XSL Transformations (XSLT) Version 3.0. http://www.w3.org/TR/2013/WD-xslt-30-20131212/, December 2013. Online; accessed 25-June-2014. [6] F. Cavalieri. Querying and Evolution of XML Schemas and Related Documents. Master’s thesis, University of Genova, 2009. [7] F. Cavalieri. EXup: an engine for the evolution of XML schemas and associated documents. In Proceedings of the 2010 EDBT/ICDT Workshops, EDBT ’10, pages 21:1–21:10, New York, NY, USA, 2010. ACM. [8] F. Cavalieri, G. Guerrini, M. Mesiti, and B. Oliboni. On the Reduction of Sequences of XML Document and Schema Update Operations. In ICDE Workshops, pages 77–86, 2011. [9] H. U. Hoppe. Task-oriented Parsing - a Diagnostic Method to Be Used Adaptive Systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’88, pages 241–247, New York, NY, USA, 1988. ACM. [10] M. Klettke. Modellierung, Bewertung und Evolution von XML-Dokumentkollektionen. Habilitation, Fakultät für Informatik und Elektrotechnik, Universität Rostock, 2007. [11] R. Kramer. iContract - the Java(tm) Design by Contract(tm) tool. In In TOOLS ’98: Proceedings of the Technology of Object-Oriented Languages and Systems, page 295. IEEE Computer Society, 1998. [12] X. Li and G. Agrawal. Efficient Evaluation of XQuery over Streaming Data. In In Proc. VLDB’05, pages 265–276, 2005. [13] E. Maler. Schema Design Rules for UBL...and Maybe for You. In XML 2002 Proceedings by deepX, 2002. [14] T. Nösinger, M. Klettke, and A. Heuer. Evolution von XML-Schemata auf konzeptioneller Ebene - Übersicht: Der CodeX-Ansatz zur Lösung des Gültigkeitsproblems. In Grundlagen von Datenbanken, pages 29–34, 2012. [15] T. Nösinger, M. Klettke, and A. Heuer. A Conceptual Model for the XML Schema Evolution - Overview: Storing, Base-Model-Mapping and Visualization. In Grundlagen von Datenbanken, 2013. [16] T. Nösinger, M. Klettke, and A. Heuer. XML Schema Transformations - The ELaX Approach. In DEXA (1), pages 293–302, 2013. 16 Automatic Decomposition of Multi-Author Documents Using Grammar Analysis Michael Tschuggnall and Günther Specht Databases and Information Systems Institute of Computer Science, University of Innsbruck, Austria {michael.tschuggnall, guenther.specht}@uibk.ac.at ABSTRACT try to build a cluster for the main author and one or more clusters The task of text segmentation is to automatically split a text doc- for intrusive paragraphs. Another scenario where the clustering of ument into individual subparts, which differ according to specific text is applicable is the analysis of multi-author academic papers: measures. In this paper, an approach is presented that attempts to especially the verification of collaborated student works such as separate text sections of a collaboratively written document based bachelor or master theses can be useful in order to determine the on the grammar syntax of authors. The main idea is thereby to amount of work done by each student. quantify differences of the grammatical writing style of authors Using results of previous work in the field of intrinsic plagia- and to use this information to build paragraph clusters, whereby rism detection [31] and authorship attribution [32], the assumption each cluster is assigned to a different author. In order to analyze that individual authors have significantly different writing styles in the style of a writer, text is split into single sentences, and for each terms of the syntax that is used to construct sentences has been sentence a full parse tree is calculated. Using the latter, a profile reused. For example, the following sentence (extracted from a web is computed subsequently that represents the main characteristics blog): ”My chair started squeaking a few days ago and it’s driving for each paragraph. Finally, the profiles serve as input for common me nuts." (S1) could also be formulated as ”Since a few days my clustering algorithms. An extensive evaluation using different En- chair is squeaking - it’s simply annoying.” (S2) which is semanti- glish data sets reveals promising results, whereby a supplementary cally equivalent but differs significantly according to the syntax as analysis indicates that in general common classification algorithms can be seen in Figure 1. The main idea of this work is to quantify perform better than clustering approaches. those differences by calculating grammar profiles and to use this information to decompose a collaboratively written document, i.e., to assign each paragraph of a document to an author. Keywords The rest of this paper is organized as follows: Section 2 at first Text Segmentation, Multi-Author Decomposition, Parse Trees, pq- recapitulates the principle of pq-grams, which represent a core con- grams, Clustering cept of the approach. Subsequently the algorithm is presented in detail, which is then evaluated in Section 3 by using different clus- 1. INTRODUCTION tering algorithms and data sets. A comparison of clustering and The growing amount of currently available data is hardly man- classification approaches is discussed in Section 4, while Section 5 ageable without the use of specific tools and algorithms that pro- depicts related work. Finally, a conclusion and future work direc- vide relevant portions of that data to the user. While this problem tions are given in Section 6. is generally addressed with information retrieval approaches, an- other possibility to significantly reduce the amount of data is to 2. ALGORITHM build clusters. Within each cluster, the data is similar according to In the following the concept of pq-grams is explained, which some predefined features. Thereby many approaches exist that pro- serves as the basic stylistic measure in this approach to distinguish pose algorithms to cluster plain text documents (e.g. [16], [22]) or between authors. Subsequently, the concrete steps performed by specific web documents (e.g. [33]) by utilizing various features. the algorithm are discussed in detail. Approaches which attempt to divide a single text document into distinguishable units like different topics, for example, are usu- 2.1 Preliminaries: pq-grams ally referred to as text segmentation approaches. Here, also many Similar to n-grams that represent subparts of given length n of features including statistical models, similarities between words or a string, pq-grams extract substructures of an ordered, labeled tree other semantic analyses are used. Moreover, text clusters are also [4]. The size of a pq-gram is determined by a stem (p) and a base used in recent plagiarism detection algorithms (e.g. [34]) which (q) like it is shown in Figure 2. Thereby p defines how much nodes are included vertically, and q defines the number of nodes to be considered horizontally. For example, a valid pq-gram with p = 2 and q = 3 starting from PP at the left side of tree (S2) shown in Figure 1 would be [PP-NP-DT-JJ-NNS] (the concrete words are omitted). Copyright c by the paper’s authors. Copying permitted only for The pq-gram index then consists of all possible pq-grams of private and academic purposes. a tree. In order to obtain all pq-grams, the base is shifted left In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- Workshop on Foundations of Databases (Grundlagen von Datenbanken), and right additionally: If then less than p nodes exist horizon- 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. tally, the corresponding place in the pq-gram is filled with *, in- 17 (S1) steps: S S CC S 1. At first the document is preprocessed by eliminating unnec- (and) essary whitespaces or non-parsable characters. For exam- NP VP NP VP ple, many data sets often are based on novels and articles of various authors, whereby frequently OCR text recognition is PRP NN VBD S PRP VBZ (My) (chair) (started) (it) ('s) VP used due to the lack of digital data. Additionally, such doc- VBG S uments contain problem sources like chapter numbers and VP (driving) titles or incorrectly parsed picture frames that result in non- alphanumeric characters. VBG ADVP NP NP (squeaking) NP RB PRP NNS 2. Subsequently, the document is partitioned into single para- (ago) (me) (nuts) graphs. For simplification reasons this is currently done by DT JJ NNS only detecting multiple line breaks. (a) (few) (days) 3. Each paragraph is then split into single sentences by utiliz- ing a sentence boundary detection algorithm implemented (S2) S within the OpenNLP framework1 . Then for each sentence a full grammar tree is calculated using the Stanford Parser S - S [19]. For example, Figure 1 depicts the grammar trees re- PP NP VP NP VP sulting from analyzing sentences (S1) and (S2), respectively. The labels of each tree correspond to a part-of-speech (POS) IN NP PRP NN VBZ VP PRP VBZ ADVP ADJP tag of the Penn Treebank set [23], where e.g NP corresponds (Since) (my) (chair) (is) (it) ('s) to a noun phrase, DT to a determiner or JJS to a superla- DT JJ NNS VBG RB JJ tive adjective. In order to examine the building structure of (a) (few) (days) (squeaking) (simply) (annoying) sentences only like it is intended by this work, the concrete words, i.e., the leafs of the tree, are omitted. Figure 1: Grammar Trees of the Semantically Equivalent Sen- tences (S1) and (S2). 4. Using the grammar trees of all sentences of the document, the pq-gram index is calculated. As shown in Section 2.1 all valid pq-grams of a sentence are extracted and stored into dicating a missing node. Applying this idea to the previous exam- a pq-gram index. By combining all pq-gram indices of all ple, also the pq-gram [PP-IN-*-*-*] (no nodes in the base) is sentences, a pq-gram profile is computed which contains a valid, as well as [PP-NP-*-*-DT] (base shifted left by two), list of all pq-grams and their corresponding frequency of ap- [PP-NP-*-DT-JJ] (base shifted left by one), [PP-NP-JJ- pearance in the text. Thereby the frequency is normalized by NNS-*] (base shifted right by one) and [PP-NP-NNS-*-*] (base the total number of all appearing pq-grams. As an example, shifted right by two) have to be considered. As a last example, all the five mostly used pq-grams using p = 2 and q = 3 of a leaves have the pq-gram pattern [leaf_label-*-*-*-*]. sample document are shown in Table 1. The profile is sorted Finally, the pq-gram index is the set of all valid pq-grams of a descending by the normalized occurrence, and an additional tree, whereby multiple occurrences of the same pq-grams are also rank value is introduced that simply defines a natural order present multiple times in the index. which is used in the evaluation (see Section 3). Table 1: Example of the Five Mostly Used pq-grams of a Sam- ple Document. pq-gram Occurrence [%] Rank NP-NN-*-*-* 2.68 1 PP-IN-*-*-* 2.25 2 Figure 2: Structure of a pq-gram Consisting of Stem p = 2 and NP-DT-*-*-* 1.99 3 Base q = 3. NP-NNP-*-*-* 1.44 4 S-VP-*-*-VBD 1.08 5 2.2 Clustering by Authors The number of choices an author has to formulate a sentence 5. Finally, each paragraph-profile is provided as input for clus- in terms of grammar structure is rather high, and the assumption tering algorithms, which are asked to build clusters based on in this approach is that the concrete choice is made mostly intu- the pq-grams contained. Concretely, three different feature itively and unconsciously. On that basis the grammar of authors is sets have been evaluated: (1.) the frequencies of occurences analyzed, which serves as input for common state-of-the-art clus- of each pq-gram, (2.) the rank of each pq-gram and (3.) a tering algorithms to build clusters of text documents or paragraphs. union of the latter sets. The decision of the clustering algorithms is thereby based on the frequencies of occurring pq-grams, i.e., on pq-gram profiles. In de- 1 Apache OpenNLP, http://incubator.apache.org/opennlp, vis- tail, given a text document the algorithm consists of the following ited July 2014 18 2.3 Utilized Algorithms many works have studied and questioned the correct author- Using the WEKA framework [15], the following clustering algo- ship of 12 disputed essays [24], which have been excluded in rithms have been evaluated: K-Means [3], Cascaded K-Means (the the experiment. number of clusters is cascaded and automatically chosen) [5], X- • The PAN’12 competition corpus (PAN12): As a well-known, Means [26], Agglomerative Hierarchical Clustering [25], and Far- state-of-the-art corpus originally created for the use in au- thest First [9]. thorship identification, parts3 of the PAN2012 corpus [18] For the clustering algorithms K-Means, Hierarchical Clustering have been integrated. The corpus is composed of several and Farthest First the number of clusters has been predefined ac- fiction texts and split into several subtasks that cover small- cording to the respective test data. This means if the test document and common-length documents (1800-6060 words) as well has been collaborated by three authors, the number of clusters has as larger documents (up to 13000 words) and novel-length also been set to three. On the other hand, the algorithms Cascaded documents (up to 170,000 words). Finally, the test setused in K-Means and X-Means implicitly decide which amount of clusters this evaluation contains 14 documents (paragraphs) written is optimal. Therefore these algorithms have been limited only in by three authors that are distributed equally. ranges, i.e., the minimum and maximum number of clusters has been set to two and six, respectively. 3.2 Results The best results of the evaluation are presented in Table 2, where 3. EVALUATION the best performance for each clusterer over all data sets is shown in The utilization of pq-gram profiles as input features for mod- subtable (a), and the best configuration for each data set is shown ern clustering algorithms has been extensively evaluated using dif- in subtable (b), respectively. With an accuracy of 63.7% the K- ferent documents and data sets. As clustering and classification Means algorithm worked best by using p = 2, q = 3 and by uti- problems are closely related, the global aim was to experiment on lizing all available features. Interestingly, the X-Means algorithm the accuracy of automatic text clustering using solely the proposed also achieved good results considering the fact that in this case the grammar feature, and furthermore to compare it to those of current number of clusters has been assigned automatically by the algo- classification techniques. rithm. Finally, the hierarchical cluster performed worst gaining an accuracy of nearly 10% less than K-Means. 3.1 Test Data and Experimental Setup Regarding the best performances for each test data set, the re- In order to evaluate the idea, different documents and test data sults for the manually created data sets from novel literature are sets have been used, which are explained in more detail in the fol- generally poor. For example, the best result for the two-author doc- lowing. Thereby single documents have been created which con- ument Twain-Wells is only 59.6%, i.e., the accuracy is only slightly tain paragraphs written by different authors, as well as multiple better than the baseline percentage of 50%, which can be achieved documents, whereby each document is written by one author. In by randomly assigning paragraphs into two clusters.4 On the other the latter case, every document is treated as one (large) paragraph hand, the data sets reused from authorship attribution, namely the for simplification reasons. FED and the PAN12 data set, achieved very good results with an For the experiment, different parameter settings have been eval- accuracy of about 89% and 83%, respectively. Nevertheless, as the uated, i.e., the pq-gram values p and q have been varied from 2 to other data sets have been specifically created for the clustering eval- 4, in combination with the three different feature sets. Concretely, uation, these results may be more expressive. Therefore a compar- the following data sets have been used: ison between clustering and classification approaches is discussed in the following, showing that the latter achieve significantly better • Twain-Wells (T-W): This document has been specifically results on those data sets when using the same features. created for the evaluation of in-document clustering. It con- tains 50 paragraphs of the book ”The Adventures of Huck- Method p q Feature Set Accuracy leberry Finn” by Mark Twain, and 50 paragraphs of ”The K-Means 3 2 All 63.7 Time Machine” by H. G. Wells2 . All paragraphs have been X-Means 2 4 Rank 61.7 randomly shuffled, whereby the size of each paragraph varies Farthest First 4 2 Occurrence-Rate 58.7 from approximately 25 words up to 280 words. Cascaded K-Means 2 2 Rank 55.3 Hierarchical Clust. 4 3 Occurrence-Rate 54.7 • Twain-Wells-Shelley (T-W-S): In a similar fashion a three- author document has been created. It again uses (different) (a) Clustering Algorithms paragraphs of the same books by Twain and Wells, and ap- pends it by paragraphs of the book ”Frankenstein; Or, The Data Set Method p q Feat. Set Accuracy Modern Prometheus” by Mary Wollstonecraft Shelley. Sum- T-W X-Means 3 2 All 59.6 marizing, the document contains 50 paragraphs by Mark T-W-S X-Means 3 4 All 49.0 Twain, 50 paragraphs by H. G. Wells and another 50 para- FED Farth. First 4 3 Rank 89.4 graphs by Mary Shelley, whereby the paragraph sizes are PAN12-A/B K-Means 3 3 All 83.3 similar to the Twain-Wells document. (b) Test Data Sets • The Federalist Papers (FED): Probably the mostly referred Table 2: Best Evaluation Results for Each Clustering Algo- text corpus in the field of authorship attribution is a series rithm and Test Data Set in Percent. of 85 political essays called ”The Federalist Papers” written by John Jay, Alexander Hamilton and James Madison in the 3 18th century. While most of the authorships are undoubted, the subtasks A and B, respectively 4 In this case X-Means dynamically created two clusters, but 2 The books have been obtained from the Project Gutenberg li- the result is still better than that of other algorithms using a fixed brary, http://www.gutenberg.org, visited July 2014 number of clusters. 19 4. COMPARISON OF CLUSTERING AND p 2 q 2 Algorithm X-Means Max 57.6 N-Bay 77.8 Bay-Net 82.3 LibLin 85.2 LibSVM 86.9 kNN 62.6 J48 85.5 CLASSIFICATION APPROACHES 2 3 X-Means 56.6 79.8 80.8 81.8 83.3 60.6 80.8 2 4 X-Means 57.6 76.8 79.8 82.2 83.8 58.6 81.0 3 2 X-Means 59.6 78.8 80.8 81.8 83.6 59.6 80.8 For the given data sets, any clustering problem can be rewrit- 3 3 3 4 X-Means X-Means 53.5 52.5 76.8 81.8 77.8 79.8 80.5 81.8 82.3 83.8 61.6 63.6 79.8 82.0 ten as classification problem with the exception that the latter need 4 2 K-Means 52.5 86.9 83.3 83.5 84.3 62.6 81.8 4 3 X-Means 52.5 79.8 79.8 80.1 80.3 59.6 77.4 training data. Although a direct comparison should be treated with 4 4 Farth. First 51.5 72.7 74.7 75.8 77.0 60.6 75.8 caution, it still gives an insight of how the two different approaches average improvement 24.1 25.0 26.5 27.9 6.2 25.7 perform using the same data sets. Therefore an additional evalua- (a) Twain-Wells tion is shown in the following, which compares the performance of the clustering algorithms to the performance of the the following p 2 q 2 Algorithm K-Means Max 44.3 N-Bay 67.8 Bay-Net 70.8 LibLin 74.0 LibSVM 75.2 kNN 51.0 J48 73.3 classification algorithms: Naive Bayes classifier [17], Bayes Net- 2 3 X-Means 38.3 65.1 67.1 70.7 72.3 48.3 70.2 2 4 X-Means 45.6 63.1 68.1 70.5 71.8 49.0 69.3 work using the K2 classifier [8], Large Linear Classification using 3 2 X-Means 45.0 51.7 64.1 67.3 68.8 45.6 65.4 3 3 X-Means 47.0 57.7 64.8 67.3 68.5 47.0 65.9 LibLinear [12], Support vector machine using LIBSVM with nu- 3 4 X-Means 49.0 67.8 67.8 70.5 72.5 46.3 68.3 SVC classification [6], k-nearest-neighbors classifier (kNN) using 4 4 2 3 X-Means K-Means 36.2 35.6 61.1 53.0 67.1 63.8 69.1 67.6 69.5 70.0 50.3 47.0 65.1 66.6 k = 1 [1], and a pruned C4.5 decision tree (J48) [28]. To compen- 4 4 X-Means 35.6 57.7 66.1 68.5 69.3 42.3 66.8 average improvement 18.7 24.8 27.7 29.0 5.6 26.0 sate the missing training data, a 10-fold cross-validation has been used for each classifier. (b) Twain-Wells-Shelley Table 3 shows the performance of each classifier compared to the p q Algorithm Max N-Bay Bay-Net LibLin LibSVM kNN J48 best clustering result using the same data and pq-setting. It can be 2 2 Farth. First 77.3 81.1 86.4 90.9 84.2 74.2 81.8 2 3 Farth. First 78.8 85.6 87.4 92.4 89.0 78.8 82.8 seen that the classifiers significantly outperform the clustering re- 2 4 X-Means 78.8 89.4 92.4 90.9 87.3 89.4 85.9 sults for the Twain-Wells and Twain-Wells-Shelley documents. The 3 3 2 3 K-Means K-Means 81.8 78.8 82.6 92.4 87.9 92.4 92.4 92.4 85.5 86.4 80.3 81.8 83.8 83.8 support vector machine framework (LibSVM) and the linear classi- 3 4 Farth. First 86.4 84.8 90.9 97.0 85.8 81.8 85.6 4 2 Farth. First 86.6 81.8 89.4 87.9 83.3 77.3 84.1 fier (LibLinear) performed best, reaching a maximum accuracy of 4 3 Farth. First 89.4 85.6 92.4 89.4 85.8 80.3 83.3 4 4 Farth. First 84.8 86.4 90.9 89.4 85.8 84.8 83.6 nearly 87% for the Twain-Wells document. Moreover, the average average improvement 3.0 7.5 8.9 3.4 -1.6 1.3 improvement is given in the bottom line, showing that most of the (c) Federalist Papers classifiers outperform the best clustering result by over 20% in av- erage. Solely the kNN algorithm achieves minor improvements as p q Algorithm Max N-Bay Bay-Net LibLin LibSVM kNN J48 it attributed the two-author document with a poor accuracy of about 2 2 2 3 K-Means K-Means 83.3 83.3 83.3 83.3 33.3 33.3 100.0 100.0 100.0 100.0 100.0 100.0 33.3 33.3 60% only. 2 4 K-Means 83.3 83.4 33.3 100.0 100.0 100.0 33.3 3 2 K-Means 83.3 75.0 33.3 91.7 91.7 100.0 33.3 A similar general improvement could be achieved on the three- 3 3 K-Means 83.3 100.0 33.3 100.0 91.7 100.0 33.3 author document Twain-Wells-Shelley as can be seen in subtable 3 4 4 2 Farth. First K-Means 75.0 83.3 66.7 91.7 33.3 33.3 100.0 91.7 100.0 75.0 91.7 91.7 33.3 33.3 (b). Again, LibSVM could achieve an accuracy of about 75%, 4 4 3 4 K-Means K-Means 83.3 83.3 75.0 75.0 33.3 33.3 100.0 100.0 75.0 83.4 91.7 83.4 33.3 33.3 whereas the best clustering configuration could only reach 49%. average improvement -0.9 -49.1 15.8 8.4 13.0 -49.1 Except for the kNN algorithm, all classifiers significantly outper- (d) PAN12-A/B form the best clustering results for every configuration. Quite different comparison results have been obtained for the Table 3: Best Evaluation Results for each Clustering Algorithm Federalist Papers and PAN12 data sets, respectively. Here, the im- and Test Data Set in Percent. provements gained from the classifiers are only minor, and in some cases are even negative, i.e., the classification algorithms perform worse than the clustering algorithms. A general explanation is the to one document. The main idea is often to compute topically re- good performance of the clustering algorithms on these data sets, lated document clusters and to assist web search engines to be able especially by utilizing the Farthest First and K-Means algorithms. to provide better results to the user, whereby the algorithms pro- In case of the Federalist Papers data set shown in subtable (c), posed frequently are also patented (e.g. [2]). Regularly applied all algorithms except kNN could achieve at least some improve- concepts in the feature extraction phase are the term frequency tf , ment. Although the LibLinear classifier could reach an outstanding which measures how often a word in a document occurs, and the accuracy of 97%, the global improvement is below 10% for all clas- term frequency-inverse document frequency tf − idf , which mea- sifiers. Finally, subtable (d) shows the results for PAN12, where the sures the significance of a word compared to the whole document outcome is quite diverse as some classifiers could improve the clus- collection. An example of a classical approach using these tech- terers significantly, whereas others worsen the accuracy even more niques is published in [21]. drastically. A possible explanation might be the small data set (only The literature on cluster analysis within a single document to the subproblems A and B have been used), which may not be suited discriminate the authorships in a multi-author document like it is very well for a reliable evaluation of the clustering approaches. done in this paper is surprisingly sparse. On the other hand, many approaches exist to separate a document into paragraphs of differ- Summarizing, the comparison of the different algorithms reveal ent topics, which are generally called text segmentation problems. that in general classification algorithms perform better than cluster- In this domain, the algorithms often perform vocabulary analysis ing algorithms when provided with the same (pq-gram) feature set. in various forms like word stem repetitions [27] or word frequency Nevertheless, the results of the PAN12 experiment are very diverse models [29], whereby ”methods for finding the topic boundaries and indicate that there might be a problem with the data set itself, include sliding window, lexical chains, dynamic programming, ag- and that this comparison should be treated carefully. glomerative clustering and divisive clustering” [7]. Despite the given possibility to modify these techniques to also cluster by au- 5. RELATED WORK thors instead of topics, this is rarely done. In the following some of Most of the traditional document clustering approaches are based the existing methods are shortly summarized. on occurrences of words, i.e., inverted indices are built and used to Probably one of the first approaches that uses stylometry to au- group documents. Thereby a unit to be clustered conforms exactly tomatically detect boundaries of authors of collaboratively written 20 text is proposed in [13]. Thereby the main intention was not to ex- K-­‐Means   pose authors or to gain insight into the work distribution, but to pro- X-­‐Means   vide a methodology for collaborative authors to equalize their style Farthest  First   Cascaded  K-­‐Means   in order to achieve better readability. To extract the style of sepa- Hierarchical  Clusterer   rated paragraphs, common stylometric features such as word/sentence lengths, POS tag distributions or frequencies of POS classes at Naive  Bayes   BayesNet   sentence-initial and sentence-final positions are considered. An ex- LibLinear   tensive experiment revealed that styolmetric features can be used to LibSVM   find authorship boundaries, but that there has to be done additional kNN   J48   research in order to increase the accuracy and informativeness. 0   10   20   30   40   50   60   70   80   90   100   In [14] the authors also tried to divide a collaborative text into Accuracy  [%]   different single-author paragraphs. In contrast to the previously described handmade corpus, a large data set has been computation- ally created by using (well-written) articles of an internet forum. At Figure 3: Best Evaluation Results Over All Data Sets For All first, different neural networks have been utilized using several sty- Utilized Clustering and Classification Algorithms. lometric features. By using 90% of the data for training, the best network could achieve an F-score of 53% for multi-author docu- ments on the remaining 10% of test data. In a second experiment, Twain-­‐Wells   only letter-bigram frequencies are used as distinguishing features. Thereby an authorship boundary between paragraphs was marked Twain-­‐Wells-­‐Shelley   if the cosine distance exceeded a certain threshold. This method reached an F-score of only 42%, and it is suspected that letter- Best  Clusterer   FED   bigrams are not suitable for the (short) paragraphs used in the eval- Best  Classifier   uation. PAN12-­‐A/B   A two-stage process to cluster Hebrew Bible texts by authorship is proposed in [20]. Because a first attempt to represent chapters 0   20   40   60   80   100   only by bag-of-words led to negative results, the authors addition- Accuracy  [%]   ally incorporated sets of synonyms (which could be generated by comparing the original Hebrew texts with an English translation). With a modified cosine-measure comparing these sets for given Figure 4: Best Clustering and Classification Results For Each chapters, two core clusters are compiled by using the ncut algo- Data Set. rithm [10]. In the second step, the resulting clusters are used as training data for a support vector machine, which finally assigns every chapter to one of the two core clusters by using the simple linear classification algorithm LibLinear could reach nearly 88%, bag-of-words features tested earlier. Thereby it can be the case, outperforming K-Means by 25% over all data sets. that units originally assigned to one cluster are moved to the other Finally, the best classification and clustering results for each data one, depending on the prediction of the support vector machine. set are shown in Figure 4. Consequently the classifiers achieve With this two-stage approach the authors report a good accuracy of higher accuracies, whereby the PAN12 subsets could be classified about 80%, whereby it should be considered that the size of poten- 100% correctly. As can be seen, a major improvement can be tial authors has been fixed to two in the experiment. Nevertheless, gained for the novel literature documents. For example, the best the authors state that their approach could be extended for more classifier reached 87% on the Twain-Wells document, whereas the authors with less effort. best clustering approach achieved only 59%. As shown in this paper, paragraphs of documents can be split 6. CONCLUSION AND FUTURE WORK and clustered based on grammar features, but the accuracy is below In this paper, the automatic creation of paragraph clusters based that of classification algorithms. Although the two algorithm types on the grammar of authors has been evaluated. Different state-of- should not be compared directly as they are designed to manage the-art clustering algorithms have been utilized with different input different problems, the significant differences in accuracies indi- features and tested on different data sets. The best working algo- cate that classifiers can handle the grammar features better. Never- rithm K-Means could achieve an accuracy of about 63% over all theless future work should focus on evaluating the same features on test sets, whereby good individual results of up to 89% could be larger data sets, as clustering algorithms may produce better results reached for some configurations. On the contrary, the specifically with increasing amount of sample data. created documents incorporating two and three authors could only Another possible application could be the creation of whole doc- be clustered with a maximum accuracy of 59%. ument clusters, where documents with similar grammar are grouped A comparison between clustering and classification algorithms together. Despite the fact that such huge clusters are very difficult to using the same input features has been implemented. Disregarding evaluate - due to the lack of ground truth data - a navigation through the missing training data, it could be observed that classifiers gen- thousands of documents based on grammar may be interesting like erally produce higher accuracies with improvements of up to 29%. it has been done for music genres (e.g. [30]) or images (e.g. [11]). On the other hand, some classifiers perform worse on average than Moreover, grammar clusters may also be utilized for modern rec- clustering algorithms over individual data sets when using some pq- ommendation algorithms once they have been calculated for large gram configurations. Nevertheless, if the maximum accuracy for data sets. For example, by analyzing all freely available books from each algorithm is considered, all classifiers perform significantly libraries like Project Gutenberg, a system could recommend other better as can be seen in Figure 3. Here the best performances of all books with a similar style based on the users reading history. Also, utilized classification and clustering algorithms are illustrated. The an enhancement of current commercial recommender systems that 21 are used in large online stores like Amazon is conceivable. [18] P. Juola. An Overview of the Traditional Authorship Attribution Subtask. In CLEF (Online Working 7. REFERENCES Notes/Labs/Workshop), 2012. [1] D. Aha and D. Kibler. Instance-Based Learning Algorithms. [19] D. Klein and C. D. Manning. Accurate Unlexicalized Machine Learning, 6:37–66, 1991. Parsing. In Proceedings of the 41st Annual Meeting on [2] C. Apte, S. M. Weiss, and B. F. White. Lightweight Association for Computational Linguistics - Volume 1, ACL Document Clustering, Nov. 25 2003. US Patent 6,654,739. ’03, pages 423–430, Stroudsburg, PA, USA, 2003. [3] D. Arthur and S. Vassilvitskii. K-means++: The advantages [20] M. Koppel, N. Akiva, I. Dershowitz, and N. Dershowitz. of careful seeding. In Proceedings of the Eighteenth Annual Unsupervised Decomposition of a Document into Authorial ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, Components. In Proc. of the 49th Annual Meeting of the pages 1027–1035, Philadelphia, PA, USA, 2007. Society for Association for Computational Linguistics: Human Industrial and Applied Mathematics. Language Technologies - Volume 1, HLT ’11, pages [4] N. Augsten, M. Böhlen, and J. Gamper. The pq-Gram 1356–1364, Stroudsburg, PA, USA, 2011. Distance between Ordered Labeled Trees. ACM Transactions [21] B. Larsen and C. Aone. Fast and Effective Text Mining Using on Database Systems (TODS), 2010. Linear-Time Document Clustering. In Proceedings of the 5th [5] T. Caliński and J. Harabasz. A Dendrite Method for Cluster ACM SIGKDD international conference on Knowledge Analysis. Communications in Statistics - Theory and discovery and data mining, pages 16–22. ACM, 1999. Methods, 3(1):1–27, 1974. [22] Y. Li, S. M. Chung, and J. D. Holt. Text Document [6] C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Clustering Based on Frequent Word Meaning Sequences. Vector Machines. ACM Transactions on Intelligent Systems Data & Knowledge Engineering, 64(1):381–404, 2008. and Technology (TIST), 2(3):27, 2011. [23] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. [7] F. Y. Choi. Advances in Domain Independent Linear Text Building a large annotated corpus of English: The Penn Segmentation. In Proceedings of the 1st North American Treebank. Computational Linguistics, 19:313–330, June chapter of the Association for Computational Linguistics 1993. conference, pages 26–33. Association for Computational [24] F. Mosteller and D. Wallace. Inference and Disputed Linguistics, 2000. Authorship: The Federalist. Addison-Wesley, 1964. [8] G. F. Cooper and E. Herskovits. A Bayesian Method for the [25] F. Murtagh. A Survey of Recent Advances in Hierarchical Induction of Probabilistic Networks From Data. Machine Clustering Algorithms. The Computer Journal, learning, 9(4):309–347, 1992. 26(4):354–359, 1983. [9] S. Dasgupta. Performance Guarantees for Hierarchical [26] D. Pelleg, A. W. Moore, et al. X-means: Extending K-means Clustering. In Computational Learning Theory, pages with Efficient Estimation of the Number of Clusters. In 351–363. Springer, 2002. ICML, pages 727–734, 2000. [10] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: [27] J. M. Ponte and W. B. Croft. Text Segmentation by Topic. In Spectral Clustering and Normalized Cuts. In Proceedings of Research and Advanced Technology for Digital Libraries, the tenth ACM SIGKDD international conference on pages 113–125. Springer, 1997. Knowledge discovery and data mining, pages 551–556. [28] J. R. Quinlan. C4.5: Programs for Machine Learning, ACM, 2004. volume 1. Morgan Kaufmann, 1993. [11] A. Faktor and M. Irani. “Clustering by Composition” - [29] J. C. Reynar. Statistical Models for Topic Segmentation. In Unsupervised Discovery of Image Categories. In Computer Proc. of the 37th annual meeting of the Association for Vision–ECCV 2012, pages 474–487. Springer, 2012. Computational Linguistics on Computational Linguistics, [12] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. pages 357–364, 1999. Lin. LIBLINEAR: A Library for Large Linear Classification. [30] N. Scaringella, G. Zoia, and D. Mlynek. Automatic Genre The Journal of Machine Learning Research, 9:1871–1874, Classification of Music Content: a Survey. Signal Processing 2008. Magazine, IEEE, 23(2):133–141, 2006. [13] A. Glover and G. Hirst. Detecting Stylistic Inconsistencies in [31] M. Tschuggnall and G. Specht. Using Grammar-Profiles to Collaborative Writing. In The New Writing Environment, Intrinsically Expose Plagiarism in Text Documents. In Proc. pages 147–168. Springer, 1996. of the 18th Conf. of Natural Language Processing and [14] N. Graham, G. Hirst, and B. Marthi. Segmenting Documents Information Systems (NLDB), pages 297–302, 2013. by Stylistic Character. Natural Language Engineering, [32] M. Tschuggnall and G. Specht. Enhancing Authorship 11(04):397–415, 2005. Attribution By Utilizing Syntax Tree Profiles. In Proc. of the [15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, 14th Conf. of the European Chapter of the Assoc. for and I. H. Witten. The WEKA Data Mining Software: an Computational Ling. (EACL), pages 195–199, 2014. Update. ACM SIGKDD explorations newsletter, [33] O. Zamir and O. Etzioni. Web Document Clustering: A 11(1):10–18, 2009. Feasibility Demonstration. In Proc. of the 21st annual [16] A. Hotho, S. Staab, and G. Stumme. Ontologies Improve international ACM conference on Research and development Text Document Clustering. In Data Mining, 2003. ICDM in information retrieval (SIGIR), pages 46–54. ACM, 1998. 2003. Third IEEE International Conference on, pages [34] D. Zou, W.-J. Long, and Z. Ling. A Cluster-Based 541–544. IEEE, 2003. Plagiarism Detection Method. In Notebook Papers of CLEF [17] G. H. John and P. Langley. Estimating Continuous 2010 LABs and Workshops, 22-23 September, 2010. Distributions in Bayesian Classifiers. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 338–345. Morgan Kaufmann Publishers Inc., 1995. 22 Proaktive modellbasierte Performance-Analyse und -Vorhersage von Datenbankanwendungen Christoph Koch Friedrich-Schiller-Universität Jena Lehrstuhl für Datenbanken und DATEV eG Informationssysteme Abteilung Datenbanken Ernst-Abbe-Platz 2 Paumgartnerstr. 6 - 14 07743 Jena 90429 Nürnberg Christoph.Koch@uni-jena.de Christoph.Koch@datev.de KURZFASSUNG 1. EINLEITUNG Moderne (Datenbank-)Anwendungen sehen sich in der heutigen Zur Erfüllung komplexerer Anforderungen und maximalen Zeit mit immer höheren Anforderungen hinsichtlich Flexibilität, Benutzerkomforts ist gute Performance eine Grundvoraussetzung Funktionalität oder Verfügbarkeit konfrontiert. Nicht zuletzt für für moderne Datenbankanwendungen. Neben Anwendungs- deren Backend – ein meist relationales Datenbankmanagement- Design und Infrastrukturkomponenten wie Netzwerk oder system – entsteht dadurch eine kontinuierlich steigende Kom- Anwendungs- beziehungsweise Web-Server wird sie maßgeblich plexität und Workload, die es frühestmöglich proaktiv zu er- durch die Performance ihres Datenbank-Backends – wir beschrän- kennen, einzuschätzen und effizient zu bewältigen gilt. Die dazu ken uns hier ausschließlich auf relationale Datenbankmanage- nötigen Anwendungs- und Datenbankspezialisten sind jedoch mentsysteme (DBMS) – bestimmt [1]. Dabei ist die Datenbank- aufgrund immer engerer Projektpläne, kürzerer Release-Zyklen Performance einer Anwendung selbst ebenfalls durch zahlreiche und weiter wachsender Systemlandschaften stark ausgelastet, Faktoren beeinflusst. Während Hardware- und systemseitige sodass für regelmäßige proaktive Expertenanalysen hinsichtlich Eigenschaften oftmals durch bestehende Infrastrukturen vor- der Datenbank-Performance kaum Kapazität vorhanden ist. gegeben sind, können speziell das Datenbank-Design sowie die Zur Auflösung dieses Dilemmas stellt dieser Beitrag ein anwendungsseitig implementierten Zugriffe mittels SQL weit- Verfahren vor, mit dessen Hilfe frühzeitig auf Grundlage der gehend frei gestaltet werden. Hinzu kommt als Einflussfaktor Datenmodellierung und synthetischer Datenbankstatistiken Per- noch die Beschaffenheit der zu speichernden/gespeicherten Daten, formance-Analysen und -Vorhersagen für Anwendungen mit die sich in Menge und Verteilung ebenfalls stark auf die relationalem Datenbank-Backend durchgeführt und deren Performance auswirkt. Ergebnisse auf leicht zugängliche Weise visualisiert werden können. Das Datenbank-Design entwickelt sich über unterschiedlich abstrakte, aufeinander aufbauende Modellstrukturen vom konzep- tionellen hin zum physischen Datenmodell. Bereits bei der Kategorien und Themenbeschreibung Entwicklung dieser Modelle können „Designfehler“ wie beispiels- Data Models and Database Design, Database Performance weise fehlende oder „übertriebene“ Normalisierungen gravierende Auswirkungen auf die späteren Antwortzeiten des Datenbank- Allgemeine Bestimmungen systems haben. Der Grad an Normalisierung selbst ist jedoch nur Performance, Design als vager Anhaltspunkt für die Performance von Datenbank- systemen anzusehen, der sich ab einem gewissen Maß auch negativ auswirken kann. Eine einfache Metrik zur Beurteilung der Schlüsselwörter Qualität des Datenbank-Designs bezüglich der zu erwartenden Performance, Proaktivität, Statistiken, relationale Datenbanken, Performance (in Abhängigkeit anderer Einflussfaktoren, wie etwa Modellierung, UML, Anwendungsentwicklung der Workload) existiert nach vorhandenem Kenntnisstand nicht. Etwas abweichend dazu verhält es sich mit dem Einfluss der Workload – repräsentiert als Menge von SQL-Statements und der Häufigkeit ihrer Ausführung, die von der Anwendung an das Datenbanksystem zum Zugriff auf dort gespeicherte Daten abgesetzt wird. Moderne DBMS besitzen einen kostenbasierten Copyright © by the paper’s authors. Copying permitted only Optimierer zur Optimierung eingehender Statements. Dieser for private and academic purposes. berechnet mögliche Ausführungspläne und wählt unter Zu- In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26 GI- hilfenahme von gesammelten Objekt-Statistiken den günstigsten Workshop on Foundations of Databases (Grundlagen von Ausführungsplan zur Abarbeitung eines SQL-Statements aus. Datenbanken), Mittels DBMS-internen Mechanismen – im Folgenden als 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. 23 EXPLAIN-Mechanismen bezeichnet – besteht die Möglichkeit, netzwerks. Ein Überblick dazu findet sich in [3]. Demnach zeigt noch vor der eigentlichen Ausführung von Statements den vom sich für all diese Konzepte ein eher wissenschaftlicher Fokus und Optimierer bestimmten optimalen Ausführungsplan ermitteln und eine damit einhergehende weitgehend unerprobte Übertragbarkeit ausgeben zu lassen. Zusätzlich umfasst das EXPLAIN-Ergebnis auf die Praxis. So fehlen Studien zur Integration in praxisnahe eine Abschätzung der zur Abarbeitung des Ausführungsplans (Entwicklungs-)Prozesse, zur Benutzerfreundlichkeit sowie zum erwarteten Zugriffskosten bezüglich der CPU-und I/O-Zeit – Kosten-Nutzen-Verhältnis der notwendigen Maßnahmen. Ein fortan als Kosten bezeichnet. Anhand dieser Informationen damit einhergehendes Defizit ist zusätzlich der mangelnde Tool- können bereits frühzeitig in Hinblick auf die Datenbank- support. Das in Kapitel 4 vorgestellte Konzept verfolgt diesbe- Performance (häufige) teure Zugriffe erkannt und gegebenenfalls züglich einen davon abweichenden Ansatz. Es baut direkt auf optimiert werden. Voraussetzung für dieses Vorgehen ist aller- etablierten modellbasierten Praxisabläufen bei der Entwicklung dings, dass dem DBMS zur Berechnung der Ausführungspläne von Datenbankanwendungen auf (vgl. Kapitel 3). Insbesondere repräsentative Datenbank-Statistiken vorliegen, was insbeson- durch die Verwendung von standardisierten UML-Erweiterungs- dere für neue Datenbankanwendungen nicht der Fall ist. mechanismen integriert es sich auch Tool-seitig nahtlos in bestehende UML-unterstützende Infrastrukturen. Auf der anderen Seite sehen sich sowohl Anwendungsentwickler- beziehungsweise -designerteams als auch Datenbankspezialisten Die Methodik der synthetischen Statistiken – also dem künstli- mit immer komplexeren Anforderungen und Aufgaben konfron- chen Erstellen sowie Manipulieren von Datenbank-Statistiken – tiert. Kapazitäten für umfangreiche Performance-Analysen oder ist neben dem in Kapitel 4 vorgestellten Ansatz wesentlicher auch nur die Aneignung des dafür nötigen Wissens sind oft nicht Bestandteil von [4]. Sie wird zum einen verwendet, um Statistiken gegeben. Nicht zuletzt deshalb geraten proaktive Performance- aus Produktionsumgebungen in eine Testumgebung zu trans- Analysen verglichen mit beispielsweise funktionalen Tests ver- ferieren. Zum anderen sieht der Ansatz aber auch die gezielte mehrt aus dem Fokus. manuelle Veränderung der Statistiken vor, um mögliche dadurch entstehende Änderungen in den Ausführungsplänen und den zu Das im vorliegenden Beitrag vorgestellte modellbasierte Konzept deren Abarbeitung benötigten Kosten mithilfe anschließender setzt an diesen beiden Problemen an und stellt Mechanismen vor, EXPLAIN-Analysen feststellen zu können. Dies kann beispiels- um auf einfache Art und Weise eine repräsentative proaktive weise bezogen auf Statistiken zur Datenmenge dafür genutzt Analyse der Datenbank-Performance zu ermöglichen. Nachdem in werden, um Zugriffe auf eine (noch) kleine Tabelle mit wenigen Kapitel 2 eine Abgrenzung zu alternativen/verwandten Ansätzen Datensätzen bereits so zu simulieren, als ob diese eine enorme gegeben wird, rückt Kapitel 3 den Entwicklungsprozess einer Menge an Daten umfasst. Weitere Einbettungen in den Entwick- Datenbank-Anwendung in den Fokus. Kapitel 4 beschäftigt sich lungsprozess von Datenbankanwendungen sieht [4] gegenüber mit dem entwickelten proaktiven Ansatz und stellt wesentliche dem hier vorgestellten Ansatz allerdings nicht vor. Schritte/Komponenten vor. Abschließend fasst Kapitel 5 den Ein weiterer Ansatzpunkt zur Performance-Analyse und -Optimie- Beitrag zusammen. rung existiert im Konzept des autonomen Datenbank-Tunings [5][6],[7] – also dem fortlaufenden Optimieren des physischen 2. VERWANDTE ARBEITEN Designs von bereits bestehenden Datenbanken durch das DBMS Das Ziel des im Beitrag vorgestellten proaktiven Ansatzes zur selbst. Ein autonomes System erkennt anhand von erlerntem Performance-Analyse und -Vorhersage von Datenbankanwendun- Wissen potentielle Probleme und leitet passende Optimierungs- gen ist die frühzeitige Erkennung von potentiellen Perfor- maßnahmen ein, bevor sich daraus negative Auswirkungen mance-Problemen auf Basis einer möglichst effizienten, leicht ergeben. Dazu zählt beispielsweise die autonome Durchführung verständlichen Methodik. Dies verfolgt auch der Ansatz von [2], einer Reorganisierung von Daten, um fortwährend steigenden dessen Grundprinzip – Informationen über Daten und Datenzu- Zugriffszeiten entgegenzuwirken. Ähnlich können auch die griffe, die aus der Anforderungsanalyse einer Anwendung bekannt mittlerweile je System vielseitig vorhandenen Tuning-Advisor wie sind, zur frühzeitigen Optimierung zu nutzen – sich auch im beispielsweise [8] und [9] angesehen werden, die zwar nicht auto- vorliegenden Beitrag wiederfindet. Dabei gelangt [2] durch eine matisch optimierend ins System eingreifen, dem Administrator eigene, dem Datenbank-Optimierer nachempfundene Logik und aber Empfehlungen zu sinnvoll durchzuführenden Aktionen dem verwendeten Modell des offenen Warteschlangennetzwerks geben. Sowohl das autonome Tuning als auch die Tuning-Advisor frühzeitig zu Kostenabschätzungen bezüglich der Datenbank- sind nicht als Alternative zu dem im vorliegenden Beitrag Performance. Das in Kapitel 4 des vorliegenden Beitrags vorge- vorgestellten Ansatz einzuordnen. Vielmehr können sich diese stellte Konzept nutzt dagegen synthetisch erzeugte Statistiken und Konzepte ergänzen, indem die Anwendungsentwicklung auf Basis datenbankinterne EXPLAIN-Mechanismen, um eine kostenmäßi- des in Kapitel 4 vorgestellten Konzepts erfolgt und für die spätere ge Performance-Abschätzung zu erhalten. Damit berücksichtigt es Anwendungsadministration/ -evolution verschiedene Tuning- stets sowohl aktuelle als auch zukünftige Spezifika einzelner Advisor und die Mechanismen des autonomen Tunings zum Ein- Datenbank-Optimierer und bleibt entgegen [2] von deren interner satz kommen. Berechnungslogik unabhängig. Ein weiterer Unterschied zwischen beiden Ansätzen besteht in der Präsentation der Analyse- 3. ENTWICKLUNGSPROZESS VON ergebnisse. Während sich [2] auf tabellarische Darstellungen beschränkt, nutzt das im Beitrag vorstellte Konzept eine auf der DATENBANKANWENDUNGEN Grundlage der Unified Modeling Language (UML) visualisierte Der Entwicklungsprozess von Anwendungen lässt sich anhand Darstellungsform. des System Development Lifecycle (SDLC) beschreiben und in verschiedene Phasen von der Analyse der Anforderungen bis hin Ähnlich wie [2] basieren auch weitere Ansätze zur Performance- zum Betrieb/zur Wartung der fertigen Software gliedern [1]. Analyse und -Evaluation auf dem Modell des Warteschlangen- 24 Project Manager Analyse Analyse Business Analyst Datenbank Software Datenbank Daten- Reports Designer/ Designer/ Detail Design Design modelle Prozesse Architekt Architekt Implementierung Program- Implementierung mierer und Laden Erstellen Prototyping Laden Test und Tuning Test und Debugging Tester Auswertung Auswertung Datenbank System- Administrator Betrieb Administrator Datenbank- Wartung der Wartung Anwendung Abbildung 1: Phasen und Akteure im Database und Software Development Lifecycle (DBLC und SDLC) Zusätzlich zur reinen Anwendungsentwicklung sind weitere der Entwicklungsprozess von Datenbankanwendungen auf die in Abläufe zur Planung und Bereitstellung einer geeigneten Infra- Abbildung 2 visualisierten Aufgaben. Anhand der analysierten struktur nötig. Für Datenbankanwendungen wäre das unter ande- Anforderungen wird im Datenbank-Design ein konzeptionelles rem der Entwicklungsprozess der Datenbank, welcher sich nach Datenmodell entwickelt, das anschließend hin zum physischen [1] ebenfalls durch ein dem SDLC ähnliches Modell – dem Data- Datenmodell verfeinert wird. Da sich der Beitrag auf die in der base Lifecycle (DBLC) – formalisieren lässt. Beide Entwicklungs- Praxis vorherrschenden relationalen DBMS beschränkt, wird auf prozesse verlaufen zeitlich parallel und werden insbesondere in das in der Theorie gebräuchliche Zwischenprodukt des logischen größeren Unternehmen/Projekten durch verschiedene Akteure Datenmodells (relationale Abbildung) verzichtet. realisiert. Auf Grundlage von [1] liefert Abbildung 1 eine Über- sicht dazu. Sie visualisiert parallel ablaufende Entwicklungspha- Nachdem die Design-Phase abgeschlossen ist, beginnt die sen und eine Auswahl an zuständigen Akteuren, deren konkrete Implementierung. Datenbankseitig wird dabei das physische Zusammensetzung/Aufgabenverteilung aber stark abhängig von Datenmodell mittels Data Definition Language (DDL) in ein der Projektgröße und dem Projektteam ist. Wichtig sind hier be- Datenbankschema innerhalb eines installierten und geeignet sonders zwei Erkenntnisse. Zum einen finden ähnliche Entwick- konfigurierten DBMS umgesetzt und möglicherweise vorhandene lungsprozesse bei Anwendung und Datenbank parallel statt – in Testdaten geladen. Anwendungsseitig erfolgt parallel dazu die etwa das Anwendungsdesign und das Datenbankdesign. Zum Entwicklung von SQL-Statements zum Zugriff auf die Datenbank anderen können sehr viele Akteure am gesamten Entwicklungs- sowie die Implementierung der Anwendung selbst. Nach prozess beteiligt sein, sodass Designer, Programmierer, Tester und Fertigstellung einzelner Module finden mithilfe des Entwick- Administratoren in der Regel disjunkte Personenkreise bilden. lungs- und Qualitätssicherungssystems kontinuierliche Tests statt, die sich allerdings anfangs auf die Prüfung funktionaler Analyse Konzeptionelles Korrektheit beschränken. Performance-Untersuchungen, insbe- Datenmodell Physisches sondere bezogen auf die Datenbankzugriffe, erfolgen in der Regel Datenmodell erst gezielt zum Abschluss der Implementierungsphase mittels Design aufwändig vorzubereitender und im Qualitätssicherungssystem durchzuführender Lasttests. Impl. SQL Die Folgen aus diesem Vorgehen für die Erkennung und Behand- Test Entwicklungs- SQL- lung von Performance-Problemen sind mitunter gravierend. Eng- system Statements pässe werden erst spät (im Betrieb) bemerkt und sind aufgrund Qualitäts- Betrieb sicherungs- des fortgeschrittenen Entwicklungsprozesses nur mit hohem Produktions- system Aufwand zu korrigieren. Basieren sie gar auf unvorteilhaften Wartung system Design-Entscheidungen beispielsweise bezogen auf die Daten- modellierung, ist eine nachträgliche Korrektur aufgrund zahlrei- Abbildung 2: Performance-relevante Entwicklungsschritte cher Abhängigkeiten (Anwendungslogik, SQL-Statements, Test- datenbestände, etc.), getrennten Zuständigkeiten und in der Regel Aus dem Blickwinkel der Datenbank-Performance und der darauf engen Projektzeitplänen nahezu ausgeschlossen. Erfahrungen aus einwirkenden bereits genannten Einflussfaktoren reduziert sich dem Arbeitsumfeld des Autors haben dies wiederholt bestätigt. 25 Performance Indikatoren Abbildung und Statistikerzeugung Konzeptionelles 1. 2. Datenmodell Physisches Kosten Datenmodell EXPLAIN EP2 EP1 3. 4. SQL Entwicklungs- Testsystem Performance-Modell system SQL- Qualitäts- Statements Produktions- sicherungs-system system Abbildung 3: Ansatz zur proaktiven modellbasierten Performance-Analyse und -Vorhersage bei Anwendungsweiterentwicklungen weitgehend vorliegen, exis- 4. PROAKTIVE MODELLBASIERTE tieren für neu zu entwickelnde Anwendungen im Normalfall keine PERFORMANCE-ANALYSE repräsentativen Datenbestände. Somit fehlen auch geeignete Alternativ zur Performance-Analyse mittels Lasttests (vgl. Kapitel Datenbankstatistiken zur Grundlage für die EXPLAIN-Auswer- 3) bieten sich zur Kontrolle der SQL-Performance die eingangs tungen. Die Folge sind Ausführungspläne und Kostenabschätzun- erwähnten EXPLAIN-Mechanismen an. Mit deren Hilfe lassen gen, die mit denen eines späteren produktiven Einsatzes der State- sich bei vorliegendem physischen Datenbank-Design (inklusive ments oftmals nur wenig gemeinsam haben und für eine proaktive Indexe, etc.) bereits in frühen Abschnitten der Implementierungs- Performance-Analyse somit (nahezu) unverwertbar sind. phase Auswertungen zu Ausführungsplänen und geschätzten Der im folgenden Kapitel vorgestellte proaktive modellbasierte Kosten für entwickelte SQL-Statements durchführen. Auf diese Ansatz zur Performance-Analyse und -Vorhersage greift beide Weise gewonnene Erkenntnisse können vom Designer/Program- Probleme auf: die fehlende repräsentative Datenbasis für Daten- mierer direkt genutzt werden, um Optimierungen in Hinblick auf bankstatistiken und die mangelnde Expertise zur Ausführungs- die quasi grade entworfenen/implementierten SQL-Statements planbewertung durch Designer/Programmierer. Dabei sieht dieser durchzuführen. Durch die gegebene zeitliche Nähe zum Anwen- Ansatz zur Bereitstellung geeigneter Datenbankstatistiken ein dungs- und Datenbank-Design sind auch Performance-Optimie- synthetisches Erzeugen anhand von Performance-Indikatoren vor. rungen auf Basis von Datenmodellanpassungen (Normalisie- Das Problem der mangelnden Expertise wird durch eine einfache rung/Denormalisierung) ohne größeren Aufwand möglich. modellbasierte Darstellung von gewonnenen EXPLAIN-Ergeb- Das beschriebene Vorgehen hat zwar den Vorteil, dass mögliche nissen adressiert. Wie diese gestaltet ist und mit den Performance- Performance-Probleme schon von den Akteuren (Designer/Pro- Indikatoren zusammenwirkt verdeutlichen die weiteren Ausfüh- grammierer) erkannt werden können, die diese durch Design- rungen des Kapitels anhand Abbildung 3. Änderungen am effektivsten zu lösen wissen. Demgegenüber erfordern die EXPLAIN-Analysen und das Verständnis der Aus- 4.1 Performance-Indikatoren im Datenmodell führungspläne einen Grad an Expertise, den Designer/Program- Als Performance-Indikatoren bezeichnet die vorliegende Arbeit mierer in der Regel nicht besitzen. Ein Datenbank Administrator ausgewählte Metadaten zu Entitäten und deren Attributen (DBA), der über diese verfügt, ist wiederum von den fachlichen (beziehungsweise zu Tabellen und deren Spalten), die Aufschluss Anforderungen zu distanziert, sodass er zwar mögliche Perfor- über die erwarteten realen Datenbestände geben und in Zusam- mance-Ausreißer erkennen, nicht aber fachlich bewerten kann. menhang mit dem Datenbank-Design und der Infrastruktur erste Führt eine Anwendung beispielsweise einmal monatlich eine sehr Rückschlüsse auf die zukünftige Datenbank-Performance erlau- komplexe Auswertung mithilfe eines entsprechend Laufzeit- ben. Dazu zählen Informationen zu den erwarteten Datenmengen intensiven SQL-Statements durch, dann würde dem DBA diese wie in etwa die erwartete Anzahl an Zeilen pro Tabelle und Abfrage bei EXPLAIN-Analysen als kritisch erscheinen. Denn er Kennzahlen zur Datenverteilung – beispielsweise in Form von weiß weder, dass damit ein fachlich aufwändiger Prozess Wertebereichsangaben, Einzelwertwahrscheinlichkeiten oder der durchgeführt wird, noch dass es sich dabei um eine einmalig pro Kardinalität pro Spalte. Viele dieser Informationen sind Teil des Monat auszuführende Abfrage handelt. Um sich als DBA in einer Ergebnisses der Anforderungsanalyse und somit frühzeitig im Infrastruktur von nicht selten mehr als 100 unterschiedlichen SDLC bekannt und vom Business Analyst erfasst worden. Dabei Anwendungen über die fachlichen Anforderungen und speziellen reicht die Dokumentation von rein textuellen Beschreibungen bis Prozesse jeder einzelnen im Detail zu informieren beziehungs- hin zu tief strukturierten Darstellungen. Eine einheitlich stan- weise um sich als Designer/Programmierer das nötige Knowhow dardisierte Form zur Erfassung von Performance-Indikatoren im zur Ausführungsplanbewertung aufzubauen, ist personelle DBLC existiert jedoch bislang nicht, wodurch die Metadaten Kapazität vonnöten, die in der Regel nicht verfügbar ist. kaum bis gar nicht in den weiteren Entwicklungsprozess ein- fließen. Ein anderes Problem, dass sich in Zusammenhang mit frühzei- tigen EXPLAIN-Analysen zeigt, begründet sich in dem dritten In der Praxis basiert die Datenmodellierung ähnlich wie weite zuvor genannten Performance-Faktor: den Daten. Während diese Teile der Anwendungsmodellierung auf der Sprache UML. Dabei 26 wurde diese ursprünglich nicht zur Abbildung von Daten- der Designer/Programmierer beim Modellieren oder dem Ent- strukturen im Sinn einer Entity-Relationship-Modellierung kon- wickeln von SQL-Statements auf relationale Weise. Die im vorlie- zipiert, sodass die Verbindung beider Welten – und damit die genden Ansatz als Performance-Modell bezeichnete vereinfachte Modellierung von Anwendung und Datenstrukturen mithilfe einer Präsentation von Ausführungsplänen versucht, diese Diskrepanz gemeinsamen Sprache in einem gemeinsamen Tool – erst durch aufzulösen. Ansätze wie [10] oder auch den Entwurf zum IMM Standard der OMG [11] geschaffen wurde. Die Voraussetzung dafür bildet Das Performance-Modell basiert auf dem physischen Datenmo- jeweils die UML-Profil-Spezifikation, die es ermöglicht, beste- dell und damit auf einer dem Designer/Programmierer bekannten hende UML-Objekte über Neu-Stereotypisierungen zu erweitern. Darstellungsform. Zusätzlich umfasst es die für diesen Personen- kreis wesentlichen Informationen aus den EXPLAIN-Ergebnissen. Um die zuvor genannten Performance-Indikatoren für den weite- Dazu zählen die vom DBMS abgeschätzten Kosten für die Aus- ren Entwicklungsprozess nutzbar zu machen und sie innerhalb führung des gesamten Statements sowie wichtiger Operatoren wie bestehender Infrastrukturen/Tool-Landschaften standardisiert zu Tabellen- beziehungsweise Indexzugriffe oder Tabellenverknü- erfassen, kann ebenfalls der UML-Profil-Mechanismus genutzt pfungen mittels Join – jeweils skaliert um die erwartete Aus- werden. So ließe sich beispielsweise mithilfe eines geeigneten führungshäufigkeit des Statements. Weitere Detailinformationen Profils wie in Abbildung 3 in 1. schematisch angedeutet aus einem innerhalb der Ausführungspläne wie beispielsweise die konkrete UML-Objekt „entity“ ein neues Objekt „entity_extended“ ablei- Abarbeitungsreihenfolge einzelner Operatoren oder Angaben zu ten, das in einem zusätzlichen Merkmal „cardinality“ Infor- abgeschätzten Prädikat-Selektivitäten werden vom Modell zum mationen über die produktiv erwartete Datenmenge zu einer Zweck der Einfachheit und Verständlichkeit bewusst vernach- Entität/Tabelle aufnehmen kann. lässigt. Für die gleichzeitige Analyse mehrerer Statements erfolgt eine Aggregation der jeweils abgeschätzten Kosten auf Objekt- 4.2 Synthetische Datenbankstatistiken ebene. Eines der eingangs aufgezeigten Hindernisse für proaktive Perfor- Zentrale Komponente im Performance-Modell ist eine ebenfalls mance-Analysen beziehungsweise -Vorhersagen bestand in der dem physischen Datenmodell angelehnte Diagrammdarstellung. fehlenden repräsentativen Datenbasis für Datenbank-Statisti- Mithilfe farblicher Hervorhebung und geeigneter Bewertungs- ken. Diese Statistiken werden im Normalfall vom DBMS anhand metriken sollen sämtliche Objekte gemäß den vom DBMS der gespeicherten Daten selbst gesammelt. Dem entgegen verfolgt geschätzten Zugriffskosten zur Abarbeitung der Workload das hier vorgestellte Konzept den Ansatz, dem DBMS Statistiken klassifiziert und visualisiert werden. Auf diese Weise kann ein vorzugeben, ohne dazu datenbankseitig repräsentative Datenbe- Designer/Programmierer frühzeitig Auskunft über aus Perfor- stände vorhalten zu müssen. Dafür bieten zwar die wenigsten mance-Perspektive zu optimierende Bereiche im Datenbank- DBMS vordefinierte Schnittstellen an, allerdings sind sämtliche schema beziehungsweise kritische, alternativ zu konzipierende Statistik-Informationen in der Regel innerhalb DBMS-interner SQL-Statements erhalten. Abbildung 3 veranschaulicht exempla- manipulierbarer Tabellen gespeichert, wie dies beispielswiese risch ein visualisiertes Performance-Modell für zwei Statements/ auch bei DB2 oder Oracle der Fall ist [12]. Ausführungspläne (EP). Während der untere Bereich weitgehend grün/unkritisch markiert ist, befinden sich im oberen Diagramm- Datenbankstatistiken enthalten Informationen über Datenmengen teil mögliche Performance-kritische rot gekennzeichnete Zugriffe, und Datenverteilungen sowie Kennzahlen zur physischen Spei- die es gezielt zu untersuchen und an geeigneter Stelle (SQL-State- cherung wie beispielsweise die Anzahl der verwendeten Daten- ment, Datenbank-Design) zu optimieren gilt (vgl. gestrichelte bankseiten pro Tabelle. Während erstere inhaltlich den zuvor Pfeile in Abbildung 3). beschriebenen Performance-Indikatoren entsprechen, sind die Statistikdaten zur physischen Speicherung interne DBMS-abhän- Die technische Realisierung des Performance-Modells sowie der gige Größen. Mithilfe geeigneter, von den DBMS-Herstellern zur dazugehörigen Diagrammdarstellung erfolgt analog zur Erfassung Unterstützung beim Datenbank-Design bereitgestellter Abschät- der Performance-Indikatoren über den UML-Profil-Mechanismus, zungsvorschriften lassen sich aber auch diese Kennzahlen auf wodurch auch in diesem Punkt die Kompatibilität des vorge- Grundlage der Performance-Indikatoren approximieren. Somit ist stellten Ansatzes zu bestehenden Tool-Infrastrukturen gewähr- es wie in Abbildung 3 in 2. gezeigt möglich, anhand geeignet leistet ist. formalisierter Performance-Indikatoren frühzeitig im SDLC/ DBLC repräsentative Datenbankstatistiken künstlich zu erzeugen. 4.4 Ablauf einer Analyse/Vorhersage Für den Designer/Programmierer sieht der in Abbildung 3 4.3 EXPLAIN und Performance-Modell vorgestellte proaktive Ansatz folgende Vorgehensweise vor. Auf Grundlage von synthetischen Datenbankstatistiken können Nachdem nach 1. ein Datenbank-Design-Entwurf fertiggestellt ist, wie in Abbildung 3 in 3. und 4. zu sehen, mittels der vom DBMS initiiert er in 2. einen Automatismus zur Abbildung des Designs bereitgestellten EXPLAIN-Funktionalität, der SQL-Workload in ein Datenbank-Schema sowie zur Erstellung von synthetischen und dem aus dem physischen Datenmodell ableitbaren Daten- Datenbank-Statistiken anhand der von ihm modellierten Perfor- bankschema proaktive Performance-Vorhersagen durchgeführt mance-Indikatoren. Mithilfe einer weiteren Routine startet der werden. Die resultierenden, teils komplexen Ausführungspläne Designer/Programmierer in 3. und 4. anschließend einen Simu- lassen sich allerdings nur mit ausreichend Expertise und vor- lationsprozess, der auf Basis der EXPLAIN-Mechanismen Perfor- handenen personellen Kapazitäten angemessen auswerten, sodass mance-Vorhersagen für eine gegebene Workload erstellt und diese diese Problematik vorläufig weiterbesteht. Eine Hauptursache, die als Performance-Modell aufbereitet. Von dort aus informiert er das Verständnis von Ausführungsplänen erschwert, ist ihre sich mithilfe der Diagrammdarstellung über mögliche kritische hierarchische Darstellung als Zugriffsbaum. Demgegenüber denkt Zugriffe, die er daraufhin gezielt analysiert und optimiert. 27 5. ZUSAMMENFASSUNG Ansatzes entgegensteht. Somit sind alternative Varianten zur Datenbank-Performance ist ein wichtiger, oftmals jedoch vernach- Beschaffung der Workload für den Analyseprozess zu lässigter Faktor in der Anwendungsentwicklung. Durch moderne untersuchen und abzuwägen. Anforderungen und dazu implementierte Anwendungen sehen sich speziell deren Datenbank-Backends mit kontinuierlich 7. LITERATUR wachsenden Herausforderungen insbesondere betreffend der [1] C. Coronel, S. Morris, P. Rob. Database Systems: Design, Performance konfrontiert. Diese können nur bewältigt werden, Implementation, and Management, Course Technology, 10. wenn das Thema Datenbank-Performance intensiver betrachtet Auflage, 2011. und durch proaktive Analysen (beispielsweise mittels EXPLAIN- [2] S. Salza, M. Renzetti. A Modeling Tool for Workload Mechanismen) kontinuierlich verfolgt wird. Doch auch dann sind Analysis and Performance Tuning of Parallel Database einzelne Hindernisse unvermeidlich: fehlende repräsentative Applications, Proceedings in ADBIS'97, 09.1997 Daten(-mengen) und Expertise/Kapazitäten zur Analyse. http://www.bcs.org/upload/pdf/ewic_ad97_paper38.pdf Der vorliegende Beitrag präsentiert zur Lösung dieser Probleme [3] R. Osman, W. J. Knottenbelt. Database system performance einen modellbasierten Ansatz, der auf Basis synthetisch erzeugter evaluation models: A survey, Artikel in Performance Statistiken proaktive Performance-Analysen sowie -Vorhersagen Evaluation, Elsevier Verlag, 10.2012 erlaubt und die daraus gewonnenen Ergebnisse in einer einfach http://dx.doi.org/10.1016/j.peva.2012.05.006 verständlichen Form visualisiert. Die technologische Grundlage dafür bietet die in der Praxis vorherrschende Modellierungs- [4] Tata Consultancy Services. System and method for SQL sprache UML mit ihrer UML-Profil-Spezifikation. Sie erlaubt es performance assurance services, Internationales Patent das hier vorgestellte Konzept und die dazu benötigten Kom- PCT/IN2011/000348, 11.2011 ponenten mit vorhandenen technischen Mitteln abzubilden und http://dx.doi.org/10.1016/j.peva.2012.05.006 nahtlos in bestehende UML-Infrastrukturen zu integrieren. [5] D. Wiese. Gewinnung, Verwaltung und Anwendung von Performance-Daten zur Unterstützung des autonomen 6. AUSBLICK Datenbank-Tuning, Dissertation, Fakultät für Mathematik Bei dem im Beitrag vorgestellten Konzept handelt es sich um und Informatik, Friedrich-Schiller-Universität Jena, 05.2011. einen auf Basis wiederkehrender praktischer Problemstellungen http://www.informatik.uni-jena.de/dbis/alumni/wiese/pubs/D und den daraus gewonnenen Erfahrungen konstruierten Ansatz. issertation__David_Wiese.pdf Während die technische Umsetzbarkeit einzelner Teilaspekte wie [6] S. Chaudhuri, V. Narasayya. A Self-Tuning Database etwa die Erfassung von Performance-Indikatoren oder die Kon- Systems: A Decade of Progress, Proceedings in VLDB'07, struktion des Performance-Modells auf Basis von UML-Profilen 09.2007 bereits geprüft wurde, steht eine prototypische Implementierung http://research.microsoft.com/pubs/76506/vldb07-10yr.pdf des gesamten Prozesses zur Performance-Analyse noch aus. [7] N. Bruno, S. Chaudhuri. An Online Approach to Physical Zuvor sind weitere Detailbetrachtungen nötig. So ist beispiels- Design Tuning, Proceedings in ICDE'07, 04.2007 weise zu klären, in welchem Umfang Performance-Indikatoren http://research.microsoft.com/pubs/74112/continuous.pdf im Datenmodell vom Analyst/Designer sinnvoll erfasst werden [8] Oracle Corporation. Oracle Database 2 Day DBA 12c sollten. Dabei ist ein Kompromiss zwischen maximalem Release 1 (12.1) – Monitoring and Tuning the Database, Detailgrad und minimal nötigem Informationsgehalt anzustreben, 2013. sodass der Aufwand zur Angabe von Performance-Indikatoren http://docs.oracle.com/cd/E16655_01/server.121/e17643/mo möglichst gering ist, mit deren Hilfe aber dennoch eine ntune.htm#ADMQS103 repräsentative Performance-Vorhersage ermöglicht wird. [9] Microsoft Corporation. SQL Server 2005 – Database Engine Weiterhin gilt es, eine geeignete Metrik zur Bewertung/Katego- Tuning Advisor (DTA) in SQL Server 2005, Technischer risierung der Analyseergebnisse zu entwickeln. Hier steht die Artikel, 2006. Frage im Vordergrund, wann ein Zugriff anhand seiner Kosten als http://download.microsoft.com/download/4/7/a/47a548b9- schlecht und wann er als gut zu bewerten ist. Ein teurer Zugriff ist 249e-484c-abd7-29f31282b04d/SQL2005DTA.doc nicht zwangsweise ein schlechter, wenn er beispielsweise zur Realisierung einer komplexen Funktionalität verwendet wird. [10] C.-M. Lo. A Study of Applying a Model-Driven Approach to the Development of Database Applications, Dissertation, Zuletzt sei noch die Erfassung beziehungsweise Beschaffung der Department of Information Management, National Taiwan für die EXPLAIN-Analysen notwendigen Workload erwähnt. University of Science and Technology, 06.2012. Diese muss dem vorgestellten proaktiven Analyseprozess [11] Object Management Group. Information Management zugänglich gemacht werden, um anhand des beschriebenen Metamodel (IMM) Specification Draft Version 8.0, Konzepts frühzeitige Performance-Untersuchungen durchführen Spezifikationsentwurf, 03.2009. zu können. Im einfachsten Fall könnte angenommen werden, dass http://www.omgwiki.org/imm/doku.php sämtliche SQL-Statements (inklusive ihrer Ausführungshäu- figkeit) vom Designer/Programmierer ebenfalls im Datenmodell [12] N. Burgold, M. Gerstmann, F. Leis. Statistiken in beispielsweise als zusätzliche Merkmale von Methoden in der relationalen DBMSen und Möglichkeiten zu deren UML-Klassenmodellierung zu erfassen und kontinuierlich zu synthetischer Erzeugung, Projektarbeit, Fakultät für pflegen wären. Dies wäre jedoch ein sehr aufwändiges Verfahren, Mathematik und Informatik, Friedrich-Schiller-Universität das der gewünschten hohen Praxistauglichkeit des proaktiven Jena, 05.2014. 28 Big Data und der Fluch der Dimensionalität Die effiziente Suche nach Quasi-Identifikatoren in hochdimensionalen Daten Hannes Grunert Andreas Heuer Lehrstuhl für Datenbank- und Lehrstuhl für Datenbank- und Informationssysteme Informationssysteme Universität Rostock Universität Rostock Albert-Einstein-Straße 22 Albert-Einstein-Straße 22 hg(at)informatik.uni-rostock.de ah(at)informatik.uni-rostock.de Kurzfassung gen Handlungen des Benutzers abgeleitet, sodass die smarte In smarten Umgebungen werden häufig große Datenmengen Umgebung eigenständig auf die Bedürfnisse des Nutzers rea- durch eine Vielzahl von Sensoren erzeugt. In vielen Fällen gieren kann. werden dabei mehr Informationen generiert und verarbei- In Assistenzsystemen [17] werden häufig wesentlich mehr tet als in Wirklichkeit vom Assistenzsystem benötigt wird. Informationen gesammelt als benötigt. Außerdem hat der Dadurch lässt sich mehr über den Nutzer erfahren und sein Nutzer meist keinen oder nur einen sehr geringen Einfluss Recht auf informationelle Selbstbestimmung ist verletzt. auf die Speicherung und Verarbeitung seiner personenbe- Bestehende Methoden zur Sicherstellung der Privatheits- zogenen Daten. Dadurch ist sein Recht auf informationel- ansprüche von Nutzern basieren auf dem Konzept sogenann- le Selbstbestimmung verletzt. Durch eine Erweiterung des ter Quasi-Identifikatoren. Wie solche Quasi-Identifikatoren Assistenzsystems um eine Datenschutzkomponente, welche erkannt werden können, wurde in der bisherigen Forschung die Privatheitsansprüche des Nutzers gegen den Informati- weitestgehend vernachlässigt. onsbedarf des Systems überprüft, kann diese Problematik In diesem Artikel stellen wir einen Algorithmus vor, der behoben werden. identifizierende Attributmengen schnell und vollständig er- Zwei Hauptaspekte des Datenschutzes sind Datenvermei- kennt. Die Evaluierung des Algorithmus erfolgt am Beispiel dung und Datensparsamkeit. In §3a des Bundesdatenschutz- einer Datenbank mit personenbezogenen Informationen. gesetzes [1] wird gefordert, dass [d]ie Erhebung, Verarbeitung und Nutzung ” ACM Klassifikation personenbezogener Daten und die Auswahl und K.4.1 [Computer and Society]: Public Policy Issues— Gestaltung von Datenverarbeitungssystemen [...] Privacy; H.2.4 [Database Management]: Systems—Que- an dem Ziel auszurichten [sind], so wenig perso- ry Processing nenbezogene Daten wie möglich zu erheben, zu verarbeiten oder zu nutzen.“. Stichworte Mittels einer datensparsamen Weitergabe der Sensor- und Datenbanken, Datenschutz, Big Data Kontext-Informationen an die Analysewerkzeuge des Assis- tenzsystems wird nicht nur die Datenschutzfreundlichkeit des Systems verbessert. Bei der Vorverdichtung der Daten 1. EINLEITUNG durch Selektion, Aggregation und Komprimierung am Sen- Assistenzsysteme sollen den Nutzer bei der Arbeit (Am- sor selbst lässt sich die Effizienz des Systems steigern. Die bient Assisted Working) und in der Wohnung (Ambient Privatheitsansprüche und der Informationsbedarf der Ana- Assisted Living) unterstützen. Durch verschiedene Senso- lysewerkzeuge können als Integritätsbedingungen im Daten- ren werden Informationen über die momentane Situation banksystem umgesetzt werden. Durch die Integritätsbedin- und die Handlungen des Anwenders gesammelt. Diese Da- gungen lassen sich die notwendigen Algorithmen zur An- ten werden durch das System gespeichert und mit weiteren onymisierung und Vorverarbeitung direkt auf dem Datenbe- Daten, beispielsweise mit dem Facebook-Profil des Nutzers stand ausführen. Eine Übertragung in externe Programme verknüpft. Durch die so gewonnenen Informationen lassen bzw. Module, die sich evtl. auf anderen Recheneinheiten be- sich Vorlieben, Verhaltensmuster und zukünftige Ereignis- finden, entfällt somit. se berechnen. Daraus werden die Intentionen und zukünfti- Für die Umsetzung von Datenschutzbestimmungen in smarten Umgebungen wird derzeit das PArADISE1 - Framework entwickelt, welches insbesondere die Aspekte der Datensparsamkeit und Datenvermeidung in heteroge- nen Systemumgebungen realisieren soll. In [3] stellen wir ein einfaches XML-Schema vor, mit der Copyright c by the paper’s authors. Copying permitted only sich Privatheitsansprüche durch den Nutzer von smarten for private and academic purposes. Systemen formulieren lassen. Dabei wird eine Anwendung In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 1 Workshop on Foundations of Databases (Grundlagen von Datenbanken), Privacy-aware assistive distributed information system 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. environment 29 innerhalb eines abgeschlossenen Systems in ihre Funktionali- pel (ti ) angibt. Ein Quasi-Identifikator QI := {A1 , ..., An } täten aufgeteilt. Für jede Funktionalität lässt sich festlegen, ist für eine Relation R entsprechend definiert: welche Informationen in welchem Detailgrad an das System ≥p weitergegeben werden dürfen. Dazu lassen sich einzelne At- Quasi-Identifikator. ∀ t1 , t2 ∈ R [t1 6= t2 ⇒ ∃ A ∈ QI: tribute zu Attributkombinationen zusammenfassen, die an- t1 (A) 6= t2 (A)] gefragt werden können. Wie beim Datenbankentwurf reicht es auch für die Anga- Für einen unerfahrenen Nutzer ist das Festlegen von sinn- be von Quasi-Identifikatoren aus, wenn die minimale Men- vollen Einstellungen nur schwer möglich. Die Frage, die sich ge von Attributen angegeben wird, welche die Eigenschaft ihm stellt, ist nicht die, ob er seine persönlichen Daten schüt- eines QI hat. Eine solche Menge wird als minimaler Quasi- zen soll, sondern vielmehr, welche Daten es wert sind, ge- Identifikator bezeichnet. schützt zu werden. Zur Kennzeichnung schützenswerter Da- ten werden u.a. sogenannte Quasi-Identifikatoren [2] verwen- minimaler Quasi-Identifikator. X ist ein minimaler det. In diesem Artikel stellen wir einen neuen Ansatz vor, Quasi-Identifikator (mQI), wenn X ein Quasi-Identifikator mit dem Quasi-Identifikatoren schnell und vollständig er- ist und jede nicht-leere Teilmenge Y von X kein Quasi- kannt werden können. Identifikator ist. Der Rest des Artikels ist wie folgt strukturiert: Kapitel 2 X ist mQI: X ist QI ∧ (@ Y ⊂ X: (Y 6= ) ∧ (Y ist QI)) gibt einen aktuellen Überblick über den Stand der Forschung Insbesondere ist X kein minimaler Quasi-Identifikator, im Bereich der Erkennung von Quasi-Identifikatoren. Im fol- wenn eine Teilmenge X-{A} von X mit A ∈ X existiert, genden Kapitel gehen wir detailliert darauf ein, wie schüt- die ein Quasi-Identifikator ist. Das Finden von allen Quasi- zenswerte Daten definiert sind und wie diese effizient erkannt Identifikatoren stellt ein NP-vollständiges Problem dar, weil werden können. Kapitel 4 evaluiert den Ansatz anhand eines die Menge der zu untersuchenden Teilmengen exponentiell Datensatzes. Das letzte Kapitel fasst den Beitrag zusammen zur Anzahl der Attribute einer Relation steigt. Besteht eine und gibt einen Ausblick auf zukünftige Arbeiten. Relation aus n Attributen, so existieren insgesamt 2n Attri- butkombinationen, für die ermittelt werden muss, ob sie ein 2. STAND DER TECHNIK QI sind. In diesem Kapitel stellen wir bestehende Konzepte zur In [12] stellen Motwani und Xu einen Algorithmus zum ef- Ermittlung von Quasi-Identifikatoren (QI) vor. Außerdem fizienten Erkennen von minimalen Quasi-Identifikatoren vor. werden Techniken vorgestellt, die in unseren Algorithmus Dieser baut auf die von Mannila et. al [10] vorgeschlagene, eingefloßen sind. ebenenweise Erzeugung von Attributmengen auf. Dabei wird die Minimalitätseigenschaft von Quasi-Identifikatoren sofort 2.1 Quasi-Identifikatoren erkannt und der Suchraum beim Durchlauf auf der nächsten Zum Schutz personenbezogener Daten existieren Konzep- Ebene eingeschränkt. te wie k-anonymity [16], l-diversity [8] und t-closeness [7]. Der Algorithmus ist effizienter als alle 2n Teilmengen zu Diese Konzepte unterteilen die Attribute einer Relation in testen, allerdings stellt die von Big-Data-Anwendungen er- Schlüssel, Quasi-Identifikatoren, sensitive Daten und sons- zeugte Datenmenge eine neue Herausforderung dar. Insbe- tige Daten. Ziel ist es, dass die sensitiven Daten sich nicht sondere die hohe Dimensionalität und die Vielfalt der Daten eindeutig zu einer bestimmten Person zuordnen lassen. Da sind ernst zu nehmende Probleme. Aus diesem Grund schla- durch Schlüsselattribute Tupel eindeutig bestimmt werden gen wir im folgenden Kapitel einen neuen Algorithmus vor, können, dürfen diese unter keinen Umständen zusammen der auf den Algorithmus von Motwani und Xu aufsetzt. mit den sensitiven Attributen veröffentlicht werden. Während Schlüssel im Laufe des Datenbankentwurfes fest- 2.2 Sideways Information Passing gelegt werden, lassen sich Quasi-Identifikatoren erst beim Der von uns entwickelte Algorithmus verwendet Techni- Vorliegen der Daten feststellen, da sie von den konkreten ken, die bereits beim Sideways Information Passing (SIP, Attributwerten der Relation abhängen. Der Begriff Quasi- [4]) eingesetzt werden. Der grundlegende Ansatz von SIP Identifikator wurde von Dalenius [2] geprägt und bezeichnet besteht darin, dass während der Ausführung von Anfrage- a subset of attributes that can uniquely identify most tuples plänen Tupel nicht weiter betrachtet werden, sofern mit Si- ” in a table“. cherheit feststeht, dass sie keinen Bezug zu Tupeln aus an- Für most tuples“ wird häufig ein Grenzwert p festge- deren Relationen besitzen. ” legt, der bestimmt, ob eine Attributkombination ein Quasi- Durch das frühzeitige Erkennen solcher Tupel wird der Identifikator ist oder nicht. Dieser Grenzwert lässt sich bei- zu betrachtende Suchraum eingeschränkt und die Ausfüh- spielsweise in relationalen Datenbanken durch zwei SQL- rungszeit von Anfragen reduziert. Besonders effektiv ist die- Anfragen wie folgt bestimmen: ses Vorgehen, wenn das Wissen über diese magic sets“ [14] ” zwischen den Teilen eines Anfrageplans ausgetauscht und p = COUNT DISTINCT *COUNT FROM (SELECT FROM table) ∗ FROM table in höheren Ebenen des Anfrageplans mit eingebunden wird. (1) Beim SIP werden zudem weitere Techniken wie Bloomjoins Wird für p der Wert 1 gewählt, so sind die gefundenen QI [9] und Semi-Joins eingesetzt um den Anfrageplan weiter zu mit diesem Grenzwert auch Schlüssel der Relation. Um eine optimieren. Vergleichbarkeit unseres Algorithmus mit dem von Motwani und Xu zu gewährleisten, verwenden wir ebenfalls die in (1) 2.3 Effiziente Erfragung von identifizieren- definierte distinct ratio“ (nach [12]). ” den Attributmengen Da es für den Ausdruck die meisten“ keinen standardisier- ” ≥p In [5] wird ein Algorithmus zur Ermittlung von identi- ten Quantor gibt, formulieren wir ihn mit dem Zeichen: ∀ , fizierenden Attributmengen (IA) in einer relationalen Da- wobei p den Prozentsatz der eindeutig identifizierbaren Tu- tenbank beschrieben. Wird für eine Attributmenge erkannt, 30 dass diese eine IA für eine Relation R ist, so sind auch alle Algorithm 1: bottomUp Obermengen dieser Attributmenge IA für R. Ist für eine Re- Data: database table tbl, list of attributes elements lation bestehend aus den Attributen A, B und C bekannt, Result: a set with all minimal QI qiLowerSet dass B eine identifizierende Attributmenge ist, dann sind initialization(); auch AB, BC und ABC eine IA der Relation. for element in elements do Ist eine Attributmenge hingegen keine IA für R, so sind set := set ∪ {element} auch alle Teilmengen dieser Attributmenge keine IA. Wenn end beispielsweise AC keine IA für R ist, dann sind auch weder A while set is not empty do noch C identifizierende Attributmengen für R. Attributmen- for Set testSet: set do gen, die keine identifizierende Attributmenge sind, werden double p := getPercentage(testSet, tbl); als negierte Schlüssel bezeichnet. if p ≥ threshold then Der in [5] vorgestellte Algorithmus nutzt diese Eigenschaf- qiLowerSet := qiLowerSet ∪ {testSet}; ten um anhand eines Dialoges mit dem Nutzer die Schlüs- end seleigenschaften einer bereits existierenden Relation festzu- end legen. Dabei wird dem Nutzer ein Ausschnitt der Relations- set := buildNewLowerSet(set, elements); tabelle präsentiert anhand derer entschieden werden soll, ob end eine Attributkombination Schlüssel ist oder nicht. Wird in return qiLowerSet; einer Teilrelation festgestellt, dass die Attributmenge Tu- pel mit gleichen Attributwerten besitzt, so kann die Attri- butkombination für die Teilmenge, als auch für die gesamte Relation kein Schlüssel sein. Algorithm 2: buildNewLowerSet Data: current lower set lSet, list of attributes elements 3. ALGORITHMUS Result: the new lower set lSetNew In diesem Kapitel stellen wir einen neuen Algorithmus Set lSetNew := new Set(); zum Finden von minimalen Quasi-Identifikatoren vor. Der for Set set: lSet do Algorithmus beschränkt sich dabei auf die Einschränkung for Attribut A: elements do der zu untersuchenden Attributkombinationen. Der entwi- if @q ∈ qiLowerSet : q ⊆ set then ckelte Ansatz führt dabei den von [12] vorgestellten Bottom- lSetNew := lSetNew ∪ {set ∪ {A}}; Up-Ansatz mit einen gegenläufigen Top-Down-Verfahren zu- end sammen. end 3.1 Bottom-Up end return lSetNew; Der von Motwani und Xu in [12] vorgestellte Ansatz zum Erkennen aller Quasi-Identifikatoren innerhalb einer Rela- tion nutzt einen in [10] präsentierten Algorithmus. Dabei wird für eine Relation mit n Attributen ebenenweise von gesetzte QIs besitzt, da so der Suchraum gleich zu Beginn den einelementigen zu n-elementigen Attributkombinatio- stark eingeschränkt wird. nen Tests durchgeführt. Wird für eine i-elementige (1≤i testSet: set do double p := getPercentage(testSet, tbl); Passing [4] untereinander ausgetauscht. Es wird pro Berech- if p < threshold then nungsschritt entweder die Top-Down- oder die Bottom-Up- optOutSet := optOutSet ∪ {subset}; Methode angewandt und das Ergebnis an die jeweils ande- else re Methode übergeben. Der Algorithmus terminiert, sobald qiUpperSet := qiUpperSet ∪ {testSet}; alle Attributebenen durch einen der beiden Methoden abge- for Set o: qiSet do arbeitet wurden oder das Bottom-Up-Vorgehen keine Attri- if testSet ⊂ o then butkombinationen mehr zu überprüfen hat. In Abbildung 1 qiUpperSet := qiUpperSet - {o}; ist die Arbeitsweise des Algorithmus anhand einer Beispiel- end relation mit sechs Attributen dargestellt. Die rot markierten end Kombinationen stehen dabei für negierte QI, grün markierte end für minimale QI und gelb markierte für potentiell minimale end QI. set := buildNewUpper(set); Um zu entscheiden, welcher Algorithmus im nächsten Zy- end klus angewandt wird, wird eine Wichtungsfunktion einge- return qiUpperSet; führt. Die Überprüfung einer einzelnen Attributkombinati- on auf Duplikate hat eine Laufzeit von O(n*log(n)), wobei n die Anzahl der Tupel in der Relation ist. Die Überprü- Der Top-Down-Ansatz hebt die Nachteile des Bottom-Up- fung der Tupel hängt aber auch von der Größe der Attri- Vorgehens auf: der Algorithmus arbeitet effizient, wenn QIs butkombination ab. Besteht ein zu überprüfendes Tupel aus aus vielen Attributen zusammengesetzt sind und für den mehreren Attributen, so müssen im Datenbanksystem auch Fall, dass die gesamte Relation kein QI ist, wird dies bei der mehr Daten in den Arbeitsspeicher für die Duplikaterken- ersten Überprüfung erkannt und der Algorithmus terminiert nung geladen werden. Durch große Datenmengen werden dann umgehend. Seiten schnell aus dem Arbeitsspeicher verdrängt, obwohl Besteht die Relation hingegen aus vielen kleinen QIs, dann sie später wieder benötigt werden. Dadurch steigt die Re- wird der Suchraum erst zum Ende des Algorithmus stark chenzeit weiter an. eingeschränkt. Ein weiterer Nachteil liegt in der erhöhten Für eine vereinfachte Wichtungsfunktion nehmen wir an, Rechenzeit, auf die in der Evaluation näher eingegangen dass alle Attribute den gleichen Speicherplatz belegen. Die wird. Anzahl der Attribute in einer Attributkombination bezeich- nen wir mit m. Für die Duplikaterkennung ergibt sich dann 3.3 Bottom-Up+Top-Down eine Laufzeit von O((n*m)*log(n*m)). Der in diesem Artikel vorgeschlagene Algorithmus kom- Da die Anzahl der Tupel für jede Duplikaterkennung kon- biniert die oben vorgestellten Verfahren. Dabei werden die stant bleibt, kann n aus der Kostenabschätzung entfernt Verfahren im Wechsel angewandt und das Wissen über (ne- werden. Die Kosten für die Überprüfung einer einzelnen gierte) Quasi-Identifikatoren wie beim Sideways Information 32 Algorithm 5: bottomUpTopDown Die Evaluation erfolgte in einer Client-Server-Umgebung. Data: database table tbl, list of attributes attrList Als Server dient eine virtuelle Maschine, die mit einer 64-Bit- Result: a set with all minimal quasi-identifier qiSet CPU (vier Kerne @ 2 GHz und jeweils 4 MB Cache) und 4 attrList.removeConstantAttributes(); GB Arbeitsspeicher ausgestattet ist. Auf dieser wurde eine Set upperSet := new Set({attrList}); MySQL-Datenbank mit InnoDB als Speichersystem verwen- Set lowerSet := new Set(attrList); det. Der Client wurde mit einem i7-3630QM als CPU betrie- // Sets to check for each algorithm ben. Dieser bestand ebenfalls aus vier Kernen, die jeweils int bottom := 0; über 2,3 GHz und 6 MB Cache verfügten. Als Arbeitsspei- int top := attrList.size(); cher standen 8 GB zur Verfügung. Als Laufzeitumgebung while (bottom<=top) or (lowerSet is empty) do wurde Java SE 8u5 eingesetzt. calculateWeights(); Der Datensatz wurde mit jedem Algorithmus getestet. if isLowerSetNext then Um zu ermitteln, wie die Algorithmen sich bei verschiede- bottomUp(); nen Grenzwerten für Quasi-Identifikatoren verhalten, wur- buildNewLowerSet(); den die Tests mit 10 Grenzwerten zwischen 50% und 99% bottom++; wiederholt. // Remove new QI from upper set Die Tests mit den Top-Down- und Bottom-Up- modifyUpperSet(); Algorithmen benötigten im Schnitt gleich viele Tablescans (siehe Abbildung 2). Die Top-Down-Methode lieferte bes- else sere Ergebnisse bei hohen QI-Grenzwerten, Bottom-Up topDown(); ist besser bei niedrigeren Grenzwerten. Bei der Laufzeit buildNewUpperSet(); (siehe Abbildung 3) liegt die Bottom-Up-Methode deutlich top--; vor dem Top-Down-Ansatz. Grund hierfür sind die großen // Remove new negated QI from lower set Attributkombinationen, die der Top-Down-Algorithmus zu modifyLowerSet(); Beginn überprüfen muss. end Der Bottom-Up+Top-Down-Ansatz liegt hinsichtlich end Laufzeit als auch bei der Anzahl der Attributvergleiche qiSet := qiLowerSet ∪ qiUpperSet; deutlich vorne. Die Anzahl der Tablescans konnte im Ver- return qiSet; gleich zum Bottom-Up-Verfahren zwischen 67,4% (4076 statt 12501 Scans; Grenzwert: 0.5) und 96,8% (543 statt 16818 Scans; Grenzwert 0.9) reduziert werden. Gleiches gilt Attributkombination mit m Attributen beträgt demnach für die Laufzeit (58,1% bis 97,5%; siehe Abbildung 3). O((m*log(m)). Die Gesamtkosten für das Überprüfen der möglichen Quasi-Identifikatoren werden mit WAV G bezeichnet. WAV G 6000 Anzahl Tablescans ergibt sich aus dem Produkt für das Überprüfen einer ein- zelnen Attributkombination und der Anzahl der Attribut- kombinationen (AttrKn ) mit n Attributen. 4000 WAV G := AttrKn ∗ log(m) ∗ m (2) 2000 Soll die Wichtungsfunktion präziser sein, so lässt sich der Aufwand abschätzen, indem für jede Attributkombination X die Summe s über die Attributgrößen von X gebildet und 0 anschließend gewichtet wird. Die Einzelgewichte werden an- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 schließend zum Gesamtgewicht aufsummiert. Anzahl Attribute in der Attributkombination P P WAV G := log(s) ∗ s; s = size(A) (3) Brute-Force X∈AttrKn A∈X Bottom-Up Diese Wichtung eignet sich allerdings nur, wenn Zugang Top-Down zu den Metadaten der Datenbankrelation besteht. Bottom-Up+Top-Down (AVG) 4. EVALUATION Abbildung 2: Verhältnis von der Anzahl der Attri- Für die Evaluation des Algorithmus wurde die Adult“- bute in den Attributkombinationen zur Anzahl von ” Tablescans (Adult-DB, Grenzwert 90%) Relation aus dem UCI Machine Learning Repository [6] ver- wendet. Die Relation besteht aus anonymisierten, personen- bezogenen Daten, bei denen Schlüssel sowie Vor- und Nach- Wie in Abbildung 3 zu erkennen ist, nimmt die Lauf- name von Personen entfernt wurden. Die übrigen 15 Attri- zeit beim Bottom-Up+Top-Down-Verfahren im Grenz- bute enthalten Angaben zu Alter, Ehestand, Staatsangehö- wertbereich von 70%-90% stark ab. Interessant ist dies rigkeit und Schulabschluss. Die Relation besteht insgesamt aus zwei Gründen. Erstens nimmt die Anzahl der Quasi- aus 32561 Tupeln, die zunächst im CSV-Format vorlagen Identifikatoren bis 90% ebenfalls ab (179 bei 50%, 56 bei und in eine Datenbank geparst wurden. 90%). Dies legt nahe, dass die Skalierung des Verfahrens neben der Dimension der Relation (Anzahl von Tupel und 33 Attributen) auch von der Anzahl der vorhandenen QIs Bekanntmachung vom 14. Januar 2003, das zuletzt abhängt. Um den Zusammenhang zu bestätigen, sind aber durch Artikel 1 des Gesetzes vom 14. August 2009 weitere Untersuchungen erforderlich. geändert worden ist, 2010. Zweitens wird dieser Grenzwertbereich in der Literatur [2] T. Dalenius. Finding a Needle In a Haystack or [13] häufig benutzt, um besonders schützenswerte Daten her- Identifying Anonymous Census Records. Journal of vorzuheben. Durch die gute Skalierung des Algorithmus in Official Statistics, 2(3):329–336, 1986. diesem Bereich lassen sich diese QIs schnell feststellen. [3] H. Grunert. Privacy Policy for Smart Environments. http://www.ls-dbis.de/pp4se, 2014. zuletzt aufgerufen am 17.07.2014. 8000 [4] Z. G. Ives and N. E. Taylor. Sideways information Laufzeit in Sekunden passing for push-style query processing. In Data 6000 Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pages 774–783. IEEE, 4000 2008. [5] M. Klettke. Akquisition von Integritätsbedingungen in 2000 Datenbanken. PhD thesis, Universität Rostock, 1997. [6] R. Kohavi and B. Becker. Adult Data Set. http://archive.ics.uci.edu/ml/datasets/Adult, 0 1996. zuletzt aufgerufen am 17.07.2014. 50 60 70 80 90 95 99 [7] N. Li, T. Li, and S. Venkatasubramanian. t-Closeness: Grenzwert in % Privacy Beyond k-Anonymity and l-Diversity. In ICDE, volume 7, pages 106–115, 2007. Bottom-Up [8] A. Machanavajjhala, D. Kifer, J. Gehrke, and Top-Down M. Venkitasubramaniam. l-diversity: Privacy beyond Bottom-Up+Top-Down(AVG) k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3, 2007. [9] L. F. Mackert. R* optimizer validation and Abbildung 3: Vergleich der Laufzeit der verschiede- performance evaluation for distributed queries. In nen Algorithmen (Adult-DB) Readings in database systems, pages 219–229. Morgan Kaufmann Publishers Inc., 1988. [10] H. Mannila, H. Toivonen, and A. I. Verkamo. 5. AUSBLICK Discovery of frequent episodes in event sequences. In dieser Arbeit stellten wir einen effizienten Algorithmus Data Mining and Knowledge Discovery, 1(3):259–289, zur Erkennung von QI in hochdimensionalen Daten vor. An- 1997. hand eines Beispiels mit Sensordaten zeigten wir die Eignung [11] D. Moos. Konzepte und Lösungen für in Assistenzsystemen. Darüber hinaus ermitteln wir derzeit, Datenaufzeichnungen in heterogenen dynamischen inwiefern sich QIs in temporalen Datenbanken feststellen Umgebungen. Bachelorarbeit, Universität Rostock, lassen. Das so gewonnene Wissen über schützenswerte Daten 2011. wird in unser Gesamtprojekt zur datenschutzfreundlichen [12] R. Motwani and Y. Xu. Efficient algorithms for Anfrageverarbeitung in Assistenzsystemen eingebunden. masking and finding quasi-identifiers. In Proceedings In späteren Untersuchungen werden wir testen, welche of the Conference on Very Large Data Bases (VLDB), weiteren Quasi-Identifikatoren sich aus der Kombination pages 83–93, 2007. von Daten verschiedener Relationen ableiten lassen. Der [13] P. Samarati and L. Sweeney. Protecting privacy when dafür verwendete Datensatz besteht aus Sensordaten, die disclosing information: k-anonymity and its im Smart Appliance Lab des Graduiertenkollegs MuSA- enforcement through generalization and suppression. MA durch ein Tool [11] aufgezeichnet wurden. Die Daten Technical report, Technical report, SRI International, umfassen dabei Bewegungsprofile, die mittels RFID-Tags 1998. und einen Sensfloor [15] erfasst wurden, aber auch Infor- [14] P. Seshadri, J. M. Hellerstein, H. Pirahesh, T. Leung, mationen zu Licht und Temperatur. Eine Verknüpfung der R. Ramakrishnan, D. Srivastava, P. J. Stuckey, and Basis-Relationen erfolgt dabei über die ermittelten Quasi- S. Sudarshan. Cost-based optimization for magic: Identifikatoren. Algebra and implementation. In ACM SIGMOD Record, volume 25, pages 435–446. ACM, 1996. 6. DANKSAGUNG [15] A. Steinhage and C. Lauterbach. Sensfloor (r): Ein Hannes Grunert wird durch die Deutsche Forschungsge- AAL Sensorsystem für Sicherheit, Homecare und meinschaft (DFG) im Rahmen des Graduiertenkollegs 1424 Komfort. Ambient Assisted Living-AAL, 2008. (Multimodal Smart Appliance Ensembles for Mobile Appli- [16] L. Sweeney. k-anonymity: A model for protecting cations - MuSAMA) gefördert. Wir danken den anonymen privacy. International Journal of Uncertainty, Gutachtern für ihre Anregungen und Kommentare. Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002. 7. LITERATUR [17] M. Weiser. The computer for the 21st century. [1] Bundesrepublik Deutschland. Scientific american, 265(3):94–104, 1991. Bundesdatenschutzgesetz in der Fassung der 34 Combining Spotify and Twitter Data for Generating a Recent and Public Dataset for Music Recommendation Martin Pichl Eva Zangerle Günther Specht Databases and Information Databases and Information Databases and Information Systems Systems Systems Institute of Computer Science Institute of Computer Science Institute of Computer Science University of Innsbruck, University of Innsbruck, University of Innsbruck, Austria Austria Austria martin.pichl@uibk.ac.at eva.zangerle@uibk.ac.at guenther.specht@uibk.ac.at ABSTRACT recommender systems, i.e., the million song dataset (MSD) In this paper, we present a dataset based on publicly avail- [4], however such datasets like the MSD often are not recent able information. It contains listening histories of Spotify anymore. Thus, in order to address the problem of a lack users, who posted what they are listening at the moment of recent and public available data for the development and on the micro blogging platform Twitter. The dataset was evaluation of recommender systems, we exploit the fact that derived using the Twitter Streaming API and is updated many users of music streaming platforms post what they are regularly. To show an application of this dataset, we imple- listening to on the microblogging Twitter. An example for ment and evaluate a pure collaborative filtering based rec- such a tweet is “#NowPlaying Human (The Killers) #craig- ommender system. The performance of this system can be cardiff #spotify http://t.co/N08f2MsdSt”. Using a dataset seen as a baseline approach for evaluating further, more so- derived from such tweets, we implement and evaluate a col- phisticated recommendation approaches. These approaches laborative filtering (CF) based music recommender system will be implemented and benchmarked against our baseline and show that this is a promising approach. Music recom- approach in future works. mender systems are of interest, as the volume and variety of available music increased dramatically, as mentioned in the beginning. Besides commercial vendors like Spotify1 , Categories and Subject Descriptors there are also open platforms like SoundCloud2 or Promo H.3.3 [Information Search and Retrieval]: Information DJ3 , which foster this development. On those platforms, filtering; H.2.8 [Database Applications]: Data mining users can upload and publish their own creations. As more and more music is available to be consumed, it gets difficult for the user or rather customer to navigate through it. By General Terms giving music recommendations, recommender systems help Algorithms, Experimentation the user to identify music he or she wants to listen to with- out browsing through the whole collection. By supporting Keywords the user finding items he or she likes, the platform opera- tors benefit from an increased usability and thus increase Music Recommender Systems, Collaborative Filtering, So- the customer satisfaction. cial Media As the recommender system implemented in this work de- livers suitable results, we will gradually enlarge the dataset 1. INTRODUCTION by further sources and assess how the enlargements influ- More and more music is available to be consumed, due ences the performance of the recommender system in fu- to new distribution channels enabled by the rise of the web. ture work. Additionally, as the dataset also contains time Those new distribution channels, for instance music stream- stamps and a part of the captured tweets contains a ge- ing platforms, generate and store valuable data about users olocation, more sophisticated recommendation approaches and their listening behavior. However, most of the time the utilizing these additional context based information can be data gathered by these companies is not publicly available. compared against the baseline pure CF-based approach in There are datasets available based on such private data cor- future works. pora, which are widely used for implementing and evaluating The remainder of this paper is structured as follows: in Section 2 we present the dataset creation process as well as the dataset itself in more detail. Afterwards, in Section 3 we briefly present the recommendation approach, which is eval- uated in Section 4. Before we present the conclusion drawn from the evaluation on Section 6, related work is discussed in Section 5. Copyright c by the paper’s authors. Copying permitted only for private and academic purposes. 1 http://www.spotify.com In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 2 Workshop on Foundations of Databases (Grundlagen von Datenbanken), http://soundcloud.com 3 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. http://promodj.com 35 2. THE SPOTIFY DATASET 2.2 Dataset Description In this Section, the used dataset 4 for developing and eval- Based on the raw data presented in the previous Sec- uating the recommender system is presented. tion, we generate a final dataset of - triples which contains 5,504,496 tweets of 569,722 unique 2.1 Dataset Creation users who listened to 322,647 tracks by 69,271 artists. In For the crawling of a sufficiently large dataset, we relied on this final dataset, users considered as not valuable for rec- the Twitter Streaming API which allows for crawling tweets ommendations, i.e., the @SpotifyNowPlay Twitter account containing specified keywords. Since July 2011, we crawled which retweets tweets sent via @Spotify, are removed. These for tweets containing the keywords nowplaying, listento and users were identified manually by the authors. listeningto. Until October 2014, we were able to crawl more As typical for social media datasets, our dataset has a than 90 million tweets. In contrast to other contributions long-tailed distribution among the users and their respective aiming at extracting music information from Twitter, where number of posted tweets [5]. This means that there are only the tweet’s content is used to extract artist and track in- a few number of users tweeting rather often in this dataset formation from [17, 7, 16], we propose to exploit the subset and numerous users are tweeted rarely which can be found of crawled tweets containing a URL leading to the website in the long-tail. This long-tailed distribution can be seen in of the Spotify music streaming service5 . I.e., information Table 2 and Figure 1, where the logarithm of the number of about the artist and the track are extracted from the web- tweets is plotted against the corresponding number of users. site mentioned in the tweet, rather than from the content of the tweet. This enables us an unambiguous resolution Number of Tweets Number of Users of the tweets, in contradiction to the contributions men- >0 569,722 tioned above, where the text of the tweets is compared to >1 354,969 entries in the reference database using some similarity mea- >10 91,217 sure. A typical tweet, published via Spotify, is depicted in >100 7,419 the following: “#nowPlaying I Tried by Total on #Spotify >1,000 198 http://t.co/ZaFH ZAokbV”, where a user published that he or she listened to the track “I Tried” by the band “Total” on Table 2: Number of Tweets and Number of Users Spotify. Additionally, a shortened URL is provided. Besides this shortened URL, Twitter also provides the according re- solved URL via its API. This allows for directly identifying all Spotify-URLs by searching for all URLs containing the string “spotify.com” or “spoti.fi”. By following the identified 4,000 URLs, the artist and the track can be extracted from the title tag of the according website. For instance, the title of the website behind the URL stated above is “I tried 1,000 by Total on Spotify ”. Using the regular expression “(.*) by (.*) on.*” the name of the track (group 1) and the artist (group 2) can be extracted. log(Number of Tweets) By applying the presented approach to the crawled tweets, we were able to extract artist and track information from 100 7.08% of all tweets or rather 49.45% of all tweets containing at least one URL. We refer to the subset of tweets, for which we are able to extract the artist and the track, as “matched tweets”. An overview of the captured tweets is given in Table 1. 1.94% of the tweets containing a Spotify-URL couldn’t 10 be matched due to HTTP 404 Not Found and HTTP 500 Internal Server errors. Restriction Number of Tweets Percentage None 90,642,123 100.00% At least one URL 12,971,482 14.31% A Spotify-URL 6,541,595 7.22% 0 50,000 100,000 150,000 200,000 Number of Users Matched 6,414,702 7.08% Table 1: Captured and Matched Tweets Figure 1: Number of Tweets versus Number of Users Facilitating the dataset creation approach previously pre- sented, we are able to gather 6,414,702 tweets and extract The performance of a pure collaborative filtering-based artist and track data from the contained Spotify-URLs. recommender system increases with the detailedness of a user profile. Especially for new users in a system, where no or only little data is available about them, this poses a 4 available at: http://dbis-twitterdata.uibk.ac.at/ problem as no suitable recommendations can be computed. spotifyDataset/ In our case, problematic users are users who tweeted rarely 5 http://www.spotify.com and thus can be found in the long tail. 36 Besides the long-tail among the number of posted tweets, based on the listening histories of the user. The Jaccard- there is another long-tail among the distribution of the artist Coefficient is defined in Equation 1 and measures the pro- play-counts in the dataset: there are a few popular artists portion of common items in two sets. occurring in a large number of tweets and many artists are mentioned only in a limited number of tweets. This is shown |Ai ∩ Aj | in Figure 2, where the logarithm of the number of tweets in jaccardi,j = (1) |Ai ∪ Aj | which an artist occurs in (the play-count) is plotted against the number of artists. Thus, this plot states how many For each user, there are two listening histories we take artists are mentioned how often in the dataset. into consideration: the set of all tracks a user listened to and the set of all artists a user listened to. Thus, we are able to compute a artist similartiy (artistSim) and a track similarity (trackSim) as shown in Equations 2 and 3. |artistsi ∩ artistsj | artistSimi,j = (2) |artistsi ∪ artistsj | 4,000 |tracksi ∩ tracksj | trackSimi,j = (3) 1,000 |tracksi ∪ tracksj | log(Number of Tweets) The final user similarity is computed using a weighted average of both, the artistSim and trackSim as depicted in Equation 4. 100 simi,j = wa ∗ artistSimi,j + wt ∗ trackSimi,j (4) The weights wa and wt determine the influence of the 10 artist- and the track listening history on the user similar- ity, where wa + wt = 1. Thus, if wt = 0, only the artist listening history is taken into consideration. We call such a recommender system an artist-based recommender system. Analogously, if wa = 0 we call such a recommender system track-based. If wa > 0 ∧ wt > 0, both the artist- and track 0 5000 10000 15000 20000 Number of Artists listening histories are used. Hence, we facilitate a hybrid recommender system for artist recommendations. The presented weights have to be predetermined. In this Figure 2: Play-Count versus Number of Artists work, we use a grid-search for finding suitable input param- eter for our recommender system as described in Section 4.2. How the presented dataset is used as input- and evaluation data for a music recommender system, is presented in the 4. EVALUATION next Section. In this Section we present the performance of the imple- mented artist recommender system, but also discuss the lim- 3. THE BASELINE RECOMMENDATION AP- itations of the conducted offline evaluation. PROACH In order to present how the dataset can be applied, we 4.1 Evaluation Setup use our dataset as input and evaluation data for an artist The performance of the recommender system with differ- recommendation system. This recommender system is based ent input parameters was evaluated using precision and re- on the open source machine learning library Mahout[2]. The call. Although we focus on the precision, for the sake of com- performance of this recommender system is shown in Section pleteness we also include the recall into the evaluation, as 4 and serves as a benchmark for future work. this is usual in the field of information retrieval [3]. The met- rics were computed using a Leave-n-Out algorithm, which 3.1 Recommendation Approach can be described as follows: For showing the usefulness of our dataset, we implemented a User-based CF approach. User-based CF recommends 1. Randomly remove n items from the listening history items by solely utilizing past user-item interactions. For the of a user music recommender system, a user-item interaction states 2. Recommend m items to the user that a user listened to a certain track by a certain artist. Thus, the past user-item interactions represent the listening 3. Calculate precision and recall by comparing the m rec- history of a user. In the following, we describe our basic ommended and the n removed items approach taken for computing artist recommendations and provide details about the implementation. 4. Repeat step 1 to 3 p times In order to estimate the similarity of two users, we com- puted a linear combination of the Jaccard-Coefficients [10] 5. Calculate the mean precision and the mean recall 37 Each evaluation in the following Sections has been re- peated five times (p = 5) and the size of the test set was fixed to 10 items (r = 10). Thus, we can evaluate the per- formance of the recommender for recommending up to 10 0.5 items. 4.2 Determining the Input Parameters In order to determine good input parameters for the rec- 0.4 ommender system, a grid search was conducted. Therefore, we define a grid of parameters and the possible combina- Recommender Precision tions are evaluated using a performance measure [9]. In our ● Artist 0.3 case, we relied on the precision of the recommender system Hybrid (cf. Figure 3), as the task of a music recommender system Track is to find a certain number of items a user will listen to (or buy), but not necessarily to find all good items. Precision 0.2 is a reasonable metric for this so called Find Good Items task [8] and was assessed using the explained Leave-n-Out algorithm. For this grid search, we recommended one item 0.1 ● ● ● ● and the size of the test set was fixed to 10 items. In order ● ● ● ● ● to find good input parameters, the following grid parame- ● ● ters determining the computation of the user similarity were altered: 0.0 ● 0 10 20 30 40 50 60 70 80 90 100 • Number of nearest neighbors k k−Nearest Neighbors • Weight of the artist similarity wa Figure 3: Precision and Recall of the Track-Based • Weight of the track similarity wt Recommender The result can be seen in Figure 3. For our dataset it n Precision Recall Upper Bound holds, that the best results are achieved with a track-based 1 0.49 0.05 0.10 recommender system (wa = 0,wt = 1) and 80 nearest neigh- 5 0.23 0.11 0.50 bors (k = 80). Thus, for the performance evaluation of the 6 0.20 0.12 0.60 recommender system in the next Section, we use the follow- 7 0.19 0.13 0.70 ing parameters: 10 0.15 0.15 1.00 • Number of nearest neighbors 80 Table 3: Precision and Recall of the Track-Based • Weight of the artist similarity 0 Recommender • Weight of the track similarity 1 As shown in Figure 4, with an increasing number of recom- mendations, the performance of the presented recommender 4.3 Performance of the Baseline Recommender system declines. Thus, for a high number of recommenda- System tions the recommender system is rather limited. This is, In this Section, the performance of the recommender sys- as the chance of false positives increases if the size of the tem using the optimized input parameters is presented. Prior test set is kept constant. For computing the recall metric, to the evaluation, we also examined real implementations the 10 items in the test set are considered as relevant items of music recommender systems: Last.fm, a music discovery (and hence are desirable to recommend to the user). The service, for instance recommends 6 artists6 when display- recall metric describes the fraction of relevant artists who ing a certain artist. If an artist is displayed on Spotify7 , are recommended, i.e., when recommending 5 items, even 7 similar artists are recommended at the first page. This if all items are considered relevant, the maximum recall is number of items also corresponds to the work of Miller [11], still only 50% as 10 items are considered as relevant. Thus, who argues that people are able to process about 7 items at in the evaluation setup, recall is bound by an upper limit, a glance, or rather that the span of attention is too short which is the number of recommended items divided by the for processing long lists of items. The precision@6 and the size of the test set. precision@7 of our recommender are 0.20 and 0.19, respec- tively. In such a setting, 20% of the recommended items 4.4 Limitations of the Evaluation computed by the proposed recommender system would be a Beside discussing the results, it is worth to mention also hit. In other words, a customer should be interested in at two limitations in the evaluation approach: First, only rec- least in two of the recommended artists. An overview about ommendations for items the user already interacted with can the precision@n of the recommender is given in Table 3. be evaluated [5]. If something new is recommended, it can’t 6 http://www.last.fm/music/Lana+Del+Rey be stated whether the user likes the item or not. We can 7 only state that it is not part of the user’s listening history http://play.spotify.com/artist/ 00FQb4jTyendYWaN8pK0wa in our dataset. Thus, this evaluation doesn’t fit to the per- 38 1.0 by monitoring users using the Yahoo! Music Services be- tween 2002 and 2006. Again, the MSD dataset, the Yahoo 0.9 dataset is less recent. Additionally to the ratings, the Yahoo dataset contains genre information which can be exploited 0.8 by a hybrid recommender system. Celma also provides a music dataset, containing data re- 0.7 trieved from last.fm10 , a music discovery service. It con- tains user, artists and play counts as well as the MusicBrainz identifiers for 360,000 users. This dataset was published in Precision / Recall 0.6 Legend 2010 [5]. ● Precision Beside the datasets presented above, which are based on 0.5 ● Recall data of private companies, there exist several datasets based Upper Bound on publicly available information. Sources exploited have 0.4 been websites in general [12, 15, 14], Internet radios posting ● their play lists [1] and micro-blogging platforms, in partic- 0.3 ● ular Twitter [17, 13]. However, using these sources has a ● drawback: For cleaning and matching the data, high effort ● 0.2 ● ● is necessary. ● ● ● One of the most similar datasets to the dataset used in 0.1 this work, is the Million Musical Tweets Dataset 11 dataset by Hauger et al. [7]. Like our dataset, it was created using ● 0.0 the Twitter streaming API from September 2011 to April 1 5 10 2013, however, all tweets not containing a geolocation were Number of Recommended Items removed and thus it is much smaller. The dataset con- tains 1,086,808 tweets by 215,375 users. Among the dataset, Figure 4: Precision and Recall of the Track-Based 25,060 unique artists have been identified [7]. Recommender Another dataset based on publicly available data which is similar to the MovieLens dataset, is the MovieTweetings dataset published by Dooms et al. [6]. The MovieTweet- fectly to the intended use of providing recommendations for ings dataset is continually updated and has the same format new artists. However, this evaluation approach enabled us as the MovieLens dataset, in order to foster exchange. At to find the optimal input parameters using a grid search. the moment, a snapshot containing 200,000 ratings is avail- Secondly, as we don’t have any preference values, the as- able12 . The dataset is generated by crawling well-structured sumption that a certain user likes the artist he/she listened tweets and extracting the desired information using regular to, has to be made. expressions. Using this regular expressions, the name of the Both drawbacks can be eliminated by conducting a user- movie, the rating and the corresponding user is extracted. centric evaluation [5]. Thus, in a future work, it would be The data is afterwards linked to the IMDb, the Internet worth to conduct a user-experiment using the optimized rec- Movie Database 13 . ommender system. 6. CONCLUSION AND FUTURE WORK 5. RELATED WORK In this work we have shown that the presented dataset As already mentioned in the introduction, there exist sev- is valuable for evaluating and benchmarking different ap- eral other publicly available datasets suitable for music rec- proaches for music recommendation. We implemented a ommendations. A quick overview of these datasets is given working music recommender systems, however as shown in in this Section. Section 4, for a high number of recommendations the perfor- One of the biggest available music datasets is the Million mance of our baseline recommendation approach is limited. Song Dataset (MSD) [4]. This dataset contains information Thus, we see a need for action at two points: First we will about one million songs from different sources. Beside real enrich the dataset with further context based information user play counts, it provides audio features of the songs and that is available: in this case this can be the time stamp is therefore suitable for CF-, CB- and hybrid recommender or the geolocation. Secondly, hybrid recommender system systems. At the moment, the Taste Profile subset8 of the utilizing this additional context based information are from MSD is bigger than the dataset presented in this work, how- interest. Therefore, in future works, we will focus on the ever it was released 2011 and is therefore not as recent. implementation of such recommender systems and compare Beside the MSD, also Yahoo! published big datasets9 con- them to the presented baseline approach. First experiments taining ratings for artists and songs suitable for CF. The were already conducted with a recommender system trying biggest dataset contains 136,000 songs along with ratings to exploit the geolocation. Two different implementations given by 1.8 million users. Additionally, the genre informa- are evaluated at the moment: The first uses the normalized tion is provided in the dataset. The data itself was gathered linear distance between two users for approximating a user 10 8 http://labrosa.ee.columbia.edu/millionsong/ http://www.last.fm 11 tasteprofile available at: http://www.cp.jku.at/datasets/MMTD/ 9 12 available at: http://webscope.sandbox.yahoo.com/ https://github.com/sidooms/MovieTweetings 13 catalog.php?datatype=r http://www.imdb.com 39 similarity. The second one, which in an early stage of eval- [14] M. Schedl, P. Knees, and G. Widmer. Investigating uation seems to be the more promising one, increases the web-based approaches to revealing prototypical music user similarity if a certain distance threshold is underrun. artists in genre taxonomies. In Proceedings of the 1st However, there remains the open question how to determine International Conference on Digital Information this distance threshold. Management (ICDIM 2006), pages 519–524. IEEE, 2006. 7. REFERENCES [15] M. Schedl, C. C. Liem, G. Peeters, and N. Orio. A [1] N. Aizenberg, Y. Koren, and O. Somekh. Build your Professionally Annotated and Enriched Multimodal own music recommender by modeling internet radio Data Set on Popular Music. In Proceedings of the 4th streams. In Proceedings of the 21st International ACM Multimedia Systems Conference (MMSys 2013), Conference on World Wide Web (WWW 2012), pages pages 78–83, February–March 2013. 1–10. ACM, 2012. [16] M. Schedl and D. Schnitzer. Hybrid Retrieval [2] Apache Software Foundation. What is Apache Approaches to Geospatial Music Recommendation. In Mahout?, March 2014. Retrieved July 13, 2014, from Proceedings of the 35th Annual International ACM http://mahout.apache.org. SIGIR Conference on Research and Development in [3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval (SIGIR), 2013. Information Retrieval: The Concepts and Technology [17] E. Zangerle, W. Gassler, and G. Specht. Exploiting behind Search (2nd Edition) (ACM Press Books). twitter’s collective knowledge for music Addison-Wesley Professional, 2 edition, 2011. recommendations. In Proceedings of the 2nd Workshop [4] T. Bertin-Mahieux, D. P. W. Ellis, B. Whitman, and on Making Sense of Microposts (#MSM2012), pages P. Lamere. The million song dataset. In A. Klapuri 14–17, 2012. and C. Leider, editors, Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), pages 591–596. University of Miami, 2011. [5] Ò. Celma. Music Recommendation and Discovery - The Long Tail, Long Fail, and Long Play in the Digital Music Space. Springer, 2010. [6] S. Dooms, T. De Pessemier, and L. Martens. Movietweetings: a movie rating dataset collected from twitter. In Workshop on Crowdsourcing and Human Computation for Recommender Systems at the 7th ACM Conference on Recommender Systems (RecSys 2013), 2013. [7] D. Hauger, M. Schedl, A. Kosir, and M. Tkalcic. The million musical tweet dataset - what we can learn from microblogs. In A. de Souza Britto Jr., F. Gouyon, and S. Dixon, editors, Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR 2013), pages 189–194, 2013. [8] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems, 22(1):5–53, Jan. 2004. [9] C. W. Hsu, C. C. Chang, and C. J. Lin. A practical guide to support vector classification. Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, 2003. [10] P. Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50, Feb. 1912. [11] G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. 62:81–97, 1956. [12] A. Passant. dbrec - Music Recommendations Using DBpedia. In Proceedings of the 9th International Semantic Web Conference (ISWC 2010), volume 6497 of Lecture Notes in Computer Science, pages 209–224. Springer Berlin Heidelberg, 2010. [13] M. Schedl. Leveraging Microblogs for Spatiotemporal Music Information Retrieval. In Proceedings of the 35th European Conference on Information Retrieval (ECIR 2013), pages 796 – 799, 2013. 40 Incremental calculation of isochrones regarding duration Nikolaus Krismer Günther Specht Johann Gamper University of Innsbruck, University of Innsbruck, Free University of Austria Austria Bozen-Bolzano, Italy nikolaus.krismer@uibk.ac.at guenther.specht@uibk.ac.at gamper@inf.unibz.it ABSTRACT target. The websites enabling such a navigation usually cal- An isochrone in a spatial network is the minimal, possibly culate routes using efficient shortest path (SP) algorithms. disconnected subgraph that covers all locations from where One of the most famous examples of these tools is Google’s a query point is reachable within a given time span and by map service named GoogleMaps1 . For a long time it was a given arrival time [5]. A novel approach for computing possible to calculate routes using one transportation system isochrones in multimodal spatial networks is presented in (by car, by train or by bus) only. This is known as rout- this paper. The basic idea of this incremental calculation is ing within unimodal spatial networks. Recent developments to reuse already computed isochrones when a new request enabled the computation combining various transportation with the same query point is sent, but with different dura- systems within the same route, even if some systems are tion. Some of the major challenges of the new calculation bound to schedules. This has become popular under the attempt are described and solutions to the most problematic term “multimodal routing” (or routing in multimodal spa- ones are outlined on basis of the already established MINE tial networks). and MINEX algorithms. The development of the incremen- Less famous, but algorithmic very interesting, is to find tal calculation is done by using six different cases of com- the answer to the question where someone can travel to in putation. Three of them apply to the MINEX algorithm, a given amount of time starting at a certain time from a which uses a vertex expiration mechanism, and three cases given place. The result is known as isochrone. Within mul- to MINE without vertex expiration. Possible evaluations are timodal spatial networks it has been defined by Gamper et also suggested to ensure the correctness of the incremental al. [5]. Websites using isochrones include Mapnificent2 and calculation. In the end some further tasks for future research SimpleFleet3 [4]. are outlined. One major advantage of isochrones is that they can be used for reachability analyses of any kind. They are help- ful in various fields including city planning and emergency Categories and Subject Descriptors management. While some providers, like SimpleFleet and H.2.8 [Database Applications]: Spatial databases and Mapnificent, enable the computation of isochrones based on GIS pre-calculated information or with heuristic data, the cal- culation of isochrones is a non-trivial and time-intense task. Although some improvements to the algorithms that can be General Terms used for isochrone computation have been published at the Algorithms Free University of Bozen-Bolzano in [7], one major drawback is that the task is always performed from scratch. It is not Keywords possible to create the result of a twenty-minute-isochrone (meaning that the travelling time from/to a query point q isochrone, incremental calculation is less than or equal to twenty minutes) based on the re- sult from a 15-minute-isochrone (the travelling time is often 1. INTRODUCTION referred to as maximal duration dmax). The incremental Throughout the past years interactive online maps have calculation could dramatically speed up the computation of become a famous tool for planning routes of any kind. Nowa- isochrones, if there are other ones for the same point q avail- days everybody with access to the internet is able to easily able. This is especially true for long travel times. However, get support when travelling from a given point to a specific the computation based on cached results has not been re- alised until now and is complex. As one could see from figures 1 and 2 it is not sufficient to extend the outline of the isochrone, because there might be some network hubs (e.g. stations of the public transportation system) which extend the isochrone result into new, possibly disconnected areas. Copyright c by the paper’s authors. Copying permitted only for private and academic purposes. 1 http://maps.google.com In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 2 Workshop on Foundations of Databases (Grundlagen von Datenbanken), http://www.mapnificent.net 3 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. http://www.simplefleet.eu 41 troduced by Bauer et. al [1] suffers from high initial loading time and is limited by the available memory, since the entire network is loaded into memory at the beginning. Another algorithm, called Multimodal Incremental Network Expan- sion (MINE), which has been proposed by Gamper et al. [5] overcomes the limitation that the whole network has to be loaded, but is restricted to the size of the isochrone result, since all points in the isochrone are still located in memory. To overcome this limitation, the Multimodal Incremental Network Expansion with vertex eXpiration (MINEX) algo- rithm has been developed by Gamper et al. [6] introducing vertex expiration (also called node expiration). This mech- anism eliminates unnecessary nodes from memory as soon as possible and therefore reduces the memory needed during computation. There are some more routing algorithms that do not load the entire network into the main memory. One well-known, which is not specific for isochrone calculation, but for query processing in spatial networks in general, called Incremental Figure 1: Isochrone with dmax of 10 minutes Euclidean Restriction (IER), has been introduced by Papa- dias [8] in 2003. This algorithm loads chunks of the network into memory that are specified by the euclidean distance. The Incremental Network Expiration (INE) algorithm has also been introduced in the publication of Papadias. It is basically an extension of the Dijkstra shortest path algo- rithm. Deng et al. [3] improved the ideas of Papadias et al. accessing less network data to perform the calculations. The open source routing software “pgRouting”4 , which calculates routes on top of the spatial database PostGIS5 (an extension to the well-known relational database PostgreSQL) uses an approach similar to IER. Instead of the euclidean distance it uses the network distance to load the spatial network. In 2013 similar ideas have been applied to MINEX and resulted in an algorithm called Multimodal Range Network Expansion (MRNEX). It has been developed at the Free University of Bozen-Bolzano by Innerebner [7]. Instead of loading the needed data edge-by-edge from the network, it is loaded using chunks, like it is done in IER. Depending on their size this approach is able to reduce the number of network accesses by far and therefore reduces calculation Figure 2: Isochrone with dmax of 15 minutes time. Recently the term “Optimal location queries” has been proposed by some researches like Chen et al. [2]. These This paper presents the calculation of incremental isochrones queries are closely related to isochrones, since they “find a in multimodal spatial networks on top of already developed location for setting up a new server such that the maximum algorithms and cached results. It illustrates some ideas that cost of clients being served by the servers (including the new need to be addressed when extending the algorithms by the server) is minimized”. incremental calculation approach. The remainder of this pa- per is structured as follows. Section 2 includes related work. Section 3 is split into three parts: the first part describes 3. INCREMENTAL CALCULATION challenges that will have to be faced during the implemen- REGARDING ISOCHRONE DURATION tation of incremental isochrones. Possible solutions to the In this paper the MINE and MINEX algorithms are ex- outlined problems are also discussed shortly here. The sec- tended by a new idea that is defined as “incremental cal- ond part deals with different cases that are regarded during culation”. This allows the creation of new results based on computation and how these cases differ, while the third part already computed and cached isochrones with different du- points out some evaluations and tests that will have to be rations, but with the same query point q (defined as base- performed to ensure the correctness of the implementation. isochrones). This type of computation is complex, since it is Section 4 consists of a conclusion and lists some possible not sufficient to extend an isochrone from its border points. future work. In theory it is necessary to re-calculate the isochrone from every node in the spatial network that is part of the base- 2. RELATED WORK isochrone and connected to other nodes. Although this is 4 The calculation of isochrones in multimodal spatial net- http://pgrouting.org 5 works can be done using various algorithms. The method in- http://postgis.net 42 true for a highly connected spatial network it might not be nations can be triggered by a service provider. Traffic jams the only or even best way for a real-world multimodal spatial and similar factors can lead to delays in the transportation network with various transportation systems. The isochrone system and thus also have to be considered. Although it calculation based on already known results should be doable should be possible to overcome both limitations or at least with respect to all the isochrone’s border points and all the limit their impact, it will not be further discussed in this public transportation system stations that are part of the paper. base isochrone. These network hubs in reality are the only nodes, which can cause new, possibly disconnected areas to 3.2 Types of calculation become part of an isochrone with different travelling time. There are six different cases that have to be kept in mind As it is important for the incremental calculation, the ver- when calculating an isochrone with travelling time dmax us- tex expiration that is introduced by Gamper et al. in [6] ing a base isochrone with duration dmax_base: three apply- will now be summarized shortly. The aim of the proposed ing to algorithms without vertex expiration and three cases approach is to remove loaded network nodes as soon as pos- for the ones using vertex expiration. sible from memory. However, to keep performance high, nodes should never be double-loaded at any time and there- 3.2.1 Cases dmax = dmax_base fore they should not be eliminated from memory too soon. The first two and most simple cases for the MINE and Removal should only occur when all computations regard- MINEX algorithm, are the ones where dmax is equal to ing the node have been performed. States are assigned to dmax_base. In these cases it is obvious that the calculation every node to assist in finding the optimal timeslot for mem- result can be returned directly without any further modifi- ory elimination. The state of a node can either be “open”, cation. It is not needed to respect expired nodes, since no “closed” or “expired”. Every loaded node is labelled with the (re)calculation needs to be performed. open state in the beginning. If all of its outgoing edges are traversed, its state changes to closed. However, the node 3.2.2 Cases dmax < dmax_base itself has to be kept in memory in order to avoid cyclic The third, also simple case, is the one where dmax is less network expansions. A node reaches the expired state, if than dmax_base for algorithms without vertex expiration. all nodes in its neighbourhood reached the closed or expired In this situation all nodes can be iterated and checked for state. It then can safely be removed from memory and is not suitability. If the duration is less or equal to dmax, then available for further computations without reloading it from the node also belongs to the new result, otherwise it does the network. Since this is problematic for the incremental not. In the fourth case, where the duration is less than calculation approach this aspect is described in more detail. dmax_base and nodes were expired (and therefore are not available in memory any more), the isochrone can be shrunk 3.1 Challenges from its borders. The network hubs do not need any special treatment, since no new areas can become part of the result There are some challenges that need to be addressed when if the available time decreased. The only necessary task is implementing an incremental calculation for the MINE and the recalculation of the durations from the query point to MINEX algorithm. The most obvious problem is related to the nodes in the isochrone and to possibly reload expired the vertex expiration of the MINEX algorithm. If nodes al- nodes. It either can be done from the query point or from ready expired, they will not be available to the calculation the border points. The duration d from the query point q to of isochrones with different durations. To take care of this a network node n is then equal to (assuming that the border problem all nodes n that are connected to other nodes which point with the minimal distance to n is named bp): are not in the direct neighbourhood of n are added to a list l_hubs. These nodes are the ones we referred to as network d(q, n) = d(q, bp) − d(bp, n) hubs. Besides the hub node n itself, further information is stored in this list: the time t of arrival at the node and 3.2.3 Cases dmax > dmax_base the remaining distance d that can be used. With this infor- The remaining two cases, where dmax_base is less than mation it is possible to continue computation from any hub dmax, are much more complex. They differ in the fact that with a modified travelling time for the algorithms. new, possibly disconnected areas can become part of the The list l_hubs needs to be stored in addition to the result and therefore it is not sufficient to look at all the base isochrone’s maximal travelling time and the isochrone re- isochrones border points. The new areas become available as sult itself, so that it can be used for incremental calculation. a result from connections caused by network hubs that often None of this information needs to be held in memory during are bound to some kind of schedule. A real-world example computation of the base isochrone itself and is only touched is a train station where a train is leaving at time t_train on incremental calculation. Therefore, runtime and memory due to its schedule and arriving at a remote station at or consumption of the isochrone algorithms will not be influ- before time dmax (in fact any time later than dmax_base is enced much. feasible). The time t_train has to be later than the arrival Other problems include modifications to the spatial net- time at the station (and after the isochrones starting time). work in combination with incremental isochrones. If there is Since all network hubs are saved with all the needed in- some change applied to the underlying network, all the base formation to the list l_hubs it is not of any interest if the isochrones can not be used for incremental calculation any algorithm uses vertex expiration or not. The points located more. It can not be guaranteed that the network modifica- at the isochrone’s outline are still in memory. Since only net- tion does not influence the base isochrone. Changes in the work hubs can create new isochrone areas it is sufficient to schedules of one or more modalities (for example the pub- grow the isochrone from its border and all the network hubs lic transportation systems) could cause problems as well, as located in the isochrone. The only effect that vertex expira- they would also influence the base isochrone. Schedule alter- tion causes is a smaller memory footprint of the calculation, 43 as it would also do without incremental calculation. be recorded to allow comparison. The incremental calcula- In table 1 and in table 2 the recently mentioned calculation tion can only be seen as successful, if there are situations types are summarised shortly. The six different cases can be where they perform better than the common calculation. distinguished with ease using these two tables. As mentioned before, this is expected to be true for at least large isochrone durations, since large portions of the spatial MINE network does not need to be loaded then. dmax < dmax_base iterating nodes from base isochrone Besides these automatically executed tests, it will be pos- checking if travel time is <= dmax sible to perform manual tests using a graphical user inter- dmax = dmax_base no change face. This system is under heavy development at the mo- dmax > dmax_base extend base isochrone by ment and has been named IsoMap. Regardless of its young border points and with list l_hubs state it will enable any user to calculate isochrones with and without the incremental approach and to visually compare Table 1: the results with each other. Incremental calculation without vertex expiration 4. CONCLUSION AND FUTURE WORK In this paper an approach to enable the calculation of MINEX isochrones with the help of already known results was pre- dmax < dmax_base shrink base isochrone from border sented. The necessary steps will be realised in the near fu- dmax = dmax_base no change ture, so that runtime comparisons between incremental cal- dmax > dmax_base extend base isochrone by culated isochrones and such created without the presented border points and with list l_hubs approach will be available shortly. The ideas developed throughout this paper do not influence the time needed for Table 2: calculation of base isochrones by far. The only additional Incremental calculation with vertex expiration complexity is generated by storing a list l_hubs besides the base isochrone. However, this is easy to manage and Although the different types of computations are intro- since the list does not contain any complex data structures, duced using the MINE and MINEX algorithms they also the changes should be doable without any noticeable conse- apply to the MRNEX method. When using MRNEX the quence to the runtime of the algorithms. same basic idea can be used to enable incremental calcula- Future work will extend the incremental procedure to fur- tions. In addition the same advantages and disadvantages ther calculation parameters, especially to the arrival time, apply to the incremental calculation using MRNEX com- the travelling speed and the query point q of the isochrone. pared to MINEX that also apply to the non-incremental Computations on top of cached results are also realisable for setup. changing arrival times and/or travel speeds. It should even be possible to use base isochrones with completely different 3.3 Evaluation query points in the context of the incremental approach. If The evaluations that will need to be carried out to ensure the isochrone calculation for a duration of twenty minutes the correctness of the implementation can be based on freely reaches a point after five minutes the 15-minute isochrone of available datasets, such as OpenStreetMap6 . Schedules from this point has to be part of the computed result (if the arrival various public transportation systems could be used and times are respected). Therefore, cached results can decrease since they might be subject of licensing it is planned to the algorithm runtimes even for different query points, espe- create some test schedules. This data can then be used cially if they are calculated for points that can cause complex as mockups and as a replacement of the license-bound real- calculations like airports or train stations. world schedules. It is also planned to realise all the described Open fields that could be addressed include the research of tests in the context of a continuous integration setup. They incremental calculation under conditions where public trans- will therefore be automatically executed ensuring the cor- portation system schedules may vary due to trouble in the rectness throughout various software changes. traffic system. The influence of changes in the underlying The basic idea of the evaluation is to calculate incremental spatial networks to the incremental procedure could also be isochrones on the basis of isochrones with different durations part of future research. It is planned to use the incremen- and to compare them with isochrones calculated without the tal calculation approach to calculate city round trips and to incremental approach. If both results are exactly the same, allow the creation of sight seeing tours for tourists with the the incremental calculation can be regarded as correct. help of isochrones. This computation will soon be enabled There will be various tests that need to be executed in in cities where it is not possible by now. order to cover all the different cases described in section Further improvements regarding the calculation runtime 3.2. As such, all the cases will be performed with and with- of isochrones can be done as well. In this field, some ex- out vertex expiration. The durations of the base isochrones aminations with different databases and even with different will cover the three cases per algorithm (less than, equal to types of databases (in particular graph databases and other and greater than the duration of the incremental calculated NoSQL systems) are planned. isochrone). Additional tests, such as testing for vertex ex- piration of the incremental calculation result, will be imple- 5. REFERENCES mented as well. Furthermore, the calculation times of both - the incremental and the non-incremental approach - will [1] V. Bauer, J. Gamper, R. Loperfido, S. Profanter, S. Putzer, and I. Timko. Computing isochrones in 6 http://www.openstreetmap.org multi-modal, schedule-based transport networks. In 44 Proceedings of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS ’08, pages 78:1–78:2, New York, NY, USA, 2008. ACM. [2] Z. Chen, Y. Liu, R. C.-W. Wong, J. Xiong, G. Mai, and C. Long. Efficient algorithms for optimal location queries in road networks. In SIGMOD Conference, pages 123–134, 2014. [3] K. Deng, X. Zhou, H. Shen, S. Sadiq, and X. Li. Instance optimal query processing in spatial networks. The VLDB Journal, 18(3):675–693, 2009. [4] A. Efentakis, N. Grivas, G. Lamprianidis, G. Magenschab, and D. Pfoser. Isochrones, traffic and demographics. In SIGSPATIAL/GIS, pages 538–541, 2013. [5] J. Gamper, M. Böhlen, W. Cometti, and M. Innerebner. Defining isochrones in multimodal spatial networks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pages 2381–2384, New York, NY, USA, 2011. ACM. [6] J. Gamper, M. Böhlen, and M. Innerebner. Scalable computation of isochrones with network expiration. In A. Ailamaki and S. Bowers, editors, Scientific and Statistical Database Management, volume 7338 of Lecture Notes in Computer Science, pages 526–543. Springer Berlin Heidelberg, 2012. [7] M. Innerebner. Isochrone in Multimodal Spatial Networks. PhD thesis, Free University of Bozen-Bolzano, 2013. [8] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao. Query processing in spatial network databases. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB ’03, pages 802–813. VLDB Endowment, 2003. 45 46 Software Design Approaches for Mastering Variability in Database Systems David Broneske, Sebastian Dorok, Veit Köppen, Andreas Meister* *author names are in lexicographical order Otto-von-Guericke-University Magdeburg Institute for Technical and Business Information Systems Magdeburg, Germany firstname.lastname@ovgu.de ABSTRACT e.g., vectorization and SSD storage, to efficiently process For decades, database vendors have developed traditional and manage petabytes of data [8]. Exploiting variability to database systems for different application domains with high- design a tailor-made DBS for applications while making the ly differing requirements. These systems are extended with variability manageable, that is keeping maintenance effort, additional functionalities to make them applicable for yet time, and cost reasonable, is what we call mastering vari- another data-driven domain. The database community ob- ability in DBSs. served that these “one size fits all” systems provide poor per- Currently, DBSs are designed either as one-size-fits-all formance for special domains; systems that are tailored for a DBSs, meaning that all possible use cases or functionalities single domain usually perform better, have smaller memory are integrated at implementation time into a single DBS, footprint, and less energy consumption. These advantages or as specialized solutions. The first approach does not do not only originate from different requirements, but also scale down, for instance, to embedded devices. The second from differences within individual domains, such as using a approach leads to situations, where for each new applica- certain storage device. tion scenario data management is reinvented to overcome However, implementing specialized systems means to re- resource restrictions, new requirements, and rapidly chang- implement large parts of a database system again and again, ing hardware. This usually leads to an increased time to which is neither feasible for many customers nor efficient in market, high development cost, as well as high maintenance terms of costs and time. To overcome these limitations, we cost. Moreover, both approaches provide limited capabilities envision applying techniques known from software product for managing variability in DBSs. For that reason, software lines to database systems in order to provide tailor-made product line (SPL) techniques could be applied to the data and robust database systems for nearly every application management domain. In SPLs, variants are concrete pro- scenario with reasonable effort in cost and time. grams that satisfy the requirements of a specific application domain [7]. With this, we are able to provide tailor-made and robust DBSs for various use cases. Initial results in the General Terms context of embedded systems, expose benefits of applying Database, Software Engineering SPLs to DBSs [17, 22]. The remainder of this paper is structured as follows: In Keywords Section 2, we describe variability in a database system re- garding hardware and software. We review three approaches Variability, Database System, Software Product Line to design DBSs in Section 3, namely, the one-size-fits-all, the specialization, and the SPL approach. Moreover, we com- 1. INTRODUCTION pare these approaches w.r.t. robustness and maturity of pro- In recent years, data management has become increasingly vided DBSs, the effort of managing variability, and the level important in a variety of application domains, such as auto- of tailoring for specific application domains. Because of the motive engineering, life sciences, and web analytics. Every superiority of the SPL approach, we argue to apply this ap- application domain has its unique, different functional and proach to the implementation process of a DBS. Hence, we non-functional requirements leading to a great diversity of provide research questions in Section 4 that have to be an- database systems (DBSs). For example, automotive data swered to realize the vision of mastering variability in DBSs management requires DBSs with small storage and memory using SPL techniques. consumption to deploy them on embedded devices. In con- trast, big-data applications, such as in life sciences, require large-scale DBSs, which exploit newest hardware trends, 2. VARIABILITY IN DATABASE SYSTEMS Variability in a DBS can be found in software as well as hardware. Hardware variability is given due to the use of Copyright c by the paper’s authors. Copying permitted only different devices with specific properties for data processing for private and academic purposes. and storage. Variability in software is reflected by differ- In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- Workshop on Foundations of Databases (Grundlagen von Datenbanken), ent functionalities that have to be provided by the DBS 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. for a specific application. Additionally, the combination of 47 hardware and software functionality for concrete application Field Programmable Gate Array: FPGAs are pro- domains increases variability. grammable stream processors, providing only a limited stor- age capacity. They consist of several independent logic cells 2.1 Hardware consisting of a storage unit and a lookup table. The in- In the past decade, the research community exploited aris- terconnect between logic cells and the lookup tables can be ing hardware features by tailor-made algorithms to achieve reprogrammed during run time to perform any possible func- optimized performance. These algorithms effectively utilize, tion (e.g., sorting, selection). e.g., caches [19] or vector registers of Central Processing Units (CPUs) using AVX- [27] and SSE-instructions [28]. 2.1.2 Storage Devices Furthermore, the usage of co-processors for accelerating data Similar to the processing device, current systems offer a processing opens up another dimension [12]. In the follow- variety of different storage devices used for data processing. ing, we consider processing and storage devices and sketch In this section, we discuss different properties of current stor- the variability arising from their different properties. age devices. Hard Disk Drive: The Hard Disk Drive (HDD), as a 2.1.1 Processing Devices non-volatile storage device, consists of several disks. The To sketch the heterogeneity of current systems, possible disks of an HDD rotate, while a movable head reads or writes (co-)processors are summarized in Figure 1. Current sys- information. Hence, sequential access patterns are well sup- tems do not only include a CPU or an Accelerated Process- ported in contrast to random accesses. ing Unit (APU), but also co-processors, such as Many In- Solid State Drive: Since no mechanical units are used, tegrated Cores (MICs), Graphical Processing Units (GPUs), Solid State Drives (SSDs) support random access without and Field Programmable Gate Arrays (FPGAs). In the fol- high delay. For this, SSDs use flash-memory to persistently lowing, we give a short description of varying processor prop- store information [20]. Each write wears out the flash cells. erties. A more extensive overview is presented in our recent Consequently, the write patterns of database systems must work [5]. be changed compared to HDD-based systems. Main-Memory: While using main memory as main stor- age, the access gap between primary and secondary storage APU Main- GPU MIC is removed, introducing main-memory access as the new bot- Memory tleneck [19]. However, main-memory systems cannot omit front-side bus memory secondary storage types completely, because main memory is bus volatile. Thus, efficient persistence mechanisms are needed I/O CPU controller for main-memory systems. PCIe bus FPGA To conclude, current architectures offer several different HDD SSD processor and storage types. Each type has a unique archi- tecture and specific characteristics. Hence, to ensure high Figure 1: Future system architecture [23] performance, the processing characteristics of processors as well as the access characteristics of the underlying storage Central Processing Unit: Nowadays, CPUs consist of devices have to be considered. For example, if several pro- several independent cores, enabling parallel execution of dif- cessing devices are available within a DBS, the DBS must ferent calculations. CPUs use pipelining, Single Instruction provide suitable algorithms and functionality to fully utilize Multiple Data (SIMD) capabilities, and branch prediction all available devices to provide peak performance. to efficiently process conditional statements (e.g., if state- ments). Hence, CPUs are well suited for control intensive 2.2 Software Functionality algorithms. Besides hardware, DBS functionality is another source Graphical Processing Unit: Providing larger SIMD reg- of variability in a DBS. In Figure 2, we show an excerpt isters and a higher number of cores than CPUs, GPUs offer of DBS functionalities and their dependencies. For exam- a higher degree of parallelism compared to CPUs. In or- ple, for different application domains different query types der to perform calculations, data has to be transferred from might be interesting. However, to improve performance main memory to GPU memory. GPUs offer an own memory or development cost, only required query types should be hierarchy with different memory types. used within a system. This example can be extended to Accelerated Processing Unit: APUs are introduced to other functional requirements. Furthermore, a DBS pro- combine the advantages of CPUs and GPUs by including vides database operators, such as aggregation functions or both on one chip. Since the APU can directly access main joins. Thereby, database operators perform differently de- memory, the transfer bottleneck of dedicated GPUs is elim- pending on the used storage and processing model [1]. For inated. However, due to space limitations, fairly less GPU example, row stores are very efficient when complete tuples cores fit on the APU die compared to a dedicated GPU, lead- should be retrieved, while column stores in combination with ing to reduced computational power compared to dedicated operator-at-a-time processing enable fast processing of single GPUs. columns [18]. Another technique to enable efficient access Many Integrated Core: MICs use several integrated and to data is to use index structures. Thereby, the choice of an interconnected CPU cores. With this, MICs offer a high appropriate index structure for the specific data and query parallelism while still featuring CPU properties. However, types is essential to guarantee best performance [15, 24]. similar to the GPU, MICs suffer from the transfer bottle- Note, we omit comprehensive relationships between func- neck. tionalities properties in Figure 2 due to complexity. Some 48 Legend DBS-Functionality Feature Mandatory Optional OR Query Storage Processing Operator Transaction Type Model Model XOR Row Column Operator- Tuple-at- Vectorized Exact kNN Range Store Store at-a-time a-time Processing Join Selection Sorting Grouping Block-nested- Bitonic Nested-loops Hash Sort-merge Radix Hash-based Sort-based loops merge Figure 2: Excerpt of DBMS-Functionality functionalities are mandatory in a DBS and others are op- database-application scenario. For example, a DBS for high- tional, such as support for transactions. Furthermore, it is performance analysis can exploit newest hardware features, possible that some alternatives can be implemented together such as SIMD, to speed up analysis workloads. Moreover, and others only exclusively. we can meet limited space requirements in embedded sys- tems by removing unnecessary functionality [22], such as the 2.3 Putting it all together support for range queries. However, exploiting variability is So far, we considered variability in hardware and software one part of mastering variability in DBSs. The second part functionality separately. When using a DBS for a specific is to manage variability efficiently to reduce development application domain, we also have to consider special require- and maintenance effort. ments of this domain as well as the interaction between hard- In this section, first, we describe three different approaches ware and software. to design and implement DBSs. Then, we compare these ap- Special requirements comprise functional as well as non- proaches regarding their applicability to arbitrary database functional ones. Examples for functional requirements are scenarios. Moreover, we assess the effort to manage vari- user-defined aggregation functions (e.g., to perform genome ability in DBSs. Besides managing and exploiting the vari- analysis tasks directly in a DBS [9]). Other applications ability in database systems, we also consider the robustness require support for spatial queries, such as geo-information and correctness of tailor-made DBSs created by using the systems. Thus, special data types as well as index structures discussed approaches. are required to support these queries efficiently. Besides performance, memory footprint and energy effi- 3.1 One-Size-Fits-All Design Approach ciency are other examples for non-functional requirements. One way to design database systems is to integrate all con- For example, a DBS for embedded devices must have a small ceivable data management functionality into one single DBS. memory footprint due to resource restrictions. For that rea- We call this approach the one-size-fits-all design approach son, unnecessary functionality is removed and data process- and DBSs designed according to this approach one-size-fits- ing is implemented as memory efficient as possible. In this all DBSs. Thereby, support for hardware features as well scenario, tuple-at-a-time processing is preferred, because in- as DBMS functionality are integrated into one code base. termediate results during data processing are smaller than Thus, one-size-fits-all DBSs provide a rich set of functional- in operator-at-a-time processing, which leads to less memory ity. Examples of database systems that follow the one-size- consumption [29]. fits-all approach are PostgreSQL, Oracle, and IBM DB2. As In contrast, in large-scale data processing, operators should one-size-fits-all DBSs are monolithic software systems, im- perform as fast as possible by exploiting underlying hard- plemented functionality is highly interconnected on the code ware and available indexes. Thereby, exploiting underlying level. Thus, removing functionality is mostly not possible. hardware is another source of variability as different pro- DBSs that follow the one-size-fits-all design approach aim cessing devices have different characteristics regarding pro- at providing a comprehensive set of DBS functionality to cessing model and data access [6]. To illustrate this fact, deal with most database application scenarios. The claim for we depict different storage models for DBS in Figure 2. For generality often introduces functional overhead that leads to example, column-storage is preferred on GPUs, because row- performance losses. Moreover, customers pay for function- storage leads to an inefficient memory access pattern that de- ality they do not really need. teriorates the possible performance benefits of GPUs [13]. 3.2 Specialization Design Approach 3. APPROACHES TO DESIGN TAILOR- In contrast to one-size-fits-all DBSs, DBSs can also be de- signed and developed to fit very specific use cases. We call MADE DATABASE SYSTEMS this design approach the specialization design approach and The variability in hardware and software of DBSs can DBSs designed accordingly, specialized DBSs. Such DBSs be exploited to tailor database systems for nearly every are designed to provide only that functionality that is needed 49 for the respective use case, such as text processing, data a) general applicability to arbitrary database applications, warehousing, or scientific database applications [25]. Spe- b) effort for managing variability, and cialized DBSs are often completely redesigned from scratch c) maturity of the deployed database system. to meet application requirements and do not follow common Although the one-size-fits-all design approach aims at pro- design considerations for database systems, such as locking viding a comprehensive set of DBS functionality to deal and latching to guarantee multi-user access [25]. Specialized with most database application scenarios, a one-size-fits-all DBSs remove the overhead of unneeded functionality. Thus, database is not applicable to use cases in automotive, em- developers can highly focus on exploiting hardware and func- bedded, and ubiquitous computing. As soon as tailor-made tional variability to provide tailor-made DBSs that meet software is required to meet especially storage limitations, high-performance criteria or limited storage space require- one-size-fits-all database systems cannot be used. Moreover, ments. Therefore, huge parts of the DBS (if not all) must specialized database systems for one specific use case outper- be newly developed, implemented, and tested which leads form one-size-fits-all database systems by orders of magni- to duplicate implementation efforts, and thus, increased de- tude [25]. Thus, although one-size-fits-all database systems velopment costs. can be applied, they are often not the best choice regarding performance. For that reason, we consider the applicability of one-size-fits-all database systems to arbitrary use cases as limited. In contrast, specialized database systems have 3.3 Software Product Line Design Approach a very good applicability as they are designed for that pur- In the specialization design approach, a new DBS must pose. be developed and implemented from scratch for every con- The applicability of the SPL design approach is good as ceivable database application. To avoid this overhead, the it also creates database systems tailor-made for specific use SPL design approach reuses already implemented and tested cases. Moreover, the SPL design approach explicitly consid- parts of a DBS to create a tailor-made DBS. ers variability during software design and implementation and provides methods and techniques to manage it [2]. For Domain Analysis Domain Implementation FAME-DBMS Buffer Manager refines class Btree { that reason, we assess the effort of managing variability with Domain the SPL design approach as lower than managing variability Storage public : bool PutData(RECORD& r); enum DataTypes OS-Abstraction Memory Alloc Replacement #include { "BtreeIndexPage.h" knowledge Mapping DataType_None, Win32 Linux NutOS Storage Dynamic Static LFU LRU Access Data Types Index DataType_Bool, refines class Btree { }; DataType_Byte, BtreePageRef using a one-size-fits-all or specialized design approach. Data Dictionary Data Types Index API Optimizer Transaction SQL Engine #include "include.h" DataType_Short, GetNewPage() B+-Tree Stream-based Relational Tables Columns List List B+-Tree update remove get put queries queries #include "DataDictionary.h" ... { #include "PrimaryIndex.h" add search remove update Aggregation queries Select queries . }; We assess the maturity of one-size-fits-all database sys- . class Btree : public . PrimaryIndex } { public: Btree() :PrimaryIndex(true) } {…} tems as very good, as these systems are developed and tested New Features Common implementation over decades. Specialized database systems are mostly im- requirements artifacts plemented from scratch, so, the possibility of errors in the Customization Product generation code is rather high, leading to a moderate maturity and ro- Customer FAME-DBMS OS-Abstraction Feature bustness of the software. The SPL design approach also Storage needs selection Product Data Dictionary Data Types Index enables the creation of tailor-made database systems, but List + B -Tree Buffer Manager from approved features that are already implemented and Access tested. Thus, we assess the maturity of database systems created via the SPL design approach as good. In Table 1, we summarize our assessment of the three Figure 3: Managing Variability software design approaches regarding the above criteria. To make use of SPL techniques, a special workflow has to Approach be followed which is sketched in Figure 3 [2]. At first, the Criteria One-Size- domain is modeled, e.g., by using a feature model – a tree- Fits-All Specialization SPL like structure representing features and their dependencies. a) Applicability − ++ + With this, the variability is captured and implementation b) Management effort − − + artifacts can be derived for each feature. The second step, c) Maturity ++ + the domain implementation, is to implement each feature us- ing a compositional or annotative approach. The third step Table 1: Characteristics of approaches of the workflow is to customize the product – in our case, Legend: ++ = very good, + = good, = moderate, − = limited the database system – which will be generated afterwards. By using the SPL design approach, we are able to imple- The one-size-fits-all and the specialization design approach ment a database system from a set of features which are are each very good in one of the three categories respec- mostly already provided. In best case, only non-existing tively. The one-size-fits-all design approach provides robust features must be implemented. Thus, the feature pool con- and mature DBSs. The specialization design approach pro- stantly grows and features can be reused in other database vides greatest applicability and can be used for nearly every systems. Applying this design approach to DBSs enables use case. Whereas the SPL design approach provides a bal- to create DBSs tailored for specific use cases while reduc- anced assessment regarding all criteria. Thus, against the ing functional overhead as well as development time. Thus, backdrop of increasing variability due to increasing variety the SPL design approach aims at the middle ground of the of use cases and hardware while guaranteeing mature and one-size-fits-all and the specialization design approach. robust DBSs, SPL design approach should be applied to de- velop future DBSs. Otherwise, development costs for yet 3.4 Characterization of Design Approaches another DBS which has to meet special requirements of the In this section, we characterize the three design approaches next data-driven domain will limit the use of DBSs in such discussed above regarding: fields. 50 4. ARISING RESEARCH QUESTIONS on inheritance or additional function calls, which causes per- Our assessment in the previous section shows that the formance penalties. A technique that allows for variability SPL design approach is the best choice for mastering vari- without performance penalties are preprocessor directives. ability in DBSs. To the best of our knowledge, the SPL However, maintaining preprocessor-based SPLs is horrible, design approach is applied to DBSs only in academic set- which accounts this approach the name #ifdef Hell [11, 10]. tings (e.g., in [22]).Hereby, the previous research were based So, there is a trade-off between performance and maintain- on BerkeleyDB. Although BerkeleyDB offers the essential ability [22], but also granularity [14]. It could be beneficial functionality of DBSs (e.g., a processing engine), several for some parts of DBS to prioritize maintainability and for functionality of relational DBSs were missing (e.g., opti- others performance or maintainability. mizer, SQL-interface). Although these missing functional- ity were partially researched (e.g., storage manager [16] and RQ-I2: How to combine different implementation tech- the SQL parser [26]), no holistic evaluation of a DBS SPL niques for SPLs? is available. Especially, the optimizer in a DBS (e.g., query If the answer of RQ-I1 is to use different implementation optimizer) with a huge number of crosscutting concerns is techniques within the same SPL, we have to find an ap- currently not considered in research. So, there is still the proach to combine these. For example, database operators need for research to fully apply SPL techniques to all parts and their different hardware optimization must be imple- of a DBS. Specifically, we need methods for modeling vari- mented using annotative approaches for performance rea- ability in DBSs and efficient implementation techniques and sons, but the query optimizer can be implemented using methods for implementing variability-aware database oper- compositional approaches supporting maintainability; the ations. SPL product generator has to be aware of these different implementation techniques and their interactions. 4.1 Modeling For modeling variability in feature-oriented SPLs, feature RQ-I3: How to deal with functionality extensions? models are state of the art [4]. A feature model is a set Thinking about changing requirements during the usage of of features whose dependencies are hierarchically modeled. the DBS, we should be able to extend the functionality in Since variability in DBSs comprises hardware, software, and the case user requirements change. Therefore, we have to their interaction, the following research questions arise: find a solution to deploy updates from an extended SPL in order to integrate the new requested functionality into a RQ-M1: What is a good granularity for modeling a running DBS. Some ideas are presented in [21], however, variable DBS? due to the increased complexity of hardware and software In order to define an SPL for DBSs, we have to model fea- requirements an adaption or extension is necessary. tures of a DBS. Such features can be modeled with different levels of granularity [14]. Thus, we have to find an appli- 4.3 Customization cable level of granularity for modeling our SPL for DBSs. In the final customization, features of the product line are Moreover, we also have to consider the dependencies be- selected that apply to the current use case. State of the art tween hardware and software. Furthermore, we have to find approaches just list available features and show which fea- a way to model the hardware and these dependencies. In tures are still available for further configuration. However, in this context, another research questions emerges: our scenario, it could be helpful to get further information of the configuration possibilities. Thus, another research ques- RQ-M2: What is the best way to model hardware and tion is: its properties in an SPL? Hardware has become very complex and researchers demand RQ-C1: How to support the user to obtain the best to develop a better understanding of the impact of hard- selection? ware on the algorithm performance, especially when paral- In fact, it is possible to help the user in identifying suitable lelized [3, 5]. Thus, the question arises what properties of configurations for his use case. If he starts to select func- the hardware are worth to be captured in a feature model. tionality that has to be provided by the generated system, Furthermore, when thinking about numerical properties, we can give him advice which hardware yields the best per- such as CPU frequency or amount of memory, we have to formance for his algorithms. However, to achieve this we find a suitable technique to represent them in feature mod- have to investigate another research question: els. One possibility are attributes of extended feature-mod- els [4], which have to be explored for applicability. RQ-C2: How to find the optimal algorithms for a given hardware? 4.2 Implementing To answer this research question, we have to investigate the relation between algorithmic design and the impact of the In the literature, there are several methods for implement- hardware on the execution. Hence, suitable properties of ing an SPL. However, most of them are not applicable to algorithms have to be identified that influence performance our use case. Databases rely on highly tuned operations on the given hardware, e.g., access pattern, size of used data to achieve peak performance. Thus, variability-enabled im- structures, or result sizes. plementation techniques must not harm the performance, which leads to the research question: 5. CONCLUSIONS RQ-I1: What is a good variability-aware implemen- DBSs are used for more and more use cases. However, tation technique for an SPL of DBSs? with an increasing diversity of use cases and increasing het- Many state of the art implementation techniques are based erogeneity of available hardware, it is getting more challeng- 51 ing to design an optimal DBS while guaranteeing low imple- [13] B. He and J. X. Yu. High-throughput Transaction mentation and maintenance effort at the same time. To solve Executions on Graphics Processors. PVLDB, this issue, we review three design approaches, namely the 4(5):314–325, Feb. 2011. one-size-fits-all, the specialization, and the software prod- [14] C. Kästner, S. Apel, and M. Kuhlemann. Granularity uct line design approach. By comparing these three design in Software Product Lines. In ICSE, pages 311–320. approaches, we conclude that the SPL design approach is a ACM, 2008. promising way to master variability in DBSs and to provide [15] V. Köppen, M. Schäler, and R. Schröter. Toward mature data management solutions with reduced implemen- Variability Management to Tailor High Dimensional tation and maintenance effort. However, there is currently Index Implementations. In RCIS, pages 452–457. no comprehensive software product line in the field of DBSs IEEE, 2014. available. Thus, we present several research questions that [16] T. Leich, S. Apel, and G. Saake. Using Step-wise have to be answered to fully apply the SPL design approach Refinement to Build a Flexible Lightweight Storage on DBSs. Manager. In ADBIS, pages 324–337. Springer-Verlag, 2005. 6. ACKNOWLEDGMENTS [17] J. Liebig, S. Apel, C. Lengauer, and T. Leich. This work has been partly funded by the German BMBF RobbyDBMS: A Case Study on Hardware/Software under Contract No. 13N10818 and Bayer Pharma AG. Product Line Engineering. In FOSD, pages 63–68. ACM, 2009. 7. REFERENCES [18] A. Lübcke, V. Köppen, and G. Saake. Heuristics-based [1] D. J. Abadi, S. R. Madden, and N. Hachem. Workload Analysis for Relational DBMSs. In Column-stores vs. Row-stores: How Different Are UNISCON, pages 25–36. Springer, 2012. They Really? In SIGMOD, pages 967–980. ACM, [19] S. Manegold, P. A. Boncz, and M. L. Kersten. 2008. Optimizing Database Architecture for the New [2] S. Apel, D. Batory, C. Kästner, and G. Saake. Bottleneck: Memory Access. VLDB J., 9(3):231–246, Feature-Oriented Software Product Lines. Springer, 2000. 2013. [20] R. Micheloni, A. Marelli, and K. Eshghi. Inside Solid [3] C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. State Drives (SSDs). Springer, 2012. Multi-Core, Main-Memory Joins: Sort vs. Hash [21] M. Rosenmüller. Towards Flexible Feature Revisited. PVLDB, 7(1):85–96, 2013. Composition: Static and Dynamic Binding in Software [4] D. Benavides, S. Segura, and A. Ruiz-Cortés. Product Lines. Dissertation, University of Magdeburg, Automated Analysis of Feature Models 20 Years Later: Germany, June 2011. A Literature Review. Inf. Sys., 35(6):615–636, 2010. [22] M. Rosenmüller, N. Siegmund, H. Schirmeier, [5] D. Broneske, S. Breß, M. Heimel, and G. Saake. J. Sincero, S. Apel, T. Leich, O. Spinczyk, and Toward Hardware-Sensitive Database Operations. In G. Saake. FAME-DBMS: Tailor-made Data EDBT, pages 229–234, 2014. Management Solutions for Embedded Systems. In [6] D. Broneske, S. Breß, and G. Saake. Database Scan SETMDM, pages 1–6. ACM, 2008. Variants on Modern CPUs: A Performance Study. In [23] M. Saecker and V. Markl. Big Data Analytics on IMDM@VLDB, 2014. Modern Hardware Architectures: A Technology [7] K. Czarnecki and U. W. Eisenecker. Generative Survey. In eBISS, pages 125–149. Springer, 2012. Programming: Methods, Tools, and Applications. [24] M. Schäler, A. Grebhahn, R. Schröter, S. Schulze, ACM Press/Addison-Wesley Publishing Co., 2000. V. Köppen, and G. Saake. QuEval: Beyond [8] S. Dorok, S. Breß, H. Läpple, and G. Saake. Toward High-Dimensional Indexing à la Carte. PVLDB, Efficient and Reliable Genome Analysis Using 6(14):1654–1665, 2013. Main-Memory Database Systems. In SSDBM, pages [25] M. Stonebraker, S. Madden, D. J. Abadi, 34:1–34:4. ACM, 2014. S. Harizopoulos, N. Hachem, and P. Helland. The End [9] S. Dorok, S. Breß, and G. Saake. Toward Efficient of an Architectural Era (It’s Time for a Complete Variant Calling Inside Main-Memory Database Rewrite). In VLDB, pages 1150–1160, 2007. Systems. In BIOKDD-DEXA. IEEE, 2014. [26] S. Sunkle, M. Kuhlemann, N. Siegmund, [10] J. Feigenspan, C. Kästner, S. Apel, J. Liebig, M. Rosenmüller, and G. Saake. Generating Highly M. Schulze, R. Dachselt, M. Papendieck, T. Leich, and Customizable SQL Parsers. In SETMDM, pages G. Saake. Do Background Colors Improve Program 29–33. ACM, 2008. Comprehension in the #ifdef Hell? Empir. Softw. [27] T. Willhalm, I. Oukid, I. Müller, and F. Faerber. Eng., 18(4):699–745, 2013. Vectorizing Database Column Scans with Complex [11] J. Feigenspan, M. Schulze, M. Papendieck, C. Kästner, Predicates. In ADMS@VLDB, pages 1–12, 2013. R. Dachselt, V. Köppen, M. Frisch, and G. Saake. [28] J. Zhou and K. A. Ross. Implementing Database Supporting Program Comprehension in Large Operations Using SIMD Instructions. In SIGMOD, Preprocessor-Based Software Product Lines. IET pages 145–156. ACM, 2002. Softw., 6(6):488–501, 2012. [29] M. Zukowski. Balancing Vectorized Query Execution [12] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, with Bandwidth-Optimized Storage. PhD thesis, CWI Q. Luo, and P. V. Sander. Relational Query Amsterdam, 2009. Coprocessing on Graphics Processors. TODS, 34(4):21:1–21:39, 2009. 52 PageBeat - Zeitreihenanalyse und Datenbanken Andreas Finger Ilvio Bruder Andreas Heuer Institut für Informatik Institut für Informatik Institut für Informatik Universität Rostock Universität Rostock Universität Rostock 18051 Rostock 18051 Rostock 18051 Rostock andreas.finger@uni- ilvio.bruder@uni- andreas.heuer@uni- rostock.de rostock.de rostock.de Steffen Konerow Martin Klemkow Mandarin Medien GmbH Mandarin Medien GmbH Graf-Schack-Allee 9 Graf-Schack-Allee 9 19053 Schwerin 19053 Schwerin sk@mandarin-medien.de mk@mandarin- medien.de ABSTRACT Keywords Zeitreihendaten und deren Analyse sind in vielen Anwen- Datenanalyse, R, Time Series Database dungsbereichen eine wichtiges Mittel zur Bewertung, Steue- rung und Vorhersage. Für die Zeitreihenanalyse gibt es ei- 1. EINFÜHRUNG ne Vielzahl von Methoden und Techniken, die in Statistik- Zeitreihen sind natürlich geordnete Folgen von Beobach- software umgesetzt und heutzutage komfortabel auch ohne tungswerten. Die Zeitreihenanalyse beschäftigt sich mit Me- eigenen Implementierungsaufwand einsetzbar sind. In den thoden zur Beschreibung dieser Daten etwa mit dem Ziel meisten Fällen hat man es mit massenhaft Daten oder auch der Analyse (Verstehen), Vorhersage oder Kontrolle (Steue- Datenströmen zu tun. Entsprechend gibt es spezialisierte rung) der Daten. Entsprechende Methoden stehen in frei- Management-Tools, wie Data Stream Management Systems er und kommerzieller Statistiksoftware wie R1 , Matlab2 , für die Verarbeitung von Datenströmen oder Time Series Weka3 [7], SPSS4 und anderen zur Verfügung wodurch ei- Databases zur Speicherung und Anfrage von Zeitreihen. Der ne komfortable Datenauswertung ohne eigenen Implemen- folgende Artikel soll hier zu einen kleinen Überblick geben tierungsaufwand ermöglicht wird. Verfahren zur Zeitreihen- und insbesondere die Anwendbarkeit an einem Projekt zur analyse sind etwa die Ermittlung von Trends und Saisona- Analyse und Vorhersage von Zuständen von Webservern ver- lität, wobei der Trend den längerfristigen Anstieg und die anschaulichen. Die Herausforderung innerhalb dieses Pro- Saisonalität wiederkehrende Muster (jedes Jahr zu Weih- jekts PageBeat“ ist es massenhaft Zeitreihen in Echtzeit ” nachten steigen die Verkäufe) repräsentieren. So werden Ab- zu analysieren und für weiterführende Analyseprozesse zu hängigkeiten in den Daten untersucht, welche eine Prognose speichern. Außerdem sollen die Ergebnisse zielgruppenspe- zukünftiger Werte mit Hilfe geeigneter Modelle ermöglichen. zifisch aufbereitet und visualisiert sowie Benachrichtigungen In einer Anwendung die in hoher zeitlicher Auflösung eine ausgelöst werden. Der Artikel beschreibt den im Projekt ge- Vielzahl von Messwerten erfasst, entstehen schnell große Da- wählten Ansatz und die dafür eingesetzten Techniken und tenmengen. Diese sollen in Echtzeit analysiert werden und Werkzeuge. gegebenenfalls zur weiteren Auswertung persistent gespei- chert werden. Hierfür existieren zum Einen Ansätze aus der Categories and Subject Descriptors Stromdatenverarbeitung und zum Anderen zur Speicherung H.4 [Information Systems Applications]: Miscellaneous; von auf Zeitreihen spezialisierte Datenbanksysteme (Time D.2.8 [Software Engineering]: Metrics—complexity mea- Series Databases). Da statistische Analysen etwa mit stand- sures, performance measures alone R Anwendungen nur funktionieren, solange die zu ana- lysierenden Daten die Größe des Hauptspeichers nicht über- General Terms schreiten, ist es notwendig die statistische Analyse in Daten- 1 Big Data, Data Mining and Knowledge Discovery, Streaming R – Programmiersprache für statistische Rechnen und Vi- sualisieren von der R Foundation for Statistical Computing, Data http://www.r-project.org. 2 Matlab – kommerzielle Software zum Lösen Veranschau- lichen mathematischer Probleme vom Entwickler The Ma- thworks, http://www.mathworks.de. 3 Weka – Waikato Environment for Knowledge Analysis, ein Werkzeugkasten für Data Mining und Maschinelles Lernen Copyright c by the paper’s authors. Copying permitted only von der University of Waikato, http://www.cs.waikato.ac. for private and academic purposes. nz/ml/weka/. 4 In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- SPSS – kommerzielle Statistik- und Analysesoftware von Workshop on Foundations of Databases (Grundlagen von Datenbanken), IBM, http://www-01.ibm.com/software/de/analytics/ 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. spss. 53 banksysteme zu integrieren. Ziel ist dabei der transparente Es wird derzeit ein möglichst breites Spektrum an Daten Zugriff auf partitionierte Daten und deren Analyse mittels in hoher zeitlicher Auflösung erfasst, um in einem Prozess partitionierter statistischen Modelle. In [6] werden verschie- der Datenexploration auf Zusammenhänge schließen zu kön- dene Möglichkeiten der Integration beschrieben und sind nen, die zunächst nicht offensichtlich sind bzw. um Vermu- in Prototypen basierend auf postgreSQL bereits umgesetzt. tungen zu validieren. Derzeit werden über 300 Kennzahlen Auch kommerzielle Produkte wie etwa Oracle R Enterpri- alle 10 s auf 14 Servern aus 9 Kundenprojekten abgetas- se[4] integrieren statistische Analyse auf Datenbankebene. tet. Diese Daten werden gespeichert und außerdem unmit- Im Open-Source-Bereich existiert eine Vielzahl von Ansät- telbar weiterverarbeitet. So findet etwa ein Downsampling zen zum Umgang mit Zeitreihen, wobei uns InfluxDB5 als für alle genannten 300 Kennzahlen statt. Dabei werden die besonders geeignetes Werkzeug aufgefallen ist. zeitliche Auflösung unter Verwendung verschiedener Aggre- Die Herausforderung innerhalb des im Weiteren beschriebe- gatfunktionen auf Zeitfenster unterschiedlicher Größe redu- nen Projekts PageBeat“ ist es innovative und anwendungs- ziert und die Ergebnisse gespeichert. Andere Analysefunk- ” reife Open-Source-Lösungen aus den genannten Bereichen tionen quantisieren die Werte hinsichtlich ihrer Zugehörig- zur Verarbeitung großer Zeitreihendaten innerhalb des Pro- keit zu Statusklassen (etwa optimal, normal, kritisch) und jektes miteinander zu kombinieren. Im Folgenden wird das speichern die Ergebnisse ebenfalls. So entstehen sehr schnell Projekt vorgestellt, um dann verschiedene in Frage kommen- große Datenmengen. Derzeit enthält der Datenspeicher etwa den Techniken und abschließend das gewählte Konzept und 40 GB Daten und wir beobachten bei der aktuellen Anzahl erste Ergebnisse vorzustellen. beobachteter Werte einen Zuwachs von etwa 1 GB Daten pro Woche. Auf Basis der erhobenen Daten müssen zeit- kritische Analysen wie etwa eine Ausreißererkennung oder 2. PROJEKT PAGEBEAT die Erkennung kritischer Muster nahezu in Echtzeit erfol- Mit PageBeat“ wird eine als Software as a Service“ (SAAS) gen, um Kunden ein rechtzeitiges Eingreifen zu ermöglichen. ” ” angebotene Softwaresuite speziell zur Beobachtung und Über- Weiterhin soll eine Vorhersage zukünftiger Werte frühzeitig prüfung von Webanwendungen entwickelt. Dies erfolgt zu- kritische Entwicklungen aufzeigen. Die Herausforderung im nächst im Rahmen eines vom Bundeswirtschaftsministeri- Projekt ist die Bewältigung des großen Datenvolumens un- um geförderten ZIM-Kooperationsprojektes. Ziel der Soft- ter Gewährleistung einer echtzeitnahen Bearbeitung durch ware ist das Beobachten des und das Berichten über den Analysefunktionen. aktuellen technischen Status einer Webanwendung (Web- site, Content Management System, E-Commerce System, Webservice) sowie das Prognostizieren technischer Proble- 3. ZEITREIHENANALYSE UND DATENBAN- me anhand geeigneter Indikatoren (Hardware- und Software- KEN spezifische Parameter). Die Berichte werden dabei für un- Im Rahmen der Evaluierung von für das Projekt geeigneter terschiedliche Nutzergruppen (Systemadministratoren, Soft- Software haben wir verschiedene Ansätze zur Datenstrom- wareentwickler, Abteilungsleiter, Geschäftsführung, Marke- verarbeitung und der Analyse und Verwaltung von Zeitrei- ting) und deren Anforderungen aufbereitet und präsentiert. hen untersucht. Ziel war die Verwendung frei verfügbarer Mittels PageBeat“ werden somit automatisiert Fehlerbe- ” Software die zudem auf im Unternehmen vorhandener tech- richte erstellt, die über akute sowie vorhersehbare kritische nischer Expertise basiert. Änderungen der Betriebsparameter einer Webanwendung in- formieren und zielgruppenspezifisch dargestellt werden. 3.1 Data Stream Management Systems Bei den zugrunde liegenden Kennzahlen handelt es sich um eine Reihe von Daten, die den Zustand des Gesamtsystems Die Verarbeitung kontinuierlicher Datenströme stellt einen im Anwendungsbereich Webshopsysteme widerspiegeln. Dies Aspekt unseres Projektes dar. Datenstromverarbeitende Sys- sind Kennzahlen des Serverbetriebssystems (etwa CPU oder teme bieten hierzu die Möglichkeit kontinuierliche Anfragen RAM Auslastung) als auch anwendungsspezifische Kennda- auf in temporäre Relationen umgewandelte Datenströme zu ten (etwa die Laufzeit von Datenbankanfragen). Diese Daten formulieren. Dies kann etwa mit Operatoren der im Pro- sind semantisch beschrieben und entsprechende Metadaten jekt Stream[1] entwickelten an SQL angelehnten Continuous sind in einer Wissensbasis abgelegt. Darüber hinaus ist die Query Language[2] erfolgen. Sollen nun komplexere Mus- Verwendung weiterer Kontextinformationen angedacht, die ter in Datenströmen erkannt werden, spricht man auch von Einfluss auf den technischen Status des Systems haben kön- der Verarbeitung komplexer Ereignisse. Im Kontext unseres nen. Hierbei kann es sich etwa um Wetterdaten handeln: Projektes entspricht so ein Muster etwa dem Anstieg der beim Kinobetreiber Cinestar ist ein regnerisches Wochenen- Aufrufe einer Seite aufgrund einer Marketingaktion, welcher de vorausgesagt, dass auf eine hohe Auslastung des Kinokar- eine höhere Systemauslastung zur Folge hat (cpu-usage), tenonlineshops schließen lässt. Ein anderes Beispiel wären was sich wiederum in steigenden time-to-first-byte-Werten Informationen aus der Softwareentwicklung: bei Codeände- niederschlägt und in einem kritischen Bereich zur Benach- rungen mit einem bestimmten Zeitstempel können Effekte in richtigung oder gar zur automatischen Aufstockung der ver- den Auswertungen zu diesem Zeitpunkt nachgewiesen wer- fügbaren Ressourcen führen soll. Complex Event Proces- den. Das Ändern oder Hinzufügen bzw. Beachten von rele- sing Systems wie Esper[5] bieten die Möglichkeit Anfragen vanten Inhalten auf den Webseiten können signifikante Än- nach solchen Mustern auf Datenströme zu formulieren und derungen in Analysen ergeben, z.B. bei der Schaltung von entsprechende Reaktionen zu implementieren. Da etwa Es- Werbung oder bei Filmbewertungen zu neu anlaufenden Fil- per als eines der wenigen frei verfügbaren und für den pro- men auf sozialen Plattformen. duktiven Einsatz geeigneten Systeme, in Java und .net im- plementiert ist, entsprechende Entwicklungskapazitäten je- 5 doch nicht im Unternehmen zur Verfügung stehen, wird im InfluxDB - An open-source distributed time series database with no external dependencies. http://influxdb.com. Projekt keines der erwähnten DSMS oder CEPS zum Ein- 54 satz kommen. Deren Architektur diente jedoch zur Orientie- Möglichkeit Oracle Data Frames zu verwenden, um Daten- rung bei der Entwicklung eines eigenen mit im Unternehmen lokalität zu erreichen. Dabei wird der Code in der Oracle- eingesetzten Techniken (etwa node.js6 , RabbitMQ7 , Mon- Umgebung ausgeführt, dort wo die Daten liegen und nicht goDB8 , u.a.) Systems für PageBeat. umgekehrt. Außerdem erfolgt so ein transparenter Zugriff auf die Daten und Aspekte der Skalierung werden durch das 3.2 Werkzeuge zur Datenanalyse DBMS abgewickelt. Zur statistischen Auswertung der Daten im Projekt werden Neben den klassischen ORDBMS existieren eine Vielzahl Werkzeuge benötigt, die es ohne großen Implementierungs- von auf Zeitserien spezialisierte Datenbanken wie OpenTSDB14 , aufwand ermöglichen verschiedene Verfahren auf die erhobe- KairosDB15 , RRDB16 . Dabei handelt es sich jeweils um einen nen Daten anzuwenden und auf ihre Eignung hin zu untersu- auf Schreibzugriffe optimierten Datenspeicher in Form einer chen. Hierfür stehen verschiedene mathematische Werkzeuge schemalosen Datenbank und darauf zugreifende Anfrage-, zur Verfügung. Kommerzielle Produkte sind etwa die bereits Analyse- und Visualisierungsfunktionalität. Man sollte sie erwähnten Matlab oder SPSS. Im Bereich frei verfügbarer deshalb vielmehr als Ereignis-Verarbeitungs- oder Monitoring- Software kann man auf WEKA und vor allem R zurückgrei- Systeme bezeichnen. Neben den bisher genannten Zeitserien- fen. Besonders R ist sehr weit verbreitet und wird von ei- datenbanken ist uns bei der Recherche von für das Projekt ner großen Entwicklergemeinde getragen. Dadurch sind für geeigneter Software InfluxDB17 aufgefallen. InfluxDB ver- R bereits eine Vielzahl von Verfahren zur Datenaufberei- wendet Googles auf Log-structured merge-trees basierenden tung und deren statistischer Analyse bis hin zur entspre- key-value Store LevelDB18 und setzt somit auf eine hohen chenden Visualisierung implementiert. Gerade in Bezug auf Durchsatz bzgl. Schreiboperationen. Einen Nachteil hinge- die Analyse von Zeitreihen ist R aufgrund vielfältiger ver- gen stellen langwierige Löschoperationen ganzer nicht mehr fügbarer Pakete zur Zeitreihenanalyse gegenüber WEKA die benötigter Zeitbereiche dar. Die einzelnen Zeitreihen werden geeignetere Wahl. Mit RStudio9 steht außerdem eine kom- bei der Speicherung sequenziell in sogenannte Shards unter- fortable Entwicklungsumgebung zur Verfügung. Weiterhin teilt, wobei jeder Shard in einer einzelnen Datenbank gespei- können mit dem Web Framework Shiny10 schnell R Anwen- chert wird. Eine vorausschauenden Einrichtung verschiede- dungen im Web bereit gestellt werden und unterstützt so- ner Shard-Spaces (4 Stunden, 1 Tag, 1 Woche etc.) ermög- mit eine zügige Anwendungsentwicklung. Somit stellt R mit licht es, das langsame Löschen von Zeitbereichen durch das den zugehörigen Erweiterungen die für das Projekt geeignete einfache Löschen ganzer Shards also ganzer Datenbanken Umgebung zur Evaluierung von Datenanalyseverfahren und (drop database) zu kompensieren. Eine verteilte Speicherung zur Datenexploration dar. Im weiteren Verlauf des Projektes der Shards auf verschiedenen Rechnerknoten die wiederum und in der Überführung in ein produktives System wird die in verschiedenen Clustern organisiert sein können, ermög- Datenanalyse, etwa die Berechnung von Vorhersagen, inner- licht eine Verteilung der Daten, die falls gewünscht auch red- halb von node.js reimplementiert. undant mittels Replikation auf verschiedene Knoten erfolgen kann. Die Verteilung der Daten auf verschiedene Rechner- 3.3 Datenbankunterstützung knoten ermöglicht es auch die Berechnung von Aggregaten über Zeitfenster die unterhalb der Shardgröße liegen, zu ver- Klassische objektrelationale DBMS wie Oracle11 , IBM In- teilen und somit Lokalität der Daten und einen Performance- formix12 oder PostgreSQL13 unterstützen in unterschiedli- Vorteil zu erreichen. Auch hier ist es sinnvoll Shardgrößen chem Umfang die Speicherung, Anfrage und Auswertung vorausschauend zu planen. Die Anfragen an InfluxDB kön- von Zeitreihen. PostgreSQL ermöglicht bswp. die Verwen- nen mittels einer SQL-ähnlichen Anfragesprache über eine dung von Fensterfunktionen etwa zur Berechnung von Ag- http-Schnittstelle formuliert werden. Es werden verschiedene gregatwerten für entsprechende Zeitabschnitte. Die IBM In- Aggregatfunktionen bereitgestellt, die eine Ausgabe bspw. formix TimeSeries Solution[3] stellt Container zur Speiche- gruppiert nach Zeitintervallen für einen gesamten Zeitbe- rung von Zeitreihendaten zur Verfügung, wodurch der Spei- reich erzeugen, wobei die Verwendung Regulärer Ausdrücke cherplatzbedarf optimiert, die Anfragegeschwindigkeit erhöht unterstützt wird: sowie die Komplexität der Anfragen reduziert werden sol- len. Oracle unterstützt nicht nur die Speicherung und An- frage von Zeitreihen, sondern integriert darüber hinaus um- select median(used) from /cpu\.*/ fassende statistische Analysefunktionalität mittels Oracle R where time > now() - 4h group by time(5m) Technologies[4]. Dabei hat der R-Anwendungsentwickler die Hier wird der Median des used“-Wertes für alle 5-Minuten- ” 6 node.js - a cross-platform runtime environment for server- Fenster der letzten 4 Stunden für alle CPUs berechnet und side and networking applications. http://nodejs.org/. ausgegeben. Neben normalen Anfragen können auch soge- 7 RabbitMQ - Messaging that just works. http://www. nannte Continuous Queries eingerichtet werden, die etwa das rabbitmq.com. einfache Downsampling von Messdaten ermöglichen: 8 MongoDB - An open-source document database. http:// www.mongodb.org/. 14 OpenTSDB - Scalable Time Series Database. http:// 9 RStudio - open source and enterprise-ready professional opentsdb.net/. software for the R statistical computing environment. http: 15 KairosDB - Fast Scalable Time Series Database. https: //www.rstudio.com. //code.google.com/p/kairosdb/. 10 16 Shiny - A web application framework for R. http://shiny. RRDB - Round Robin Database. http://oss.oetiker.ch/ rstudio.com. rrdtool/. 11 17 Oracle. http://www.oracle.com. InfluxDB - An open-source distributed time series database 12 IBM Informix. http://www-01.ibm.com/software/data/ with no external dependencies. http://influxdb.com/. informix/. 18 LevelDB - A fast and lightweight key-value database library 13 PostgreSQL. http://www.postgresql.org/. by Google. http://code.google.com/p/leveldb/. 55 select count(name) from clicks Datenstrom (Drohne, Lasttestserver, Clientsimulation, etc.) group by time(1h) into clicks.count.1h InfluxDB befindet sich noch in einem frühen Stadium der Entwicklung und wird ständig weiterentwickelt. So ist etwa angekündigt, dass zukünftig bspw. das Speichern von Meta- daten zu Zeitreihen (Einheiten, Abtastrate, etc.) oder auch Vorverarbeitung die Implementierung nutzerdefinierter Aggregatfunktionen / Data Cleaning ermöglicht wird. InfluxDB ist ein für unsere Anwendung viel- versprechendes Werkzeug, wobei jedoch abzuwarten bleibt, Integration inwiefern es sich für den produktiven Einsatz eignet. Aus die- sem Grund wird derzeit zusätzlich zu InfluxDB, MongoDB parallel als im Unternehmen bewährter Datenspeicher ver- wendet. Ergebnisse Adhoc‐Analyse Wissens (outlier, etc.) Basis 4. LÖSUNG IN PAGEBEAT Im Projekt Pagebeat wurden verschiedene Lösungsansätze getestet, wobei die Praktikabilität beim Einsatz im Unter- nehmen, die schnelle Umsetzbarkeit sowie die freie Verfüg- barkeit der eingesetzten Werkzeuge die entscheidende Rolle Daten Speicher spielten. 4.1 Datenfluss Daten Der Datenfluss innerhalb der Gesamtarchitektur ist in Ab- Explorer bildung 1 dargestellt. Die Messdaten werden von einer Droh- Langzeit‐ ne19 sowie Clientsimulatoren und Lasttestservern in äquidi- Analyse Ergebnisse stanten Zeitabschnitten (meist 10 s) ermittelt. Die erhobe- nen Daten werden einem Loggingdienst per REST-Schnittstelle zur Verfügung gestellt und reihen sich in die Warteschlange eines Nachrichtenservers ein. Von dort aus werden sie ihrer Abbildung 1: Datenfluss Signatur entsprechend durch registrierte Analyse- bzw. In- terpretationsprozesse verarbeitet, wobei die Validierung der eintreffenden Daten sowie die Zuordnung zu registrierten 4.3 Speicherung der Zeitreihen Analysefunktionen mittels einer Wissensbasis erfolgt. Ergeb- Die Speicherung der Messdaten sowie Analyse- und Inter- nisse werden wiederum als Nachricht zur Verfügung gestellt pretationsergebnisse erfolgt zum Einen in der im Unterneh- und falls vorgesehen persistent gespeichert. So in die Nach- men bewährten, auf hochfrequente Schreibvorgänge opti- richtenschlange gekommene Ergebnisse können nun weitere mierten schemafreien Datenbank MongoDB. Zum Anderen Analysen bzw. Interpretationen oder die Auslösung einer Be- setzen wir mittlerweile parallel zu MongoDB auf InfluxDB. nachrichtigung zur Folge haben. Der Daten Explorer ermög- So kann z.B. über die in InluxDB zur Verfügung stehen- licht eine Sichtung von Rohdaten und bereits in PageBeat den Continious Queries ein automatisches Downsampling integrierten Analyseergebnissen sowie Tests für zukünftige und somit eine Datenreduktion der im 10 Sekunden Takt Analysefunktionen. erhobenen Daten erfolgen. Das Downsampling erfolgt der- 4.2 Wissensbasis zeit durch die Berechnung der Mittelwerte von Zeitfenstern einer Länge von 1 Minute bis hin zu einem Tag und ge- Die Wissenbasis bildet die Grundlage für die modular auf- neriert somit automatisch unterschiedliche zeitliche Auflö- gebauten Analyse- und Interpretationsprozesse. Die Abbil- sungen für alle Messwerte. Außerdem stellt die SQL ähn- dung 2 dargestellten ParameterValues“ repräsentieren die liche Anfragesprache von InfluxDB eine Vielzahl von für ” Messdaten und deren Eigenschaften wie Name, Beschrei- die statistische Auswertung hilfreichen Aggregatfunktionen bung oder Einheit. ParameterValues können zu logischen (min, max, mean, median, stddev, percentile, histogramm, Gruppen (Parameters) zusammengefasst werden (wie z.B. etc.) zur Verfügung. Weiterhin soll es zukünftig möglich sein die ParameterValues: system“, load“, iowait“ und max“ benutzerdefinierte Funktionen mit eigener Analysefunktio- ” ” ” ” zum Parameter cpu“). Parameter sind mit Visualisierungs- nalität (etwa Autokorrelation, Kreuzkorrelation, Vorhersa- ” komponenten und Kundendaten sowie mit Analysen und ge, etc.) auf Datenbankebene umzusetzen oder auch das Interpretationen verknüpft. Analysen und Interpretationen automatische Zusammenführen verschiedener Zeitserien an- sind modular aufgebaut und bestehen jeweils aus Eingangs- hand eines Timestamp-Attributs durchzuführen. Dies wür- und Ausgangsdaten (ParameterValues) sowie aus Verweisen de schon auf Datenbankebene eine zeitreihenübergreifende auf den Programmcode. Weiterhin sind ihnen spezielle Me- Analyse (bspw. Korrelation) unterstützen und senkt den Re- thodenparameter zugeordnet. Hierbei handelt es sich etwa implentierungsaufwand von R Funktionalität aus der Daten- um Start und Ende eines Zeitfensters, Schwellenwerte oder explorationsphase. Da herkömmliche Datenbanken nicht die andere Modellparameter. Die Wissensbasis ist mittels eines hohe Performance bzgl. Schreibzugriffen erreichen und kaum relationalem Schemas in MySQL abgebildet. auf Zeitreihen spezialisierte Anfragen unterstützen, scheint 19 InfluxDB ein geeigneter Kandidat für den Einsatz innerhalb Auf dem zu beobachtenden System installierter Agent zur Datenerhebung. PageBeats zu sein. 56 Analysis Visualisation Parameter Abbildung 4: Autokorrelation gebnissen dient. Abbildung 5 zeigt etwa die Darstellung ag- Abbildung 2: Ausschnitt Schema Wissensbasis gregierter Parameter in Ampelform (rot = kritisch, gelb = Warnung, grün = normal, blau = optimal) was schnell einen Eindruck über den Zustand verschiedener Systemparameter ermöglicht. 4.4 Datenexploration Interpretation Customer Data Die Datenexploration soll dazu dienen, Administratoren und auch Endnutzern die Möglichkeit zu geben, die für sie rele- vanten Daten mit den richtigen Werkzeugen zu analysieren. Während der Entwicklung nutzen wir die Datenexploration als Werkzeug zur Ermittlung relevanter Analysemethoden und zur Evaluierung sowie Visualisierung der Datenströme. Abbildung 3 zeigt eine einfache Nutzerschnittstelle umge- setzt mit Shiny zur Datenauswertung mittels R mit Zu- griff auf unterschiedliche Datenbanken, InfluxDB und Mon- goDB. Verschiedene Parameter zur Auswahl des Zeitraumes, der Analysefunktion und deren Parameter sowie Visualisie- rungsparameter. Hier sind durchschnittliche CPU-Nutzung und durchschnitt- liche Plattenzugriffszeiten aus einer Auswahl aus 10 Zeitse- Abbildung 5: Ampel rien dargestellt. Mittels unterem Interaktionselement lassen sich Intervalle selektieren und die Granularität anpassen. Analysefunktionalität die über Aggregationen auf Daten- Mit ähnlichen Visualisierungsmethoden lassen sich auch Au- bankebene hinausgehen wird von uns in einer Experimen- tokorrelationsanalysen visualisieren, siehe Abbildung 4. talumgebung umgesetzt und evaluiert. Diese basiert auf R. So stehen eine Vielzahl statistischer Analysemethoden und 4.5 Analyse und Interpretation Methoden zur Aufbereitung komplexer Datenstrukturen in Analysen sind Basisoperationen wie die Berechnung von Mit- Form von R Paketen zur Verfügung. Darüber hinaus ermög- telwert, Median, Standardabweichung, Autokorrelation u.a. licht das R-Paket Shiny Server“ die komfortable Bereitstel- ” deren Ergebnisse falls nötig persistent gespeichert werden lung von R Funktionalität für das Web. Ein wesentlicher Teil oder direkt anderen Verarbeitungsschritten als Eingabe über- unser Experimentalumgebung ist der Pagebeat Data Explo- geben werden können. Die Spezifizierung der Analysefunk- rer (siehe Abbildung 3). Dieser basiert auf den genannten tionen erfolgt in der Wissensbasis, die eigentliche Implemen- Techniken und ermöglicht die Sichtung der erfassten Roh- tierung ist möglichst nahe an den zu analysierenden Daten, daten oder das Spielen“ mit Analysemethoden und Vorher- ” wenn möglich unter Verwendung von Aggregat- oder benut- sagemodellen. zerdefinierten Funktionen des Datenbanksystems, umzuset- zen. Wissensbasis und Analyse sind hierzu mittels eines me- 5. ZUSAMMENFASSUNG UND AUSBLICK ” thod codepath“ verknüpft. Pagebeat ist ein Projekt, bei dem es insbesondere auf eine Interpretationen funktionieren analog zur Analyse bilden je- performante Speicherung und schnelle Adhoc-Auswertung doch Berechnungsvorschriften etwa für den Gesamtindex (Pagebeat- der Daten ankommt. Dazu wurden verschiedene Lösungsan- Faktor) des Systems bzw. einzelner Teilsysteme ab, in dem sätze betrachtet und die favorisierte Lösung auf Basis von sie z.B. Analyseergebnisse einzelner Zeitreihen gewichtet zu- InfluxDB und R beschrieben. sammenführen. Weiterhin besitzen Interpretationen einen Die konzeptionelle Phase ist abgeschlossen, die Projektin- Infotyp, welcher der nutzerspezifischen Aufbereitung von Er- frastruktur umgesetzt und erste Analysemethoden wie Aus- 57 Abbildung 3: Daten reißererkennung oder Autokorrelation wurden ausprobiert. [7] M. Hall, E. Frank, G. Holmes, B. Pfahringer, Derzeit beschäftigen wir uns mit den Möglichkeiten einer P. Reutemann, and I. H. Witten. The weka data mining Vorhersage von Zeitreihenwerten. Dazu werden Ergebnisse software: An update. SIGKDD Explorations, 11(1), der Autokorrelationsanalyse zur Identifikation von Abhän- 2009. gigkeiten innerhalb von Zeitreihen verwendet um die Qua- lität von Vorhersagen abschätzen zu können. Weiterhin ist geplant Analysen näher an der Datenbank auszuführen um Datenlokalität zu unterstützen. 6. REFERENCES [1] A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. Stream: The stanford data stream management system. Technical Report 2004-20, Stanford InfoLab, 2004. [2] A. Arasu, S. Babu, and J. Widom. The cql continuous query language: Semantic foundations and query execution. Technical Report 2003-67, Stanford InfoLab, 2003. [3] K. Chinda and R. Vijay. Informix timeseries solution. http://www.ibm.com/developerworks/data/library/ techarticle/dm-1203timeseries, 2012. [4] O. Corporation. R technologies from oracle. http://www.oracle.com/technetwork/topics/ bigdata/r-offerings-1566363.html, 2014. [5] EsperTech. Esper. http://esper.codehaus.org, 2014. [6] U. Fischer, L. Dannecker, L. Siksnys, F. Rosenthal, M. Boehm, and W. Lehner. Towards integrated data analytics: Time series forecasting in dbms. Datenbank-Spektrum, 13(1):45–53, 2013. 58 Databases under the Partial Closed-world Assumption: A Survey Simon Razniewski Werner Nutt Free University of Bozen-Bolzano Free University of Bozen-Bolzano Dominikanerplatz 3 Dominikanerplatz 3 39100 Bozen, Italy 39100 Bozen, Italy razniewski@inf.unibz.it nutt@inf.unibz.it ABSTRACT centralized manner, as each school is responsible for its own Databases are traditionally considered either under the closed- data. Since there are numerous schools in this province, the world or the open-world assumption. In some scenarios how- overall database is notoriously incomplete. However, peri- ever a middle ground, the partial closed-world assumption, odically the statistics department of the province queries the is needed, which has received less attention so far. school database to generate statistical reports. These statistics In this survey we review foundational and work on the are the basis for administrative decisions such as the opening partial closed-world assumption and then discuss work done and closing of classes, the assignment of teachers to schools in our group in recent years on various aspects of reasoning and others. It is therefore important that these statistics are over databases under this assumption. correct. Therefore, the IT department is interested in finding We first discuss the conceptual foundations of this assump- out which data has to be complete in order to guarantee cor- tion. We then list the main decision problems and the known rectness of the statistics, and on which basis the guarantees results. Finally, we discuss implementational approaches and can be given. extensions. Broadly, we investigated the following research questions: 1. How to describe complete parts of a database? 1. INTRODUCTION Data completeness is an important aspect of data quality. 2. How to find out, whether a query answer over a par- Traditionally, it is assumed that a database reflects exactly tially closed database is complete? the state of affairs in an application domain, that is, a fact that is true in the real world is stored in the database, and a 3. If a query answer is not complete, how to find out which fact that is missing in the database does not hold in the real kind of data can be missing, and which similar queries world. This is known as the closed-world assumption (CWA). are complete? Later approaches have discussed the meaning of databases that are missing facts that hold in the real world and thus are incomplete. This is called the open-world assumption Work Overview. The first work on the PCWA is from (OWA) [16, 7]. Motro [10]. He used queries to describe complete parts and A middle view, which we call the partial closed-world as- introduced the problem of inferring the completeness of other sumption (PCWA), has received less attention until recently. queries (QC) from such completeness statements. Later work Under the PCWA, some parts of the database are assumed by Halevy [8] introduced tuple-generating dependencies or to be closed (complete), while others are assumed to be open table completeness (TC) statements for specification of com- (possibly incomplete). So far, the former parts were specified plete parts. A detailed complexity study of TC-QC entailment using completeness statements, while the latter parts are the was done by Razniewski and Nutt [13]. complement of the complete parts. Later work by Razniewski and Nutt has focussed on databases with null values [12] and geographic databases [14]. Example. As an example, consider a problem arising in the There has also been work on RDF data [3]. Savkovic management of school data in the province of Bolzano, Italy, et al. [18, 17] have focussed on implementation techniques, which motivated the technical work reported here. The IT leveraging especially on logic programming. department of the provincial school administration runs a Also the derivation of completeness from data-aware busi- database for storing school data, which is maintained in a de- ness process descriptions has been discussed [15]. Current work is focussing on reasoning wrt. database in- stances and on queries with negation [4]. Outline. This paper is structured as follows. In Section 2, we discuss conceptual foundations, in particular the par- tial closed-world assumption. In Section 3 we present main Copyright c by the paper’s authors. Copying permitted only for reasoning problems in this framework and known results. private and academic purposes. Section 4 discusses implementation techniques. Section 5 In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- Workshop on Foundations of Databases (Grundlagen von Datenbanken), presents extension and Section 6 discusses current work and 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. open problems. 59 2. CONCEPTUAL FOUNDATIONS Example 1. Consider a partial database DS for a school with two students, Hans and Maria, and one teacher, Carlo, as follows: 2.1 Standard Definitions In the following, we fix our notation for standard concepts DiS = {student(Hans, 3, A), student(Maria, 5, C), from database theory. We assume a set of relation symbols person(Hans, male), person(Maria, female), Σ, the signature. A database instance D is a finite set of ground atoms with relation symbols from Σ. For a relation symbol person(Carlo, male) }, R ∈ Σ we write R(D) to denote the interpretation of R in D, that DaS = DiS \ { person(Carlo, male), student(Maria, 5, C) }, is, the set of atoms in D with relation symbol R. A condition G is a set of atoms using relations from Σ and possibly the that is, the available database misses the facts that Maria is a student comparison predicates < and ≤. As common, we write a and that Carlo is a person. condition as a sequence of atoms, separated by commas. A Next, we define statements to express that parts of the in- condition is safe if each of its variables occurs in a relational formation in Da are complete with regard to the ideal database atom. A conjunctive query is written in the form Q(s̄) :− B, Di . We distinguish query completeness and table complete- where B is a safe condition, s̄ is a vector of terms, and every ness statements. variable in s̄ occurs in B. We often refer to the entire query by the symbol Q. As usual, we call Q(s̄) the head, B the body, the variables in s̄ the distinguished variables, and the Query Completeness. For a query Q, the query completeness statement Compl(Q) says that Q can be answered completely remaining variables in B the nondistinguished variables of Q. over the available database. Formally, Compl(Q) is satisfied by We generically use the symbol L for the subcondition of B a partial database D, denoted as D |= Compl(Q), if Q(Da ) = containing the relational atoms and M for the subcondition Q(Di ). containing the comparisons. If B contains no comparisons, then Q is a relational conjunctive query. Example 2. Consider the above defined partial database DS and The result of evaluating Q over a database instance D is the query denoted as Q(D). Containment and equivalence of queries are defined as usual. A conjunctive query is minimal if no Q1 (n) :− student(n, l, c), person(n, ’male’), relational atom can be removed from its body without leading asking for all male students. Over both, the available database DaS to a non-equivalent query. and the ideal database DiS , this query returns exactly Hans. Thus, 2.2 Running Example DS satisfies the query completeness statement for Q1 , that is, For our examples throughout the paper, we will use a dras- DS |= Compl(Q1 ). tically simplified extract taken from the schema of the Bolzano school database, containing the following two tables: Abiteboul et al. [1] introduced the notion of certain and possible answers over databases under the open-world as- - student(name, level, code), sumption. Query completeness can also be seen as a relation - person(name, gender). between certain and possible answers: A query over a par- The table student contains records about students, that is, tially complete database is complete, if the certain and the their names and the level and code of the class we are in. possible answers coincide. The table person contains records about persons (students, teachers, etc.), that is, their names and genders. Table completeness. A table completeness (TC) statement allows one to say that a certain part of a relation is com- 2.3 Completeness plete, without requiring the completeness of other parts of Open and closed world semantics were first discussed by the database [8]. It has two components, a relation R and Reiter in [16], where he formalized earlier work on negation a condition G. Intuitively, it says that all tuples of the ideal as failure [2] from a database point of view. The closed-world relation R that satisfy condition G in the ideal database are assumption corresponds to the assumption that the whole also present in the available relation R. database is complete, while the open-world assumption cor- Formally, let R(s̄) be an R-atom and let G be a condition responds to the assumption that nothing is known about the such that R(s̄), G is safe. We remark that G can contain re- completeness of the database. lational and built-in atoms and that we do not make any safety assumptions about G alone. Then Compl(R(s̄); G) is a Partial Database. The first and very basic concept is that table completeness statement. It has an associated query, which of a partially complete database or partial database [10]. A is defined as QR(s̄);G (s̄) :− R(s̄), G. The statement is satisfied database can only be incomplete with respect to another by D = (Di , Da ), written D |= Compl(R(s̄); G), if QR(s̄);G (Di ) ⊆ database that is considered to be complete. So we model a R(Da ). Note that the ideal instance D̂ is used to determine partial database as a pair of database instances: one instance those tuples in the ideal version R(Di ) that satisfy G and that that describes the complete state, and another instance that the statement is satisfied if these tuples are present in the describes the actual, possibly incomplete state. Formally, a available version R(Da ). In the sequel, we will denote a TC partial database is a pair D = (Di , Da ) of two database instances statement generically as C and refer to the associated query Di and Da such that Da ⊆ Di . In the style of [8], we call Di simply as QC . the ideal database, and Da the available database. The require- If we introduce different schemas Σi and Σa for the ideal ment that Da is included in Di formalizes the intuition that and the available database, respectively, we can view the the available database contains no more information than the TC statement C = Compl(R(s̄); G) equivalently as the TGD (= ideal one. tuple-generating dependency) δC : Ri (s̄), Gi → Ra (s̄) from Σi to 60 Σa . It is straightforward to see that a partial database satisfies 3. CHARACTERIZATIONS AND DECISION the TC statement C if and only if it satisfies the TGD δC . PROCEDURES The view of TC statements is especially useful for imple- mentations. Motro [10] introduced the notion of partially incomplete and incorrect databases as databases that can both miss facts that hold in the real world or contain facts that do not hold Example 3. In the partial database DS defined above, we can there. He described partial completeness in terms of query observe that in the available relation person, the teacher Carlo is completeness (QC) statements, which express that the answer missing, while all students are present. Thus, person is complete of a query is complete. The query completeness statements for all students. The available relation student contains Hans, who express that to some parts of the database the closed-world is the only male student. Thus, student is complete for all male assumption applies, while for the rest of the database, the persons. Formally, these two observations can be written as table open-world assumption applies. He studied how the com- completeness statements: pleteness of a given query can be deduced from the com- pleteness of other queries, which is QC-QC entailment. His C1 = Compl(person(n, g); student(n, l, c)), solution was based on rewriting queries using views: to infer C2 = Compl(student(n, l, c); person(n, ’male’)), that a given query is complete whenever a set of other queries are complete, he would search for a conjunctive rewriting in which, as seen, are satisfied by the partial database DS . terms of the complete queries. This solution is correct, but not complete, as later results on query determinacy show: One can prove that table completeness cannot be expressed the given query may be complete although no conjunctive by query completeness statements, because the latter require rewriting exists. completeness of the relevant parts of all the tables that ap- While Levy et al. could show that rewritability of conjunc- pear in the statement, while the former only talks about the tive queries as conjunctive queries is decidable [9], general completeness of a single table. rewritability of conjunctive queries by conjunctive queries is still open: An extensive discussion on that issue was pub- lished in 2005 by Segoufin and Vianu where it is shown that Example 4. As an illustration, consider the table completeness it is possible that conjunctive queries can be rewritten using statement C1 that states that person is complete for all students. The other conjunctive queries, but the rewriting is not a conjunc- corresponding query QC1 that asks for all persons that are students tive query [19]. They also introduced the notion of query is determinacy, which for conjunctive queries implies second QC1 (n, g) :− person(n, g), student(n, l, c). order rewritability. The decidability of query determinacy for conjunctive queries is an open problem to date. Evaluating QC1 over DiS gives the result { Hans, Maria }. However, evaluating it over DaS returns only { Hans }. Thus, DS does not Halevy [8] suggested local completeness statements, which satisfy the completeness of the query QC1 although it satisfies the we, for a better distinction from the QC statements, call table table completeness statement C1 . completeness (TC) statements, as an alternate formalism for expressing partial completeness of an incomplete database. Reasoning. As usual, a set S1 of TC- or QC-statements en- These statements allow one to express completeness of parts tails another set S2 (we write S1 |= S2 ) if every partial database of relations independent from the completeness of other parts that satisfies all elements of S1 also satisfies all elements of S2 . of the database. The main problem he addressed was how to derive query completeness from table completeness (TC-QC). He reduced TC-QC to the problem of queries independent Example 5. Consider the query Q(n) :− student(n, 7, c), of updates (QIU) [5]. However, this reduction introduces person(n,0 male0 ) that asks for all male students in level 7. The negation, and thus, except for trivial cases, generates QIU TC statements C1 and C2 entail completeness of this query, because instances for which no decision procedures are known. As we ensure that all persons that are students and all male students a consequence, the decidability of TC-QC remained largely are in the database. Note that these are not the minimal precon- open. Moreover, he demonstrated that by taking into ac- ditions, as it would be enough to only have male persons in the count the concrete database instance and exploiting the key database who are student in level 7, and students in level 7, who constraints over it, additional queries can be shown to be are male persons. complete. Razniewski and Nutt provided decision procedures for TC- While TC statements are a natural way to describe com- QC in [13]. They showed that for queries under bag semantics pleteness of available data (“These parts of the data are com- and for minimal queries under set semantics, weakest precon- plete”), QC statements capture requirements for data qual- ditions for query completeness can be expressed in terms of ity (“For these queries we need complete answers”). Thus, table completeness statements, which allow to reduce TC-QC checking whether a set of TC statements entails a set of entailment to TC-TC entailment. QC statements (TC-QC entailment) is the practically most For the problem of TC-TC entailment, they showed that it relevant inference. Checking TC-TC entailment is useful is equivalent to query containment. when managing sets of TC statements. Moreover, as we For QC-QC entailment, they showed that the problem is will show later on, TC-QC entailment for aggregate queries decidable for queries under bag semantics. with count and sum can be reduced to TC-TC entailment for For aggregate queries, they showed that for the aggregate non-aggregate queries. If completeness guarantees are given functions SUM and COUNT, TC-QC has the same complexity in terms of query completeness, also QC-QC entailment is of as TC-QC for nonaggregate queries under bag semantics. For interest. the aggregate functions MIN and MAX, they showed that 61 Problem Work by Results Query rewritability is a sufficient Motro 1989 QC-QC condition for QC-QCs Razniewski/Nutt QC-QCb is equivalent to query 2011 containment Razniewski/Nutt TC-TC is equivalent to query TC-TC 2011 containment Levy 1996 Decision procedure for trivial cases TC-QC TC-QCb is equivalent to TC-TC, Razniewski/Nutt TC-QCs is equivalent to TC-TC up 2011 to asymmetric cases Razniewski/Nutt Decision procedures for TC-QCs 2012 over databases with nulls Table 1: Main results TC-QC has the same complexity as TC-QC for nonaggregate that computes for a query that may be incomplete, complete queries under set semantics. approximations from above and from below. With this exten- For reasoning wrt. a database instance, they showed that sion, they show how to reformulate the original query in such TC-QC becomes computationally harder than without an in- a way that answers are guaranteed to be complete. If there stance, while QC-QC surprisingly becomes solvable, whereas exists a more general complete query, there is a unique most without an instance, decidability is open. specific one, which is found. If there exists a more specific complete query, there may even be infinitely many. In this In [12], Nutt and Razniewski discussed TC-QC entailment case, the least specific specializations whose size is bounded reasoning over databases that contain null values. Null val- by a threshold provided by the user is found. Generalizations ues as used in SQL are ambiguous, as they can indicate either are computed by a fixpoint iteration, employing an answer set that no attribute value exists or that a value exists, but is un- programming engine. Specializations are found leveraging known. Nutt and Razniewski studied completeness reason- unification from logic programming. ing for both interpretations, and showed that when allowing both interpretations at the same time, it becomes necessary to syntactically distinguish between different kinds of null val- 5. EXTENSIONS AND APPLICATIONS SCE- ues. They presented an encoding for doing that in standard NARIOS SQL databases. With this technique, any SQL DBMS evalu- ates complete queries correctly with respect to the different meanings that null values can carry. Complete generalizations and specializations. When a The main results are summarized in Table 1. query is not guaranteed to be complete, it may be interesting to know which similar queries are complete. For instance, when a query for all students in level 5 is not complete, it 4. IMPLEMENTATION TECHNIQUES may still be the case that the query for students in classes 5b Systems for reasoning can be developed from scratch, how- and 5c is complete. Such information is especially interesting ever it may be useful to implement them using existing tech- for interaction with a completeness reasoning system. In [11], nology as far as possible. So far, it was investigated how Savkovic et al. defined the notion of most general complete completeness reasoning can be reduced to answer set pro- specialization and the most specific comple generalization, gramming, in particular using the DLV system. and discussed techniques to find those. The MAGIK system developed by Savkovic et al. [18] demonstrates how to use meta-information about the com- Completeness over Business Processes. In many appli- pleteness of a database to assess the quality of the answers cations, data is managed via well documented processes. If returned by a query. The system holds table-completeness information about such processes exists, one can draw con- (TC) statements, by which one can express that a table is par- clusions about completeness as well. In [15], Razniewski et tially complete, that is, it contains all facts about some aspect al. presented a formalization of so-called quality-aware pro- of the domain. cesses that create data in the real world and store it in the Given a query, MAGIK determines from such meta- company’s information system possibly at a later point. They information whether the database contains sufficient data then showed how one can check the completeness of database for the query answer to be complete (TC-QC entailment). queries in a certain state of the process or after the execution If, according to the TC statements, the database content is of a sequence of actions, by leveraging on query contain- not sufficient for a complete answer, MAGIK explains which ment, a well-studied problem in database theory. Finally, further TC statements are needed to guarantee completeness. they showed how the results can be extended to the more MAGIK extends and complements theoretical work on expressive formalism of colored Petri nets. modeling and reasoning about data completeness by provid- ing the first implementation of a reasoner. The reasoner op- erates by translating completeness reasoning tasks into logic Spatial Data. Volunteered geographical information sys- tems are gaining popularity. The most established one is programs, which are executed by an answer set engine. OpenStreetMap (OSM), but also classical commercial map In [17], Savkovic et al. present an extension to MAGIK services such as Google Maps now allow users to take part in 62 the content creation. Relationship between Completeness Assessing the quality of spatial information is essential for Certain Answers, Query P Pattern making informed decisions based on the data, and particu- Answers, and Possible Answers larly challenging when the data is provided in a decentral- Q :− C CA = QA = PA ized, crowd-based manner. In [14], Razniewski and Nutt Q :− N CA = QA ⊆ PA = inf showed how information about the completeness of features Q :− N, ¬N ∅ = CA ⊆ QA ⊆ PA = inf in certain regions can be used to annotate query answers with Q :− C, ¬C CA = QA = PA completeness information. They provided a characterization Q :− N, ¬C CA = QA ⊆ PA = inf of the necessary reasoning and show that when taking into Q :− C, ¬N ∅ = CA ⊆ QA = PA account the available database, more completeness can be de- rived. OSM already contains some completeness statements, Table 2: Relation between query result, certain answers and which are originally intended for coordination among the ed- possible answers for queries with negation. The arguments itors of the map. A contribution was also to show that these of Q are irrelevant and therefore omitted. statements are not only useful for the producers of the data but also for the consumers. query answer may either be equal to the possible answers, to RDF Data. With thousands of RDF data sources today avail- the certain answers, both, or none. able on the Web, covering disparate and possibly overlapping Note that the above results hold for conjunctive queries in knowledge domains, the problem of providing high-level de- general, and thus do not only apply to SPARQL but also to scriptions (in the form of metadata) of their content becomes other query languages with negation, such as SQL. crucial. In [3], Darari et al. discussed reasoning about the completeness of semantic web data sources. They showed 6.2 Instance Reasoning how the previous theory can be adapted for RDF data sources, Another line of current work concerns completeness rea- what peculiarities the SPARQL query language offers and soning wrt. a database instance. We are currently looking into how completeness statements themselves can be expressed completeness statements which are simpler than TC state- in RDF. ments in the sense that we do not contain any joins. For They also discussed the foundation for the expression of such statements, reasoning is still exponential in the size of completeness statements about RDF data sources. This al- the database schema, but experimental results suggest that in lows to complement with qualitative descriptions about com- use cases, the reasoning is feasible. A challenge is however pleteness the existing proposals like VOID that mainly deal to develop a procedure which is algorithmically complete. with quantitative descriptions. The second aspect of their work is to show that completeness statements can be useful for the semantic web in practice. On the theoretical side, 7. ACKNOWLEDGEMENT they provide a formalization of completeness for RDF data We thank our collaborators Fariz Darari, Flip Korn, Paramita sources and techniques to reason about the completeness of Mirza, Marco Montali, Sergey Paramonov, Giuseppe Pirró, query answers. From the practical side, completeness state- Radityo Eko Prasojo, Ognjen Savkovic and Divesh Srivas- ments can be easily embedded in current descriptions of data tava. sources and thus readily used. The results on RDF data have This work has been partially supported by the project been implemented by Darari et al. in a demo system called “MAGIC: Managing Completeness of Data” funded by the CORNER [6]. province of Bozen-Bolzano. 6. CURRENT WORK 8. REFERENCES In this section we list problems that our group is currently [1] S. Abiteboul, P.C. Kanellakis, and G. Grahne. On the working on. representation and querying of sets of possible worlds. In Proc. SIGMOD, pages 34–48, 1987. 6.1 SPARQL Queries with Negation [2] Keith L Clark. Negation as failure. In Logic and data bases, pages 293–322. Springer, 1978. RDF data is often treated as incomplete, following the Open-World Assumption. On the other hand, SPARQL, the [3] Fariz Darari, Werner Nutt, Giuseppe Pirrò, and Simon standard query language over RDF, usually follows the Closed- Razniewski. Completeness statements about RDF data World Assumption, assuming RDF data to be complete. What sources and their use for query answering. In then happens is the semantic gap between RDF and SPARQL. International Semantic Web Conference (1), pages 66–83, In current work, Darari et al. [4] address how to close the se- 2013. mantic gap between RDF and SPARQL, in terms of certain an- [4] Fariz Darari, Simon Razniewski, and Werner Nutt. swers and possible answers using completeness statements. Bridging the semantic gap between RDF and SPARQL Table 2 shows current results for the relations between query using completeness statements. ISWC, 2013. answers, certain answers and possible answers for queries [5] Ch. Elkan. Independence of logic database queries and with negation. The queries are assumed to be of the form updates. In Proc. PODS, pages 154–160, 1990. Q(s̄) :− P+ , ¬P− , where P+ is the positive part and P− is the [6] Radityo Eko Prasojo Fariz Darari and Werner Nutt. negative part. Then we use letters C and N to indicate which CORNER: A completeness reasoner for the semantic parts are complete. E.g. Q(s̄) :− N, ¬C indicates that the pos- web (poster). ESWC, 2013. itive part is not complete and the negative part is complete. [7] T. Imieliński and W. Lipski, Jr. Incomplete information As the table shows, depending on the complete parts, the in relational databases. J. ACM, 31:761–791, 1984. 63 [8] Alon Y. Levy. Obtaining complete answers from of geographical data (short paper). In BNCOD, 2013. incomplete databases. In Proceedings of the International [15] Simon Razniewski, Marco Montali, and Werner Nutt. Conference on Very Large Data Bases, pages 402–412, 1996. Verification of query completeness over processes. In [9] Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, BPM, pages 155–170, 2013. and Divesh Srivastava. Answering queries using views. [16] Raymond Reiter. On closed world data bases. In Logic In PODS, pages 95–104, 1995. and Data Bases, pages 55–76, 1977. [10] A. Motro. Integrity = Validity + Completeness. ACM [17] Ognjen Savkovic, Paramita Mirza, Sergey Paramonov, TODS, 14(4):480–502, 1989. and Werner Nutt. Magik: managing completeness of [11] Werner Nutt, Sergey Paramonov, and Ognjen Savkovic. data. In CIKM, pages 2725–2727, 2012. An ASP approach to query completeness reasoning. [18] Ognjen Savkovic, Paramita Mirza, Alex Tomasi, and TPLP, 13(4-5-Online-Supplement), 2013. Werner Nutt. Complete approximations of incomplete [12] Werner Nutt and Simon Razniewski. Completeness of queries. PVLDB, 6(12):1378–1381, 2013. queries over SQL databases. In CIKM, pages 902–911, [19] L. Segoufin and V. Vianu. Views and queries: 2012. Determinacy and rewriting. In Proc. PODS, pages [13] S. Razniewski and W. Nutt. Completeness of queries 49–60, 2005. over incomplete databases. In VLDB, 2011. [14] S. Razniewski and W. Nutt. Assessing the completeness 64 Towards Semantic Recommendation of Biodiversity Datasets based on Linked Open Data Felicitas Löffler Bahar Sateli René Witte Birgitta König-Ries Dept. of Mathematics Semantic Software Lab Semantic Software Lab Friedrich Schiller University and Computer Science Dept. of Computer Science Dept. of Computer Science Jena, Germany and Friedrich Schiller University and Software Engineering and Software Engineering German Centre for Integrative Jena, Germany Concordia University Concordia University Biodiversity Research (iDiv) Montréal, Canada Montréal, Canada Halle-Jena-Leipzig, Germany ABSTRACT 1. INTRODUCTION Conventional content-based filtering methods recommend Content-based recommender systems observe a user’s brows- documents based on extracted keywords. They calculate the ing behaviour and record the interests [1]. By means of natu- similarity between keywords and user interests and return a ral language processing and machine learning techniques, the list of matching documents. In the long run, this approach user’s preferences are extracted and stored in a user profile. often leads to overspecialization and fewer new entries with The same methods are utilized to obtain suitable content respect to a user’s preferences. Here, we propose a seman- keywords to establish a content profile. Based on previously tic recommender system using Linked Open Data for the seen documents, the system attempts to recommend similar user profile and adding semantic annotations to the index. content. Therefore, a mathematical representation of the user Linked Open Data allows recommendations beyond the con- and content profile is needed. A widely used scheme are TF- tent domain and supports the detection of new information. IDF (term frequency-inverse document frequency) weights One research area with a strong need for the discovery of [19]. Computed from the frequency of keywords appearing new information is biodiversity. Due to their heterogeneity, in a document, these term vectors capture the influence of the exploration of biodiversity data requires interdisciplinary keywords in a document or preferences in a user profile. The collaboration. Personalization, in particular in recommender angle between these vectors describes the distance or the systems, can help to link the individual disciplines in bio- closeness of the profiles and is calculated with similarity mea- diversity research and to discover relevant documents and sures, like the cosine similarity. The recommendation lists of datasets from various sources. We developed a first prototype these traditional, keyword-based recommender systems often for our semantic recommender system in this field, where a contain very similar results to those already seen, leading multitude of existing vocabularies facilitate our approach. to overspecialization [11] and the “Filter-Bubble”-effect [17]: The user obtains only content according to the stored prefer- ences, other related documents not perfectly matching the Categories and Subject Descriptors stored interests are not displayed. Thus, increasing diversity H.3.3 [Information Storage And Retrieval]: Informa- in recommendations has become an own research area [21, 25, tion Search and Retrieval; H.3.5 [Information Storage 24, 18, 3, 6, 23], mainly used to improve the recommendation And Retrieval]: Online Information Services results in news or movie portals. One field where content recommender systems could en- hance daily work is research. Scientists need to be aware General Terms of relevant research in their own but also neighboring fields. Design, Human Factors Increasingly, in addition to literature, the underlying data itself and even data that has not been used in publications are being made publicly available. An important example Keywords for such a discipline is biodiversity research, which explores content filtering, diversity, Linked Open Data, recommender the variety of species and their genetic and characteristic systems, semantic indexing, semantic recommendation diversity [12]. The morphological and genetic information of an organism, together with the ecological and geographical context, forms a highly diverse structure. Collected and stored in different data formats, the datasets often contain or link to spatial, temporal and environmental data [22]. Many important research questions cannot be answered by working with individual datasets or data collected by one group, but require meta-analysis across a wide range of data. Since the analysis of biodiversity data is quite time-consuming, there is Copyright c by the paper’s authors. Copying permitted only a strong need for personalization and new filtering techniques for private and academic purposes. in this research area. Ordinary search functions in relevant In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- data portals or databases, e.g., the Global Biodiversity In- Workshop on Foundations of Databases (Grundlagen von Datenbanken), 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. 65 formation Facility (GBIF)1 and the Catalog of Life,2 only that several types of relations can be taken into account. return data that match the user’s query exactly and fail at For instance, for a user interested in “geology”, the profile finding more diverse and semantically related content. Also, contains the concept “geology” that also permits the recom- user interests are not taken into account in the result list. mendation of inferred concepts, e.g., “fossil”. The idea of We believe our semantic-based content recommender system recommending related concepts was first introduced by Mid- could facilitate the difficult and time-consuming research delton et al. [15]. They developed Quickstep, a recommender process in this domain. system for research papers with ontological terms in the user Here, we propose a new semantic-based content recom- profile and for paper categories. The ontology only considers mender system that represents the user profile as Linked is-a relationships and omits other relation types (e.g., part- Open Data (LOD) [9] and incorporates semantic annotations of). Another simple hierarchical approach from Shoval et into the recommendation process. Additionally, the search al. [13] calculates the distance among concepts in a profile engine is connected to a terminology server and utilizes the hierarchy. They distinguish between perfect, close and weak provided vocabularies for a recommendation. The result list match. When the concept appears in both a user’s and docu- contains more diverse predictions and includes hierarchical ment’s profile, it is called a perfect match. In a close match, concepts or individuals. the concept emerges only in one of the profiles and a child or The structure of this paper is as follows: Next, we de- parent concept appears in the other. The largest distance is scribe related work. Section 3 presents the architecture of called a weak match, where only one of the profiles contains a our semantic recommender system and some implementation grandchild or grandparent concept. Finally, a weighted sum details. In Section 4, an application scenario is discussed. Fi- over all matching categories leads to the recommendation nally, conclusions and future work are presented in Section 5. list. This ontological filtering method was integrated into the news recommender system epaper. Another semantically en- hanced recommender system is Athena [10]. The underlying 2. RELATED WORK ontology is used to explore the semantic neighborhood in the The major goal of diversity research in recommender sys- news domain. The authors compared several ontology-based tems is to counteract overspecialization [11] and to recom- similarity measures with the traditional TF-IDF approach. mend related products, articles or documents. More books However, this system lacks of a connection to a search engine of an author or different movies of a genre are the classical that allows to query large datasets. applications, mainly used in recommender systems based on All presented systems use manually established vocabular- collaborative filtering methods. In order to enhance the vari- ies with a limited number of classes. None of them utilize ety in book recommendations, Ziegler et al. [25] enrich user a generic user profile to store the preferences in a seman- profiles with taxonomical super-topics. The recommendation tic format (RDF/XML or OWL). The FOAF (Friend Of A list generated by this extended profile is merged with a rank Friend) project3 provides a vocabulary for describing and in reverse order, called dissimilarity rank. Depending on a connecting people, e.g., demographic information (name, ad- certain diversification factor, this merging process supports dress, age) or interests. As one of the first, in 2006 Celma [2] more or less diverse recommendations. Larger diversification leveraged FOAF in his music recommender system to store factors lead to more diverse products beyond user interests. users’ preferences. Our approach goes beyond the FOAF Zhang and Hurley [24] favor another mathematical solution interests, by incorporating another generic user model vo- and describe the balance between diversity and similarity as cabulary, the Intelleo User Modelling Ontology (IUMO).4 a constrained optimization problem. They compute a dis- Besides user interests, IUMO offers elements to store learning similarity matrix according to applied criterias, e.g., movie goals, competences and recommendation preferences. This genres, and assign a matching function to find a subset of allows to adapt the results to a user’s previous knowledge or products that are diverse as well as similar. One hybrid to recommend only documents for a specific task. approach by van Setten [21] combines the results of several conventional algorithms, e.g., collaborative and case-based, to improve movie recommendations. Mainly focused on news 3. DESIGN AND IMPLEMENTATION or social media, approaches using content-based filtering In this section, we describe the architecture and some methods try to present different viewpoints on an event to implementation details of our semantic-based recommender decrease the media bias in news portals [18, 3] or to facilitate system (Figure 1). The user model component, described in the filtering of comments [6, 23]. Section 3.1, contains all user information. The source files, Apart from Ziegler et al., none of the presented approaches described in Section 3.2, are analyzed with GATE [5], as de- have considered semantic technologies. However, utilizing scribed in Section 3.3. Additionally, GATE is connected with ontologies and storing user or document profiles in triple a terminology server (Section 3.2) to annotate documents stores represents a large potential for diversity research in with concepts from the provided biodiversity vocabularies. recommender systems. Frasincar et al. [7] define semanti- In Section 3.4, we explain how the annotated documents are cally enhanced recommenders as systems with an underly- indexed with GATE Mı́mir [4]. The final recommendation list ing knowledge base. This can either be linguistic-based [8], is generated in the recommender component (Section 3.5). where only linguistic relations (e.g., synonymy, hypernomy, meronymy, antonymy) are considered, or ontology-based. In 3.1 User profile the latter case, the content and the user profile are repre- The user interests are stored in an RDF/XML format uti- sented with concepts of an ontology. This has the advantage lizing the FOAF vocabulary for general user information. In 1 3 GBIF, http://www.gbif.org FOAF, http://xmlns.com/foaf/spec/ 2 4 Catalog of Life, http://www.catalogueoflife.org/col/ IUMO, http://intelleo.eu/ontologies/user-model/ search/all/ spec/ 66 Figure 1: The architecture of our semantic content recommender system order to improve the recommendations regarding a user’s existing vocabularies. Furthermore, biodiversity is an inter- previous knowledge and to distinguish between learning goals, disciplinary field, where the results from several sources have interests and recommendation preferences, we incorporate to be linked to gain new knowledge. A recommender system the Intelleo User Modelling Ontology for an extended profile for this domain needs to support scientists by improving this description. Recommendation preferences will contain set- linking process and helping them finding relevant content in tings in respect of visualization, e.g., highlighting of interests, an acceptable time. and recommender control options, e.g., keyword-search or Researchers in the biodiversity domain are advised to store more diverse results. Another adjustment will adapt the their datasets together with metadata, describing informa- result set according to a user’s previous knowledge. In order tion about their collected data. A very common metadata to enhance the comprehensibility for a beginner, the system format is ABCD.7 This XML-based standard provides ele- could provide synonyms; and for an expert the recommender ments for general information (e.g., author, title, address), could include more specific documents. as well as additional biodiversity related metadata, like infor- The interests are stored in form of links to LOD resources. mation about taxonomy, scientific name, units or gathering. For instance, in our example profile in Listing 1, a user is Very often, each taxon needs specific ABCD fields, e.g., fossil interested in “biotic mesoscopic physical object”, which is a datasets include data about the geological era. Therefore, concept from the ENVO5 ontology. Note that the interest several additional ABCD-related metadata standards have entry in the RDF file does not contain the textual description, emerged (e.g., ABCDEFG8 , ABCDDNA9 ). One document but the link to the concept in the ontology, i.e., http://purl. may contain the metadata of one or more species observations obolibrary.org/obo/ENVO_01000009. Currently, we only in a textual description. This provides for annotation and support explicit user modelling. Thus, the user information indexing for a semantic search. For our prototype, we use the has to be added manually to the RDF/XML file. Later, we ABCDEFG metadata files provided by the GFBio10 project; intend to develop a user profiling component, which gathers specifically, metadata files from the Museum für Naturkunde a user’s interests automatically. The profile is accessible via (MfN).11 An example for an ABCDEFG metadata file is an Apache Fuseki6 server. presented in Listing 2, containing the core ABCD structure as well as additional information about the geological era. Listing 1: User profile with interests stored as The terminology server supplied by the GFBio project of- Linked Open Data URIs fers access to several biodiversity vocabularies, e.g., ENVO, BEFDATA, TDWGREGION. It also provides a SPARQL Felicitas 3.3 Semantic annotation Loeffler The source documents are analyzed and annotated accord- Felicitas Loeffler ing to the vocabularies provided by the terminology server. Female that offers several standard language engineering components Friedrich Schiller University Jena [5]. We developed a custom GATE pipeline (Figure 2) that felicitas.loeffler@uni−jena.de analyzes the documents: First, the documents are split into included in the GATE distribution. Afterwards, an ‘Anno- tation Set Transfer’ processing resource adds the original 7 3.2 Source files and terminology server 8 ABCD, http://www.tdwg.org/standards/115/ ABCDEFG, http://www.geocase.eu/efg The content provided by our recommender comes from the 9 ABCDDNA, http://www.tdwg.org/standards/640/ biodiversity domain. This research area offers a wide range of 10 GFBio, http://www.gfbio.org 5 11 ENVO, http://purl.obolibrary.org/obo/envo.owl MfN, http://www.naturkundemuseum-berlin.de/ 6 12 Apache Fuseki, http://jena.apache.org/documentation/ GFBio terminology server, http://terminologies.gfbio. serving_data/ org/sparql/ 67 Figure 2: The GFBio pipeline in GATE presenting the GFBio annotations markups of the ABCDEFG files to the annotation set, e.g., the user in steering the recommendation process actively. abcd:HigherTaxon. The following ontology-aware ‘Large KB The recommender component is still under development and Gazetteer’ is connected to the terminology server. For each has not been added to the implementation yet. document, all occurring ontology classes are added as specific “gfbioAnnot” annotations that have both instance (link to Listing 2: Excerpt from a biodiversity metadata file the concrete source document) and class URI. At the end, a in ABCDEFG format [20] ‘GATE Mı́mir Processing Resource’ submits the annotated documents to the semantic search engine. 3.4 Semantic indexing For semantic indexing, we are using GATE Mı́mir:13 “Mı́mir MfN − Fossil invertebrates is a multi-paradigm information management index and Gastropods, bivalves, brachiopods, sponges repository which can be used to index and search over text, annotations, semantic schemas (ontologies), and semantic metadata (instance data)” [4]. Besides ordinary keyword- Gastropods, Bivalves, Brachiopods, Sponges based search, Mı́mir incorporates the previously generated semantic annotations from GATE to the index. Addition- ally, it can be connected to the terminology server, allowing MfN queries over the ontologies. All index relevant annotations MfN − Fossil invertebrates Ia and the connection to the terminology server are specified in MB.Ga.3895 an index template. 3.5 Content recommender Euomphaloidea Family The Java-based content recommender sends a SPARQL query to the Fuseki Server and obtains the interests and preferred recommendation techniques from the user profile Euomphalus sp. SPARQL query to the Mı́mir server. Presently, this query asks only for child nodes (Figure 3). The result set contains ABCDEFG metadata files related to a user’s interests. We intend to experiment with further semantic relations in the future, e.g., object properties. Assuming that a specific fossil used to live in rocks, it might be interesting to know if other System species, living in this geological era, occured in rocks. An- Triassic other filtering method would be to use parent or grandparent provide control options and feedback mechanisms to support 13 GATE Mı́mir, https://gate.ac.uk/mimir/ 68 Figure 3: A search for “biotic mesoscopic physical object” returning documents about fossils (child concept) 4. APPLICATION The semantic content recommender system allows the recommendation of more specific and diverse ABCDEFG metadata files with respect to the stored user interests. List- ing 3 shows the query to obtain the interests from a user profile, introduced in Listing 1. The result contains a list of (LOD) URIs to concepts in an ontology. Figure 4: An excerpt from the ENVO ontology Listing 3: SPARQL query to retrieve user interests 5. CONCLUSIONS SELECT ?label ?interest ?syn WHERE We introduced our new semantically enhanced content { recommender system for the biodiversity domain. Its main ?s foaf:firstName "Felicitas" . benefit lays in the connection to a search engine supporting ?s um:TopicPreference ?interest . ?interest rdfs:label ?label . integrated textual, linguistic and ontological queries. We are ?interest oboInOwl:hasRelatedSynonym ?syn using existing vocabularies from the terminology server of the } GFBio project. The recommendation list contains not only classical keyword-based results, but documents including In this example, the user would like to obtain biodiversity semantically related concepts. datasets about a “biotic mesoscopic physical object”, which In future work, we intend to integrate semantic-based rec- is the textual description of http://purl.obolibrary.org/ ommender algorithms to obtain further diverse results and to obo/ENVO_01000009. This technical term might be incom- support the interdisciplinary linking process in biodiversity prehensible for a beginner, e.g., a student, who would prefer research. We will set up an experiment to evaluate the algo- a description like “organic material feature”. Thus, for a rithms in large datasets with the established classification later adjustment of the result according to a user’s previous metrics Precision and Recall [14]. Additionally, we would knowledge, the system additionally returns synonyms. like to extend the recommender component with control op- The returned interest (LOD) URI is utilized for a second tions for the user [1]. Integrated into a portal, the result query to the search engine (Figure 3). The connection to the list should be adapted according to a user’s recommendation terminology server allows Mı́mir to search within the ENVO settings or adjusted to previous knowledge. These control ontology (Figure 4) and to include related child concepts functions allow the user to actively steer the recommenda- as well as their children and individuals. Since there is no tion process. We are planning to utilize the new layered metadata file containing the exact term “biotic mesoscopic evaluation approach for interactive adaptive systems from physical object”, a simple keyword-based search would fail. Paramythis, Weibelzahl and Masthoff [16]. Since adaptive However, Mı́mir can retrieve more specific information than systems present different results to each user, ordinary eval- stored in the user profile and is returning biodiversity meta- uation metrics are not appropriate. Thus, accuracy, validity, data files about “fossil”. That ontology class is a child node of usability, scrutability and transparency will be assessed in “biotic mesoscopic physical object” and represents a semantic several layers, e.g., the collection of input data and their relation. Due to a high similarity regarding the content of interpretation or the decision upon the adaptation strategy. the metadata files, the result set in Figure 3 contains only This should lead to an improved consideration of adaptivity documents which closely resemble each other. in the evaluation process. 69 6. ACKNOWLEDGMENTS P. B. Kantor, editors, Recommender Systems Handbook, This work was supported by DAAD (German Academic pages 73–105. Springer, 2011. Exchange Service)14 through the PPP Canada program and [12] M. Loreau. Excellence in ecology. International Ecology by DFG (German Research Foundation)15 within the GFBio Institute, Oldendorf, Germany, 2010. project. [13] V. Maidel, P. Shoval, B. Shapira, and M. Taieb-Maimon. Ontological content-based filtering 7. REFERENCES for personalised newspapers: A method and its evaluation. Online Information Review, 34 Issue [1] F. Bakalov, M.-J. Meurs, B. König-Ries, B. Sateli, 5:729–756, 2010. R. Witte, G. Butler, and A. Tsang. An approach to [14] C. D. Manning, P. Raghavan, and H. Schütze. controlling user models and personalization effects in Introduction to Information Retrieval. Cambridge recommender systems. In Proceedings of the 2013 University Press, 2008. international conference on Intelligent User Interfaces, [15] S. E. Middleton, N. R. Shadbolt, and D. C. D. Roure. IUI ’13, pages 49–56, New York, NY, USA, 2013. ACM. Ontological user profiling in recommender systems. [2] Ò. Celma. FOAFing the music: Bridging the semantic ACM Trans. Inf. Syst., 22(1):54–88, Jan. 2004. gap in music recommendation. In Proceedings of 5th [16] A. Paramythis, S. Weibelzahl, and J. Masthoff. Layered International Semantic Web Conference, pages 927–934, evaluation of interactive adaptive systems: Framework Athens, GA, USA, 2006. and formative methods. User Modeling and [3] S. Chhabra and P. Resnick. Cubethat: News article User-Adapted Interaction, 20(5):383–453, Dec. 2010. recommender. In Proceedings of the sixth ACM [17] E. Pariser. The Filter Bubble - What the internet is conference on Recommender systems, RecSys ’12, pages hiding from you. Viking, 2011. 295–296, New York, NY, USA, 2012. ACM. [18] S. Park, S. Kang, S. Chung, and J. Song. Newscube: [4] H. Cunningham, V. Tablan, I. Roberts, M. Greenwood, delivering multiple aspects of news to mitigate media and N. Aswani. Information extraction and semantic bias. In Proceedings of the SIGCHI Conference on annotation for multi-paradigm information Human Factors in Computing Systems, CHI ’09, pages management. In M. Lupu, K. Mayer, J. Tait, and A. J. 443–452, New York, NY, USA, 2009. ACM. Trippe, editors, Current Challenges in Patent [19] G. Salton and C. Buckley. Term-weighting approaches Information Retrieval, volume 29 of The Information in automatic text retrieval. Information Processing and Retrieval Series, pages 307–327. Springer Berlin Management, 24:513–523, 1988. Heidelberg, 2011. [20] Museum für Naturkunde Berlin. Fossil invertebrates, [5] H. Cunningham et al. Text Processing with GATE UnitID:MB.Ga.3895. (Version 6). University of Sheffield, Dept. of Computer http://coll.mfn-berlin.de/u/MB_Ga_3895.html. Science, 2011. [21] M. van Setten. Supporting people in finding [6] S. Faridani, E. Bitton, K. Ryokai, and K. Goldberg. information: hybrid recommender systems and Opinion space: A scalable tool for browsing online goal-based structuring. PhD thesis, Telematica Instituut, comments. In Proceedings of the SIGCHI Conference University of Twente, The Netherlands, 2005. on Human Factors in Computing Systems, CHI ’10, pages 1175–1184, New York, NY, USA, 2010. ACM. [22] R. Walls, J. Deck, R. Guralnick, S. Baskauf, R. Beaman, and et al. Semantics in Support of [7] F. Frasincar, W. IJntema, F. Goossen, and Biodiversity Knowledge Discovery: An Introduction to F. Hogenboom. A semantic approach for news the Biological Collections Ontology and Related recommendation. Business Intelligence Applications Ontologies. PLoS ONE 9(3): e89606, 2014. and the Web: Models, Systems and Technologies, IGI Global, pages 102–121, 2011. [23] D. Wong, S. Faridani, E. Bitton, B. Hartmann, and K. Goldberg. The diversity donut: enabling participant [8] F. Getahun, J. Tekli, R. Chbeir, M. Viviani, and control over the diversity of recommended responses. In K. Yétongnon. Relating RSS News/Items. In CHI ’11 Extended Abstracts on Human Factors in M. Gaedke, M. Grossniklaus, and O. Dı́az, editors, Computing Systems, CHI EA ’11, pages 1471–1476, ICWE, volume 5648 of Lecture Notes in Computer New York, NY, USA, 2011. ACM. Science, pages 442–452. Springer, 2009. [24] M. Zhang and N. Hurley. Avoiding monotony: [9] T. Health and C. Bizer. Linked Data: Evolving the Web Improving the diversity of recommendation lists. In into a Global Data Space. Synthesis Lectures on the Proceedings of the 2008 ACM Conference on Semantic Web: Theory and Technology. Morgan & Recommender Systems, RecSys ’08, pages 123–130, New Claypool, 2011. York, NY, USA, 2008. ACM. [10] W. IJntema, F. Goossen, F. Frasincar, and [25] C.-N. Ziegler, G. Lausen, and L. Schmidt-Thieme. F. Hogenboom. Ontology-based news recommendation. Taxonomy-driven computation of product In Proceedings of the 2010 EDBT/ICDT Workshops, recommendations. In Proceedings of the Thirteenth EDBT ’10, pages 16:1–16:6, New York, NY, USA, 2010. ACM International Conference on Information and ACM. Knowledge Management, CIKM ’04, pages 406–415, [11] P. Lops, M. de Gemmis, and G. Semeraro. New York, NY, USA, 2004. ACM. Content-based recommender systems: State of the art and trends. In F. Ricci, L. Rokach, B. Shapira, and 14 DAAD, https://www.daad.de/de/ 15 DFG, http://www.dfg.de 70 Exploring Graph Partitioning for Shortest Path Queries on Road Networks Theodoros Chondrogiannis Johann Gamper Free University of Bozen-Bolzano Free University of Bozen-Bolzano tchond@inf.unibz.it gamper@inf.unibz.it ABSTRACT The classic solution for the shortest path problem is Dijkstra’s al- Computing the shortest path between two locations in a road net- gorithm [1]. Given a source s and a destination t in a road network work is an important problem that has found numerous applica- G, Dijkstra’s algorithm traverses the vertices in G in ascending or- tions. The classic solution for the problem is Dijkstra’s algo- der of their distances to s. However, Dijkstra’s algorithm comes rithm [1]. Although simple and elegant, the algorithm has proven with a major shortcoming. When the distance between the source to be inefficient for very large road networks. To address this defi- and the target vertex is high, the algorithm has to expand a very ciency of Dijkstra’s algorithm, a plethora of techniques that intro- large subset of the vertices in the graph. To address this short- duce some preprocessing to reduce the query time have been pro- coming, several techniques have been proposed over the last few posed. In this paper, we propose Partition-based Shortcuts (PbS), a decades [3]. Such techniques require a high start-up cost, but in technique based on graph-partitioning which offers fast query pro- terms of query processing they outperform Dijkstra’s algorithm by cessing and supports efficient edge weight updates. We present a orders of magnitude. shortcut computation scheme, which exploits the traits of a graph Although most of the proposed techniques offer fast query pro- partition. We also present a modified version of the bidirectional cessing, the preprocessing is always performed under the assump- search [2], which uses the precomputed shortcuts to efficiently an- tion that the weights of a road network remain unchanged over swer shortest path queries. Moreover, we introduce the Corridor time. Moreover, the preprocessing is metric-specific, thus for dif- Matrix (CM), a partition-based structure which is exploited to re- ferent metrics the preprocessing needs to be performed for each duce the search space during the processing of shortest path queries metric. The recently proposed Customizable Route Planning [4] when the source and the target point are close. Finally, we evaluate applies preprocessing for various metrics, i.e., distance, time, turn the performance of our modified algorithm in terms of preprocess- cost and fuel consumption. Such an approach allows a fast com- ing cost and query runtime for various graph partitioning configu- putation of shortest path queries using any metric desired by the rations. user, at the cost of some extra space. Moreover, the update cost for the weights is low since the structure is designed such that only a small part of the preprocessed information has to be recomputed. Keywords In this paper, our aim is to develop an approach which offers even Shortest path, road networks, graph partitioning faster query processing, while keeping the update cost of the pre- processed information low. This is particularly important in dy- namic networks, where edge weights might frequently change, e.g., 1. INTRODUCTION due to traffic jams. Computing the shortest path between two locations in a road The contributions of this paper can be summarized as follows: network is a fundamental problem and has found numerous ap- • We present Partitioned-based Shortcuts (PbS), a preprocess- plications. The problem can be formally defined as follows. Let ing method which is based on Customizable Route Planning G(V, E) be a directed weighted graph with vertices V and edges (CRP), but computes more shortcuts in order to reduce the E. For each edge e ∈ E, a weight l(e) is assigned, which usually query processing time. represents the length of e or the time required to cross e. A path p between two vertices s, t ∈ V is a sequence of connected edges, • We propose the Corridor Matrix (CM), a pruning technique p(s, t) = h(s, v1 ), (v1 , v2 ), . . . , (vk , vt )i where (vk , vk+1 ) ∈ E, which can be used for shortest path queries when the source that connects s and t. The shortest path between two vertices s and and the target are very close and the precomputed shortcuts t is the path p(s, t) that has the shortest distance among all paths cannot be exploited. that connect s and t. • We run experiments for several different partition configura- tions and we evaluate our approach in terms of both prepro- cessing and query processing cost. The rest of the paper is organized as follows. In Section 2, we discuss related work. In Section 3, we describe in detail the prepro- In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26 GI- cessing phase of our method. In Section 5, we present a modified Workshop on Foundations of Databases (Grundlagen von Datenbanken), version of the bidirectional search algorithm. In Section 6, we show 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. Copyright c by the paper’s authors. Copying permitted only for private preliminary results of an empirical evaluation. Section 7 concludes and academic purposes.. the paper and points to future research directions. 71 2. RELATED WORK in each component and then CRP applies a modified bidirectional The preprocessing based techniques that have been proposed search algorithm which expands only the shortcuts and the edges in in order to reduce the time required for processing shortest path the source or the target component. The main difference between queries can be classified into different categories [3]. Goal-directed our approach and CRP is that, instead of computing only shortcuts techniques use either heuristics or precomputed information in or- between border nodes in each component, we compute shortcuts der to limit the search space by excluding vertices that are not in from every node of a component to the border nodes of the same the direction of the target. For example, A∗ [5] search uses the component. The extra shortcuts enable the bidirectional algorithm Euclidean distance as a lower bound. ALT [6] uses precomputed to start directly from the border nodes, while CRP has to scan the shortest path distances to a carefully selected set of landmarks and original edges of the source and the target component. produces the lower bound using the triangle inequality. Some goal- directed techniques exploit graph partitioning in order to prune the 3. PBS PREPROCESSING search space and speed-up queries. Precomputed Cluster Distances The Partition-based Shortcuts (PbS) method we propose ex- (PCD) [7] partitions the graph into k components, computes the ploits graph partitioning to produce shortcuts in a preprocessing distance between all pairs of components and uses the distances be- phase, which during the query phase are used to efficiently com- tween components to compute lower bounds. Arc Flags [8] main- pute shortest path queries. The idea is similar to the concept of tains a vector of k bits for each edge, where the i-th bit is set if the transit nodes [12]. Every shortest path between two nodes lo- arc lies on a shortest path to some vertex of component i. Other- cated in different partitions (also termed components) can be ex- wise, all edges of component i are pruned by the search algorithm. pressed as a combination of three smaller shortest paths. Con- Path Coherent techniques take advantage of the fact that shortest sider the graph in Figure 1 and a query q(s, t), where s ∈ C1 paths in road networks are often spatially coherent. To illustrate the and t ∈ C5 . The shortest path from s to t can be expressed as concept of spatial coherence, let us consider four locations s, s0 , t p(s, bs ) + p(bs , bt ) + p(bt , t), where bs ∈ {b1 , b2 } and bt ∈ and t0 in a road network. If s is close to s0 and t is close to t0 , the {b3 , b4 , b5 }. Before PbS is able to process shortest path queries, shortest path from s to t is likely to share vertices with the shortest a preprocessing phase is required, which consists of three steps: path from s0 to t0 . Spatial coherence methods precompute all short- graph partitioning, in-component shortcut computation and short- est paths and use then some data structures to index the paths and cut graph construction. answer queries efficiently. For example, Spatially Induced Linkage Cognizance (SILC) [9] use a quad-tree [10] to store the paths. Path- 3.1 Graph Partitioning Coherent Pairs Decomposition (PCPD) [11] computes unique path The first step in the pre-processing phase is the graph partition- coherent pairs and retrieves any shortest path recursively in almost ing. Let G(V, E) be a graph with vertices V and edges E. A linear time to the size of the path. partition of G is a set P (G) = {C1 , . . . , Ck } of connected sub- Bounded-hop techniques aim to reduce a shortest path query to graphs Ci of G, also referred to as components of G. For the set a number of look-ups. Transit Node Routing (TNR) [12] is an in- P (G), all components must be disjoint, i.e., C1 ∩ . . . ∩ Ck = ∅. dexing method that imposes a grid on the road network and re- Moreover, let V1 , . . . , V|P (G)| be the sets of vertices of each com- computes the shortest paths from within each grid cell C to a set ponent. The vertex sets of all components must cover the vertex set of vertices that are deemed important for C (so-called access nodes of the graph, i.e., V1 ∪ . . . ∪ V|P (G)| = V . We assign a tag to each of C). More approaches are based on the theory of 2-hop label- node of the original graph, which indicates the component the node ing [13]. During preprocessing, a label L(u) is computed for each is located in. The set of connecting edges, EC ⊆ E, is the set of all vertex u of the graph such that for any pair u, v of vertices, the edges in the graph for which the source and target nodes belong to distance dist(u, v) can be determined by only looking at the labels different components, i.e., (n, n0 ) ∈ E such that n ∈ Ci , n0 ∈ Cj L(u) and L(v). A natural special case of this approach is Hub La- and Ci 6= Cj . Finally, we define the border nodes of a component beling (HL) [14], in which the label L(u) associated with vertex C. A node n ∈ C is a border node of C if there exists a connecting u consists of a set of vertices (the hubs of u), together with their edge e = (n, n0 ) or e = (n0 , n), i.e., n0 is not in C. If e = (n, n0 ), distances from u. n is called outgoing border node of C, whereas if e = (n0 , n), n Finally, Hierarchical techniques aim to impose a total order on is called incoming border node of C. The set of all border nodes the nodes as they deem nodes that are crossed by many shortest of a graph is referred to as B. Figure 1 illustrates a graph parti- paths as more important. Highway Hierarchies (HH) [15] and its tioned into five components. The filled nodes are the border nodes. direct descendant Contraction Hierarchies (CH) organize the nodes Note that for ease of exposition we use only undirected graphs in in the road network into a hierarchy based on their relative im- the examples. portance, and create shortcuts among vertices at the same level of the hierarchy. Arterial Hierarchies (AH) [16] are inspired by CH, but produce shortcuts by imposing a grid on the graph. AH outperform CH in terms of both asymptotic and practical perfor- mance [17]. Some hierarchical approaches exploit graph partition to create shortcuts. HEPV [18] and HiTi [19] are techniques that pre-computes the distance between any two boundary vertices and create a new overlay graph. By partitioning the overlay graph and repeating the process several times, a hierarchy of partitions is cre- ated, which is used to process shortest path queries. The recent Customizable Route Planning (CRP) [4] is the clos- est work to our own. CRP is able to handle various arbitrary met- rics and can also handle dynamic edge weight updates. CRP uses PUNCH [20], a graph partitioning algorithm tailored to road net- works. CRP pre-computes distances between boundary vertices Figure 1: Partitioned graph into five components. 72 We characterize a graph partition as good if it minimizes the Thus, the number of vertices and edges in the shortcut graph is, number of connecting edges between the components. However, respectively, graph partitioning is an N P -hard problem, thus an optimal solu- k tion is out of the question [21]. A popular approach is multilevel X |B| = |Biinc ∪ Biout | and graph partitioning (MGP), which can be found in many software i=1 libraries, such as METIS [22]. Algorithms such as PUNCH [20] k X and Spatial Partition Clustering (SPC) [23] take advantage of road |Esc | = (|Biinc | × |Biout |) + EC . network characteristics in order to provide a more efficient graph i=1 partitioning. We use METIS for graph partitioning since it is the most efficient approach out of all available ones [24]. METIS re- Figure 3 shows the shortcut graph of our running example. Notice quires only the number of components as an argument in order to that only border nodes are vertices of the shortcut graph. The set of perform the partitioning. The number of components influences edges consists of connecting edges and the in-component shortcuts both the number of the in-component shortcuts and the size of the between the border nodes of the same component. Note that there shortcut graph. is no need for extra computations in order to populate the shortcut graph. 3.2 In-component Shortcuts The second step of the preprocessing phase is the computation of the in-component shortcuts. For each node n in the original graph, we compute the shortest path from the node to every outgoing bor- der node of the component in which n is located. Then we create outgoing shortcuts which abstract the shortest path from n to each outgoing border node. The incoming shortcuts are computed in a similar fashion. Thus, the total number of in-component shortcuts, S, is k X S= Ni × (|Biinc | + |Biout |), i=1 where Ni is the number of nodes in component Ci and Biinc , Biout are the incoming and outgoing border nodes of Ci , respectiv- Figure 3: Shortcut Graph illustrated over the original. elly. Figure 2 shows the in-component shortcuts for a node located in component C2 . 4. CORRIDOR MATRIX In Section 3 we presented how PbS creates shortcuts in order to answer queries when the source and the target points are in differ- ent components. However, when the source and the target points of a query are located in the same component, the shortest path may lie entirely inside the component. Therefore, the search algo- rithm will never reach the border nodes and the shortcuts will not be expanded. In such a case, the common approach is to use bidi- rectional search to return the shortest path. However, if the compo- nents of the partitioned graph are large, the query processing can be quite slow. In order to improve the processing time of such queries, we partition each component again into sub-components, and for each component, we compute its Corridor Matrix (CM). In gen- Figure 2: In-component shortcuts for a given node. eral, given a partition of a graph G in k components, the Corridor Matrix (CM) of G is a k × k matrix, where each cell C(i, j) of For each border node in a component, b ∈ C, we execute Di- CM contains a list of components that are crossed by some short- jkstra’s algorithm with b as source and all other nodes (including est path from a node s ∈ Ci to a node t ∈ Cj . We call such a border nodes) in C as targets. Depending on the type of the source list the corridor from Ci to Cj . The concept of the CM is similar node, the expansion strategy is different. When an incoming bor- to Arc-Flags [8], but the CM requires much less space. The space der node is the source, forward edges are expanded; vice versa, complexity of the CM is O(k3 ), where k is the number of compo- when an outgoing border node is the source, incoming edges are nents in the partition, while the space complexity of Arc-Flags is expanded. This strategy ensures that the maximum number of node |E| × k2 , where |E| is the number of edges in the original graph. expansions is at most twice the number of border nodes of G. C1 C2 C3 C4 C5 3.3 Shortcut Graph Construction C1 ∅ {C2 , C3 } The third step of the preprocessing phase of our approach is the C2 ∅ construction of the shortcut graph. Given a graph G, the shortcut C3 ∅ graph of G is a graph Gsc (B, Esc ), where B is the set of border C4 ∅ nodes of G and Esc = EC ∪ SG is the union of the connecting C5 ∅ edges, EC , of G and the shortcuts, SG , from every incoming bor- der node to every outgoing border node of the same component. Figure 4: Corridor Matrix example. 73 To optimize the look-up time in CM, we implemented each com- Name Region # Vertices # Edges ponent list using a bitmap of length k. Therefore, the space com- CAL California/Nevada 1,890,815 4,657,742 plexity of the CM in the worst case is O(k3 ). The actual space FLA Florida 1,070,376 2,712,798 occupied by the CM is smaller, since we do not allocate space for BAY SF Bay Area 321,270 800,172 bitmaps when the component list is empty. For the computation of NY New York City 264,346 733,846 the Corridor Matrix, we generate the Shortcut Graph in the same ROME Center of Rome 3353 8,859 way as described in Section 3.3. To compute the distances between all pairs of vertices, we use the Floyd-Warshall algorithm [25], Table 1: Dataset characteristics. which is specifically designed to compute the all-pair shortest path distance efficiently. After having computed the distances between the nodes, instead of retrieving each shortest path, we retrieve only the components that are crossed by each path, and we update the contain 1000 queries each. We make sure that the distance of ev- CM accordingly. ery query in set Qi is smaller than the distance of every query in set Qi+1 . We also evaluate the CM separately by comparing our CM implementation against Arc Flags and the original bidi- 5. SHORTEST PATH ALGORITHM rectional search for a set of 1000 random queries in the ROME In order to process a shortest path query from a source point s dataset. We use a small dataset in order to simulate in-component to a target point t, we first determine the components of the graph query processing. the nodes s ∈ Cs and t ∈ Ct are located in. If Cs = Ct , we execute a modified bidirectional search from s to t. Note that the 6.1 Preprocessing and Space Overhead shortcuts are not used for processing queries for which the source Figures 5 and 6 show a series of measurements for the prepro- and target are located in the same component C. Instead, we re- cessing cost of our approach in comparison to CRP and CH over trieve the appropriate corridor from the CM of C, which contains the four largest datasets. Figure 5 shows how many shortcuts are a list of sub-components. Then, we apply bidirectional search and created by each approach. The extra shortcuts can be translated prune all nodes that belong to sub-components which are not in the into the space overhead required in order to speed-up shortest path retrieved corridor. queries. CH uses shortcuts which represent only two edges, while In the case that the points s and t are not located in the same the shortcuts in PbS and CRP are composed of much longer se- component, we exploit the pre-computed shortcuts. First, we re- quences. The difference between the shortcuts produced by CRP trieve the lengths of the in-component outgoing shortcuts from s to and CH is much less. In short, PbS produces about two orders of all the outgoing borders of Cs and the length of the in-component magnitude more shortcuts than CRP and CH. Moreover, we can ob- incoming shortcuts from all the incoming borders of Ct to t. Then serve that the number of shortcuts produced by PbS is getting lower we apply a many-to-many bidirectional search in the overlay graph as the number of components is increasing. from all the outgoing borders of Cs to all the incoming borders of Ct . We use the length of the in-component shortcuts (retrieved CH CRP PbS in the first step) as initial weights for the source and target nodes of the bidirectional search in the Shortcut Graph. The list of edges 3 ·10 7 shortcuts 3 ·107 shortcuts consisting the path is a set of connecting edges of the original graph and in-component shortcuts. For each shortcut we retrieve the pre- computed set of the original edges. The cost to retrieve the original 2 2 path is linear to the size of the path. After the retrieval we replace the shortcuts with the list of edges in the original graph and we re- 1 1 turn the new edge list, which is the shortest path from s to t in the original graph. 0 0 128 256 384 512 128 256 384 512 6. PRELIMINARY RESULTS (a) NY (b) BAY In this section, we compare our PbS method with CRP, the 1 ·108 shortcuts 2 ·108 shortcuts method our own approach is based on, and CH, a lightweight yet very efficient state-of-the-art approach for shortest path queries in 0.75 1.5 road networks [17]. CRP can handle arbitrary metrics and edge weight updates, while CH is a technique with fast pre-processing 0.5 1 and relatively low query processing time. We implemented in Java the basic version of CRP and PbS. The CH algorithm in the ex- 0.25 0.5 periments is from Graphhopper Route Planner [26]. Due to the different implementations of the graph models between ours and 0 256 512 768 1,024 0 512 1,024 1,536 2,048 CH, we do not measure the runtime. Instead, for preprocessing we count the extra shortcuts created by each algorithm, while for query (c) FLA (d) CAL processing we count the number of expanded nodes. For the experiments we follow the same evaluation setting as Figure 5: Preprocessing: # of shortcuts vs. # of components. in [17]. We use 5 publicly available datasets [27], four of of which are a part of the US road network, and the smallest one represents The same tendency as observed for the number of shortcuts can the road network of Rome. We present the characteristics of each be observed for the preprocessing time. In Figure 6, we can see dataset in Table 1. In order to compare our PbS approach and CRP that PbS requires much more time than CRP and CH in order to with CH, we run our experiments over 5 query sets Q1 –Q5, which create shortcuts. However, we should also notice that the update 74 cost for CRP and PbS is only a small portion of the preprocessing CRP PbS cost. When an edge weight changes, we need to update only the ·104 expanded nodes ·104 expanded nodes shortcuts that contains that particular edge. In contrast, for CH the 1 1 the update cost is the same as the preprocesing cost since a change 0.75 0.75 in a single weight can influence the entire hierarchy. 0.5 0.5 CH CRP PbS preprocessing time(sec) preprocessing time(sec) 0.25 0.25 300 300 0 0 128 256 384 512 128 256 384 512 200 200 (a) NY (b) BAY ·104 expanded nodes ·104 expanded nodes 2 3 100 100 1.5 2 0 0 128 256 384 512 128 256 384 512 1 (a) NY (b) BAY 1 preprocessing time(sec) preprocessing time(sec) 0.5 1,500 3,000 0 0 256 512 768 1,024 512 1,024 1,536 2,048 1,000 2,000 (c) FLA (d) CAL 500 1,000 Figure 7: Performance of shortest path queries vs. # of components. 0 0 256 512 768 1,024 512 1,024 1,536 2,048 7. CONCLUSION (c) FLA (d) CAL In this paper we presented PbS, an approach which uses graph partitioning in order to compute shortcuts and speed-up shortest Figure 6: Preprocessing: time vs. # of components. path queries in road networks. Our aim was a solution which sup- ports efficient and incremental updates of edge weights, yet is ef- ficient enough in many real-world applications. In the evaluation, 6.2 Query Processing we showed that our PbS approach outperforms CRP. PbS supports Figure 7 shows a series of measurements of the performance of edge weight updates as any change in the weight of an edge can CRP and PbS. We evaluate both techniques for different partitions influence only shortcuts in a single component. On the other hand, and various numbers of components. An important observation is CH is faster than our PbS approach. However, CH cannot handle the tendency of the performance for CRP and PbS. The perfor- well edge weight updates as almost the entire hierarchy of short- mance of CRP gets worse for partitions with many components cuts has to be recomputed every time a single weight changes. For while the opposite happens for PbS. The reason is that for parti- queries where the source and the target are in the same component, tions with few components, PbS manages to process many queries we introduced the CM. The efficiency of the CM in query process- with two look-ups (the case where the source and the target are in ing approaches the efficiency of Arc Flags, while consuming much adjacent components). less space. In Figure 8 we compare CH with CRP (we choose the best result) In future work, we plan to extend our approach to support multi- and two configurations of PbS: PbS-BT, which is the configuration modal transportation networks, where the computation has to con- that leads to the best performance, and PbS-AVG, which is the aver- sider a time schedule, and dynamic and traffic aware networks, age performance of PbS among all configurations. We can see that where the weights of the edges change over time. We will also PbS outperforms CRP in all datasets from Q1 to Q5 . However, CH improve the preprocessing phase of our approach both in terms of is faster in terms of query processing than our PbS approach. CH time overhead, by using parallel processing, and space overhead, is more suitable for static networks as the constructed hierarchy of by using compression techniques or storing some of the precom- shortcuts enables the shortest path algorithm to expand much fewer puted information on the disk. nodes. 6.3 In-component Queries 8. REFERENCES In Figure 9, we compare the performance of our bidirectional [1] E. W. Dijkstra. A note on two problems in connexion with algorithm using the proposed CM, the original bidirectional search graphs. Numerische Mathematik, 1(1):269–271, December and the bidirectional algorithm using Arc Flags. We observe that 1959. the bidirectional search is the slowest since no pruning is applied. [2] I. S. Pohl. Bi-directional and Heuristic Search in Path Between Arc Flags and CM, the Arc Flags provide slightly better Problems. PhD thesis, Stanford, CA, USA, 1969. pruning thus fewer expanded nodes by the bidirectional search. On AAI7001588. the other hand, the preprocessing time required to compute the Arc [3] H. Bast, D. Delling, A. Goldberg, M. Müller, T. Pajor, Flags is significantly higher than the time required to compute the P. Sanders, D. Wagner, and R Werneck. Route planning in CM. transportation networks. (MSR-TR-2014-4), January 2014. 75 CH CRP PbS-BT PbS-AVG Int. Workshop on Geographic Information Systems (GIS), page 200, 2005. 8,000 [10] R.A. Finkel and J. L. Bentley. Quad trees: A data structure 8,000 for retrieval on composite keys. Acta Informatica, 4(1):1–9, 6,000 6,000 1974. [11] J. Sankaranarayanan and H. Samet, H. andi Alborzi. Path 4,000 4,000 Oracles for Spatial Networks. In Proc. of the 35th VLDB 2,000 2,000 Conf., pages 1210–1221, 2009. [12] H. Bast, S. Funke, D Matijevic, P. Sanders, and D. Schultes. 0 Q1 Q2 Q3 Q4 Q5 0 Q1 Q2 Q3 Q4 Q5 In Transit to Constant Time Shortest-Path Queries in Road Networks. In Proc. of the Workshop on Algorithm (a) NY (b) BAY Engineering and Experiments, pages 45–59, 2007. ·104 ·104 [13] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick. 3 Reachability and distance queries via 2-hop labels. In Proc. 1.5 of the 13th ACM-SIAM Symposium on Discrete Algorithms 2 (SODA), pages 937–946, 2002. 1 [14] I. Abraham, D. Delling, A. V. Goldberg, and R. F. Werneck. A hub-based labeling algorithm for shortest paths in road 0.5 1 networks. In Proc. of the 10th Int. Symposium on Experimental Algorithms, pages 230–241, 2011. 0 0 Q1 Q2 Q3 Q4 Q5 Q1 Q2 Q3 Q4 Q5 [15] P. Sanders and D. Schultes. Highway Hierarchies Hasten (c) FLA (d) CAL Exact Shortest Path Queries. In Proc. of the 13th European Conf. on Algorithms (ESA), pages 568–579, 2005. Figure 8: Performance of shortest path queries vs. query sets. [16] A. D. Zhu, H. Ma, X. Xiao, S. Luo, Y. Tang, and S. Zhou. Shortest Path and Distance Queries on Road Networks: Towards Bridging Theory and Practice. In Proc. of the 32nd Bidirectional Arc Flags CM SIGMOD Conf., pages 857–868, 2013. 12 3,000 [17] L. Wu, X. Xiao, D. Deng, G. Cong, and A. D. Zhu. Shortest Path and Distance Queries on Road Networks : An 9 Experimental Evaluation. In Proc. of the 39th VLDB Conf., 2,000 pages 406–417, 2012. 6 [18] Y. W. Huang, N. Jing, and E. A. Rundensteiner. Hierarchical 1,000 path views : A model based on fragmentation and 3 transportation road types. In Proc. of the 3rd ACM Workshop 0 0 Geographic Information Systems (GIS),, 1995. 8 16 24 32 40 48 8 16 24 32 40 48 [19] S. Jung and S. Pramanik. Hiti graph model of topographical (a) Preprocessing time (ms) (b) Visited nodes roadmaps in navigation systems. In Proc. of the 12th ICDE Conf., pages 76–84, 1996. Figure 9: Evaluation of Arc Flags & CM using ROME dataset. [20] D. Delling, A. V. Goldberg, I. Razenshteyn, and R. F. Werneck. Graph Partitioning with Natural Cuts. In Proc. of the 35th Int. Parallel & Distributed Processing Symposium [4] D. Delling, A. V. Goldberg, T. Pajor, and R. F. Werneck. (IPDPS), pages 1135–1146, 2011. Customizable route planning. In Proc. of the 10th Int. [21] A. E. Feldmann and L/ Foschini. Balanced Partitions of Symposium on Experimental Algorithms (SEA), pages Trees and Applications. In 29th Symp. on Theoretical 376–387, 2011. Aspects of Computer Science, volume 14, pages 100–111, [5] P. Hart, N. Nilsson, and B. Raphael. Formal Basis for the Paris, France, 2012. Heuristic Determination of Minimum Cost PAths. IEEE [22] G. Karypis and V. Kumar. A Fast and High Quality Transactions of Systems Science and Cybernetics, Multilevel Scheme for Partitioning Irregular Graphs. SIAM 4(2):100–107, 1968. Journal on Scientific Computing, 20(1):359–392, 1998. [6] A. V. Goldberg and C. Harrelson. Computing the Shortest [23] Y. W. Huang, N. Jing, and E. Rundensteiner. Effective Graph Path : A * Search Meets Graph Theory. In Proc. of the 16th Clustering for Path Queries in Digital Map Databases. In ACM-SIAM Symposium on Discrete Algorithms (SODA), Proc. of the 5th Int. Conf. on Information and Knowledge pages 156–165, 2005. Management, pages 215–222, 1996. [7] J. Maue, P. Sanders, and D. Matijevic. Goal-directed [24] X. Sui, D. Nguyen, M. Burtscher, and K. Pingali. Parallel shortest-path queries using precomputed cluster distances. graph partitioning on multicore architectures. In Proc. of the Journal on Experimental Algorithms, 14:2:3.2–2:3.27, 23rd Int. Conf. on Languages and Compilers for Parallel January 2010. Computing, pages 246–260, 2011. [8] E. Köhler, R. H. Möhring, and H. Schilling. Fast [25] R. W. Floyd. Algorithm 97: Shortest path. Communications point-to-point shortest path computations with arc-flags. In of the ACM, 5:345, 1962. Proc. of the 9th DIMACS Implementation Challenge, 2006. [26] https://graphhopper.com. [9] J. Sankaranarayanan, H. Alborzi, and H. Samet. Efficient [27] http://www.dis.uniroma1.it/challenge9/. query processing on spatial networks. In Proc. of the 2005 76 Missing Value Imputation in Time Series using Top-k Case Matching Kevin Wellenzohn Hannes Mitterer Johann Gamper Free University of Free University of Free University of Bozen-Bolzano Bozen-Bolzano Bozen-Bolzano kevin.wellenzohn@unibz.it hannes.mitterer@unibz.it gamper@inf.unibz.it M. H. Böhlen Mourad Khayati University of Zurich University of Zurich boehlen@ifi.uzh.ch mkhayati@ifi.uzh.ch ABSTRACT pecially frost is dangerous as it can destroy the harvest within a In this paper, we present a simple yet effective algorithm, called few minutes unless the farmers react immediately. The Südtiroler the Top-k Case Matching algorithm, for the imputation of miss- Beratungsring operates more than 120 weather stations spread all ing values in streams of time series data that are similar to each over South Tyrol, where each of them collects every five minutes other. The key idea of the algorithm is to look for the k situations up to 20 measurements including temperature, humidity etc. The in the historical data that are most similar to the current situation weather stations frequently suffer outages due to sensor failures or and to derive the missing value from the measured values at these k errors in the transmission of the data. However, the continuous time points. To efficiently identify the top-k most similar historical monitoring of the current weather condition is crucial to immedi- situations, we adopt Fagin’s Threshold Algorithm, yielding an al- ately warn about imminent threats such as frost and therefore the gorithm with sub-linear runtime complexity with high probability, need arises to recover those missing values as soon as they are de- and linear complexity in the worst case (excluding the initial sort- tected. ing of the data, which is done only once). We provide the results In this paper, we propose an accurate and efficient method to of a first experimental evaluation using real-world meteorological automatically recover missing values. The need for a continuous data. Our algorithm achieves a high accuracy and is more accurate monitoring of the weather condition at the SBR has two important and efficient than two more complex state of the art solutions. implications for our solution. Firstly, the proposed algorithm has to be efficient enough to complete the imputation before the next set of measurements arrive in a few minutes time. Secondly, the Keywords algorithm cannot use future measurements which would facilitate Time series, imputation of missing values, Threshold Algorithm the imputation, since they are not yet available. The key idea of our Top-k Case Matching algorithm is to seek for the k time points in the historical data when the measured val- 1. INTRODUCTION ues at a set of reference stations were most similar to the measured Time series data is ubiquitous, e.g., in the financial stock mar- values at the current time point (i.e., the time point when a value is ket or in meteorology. In many applications time series data is in- missing). The missing value is then derived from the values at the k complete, that is some values are missing for various reasons, e.g., past time points. While a naïve solution to identify the top-k most sensor failures or transmission errors. However, many applications similar historical situations would have to scan the entire data set, assume complete data, hence need to recover missing values before we adopt Fagin’s Threshold Algorithm, which efficiently answers further data processing is possible. top-k queries by scanning, on average, only a small portion of the In this paper, we focus on the imputation of missing values in data. The runtime complexity of our solution is derived from the long streams of meteorological time series data. As a case study, Threshold Algorithm and is sub-linear with high probability and we use real-world meteorological data collected by the Südtiroler linear in the worst case, when all data need to be scanned. We pro- Beratungsring1 (SBR), which is an organization that provides pro- vide the results of a first experimental evaluation using real-world fessional and independent consultancy to the local wine and apple meteorological data from the SBR. The results are promising both farmers, e.g., to determine the optimal harvesting time or to warn in terms of efficiency and accuracy. Our algorithm achieves a high about potential threats, such as apple scab, fire blight, or frost. Es- accuracy and is more accurate than two state of the art solutions. 1 The rest of the paper is organized as follows. In Section 2, we http://www.beratungsring.org/ review the existing literature about imputation methods for missing values. In Section 3, we introduce the basic notation and a running example. In Section 4, we present our Top-k Case Matching algo- rithm for the imputation of missing values, followed by the results of an experimental evaluation in Section 5. Section 6 concludes the paper and outlines ideas for future work. Copyright © by the paper’s authors. Copying permitted only for private and academic purposes. In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 2. RELATED WORK Workshop on Foundations of Databases (Grundlagen von Datenbanken), 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. Khayati et al. [4] present an algorithm, called REBOM, which 77 recovers blocks of missing values in irregular (with non repeating t∈w s(t) r1 (t) r2 (t) r3 (t) trends) time series data. The algorithm is based on an iterated trun- 1 16.1° 15.0° 15.9° 14.1° cated matrix decomposition technique. It builds a matrix which 2 15.8° 15.2° 15.7° 13.9° stores the time series containing the missing values and its k most 3 15.9° 15.2° 15.8° 14.1° correlated time series according to the Pearson correlation coeffi- 4 16.2° 15.0° 15.9° 14.2° cient [7]. The missing values are first initialized using a simple 5 16.5° 15.3° 15.7° 14.5° interpolation technique, e.g., linear interpolation. Then, the ma- 6 16.1° 15.2° 16.0° 14.1° trix is iteratively decomposed using the truncated Singular Value 7 ? 15.0° 16.0° 14.3° Decomposition (SVD). By multiplying the three matrices obtained from the decomposition, the algorithm is able to accurately approx- Table 1: Four time series in a window w = [1, 7]. imate the missing values. Due to its quadratic runtime complexity, REBOM is not scalable for long time series data. s (Schlanders) r1 (Kortsch) Khayati et al. [5] further investigate the use of matrix decompo- r2 (Göflan) r3 (Laas) sition techniques for the imputation of missing values. They pro- Temperature in Degree Celsius pose an algorithm with linear space complexity based on the Cen- troid Decomposition, which is an approximation of SVD. Due to 16 the memory-efficient implementation, the algorithm scales to long time series. The imputation follows a similar strategy as the one used in REBOM. 15 The above techniques are designed to handle missing values in static time series. Therefore, they are not applicable in our sce- nario, as we have to continuously impute missing values as soon 14 as they appear. A naïve approach to run the algorithms each time 1 2 3 4 5 6 7 a missing value occurs is not feasible due to their relatively high runtime complexity. Timestamps There are numerous statistical approaches for the imputation of missing values, including easy ones such as linear or spline interpo- Figure 1: Visualization of the time series data. lation, all the way up to more complex models such as the ARIMA model. The ARIMA model [1] is frequently used for forecasting future values, but can be used for backcasting missing values as For the imputation of missing values we assign to each time se- well, although this is a less common use case. A recent comparison ries s a set Rs of reference time series, which are similar to s. of statistical imputation techniques for meteorological data is pre- The notion of similarity between two time series is tricky, though. sented in [9]. The paper comprises several simple techniques, such Intuitively, we want time series to be similar when they have sim- as the (weighted) average of concurrent measurements at nearby ilar values and behave similarly, i.e., values increase and decrease reference stations, but also computationally more intensive algo- roughly at the same time and by the same amount. rithms, such as neural networks. As a simple heuristic for time series similarity, we use the spa- tial proximity between the stations that record the respective time 3. BACKGROUND series. The underlying assumption is that, if the weather stations are nearby (say within a radius of 5 kilometers), the measured val- Let S = {s1 , . . . , sn } be a set of time series. Each time series, ues should be similar, too. Based on this assumption, we manually s ∈ S, has associated a set of reference time series Rs , Rs ⊆ compiled a list of 3–5 reference time series for each time series. S \ {s}. The value of a time series s ∈ S at time t is denoted as This heuristic turned out to work well in most cases, though there s(t). A sliding window of a time series s is denoted as s([t1 , t2 ]) are situations where the assumption simply does not hold. One rea- and represents all values between t1 and t2 . son for the generally good results is most likely that in our data E XAMPLE 1. Table 1 shows four temperature time series in a set the over 100 weather stations cover a relatively small area, and time window w = [1, 7], which in our application corresponds to hence the stations are very close to each other. seven timestamps in a range of 30 minutes. s is the base time series from the weather station in Schlanders, and Rs = {r1 , r2 , r3 } is 4. TOP-K CASE MATCHING the associated set of reference time series containing the stations Weather phenomena are often repeating, meaning that for exam- of Kortsch, Göflan, and Laas, respectively. The temperature value ple during a hot summer day in 2014 the temperature measured at s(7) is missing. Figure 1 visualizes this example graphically. the various weather stations are about the same as those measured The Top-k Case Matching algorithm we propose assumes that during an equally hot summer day in 2011. We use this observa- the time series data is aligned, which generally is not the case for tion for the imputation of missing values. Let s be a time series our data. Each weather station collects roughly every 5 minutes where the current measurement at time θ, s(θ), is missing. Our new measurements and transmits them to a central server. Since assumption on which we base the imputation is as follows: if we the stations are not perfectly synchronized, the timestamps of the find historical situations in the reference time series Rs such that measurements typically differ, e.g., one station collects measure- the past values are very close to the current values at time θ, then ments at 09:02, 09:07, . . . , while another station collects them at also the past measurements in s should be very similar to the miss- 09:04, 09:09, . . . . Therefore, in a pre-processing step we align the ing value s(θ). Based on this assumption, the algorithm searches time series data using linear interpolation, which yields measure- for similar climatic situations in historical measurements, thereby ment values every 5 minutes (e.g., 00:00, 00:05, 00:10, . . . ). If we leveraging the vast history of weather records collected by the SBR. observe a gap of more than 10 minutes in the measurements, we More formally, given a base time series s with reference time assume that the value is missing. series Rs , we are looking for the k timestamps (i.e., historical sit- 78 uations), D = {t1 , . . . , tk }, ti < θ, which minimize the error popularity. Let us assume that k = 2 and the aggregation function function f (x1 , x2 ) = x1 + x2 . Further, assume that the bounded X buffer currently contains {(C, 18), (A, 16)} and the algorithm has δ(t) = |r(θ) − r(t)|. read the data up to the boxes shown in gray. At this point the al- r∈Rs gorithm computes the threshold using the interestingness That is, δ(t) ≤ δ(t ) for all t ∈ D and t0 6∈ D ∪ {θ}. The er- 0 grade for object B and the popularity grade of object C, yield- ror function δ(t) is the accumulated absolute difference between ing τ = f (5, 9) = 5 + 9 = 14. Since the lowest ranked object in the current temperature r(θ) and the temperature at time t, r(t), the buffer, object A, has an aggregated grade that is greater than τ , over all reference time series r ∈ Rs . Once D is determined, we can conclude that C and A are the top-2 objects. Note that the the missing value is recovered using some aggregation function algorithm never read object D, yet it can conclude that D cannot g ({s(t)|∀t ∈ D}) over the measured values of the time series s be part of the top-k list. at the timestamps in D. In our experiments we tested the average and the median as aggregation function (cf. Section 5). interestingness popularity E XAMPLE 2. We show the imputation of the missing value s(7) in Table 1 using as aggregation function g the average. For Object grade Object grade the imputation, we seek the k = 2 most similar historical sit- A 10 B 10 uations. The two timestamps D = {4, 1} minimize δ(t) with C 9 C 9 δ(4) = |15.0° − 15.0°| + |16.0° − 15.9°| + |14.3° − 14.2°| = 0.2° B 5 D 8 and δ(1) = 0.3°. The imputation is then simply the average D 4 A 6 of the base station measurements at time t = 4 and t = 1, i.e.,s(7) = avg(16.2°, 16.1°) = 12 (16.2° + 16.1°) = 16.15°. Table 2: Threshold Algorithm example. A naïve implementation of this algorithm would have to scan the entire database of historical data to find the k timestamps that 4.2 Adapting the Threshold Algorithm minimize δ(t). This is, however, not scalable for huge time series In order to use the Threshold Algorithm for the imputation of data, hence a more efficient technique is needed. missing values in time series data, we have to adapt it. Instead of looking for the top-k objects that maximize the aggregation func- 4.1 Fagin’s Threshold Algorithm tion f , we want to find the top-k timestamps that minimize the What we are actually trying to do is to answer a top-k query for error function δ(t) over the reference time series Rs . Similar to the k timestamps which minimize δ(t). There exist efficient algo- TA, we need sorted access to the data. Therefore, for each time rithms for top-k queries. For example, Fagin’s algorithm [2] solves series r ∈ Rs we define Lr to be the time series r ordered first this problem by looking only at a small fraction of the data. Since by value and then by timestamp in ascending order. Table 3 shows the first presentation of Fagin’s algorithm there were two notewor- the sorted data for the three reference time series of our running ex- thy improvements, namely the Threshold Algorithm (TA) by Fagin ample (ignore the gray boxes and small subscript numbers for the et al. [3] and a probabilistic extension by Theobald et al. [8]. The moment). latter approach speeds up TA by relaxing the requirement to find the exact top-k answers and providing approximations with proba- Lr1 Lr2 Lr3 bilistic guarantees. Our Top-k Case Matching algorithm is a variation of TA with t r1 (t) t r2 (t) t r3 (t) slightly different settings. Fagin et al. assume objects with m at- 1 15.0° 4 2 15.7° 2 13.9° tributes, a grade for each attribute and a monotone aggregation 4 15.0° 1 5 15.7° 1 14.1° function f : Rm 7→ R, which aggregates the m grades of an ob- 7 15.0° 3 15.8° 3 14.1° ject into an overall grade. The monotonicity property is defined as 2 15.2° 1 15.9° 6 14.1° follows. 3 15.2° 4 15.9° 5 4 14.2° 3 6 15.2° 6 16.0° 2 7 14.3° D EFINITION 1. (Monotonicity) Let x1 , . . . , xm and 5 15.3° 7 16.0° 5 14.5° 6 x01 , . . . , x0m be the m grades for objects X and X 0 , re- spectively. The aggregation function f is monotone if Table 3: Time series sorted by temperature. f (x1 , . . . , xm ) ≤ f (x01 , . . . , x0m ) given that xi ≤ x0i for each 1 ≤ i ≤ m. The general idea of our modified TA algorithm is the following. The TA finds the k objects that maximize the function f . To do The scan of each sorted lists starts at the current element, i.e., the so it requires two modes of accessing the data, one being sorted and element with the timestamp t = θ. Instead of scanning the lists Lri the other random access. The sorted access is ensured by maintain- only in one direction as TA does, we scan each list sequentially ing a sorted list Li for each attribute mi , ordered by the grade in in two directions. Hence, as an initialization step, the algorithm − descending order. TA keeps a bounded buffer of size k and scans places two pointers, pos+ r and posr , at the current value r(θ) of each list Li in parallel until the buffer contains k objects and the time series r (the gray boxes in Table 3). During the execution of lowest ranked object in the buffer has an aggregated grade that is the algorithm, pointer pos+ r is only incremented (i.e., moved down greater than or equal to some threshold τ . The threshold τ is com- the list), whereas pos− r is only decremented (i.e., moved up the puted using the aggregation function f over the grades last seen list). To maintain the k highest ranking timestamps, the algorithm under the sorted access for each list Li . uses a bounded buffer of size k. A new timestamp t0 is added only if the buffer is either not yet full or δ(t0 ) < δ(t), where t is the last E XAMPLE 3. Table 2 shows four objects {A, B, C, D} and (i.e., lowest ranking) timestamp in the buffer. ¯In the latter ¯ case the their grade for the two attributes interestingness and timestamp t is removed from the buffer. ¯ 79 After this initialization, the algorithm iterates over the lists Lr in Algorithm 1: Top−k Case Matching round robin fashion, i.e., once the last list is reached, the algorithm Data: Reference time series Rs , current time θ, and k wraps around and continues again with the first list. In each iter- Result: k timestamps that minimize δ(t) ation, exactly one list Lr is processed, and either pointer pos+ r or r 1 L ← {L |r ∈ Rs } pos−r is advanced, depending on which value the two pointers point 2 buffer ← boundendBuffer(k) to has a smaller absolute difference to the current value at time θ, 3 for r ∈ Rs do r(θ). This process grows a neighborhood around the element r(θ) 4 pos− + r , posr ← position of r(θ) in L r 5 end in each list. Whenever a pointer is advanced by one position, the 6 while L <> ∅ do timestamp t at the new position is processed. At this point, the 7 for Lr ∈ L do algorithm needs random access to the values r(t) in each list to 8 t ← AdvancePointer(Lr ) compute the error function δ(t). Time t is added to the bounded 9 if t = N IL then buffer using the semantics described above. 10 L ← L \ {Lr } The algorithm terminates once the error at the lowest ranking 11 else 12 if t 6∈ buffer then timestamp, t, among the k timestamps in the buffer is less or equal ¯ 13 buffer.addWithPriority(t, δ(t)) to thePthreshold, i.e., δ(t) ≤ τ . The threshold τ is defined as 14 end τ = r∈Rs |r(θ) − r(pos ¯ )|, where pos is either pos+ or pos− , r r r r 15 τ ← ComputeThreshold(L) depending on which pointer was advanced last. That is, τ is the 16 if buffer.size() = k sum over all lists Lr of the absolute differences between r(θ) and and buffer.largestError() ≤ τ then the value under pos+ − return buffer r or posr . 17 18 end E XAMPLE 4. We illustrate the Top-k Case Matching algorithm 19 end for k = 2 and θ = 7. Table 4 shows the state of the algorithm in 20 end each iteration i. The first column shows an iteration counter i, the 21 end 22 return buffer second the buffer with the k current best timestamps, and the last column the threshold τ . The buffer entries are tuples of the form (t, δ(t)). In iteration i = 1, the algorithm moves the pointer to t = 4 in list Lr1 and adds (t = 4, δ(4) = 0.2°) to the buffer. Since on the direction of the pointer. If next() reaches the end of a list, δ(4) = 0.2° > 0.0° = τ , the algorithm continues. The pointer it returns N IL. The utility functions timestamp() and value() in Lr2 is moved to t = 6, and (6, 0.4°) is added to the buffer. In return the timestamp and value of a list Lr at a given position, re- iteration i = 4, timestamp 6 is replaced by timestamp 1. Finally, spectively. There are four cases, which the algorithm has to distin- in iteration i = 6, the error at timestamp t = 1 is smaller or equal guish: to τ , i.e., δ(1) = 0.3° ≤ τ6 = 0.3°. The algorithm terminates and returns the two timestamps D = {4, 1}. 1. None of the two pointers reached the beginning or end of the list. In this case, the algorithm checks which pointer to ad- vance (line 5). The pointer that is closer to r(θ) after advanc- Iteration i Buffer Threshold τi ing is moved by one position. In case of a tie, we arbitrarily 1 (4, 0.2°) 0.0° decided to advance pos+ r . 2 (4, 0.2°), (6, 0.4°) 0.0° 3 (4, 0.2°), (6, 0.4°) 0.1° 2. Only pos− r reached the beginning of the list: the algorithm 4 (4, 0.2°), (1, 0.3°) 0.1° increments pos+ r (line 11). 5 (4, 0.2°), (1, 0.3°) 0.2° 6 (4, 0.2°), (1, 0.3°) 0.3° 3. Only pos+ r reached the end of the list: the algorithm decre- ments pos− r (line 13). Table 4: Finding the k = 2 most similar historical situations. 4. The two pointers reached the beginning respective end of the list: no pointer is moved. 4.3 Implementation In the first three cases, the algorithm returns the timestamp that Algorithm 1 shows the pseudo code of the Top-k Case Matching was discovered after advancing the pointer. In the last case, N IL is algorithm. The algorithm has three input parameters: a set of time returned. series Rs , the current timestamp θ, and the parameter k. It returns At the moment we use an in-memory implementation of the al- the top-k most similar timestamps to the current timestamp θ. In gorithm, which loads the whole data set into main memory. More line 2 the algorithm initializes the bounded buffer of size k, and in specifically, we keep two copies of the data in memory: the data line 4 the pointers pos+ − r and posr are initialized for each reference sorted by timestamp for fast random access and the data sorted by time series r ∈ Rs . In each iteration of the loop in line 7, the algo- value and timestamp for fast sorted access. rithm advances either pos+ − r or posr (by calling Algorithm 2) and Note that we did not normalize the raw data using some standard reads a new timestamp t. The timestamp t is added to the bounded technique like the z-score normalization, as we cannot compute buffer using the semantics described before. In line 15, the algo- that efficiently for streams of data without increasing the complex- rithm computes the threshold τ . If the buffer contains k timestamps ity of our algorithm. and we have δ(t) ≤ τ , the top-k most similar timestamps were ¯ found and the algorithm terminates. 4.4 Proof of Correctness Algorithm 2 is responsible for moving the pointers pos+ r and The correctness of the Top-k Case Matching algorithm follows pos− r r for each list L . The algorithm uses three utility functions. directly from the correctness of the Threshold Algorithm. What The first is next(), which takes a pointer as input and returns the remains to be shown, however, is that the aggregation function δ(t) next position by either incrementing or decrementing, depending is monotone. 80 ∗ Algorithm 2: AdvancePointer ference between P the real value∗ s(θ) and the imputed value s (θ), Data: List Lr where to advance a pointer i.e., ∆ = |w| θ∈w |s(θ) − s (θ)| 1 Result: Next timestamp to look at or N IL Figure 2 shows how the accuracy of the algorithms changes with 1 pos ← N IL varying k. Interestingly and somewhat unexpectedly, ∆ decreases if next(pos+ − as k increases. This is somehow contrary to what we expected, 2 r ) <> N IL and next(posr ) <> N IL then ∆+ ← |r(θ) − value(Lr [next(pos+ since with an increasing k also the error function δ(t) grows, and 3 r )])| ∆− ← |r(θ) − value(Lr [next(pos− therefore less similar historical situations are used for the imputa- 4 r )])| 5 if ∆+ ≤ ∆− then tion. However, after a careful analysis of the results it turned out pos, pos+ + that for low values of k the algorithm is more sensitive to outliers, 6 r ← next(posr ) 7 else and due to the often low quality of the raw data the imputation is 8 pos, pos− − r ← next(posr ) flawed. 9 end Top-k (Average) Average Difference ∆ in °C + − 10 else if next(posr ) <> N IL and next(posr ) = N IL then 0.8 11 + pos, posr ← next(posr )+ Top-k (Median) + − Simple Average 12 else if next(posr ) = N IL and next(posr ) <> N IL then 13 pos, pos− − r ← next(posr ) 0.7 14 end 15 if pos <> N IL then 0.6 16 return timestamp(Lr [pos]) 17 else 18 return N IL 0.5 19 end 0 50 100 Parameter k T HEOREM 4.1. The aggregation function δ(t) is a monotoni- Figure 2: Impact of k on accuracy. cally increasing function. P ROOF. Let t1 and t2 be two timestamps such that |r(θ) − Table 5 shows an example of flawed raw data. The first row is r(t1 )| ≤ |r(θ) − r(t2 )| for each r ∈ Rs . Then it trivially fol- the current situation, and we assume that the value in the gray box lows that δ(t1 ) ≤ δ(t2 ) as the aggregation function δ is the sum of is missing and need to be recovered. The search for the k = 3 |r(θ) − r(t1 )| over each r ∈ Rs and, by definition, each compo- most similar situations using our algorithm yields the three rows nent of δ(t1 ) is less than or equal to the corresponding component at the bottom. Notice that one base station value is 39.9° around in δ(t2 ). midnight of a day in August, which is obviously a very unlikely thing to happen. By increasing k, the impact of such outliers is 4.5 Theoretical Bounds reduced and hence ∆ decreases. Furthermore, using the median as The space and runtime bounds of the algorithm follow directly aggregation function reduces the impact of outliers and therefore from the probabilistic guarantees of TA, which has sub-linear cost yields better results than the average. with high probability and linear cost in the worst case. Note Timestamp s r1 r2 r3 that sorting the raw data to build the lists Lr is a one-time pre- processing step with complexity O(n log n). After that the system 2013-04-16 19:35 18.399° 17.100° 19.293° 18.043° can insert new measurements efficiently into the sorted lists with 2012-08-24 01:40 18.276° 17.111° 19.300° 18.017° logarithmic cost. 2004-09-29 15:50 19.644° 17.114° 19.259° 18.072° 2003-08-02 01:10 39.900° 17.100° 19.365° 18.065° 5. EXPERIMENTAL EVALUATION Table 5: Example of flawed raw data. In this section, we present preliminary results of an experimental evaluation of the proposed Top-k Case Matching algorithm. First, Figure 3 shows the runtime, which for the Top-k Case Match- we study the impact of parameter k on the Top-k Case Matching ing algorithm linearly increases with k. Notice that, although the and a baseline algorithm. The baseline algorithm, referred to as imputation of missing values for 8 days takes several minutes, the “Simple Average”, imputes the missing value s(θ) with the average algorithm is fast enough to continuously impute missing values in of thePvalues in the reference time series at time θ, i.e., s(θ) = our application at the SBR. The experiment essentially corresponds r∈Rs r(θ). Second, we compare our solution with two state 1 |Rs | to a scenario, where in 11452 base stations an error occurs at the of the art competitors, REBOM [4] and CD [5]. same time. With 120 weather stations operated by the SBR, the number of missing values at each time is only a tiny fraction of the 5.1 Varying k missing values that we simulated in this experiment. In this experiment, we study the impact of parameter k on the accuracy and the runtime of our algorithm. We picked five base 5.2 Comparison with CD and REBOM stations distributed all over South Tyrol, each having two to five In this experiment, we compare the Top-k Case Matching algo- reference stations. We simulated a failure of the base station dur- rithm with two state-of-the-art algorithms, REBOM [4] and CD [5]. ing a time interval, w, of 8 days in the month of April 2013. This We used four time series, each containing 50.000 measurements, amounts to a total of 11452 missing values. We then used the Top-k which corresponds roughly to half a year of temperature measure- Case Matching (using both the average and median as aggregation ments. We simulated a week of missing values (i.e., 2017 measure- function g) and Simple Average algorithms to impute the missing ments) in one time series and used the other three as reference time values. As a measure of accuracy we use the average absolute dif- series for the imputation. 81 further study the impact of complex weather phenomena that we 800 observed in our data, such as the foehn. The foehn induces shifting effects in the time series data, as the warm wind causes the temper- Runtime (sec) 600 ature to increase rapidly by up to 15° as soon as the foehn reaches Top-k (Average) another station. 400 Top-k (Median) There are several possibilities to further improve the algorithm. Simple Average First, we would like to explore whether the algorithm can dynam- 200 ically determine an optimal value for the parameter k, which is 0 currently given by the user. Second, we would like to make the 0 50 100 algorithm more robust against outliers. For example, the algorithm Parameter k could consider only historical situations that occur roughly at the same time of the day. Moreover, we can bend the definition of “cur- Figure 3: Impact of k on runtime. rent situation” to not only consider the current timestamp, but rather a small window of consecutive timestamps. This should make the ranking more robust against anomalies in the raw data and weather The box plot in Figure 4 shows how the imputation error |s(θ) − phenomena such as the foehn. Third, right now the similarity be- s∗ (θ)| is distributed for each of the four algorithms. The left and tween time series is based solely on temperature data. We would right line of the box are the first and third quartile, respectively. like to include the other time series data collected by the weather The line inside the box denotes the median and the left and right stations, such as humidity, precipitation, wind, etc. Finally, the al- whiskers are the 2.5% and 97.5% percentile, which means that the gorithm should be able to automatically choose the currently hand- plot incorporates 95% of the values and omits statistical outliers. picked reference time series based on some similarity measures, The experiment clearly shows that the Top-k Case Matching algo- such as the Pearson correlation coefficient. rithm is able to impute the missing values more accurately than CD and REBOM. Although not visualized, also the maximum observed error for our algorithm is with 2.29° (Average) and 2.21° (Median) 7. ACKNOWLEDGEMENTS considerably lower than 3.71° for CD and 3.6° for REBOM. The work has been done as part of the DASA project, which is funded by the Foundation of the Free University of Bozen-Bolzano. We wish to thank our partners at the Südtiroler Beratungsring and Top-k the Research Centre for Agriculture and Forestry Laimburg for the (Median) good collaboration and helpful domain insights they provided, in Top-k particular Armin Hofer, Martin Thalheimer, and Robert Wiedmer. (Average) CD 8. REFERENCES [1] G. E. P. Box and G. Jenkins. Time Series Analysis, Forecasting REBOM and Control. Holden-Day, Incorporated, 1990. [2] R. Fagin. Combining fuzzy information from multiple systems 0 0.5 1 1.5 2 (extended abstract). In PODS’96, pages 216–226, New York, Absolute Difference in °C NY, USA, 1996. ACM. [3] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation Figure 4: Comparison with REBOM and CD. algorithms for middleware. In PODS ’01, pages 102–113, New York, NY, USA, 2001. ACM. In terms of runtime, the Top-k Case Matching algorithm needed [4] M. Khayati and M. H. Böhlen. REBOM: recovery of blocks of 16 seconds for the imputation of the 2017 missing measurements, missing values in time series. In COMAD’12, pages 44–55, whereas CD and REBOM needed roughly 10 minutes each. Note, 2012. however, that this large difference in run time is also due to the [5] M. Khayati, M. H. Böhlen, and J. Gamper. Memory-efficient fact that CD and REBOM need to compute the Pearson correlation centroid decomposition for long time series. In ICDE’14, coefficient which is a time intensive operation. pages 100–111, 2014. [6] L. Li, J. McCann, N. S. Pollard, and C. Faloutsos. Dynammo: 6. CONCLUSION AND FUTURE WORK Mining and summarization of coevolving sequences with In this paper, we presented a simple yet efficient and accurate al- missing values. In KDD’09, pages 507–516, New York, NY, gorithm, termed Top-k Case Matching, for the imputation of miss- USA, 2009. ACM. ing values in time series data, where the time series are similar to [7] A. Mueen, S. Nath, and J. Liu. Fast approximate correlation each other. The basic idea of the algorithm is to look for the k sit- for massive time-series data. In SIGMOD’10, pages 171–182, uations in the historical data that are most similar to the current sit- New York, NY, USA, 2010. ACM. uation and to derive the missing values from the data at these time [8] M. Theobald, G. Weikum, and R. Schenkel. Top-k query points. Our Top-k Case Matching algorithm is based on Fagin’s evaluation with probabilistic guarantees. In VLDB’04, pages Threshold Algorithm. We presented the results of a first experi- 648–659. VLDB Endowment, 2004. mental evaluation. The Top-k Case Matching algorithm achieves a [9] C. Yozgatligil, S. Aslan, C. Iyigun, and I. Batmaz. high accuracy and outperforms two state of the art solutions both Comparison of missing value imputation methods in time in terms of accuracy and runtime. series: the case of turkish meteorological data. Theoretical As next steps we will continue with the evaluation of the algo- and Applied Climatology, 112(1-2):143–167, 2013. rithm, taking into account also model based techniques such as Dy- naMMo [6] and other statistical approaches outlined in [9]. We will 82 Dominanzproblem bei der Nutzung von Multi-Feature-Ansätzen Thomas Böttcher Ingo Schmitt Technical University Cottbus-Senftenberg Technical University Cottbus-Senftenberg Walther-Pauer-Str. 2, 03046 Cottbus Walther-Pauer-Str. 2, 03046 Cottbus tboettcher@tu-cottbus.de schmitt@tu-cottbus.de ABSTRACT Ein Vergleich von Objekten anhand unterschiedlicher Eigen- schaften liefert auch unterschiedliche Ergebnisse. Zahlreiche Arbeiten haben gezeigt, dass die Verwendung von mehreren Eigenschaften signifikante Verbesserungen im Bereich des Retrievals erzielen kann. Ein großes Problem bei der Verwen- Figure 1: Unterschiedliche Objekte mit sehr hoher dung mehrerer Eigenschaften ist jedoch die Vergleichbarkeit Farbähnlichkeit der Einzeleigenschaften in Bezug auf die Aggregation. Häu- fig wird eine Eigenschaft von einer anderen dominiert. Viele Normalisierungsansätze versuchen dieses Problem zu lösen, von Eigenschaften erfolgt mittels eines Distanz- bzw. Ähn- nutzen aber nur eingeschränkte Informationen. In dieser Ar- lichkeitsmaßes1 . Bei der Verwendung mehrerer Eigenschaf- beit werden wir einen Ansatz vorstellen, der die Messung des ten lassen sich Distanzen mittels einer Aggregationsfunktion Grades der Dominanz erlaubt und somit auch eine Evaluie- verknüpfen und zu einer Gesamtdistanz zusammenfassen. rung verschiedener Normalisierungsansätze. Der Einsatz von unterschiedlichen Distanzmaßen und Ag- gregationsfunktionen bringt jedoch verschiedene Probleme mit sich: Keywords Verschiedene Distanzmaße erfüllen unterschiedliche alge- Dominanz, Score-Normalisierung, Aggregation, Feature braische Eigenschaften und nicht alle Distanzmaße sind für spezielle Probleme gleich geeignet. So erfordern Ansätze zu metrischen Indexverfahren oder Algorithmen im Data- 1. EINLEITUNG Mining die Erfüllung der Dreiecksungleichung. Weitere Pro- Im Bereich des Information-Retrievals (IR), Multimedia- bleme können durch die Eigenschaften der Aggregations- Retrievals (MMR), Data-Mining (DM) und vielen anderen funktion auftreten. So kann diese z.B. die Monotonie oder Gebieten ist ein Vergleich von Objekten essentiell, z.B. zur andere algebraische Eigenschaften der Einzeldistanzmaße Erkennung ähnlicher Objekte bzw. Duplikate oder zur Klas- zerstören. Diese Probleme sollen jedoch nicht im Fokus die- sifizierung der untersuchten Objekte. Der Vergleich von Ob- ser Arbeit stehen. jekten einer Objektmenge O basiert dabei in der Regel auf Für einen Ähnlichkeitsvergleich von Objekten anhand meh- deren Eigenschaftswerten. Im Bereich des MMR sind Eigen- rerer Merkmale wird erwartet, dass die Einzelmerkmale glei- schaften (Features) wie Farben, Kanten oder Texturen häu- chermaßen das Aggregationsergebnis beeinflussen. Häufig fig genutzte Merkmale. In vielen Fällen genügt es für einen gibt es jedoch ein Ungleichgewicht, welches die Ergebnisse erschöpfenden Vergleich von Objekten nicht, nur eine Eigen- so stark beeinflusst, dass einzelne Merkmale keinen oder nur schaft zu verwenden. Abbildung 1 zeigt anhand des Beispiels einen geringen Einfluss besitzen. Fehlen algebraische Eigen- eines Farbhistogramms die Schwächen einer einzelnen Eigen- schaften oder gibt es eine zu starke Dominanz, so können die schaft. Obwohl beide Objekte sich deutlich unterscheiden so Merkmale und dazugehörigen Distanzmaße nicht mehr sinn- weisen sie ein sehr ähnliches Farbhistogramm auf. voll innerhalb einer geeigneten Merkmalskombination einge- Statt einer Eigenschaft sollte vielmehr eine geeignete Kombi- setzt werden. Im Bereich der Bildanalyse werden zudem im- nation verschiedener Merkmale genutzt werden, um mittels mer komplexere Eigenschaften aus den Bilddaten extrahiert. einer verbesserten Ausdruckskraft [16] genauere Ergebnissen Damit wird auch die Berechnung der Distanzen basierend zu erzielen. Der (paarweise) Vergleich von Objekten anhand auf diesen Eigenschaften immer spezieller und es kann nicht sichergestellt werden welche algebraische Eigenschaften er- füllt werden. Durch die vermehrte Verwendung von vielen Einzelmerkmalen steigt auch das Risiko der Dominanz eines oder weniger Merkmale. Kernfokus dieser Arbeit ist dabei die Analyse von Multi- Feature-Aggregationen in Bezug auf die Dominanz einzelner Copyright © by the paper’s authors. Copying permitted only for private and academic purposes. Merkmale. Wir werden zunächst die Dominanz einer Eigen- In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 1 Workshop on Foundations of Databases (Grundlagen von Datenbanken), Beide lassen sich ineinander überführen [Sch06], im Folgen- 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. den gehen wir daher von Distanzmaßen aus. 83 schaft definieren und zeigen wann sich eine solche Dominanz Beispiel erläutert werden. Abschließend werden wir ein Maß manifestiert. Anschließend führen wir ein Maß zur Messung definieren, um den Grad der Dominanz messen zu können. des Dominanzgrades ein. Wir werden darüber hinaus zei- gen, dass die Ansätze bestehender Normalisierungsverfah- 3.1 Problemdefinition ren nicht immer ausreichen um das Problem der Dominanz Wie bereits erwähnt ist der Einsatz vieler, unterschiedlicher zu lösen. Zusätzlich ermöglicht dieses Maß die Evaluation Eigenschaften (Features) und ihrer teilweise speziellen Di- verschiedener Normalisierungsansätze. stanzmaße nicht trivial und bringt einige Herausforderungen Die Arbeit ist dabei wie folgt aufgebaut. In Kapitel 2 werden mit sich. Das Problem der Dominanz soll in diesem Unter- noch einmal einige Grundlagen zur Distanzfunktion und zur abschnitt noch einmal genauer definiert werden. Aggregation dargelegt. Kapitel 3 beschäftigt sich mit der Zunächst definieren wir das Kernproblem bei der Aggre- Definition der Dominanz und zeigt anhand eines Beispiels gation mehrerer Distanzwerte. die Auswirkungen. Weiterhin wird ein neues Maß zur Mes- Problem: Für einen Ähnlichkeitsvergleich von Objekten sung des Dominanzgrades vorgestellt. Kapitel 4 liefert einen anhand mehrerer Merkmale sollen die Einzelmerkmale glei- Überblick über bestehende Ansätze. Kapitel 5 gibt eine Zu- chermaßen das Aggregationsergebnis beeinflussen. Dominie- j sammenfassung und einen Ausblick für zukünftige Arbeiten. ren die partiellen Distanzen δrs eines Distanzmaßes dj das Aggregationsergebnis, so soll diese Dominanz reduziert bzw. 2. GRUNDLAGEN beseitigt werden. Offen ist an dieser Stelle die Frage, wann eine Dominanz ei- Das folgende Kapitel definiert die grundlegenden Begriffe ner Eigenschaft auftritt, wie sich diese auf das Aggregations- und die Notationen, die in dieser Arbeit verwendet werden. ergebnis auswirkt und wie der Grad der Dominanz gemessen Distanzberechnungen auf unterschiedlichen Merkmalen er- werden kann. fordern in der Regel auch den Einsatz unterschiedlicher Di- Das Ergebnis einer Aggregation von Einzeldistanzwerten ist stanzmaße. Diese sind in vielen Fällen speziell auf die Eigen- erneut ein Distanzwert. Dieser soll jedoch von allen Einzeldi- schaft selbst optimiert bzw. angepasst. Für eine Distanzbe- stanzwerten gleichermaßen abhängen. Ist der Wertebereich, rechnung auf mehreren Merkmalen werden dementsprechend der zur Aggregation verwendeten Distanzfunktionen nicht auch unterschiedliche Distanzmaße benötigt. identisch, so kann eine Verfälschung des Aggregationsergeb- Ein Distanzmaß zwischen zwei Objekten basierend auf einer nisses auftreten. Als einfaches Beispiel seien hier zwei Di- Eigenschaft p sei als eine Funktion d : O × O 7→ R≥0 defi- stanzfunktionen d1 und d2 genannt, wobei d1 alle Distanzen niert. Ein Distanzwert basierend auf einem Objektvergleich auf das Intervall [0, 1] und d2 alle Distanzen auf [0, 128] ab- zwischen or und os über einer einzelnen Eigenschaft pj wird bildet. Betrachtet man nun eine Aggregationsfunktion dagg , mit dj (or , os ) ∈ R≥0 beschrieben. Unterschiedliche Distanz- die Einzeldistanzen aufsummiert, so zeigt sich, dass d2 das maße besitzen damit auch unterschiedliche Eigenschaften. Aggregationsergebnis erheblich mehr beeinflusst als d1 . Zur Klassifikation der unterschiedlichen Distanzmaße wer- Allgemein werden dann die aggregierten Distanzwerte stär- den folgende vier Eigenschaften genutzt: ker oder schwächer durch Einzeldistanzwerte einer (zur Ag- Selbstidentität: ∀o ∈ O : d(o, o) = 0, Positivität: ∀or 6= gregation verwendeten) Distanzfunktion beeinflusst als ge- os ∈ O : d(or , os ) > 0, Symmetrie: ∀or , os ∈ O : wünscht. Wir bezeichnen diesen Effekt als eine Überwer- d(or , os ) = d(os , or ) und Dreiecksungleichung: ∀or , os , ot ∈ tung. Der Grad der Überbewertung lässt sich mittels Korre- O : d(or , ot ) ≤ d(or , os ) + d(os , ot ). lationsanalyse (z.B. nach Pearson [10] oder Spearman [13]) Erfüllt eine Distanzfunktion alle vier Eigenschaften so wird bestimmen. sie als Metrik bezeichnet [11]. Ist der Vergleich zweier Objekte anhand einer einzelnen Ei- Definition 1 (Überbewertung einer Distanzfunktion). genschaft nicht mehr ausreichend, um die gewünschte (Un-) Für zwei Distanzfunktionen dj und dk , bei der die Distanz- Ähnlichkeit für zwei Objekte or ,os ∈ O zu bestimmen , so werte δ j in Abhängigkeit einer Aggregationsfunktion agg ist die Verwendung mehrerer Eigenschaften nötig. Für ei- das Aggregationsergebnis stärker beeinflussen als δ k , also ne Distanzberechnung mit m Eigenschaften p = (p1 . . . pm ) die Differenz der Korrelationswerte j werden zunächst die partiellen Distanzen δrs = dj (or , os ) ρ(δ j , δ agg ) − ρ(δ k , δ agg ) >  ist, bezeichnen wir dj als bestimmt. Anschließend werden die partiellen Distanzwerte überbewertet gegenüber dk . j δrs mittels einer Aggregationsfunktion agg : Rm ≥0 7→ R≥0 zu einer Gesamtdistanz aggregiert. Die Menge aller aggre- Eine empirische Untersuchung hat gezeigt, dass sich ab ei- gierten Distanzen (Dreiecksmatrix) für Objektpaar aus O, nem Wert  ≥ 0.2 eine Beeinträchtigung des Aggregations- 2 sei durch δ j = (δ1j , δ2j . . . , δlj ) mit l = n 2−n bestimmt. Die- ergebnisses zu Gunsten einer Distanzfunktion zeigt. ser Ansatz erlaubt eine Bestimmung der Aggregation auf Ausgehend von einer Überbewertung definieren wir das Pro- den jeweiligen Einzeldistanzwerten. Die Einzeldistanzfunk- blem der Dominanz. tionen dj sind in sich geschlossen und damit optimiert auf die Eigenschaft selbst. Definition 2 (Dominanzproblem). Ein Dominanzpro- blem liegt vor, wenn es eine Überbewertung einer Distanz- funktion dj gegenüber dk gibt. 3. DOMINANZPROBLEM Bisher haben wir das Problem der Dominanz nur kurz ein- Das Problem einer Überbewertung bei unterschiedlichen geführt. Eine detaillierte Motivation und Heranführung an Wertebereichen in denen die Distanzen abgebildet werden ist das Problem soll in diesem Kapitel erfolgen. Hierzu werden jedoch bereits weitreichend bekannt. In vielen Fällen kom- wir zunächst die Begriffe Überbewertung und Dominanzpro- men Normalisierungsverfahren (z.B. im Data-Mining [12] blem einführen. Die Auswirkungen des Dominanzproblem oder in der Biometrie [5]) zum Einsatz. Diese bereiten Di- auf das Aggregationsergebnis sollen anschließend durch ein stanzen aus verschiedenen Quellen für eine Aggregation vor. 84 Zur Vermeidung einer Überbewertung werden Distanzen aggQd ,d (or , os ) = d1 (or , os ) ∗ d2 (or , os ) kann nun gezeigt 1 2 häufig auf ein festes Intervall normalisiert (i.d.R. auf [0,1]). werden, dass d1 stärker den aggregierten Distanzwert beein- Damit ist zumindest das Problem in unserem vorherigen Bei- flusst als d2 . spiel gelöst. In Abbildung 3 sind zwei verschiedene Rangfolgen aller 10 Das Problem der Dominanz tritt jedoch nicht nur bei un- Distanzwerte zwischen fünf zufälligen Objekten der Vertei- terschiedlichen Wertebereichen auf. Auch bei Distanzfunk- lungen ν1 und ν2 dargestellt, sowie die Aggregation mittels tionen, die alle auf den gleichen Wertebereich normalisiert aggQ . Die Distanz-ID definiert hierbei einen Identifikator sind, kann das Dominanzproblem auftreten. Im folgenden für ein Objektpaar. Betrachtet man die ersten fünf Rän- Abschnitt soll anhand eines Beispiels dieses Dominanzpro- ge der aggregierten Distanzen, so sieht man, dass die top- blem demonstriert werden. 5-Objekte von Distanzfunktion d1 komplett mit denen der Aggregation übereinstimmen, während bei Distanzfunktion 3.2 Beispiel eines Dominanzproblems d2 lediglich zwei Werte in der Rangfolge der aggregierten In Abbildung 2 sind drei Distanzverteilungen ν1 , ν2 und ν3 Distanzen auftreten. Gleiches gilt für die Ränge 6–10. Da- aus einer Stichprobe zu den zugehörigen Distanzfunktionen mit zeigt die Distanzfunktion d1 eine Dominanz gegenüber d1 , d2 sowie d3 dargestellt. Der Wertebereich der Funktio- der Distanzfunktion d2 . Schaut man sich noch einmal die nen sei auf das Intervall [0,1] definiert. Die Werte aus der Intervalle der Verteilung ν1 und ν2 an, so zeigt sich, dass die Stichprobe treten ungeachtet der Normalisierung auf [0, 1] Dominanz dem großen Unterschied der Verteilungsintervalle jedoch in unterschiedlichen Intervallen auf. Die Distanzwer- (0.7 vs. 0.2) obliegt. Eine Dominanz manifestiert sich also te der Stichprobe von ν1 liegen im Intervall [0.2, 0.9], von ν2 vor allem wenn eine große Differenz zwischen den jeweiligen im Intervall [0.3, 0.5] und in ν3 im Intervall [0.8, 0.9]. Auch Intervallen der Distanzverteilungen liegt. wenn es sich hierbei um simulierte Daten handelt so sind solche Verteilungen im Bereich des MMR häufig anzutref- 3.3 Messung der Dominanz fen. Um die Überwertung aus unserem Beispiel und somit die 0.12 Dominanz zu quantifizieren, wird die Korrelation zwischen 0.1 den Distanzen von d1 (d2 ) und der aggregierten Distanzen aus dagg bestimmt. Zur Berechnung der Korrelation kön- nen mehrere Verfahren genutzt werden. Verwendet man wie 0.08 Häufigkeit 0.06 im obigen Beispiel nur die Ränge, so bietet sich Spearmans 0.04 Rangkorrelationskoeffizient an [13]. 0.02 Cov(Rang(A), Rang(B)) ρ(A, B) = mit 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 σRang(A) ∗ σRang(B) (1) Distanz (a) ν1 Cov(X, Y ) = E [(X − µx ) ∗ (Y − µy )] 0.12 Hierbei sei Cov(X, Y ) die über den Erwartungswert von X 0.1 und Y definierte Kovarianz. Bezogen auf das vorherige Bei- spiel erhalten wir eine Korrelation nach Spearman für d1 von ρ1 = 0.94 und für d2 ρ2 = 0.45. Die Differenz der Korrela- 0.08 Häufigkeit 0.06 tionswerte liegt dabei bei ρ1 − ρ2 = 0.49. Ab  = 0.2 lässt 0.04 sich eine Überbewertung einer Distanzfunktion feststellen. 0.02 Somit haben wir mit ρ1 − ρ2 = 0.49 > 0.2 eine starke Über- bewertung von d1 gegenüber d2 in Bezug auf das Aggrega- 0 0 0.1 0.2 0.3 0.4 0.5 Distanz 0.6 0.7 0.8 0.9 1 tionsergebnis gezeigt. (b) ν2 Durch die Verwendung der Rangwerte gibt es allerdings einen Informationsverlust. Eine alternative Berechnung ohne 0.12 Informationsverlust wäre durch Pearsons Korrelationskoeffi- 0.1 zienten möglich [10]. Genügen die Ranginformationen, dann 0.08 bietet Spearmans Rangkorrelationskoeffizient durch eine ge- ringere Anfälligkeit gegenüber Ausreißern an [14]. Häufigkeit 0.06 Bisher haben wir die Korrelation zwischen den aggregier- 0.04 ten Werten und denen aus je einer Distanzverteilung vergli- 0.02 chen. Um direkt eine Beziehung zwischen zwei verschiede- nen Distanzverteilungen bzgl. einer aggregierten Verteilung zu bestimmen, werden zunächst die zwei Korrelationswerte 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Distanz (c) ν3 ρ1 und ρ2 der Distanzfunktionen d1 und d2 bzgl. ihres Ein- flusses auf das Aggregationsergebnis graphisch dargestellt [6]. Hierzu werden die jeweiligen Werte der Korrelation als Figure 2: Distanzverteilung verschiedener Distanz- Punkte in [−1, 1]2 definiert. Für eine gleichmäßige Beein- funktionen (simulierte Daten) flussung des Aggregationsergebnisses sollten sich die Punk- te auf der Diagonalen durch den Koordinatenursprung mit Wir betrachten nun die Distanzfunktionen d1 und d2 . Be- züglich einer beispielhaften Aggregationsfunktion2 gationsfunktionen wie Summe, Mittelwert etc. auf und kann zusätzlich eine Dominanz hervorrufen, z.B. bei der Mini- 2 Das Problem der Dominanz tritt auch bei anderen Aggre- mum/Maximumfunktion. 85 1 Rang d1 Distanz-ID d2 Distanz-ID aggQ Distanz-ID 1 0.729 1 0.487 8 0.347 8 0.8 2 0.712 8 0.481 5 0.285 4 3 0.694 4 0.426 10 0.266 1 0.6 4 0.547 9 0.425 7 0.235 5 ρ2 (ρ1, ρ2) 5 0.488 5 0.421 3 0.205 9 0.4 6 0.473 7 0.411 4 0.201 7 u 7 0.394 10 0.375 9 0.168 10 0.2 8 0.351 3 0.367 6 0.148 3 α 9 0.337 2 0.365 1 0.112 6 0 0 0.2 0.4 0.6 0.8 1 10 0.306 6 0.316 2 0.106 2 ρ1 Figure 3: Dominanzproblem bei unterschiedlichen Verteilungen Figure 4: Graphische Darstellung der Korrelation ρ1 und ρ2 auf das Aggregationsergebnis dem Anstieg m = 1 befinden. Wir bezeichnen diese Gerade 3.4 Zusammenfassung als Kalibrierungslinie. Für unser Beispiel genügt es, nur po- Wir haben in diesem Kapitel gezeigt wann ein Dominanz- sitive Korrelationswerte zu betrachten. Damit kennzeichnen problem auftritt und wie groß der Einfluss auf das Aggrega- alle Punkte unterhalb dieser Linie einen größeren Einfluss tionsergebnis sein kann. Mit der Verwendung von Gleichung durch d1 . Analog gilt bei allen Punkten oberhalb dieser Li- (2) ist es nun möglich den Grad des Dominanzproblems bzw. nie (grau schraffierter Bereich) eine größere Beeinflussung den Kalibrierungsfehler messen zu können. Ein Hauptgrund durch d2 . Abbildung 4 zeigt graphisch die Korrelation für für das Auftreten des Dominanzproblem liegt in der Vertei- unser Beispiel von ρ1 und ρ2 auf das Aggregationsergebnis. lung der Distanzen. Sind die Intervalle, in denen die Distan- Um die Abweichung vom gewünschten Zustand zu bestim- zen liegen unterschiedlich groß, so ist die Dominanz einer men, ermitteln wir den Winkel zwischen dem Ortsvektor Eigenschaft unvermeidbar. Können diese Intervalle der Di- u = (ρ1 , ρ2 )T durch den Punkt (ρ1 , ρ2 ) und der horizon- ~ stanzverteilungen aneinander angeglichen werden ohne da- talen Koordinatenachse   [6]. Der Winkel α ergibt sich dann bei die Rangfolge zu verletzen, so könnte dies das Dominanz- durch α = arctan ρρ21 Dieser Winkel liegt zwischen [0, Π 2 ], problem lösen. Weiterhin ermöglicht das Maß des Kalibrie- während die Kalibrierungslinie mit der horizontalen Ach- rungsfehlers die Evaluation von Normalisierungsansätzen. se einen Winkel von Π 4 einschließt. Für eine vorzeichenbe- haftete Kennzeichnung der Überbewertung sollen nun alle 4. STAND DER TECHNIK Korrelationspunkte unterhalb der Kalibrierungslinie einen Die Aggregation auf Basis mehrerer Eigenschaften ist ein positiven Wert und alle Korrelationspunkte oberhalb einen weit verbreitetes Feld. Es gibt bereits eine Vielzahl von Ar- negativen Wert erhalten. Für ein Maß der Dominanz defi- beiten die sich mit dem Thema der Score-Normalization be- nieren wir nun folgende Berechnung [6]: schäftigten. Die Evaluierung solcher Ansätze erfolgt in vielen Fällen, vor allem im Bereich des IR, direkt über die Auswer-   tung der Qualität der Suchergebnisse anhand verschiedener 4 Corr(δ j , δ agg ) Calerr (δ i , δ j , δ agg ) = 1 − arctan (2) Dokumentenkollektionen, z.B. TREC-Kollektionen3 . Dieses π Corr(δ i , δ agg ) Vorgehen liefert aber kaum Anhaltspunkte, warum sich ei- nige Normalisierungsansätze besser für bestimmte Anwen- Hierbei definiert Corr(X, Y ) ein geeignetes Korrelations- dungen eignen als andere [6]. maß, in unserem Fall der Rangkorrelationskoeffizient von Betrachten wir zunächst verschiedene lineare Normalisierun- Spearman. Wir bezeichnen dieses Maß als Kalibrierungsfeh- δ−xmin gen der Form normalize(δ) = ymin + xmax (ymax − ler, wobei ein Fehler von 0 bedeutet, dass es keine Dominanz −xmin gibt und somit beide Distanzfunktionen gleichermaßen in ymin ) [15], wobei die Bezeichnungen xmin , xmax , ymin und das Aggregationsergebnis einfließen. Der Wertebereich des ymax verschiedene Normalisierungsparameter darstellen. Ta- Kalibrierungsfehlers Calerr liegt in [−1, 1]. Für unser Bei- belle 1 stellt einige solcher linearer Ansätze dar [15, 5, 9, 6]. spiel erhalten wir unter Verwendung von Spearmans Rang- korrelationskoeffizienten Calerr (d1 , d2 , dagg ) = 0.43, womit erkennbar ist, dass d1 das Aggregationsergebnis stärker be- Name ymin ymax xmin xmax einflusst als d2 . Min-Max 0 1 min(δ) max(δ) Fitting 0 |s1 | we conclude (without inspecting any set element) that s0 cannot reach threshold Figure 1: Overview of functions. tC with s1 . Similarly, minoverlap(tC , s0 , s2 ) = 10.1, thus s2 is too large to meet the threshold with s0 . In fact, minsize(tC , s0 ) = 6.4 and maxsize(tC , s0 ) = 15.6. The positional filter is stricter than the prefix filter and Prefix length. The prefix length is |s0 | − tO + 1 for is applied on top of it. The pruning power of the positional a given overlap threshold tO and set s0 . For normalized filter is larger for prefix matches further to right (i.e., when thresholds t the prefix length does not only depend on s0 , p0 , p1 increase). Since the prefix filter may produce the same but also on the sets we compare to. If we compare to s1 , the candidate pair multiple times (for each match in the prefix), minimum prefix size of |s0 | is minprefix(t, s0 , s1 ) = |s0 | − an interesting situation arises: a pair that passes the posi- minoverlap(t, s0 , s1 ) + 1. When we index one of the join tional filter for the first match may not pass the filter for partners, we do not know the size of the matching partners later matches. Thus, the positional filter is applied to pairs upfront and need to cover the worst case; this results in the that are already in the candidate set whenever a new match prefix length maxprefix(t, s0 ) = |s0 |−minsize(t, s0 )+1 [7], is found. To correctly apply the positional filter we need which does not depend on s1 . For typical Jaccard thresholds to maintain the overlap value for each pair in the candidate t ≥ 0.8, this reduces the number of tokens to be processed set. We illustrate the positional filter with examples. during the candidate generation phase by 80 % or more. Example 1. Set s0 in Figure 2 is the probing set (prefix For self joins we can further reduce the prefix length [12] length maxprefix = 4), s1 is the indexed set (prefix length w.r.t. maxprefix: when the index is built on-the-fly in in- midprefix = 2, assuming self join). Set s1 is returned from creasing order of the sets, then the indexed prefix of s0 will the index due to the match on g (first match between s0 and never be compared to any set s1 with |s1 | < |s0 |. This al- s1 ). The required overlap is dminoverlapC (0.8, s0 , s1 )e = lows us to reduce the prefix length to midprefix(t, s0 ) = 8. Since there are only 6 tokens left in s1 after the match, |s0 | − minoverlap(t, s0 , s0 ) + 1. the maximum overlap we can get is 7, and the pair is pruned. Positional filter. The minimum prefix length for a pair This is also confirmed by the positional filter condition (1) of sets is often smaller than the worst case length, which we (o = 0, p0 = 3, p1 = 1). use to build and probe the index. When we probe the index Example 2. Assume a situation similar to Figure 2, but with a token from the prefix of s0 and find a match in the the match on g is the second match (i.e., o = 1, p0 = 3, prefix of set s1 , then the matching token may be outside the p1 = 1). Condition (1) holds and the pair can not be pruned, optimal prefix. If this is the first matching token between i.e., it remains in the candidate set. s0 and s1 , we do not need to consider the pair. In general, Example 3. Consider Figure 3 with probing set s0 and a candidate pair s0 , s1 must be considered only if indexed set s1 . The match on token a adds pair (s0 , s1 ) to the candidate set. Condition (1) holds for the match on a minoverlap(t, s0 , s1 ) ≤ o + min{|s0 | − p0 , |s1 | − p1 }, (1) (o = 0, p0 = 0, p1 = 0), and the pair is not pruned by where o is the current overlap (i.e., number of matching the positional filter. For the next match (on e), however, tokens so far excluding the current match) and p0 (p1 ) is condition (1) does not hold (o = 1, p0 = 1, p1 = 4) and the position of the current match in the prefix of s0 (s1 ); the positional filter removes the pair from the candidate set. positions start at 0. Thus, the positional filter does not only avoid pairs to enter 90 pred: C(s0 , s1 ) ≥ 0.8 s0 : b c e f g h ? ? ? pr ⇒ dminoverlap(s0 , s1 , 0.8)e = 8 s1 : a e h ? ? ? ? ? ? idx 7 s0 : c e f g ? ? ? ? ? ? probing set (pr) Figure 4: Verification: where to start? s1 : a g ? ? ? ? ? ? indexed set (idx) pred: J(s0 , s1 ) ≥ 0.7 pred: J(s0 , s1 ) ≥ 0.7 7 ⇒ dminoverlap(. . .)e = 6 ⇒ dminoverlap(. . .)e = 5 s0 : c d e ? ? ? ? pr s0 : c d e ? ? ? ? pr Figure 2: Sets with matching token In prefix: match impossible due to positions of matching tokens and s1 : e ? ? ? ? ? idx s1 : e ? ? ? ? idx remaining tokens. (a) Match impossible (b) Match possible pred: C(s0 , s1 ) ≥ 0.6 Figure 5: Impossible and possible set sizes based on ⇒ dminoverlap(s0 , s1 , 0.8)e = 8 position in s0 and the size-dependent minoverlap. 14 s0 : a e ? ? ? ? ? ? ? ? ? ? ? ? ? ? pr midprefix (indexing set) as discussed in earlier sections. s1 : a b c d e ? ? ? ? ? idx Since the sets are sorted, we compute the overlap in a +1 +1 5 =7<8 merge fashion. At each merge step, we verify if the current overlap and the remaining set size are sufficient to achieve Figure 3: Sets with two matching tokens: pruning the threshold, i.e., we check positional filter condition (1). of candidate pair by second match. (A) Prefix overlap [12] : At verification time we already know the overlap between the two prefixes of a candidate pair. This piece of information should be leveraged. Note the candidate set, but may remove them later. that we cannot simply continue verification after the two prefixes. This is illustrated in Figure 4: there is 1 match in 2.2 Improving the Prefix Filter the prefixes of s0 and s1 ; when we start verification after the The prefix filter often produces candidates that will be prefixes, we miss token h. Token h occurs after the prefix removed immediately in the next filter stage, the positional of s0 but inside the prefix of s1 . Instead, we compare the filter (see Example 1). Ideally, such candidates are not pro- last element of the prefixes: for the set with the smaller duced at all. This issue is addressed in the mpjoin algo- element (s0 ), we start verification after the prefix (g). For rithm [7] as outlined below. the other set (s1 ) we leverage the number of matches in the Consider condition (1) for the positional filter. We split prefix (overlap o). Since the leftmost positions where these the condition into two new conditions by expanding the min- matches can appear are the first o elements, we skip o tokens imum such that the conjunction of the new conditions is and start at position o (token e in s1 ). There is no risk of equivalent to the positional filter condition: double-counting tokens w.r.t. overlap o since we start after the end of the prefix in s0 . minoverlap(t, s0 , s1 ) ≤ o + |s0 | − p0 (2) (B) Position of last match [7] : A further improvement is minoverlap(t, s0 , s1 ) ≤ o + |s1 | − p1 (3) to store the position of the last match. Then we start the verification in set s1 after this position (h in s1 , Figure 4). The mpjoin algorithm leverages condition (2) as follows. The probing sets s0 are processed in increasing size order, so Small candidate set vs. fast verification. The po- |s0 | grows monotonically during the execution of the algo- sitional filter is applied on each candidate pair returned by rithm. Hence, for a specific set s1 , minoverlap grows mono- the prefix filter. The same candidate pair may be returned tonically. We assume o = 0 (and justify this assumption multiple times for different matches in the prefix. The po- later). For a given index entry (s1 , p1 ), the right side of con- sitional filter potentially removes existing candidate pairs dition (2) is constant, while the left side can only grow. Af- when they appear again (cf. Section 2.1). This reduces the ter the condition fails to hold for the first time, it will never size of the candidate set, but comes at the cost of (a) lookups hold again, and the index list entry is removed. For a given in the candidate set, (b) deletions from the candidate set, index set s1 , this improvement changes the effective length and (c) book keeping of the overlaps for each candidate pair. of the prefix (i.e., the part of the sets where we may detect Overall, it might be more efficient to batch-verify a larger matches) w.r.t. a probing set s0 to minprefix(t, s0 , s1 ) = candidate set than to incrementally maintain the candidates; |s1 | − minoverlap(t, s0 , s1 ) + 1, which is optimal. On the Ribeiro and Härder [7] empirically analyze this trade-off. downside, a shorter prefix may require more work in the verification phase: in some cases, the verification can start 3. POSITION-ENHANCED LENGTH FIL- after the prefix as will be discussed in Section 2.3. TERING 2.3 Verification In this section, we motivate the position-enhanced length Efficient verification techniques are crucial for fast set sim- filter (PEL), derive the new filter function pmaxsize, dis- ilarity joins. We revisit a baseline algorithm and two im- cuss the effect of PEL on self vs. foreign joins, and show how provements, which affect the verification speed of both false to apply PEL to previous algorithms. and true positives. Unless explicitly mentioned, the term Motivation. The introduction of the position-enhanced prefix subsequently refers to maxprefix (probing set) resp. length filter is inspired by examples for positional filtering 91 1250 base region. The base region is partitioned into four regions maxsize (A, B, C, and D) by the probing set size and pmaxsize. For C D probing set size foreign joins, our filter reduces the base region to A+C. If we set size assume that all set sizes occur equally likely in the individual 1000 inverted lists of the index, our filter cuts the number of index B list entries that must be processed by 50%. Since the tokens A pmaxsize are typically ordered by their frequency, the list length will minsize increase with increasing matching position. Thus the gain of 800 PEL in practical settings can be expected to be even higher. 0 100 maxprefix 200 This analysis holds for all parameters of Jaccard and Dice. position in prefix For Cosine, the situation is more tricky since pmaxsize is quadratic and describes a parabola. Again, this is in our Figure 6: Illustrating possible set sizes. favor since the parabola is open to the top, and the curve that splits the base region is below the diagonal. For self joins, the only relevant regions are A and B since like Figure 5(a). In set s1 , the only match in the prefix oc- the size of the sets is bounded by the probing set size. Our curs at the leftmost position. Despite this being the leftmost filter reduces the relevant region from A + B to A. As Fig- match in s1 , the positional filter removes s1 : the overlap ure 6 illustrates, this reduction is smaller than the reduction threshold cannot be reached due the position of the match for foreign joins. For the similarity functions in Table 1, B in s0 . Apparently, the position of the token in the probing is always less than a quarter of the full region A + B. In the set can render a match of the index sets impossible, inde- example, region B covers about 0.22 of A + B. pendently of the matching position in the index set. Let us analyze how we need to modify the example such that it passes the positional filter: the solution is to shorten index Algorithm 1: AllPairs-PEL(Sp , I, t) set s1 , as shown in Figure 5(b). This suggests that some Version using pmaxsize for foreign join; tighter limit on the set size can be derived from the position input : Sp collection of outer sets, I inverted list index of the matching token. covering maxprefix of inner sets, t similarity Deriving the PEL filter. For the example in threshold Figure 5(a) the first part of the positional filter, i.e., output: res set of result pairs (similarity at least t) condition (2), does not hold. We solve the equation 1 foreach s0 in Sp do minoverlap(t, s0 , s1 ) ≤ |s0 | − p0 to |s1 | by replacing 2 M = {}; /* Hashmap: candidate set → count */ minoverlap with its definition for the different similarity 3 for p0 ← 0 to maxprefix(t, s0 ) − 1 do functions. The result is pmaxsize(t, s0 , p0 ), an upper 4 for s1 in Is0 [p] do bound on the size of eligible sets in the index. This bound 5 if |s1 | < minsize(t, s0 ) then is at the core of the PEL filter, and definitions of pmaxsize 6 remove index entry with s1 from Is0 [p] ; for various similarity measures are listed in Table 1. 7 else if |s1 | > pmaxsize(t, s0 , p0 ) then Application of PEL. We integrate the pmaxsize 8 break; upper bound into the prefix filter. The basic prefix filter 9 else algorithm processes a probing set as follows: loop over 10 if M [s1 ] = ∅ then the tokens of the probing set from position p0 = 0 to 11 M = M ∪ (s1 , 0); maxprefix(t, s0 ) − 1 and probe each token against the 12 M [s1 ] = M [s1 ] + 1; index. The index returns a list of sets (their IDs) which 13 end contain this token. The sets in these lists are ordered by 14 end increasing size, so we stop processing a list when we hit a 15 /* Verify() verifies the candidates in M */ set that is larger than pmaxsize(t, s0 , p0 ). 16 res = res ∪ V erif y(s0 , M, t); Intuitively, we move half of the positional filter to the 17 end prefix filter, where we can evaluate it at lower cost: (a) the value of pmaxsize needs to be computed only once for each probing token; (b) we check pmaxsize against the size of Algorithm. Algorithm 1 shows AllPairs-PEL2 , a ver- each index list entry, which is a simple integer comparison. sion of AllPairs enhanced with our PEL filter. AllPairs- Overall, this is much cheaper than the candidate lookup that PEL is designed for foreign joins, i.e., the index is con- the positional filter must do for each index match. structed in a preprocessing step before the join is executed. Self Joins vs. Foreign Joins. The PEL filter is more The only difference w.r.t. AllPairs is that AllPairs-PEL uses powerful on foreign joins than on self joins. In self joins, pmaxsize(t, s0 , p0 ) instead of maxsize(t, s0 ) in the condi- the size of the probing set is an upper bound for the set tion on line 7. The extensions of the algorithms ppjoin and size in the index. For all the similarity functions in Table 1, mpjoin with PEL are similar. pmaxsize is below the probing set size in less than 50% An enhancement that is limited to ppjoin and mpjoin is to of the prefix positions. Figure 6 gives an example: The simplify the positional filter: PEL ensures that no candidate probing set size is 1000, the Jaccard threshold is 0.8, so set can fail on the first condition (Equation 2) of the split minsize(0.8, 1000) = 800, maxsize(0.8, 1000) = 1250, and positional filter. Therefore, we remove the first part of the the prefix size is 201. The x-axis represents the position in the prefix, the y-axis represents bounds for the set size of the 2 We use the -PEL suffix for algorithm variants that make other set. The region between minsize and maxsize is the use of our PEL filter. 92 collections are identical. Figures 7(a) and 7(b) show the per- Table 2: Input set characteristics. formance on DBLP with Jaccard similarity threshold 0.75 #sets in set size # of diff. and Cosine similarity 0.85. These thresholds produce result collection min max avg tokens sets of similar size. We observe a speedup of factor 3.5 for DBLP 3.9 · 106 2 283 12 1.34 · 106 AllPairs-PEL over AllPairs with Jaccard, and a speedup of TREC 3.5 · 105 2 628 134 3.4 · 105 3.8 with Cosine. For mpjoin to mpjoin-PEL we observe a 5 ENRON 5 · 10 1 192 000 298 7.3 · 106 speedup of 4.0 with Jaccard and 4.2 with Cosine. Thus, the PEL filter provides a substantial speed advantage on these data points. For other Jaccard thresholds and mpjoin vs. minimum in the original positional filter (Equation 1), such mpjoin-PEL, the maximum speedup is 4.1 and the minimum that the minimum is no longer needed. speedup is 1.02. For threshold 0.5, only mpjoin-PEL finishes Note that the removal of index entries on line 6 is the eas- within the time limit of one hour. Among all Cosine thresh- iest solution to apply minsize, but in real-world scenarios, olds and mpjoin vs. mpjoin-PEL, the maximum speedup is it only makes sense for a single join to be executed. For 4.2 (tC = 0.85), the minimum speedup is 1.14 (tC = 0.95). a similarity search scenario, we recommend to apply binary We only consider Cosine thresholds tC ≥ 0.75, because the search on the lists. For multiple joins with the same indexed non-PEL variants exceed the time limit for smaller thresh- sets in a row, we suggest to use an overlay over the index olds. There is no data point where PEL slows down an that stores the pointer for each list where to start. algorithm. It is also worth noting that AllPairs-PEL beats mpjoin by a factor of 2.7 with Jaccard threshold tJ = 0.75 4. EXPERIMENTS and 3.3 on Cosine threshold tC = 0.85; we observe such speedups also on other thresholds. We compare the algorithms AllPairs [4] and mpjoin [7] Figure 7(c) shows the performance on TREC with Jac- with and without our PEL extension on both self and for- card threshold tJ = 0.75. The speedup for AllPairs-PEL eign joins. Our implementation works on integers, which we compared to AllPairs is 1.64, and for mpjoin-PEL compared order by the frequency of appearance in the collection. The to mpjoin 2.3. The minimum speedup of mpjoin over all time to generate integers from tokens is not measured in our thresholds is 1.26 (tJ = 0.95), the maximum speedup is experiments since it is the same for all algorithms. We also 2.3 (tJ = 0.75). Performance gains on ENRON are slightly do not consider the indexing time for foreign joins, which smaller – we observe speedups of 1.15 (AllPairs-PEL over is considered a preprocessing step. The use of PEL has no AllPairs), and 1.85 (mpjoin-PEL over mpjoin) on Jaccard impact on the index construction. The prefix sizes are max- threshold tJ = 0.75 as illustrated in Figure 7(d). The mini- prefix for foreign joins and midprefix for self joins. For self mum speedup of mpjoin over mpjoin-PEL is 1.24 (tJ = 0.9 joins, we include the indexing time in the overall runtime and 0.95), the maximum speedup is 2.0 (tJ = 0.6). since the index is built incrementally on-the-fly. We report Figure 8(a) shows the number of processed index entries results for Jaccard and Cosine similarity, the results for Dice (i.e., the overall length of the inverted lists that must be show similar behavior. Our experiments are executed on the scanned) for Jaccard threshold tJ = 0.75 on TREC. The following real-world data sets: number of index entries increases by a factor of 1.67 for AllPairs w.r.t. AllPairs-PEL, and a factor of 4.0 for mpjoin • DBLP3 : Snapshot (February 2014) of the DBLP bib- w.r.t. mpjoin-PEL. liographic database. We concatenate authors and ti- Figure 8(b) shows the number of candidates that must tle of each entry and generate tokens by splitting on be verified for Jaccard threshold tJ = 0.75 on TREC. On whitespace. AllPairs, PEL decreases the number of candidates. This is because AllPairs does not apply any further filters before • TREC4 : References from the MEDLINE database, verification. On mpjoin, the number of candidates increases years 1987–1991. We concatenate author, title, and by 20%. This is due to the smaller number of matches from abstract, remove punctuation, and split on whitespace. the prefix index in the case of PEL: later matches can remove • ENRON5 : Real e-mail messages published by FERC pairs from the candidate set (using the positional filter) and after the ENRON bankruptcy. We concatenate sub- thus decrease its size. However, the larger candidate set ject and body fields, remove punctuation, and split on for PEL does not seriously impact the overall performance: whitespace. the positional filter is also applied in the verification phase, where the extra candidate pairs are pruned immediately. Table 2 lists basic characteristics of the input sets. We Self joins. Due to space constraints, we only show re- conduct our experiments on an Intel Xeon 2.60GHz machine sults for DBLP and ENRON, i.e., the input sets with the with 128 GB RAM running Debian 7.6 ’wheezy’. We com- smallest and the largest average set sizes, respectively. Fig- pile our code with gcc -O3. Claims about results on “all” ure 7(e) and 7(f) show the performance of the algorithms on thresholds for a particular data set refer to the thresholds DBLP and ENRON with Jaccard threshold tJ = 0.75. Our {0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95}. We stop tests whose PEL filter provides a speed up of about 1.22 for AllPairs, runtime exceeds one hour. and 1.17 for mpjoin on DBLP. The maximum speedup we Foreign Joins. For foreign joins, we join a collection of observe is 1.70 (AllPairs-PEL vs. AllPairs, tJ = 0.6); for sets with a copy of itself, but do not leverage the fact that the tJ = 0.95 there is no speed difference between mpjoin and mpjoin-PEL. On the large sets of ENRON, the performance 3 http://www.informatik.uni-trier.de/~Ley/db/ is worse for AllPairs-PEL because verification takes more 4 http://trec.nist.gov/data/t9_filtering.html time than PEL can save in the probing phase (by reducing 5 https://www.cs.cmu.edu/~enron/ the number of processed index entries). There is almost no 93 sec sec sec 400 sec sec sec 500 500 100 150 400 300 30 400 80 AllPairs-PEL AllPairs-PEL AllPairs-PEL AllPairs-PEL AllPairs-PEL AllPairs-PEL mpjoin-PEL mpjoin-PEL mpjoin-PEL mpjoin-PEL mpjoin-PEL mpjoin-PEL 300 300 100 20 60 200 AllPairs AllPairs AllPairs AllPairs AllPairs AllPairs 200 200 40 mpjoin mpjoin mpjoin mpjoin mpjoin mpjoin 50 100 10 100 100 20 0 0 0 0 0 0 (a) Foreign join, (b) Foreign join, (c) Foreign join, (d) Foreign j., EN- (e) Self join, (f) Self join, EN- DBLP, tJ = 0.75. DBLP, tC = 0.85. TREC, tJ = 0.75 RON, tJ = 0.75 DBLP, tJ = 0.75 RON, tJ = 0.75 Figure 7: Join times. 8.0e10 the PEL filter improves performance in almost any foreign 1.5e10 join and also in some self join scenarios, despite the fact that 6.0e10 it may increase the number of candidates to be verified. AllPairs-PEL AllPairs-PEL mpjoin-PEL mpjoin-PEL 1.0e10 4.0e10 7. REFERENCES AllPairs AllPairs mpjoin mpjoin 2.0e10 5.0e9 [1] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proc. VLDB, pages 918 – 929, 0 0 2006. (a) Number of pro- (b) Number of candi- [2] N. Augsten, M. H. Böhlen, and J. Gamper. The cessed index entries. dates to be verify. pq-gram distance between ordered labeled trees. ACM TODS, 35(1), 2010. Figure 8: TREC (foreign join): tJ = 0.75 [3] N. Augsten, A. Miraglia, T. Neumann, and A. Kemper. On-the-fly token similarity joins in relational databases. In Proc. SIGMOD, pages 1495 – difference between mpjoin and mpjoin-PEL. The maximum 1506. ACM, 2014. increase in speed is 9% (threshold 0.8, mpjoin), the maxi- [4] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all mum slowdown is 30% (threshold 0.6, AllPairs). pairs similarity search. WWW, 7:131 – 140, 2007. Summarizing, PEL substantially improves the runtime in [5] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive foreign join scenarios. For self joins, PEL is less effective operator for similarity joins in data cleaning. In Proc. and, in some cases, may even slightly increase the runtime. ICDE, page 5. IEEE, 2006. [6] A. Gionis, P. Indyk, and R. Motwani. Similarity 5. RELATED WORK search in high dimensions via hashing. In Proc. VLDB, pages 518–529, 1999. Sarawagi and Kirpal [8] first discuss efficient algorithms for exact set similarity joins. Chaudhuri et al. [5] propose [7] L. A. Ribeiro and T. Härder. Generalizing prefix SSJoin as an in-database operator for set similarity joins filtering to improve set similarity joins. Information and introduce the prefix filter. AllPairs [4] uses the prefix Systems, 36(1):62 – 78, 2011. filter with an inverted list index. The ppjoin algorithm [12] [8] S. Sarawagi and A. Kirpal. Efficient set joins on extends AllPairs by the positional filter and introduces the similarity predicates. In Proc. SIGMOD, pages 743 – suffix filter, which reduces the candidate set before the final 754. ACM, 2004. verification. The mpjoin algorithm [7] improves over ppjoin [9] E. Spertus, M. Sahami, and O. Buyukkokten. by reducing the number of entries returned from the index. Evaluating similarity measures: A large-scale study in AdaptJoin [10] takes the opposite approach and drastically the orkut social network. In Proc. SIGKDD, pages 678 reduces the number of candidates at the expense of longer – 684. ACM, 2005. prefixes. Gionis et al. [6] propose an approximate algorithm [10] J. Wang, G. Li, and J. Feng. Can we beat the prefix based on LSH for set similarity joins. Recently, an SQL op- filtering?: An adaptive framework for similarity join erator for the token generation problem was introduced [3]. and search. In Proc. SIGMOD, pages 85 – 96. ACM, 2012. [11] C. Xiao, W. Wang, and X. Lin. Ed-Join: An efficient 6. CONCLUSIONS algorithm for similarity joins with edit distance We presented PEL, a new filter based on the pmaxsize constraints. In Proc. VLDB, 2008. upper bound derived in this paper. PEL can be easily [12] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. plugged into algorithms that store prefixes in an inverted Efficient similarity joins for near-duplicate detection. list index (e.g., AllPairs, ppjoin, or mpjoin). For these algo- ACM TODS, 36(3):15, 2011. rithms, PEL will effectively reduce the number of list entries that must be processed. This reduces the overall lookup time in the inverted list index at the cost of a potentially larger candidate set. We analyzed this trade-off for foreign joins and self joins. Our empirical evaluation demonstrated that 94