Proceedings 25. GI-Workshop „Grundlagen von Datenbanken“ 28.05.2013 – 31.05.2013 Ilmenau, Deutschland Kai-Uwe Sattler Stephan Baumann Felix Beier Heiko Betz Francis Gropengießer Stefan Hagedorn (Hrsg.) ii Vorwort Liebe Teilnehmerinnen und Teilnehmer, mittlerweile zum 25. Mal fand vom 28.5. bis 31.5.2013 der Workshop Grundlagen ” von Datenbanken“ des GI-Arbeitskreises Grundlagen von Informationssystemen“ im ” Fachbereich Datenbanken und Informationssysteme (DBIS) statt. Nach Österreich im Jahr 2011 und dem Spreewald im Jahr 2012 war bereits zum dritten Mal Thüringen der Austragungsort – diesmal die kleine Gemeinde Elgersburg am Fuße der Hohen Warte im Ilm-Kreis. Organisiert wurde der Workshop vom Fachgebiet Datenbanken und Informationssysteme der TU Ilmenau. Die Workshop-Reihe, die 1989 in Volkse bei Braunschweig vom Braunschweiger Datenbanklehrstuhl ins Leben gerufen wurde und die ersten 3 Jahre auch in Volkse blieb, hat sich inzwischen als eine Institution für den Gedankenaustausch gerade für Nachwuchswissenschaftler/-innen aus dem deutschsprachigen Raum im Bereich Da- tenbanken und Informationssysteme etabliert. Längst sind dabei die Beschränkungen auf Deutsch als Vortragssprache und reine theorie- und grundlagenorientierte Themen gefallen – auch wenn die offene Atmosphäre an abgeschiedenen Tagungsorten (und Elgersburg stellte hier keine Ausnahme dar) mit viel Zeit für intensive Diskussionen während der Sitzungen und an den Abenden geblieben sind. Für den diesjährigen Workshop wurden 15 Beiträge eingereicht und von jeweils drei Mitgliedern des 13-köpfigen Programmkomitees begutachtet. Aus allen eingereichten Beiträgen wurden 13 für die Präsentation auf dem Workshop ausgewählt. Die Band- breite der Themen reichte dabei von fast schon klassischen Datenbankthemen wie Anfrageverarbeitung (mit XQuery), konzeptueller Modellierung (für XML Schemaevo- lution), Indexstrukturen (für Muster auf bewegten Objekten) und dem Auffinden von Spaltenkorrelationen über aktuelle Themen wie MapReduce und Cloud-Datenbanken bis hin zu Anwendungen im Bereich Image Retrieval, Informationsextraktion, Complex Event Processing sowie Sicherheitsaspekten. Vervollständigt wurde das viertägige Programm durch zwei Keynotes von namhaften Datenbankforschern: Theo Härder stellte das WattDB-Projekt eines energieproportio- nalen Datenbanksystems vor und Peter Boncz diskutierte die Herausforderungen an die Optimierung von Datenbanksysteme durch moderne Hardwarearchitekturen - un- tersetzt mit praktischen Vorführungen. Beiden sei an dieser Stelle für ihr Kommen und ihre interessanten Vorträge gedankt. In zwei weiteren Vorträgen nutzten die Sponso- ren des diesjährigen Workshops, SAP AG und Objectivity Inc., die Gelegenheit, die Datenbanktechnologien hinter HANA (SAP AG) und InfiniteGraph (Objectivity Inc.) vorzustellen. Hannes Rauhe und Timo Wagner als Vortragenden möchten wir daher genauso wie den beiden Unternehmen für die finanzielle Unterstützung des Workshops und damit der Arbeit des GI-Arbeitskreises danken. Gedankt sei an dieser Stelle auch allen, die an der Organisation und Durchführung beteiligt waren: den Autoren für ihre Beiträge und Vorträge, den Mitgliedern des Pro- grammkomitees für ihre konstruktive und pünktliche Begutachtung der Einreichungen, den Mitarbeitern vom Hotel am Wald in Elgersburg, dem Leitungsgremium des Ar- beitskreises in Person von Günther Specht und Stefan Conrad, die es sich nicht nehmen iii ließen, persönlich am Workshop teilzunehmen, sowie Eike Schallehn, der im Hinter- grund mit Rat und Tat zur Seite stand. Der größte Dank gilt aber meinem Fachge- bietsteam, das den Großteil der Organisationsarbeit geleistet hat: Stephan Baumann, Francis Gropengießer, Heiko Betz, Stefan Hagedorn und Felix Beier. Ohne ihr Enga- gement wäre der Workshop nicht möglich gewesen. Herzlichen Dank! Kai-Uwe Sattler Ilmenau am 28.5.2013 iv v Komitee Programm-Komitee • Andreas Heuer, Universität Rostock • Eike Schallehn, Universität Magdeburg • Erik Buchmann, Karlsruher Institut für Technologie • Friederike Klan, Universität Jena • Gunter Saake, Universität Magdeburg • Günther Specht, Universität Innsbruck • Holger Schwarz, Universität Stuttgart • Ingo Schmitt, Brandenburgische Technische Universität Cottbus • Kai-Uwe Sattler, Technische Universität Ilmenau • Katja Hose, Aalborg University • Klaus Meyer-Wegener, Universität Erlangen • Stefan Conrad, Universität Düsseldorf • Torsten Grust, Universität Tübingen Organisations-Komitee • Kai-Uwe Sattler, TU Ilmenau • Stephan Baumann, TU Ilmenau • Felix Beier, TU Ilmenau • Heiko Betz, TU Ilmenau • Francis Gropengießer, TU Ilmenau • Stefan Hagedorn, TU Ilmenau vi vii Inhaltsverzeichnis 1 Keynotes 1 1.1 WattDB—a Rocky Road to Energy Proportionality Theo Härder . . . . 1 1.2 Optimizing database architecture for machine architecture: is there still hope? Peter Boncz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Workshop-Beiträge 5 2.1 Adaptive Prejoin Approach for Performance Optimization in MapReduce- based Warehouses Weiping Qu, Michael Rappold und Stefan Dessloch . 5 2.2 Ein Cloud-basiertes räumliches Decision Support System für die Her- ausforderungen der Energiewende Golo Klossek, Stefanie Scherzinger und Michael Sterner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Consistency Models for Cloud-based Online Games: the Storage Sys- tem’s Perspective Ziqiang Diao . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Analysis of DDoS Detection Systems Michael Singhof . . . . . . . . . . 22 2.5 A Conceptual Model for the XML Schema Evolution Thomas Nösinger, Meike Klettke und Andreas Heuer . . . . . . . . . . . . . . . . . . . . . 28 2.6 Semantic Enrichment of Ontology Mappings: Detecting Relation Types and Complex Correspondences Patrick Arnold . . . . . . . . . . . . . . 34 2.7 Extraktion und Anreicherung von Merkmalshierarchien durch Analyse unstrukturierter Produktrezensionen Robin Küppers . . . . . . . . . . . 40 2.8 Ein Verfahren zur Beschleunigung eines neuronalen Netzes für die Ver- wendung im Image Retrieval Daniel Braun . . . . . . . . . . . . . . . . 46 2.9 Auffinden von Spaltenkorrelationen mithilfe proaktiver und reaktiver Verfahren Katharina Büchse . . . . . . . . . . . . . . . . . . . . . . . . 52 2.10 MVAL: Addressing the Insider Threat by Valuation-based Query Pro- cessing Stefan Barthel und Eike Schallehn . . . . . . . . . . . . . . . . . 58 2.11 TrIMPI: A Data Structure for Efficient Pattern Matching on Moving Objects Tsvetelin Polomski und Hans-Joachim Klein . . . . . . . . . . . 64 2.12 Complex Event Processing in Wireless Sensor Networks Omran Saleh . 69 2.13 XQuery processing over NoSQL stores Henrique Valer, Caetano Sauer und Theo Härder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 ix WattDB—a Rocky Road to Energy Proportionality Theo Härder Databases and Information Systems Group University of Kaiserslautern, Germany haerder@cs.uni-kl.de Extended Abstract pared to a single, brawny server, they offer higher energy Energy efficiency is becoming more important in database saving potential in turn. design, i. e., the work delivered by a database server should Current hardware is not energy proportional, because a be accomplished by minimal energy consumption. So far, a single server consumes, even when idle, a substantial frac- substantial number of research papers examined and opti- tion of its peak power [1]. Because typical usage patterns mized the energy consumption of database servers or single lead to a server utilization far less than its maximum, en- components. In this way, our first efforts were exclusively fo- ergy efficiency of a server aside from peak performance is cused on the use of flash memory or SSDs in a DBMS context reduced [4]. In order to achieve energy proportionality using to identify their performance potential for typical DB opera- commodity hardware, we have chosen a clustered approach, tions. In particular, we developed tailor-made algorithms to where each node can be powered independently. By turn- support caching for flash-based databases [3], however with ing on/off whole nodes, the overall performance and energy limited success concerning the energy efficiency of the entire consumption can be fitted to the current workload. Unused database server. servers could be either shut down or made available to other A key observation made by Tsirogiannis et al. [5] con- processes. If present in a cloud, those servers could be leased cerning the energy efficiency of single servers, the best per- to other applications. forming configuration is also the most energy-efficient one, We have developed a research prototype of a distribu- because power use is not proportional to system utilization ted DBMS called WattDB on a scale-out architecture, con- and, for this reason, runtime needed for accomplishing a sisting of n wimpy computing nodes, interconnected by an computing task essentially determines energy consumption. 1GBit/s Ethernet switch. The cluster currently consists of Based on our caching experiments for flash-based databases, 10 identical nodes, composed of an Intel Atom D510 CPU, we came to the same conclusion [2]. Hence, the server sys- 2 GB DRAM and an SSD. The configuration is considered tem must be fully utilized to be most energy efficient. How- Amdahl-balanced, i. e., balanced between I/O and network ever, real-world workloads do not stress servers continuously. throughput on one hand and processing power on the other. Typically, their average utilization ranges between 20 and Compared to InfiniBand, the bandwidth of the intercon- 50% of peak performance [1]. Therefore, traditional single- necting network is limited but sufficient to supply the light- server DBMSs are chronically underutilized and operate be- weight nodes with data. More expensive, yet faster con- low their optimal energy-consumption-per-query ratio. As nections would have required more powerful processors and a result, there is a big optimization opportunity to decrease more sophisticated I/O subsystems. Such a design would energy consumption during off-peak times. have pushed the cost beyond limits, especially because we Because the energy use of single-server systems is far from would not have been able to use commodity hardware. Fur- being energy proportional, we came up with the hypothe- thermore, by choosing lightweight components, the overall sis that better energy efficiency may be achieved by a clus- energy footprint is low and the smallest configuration, i. e., ter of nodes whose size is dynamically adjusted to the cur- the one with the fewest number of nodes, exhibits low power rent workload demand. For this reason, we shifted our re- consumption. Moreover, experiments running on a small search focus from inflexible single-server DBMSs to distribu- cluster can easily be repeated on a cluster with more pow- ted clusters running on lightweight nodes. Although distri- erful nodes. buted systems impose some performance degradation com- A dedicated node is the master node, handling incoming queries and coordinating the cluster. Some of the nodes have each four hard disks attached and act as storage nodes, providing persistent data storage to the cluster. The remain- ing nodes (without hard disks drives) are called processing nodes. Due to the lack of directly accessible storage, they can only operate on data provided by other nodes (see Fig- ure 1). All nodes can evaluate (partial) query plans and execute DB operators, e. g., sorting, aggregation, etc., but only the 25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- storage nodes can access the DB storage structures, i. e., banken), 28.05.2013 - 31.05.2012, Ilmenau, Germany. tables and indexes. Each storage node maintains a DB buffer Copyright is held by the author/owner(s). Master Node S S Processing S S Processing S S Processing Processing S S D Node D Node D Node D Node S Storage Node S Storage Node S Storage Node S S S Disk Disk Disk Disk Disk Disk D Disk Disk D Disk Disk D Disk Disk S Storage Node S Storage Node S S Disk Disk Disk Disk D Disk Disk D Disk Disk Figure 1: Overview of the WattDB cluster to keep recently referenced pages in main memory, whereas 1. REFERENCES a processing node does not cache intermediate results. As a [1] L. A. Barroso and U. Hölzle. The Case for consequence, each query needs to always fetch the qualified Energy-Proportional Computing. IEEE Computer, records from the corresponding storage nodes. 40(12):33–37, 2007. Hence, our cluster design results in a shared-nothing ar- [2] T. Härder, V. Hudlet, Y. Ou, and D. Schall. Energy chitecture where the nodes only differentiate to those which efficiency is not enough, energy proportionality is have or have not direct access to DB data on external stor- needed! In DASFAA Workshops, 1st Int. Workshop on age. Each of the nodes is additionally equipped with a FlashDB, LNCS 6637, pages 226–239, 2011. 128GB Solid-State Disk (Samsung 830 SSD). The SSDs do [3] Y. Ou, T. Härder, and D. Schall. Performance and not store the DB data, they provide swap space to support Power Evaluation of Flash-Aware Buffer Algorithms. In external sorting and to provide persistent storage for con- DEXA, LNCS 6261, pages 183–197, 2010. figuration files. We have chosen SSDs, because their access [4] D. Schall, V. Höfner, and M. Kern. Towards an latency is much lower compared to traditional hard disks; Enhanced Benchmark Advocating Energy-Efficient hence, they are better suited for temp storage. Systems. In TPCTC, LNCS 7144, pages 31–45, 2012. In WattDB, a dedicated component, running on the mas- ter node, controls the energy consumption, called Energy- [5] D. Tsirogiannis, S. Harizopoulos, and M. A. Shah. Controller. This component monitors the performance of Analyzing the Energy Efficiency of a Database Server. all nodes in the cluster. Depending on the current query In SIGMOD Conference, pages 231–242, 2010. workload and node utilization, the EnergyController acti- vates and suspends nodes to guarantee a sufficiently high node utilization depending on the workload demand. Sus- pended nodes do only consume a fraction of the idle power, but can be brought back online in a matter of a few sec- onds. It also modifies query plans to dynamically distribute the current workload on all running nodes thereby achieving balanced utilization of the active processing nodes. As data-intensive workloads, we submit specific TPC-H queries against a distributed shared-nothing DBMS, where time and energy use are captured by specific monitoring and measurement devices. We configure various static clusters of varying sizes and show their influence on energy efficiency and performance. Further, using an EnergyController and a load-aware scheduler, we verify the hypothesis that en- ergy proportionality for database management tasks can be well approximated by dynamic clusters of wimpy computing nodes. Optimizing database architecture for machine architecture: is there still hope? Peter Boncz CWI p.boncz@cwi.nl Extended Abstract In particular, there is the all too present danger to over- In the keynote, I will give some examples of how computer optimize of one particular architecture; or to propose tech- architecture has strongly evolved in the past decennia and niques that will have only a very short span of utility. The how this influences the performance, and therefore the de- question thus is not only to find specific ways to optimize sign, of algorithms and data structure for data management. for certain hardware features, but do so in a way that works One the one hand, these changes in hardware architecture across the full spectrum of architectural, i.e. robust tech- have caused the (continuing) need for new data management niques. research. i.e. hardware-conscious database research. Here, I will close the talk by recent work at CWI and Vectorwise I will draw examples from hardware-conscious research per- on robustness of query evaluator performance, describing a formed on the CWI systems MonetDB and Vectorwise. project called ”Micro-Adaptivity” where database systems This diversification trend in computer architectural char- are made self-adaptive and react immediately to observed acteristics of the various solutions in the market seems to performance, self-optimizing to the combination of current be intensifying. This is seen in quite different architectural query workload, observed data distributions, and hardware options, such as CPU vs GPU vs FPGA, but also even re- characteristics. stricting oneself to just CPUs there seems to be increasing design variation in architecture and platform behavior. This poses a challenge to hardware-conscious database research. 25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 28.05.2013 - 31.05.2012, Ilmenau, Germany. Copyright is held by the author/owner(s). Adaptive Prejoin Approach for Performance Optimization in MapReduce-based Warehouses ∗ Weiping Qu Michael Rappold Stefan Dessloch Heterogeneous Information Department of Computer Heterogeneous Information Systems Group Science Systems Group University of Kaiserslautern University of Kaiserslautern University of Kaiserslautern qu@informatik.uni-kl.de m_rappol@cs.uni-kl.de dessloch@informatik.uni- kl.de ABSTRACT scalable file system, MapReduce/Hadoop1 systems enable MapReduce-based warehousing solutions (e.g. Hive) for big analytics on large amounts of unstructured data or struc- data analytics with the capabilities of storing and analyzing tured data in acceptable response time. high volume of both structured and unstructured data in a With the continuous growth of data, scalable data stores scalable file system have emerged recently. Their efficient based on Hadoop/HDFS2 have achieved more and more at- data loading features enable a so-called near real-time ware- tention for big data analytics. In addition, by means of sim- housing solution in contrast to those offered by conventional ply pulling data into the file system of MapReduce-based data warehouses with complex, long-running ETL processes. systems, unstructured data without schema information is However, there are still many opportunities for perfor- directly analyzed with parallelizable custom programs, where- mance improvements in MapReduce systems. The perfor- as data can only be queried in traditional data warehouses mance of analyzing structured data in them cannot cope after it has been loaded by ETL tools (cleansing, normaliza- with the one in traditional data warehouses. For example, tion, etc.), which normally takes a long period of time. join operations are generally regarded as a bottleneck of per- Consequently, many web or business companies add MapRe- forming generic complex analytics over structured data with duce systems to their analytical architecture. For example, MapReduce jobs. Fatma Özcan et al. [12] integrate their DB2 warehouse with In this paper, we present one approach for improving per- the Hadoop-based analysis tool - IBM Infosphere BigInsights formance in MapReduce-based warehouses by pre-joining with connectors between these two platforms. An analytical frequently used dimension columns with fact table redun- synthesis is provided, where unstructured data is initially dantly during data transfer and adapting queries to this join- placed in a Hadoop-based system and analyzed by MapRe- friendly schema automatically at runtime using a rewrite duce programs. Once its schema can be defined, it is further component. This approach is driven by the statistics infor- loaded into a DB2 warehouse with more efficient analysis ex- mation derived from previous executed workloads in terms ecution capabilities. of join operations. Another example is the data warehousing infrastructure The results show that the execution performance is im- at Facebook which involves a web-based tier, a federated proved by getting rid of join operations in a set of future MySQL tier and a Hadoop-based analytical cluster - Hive. workloads whose join exactly fits the pre-joined fact table Such orchestration of various analytical platforms forms a schema while the performance still remains the same for heterogeneous environment where each platform has a differ- other workloads. ent interface, data model, computational capability, storage system, etc. Pursuing a global optimization in such a heterogeneous 1. INTRODUCTION environment is always challenging, since it is generally hard By packaging complex custom imperative programs (text to estimate the computational capability or operational cost mining, machine learning, etc.) into simple map and reduce concisely on each autonomous platform. The internal query functions and executing them in parallel on files in a large engine and storage system do not tend to be exposed to outside and are not designed for data integration. ∗finished his work during his master study at university of In our case, relational databases and Hadoop will be in- kaiserslautern tegrated together to deliver an analytical cluster. Simply transferring data from relational databases to Hadoop with- out considering the computational capabilities in Hadoop can lead to lower performance. As an example, performing complex analytical workloads over multiple small/large tables (loaded from relational data- 1 one open-source implementation of MapReduce framework th 25 GI-Workshop on Foundations of Databases (Grundlagen von Daten- from Apache community, see http://hadoop.apache.org 2 banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. Hadoop Distributed File System - is used to store the data Copyright is held by the author/owner(s). in Hadoop for analysis information describing advertisements – their category, there name, daily in the form the advertiser information etc. The data sets originating in the latter through a set of l mostly correspond to actions such as viewing an advertisement, consumption. clicking on it, fanning a Facebook page etc. In traditional data The data from th warehousing terminology, more often than not the data in the Hadoop clusters processes dump compressing them into the Hive-Had bases) in Hadoop leads to a number of join operations which failures and also n slows down the whole processing. The reason is that the much load on the join performance is normally weak in MapReduce systems Scribe-Hadoop Clusters running the scrape Web Servers avoiding extra loa as compared to relational databases [15]. Performance limi- any notions of stro tations have been shown due to several reasons such as the order to avoid loc inherent unary feature of map and reduce functions. database server b cannot be read ev To achieve better global performance in such an analytical data from that par synthesis with multiple platforms from a global perspective servers, there are a of view, several strategies can be applied. Hive replication the scrapes and by data a daily dum One would be simply improving the join implementation Hadoop clusters. T Production Hive-Hadoop on single MapReduce platform. There have been several ex- Adhoc Hive-Hadoop Cluster tables. isting works trying to improve join performance in MapRe- Cluster As shown in Figu duce systems [3, 1]. where the data b Another one would be using heuristics for global perfor- stream processes. Hadoop cluster - i mance optimization. In this paper, we will take a look at the strict delivery dea second one. In order to validate our general idea of improv- Hive-Hadoop clus ing global performance on multiple platforms, we deliver our Federated MySQL well as any ad ho data sets. The ad adaptive approach in terms of join performance. We take the Figure 1: Data Flow Architecture run production job data flow architecture at Facebook as a starting point and Figure 1: Facebook Data Flow Architecture[17] the contributions are summarized as follows: 1014 1. Adaptively pre-joining tables during data transfer for better performance in Hadoop/Hive. 2.2 Hive Hive [16] is an open source data warehousing solution built 2. Rewriting incoming queries according to changing ta- on top of MapReduce/Hadoop. Analytics is essentially done ble schema. by MapReduce jobs and data is still stored and managed in Hadoop/HDFS. The remainder of this paper is structured as follows: Sec- Hive supports a higher-level SQL-like language called Hive- tion 2 describes the background of this paper. Section 3 gives QL for users who are familiar with SQL for accessing files a naı̈ve approach of fully pre-joining related tables. Based in Hadoop/HDFS, which highly increases the productivity on the performance observation of this naı̈ve approach, more of using MapReduce systems. When a HiveQL query comes considerations have been taken into account and an adap- in, it will be automatically translated into corresponding tive pre-join approach is proposed in Section 4, followed by MapReduce jobs with the same analytical semantics. For the implementation and experimental evaluation shown in this purpose, Hive has its own meta-data store which maps Section 5. Section 6 shows some related works. Section 7 the HDFS files to the relational data model. Files are log- concludes with a summary and future work. ically interpreted as relational tables during HiveQL query execution. 2. BACKGROUND Furthermore, in contrast to high data loading cost (using ETL jobs) in traditional data warehouses, Hive benefits from In this section, we will introduce our starting point, i.e. its efficient loading process which pulls raw files directly into the analytical data flow architecture at Facebook and its Hadoop/HDFS and further publishes them as tables. This MapReduce-based analytical platform - Hive. In addition, feature makes Hive much more suitable for dealing with large the performance issue in terms of join is also stated subse- volumes of data (i.e. big data). quently. 2.1 Facebook Data Flow Architecture 2.3 Join in Hadoop/Hive Instead of using a traditional data warehouse, Facebook There has been an ongoing debate comparing parallel data- uses Hive - a MapReduce-based analytical platform - to base systems and MapReduce/Hadoop. In [13], experiments perform analytics on information describing advertisement. showed that performance of selection, aggregation and join The MapReduce/Hadoop system offers high scalability which tasks in Hadoop could not reach parallel databases (Vertica enables Facebook to perform data analytics over 15PB of & DBMS-X). Several reasons of the performance difference data and load 60TB of new data every day [17]. The archi- have been also explained by Stonebraker et al. in [15] such tecture of data flow at Facebook is described as follows. as repetitive record parsing, and high I/O cost due to non- As depicted in Figure 1, data is extracted from two types compression & non-indexing. of data sources: a federated MySQL tier and a web-based Moreover, as MapReduce was not originally designed to tier. The former offers the category, the name and corre- combine information from two or more data sources, join im- sponding information of the advertisements as dimension plementations are always cumbersome [3]. The join perfor- data while the actions such as viewing an advertisement, mance relies heavily on the implementation of MapReduce clicking on it, fanning a Facebook page are extracted as fact jobs which have been considered as not straightforward. data from the latter. As Hive is built on top of MapReduce/Hadoop, the join There are two types of analytical cluster: production Hive operation is essentially done by corresponding MapReduce cluster and ad hoc Hive cluster. Periodic queries are per- jobs. Thus, Hive suffers from these issues even though there formed on the production Hive cluster while the ad hoc have been efforts [5] to improve join performance in MapRe- queries are executed on the ad hoc Hive cluster. duce systems or in Hive. information describing advertisemen ts ! their category, there name, the advertiser information etc. data The through a set of loader processes and then becomes sets originating in the latter as viewing an advertisement, consumption. mostly correspond to actions such clicking on it, fanning a Facebook ge etc. pa In traditional dataThe data from the federated mysql tier gets loaded warehousing terminology, more often than not the data in theHadoop clusters through daily scra pe processes. The sc processes dump the desired data sets from mysql datab compressing them on the source stems sy and finally movin 3. FULL PRE-JOIN APPROACH into the Hive-Hadoop cluster. The scrapes need to b failures and also need to be designed such that the Due to the fact that the join performance is a perfor- much load on the mysql databases. The latter is acc Scribe-HadoopClusters running the scrapes on a replicated tier of mysql database mance bottleneck in Hive with its inherent MapReduce Web Servers fea- a b c d avoiding extra load on the already loaded masters. ture, one naı̈ve thinking for improving total workload perfor- any notions ofAdaptive strongPre-joined consistencySchema the scraped in data is sa mance would be to simply eliminate the join task from the order to avoid locking overheads. The scrapes are retried fact table: λ database servera basis b c ind ther′ of x′ failures and if the case workload by performing a rewritten workload with the same “(λ,cannot be read even after repeated tries, the previ α.r, β.x)“ analytical semantics over pre-joined tables created in the data from that particular server used. is With thousands of fact table: λ′ data load phase. A performance gain would be expected by servers, there are always some servers that may not r s t performing large table scan with high parallelism of increas-Hive replication the scrapes and by a combination using of retries and scr data a daily dump of the dimension data is created ing working nodes in Hadoop instead of join. In addition, Production Hive-Hadoop p Hadoop clusters. dim These ps dum table: αare then converted to top the scalable storage system allows us to create AdhocHive-Hadoop redundant Cluster tables. Cluster x y z pre-joined tables for some workloads with specific join pat- r s t As shown in Figure 1, theretwoaredifferent Hive-Hadoop terns. where the data becomes available for consumption by dim table: α dim table: β stream processes. One of these clusters ! the produ In an experiment, we tried to validate this strategy. An Hadoop cluster - is used to execute obs thatj need to adher analytical workload (TPC-H Query 3) was executed over x y z strict delivery deadlines, where as the other clust two data sets of TPC-H benchmark (with scale factor 5 & Hive-Hadoop cluster is used toteexecu lower priority bat Federated MySQL dim table: βwell as any ad hoc analysis that the users want to 10) of the original table schema (with join at runtime) and a data sets. The ad hoc nature userofqueries makes it dan fully pre-joined table schema (without join) which fully Figure joins 1: Data Flow Architecture run production jobs in the same cluster. A badly wr all the related dimension tables with the fact table during Figure 3: Adaptive Pre-joined Schema in Facebook the load phase, respectively. In this case, we trade storage Example overhead for better total performance. 1014 As shown on the left side of the Figure 2(a), the perfor- mance gain of the total workload (including the join) over the periodic queries on production Hive-Hadoop cluster, a the data set with SF 5 can be seen with 6GB storage over- frequent column set could be extracted. head introduced by fully pre-joining the related tables into One example is illustrated in Figure 3. The frequent set one redundant table (shown in Figure 2(b)). The overall of additional columns has been extracted. The column r in dimension table α is frequently joined with fact table in 350 25 company in the previous workloads as a filter or aggregate data volume for executing workloads (GB) 300 20 column, as the same for the column x in dimension table average runtime (sec) 250 200 15 β. During next load phase, the fact table is expanded by no pre-join no pre-join 150 full pre-join 10 full pre-join redundantly pre-joining these two additional columns r and 100 5 x with it. 50 Depending on the statistics information of previous queries, 0 0 5GB 10GB 5GB 10GB different frequent sets of additional columns could be found data set size data set size in diverse time intervals. Thus, the fact table is pre-joined (a) Average Runtimes (b) Accessed Data Volume in an adaptive manner. Assume that the additional columns identified in previ- Figure 2: Running TPC-H Query-3 on Original and ous queries will also frequently occur in the future ones (as Full Pre-joined Table Schema in the Facebook example), the benefits of adaptive pre-join approach are two-fold: performance can be significantly increased if workloads with First, when all the columns (including dimension columns) the same join pattern later frequently occur, especially for in a certain incoming query which requires a join opera- periodic queries over production Hive-Hadoop cluster in the tion have been contained in the pre-joined fact table, this Facebook example. query could be directly performed on the pre-joined fact ta- However, the result of performing the same query on the ble without join. data set with SF 10 size is disappointing as there is no per- Second, the adaptive pre-join approach leads to a smaller formance gain while paying 12.5GB storage for redundancy table size in contrast to the full pre-join approach, as only (shown in Figure 2(b)), which is not what we expected. The subsets of the dimension tables are pre-joined. Thus, the reason could be that the overhead of scanning such redun- resulting storage overhead is reduced, which plays a signif- dant fully pre-joined tables and the high I/O cost as well off- icant role especially in big data scenarios (i.e. terabytes, set the performance gain as the accessed data volume grows. petabytes of data). To automatically accomplish the adaptive pre-join ap- 4. ADAPTIVE PRE-JOIN APPROACH proach, three sub-steps are developed: frequent column set Taking the lessons learned from the full pre-join approach extraction, pre-join and query rewrite. above, we propose an adaptive pre-join approach in this pa- per. 4.1 Frequent Column Set Extraction Instead of pre-joining full dimension tables with the fact In the first phase, the statistics collected for extracting table, we try to identify the dimension columns which oc- frequent set of additional columns is formated as a list of curred frequently in the select, where, etc. clauses of previ- entries each which has the following form: ous executed queries for filtering, aggregation and so on. We Set : {Fact, Dim X.Col i, Dim X.Col j ... Dim Y.Col k} refer to these columns as additional columns as compared to the join columns in the join predicates. By collecting a list of The join set always starts with the involved fact table additional column sets from previous queries, for example, while the joint dimension columns are identified and cap- tured from the select, where, etc. clauses or from the sub- to answer queries using views in data warehouses. Further- queries. more, several subsequent works [14, 10] have focuses on dy- The frequent set of additional columns could be extracted namic view management based on runtime statistics (e.g. using a set of frequent itemset mining approaches [2, 7, 11] reference frequency, result data size, execution cost) and measured profits for better query performance. In our work, 4.2 Query Rewrite we reviewed these sophisticated techniques in a MapReduce- As the table schema is changed in our case (i.e. newly gen- based environment. erated fact table schema), initial queries need to be rewritten Cheetah [4] is a high performance, custom data warehouse for successful execution. Since the fact table is pre-joined on top of MapReduce. It is very similar to the MapReduce- with a set of dedicated redundant dimension columns, the based warehouse Hive introduced in this paper. The perfor- tables which are involved in the from clause of the original mance issue of join implementation has also been addressed query can be replaced with this new fact table once all the in Cheetah. To reduce the network overhead for joining columns have been covered in it. big dimension table with fact table at query runtime, big By storing the mapping from newly generated fact table dimension tables are denormalized and all the dimension at- schema to the old schema in the catalog, the query rewrite tributes are directly stored into the fact table. In contrast, process can be easily applied. Note that the common issue we choose to only denormalize the frequently used dimen- of handling complex sub-queries for Hive can thereby be sion attributes with the fact table since we believe that less facilitated if the columns in the sub-query have been pre- I/O cost can be achieved in this way. joined with the fact table. 7. CONCLUSION AND FUTURE WORK 5. IMPLEMENTATION AND EVALUATION We propose a schema adaption approach for global opti- We use Sqoop3 as the basis to implement our approach. mization in an analytical synthesis of relational databases The TPC-H benchmark data set with SF 10 is adaptively and a MapReduce-based warehouse - Hive. As MapRe- pre-joined according to the workload statistics and trans- duce systems have weak join performance, frequently used ferred from MySQL to Hive. First, the extracted join pat- columns of dimension tables are pre-joined with the fact tern information is sent to Sqoop as additional transforma- table according to useful workload statistics in an adap- tion logic embedded in the data transfer jobs for generating tive manner before being transfered to Hive. Besides, a the adaptive pre-joined table schema on the original data rewrite component enables the execution of incoming work- sources. Furthermore, the generated schema is stored in loads with join operations over such pre-joined tables trans- Hive to enable automatic query rewrite at runtime. parently. In this way, better performance can be achieved in We tested the adaptive pre-join approach on a six-node Hive. Note that this approach is not restricted to any spe- cluster (Xeon Quadcore CPU at 2.53GHz, 4GB RAM, 1TB cific platform like Hive. Any MapReduce-based warehouse SATA-II disk, Gigabit Ethernet) running Hadoop and Hive. can benefit from it, as generic complex join operations occur After running the same TPC-H Query 3 over the adaptive in almost every analytical platform. pre-joined table schema, the result in the Figure 4(a) shows However, the experimental results also show that the per- that the average runtime is significantly reduced. The join formance improvement is not stable while the data volume grows continuously. For example, when the query is exe- 350 25 cuted on one larger pre-joined table, the performance gain data volume for executing workloads (GB) 300 20 from eliminating joins is offset by the impact caused by the 250 record parsing overhead and high I/O cost during the scan, average runtime (sec) 15 200 no pre-join no pre-join which results in worse performance. This concludes that full pre-join full pre-join 150 adaptive pre-join 10 adaptive pre-join the total performance of complex data analytics is effected 100 5 by multiple metrics rather than a unique consideration, e.g. 50 0 join. 0 10GB data set size 10GB data set size With the continuous growth of data, diverse frameworks and platforms (e.g. Hive, Pig) are built for large-scale data (a) Average Runtimes (b) Accessed Data Volume analytics and business intelligent applications. Data trans- fer between different platforms generally takes place in the Figure 4: Running TPC-H Query-3 on Original, Full absence of key information such as operational cost model, Pre-joined and Adaptive Pre-joined Table Schema resource consumption, computational capability etc. within platforms which are autonomous and inherently not designed task has been eliminated for this query and the additional for data integration. Therefore, we are looking at a generic overheads (record parsing, I/O cost) have been relieved due description of the operational semantics with their compu- to the smaller size of redundancy as shown in Figure 4(b). tational capabilities on different platforms and a cost model for performance optimization from a global perspective of 6. RELATED WORK view. The granularity we are observing is a single operator in the execution engines. Thus, a global operator model with An adaptively pre-joined fact table is essentially a mate- generic cost model is expected for performance improvement rialized view in Hive. Creating materialized views in data in several use cases, e.g. federated systems. warehouses is nothing new but a technique used for query Moreover, as an adaptively pre-joined fact table is re- optimization. Since 1990s, a substantial effort [6, 8] has been garded as a materialized view in a MapReduce-based ware- 3 house, another open problem left is how to handle the view an open source tool for data transfer between Hadoop and relational database, see http://sqoop.apache.org/ maintanence issue. The work from [9] introduced an incre- mental loading approach to achieve near real-time dataware- International Conference on Management of data, housing by using change data capture and change propaga- SIGMOD ’09, pages 165–178, New York, NY, USA, tion techniques. Ideas from this work could be taken further 2009. ACM. to improve the performance of total workload including the [14] P. Scheuermann, J. Shim, and R. Vingralek. pre-join task. Watchman: A data warehouse intelligent cache manager. In Proceedings of the 22th International 8. REFERENCES Conference on Very Large Data Bases, VLDB ’96, [1] F. N. Afrati and J. D. Ullman. Optimizing joins in a pages 51–62, San Francisco, CA, USA, 1996. Morgan map-reduce environment. In Proceedings of the 13th Kaufmann Publishers Inc. International Conference on Extending Database [15] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, Technology, EDBT ’10, pages 99–110, New York, NY, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and USA, 2010. ACM. parallel dbmss: friends or foes? Commun. ACM, [2] R. Agrawal and R. Srikant. Fast algorithms for mining 53(1):64–71, Jan. 2010. association rules in large databases. In Proceedings of [16] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, the 20th International Conference on Very Large Data N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a Bases, VLDB ’94, pages 487–499, San Francisco, CA, petabyte scale data warehouse using Hadoop. In ICDE USA, 1994. Morgan Kaufmann Publishers Inc. ’10: Proceedings of the 26th International Conference [3] S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. on Data Engineering, pages 996–1005. IEEE, Mar. Shekita, and Y. Tian. A comparison of join algorithms 2010. for log processing in mapreduce. In Proceedings of the [17] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, 2010 ACM SIGMOD International Conference on N. Jain, J. Sen Sarma, R. Murthy, and H. Liu. Data Management of data, SIGMOD ’10, pages 975–986, warehousing and analytics infrastructure at facebook. New York, NY, USA, 2010. ACM. In Proceedings of the 2010 ACM SIGMOD [4] S. Chen. Cheetah: a high performance, custom data International Conference on Management of data, warehouse on top of mapreduce. Proc. VLDB Endow., SIGMOD ’10, pages 1013–1020, New York, NY, USA, 3(1-2):1459–1468, Sept. 2010. 2010. ACM. [5] A. Gruenheid, E. Omiecinski, and L. Mark. Query optimization using column statistics in hive. In Proceedings of the 15th Symposium on International Database Engineering & Applications, IDEAS ’11, pages 97–105, New York, NY, USA, 2011. ACM. [6] A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270–294, Dec. 2001. [7] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD Rec., 29(2):1–12, May 2000. [8] V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96, pages 205–216, New York, NY, USA, 1996. ACM. [9] T. Jörg and S. Deßloch. Towards generating etl processes for incremental loading. In Proceedings of the 2008 international symposium on Database engineering & applications, IDEAS ’08, pages 101–110, New York, NY, USA, 2008. ACM. [10] Y. Kotidis and N. Roussopoulos. Dynamat: a dynamic view management system for data warehouses. SIGMOD Rec., 28(2):371–382, June 1999. [11] H. Mannila, H. Toivonen, and I. Verkamo. Efficient algorithms for discovering association rules. pages 181–192. AAAI Press, 1994. [12] F. Özcan, D. Hoa, K. S. Beyer, A. Balmin, C. J. Liu, and Y. Li. Emerging trends in the enterprise data analytics: connecting hadoop and db2 warehouse. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, SIGMOD ’11, pages 1161–1164, New York, NY, USA, 2011. ACM. [13] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM SIGMOD Ein Cloud-basiertes räumliches Decision Support System für die Herausforderungen der Energiewende Golo Klossek Stefanie Scherzinger Michael Sterner Hochschule Regensburg Hochschule Regensburg Hochschule Regensburg golo.klossek stefanie.scherzinger michael.sterner @stud.hs-regensburg.de @hs-regensburg.de @hs-regensburg.de KURZFASSUNG Die Standortfindung etwa für Bankfilialen und die Zo- Die Energiewende in Deutschland wirft sehr konkrete Fra- nierung, also das Ausweisen geographischer Flächen für die gestellungen auf: Welche Standorte eignen sich für Wind- Landwirtschaft, sind klassische Fragestellungen für räumli- kraftwerke, wo können Solaranlagen wirtschaftlich betrieben che Entscheidungsunterstützungssysteme [6]. werden? Dahinter verbergen sich rechenintensive Datenver- Die Herausforderungen an solch ein Spatial Decision Sup- arbeitungsschritte, auszuführen auf Big Data aus mehreren port System im Kontext der Energiewende sind vielfältig: Datenquellen, in entsprechend heterogenen Formaten. Diese 1. Verarbeitung heterogener Datenformate. Arbeit stellt exemplarisch eine konkrete Fragestellung und ihre Beantwortung als MapReduce Algorithmus vor. Wir 2. Skalierbare Anfragebearbeitung auf Big Data. konzipieren eine geeignete, Cluster-basierte Infrastruktur für ein neues Spatial Decision Support System und legen die 3. Eine elastische Infrastruktur, die mit der Erschließung Notwendigkeit einer deklarativen, domänenspezifischen An- neuer Datenquellen ausgebaut werden kann. fragesprache dar. 4. Eine deklarative, domänenspezifische Anfragesprache für komplexe ad-hoc Anfragen. Allgemeine Begriffe Measurement, Performance, Languages. Wir begründen kurz die Eckpunkte dieses Anforderungs- profils im Einzelnen. Dabei vertreten wir den Standpunkt, Stichworte dass existierende Entscheidungsunterstützungssysteme auf Basis relationaler Datenbanken diese nicht in allen Punkten Cloud-Computing, MapReduce, Energiewende. erfüllen können. (1) Historische Wetterdaten sind zum Teil öffentlich zu- 1. EINLEITUNG gänglich, werden aber auch von kommerziellen Anbietern Der Beschluss der Bundesregierung, zum Jahr 2022 aus bezogen. Prominente Vertreter sind das National Center for der Kernenergie auszusteigen und deren Anteil am Strom- Atmospheric Research [12] in Boulder Colorado, der Deut- Mix durch erneuerbare Energien zu ersetzen, fordert einen sche Wetterdienst [7] und die Satel-Light [14] Datenbank der rasanten Ausbau der erneuerbaren Energien. Entscheidend Europäischen Union. Hinzu kommen Messwerte der hoch- für den Bau neuer Windkraft- und Solaranlagen sind vor schuleigenen experimentellen Windkraft- und Solaranlagen. allem die zu erzielenden Gewinne und die Sicherheit der In- Die Vielzahl der Quellen und somit der Formate führen zu vestitionen. Somit sind präzise Ertragsprognosen von großer den klassischen Problemen der Datenintegration. Bedeutung. Unterschiedliche Standorte sind zu vergleichen, (2) Daten in hoher zeitlicher Auflösung, die über Jahr- die Ausrichtung der Windkraftanlagen zueinander in den zehnte hinweg erhoben werden, verursachen Datenvolumi- Windparks ist sorgfältig zu planen. Als Entscheidungsgrund- na im Big Data Bereich. Der Deutsche Wetterdienst allein lage dienen hierzu vor allem historische Wetterdaten. Für verwaltet ein Datenarchiv von 5 Petabyte [7]. Bei solchen die Kalkulation des Ertrags von Windkraftanlagen muss ef- Größenordnung haben sich NoSQL Datenbanken gegenüber fizient auf die Datenbasis zugegriffen werden können. Diese relationalen Datenbanken bewährt [4]. erstreckt sich über große Zeiträume, da das Windaufkom- (3) Wir stellen die Infrastruktur für ein interdisziplinäres men nicht nur jährlich schwankt, sondern auch dekadenweise Team der Regensburg School of Energy and Resources mit variiert [3, 9]. mehreren im Aufbau befindlichen Projekten bereit. Um den wachsenden Anforderungen unserer Nutzer gerecht werden zu können, muss das System elastisch auf neue Datenquellen und neue Nutzergruppen angepasst werden können. (4) Unsere Nutzer sind überwiegend IT-affin, doch nicht erfahren in der Entwicklung komplexer verteilter Systeme. Mit einer domänenspezifischen Anfragesprache wollen die Autoren dieses Artikels die intuitive Nutzbarkeit des Sys- tems gewährleisten. Unter diesen Gesichtspunkten konzipieren wir unser Sys- 25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- tem als Hadoop-Rechencluster [1, 5]. Damit sind die Ska- banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. Copyright is held by the author/owner(s). lierbarkeit auf große Datenmengen (2) und die horizonta- le Skalierbarkeit der Hardware gegeben (3). Da auf histo- rische Daten ausschließlich lesend zugegriffen wird, bietet sich der MapReduce Ansatz geradezu an. Zudem erlaubt Hadoop das Verarbeiten unstrukturierter, heterogener Da- ten (1). Der Entwurf einer eigenen Anfragesprache (4) stellt dabei eine spannende und konzeptionelle Herausforderung dar, weil hierfür ein tiefes Verständnis für die Fragestellun- gen der Nutzer erforderlich ist. Struktur. Die folgenden Kapitel liefern Details zu unserem Vorhaben. In Kapitel 2 beschreiben wir eine konkrete Frage- stellung bei der Standortfindung von Windkraftwerken. In Kapitel 3 stellen wir unsere Lösung als MapReduce Algorith- mus dar. Kapitel 4 skizziert unsere Infrastruktur. Im 6. Ka- pitel wird auf verwandte Arbeiten eingegangen. Das letzte Abbildung 1: Aussagen über die Leistung in Abhän- Kapitel gibt eine Zusammenfassung unserer Arbeit und zeigt gigkeit zur Windgeschwindigkeit (aus [9]). deren Perspektive auf. 2. WINDPOTENTIALANALYSE Ein aktuelles Forschungsprojekt der Hochschule Regens- burg beschäftigt sich mit der Potentialanalyse von Wind- kraftanlagen. Hier werden die wirtschaftlichen Aspekte, die für das Errichten neuer Windkraftanlagen entscheidend sind, untersucht. Mithilfe der prognostizierten Volllaststunden ei- ner Windkraftanlage kann eine Aussage über die Rentabi- lität getroffen werden. Diese ist bestimmt durch die Leis- tungskennlinie der Windkraftanlage und letztlich durch die zu erwartenden Windgeschwindigkeiten. Abbildung 1 (aus [9]) skizziert die spezifische Leistungs- kennlinie einer Windkraftanlage in vier Phasen: I) Erst ab einer gewissen Windgeschwindigkeit beginnt die Anlage Strom zu produzieren. II) Die Leistung steigt über den wichtigsten Arbeitsbe- reich in der dritten Potenz zur Windgeschwindigkeit an, bis die Nennleistung der Anlage erreicht ist. Abbildung 2: Histogramme über die Windstärkever- III) Die Ausgangsleistung wird auf die Nennleistung der teilung. Anlage begrenzt. Ausschlaggebend für die Höhe der Nennleistung ist die Auslegungsgröße des Generators. IV) Die Windkraftanlage schaltet sich bei zu hohen Wind- Orographie variieren die Windgeschwindigkeiten schon bei geschwindigkeiten ab, um eine mechanische Überbelas- kurzen Distanzen stark. tung zu verhindern. Abbildung 2 skizziert die resultierende Aufgabenstellung: Geographische Flächen werden kleinräumig unterteilt, was Wie Abbildung 1 verdeutlicht, ist zum Errechnen der ab- die Abbildung aus Gründen der Anschaulichkeit stark ver- gegeben Arbeit einer Windkraftanlage eine genaue Kennt- einfacht darstellt. Für jeden Quadranten, bestimmt durch nis der stochastischen Verteilung der Windgeschwindigkeit 1 Längen- und Breitengrad, interessiert die Häufigkeitsvertei- notwendig. Mithilfe entsprechender Histogramme können so- lung der Windstärken (dargestellt als Histogramm). mit potentielle Standorte für neue Windkraftanlagen vergli- Je nach Fragestellung wird von unterschiedlichen Zeiträu- chen, und Anlagen mit geeigneter Leistungskennlinie pas- men und unterschiedlicher Granularität der Quadranten aus- send für den spezifischen Standort ausgewählt werden. gegangen. Aufgrund der schieren Größe der Datenbasis ist Als Datenbasis eignen sich etwa die Wetterdaten des For- hier ein massiv paralleler Rechenansatz gefordert, wenn über schungsinstitut des National Center for Atmospheric Rese- eine Vielzahl von Quadranten hinweg Histogramme berech- arch [12] und die des Deutschen Wetterdienstes [7]. net werden sollen. Insbesondere im Binnenland ist eine hohe räumliche Auf- lösung der meteorologischen Daten wichtig. Aufgrund der 3. MASSIV PARALLELE HISTOGRAMM- 1 Wir verwenden die Begriffe Windgeschwindigkeit und BERECHNUNG Windstärke synonym. Streng genommen wird die Windge- schwindigkeit als Vektor dargestellt, während die Windstär- Im Folgenden stellen wir einen MapReduce Algorithmus ke als skalare Größe erfasst wird. Dabei kann die Windstärke zur parallelen Berechnung von Windgeschwindigkeitsvertei- aus der Windgeschwindigkeit errechnet werden. lungen vor. Wir betrachten dabei die Plattform Apache Ha- Abbildung 3: Erste MapReduce-Sequenz zur Berechnung der absoluten Häufigkeiten. doop [1], eine quelloffene MapReduce Implementierung [5]. hieren wir von dem tatsächlichen Eingabeformat und be- Hadoop ist dafür ausgelegt, mit großen Datenmengen um- schränken uns auf nur eine Datenquelle. Die Eingabetupel zugehen. Ein intuitives Programmierparadigma erlaubt es, enthalten einen Zeitstempel, den Längen- und Breitengrad massiv parallele Datenverarbeitungsschritte zu spezifizieren. als Ortsangabe und diverse Messwerte. Die Plattform partitioniert die Eingabe in kleinere Daten- Wir nehmen vereinfachend an, dass die Ortsangabe be- blöcke und verteilt diese redundant auf dem Hadoop Distri- reits in eine Quadranten-ID übersetzt ist. Diese Vereinfa- buted File System [15]. Dadurch wird eine hohe Datensicher- chung erlaubt eine übersichtlichere Darstellung, gleichzeitig heit gewährleistet. Als logische Basiseinheit arbeitet Hadoop ist die Klassifikation der Datensätze nach Quadranten ein- mit einfachen Schlüssel/Werte Paaren. Somit können selbst fach umzusetzen. Zudem ignorieren wir alle Messwerte bis unstrukturierte oder nur schwach strukturierte Daten ad hoc auf die Windstärke. Tabelle 1 zeigt exemplarisch einige Da- verarbeitet werden. tensätze, die wir in unserem laufenden Beispiel verarbeiten. MapReduce Programme werden in drei Phasen ausgeführt. Wir betonen an dieser Stelle, dass diese vereinfachenden Annahmen nur der Anschaulichkeit dienen und keine Ein- 1. In der ersten Phase wird auf den partitionierten Einga- schränkung unseres Systems darstellen. bedaten eine Map-Funktion parallel ausgeführt. Diese Map-Funktion transformiert einfache Schlüssel/Werte Quadrant Windstärke Paare in eine Liste von neuen Schlüssel/Werte Paaren. q ws 2. Die anschließende Shuffle-Phase verteilt die entstande- 2 0 nen Tupel so um, dass nun alle Paare mit dem gleichen 3 7 Schlüssel an demselben Rechner vorliegen. 4 9 3. Die Reduce-Phase berechnet meist eine Aggregatfunk- 1 3 tion auf allen Tupeln mit demselben Schlüssel. ... ... Die Signaturen der Map- und Reduce-Funktion werden Tabelle 1: Tabellarisch dargestellte Eingabedaten. üblicherweise wie folgt beschrieben [11]: Map: (k1, v1) → list(k2, v2) Wir schalten zwei MapReduce-Sequenzen in Reihe: Reduce: (k2, list(v2)) → list(k3, v3) • Die erste Sequenz ermittelt, wie oft in einem Quadran- ten eine konkrete Windstärke aufgetreten ist. Wir erläutern nun unseren MapReduce Algorithmus zum Erstellen von Histogrammen der Windgeschwindigkeitsver- • Die zweite Sequenz fasst die berechneten Tupel zu Hi- teilungen. Im Sinne einer anschaulichen Darstellung abstra- stogrammen zusammen. Abbildung 4: Zweite MapReduce-Sequenz zur Berechnung der Histogramme. def map(Datei d, Liste L) : def reduce((Quadrant q, Windstärke ws), Liste L) : foreach (q, ws) in L do int total = 0; if (q ∈ Q) foreach count in L do int count = 1; total += count; emit ((q, ws), count); od fi emit (q, (ws, total)); od Abbildung 6: Reduce-Funktion der ersten Sequenz. Abbildung 5: Map-Funktion der ersten Sequenz. 2}). Die Shuffle-Phase reorganisiert die Tupel so, dass an- Durch das Aneinanderreihen von MapReduce-Sequenzen schließend alle Tupel mit den gleichen Werten für Quadrant werden ganz im Sinne des Prinzips teile und herrsche“ mit und Windstärke bei demselben Rechner vorliegen. Hadoop ” sehr einfachen und gut parallelisierbaren Rechenschritten fasst dabei die count-Werte bereits zu einer Liste zusammen. komplexe Transformationen spezifiziert. Wir betrachten nun Die Reduce-Funktion produziert daraus Tupel mit dem beide Sequenzen im Detail. Quadranten als Schlüssel. Der Wert setzt sich aus der Wind- 3.1 Sequenz 1: Absolute Häufigkeiten stärke und ihrer absoluten Häufigkeit zusammen. 2 Die erste Sequenz erinnert an das WordCount“ -Beispiel, ” das klassische Einsteigerbeispiel für MapReduce Program- 3.2 Sequenz 2: Histogramm-Berechnung mierung [11]. Die Eingabe der Map-Funktion ist der Name Die Ausgabe der ersten Sequenz wird nun weiter verar- einer Datei und deren Inhalt, nämlich eine Liste der Qua- beitet. Die Map-Funktion der zweiten Sequenz ist schlicht dranten und der darin gemessenen Windstärke. Wir nehmen die Identitätsfunktion. Die Shuffle-Phase gruppiert die Tu- an, dass nur eine ausgewählte Menge von Quadranten Q in- pel nach dem Quadranten. Somit findet die finale Erstellung teressiert, etwa um mögliche Standorte von Windkraftanla- der Histogramme in der Reduce-Funktion statt. gen im Regensburger Raum zu untersuchen. Abbildung 5 zeigt die Map-Funktion in Pseudocode. Die Anweisung emit produziert ein neues Ausgabetupel. In der Beispiel 2. Abbildung 4 zeigt für das laufende Beispiel Shuffle-Phase werden die produzierten Tupel nach der Schlüs- die Verarbeitungsschritte der zweiten Sequenz. 2 selkomponente aus Quadrant und Windstärke umverteilt. Die Reduce-Funktion in Abbildung 6 berechnet nun die Häu- figkeit der einzelnen Windstärkewerte pro Quadrant. 4. ARCHITEKTURBESCHREIBUNG Beispiel 1. Abbildung 3 visualisiert die erste Sequenz Unsere Vision eines Cloud-basierten Spatial Decision Sup- anhand konkreter Eingabedaten. Die Map-Funktion selek- port Systems für die Fragestellungen der Energiewende fußt tiert nur Tupel aus den Quadranten 1 und 2 (d.h. Q = {1, fest auf MapReduce Technologie. hinsichtlich Performanz und Korrektheit analysiert werden kann. Hadoop Laien hingegen brauchen sich mit diesen In- terna nicht zu belasten. Aktuell erarbeiten wir einen Katalog konkreter Fragestellungen der Energiewirtschaft, um gängi- ge Query-Bausteine für WenQL zu identifizieren. 5. VERWANDTE ARBEITEN In diversen Forschungsgebieten der Informatik finden sich Berührungspunkte mit unserer Arbeit. Viele Forschungspro- jekte, die sich mit der Smart Grid Technologie beschäftigen, setzten auf distributive Systeme zum Bearbeiten ihrer Da- ten [4, 17]. Ähnlich wie in unserem Projekt wird diese Ent- scheidung aufgrund der großen Datenmengen getroffen, wel- che aus unterschiedlichen Quellen stammen. Wettereinflüsse auf Kraftwerke, Börsenstrompreise, Auslastung von Strom- netzen und das Stromverbrauchsverhalten von Millionen von Nutzern müssen verglichen werden. Smart Grid Analysen unterscheiden sich von unserem Projekt darin, dass wir nur auf historische Daten zurückgreifen und somit keine Echt- zeitanforderungen an das System stellen. Fragestellungen wie Standortbestimmung und Zonierung haben eine lange Tradition in der Entwicklung dedizierter Spatial Decision Support Systeme [6]. Deren Architekturen fußen üblicherweise auf relationalen Datenbanken zur Da- tenhaltung. Mit der Herausforderung, auf Big Data zu ska- lieren, und mit heterogenen Datenquellen zu arbeiten, besit- zen NoSQL Systeme wie Hadoop und das Hadoop Dateisys- tem hingegen eindeutige Vorteile in der Datenhaltung und Anfragebearbeitung. Die Definition deklarativer Anfragesprachen für MapRe- duce Algorithmen ist ein sehr aktives Forschungsgebiet. Am relevantesten für uns sind SQL-ähnliche Anfragesprachen, wie etwa im Hive Projekt umgesetzt [2, 16]. Allerdings wird SQL von unseren Anwendern in der Regel nicht beherrscht. Daher planen wir, eine eigene Anfragesprache zu definieren, die möglichst intuitiv für unsere Anwender zu erlernen ist. 6. ZUSAMMENFASSUNG UND AUSBLICK Der Bedarf für eine neue, BigData-fähige Generation von räumlichen Entscheidungsunterstützungssystemen für diver- Abbildung 7: Architektur des Cloud-basierten Spa- se Fragestellungen der Energiewende ist gegeben. tial Decision Support Systems. In dieser Arbeit stellen wir unsere Vision eines Cloud- basierten Spatial Decision Support Systems vor. Anhand des Beispiels der Windpotentialanalyse zeigen wir die Einsatz- fähigkeit von MapReduce Algorithmen für strategische Fra- Das Projektvorhaben geht dabei über den bloßen Einsatz gestellungen in der Energieforschung. von Cluster-Computing hinaus. Das Ziel ist der Entwurf ei- Wir sind zuversichtlich, ein breites Spektrum essentiel- ner domänenspezifischen Anfragesprache WEnQL, die Wet- ler Entscheidungen unterstützen zu können. Eine Weiterfüh- ” ter und Energie Query Language“. Diese wird in interdis- rung unserer Fallstudie ist die Ausrichtung von Windkrafträ- ziplinärer Zusammenarbeit mit dem Forschungsinstitut Re- dern innerhalb eines Windparks. Dafür ist die dominierende gensburg School of Energy and Resources entwickelt. Mit ihr Windrichtung entscheidend, um Windkrafträder günstig zu- sollen auch MapReduce Laien in der Lage sein, Algorithmen einander und zur Hauptwindrichtung auszurichten. Ein ein- auf dem Cluster laufen zu lassen. zelnes Windkraftwerk kann zwar die Gondel um 360◦ dre- Abbildung 7 illustriert die Vision: Unsere Nutzer formulie- hen, um die Rotoren gegen den Wind auszurichten. Die ren eine deklarative Anfrage in WenQL, etwa um die Wind Anordnung der Türme zueinander im Park ist allerdings geschwindigkeits-Histogramme einer Region zu berechnen. fixiert. Bei einem ungünstigen Layout der Türme können Ein eigener Compiler übersetzt die WenQL Anfrage in das somit Windschatteneffekte die Rendite nachhaltig schmä- gängige Anfrageformat Apache Pig [8, 13], das wiederum lern. Abbildung 8 (aus [10]) visualisiert die Windstärke und nach Java übersetzt wird. Das Java Programm wird anschlie- Windrichtung als Entscheidungsgrundlage. Unser MapRe- ßend kompiliert und auf dem Hadoop Cluster ausgeführt. duce Algorithmus aus Kapitel 3 kann dementsprechend er- Die Übersetzung in zwei Schritten hat den Vorteil, dass weitert werden. das Programm in der Zwischensprache Pig von Experten Darüber hinaus eruieren wir derzeit die Standortbestim- 8. LITERATUR [1] Apache Hadoop. http://hadoop.apache.org/, 2013. [2] Apache Hive. http://hive.apache.org/, 2013. [3] O. Brückl. Meteorologische Grundlagen der Windenergienutzung. Vorlesungsskript: Windenergie. Hochschule Regensburg, 2012. [4] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, 2008. [5] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, Jan. 2008. [6] P. J. Densham. Spatial decision support systems. Geographical information systems: Principles and applications, 1:403–412, 1991. [7] Deutscher Wetterdienst. http://www.dwd.de/, 2013. [8] A. Gates. Programming Pig. O’Reilly Media, 2011. [9] M. Kaltschmitt, W. Streicher, and A. Wiese. Erneuerbare Energien Systemtechnik, Wirtschaftlichkeit, Umweltaspekte. Springer, 2006. [10] Lakes Environmental Software. Abbildung 8: Aussagen über Windstärke und http://www.weblakes.com/, 2013. Windrichtung (aus [10]). [11] C. Lam. Hadoop in Action. Manning Publications, 2010. [12] National Center for Atmospheric Research (NCAR). http://ncar.ucar.edu/ 2013. mung von Solaranlagen, und noch komplexer, den strategi- ” [13] C. Olston, B. Reed, U. Srivastava, R. Kumar, and schen Einsatz von Energiespeichern, um etwa Windstillen A. Tomkins. Pig latin: A not-so-foreign language for oder Nachtphasen ausgleichen zu können. data processing. In Proceedings of the 2008 ACM Mit den Fähigkeiten unseres künftigen Decision Support SIGMOD international conference on Management of Systems, der Skalierbarkeit auf sehr großen Datenmengen, data, pages 1099–1110. ACM, 2008. dem flexible Umgang mit heterogenen Datenformaten und [14] Satel-Light. http://www.satel-light.com/, 2013. nicht zuletzt mit einer domänenspezifischen Anfragesprache wollen wir unseren Beitrag zu einer klimafreundlichen und [15] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. nachhaltigen Energieversorgung leisten. The Hadoop Distributed File System. In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pages 1–10, 2010. 7. DANKSAGUNG [16] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, Diese Arbeit ist ein Projekt der Regensburg School of Ener- S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: gy and Resources, eine interdisziplinäre Einrichtung der Hoch- A warehousing solution over a map-reduce framework. schule Regensburg und des Technologie- und Wissenschafts- Proceedings of the VLDB Endowment, 2(2):1626–1629, netzwerkes Oberpfalz. 2009. [17] D. Wang and L. Xiao. Storage and Query of Condition Monitoring Data in Smart Grid Based on Hadoop. In Computational and Information Sciences (ICCIS), 2012 Fourth International Conference, pages 377–380. IEEE, 2012. Consistency Models for Cloud-based Online Games: the Storage System’s Perspective Ziqiang Diao Otto-von-Guericke University Magdeburg 39106 Magdeburg, Germany diao@iti.cs.uni-magdeburg.de ABSTRACT Client The existing architecture for massively multiplayer online Login Server Gateway Server Chat Server role-playing games (MMORPG) based on RDBMS limits the availability and scalability. With increasing numbers of Zone Server HDFS/Cassandra players, the storage systems become bottlenecks. Although Logic Server Map Server (Game Data and Log Data) RDBMS as a Service a Cloud-based architecture has the ability to solve these spe- (Account Data) cific issues, the support for data consistency becomes a new In-Memory DB open issue. In this paper, we will analyze the data consis- Data Access Server tency requirements in MMORPGs from the storage system point of view, and highlight the drawbacks of Cassandra to Cloud Storage System support of game consistency. A timestamp-based solution (State Data) will be proposed to address this issue. Accordingly, we will present data replication strategies, concurrency control, and system reliability as well. Figure 1: Cloud-based Architecture of MMORPGs [4] 1. INTRODUCTION In massively multiplayer online role-playing games (MMORPG) thousands of players can cooperate with other players in a virtual game world. To support such a huge game world (e.g., log data and state data) that requires scalability and following often complex application logic and specific re- availability is stored in a Cloud data storage system (Cas- quirements. Additionally, we have to bear the burden of sandra, in this paper). Figure 1 shows the new architecture. managing large amounts of data. The root of the issue is Unfortunately, there are still some open issues, such as that the existing architectures of MMORPGs use RDBMS the support of data consistency. According to the CAP to manage data, which limits the availability and scalability. theorem, in a partition tolerant distributed system (e.g., Cloud data storage systems are designed for internet ap- an MMORPG), we have to sacrifice one of the two prop- plications, and are complementary to RDBMS. For example, erties: consistency or availability [5]. If an online game Cloud systems are able to support system availability and does not guarantee availability, players’ requests may fail. scalability well, but not data consistency. In order to take If data is inconsistent, players may get data not conforming advantages of these two types of storage systems, we have to game logic, which affects their operations. For this rea- classified data in MMORPGs into four data sets according son, we must analyze the data consistency requirements of to typical data management requirements (e.g., data consis- MMORPGs so as to find a balance between data consistency tency, system availability, system scalability, data model, se- and system availability. curity, and real-time processing) in [4]: account data, game Although there has been some research work focused on data, state data, and log data. Then, we have proposed to the data consistency model of online games, the researchers apply multiple data management systems (or services) in one generally discussed it from players’ or servers’ point of view MMORPG, and manage diverse data sets accordingly. Data [15, 9, 11], which actually are only related to data synchro- with strong requirements for data consistency and security nization among players. Another existing research work did (e.g., account data) is still managed by RDBMS, while data not process diverse data accordingly [3], or just handled this issue based on a rough classification of data [16]. However, we believe the only efficient way to solve this issue is to ana- lyze the consistency requirements of each data set from the storage system’s perspective. Hence, we organize the rest of this paper as follows: in Section 2, we highlight data consis- tency requirements of the four data sets. In Section 3, we discuss the data consistency issue of our Cloud-based archi- rd tecture. We explain our timestamp-based solution in detail 25 GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 28.05.2013 - 31.05.2011, Ilmenau, Deutschland. from Section 4 to Section 6. Then, we point out some opti- Copyright is held by the author/owner(s). mization programs and our future work in Section 7. Finally, we summarize this paper in Section 8. players in the same game world could be treated equally. It is noteworthy that a zone server accesses data generally from one data center. Hence, we guarantee strong consis- 2. CONSISTENCY REQUIREMENTS OF DI- tency within one data center, and causal consistency among VERSE DATA IN MMORPGS data centers. In other words, when game developers modify the game data, the updated value should be submitted syn- Due to different application scenarios, the four data sets chronously to all replicas within the same data center, and have distinct data consistency requirements. For this reason, then propagated asynchronously across data centers. we need to apply different consistency models to fulfill them. State data: for instance, metadata of PCs (Player Char- Account data: is stored on the server side, and is cre- acters) and state (e.g., position, task, or inventory) of char- ated, accessed as well as deleted when players log in to or acters, is modified by players frequently during the game. log out of a game. It includes player’s private data and The change of state data must be perceived by all relevant some other sensitive information (e.g., user ID, password, players synchronously, so that players and NPCs can re- and recharge records). The inconsistency of account data spond correctly and timely. An example for the necessity of might bring troubles to a player as well as the game provider, data synchronization is that players cannot tolerate that a or even lead to an economic or legal dispute. Imagine the dead character can continue to attack other characters. Note following two scenarios: a player has changed the password that players only access data from the in-memory database successfully. However, when this player log in to the game during the game. Hence, we need to ensure strong consis- again, the new password is not effective; a player has trans- tency in the in-memory database. ferred to the game account, or the player has consumed in Another point about managing state data is that updated the game, but the account balance is somehow not properly values must be backed up to the disk-resident database asyn- presented in the game system. Both cases would influence chronously. Similarly, game developers also need to take on the player’s experience, and might result in the customer care of data consistency and durability in the disk-resident or the economic loss of a game company. Hence, we need database, for instance, it is intolerable for a player to find to access account data under strong consistency guarantees, that her/his last game record is lost when she/he starts the and manage it with transactions. In a distributed database game again. In contrast to that in the in-memory database, system, it means that each copy should hold the same view we do not recommend ensuring strong consistency to state on the data value. data. The reason is as follows: according to the CAP theo- Game data: such as world appearance, metadata (name, rem, a distributed database system can only simultaneously race, appearance, etc.) of NPC (Non Player Character), satisfy two of three the following desirable properties: con- system configuration files, and game rules, is used by play- sistency, availability, and partition tolerance. Certainly, we ers and game engine in the entire game, which can only be hope to satisfy both consistency and availability guarantees. modified by game developers. Players are not as sensitive to However, in the case of network partition or under high net- game data as to account data. For example, the change of work latency, we have to sacrifice one of them. Obviously, an NPC’s appearance or name, the duration of a bird ani- we do not want all update operations to be blocked until the mation, and the game interface may not catch the players’ system recovery, which may lead to data loss. Consequently, attention and have no influence on players’ operations. As a the level of data consistency should be reduced. We propose result, it seems that strong consistency for game data is not to ensure read-your-writes consistency guarantee [13]. In so necessary. On the other hand, some changes of the game this paper, it describes that once state data of player A has data must be propagated to all online players synchronously, been persisted in the Cloud, the subsequent read request of for instance, the change of the game world’s appearance, the player A will receive the updated values, yet other players power of a weapon or an NPC, game rules as well as scripts, (or the game engine) may only obtain an outdated version of and the occurrence frequency of an object during the game. it. From the storage system’s perspective, as long as a quo- The inconsistency of these data will lead to errors on the rum of replicas has been updated successfully, the commit game display and logic, unfair competition among players, operation is considered complete. In this case, the storage or even a server failure. For this reason, we also need to system needs to provide a solution to return the up-to-date treat data consistency of game data seriously. Game data data to player A. We will discuss it in the next section. could be stored on both the server side and the client side, Log data: (e.g., player chat history and operation logs) so we have to deal with it accordingly. is created by players, but used by data analysts for the pur- Game data on the client side could only synchronize with pose of data mining. This data will be sorted and cached servers when a player logs in to or starts a game. For this on the server side during the game, and then bulk stored reason, causal consistency is required [8, 13]. In this paper, into the database, thereby reducing the conflict rate as well it means when player A uses client software or browser to as the I/O workload, and increasing the total simultaneous connect with the game server, the game server will then throughput [2]. The management of log data has three fea- transmit the latest game data in the form of data packets tures: log data will be appended continually, and its value to the client side of player A. In this case, the subsequent will not be modified once it is written to the database; The local access by player A is able to return the updated value. replication of log data from thousands of players to multiple Player B that has not communicated with the game server nodes will significantly increase the network traffic and even will still retain the outdated game data. block the network; Moreover, log data is generally organized Although both client side and server side store the game and analyzed after a long time. Data analysts are only con- data, only the game server maintains the authority of it. cerned about the continuous sequence of the data, rather Furthermore, players in different game worlds cannot com- than the timeliness of the data. Hence, data inconsistency municate to each other. Therefore, we only need to ensure is accepted in a period of time. For these three reasons, that the game data is consistent in one zone server so that Account data Game data State data Log data Modified by Players Game developers Players Players Utilized by Players & Game engine Players & Game engine Players & Game engine Data analysts Stored in Cloud Client side Cloud In-memory DB Cloud Cloud Data center Across — Single Across Single Across Across Consistency Strong Causal Strong Causal Strong Read-your-writes Timed model consistency consistency consistency consistency consistency consistency consistency Table 1: Consistency requirements a deadline-based consistency model, such as timed consis- update all replicas while executing write operations. In this tency, is more suitable for log data[12, 10]. In this paper, case, data in Cassandra is consistent, and we can obtain timed consistency specifically means that update operations the up-to-date data from the closest replica directly. Un- are performed on a quorum of replicas instantaneously at fortunately, this replication strategy significantly increases time t, and then the updated values will be propagated to the network traffic as well as the response time of write op- all the other replicas within a time bounded by t + 4 [10]. erations, and sacrifices system availability. As a result, to Additionally, to maintain the linear order of the log data, implement read-your-writes consistency efficiently becomes the new value needs to be sorted with original values before an open issue. being appended to a replica. In other words, we execute Another drawback is that Cassandra makes all replicas a sort-merge join by the timestamp when two replicas are eventually consistent, which sometimes does not match the asynchronous. Under timed consistency guarantee, data an- application scenarios of MMORPG, and reduce the efficiency alysts can at time t + 4 obtain a continuously sequential of the system. The reasons are as follows. log data until time t. • Unnecessary for State data: state data of a PC is read by a player from the Cloud storage system only once 3. OPPORTUNITIES AND CHALLENGES during the game. The subsequent write operations do not depend on values in the Cloud any more. Hence, In our previous work, we have already presented the ca- after obtaining the up-to-date data from the Cloud, pability of the given Cloud-based architecture to support there is no necessity to ensure that all replicas reach a the corresponding consistency model for each data set in consensus on these values. MMORPGs [4]. However, we also have pointed out that to ensure read-your-writes consistency to state data and timed • Increase network traffic: Cassandra utilizes Read Re- consistency to log data efficiently in Cassandra is an open pair functionality to guarantee eventual consistency issue. In this section, we aim at discussing it in detail. [1]. It means that all replicas have to be compared Through customizing the quorum of replicas in response to in the background while executing a write operation read and write operations, Cassandra provides tunable con- in order to return the up-to-date data to players, de- sistency, which is an inherent advantage to support MMORPGs tect the outdated data versions, and fix them. In [7, 4]. There are two reasons: first, as long as a write request MMORPGs, both state data and log data have a large receives a quorum of responses, it completes successfully. In scale, and are distributed in multiple data centers. this case, although data in Cassandra is inconsistent, it re- Hence, transmission of these data across replicas will duces the response time of write operations, and ensures significantly increase the network traffic and affect the availability as well as fault tolerance of the system; Addi- system performance. tionally, a read request will be sent to the closest replica, or routed to a quorum or all replicas according to the con- 4. A TIMESTAMP-BASED CONSISTENCY sistency requirement of the client. For example, if a write request is accepted by three (N, N> 0) of all five (M, M>= SOLUTION N) replicas, at least three replicas (M-N+1) need to respond A common method for solving the consistency problem of to the subsequent read request, so that the up-to-date data Cloud storage system is to build an extra transaction layer can be returned. At this case, Cassandra can guarantee on top of the system [6, 3, 14]. Similarly, we have proposed read-your-writes consistency or strong consistency. Other- a timestamp-based solution especially for MMORPG, which wise, it can only guarantee timed consistency or eventual is designed based on the features of Cassandra [4]. Cassan- consistency [7, 13]. Due to the support of tunable consis- dra records timestamps in each column, and utilizes it as a tency, Cassandra has the potential to manage state data and version identification (ID). Therefore, we record the times- log data of MMORPGs simultaneously, and is more suitable tamps from a global server in both server side and in the than some other Cloud storage systems that only provide Cloud storage system. When we read state data from the either strong or eventual consistency guarantees. Cloud, the timestamps recorded on the server side will be On the other hand, Cassandra fails to implement tun- sent with the read request. In this way, we can find out the able consistency efficiently according to MMORPG require- most recent data easily. In the following sections, we will ments. For example, M-N+1 replicas of state data have to be introduce this solution in detail. compared so as to guarantee read-your-writes consistency. However, state data has typically hundreds of attributes, 4.1 Data Access Server the transmission and comparison of which affect the read Data access servers are responsible for data exchange be- performance. Opposite to update a quorum of replicas, we tween the in-memory database and the Cloud storage sys- Player Data access servers Player/Game engine Data access servers In-memory DB Cloud storage system In-memory DB Cloud storage system Write request (WR) Read request (RR) (Logout) Status LMT & RR PR(1) Check Version ID Snapshot (LMT) (Login) State data Timestamp(TS)&WR State data W(1) State data TS → Version ID (TS) Status RR WR & (LMT,Login) PR(2) Quit request(QR) (TS,Login) State data RR Status Snapshot & QR (LMT, Login) RR TS & WR RR W(2) TS→ Version ID GER (TS, Logout) Status State data Delete state data Delete request State data State data Figure 2: Executions of Write Operations: W(1) describes a general backup operation; W(2) shows Figure 3: Executions of Read Operations: PR(1) the process of data persistence when a player quits shows a general read request from the player; In the the game. case of PR(2), the backup operation is not yet com- pleted when the read request arrives; GER presents the execution of a read operation from the game engine. tem. They ensure the consistency of state data, maintain timestamp tables, and play the role of global counters as well. In order to balance the workload and prevent server failures, several data access servers run in one zone server in data access server obtains the corresponding state data from parallel. Data access servers need to synchronize their sys- the snapshot periodically. In order to reduce the I/O work- tem clock with each other automatically. However, a com- load of the Cloud, a data access server generates one message plete synchronization is not required. A time difference less including all its responsible state data as well as a new times- than the frequency of data backup is acceptable. tamp TS, and then sends it to the Cloud storage system. In An important component in data access servers is the the Cloud, this message is divided based on the ID of state timestamp table, which stores the ID as well as the last mod- data into several messages, each of which still includes TS. ified time (LMT) of state data, and the log status (LS). If a In this way, the update failure of one state data won’t block character or an object in the game is active, its value of LS the submission of other state data. Then, these messages is “login”. Otherwise, the value of LS is “logout”. We utilize are routed to appropriate nodes. When a node receives a a hash function to map IDs of state data to distinct times- message, it writes changes immediately into the commit log, tamp tables, which are distributed and partitioned in data updates data, and records TS as version ID in each column. access servers. It is noteworthy that timestamp tables are If an update is successful and TS is higher than the exist- partitioned and managed by data access servers in parallel ing LMT of this state data, then the data access server uses and data processing is simple, so that accessing timestamp TS to replace the LMT. Note that if a player has quit the tables will not become a bottleneck of the game system. game and the state data of the relevant PC has backed up Note that players can only interact with each other in the into the Cloud storage system, the LS of this PC needs to same game world, which is managed by one zone server. be modified form “login” to “logout”, and the relevant state Moreover, a player cannot switch the zone server freely. data in the in-memory database needs to be deleted. Therefore, data access servers as well as timestamp tables Data access servers obtain log data not from the in-memory across zone servers are independent. database, but from the client side. Log data also updates in batch, and gets timestamp from a data access server. When 4.2 Data Access a node in the Cloud receives log data, it inserts log data into In this subsection, we discuss the data access without con- its value list according to the timestamp. However, times- sidering data replication and concurrency conflicts. tamp tables are not modified when the update is complete. In Figure 2, we show the storage process of state data in Figure 3 presents executions of read operations. When a the new Cloud-based architecture: the in-memory database player starts the game, a data access server firstly obtains takes a consistent snapshot periodically. Though using the the LS information from the timestamp table. If the value same hash function employed by timestamp tables, each is “login”, that means the previous backup operation is not completed and the state data is still stored in the in-memory game. As a result, the subsequent read operation can obtain database. In this case, the player gets the state date from the updated values quickly. the in-memory database directly, and the data access server While using our replication strategies, a replica may con- needs to generate a new timestamp to replace the LMT of tain outdated data when it receives a read request. Though the relevant state data; if the value is “logout”, the data comparing LMT held by the read request with the Version access server then gets the LMT, and sends it with a read ID in a replica, this case can be detected easily. Contrary to request to the Cloud storage system. When the relevant the existing approach of Cassandra (compares M-N+1 repli- node receives the request, it compares the LMT with its cas and utilizes Read Repair), only the read request will be local version ID. If they match, the replica responds the sent to other replicas until the lasted values was found. In read request immediately. If not match, this read request this way, the network traffic will not be increased signifi- will be sent to other replicas (we will discuss it in detail cantly, and the up-to-date data can also be found easily. in the section 5). When the data access server receives the However, if the read request comes from the game engine, state data, it sends it to the in-memory database as well as the replica will respond immediately. These strategies en- the relevant client sides, and modifies the LS from “logout” sure that this Cloud-based architecture can manage state to “login” in the timestamp table. Note that state data may data under read-your-writes consistency guarantees. also be read by the game engine for the purpose of statistics. Similar to state data, a write request to log data is also In this case, the up-to-date data is not necessary, so that we accepted by a quorum of replicas at first. However, the do not need to compare the LMT with the Version ID. updated values then must be propagated to other replicas Data analysts read data also through data access servers. asynchronously when the Cloud storage system is not busy, If a read request contains a timestamp T, the cloud stor- and arranged in order of timestamp within a predetermined age system only returns log data until T-4 because it only time (4), which can be done with the help of Anti-Entropy guarantees log data timed consistency. functionality in Cassandra [1]. In this way, this Cloud stor- age system guarantees log data timed consistency. 4.3 Concurrency Control Concurrency conflicts appear rarely in the storage layer 5.2 Version Conflict Reconciliation of MMORPGs: the probability of read-write conflicts is low When the Cloud storage system detected a version conflict because only state data with a specific version ID (the same between two replicas: if it is state data, the replica with as its LMT) will be read by players during the game, and a higher version ID wins, and values of another replica will be read request to log data does not return the up-to-date data. replaced by new values; if it is log data, these two replicas Certain data is periodically updated by only one data access perform a sort-merge join by timestamps for the purpose of server simultaneously. Therefore, write-write conflicts occur synchronization. only when the per-update is not completed for some reason, for example, serious network latency, or a node failure. For- 6. SYSTEM RELIABILITY tunately, we can solve these conflicts easily by comparing Our Cloud-based architecture for MMORPGs requires a timestamps. If two processes attempt to update the same mutual cooperation of multiple components. Unfortunately, state data, the process with higher timestamp wins, and an- each component has the possibility of failure. In the follow- other process should be canceled because it is out of date. If ing, we discuss measures to deal with different failures. two processes intend to update the same log data, the pro- Cloud storage system failure: the new architecture for cess with lower timestamp wins, and another process enters MMORPGs is built based on Cassandra, which has the abil- the wait queue. The reason is that values contained in both ity to deal with its own failure. For example, Cassandra ap- processes must be stored in correct order. plies comment logs to recover nodes. It is noteworthy that by using our timestamp-based solution, when a failed node 5. DATA REPLICATION comes back up, it could be regarded as an asynchronous Data in the Cloud typically has multiple replicas for the node. Therefore, the node recovery as well as response to purpose of increasing data reliability as well as system avail- write and read requests can perform simultaneously. ability, and balancing the node workload. On the other In-memory database failure: similarly, we can also apply hand, data replication increases the response time and the comment logs to handle this kind of failure so that there network traffic as well, which cannot be handled well by is no data loss. However, writing logs affects the real-time Cassandra. For most of this section, we focus on resolving response. Moreover, logs are useless when changes are per- this contradiction according to access features of state data sisted in the Cloud. Hence, we have to find a solution in our and log data. future work. Data access server failure: If all data access servers crash, 5.1 Replication Strategies the game can still keep running, whereas data cannot be Although state data is backed up periodically into the backed up to the Cloud until servers restart, and only play- Cloud, only the last updated values will be read when play- ers already in the game can continue to play; Data access ers start the game again. It is noteworthy that the data servers have the same functionality and their system clocks loss in the server layer occurs infrequently. Therefore, we are relatively synchronized, so if one server is down, any propose to synchronize only a quorum of replicas during the other servers can replace it. game, so that an update can complete effectively and won’t Timestamp table failure: We utilize the primary/secondary block the subsequent updates. In addition, players usually model and the synchronous replication mechanism to main- start a game again after a period of time, so the system has tain the reliability of timestamp tables. In the case of all enough time to store state data. For this reason, we propose replicas failure, we have to apply the original feature of Cas- to update all replicas synchronously when players quit the sandra to obtain the up-to-date data. In other words, M- N+1 replicas need to be compared. In this way, we can and V. Yushprakh. Megastore: Providing scalable, rebuild timestamp tables as well. highly available storage for interactive services. In Conference on Innovative Data Systems 7. OPTIMIZATION AND FUTURE WORK Research(CIDR), pages 223–234, Asilomar, California, USA, 2011. When a data access server updates state data in the Cloud, [3] S. Das, D. Agrawal, and A. E. Abbadi. G-store: a it propagates a snapshot of state data to multiple replicas. scalable data store for transactional multi key access in Note that state data has hundreds of attributes, so the trans- the cloud. In Symposium on Cloud Computing(SoCC), mission of a large volume of state data may block the net- pages 163–174, Indianapolis, Indiana, USA, 2010. work. Therefore, we proposed two optimization strategies in our previous work [4]: if only some less important at- [4] Z. Diao and E. Schallehn. Cloud Data Management tributes of the state (e.g., the position or orientation of a for Online Games : Potentials and Open Issues. In character) are modified, the backup can be skipped; Only Data Management in the Cloud (DMC), Magdeburg, the timestamp, ID, and the modified values are sent as mes- Germany, 2013. Accepted for publication. sages to the Cloud. However, in order to achieve the second [5] S. Gilbert and N. Lynch. Brewer’s conjecture and the optimization strategy, our proposed data access approach, feasibility of consistent, available, partition-tolerant data replication strategies, and concurrency control mech- web services. ACM Special Interest Group on anism have to be changed. For example, even during the Algorithms and Computation Theory (SIGACT), game, updated values must be accepted by all replicas, so 33(2):51–59, 2002. that the subsequent read request does not need to compare [6] F. Gropengieß er, S. Baumann, and K.-U. Sattler. M-N+1 replicas. We will detail the adjustment program in Cloudy transactions cooperative xml authoring on our future work. amazon s3. In Datenbanksysteme für Business, It is noteworthy that a data access server stores a times- Technologie und Web (BTW), pages 307–326, tamp repeatedly into the timestamp table, which increases Kaiserslautern, Germany, 2011. the workload. A possible optimization program is as fol- [7] A. Lakshman. Cassandra - A Decentralized Structured lows: If a batch write is successful, data access server caches Storage System. Operating Systems Review, the timestamp (TS) of this write request. Accordingly, in 44(2):35–40, 2010. the timestamp table, we add a new column to each row to [8] L. Lamport. Time, clocks, and the ordering of events maintain a pointer. If a row is active (the value of LS is in a distributed system. Communications of the ACM, “login”), the pointer refers to the memory location of TS; if 21(7):558–565, July 1978. not, it refers to its own LMT. When a row becomes inactive, [9] F. W. Li, L. W. Li, and R. W. Lau. Supporting it uses TS to replace its LMT. In this way, the workload of continuous consistency in multiplayer online games. In a timestamp table will reduce significantly. However, LMT 12. ACM Multimedia 2004, pages 388–391, New York, and Version ID of state data may be inconsistent due to the New York, USA, 2004. failure of the Cloud storage system or the data access server. [10] H. Liu, M. Bowman, and F. Chang. Survey of state melding in virtual worlds. ACM Computing Surveys, 8. CONCLUSIONS 44(4):1–25, 2012. Our Cloud-based architecture of MMORPGs can cope [11] W. Palant, C. Griwodz, and P. l. Halvorsen. with data management requirements regarding availability Consistency requirements in multiplayer online games. and scalability successfully, while supporting data consis- In Proceedings of the 5th Workshop on Network and tency becomes an open issue. In this paper, we detailed our System Support for Games, NETGAMES 2006, timestamp-based solution in theory, which will guide the page 51, Singapore, 2006. implementation work in the future. We analyzed the data [12] F. J. Torres-Rojas, M. Ahamad, and M. Raynal. consistency requirements of each data set from the storage Timed consistency for shared distributed objects. In system’s perspective, and studied methods of Cassandra to Proceedings of the eighteenth annual ACM symposium guarantee tunable consistency. We found that Cassandra on Principles of distributed computing - PODC ’99, cannot ensure read-your-writes consistency for state data pages 163–172, Atlanta, Georgia, USA, 1999. and timed consistency for log data efficiently. Hence, we [13] W. Vogels. Eventually consistent. ACM Queue, proposed a timestamp-based solution to improve it, and ex- 6(6):14–19, 2008. plained our idea for concurrency control, data replication [14] Z. Wei, G. Pierre, and C.-H. Chi. Scalable strategies, and fault handling in detail. In our future work, Transactions for Web Applications in the Cloud. In we will implement our proposals and the optimization strate- 15th International Euro-Par Conference, pages gies. 442–453, Delft, The Netherlands, 2009. [15] K. Zhang and B. Kemme. Transaction Models for 9. ACKNOWLEDGEMENTS Massively Multiplayer Online Games. In 30th IEEE Symposium on Reliable Distributed Systems (SRDS Thanks to Eike Schallehn for his comments. 2011), pages 31–40, Madrid, Spain, 2011. [16] K. Zhang, B. Kemme, and A. Denault. Persistence in 10. REFERENCES massively multiplayer online games. In Proceedings of [1] Apache. Cassandra, January 2013. the 7th ACM SIGCOMM Workshop on Network and http://cassandra.apache.org/. System Support for Games, NETGAMES 2008, pages [2] J. Baker, C. Bond, J. C. Corbett, J. Furman, 53–58, Worcester, Massachusetts, USA, 2008. A. Khorlin, J. Larson, J.-M. Lt’eon, Y. Li, A. Lloyd, Analysis of DDoS Detection Systems Michael Singhof Heinrich-Heine-Universität Institut für Informatik Universitätsstraße 1 40225 Düsseldorf, Deutschland singhof@cs.uni-duesseldorf.de ABSTRACT targeting specific weaknesses in that service or by brute force While there are plenty of papers describing algorithms for approaches. A particularly well-known and dangerous kind detecting distributed denial of service (DDoS) attacks, here of DoS attack are distributed denial of service attacks. These an introduction to the considerations preceding such an im- kinds of attacks are more or less brute force bandwidth DoS plementation is given. Therefore, a brief history of and in- attacks carried out by multiple attackers simultaneously. troduction to DDoS attacks is given, showing that these kind In general, there are two ways to detect any kind of net- of attacks are nearly two decades old. It is also depicted that work attacks: Signature based approaches in which the in- most algorithms used for the detection of DDoS attacks are trusion detection software compares network input to known outlier detection algorithms, such that intrusion detection attacks and anomaly detection methods. Here, the software can be seen as a part of the KDD research field. is either trained with examples for normal traffic or not It is then pointed out that no well known and up-to-date previously trained at all. Obviously, the latter method is test cases for DDoS detection system are known. To over- more variable since normal network traffic does not change come this problem in a way that allows to test algorithms as quickly as attack methods. The algorithms used in this as well as making results reproducible for others we advice field are, essentially, known KDD methods for outlier detec- using a simulator for DDoS attacks. tion such as clustering algorithms, classification algorithms The challenge of detecting denial of service attacks in or novelty detection algorithms on time series. However, real time is addressed by presenting two recently published in contrast to many other related tasks such as credit card methods that try to solve the performance problem in very fraud detection, network attack detection is highly time crit- different ways. We compare both approaches and finally ical since attacks have to be detected in near real time. This summarise the conclusions drawn from this, especially that makes finding suitable methods especially hard because high methods concentrating on one network traffic parameter only precision is necessary, too, in order for an intrusion detection are not able to detect all kinds of distributed denial of service system to not cause more harm than being of help. attacks. The main goal of this research project is to build a dis- tributed denial of service detection system that can be used in today’s networks and meets the demands formulated in Categories and Subject Descriptors the previous paragraph. In order to build such a system, H.2.8 [Database Management]: Database Applications— many considerations have to be done. Some of these are Data Mining; H.3.3 [Information Storage and Retrieval]: presented in this work. Information Search and Retrieval—Clustering, Information The remainder of this paper is structured as follows: In filtering section 2 an introduction to distributed denial of service at- tacks and known countermeasures is given, section 3 points Keywords out known test datasets. In section 4 some already existing approaches are presented and finally section 5 concludes this DDoS, Intrusion Detection, KDD, Security work and gives insight in future work. 1. INTRODUCTION 2. INTRODUCTION TO DDOS ATTACKS Denial of service (DoS) attacks are attacks that have the goal of making a network service unusable for its legitimate Denial of service and distributed denial of service attacks users. This can be achieved in different ways, either by are not a new threat in the internet. In [15] the first notable denial of service attack is dated to 1996 when the internet provider Panix was taken down for a week by a TCP SYN flood attack. The same article dates the first noteworthy distributed denial of service attack to the year 1997 when internet service providers in several countries as well as an IRC network were attacked by a teenager. Since then, many of the more elaborate attacks that worked well in the past, have been successfully defused. 25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. Let us, as an example, examine the TCP SYN flood at- Copyright is held by the author/owner(s). tack. A TCP connection is established by a three way hand- shake. On getting a SYN request packet, in order to open a TCP connection, the addressed computer has to store some information on the incoming packet and then answers with a SYN ACK packet which is, on regularly opening a TCP connection, again replied by an ACK packet. The idea of the SYN flood attack is to cause a memory overrun on the victim by sending many TCP SYN packets. As for every such packet the victim has to store information while the attacker just generates new packets and ignores the victim’s answers. By this the whole available memory of the victim can be used up, thus disabling the victim to open le- gitimate connection to regular clients. As a countermeasure, Figure 1: Detection locations for DDoS attacks. in [7] SYN cookies were introduced. Here, instead of storing the information associated with the only half opened TCP connection in the local memory, that information is coded testing that allows users, among other functions, to volun- into the TCP sequence number. Since that number is re- tary join a botnet in order to carry out an attack. Since turned by regular clients on sending the last packet of the the tool is mainly for testing purposes, the queries are not already described three way handshake and initial sequence masqueraded so that it is easy to identify the participat- numbers can be arbitrarily chosen by each connection part- ing persons. Again, however, the initiator of the attack does ner, no changes on the TCP implementation of the client not necessarily have to have direct contact to the victim and side have to be made. Essentially, this reduces the SYN thus remains unknown. cookie attack to a mere bandwidth based attack. A great diversity of approaches to solve the problem of The same applies to many other attack methods that have detecting DDoS attacks exists. Note again, that this work been successfully used in the past, such as the smurf attack focuses on anomaly detection methods only. This describes [1] or the fraggle attack. Both of these attacks are so called methods, that essentially make use of outlier detection meth- reflector attacks that consist of sending an echo packet – ods to distinguish normal traffic and attack traffic. In a field ICMP echo in case of the smurf attack and UDP echo in with as many publications as intrusion detection and even, case of the fraggle attack – to a network’s broadcast address. more specialised, DDoS detection, it is not surprising, that The sender’s address in this packet has to be forged so that many different approaches are used, most of which are com- the packet occurs to be sent by the victim of the attack, so mon in other knowledge discovery research fields as well. that all replies caused by the echo packet are routed to the As can be seen in Figure 1 this research part, again, can victim. be divided in three major categories, namely distributed de- Thus, it seems like nowadays most denial of service attacks tection or in network detection, source end detection and are distributed denial of service attack trying to exhaust the end point or victim end detection. victims bandwidth. Examples for this are the attacks on By distributed detection approaches we denote all ap- Estonian government and business computers in 2007 [12]. proaches that use more than one node in order to monitor As already mentioned, distributed denial of service attacks the network traffic. This kind of solution is mostly aimed are denial of service attacks with several participating at- to be used by internet providers and sometimes cooperation tackers. The number of participating computers can differ between more than one or even all ISPs is expected. The largely, ranging from just a few machines to several thou- main idea of almost all of these systems is to monitor the sands. Also, in most cases, the owners of these computers network traffic inside the backbone network. Monitors are are not aware that they are part of an attack. This lies in the mostly expected to be backbone routers, that communicate nature of most DDoS attacks which consist of three steps: the results of their monitoring either to a central instance or 1. Building or reusing a malware that is able to receive among each other. These systems allow an early detection of commands from the main attacker (“master”) and to suspicious network traffic so that an attack can be detected carry out the attack. A popular DDoS framework is and disabled – by dropping the suspicious network packets Stacheldraht [9]. – before it reaches the server the attack is aimed at. How- ever, despite these methods being very mighty in theory, 2. Distribute the software created in step one to create they suffer the main disadvantage of not being able to be a botnet. This step can essentially be carried out in employed without the help of one or more ISPs. Currently, every known method of distributing malware, for ex- this makes these approaches impractical for end users since, ample by forged mail attachments or by adding it to to the knowledge of the author, at this moment no ISP uses software like pirate copies. such an approach. Source end detection describes approaches that monitor 3. Launch the attack by giving the infected computers outgoing attack streams. Of course, such methods can only the command. be successful if the owner of an attacking computer is not aware of his computer participating in that attack. A widely This procedure – from the point of view of the main at- used deployment of such solutions is necessary for them to tacker – has the advantage of not having to maintain a direct have an effect. If this happens, however, these methods have connection to the victim. This makes it very hard to track the chance to not only detect distributed denial of service that person. It is notable though, that during attacks origi- attacks but also to prevent them by stopping the attacking nating to Anonymous in the years 2010 and 2012 Low Orbit traffic flows. However, in our opinion, the necessity of wide Ion Cannon [6] was used. This is originally a tool for stress deployment makes a successful usage of this methods – at Packet type No of packets Percentage 1e+08 Number of packets over arrival times IP 65612516 100 1e+07 TCP 65295894 99.5174 UDP 77 0.0001 1e+06 ICMP 316545 0.4824 100000 Number of packets Protocol Incoming Traffic Outgoing Traffic 10000 IP 24363685 41248831 TCP 24204679 41091215 1000 UDP 77 0 ICMP 158929 157616 100 10 Table 1: Distribution of web traffic on protocol types and incoming and outgoing traffic at the university’s 1 web server. 0 0.2 0.4 0.6 0.8 1 Arrival time [seconds] least in the near future – difficult. Figure 2: Arrival times for the university’s web- In contrast to the approaches described above, end point server trace. detection describes those methods that rely on one host only. In general, this host can be either the same server other ap- UDP packets seem to be unwanted packets as none of these plications are running on or a dedicated firewall in the case packets is replied. The low overall number of these packets is of small networks. Clearly, these approaches suffer one dis- an indicator for this fact, too. With ICMP traffic, incoming advantage: Attacks cannot be detected before the attack and outgoing packet numbers are nearly the same which lies packets arrive at their destination, as only those packets in the nature of this message protocol. can be inspected. On the other hand end point based meth- In order to overcome the problems with old traces, based ods allow individual deployment and can therefore be used on the characteristics of the web trace, as a next step we nowadays. Due to this fact, our work focuses on end point implement a simulator for distributed denial of service at- approaches. tacks. As the results in [20] show, the network simulators OMNeT++ [19], ns-3 [10] and JiST [5] are, in terms of speed 3. TEST TRACES OF DISTRIBUTED DE- and memory usage, more or less equal. To not let the simula- tion become either too slow or too inaccurate, it is intended NIAL OF SERVICE ATTACKS to simulate a nearer neighbourhood of the victim server very Today, the testing of DDoS detection methods unfortu- accurately. With greater distance to the victim, it is planned nately is not easy, as not many recordings of real or simu- to simulate in less detail. In this context, the distance be- lated DDoS attacks exist or, at least, are not publicly avail- tween two network nodes is given by the number of hops able. The best known test trace is the KDD Cup 99 data between the nodes. set [3]. A detailed description of this data set is given in Simulation results then will be compared with the afore- [18]. Other known datasets are the 1998 DARPA intrusion mentioned network trace to ensure its realistic behaviour. detection evaluation data set that has been described in [14] After the simulation of normal network traffic resembles the as well as the 1999 DARPA intrusion detection evaluation real traffic at the victim server close enough, we will proceed data set examined in [13]. by implementing distributed denial of service attacks in the In terms of the internet, with an age of 14 to 15 years, simulator environment. With this simulator it will then, these data sets are rather old and therefore cannot reflect hopefully, be possible to test existing and new distributed today’s traffic volume and behaviour in a realistic fashion. denial of service detection approaches in greater detail as Since testing with real distributed denial of service attacks has been possible in the past. is rather difficult both on technical as well as legal level, we suggest the usage of a DDoS simulator. In order to get a feel- ing for today’s web traffic, we recorded a trace at the main 4. EXISTING APPROACHES web server of Heinrich-Heine-Universität. Tracing started on Many approaches to the detection of distributed denial of 17th September 2012 at eight o’clock local time and lasted service attacks already exist. As has been previously pointed until eight o’clock the next day. out in section 1, in contrast to many other outlier and nov- This trace consists of 65612516 packets of IP traffic with elty detection applications in the KDD field, the detection 31841 unique communication partners contacting the web of DDoS attacks is extremely time critical, hence near real server. As can be seen in Table 1 almost all of these packets time detection is necessary. are TCP traffic. This is not surprising as the HTTP protocol Intuitively, the less parameters are observed by an ap- uses the TCP protocol and web page requests are HTTP proach, the faster it should work. Therefore, first, we take a messages. look at a recently issued method that relies on one parameter About one third of the TCP traffic is incoming traffic. only. This, too, is no surprise as most clients send small request messages and, in return, get web pages that often include 4.1 Arrival Time Based DDoS Detection images or other larger data and thus consist of more than In [17] the authors propose an approach that bases on ir- one package. It can also be seen, clearly, that all of the regularities in the inter packet arrival times. By this term 1 Now, since we are solely interested in the estimation of x̄, 0.9 only 1 M is needed, which is computed to be [x̄, x̄] since   0.8 1 β β 1 1 g(1) = − 1 + = (1 − β + β) = 0.7 2 2 2 2 2 0.6 and 1 0.5 zg(1) = Φ−1 (1 − g(1)) = Φ−1 ( ) = 0. α 2 0.4 During traffic monitoring, for a given time interval, the 0.3 current traffic arrival times tc are computed by estimating 0.2       1 1 1 1 [tc ]α = ln , ln 0.1 1 − p rα 1 − p lα 0 0.00122 0.00124 0.00126 0.00128 0.0013 0.00132 0.00134 0.00136 0.00138 0.0014 0.00142 where p is some given but again not specified probability and Arrival times [s] [lα , rα ] are the α-cuts for E(T ) = t̄. As described above, the only value that is of further use is tc , the only value in the Figure 3: The fuzzy mean estimator constructed ac- interval of [tc ]1 . Since [E(T )]1 = [t̄]1 = [t̄, t̄] it follows that cording to [17].       1 1 1 1 [tc ]1 = ln , ln 1 − p t̄ 1 − p t̄ the authors describe the time that elapses between two sub- and thus sequent packets.   The main idea of this work is based on [8] where non- 1 1 1 tc = ln = (ln(1) − ln(1 − p)) . asymptotic fuzzy estimators are used to estimate variable 1−p t̄ t̄ costs. Here, this idea is used to estimate the mean arrival As ln(1) = 0 this can be further simplified to time x̄ of normal traffic packets. Then, the mean arrival time of the current traffic – denoted by tc – is estimated, ln(1 − p) tc = − ∈ [0, ∞) too, and compared to the overall value. If tc > x̄, the traffic t̄ is considered as normal traffic and if tc < x̄ a DDoS attack with p ∈ [0, 1). is assumed to be happening. We suppose here, that for a By this we are able to determine a value for p by choosing value of tc = x̄ no attack is happening, although this case is the smallest p where tc ≥ x̄ for all intervals in our trace. An not considered in the original paper. interval length of four seconds was chosen to ensure compa- To get a general feeling for the arrival times, we computed rability with the results presented in [17]. them for our trace. The result is shown in Figure 2. Note, During the interval with the highest traffic volume 53568 that the y-axis is scaled logarithmic as values for arrival packets arrived resulting in an average arrival time of t̄ ≈ times larger than 0.1 seconds could not been distinguished 7.4671 · 10−5 seconds. Note here, that we did not maximise from zero on a linear y-axis. It can be seen here, that most the number of packets for the interval but instead let the first arrival times are very close to zero. It is also noteworthy interval begin at the first timestamp in our trace rounded that, due to the limited precision of libpcap [2], the most down to full seconds and split the trace sequentially from common arrival interval is zero. there on. Computing the fuzzy mean estimator for packet arrival Now, in order to compute p one has to set times yields the graph presented in Figure 3 and x̄ ≈ 0.00132. Note, that since the choice of parameter β ∈ [0, 1) is not p = 1 − e−x̄t̄ specified in [17], we here chose β = 12 . We will see, however, leading to p ≈ 9.8359 · 10−8 . As soon as this value of p is that, as far as our understanding of the proposed method learned, the approach is essentially a static comparison. goes, this parameter has no further influence. There are, however, other weaknesses to this approach To compute the α-cuts of a fuzzy number, one has to as well: Since the only monitored value is the arrival time, compute   a statement on values such as bandwidth usage cannot be α σ σ made. Consider an attack where multiple corrupted com- M = x̄ − zg(α) √ , x̄ + zg(α) √ puters try to download a large file from a server via a TCP n n connection. This behaviour will result in relatively large where x̄ is the mean value – i.e. exactly the value that is packets being sent from the server to the clients, resulting going to be estimated – and σ is presumably the arrival in larger arrival times as well. Still, the server’s connec- times’ deviation. Also tion can be jammed by this traffic thus causing a denial of   1 β β service. g(α) = − α+ By this, we draw the conclusion that a method relying on 2 2 2 only one parameter – in this example arrival times – can- and not detect all kinds of DDoS attacks. Thus, despite its low zg(α) = Φ−1 (1 − g(α)). processing requirements, such an approach in our opinion is not suited for general DDoS detection even if it seems that Note, that α M is the (1 − α)(1 − β) confidence interval it can detect packet flooding attacks with high precision as for µ, the real mean value of packet arrival times. stated in the paper. Algorithm 1 LCFS algorithm based on [11]. Require: the initial set of all features I, the class-outputs y, the desired number of features n Ensure: the dimension reduced subset F ⊂ I 1: for all fi ∈ I do 2: compute corr(fi , y) 3: end for 4: f := max{correlation(fi , y)|fi ∈ I} 5: F := {f } 6: I := I \ {f } 7: while |F | <(n do ) 1 P 8: f := max corr(fi , y) − |F | corr(fi , fj ) fi ∈ I Figure 4: Protocol specific DDoS detection architec- fj ∈F ture as proposed in [11]. 9: F := F ∪ {f } 10: I := I \ {f } 11: end while 4.2 Protocol Type Specific DDoS Detection 12: return F In [11] another approach is presented: Instead of using the same methods on all types of packets, different procedures are used for different protocol types. This is due to the fact, the university’s campus. The presented results show that that different protocols show different behaviour. Especially on all data sets the DDoS detection accuracy varies in the TCP traffic behaviour differs from UDP and ICMP traffic range of 99.683% to 99,986% if all of the traffic’s attributes because of its flow control features. By this the authors try are used. When reduced to three or five attributes, accuracy to minimise the feature set characterising distributed denial stays high with DDoS detection of 99.481% to 99.972%. At of service attacks for every protocol type, separately, such the same time, the computation time shrinks by a factor of that computation time is minimised, too. two leading to a per instance computation time of 0.0116ms The proposed detection scheme is described as a four step (three attributes) on the KDD Cup data set and 0.0108ms approach, as shown in Figure 4. Here, the first step is the (three attributes) and 0.0163ms (five attributes) on the self- preprocessing where all features of the raw network traffic recorded data sets of the authors. are extracted. Then packets are forwarded to the correspon- Taking into account the 53568 packets in a four second dent modules based on the packet’s protocol type. interval we recorded, the computation time during this in- The next step is the protocol specific feature selection. terval would be about (53568 · 0.0163ms ≈) 0.87 seconds. Here, per protocol type, the most significant features are However, there is no information about the machine that selected. This is done by using the linear correlation based carried out the computations given in the paper such that feature selection (LCFS) algorithm that has been introduced this number appears to be rather meaningless. If we suppose in [4], which essentially ranks the given features by their a fast machine with no additional tasks, this computation correlation coefficients given by time would be relatively high. Pn (xi − x̄)(yi − x̄) Nevertheless, the results presented in the paper at hand corr(X, Y ) := pPn i=1 Pn are promising enough to consider a future re-evaluation on a 2 2 i=1 (xi − x̄) i=1 (yi − ȳ) known machine with our recorded trace and simulated DDoS for two random variables X, Y with values xi , yi , 1 ≤ i ≤ n, attacks. respectively. A pseudo code version of LCFS is given in Algorithm 1. As can be seen there, the number of features in the reduced set must be given by the user. This number 5. CONCLUSION characterises the trade-off between precision of the detection We have seen that distributed denial of service attacks are, and detection speed. in comparison to the age of the internet itself, a relatively The third step is the classification of the instances in ei- old threat. Against many of the more sophisticated attacks ther normal traffic or DDoS traffic. The classification is specialised counter measures exist, such as TCP SYN cook- trained on the reduced feature set generated in the previous ies in order to prevent the dangers of SYN flooding. Thus, step. The authors tested different well known classification most DDoS attacks nowadays are pure bandwidth or brute techniques and established C4.5 [16] as the method working force attacks and attack detection should focus on this types best in this case. of attacks, making outlier detection techniques the method Finally, the outputs of the classifiers are given to the of choice. Still, since many DDoS toolkits such as Stachel- merger to be able to report warnings over one alarm gen- draht allow for attacks like SYN flooding properties of this eration interface instead of three. The authors mention that attacks can still indicate an ongoing attack. there is a check for false positives in the merger, too. How- Also, albeit much research on the field of DDoS detection ever, there is no further information given on how this check has been done during the last two decades that lead to a works apart from the fact that it is relatively slow. nearly equally large number of possible solutions, in section The presented experiments have been carried out on the 3 we have seen that one of the biggest problems is the un- aforementioned KDD Cup data set as well as on two self- availability of recent test traces or a simulator being able made data sets for which the authors attacked a server within to produce such traces. With the best known test series having an age of fourteen years, today, the results presented Off-line Intrusion Detection Evaluation. In DARPA in many of the research papers on this topic are difficult to Information Survivability Conference and Exposition, compare and confirm. 2000. DISCEX’00. Proceedings, volume 2, pages Even if one can rate the suitability of certain approaches in 12–26. IEEE, 2000. respect to detect certain approaches, such as seen in section [15] G. Loukas and G. Öke. Protection Against Denial of 4, a definite judgement of given methods is not easy. We Service Attacks: A Survey. The Computer Journal, therefore, before starting to implement an own approach to 53(7):1020–1037, 2010. distributed denial of service detection, want to overcome this [16] J. R. Quinlan. C4.5: Programs for Machine Learning, problem by implementing a DDoS simulator. volume 1. Morgan Kaufmann, 1993. With the help of this tool, we will be subsequently able to [17] S. N. Shiaeles, V. Katos, A. S. Karakos, and B. K. compare existing approaches among each other and to our Papadopoulos. Real Time DDoS Detection Using ideas in a fashion reproducible for others. Fuzzy Estimators. Computers & Security, 31(6):782–790, 2012. 6. REFERENCES [18] M. Tavallaee, E. Bagheri, W. Lu, and A.-A. Ghorbani. [1] CERT CC. Smurf Attack. A Detailed Analysis of the KDD CUP 99 Data Set. In http://www.cert.org/advisories/CA-1998-01.html. Proceedings of the Second IEEE Symposium on [2] The Homepage of Tcpdump and Libpcap. Computational Intelligence for Security and Defence http://www.tcpdump.org/. Applications 2009, 2009. [3] KDD Cup Dataset. [19] A. Varga and R. Hornig. An Overview of the http://kdd.ics.uci.edu/databases/kddcup99/ OMNeT++ Simulation Environment. In Proceedings kddcup99.html, 1999. of the 1st International Conference on Simulation [4] F. Amiri, M. Rezaei Yousefi, C. Lucas, A. Shakery, Tools and Techniques for Communications, Networks and N. Yazdani. Mutual Information-based Feature and Systems & Workshops, Simutools ’08, pages Selection for Intrusion Detection Systems. Journal of 60:1–60:10, ICST, Brussels, Belgium, Belgium, 2008. Network and Computer Applications, 34(4):1184–1199, ICST (Institute for Computer Sciences, 2011. Social-Informatics and Telecommunications [5] R. Barr, Z. J. Haas, and R. van Renesse. JiST: An Engineering). Efficient Approach to Simulation Using Virtual [20] E. Weingartner, H. vom Lehn, and K. Wehrle. A Machines. Software: Practice and Experience, Performance Comparison of Recent Network 35(6):539–576, 2005. Simulators. In Communications, 2009. ICC ’09. IEEE [6] A. M. Batishchev. Low Orbit Ion Cannon. International Conference on, pages 1–5, 2009. http://sourceforge.net/projects/loic/. [7] D. Bernstein and E. Schenk. TCP SYN Cookies. on-line journal, http://cr.yp.to/syncookies.html, 1996. [8] K. A. Chrysafis and B. K. Papadopoulos. Cost–volume–profit Analysis Under Uncertainty: A Model with Fuzzy Estimators Based on Confidence Intervals. International Journal of Production Research, 47(21):5977–5999, 2009. [9] D. Dittrich. The ‘Stacheldraht’ Distributed Denial of Service Attack Tool. http://staff.washington.edu/dittrich/misc/ stacheldraht.analysis, 1999. [10] T. Henderson. ns-3 Overview. http://www.nsnam.org/docs/ns-3-overview.pdf, May 2011. [11] H. J. Kashyap and D. Bhattacharyya. A DDoS Attack Detection Mechanism Based on Protocol Specific Traffic Features. In Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology, pages 194–200. ACM, 2012. [12] M. Lesk. The New Front Line: Estonia under Cyberassault. Security & Privacy, IEEE, 5(4):76–79, 2007. [13] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and K. Das. The 1999 DARPA Off-line Intrusion Detection Evaluation. Computer networks, 34(4):579–595, 2000. [14] R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. R. Kendall, D. McClung, D. Weber, S. E. Webster, D. Wyschogrod, R. K. Cunningham, et al. Evaluating Intrusion Detection Systems: The 1998 DARPA A Conceptual Model for the XML Schema Evolution Overview: Storing, Base-Model-Mapping and Visualization Thomas Nösinger, Meike Klettke, Andreas Heuer Database Research Group University of Rostock, Germany (tn, meike, ah)@informatik.uni-rostock.de ABSTRACT our conceptual model called EMX (Entity Model for XML- In this article the conceptual model EMX (Entity Model Schema). for XML-Schema) for dealing with the evolution of XML A further issue, not covered in this paper, but important Schema (XSD) is introduced. The model is a simplified in the overall context of exchanging information, is the valid- representation of an XSD, which hides the complexity of ity of XML documents [5]. Modifications of XML Schema re- XSD and offers a graphical presentation. For this purpose quire adaptions of all XML documents that are valid against a unique mapping is necessary which is presented as well the former XML Schema (also known as co-evolution). as further information about the visualization and the log- One unpractical way to handle this problem is to introduce ical structure. A small example illustrates the relation- different versions of an XML Schema, but in this case all ships between an XSD and an EMX. Finally, the integration versions have to be stored and every participant of the het- into a developed research prototype for dealing with the co- erogeneous environment has to understand all different doc- evolution of corresponding XML documents is presented. ument descriptions. An alternative solution is the evolution of the XML Schema, so that just one document description exists at one time. The above mentioned validity problem 1. INTRODUCTION of XML documents is not solved, but with the standardized The eXtensible Markup Language (XML) [2] is one of the description of the adaptions (e.g. a sequence of operations most popular formats for exchanging and storing structured [8]) and by knowing a conceptual model inclusively the cor- and semi-structured information in heterogeneous environ- responding mapping to the base-model (e.g. XSD), it is ments. To assure that well-defined XML documents can be possible to derive necessary XML document transformation understood by every participant (e.g. user or application) steps automatically out of the adaptions [7]. The conceptual it is necessary to introduce a document description, which model is an essential prerequisite for the here not in detail contains information about allowed structures, constraints, but incidentally handled process of the evolution of XML data types and so on. XML Schema [4] is one commonly used Schema. standard for dealing with this problem. An XML document This paper is organized as follows. Section 2 gives the is called valid, if it fulfills all restrictions and conditions of necessary background of XML Schema and corresponding an associated XML Schema. concepts. Section 3 presents our conceptual model by first XML Schema that have been used for years have to be giving a formal definition (3.1), followed by the specification modified from time to time. The main reason is that the of the unique mapping between EMX and XSD (3.2) and requirements for exchanged information can change. To the logical structure of the EMX (3.3). After introducing meet these requirements the schema has to be adapted, for the conceptual model we present an example of an EMX in example if additional elements are added into an existing section 4. In section 5 we describe the practical use of content model, the data type of information changed or in- EMX in our prototype, which was developed for handle the tegrity constraints are introduced. All in all every possi- co-evolution of XML Schema and XML documents. Finally ble structure of an XML Schema definition (XSD) can be in section 6 we draw our conclusions. changed. A question occurs: In which way can somebody make these adaptions without being coerced to understand 2. TECHNICAL BACKGROUND and deal with the whole complexity of an XSD? One solu- In this section we present a common notation used in the tion is the definition of a conceptual model for simplifying rest of the paper. At first we will shortly introduce the the base-model; in this paper we outline further details of abstract data model (ADM) and element information item (EII) of XML Schema, before further details concerning dif- ferent modeling styles are given. The XML Schema abstract data model consists of different components or node types1 , basically these are: type defi- nition components (simple and complex types), declaration components (elements and attributes), model group compo- nents, constraint components, group definition components 25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- 1 banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. An XML Schema can be visualized as a directed graph with Copyright is held by the author/owner(s). different nodes (components); an edge realizes the hierarchy and annotation components [3]. Additionally the element against exchanged information change and the underlying information item exists, an XML representation of these schema has to be adapted then this modeling style is the components. The element information item defines which most suitable. The advantage of the Garden of Eden style content and attributes can be used in an XML Schema. Ta- is that all components can be easily identified by knowing ble 1 gives an overview about the most important compo- the QNAME (qualified name). Furthermore the position of nents and their concrete representation. The , components within an XML Schema is obvious. A qualified name is a colon separated string of the target namespace of ADM Element Information Item the XML Schema followed by the name of the declaration declarations , or definition. The name of a declaration and definition is group-definitions a string of the data type NCNAME (non-colonized name), model-groups , , , a string without colons. The Garden of Eden style is the , basic modeling style which is considered in this paper, a type-definitions , transformation between different styles is possible.2 N.N. , , 3. CONCEPTUAL MODEL , In [7] the three layer architecture for dealing with XML annotations Schema adaptions (i.e. the XML Schema evolution) was constraints , , , introduced and the correlations between them were men- , tioned. An overview is illustrated in figure 2. The first N.N. EMX Table 1: XML Schema Information Items EMX Operation EMX‘ , and are not explicitly 1 - 1 mapping given in the abstract data model (N.N. - Not Named), but Schema they are important components for embedding externally XML defined XML Schema (esp. element declarations, attribute XSD Operation XSD‘ declarations and type definitions). In the rest of the pa- 1 - * mapping per, we will summarize them under the node type ”module”. XML The ”is the document (root) element of any W3C Operation XML Schema. It’s both a container for all the declarations XML XML‘ and definitions of the schema and a place holder for a number of default values expressed as attributes” [9]. Analyzing the possibilities of specifying declarations and definitions leads Figure 2: Three Layer Architecture to four different modeling styles of XML Schema, these are: Russian Doll, Salami Slice, Venetian Blind and Garden of layer is our conceptual model EMX (Entity Model for XML- Eden [6]. These modeling styles influence mainly the re- Schema), a simplified representation of the second layer. usability of element declarations or defined data types and This layer is the XML Schema (XSD), where a unique map- also the flexibility of an XML Schema in general. Figure ping between these layers exists. The mapping is one of the 1 summarizes the modeling styles with their scopes. The main aspects of this paper (see section 3.2). The third layer are XML documents or instances, an ambiguous mapping between XSD and XML documents exists. It is ambiguous because of the optionality of structures (e.g. minOccurs = Garden of Eden Venetian Blind ’0’; use = ’optional’) or content types (e.g. ). The Russian Doll Salami Slice third layer and the mapping between layer two and three, as well as the operations for transforming the different layers are not covered in this paper (parts were published in [7]). Scope element and attribute local x x 3.1 Formal Definition declaration global x x The conceptual model EMX is a triplet of nodes (NM ), local x x directed edges between nodes (EM ) and features (FM ). type definition global x x EM X = (NM , EM , FM ) (1) Figure 1: XSD Modeling Styles according to [6] Nodes are separated in simple types (st), complex types (ct), elements, attribute-groups, groups (e.g. content model), an- scope of element and attribute declarations as well as the notations, constraints and modules (i.e. externally managed scope of type definitions is global iff the corresponding node XML Schemas). Every node has under consideration of the is specified as a child of the and can be referenced element information item of a corresponding XSD different (by knowing e.g. the name and namespace). Locally speci- attributes, e.g. an element node has a name, occurrence fied nodes are in contrast not directly under , the values, type information, etc. One of the most important re-usability is not given respectively not possible. 2 A student thesis to address the issue of converting different An XML Schema in the Garden of Eden style just con- modeling styles into each other is in progress at our profes- tains global declarations and definitions. If the requirements sorship; this is not covered in this paper. attributes of every node is the EID (EMX ID), a unique visualized by adding a ”blue W in a circle”, a similar be- identification value for referencing and localization of every haviour takes place if an attribute wildcard is given in an node; an EID is one-to-one in every EMX. The directed . edges are defined between nodes by using the EIDs, i.e. ev- The type-definitions are not directly visualized in an EMX. ery edge is a pair of EID values from a source to a tar- Simple types for example can be specified and afterwards be get. The direction defines the include property, which was referenced by elements or attributes 3 by using the EID of the specified under consideration of the possibilities of an XML corresponding EMX node. The complex type is also implic- Schema. For example if a model-group of the abstract data itly given, the type will be automatically derived from the model (i.e. an EMX group with ”EID = 1”) contains dif- structure of the EMX after finishing the modeling process. ferent elements (e.g. EID = {2,3}), then two edges exist: The XML Schema specification 1.1 has introduced different (1,2) and (1,3). In section 3.3 further details about allowed logical constraints, which are also integrated in the EMX. edges are specified (see also figure 5). The additional fea- These are the EIIs (for constraints on complex tures allow the user-specific setting of the overall process types) and . An is under consider- of co-evolution. It is not only possible to specify default ation of the specification a facet of a restricted simple type values but also to configure the general behaviour of opera- [4]. The last EII is , this ”root” is an EMX itself. tions (e.g. only capacity-preserving operations are allowed). This is also the reason why further information or properties Furthermore all XML Schema properties of the element in- of an XML Schema are stored in the additional features as formation item are included in the additional mentioned above. features. The additional features are not covered in this paper. 3.3 Logical Structure After introducing the conceptual model and specifying the 3.2 Mapping between XSD and EMX mapping between an EMX and XSD, in the following section An overview about the components of an XSD has been details about the logical structure (i.e. the storing model) given in table 1. In the following section the unique map- are given. Also details about the valid edges of an EMX are ping between these XSD components and the EMX nodes illustrated. Figure 3 gives an overview about the different introduced in section 3.1 is specified. Table 2 summarizes relations used as well as the relationships between them. the mapping. For every element information item (EII) the The logical structure is the direct consequence of the used EII EMX Node Visualization element Path Constraint ST_List Facet Annotation , attribute- group Attribute Assert Element ST Attribute , , group _Ref Element_ Attribute Attribute CT Schema Ref _Gr _Gr_Ref @ @ st implicit and @ specifiable Group Wildcard Module ct implicit and derived EMX visualized extern @ Attribute , module Relation parent_EIDparent_EID has_asame @ node in EMX Element , , Figure 3: Logical Structure of an EMX annotation modeling style Garden of Eden, e.g. elements are either , , constraint element declarations or element references. That’s why this separation is also made in the EMX. implicit in ct All in all there are 18 relations, which store the content of restriction in st an XML Schema and form the base of an EMX. The different the EMX itself nodes reference each other by using the well known foreign key constraints of relational databases. This is expressed by Table 2: Mapping and Visualization using the directed ”parent EID” arrows, e.g. the EMX nodes (”rectangle with thick line”) element, st, ct, attribute-group corresponding EMX node is given as well as the assigned vi- and modules reference the ”Schema” itself. If declarations sualization. For example an EMX node group represents the or definitions are externally defined then the ”parent EID” abstract data model (ADM) node model-group (see table 1). is the EID of the corresponding module (”blue arrow”). The This ADM node is visualized through the EII content mod- ”Schema” relation is an EMX respectively the root of an els , and , and the wildcards EMX as already mentioned above. and . In an EMX the visualization of a group is the blue ”triangle with a G” in it. Further- 3 The EII and are the same more if a group contains an element wildcard then this is in the EMX, an attribute-group is always a container The ”Annotation” relation can reference every other re- specified under consideration of the XML Schema specifica- lation according to the XML Schema specification. Wild- tion [4], e.g. an element declaration needs a ”name” and a cards are realized as an element wildcard, which belongs to type (”type EID” as a foreign key) as well as other optional a ”Group” (i.e. EII ), or they can be attribute wild- values like the final (”finalV”), default (”defaultV”), ”fixed”, cards which belongs to a ”CT” or ”Attribute Gr” (i.e. EII ”nillable”, XML Schema ”id” or ”form” value. Other EMX ). Every ”Element” relation (i.e. element specific attributes are also given, e.g. the ”file ID” and the declaration) has either a simple type or a complex type, ”parent EID” (see figure 3). The element references have a and every ”Element Ref” relation has an element declara- ”ref EID”, which is a foreign key to a given element declara- tion. Attributes and attribute-groups are the same in an tion. Moreover attributes of the occurrence (”minOccurs”, EMX, as mentioned above. ”maxOccurs”), the ”position” in a content model and the Moreover figure 3 illustrates the distinction between visu- XML Schema ”id” are stored. Element references are visual- alized (”yellow border”) and not visualized relations. Under ized in an EMX. That’s why some values about the position consideration of table 2 six relations are direct visible in in an EMX are stored, i.e. the coordinates (”x Pos”, ”y Pos”) an EMX: constraints, annotations, modules, groups and be- and the ”width” and ”height” of an EMX node. The same cause of the Garden of Eden style element references and position attributes are given in every other visualized EMX attribute-group references. Table 3 summarizes which rela- node. tion of figure 3 belongs to which EMX node of table 2. The edges of the formal definition of an EMX can be de- rived by knowing the logical structure and the visualization EMX Node Relation of an EMX. Figure 5 illustrates the allowed edges of EMX element Element, Element Ref nodes. An edge is always a pair of EIDs, from a source attribute-group Attribute, Atttribute Ref, Attribute Gr, attribute-group Attribute Gr Ref source X annotation group Group, Wildcard constraint edge(X,Y) element schema module st ST, ST List, Facet group ct CT ct st annotation Annotation target Y constraint Contraint, Path, Assert element x x x module Module attribute-group x x x x group x x x Table 3: EMX Nodes with Logical Structure ct x x x st x x x x annotation x x x x x x x x The EMX node st (i.e. simple type) has three relations. constraint x x x These are the relation ”ST” for the most simple types, the re- module x lation ”ST List” for set free storing of simple union types and implicitly given the relation ”Facet” for storing facets of a simple restriction type. Constraints are realized through the relation ”Path” Figure 5: Allowed Edges of EMX Nodes for storing all used XPath statements for the element infor- mation items (EII) , and and (”X”) to a target (”Y”). For example it is possible to add the relation ”Constraint” for general properties e.g. name, an edge outgoing from an element node to an annotation, XML Schema id, visualization information, etc. Further- constraint, st or ct. A ”black cross” in the figure defines a more the relation ”Assert” is used for storing logical con- possible edge. If an EMX is visualized then not all EMX straints against complex types (i.e. EII ) and sim- nodes are explicitly given, e.g. the type-definitions of the ple types (i.e. EII ). Figure 4 illustrates the abstract data model (i.e. EMX nodes st, ct; see table 2). In this case the corresponding ”black cross” has to be moved element element_ref along the given ”yellow arrow”, i.e. an edge in an EMX be- PK EID PK EID tween a ct (source) and an attribute-group (target) is valid. name FK ref_EID If this EMX is visualized, then the attribute-group is shown FK type_EID minOccurs finalV maxOccurs as a child of the group which belongs to above mentioned defaultV position ct. Some information are just ”implicitly given” in a visu- fixed id nillable FK file_ID alization of an EMX (e.g. simple types). A ”yellow arrow” id form FK parent_EID width which starts and ends in the same field is a hint for an union FK file_ID height of different nodes into one node, e.g. if a group contains a FK parent_EID x_Pos y_Pos wildcard then in the visualization only the group node is visible (extended with the ”blue W”; see table 2). Figure 4: Relations of EMX Node element 4. EXAMPLE stored information concerning the EMX node element re- In section 3 the conceptual model EMX was introduced. spectively the relations ”Element” and ”Element Ref”. Both In the following section an example is given. Figure 6 il- relations have in common, that every tuple is identified by lustrates an XML Schema in the Garden of Eden modeling using the primary key EID. The EID is one-to-one in ev- style. An event is specified, which contains a place (”ort”) ery EMX as mentioned above. The other attributes are and an id (”event-id”). Furthermore the integration of other the connection without ”black rectangle”, the target is the other side. For example the given annotation is a child of the element ”event” and not the other way round; an element can never be a child of an annotation, neither in the XML Schema specification nor in the EMX. The logical structure of the EMX of figure 7 is illustrated in figure 8. The relations of the EMX nodes are given as well Schema EID xmlns_xs targetName TNPrefix 1 http://www.w3.org/2001/XMlSchema gvd2013.xsd eve Element Annotation EID name type_EID parent_EID EID parent_EID x_Pos y_Pos 2 event 14 1 10 2 50 100 3 name 11 1 Wildcard 4 datum 12 1 EID parent_EID 5 ort 13 1 17 14 Element_Ref EID ref_EID minOccurs maxOccurs parent_EID x_Pos y_Pos 6 2 1 1 1 75 75 event Figure 6: XML Schema in Garden of Eden Style 7 3 1 1 16 60 175 name 8 4 1 1 16 150 175 datum 9 5 1 1 15 100 125 ort attributes is possible, because of an attribute wildcard in ST CT the respective complex type. The place is a sequence of a EID name mode parent_EID EID name parent_EID name and a date (”datum”). 11 xs:string built-in 1 13 orttype 1 All type definitions (NCNAME s: ”orttype”, ”eventtype”) 12 xs:date built-in 1 14 eventtype 1 and declarations (NCNAME s: ”event”, ”name”, ”datum”, Group EID mode parent_EID x_Pos y_Pos ”ort” and the attribute ”event-id”) are globally specified. 15 sequence 14 125 100 eventsequence The target namespace is ”eve”, so the QNAME of e.g. the 16 sequence 13 100 150 ortsequence complex type definition ”orttype” is ”eve:orttype”. By using Attribute Attribute_Ref the QNAME every above mentioned definition and decla- EID name parent_EID EID ref_EID parent_EID ration can be referenced, so the re-usability of all compo- 18 event-id 1 19 18 14 nents is given. Furthermore an attribute wildcard is also Attribute_Gr Attribute_Gr_Ref specified, i.e. the complex type ”eventtype” contains apart EID parent_EID EID ref_EID parent_EID x_Pos y_Pos from the content model sequence and the attribute refer- 20 1 21 20 14 185 125 ence ”eve:event-id” the element information item . Figure 8: Logical Structure of Figure 7 Figure 7 is the corresponding EMX of the above specified XML Schema. The representation is an obvious simplifica- as the attributes and corresponding values relevant for the example. Next to every tuple of the relations ”Element Ref” and ”Group” small hints which tuples are defined are added (for increasing the readability). It is obvious that an EID has to be unique, this is a prerequisite for the logical struc- ture. An EID is created automatically, a user of the EMX can neither influence nor manipulate it. The element references contain information about the oc- currence (”minOccurs”, ”maxOccurs”), which are not explic- itly given in the XSD of figure 6. The XML Schema spec- ification defines default values in such cases. If an element reference does not specify the occurrence values then the standard value ”1” is used; an element reference is obliga- tory. These default values are also added automatically. Figure 7: EMX to XSD of Figure 6 The stored names of element declarations are NCNAME s, but by knowing the target namespace of the corresponding tion, it just contains eight well arranged EMX nodes. These schema (i.e. ”eve”) the QNAME can be derived. The name are the elements ”event”, ”ort”, ”name” and ”datum”, an an- of a type definition is also the NCNAME, but if e.g. a built- notation as a child of ”event”, the groups as a child under in type is specified then the name is the QNAME of the ”event” and ”ort”, as well as an attribute-group with wild- XML Schema specification (”xs:string”, ”xs:date”). card. The simple types of the element references ”name” and ”datum” are implicitly given and not visualized. The complex types can be derived by identifying the elements 5. PRACTICAL USE OF EMX which have no specified simple type but groups as a child The co-evolution of XML documents was already men- (i.e. ”event” and ”ort”). tioned in section 1. At the University of Rostock a research The edges are under consideration of figure 5 pairs of not prototype for dealing with this co-evolution was developed: visualized, internally defined EIDs. The source is the side of CodeX (Conceptual design and evolution for XML Schema) [5]. The idea behind it is simple and straightforward at the modeled XSD and an EMX, so it is possible to representa- same time: Take an XML Schema, transform it to the specif- tively adapt or modify the conceptual model instead of the ically developed conceptual model (EMX - Entity Model for XML Schema. XML-Schema), change the simplified conceptual model in- This article presents the formal definition of an EMX, all stead of dealing with the whole complexity of XML Schema, in all there are different nodes, which are connected by di- collect these changing information (i.e. the user interaction rected edges. Thereby the abstract data model and element with EMX) and use them to create automatically trans- information item of the XML Schema specification were con- formation steps for adapting the XML documents (by us- sidered, also the allowed edges are specified according to ing XSLT - Extensible Stylesheet Language Transformations the specification. In general the most important compo- [1]). The mapping between EMX and XSD is unique, so it is nents of an XSD are represented in an EMX, e.g. elements, possible to describe modifications not only on the EMX but attributes, simple types, complex types, annotations, con- also on the XSD. The transformation and logging language strains, model groups and group definitions. Furthermore ELaX (Evolution Language for XML-Schema [8]) is used to the logical structure is presented, which defines not only the unify the internally collected information as well as intro- underlying storing relations but also the relationships be- duce an interface for dealing directly with XML Schema. tween them. The visualization of an EMX is also defined: Figure 9 illustrates the component model of CodeX, firstly outgoing from 18 relations in the logical structure, there are published in [7] but now extended with the ELaX interface. eight EMX nodes in the conceptual model, from which six are visualized. Results Our conceptual model is an essential prerequisite for the prototype CodeX (Conceptual design and evolution for XML GUI Schema modifications ELaX Data supply Schema) as well as for the above mentioned co-evolution. A Visualization ELaX Import Export remaining step is the finalization of the implementation in XSD CodeX. After this work an evaluation of the usability of the Evolution engine XSD Config XML XSLT XSD Config conceptual model is planned. Nevertheless we are confident, that the usage is straightforward and the simplification of EMX in comparison to deal with the whole complexity of Model Spezification Configuration XML an XML Schema itself is huge. mapping of operation documents Update notes & 7. REFERENCES Model data Evolution spezific data evolution results Knowledge Transformation base Log [1] XSL Transformations (XSLT) Version 2.0. CodeX http://www.w3.org/TR/2007/REC-xslt20-20070123/, January 2007. Online; accessed 26-March-2013. Figure 9: Component Model of CodeX [5] [2] Extensible Markup Language (XML) 1.0 (Fifth Edition). The component model illustrates the different parts for http://www.w3.org/TR/2008/REC-xml-20081126/, dealing with the co-evolution. The main parts are an im- November 2008. Online; accessed 26-March-2013. port and export component for collecting and providing data [3] XQuery 1.0 and XPath 2.0 Data Model (XDM) of e.g. a user (XML Schemas, configuration files, XML doc- (Second Edition). http://www.w3.org/TR/2010/ ument collections, XSLT files), a knowledge base for stor- REC-xpath-datamodel-20101214/, December 2010. ing information (model data, evolution specific data and Online; accessed 26-March-2013. co-evolution results) and especially the logged ELaX state- [4] W3C XML Schema Definition Language (XSD) 1.1 ments (”Log”). The mapping information between XSD and Part 1: Structures. http://www.w3.org/TR/2012/ EMX of table 2 are specified in the ”Model data” component. REC-xmlschema11-1-20120405/, April 2012. Online; Furthermore the CodeX prototype also provides a graph- accessed 26-March-2013. ical user interface (”GUI”), a visualization component for [5] M. Klettke. Conceptual XML Schema Evolution - the the conceptual model and an evolution engine, in which the CoDEX Approach for Design and Redesign. In BTW transformations are derived. The visualization component Workshops, pages 53–63, 2007. realizes the visualization of an EMX introduced in table 2. [6] E. Maler. Schema design rules for ubl...and maybe for The ELaX interface for modifying imported XML Schemas you. In XML 2002 Proceedings by deepX, 2002. communicates directly with the evolution engine. [7] T. Nösinger, M. Klettke, and A. Heuer. Evolution von XML-Schemata auf konzeptioneller Ebene - Übersicht: 6. CONCLUSION Der CodeX-Ansatz zur Lösung des Gültigkeitsproblems. Valid XML documents need e.g. an XML Schema, which In Grundlagen von Datenbanken, pages 29–34, 2012. restricts the possibilities and usage of declarations, defini- [8] T. Nösinger, M. Klettke, and A. Heuer. Automatisierte tions and structures in general. In a heterogeneous changing Modelladaptionen durch Evolution - (R)ELaX in the environment (e.g. an information exchange scenario), also Garden of Eden. Technical Report CS-01-13, Institut ”old” and longtime used XML Schema have to be modified für Informatik, Universität Rostock, Rostock, Germany, to meet new requirements and to be up-to-date. Jan. 2013. Published as technical report CS-01-13 EMX (Entity Model for XML-Schema) as a conceptual under ISSN 0944-5900. model is a simplified representation of an XSD, which hides [9] E. van der Vlist. XML Schema. O’Reilly & Associates, its complexity and offers a graphical presentation. A unique Inc., 2002. mapping exists between every in the Garden of Eden style Semantic Enrichment of Ontology Mappings: Detecting Relation Types and Complex Correspondences ∗ Patrick Arnold Universität Leipzig arnold@informatik.uni-leipzig.de ABSTRACT being a tripe (s, t, c), where s is a concept in the source ontol- While there are numerous tools for ontology matching, most ogy, t a concept in the target ontology and c the confidence approaches provide only little information about the true na- (similarity). ture of the correspondences they discover, restricting them- These tools are able to highly reduce the effort of man- selves on the mere links between matching concepts. How- ual ontology mapping, but most approaches solely focus on ever, many disciplines such as ontology merging, ontology detecting the matching pairs between two ontologies, with- evolution or data transformation, require more-detailed in- out giving any specific information about the true nature formation, such as the concrete relation type of matches or of these matches. Thus, a correspondence is commonly re- information about the cardinality of a correspondence (one- garded an equivalence relation, which is correct for a corre- to-one or one-to-many). In this study we present a new ap- spondence like (zip code, postal code), but incorrect for cor- proach where we denote additional semantic information to respondences like (car, vehicle) or (tree trunk, tree), where an initial ontology mapping carried out by a state-of-the-art is-a resp. part-of would be the correct relation type. This re- matching tool. The enriched mapping contains the relation striction is an obvious shortcoming, because in many cases type (like equal, is-a, part-of) of the correspondences as well a mapping should also include further kinds of correspon- as complex correspondences. We present different linguistic, dences, such as is-a, part-of or related. Adding these infor- structural and background knowledge strategies that allow mation to a mapping is generally beneficial and has been semi-automatic mapping enrichment, and according to our shown to considerably improve ontology merging [13]. It first internal tests we are already able to add valuable se- provides more precise mappings and is also a crucial aspect mantic information to an existing ontology mapping. in related areas, such as data transformation, entity resolu- tion and linked data. An example is given in Fig. 1, which depicts the basic Keywords idea of our approach. While we get a simple alignment as ontology matching, relation type detection, complex corre- input, with the mere links between concepts (above picture), spondences, semantic enrichment we return an enriched alignment with the relation type an- notated to each correspondence (lower picture). As we will 1. INTRODUCTION point out in the course of this study, we use different linguis- tic methods and background knowledge in order to find the Ontology matching plays a key role in data integration relevant relation type. Besides this, we have to distinguish and ontology management. With the ontologies getting in- between simple concepts (as ”Office Software”) and complex creasingly larger and more complex, as in the medical or concepts, which contain itemizations like ”Monitors and Dis- biological domain, efficient matching tools are an important plays”, and which need a special treatment for relation type prerequisite for ontology matching, merging and evolution. detection. There are already various approaches and tools for ontol- Another issue of present ontology matchers is their restric- ogy matching, which exploit most different techniques like tion to (1:1)-correspondences, where exactly one source con- lexicographic, linguistic or structural methods in order to cept matches exactly one target concept. However, this can identify the corresponding concepts between two ontologies occasionally lead to inaccurate mappings, because there may [16], [2]. The determined correspondences build a so-called occur complex correspondences where more than one source alignment or ontology mapping, with each correspondence element corresponds to a target element or vice versa, as ∗ the two concepts first name and last name correspond to a concept name, leading to a (2:1)-correspondence. We will show in section 5 that distinguishing between one-to-one and one-to-many correspondences plays an important role in data transformation, and that we can exploit the results from the relation type detection to discover such complex matches in a set of (1:1)-matches to add further knowledge to a mapping. 25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- In this study we present different strategies to assign the banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. relation types to an existing mapping and demonstrate how Copyright is held by the author/owner(s). lence, less/more-general (is-a / inverse is-a) and is-close (”re- lated”) and exploits linguistic techniques and background sources such as WordNet. The linguistic strategies seem rather simple; if a term appears as a part in another term, a more-general relation is assumed which is not always the case. For example, in Figure 1 the mentioned rule holds for the correspondence between Games and Action Games, but not between M onitors and M onitors and Displays. In [14], the authors evaluated Taxomap for a mapping scenario with 162 correspondences and achieved a recall of 23 % and a precision of 89 %. The LogMap tool [9] distinguishes between equivalence and so-called weak (subsumption / is-a) correspondences. It is based on Horn Logic, where first lexicographic and struc- tural knowledge from the ontologies is accumulated to build an initial mapping and subsequently an iterative process is carried out to first enhance the mapping and then to verify the enhancement. This tool is the least precise one with regard to relation type detection, and in evaluations the re- lation types were not further regarded. Several further studies deal with the identification of se- mantic correspondence types without providing a complete tool or framework. An approach utilizing current search engines is introduced in [10]. For two concepts A, B they generate different search queries like ”A, such as B” or ”A, which is a B” and submit them to a search engine (e.g., Google). They then analyze the snippets of the search en- gine results, if any, to verify or reject the tested relation- Figure 1: Input (above) and output (below) of the ship. The approach in [15] uses the Swoogle search engine Enrichment Engine to detect correspondences and relationship types between concepts of many crawled ontologies. The approach sup- ports equal, subset or mismatch relationships. [17] exploits complex correspondences can be discovered. Our approach, reasoning and machine learning to determine the relation which we refer to as Enrichment Engine, takes an ontology type of a correspondence, where several structural patterns mapping generated by a state-of-the-art matching tool as in- between ontologies are used as training data. put and returns a more-expressive mapping with the relation Unlike relation type determination, the complex corre- type added to each correspondence and complex correspon- spondence detection problem has hardly been discussed so dences revealed. According to our first internal tests, we far. It was once addressed in [5], coming to the conclusion recognized that even simple strategies already add valuable that there is hardly any approach for complex correspon- information to an initial mapping and may be a notable gain dence detection because of the vast amount of required com- for current ontology matching tools. parisons in contrast to (1:1)-matching, as well as the many Our paper is structured as follows: We discuss related possible operators needed for the mapping function. One work in section 2 and present the architecture and basic key observation for efficient complex correspondence detec- procedure of our approach in section 3. In section 4 we tion has been the need of large amounts of domain knowl- present different strategies to determine the relation types edge, but until today there is no available tool being able to in a mapping, while we discuss the problem of complex cor- semi-automatically detect complex matches. respondence detection in section 5. We finally conclude in One remarkable approach is iMAP [4], where complex section 6. matches between two schemas could be discovered and even several transformation functions calculated, as RoomP rice = 2. RELATED WORK RoomP rice∗(1+T axP rice). For this, iMAP first calculates (1:1)-matches and then runs an iterative process to gradu- Only a few tools and studies regard different kinds of ally combine them to more-complex correspondences. To correspondences or relationships for ontology matching. S- justify complex correspondences, instance data is analyzed Match [6][7] is one of the first such tools for ”semantic ontol- and several heuristics are used. In [8] complex correspon- ogy matching”. They distinguish between equivalence, sub- dences were also regarded for matching web query inter- set (is-a), overlap and mismatch correspondences and try faces, mainly exploiting co-occurrences. However, in order to provide a relationship for any pair of concepts of two to derive common co-occurrences, the approach requires a ontologies by utilizing standard match techniques and back- large amount of schemas as input, and thus does not appear ground knowledge from WordNet. Unfortunately, the result appropriate for matching two or few schemas. mappings tend to become very voluminous with many corre- While the approaches presented in this section try to a- spondences per concept, while users are normally interested chieve both matching and semantic annotation in one step, only in the most relevant ones. thus often tending to neglect the latter part, we will demon- Taxomap [11] is an alignment tool developed for the geo- strate a two-step architecture in which we first perform a graphic domain. It regards the correspondence types equiva- schema mapping and then concentrate straight on the en- Strategy equal is-a part-of related Compounding X richment of the mapping (semantic part). Additionally, we Background K. X X X X want to analyze several linguistic features to provide more Itemization X X qualitative mappings than obtained by the existing tools, Structure X X and finally develop an independent system that is not re- stricted to schema and ontology matching, but will be dif- Table 1: Supported correspondence types by the ferently exploitable in the wide field of date integration and strategies data analysis. ”undecided”. In this case we assign the relation type ”equal”, 3. ARCHITECTURE because it is the default type in the initial match result and As illustrated in Fig. 2 our approach uses a 2-step ar- possibly the most likely one to hold. Secondly, there might chitecture in which we first calculate an ontology mapping be different outcomes from the strategies, e.g., one returns (match result) using our state-of-the-art matching tool is-a, one equal and the others undecided. There are different COMA 3.0 (step 1) [12] and then perform an enrichment ways to solve this problem, e.g., by prioritizing strategies or on this mapping (step 2). relation types. However, we hardly discovered such cases so Our 2-step approach for semantic ontology matching offers far, so we currently return ”undecided” and request the user different advantages. First of all, we reduce complexity com- to manually specify the correct type. pared to 1-step approaches that try to directly determine the At the present, our approach is already able to fully assign correspondence type when comparing concepts in O1 with relation types to an input mapping using the 4 strategies, concepts in O2 . For large ontologies, such a direct match- which we will describe in detail in the next section. We have ing is already time-consuming and error-prone for standard not implemented strategies to create complex matches from matching. The proposed approaches for semantic matching the match result, but will address a couple of conceivable are even more complex and could not yet demonstrate their techniques in section 5. general effectiveness. Secondly, our approach is generic as it can be used for different domains and in combination with different match- 4. IMPLEMENTED STRATEGIES ing tools for the first step. We can even re-use the tool in We have implemented 4 strategies to determine the type different fields, such as entity resolution or text mining. On of a given correspondence. Table 1 gives an overview of the the other hand, this can also be a disadvantage, since the strategies and the relation types they are able to detect. It enrichment step depends on the completeness and quality of can be seen that the Background Knowledge approach is the initially determined match result. Therefore, it is im- especially valuable, as it can help to detect all relationship portant to use powerful tools for the initial matching and types. Besides, all strategies are able to identify is-a corre- possibly to fine-tune their configuration. spondences. In the following let O1 , O2 be two ontologies with c1 , c2 being two concepts from O1 resp. O2 . Further, let C = (c1 , c2 ) be a correspondence between two concepts (we do not regard the confidence value in this study). 4.1 Compound Strategy In linguistics, a compound is a special word W that con- sists of a head WH carrying the basic meaning of W , and a modifier WM that specifies WH [3]. In many cases, a compound thus expresses something more specific than its head, and is therefore a perfect candidate to discover an is-a relationship. For instance, a blackboard is a board or an apple tree is a tree. Such compounds are called endocen- tric compounds, while exocentric compounds are not related Figure 2: Basic Workflow for Mapping Enrichment with their head, such as buttercup, which is not a cup, or saw tooth, which is not a tooth. These compounds are of literal The basics of the relation type detection, on which we fo- meaning (metaphors) or changed their spelling as the lan- cus in this study, can be seen in the right part of Fig. 2. We guage evolved, and thus do not hold the is-a relation, or only provide 4 strategies so far (Compound, Background Knowl- to a very limited extent (like airport, which is a port only in edge, Itemization, Structure), where each strategy returns a broad sense). There is a third form of compounds, called the relation type of a given correspondence, or ”undecided” appositional or copulative compounds, where the two words in case no specific type can be determined. In the Enrich- are at the same level, and the relation is rather more-general ment step we thus iterate through each correspondence in (inverse is-a) than more-specific, as in Bosnia-Herzegowina, the mapping and pass it to each strategy. We eventually which means both Bosnia and Herzegowina, or bitter-sweet, annotate the type that was most frequently returned by the which means both bitter and sweet (not necessarily a ”spe- strategies (type computation). In this study, we regard 4 cific bitter” or a ”specific sweet”). However, this type is quite distinct relation types: equal, is-a and inv. is-a (composi- rare. tion), part-of and has-a (aggregation), as well as related. In the following, let A, B be the literals of two con- There are two problems we may encounter when comput- cepts of a correspondence. The Compound Strategy ana- ing the correspondence type. First, all strategies may return lyzes whether B ends with A. If so, it seems likely that B is a compound with head A, so that the relationship B is-a by w1 . A (or A inv. is-a B) is likely to hold. The Compound ap- proach allows us to identify the three is-a correspondences 3. Remove each w1 ∈ I1 , w2 ∈ I2 if there is a synonym shown in Figure 1 (below). pair (w1 , w2 ). We added an additional rule to this simple approach: B is 4. Remove each w2 ∈ I2 which is a hyponym of w1 ∈ I1 . only considered a compound to A if length(B)−length(A) ≥ 3, where length(X) is the length of a string X. Thus, we 5. Determine the relation type: expect the supposed compound to be at least 3 characters longer than the head it matches. This way, we are able to (a) If I1 = ∅, I2 = ∅: equal eliminate obviously wrong compound conclusions, like sta- (b) If I1 = ∅, |I2 | ≥ 1: is-a ble is a table, which we call pseudo compounds. The value If I2 = ∅, |I1 | ≥ 1: inverse is-a of 3 is motivated by the observation that typical nouns or (c) If |I1 | ≥ 1, I2 ≥ 1: undecided adjectives consist of at least 3 letters. The rationale behind this algorithm is that we remove items 4.2 Background Knowledge from the item sets as long as no information gets lost. Then Background knowledge is commonly of great help in on- we compare what is left in the two sets and come to the tology matching to detect more difficult correspondences, conclusions presented in step 5. especially in special domains. In our approach, we intend to Let us consider the concept pair C1 = ”books, ebooks, use it for relation type detection. So far, we use WordNet movies, films, cds” and C2 =”novels, cds”. Our item sets are 3.0 to determine the relation that holds between two words I1 = {books, ebooks, movies, f ilms, cds}, I2 = {novels, cds}. (resp. two concepts). WordNet is a powerful dictionary and First, we remove synonyms and hyponyms within each set, thesaurus that contains synonym relations (equivalence), hy- because this would cause no loss of information (steps 1+2). pernym relations (is-a) and holonym relations (part-of) be- We remove f ilms in I1 (because of the synonym movies) tween words [22]. Using the Java API for WordNet Search and ebooks in I1 , because it is a hyponym of books. We have (JAWS), we built an interface that allows to answer ques- I1 = {books, movies, cds} , I2 = {novels, cds}. Now we re- tions like ”Is X a synonym to Y?”, or ”Is X a direct hyper- move synonym pairs between the two item sets, so we remove nym of Y?”. The interface is also able to detect cohyponyms, cds in either set (step 3). Lastly, we remove a hyponym in I1 which are two words X, Y that have a common direct hyper- if there is a hypernym in I2 (step 4). We remove novel in I2 , nym Z. We call a correspondence between two cohyponyms because it is a book. We have I1 = {books, movies} , I2 = ∅. X and Y related, because both concepts are connected to Since I1 still contains items, while I2 is empty, we conclude the same father element. For example, the relation between that I1 specifies something more general, i.e., it holds C1 apple tree and pear tree is related, because of the common inverse is-a C2 . father concept tree. If neither item set is empty, we return ”undecided” because Although WordNet has a limited vocabulary, especially we cannot derive an equal or is-a relationship in this case. with regard to specific domains, it is a valuable source to detect the relation type that holds between concepts. It al- 4.4 Structure Strategy lows an excellent precision, because the links in WordNet are The structure strategy takes the structure of the ontolo- manually defined, and contains all relation types we intend gies into account. For a correspondence between concepts to detect, which the other strategies are not able to achieve. Y and Z we check whether we can derive a semantic rela- tionship between a father concept X of Y and Z (or vice 4.3 Itemization versa). For an is-a relationship between Y and X we draw In several taxonomies we recognized that itemizations ap- the following conclusions: pear very often, and which cannot be processed with the pre- viously presented strategies. Consider the correspondence • X equiv Z → Y is-a Z (”books and newspapers”, ”newspapers”). The compound • X is-a Z → Y is-a Z strategy would be mislead and consider the source concept a compound, resulting in the type ”is-a”, although the op- For a part-of relationship between Y and X we can analo- posite is the case (inv. is-a). WordNet would not know the gously derive: word ”books and newspapers” and return ”undecided”. Itemizations thus deserve special treatment. We first split • X equiv Z → Y part-of Z each itemization in its atomic items, where we define an item as a string that does not contain commas, slashes or the • X part-of Z → Y part-of Z words ”and” and ”or”. The approach obviously utilizes the semantics of the intra- We now show how our approach determines the correspon- ontology relationships to determine the correspondence types dence types between two concepts C1 , C2 where at least one for pairs of concepts for which the semantic relationship can- of the two concepts is an itemization with more than one not directly be determined. item. Let I1 be the item set of C1 and I2 the item set of C2 . Let w1 , w2 be two words, with w1 6= w2 . Our approach 4.5 Comparison works as follows: We tested our strategies and overall system on 3 user- 1. In each set I remove each w1 ∈ I which is a hyponym generated mappings in which each correspondence was tagged of w2 ∈ I. with its supposed type. After running the scenarios, we checked how many of the non-trivial relations were detected 2. In each set I, replace a synonym pair (w1 ∈ I, w2 ∈ I) by the program. The 3 scenario consisted of about 350 .. 750 correspondences. We had a German-language sce- nario (product catalogs from online shops), a health scenario (diseases) and a text annotation catalog scenario (everyday speech). Compounding and Background Knowledge are two inde- pendent strategies that separately try to determine the rela- tion type of a correspondence. In our tests we saw that Com- pounding offers a good precision (72 .. 97 %), even without the many exocentric and pseudo-compounds that exist. By contrast, we recognized only moderate recall, ranging from 12 to 43 %. Compounding is only able to determine is-a relations, however, it is the only strategy that invariably works. Background Knowledge has a low or moderate recall (10 .. Figure 3: Match result containing two complex cor- 50 %), depending on the scenario at hand. However, it offers respondences (name and address) an excellent precision being very close to 100 % and is the only strategy that is able to determine all relation types we regard. As matter of fact, it did not work on our German- structure of the schemas to transform several (1:1)-corres- language example and only poorly in our health scenario. pondences into a complex correspondence, although these Structure and Itemization strategy depend much on the approaches will fail in more intricate scenarios. We used given schemas and are thus very specific strategies to han- the structure of the schemas and the already existing (1:1)- dle individual cases. They exploit the Compound and Back- matches to derive complex correspondences. Fig. 3 demon- ground Knowledge Strategy and are thus not independent. strates this approach. There are two complex correspon- Still, they were able to boost the recall to some degree. dences in the mapping, ( (First Name, Last Name), (Name)) We realized that the best result is gained by exploiting and ( (Street, City, Zip Code, Country), Address), repre- all strategies. Currently, we do not weight the strategies, sented by simple (1:1)-correspondences. Our approach was however, we may do so in order to optimize our system. We able to detect both complex correspondences. The first one finally achieved an overall recall between 46 and 65 % and (name) was detected, because first name and last name can- precision between 69 and 97 %. not be mapped to one element at the same time, since the name element can only store either of the two values. The 5. COMPLEX CORRESPONDENCES second example (address) is detected since schema data is Schema and ontology matching tools generally calculate located in the leaf nodes, not in inner nodes. In database (1:1)-correspondences, where exactly one source element schemas we always expect data to reside in the leaf nodes, matches exactly one target element. Naturally, either el- so that the match (Address, Address) is considered unrea- ement may take part in different correspondences, as in sonable. (name, first name) and (name, last name), however, having In the first case, our approach would apply the concatena- these two separate correspondences is very imprecise and the tion function, because two values have to be concatenated to correct mapping would rather be the single correspondence match the target value, and in the second case the split func- ( (first name, last name), (name)). These kind of matches tion would be applied, because the Address values have to are called complex correspondences or one-to-many corre- be split into the address components (street, city, zip code, spondences. country). The user needs to adjust these functions, e.g., in The disambiguation between a complex correspondence order to tell the program where in the address string the or 2 (or more) one-to-one correspondences is an inevitable split operations have to be performed. premise for data transformation where data from a source This approach was mostly based on heuristics and would database is to be transformed into a target database, which only work in simple cases. Now that we are able to de- we could show in [1]. Moreover, we could prove that each termine the relation types of (1:1)-matches, we can enhance complex correspondence needs a transformation function in this original approach. If a node takes part in more than one order to correctly map data. If elements are of the type composition relation (part-of / has-a), we can conclude that string, the transformation function is normally concatena- it is a complex correspondence and can derive it from the tion in (n:1)-matches and split in (1:n)-matches. If the el- (1:1)-correspondences. For instance, if we have the 3 corre- ements are of a numerical type, as in the correspondence spondences (day part-of date), (month part-of date), (year ( (costs), ((operational costs), (material costs), (personnel part-of date) we could create the complex correspondence ( costs))), a set of numerical operations is normally required. (day, month, year), date). There are proprietary solutions that allow to manually We have not implemented this approach so far, and we as- create transformation mappings including complex corre- sume that detecting complex correspondences and the cor- spondences, such as Microsoft Biztalk Server [19], Altova rect transformation function will still remain a very challeng- MapForce [18] or Stylus Studio [20], however, to the best ing issue, so that we intend to investigate additional methods of our knowledge there is no matching tool that is able to like using instance data to allow more effectiveness. How- detect complex correspondences automatically. Next to rela- ever, adding these techniques to our existing Enrichment tion type detection, we therefore intend to discover complex Engine, we are able to present a first solution that semi- correspondences in the initial mapping, which is a second automatically determines complex correspondences, which important step of mapping enrichment. is another step towards more precise ontology matching, and We already developed simple methods that exploit the an important condition for data transformation. 6. OUTLOOK AND CONCLUSION [4] Dhamankar, R., Yoonkyong, L., Doan, A., Halevy, A., We presented a new approach to semantically enrich ontol- Domingos, P.: iMAP: Discovering Complex Semantic ogy mappings by determining the concrete relation type of a Matches between Database Schemas. In: SIGMOD ’04, correspondence and detecting complex correspondences. For pp. 383–394 this, we developed a 2-step architecture in which the actual [5] Doan, A., Halevy, A. Y.: Semantic Integration ontology matching and the semantic enrichment are strictly Research in the Database Community: A Brief Survey. separated. This makes the Enrichment Engine highly generic In AI Mag. (2005), pp. 83–94 so that it is not designed for any specific ontology matching [6] Giunchiglia, F., Shvaiko, P., Yatskevich, M.: S-Match: tool, and moreover, can be used independently in various An Algorithm and an Implementation of Semantic fields different from ontology matching, such as data trans- Matching. Proceedings of the European Semantic Web formation, entity resolution and text mining. Symposium (2004), LNCS 3053, pp. 61–75 In our approach we developed new linguistic strategies [7] Giunchiglia, F., Autayeu, A., Pane, J.: S-Match: an to determine the relation type, and with regard to our first open source framework for matching lightweight internal tests even the rather simple strategies already added ontologies. In: Semantic Web, vol. 3-3 (2012), pp. much useful information to the input mapping. We also 307-317 discovered that some strategies (Compounding, and to a less [8] He, B., Chen-Chuan Chang, H., Han, J.: Discovering degree Itemization and Structure) are rather independent complex matchings across web query interfaces: A from the language of the ontologies, so that our approach correlation mining approach. In: KDD ’04, pp. 148–157 provided remarkable results both in German and English- [9] Jiménez-Ruiz, E., Grau, B. C.: LogMap: Logic-Based language ontologies. and Scalable Ontology Matching. In: International One important obstacle is the strong dependency to the Semantic Web Conference (2011), LNCS 7031, pp. initial mapping. We recognized that matching tools tend to 273–288 discover equivalence relations, so that different non-equiva- [10] van Hage, W. R., Katrenko, S., Schreiber, G. A lence correspondences are not contained by the initial map- Method to Combine Linguistic Ontology-Mapping ping, and can thus not be detected. It is future work to Techniques. In: International Semantic Web Conference adjust our tool COMA 3.0 to provide a more convenient in- (2005), LNCS 3729, pp. 732–744 put, e.g., by using relaxed configurations. A particular issue [11] Hamdi, F., Safar, B., Niraula, N. B., Reynaud, C.: we are going to investigate is the use of instance data con- TaxoMap alignment and refinement modules: Results nected with the concepts to derive the correct relation type for OAEI 2010. Proceedings of the ISWC Workshop if the other strategies (which operate on the meta level) fail. (2010), pp. 212–219 This will also result in a time-complexity problem, which we [12] Massmann, S., Raunich, S., Aumueller, D., Arnold, P., will have to consider in our ongoing research. Rahm, E. Evolution of the COMA Match System. Proc. Our approach is still in a rather early state, and there Sixth Intern. Workshop on Ontology Matching (2011) is still much space for improvement, since the implemented strategies have different restrictions so far. For this reason, [13] Raunich, S.,Rahm, E.: ATOM: Automatic we will extend and fine-tune our tool in order to increase Target-driven Ontology Merging. Proc. Int. Conf. on effectiveness and precision. Among other aspects, we intend Data Engineering (2011) to improve the structure strategy by considering the entire [14] Reynaud, C., Safar, B.: Exploiting WordNet as concept path rather than the mere father concept, to add Background Knowledge. Proc. Intern. ISWCŠ07 further background knowledge to the system, especially in Ontology Matching (OM-07) Workshop specific domains, and to investigate further linguistic strate- [15] Sabou, M., d’Aquin, M., Motta, E.: Using the gies, for instance, in which way compounds also indicate the semantic web as background knowledge for ontology part-of relation. Next to relation type detection, we will also mapping. Proc. 1st Intern. Workshop on on Ontology concentrate on complex correspondence detection in data Matching (2006). transformation to provide further semantic information to [16] Shvaiko, P., Euzenat, J.: A Survey of Schema-based ontology mappings. Matching Approaches. J. Data Semantics IV (2005), pp. 146–171 7. ACKNOWLEDGMENT [17] Spiliopoulos, V., Vouros, G., Karkaletsis, V: On the discovery of subsumption relations for the alignment of This study was partly funded by the European Commis- ontologies. Web Semantics: Science, Services and sion through Project ”LinkedDesign” (No. 284613 FoF-ICT- Agents on the World Wide Web 8 (2010), pp. 69-88 2011.7.4). [18] Altova MapForce - Graphical Data Mapping, Conversion, and Integration Tool. 8. REFERENCES http://www.altova.com/mapforce.html [1] Arnold P.: The Basics of Complex Correspondences [19] Microsoft BizTalk Server. and Functions and their Implementation and http://www.microsoft.com/biztalk Semi-automatic Detection in COMA++ (Master’s [20] XML Editor, XML Tools, and XQuery - Stylus thesis), University of Leipzig, 2011. Studio. http://www.stylusstudio.com/ [2] Bellahsene., Z., Bonifati, A., Rahm, E. (eds.): Schema [21] Java API for WordNet Searching (JAWS), Matching and Mapping, Springer (2011) http://lyle.smu.edu/~tspell/jaws/index.html [3] Bisetto, A., Scalise, S.: Classification of Compounds. [22] WordNet - A lexical database for English, University of Bologna, 2009. In: The Oxford Handbook http://wordnet.princeton.edu/wordnet/ of Compounding, Oxford University Press, pp. 49-82. Extraktion und Anreicherung von Merkmalshierarchien durch Analyse unstrukturierter Produktrezensionen Robin Küppers Institut für Informatik Heinrich-Heine-Universität Universitätsstr. 1 40225 Düsseldorf, Deutschland kueppers@cs.uni-duesseldorf.de ABSTRACT tionelle Datenblätter oder Produktbeschreibungen möglich Wir präsentieren einen Algorithmus zur Extraktion bzw. wäre, da diese dazu tendieren, die Vorteile eines Produkts zu Anreicherung von hierarchischen Produktmerkmalen mittels beleuchten und die Nachteile zu verschweigen. Aus diesem einer Analyse von unstrukturierten, kundengenerierten Pro- Grund haben potentielle Kunden ein berechtigtes Interesse duktrezensionen. Unser Algorithmus benötigt eine initiale an der subjektiven Meinung anderer Käufer. Merkmalshierarchie, die in einem rekursiven Verfahren mit Zudem sind kundengenerierte Produktrezensionen auch für neuen Untermerkmalen angereichert wird, wobei die natür- Produzenten interessant, da sie wertvolle Informationen über liche Ordnung der Merkmale beibehalten wird. Die Funk- Qualität und Marktakzeptanz eines Produkts aus Kunden- tionsweise unseres Algorithmus basiert auf häufigen, gram- sicht enthalten. Diese Informationen können Produzenten matikalischen Strukturen, die in Produktrezensionen oft be- dabei helfen, die eigene Produktpalette zu optimieren und nutzt werden, um Eigenschaften eines Produkts zu beschrei- besser an Kundenbedürfnisse anzupassen. ben. Diese Strukturen beschreiben Obermerkmale im Kon- Mit wachsendem Umsatz der Web-Shops nimmt auch die text ihrer Untermerkmale und werden von unserem Algo- Anzahl der Produktrezensionen stetig zu, so dass es für Kun- rithmus ausgenutzt, um Merkmale hierarchisch zu ordnen. den (und Produzenten) immer schwieriger wird, einen um- fassenden Überblick über ein Produkt / eine Produktgrup- pe zu behalten. Deshalb ist unser Ziel eine feingranulare Kategorien Zusammenfassung von Produktrezensionen, die es erlaubt H.2.8 [Database Management]: Database Applications— Produkte dynamisch anhand von Produktmerkmalen (pro- data mining; I.2.7 [Artificial Intelligence]: Natural Lan- duct features) zu bewerten und mit ähnlichen Produkten zu guage Processing—text analysis vergleichen. Auf diese Weise wird ein Kunde in die Lage versetzt ein Produkt im Kontext seines eigenen Bedürfnis- Schlüsselwörter ses zu betrachten und zu bewerten: beispielsweise spielt das Gewicht einer Kamera keine große Rolle für einen Kunden, Text Mining, Review Analysis, Product Feature aber es wird viel Wert auf eine hohe Bildqualität gelegt. Produzenten können ihre eigene Produktpalette im Kontext 1. EINLEITUNG der Konkurrenz analysieren, um z. B. Mängel an den eige- Der Einkauf von Waren (z. B. Kameras) und Dienstleis- nen Produkten zu identifizieren. tungen (z. B. Hotels) über Web-Shops wie Amazon unter- Das Ziel unserer Forschung ist ein Gesamtsystem zur Analy- liegt seit Jahren einem stetigen Wachstum. Web-Shops ge- se und Präsentation von Produktrezensionen in zusammen- ben ihren Kunden (i. d. R.) die Möglichkeit die gekaufte Wa- gefasster Form (vgl. [3]). Dieses System besteht aus mehre- re in Form einer Rezension zu kommentieren und zu bewer- ren Komponenten, die verschiedene Aufgaben übernehmen, ten. Diese kundengenerierten Rezensionen enthalten wert- wie z.B. die Extraktion von Meinungen und die Bestimmung volle Informationen über das Produkt, die von potentiellen der Tonalität bezüglich eines Produktmerkmals (siehe dazu Kunden für ihre Kaufentscheidung herangezogen werden. Je auch Abschnitt 2). Im Rahmen dieser Arbeit beschränken positiver ein Produkt bewertet wird, desto wahrscheinlicher wir uns auf einen wichtigen Teilaspekt dieses Systems: die wird es von anderen Kunden gekauft. Extraktion und Anreicherung von hierarchisch organisierten Der Kunde kann sich so ausführlicher über die Vor- und Produktmerkmalen. Nachteile eines Produkts informieren, als dies über redak- Der Rest dieser Arbeit ist wie folgt gegliedert: zunächst geben wir in Abschnitt 2 einen Überblick über verwandte Arbeiten, die auf unsere Forschung entscheidenen Einfluss hatten. Anschließend präsentieren wir in Abschnitt 3 einen Algorithmus zur Extraktion und zur Anreicherung von hier- archisch organisierten Produktmerkmalen. Eine Bewertung des Algorithmus wird in Abschnitt 4 vorgenommen, sowie einige Ergebnisse präsentiert, die die Effektivität unseres Algorithmus demonstrieren. Die gewonnenen Erkenntnisse 25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. werden in Abschnitt 5 diskutiert und zusammengefasst. Des Copyright is held by the author/owner(s). Weiteren geben wir einen Ausblick auf unsere zukünftige Forschung. 2. VERWANDTE ARBEITEN Dieser Abschnitt gibt einen kurzen Überblick über ver- wandte Arbeiten, die einen Einfluss auf unsere Forschung hatten. Die Analyse von Produktrezensionen basiert auf Al- Abbildung 1: Beispielhafte Merkmalshierarchie ei- gorithmen und Methoden aus verschiedensten Disziplinen. ner Digitalkamera. Zu den Wichtigsten zählen: Feature Extraction, Opining Mi- ning und Sentiment Analysis. Ein typischer Algorithmus zur merkmalsbasierten Tonali- Wir haben hauptsächlich Arbeiten vorgestellt, die Merk- tätsanalyse von Produktrezensionen ist in 3 unterschiedliche male und Meinungen aus Produktrezensionen extrahieren, Phasen unterteilt (vgl. [3]): aber Meinungsanalysen sind auch für andere Domänen inter- essant: z. B. verwenden die Autoren von [7] einen von Exper- 1. Extraktion von Produktmerkmalen. ten annotierten Korpus mit Nachrichten, um mit Techniken des maschinellen Lernens einen Klassifikator zu trainieren, 2. Extraktion von Meinungen über Produktmerkmale. der zwischen Aussagen (Meinungen) und Nicht-Aussagen 3. Tonalitätsanalyse der Meinungen. unterscheidet. Solche Ansätze sind nicht auf die Extrakti- on von Produktmerkmalen angewiesen. Man unterscheidet zwischen impliziten und expliziten Merk- malen[3]: explizite Merkmale werden direkt im Text genannt, implizite Merkmale müssen aus dem Kontext erschlossen 3. ANREICHERUNG VON MERKMALS- werden. Wir beschränken uns im Rahmen dieser Arbeit auf die Extraktion expliziter Merkmale. HIERARCHIEN Die Autoren von [3] extrahieren häufig auftretende, explizi- Dieser Abschnitt dient der Beschreibung eines neuen Al- te Merkmale mit dem a-priori Algorithmus. Mit Hilfe dieser gorithmus zur Anreicherung einer gegebenen, unvollständi- Produktmerkmale werden Meinungen aus dem Text extra- gen Merkmalshierarchie mit zusätzlichen Merkmalen. Die- hiert, die sich auf ein Produktmerkmal beziehen. Die Tona- se Merkmale werden aus unstrukturierten kundengenerier- lität einer Meinung wird auf die Tonalität der enthaltenen ten Produktrezensionen gewonnen, wobei versucht wird die Adjektive zurückgeführt. Die extrahierten Merkmale werden natürliche Ordnung der Merkmale (Unter- bzw. Obermerk- - im Gegensatz zu unserer Arbeit - nicht hierarchisch mo- malsbeziehung) zu beachten. delliert. Die Merkmalshierarchie bildet die Basis für weitergehende Es gibt auch Ansätze, die versuchen die natürliche Hierar- Analysen, wie z.B. die gezielte Extraktion von Meinungen chie von Produktmerkmalen abzubilden. Die Autoren von [1] und Tonalitäten, die sich auf Produktmerkmale beziehen. nutzen die tabellarische Struktur von Produktbeschreibun- Diese nachfolgenden Analyseschritte sind nicht mehr Gegen- gen aus, um explizite Produktmerkmale zu extrahieren, wo- stand dieser Arbeit. Produkte (aber auch Dienstleistungen) bei die hierarchische Struktur aus der Tabellenstruktur ab- können durch eine Menge von Merkmalen (product features) geleitet wird. Einen ähnlichen Ansatz verfolgen [5] et. al.: die beschrieben werden. Produktmerkmale folgen dabei einer Autoren nutzen ebenfalls die oftmals hochgradige Struktu- natürlichen, domänenabhängigen Ordnung. Eine derartige rierung von Produktbeschreibungen aus. Die Produktmerk- natürliche Hierarchie ist exemplarisch in Abbildung 1 für male werden mit Clusteringtechniken aus einem Korpus ex- das Produkt Digitalkamera dargestellt. Offensichtlich ist trahiert, wobei die Hierarchie der Merkmale durch das Clus- Display ein Untermerkmal von Digitalkamera und besitzt tering vorgegeben wird. Die Extraktion von expliziten Merk- eigene Untermerkmale Auflösung und Farbtemperatur. malen aus strukturierten Texten ist (i. d. R.) einfacher, als Hierarchien von Produktmerkmalen können auf Basis von durch Analyse unstrukturierter Daten. strukturierten Texten erzeugt werden, wie z. B. technische Die Methode von [2] et. al. benutzt eine Taxonomie zur Ab- Datenblättern und Produktbeschreibungen (vgl. [5]). Die- bildung der Merkmalshierarchie, wobei diese von einem Ex- se Datenquellen enthalten i. d. R. die wichtigsten Produkt- perten erstellt wird. Diese Hierarchie bildet die Grundlage merkmale. Der hohe Strukturierungsgrad dieser Datenquel- für die Meinungsextraktion. Die Tonalität der Meinungen len erlaubt eine Extraktion der Merkmale mit hoher Ge- wird über ein Tonalitätswörterbuch gelöst. Für diesen An- nauigkeit (≈ 71% [5]). Allerdings tendieren Datenblätter satz wird - im Gegensatz zu unserer Methode - umfangrei- und Produktbeschreibungen dazu, ein Produkt relativ ober- ches Expertenwissen benötigt. flächlich darzustellen oder zu Gunsten des Produkts zu ver- Die Arbeit von [8] et. al. konzentriert sich auf die Extrakti- zerren. Zum Beispiel enthält die Hierarchie in Abbildung on von Meinungen und die anschließende Tonalitätsanalyse. 1 eine Reihe von Merkmalen, wie sie häufig in strukturier- Die Autoren unterscheiden zwischen subjektiven und kom- ten Datenquellen zu finden sind (helle Knoten). Allerdings parativen Sätze. Sowohl subjektive, als auch komparative sind weitere, detailliertere Merkmale denkbar, die für eine Sätze enthalten Meinungen, wobei im komparativen Fall ei- Kaufentscheidung von Interesse sein könnten. Beispielsweise ne Meinung nicht direkt gegeben wird, sondern über einen könnte das Display einer Digitalkamera zur Fleckenbildung Vergleich mit einem anderen Produkt erfolgt. Die Autoren am unteren/oberen Rand neigen. Unterer/Oberer Rand nutzen komparative Sätze, um Produktgraphen zu erzeu- wird in diesem Fall zu einem Untermerkmal von Display gen mit deren Hilfe verschiedene Produkte hinsichtlich eines und Obermerkmal von Fleckenbildung (dunkle Knoten). Merkmals geordnet werden können. Die notwendigen Tona- Eine derartige Anreicherung einer gegebenen, unvollständi- litätswerte werden einem Wörterbuch entnommen. gen Merkmalshierarchie kann durch die Verarbeitung von kundengenerierten, unstrukturierten Rezensionen erfolgen. z.B. steht DET für einen Artikel, NOUN für ein Hauptwort Wir halten einen hybriden Ansatz für durchaus sinnvoll: zu- und ADJ für ein Adjektiv. Weitere Informationen über das nächst wird eine initiale Merkmalshierarchie mit hoher Ge- Universal Tagset finden sich in [6]. nauigkeit aus strukturierten Daten gewonnen. Anschließend wird diese Hierarchie in einer zweiten Verarbeitungshase mit 3.2 Analysepipeline zusätzlichen Produktmerkmalen angereichert. Für die Verarbeitung und Untersuchung der Produktre- Für den weiteren Verlauf dieses Abschnitts beschränken wir zensionen haben wir eine für den NLP-Bereich (Natural Lan- uns auf die zweite Analysephase, d.h. wir nehmen eine in- guage Processing) typische Standardpipeline benutzt: die itiale Merkmalshierarchie als gegeben an. Für die Evaluation Volltexte der Rezensionen sind für unsere Zwecke zu grob- unseres Algorithmus (siehe Abschnitt 4) wurden die initia- granular, so dass in einer ersten Phase der Volltext in Sätze len Merkmalshierarchien manuell erzeugt. zerteilt wird. Anschließend werden die Sätze tokenisiert und Unser Algorithmus wurde auf der Basis einer Reihe von ein- die Wortarten der einzelnen Worte bestimmt. Des Weite- fachen Beobachtungen entworfen, die wir bei der Analyse ren werden Stoppworte markiert - dafür werden Standard- von unserem Rezensionskorpus gemacht haben. Stoppwortlisten benutzt. Wir beenden die Analysepipeline mit einer Stammformreduktion für jedes Wort, um die ver- 1. Ein Produktmerkmal wird häufig durch ein Hauptwort schiedenen Flexionsformen eines Wortes auf eine kanonische repräsentiert. Basis zu bringen. 2. Viele Hauptwörter können dasselbe Produktmerkmal Für die Bestimmung zusätzlicher Produktmerkmale aus Pro- beschreiben. (Synonyme) duktrezensionen, sind vor allem Hauptworte interessant, die i. d. R. keine Stoppworte sind. Allerdings ist uns aufgefal- 3. Untermerkmale werden häufig im Kontext ihrer Ober- len, dass überdurchschnittlich viele Worte fälschlicherweise merkmale genannt, wie z. B. ”das Ladegerät der Ka- als ein Hauptwort erkannt werden - viele dieser Worte sind mera”. Stoppworte. Wir nehmen an, dass die variierende, gramma- 4. Textfragmente, die von Produktmerkmalen handeln, tikalische Qualität der Produktrezensionen für die hohe An- besitzen häufig eine sehr ähnliche grammatikalische zahl falsch bestimmer Worte verantwortlich ist. Die Stopp- Struktur, wie z.B. ”die Auflösung der Anzeige” oder wortmarkierung hilft dabei, diesen Fehler etwas auszuglei- ”die Laufzeit des Akkus”, wobei Unter- und Obermerk- chen. male gemeinsam genannt werden. Die Struktur der 3.3 Der Algorithmus Fragmente lautet [DET, NOUN, DET, NOUN], wo- bei DET einen Artikel und NOUN ein Hauptwort be- In diesem Abschnitt beschreiben wir einen neuen Algorith- schreibt. mus, um eine initiale Hierarchie von Produktmerkmalen mit zusätzlichen Merkmalen anzureichern, wobei die natürliche Der Rest dieses Abschnitts gliedert sich wie folgt: zunächst Ordnung der Merkmale erhalten bleibt (siehe Algorithmus 1). werden Definitionen in Unterabschnitt 3.1 eingeführt, die Der Algorithmus erwartet 3 Parameter: eine 2-dimensionale für das weitere Verständnis notwendig sind. Anschließend Liste von Token T , die sämtliche Token für jeden Satz ent- beschreiben wir unsere Analysepipeline, die für die Vorver- hält (dabei beschreibt die erste Dimension die Sätze, die arbeitung der Produktrezensionen verwendet wurde, in Un- zweite Dimensionen die einzelnen Wörter), eine initiale Hier- terabschnitt 3.2. Darauf aufbauend wird in Unterabschnitt archie von Merkmalen f und eine Menge von POS-Mustern 3.3 unser Algorithmus im Detail besprochen. P . Da der Algorithmus rekursiv arbeitet, wird zusätzlich ein Parameter d übergeben, der die maximale Rekursionstiefe 3.1 Definitionen angibt. Der Algorithmus bricht ab, sobald die vorgegebene Für das Verständnis der nächsten Abschnitte werden eini- Tiefe erreicht wird (Zeile 1-3). ge Begriffe benötigt, die in diesem Unterabschnitt definiert werden sollen: Kandidatensuche (Zeile 4-11). Um geeignete Kandida- Token. Ein Token t ist ein Paar t = (vword , vP OS ), wobei ten für neue Produktmerkmale zu finden, werden alle Sätze vword das Wort und vpos die Wortart angibt. Im Rahmen betrachtet und jeweils entschieden, ob der Satz eine Realisie- dieser Arbeit wurde das Universal Tagset [6] benutzt. rung des aktuell betrachteten Merkmals enthält oder nicht. Wenn ein Satz eine Realisierung hat, dann wird die Funkti- Merkmal. Wir definieren ein Produktmerkmal f als ein on applyP atterns aufgerufen. Diese Funktion sucht im über- Tripel f = (S, C, p), wobei S eine Menge von Synonymen be- gebenen Satz nach gegebenen POS-Mustern und gibt – so- schreibt, die als textuelle Realisierung eines Merkmals Ver- fern mindestens ein Muster anwendbar ist – die entsprechen- wendung finden können. Die Elemente von S können Wor- den Token als Kandidat zurück, wobei die Mustersuche auf te, Produktbezeichnungen und auch Abkürzungen enthal- das unmittelbare Umfeld der gefundenen Realisierung einge- ten. Die Hierarchie wird über C und p kontrolliert, wobei schränkt wird, damit das korrekte POS-Muster zurückgelie- C eine Menge von Untermerkmalen und p das Obermerk- fert wird, da POS-Muster mehrfach innerhalb eines Satzes mal von f angibt. Das Wurzelelement einer Hierarchie be- vorkommen können. schreibt das Produkt/die Produktgruppe selbst und besitzt Im Rahmen dieser Arbeit haben wie die folgenden POS- kein Obermerkmal. Muster verwendet: • [DET, NOUN, DET, NOUN] POS-Muster. Ein POS-Muster q ist eine geordnete Sequenz von POS-Tags p = [tag1 , tag2 , . . . , tagn ], wobei n die Mus- • [DET, NOUN, VERB, DET, ADJ, NOUN] terlänge beschreibt. Ein POS-Tag beschreibt eine Wortart, Algorithm 1: refineHierarchy Synonymen. Dazu wird das Wort mit den Synonymen von f verglichen (z.B. mit der Levenshtein-Distanz) und als Syn- onym aufgenommen, falls eine ausreichende Ähnlichkeit be- Eingabe : T : Eine 2-dimensionale Liste von Token. steht. Damit soll verhindert werden, dass die falsche Schreib- Eingabe : P : Ein Array von POS-Mustern. weise eines eigentlich bekannten Merkmals dazu führt, dass Eingabe : f : Eine initiale Merkmalshierarchie. ein neuer Knoten in die Hierarchie eingefügt wird. Eingabe : d : Die maximale Rekursionstiefe. Ausgabe: Das Wurzelmerkmal der angereicherten Wenn der Token t die Heuristiken erfolgreich passiert hat, Hierarchie. dann wird t zu einem neuen Untermerkmal von f (Zeile 27). 1 if d = 0 then 2 return f Rekursiver Aufruf (Zeile 30-32). Nachdem das Merkmal 3 end f nun mit zusätzlichen Merkmalen angereichert wurde, wird 4 C ← {} ; der Algorithmus rekursiv für alle Untermerkmale von f auf- 5 for Token[] T ′ ∈ T do gerufen, um diese mit weiteren Merkmalen zu versehen. Die- 6 for Token t ∈ T ′ do ser Vorgang wiederholt sich solange, bis die maximale Re- 7 if t.word ∈Sf.S then kursionstiefe erreicht wird. 8 C ← C applyP attern(T ′ , P ) ; 9 end 10 end Nachbearbeitungsphase. Die Hierarchie, die von Algorith- 11 end mus 1 erweitert wurde, muss in einer Nachbearbeitungspha- 12 for Token[] C ′ ∈ C do se bereinigt werden, da viele Merkmale enthalten sind, die 13 for Token t ∈ C ′ do keine realen Produktmerkmale beschreiben (Rauschen). Für 14 if t.pos 6= NOUN then diese Arbeit verwenden wir die relative Häufigkeit eines Un- 15 next ; termerkmals im Kontext seines Obermerkmals, um nieder- 16 end frequente Merkmale (samt Untermerkmalen) aus der Hier- 17 if t.length ≤ 3 then archie zu entfernen. Es sind aber auch andere Methoden 18 next ; 19 end denkbar, wie z.B. eine Gewichtung nach tf-idf [4]. Dabei wird 20 if hasP arent(t.word, f ) then nicht nur die Termhäufigkeit (tf ) betrachtet, sondern auch 21 next ; die inverse Dokumenthäufigkeit (idf ) mit einbezogen. Der 22 end idf eines Terms beschreibt die Bedeutsamkeit des Terms im 23 if isSynonym(t.word, f.S) then Bezug auf die gesamte Dokumentenmenge. 24 f.S ← t.word ; 25 next ; 26 end S 4. EVALUATION 27 f.C ← f.C ({t.word}, {}, f ) ; In diesem Abschnitt diskutieren wir die Vor- und Nachteile 28 end unseres Algorithmus. Um unseren Algorithmus evaluieren zu 29 end können, haben wir einen geeigneten Korpus aus Kundenre- 30 for Feature[] f ′ ∈ f.C do zensionen zusammengestellt. Unser Korpus besteht aus 4000 31 ref ineHierarchy(T, f ′ , P, d − 1); Kundenrezensionen von amazon.de aus der Produktgruppe 32 end Digitalkamera. Wir haben unseren Algorithmus für die genannte Produkt- gruppe eine Hierarchie anreichern lassen. Die initiale Pro- dukthierarchie enthält ein Obermerkmal, welches die Pro- duktgruppe beschreibt. Zudem wurden häufig gebrauchte Validierungsphase (Zeile 12-29). Die Validierungsphase Synonyme hinzugefügt, wie z.B. Gerät. Im Weiteren prä- dient dazu die gefundenen Kandidaten zu validieren, also sentieren wir exemplarisch die angereicherte Hierarchie. Für zu entscheiden, ob ein Kandidat ein neues Merkmal enthält. dieses Experiment wurde die Rekursionstiefe auf 3 gesetzt, Man beachte, dass es sich bei diesem neuen Merkmal um niederfrequente Merkmale (relative Häufigkeit < 0, 002) wur- ein Untermerkmal des aktuellen Produktmerkmals handelt, den eliminiert. Wir haben für diese Arbeit Rezensionen in sofern es existiert. Für die Entscheidungsfindung nutzen wir Deutscher Sprache verwendet, aber der Algorithmus kann eine Reihe von einfachen Heuristiken. Ein Token t ist kein leicht auf andere Sprachen angepasst werden. Die erzeug- Produktmerkmal und wird übergangen, falls t.vword : te Hierarchie ist in Abbildung 2 dargestellt. Es zeigt sich, dass unser Algorithmus – unter Beachtung der hierarchi- 1. kein Hauptwort ist (Zeile 14-16). schen Struktur – eine Reihe wertvoller Merkmale extrahieren 2. keine ausreichende Länge besitzt (Zeile 17-19). konnte: z. B. Batterie mit seinen Untermerkmalen Halte- zeit und Verbrauch oder Akkus mit den Untermerkmalen 3. ein Synonym von f (oder eines Obermerkmals von f ) Auflad und Zukauf. Es wurden aber auch viele Merkmale ist (Zeile 20-22). aus den Rezensionen extrahiert, die entweder keine echten Produktmerkmale sind (z.B. Kompakt oder eine falsche 4. ein neues Synonym von f darstellt (Zeile 23-26). Ober-Untermerkmalsbeziehung abbilden (z. B. Haptik und Kamera). Des Weiteren sind einige Merkmale, wie z. B. Die 3. Heuristik stellt sicher, dass sich keine Kreise in der Qualität zu generisch und sollten nicht als Produktmerk- Hierarchie bilden können. Man beachte, dass Obermerkma- mal benutzt werden. le, die nicht direkt voneinander abhängen, gleiche Unter- merkmale tragen können. Die 4. Heuristik dient zum Lernen von vorher unbekannten malen anreichert. Die neuen Merkmale werden automatisch aus unstrukturierten Produktrezensionen gewonnen, wobei der Algorithmus versucht die natürliche Ordnung der Pro- duktmerkmale zu beachten. Wir konnten zeigen, dass unser Algorithmus eine initiale Merkmalshierarchie mit sinnvollen Untermerkmalen anrei- chern kann, allerdings werden auch viele falsche Merkma- le extrahiert und in fehlerhafte Merkmalsbeziehungen ge- bracht. Wir halten unseren Algorithmus dennoch für viel- versprechend. Unsere weitere Forschung wird sich auf Teila- spekte dieser Arbeit konzentrieren: • Die Merkmalsextraktion muss verbessert werden: wir haben beobachtet, dass eine Reihe extrahierter Merk- male keine echten Produktmerkmale beschreiben. Da- bei handelt es sich häufig um sehr allgemeine Wörter wie z.B. Möglichkeiten. Wir bereiten deshalb den Aufbau einer Stoppwortliste für Produktrezensionen vor. Auf diese Weise könnte diese Problematik abge- schwächt werden. • Des Weiteren enthalten die angereicherten Hierarchi- en teilweise Merkmale, die in einer falschen Beziehung zueinander stehen, z.B. induzieren die Merkmale Ak- ku und Akku-Ladegerät eine Ober-Untermerkmals- beziehung: Akku kann als Obermerkmal von Ladege- rät betrachtet werden. Außerdem konnte beobachtet werden, dass einige Merkmalsbeziehungen alternieren: z.B. existieren 2 Merkmale Taste und Druckpunkt in wechselnder Ober-Untermerkmalbeziehung. • Der Algorithmus benötigt POS-Muster, um Untermerk- male in Sätzen zu finden. Für diese Arbeit wurden die verwendeten POS-Muster manuell konstruiert, aber wir planen die Konstruktion der POS-Muster weitestge- hend zu automatisieren. Dazu ist eine umfangreiche Analyse eines großen Korpus notwendig. • Die Bereinigung der erzeugten Hierarchien ist unzurei- chend - die relative Häufigkeit eines Merkmals reicht als Gewichtung für unsere Zwecke nicht aus. Aus die- sem Grund möchten wir mit anderen Gewichtungsma- ßen experimentieren. • Die Experimente in dieser Arbeit sind sehr einfach ge- staltet. Eine sinnvolle Evaluation ist (z. Zt.) nicht mög- lich, da (unseres Wissens nach) kein geeigneter Test- korpus mit annotierten Merkmalshierarchien existiert. Die Konstruktion eines derartigen Korpus ist geplant. • Des Weiteren sind weitere Experimente geplant, um den Effekt der initialen Merkmalshierarchie auf den Algorithmus zu evaluieren. Diese Versuchsreihe um- fasst Experimente mit mehrstufigen, initialen Merk- malshierarchien, die sowohl manuell, als auch automa- tisch erzeugt wurden. Abbildung 2: Angereicherte Hierarchie für die Pro- • Abschließend planen wir die Verwendung unseres Al- duktgruppe Digitalkamera. gorithmus zur Extraktion von Produktmerkmalen in einem Gesamtsystem zur automatischen Zusammen- fassung und Analyse von Produktrezensionen einzu- setzen. 5. RESÜMEE UND AUSBLICK In dieser Arbeit wurde ein neuer Algorithmus vorgestellt, der auf Basis einer gegebenen – möglicherweise flachen – Merkmalshierarchie diese Hierarchie mit zusätzlichen Merk- 6. REFERENZEN [1] M. Acher, A. Cleve, G. Perrouin, P. Heymans, C. Vanbeneden, P. Collet, and P. Lahire. On extracting feature models from product descriptions. In Proceedings of the Sixth International Workshop on Variability Modeling of Software-Intensive Systems, VaMoS ’12, pages 45–54, New York, NY, USA, 2012. ACM. [2] F. L. Cruz, J. A. Troyano, F. Enrı́quez, F. J. Ortega, and C. G. Vallejo. A knowledge-rich approach to feature-based opinion extraction from product reviews. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, SMUC ’10, pages 13–20, New York, NY, USA, 2010. ACM. [3] M. Hu and B. Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’04, pages 168–177, New York, NY, USA, 2004. ACM. [4] K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11–21, 1972. [5] X. Meng and H. Wang. Mining user reviews: From specification to summarization. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, ACLShort ’09, pages 177–180, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. [6] S. Petrov, D. Das, and R. McDonald. A universal part-of-speech tagset. In N. C. C. Chair), K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may 2012. European Language Resources Association (ELRA). [7] T. Scholz and S. Conrad. Extraction of statements in news for a media response analysis. In Proc. of the 18th Intl. conf. on Applications of Natural Language Processing to Information Systems 2013 (NLDB 2013), 2013. (to appear). [8] K. Zhang, R. Narayanan, and A. Choudhary. Voice of the customers: Mining online customer reviews for product feature-based ranking. In Proceedings of the 3rd conference on Online social networks, WOSN’10, pages 11–11, Berkeley, CA, USA, 2010. USENIX Association. Ein Verfahren zur Beschleunigung eines neuronalen Netzes für die Verwendung im Image Retrieval Daniel Braun Heinrich-Heine-Universität Institut für Informatik Universitätsstr. 1 D-40225 Düsseldorf, Germany braun@cs.uni-duesseldorf.de ABSTRACT fikator eine untergeordnete Rolle, da man die Berechnung Künstliche neuronale Netze haben sich für die Mustererken- vor der eigentlichen Anwendung ausführt. Will man aller- nung als geeignetes Mittel erwiesen. Deshalb sollen ver- dings auch während der Nutzung des Systems weiter ler- schiedene neuronale Netze verwendet werden, um die für nen, so sollten die benötigten Rechnungen möglichst wenig ein bestimmtes Objekt wichtigen Merkmale zu identifizier- Zeit verbrauchen, da der Nutzer ansonsten entweder auf die en. Dafür werden die vorhandenen Merkmale als erstes Berechnung warten muss oder das Ergebnis, dass ihm aus- durch ein Art2-a System kategorisiert. Damit die Kategorien gegeben wird, berücksichtigt nicht die durch ihn hinzuge- verschiedener Objekte sich möglichst wenig überschneiden, fügten Daten. muss bei deren Berechnung eine hohe Genauigkeit erzielt Für ein fortlaufendes Lernen bieten sich künstliche neu- werden. Dabei zeigte sich, dass das Art2 System, wie auch ronale Netze an, da sie so ausgelegt sind, dass jeder neue die Art2-a Variante, bei steigender Anzahl an Kategorien Input eine Veränderung des Gedächtnisses des Netzes nach schnell zu langsam wird, um es im Live-Betrieb verwen- sich ziehen kann. Solche Netze erfreuen sich, bedingt durch den zu können. Deshalb wird in dieser Arbeit eine Opti- die sich in den letzten Jahren häufenden erfolgreichen An- mierung des Systems vorgestellt, welche durch Abschätzung wendungen - zum Beispiel in der Mustererkennung - einer des von dem Art2-a System benutzen Winkels die Anzahl steigenden Beliebtheit in verschiedensten Einsatzgebieten, der möglichen Kategorien für einen Eingabevektor stark ein- wie zum Beispiel auch im Image Retrieval. schränkt. Des Weiteren wird eine darauf aufbauende In- Der geplante Systemaufbau sieht dabei wie folgt aus: die dexierung der Knoten angegeben, die potentiell den Speich- Merkmalsvektoren eines Bildes werden nacheinander einer erbedarf für die zu überprüfenden Vektoren reduzieren kann. Clustereinheit übergeben, welche die Merkmalsvektoren clus- Wie sich in den durchgeführten Tests zeigte, kann die vorge- tert und die Kategorien der in dem Bild vorkommenden stellte Abschätzung die Bearbeitungszeit für kleine Cluster- Merkmale berechnet. Das Clustering der Clustereinheit pas- radien stark reduzieren. siert dabei fortlaufend. Das bedeutet, dass die einmal be- rechneten Cluster für alle weiteren Bilder verwendet werden. Danach werden die für das Bild gefundenen Kategorien von Kategorien Merkmalen an die Analyseeinheit übergeben, in der versucht H.3.3 [Information Storage and Retrieval]: Information wird, die für ein bestimmtes Objekt wichtigen Kategorien zu Search and Retrieval—Clustering; F.1.1 [Computation by identifizieren. Die dort gefundenen Kategorien werden dann Abstract Devices]: Models of Computation—Neural Net- für die Suche dieser Objekte in anderen Bildern verwendet. work Das Ziel ist es dabei, die Analyseeinheit so zu gestalten, dass sie nach einem initialen Training weiter lernt und so Schlüsselwörter neue Merkmale eines Objektes identifizieren soll. Für die Analyseeinheit ist die Verwendung verschiedener Neuronale Netze, Clustering, Image Retrieval neuronaler Netze geplant. Da sie aber für die vorgenomme- nen Optimierungen irrelevant ist, wird im Folgenden nicht 1. EINLEITUNG weiter auf sie eingegangen. Trainiert man ein Retrieval System mit einem festen Kor- Von dem Clusteringverfahren für die Clustereinheit wer- pus und wendet die berechneten Daten danach unverän- den dabei die folgenden Punkte gefordert: dert an, so spielt die Berechnungsdauer für einen Klassi- • Das Clustering soll nicht überwacht funktionieren. Das bedeutet, dass es keine Zielvorgabe für die Anzahl der Cluster geben soll. Das System soll also auch bei einem bestehenden Clustering für einen neuen Eingabevektor erkennen, ob er einem Cluster zugewiesen werden kann oder ob ein neuer Cluster erstellt werden muss. • Die Ausdehnung der Cluster soll begrenzt sein. Das 25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. soll dazu führen, dass gefundene Merkmalskategorien Copyright is held by the author/owner(s). mit höherer Wahrscheinlichkeit zu bestimmten Objek- ten gehören und nicht die Vektoren anderer Objekte Orienting Subsystem Attentional Subsystem die Kategorie verschmutzen. • Das Clustering Verfahren sollte auch bei einer hohen Reset Category Representation Field Anzahl an Clustern, die aus der gewünschten hohen Genauigkeit der einzelnen Cluster resultiert, schnell Reset Modul berechnet werden können. LTM In dieser Arbeit wird ein Adaptive Resonance Theory Netz [5] verwendet, genauer ein Art2 Netz [1], da es die beiden er- zJ* sten Bedingungen erfüllt. Denn dieses neuronale Netz führt Input Representation Field ein nicht überwachtes Clustering aus, wobei es mit jedem Eingabevektor weiter lernt und gegebenenfalls neue Cluster I erschafft. Der genaue Aufbau dieses Systems wird in Kapitel 3 genauer dargestellt. I Preprocessing Field Zur Beschreibung des Bildes dienen SIFT Deskriptoren [9, 10], welche mit 128 Dimensionen einen sehr großen Raum für mögliche Kategorien aufspannen. Dadurch wächst die Kno- I0 tenanzahl innerhalb des Art2 Netzes rapide an, was zu einer Verlangsamung des Netzes führt. Deshalb wird die Art2-a Variante [2] verwendet, welche das Verhalten des Art2 Sys- Abbildung 1: Skizze eines Art2-a Systems tems approximiert. Dieses System hat die Vorteile, dass es zum Einen im Vergleich zu Art2 um mehrere Größenord- nungen schneller ist und sich zum Anderen gleichzeitig auch durchlaufen und der ähnlichste Knoten als Antwort gewählt. noch größtenteils parallelisieren lässt, wodurch ein weiterer Das Framework verwendet dabei das Feedback des Nutzers, Geschwindigkeitsgewinn erzielt werden kann. wodurch nach jeder Iteration das Ergebnis verfeinert wird. Dennoch zeigt sich, dass durch die hohe Dimension des Das neuronale Netz dient hier somit als Klassifikator. Vektors, die für die Berechnung der Kategorie benötigten [12] benutzt das Fuzzy Art neuronale Netz, um die Merk- Skalarprodukte, unter Berücksichtigung der hohen Anzahl malsvektoren zu klassifizieren. Sie schlagen dabei eine zweite an Knoten, weiterhin sehr rechenintensiv sind. Dadurch Bewertungsphase vor, die dazu dient, das Netz an ein er- steigt auch bei starker Parallelisierung, sofern die maximale wartetes Ergebnis anzupassen, das System damit zu über- Anzahl paralleler Prozesse begrenzt ist, die Zeit für die Bear- wachen und die Resultate des Netzes zu präzisieren. beitung eines neuen Vektors mit fortlaufendem Training kon- In [6] wird ein Radial Basis Funktion Netzwerk (RBF) als tinuierlich an. Aus diesem Grund wird in Kapitel 4 eine Er- Klassifikator verwendet. Eins ihrer Verfahren lässt dabei den weiterung des Systems vorgestellt, die die Menge der Kandi- Nutzer einige Bilder nach der Nähe zu ihrem Suchziel bew- daten möglicher Gewinnerknoten schon vor der teuren Be- erten, um diese Bewertung dann für das Training ihrer Net- rechnung des Skalarproduktes verkleinert. zwerke zu verwenden. Danach nutzen sie die so trainierten Der weitere Aufbau dieser Arbeit sieht dabei wie folgt aus: neuronalen Netze zur Bewertung aller Bilder der Datenbank. in dem folgenden Kapitel 2 werden einige ausgewählte An- Auch [11] nutzt ein Radial Basis Funktion Netz zur Suche sätze aus der Literatur genannt, in denen neuronale Netze nach Bildern und trainiert dieses mit der vom Nutzer angege- für das Image Retrieval verwendet werden. Um die Plausi- benen Relevanz des Bildes, wobei das neuronale Netz nach bilität der Erweiterung zu verstehen, werden dann in Kapi- jeder Iteration aus Bewertung und Suche weiter trainiert tel 3 die dafür benötigten Mechanismen und Formeln eines wird. Art2-a Systems vorgestellt. Kapitel 4 fokussiert sich danach In [3] wird ein Multiple Instance Netzwerk verwendet. auf die vorgeschlagene Erweiterung des bekannten Systems. Das bedeutet, dass für jede mögliche Klasse von Bildern In Kapitel 5 wird die Erweiterung evaluiert, um danach in ein eigenes neuronales Netz erstellt wird. Danach wird ein dem folgenden Kapitel eine Zusammenfassung des Gezeigten Eingabebild jedem dieser Subnetze präsentiert und gegebe- sowie einen Ausblick zu geben. nenfalls der dazugehörigen Klasse zugeordnet. 2. VERWANDTE ARBEITEN 3. ART2-A BESCHREIBUNG In diesem Kapitel werden einige Ansätze aus der Liter- In diesem Kapitel werden die benötigten Mechanismen atur vorgestellt, in denen neuronale Netze für verschiedene eines Art2-a Systems vorgestellt. Für das Verständnis sind Aufgabenstellungen im Image Retrieval verwendet werden. dabei nicht alle Funktionen des Systems nötig, weshalb zum Darunter fallen Themengebiete wie Clustering und Klassi- Beispiel auf die nähere Beschreibung der für das Lernen fikation von Bildern und ihren Merkmalsvektoren. benötigten Formeln und des Preprocessing Fields verzichtet Ein bekanntes Beispiel für die Verwendung von neuronalen wird. Für weiterführende Informationen über diese beiden Netzen im Image Retrieval ist das PicSOM Framework, wel- Punkte sowie generell über das Art2-a System sei deshalb ches in [8] vorgestellt wird. Dort werden TS-SOMs (Tree auf [1, 2] verwiesen. Structured Self Orienting Maps) für die Bildsuche verwen- Wie in Bild 1 zu sehen ist, besteht das System aus zwei det. Ein Bild wird dabei durch einen Merkmalsvektor darge- Subsystemen: einem Attentional Subsystem, in dem die Be- stellt. Diese Vektoren werden dann dem neuronalen Netz arbeitung und Zuordnung eines an den Eingang angelegten präsentiert, welches sie dann der Baumstruktur hinzufügt, Vektors ausgeführt wird, sowie einem Orienting Subsystem, so dass im Idealfall am Ende jedes Bild in der Baumstruk- welches die Ähnlichkeit des Eingabevektors mit der vorher tur repräsentiert wird. Bei der Suche wird der Baum dann gewählten Gewinnerkategorie berechnet und diese bei zu geringer Nähe zurücksetzt. unterliegen. Damit wird der Knoten J genau dann abge- Innerhalb des Category Representation Field F2 liegen die lehnt, wenn Knoten die für die einzelnen Vektorkategorien stehen. Dabei wird die Beschreibung der Kategorie in der Long Term Mem- T J < ρ∗ (4) ory (LTM) gespeichert, die das Feld F2 mit dem Input Rep- resentation Field F1 in beide Richtungen verbindet. gilt. Ist das der Fall, wird ein neuer Knoten aktiviert und Nach [2] gilt für den Aktivitätswert T von Knoten J in somit eine neue Kategorie erstellt. Mit 2 und 4 folgt damit, dem Feld F2 : dass ein Knoten nur dann ausgewählt werden kann, wenn für den Winkel θ zwischen dem Eingabevektor I und dem { ∑ gespeicherten LTM-Vektor zJ∗ α· n i=1 Ii , wenn J nicht gebunden ist, TJ = I · zJ∗ , wenn J gebunden ist. cos θ ≥ ρ∗ (5) Ii steht dabei für den durch das Preprocessing Field F0 berechneten Input in das Feld F1 und α ist ein wählbarer Pa- gilt. Da die einzelnen Rechnungen, die von dem System rameter, der klein genug ist, so dass die Aktivität eines unge- ausgeführt werden müssen, unabhängig sind, ist dieses Sys- bundenen Knotens für bestimmte Eingangsvektoren nicht tem hochgradig parallelisierbar, weshalb alleine durch Aus- immer größer ist als alle Aktivitätswerte der gebundenen nutzung dieser Tatsache die Berechnungszeit stark gesenkt Knoten. Hierbei gilt ein Knoten als gebunden, wenn ihm werden kann. Mit steigender Knotenanzahl lässt sich das mindestens ein Vektor zugeordnet wurde. System dennoch weiter optimieren, wie in dem folgenden Da der Aktivitätswert für alle nicht gebundenen Knoten Kapitel gezeigt werden soll. konstant ist und deshalb nur einmal berechnet werden muss, Das Art2-a System hat dabei allerdings einen Nachteil, ist dieser Fall für eine Effizienzsteigerung von untergeord- denn bedingt durch die Nutzung des Kosinus des Winkels netem Interesse und wird deshalb im Folgenden nicht weiter zwischen zwei Vektoren werden Vektoren, die linear abhäng- betrachtet. ig sind, in dieselbe Kategorie gelegt. Dieses Verhalten ist für Wie in [2] gezeigt wird, sind sowohl I als auch zJ∗ , durch die geforderte Genauigkeit bei den Kategorien unerwünscht. die Anwendung der euklidischen Normalisierung, Einheits- Dennoch lässt sich dieses Problem leicht durch die Erhebung vektoren, weshalb folglich weiterer Daten, wie zum Beispiel den Clustermittelpunkt, lösen, weshalb im Folgenden nicht weiter darauf eingegangen wird. ∥I∥ = ∥zJ∗ ∥ = 1 (1) gilt. Deshalb folgt für die Aktivitätswerte der gebunden 4. VORGENOMMENE OPTIMIERUNG Kategorieknoten: Dieses Kapitel dient der Beschreibung der verwendeten Abschätzung und ihrer Implementierung in das Art2-a Sys- TJ = I · zJ∗ tem. Abschließend wird dann noch auf eine weitere Verbes- = ∥I∥ · ∥zJ∗ ∥ · cos θ serung, die sich durch diese Implementierung ergibt, einge- gangen. Der Aufbau des Abschnitts ist dabei wie folgt: in = cos θ (2) Unterabschnitt 1 wird das Verfahren zur Abschätzung des Die Aktivität eines Knotens entspricht damit dem Winkel Winkels vorgestellt. In dem folgenden Unterabschnitt 2 wird zwischen dem Eingangsvektor I und dem LTM-Vektor zJ∗ . dann gezeigt, wie man diese Abschätzung in das Art2-a Sys- Damit der Knoten mit dem Index J gewählt wird, muss tem integrieren kann. In dem letzten Unterabschnitt folgt dann eine Vorstellung der Abschätzung als Index für die Knoten. TJ = max{Tj } 4.1 Abschätzung des Winkels j gelten, sprich der Knoten mit der maximalen Aktivität wird als mögliche Kategorie gewählt. Dabei wird bei Gleich- In [7] wird eine Methode zur Schätzung der Distanz zwis- heit mehrerer Werte der zu erst gefundene Knoten genom- chen einem Anfragevektor und einem Datenvektor beschrie- men. Die maximale Distanz, die das Resetmodul akzep- ben. Im Folgenden wird beschrieben, wie man Teile dieses tiert, wird durch den Schwellwert ρ, im Folgenden Vigilance Verfahrens nutzen kann, um damit die Menge möglicher Parameter genannt, bestimmt, mit dem die, für die Art2-a Knoten schon vor der Berechnung des Aktivitätswertes TJ Variante benötigte, Schwelle ρ∗ wie folgt berechnet wird: zu verringern. Das Ziel ist es, die teure Berechnung des Skalarproduktes zwischen I und zJ∗ möglichst selten auszu- führen und gleichzeitig möglichst wenig Daten im Speicher ρ2 (1 + σ)2 − (1 + σ 2 ) ρ∗ = vorrätig halten zu müssen. Dafür wird der unbekannte Win- 2σ kel θ zwischen den beiden Vektoren P und Q durch die mit bekannten Winkel α und β zwischen beiden Vektoren und einer festen Achse T wie folgt approximiert: cd σ≡ (3) 1−d cos θ ≤ cos (|α − β|) und c und d als frei wählbare Parameter des Systems, die der Beschränkung = cos (α − β) = cos α cos β + sin α sin β cd √ √ ≤1 = cos α cos β + 1 − cos α2 1 − cos β 2 (6) 1−d Damit die Bedingung (7) ausgenutzt werden kann, wird das Art2 System um ein weiteres Feld, im Folgenden Estima- F2 1 2 3 . . . n tion Field genannt, erweitert. Dieses Feld soll als Schnittstel- le zwischen F0 und F2 dienen und die Abschätzung des Winkels zwischen dem Eingabevektor und dem gespeicher- ten LTM Vektor vornehmen. Dazu wird dem Feld, wie in z *n Abbildung 2 gezeigt wird, von dem Feld F0 die Summe S I S übergeben. Innerhalb des Feldes gibt es dann für jeden ′ Knoten J im Feld F2 eine zugehörige Estimation Unit J . In der Verbindung von jedem Knoten J zu der ihm zugehörigen ′ Estimation I F0 Estimation Unit J wird ∗ die Summe der Werte des jeweili- Field S gen LTM Vektors S zJ als LTM gespeichert. Die Estimation Unit berechnet dann die Funktion = LTM √ ∗ ∗2 S I ∗ S zJ + (n − S I 2 )(n − S zJ ) Abbildung 2: Erweiterung des Art2 Systems mit f (J) = n einem neuen Feld für die Abschätzung des Winkels für den ihr zugehörigen Knoten J. Abschließend wird als Aktivierungsfunktion, für die Berechnung der Ausgabe oJ ′ ′ Als Achse T wird hierbei ein n-dimensionaler mit Einsen der Estimation Unit J , die folgende Schwellenfunktion ver- gefüllter Vektor verwendet, wodurch für die L2-Norm des wendet: √ Achsenvektors ∥T ∥ = n folgt. Eingesetzt in die Formel { 1, wenn f (J) ≥ ρ∗ ⟨P, Q⟩ oJ ′ = (8) cos θ = 0, sonst ∥P ∥∥Q∥ Damit ergibt sich für die Aktivitätsberechnung jedes Kno- ergibt sich damit, unter Ausnutzung von (1), für das Sys- tens des F2 Feldes die angepasste Formel tem mit den Vektoren I und zJ∗ :  ∑ ∑n ∑n ∗  α ∗ i Ii , wenn J nicht gebunden ist, i=1 Ii i=1 zJ i cosα = √ und cosβ = √ TJ = I ∗ zJ∗ , wenn J gebunden ist und oJ ′ = 1 gilt, n n  0 ∗ wenn oJ ′ = 0 gilt. Mit S I und S zJ als jeweilige Summe der Vektorwerte (9) reduziert sich, unter Verwendung der Formel (6), die Ab- mit oJ ′ als Ausgabe des Estimation Field zu Knoten J. schätzung des Kosinus vom Winkel θ auf 4.3 Verwendung als Index √ Durch die gezeigte Kosinusschätzung werden unnötige Ska- ∗ ∗2 S I ∗ S zJ SI 2 S zJ larprodukte vermieden und somit das System beschleunigt. cos θ ≤ + (1 − )(1 − ) n n n Allerdings kann es bei weiterhin wachsender Anzahl der Kno- √ ten, zum Beispiel weil der Arbeitsspeicher nicht ausreicht, ∗ ∗2 S I ∗ S zJ + (n − S I 2 )(n − S zJ ) nötig werden, nicht mehr alle LTM Vektoren im Speicher zu = n halten, sondern nur ein Set möglicher Kandidaten zu laden Diese Abschätzung ermöglicht es nun, die Menge der Kan- und diese dann gezielt zu analysieren. In dem folgenden Ab- didaten möglicher Knoten für einen Eingabevektor I vorzei- schnitt wird gezeigt, wie die Abschätzung sinnvoll als Index tig zu reduzieren, indem man ausnutzt, dass der wirkliche für die Knoten verwendet werden kann. + Winkel zwischen Eingabevektor und in der LTM gespeicher- Für die Indexierung wird als Indexstruktur ein B ∗ -Baum zJ tem Vektor maximal genauso groß ist, wie der mit der gezeig- mit der Summe der Werte jedes LTM Vektors S und der ten Formel geschätzte Winkel zwischen beiden Vektoren. ID J des Knotens als zusammengesetzten Schlüssel verwen- ∗ Damit ist diese Abschätzung des wirklichen Winkels θ ver- det. Für die Sortierreihenfolge gilt: zuerst wird nach S zJ + lustfrei, denn es können keine Knoten mit einem tatsächlich sortiert und dann nach J. Dadurch bleibt der B -Baum für größeren Kosinuswert des Winkels aus der Menge an Kan- partielle Bereichsanfragen nach dem Wert der Summe opti- didaten entfernt werden. Daraus resultiert, dass ein Knoten miert. Damit das funktioniert muss allerdings die Suche so nur dann weiter betrachtet werden muss, wenn die Bedin- angepasst werden, dass sie bei einer partiellen Bereichsan- gung frage für die ID den kleinstmöglichen Wert einsetzt und dann bei der Ankunft in einem Blatt der Sortierung bis zum ersten √ Vorkommen, auch über Blattgrenzen hinweg, der gesuchten ∗ ∗2 S I ∗ S zJ + (n − S I 2 )(n − S zJ ) Summe folgt. ≥ ρ∗ (7) Dieser Index wird nun verwendet, um die Menge der Kan- n didaten einzuschränken, ohne, wie in der vorher vorgestell- erfüllt wird. ten Optimierung durch die Estimation Unit, alle Knoten durchlaufen zu müssen. Anschaulich bedeutet das, dass 4.2 Erweiterung des Systems das Art2-a System nur noch die der Menge an Kandidaten für den Eingabevektor I angehörenden Knoten sehen soll und somit nur in diesen den Gewinnerknoten suchen muss. Für diesen Zweck ∗ müssen mögliche Wertebereiche der gespe- icherten S zJ für einen beliebigen Eingabevektor festgelegt werden. Dies geschieht wieder mit Hilfe der Bedingung (7): √ ∗ ∗2 S I · S zJ + (n − S I 2 )(n − S zJ ) ≥ρ n √ ∗2 ∗ (n − S I 2 )(n − S zJ ) ≥ ρ · n − S I · S zJ ∗ Für ρ · n − S I · S zJ < 0 ist diese Ungleichung offensichtlich immer erfüllt, da die Quadratwurzel auf der linken Seite immer positiv ist. Damit ergibt sich die erste Bedingung: Abbildung 3: Zeitmessung für ρ = 0.95 ∗ ρ·n S zJ > (10) SI gestellt, denn es sind keine Einbußen in der Güte des Ergeb- Nun wird im Folgenden noch der Fall ρ · n ≥ S I · S zJ ∗ nisses zu erwarten. Außerdem wird die mögliche Paralleli- weiter betrachtet: sierung nicht weiter betrachtet, da bei einer begrenzten An- zahl von parallelen Prozessen die Anzahl der Knoten pro √ Prozess mit jedem weiteren Bild steigt und den Prozess so ∗2 ∗ (n − S I 2 )(n − S zJ ) ≥ ρ · n − S I · S zJ verlangsamt. Als mögliche Werte für den Schwellwert ρ wur- 2 ∗2 ∗ den die zwei, in der Literatur öfter genannten, Werte 0.95 n · (1 − ρ2 ) − S I ≥ S zJ − 2ρS I S zJ und 0.98 sowie der Wert 0.999 verwendet. Für die restlichen 2 ∗ benötigten Parameter aus Formel (3) und (9) gilt: c = 0.1, (n − S I )(1 − ρ2 ) ≥ (S zJ − ρ · S I )2 d = 0.9 und α = 0 Damit ergibt sich: 5.2 Ergebnisse √ √ Für die kleineren Vigilance Werte von 0.95 und 0.98 zeigt ∗ (n − S I 2 )(1 − ρ2 ) ≥ S zJ − ρ · S I ≥ − (n − S I 2 )(1 − ρ2 ) sich, wie in den Abbildungen 3 und 4 zu sehen ist, dass die (11) Abschätzung hier kaum einen Vorteil bringt. Sie ist sogar Mit den Bedingungen (10) und (11) können nun die par- langsamer als das originale System. Dies liegt daran, dass tiellen Bereichsanfragen an den Index für einen beliebigen die Abschätzung durch Verwendung nur eines Wertes, näm- Eingabevektor I wie folgt formuliert werden: lich der Summe, viel zu ungenau ist, um bei diesem Vigilance Wert genug Knoten herauszufiltern, da fast alle Knoten über √ √ der Grenze liegen. Da deshalb kaum Zeit gewonnen wird, r1 = [ρS I − (n − S I 2 )(1 − ρ2 ), ρS I + (n − S I 2 )(1 − ρ2 )] wird das System durch den betriebenen Mehraufwand lang- ρ·n samer. Mit steigendem Vigilance Parameter nimmt auch r2 = [ I , ∞] S der Nutzen des Verfahrens zu, da die Anzahl der entfernten Da für diese Bereichsanfragen die Bedingung (7) genutzt Knoten signifikant zunimmt. Dies sieht man deutlich in Ab- wird und somit alle geschätzten Winkel größer als ρ∗ sind, bildung 5, in der die benötigte Rechenzeit für einen Wert von hat bei der Verwendung des Indexes das Estimation Field 0.999 dargestellt ist. In diesem Fall filtert die gezeigte Ab- keinen Effekt mehr. schätzung sehr viele Knoten heraus, weshalb der Zeitgewinn den Zeitverlust durch den größeren Aufwand weit übersteigt. Da aber möglichst genaue Kategorien erwünscht sind, ist ein 5. EVALUATION hoher Vigilance Parameter die richtige Wahl. Deshalb kann In diesem Kapitel wird die gezeigte Abschätzung evaluiert. das gezeigte Verfahren für das angestrebte System adaptiert Der vorgeschlagene Index wird dabei aber noch nicht berück- werden. sichtigt. 5.1 Versuchsaufbau 6. RESÜMEE UND AUSBLICK Für die Evaluierung des gezeigten Ansatzes wurde ein In dieser Arbeit wurde eine Optimierung des Art2-a Sys- Computer mit einem Intel Core 2 Duo E8400 3 GHz als tems vorgestellt, die durch Abschätzung des Winkels zwis- Prozesser und 4 GB RAM benutzt. chen Eingabevektor und gespeichertem Vektor die Menge Als Datensatz wurden Bilder von Flugzeugen aus dem an zu überprüfenden Kandidaten für hohe Vigilance Werte Caltech 101 Datensatz [4] verwendet. Diese Bilder zeigen stark reduzieren kann. Des Weiteren wurde ein Ansatz zur verschiedene Flugzeuge auf dem Boden beziehungsweise in Indexierung der Knoten basierend auf der für die Abschätz- der Luft. Für den Geschwindigkeitstest wurden 20 Bilder ung nötigen Summe vorgestellt. Auch wenn eine abschlie- aus dem Pool ausgewählt und nacheinander dem neuronalen ßende Analyse des gezeigten noch offen ist, so scheint dieser Netz präsentiert. Im Schnitt produzieren die benutzten Bil- Ansatz dennoch erfolgversprechend für die erwünschten ho- der dabei 4871 SIFT Feature Vektoren pro Bild. hen Vigilance Werte. Bedingt dadurch, dass die Ansätze verlustfrei sind, wird Aufbauend auf dem gezeigten wird unsere weitere For- nur die Rechenzeit der verschiedenen Verfahren gegenüber schung die folgenden Punkte beinhalten: fortlaufendes Lernen braucht, um einem Objekt keine falschen neuen Kategorien zuzuweisen oder richtige Ka- tegorien zu entfernen. Danach soll ein geeignetes neu- ronales Netz aufgebaut werden, um damit die Zuord- nung der Kategorien zu den Objekten durchführen zu können. Das Netz muss dann an die vorher erhobenen Daten angepasst werden, um die Präzision des Netzes zu erhöhen. Abschließend wird das Verfahren dann gegen andere populäre Verfahren getestet. 7. REFERENZEN [1] G. A. Carpenter and S. Grossberg. Art 2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics, Abbildung 4: Zeitmessung für ρ = 0.98 26(23):4919–4930, 1987. [2] G. A. Carpenter, S. Grossberg, and D. B. Rosen. Art 2-a: an adaptive resonance algorithm for rapid category learning and recognition. In Neural Networks, volume 4, pages 493–504, 1991. [3] S.-C. Chuang, Y.-Y. Xu, H. C. Fu, and H.-C. Huang. A multiple-instance neural networks based image content retrieval system. In Proceedings of the First International Conference on Innovative Computing, Information and Control, volume 2, pages 412–415, 2006. [4] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object Abbildung 5: Zeitmessung für ρ = 0.999 categories, 2004. CVPR 2004, Workshop on Generative-Model Based Vision. [5] S. Grossberg. Adaptive pattern classification and • Es wird geprüft, ob die Abschätzung durch die Hinzu- universal recording: II. Feedback, expectation, nahme weiterer Daten verbessert werden kann und so- olfaction, illusions. Biological Cybernetics, 23:187–202, mit eine weitere Beschleunigung erzielt wird. Dafür 1976. kann man, um das Problem der zu geringen Präzision [6] B. Jyothi and D. Shanker. Neural network approach der Abschätzung bei kleinerem Vigilance Parameter for image retrieval based on preference elicitation. zu umgehen, die Vektoren teilen und die Abschätzung International Journal on Computer Science and wie in [7] aus den Teilsegmenten der Vektoren zusam- Engineering, 2(4):934–941, 2010. mensetzen. Dafür bräuchte man aber auch die Summe [7] Y. Kim, C.-W. Chung, S.-L. Lee, and D.-H. Kim. der Quadrate, da die Teilsegmente der Vektoren keine Distance approximation techniques to reduce the Einheitsvektoren mehr sind. Deshalb wird es sich noch dimensionality for multimedia databases. Knowledge zeigen, ob der Gewinn an Präzision durch eine Auftei- and Information Systems, 2010. lung den größeren Aufwand durch Berechnung und Speicherung weiterer Werte rechtfertigt. Des Weiteren [8] L. Koskela, , J. T. Laaksonen, J. M. Koskela, and soll damit überprüft werden, ob die Abschätzung auch E. Oja. Picsom a framework for content-based image für kleinere Vigilance Werte verwendet werden kann. database retrieval using self-organizing maps. In In 11th Scandinavian Conference on Image Analysis, • Es wird überprüft, wie groß die Auswirkungen der pages 151–156, 1999. vorgestellten Verfahren bei einer parallelen Berechnung [9] D. G. Lowe. Object recognition from local des Gewinnerknotens sind. Des Weiteren wird das scale-invariant features. In Proceedings of the Verfahren auf größeren Datenmengen getestet, um zu International Conference on Computer Vision, 1999. überprüfen, ob eine weitere Beschleunigung nötig ist, [10] D. G. Lowe. Distinctive image features from damit man das Verfahren im Live Betrieb verwenden scale-invariant keypoints. International Journal of kann. Computer Vision, 60:91–110, 2004. • Die Verwendung der Abschätzung zum Indexieren soll [11] K. N. S., Čabarkapa Slobodan K., Z. G. J., and R. B. getestet und mit anderen Indexierungsverfahren ver- D. Implementation of neural network in cbir systems glichen werden, um ihren Nutzen besser bewerten zu with relevance feedback. Journal of Automatic können. Aber vor allem ihre Auswirkungen auf das Control, 16:41–45, 2006. Art2-a System im parallelisierten Betrieb sind noch [12] H.-J. Wang and C.-Y. Chang. Semantic real-world offen und werden überprüft. image classification for image retrieval with fuzzy-art neural network. Neural Computing and Applications, • Danach werden wir die Analyseeinheit konzipieren. Da- 21(8):2137–2151, 2012. für wird als erstes überprüft, welche Daten man für ein Auffinden von Spaltenkorrelationen mithilfe proaktiver und reaktiver Verfahren Katharina Büchse Friedrich-Schiller-Universität Institut für Informatik Ernst-Abbe-Platz 2 07743 Jena katharina.buechse@uni-jena.de KURZFASSUNG Keywords Zur Verbesserung von Statistikdaten in relativen Datenbank- Anfrageoptimierung, Spaltenkorrelation, Feedback systemen werden seit einigen Jahren Verfahren für das Fin- den von Korrelationen zwischen zwei oder mehr Spalten 1. EINFÜHRUNG entwickelt. Dieses Wissen über Korrelationen ist notwen- dig, weil der Optimizer des Datenbankmanagementsystems Die Verwaltung großer Datenmengen benötigt zunehmend (DBMS) bei der Anfrageplanerstellung sonst von Unabhän- leistungsfähigere Algorithmen, da die Verbesserung der Tech- gigkeit der Daten ausgeht, was wiederum zu groben Fehlern nik (Hardware) nicht mit dem immer höheren Datenauf- bei der Kostenschätzung und somit zu schlechten Ausfüh- kommen heutiger Zeit mithalten kann. Bspw. werden wis- rungsplänen führen kann. senschaftliche Messergebnisse aufgrund besserer Messtech- Die entsprechenden Verfahren gliedern sich grob in proak- nik immer genauer und umfangreicher, sodass Wissenschaft- tive und reaktive Verfahren: Erstere liefern ein gutes Ge- ler sie detaillierter, aber auch umfassender analysieren wol- samtbild über sämtliche vorhandenen Daten, müssen dazu len und müssen, oder Online-Shops speichern sämtliche ihrer allerdings selbst regelmäßig auf die Daten zugreifen und be- Verkaufsdaten und werten sie aus, um dem Benutzer passend nötigen somit Kapazität des DBMS. Letztere überwachen zu seinen Interessen zeitnah und individuell neue Angebote und analysieren hingegen die Anfrageergebnisse und liefern machen zu können. daher nur Korrelationsannahmen für bereits abgefragte Da- Zur Verwaltung dieser wie auch anderer Daten sind (im ten, was einerseits das bisherige Nutzerinteresse sehr gut wi- Datenbankbereich) insbesondere schlaue Optimizer gefragt, derspiegelt, andererseits aber bei Änderungen des Workloads weil sie für die Erstellung der Anfragepläne (und somit für versagen kann. Dafür wird einzig bei der Überwachung der die Ausführungszeit einer jeden Anfrage) verantwortlich sind. Anfragen DBMS-Kapazität benötigt, es erfolgt kein eigen- Damit sie in ihrer Wahl nicht völlig daneben greifen, gibt ständiger Zugriff auf die Daten. es Statistiken, anhand derer sie eine ungefähre Vorstellung Im Zuge dieser Arbeit werden beide Ansätze miteinan- bekommen, wie die vorhandene Datenlandschaft aussieht. der verbunden, um ihre jeweiligen Vorteile auszunutzen. Da- Hierbei ist insbesondere die zu erwartende Tupelanzahl von zu werden die sich ergebenden Herausforderungen, wie sich Interesse, da sie in hohem Maße die Ausführungszeit einer widersprechende Korrelationsannahmen, aufgezeigt und als Anfrage beeinflusst. Je besser die Statistiken die Verteilung Lösungsansatz u. a. der zusätzliche Einsatz von reaktiv er- der Daten wiedergeben (und je aktueller sie sind), desto bes- stellten Statistiken vorgeschlagen. ser ist der resultierende Ausführungsplan. Sind die Daten unkorreliert (was leider sehr unwahrscheinlich ist), genügt es, pro zu betrachtender Spalte die Verteilung der Werte Categories and Subject Descriptors innerhalb dieser Spalte zu speichern. Treten in diesem Fall später in den Anfragen Kombinationen der Spalten auf, er- H.2 [Information Systems]: Database Management; H.2.4 gibt sich die zu erwartende Tupelanzahl mithilfe einfacher [Database Management]: Systems—Query processing statistischer Weisheiten (durch Multiplikation der Einzel- wahrscheinlichkeiten). Leider versagen diese ab einem bestimmten Korrelations- General Terms grad (also bei korrelierten Daten), und zwar in dem Sinne, Theory, Performance dass die vom Optimizer berechneten Schätzwerte zu stark von der Wirklichkeit abweichen, was wiederum zu schlech- ten Ausführungszeiten führt. Diese ließen sich u.U. durch die Wahl eines anderen Plans, welcher unter Berücksichtigung der Korrelation vom Optimizer erstellt wurde, verringern oder sogar vermeiden. Zur Veranschaulichung betrachten wir eine Tabelle, wel- th che u. a. die Spalten A und B besitzt, und eine Anfrage, 25 GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. welche Teile eben dieser Spalten ausgeben soll. Desweiteren Copyright is held by the author/owner(s). liege auf Spalte B ein Index, den wir mit IB bezeichnen wol- len, und es existiere ein zusammengesetzter Index IA,B für Daher gibt es zwei grundsätzliche Möglichkeiten: Entwe- beide Spalten. Beide Indizes seien im DBMS mithilfe von der schauen wir dem Benutzer auf die Finger und suchen Bäumen (bspw. B∗ -Bäume) implementiert, sodass wir auch in den von ihm abgefragten Daten nach Korrelationen (das (etwas informell) von flachen“ oder hohen“ Indizes spre- entspricht einer reaktiven Vorgehensweise), oder wir suchen ” ” ” chen können. uns selbst“ ein paar Daten der Datenbank aus“, die wir ” Sind beide Spalten unkorreliert, so lohnt sich in der Regel untersuchen wollen (und gehen somit proaktiv vor). Beide die Abfrage über IA,B . Bei einer starken Korrelation bei- Vorgehensweisen haben ihre Vor- und Nachteile. Während der Spalten dagegen könnte die alleinige Verwendung von im reaktiven Fall keine Daten speziell zur Korrelationsfin- IB vorteilhaft sein, und zwar wenn die Werte aus Spalte A dung angefasst“ werden müssen, hier aber alle Daten, die ” i.d.R. durch die Werte aus Spalte B bestimmt werden (ein bis zu einer bestimmten Anfrage nie abgefragt wurden, als typisches Beispiel, welches auch in CORDS [7] anzutreffen unkorreliert gelten, müssen wir für die proaktive Methode ist, wäre eine Tabelle Auto“ mit den Spalten A = Firma“ (also nur zum Feststellen, ob Korrelation vorherrscht) extra ” ” und B = Marke“, sodass sich für A Werte wie Opel“ oder Daten lesen, sind aber für (fast) alle Eventualitäten gewapp- ” ” Mercedes“ und für B Werte wie Zafira“ oder S-Klasse“ er- net. ” ” ” geben). Statt nun im vergleichsweise hohen Index IA,B erst Interessanterweise kann es vorkommen, dass beide Metho- passende A- und dann passende B-Werte zu suchen, werden den für ein und dieselbe Spaltenkombination unterschied- sämtliche Tupel, welche die gewünschten B-Werte enthalten, liche Ergebnisse liefern (der Einfachheit halber beschrän- über den flacheren Index IB geladen und überprüft, ob die ken wir uns hierbei auf die möglichen Ergebnisse korre- ” jeweiligen A-Werte der Anfrage entsprechen (was aufgrund liert“ oder unabhängig“). Für den Fall, dass die reaktive ” der Abhängigkeit der Regelfall sein sollte). Methode eine Spaltenkombination gar nicht betrachtet hat, sollte das klar sein. Aber nehmen wir an, dass die Kombi- Das Wissen über Korrelationen fällt aber natürlich nicht nation von beiden Methoden analysiert wurde. Da für die vom Himmel, es hat seinen Preis. Jeder Datenbänkler hofft, Analyse höchstwahrscheinlich jeweils unterschiedliche Tupel dass seine Daten unkorreliert sind, weil sein DBMS dann we- (Wertekombinationen) verwendet wurden, können sich na- niger Metadaten (also Daten über die Daten) speichern und türlich auch die Schlüsse unterscheiden. Hier stellt sich nun verwalten muss, sondern auf die bereits erwähnten statis- die Frage, welches Ergebnis besser“ ist. Dafür gibt es kei- ” tischen Weisheiten zurückgreifen kann. Sind die Daten da- ne allgemeine Antwort, gehen wir aber von einer modera- gegen (stark) korreliert, lässt sich die Erkenntnis darüber ten Änderung des Anfrageverhaltens aus, ist sicherlich das nicht so einfach wie die Unabhängigkeit mit (anderen) sta- reaktive Ergebnis“ kurzfristig entscheidender, während das ” tistischen Weisheiten verbinden und somit abarbeiten“. proaktive Ergebnis“ in die längerfristige Planung der Sta- ” ” Nicht jede (eher kaum eine) Korrelation stellt eine (schwa- tistikerstellung mit aufgenommen werden sollte. che) funktionale Abhängigkeit dar, wie es im Beispiel der Fall war, wo wir einfach sagen konnten Aus der Marke folgt ” die Firma (bis auf wenige Ausnahmen)“. Oft liebäugeln be- 2. GRUNDLAGEN stimmte Werte der einen Spalte mit bestimmten Werten an- Wie in der Einleitung bereits angedeutet, können Korrela- derer Spalten, ohne sich jedoch in irgendeiner Weise auf diese tionen einem Datenbanknutzer den Tag vermiesen. Um dies Kombinationen zu beschränken. (In Stuttgart gibt es sicher- zu verhindern, wurden einige Methoden vorgeschlagen, wel- lich eine Menge Porsches, aber die gibt es woanders auch.) che sich auf verschiedene Art und Weise dieser Problematik Außerdem ändern sie möglicherweise mit der Zeit ihre Vor- annehmen (z. B. [7, 6]) oder sie sogar ausnutzen (z. B. [4, 8]), lieben (das Stuttgarter Porschewerk könnte bspw. nach Chi- um noch an Performance zuzulegen. Letztere sind allerdings na umziehen) oder schaffen letztere völlig ab (wer braucht mit hohem Aufwand oder der Möglichkeit, fehlerhafte An- schon einen Porsche? Oder überhaupt ein Auto?). frageergebnisse zu liefern1 , verbunden. Daher konzentrieren Deswegen werden für korrelierte Daten zusätzliche Sta- wir uns hier auf das Erkennen von Korrelationen allein zur tistiken benötigt, welche nicht nur die Werteverteilung ei- Verbesserung der Statistiken und wollen hierbei zwischen ner, sondern die Werteverteilung mehrerer Spalten wiederge- proaktiven und reaktiven Verfahren unterscheiden. ben. Diese zusätzlichen Statistiken müssen natürlich irgend- wo abgespeichert und, was noch viel schlimmer ist, gewartet 2.1 Proaktive (datengetriebene) Verfahren werden. Somit ergeben sich zusätzlicher Speicherbedarf und Proaktiv zu handeln bedeutet, etwas auf Verdacht“ zu ” zusätzlicher Aufwand, also viel zu viel von dem, was keiner tun. Impfungen sind dafür ein gutes Beispiel – mithilfe ei- so richtig will. ner Impfung ist der Körper in der Lage, Krankheitserreger zu bekämpfen, aber in vielen Fällen ist unklar, ob er die- Da sich ein bisschen statistische Korrelation im Grunde se Fähigkeit jemals benötigen wird. Da Impfungen auch mit überall findet, gilt es, die Korrelationen ausfindig zu ma- Nebenwirkungen verbunden sein können, muss jeder für sich chen, welche unsere statistischen Weisheiten alt aussehen entscheiden, ob und wogegen er sich impfen lässt. lassen und dazu führen, dass das Anfrageergebnis erst nach Auch Datenbanken können geimpft“ werden, allerdings ” einer gefühlten halben Ewigkeit ausgeben wird. Ob letzte- handelt es sich bei langen Anfrageausführungszeiten (die res überhaupt passiert, hängt natürlich auch vom Anfrage- wir ja bekämpfen wollen) eher um Symptome (wie Bauch- verhalten auf die Datenbank ab. Wenn die Benutzer sich schmerzen oder eine laufende Nase), die natürlich unter- in ihren (meist mithilfe von Anwendungsprogrammen abge- schiedliche Ursachen haben können. Eine davon bilden ganz setzten) SQL-Anfragen in der WHERE-Klausel jeweils auf 1 eine Spalte beschränken und auf jedwede Verbünde (Joins) Da die Verfahren direkt in die Anfrageplanerstellung ein- verzichten, dann ist die Welt in Ordnung. Leider lassen sich greifen und dabei auf ihr Wissen über Korrelationen aufbau- Benutzer nur ungern so stark einschränken. en, muss, für ein korrektes Anfrageergebnis, dieses Wissen aktuell und vollständig sein. klar Korrelationen zwischen den Daten, wobei natürlich erst 2.2 Reaktive (anfragegetriebene) Verfahren ein gewisses Maß an Korrelation überhaupt als krankhaft“ Während wir im vorherigen Abschnitt Vermutungen auf- ” anzusehen ist. (Es benötigt ja auch eine gewisse Menge an gestellt und auf Verdacht gehandelt haben, um den Daten- Bakterien, damit eine Krankheit mit ihren Symptomen aus- bankbenutzer glücklich zu machen, gehen wir jetzt davon bricht.) Der grobe Impfvorgang“ gegen“ Korrelationen um- aus, dass den Benutzer auch weiterhin das interessieren wird, ” ” fasst zwei Schritte: wofür er sich bis jetzt interessiert hat. Wir ziehen also aus der Vergangenheit Rückschlüsse für 1. Es werden Vermutungen aufgestellt, welche Spalten- die Zukunft, und zwar indem wir den Benutzer bei seinem kombinationen für spätere Anfragen eine Rolle spielen Tun beobachten und darauf reagieren (daher auch die Be- könnten. zeichnung reaktiv“). Dabei achten wir nicht allein auf die ” 2. Es wird kontrolliert, ob diese Kombinationen von Kor- gestellten SQL-Anfragen, sondern überwachen viel mehr die relation betroffen sind oder nicht. von der Datenbank zurückgegebenen Anfrageergebnisse. Die- se verraten uns nämlich alles (jeweils 100-prozentig aktuell!) Entscheidend dabei ist, dass die Daten bzw. ein Teil der über den Teil der vorhandenen Datenlandschaft, den der Be- Daten gelesen (und analysiert) werden, und zwar ohne da- nutzer bis jetzt interessant fand. mit konkrete Anfragen zu bedienen, sondern rein zur Aus- Auf diese Weise können bspw. Statistiken erzeugt werden führung des Verfahrens bzw. der Impfung“ (in diesem Fall [5, 11, 3] (wobei STHoles [5] und ISOMER [11] sogar in ” gegen“ Korrelation, wobei die Korrelation natürlich nicht der Lage sind, mehrdimensionale Statistiken zu erstellen) ” beseitigt wird, schließlich können wir schlecht den Datenbe- oder es lassen sich mithilfe alter Anfragen neue, ähnliche stand ändern, sondern die Datenbank lernt, damit umzuge- Anfragen in ihrer Performance verbessern [12]. Sinnvoll kann hen). Das Lesen und Analysieren kostet natürlich Zeit, wo- auch eine Unterbrechung der Anfrageausführung mit damit mit klar wird, dass auch diese Impfung“ Nebenwirkungen“ verbundener Reoptimierung sein [9, 2, 10]. Zu guter letzt ” ” mit sich bringt. lässt sich mithilfe dieses Ansatzes zumindest herausfinden, Eine konkrete Umsetzung haben Ilyas et al., aufbauend welche Statistikdaten entscheidend sein könnten [1]. auf BHUNT [4], mit CORDS [7] vorgestellt. Dieses Verfah- In [1] haben Aboulnaga et al. auch schon erste Ansätze für ren findet Korrelationen zwischen Spaltenpaaren, die Spal- eine Analyse auf Spaltenkorrelation vorgestellt, welche spä- tenanzahl pro Spaltenkombination wurde also auf zwei be- ter in [6] durch Haas et al. ausgebaut und verbessert wurden. grenzt.2 In Analogie zu CORDS werden in [1] und [6] nur Spaltenpaa- Es geht folgendermaßen vor: Im ersten Impfschritt“ sucht re für die Korrelationssuche in Betracht gezogen. Allerdings ” es mithilfe des Katalogs oder mittels Stichproben nach Schlüs- fällt die Auswahl der infrage kommenden Spaltenpaare we- sel-Fremdschlüssel-Beziehungen und führt somit eine Art sentlich leichter aus, weil einfach alle Spaltenpaare, die in Rückabbildung von Datenbank zu Datenmodell durch (engl. den Anfragen (mit hinreichend vielen Daten3 ) vorkommen, reverse engineering“) [4]. Darauf aufbauend werden dann potentielle Kandidaten bilden. ” nur solche Spaltenkombinationen als für die Korrelationssu- Während in [1] pro auftretendes Wertepaar einer Spalten- che infrage kommend angesehen, deren Spalten kombination ein Quotient aus Häufigkeit bei Unabhängig- ” keit“ und tatsächliche Häufigkeit“ gebildet und das Spal- a) aus derselben Tabelle stammen oder ” tenpaar als korreliert“ angesehen wird, sobald zu viele die- ” b) aus einer Verbundtabelle stammen, wobei der Verbund ser Quotienten von einem gewissen Wert abweichen, setzen ( Join“) mittels (Un-) Gleichung zwischen Schlüssel- Haas et al. in [6] einen angepassten Chi-Quadrat-Test ein, ” um Korrelationen zu finden. Dieser ist etwas aufwendiger als und Fremdschlüsselspalten entstanden ist. die Vorgehensweise von [1], dafür jedoch nicht so fehleranfäl- Zudem gibt es zusätzliche Reduktionsregeln (engl. pruning lig [6]. Zudem stellen Haas et al. in [6] Möglichkeiten vor, wie ” rules“) für das Finden der Beziehungen und für die Aus- sich die einzelnen Korrelationswerte“ pro Spaltenpaar mit- ” wahl der zu betrachtenden Spaltenkombinationen. Schließ- einander vergleichen lassen, sodass, ähnlich wie in CORDS, lich kann die Spaltenanzahl sehr hoch sein, was die Anzahl eine Rangliste der am stärksten korrelierten Spaltenpaare an möglichen Kombinationen gegebenenfalls ins Unermess- erstellt werden kann. Diese kann als Entscheidungshilfe für liche steigert. das Anlegen zusätzlicher Statistikdaten genutzt werden. Im zweiten Impfschritt“ wird für jede Spaltenkombinati- ” on eine Stichprobe entnommen und darauf aufbauend eine Kontingenztabelle erstellt. Letztere dient dann wiederum als 3. HERAUSFORDERUNGEN Grundlage für einen Chi-Quadrat-Test, der als Ergebnis eine In [6] wurde bereits vorgeschlagen, dieses Verfahren mit Zahl χ2 ≥ 0 liefert. Gilt χ2 = 0, so sind die Spalten voll- CORDS zu verbinden. Das reaktive Verfahren spricht auf- ständig unabhängig. Da dieser Fall aber in der Praxis kaum grund seiner Effizienz für sich, während das proaktive Ver- auftritt, muss χ2 einen gewissen Schwellwert überschreiten, fahren eine gewisse Robustheit bietet und somit bei Lern- damit die entsprechende Spaltenkombination als korreliert phasen von [6] (wenn es neu eingeführt wird oder wenn sich angesehen wird. Zum Schluss wird eine Art Rangliste der die Anfragen ändern) robuste Schätzwerte zur Erstellung Spaltenkombinationen mit den höchsten χ2 -Werten erstellt eines Anfrageplans berechnet werden können [6]. Dazu soll- und für die obersten n Kombinationen werden zusätzliche te CORDS entweder in einem gedrosselten Modus während Statistikdaten angelegt. Die Zahl n ist dabei u. a. durch die des normalen Datenbankbetriebs laufen oder während War- Größe des Speicherplatzes (für Statistikdaten) begrenzt. tungszeiten ausgeführt werden. Allerdings werden in [6] kei- ne Aussagen darüber getroffen, wie die jeweiligen Ergebnis- 2 Die Begrenzung wird damit begründet, dass auf diese Weise 3 das beste Aufwand-Nutzen-Verhältnis entsteht. Das Verfah- Um aussagefähige Ergebnisse zu bekommen, wird ein ge- ren selbst ist nicht auf Spaltenpaare beschränkt. wisses Mindestmaß an Beobachtungen benötigt, insb. in [6]. se beider Verfahren miteinander kombiniert werden sollten. ders interessant sein könnten, die möglicherweise eben gera- Folgende Punkte sind dabei zu bedenken: de mit Korrelation einhergehen, spricht wiederum für eine Art Hinweis“ an den Optimizer. • Beide Verfahren liefern eine Rangliste mit den als am ” stärksten von Korrelation betroffenen Spalten. Aller- dings sind die den Listen zugrunde liegenden Korrela- 4. LÖSUNGSANSATZ ” tionswerte“ (s. bspw. χ2 im Abschnitt über proaktive Da CORDS wie auch das Verfahren aus [6] nur Spalten- Verfahren) auf unterschiedliche Weise entstanden und paare betrachten und dies mit einem sich experimentell erge- lassen sich nicht einfach vergleichen. Liefern beide Lis- benem Aufwand-Nutzen-Optimum begründen, werden auch ten unterschiedliche Spaltenkombinationen, so kann es wir uns auf Spaltenpaare begrenzen. Allerdings wollen wir passieren, dass eine Kombination, die in der eine Lis- uns für die Kombination von proaktiver und reaktiver Kor- te sehr weit unten erscheint, stärker korreliert ist, als relationssuche zunächst nicht auf diese beiden Verfahren be- Kombinationen, die auf der anderen Liste sehr weit schränken, müssen aber doch gewisse Voraussetzungen an oben aufgeführt sind. die verwendeten Verfahren (und das Datenmodell der Da- tenbank) stellen. Diese seien hier aufgezählt: • Die Daten, welche zu einer gewissen Entscheidung bei den beiden Verfahren führen, ändern sich, werden aber 1. Entscheidung über die zu untersuchenden Spaltenkom- in der Regel nicht gleichzeitig von beiden Verfahren ge- binationen: lesen. Das hängt damit zusammen, dass CORDS zu ei- nem bestimmten Zeitpunkt eine Stichprobe entnimmt • Das proaktive Verfahren betreibt reverse engi- und darauf seine Analyse aufbaut, während das Ver- ” neering“, um zu entscheiden, welche Spaltenkom- fahren aus [6] die im Laufe der Zeit angesammelten binationen untersucht werden sollen. Anfragedaten auswertet. • Das Datenmodell der Datenbank ändert sich nicht, • Da zusätzliche Statistikdaten Speicherplatz benötigen bzw. sind nur geringfügige Änderungen zu erwar- und vor allem gewartet werden müssen, ist es nicht ten, welche vom proaktiven Verfahren in das von sinnvoll, einfach für alle Spaltenkombinationen, die in ihm erstellte Datenmodell sukzessive eingearbei- der einen und/oder der anderen Rangliste vorkommen, tet werden können. Auf diese Weise können wir gleich zu verfahren und zusätzliche Statistiken zu er- bei unseren Betrachtungen den ersten Impfschritt“ ” stellen. vernachlässigen. Zur Verdeutlichung wollen wir die Tabelle aller Firmen- 2. Datengrundlage für die Untersuchung: wagen eines großen, internationalen IT-Unternehmens be- trachten, in welcher zu jedem Wagen u. a. seine Farbe und • Das proaktive Verfahren entnimmt für jegliche zu die Personal- sowie die Abteilungsnummer desjenigen Mitar- untersuchende Spaltenkombination eine Stichpro- beiters verzeichnet ist, der den Wagen hauptsächlich nutzt. be, welche mit einem Zeitstempel versehen wird. Diverse dieser Mitarbeiter wiederum gehen in einem Dres- Diese Stichprobe wird solange aufbewahrt, bis das dener mittelständischen Unternehmen ein und aus, welches Verfahren auf Unkorreliertheit“ plädiert oder für nur rote KFZ auf seinem Parkplatz zulässt (aus Kapazitäts- ” die entsprechende Spaltenkombination eine neue gründen wurde eine solche, vielleicht etwas seltsam anmu- Stichprobe erstellt wird. tende Regelung eingeführt). Da die Mitarbeiter sich dieser Regelung bei der Wahl ihres Wagens bewusst waren, fahren • Das reaktive Verfahren bedient sich eines Que- sie alle ein rotes Auto. Zudem sitzen sie alle in derselben ry-Feedback-Warehouses, in welchem die Beob- Abteilung. achtungen ( Query-Feedback-Records“) der An- ” Allerdings ist das internationale Unternehmen wirklich fragen notiert sind. sehr groß und besitzt viele Firmenwagen sowie unzählige Abteilungen, sodass diese roten Autos in der Gesamtheit der 3. Vergleich der Ergebnisse: Tabelle nicht auffallen. In diesem Sinne würde das proaktive Verfahren CORDS also (eher) keinen Zusammenhang zwi- • Beide Verfahren geben für jede Spaltenkombinati- schen der Abteilungsnummer des den Wagen benutzenden on, die sie untersucht haben, einen Korrelations- ” Mitarbeiters und der Farbe des Autos erkennen. wert“ aus, der sich innerhalb des Verfahrens ver- Werden aber häufig genau diese Mitarbeiter mit der Farbe gleichen lässt. Wie dieser genau berechnet wird, ihres Wagens abgefragt, z. B. weil sich diese kuriose Rege- ist für uns unerheblich. lung des mittelständischen Unternehmens herumspricht, es • Aus den höchsten Korrelationswerten ergeben sich keiner so recht glauben will und deswegen die Datenbank zwei Ranglisten der am stärksten korrelierten Spal- konsultiert, so könnte ein reaktives Verfahren feststellen, tenpaare, die wir unterschiedlich auswerten wol- dass beide Spalten korreliert sind. Diese Feststellung tritt len. insbesondere dann auf, wenn sonst wenig Anfragen an beide betroffenen Spalten gestellt werden, was durchaus möglich Zudem wollen wir davon ausgehen, dass das proaktive Ver- ist, weil sonst die Farbe des Wagens eine eher untergeordnete fahren in einem gedrosselten Modus ausgeführt wird und Rolle spielt. somit sukzessive seine Rangliste befüllt. (Zusätzliche War- Insbesondere der letztgenannte Umstand macht deutlich, tungszeiträume, bei denen das Verfahren ungedrosselt lau- dass es nicht sinnvoll ist, Statistikdaten für die Gesamtheit fen kann, beschleunigen die Arbeit und bilden somit einen beider Spalten zu erstellen und zu warten. Aber die Tat- schönen Zusatz, aber da heutzutage viele Datenbanken quasi sache, dass bestimmte Spezialfälle für den Benutzer beson- dauerhaft laufen müssen, wollen wir sie nicht voraussetzen.) Das reaktive Verfahren dagegen wird zu bestimmten Zeit- in der Rangliste des reaktiven Verfahrens, dann löschen wir punkten gestartet, um die sich bis dahin angesammelten Be- die reaktiv erstellten Statistiken und erstellen neue Statis- obachtungen zu analysieren, und gibt nach beendeter Ana- tiken mittels einer Stichprobe, analog zum ersten Fall. (Die lyse seine Rangliste bekannt. Da es als Grundlage nur die Kombination beider Statistiktypen wäre viel zu aufwendig, Daten aus dem Query-Feedback-Warehouse benötigt, kann u. a. wegen unterschiedlicher Entstehungszeitpunkte.) Wenn es völlig entkoppelt von der eigentlichen Datenbank laufen. das proaktive Verfahren dagegen explizit unkorreliert“ aus- ” gibt, bleibt es bei den reaktiv erstellten Statistiken, s. oben. Ist die reaktive Rangliste bekannt, kann diese mit der (bis dahin angefertigten) proaktiven Rangliste verglichen wer- Wenn jedoch nur das proaktive Verfahren eine bestimmte den. Tritt eine Spaltenkombination in beiden Ranglisten auf, Korrelation erkennt, dann ist diese Erkenntnis zunächst für so bedeutet das, dass diese Korrelation für die bisherigen An- die Benutzer unerheblich. Sei es, weil der Nutzer diese Spal- fragen eine Rolle gespielt hat und nicht nur auf Einzelfälle tenkombination noch gar nicht abgefragt hat, oder weil er beschränkt ist, sondern auch mittels Analyse einer repräsen- bis jetzt nur den Teil der Daten benötigt hat, der scheinbar tativen Stichprobe an Wertepaaren gefunden wurde. unkorreliert ist. In diesem Fall markieren wir nur im Daten- Unter diesen Umständen lassen wir mittels einer Stichpro- bankkatolog (wo die Statistiken abgespeichert werden) die be Statistikdaten für die betreffende Spaltenkorrelation er- beiden Spalten als korreliert und geben dem Optimizer somit stellen. Dabei wählen wir die Stichprobe des proaktiven Ver- ein Zeichen, dass hier hohe Schätzfehler möglich sind und fahrens, solange diese ein gewisses Alter nicht überschritten er deswegen robuste Pläne zu wählen hat. Dabei bedeutet hat. Ist sie zu alt, wird eine neue Stichprobe entnommen.4 robust“, dass der gewählte Plan für die errechneten Schätz- ” werte möglicherweise nicht ganz optimal ist, dafür aber bei Interessanter wird es, wenn nur eines der Verfahren auf stärker abweichenden wahren Werten“ immer noch akzep- ” Korrelation tippt, während das andere Verfahren die ent- table Ergebnisse liefert. Zudem können wir ohne wirklichen sprechende Spaltenkombination nicht in seiner Rangliste ent- Einsatz des reaktiven Verfahrens die Anzahl der Anfragen hält. Die Ursache dafür liegt entweder darin, dass letzteres zählen, die auf diese Spalten zugreifen und bei denen sich Verfahren die Kombination noch nicht analysiert hat (beim der Optimizer stark verschätzt hat. Übersteigt der Zähler reaktiven Verfahren heißt das, dass sie nicht oder zu selten einen Schwellwert, werden mithilfe einer neuen Stichprobe in den Anfragen vorkam), oder bei seiner Analyse zu dem (vollständige, also insb. mit Werteverteilung) Statistikdaten Ergebnis nicht korreliert“ gekommen ist. erstellt und im Katalog abgelegt. ” Diese Unterscheidung wollen wir insbesondere in dem Fall vornehmen, wenn einzig das reaktive Verfahren die Korre- Der Vollständigkeit halber wollen wir hier noch den Fall lation entdeckt“ hat. Unter der Annahme, dass weitere, erwähnen, dass eine Spaltenkombination weder in der einen, ” ähnliche Anfragen folgen werden, benötigt der Optimizer noch in der anderen Rangliste vorkommt. Es sollte klar sein, schnell Statistiken für den abgefragten Bereich. Diese sol- dass diese Kombination als unkorreliert“ angesehen und so- ” len zunächst reaktiv mithilfe der Query-Feedback-Records mit für die Statistikerstellung nicht weiter betrachtet wird. aus der Query-Feedback-Warehouse erstellt werden (unter Verwendung von bspw. [11], wobei wir nur zweidimensionale Statistiken benötigen). Das kann wieder völlig getrennt von 5. AUSBLICK der eigentlichen Datenbank geschehen, da nur das Query- Die hier vorgestellte Vorgehensweise zur Verbesserung der Feedback-Warehouse als Grundlage dient. Korrelationsfindung mittels Einsatz zweier unterschiedlicher Wir überprüfen nun, ob das proaktive Verfahren das Spal- Verfahren muss weiter vertieft und insbesondere praktisch tenpaar schon bearbeitet hat. Dies sollte anhand der Ab- umgesetzt und getestet werden. Vor allem muss ein passen- arbeitungsreihenfolge der infrage kommenden Spaltenpaare des Datenmodell für die reaktive Erstellung von Spalten- erkennbar sein. paarstatistiken gefunden werden. Das vorgeschlagene Ver- Ist dem so, hat das proaktive Verfahren das entsprechen- fahren ISOMER [11] setzt hier auf STHoles [5], einem Da- de Paar als unkorreliert“ eingestuft und wir bleiben bei den ” tenmodell, welches bei sich stark überschneidenden Anfra- reaktiv erstellten Statistiken, die auch nur reaktiv aktuali- gen schnell inperformant werden kann. Für den eindimen- siert werden. Veralten sie später zu stark aufgrund fehlender sionalen Fall wurde bereits von Informix-Entwicklern eine Anfragen (und somit fehlendem Nutzerinteresse), können sie performante Lösung vorgestellt [3], welche sich aber nicht gelöscht werden. einfach auf den zweidimensionalen Fall übertragen lässt. Ist dem nicht so, geben wir die entsprechende Kombina- tion an das proaktive Verfahren weiter mit dem Auftrag, Eine weitere, noch nicht völlig ausgearbeitete Herausfor- diese zu untersuchen.5 Beim nächsten Vergleich der Ranglis- derung bildet die Tatsache, dass das proaktive Verfahren im ten muss es für das betrachtete Spaltenpaar eine konkrete gedrosselten Modus läuft und erst sukzessive seine Rangliste Antwort geben. Entscheidet sich das proaktive Verfahren für erstellt. Das bedeutet, dass wir eigentlich nur Zwischener- korreliert“ und befindet sich das Spaltenpaar auch wieder ” gebnisse dieser Rangliste mit der reaktiv erstellten Ranglis- 4 te vergleichen. Dies kann zu unerwünschten Effekten füh- Falls die betroffenen Spalten einen Zähler besitzen, der bei ren, z. B. könnten beide Ranglisten völlig unterschiedliche Änderungsoperationen hochgezählt wird (vgl. z. B. [1]), kön- Spaltenkombinationen enthalten, was einfach der Tatsache nen natürlich auch solche Daten mit in die Wahl der Stich- geschuldet ist, dass beide Verfahren unterschiedliche Spal- probe einfließen, allerdings sind hier unterschiedliche Aus- gangszeiten“ zu beachten. ” tenkombinationen untersucht haben. Um solche Missstände 5 Dadurch stören wir zwar etwas die vorgegebene Abarbei- zu vermeiden, muss die proaktive Abarbeitungsreihenfolge tungsreihenfolge der infrage kommenden Spaltenpaare, aber der Spaltenpaare überdacht werden. In CORDS wird bspw. der Fall ist ja auch dringend. als Reduktionsregel vorgeschlagen, nur Spaltenpaare zu be- trachten, die im Anfrageworkload vorkommen (dazu müssen [9] V. Markl, V. Raman, D. Simmen, G. Lohman, von CORDS nur die Anfragen, aber nicht deren Ergebnis- H. Pirahesh, and M. Cilimdzic. Robust query se betrachtet werden). Würde sich dann aber der Workload processing through progressive optimization. In ACM, dahingehend ändern, dass völlig neue Spalten oder Tabel- editor, Proceedings of the 2004 ACM SIGMOD len abgefragt werden, hätten wir dasselbe Problem wie bei International Conference on Management of Data einem rein reaktiven Verfahren. Deswegen muss hier eine 2004, Paris, France, June 13–18, 2004, pages 659–670. Zwischenlösung gefunden werden, die Spaltenkombinationen ACM Press, 2004. aus Anfragen bevorzugt behandelt“, sich aber nicht darauf [10] T. Neumann and C. Galindo-Legaria. Taking the edge ” beschränkt. off cardinality estimation errors using incremental Außerdem muss überlegt werden, wann wir Statistikda- execution. In BTW, pages 73–92, 2013. ten, die auf Stichproben beruhen, wieder löschen können. [11] U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and Im reaktiven Fall fiel die Entscheidung leicht aus, weil feh- T. M. Tran. ISOMER: Consistent histogram lender Zugriff auf die Daten auch ein fehlendes Nutzerinter- construction using query feedback. In ICDE, page 39. esse widerspiegelt und auf diese Weise auch keine Aktuali- IEEE Computer Society, 2006. sierung mehr stattfindet, sodass die Metadaten irgendwann [12] M. Stillger, G. Lohman, V. Markl, and M. Kandil. unbrauchbar werden. LEO - DB2’s learning optimizer. In Proceedings of the Basieren die Statistiken dagegen auf Stichproben, müs- 27th International Conference on Very Large Data sen sie von Zeit zu Zeit aktualisiert werden. Passiert diese Bases(VLDB ’01), pages 19–28, Orlando, Sept. 2001. Aktualisierung ohne zusätzliche Überprüfung auf Korrelati- on (welche ja aufgrund geänderten Datenbestands nachlas- sen könnte), müssen mit der Zeit immer mehr zusätzliche Statistikdaten über Spaltenpaare gespeichert und gewartet werden. Der für Statistikdaten zur Verfügung stehende Spei- cherplatz im Katalog kann so an seine Grenzen treten, au- ßerdem kostet die Wartung wiederum Kapazität des DBMS. Hier müssen sinnvolle Entscheidungen über die Wartung und das Aufräumen“ nicht mehr benötigter Daten getroffen wer- ” den. 6. REFERENCES [1] A. Aboulnaga, P. J. Haas, S. Lightstone, G. M. Lohman, V. Markl, I. Popivanov, and V. Raman. Automated statistics collection in DB2 UDB. In VLDB, pages 1146–1157, 2004. [2] S. Babu, P. Bizarro, and D. J. DeWitt. Proactive re-optimization. In SIGMOD Conference, pages 107–118. ACM, 2005. [3] E. Behm, V. Markl, P. Haas, and K. Murthy. Integrating query-feedback based statistics into informix dynamic server, Apr. 03 2008. [4] P. Brown and P. J. Haas. BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In VLDB 2003: Proceedings of 29th International Conference on Very Large Data Bases, September 9–12, 2003, Berlin, Germany, pages 668–679, 2003. [5] N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: a multidimensional workload-aware histogram. SIGMOD Rec., 30(2):211–222, May 2001. [6] P. J. Haas, F. Hueske, and V. Markl. Detecting attribute dependencies from query feedback. In VLDB, pages 830–841. ACM, 2007. [7] I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. CORDS: automatic discovery of correlations and soft functional dependencies. In ACM, editor, Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data 2004, Paris, France, June 13–18, 2004, pages 647–658, pub-ACM:adr, 2004. ACM Press. [8] H. Kimura, G. Huo, A. Rasin, S. Madden, and S. B. Zdonik. Correlation maps: A compressed access method for exploiting soft functional dependencies. PVLDB, 2(1):1222–1233, 2009. MVAL: Addressing the Insider Threat by Valuation-based Query Processing Stefan Barthel Eike Schallehn Institute of Technical and Business Information Institute of Technical and Business Information Systems Systems Otto-von-Guericke-University Magdeburg Otto-von-Guericke-University Magdeburg Magdeburg, Germany Magdeburg, Germany stefan.barthel@ovgu.de eike.schallehn@ovgu.de ABSTRACT by considering relational and SQL operations and describing The research presented in this paper is inspired by problems possible valuation derivations for them. of conventional database security mechanisms to address the insider threat, i.e. authorized users abusing granted privi- 2. PRINCIPLES OF DATA VALUATION leges for illegal or disadvantageous accesses. The basic idea In [1] we outlined our approach of a leakage-resistant data is to restrict the data one user can access by a valuation valuation which computes a monetary value (mval) for each of data, e.g. a monetary value of data items, and, based query. This is based on the following basic principles: Every on that, introducing limits for accesses. The specific topic attribute Ai ∈ R of a base relation schema R is valuated by of the present paper is the conceptual background, how the a certain monetary value (mval(Ai ) ∈ R). The attribute process of querying valuated data leads to valuated query valuation for base tables are part of the data dictionary and results. For this, by analyzing operations of the relational can for instance be specified as an extension of the SQL algebra and SQL, derivation functions are added. DDL: CREATE TABLE table_1 1. INTRODUCTION ( An acknowledged main threat to data security are fraud- attribute_1 INT PRIMARY KEY MVAL 0.1, ulent accesses by authorized users, often referred to as the attribute_2 UNIQUE COMMENT ’important’ MVAL 10, insider threat [2]. To address this problem, in [1] we pro- attribute_3 DATE posed a novel approach of detecting authorization misuse ); based on a valuation of data, i.e. of an assigned descrip- With these attribute valuations, we derive a monetary tion of the worth of data management in a system, which value for one tuple t ∈ r(R) given by Equation (1), as well could for instance be interpreted as monetary values. Ac- as the total monetary value of the relation r(R) given by cordingly, possible security leaks exist if users access more Equation (2), if data is extracted by a query. valuable data than they are allowed to within a query or cumulated over a given time period. E.g., a bank account X manager accessing a single customer record does not repre- mval(t ∈ r(R)) = mval(Ai ) (1) sent a problem, while dumping all data in an unrestricted Ai ∈R query should be prohibited. Here, common approaches like X role-based security mechanisms typically fail. mval(r(R)) = mval(t) = |r(R)| ∗ mval(t ∈ r(R)) (2) According to our proposal, the data valuation is first of t∈r(R) all based on the relation definitions, i.e. as part of the data dictionary information about the value of data items such as To be able to consider the mval for a query as well as sev- attribute values and, derived from that, entire tuples and re- eral queries of one user over a certain period of time, we log lations. Then, a key question is how the valuation of a query all mvals in an alert log and compare the current cumulated result can be derived from the input valuations, because per- mval per user to two thresholds. If a user exceeds the first forming operations on the base data causes transformations threshold – suspicious threshold – she will be categorized as that have an impact on the data’s significance. suspect. After additionally exceeding the truncation thresh- This problem is addressed in the research presented here old her query output will be limited by hiding tuples and presenting a user notification. We embedded our approach in an additional layer in the security defense-in-depth model for raw data, which we have enhanced by a backup entity (see Fig. 1). Furthermore, encryption has to be established to prevent data theft via unauthorized, physical reads as well as backup theft. In this paper we are going into detail about how to handle operations like joins, aggregate func- tions, stored procedures as well as common functions. 25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. Most of the data stored in a database can be easily iden- Copyright is held by the author/owner(s). tified as directly interpretable. One example would be an Views by uncertainty has to be less valuable than properly set at- tribute values. Therefore, the monetary value should be Access Control only a percentage of the respective monetary value of an at- tribute value. If several source attribute values are involved, Data Valuation we recommend to value the computed attribute value as a Encryption DBMS Level percentage of the monetary value of all participating source Raw attribute values. In general, we suggest a maximum of 50% Data for both valuations. Furthermore, we need to consider the overall purpose of our leakage-resistant data valuation which Physical Level shall prevent extractions of large amounts of data. There- fore, the percentage needs to be increased with the amount of data, but not in a way that an unset or unknown attribute value becomes equivalent valuable than a properly set one. For that reason, exponential growth is not a suitable option. Backup Additionally, we have to focus a certain area of application, Data because a trillion attributes (1012 ) are conceivable whereas a Encryption septillion attributes (1024 ) are currently not realistic. From Backup the overall view on our data valuation, we assume depending on the application, that the extraction of sensitive data be- Figure 1: Security defense model on DBMS and comes critical when 103 up to 109 attribute values will be ex- physical level tracted. Therefore, the growth of our uncertainty factor UF increases much more until 109 attribute values than after- wards, which predominantly points to a logarithmic growth. employee-table, where each tuple has a value for attributes We also do not need to have a huge difference of the factor if "first name", "surname" and "gender". In this case, it is theoretically much more attribute values shall be extracted also quite easy to calculate the monetary value for a query (e.g., 1014 and more), because with respect to an extraction (r(Remp )) by simply summarizing all mval per attribute and limiting approach, it is way too much data to return. This multiply those with the number of involved rows (see Eq. assumption does also refer to a logarithmic increase. We (3)). conclude that the most promising formula that was adapted to fit our needs is shown in Eq. (4). X mval(r(Remp )) = mval(Ai ) ∗ |r(Remp )| (3) 1 Ai ∈Remp UF = log10 (|valAi ,...,Ak | + 1) (4) 30 However, it becomes more challenging if an additional at- tribute "license plate number" is added, which does have some unset or unknown attribute values - in most cases NULL values. By knowing there is a NULL value for a 3. DERIVING VALUATIONS FOR certain record, this could be interpreted as either simply un- known whether there is any car or unset because this person DATABASE OPERATIONS has no car. So there is an uncertainty that could lead to an In this chapter we will describe valuation derivation for information gain which would be uncovered if no adequate main database operations by first discussing core relational valuation exists. Some other potentially implicit informa- operations. Furthermore, we address specifics of join oper- tion gains are originated from joins and aggregate functions ations and finally functions (aggregate, user-defined, stored which we do mention in the regarding section. procedures) which are defined in SQL. Because the terms information gain and information loss are widely used and do not have a uniform definition, we do define them for further use. We call a situation where an 3.1 Core Operations of Relational Algebra attacker received new data (resp. information) information The relational algebra [4] consists of six basic operators, gain and the same situation in the view of the data owner where selection, projection, and rename are unary opera- an information loss. tions and union, set difference, and Cartesian product are operators that take two relations as input (binary opera- Uncertainty Factor tion). Due to the fact that applying rename to a relation or Some operators used for query processing obviously reduce attribute will not change the monetary value, we will only the information content of the result set (e.g. selection, ag- consider the rest. gregations, semi joins, joins with resulting NULL values), but there is still an uncertain, implicit information gain. Since, the information gain by uncertainty is blurry, mean- Projection ing in some cases more indicative than in others, we have The projection πattr_list (r(R)) is a unary operation and to distinguish uncertainty of one attribute value generated eliminates all attributes (columns) of an input relation r(R) out of one source attribute value (e.g., generated NULL val- except those mentioned in the attribute list. For computa- ues) and attribute values which are derived from informa- tion of the monetary value of such a projection, only mval tion of several source attribute values (e.g., aggregations). for chosen attributes of the input relation are considered In case of one source attribute value, an information gain while taking into account that a projection may eliminate duplicates (shown in Eq. (5)). fully aware that by a user mistake, e.g. using cross join instead of natural join, thresholds will be exceeded and the X k user will be classified as potentially suspicious. However, we mval(πAj ,..,Ak (r(R))) = mval(Ai ) ∗ |πAj ,..,Ak (r(R))| recommend a multiplication of the monetary value of both i=j source relations instead of a summation due to the fact that (5) the calculation of the monetary value needs to be consistent also by combining different operators. For that reason, by Selection following our recommendation, we ensure that an inner join According to the relational algebra, a selection of a certain is valuated with the same monetary value as the respective relation σpred r(R) reduces tuples to a subset which satisfy combination of a cross join (Cartesian product) and selection specified predicates. Because the selection reduces the num- on the join condition. ber of tuples, the calculation of the monetary value does not have to consider those filtered tuples and only the number mval(r(R1 × R2 )) = of present tuples are relevant (shown in Eq. (6)). mval(t ∈ r(R1 )) ∗ |r(R1 )| + mval(t ∈ r(R2 )) ∗ |r(R2 )| (9) mval(σpred (r(R))) = mval(t ∈ r(R)) ∗ |σpred (r(R))| (6) Set Union 3.2 Join Operations In the context of relational databases, a join is a binary A relation of all distinct elements (resp. tuples) of any two operation of two tables (resp. data sources). The result set relations is called the union (denoted by ∪) of those re- of a join is an association of tuples from one table with tuples lations. For performing set union, the two involved rela- from another table by concatenating concerned attributes. tions must be union-compatible – they must have the same Joining is an important operation and most often perfor- set of attributes. In symbols, the union is represented as mance critical to certain queries that target tables whose R1 ∪ R2 = {x : x ∈ R1 ∨ x ∈ R2 }. However, if two re- relationships to each other cannot be followed directly. Be- lations contain identical tuples, within a resulting relation cause the type of join affects the number of resulting tuples these tuples do only exist once, meaning duplicates are elim- and their attributes, the monetary value of each join needs inated. Accordingly, the mval of a union of two relations is to be calculated independently. computed by adding mval of both relations, subtracted with mval of duplicates (shown in Eq. (7)). Inner Join mval(R1 ∪ R2 ) = mval(r(R1 ))+ An inner join produces a result table containing composite X rows of involved tables that match some pre-defined, or ex- mval(r(R2 )) − mval(ti ∈ r(R1 ∩ R2 ) (7) plicitly specified, join condition. This join condition can be i any simple or compound search condition, but does not have to contain a subquery reference. The valuation of an inner Set Difference join is computed by the sum of the monetary values of all The difference of relations R1 and R2 is the relation that attributes of a composite row multiplied by the number of contains all the tuples that are in R1 , but do not belong to rows within the result set. Because the join attribute Ajoin R2 . The set difference is denoted by R1 − R2 or R1 \R2 and of two joined tables has to be counted only once, we need defined by R1 \R2 = {x : x ∈ R1 ∧ x ∈ / R2 }. Also, the set to subtract it (shown in Eq. (10)). difference is union-compatible, meaning the relations must mval(r(R1 ./ R2 ) = |r(R1 ./ R2 )| ∗ have the same number of attributes and the domain of each attribute is the same in both R1 and R2 . The mval of a set (mval(t ∈ r(R1 )) + (mval(t ∈ r(R2 )) − mval(Ajoin )) difference of two relations is computed by subtracting the (10) mval of tuples that have both relations in common from the monetary value of R1 given by Equation (8). Outer Join X An outer join does not require matching records for each mval(R1 \R2 ) = mval(r(R1 ) − mval(ti ∈ r(R1 ∩ R2 ) tuple of concerned tables. The joined result table retains all i rows from at least one of the tables mentioned in the FROM (8) clause, as long as those rows are consistent with the search condition. Outer joins are subdivided further into left, right, Cartesian Product and full outer joins. The result set of a left outer join (or left The Cartesian product, also known as cross product, is an join) includes all rows of the first mentioned table (left of operator which works on two relations, just as set union the join keyword) merged with attribute values of the right and set difference. However, the Cartesian product is the table where the join attribute matches. In case there is no costliest operator to evaluate [9], because it combines the match, attributes of the right table are set to NULL. The tuples of one relation with all the tuples of the other relation right outer join (or right join) will return rows that have data – it pairs rows from both tables. Therefore, if the input in the right table, even if there’s no matching rows in the left relations R1 and R2 have n and m rows, respectively, the table enhanced by atteributes (with NULL values) of the left result set will contain n ∗ m rows and consist of columns of table. A full outer join is used to retain the non-matching R1 and the columns of R2 . Because, the number of tuples information of all affected tables by including non-matching of the outgoing relations are known, the monetary value is a rows in the result set. To cumulate the monetary value summation of all attribute valuations multiplied by number for a query that contains a left or right outer join, we only of rows of both relations given by Equation (9). We are need to compute the monetary value of an inner join of both tables and add the mval of an antijoin r(R1 . R2 ) ⊆ r(R1 ) of the ISO (1987) and ANSI (1986) standard for the SQL which includes only tuples of R1 that do not have a join database query language. partner in R2 (shown in Eq. (11)). For the monetary value of To be able to compute the monetary value of a derived, a full outer join, we additionally would consider an antijoin aggregated attribute, we need to consider two more factors. r(R2 . R1 ) ⊆ r(R2 ) which includes tuples of R2 that do not First of all, we divided aggregate function into two groups: have a join partner given by Equation (12)). informative and conservative. mval(r(R1 1 R2 )) = mval(r(R1 ./ R2 ))+ 1. Informative are those aggregate functions where the (11) aggregated value of a certain aggregate function leads mval(r(R1 . R2 )) to an information gain of the entire input of all at- tribute values. This means that every single attribute mval(r(R1 1 R2 )) = mval(r(R1 ./ R2 ))+ value participates in the computation of the aggre- (12) mval(r(R1 . R2 )) + mval(r(R2 . R1 )) gated attribute value. Representatives for informative aggregate functions are COUNT, AVG and SUM. Semi Join 2. Conservative, on the contrary, are those functions where A semi join is similar to the inner join, but with the addition the aggregated value is represented by only one at- that only attributes of one relation are represented in the tribute value, but in consideration of all other attribute result set. Semi joins are subdivided further into left and values. So if the aggregated value are again separated right semi joins. The left semi join operator returns each row from the input set, all other attribute values will re- from the first input relation (left of the join keyword) when main. Conservative aggregate functions are MAX and there is a matching row in the second input relation (right MIN. of the join keyword). The right semi join is computed vice versa. The monetary value for a query that uses semi joins The second factor that needs to be considered is the num- can be easily cumulated by multiplying the sum of monetary ber of attributes that are used to compute the aggregated values for included attributes with number of matching rows values. In case of a conservative aggregate function, it is of the outgoing relation (shown in Eq. (13)). simple, because only one attribute value is part of the out- X put. For that reason we recommend to leave the mval of mval(r(R1 n R2 )) = mval(Ai ) ∗ |r(R1 n R2 )| (13) the source attribute unchanged (shown in Eq. (14)). Ai ∈R1 mval(Ai ) = mval(M AX(Ai )) = mval(M IN (Ai )) (14) Nevertheless, we do have an information gain by knowing For the informative aggregate functions the computation join attributes of R1 have some join partners within R2 is more challenging due to several participating attribute which are not considered. But adding our uncertainty factor values. Because several input attribute values are concerned, UF in this equation would lead to inconsistency by cumu- we recommend the usage of our uncertainty factor which lating the mval of a semi join compared to the mval of a we already mentioned in a prior section. With the uncer- combination of a natural join and a projection. In future tainty factor it is possible to integrate the number of at- work, we will solve this issue by presenting a calculation tribute values in a way that a higher number of concerned that is based on a combination of projections and joins to attributes leads to an increase in percentage terms of the cover such an implicit information gain. monetary value of the aggregated attribute value given by Equation (15). 3.3 Aggregate Functions In computer science, an aggregate function is a function mval(COU N T (Ai )) = mval(SU M (Ai )) = where the values of multiple rows are grouped together as 1 (15) mval(AV G(Ai )) = log10 (|Ai | + 1) ∗ mval(Ai ) input on certain criteria to form a single value of more sig- 30 nificant meaning. The SQL aggregate functions are useful when mathematical operations must be performed on all or 3.4 Scalar Functions on a group of values. For that reason, they are frequently Besides the SQL aggregate functions, which return a sin- used with the GROUP BY clause within a SELECT state- gle value, calculated from values in a column, there are also ment. According to the SQL standard, the following aggre- scalar functions defined in SQL, that return a single value gate function are implemented in most DBMS and the ones based on the input value. The possibly most commonly used used most often: COUNT, AVG, SUM, MAX, and MIN. and well known scalar functions are: All aggregate functions are deterministic, i.e. they return • UCASE() - Converts a field to upper case the same value any time they are called by using the same set of input values. SQL aggregate functions return a sin- • LCASE() - Converts a field to lower case gle value, calculated from values within one column of a • LEN() - Returns the length of a text field arbitrary relation [10]. However, it should be noted that ex- cept for COUNT, these functions return a NULL value when • ROUND() - Rounds a number to a specified degree no rows are selected. For example, the function SUM per- • FORMAT() - Formats how a field is to be displayed formed on no rows returns NULL, not zero as one might ex- pect. Furthermore, except for COUNT, aggregate functions Returned values of this scalar functions are always derived ignore NULL values at all during computation. All aggre- from one source attribute value, and some of them do not gate function are defined in SQL:2011 standard or ISO/IEC even change the main content of the attribute value. There- 9075:2011 (under the general title "Information technology fore, we recommend that the monetary value of the source - Database languages - SQL") which is the seventh revision attribute stays untouched. 3.5 User-Defined Functions Furthermore, by summing all partial result, we make sure User-defined functions (UDF ) are subroutines made up that the worst case of information loss is considered, entirely of one or several SQL or programming extension statements in line with our general idea of a leakage resistant data val- that can be used to encapsulate code for reuse. Most database uation that should prevent a massive data extraction. How- management systems (DBMS) allow users to create their ever, since SPs represent a completed unit, by reaching the own user-defined functions and do not limit them to the truncate threshold the whole SP will be blocked and rolled built-in functions of their SQL programming language (e.g., back. For that reason, we recommend smaller SPs resp. TSQL, PL/SQL, etc.). User-defined functions in most sys- split existing SPs in DBS with an enabled leakage resistant tems are created by using the CREATE FUNCTION state- data valuation. ment and other users than the owner must be granted ap- propriate permissions on a function before they can use it. 4. RELATED WORK Furthermore, UDFs can be either deterministic or nondeter- Conventional database management systems mostly use ministic. A deterministic function always returns the same access control models to face unauthorized access on data. results if the input is the equal and a nondeterministic func- However, these are insufficient when an authorized individ- tion returns different results every time it is called. ual extracts data regardless whether she is the owner or On the basis of the multiple possibilities offered by most has stolen that account. Several methods were conceived to DBMS, it is impossible to estimate all feasible results of a eliminate those weaknesses. We refer to Park and Giordano UDF. Also, due to several features like shrinking, concate- [14], who give an overview of requirements needed to address nating, and encrypting of return values, a valuation of a the insider threat. single or an array of output values is practically impossible. Authorization views partially achieve those crucial goals of For this reason we decided not to calculate the monetary an extended access control and have been proposed several value depending on the output of a UDF, much more we times. For example, Rizvi et al. [15] as well as Rosenthal do consider the attribute values that are passed to an UDF et al. [16] use authorization-transparent views. In detail, (shown in Eq. (16)). This assumption is also the most re- incoming user queries are only admitted, if they can be an- liable, because it does not matter what happens inside an swered using information contained in authorization views. UDF – like a black box – the information loss after inserting Contrary to this, we do not prohibit a query in its entirety. cannot get worse. Another approach based on views was introduced by Motro [12]. Motro handles only conjunctive queries and answers mval(U DFoutput (Aa , .., Ag )) = a query only with a part of the result set, but without any p X (16) indication why it is partial. We do handle information en- mval(U DFinput (Ak , .., Ap )) = mval(Ai ) hancing (e.g., joins), as well as coarsening operations (e.g., i=k aggregation) and we do display a user notification. All au- thorization view approaches require an explicit definition of 3.6 Stored Procedures a view for each possible access need, which also imposes Stored procedures (SP) are stored similar to user-defined the burden of knowing and directly querying these views. functions (UDF ) within a database system. The major dif- In contrast, the monetary values of attributes are set while ference is that stored procedures have to be called and the defining the tables and the user can query the tables or views return values of UDFs are used in other SQL statements in she is used to. Moreover, the equivalence test of general re- the same way pre-installed functions are used (e.g., LEN, lational queries is undecidable and equivalence for conjunc- ROUND, etc.). A stored procedure, which is depending on tive queries is known to be NP complete [3]. Therefore, the the DBMS also called proc, sproc, StoredProc or SP, is a leakage-resistant data valuation is more applicable, because group of SQL statements compiled into a single execution it does not have to face those challenges. plan [13] and mostly developed for applications that need However, none of these methods does consider the sensi- to access easily a relational database system. Furthermore, tivity level of data that is extracted by an authorized user. SPs combine and provide logic also for extensive or complex In the field of privacy-preserving data publishing (PPDP), processing that requires execution of several SQL statement, on the contrary, several methods are provided for publishing which had to be implemented in an application before. Also useful information while preserving data privacy. In detail, a nesting of SPs is feasible by executing one stored procedure multiple security-related measures (e.g., k-anonymity [17], from within another. A typical use for SPs refers to data l-Diversity [11]) have been proposed, which aggregate infor- validation (integrated into the database) or access control mation within a data extract in a way that they can not lead mechanisms [13]. to an identification of a single individual. We refer to Fung et Because stored procedures have such a complex structure, al. [5], who give a detailed overview of recent developments nesting is also legitimate and SPs are "only" a group of in methods and tools of PPDP. However, these mechanisms SQL statements, we recommend to value each single state- are mainly used for privacy-preserving tasks and are not in ment within a SP and sum up all partial results (shown in use when an insider accesses data. They are not applica- Eq. (17). With this assumption we do follow the principal ble for our scenario, because they do not consider a line by that single SQL statements are moved into stored proce- line extraction over time as well as the information loss by dures to provide a simple access for applications which only aggregating attributes. need to call the procedures. To the best of our knowledge, there is only the approach of Harel et al. ([6], [7], [8]) that is comparable to our data val- X k uation to prevent suspicious, authorized data extractions. mval(SP (r(Rj ), .., r(Rk ))) = mval(r(Ri )) (17) Harel et al. introduce the Misuseability Weight (M-score) i=j that desribes the sensitivity level of the data exposed to the user. Hence, Harel et al. focus on the protection of the [4] E. F. Codd. A Relational Model of Data for Large quality of information, whereas our approach predominantly Shared Data Banks. ACM Communication, preserves the extraction of a collection of data (quantity of 13(6):377–387, June 1970. information). Harel et al. also do not consider extractions [5] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. over time, logging of malicious requester and the backup pro- Privacy-Preserving Data Publishing: A Survey of cess. In addition, mapping attributes to a certain monetary Recent Developments. ACM Comput. Surv., value is much more applicable and intuitive, than mapping 42(4):14:1–14:53, June 2010. to a artificial M-score. [6] A. Harel, A. Shabtai, L. Rokach, and Y. Elovici. Our extended authorization control does not limit the sys- M-score: Estimating the Potential Damage of Data tem to a simple query-authorization control without any Leakage Incident by Assigning Misuseability Weight. protection against the insider threat, rather we allow a query In Proc. of the 2010 ACM Workshop on Insider to be executed whenever the information carried by the Threats, Insider Threats’10, pages 13–20. ACM, 2010. query is legitimate according to the specified authorizations [7] A. Harel, A. Shabtai, L. Rokach, and Y. Elovici. and thresholds. Eliciting Domain Expert Misuseability Conceptions. In Proc. of the 6th Int’l Conference on Knowledge 5. CONCLUSIONS AND FUTURE WORK Capture, K-CAP’11, pages 193–194. ACM, 2011. [8] A. Harel, A. Shabtai, L. Rokach, and Y. Elovici. In this paper we described conceptual background details M-Score: A Misuseability Weight Measure. IEEE for a novel approach for database security. The key contri- Trans. Dependable Secur. Comput., 9(3):414–428, May bution is to derive valuations for query results by considering 2012. the most important operations of the relational algebra as [9] T. Helleseth and T. Klove. The Number of Cross-Join well as SQL and providing specific mval functions for each Pairs in Maximum Length Linear Sequences. IEEE of them. While some of these rules are straight forward, e.g. Transactions on Information Theory, 37(6):1731–1733, for core operations like selection and projection, other oper- Nov. 1991. ations like specific join operations require some more thor- ough considerations. Further operations, e.g. grouping and [10] P. A. Laplante. Dictionary of Computer Science, aggregation or user-defined function, would actually require Engineering and Technology. CRC Pressl, application specific valuations. To minimize the overhead London,England, 1st edition, 2000. for using valuation-based security, we discuss and recom- [11] A. Machanavajjhala, D. Kifer, J. Gehrke, and mend some reasonable valuation functions for these cases, M. Venkitasubramaniam. L-Diversity: Privacy Beyond too. k-Anonymity. ACM Trans. Knowl. Discov. Data, As the results presented here merely are of conceptual 1(1):1–50, Mar. 2007. nature, our current and future research includes considering [12] A. Motro. An Access Authorization Model for implementation alternatives, e.g. integrated with a given Relational Databases Based on Algebraic DBMS or as part of a middleware or driver as well as eval- Manipulation of View Definitions. In Proc. of the 5th uating the overhead and the effectiveness of the approach. Int’l Conference on Data Engineering, pages 339–347. We will also come up with a detailed recommendation of IEEE Computer Society, 1989. how to set monetary values appropriate to different environ- [13] J. Natarajan, S. Shaw, R. Bruchez, and M. Coles. Pro ments and situations. Furthermore, we plan to investigate T-SQL 2012 Programmer’s Guide. Apress, further possible use cases for data valuation, such as billing Berlin-Heidelberg, Germany, 3rd edition, 2012. of data-providing services on a fine-grained level and con- [14] J. S. Park and J. Giordano. Access Control trolling benefit/cost trade-offs for data security and safety. Requirements for Preventing Insider Threats. In Proc. of the 4th IEEE Int’l Conference on Intelligence and Security Informatics, ISI’06, pages 529–534. Springer, 6. ACKNOWLEDGMENTS 2006. This research has been funded in part by the German [15] S. Rizvi, A. Mendelzon, S. Sudarshan, and P. Roy. Federal Ministry of Education and Science (BMBF) through Extending Query Rewriting Techniques for the Research Program under Contract FKZ: 13N10818. Fine-Grained Access Control. In Proc. of the 2004 ACM SIGMOD Int’l Conference on Management of Data, SIGMOD’04, pages 551–562. ACM, 2004. 7. REFERENCES [16] A. Rosenthal and E. Sciore. View Security as the Basis [1] S. Barthel and E. Schallehn. The Monetary Value of for Data Warehouse Security. In CAiSE Workshop on Information: A Leakage-Resistant Data Valuation. In Design and Management of Data Warehouses, BTW Workshops, BTW’2013, pages 131–138. Köln DMDW’2000, pages 5–6. CEUR-WS, 2000. Verlag, 2013. [17] L. Sweeney. K-Anonymity: A Model For Protecting [2] E. Bertino and R. Sandhu. Database Security - Privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Concepts, Approaches, and Challenges. IEEE Syst., 10(5):557–570, Oct. 2002. Dependable and Secure Comp., 2(1):2 –19, Mar. 2005. [3] A. K. Chandra and P. M. Merlin. Optimal Implementation of Conjunctive Queries in Relational Data Bases. In Proc. of the 9th Annual ACM Symposium on Theory of Computing, STOC’77, pages 77–90. ACM, 1977. TrIMPI: A Data Structure for Efficient Pattern Matching on Moving Objects Tsvetelin Polomski Hans-Joachim Klein Christian-Albrechts-University at Kiel Christian-Albrechts-University at Kiel Hermann-Rodewald-Straße 3 Hermann-Rodewald-Straße 3 24118 Kiel 24118 Kiel tpo@is.informatik.uni-kiel.de hjk@is.informatik.uni-kiel.de ABSTRACT qualitative description, e.g. return all trajectories where the under- Managing movement data efficiently often requires the exploita- lying object slowed down (during any time interval) and after that tion of some indexing scheme. Taking into account the kind of it changed its course. Obviously, the motion properties slowdown queries issued to the given data, several indexing structures have and course alteration as well as their temporal adjustment can be been proposed which focus on spatial, temporal or spatio-temporal computed using formal methods. The crucial point is that, even if data. Since all these approaches consider only raw data of moving an indexing structure is used, the stated properties must be com- objects, they may be well-suited if the queries of interest contain puted for each trajectory and this results in sequential scan(s) on concrete trajectories or spatial regions. However, if the query con- the whole trajectory data. Time consuming processing of queries sists only of a qualitative description of a trajectory, e.g. by stating is not acceptable, however, in a scenario where fast reaction on in- some properties of the underlying object, sequential scans on the coming data streams is needed. An example of such a situation with whole trajectory data are necessary to compute the property, even so-called tracks computed from radar and sonar data as input is the if an indexing structure is available. detection of patterns of skiff movements typical for many piracy The present paper presents some results of an ongoing work on a attacks [14]. A track comprises the position of an object at a time data structure for Trajectory Indexing using Motion Property In- moment and can hold additional information e.g. about its current formation (TrIMPI). The proposed approach is flexible since it al- course and velocity. Gathering the tracks of a single object over a lows the user to define application-specific properties of trajecto- time interval yields its trajectory over this interval. ries which have to be used for indexing. Thereby, we show how To address the efficiency problem, we propose an indexing scheme to efficiently answer queries given in terms of such qualitative de- which is not primarily focused on the “time-position data” of tra- scriptions. Since the index structure is built on top of ordinary data jectories but uses meta information about them instead. structures, it can be implemented in arbitrary database management We start with a discussion of related work in Section 2. Section 3 systems. provides some formal definitions on trajectories and their motion properties. In section 4 we introduce the indexing scheme itself and illustrate algorithms for querying it. Section 5 summarizes the Keywords present work and outlines our future work. Moving object databases, motion patterns, indexing structures 2. RELATED WORK 1. INTRODUCTION AND MOTIVATION In this section we provide a short overview on previous contri- butions which are related to our approach. We start the section Most index structures for trajectories considered in the literature (e.g. [8]) concentrate on (time dependent) positional data, e.g. R- by reviewing classical indexing structures for moving objects data. Tree [9] or TPR*-Tree [17]. There are different approaches (e.g. Next to this, we show an approach which is similar in general terms [1], [12]) exploiting transformation functions on the original data to the proposed one and finally we review literature related to se- mantical aspects of moving objects. and thereby reducing the indexing overhead through “light ver- sions” of the trajectories to be indexed. In these approaches only 2.1 Indexing of Spatial, Temporal and Spatio- stationary data is being handled. In cases where the queries of in- terest consist of concrete trajectories or polygons covering them, Temporal Data such indexing schemata as well as trajectory compression tech- The moving object databases community has developed several niques (e.g. [1], [6], [10], [12], [13]) may be well-suited. However, data structures for indexing movement data. According to [8], these there are applications [14] where a query may consist only of a structures can be roughly categorized as structures indexing only spatial data, also known as spatial access methods (SAM); index- ing approaches for temporal data, also known as temporal index structures; and those which manage both - spatial and temporal data, also known as spatio-temporal index structures. One of the first structures developed for SAMs is the well-known R-Tree [9]. Several extensions of R-Trees have been provided over the years, thus yielding a variety of spatio-temporal index structures. An in- formal schematic overview on these extensions, including also new 25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 28.05.2013 - 31.05.2013, Illmenau, Germany. developments as the HTPR*-Tree [7] can be found in [11]. Since Copyright is held by the author/owner(s). all of the proposed access methods focus mainly on the raw spatio- temporal data, they are well-suited for queries on history of move- “real time” and use any time domain instead. A time domain is any ment and predicting new positions of moving objects, or for re- set which is interval scaled and countably infinite. The first re- turning most similar trajectories to a given one. If a query consists quirement ensures that timestamps can be used for ordering and, only of a qualitative description, however, all the proposed index- furthermore, that the “delay” between two time assignments can ing structures are of no use. be determined. The second requirement ensures that we have an infinite number of “time moments” which can be unambiguously 2.2 Applying Dimensionality Reduction upon indexed by elements of N. In the following we denote a time do- Indexing - the GEMINI Approach main by T. The overall approach we consider in this work is similar to the Since objects move in a space, we also need a notion for a spa- GEMINI (GEneric Multimedia INdexIng method) indexing scheme tial domain. In the following, let S denote the spatial domain. We presented in [6]. This approach was originally proposed for time require that S is equipped with an adequate metric, such as the Eu- series and has been applied later for other types of data, e.g. for clidean distance (e.g. for S = R × R), which allows us to measure motion data in [16]. The main idea behind GEMINI is to reduce the the spatial distance between objects. dimensionality of the original data before indexing. Therefor, rep- Having the notions of time and space we can define formally the resentatives of much lower dimensionality are created for the data term trajectory. (trajectory or time series) to be indexed by using an appropriate transform and used for indexing. A crucial result in [6] is that the Definition 1. Let T, S and O denote a time domain, a space do- authors proved that in order to guarantee no false dismissals [12], main and a set of distinct objects, respectively. Then, the trajectory the exploited transform must retain the distance (or similarity) of τo of an object o ∈ O is a function τo : T → S. the data to be indexed, that is, the distance between representatives should not exceed the distance of the original time series. For brevity, we can also write the trajectory of an object o ∈ O In the mentioned approaches, the authors achieve encouraging re- in the form (o, t0 , s0 ), (o, t1 , s1 ) . . . for those t ∈ T where τo (t) = s is sults on querying most similar trajectories (or time series) to a given defined. A single element (o, ti , si ) is called the track of object o at one. However, since the representatives of the original data are tra- time ti . jectories or time series, respectively, evaluating a query which only describes a motion behavior would result in the inspection of all 3.2 Motion Patterns representatives. We consider a motion pattern as a sequence of properties of trajectories which reveal some characteristics of the behavior of 2.3 Semantical Properties of Movement the underlying moving objects. Such properties may be expressed Semantical properties of movement data have been considered in through any predicates which are important for the particular anal- various works, e.g. in [2], [5], and [15]. ysis, such as start, stop, turn, or speedup. The authors of [2] propose a spatio-temporal representation scheme for moving objects in the area of video data. The considered rep- Definition 2. Let T be a time domain, T be the set of trajectories resentation scheme distinguishes between spatio-temporal data of of an object set O over T, and IT be the set of all closed inter- trajectories and their topological information, and also utilizes in- vals over T. A motion property on T is a function p : 2T × IT → formation about distances between pairs of objects. The topolog- {true, f alse}. ical information itself is defined through a set of topological re- lations operators expressing spatial relations between objects over That is, a motion property is fulfilled for a set of trajectories and some time interval, including faraway, disjoint, meet, overlap, is- a certain time interval if the appropriate predicate is satisfied. To included-by/includes and same. illustrate this definition, some examples of motion properties are In [5], a comprehensive study on the research that has been carried provided below: out on data mining and visual analysis of movement patterns has been provided. The authors propose a conceptual framework for • Appearance: Let t ∈ T. Then we define appear(·, ·) as movement behavior of different moving objects. The extracted be- follows: appear({τo }, [t, t]) = true ⇔ ∀t′ ∈ T : τo (t′ ) , havior patterns are classified according to a taxonomy. undefined → t ≤ t′ . That is, an object “appears” only in the In [15], the authors provide some aspects related to a semantic view “first” moment it is being observed. of trajectories. They show a conceptual approach for how trajectory behaviors can be described by predicates that involve movement • Speedup: Let t1 , t2 ∈ T and t1 < t2 . Then speedup(·, ·) is de- attributes and/or semantic annotations. The provided approach is fined as follows: speedup({τo }, [t1 , t2 ]) = true ⇔ v(τo , t1 ) < rather informal and considers behavior analysis of moving objects v(τo , t2 ) ∧ ∀t ∈ T : t1 ≤ t ≤ t2 → v(τo , t1 ) ≤ v(τo , t) ≤ v(τo , t2 ) on a general level. where v(τo , t) denotes the velocity of the underlying moving object o at time t. That is, the predicate speedup is satisfied for a trajectory and a time interval if and only if the velocity 3. FORMAL BACKGROUND of the underlying object is increasing in the considered time This section provides the formal notions as well as the definitions interval. Note that the increase may not be strictly mono- needed throughout the rest of the paper. We start with the term tonic. trajectory and then direct later our attention to motion properties and patterns. • Move away: Let t1 , t2 ∈ T and t1 < t2 . Then we define: moveaway({τo1 , τo2 }, [t1 , t2 ]) = true ⇔ ∀t, t′ ∈ T : t1 ≤ t < 3.1 Trajectories t′ ≤ t2 → dist(τo1 , τo2 , t) < dist(τo1 , τo2 , t′ ) where the term In our approach we consider the trajectory τo of an object o sim- dist(τo1 , τo2 , t) denotes the distance between the underlying ply as a function of time which assigns a position to o at any point moving objects o1 and o2 at time t. That is, two objects are in time. Since time plays only a role for the determination of tem- moving away from each other for a time interval, if their dis- poral causality between the positions of an object, we abstract from tance is increasing during the considered time interval. the trajectory data in blocks. This has the advantage that extract- ing the complete trajectory requires only loading as much blocks as needed for storing a trajectory. 4.2 Indexing Motion Patterns For the maintenance of motion patterns we consider two cases - single motion properties and sequences of motion properties. Stor- ing single motion properties allows the efficient finding of trajec- tories which contain the considered motion property. This is ad- vantageous if the searched property is not often satisfied. Thus, for each motion property p a “list” DBT p holding all trajectories sat- isfying this property is maintained. As we shall see in Algorithm Figure 1: Overview of the index structure 4.3, we have to combine such lists and, thus, a simple unsorted list would not be very favourable. Therefore, we implement these lists through B+ -Trees (ordered by the trajectory/object identifiers). An Using motion properties, a motion pattern of a single trajectory evaluation of union and intersection of two B+ -Trees with m and n or a set of trajectories is defined as a sequence of motion properties leaves can be performed in O(m log m+n m )[4]. ordered by the time intervals in which they are fulfilled. It is impor- The search for motion patterns with more than one motion property tant to note, that this common definition of a motion pattern allows can be conducted through the single DBT p structures. However, if multiple occurrences of the same motion property in the sequence. the query motion pattern is too long, too many intersections of the In order to get a well-defined notion it has to be required that the DBT p structures will happen and the resulting trajectories will have time intervals in which the motion properties are fulfilled are dis- to be checked for containing properties that match the given order, joint or that meaningful preferences on the motion properties are as well. To overcome this problem, sequences of motion properties specified in order to allow ordering in case the time intervals over- are stored in an additional B+ -Tree structure DBT . The elements of lap. DBT have the form (p, τo ) where p is a motion pattern, and o ∈ O. To sort the elements of DBT , we apply lexicographical ordering. 4. TRAJECTORY INDEXING USING MO- As a result, sequences with the same prefix are stored consecu- TION PROPERTIES tively. Thus, storing of motion patterns that are prefixes of other In this section we explain how the proposed index is being cre- motion patterns can be omitted. ated and used. Index creation starts with the determination of the 4.3 Building the Index motion pattern of each trajectory to be indexed. For this purpose, the motion predicates specified by the user are computed. The re- The algorithm for the index creation is quite simple. It consists sulting motion patterns are indexed with references to the original primarily of the following steps: trajectories. • Determine the motion properties for each trajectory τo . Con- The resulting index is schematically depicted in Figure 1. TrIMPI sider, if needed, a sliding window or some reduction or seg- consists mainly of a data structure holding the raw trajectory data, menting technique as proposed in [1], [6], [10], [12], [13], and secondary index structures for maintaining motion patterns. for example. Generate a list f of the motion properties of τo , Thereby, we differentiate between indexing single motion proper- ordered by their appearance in τo . ties and indexing motion patterns. A query to the index can be stated either through a motion pattern or • Store τo into the trajectory record file. through a concrete trajectory. The index is searched for motion pat- terns containing the given one or the computed one, respectively. In • Apply Algorithm 4.1 to f to generate access keys relevant both cases, the associated trajectories are returned. The following for indexing. subsections consider the outlined procedures more precisely. • For each generated access key, check whether it is already 4.1 Indexing Trajectory Raw Data contained in the index. If this is not the case, store it in the index. Link the trajectory record file entry of τo to the access Since the focus of TrIMPI is not on querying trajectories by ex- key. ample, the index structure for the raw trajectory data can be rather simple. For our implementation, we considered a trajectory record Algorithm 4.1 is used to generate index keys of a pattern. An index file as proposed by [3]. This structure (Figure 1) stores trajectories key is any subpattern p′ = (p′j )m−1 n−1 j=0 of a pattern p = (pi )i=0 which is in records of fixed length. The overall structure of the records is as defined as follows: follows • For each j ≤ m − 1 exists i ≤ n − 1 such that p′j = pi IDo next_ptr prev_ptr {track0 , . . . , tracknum−1 } . • For each j, k such that 0 ≤ j < k ≤ m − 1 exist i, l such that IDo denotes the identifier of the underlying moving object, next_ptr 0 ≤ i < l ≤ n − 1 and p′j = pi and p′k = pl . and prev_ptr are references to the appropriate records holding fur- ther parts of the trajectory, and {track0 , . . . , tracknum−1 } is a list of To generate the list of index keys, algorithm 4.1 proceeds itera- tracks of a predefined fixed length num. If a record ri for a tra- tively. At each iteration of the outer loop (lines 3 to 16) the algo- jectory τo gets filled, a new record r j is created for τo holding its rithm considers a single element p of the input sequence f . On the further tracks. In this case, next_ptrri is set up to point to r j , and one hand, p is being added as an index key to the (interim) result prev_ptrr j is set up to point to ri . (lines 14 and 15) and on the other hand it is being appended as a Using a trajectory record file, the data is not completely clustered, suffix to each previously generated index key (inner loop - lines 5 but choosing appropriate record size leads to partial clustering of to 13). Algorithm 4.1 utilizes two sets whose elements are lists of motion properties - supplist and entries. The set supplist that each trajectory of the interim result has to be checked whether contains at each iteration the complete set of index keys, includ- it matches the queried pattern (lines 9 to 13). ing those which are prefixes of other patterns. The set entries is The other special case are queries longer than G (lines 16 to 24). As built in each iteration of the inner loop (lines 5 to 13) by appending we have seen in algorithm 4.1, in such cases the index keys are cut the current motion property of the input sequence to any element to prefixes of length G. Thus, the extraction in this case considers of supplist. Thereby, at line 14 entries holds only index keys the prefix of length G of the query sequence (lines 17) and extracts which are no prefixes of other index keys. Since the resulting lists the appropriate trajectories (line 18). Since these trajectories may of index keys are stored in a B+ -Tree by applying a lexicographical still not match the query sequence, e.g. by not fulfilling some of the order, sequences of motion properties which are prefixes of other properties appearing on a position after G − 1 in the input sequence, sequences can be omitted. Therefore, the set entries is returned an additional check of the trajectories in the interim result is made as final result (line 17). (lines 19 to 23). Since the given procedure may result in the computation of up to The last case to consider are query sequences with length between 2k0 different indexing keys for an input sequence with k0 motion α and G. In these cases, the index DBT holding the index keys is properties, a global constant G is used to limit the maximal length searched through a call to algorithm 4.2 and the result is returned. of index keys. Using an appropriate value for G leads to no draw- Finally, the function Match (algorithm 4.4) checks whether a tra- backs for the application. Furthermore, the proposed querying al- gorithm can handle queries longer than G. Algorithm 4.3 Querying trajectories with a sequence of arbitrary length Algorithm 4.1 Building the indexing keys Require: s is a sequence of motion properties Require: f is a sequence of motion properties Require: G is the maximal length of stored sequences Require: G is the maximal length of sequences to be indexed Require: DBT p is the index of the property p 1 function createIndexKeys( f ) Require: 1 ≤ α < G maximal query length for searching single property indexes 2 supplist ← empty set of lists 1 function GetEntries(s) 3 for all a ∈ f do 2 result ← empty set 4 entries ← empty set of lists 3 if |s| < α then 5 for all l ∈ supplist do 4 result ← T 6 new ← empty list 5 for all p ∈ s do 7 if |l| ≤ G then 6 suppset ← DBT p 8 new ← l.append(a) 7 result ← result ∩ suppset 9 else 8 end for 10 new ← l 9 for all τo ∈ result do 11 end if 10 if ! match(τo , s) then 12 entries ← entries ∪ {new} 11 result ← result\{τo } 13 end for 12 end if 14 entries ← entries ∪ {[a]} 13 end for 15 supplist ← entries ∪ supplist 14 else if |s| ≤ G then 16 end for 15 result ← GetEntriesFromDBT (s) 17 return entries 16 else 18 end function 17 k ← s[0..G − 1] 18 result ← GetEntriesFromDBT (k) 19 for all τo ∈ result do 20 if ! match(τo , s) then 4.4 Searching for Motion Patterns 21 result ← result\{τo } 22 end if Since the index is primarily considered to support queries on se- 23 end for quences of motion properties, the appropriate algorithm for eval- 24 end if 25 return result uating such queries given in the following is rather simple. In its 26 end function “basic” version, query processing is just traversing the index and re- turning all trajectories referenced by index keys which contain the queried one (as a subpattern). This procedure is illustrated in algo- jectory τo fulfills a pattern s. For this purpose, the list of motion rithm 4.2. There are, however, some special cases which have to properties of τo is being generated (line 2). Thereafter, s and the generated pattern of τo are traversed (lines 5 to 14) so that it can be checked whether the elements of s can be found in the trajectory Algorithm 4.2 Basic querying of trajectories with a sequence of pattern of τo in the same order. In this case the function Match motion properties returns true, otherwise it returns false. Require: s is a sequence of motion properties; |s| ≤ G Require: DBT is the index containing motion patterns 1 function GetEntriesFromDBT(s) 5. CONCLUSIONS AND OUTLOOK 2 result ← {τo | ∃p s.t. s ≤ p ∧ (p, τo ) ∈ DBT } 3 return result In this paper we provided some first results of an ongoing work 4 end function on an indexing structure for trajectories of moving objects called TrIMPI. The focus of TrIMPI lies not on indexing spatio-temporal be taken into account. The first of them considers query sequences data but on the exploitation of motion properties of moving objects. which are “too short”. As stated in Section 4.2, it can be advan- For this purpose, we provided a formal notion of motion proper- tageous to evaluate queries containing only few motion properties ties and showed how they form a motion pattern. Furthermore, we by examination of the index structures for single motion proper- showed how these motion patterns can be used to build a meta in- ties. To be able to define an application specific notion of “short” dex. Algorithms for querying the index were also provided. In queries, we provide besides G an additional global parameter α for the next steps, we will finalize the implementation of TrIMPI and which holds 1 ≤ α < G. In algorithm 4.3, which evaluates queries perform tests in the scenario of the automatic detection of piracy at- of patterns of arbitrary length, each pattern of length shorter than α tacks mentioned in the Introduction. As a conceptual improvement is being handled in the described way (lines 3 to 8). It is important of the work provided in this paper, we consider a flexibilisation of Algorithm 4.4 Checks whether a trajectory matches a motion pat- Notes in Computer Science, pages 26–39. Springer Berlin tern Heidelberg, 2012. Require: τo is a valid trajectory [8] R. H. Güting and M. Schneider. Moving Object Databases. Require: s is a sequence of motion properties 1 function match(τo , s) Data Management Systems. Morgan Kaufmann, 2005. 2 motion_properties ← compute the list of motion properties of τo [9] A. Guttman. R-Trees: a dynamic index structure for spatial 3 index_s ← 0 searching. In Proceedings of the 1984 ACM SIGMOD 4 index_props ← 0 5 while index_props < motion_properties.length do international conference on Management of data, SIGMOD 6 if motion_properties[index_props] = s[index_s] then ’84, pages 47–57, New York, NY, USA, 1984. ACM. 7 index_s ← index_s + 1 [10] J. Hershberger and J. Snoeyink. Speeding Up the 8 else 9 index_props ← index_props + 1 Douglas-Peucker Line-Simplification Algorithm. In 10 end if P. Bresnahan, editor, Proceedings of the 5th International 11 if index_s = s.length then Symposium on Spatial Data Handling, SDH’92, Charleston, 12 return true 13 end if South Carolina, USA, August 3-7, 1992, pages 134–143. 14 end while University of South Carolina. Humanities and Social 15 return false Sciences Computing Lab, August 1992. 16 end function [11] C. S. Jensen. TPR-Tree Successors 2000–2012. http://cs.au.dk/~csj/tpr-tree-successors, 2013. Last accessed 24.03.2013. the definition of motion patterns including arbitrary temporal rela- [12] E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. tions between motion predicates. Dimensionality reduction for fast similarity search in large time series databases. Journal Of Knowledge And 6. ACKNOWLEDGMENTS Information Systems, 3(3):263–286, 2001. The authors would like to give special thanks to their former stu- [13] E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani. An online dent Lasse Stehnken for his help in implementing TrIMPI. algorithm for segmenting time series. In N. Cercone, T. Y. Lin, and X. Wu, editors, Proceedings of the 2001 IEEE International Conference on Data Mining, ICDM’01, San 7. REFERENCES Jose, California, USA, 29 November - 2 December 2001, [1] R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient pages 289–296. IEEE Computer Society, 2001. similarity search in sequence databases. In D. B. Lomet, [14] T. Polomski and H.-J. Klein. How to Improve Maritime editor, Proceedings of the 4th International Conference on Situational Awareness using Piracy Attack Patterns. 2013. Foundations of Data Organization and Algorithms, submitted. FODO’93, Chicago, Illinois, USA, October 13-15, 1993, [15] S. Spaccapietra and C. Parent. Adding meaning to your steps volume 730 of Lecture Notes in Computer Science, pages (keynote paper). In M. Jeusfeld, L. Delcambre, and T.-W. 69–84. Springer, 1993. Ling, editors, Conceptual Modeling - ER 2011, 30th [2] J.-W. Chang, H.-J. Lee, J.-H. Um, S.-M. Kim, and T.-W. International Conference, ER 2011, Brussels, Belgium, Wang. Content-based retrieval using moving objects’ October 31 - November 3, 2011. Proceedings, ER’11, pages trajectories in video data. In IADIS International Conference 13–31. Springer, 2011. Applied Computing, pages 11–18, 2007. [16] Y.-S. Tak, J. Kim, and E. Hwang. Hierarchical querying [3] J.-W. Chang, M.-S. Song, and J.-H. Um. TMN-Tree: New scheme of human motions for smart home environment. Eng. trajectory index structure for moving objects in spatial Appl. Artif. Intell., 25(7):1301–1312, Oct. 2012. networks. In Computer and Information Technology (CIT), [17] Y. Tao, D. Papadias, and J. Sun. The TPR*-tree: an 2010 IEEE 10th International Conference on, pages optimized spatio-temporal access method for predictive 1633–1638. IEEE Computer Society, July 2010. queries. In J. C. Freytag, P. C. Lockemann, S. Abiteboul, [4] E. D. Demaine, A. López-Ortiz, and J. I. Munro. Adaptive M. J. Carey, P. G. Selinger, and A. Heuer, editors, set intersections, unions, and differences. In Proceedings of Proceedings of the 29th international conference on Very the eleventh annual ACM-SIAM symposium on Discrete large data bases - Volume 29, VLDB ’03, pages 790–801. algorithms, SODA ’00, pages 743–752, Philadelphia, PA, VLDB Endowment, 2003. USA, 2000. Society for Industrial and Applied Mathematics. [5] S. Dodge, R. Weibel, and A.-K. Lautenschütz. Towards a taxonomy of movement patterns. Information Visualization, 7(3):240–252, June 2008. [6] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In R. T. Snodgrass and M. Winslett, editors, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, SIGMOD ’94, pages 419–429, New York, NY, USA, 1994. ACM. 472940. [7] Y. Fang, J. Cao, J. Wang, Y. Peng, and W. Song. HTPR*-Tree: An efficient index for moving objects to support predictive query and partial history query. In L. Wang, J. Jiang, J. Lu, L. Hong, and B. Liu, editors, Web-Age Information Management, volume 7142 of Lecture Complex Event Processing in Wireless Sensor Networks Omran Saleh Faculty of Computer Science and Automation Ilmenau University of Technology Ilmenau, Germany omran.saleh@tu-ilmenau.de ABSTRACT itored region. These nodes can sense the surrounding envi- Most of the WSN applications need the number of sensor ronment and share the information with their neighboring nodes deployed to be in order of hundreds, thousands or nodes. They are gaining adoption on an increasing scale more to monitor certain phenomena and capture measure- for tracking and monitoring purposes. Furthermore, sensor ments over a long period of time. The large volume of sensor nodes are often used in control purposes. They are capable networks would generate continuous streams of raw events1 of performing simple processing. in case of centralized architecture, in which the sensor data In the near future, it is prospective that wireless sensor captured by all the sensor nodes is sent to a central entity. networks will offer and make conceivable a wide range of In this paper, we describe the design and implementation applications and emerge as an important area of comput- of a system that carries out complex event detection queries ing. WSN technology is exciting with boundless potential for inside wireless sensor nodes. These queries filter and re- various application areas. They are now found in many in- move undesirable events. They can detect complex events dustrial and civilian application areas, military and security and meaningful information by combining raw events with applications, environmental monitoring, disaster prevention logical and temporal relationship, and output this informa- and health care applications, etc. tion to external monitoring application for further analysis. One of the most important issues in the design of WSNs This system reduces the amount of data that needs to be is energy efficiency. Each node should be as energy effi- sent to the central entity by avoiding transmitting the raw cient as possible. Processing a chunk of information is less data outside the network. Therefore, it can dramatically re- costly than wireless communication; the ratio between them duce the communication burden between nodes and improve is commonly supposed to be much smaller than one [19]. the lifetime of sensor networks. There is a significant link between energy efficiency and su- We have implemented our approach for the TinyOS Oper- perfluous data. The sensor node is going to consume unnec- ating System, for the TelosB and Mica2 platforms. We con- essary energy for the transmission of superfluous data to the ducted a performance evaluation of our method comparing central entity, which means minimizing the energy efficiency. it with a naive method. Results clearly confirm the effec- Furthermore, traditional WSN software systems do not tiveness of our approach. apparently aim at efficient processing of continuous data or event streams. According to previous notions, we are looking for an approach that makes our system gains high perfor- Keywords mance and power saving via preventing the generation and Complex Event Processing, Wireless Sensor Networks, In- transmission of needless data to the central entity. There- network processing, centralized processing, Non-deterministic fore, it can dramatically reduce the communication burden Finite state Automata between nodes and improve the lifetime of sensor networks. This approach takes into account the resource limitations in 1. INTRODUCTION terms of computation power, memory, and communication. Sensor nodes can employ their processing capabilities to per- Wireless sensor networks are defined as a distributed and form some computations. Therefore, an in-network complex cooperative network of devices, denoted as sensor nodes that event processing 2 based solution is proposed. are densely deployed over a region especially in harsh envi- We have proposed to run a complex event processing en- ronments to gather data for some phenomena in this mon- gine inside the sensor nodes. CEP engine is implemented 1 The terms data, events and tuples are used interchangeably. to transform the raw data into meaningful and beneficial events that are to be notified to the users after detecting them. It is responsible for combining primitive events to identify higher level complex events. This engine provides an efficient Non-deterministic Finite state Automata (NFA) [1] based implementation to lead the evaluation of the com- plex event queries where the automaton runs as an integral part of the in-network query plan. It also provides the the- 25th GI-Workshop on Foundations of Databases (Grundlagen von Daten- oretical basis of CEP as well as supports us with particular banken), 28.05.2013 - 31.05.2013, Ilmenau, Germany. 2 Copyright is held by the author/owner(s). CEP is discussed in reference [15] operators (conjunction, negation, disjunction and sequence of active databases. Most of the models in these engines operators, etc.). are based on fixed data structures such as tree, graph, fi- Complex event processing over data stream has increas- nite automaton or petri net. The authors of [6] used a tree ingly become an important field due to the increasing num- based model. Paper [9] used petri net based model to de- ber of its applications for wireless sensor networks. There tect complex events from active database. Reference [17] have been various event detection applications proposed in used Timed Petri-Net (TPN) to detect complex events from the WSNs, e.g. for detecting eruptions of volcanoes [18], RFID stream. forest fires, and for the habitat monitoring of animals [5]. An increasing number of applications in such networks is confronted with the necessity to process voluminous data 3. RELATED WORKS streams in real time fashion. It is preferable to perform In-Network Processing in- The rest of the paper is organized as follows: section 2 pro- side sensor network to reduce the transmission cost between vides an overview of the naive approaches for normal data neighboring nodes. This concept is proposed by several sys- and complex event processing in WSNs. Related works are tems such as TinyDB [16], and Cougar [19]. Cougar project briefly reviewed in section 3. Then we introduce the overall applies a database system concept to sensor networks. It system architecture in order to perform complex event pro- uses the declarative queries that are similar to SQL to query cessing in sensor networks in section 4. Section 5 discusses sensor nodes. Additionally, sensor data in cougar is consid- how to create logical query plans to evaluate sensor portion ered like a “virtual” relational database. Cougar places on queries. Section 6 explains our approach and how queries are each node an additional query layer that lies between the implemented by automata. In section 7, the performance of network and application layers which has the responsibility our system is evaluated using a particular simulator. Fi- of in-network processing. This system generates one plan for nally, section 8 presents our concluding remarks and future the leader node to perform aggregation and send the data to works. a sink node. Another plan is generated for non-leader nodes to measure the sensors status. The query plans are dissem- inated to the query layers of all sensor nodes. The query 2. NAIVE APPROACHES IN WSNS layer will register the plan inside the sensor node, enable The ideas behind naive approaches which are definitely desired sensors, and return results according to this plan. different from our approach lie in the processing of data as TinyDB is an acquisitional query processing system for the central architectural concept. For normal sensor data sensor networks which maintains a single, infinitely-long vir- processing, the centralized approach proceeds in two steps; tual database table. It uses an SQL-like interface to ask for the sensor data captured by all the sensor nodes is sent to data from the network. In this system, users specify the the sink node and then routed to the central server (base data they want and the rate at which the data should be station) where it is stored in centralized database. High vol- refreshed, and the underlying system would decide the best ume data are arriving at the server. Subsequently, query plan to be executed. Several in-network aggregation tech- processing takes place on this database by running queries niques have been proposed in order to extend the life time against stored data. Each query executes one time and re- of sensor network such as tree-based aggregation protocols turns a set of results. i.e., directed diffusion. Another approach which adopts the idea of centralized Paper [13] proposes a framework to detect complex events architecture is the use of a central data stream management in wireless sensor networks by transforming them into sub- system (DSMS), which simply takes the sensor data stream events. In this case, the sub-events can easily be detected as input source. Sending all sensor readings to DSMS is also by sensor nodes. Reference [14] splits queries into server and an option for WSN data processing. DSMS is defined as a node queries, where each query can be executed. The final system that manages a data stream, executes a continuous results from both sides are combined by the results merger. query against a data stream and supports on-line analysis In [20], symbolic aggregate approximation (SAX) is used to of rapidly changing data streams [10]. Traditional stream transform sensor data to symbolic representations. To de- processing systems such as Aurora [2], NiagraCQ [7], and tect complex events, a distance metric for string comparison AnduIN [12] extend the relational query processing to work is utilized. These papers are the closer works to our system. with stream data. Generally the select, project, join and Obviously, there is currently little work into how the idea aggregate operations are supported in these stream systems. of in-network processing can be extended and implemented The naive approach for Complex Event Processing in to allow more complex event queries to be resolved within WSNs is similar to the central architectural idea of normal the network. data processing, but instead of using traditional database and data stream engine, CEP uses a dedicated engine for processing complex events such as Esper [8], SASE [11] and 4. SYSTEM ARCHITECTURE Cayuga [4], in which sensor data or events streams need to We have proposed a system architecture in which collected be filtered, aggregated, processed and analyzed to find the data at numerous, inexpensive sensor nodes are processed events of interest and identify some patterns among them, locally. The resulting information is transmitted to larger, finally take actions if needed. more capable and more expensive nodes for further analysis Reference [11] uses SASE in order to process RFID stream and processing through specific node called sink node. data for a real-world retail management scenario. Paper [3] The architecture has three main parts that need to be demonstrates the use of Esper engine for object detection modified or created to make our system better suited to tracking in sensor networks. All the aforementioned engines queries over sensor nodes: 1- Server side: queries will be use some variant of a NFA model to detect the complex originated at server side and then forwarded to the near- event. Moreover, there are many CEP engines in the field est sink node. Additionally, this side mainly contains an application that runs on the user’s PC (base station). Its main purpose is to collect the results stream over the sen- sor network and display them. Server side application can offer more functions i.e., further filtering for the collected data, perform joining on sensor data, extract, save, man- age, and search the semantic information and apply further complex event processing on incoming events after process- ing them locally in sensor nodes. Because sensor data can be considered as a data stream, we proposed to use a data stream management system to play a role of server side, for that we selected AnduIN data stream engine. 2- Sink side: sink node (also known as root or gateway node) is one of the motes in the network which communicates with the base sta- tion directly, all the data collected by sensors is forwarded to a sink node and then to server side. This node will be in charge of disseminating the query down to all the sensor Figure 1: Logical query plan nodes in the network that comes from server side. 3- Node side: in this side, we have made huge changes to the tra- ditional application which runs on the nodes themselves to mechanism takes as input primitive events from lower oper- enable database manner queries involving filters, aggregates, ators and detects occurrences of composite events which are complex event processing operator (engine) and other oper- used as an output to the rest of the system. ators to be slightly executed within sensor networks. These changes are done in order to reduce communication costs 6. IN-NETWORK CEP SYSTEM and get useful information instead of raw data. Various applications including WSNs require the ability to When combining on-sensor portions of the query with the handle complex events among apparently unrelated events server side query, most of the pieces of the sensor data query and find interesting and/or special patterns. Users want are in place. This makes our system more advanced. to be notified immediately as soon as these complex events are detected. Sensor node devices generate massive sensor 5. LOGICAL PLAN data streams. These streams generate a variety of primitive Each and every sensor node of a network generates tu- events continuously. The continuous events form a sequence ples. Every tuple may consist of information about the node of primitive events, and recognition of the sequence supplies id, and sensor readings. Query plan can specify the tuples us a high level event, which the users are interested in. flow between all necessary operators and a precise computa- Sensor event streams have to be automatically filtered, tion plan for each sensor node. Figure 1 (lower plan) illus- processed, and transformed into significative information. trates how our query plan can be employed. It corresponds In non-centralized architecture, CEP has to be performed to an acyclic directed graph of operators. We assume the as close to real time as possible (inside the node). The task dataflow being upward. At the bottom, there is a homo- of identifying composite events from primitive ones is per- geneous data source which generates data tuples that must formed by the Complex Event Processing engine. CEP en- be processed by operators belonging to query plans. Tu- gine provides the runtime to perform complex event process- ples are flowed through intermediate operators composed in ing where they accept queries provided by the user, match the query graph. The operators perform the actual process- those queries against continuous event streams, and trigger ing and eventually forward the data to the sink operator an event or an execution when the conditions specified in for transmitting the resulting information to the server side the queries have been satisfied. The idea of this concept is (base station). These operators adopt publish/subscribe close to Event-Condition-Action (ECA) concept in conven- mechanism to transfer tuples from one operator to next op- tional database systems where an action has to be carried erator. out in response to an event and one or more conditions are We differ between three different types of operators within satisfied. a query graph [12]: 1- Source operator: produces tuples Each data tuple from the sensor node is viewed as a prim- and transfers them to other operators. 2- Sink operator: itive event and it has to be processed inside the node. We receives incoming tuples from other operators. 3- Inner have proposed an event detection system that specifically operators: receive incoming tuples from source operator, targets applications with limited resources, such in our sys- process them, and transfer the result to sink operator or tem. There are four phases for complex event processing other inner operators. in our in-network model: NFA creation, Filtering, Sequence A query plan consists of one source at the bottom of a scan and Response as shown in figure 2. logical query graph, several inner operators, and one sink at the top and the tuples are flowing strictly upward. In 6.1 NFA Creation Phase our system, we have extended this plan to give the system The first phase is NFA creation. NFA’s structure is cre- the capability to perform the complex event processing and ated by the translation from the sequence pattern through detecting by adding new operators. We have separated the mapping the events to NFA states and edges, where the con- mechanism for detecting complex events from the rest of ditions of the events (generally called event types) are asso- normal processing side. We have a particular component ciated with edges. For pattern matching over sensor node working as an extra operator or engine within the main pro- streams, NFA is employed to represent the structure of an cess, as we can see from figure 1 (upper plan). The detection event sequence. For a concrete example, consider the query Figure 2: CEP Phases Figure 4: Sequence Scan for SEQ (A, B+, D) within 6 Time Unit Using UNRESTRICTED Mode Figure 3: NFA for SEQ(A a, B+ b, C c) is waiting for the arrival of events in its starting state. Once a new instance event e arrives, the sequence scan responds pattern: SEQ(A a, B+ b, C c)3 . Figure 3 shows the NFA as follows: 1- It checks whether the type of instance (from created for the aforementioned pattern (A, B+, C), where attributes) and occurrence time of e satisfy a transition for state S0 is the starting state, state S1 is for the successful one of the logical existing sequences. If not, the event is detection of an A event, state S2 is for the detection of a B directly rejected. 2- If yes, e is registered in the system (the event after event A, also state S3 is for the detection of a C registration is done in the sliding window) and the sequence event after the B event. State S1 contains a self-loop with advances to next state. 3- If e allows for a sequence to move the condition of a B event. State S3 is the accepting state, from the starting state to next state, the engine will create reaching this state indicates that the sequence is detected. other logical sequence to process further incoming events while keeping the original sequence in its current state to 6.2 Filtering Phase receive new event. Therefore, multiple sequences work on The second phase is to filter primitive events at early the events at the same time. 4- Delete some sequences when stage, generated by sensor nodes. Sensor nodes cannot un- their last received items are not within a time limit. It be- derstand whether a particular event is necessary or not. comes impossible for them to proceed to the next state since When additional conditions are added to the system, possi- the time limits for future transitions have already expired. ble event instances might be pruned at the first stage. Next, we use an example to illustrate how UNRESTRICTED After filtering, timestamp operator will add the occur- sequence scan works. Suppose we have the following pat- rence time of the event t. A new operator is designed for tern4 SEQ (A, B+, D) and sequence of events (tuples) adding a timestamp t to the events (tuples) before entering presented as [a1, b2, a3, c4, c5, b6, d7 ...] within 6 time the complex event processing operator. We can notice that unit. Figure 4 shows, step by step, how the aforementioned from figure 1. The timestamp attribute value of an event events are processed. Once the sequence has reached the t records the reading of a clock in the system in which the accepting state (F ), the occurrences of SEQ (A, B+, D) event was created, in this case it can reflect the true order will be established at : {{a1, b2, d7 }, {a1, b6, d7 }, of the occurrences of primitive events. {a3, b6, d7 }}. The drawback of this mode is the use of high storage to 6.3 Sequence Scan Phase accumulate all the events that participate in the combina- The third phase is sequence scan to detect a pattern match. tions in addition to computation overhead for the detection. We have three modes state the way in which events may con- It consumes more energy. On other hand, it gives us all the tribute to scan a sequence: UNRESTRICTED, RECENT possibilities of event combination which can be used (e.g. and FIRST. Every mode has a different behavior. The selec- for further analysis). In our system, we only output one of tion between them depends on the users and the application these possibilities to reduce transmission cost overhead. All domain. These modes have advantages and disadvantages. registered events are stored in a sliding window. Once the We will illustrate them below. overflow has occurred, the candidate events would be the In the UNRESTRICTED mode, each start event e, which newest registered ones from the first sequence. The engine allows a sequence to move from the initial state to the next will continue to replace the events from the first sequence as state, starts a separate sequence detection. In this case any long as there is no space. When the initial event (first event event occurrence combination that matches the definition of in the first sequence combination) is replaced, the engine the sequence can be considered as an output. By using this starts the replacement from the second sequence and so on. mode, we can get all the possibilities of event combination The engine applies this replacement policy to ensure that which satisfy the sequence. When the sequence is created, it the system still has several sequences to detect a composite event, because replacing the initial events would destroy the 3 Notice: In this paper, we are going to only focus on se- 4 quence operator SEQ because of the limited number of The terms complex event, composite event, pattern and pages. sequence are used interchangeably. Figure 5: First and Recent Modes whole sequence. Figure 6: Total Energy Consumption In the FIRST mode, the earliest occurrence of each con- tributing event type is used to form the composite event output. Only the first event from a group of events which ing the in-network complex event processing techniques as have the same type advances the sequence to the next state. well as base station functionality, is written in TinyOS. Our In this mode, we have just one sequence in the system. The code runs successfully on both real motes and the TinyOS automaton engine will examine every incoming instance e, Avrora simulator. The aim of the proposed work is to com- whether the type of it and occurrence time of e satisfy a pare the performance of our system, in-network processor transition from the current state to next state. If it is, the which includes complex event engine in comparison with sequence will register the event in the current state and ad- centralized approach in wireless sensor networks and to as- vance to next state. If not, the event is directly rejected. sess the suitability of our approach in an environment where Suppose we have the following pattern SEQ (A, B+, C+, resources are limited. The comparison would be done in D) and sequence of tuples presented as [a1, a2, b3, c4, c5, terms of energy efficiency (amount of energy consumed) and b6, d7 ...] within 6 time unit. The result as shown in the the number of messages transmitted per particular interval, upper part of figure 5 . in the entire network. The experiment was run for varying In the RECENT mode (as the lower part of figure 5 which the SEQ length. We started with length 2 then 3 and finally has FIRST pattern and the same sequence of tuples), the 5. Simulations were run for 60 seconds with one event per most recent event occurrences of contributing event types second. The performance for different SEQ lengths and dif- are used to form the composite event. In RECENT mode, ferent modes with a network of 75 nodes is shown in figure 6. once an instance satisfies the condition and timing constraint The centralized architecture led to higher energy consump- to jump from a state to next state, the engine will stay in tion because sensor nodes transmitted events to the sink the current state unlike FIRST mode. This mode tries to node at regular periods. In our system, we used in-network find the most recent instance from consecutive instances for complex event processing to decrease the number of trans- that state before moving to next state. When a1 enters the missions of needless events at each sensor node. What we engine. It satisfies the condition to move from S0 to S1. can notice from figure 6 is summarized as: 1- By increasing The engine registers it, stays in S0 and does not jump to the SEQ length in our approach, the RAM size is increased the next state. Perhaps the new incoming instance is more while energy consumption is reduced. The reason is: the recent from the last one in the current state. transmission will not occur until the sequence reaches the The advantages of FIRST and RECENT modes are the accepting state, few events (tuples) will be relatively satis- use of less storage to accumulate all the events that partic- fied. Hence, the number of transmissions after detections ipate in the combinations. Only a few events will be regis- will be decreased. 2- FIRST is a little bit better than RE- tered in the system in addition to low computation overhead CENT, and both of them are better than UNRESTRICTED for the detection. They consume less energy. Unlike UNRE- in energy consumption. The gap between them is resulting STRICTED, they do not give all possible matches. from processing energy consumption, that is because UN- RESTRICTED needs more processing power while the other 6.4 Response Phase needs less, as shown in figure 6. Figure 7 shows the radio energy consumption for each Once an accepting state F is reached by the engine, the sensor node and the total number of messages when SEQ engine should immediately output the event sequence. This length was 3. The nodes in the centralized architecture sent phase is responsible for preparing the output sequence to more messages than our approach (nearly three times more). pass it to the sink operator. The output sequence depends Hence, it consumed more radio energy. Additionally, the on the mode of the scan. This phase will start to create gateway nodes consumed more radio energy due to receiv- the response by reading the sliding window contents. In ing and processing the messages from other sensor nodes. case of FIRST and RECENT modes, the sliding window In a 25 nodes network, the centralized approach consumed contains only the events which contribute in sequence de- energy nearly 4203mJ in sink side, while our approach con- tection. In UNRESTRICTED mode, the engine randomly sumed around 2811mJ. Thus, our system conserved nearly selects a combination of events which matches the pattern 1392mJ (33% of the centralized approach) of the energy. In in order to reduce transmission cost. our architecture, the number of transmissions was reduced. Therefore, the radio energy consumption is reduced not only 7. EVALUATION at the sensor nodes but also at the sink nodes. We have completed an initial in-network complex event processing implementation. All the source code, implement- 8. CONCLUSIONS In ACM SIGMOD, pages 1100–1102, New York, NY, USA, 2007. ACM. [5] A. Cerpa, J. Elson, D. Estrin, L. Girod, M. Hamilton, and J. Zhao. Habitat monitoring: application driver for wireless communications technology. SIGCOMM Comput. Commun. Rev., 31(2 supplement):20–41, Apr. 2001. [6] S. Chakravarthy, V. Krishnaprasad, E. Anwar, and S.-K. Kim. Composite events for active databases: semantics, contexts and detection. In Proceedings of the 20th International Conference on Very Large Data Figure 7: Energy Consumption vs. Radio Message Bases, VLDB ’94, pages 606–617, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. [7] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a scalable continuous query system for Sensor networks provide a considerably challenging pro- Internet databases. In ACM SIGMOD, pages 379–390, gramming and computing environment. They require ad- New York, NY, USA, 2000. ACM. vanced paradigms for software design, due to their character- [8] EsperTech. Event stream intelligence: Esper & istics such as limited computational power, limited memory NEsper. http://www.esper.codehaus.org/. and battery power which WSNs suffer from. In this paper, [9] S. Gatziu and K. R. Dittrich. Events in an active we presented our system, an in-network complex event pro- object-oriented database system, 1993. cessing, a system that efficiently carries out complex event [10] V. Goebel and T. Plagemann. Data stream queries inside network nodes. management systems - a technology for network We have proposed an engine to allow the system to de- monitoring and traffic analysis? In ConTEL 2005, tect complex events and valuable information from primitive volume 2, pages 685–686, June 2005. events. [11] D. Gyllstrom, E. Wu, H. Chae, Y. Diao, P. Stahlberg, We developed a query plan based approach to implement and G. Anderson. SASE: complex event processing the system. We provided the architecture to collect the over streams (Demo). In CIDR, pages 407–411, 2007. events from sensor network, this architecture includes three [12] D. Klan, M. Karnstedt, K. Hose, L. Ribe-Baumann, sides; sensor side to perform in-network complex event pro- and K. Sattler. Stream engines meet wireless sensor cessing, sink side to deliver the events from the network to networks: cost-based planning and processing of AnduIN server side which has the responsibility to display complex queries in AnduIN, distributed and parallel them and perform further analysis. databases. Distributed and Parallel Databases, We demonstrated the effectiveness of our system in a de- 29(1):151–183, Jan. 2011. tailed performance study. Results obtained from a compari- son between centralized approach and our approach confirms [13] Y. Lai, W. Zeng, Z. Lin, and G. Li. LAMF: framework that our in-network complex event processing in small-scale for complex event processing in wireless sensor and large-scale sensor networks has shown to increase the networks. In 2nd International Conference on lifetime of the network. We plan to continue our research (ICISE), pages 2155–2158, Dec. 2010. to build distributed in-network complex event processing, in [14] P. Li and W. Bingwen. Design of complex event which each sensor node has a different complex event pro- processing system for wireless sensor networks. In cessing plan and can communicate directly between them to NSWCTC, volume 1, pages 354–357, Apr. 2010. detect complex events. [15] D. C. Luckham. The power of events. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2001. 9. REFERENCES [16] S. R. Madden, M. J. Franklin, J. M. Hellerstein, and [1] Nondeterministic finite automaton. W. Hong. TinyDB: an acquisitional query processing http://en.wikipedia.org/wiki/Nondeterministic_ system for sensor networks. ACM Trans. Database finite_automaton. Syst., 30(1):122–173, Mar. 2005. [2] D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, [17] J. Xingyi, L. Xiaodong, K. Ning, and Y. Baoping. C. Convey, C. Erwin, E. Galvez, M. Hatoun, J.-h. Efficient complex event processing over RFID data Hwang, A. Maskey, A. Rasin, A. Singer, stream. In IEEE/ACIS, pages 75–81, May 2008. M. Stonebraker, N. Tatbul, Y. Xing, R. Yan, and [18] X. Yang, H. B. Lim, T. M. Özsu, and K. L. Tan. S. Zdonik. Aurora: a data stream management system. In-network execution of monitoring queries in sensor In ACM SIGMOD Conference, page 666, 2003. networks. In ACM SIGMOD, pages 521–532, New [3] R. Bhargavi, V. Vaidehi, P. T. V. Bhuvaneswari, York, NY, USA, 2007. ACM. P. Balamuralidhar, and M. G. Chandra. Complex [19] Y. Yao and J. Gehrke. The cougar approach to event processing for object tracking and intrusion in-network query processing in sensor networks. detection in wireless sensor networks. In ICARCV, SIGMOD Rec., 31(3):9–18, Sept. 2002. pages 848–853. IEEE, 2010. [20] M. Zoumboulakis and G. Roussos. Escalation: [4] L. Brenna, A. Demers, J. Gehrke, M. Hong, J. Ossher, complex event detection in wireless sensor networks. B. Panda, M. Riedewald, M. Thatte, and W. White. In EuroSSC, pages 270–285, 2007. Cayuga: a high-performance event processing engine. XQuery processing over NoSQL stores Henrique Valer Caetano Sauer Theo Härder University of Kaiserslautern University of Kaiserslautern University of Kaiserslautern P.O. Box 3049 P.O. Box 3049 P.O. Box 3049 67653 Kaiserslautern, 67653 Kaiserslautern, 67653 Kaiserslautern, Germany Germany Germany valer@cs.uni-kl.de csauer@cs.uni-kl.de haerder@cs.uni-kl.de ABSTRACT for that are scalability and flexibility. The solution RDBMS Using NoSQL stores as storage layer for the execution of provide is usually twofold: either (i) a horizontally-scalable declarative query processing using XQuery provides a high- architecture, which in database terms generally means giv- level interface to process data in an optimized manner. The ing up joins and also complex multi-row transactions; or (ii) term NoSQL refers to a plethora of new stores which es- by using parallel databases, thus using multiple CPUs and sentially trades off well-known ACID properties for higher disks in parallel to optimize performance. While the lat- availability or scalability, using techniques such as eventual ter increases complexity, the former just gives up operations consistency, horizontal scalability, efficient replication, and because they are too hard to implement in distributed envi- schema-less data models. This work proposes a mapping ronments. Nevertheless, these solutions are neither scalable from the data model of different kinds of NoSQL stores— nor flexible. key/value, columnar, and document-oriented—to the XDM NoSQL tackles these problems with a mix of techniques, data model, thus allowing for standardization and querying which involves either weakening ACID properties or allow- NoSQL data using higher-level languages, such as XQuery. ing more flexible data models. The latter is rather simple: This work also explores several optimization scenarios to im- some scenarios—such as web applications—do not conform prove performance on top of these stores. Besides, we also to a rigid relational schema, cannot be bound to the struc- add updating semantics to XQuery by introducing simple tures of a RDBMS, and need flexibility. Solutions exist, such CRUD-enabling functionalities. Finally, this work analyzes as using XML, JSON, pure key/value stores, etc, as data the performance of the system in several scenarios. model for the storage layer. Regarding the former, some NoSQL systems relax consistency by using mechanisms such as multi-version concurrency control, thus allowing for even- Keywords tually consistent scenarios. Others support atomicity and NoSQL, Big Data, key/value, XQuery, ACID, CAP isolation only when each transaction accesses data within some convenient subset of the database data. Atomic oper- 1. INTRODUCTION ations would require some distributed commit protocol—like two-phase commit—involving all nodes participating in the We have seen a trend towards specialization in database transaction, and that would definitely not scale. Note that markets in the last few years. There is no more one-size- this has nothing to do with SQL, as the acronym NoSQL fits-all approach when comes to storing and dealing with suggests. Any RDBMS that relaxes ACID properties could data, and different types of DBMSs are being used to tackle scale just as well, and keep SQL as querying language. different types of problems. One of these being the Big Data Nevertheless, when it comes to performance, NoSQL sys- topic. tems have shown some interesting improvements. When It is not completely clear what Big Data means after all. considering update- and lookup-intensive OLTP workloads— Lately, it is being characterized by the so-called 3 V’s: vol- scenarios where NoSQL are most often considered—the work ume—comprising the actual size of data; velocity—compri- of [13] shows that the total OLTP time is almost evenly sing essentially a time span in which data data must be distributed among four possible overheads: logging, locking, analyzed; and variety—comprising types of data. Big Data latching, and buffer management. In essence, NoSQL sys- applications need to understand how to create solutions in tems improve locking by relaxing atomicity, when compared these data dimensions. to RDBMS. RDBMS have had problems when facing Big Data appli- When considering OLAP scenarios, RDBMS require rigid cations, like in web environments. Two of the main reasons schema to perform usual OLAP queries, whereas most NoSQL stores rely on a brute-force processing model called MapRe- duce. It is a linearly-scalable programming model for pro- cessing and generating large data sets, and works with any data format or shape. Using MapReduce capabilities, par- allelization details, fault-tolerance, and distribution aspects are transparently offered to the user. Nevertheless, it re- 24th GI-Workshop on Foundations of Databases (Grundlagen von Daten- quires implementing queries from scratch and still suffers banken), 29.05.2012 - 01.06.2012, Lübbenau, Germany. from the lack of proper tools to enhance its querying capa- Copyright is held by the author/owner(s). bilities. Moreover, when executed atop raw files, the pro- balancing and data replication. It does not have any rela- cessing is inefficient. NoSQL stores provide this structure, tionship between data, even though it tries by adding link thus one could provide a higher-level query language to take between key/value pairs. It provides the most flexibility, by full advantage of it, like Hive [18], Pig [16], and JAQL [6]. allowing for a per-request scheme on choosing between avail- These approaches require learning separated query lan- ability or consistency. Its distributed system has no master guages, each of which specifically made for the implementa- node, thus no single point of failure, and in order to solve tion. Besides, some of them require schemas, like Hive and partial ordering, it uses Vector Clocks [15]. Pig, thus making them quite inflexible. On the other hand, HBase enhances Riak’s data model by allowing colum- there exists a standard that is flexible enough to handle the nar data, where a table in HBase can be seen as a map of offered data flexibility of these different stores, whose compi- maps. More precisely, each key is an arbitrary string that lation steps are directly mappable to distributed operations maps to a row of data. A row is a map, where columns on MapReduce, and is been standardized for over a decade: act as keys, and values are uninterpreted arrays of bytes. XQuery. Columns are grouped into column families, and therefore, the full key access specification of a value is through column Contribution family concatenated with a column—or using HBase nota- Consider employing XQuery for implementing the large class tion: a qualifier. Column families make the implementation of query-processing tasks, such as aggregating, sorting, fil- more complex, but their existence enables fine-grained per- tering, transforming, joining, etc, on top of MapReduce as a formance tuning, because (i) each column family’s perfor- first step towards standardization on the realms of NoSQL mance options are configured independently, like read and [17]. A second step is essentially to incorporate NoSQL sys- write access, and disk space consumption; and (ii) columns tems as storage layer of such framework, providing a sig- of a column family are stored contiguously in disk. More- nificant performance boost for MapReduce queries. This over, operations in HBase are atomic in the row level, thus storage layer not only leverages the storage efficiency of keeping a consistent view of a given row. Data relations RDBMS, but allows for pushdown projections, filters, and exist from column family to qualifiers, and operations are predicate evaluations to be done as close to the storage level atomic on a per-row basis. HBase chooses consistency over as possible, drastically reducing the amount of data used on availability, and much of that reflects on the system archi- the query processing level. tecture. Auto-sharding and automatic replication are also This is essentially the contribution of this work: allowing present: shardling is automatically done by dividing data for NoSQL stores to be used as storage layer underneath in regions, and replication is achieved by the master-slave a MapReduce-based XQuery engine, Brackit[?]—a generic pattern. XQuery processor, independent of storage layer. We rely MongoDB fosters functionality by allowing more RDBMS- on Brackit’s MapReduce-mapping facility as a transparently like features, such as secondary indexes, range queries, and distributed execution engine, thus providing scalability. Mo- sorting. The data unit is a document, which is an ordered reover, we exploit the XDM-mapping layer of Brackit, which set of keys with associated values. Keys are strings, and provides flexibility by using new data models. We created values, for the first time, are not simply objects, or arrays three XDM-mappings, investigating three different imple- of bytes as in Riak or HBase. In MongoDB, values can be mentations, encompassing the most used types of NoSQL of different data types, such as strings, date, integers, and stores: key/value, column-based, and document-based. even embedded documents. MongoDB provides collections, The remainder of this paper is organized as follows. Sec- which are grouping of documents, and databases, which are tion 2 introduces the NoSQL models and their characteris- grouping of collections. Stored documents do not follow any tics. Section 3 describes the used XQuery engine, Brackit, predefined schema. Updates within a single document are and the execution environment of XQuery on top of the transactional. Consistency is also taken over availability in MapReduce model. Section 4 describes the mappings from MongoDB, as in HBase, and that also reflects in the system various stores to XDM, besides all implemented optimiza- architecture, that follows a master-worker pattern. tions. Section 5 exposes the developed experiments and the Overall, all systems provide scaling-out, replication, and obtained results. Finally, Section 6 concludes this work. parallel-computation capabilities. What changes is essen- tially the data-model: Riak seams to be better suited for 2. NOSQL STORES problems where data is not really relational, like logging. On the other hand, because of the lack of scan capabilities, on This work focuses on three different types of NoSQL stores, situations where data querying is needed, Riak will not per- namely key/value, columnar, and document-oriented, repre- form that well. HBase allows for some relationship between sented by Riak [14], HBase[11], and MongoDB[8], respec- data, besides built-in compression and versioning. It is thus tively. an excellent tool for indexing web pages, which are highly Riak is the simplest model we dealt with: a pure key/- textual (thus benefiting from compression), as well as inter- value store. It provides solely read and write operations to related and updatable (benefiting from built-in versioning). uniquely-identified values, referenced by key. It does not Finally, MongoDB provides documents as granularity unit, provide operations that span across multiple data items and thus fitting well when the scenario involves highly-variable there is no need for relational schema. It uses concepts or unpredictable data. such as buckets, keys, and values. Data is stored and ref- erenced by bucket/key pairs. Each bucket defines a virtual key space and can be thought of as tables in classical rela- 3. BRACKIT AND MAPREDUCE tional databases. Each key references a unique value, and Several different XQuery engines are available as options there are no data type definitions: objects are the only unit for querying XML documents. Most of them provide ei- of data storage. Moreover, Riak provides automatic load ther (i) a lightweight application that can perform queries on documents, or collections of documents, or (ii) an XML XQuery over MapReduce database that uses XQuery to query documents. The for- Mapping XQuery to the MapReduce model is an alternative mer lacks any sort of storage facility, while the latter is just to implementing a distributed query processor from scratch, not flexible enough, because of the built-in storage layer. as normally done in parallel databases. This choice relies Brackit1 provides intrinsic flexibility, allowing for different on the MapReduce middleware for the distribution aspects. storage levels to be “plugged in”, without lacking the neces- BrackitMR is one such implementation, and is more deeply sary performance when dealing with XML documents [5]. discussed in [17]. It achieves a distributed XQuery engine in By dividing the components of the system into different Brackit by scaling out using MapReduce. modules, namely language, engine, and storage, it gives us The system hitherto cited processes collections stored in the needed flexibility, thus allowing us to use any store for HDFS as text files, and therefore does not control details our storage layer. about encoding and management of low-level files. If the DBMS architecture [12] is considered, it implements solely Compilation the topmost layer of it, the set-oriented interface. It executes The compilation process in Brackit works as follows: the processes using MapReduce functions, but abstracts this parser analyzes the query to validate the syntax and ensure from the final user by compiling XQuery over the MapRe- that there are no inconsistencies among parts of the state- duce model. ment. If any syntax errors are detected, the query compiler It represents each query in MapReduce as sequence of jobs, stops processing and returns the appropriate error message. where each job processes a section of a FLWOR pipeline. Throughout this step, a data structure is built, namely an In order to use MapReduce as a query processor, (i) it AST (Abstract Syntax Tree). Each node of the tree de- breaks FLWOR pipelines are into map and reduce functions, notes a construct occurring in the source query, and is used and (ii) groups these functions to form a MapReduce job. through the rest of the compilation process. Simple rewrites, On (i), it converts the logical-pipeline representation of the like constant folding, and the introduction of let bindings are FLWOR expression—AST—to a MapReduce-friendly ver- also done in this step. sion. MapReduce uses a tree of splits, which represents the The pipelining phase transforms FLWOR expressions into logical plan of a MapReduce-based query. Each split is a pipelines—the internal, data-flow-oriented representation of non-blocking operator used by MapReduce functions. The FLWORs, discussed later. Optimizations are done atop structure of splits is rather simple: it contains an AST and pipelines, and the compiler uses global semantics stored in pointers to successor and predecessor splits. Because splits the AST to transform the query into a more-easily-optimized are organized in a bottom-up fashion, leaves of the tree are form. For example, the compiler will move predicates if pos- map functions, and the root is a reduce function—which sible, altering the level at which they are applied and poten- produces the query output. tially improving query performance. This type of opera- On (ii), the system uses the split tree to generate pos- tion movement is called predicate pushdown, or filter push- sibly multiple MapReduce job descriptions, which can be down, and we will apply them to our stores later on. More executed in a distributed manner. Jobs are exactly the ones optimizations such as join recognition, and unnesting are used on Hadoop MapReduce [20], and therefore we will not present in Brackit and are discussed in [4]. In the opti- go into details here. mization phase, optimizations are applied to the AST. The distribution phase is specific to distributed scenarios, and 4. XDM MAPPINGS is where MapReduce translation takes place. More details about the distribution phase are presented in [17]. At the This section shows how to leverage NoSQL stores to work end of the compilation, the translator receives the final AST. as storage layer for XQuery processing. First, we present It generates a tree of executable physical operators. This mappings from NoSQL data models to XDM, adding XDM- compilation process chain is illustrated in Figure 1. node behavior to these data mappings. Afterwards, we dis- cuss possible optimizations regarding data-filtering techniques. Riak Riak’s mapping strategy starts by constructing a key/value tuple from its low-level storage representation. This is es- sentially an abstraction and is completely dependent on the storage used by Riak. Second, we represent XDM opera- tions on this key/value tuple. We map data stored within Riak utilizing Riak’s linking mechanism. A key/value pair kv represents an XDM element, and key/value pairs linked to kv are addressed as children of kv. We map key/value tuples as XDM elements. The name of the element is sim- ply the name of the bucket it belongs to. We create one bucket for the element itself, and one extra bucket for each link departing from the element. Each child element stored Figure 1: Compilation process in Brackit [5]. in a separated bucket represents a nested element within the key/value tuple. The name of the element is the name of the link between key/values. This does not necessarily decrease data locality: buckets are stored among distributed nodes 1 Available at http:\\www.brackit.org based on hashed keys, therefore uniformly distributing the Figure 2: Mapping between an HBase row and an XDM instance. load on the system. Besides, each element has an attribute behavior to data. Brackit interacts with the storage using key which Riak uses to access key/value pairs on the storage this interface. It provides general rules present in XDM [19], level. Namespaces [2], and Xquery Update Facility [3] standards, It allows access using key/value as granularity, because resulting in navigational operations, comparisons, and other every single element can be accessed within a single get op- functionalities. RiakRowNode wraps Riak’s buckets, key/- eration. Full reconstruction of an element el requires one ac- values, and links. HBaseRowNode wraps HBase’s tables, col- cess for each key/value linked to el. Besides, Riak provides umn families, qualifiers, and values. Finally, MongoRowN- atomicity using single key/value pairs as granularity, there- ode wraps MongoDB’s collections, documents, fields, and fore consistent updates of multiple key/value tuples cannot values. be guaranteed. Overall, each instance of these objects represents one unit of data from the storage level. In order to better grasp the HBase mapping, we describe the HBase abstraction in more de- HBase’s mapping strategy starts by constructing a colum- tails, because it represents the more complex case. Riak’s nar tuple from the HDFS low-level-storage representation. and MongoDB’s representation follow the same approach, HBase stores column-family data in separated files within but without a “second-level node”. Tables are not repre- HDFS, therefore we can use this to create an efficient map- sented within the Node interface, because their semantics ping. Figure 2 presents this XDM mapping, where we map represent where data is logically stored, and not data itself. a table partsupp using two column families: references and Therefore, they are represented using a separated interface, values, five qualifiers: partkey, suppkey, availqty, supplycost, called Collection. Column families represent a first-level- and comment. We map each row within an HBase table to access. Qualifiers represent a second-level-access. Finally, an XDM element. The name of the element is simply the values represent a value-access. Besides, first-level-access, name of the table it belongs to, and we store the key used second-level-access, and value-access must keep track of cur- to access such element within HBase as an attribute in the rent indexes, allowing the node to properly implement XDM element. The figure shows two column families: references operations. Figure 3 depicts the mapping. The upper-most and values. Each column family represents a child element, part of the picture shows a node which represents a data whose name is the name of the column family. Accordingly, row from any of the three different stores. The first layer each qualifier is nested as a child within the column-family of nodes—with level = 1st—represents the first-level-access, element from which it descends. explained previously. The semantic of first-level-access dif- fers within different stores: while Riak and MongoDB inter- MongoDB pret it as a value wrapper, HBase prefers a column family MongoDB’s mapping strategy is straight-forward. Because wrapper. Following, HBase is the only implementation that it stores JSON-like documents, the mapping consists essen- needs a second-level-access, represented by the middle-most tially of a document field → element mapping. We map node with level = 2nd, in this example accessing the wrap- each document within a MongoDB collection to an XDM el- per of regionkey = “1”. Finally, lower-level nodes with level ement. The name of the element is the name of the collection = value access values from the structure. it belongs to. We store the id —used to access the document within MongoDB—as an attribute on each element. Nested Optimizations within the collection element, each field of the document We introduce projection and predicate pushdowns optimiza- represents a child element, whose name is the name of the tions. The only storage that allows for predicate push- field itself. Note that MongoDB allows fields to be of type down is MongoDB, while filter pushdown is realized on all of document, therefore more complex nested elements can be them. These optimizations are fundamental advantages of achieved. Nevertheless, the mapping rules work recursively, this work, when compared with processing MapReduce over just as described above. raw files: we can take “shortcuts” that takes us directly to the bytes we want in the disk. Nodes Filter and projections pushdown are an important opti- We describe XDM mappings using object-oriented notation. mization for minimizing the amount of data scanned and Each store implements a Node interface that provides node processed by storage levels, as well as reducing the amount a $key, therefore allowing for both insertions and updates. db:insert($table as xs:string, $key as xs:string, $value as node()) as xs:boolean The delete function deletes a values from the store. We also provide two possible signatures: with or without $key, therefore allowing for deletion of a giveng key, or droping a given table. db:delete($table as xs:string, $key as xs:string) as xs:boolean 5. EXPERIMENTS The framework we developed in this work is mainly con- cerned with the feasibility of executing XQuery queries atop NoSQL stores. Therefore, our focus is primarily on the proof of concept. The data used for our tests comes from the TPC- H benchmark [1]. The dataset size we used has 1GB, and we essentially scanned the five biggest tables on TPC-H: part, partsupp, order, lineitem, and customer. The experi- ments were performed in a single Intel Centrino Duo dual- core CPU with 2.00 GHz, with 4GB RAM, running Ubuntu Linux 10.04 LTS. HBase used is version 0.94.1, Riak is 1.2.1, and MongoDB is 2.2.1. It is not our goal to assess the scal- ability of these systems, but rather their query-procedure Figure 3: Nodes implementing XDM structure. performance. For scalability benchmarks, we refer to [9] and [10]. of data passed up to the query processor. Predicate push- 5.1 Results down is yet another optimization technique to minimize the amount of data flowing between storage and processing lay- ers. The whole idea is to process predicates as early in the plan as possible, thus pushing them to the storage layer. On both cases we traverse the AST, generated in the be- ginning of the compilation step, looking for specific nodes, and when found we annotate the collection node on the AST with this information. The former looks for path expres- sions (PathExpr ) that represent a child step from a collec- tion node, or for descendants of collection nodes, because in the HBase implementation we have more than one access level within storage. The later looks for general-comparison operators, such as equal, not equal, less than, greater than, less than or equal to, and greater than or equal to. After- wards, when accessing the collection on the storage level, we use the marked collections nodes to filter data, without Figure 4: Latency comparison among stores. further sending it to the query engine. Figure 4 shows the gathered latency times of the best NoSQL updates schemes of each store, using log-scale. As we can see, all ap- The used NoSQL stores present different API to persist data. proaches take advantage from the optimization techniques. Even though XQuery does not provide data-storing mecha- The blue column of the graph—full table scan—shows the nisms on its recommendation, it does provide an extension latency when scanning all data from TPC-H tables. The red called XQuery Update Facility [3] for that end. It allows column —single column scan—represents the latency when to add new nodes, delete or rename existing nodes, and re- scanning a simple column of each table. Filter pushdown op- place existing nodes and their values. XQuery Update Fa- timizations explain the improvement in performance when cility adds very natural and efficient persistence-capabilities compared to the first scan, reducing the amount of data flow- to XQuery, but it adds lots of complexity as well. More- ing from storage to processing level. The orange column— over, some of the constructions need document-order, which predicate column scan—represents the latency when scan- is simply not possible in the case of Riak. Therefore, simple- ning a single column and where results were filtered by a semantic functions such as “insert” or “put” seem more at- predicate. We have chosen predicates to cut in half the tractive, and achieve the goal of persisting or updating data. amount of resulting data when compared with single column The insert function stores a value within the underlying scan. The querying time was reduced in approximately 30%, store. We provide two possible signatures: with or without not reaching the 50% theoretically-possible-improvement rate, essentially because of processing overhead. Nevertheless, it Independence, and Parallelism. PhD thesis, University shows how efficient the technique is. of Kaiserslautern, 12 2012. In scanning scenarios like the ones on this work, MongoDB [5] S. Bächle and C. Sauer. Unleashing xquery for has shown to be more efficient than the other stores, by al- data-independent programming. Submitted, 2011. ways presenting better latency. MongoDB was faster by de- [6] K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, sign: trading of data-storage capacity for data-addressability M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J. has proved to be a very efficiency-driven solution, although Shekita. Jaql: A scripting language for large scale being a huge limitation. Moreover, MongoDB uses pre- semistructured data analysis. PVLDB, caching techniques. Therefore, at run-time it allows work- 4(12):1272–1283, 2011. ing with data almost solely from main memory, specially in [7] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, scanning scenarios. D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. 6. CONCLUSIONS VLDB Endow., 1(2):1265–1276, Aug. 2008. We extended a mechanism that executes XQuery to work [8] K. Chodorow and M. Dirolf. MongoDB: The with different NoSQL stores as storage layer, thus providing Definitive Guide. Oreilly Series. O’Reilly Media, a high-level interface to process data in an optimized man- Incorporated, 2010. ner. We have shown that our approach is generic enough to [9] B. F. Cooper, A. Silberstein, E. Tam, work with different NoSQL implementations. R. Ramakrishnan, and R. Sears. Benchmarking cloud Whenever querying these systems with MapReduce—ta- serving systems with ycsb. In Proceedings of the 1st king advantage of its linearly-scalable programming model ACM symposium on Cloud computing, SoCC ’10, for processing and generating large-data sets—parallelization pages 143–154, New York, NY, USA, 2010. ACM. details, fault-tolerance, and distribution aspects are hidden [10] T. Dory, B. Mejhas, P. V. Roy, and N. L. Tran. from the user. Nevertheless, as a data-processing paradigm, Measuring elasticity for cloud databases. In MapReduce represents the past. It is not novel, does not use Proceedings of the The Second International schemas, and provides a low-level record-at-a-time API: a Conference on Cloud Computing, GRIDs, and scenario that represents the 1960’s, before modern DBMS’s. Virtualization, 2011. It requires implementing queries from scratch and still suf- [11] L. George. HBase: The Definitive Guide. O’Reilly fers from the lack of proper tools to enhance its querying Media, 2011. capabilities. Moreover, when executed atop raw files, the [12] T. Härder. Dbms architecture - new challenges ahead. processing is inefficient—because brute force is the only pro- Datenbank-Spektrum, 14:38–48, 2005. cessing option. We solved precisely these two MapReduce [13] S. Harizopoulos, D. J. Abadi, S. Madden, and problems: XQuery works as the higher-level query language, M. Stonebraker. Oltp through the looking glass, and and NoSQL stores replace raw files, thus increasing perfor- what we found there, 2008. mance. Overall, MapReduce emerges as solution for situ- [14] R. Klophaus. Riak core: building distributed ations where DBMS’s are too “hard” to work with, but it applications without shared state. In ACM SIGPLAN should not overlook the lessons of more than 40 years of Commercial Users of Functional Programming, CUFP database technology. ’10, pages 14:1–14:1, New York, NY, USA, 2010. Other approaches cope with similar problems, like Hive, ACM. and Scope. Hive [18] is a framework for data warehousing on [15] F. Mattern. Virtual time and global states of top of Hadoop. Nevertheless, it only provides equi-joins, and distributed systems. In C. M. et al., editor, Proc. does not fully support point access, or CRUD operations— Workshop on Parallel and Distributed Algorithms, inserts into existing tables are not supported due to sim- pages 215–226, North-Holland / Elsevier, 1989. plicity in the locking protocols. Moreover, it uses raw files as storage level, supporting only CSV files. Moreover, Hive [16] C. Olston, B. Reed, U. Srivastava, R. Kumar, and is not flexible enough for Big Data problems, because it is A. Tomkins. Pig latin: a not-so-foreign language for not able to understand the structure of Hadoop files with- data processing. In Proceedings of the 2008 ACM out some catalog information. Scope [7] provides a declar- SIGMOD international conference on Management of ative scripting language targeted for massive data analysis, data, SIGMOD ’08, pages 1099–1110, New York, NY, borrowing several features from SQL. It also runs atop a USA, 2008. ACM. distributed computing platform, a MapReduce-like model, [17] C. Sauer. Xquery processing in the mapreduce therefore suffering from the same problems: lack of flexibil- framework. Master thesis, Technische Universität ity and generality, although being scalable. Kaiserslautern, 2012. [18] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - 7. REFERENCES a petabyte scale data warehouse using hadoop. In [1] The tpc-h benchmark. http://www.tpc.org/tpch/, ICDE, pages 996–1005, 2010. 1999. [19] N. Walsh, M. Fernández, A. Malhotra, M. Nagy, and [2] Namespaces in xml 1.1 (second edition). J. Marsh. XQuery 1.0 and XPath 2.0 data model http://www.w3.org/TR/xml-names11/, August 2006. (XDM). http://www.w3.org/TR/2007/ [3] Xquery update facility 1.0. http://www.w3.org/TR/ REC-xpath-datamodel-20070123/, January 2007. 2009/CR-xquery-update-10-20090609/, June 2009. [20] T. White. Hadoop: The Definitive Guide. O’Reilly [4] S. Bächle. Separating Key Concerns in Query Media, 2012. Processing - Set Orientation, Physical Data