=Paper= {{Paper |id=Vol-1366/paper17.pdf |storemode=property |title=Large-scale Analysis of Event Data |pdfUrl=https://ceur-ws.org/Vol-1366/paper17.pdf |volume=Vol-1366 |dblpUrl=https://dblp.org/rec/conf/gvd/HagedornSG15 }} ==Large-scale Analysis of Event Data== https://ceur-ws.org/Vol-1366/paper17.pdf
                            Large-scale Analysis of Event Data

                Stefan Hagedorn                           Kai-Uwe Sattler                           Michael Gertz
               TU Ilmenau, Germany                      TU Ilmenau, Germany                      Heidelberg University,
             stefan.hagedorn@tu-                       kus@tu-ilmenau.de                               Germany
                  ilmenau.de                                                                   gertz@informatik.uni-
                                                                                                   heidelberg.de


ABSTRACT                                                                   of such information about events are temporal taggers for
With the availability of numerous sources and the devel-                   the temporal component and geo-taggers for the spatial or
opment of sophisticated text analysis and information re-                  geographic components [21]. Such taggers detect, extract,
trieval techniques, more and more spatio-temporal data are                 and normalize respective expressions in textual data and
extracted from texts such as news documents or social net-                 provide subsequent tools as the basis for further tasks such
work data. Temporal and geographic information obtained                    as document clustering or classification. For example, from
this way often form some kind of event, describing when                    the sentence “Obama visited Germany in April 2009”, a tem-
and where something happened. An important task in the                     poral tagger would detect the expression “April 2009” and
context of business intelligence and document exploration                  normalize it to “2009-04”; similarly, a geo-tagger would ex-
applications is the correlation of events in terms of their                tract the expression “Germany” and often associate further
temporal, geographic or even semantic properties. In this                  information with it. This might include the spatial extent in
paper we discuss the tasks related to event data analysis,                 the form of a polygonal description or that Germany is part
ranging from the extraction of events to determining events                of Europe (using some concept hierarchy). Such a pair of
that are similar in terms of space and time by using skyline               temporal component and geographic component then forms
processing and clustering. We present a framework imple-                   an event. Given that today’s taggers become more and more
mented in Apache Spark that provides operators supporting                  sophisticated, large repositories of event information can be
these tasks and thus allows to build analysis pipelines.                   built from news articles, blog entries, social network post-
                                                                           ings, and medical records, to name but a few. The task we
                                                                           focus on in this paper is to find events that are correlated to
1.   INTRODUCTION                                                          a given event in terms of its time and place of occurrence.
   Traditionally, research on querying and analyzing spatio-               The result of such a query is a list of pointers to documents
temporal data focuses on data obtained from sensors that                   in which similar (correlated) events have been detected.
record the evolution and movement of objects in 2- or 3-                      For such correlation tasks, one is facing several challenges
dimensional space over time. Typical examples of such data                 ranging from extracting event data from documents, deal-
include trajectories of moving objects (e.g., [7]) and remotely-           ing with imprecise event specifications resulting from the
sensed data (e.g., [10]). Spatio-temporal data, however, do                extraction process, to exploiting different correlation tech-
not only arise in the context of sensor data or, more gen-                 niques for finding similar events, to scalable processing for
erally, from observing objects with their spatial extent over              large datasets.
time. In this paper, we consider events as an important type                  In this paper, we describe a framework for analyzing event
of spatio-temporal data that thus far have been neglected                  data and detecting correlations between events. Based on
in the context of data analysis. Events play an important                  a model for spatio-temporal events at different granularities
role in information retrieval and text analysis in general,                we discuss corresponding distance measures. We present the
most prominently in the task of topic detection and track-                 analysis tasks and discuss how these tasks can be supported
ing (TDT) [2, 13]. An event is often described as “something               by a set of basic operators for extracting event data from text
that happens at some place at some time”, e.g., [24]. Thus,                documents, preparing the data, and analyzing event correla-
events inherently have a spatial and a temporal component.                 tions using top-k and skyline processing as well as clustering.
   A prominent source for event data are textual informa-                  We further describe how these operators are supported as
tion from newsfeeds, websites, and social media such as                    transformation operators in Apache Spark and outline the
blogs, tweets or social networks. Underlying the extraction                support for explorative analyses.



                                                                           2.   EVENT MODEL
                                                                              An important data analysis task is to find similar events,
                                                                           i.e. events that share the same context or have other values
                                                                           in common. In this paper, we consider the spatio-temporal
                                                                           aspect of events for determining similarities, i.e., we focus
27th GI-Workshop on Foundations of Databases (Grundlagen von Daten-
banken), 26.05.2015 - 29.05.2015, Magdeburg, Germany.                      on the similarity of their respective location and/or time of
Copyright is held by the author/owner(s).                                  occurrence.

                                                                      90
For our analysis framework, we assume an event model in                  value for time (e.g., distance in days) and location (e.g. dis-
which information about events has been extracted from                   tance in meters). Both values can be combined into a single
some document (see Sect. 4) and is represented by useful                 (weighted) result. Although we cannot use the traditional
and necessary information like ID, description, origin, etc.             distance functions for events that are imprecise in at least
as well as a temporal and a geographic component, describ-               one component, we can specify new distance functions ac-
ing the when and where. The expressions underlying these                 cordingly. However, the definition of such functions is be-
components are based on concept hierarchies for time and                 yond the scope of this paper and we will give only a brief
space, as shown in Figure 1.                                             overview below.
                                                                           First, we convert the imprecise event data into intervals
                       year          country
                                                                         and regions. As imprecise temporal data is defined as a
                                                                         month or year (to be imprecise the day part is missing), we
                                                                         can create an interval of days that starts with the minimal
                      month           state                              possible day, i.e., the first day of the corresponding month or
                                                                         a year and ends with the maximal possible day, which is the
                       day             city
                                                                         last day of the month or year, respectively. Each subinter-
                                                                         val is called a valid instance of this interval. As an example
                                                                         consider the imprecise temporal expression “2014-05” that is
Figure 1: Concept hierarchies for temporal and ge-                       mapped to the (right-open) interval i = [2014-05-01, 2014-
ographic information                                                     05-30]). Note that any subinterval in i is a valid instance
                                                                         of the temporal expression. Analogously, for imprecise ge-
                                                                         ographic expression, i.e., expressions not at the city level,
   The temporal as well as the spatial expressions can be
                                                                         a polygon (or its minimum bounding box) is used. Then,
of different granularities. For temporal expressions we con-
                                                                         a function that calculates the distance between such inter-
sider days to be of the finest and years of the coarsest gran-
                                                                         vals and between polygons/boxes is needed. Such distance
ularity. Of course one could also extend this model with
                                                                         function can yield a scalar value, which may be the average
further granularity levels, such as weeks, hours, or minutes.
                                                                         distance between points in the respective intervals or poly-
In this paper, however, and for the sake of simplicity, we
                                                                         gons, or it can yield an interval, representing the minimum
will only consider days, months, and years. We denote the
                                                                         and maximum distance between the intervals/polygons.
corresponding domains as T = {Tday , Tmonth , Tyear }. For
                                                                           The Hausdorff distance is a typical approach to calculate
example, “2001-05-20”, “2013-05”, and “1995” are all valid
                                                                         the distance between two intervals or polygons as a scalar
temporal expressions from these three different domains.
                                                                         value. Its main idea is to compute the maximum distance
   Analogously, geographic expressions are taken from the
                                                                         between any point in one interval to any other point of the
domains in G = {Gcity , Gstate , Gcountry }. We assume that
                                                                         second interval. For two temporal intervals t1 and t2 , the
with each expression a spatial object in the form of a single
                                                                         distance can be computed as
polygon (without holes) is associated. For example, the ge-
ographic expressions “Heidelberg” and “Germany” are both                      dH (t1 , t2 ) := max{max dh (i1 , t2 ), max dh (i2 , t1 )}
                                                                                                   i1 ∈t1            i2 ∈t2
valid expressions of type Gcity and Gcountry , respectively.
                                                                         where the distance dh (i, t) between a day i and a temporal
Definition. (Event) Given concept hierarchies T and G for                interval t defined as dh (i, t) := mini0 ∈t (|i − i0 |).
     temporal and geographic expressions, respectively. An                 For polygons/boxes, the Hausdorff distance can be defined
     event e = ht, gi consists of a temporal expression t                as follows:
     with t.type ∈ T and a geographic expression g with
     g.type ∈ G.                                                                        dH (g1 , g2 ) := max min ||x − y||
                                                                                                        x∈g1 y∈g2

  Examples of (imprecise) event specifications are (2013-09-                This is the greatest distance from a point in g1 to the
02, Munich), (1955, Germany), or (2000-04, Bavaria). To                  closest point in g2 . However, the calculation of this distance
account for these types of uncertainties, in our framework               measure can become very expensive as the distance from
we make the following assumptions:                                       every point in the first interval to every other point in the
                                                                         second interval has to be computed. For temporal intervals
  1. Temporal and geographic expressions of the finest gran-
                                                                         this is not a big problem, because when taking years as the
     ularity are certain.
                                                                         most imprecise and days as most precise level, such an in-
  2. Every temporal (resp. geographic) expression of type                terval can have only 365 (or 366) points at most. However,
     P 0 that refines a given temporal (resp. geographic) ex-            for polygons the set of inner points is in principle infinite.
     pression of type P , with P 0 being of finer granularity            On the one hand, one could use the cities that are located
     than P , is equally likely.                                         in the respective region. This approach may not lead to
                                                                         good results as many cities may be clustered around a large
Distance Measures. To determine the similarity between                   metropolis, whereas in rural regions only very few cities can
events, a distance measure is needed that takes both the                 be found. On the other hand, one could impose a raster
temporal and the geographic component of an event into                   on the respective regions using a fixed grid. Then, only
account, both of which can be imprecise.                                 the points of this grid will be used for calculations. Then,
   Precise events are those events for which both the location           however, the definition of the grid size may be difficult since
and temporal information is given as a fine-grained concept,             choosing a small grid step size may lead to many data points
i.e., they are defined on the city and day level. For such               that will take part in the calculation and thus, will not im-
events, calculating the distance is trivial resulting in a scalar        prove the computation time. Choosing a large grid size will

                                                                    91
result in only very few data points, which might again re-                All events extracted from a document or document col-
sult in incorrect distance values. In order to use common              lection are then stored based on our event model described
distance functions, one could reduce a polygon to a single             in Sect. 2. For each event detected, it describes the normal-
point that represents this polygon, e.g., the center point.            ized temporal and geographic expression, the granularity of
However, these simplifications result in distances that may            the expressions, and also some offset information (in what
not necessarily reflect the real distances. Therefore, more            document and at what position each of the two components
sophisticated distance functions are needed.                           have been determined).

                                                                       3.2    Correlation Analysis
3.     ANALYSIS TASKS
                                                                         Correlations on event data can be computed in different
   Analysis of event data comprises several tasks. Depending           ways, depending on specific applications. In the following,
on the sources of events (structured data, text documents)             we discuss the three operations for correlation computations.
and the specific questions different tasks have to be com-
bined in an analysis pipeline. In addition, such analyses
                                                                       Nearest Neighbor Queries.
are often interactive and explorative processes requiring a
                                                                         A straightforward solution to find correlated, i.e., similar,
flexible combination of a set of powerful extraction and cor-
                                                                       events is to find the neighbors of a given reference event.
relation operators. The whole process is depicted in Fig. 2.
                                                                       Given a set of events E, a reference event er and a dis-
In the following, we describe a basic set of tasks that our
                                                                       tance function dist, the task is to find the set kNN(er ) of
framework supports by dedicated operators.
                                                                       the k nearest events. In the case of our spatio-temporal
3.1     Event Information Extraction                                   event data this requires a single distance measure, which is
                                                                       usually defined using weights for the spatial and temporal
   Prior to any analysis task event information needs to be
                                                                       distances. These weights are necessary, because we cannot
extracted from the text documents of a given corpus, such
                                                                       directly combine the spatial distance with the temporal dis-
as Wikipedia or news articles. As an event is assumed to be
                                                                       tance into one distance measure. However, with the weights
composed of a temporal expression describing the “when” of
                                                                       we can adjust the impact of the spatial and temporal dis-
an event and a geographic expression describing the “where”
                                                                       tance to the overall distance:
of an event, respective event components need to be de-
termined in a given text and mapped to some standard for-                    dist(e1 , e2 ) = wg · distg (e1 , e2 ) + wt · distt (e1 , e2 )
mat. The function of a temporal tagger is to determine such
temporal information and to map each piece of information              where wt , wg ∈ [0, 1] and wt + wg = 1.
found to a temporal expression. One typically distinguishes
between explicit, implicit and relative information tempo-             Skyline.
ral information. Explicit temporal information refers to a                In contrast to the Nearest Neighbor search, Skyline queries
sequence of tokens that describe an exact date, month in a             do not need weights for the spatial and temporal distance,
year or year, e.g., “April 1, 2015”. Implicit temporal infor-          which are often difficult to determine. Adapted to our sce-
mation often refers to holidays or some named events, such             nario, the notion of the Skyline algorithm is to find those
as “Independence Day 2015” or “Summer Olympics 2012”.                  events in E that “match” a query event q = htq , gq i best.
Finally, relative temporal information can only be deter-              Since we consider two dimension for events, time and space,
mined using the context or document in which the token                 it is thus intuitive to employ a skyline-based approach as
sequence appears. For example, in this paper, the string               there might be events that match tq well but not gq , and
“last year” would refer to the year 2014. Popular tempo-               vice versa. A core concept of skylines is the dominance re-
ral taggers such as HeidelTime [21] are able to detect such            lationship. The skyline Sq consists of all events that are not
information and map them to a standard format, which is                dominated by any other event in E with respect to q:
mostly based on the TIMEX3 annotation standard. For                          Sq = {ei |ei ∈ E ∧ ¬∃ej ∈ E : ej 6= ei ∧ ej q ei }
example, the above explicit temporal information “April 1,
2015” would be mapped to the temporal expression “2015-                   where q denotes a dominance relationship with ej q ei
04-01”.                                                                meaning that ej is “in closer proximity to q” than ei . Because
   Similarly, for geographic information in documents, a geo-          the dominance of an event with respect to another event
tagger is employed. Popular tools include GeoNames1 or Ya-             is determined based on their respective distances to q, the
hoo! Placemaker2 . Mapping token sequences found in a text             distance function outlined before comes into play.
to some standardized format, however, is less trivial. For ex-
ample, most taggers would map the string “Magdeburg” to                Clustering.
a latitude/longitude pair rather than to some complex poly-               Other than in the two approaches mentioned before, for
gon description. Also, several geo-taggers also include some           clustering, a definition of a reference event is not needed. It
information about the granularity of the geographic expres-            is typically understood as the task of grouping objects based
sion based on concept hierarchies.                                     on the principle of maximizing the intra-class similarity and
   Once a temporal tagger and a geo-tagger have extracted              minimizing the inter-class similarity. However, there is a
and mapped temporal expressions from a given text, events              rich diversity in specific definitions of clustering and many
are formed. In the most simple case, an event is a pair con-           different techniques exist [1].
sisting of a temporal expression and a geographic expression              The clusters can be formed using distance values between
found in close text proximity, typically at the sentence level.        other events. Thus, events that belong to the same cluster
1
                                                                       can be considered to be correlated as they occur around the
    http://www.geonames.org/                                           same time and place.
2
    https://developer.yahoo.com/boss/geo/

                                                                  92
Text documents
                                                                                                            Event
                                                                                                            Model
                                Extraction



Contextual Information
(e.g., GeoNames)                                                                                                                                        Applications
                                                      Mapping &                                                        Event
                                                     Normalization                                                  Correlation &
                                                                                                                     Resolution                            YAGO2
                                                                                                Raw event                                   Event          Import
      Vocabularies                                                                                data                                    repository

                                                                                    Processing pipeline


                                                    Figure 2: Overview of event analysis pipeline.

4.       PROCESSING PIPELINE                                                                           CalcDistance: This implements a transformation operator
   After the events have been extracted from the text docu-                                                 for calculating the geographic and temporal distance
ment using the approach as outline in Section 2, the events                                                 dist of each event of an RDD to a given reference event.
are fed into the correlation pipeline. This pipeline is imple-                                         TopKOp: This operator computes the k nearest neighbors as
mented using the Apache Spark3 platform. However, it can                                                    a top-k list of events from an input RDD produced by
be easily ported to other platforms like Apache Flink4 or                                                   CalcDistance. Parameters to this operator are k as
even Google’s new DataFlow SDK5 if they provide a Scala-                                                    well as the weights for the geographic (wg ) and tem-
based API. We chose Spark over other MapReduce based ap-                                                    poral (wt ) distance.
proaches, e.g., Pig, because it provides a very efficient data
structure called resilient distributed datasets (RDD). RDDs                                            SkylineOp: This operator computes the skyline of event re-
are immutable, in-memory partitioned collections that can                                                   cords from an RDD produced by CalcDistance. Our
be distributed among the cluster nodes. With the help of                                                    implementation adopts the grid-based approach using
such a data structure, the performance of the computations                                                  bitsets described in [14] for the Spark platform. The
is significantly faster than in MapReduce programs that ma-                                                 dominance relation can be passed as parameter to the
terialize their intermediate results to disk. Furthermore,                                                  operator.
Spark allows to implement iterative models that are needed
                                                                                                       ClusteringOp: Finding groups of correlated events is real-
for our computations as well as for programs following the
                                                                                                            ized by the ClusteringOp operator implementing a
MapReduce paradigm.
                                                                                                            parallel variant of DBSCAN [9] for spatio-temporal
                                                                                                            data. Parameters are the standard clustering parame-
                                GeoDB
                                                                                                            ters ε and MinPts as well as a global distance function
                                                                                                            taking both geographic and temporal distances into
    RawEventData          PrepareEvents         EventData
                                                             time, location
                                                                                                            account.
                            reference

                                                                                                       5.    TOWARDS AN INTERACTIVE EXPLO-
                              event                                             distance-func
                                 CalcDistance                                 ClusteringOp

    dominates        distance                               k, weights                                       RATION OF EVENT CORRELATIONS
       SkylineOp            EventDistanceData        TopKOp                                               We will use the above processing pipeline to build an
                                                                                                       event repository that makes the extracted event data pub-
      EventSkyline                                  EventList                 EventCluster             licly available as Open Data. As an underlying data struc-
                                                                                                       ture we are going to use the Linked Data principle, which
                                                                                                       allows for a flexible schema for different types of events and
                     Figure 3: Framework overview                                                      also makes the information contained in the event itself ma-
                                                                                                       chine readable. Furthermore, one can easily add links be-
 The Spark operators represent transformations on these                                                tween events that have been identified to be correlated and
RDD. Fig. 3 gives an overview of the correlation pipeline,                                             thus make these correlations also available for querying and
where the tasks of the operators is described below:                                                   processing. Adding links to other datasets like YAGO2 or
                                                                                                       DBpedia will enrich our event repository with even more
PrepareEvents: This operator transforms a set of raw (tex-
                                                                                                       useful and broader information that can be used for data
     tual) event data into a set of event records ht, qi con-
                                                                                                       analysis and exploration.
     forming to our framework. This means that textual
                                                                                                          The event repository will be free to download and will also
     temporal and geographic properties are normalized into
                                                                                                       be made queryable via web services. Via an API users can
     numerical values, i.e., date/time values and points or
                                                                                                       run Spark programs (using operators introduced in Sect. 3)
     polygons for the spatial descriptions such as names of
                                                                                                       or SPARQL queries.
     cities or locations.
                                                                                                          Next to the event repository we plan to develop a data
3
  http://spark.apache.org                                                                              exploration platform that will allow users to interactively
4
  http://flink.apache.org                                                                              analyze the data. Different from traditional business intelli-
5
  https://cloud.google.com/dataflow/                                                                   gence frameworks that provide only a predefined set of tools

                                                                                                  93
      Figure 4: Screenshot of Zeppelin after executing a Pig script. The results are visualized as bar chart.


for reporting and visualization, our planned platform allows            also make use of Spark modules like SparkSQL and Spark
very different interaction paradigms. We integrate the idea             Streaming. Although the Spark support lets users create
of notebooks that was inspired by Mathematica’s interac-                efficient data analytics programs, it may be hard for a non-
tive documents and IPython’s (and Jupyter’s) notebooks.                 programmer to create such programs.
With the help of such notebooks users can create their own                 We argue that a declarative approach will be easier to use
programs that fit their needs and run them in a managed                 for users that are not familiar with the Spark API and the
cluster environment without the need to set up any hardware             concepts of their RDDs. For this reason, we integrate a new
or other software before. The idea of web-based notebooks               interpreter that allows to run Pig scripts. Since our event
allows users to organize text, scripts, and plots on a single           repository uses Linked Data, we do not use the conventional
webpage so that all information for data analytics can be               Pig interpreter, but our extended Pig version that allows to
kept in one place.                                                      use SPARQL-like features, such as BGP, in Pig scripts [11].
   For our data exploration tool we chose the Apache Zep-               To do so, we enhanced Pig’s FILTER operator. This allows
pelin6 , because it already has support for Spark programs              the user to easily define queries on the Linked Data sets with
and also provides the possibility to visualize script results.          a more powerful language than SPARQL. Though, the triple
Content in the format of CSV, HTML, or images can di-                   structure of the Linked Data concept is not very well suited
rectly be visualized as tables, bar charts, line graphs, and            for a tuple-based execution environment as it requires a lot
scatter plots. A notebook in Zeppelin is organized in para-             of (self-)joins to reconstruct all details of an entity from the
graphs where each of them can contain and execute a script              triples. To solve this issue, in [11] we introduced a new triple
of a different (supported) language. On execution, the script           bag format (based on the concept introduced in [12]) that
content of a paragraph is sent to the server, which will for-           keeps the flexibility of triples with the convenience of tuples.
ward it to the appropriate interpreter. The interpreter is              We also showed that this format can lead to significant per-
chosen by a magic keyword given by the user at the begin-               formance gains.
ning of the script, e.g. %spark. When the execution has                    However, since Pig programs are compiled into MapRe-
finished, the interpreter will return the results to the server,        duce programs, the execution of these programs will take
which in turn will publish them on the website. Figure 4                longer than for Spark programs. Thus, to combine the ad-
shows a screenshot of a notebook in Zeppelin that executed              vantages of both approaches, in our ongoing work we use
a Pig [16] script and shows the results of that script in a bar         the Pig Latin language and compile it into Spark programs.
chart.                                                                  This will allow users to write intuitive data flow scripts with-
   Zeppelin currently supports shell scripts and Spark pro-             out the need to know a specific API and characteristics of
grams written in Scala or Python out of the box and can                 the underlying platform and to get very efficient programs
6
                                                                        that will execute faster than MapReduce programs.
    https://zeppelin.incubator.apache.org

                                                                   94
6.   RELATED WORK                                                         [6] L. Chen, K. Hwang, and J. Wu. MapReduce Skyline
   For our event correlation framework, we employ concepts                    Query Processing with a New Angular Partitioning
developed in different areas of information extraction, event                 Approach. In IPDPSW, May 2012.
similarity especially for spatio-temporal similarity, as well             [7] L. Chen, M. T. Özsu, and V. Oria. Robust and fast
as skylines and clustering algorithms for distributed envi-                   similarity search for moving object trajectories. In
ronments.                                                                     SIGMOD, 2005.
   To extract the spatial and temporal information from tex-              [8] B.-R. Dai and I.-C. Lin. Efficient Map/Reduce-Based
tual data, we use concepts that were described, e.g., in [22,                 DBSCAN Algorithm with Optimized Data Partition.
23]. In addition to these works, the notions of event similar-                In CLOUD, June 2012.
ity and relationships between events has also been studied                [9] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A
in the context of social media, e.g., [3, 17, 18].                            Density-Based Algorithm for Discovering Clusters in
   Our correlation operators mainly focus on Skylines and                     Large Spatial Databases with Noise. KDD, 1996.
clustering. Among the numerous techniques that build on                  [10] A. Ganguly, O. Omitaomu, Y. Fang, S. Khan, and
the concept of skyline queries [5], there also has been some                  B. Bhaduri. Knowledge discovery from sensor data for
work that study skylines for geographic and uncertain or                      scientific applications. In Learning from Data Streams.
probabilistic data. These include spatial skyline queries [19,                2007.
20] where no temporal aspects are considered for data points.            [11] S. Hagedorn, K. Hose, and K.-U. Sattler. Sparqling
   To compute skylines for large datasets, new algorithms                     pig - processing linked data with pig latin. In BTW,
have been proposed that allow the computation in MapRe-                       March 2015.
duce environments. Here, an important challenge is to parti-             [12] H. Kim, P. Ravindra, and K. Anyanwu. From
tion the data in a way, that the partitioning itself is efficient             SPARQL to MapReduce: The Journey Using a Nested
and data is distributed to all nodes equally to ensure that                   TripleGroup Algebra. PVLDB, 4(12):1426–1429, 2011.
all nodes get approx. the same workload to avoid one node                [13] J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi.
having to do all the work while the others are idle. In [14]                  Topic detection and tracking with spatio-temporal
an approach using a grid based partitioning is introduced,
                                                                              evidence. In Advances in Information Retrieval,
and [6] shows an angular based partitioning approach.
                                                                              volume 2633, pages 251–265. 2003.
   For clustering, density based algorithms like DBSCAN [9]
                                                                         [14] K. Mullesgaard, J. L. Pederseny, H. Lu, and Y. Zhou.
are chosen if the number of clusters is not known apriori. For
                                                                              Efficient Skyline Computation in MapReduce. In
dealing with spatio-temporal data an extension to DBSCAN
                                                                              EDBT, 2014.
has been proposed in [4], that uses two ε parameters (one for
spatial and one for temporal distance). To run the clustering            [15] M. Noticewala and D. Vaghela. MR-IDBSCAN:
algorithms in a distributed environment for MapReduce, we                     Efficient Parallel Incremental DBSCAN Algorithm
rely on concepts developed in [15, 8].                                        using MapReduce. Intl J Computer Applications,
                                                                              93(4):13–17, 2014.
                                                                         [16] C. Olston, B. Reed, U. Srivastava, R. Kumar, and
7.   SUMMARY                                                                  A. Tomkins. Pig latin: a not-so-foreign language for
  In this paper we presented our approach for an event cor-                   data processing. In SIGMOD, 2008.
relation framework based on Apache Spark. The framework                  [17] W. Ribarsky, D. Skau, X. Wang, W. Dou, and M. X.
provides operators for importing data as well as for clus-                    Zhou. Leadline: Interactive visual analysis of text data
tering, skylines and k nearest neighbor queries. We chose                     through event identification and exploration. VAST,
Apache Zeppelin for our exploration tool to allow an incre-                   0:93–102, 2012.
mental creation of scripts and an immediate visualization of             [18] A. D. Sarma, A. Jain, and C. Yu. Dynamic
results.                                                                      relationship and event discovery. In WSDM, pages
                                                                              207–216, 2011.
Acknowledgements                                                         [19] M. Sharifzadeh, C. Shahabi, and L. Kazemi.
This work was funded by the German Research Foundation                        Processing spatial skyline queries in both vector
(DFG) under grant no. SA782/22.                                               spaces and spatial network databases. TODS, 2009.
                                                                         [20] W. Son, M.-W. Lee, H.-K. Ahn, and S. won Hwang.
                                                                              Spatial skyline queries: An efficient geometric
8.   REFERENCES                                                               algorithm. In SSTD, pages 247–264, 2009.
 [1] C. C. Aggarwal and C. K. Reddy. Data clustering:                    [21] J. Strötgen and M. Gertz. Multilingual and
     algorithms and applications. CRC Press, 2013.                            cross-domain temporal tagging. Language Resources
 [2] J. Allan, editor. Topic detection and tracking:                          and Evaluation, 47(2):269–298, 2013.
     event-based information organization. Kluwer                        [22] J. Strötgen and M. Gertz. Proximity2-aware ranking
     Academic Publishers, 2002.                                               for textual, temporal, and geographic queries. In
 [3] H. Becker, M. Naaman, and L. Gravano. Learning                           CIKM, pages 739–744, 2013.
     similarity metrics for event identification in social               [23] J. Strötgen, M. Gertz, and C. Junghans. An
     media. In WSDM, pages 291–300, 2010.                                     event-centric model for multilingual document
 [4] D. Birant and A. Kut. ST-DBSCAN: An algorithm for                        similarity. In SIGIR, pages 953–962, 2011.
     clustering spatial–temporal data. Data & Knowledge                  [24] Y. Yang, T. Pierce, and J. Carbonell. A study of
     Engineering, 2007.                                                       retrospective and on-line event detection. In SIGIR,
 [5] S. Börzsönyi, D. Kossmann, and K. Stocker. The                         pages 28–36, 1998.
     skyline operator. In ICDE, pages 421–430, 2001.

                                                                    95