<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Large-scale Analysis of Event Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kai-Uwe Sattler TU Ilmenau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany kus@tu-ilmenau.de</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Michael Gertz Heidelberg University</institution>
          ,
          <addr-line>Germany heidelberg.de</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Stefan Hagedorn TU Ilmenau</institution>
          ,
          <addr-line>Germany ilmenau.de</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>90</fpage>
      <lpage>95</lpage>
      <abstract>
        <p>With the availability of numerous sources and the development of sophisticated text analysis and information retrieval techniques, more and more spatio-temporal data are extracted from texts such as news documents or social network data. Temporal and geographic information obtained this way often form some kind of event, describing when and where something happened. An important task in the context of business intelligence and document exploration applications is the correlation of events in terms of their temporal, geographic or even semantic properties. In this paper we discuss the tasks related to event data analysis, ranging from the extraction of events to determining events that are similar in terms of space and time by using skyline processing and clustering. We present a framework implemented in Apache Spark that provides operators supporting these tasks and thus allows to build analysis pipelines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Traditionally, research on querying and analyzing
spatiotemporal data focuses on data obtained from sensors that
record the evolution and movement of objects in 2- or
3dimensional space over time. Typical examples of such data
include trajectories of moving objects (e.g., [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) and
remotelysensed data (e.g., [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). Spatio-temporal data, however, do
not only arise in the context of sensor data or, more
generally, from observing objects with their spatial extent over
time. In this paper, we consider events as an important type
of spatio-temporal data that thus far have been neglected
in the context of data analysis. Events play an important
role in information retrieval and text analysis in general,
most prominently in the task of topic detection and
tracking (TDT) [
        <xref ref-type="bibr" rid="ref13 ref2">2, 13</xref>
        ]. An event is often described as “something
that happens at some place at some time”, e.g., [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Thus,
events inherently have a spatial and a temporal component.
      </p>
      <p>
        A prominent source for event data are textual
information from newsfeeds, websites, and social media such as
blogs, tweets or social networks. Underlying the extraction
of such information about events are temporal taggers for
the temporal component and geo-taggers for the spatial or
geographic components [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Such taggers detect, extract,
and normalize respective expressions in textual data and
provide subsequent tools as the basis for further tasks such
as document clustering or classification. For example, from
the sentence “Obama visited Germany in April 2009”, a
temporal tagger would detect the expression “April 2009” and
normalize it to “2009-04”; similarly, a geo-tagger would
extract the expression “Germany” and often associate further
information with it. This might include the spatial extent in
the form of a polygonal description or that Germany is part
of Europe (using some concept hierarchy). Such a pair of
temporal component and geographic component then forms
an event. Given that today’s taggers become more and more
sophisticated, large repositories of event information can be
built from news articles, blog entries, social network
postings, and medical records, to name but a few. The task we
focus on in this paper is to find events that are correlated to
a given event in terms of its time and place of occurrence.
The result of such a query is a list of pointers to documents
in which similar (correlated) events have been detected.
      </p>
      <p>For such correlation tasks, one is facing several challenges
ranging from extracting event data from documents,
dealing with imprecise event specifications resulting from the
extraction process, to exploiting different correlation
techniques for finding similar events, to scalable processing for
large datasets.</p>
      <p>In this paper, we describe a framework for analyzing event
data and detecting correlations between events. Based on
a model for spatio-temporal events at different granularities
we discuss corresponding distance measures. We present the
analysis tasks and discuss how these tasks can be supported
by a set of basic operators for extracting event data from text
documents, preparing the data, and analyzing event
correlations using top-k and skyline processing as well as clustering.
We further describe how these operators are supported as
transformation operators in Apache Spark and outline the
support for explorative analyses.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>EVENT MODEL</title>
      <p>An important data analysis task is to find similar events,
i.e. events that share the same context or have other values
in common. In this paper, we consider the spatio-temporal
aspect of events for determining similarities, i.e., we focus
on the similarity of their respective location and/or time of
occurrence.</p>
      <p>For our analysis framework, we assume an event model in
which information about events has been extracted from
some document (see Sect. 4) and is represented by useful
and necessary information like ID, description, origin, etc.
as well as a temporal and a geographic component,
describing the when and where. The expressions underlying these
components are based on concept hierarchies for time and
space, as shown in Figure 1.</p>
      <p>year
month
day
country
state
city</p>
      <p>The temporal as well as the spatial expressions can be
of different granularities. For temporal expressions we
consider days to be of the finest and years of the coarsest
granularity. Of course one could also extend this model with
further granularity levels, such as weeks, hours, or minutes.
In this paper, however, and for the sake of simplicity, we
will only consider days, months, and years. We denote the
corresponding domains as T = {Tday, Tmonth, Tyear}. For
example, “2001-05-20”, “2013-05”, and “1995” are all valid
temporal expressions from these three different domains.</p>
      <p>Analogously, geographic expressions are taken from the
domains in G = {Gcity, Gstate, Gcountry}. We assume that
with each expression a spatial object in the form of a single
polygon (without holes) is associated. For example, the
geographic expressions “Heidelberg” and “Germany” are both
valid expressions of type Gcity and Gcountry, respectively.
Definition. (Event) Given concept hierarchies T and G for
temporal and geographic expressions, respectively. An
event e = ht, gi consists of a temporal expression t
with t.type ∈ T and a geographic expression g with
g.type ∈ G.</p>
      <p>Examples of (imprecise) event specifications are
(2013-0902, Munich), (1955, Germany), or (2000-04, Bavaria). To
account for these types of uncertainties, in our framework
we make the following assumptions:
1. Temporal and geographic expressions of the finest
granularity are certain.
2. Every temporal (resp. geographic) expression of type
P 0 that refines a given temporal (resp. geographic)
expression of type P , with P 0 being of finer granularity
than P , is equally likely.</p>
      <p>Distance Measures. To determine the similarity between
events, a distance measure is needed that takes both the
temporal and the geographic component of an event into
account, both of which can be imprecise.</p>
      <p>Precise events are those events for which both the location
and temporal information is given as a fine-grained concept,
i.e., they are defined on the city and day level. For such
events, calculating the distance is trivial resulting in a scalar
value for time (e.g., distance in days) and location (e.g.
distance in meters). Both values can be combined into a single
(weighted) result. Although we cannot use the traditional
distance functions for events that are imprecise in at least
one component, we can specify new distance functions
accordingly. However, the definition of such functions is
beyond the scope of this paper and we will give only a brief
overview below.</p>
      <p>First, we convert the imprecise event data into intervals
and regions. As imprecise temporal data is defined as a
month or year (to be imprecise the day part is missing), we
can create an interval of days that starts with the minimal
possible day, i.e., the first day of the corresponding month or
a year and ends with the maximal possible day, which is the
last day of the month or year, respectively. Each
subinterval is called a valid instance of this interval. As an example
consider the imprecise temporal expression “2014-05” that is
mapped to the (right-open) interval i = [2014-05-01,
201405-30]). Note that any subinterval in i is a valid instance
of the temporal expression. Analogously, for imprecise
geographic expression, i.e., expressions not at the city level,
a polygon (or its minimum bounding box) is used. Then,
a function that calculates the distance between such
intervals and between polygons/boxes is needed. Such distance
function can yield a scalar value, which may be the average
distance between points in the respective intervals or
polygons, or it can yield an interval, representing the minimum
and maximum distance between the intervals/polygons.</p>
      <p>The Hausdorff distance is a typical approach to calculate
the distance between two intervals or polygons as a scalar
value. Its main idea is to compute the maximum distance
between any point in one interval to any other point of the
second interval. For two temporal intervals t1 and t2, the
distance can be computed as
dH (t1, t2) := max{max dh(i1, t2), max dh(i2, t1)}
i1∈t1 i2∈t2
where the distance dh(i, t) between a day i and a temporal
interval t defined as dh(i, t) := mini0∈t(|i − i0|).</p>
      <p>For polygons/boxes, the Hausdorff distance can be defined
as follows:
dH (g1, g2) := max</p>
      <p>x∈g1 ym∈ign2 ||x − y||</p>
      <p>This is the greatest distance from a point in g1 to the
closest point in g2. However, the calculation of this distance
measure can become very expensive as the distance from
every point in the first interval to every other point in the
second interval has to be computed. For temporal intervals
this is not a big problem, because when taking years as the
most imprecise and days as most precise level, such an
interval can have only 365 (or 366) points at most. However,
for polygons the set of inner points is in principle infinite.
On the one hand, one could use the cities that are located
in the respective region. This approach may not lead to
good results as many cities may be clustered around a large
metropolis, whereas in rural regions only very few cities can
be found. On the other hand, one could impose a raster
on the respective regions using a fixed grid. Then, only
the points of this grid will be used for calculations. Then,
however, the definition of the grid size may be difficult since
choosing a small grid step size may lead to many data points
that will take part in the calculation and thus, will not
improve the computation time. Choosing a large grid size will
result in only very few data points, which might again
result in incorrect distance values. In order to use common
distance functions, one could reduce a polygon to a single
point that represents this polygon, e.g., the center point.
However, these simplifications result in distances that may
not necessarily reflect the real distances. Therefore, more
sophisticated distance functions are needed.</p>
    </sec>
    <sec id="sec-3">
      <title>ANALYSIS TASKS</title>
      <p>Analysis of event data comprises several tasks. Depending
on the sources of events (structured data, text documents)
and the specific questions different tasks have to be
combined in an analysis pipeline. In addition, such analyses
are often interactive and explorative processes requiring a
flexible combination of a set of powerful extraction and
correlation operators. The whole process is depicted in Fig. 2.
In the following, we describe a basic set of tasks that our
framework supports by dedicated operators.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Event Information Extraction</title>
      <p>
        Prior to any analysis task event information needs to be
extracted from the text documents of a given corpus, such
as Wikipedia or news articles. As an event is assumed to be
composed of a temporal expression describing the “when” of
an event and a geographic expression describing the “where”
of an event, respective event components need to be
determined in a given text and mapped to some standard
format. The function of a temporal tagger is to determine such
temporal information and to map each piece of information
found to a temporal expression. One typically distinguishes
between explicit, implicit and relative information
temporal information. Explicit temporal information refers to a
sequence of tokens that describe an exact date, month in a
year or year, e.g., “April 1, 2015”. Implicit temporal
information often refers to holidays or some named events, such
as “Independence Day 2015” or “Summer Olympics 2012”.
Finally, relative temporal information can only be
determined using the context or document in which the token
sequence appears. For example, in this paper, the string
“last year” would refer to the year 2014. Popular
temporal taggers such as HeidelTime [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] are able to detect such
information and map them to a standard format, which is
mostly based on the TIMEX3 annotation standard. For
example, the above explicit temporal information “April 1,
2015” would be mapped to the temporal expression
“201504-01”.
      </p>
      <p>Similarly, for geographic information in documents, a
geotagger is employed. Popular tools include GeoNames1 or
Yahoo! Placemaker2. Mapping token sequences found in a text
to some standardized format, however, is less trivial. For
example, most taggers would map the string “Magdeburg” to
a latitude/longitude pair rather than to some complex
polygon description. Also, several geo-taggers also include some
information about the granularity of the geographic
expression based on concept hierarchies.</p>
      <p>Once a temporal tagger and a geo-tagger have extracted
and mapped temporal expressions from a given text, events
are formed. In the most simple case, an event is a pair
consisting of a temporal expression and a geographic expression
found in close text proximity, typically at the sentence level.</p>
      <sec id="sec-4-1">
        <title>1http://www.geonames.org/</title>
      </sec>
      <sec id="sec-4-2">
        <title>2https://developer.yahoo.com/boss/geo/</title>
        <p>All events extracted from a document or document
collection are then stored based on our event model described
in Sect. 2. For each event detected, it describes the
normalized temporal and geographic expression, the granularity of
the expressions, and also some offset information (in what
document and at what position each of the two components
have been determined).
3.2</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Correlation Analysis</title>
      <p>Correlations on event data can be computed in different
ways, depending on specific applications. In the following,
we discuss the three operations for correlation computations.</p>
      <sec id="sec-5-1">
        <title>Nearest Neighbor Queries.</title>
        <p>A straightforward solution to find correlated, i.e., similar,
events is to find the neighbors of a given reference event.
Given a set of events E, a reference event er and a
distance function dist, the task is to find the set kNN(er) of
the k nearest events. In the case of our spatio-temporal
event data this requires a single distance measure, which is
usually defined using weights for the spatial and temporal
distances. These weights are necessary, because we cannot
directly combine the spatial distance with the temporal
distance into one distance measure. However, with the weights
we can adjust the impact of the spatial and temporal
distance to the overall distance:</p>
        <p>
          dist(e1, e2) = wg · distg(e1, e2) + wt · distt(e1, e2)
where wt, wg ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] and wt + wg = 1.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Skyline.</title>
        <p>In contrast to the Nearest Neighbor search, Skyline queries
do not need weights for the spatial and temporal distance,
which are often difficult to determine. Adapted to our
scenario, the notion of the Skyline algorithm is to find those
events in E that “match” a query event q = htq, gqi best.
Since we consider two dimension for events, time and space,
it is thus intuitive to employ a skyline-based approach as
there might be events that match tq well but not gq, and
vice versa. A core concept of skylines is the dominance
relationship. The skyline Sq consists of all events that are not
dominated by any other event in E with respect to q:</p>
        <p>Sq = {ei|ei ∈ E ∧ ¬∃ej ∈ E : ej 6= ei ∧ ej q ei}
where q denotes a dominance relationship with ej q ei
meaning that ej is “in closer proximity to q” than ei. Because
the dominance of an event with respect to another event
is determined based on their respective distances to q, the
distance function outlined before comes into play.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Clustering.</title>
        <p>
          Other than in the two approaches mentioned before, for
clustering, a definition of a reference event is not needed. It
is typically understood as the task of grouping objects based
on the principle of maximizing the intra-class similarity and
minimizing the inter-class similarity. However, there is a
rich diversity in specific definitions of clustering and many
different techniques exist [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>The clusters can be formed using distance values between
other events. Thus, events that belong to the same cluster
can be considered to be correlated as they occur around the
same time and place.</p>
        <p>Text documents
Contextual Information
(e.g., GeoNames)</p>
        <p>Vocabularies</p>
        <p>Extraction</p>
        <p>Mapping &amp;
Normalization</p>
        <p>Event
Correlation &amp;</p>
        <p>Resolution
Raw event</p>
        <p>data</p>
        <p>Processing pipeline</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>PROCESSING PIPELINE</title>
      <p>After the events have been extracted from the text
document using the approach as outline in Section 2, the events
are fed into the correlation pipeline. This pipeline is
implemented using the Apache Spark3 platform. However, it can
be easily ported to other platforms like Apache Flink4 or
even Google’s new DataFlow SDK5 if they provide a
Scalabased API. We chose Spark over other MapReduce based
approaches, e.g., Pig, because it provides a very efficient data
structure called resilient distributed datasets (RDD). RDDs
are immutable, in-memory partitioned collections that can
be distributed among the cluster nodes. With the help of
such a data structure, the performance of the computations
is significantly faster than in MapReduce programs that
materialize their intermediate results to disk. Furthermore,
Spark allows to implement iterative models that are needed
for our computations as well as for programs following the
MapReduce paradigm.
PrepareEvents: This operator transforms a set of raw
(textual) event data into a set of event records ht, qi
conforming to our framework. This means that textual
temporal and geographic properties are normalized into
numerical values, i.e., date/time values and points or
polygons for the spatial descriptions such as names of
cities or locations.</p>
      <sec id="sec-6-1">
        <title>3http://spark.apache.org</title>
      </sec>
      <sec id="sec-6-2">
        <title>4http://flink.apache.org</title>
      </sec>
      <sec id="sec-6-3">
        <title>5https://cloud.google.com/dataflow/</title>
        <p>CalcDistance: This implements a transformation operator
for calculating the geographic and temporal distance
dist of each event of an RDD to a given reference event.
TopKOp: This operator computes the k nearest neighbors as
a top-k list of events from an input RDD produced by
CalcDistance. Parameters to this operator are k as
well as the weights for the geographic (wg) and
temporal (wt) distance.</p>
        <p>
          SkylineOp: This operator computes the skyline of event
records from an RDD produced by CalcDistance. Our
implementation adopts the grid-based approach using
bitsets described in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] for the Spark platform. The
dominance relation can be passed as parameter to the
operator.
        </p>
        <p>
          ClusteringOp: Finding groups of correlated events is
realized by the ClusteringOp operator implementing a
parallel variant of DBSCAN [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] for spatio-temporal
data. Parameters are the standard clustering
parameters ε and MinPts as well as a global distance function
taking both geographic and temporal distances into
account.
5.
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>TOWARDS AN INTERACTIVE EXPLO</title>
    </sec>
    <sec id="sec-8">
      <title>RATION OF EVENT CORRELATIONS</title>
      <p>We will use the above processing pipeline to build an
event repository that makes the extracted event data
publicly available as Open Data. As an underlying data
structure we are going to use the Linked Data principle, which
allows for a flexible schema for different types of events and
also makes the information contained in the event itself
machine readable. Furthermore, one can easily add links
between events that have been identified to be correlated and
thus make these correlations also available for querying and
processing. Adding links to other datasets like YAGO2 or
DBpedia will enrich our event repository with even more
useful and broader information that can be used for data
analysis and exploration.</p>
      <p>The event repository will be free to download and will also
be made queryable via web services. Via an API users can
run Spark programs (using operators introduced in Sect. 3)
or SPARQL queries.</p>
      <p>Next to the event repository we plan to develop a data
exploration platform that will allow users to interactively
analyze the data. Different from traditional business
intelligence frameworks that provide only a predefined set of tools
for reporting and visualization, our planned platform allows
very different interaction paradigms. We integrate the idea
of notebooks that was inspired by Mathematica’s
interactive documents and IPython’s (and Jupyter’s) notebooks.
With the help of such notebooks users can create their own
programs that fit their needs and run them in a managed
cluster environment without the need to set up any hardware
or other software before. The idea of web-based notebooks
allows users to organize text, scripts, and plots on a single
webpage so that all information for data analytics can be
kept in one place.</p>
      <p>
        For our data exploration tool we chose the Apache
Zeppelin6, because it already has support for Spark programs
and also provides the possibility to visualize script results.
Content in the format of CSV, HTML, or images can
directly be visualized as tables, bar charts, line graphs, and
scatter plots. A notebook in Zeppelin is organized in
paragraphs where each of them can contain and execute a script
of a different (supported) language. On execution, the script
content of a paragraph is sent to the server, which will
forward it to the appropriate interpreter. The interpreter is
chosen by a magic keyword given by the user at the
beginning of the script, e.g. %spark. When the execution has
finished, the interpreter will return the results to the server,
which in turn will publish them on the website. Figure 4
shows a screenshot of a notebook in Zeppelin that executed
a Pig [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] script and shows the results of that script in a bar
chart.
      </p>
      <p>Zeppelin currently supports shell scripts and Spark
programs written in Scala or Python out of the box and can</p>
      <sec id="sec-8-1">
        <title>6https://zeppelin.incubator.apache.org</title>
        <p>also make use of Spark modules like SparkSQL and Spark
Streaming. Although the Spark support lets users create
efficient data analytics programs, it may be hard for a
nonprogrammer to create such programs.</p>
        <p>
          We argue that a declarative approach will be easier to use
for users that are not familiar with the Spark API and the
concepts of their RDDs. For this reason, we integrate a new
interpreter that allows to run Pig scripts. Since our event
repository uses Linked Data, we do not use the conventional
Pig interpreter, but our extended Pig version that allows to
use SPARQL-like features, such as BGP, in Pig scripts [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
To do so, we enhanced Pig’s FILTER operator. This allows
the user to easily define queries on the Linked Data sets with
a more powerful language than SPARQL. Though, the triple
structure of the Linked Data concept is not very well suited
for a tuple-based execution environment as it requires a lot
of (self-)joins to reconstruct all details of an entity from the
triples. To solve this issue, in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] we introduced a new triple
bag format (based on the concept introduced in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]) that
keeps the flexibility of triples with the convenience of tuples.
We also showed that this format can lead to significant
performance gains.
        </p>
        <p>However, since Pig programs are compiled into
MapReduce programs, the execution of these programs will take
longer than for Spark programs. Thus, to combine the
advantages of both approaches, in our ongoing work we use
the Pig Latin language and compile it into Spark programs.
This will allow users to write intuitive data flow scripts
without the need to know a specific API and characteristics of
the underlying platform and to get very efficient programs
that will execute faster than MapReduce programs.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>RELATED WORK</title>
      <p>For our event correlation framework, we employ concepts
developed in different areas of information extraction, event
similarity especially for spatio-temporal similarity, as well
as skylines and clustering algorithms for distributed
environments.</p>
      <p>
        To extract the spatial and temporal information from
textual data, we use concepts that were described, e.g., in [
        <xref ref-type="bibr" rid="ref22 ref23">22,
23</xref>
        ]. In addition to these works, the notions of event
similarity and relationships between events has also been studied
in the context of social media, e.g., [
        <xref ref-type="bibr" rid="ref17 ref18 ref3">3, 17, 18</xref>
        ].
      </p>
      <p>
        Our correlation operators mainly focus on Skylines and
clustering. Among the numerous techniques that build on
the concept of skyline queries [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], there also has been some
work that study skylines for geographic and uncertain or
probabilistic data. These include spatial skyline queries [
        <xref ref-type="bibr" rid="ref19 ref20">19,
20</xref>
        ] where no temporal aspects are considered for data points.
      </p>
      <p>
        To compute skylines for large datasets, new algorithms
have been proposed that allow the computation in
MapReduce environments. Here, an important challenge is to
partition the data in a way, that the partitioning itself is efficient
and data is distributed to all nodes equally to ensure that
all nodes get approx. the same workload to avoid one node
having to do all the work while the others are idle. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
an approach using a grid based partitioning is introduced,
and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] shows an angular based partitioning approach.
      </p>
      <p>
        For clustering, density based algorithms like DBSCAN [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
are chosen if the number of clusters is not known apriori. For
dealing with spatio-temporal data an extension to DBSCAN
has been proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], that uses two ε parameters (one for
spatial and one for temporal distance). To run the clustering
algorithms in a distributed environment for MapReduce, we
rely on concepts developed in [
        <xref ref-type="bibr" rid="ref15 ref8">15, 8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-10">
      <title>7. SUMMARY</title>
      <p>In this paper we presented our approach for an event
correlation framework based on Apache Spark. The framework
provides operators for importing data as well as for
clustering, skylines and k nearest neighbor queries. We chose
Apache Zeppelin for our exploration tool to allow an
incremental creation of scripts and an immediate visualization of
results.</p>
    </sec>
    <sec id="sec-11">
      <title>Acknowledgements</title>
      <p>This work was funded by the German Research Foundation
(DFG) under grant no. SA782/22.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Reddy</surname>
          </string-name>
          .
          <article-title>Data clustering: algorithms and applications</article-title>
          . CRC Press,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Allan</surname>
          </string-name>
          , editor.
          <source>Topic detection and tracking: event-based information organization</source>
          . Kluwer Academic Publishers,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naaman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Gravano</surname>
          </string-name>
          .
          <article-title>Learning similarity metrics for event identification in social media</article-title>
          .
          <source>In WSDM</source>
          , pages
          <fpage>291</fpage>
          -
          <lpage>300</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Birant</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Kut. ST-DBSCAN</surname>
          </string-name>
          :
          <article-title>An algorithm for clustering spatial-temporal data</article-title>
          .
          <source>Data &amp; Knowledge Engineering</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bo</surname>
          </string-name>
          ¨rzso¨nyi,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kossmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Stocker</surname>
          </string-name>
          .
          <article-title>The skyline operator</article-title>
          .
          <source>In ICDE</source>
          , pages
          <fpage>421</fpage>
          -
          <lpage>430</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hwang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu. MapReduce Skyline Query</surname>
          </string-name>
          <article-title>Processing with a New Angular Partitioning Approach</article-title>
          . In IPDPSW, May
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          , M. T. O¨zsu, and
          <string-name>
            <given-names>V.</given-names>
            <surname>Oria</surname>
          </string-name>
          .
          <article-title>Robust and fast similarity search for moving object trajectories</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.-R.</given-names>
            <surname>Dai</surname>
          </string-name>
          and
          <string-name>
            <given-names>I.-C.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition</article-title>
          . In CLOUD,
          <year>June 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Kriegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <article-title>A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise</article-title>
          .
          <source>KDD</source>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Omitaomu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Bhaduri</surname>
          </string-name>
          .
          <article-title>Knowledge discovery from sensor data for scientific applications</article-title>
          .
          <source>In Learning from Data Streams</source>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hagedorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K.-U. Sattler.
          <article-title>Sparqling pig - processing linked data with pig latin</article-title>
          . In BTW, March
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ravindra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Anyanwu</surname>
          </string-name>
          .
          <article-title>From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>4</volume>
          (
          <issue>12</issue>
          ):
          <fpage>1426</fpage>
          -
          <lpage>1429</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Makkonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ahonen-Myka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Salmenkivi</surname>
          </string-name>
          .
          <article-title>Topic detection and tracking with spatio-temporal evidence</article-title>
          .
          <source>In Advances in Information Retrieval</source>
          , volume
          <volume>2633</volume>
          , pages
          <fpage>251</fpage>
          -
          <lpage>265</lpage>
          .
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Mullesgaard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Pederseny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Efficient Skyline Computation in MapReduce</article-title>
          . In EDBT,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Noticewala</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Vaghela</surname>
          </string-name>
          .
          <article-title>MR-IDBSCAN: Efficient Parallel Incremental DBSCAN Algorithm using MapReduce</article-title>
          .
          <source>Intl J Computer Applications</source>
          ,
          <volume>93</volume>
          (
          <issue>4</issue>
          ):
          <fpage>13</fpage>
          -
          <lpage>17</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Olston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Reed</surname>
          </string-name>
          , U. Srivastava,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomkins</surname>
          </string-name>
          .
          <article-title>Pig latin: a not-so-foreign language for data processing</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>W.</given-names>
            <surname>Ribarsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Skau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          . Leadline:
          <article-title>Interactive visual analysis of text data through event identification and exploration</article-title>
          .
          <source>VAST</source>
          ,
          <volume>0</volume>
          :
          <fpage>93</fpage>
          -
          <lpage>102</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>A. D. Sarma</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Jain</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Dynamic relationship and event discovery</article-title>
          .
          <source>In WSDM</source>
          , pages
          <fpage>207</fpage>
          -
          <lpage>216</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sharifzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shahabi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Kazemi</surname>
          </string-name>
          .
          <article-title>Processing spatial skyline queries in both vector spaces and spatial network databases</article-title>
          .
          <source>TODS</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>W.</given-names>
            <surname>Son</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Lee</surname>
          </string-name>
          , H.
          <article-title>-</article-title>
          <string-name>
            <surname>K. Ahn</surname>
          </string-name>
          , and S. won Hwang.
          <article-title>Spatial skyline queries: An efficient geometric algorithm</article-title>
          .
          <source>In SSTD</source>
          , pages
          <fpage>247</fpage>
          -
          <lpage>264</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Stro</surname>
          </string-name>
          <article-title>¨tgen and M. Gertz. Multilingual and cross-domain temporal tagging</article-title>
          .
          <source>Language Resources and Evaluation</source>
          ,
          <volume>47</volume>
          (
          <issue>2</issue>
          ):
          <fpage>269</fpage>
          -
          <lpage>298</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Stro</surname>
          </string-name>
          <article-title>¨tgen and M. Gertz. Proximity2-aware ranking for textual, temporal, and geographic queries</article-title>
          .
          <source>In CIKM</source>
          , pages
          <fpage>739</fpage>
          -
          <lpage>744</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Stro</surname>
          </string-name>
          ¨tgen, M. Gertz, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Junghans</surname>
          </string-name>
          .
          <article-title>An event-centric model for multilingual document similarity</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <fpage>953</fpage>
          -
          <lpage>962</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pierce</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Carbonell</surname>
          </string-name>
          .
          <article-title>A study of retrospective and on-line event detection</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <fpage>28</fpage>
          -
          <lpage>36</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>