=Paper=
{{Paper
|id=Vol-1366/paper17.pdf
|storemode=property
|title=Large-scale Analysis of Event Data
|pdfUrl=https://ceur-ws.org/Vol-1366/paper17.pdf
|volume=Vol-1366
|dblpUrl=https://dblp.org/rec/conf/gvd/HagedornSG15
}}
==Large-scale Analysis of Event Data==
Large-scale Analysis of Event Data Stefan Hagedorn Kai-Uwe Sattler Michael Gertz TU Ilmenau, Germany TU Ilmenau, Germany Heidelberg University, stefan.hagedorn@tu- kus@tu-ilmenau.de Germany ilmenau.de gertz@informatik.uni- heidelberg.de ABSTRACT of such information about events are temporal taggers for With the availability of numerous sources and the devel- the temporal component and geo-taggers for the spatial or opment of sophisticated text analysis and information re- geographic components [21]. Such taggers detect, extract, trieval techniques, more and more spatio-temporal data are and normalize respective expressions in textual data and extracted from texts such as news documents or social net- provide subsequent tools as the basis for further tasks such work data. Temporal and geographic information obtained as document clustering or classification. For example, from this way often form some kind of event, describing when the sentence “Obama visited Germany in April 2009”, a tem- and where something happened. An important task in the poral tagger would detect the expression “April 2009” and context of business intelligence and document exploration normalize it to “2009-04”; similarly, a geo-tagger would ex- applications is the correlation of events in terms of their tract the expression “Germany” and often associate further temporal, geographic or even semantic properties. In this information with it. This might include the spatial extent in paper we discuss the tasks related to event data analysis, the form of a polygonal description or that Germany is part ranging from the extraction of events to determining events of Europe (using some concept hierarchy). Such a pair of that are similar in terms of space and time by using skyline temporal component and geographic component then forms processing and clustering. We present a framework imple- an event. Given that today’s taggers become more and more mented in Apache Spark that provides operators supporting sophisticated, large repositories of event information can be these tasks and thus allows to build analysis pipelines. built from news articles, blog entries, social network post- ings, and medical records, to name but a few. The task we focus on in this paper is to find events that are correlated to 1. INTRODUCTION a given event in terms of its time and place of occurrence. Traditionally, research on querying and analyzing spatio- The result of such a query is a list of pointers to documents temporal data focuses on data obtained from sensors that in which similar (correlated) events have been detected. record the evolution and movement of objects in 2- or 3- For such correlation tasks, one is facing several challenges dimensional space over time. Typical examples of such data ranging from extracting event data from documents, deal- include trajectories of moving objects (e.g., [7]) and remotely- ing with imprecise event specifications resulting from the sensed data (e.g., [10]). Spatio-temporal data, however, do extraction process, to exploiting different correlation tech- not only arise in the context of sensor data or, more gen- niques for finding similar events, to scalable processing for erally, from observing objects with their spatial extent over large datasets. time. In this paper, we consider events as an important type In this paper, we describe a framework for analyzing event of spatio-temporal data that thus far have been neglected data and detecting correlations between events. Based on in the context of data analysis. Events play an important a model for spatio-temporal events at different granularities role in information retrieval and text analysis in general, we discuss corresponding distance measures. We present the most prominently in the task of topic detection and track- analysis tasks and discuss how these tasks can be supported ing (TDT) [2, 13]. An event is often described as “something by a set of basic operators for extracting event data from text that happens at some place at some time”, e.g., [24]. Thus, documents, preparing the data, and analyzing event correla- events inherently have a spatial and a temporal component. tions using top-k and skyline processing as well as clustering. A prominent source for event data are textual informa- We further describe how these operators are supported as tion from newsfeeds, websites, and social media such as transformation operators in Apache Spark and outline the blogs, tweets or social networks. Underlying the extraction support for explorative analyses. 2. EVENT MODEL An important data analysis task is to find similar events, i.e. events that share the same context or have other values in common. In this paper, we consider the spatio-temporal aspect of events for determining similarities, i.e., we focus 27th GI-Workshop on Foundations of Databases (Grundlagen von Daten- banken), 26.05.2015 - 29.05.2015, Magdeburg, Germany. on the similarity of their respective location and/or time of Copyright is held by the author/owner(s). occurrence. 90 For our analysis framework, we assume an event model in value for time (e.g., distance in days) and location (e.g. dis- which information about events has been extracted from tance in meters). Both values can be combined into a single some document (see Sect. 4) and is represented by useful (weighted) result. Although we cannot use the traditional and necessary information like ID, description, origin, etc. distance functions for events that are imprecise in at least as well as a temporal and a geographic component, describ- one component, we can specify new distance functions ac- ing the when and where. The expressions underlying these cordingly. However, the definition of such functions is be- components are based on concept hierarchies for time and yond the scope of this paper and we will give only a brief space, as shown in Figure 1. overview below. First, we convert the imprecise event data into intervals year country and regions. As imprecise temporal data is defined as a month or year (to be imprecise the day part is missing), we can create an interval of days that starts with the minimal month state possible day, i.e., the first day of the corresponding month or a year and ends with the maximal possible day, which is the day city last day of the month or year, respectively. Each subinter- val is called a valid instance of this interval. As an example consider the imprecise temporal expression “2014-05” that is Figure 1: Concept hierarchies for temporal and ge- mapped to the (right-open) interval i = [2014-05-01, 2014- ographic information 05-30]). Note that any subinterval in i is a valid instance of the temporal expression. Analogously, for imprecise ge- ographic expression, i.e., expressions not at the city level, The temporal as well as the spatial expressions can be a polygon (or its minimum bounding box) is used. Then, of different granularities. For temporal expressions we con- a function that calculates the distance between such inter- sider days to be of the finest and years of the coarsest gran- vals and between polygons/boxes is needed. Such distance ularity. Of course one could also extend this model with function can yield a scalar value, which may be the average further granularity levels, such as weeks, hours, or minutes. distance between points in the respective intervals or poly- In this paper, however, and for the sake of simplicity, we gons, or it can yield an interval, representing the minimum will only consider days, months, and years. We denote the and maximum distance between the intervals/polygons. corresponding domains as T = {Tday , Tmonth , Tyear }. For The Hausdorff distance is a typical approach to calculate example, “2001-05-20”, “2013-05”, and “1995” are all valid the distance between two intervals or polygons as a scalar temporal expressions from these three different domains. value. Its main idea is to compute the maximum distance Analogously, geographic expressions are taken from the between any point in one interval to any other point of the domains in G = {Gcity , Gstate , Gcountry }. We assume that second interval. For two temporal intervals t1 and t2 , the with each expression a spatial object in the form of a single distance can be computed as polygon (without holes) is associated. For example, the ge- ographic expressions “Heidelberg” and “Germany” are both dH (t1 , t2 ) := max{max dh (i1 , t2 ), max dh (i2 , t1 )} i1 ∈t1 i2 ∈t2 valid expressions of type Gcity and Gcountry , respectively. where the distance dh (i, t) between a day i and a temporal Definition. (Event) Given concept hierarchies T and G for interval t defined as dh (i, t) := mini0 ∈t (|i − i0 |). temporal and geographic expressions, respectively. An For polygons/boxes, the Hausdorff distance can be defined event e = ht, gi consists of a temporal expression t as follows: with t.type ∈ T and a geographic expression g with g.type ∈ G. dH (g1 , g2 ) := max min ||x − y|| x∈g1 y∈g2 Examples of (imprecise) event specifications are (2013-09- This is the greatest distance from a point in g1 to the 02, Munich), (1955, Germany), or (2000-04, Bavaria). To closest point in g2 . However, the calculation of this distance account for these types of uncertainties, in our framework measure can become very expensive as the distance from we make the following assumptions: every point in the first interval to every other point in the second interval has to be computed. For temporal intervals 1. Temporal and geographic expressions of the finest gran- this is not a big problem, because when taking years as the ularity are certain. most imprecise and days as most precise level, such an in- 2. Every temporal (resp. geographic) expression of type terval can have only 365 (or 366) points at most. However, P 0 that refines a given temporal (resp. geographic) ex- for polygons the set of inner points is in principle infinite. pression of type P , with P 0 being of finer granularity On the one hand, one could use the cities that are located than P , is equally likely. in the respective region. This approach may not lead to good results as many cities may be clustered around a large Distance Measures. To determine the similarity between metropolis, whereas in rural regions only very few cities can events, a distance measure is needed that takes both the be found. On the other hand, one could impose a raster temporal and the geographic component of an event into on the respective regions using a fixed grid. Then, only account, both of which can be imprecise. the points of this grid will be used for calculations. Then, Precise events are those events for which both the location however, the definition of the grid size may be difficult since and temporal information is given as a fine-grained concept, choosing a small grid step size may lead to many data points i.e., they are defined on the city and day level. For such that will take part in the calculation and thus, will not im- events, calculating the distance is trivial resulting in a scalar prove the computation time. Choosing a large grid size will 91 result in only very few data points, which might again re- All events extracted from a document or document col- sult in incorrect distance values. In order to use common lection are then stored based on our event model described distance functions, one could reduce a polygon to a single in Sect. 2. For each event detected, it describes the normal- point that represents this polygon, e.g., the center point. ized temporal and geographic expression, the granularity of However, these simplifications result in distances that may the expressions, and also some offset information (in what not necessarily reflect the real distances. Therefore, more document and at what position each of the two components sophisticated distance functions are needed. have been determined). 3.2 Correlation Analysis 3. ANALYSIS TASKS Correlations on event data can be computed in different Analysis of event data comprises several tasks. Depending ways, depending on specific applications. In the following, on the sources of events (structured data, text documents) we discuss the three operations for correlation computations. and the specific questions different tasks have to be com- bined in an analysis pipeline. In addition, such analyses Nearest Neighbor Queries. are often interactive and explorative processes requiring a A straightforward solution to find correlated, i.e., similar, flexible combination of a set of powerful extraction and cor- events is to find the neighbors of a given reference event. relation operators. The whole process is depicted in Fig. 2. Given a set of events E, a reference event er and a dis- In the following, we describe a basic set of tasks that our tance function dist, the task is to find the set kNN(er ) of framework supports by dedicated operators. the k nearest events. In the case of our spatio-temporal 3.1 Event Information Extraction event data this requires a single distance measure, which is usually defined using weights for the spatial and temporal Prior to any analysis task event information needs to be distances. These weights are necessary, because we cannot extracted from the text documents of a given corpus, such directly combine the spatial distance with the temporal dis- as Wikipedia or news articles. As an event is assumed to be tance into one distance measure. However, with the weights composed of a temporal expression describing the “when” of we can adjust the impact of the spatial and temporal dis- an event and a geographic expression describing the “where” tance to the overall distance: of an event, respective event components need to be de- termined in a given text and mapped to some standard for- dist(e1 , e2 ) = wg · distg (e1 , e2 ) + wt · distt (e1 , e2 ) mat. The function of a temporal tagger is to determine such temporal information and to map each piece of information where wt , wg ∈ [0, 1] and wt + wg = 1. found to a temporal expression. One typically distinguishes between explicit, implicit and relative information tempo- Skyline. ral information. Explicit temporal information refers to a In contrast to the Nearest Neighbor search, Skyline queries sequence of tokens that describe an exact date, month in a do not need weights for the spatial and temporal distance, year or year, e.g., “April 1, 2015”. Implicit temporal infor- which are often difficult to determine. Adapted to our sce- mation often refers to holidays or some named events, such nario, the notion of the Skyline algorithm is to find those as “Independence Day 2015” or “Summer Olympics 2012”. events in E that “match” a query event q = htq , gq i best. Finally, relative temporal information can only be deter- Since we consider two dimension for events, time and space, mined using the context or document in which the token it is thus intuitive to employ a skyline-based approach as sequence appears. For example, in this paper, the string there might be events that match tq well but not gq , and “last year” would refer to the year 2014. Popular tempo- vice versa. A core concept of skylines is the dominance re- ral taggers such as HeidelTime [21] are able to detect such lationship. The skyline Sq consists of all events that are not information and map them to a standard format, which is dominated by any other event in E with respect to q: mostly based on the TIMEX3 annotation standard. For Sq = {ei |ei ∈ E ∧ ¬∃ej ∈ E : ej 6= ei ∧ ej q ei } example, the above explicit temporal information “April 1, 2015” would be mapped to the temporal expression “2015- where q denotes a dominance relationship with ej q ei 04-01”. meaning that ej is “in closer proximity to q” than ei . Because Similarly, for geographic information in documents, a geo- the dominance of an event with respect to another event tagger is employed. Popular tools include GeoNames1 or Ya- is determined based on their respective distances to q, the hoo! Placemaker2 . Mapping token sequences found in a text distance function outlined before comes into play. to some standardized format, however, is less trivial. For ex- ample, most taggers would map the string “Magdeburg” to Clustering. a latitude/longitude pair rather than to some complex poly- Other than in the two approaches mentioned before, for gon description. Also, several geo-taggers also include some clustering, a definition of a reference event is not needed. It information about the granularity of the geographic expres- is typically understood as the task of grouping objects based sion based on concept hierarchies. on the principle of maximizing the intra-class similarity and Once a temporal tagger and a geo-tagger have extracted minimizing the inter-class similarity. However, there is a and mapped temporal expressions from a given text, events rich diversity in specific definitions of clustering and many are formed. In the most simple case, an event is a pair con- different techniques exist [1]. sisting of a temporal expression and a geographic expression The clusters can be formed using distance values between found in close text proximity, typically at the sentence level. other events. Thus, events that belong to the same cluster 1 can be considered to be correlated as they occur around the http://www.geonames.org/ same time and place. 2 https://developer.yahoo.com/boss/geo/ 92 Text documents Event Model Extraction Contextual Information (e.g., GeoNames) Applications Mapping & Event Normalization Correlation & Resolution YAGO2 Raw event Event Import Vocabularies data repository Processing pipeline Figure 2: Overview of event analysis pipeline. 4. PROCESSING PIPELINE CalcDistance: This implements a transformation operator After the events have been extracted from the text docu- for calculating the geographic and temporal distance ment using the approach as outline in Section 2, the events dist of each event of an RDD to a given reference event. are fed into the correlation pipeline. This pipeline is imple- TopKOp: This operator computes the k nearest neighbors as mented using the Apache Spark3 platform. However, it can a top-k list of events from an input RDD produced by be easily ported to other platforms like Apache Flink4 or CalcDistance. Parameters to this operator are k as even Google’s new DataFlow SDK5 if they provide a Scala- well as the weights for the geographic (wg ) and tem- based API. We chose Spark over other MapReduce based ap- poral (wt ) distance. proaches, e.g., Pig, because it provides a very efficient data structure called resilient distributed datasets (RDD). RDDs SkylineOp: This operator computes the skyline of event re- are immutable, in-memory partitioned collections that can cords from an RDD produced by CalcDistance. Our be distributed among the cluster nodes. With the help of implementation adopts the grid-based approach using such a data structure, the performance of the computations bitsets described in [14] for the Spark platform. The is significantly faster than in MapReduce programs that ma- dominance relation can be passed as parameter to the terialize their intermediate results to disk. Furthermore, operator. Spark allows to implement iterative models that are needed ClusteringOp: Finding groups of correlated events is real- for our computations as well as for programs following the ized by the ClusteringOp operator implementing a MapReduce paradigm. parallel variant of DBSCAN [9] for spatio-temporal data. Parameters are the standard clustering parame- GeoDB ters ε and MinPts as well as a global distance function taking both geographic and temporal distances into RawEventData PrepareEvents EventData time, location account. reference 5. TOWARDS AN INTERACTIVE EXPLO- event distance-func CalcDistance ClusteringOp dominates distance k, weights RATION OF EVENT CORRELATIONS SkylineOp EventDistanceData TopKOp We will use the above processing pipeline to build an event repository that makes the extracted event data pub- EventSkyline EventList EventCluster licly available as Open Data. As an underlying data struc- ture we are going to use the Linked Data principle, which allows for a flexible schema for different types of events and Figure 3: Framework overview also makes the information contained in the event itself ma- chine readable. Furthermore, one can easily add links be- The Spark operators represent transformations on these tween events that have been identified to be correlated and RDD. Fig. 3 gives an overview of the correlation pipeline, thus make these correlations also available for querying and where the tasks of the operators is described below: processing. Adding links to other datasets like YAGO2 or DBpedia will enrich our event repository with even more PrepareEvents: This operator transforms a set of raw (tex- useful and broader information that can be used for data tual) event data into a set of event records ht, qi con- analysis and exploration. forming to our framework. This means that textual The event repository will be free to download and will also temporal and geographic properties are normalized into be made queryable via web services. Via an API users can numerical values, i.e., date/time values and points or run Spark programs (using operators introduced in Sect. 3) polygons for the spatial descriptions such as names of or SPARQL queries. cities or locations. Next to the event repository we plan to develop a data 3 http://spark.apache.org exploration platform that will allow users to interactively 4 http://flink.apache.org analyze the data. Different from traditional business intelli- 5 https://cloud.google.com/dataflow/ gence frameworks that provide only a predefined set of tools 93 Figure 4: Screenshot of Zeppelin after executing a Pig script. The results are visualized as bar chart. for reporting and visualization, our planned platform allows also make use of Spark modules like SparkSQL and Spark very different interaction paradigms. We integrate the idea Streaming. Although the Spark support lets users create of notebooks that was inspired by Mathematica’s interac- efficient data analytics programs, it may be hard for a non- tive documents and IPython’s (and Jupyter’s) notebooks. programmer to create such programs. With the help of such notebooks users can create their own We argue that a declarative approach will be easier to use programs that fit their needs and run them in a managed for users that are not familiar with the Spark API and the cluster environment without the need to set up any hardware concepts of their RDDs. For this reason, we integrate a new or other software before. The idea of web-based notebooks interpreter that allows to run Pig scripts. Since our event allows users to organize text, scripts, and plots on a single repository uses Linked Data, we do not use the conventional webpage so that all information for data analytics can be Pig interpreter, but our extended Pig version that allows to kept in one place. use SPARQL-like features, such as BGP, in Pig scripts [11]. For our data exploration tool we chose the Apache Zep- To do so, we enhanced Pig’s FILTER operator. This allows pelin6 , because it already has support for Spark programs the user to easily define queries on the Linked Data sets with and also provides the possibility to visualize script results. a more powerful language than SPARQL. Though, the triple Content in the format of CSV, HTML, or images can di- structure of the Linked Data concept is not very well suited rectly be visualized as tables, bar charts, line graphs, and for a tuple-based execution environment as it requires a lot scatter plots. A notebook in Zeppelin is organized in para- of (self-)joins to reconstruct all details of an entity from the graphs where each of them can contain and execute a script triples. To solve this issue, in [11] we introduced a new triple of a different (supported) language. On execution, the script bag format (based on the concept introduced in [12]) that content of a paragraph is sent to the server, which will for- keeps the flexibility of triples with the convenience of tuples. ward it to the appropriate interpreter. The interpreter is We also showed that this format can lead to significant per- chosen by a magic keyword given by the user at the begin- formance gains. ning of the script, e.g. %spark. When the execution has However, since Pig programs are compiled into MapRe- finished, the interpreter will return the results to the server, duce programs, the execution of these programs will take which in turn will publish them on the website. Figure 4 longer than for Spark programs. Thus, to combine the ad- shows a screenshot of a notebook in Zeppelin that executed vantages of both approaches, in our ongoing work we use a Pig [16] script and shows the results of that script in a bar the Pig Latin language and compile it into Spark programs. chart. This will allow users to write intuitive data flow scripts with- Zeppelin currently supports shell scripts and Spark pro- out the need to know a specific API and characteristics of grams written in Scala or Python out of the box and can the underlying platform and to get very efficient programs 6 that will execute faster than MapReduce programs. https://zeppelin.incubator.apache.org 94 6. RELATED WORK [6] L. Chen, K. Hwang, and J. Wu. MapReduce Skyline For our event correlation framework, we employ concepts Query Processing with a New Angular Partitioning developed in different areas of information extraction, event Approach. In IPDPSW, May 2012. similarity especially for spatio-temporal similarity, as well [7] L. Chen, M. T. Özsu, and V. Oria. Robust and fast as skylines and clustering algorithms for distributed envi- similarity search for moving object trajectories. In ronments. SIGMOD, 2005. To extract the spatial and temporal information from tex- [8] B.-R. Dai and I.-C. Lin. Efficient Map/Reduce-Based tual data, we use concepts that were described, e.g., in [22, DBSCAN Algorithm with Optimized Data Partition. 23]. In addition to these works, the notions of event similar- In CLOUD, June 2012. ity and relationships between events has also been studied [9] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A in the context of social media, e.g., [3, 17, 18]. Density-Based Algorithm for Discovering Clusters in Our correlation operators mainly focus on Skylines and Large Spatial Databases with Noise. KDD, 1996. clustering. Among the numerous techniques that build on [10] A. Ganguly, O. Omitaomu, Y. Fang, S. Khan, and the concept of skyline queries [5], there also has been some B. Bhaduri. Knowledge discovery from sensor data for work that study skylines for geographic and uncertain or scientific applications. In Learning from Data Streams. probabilistic data. These include spatial skyline queries [19, 2007. 20] where no temporal aspects are considered for data points. [11] S. Hagedorn, K. Hose, and K.-U. Sattler. Sparqling To compute skylines for large datasets, new algorithms pig - processing linked data with pig latin. In BTW, have been proposed that allow the computation in MapRe- March 2015. duce environments. Here, an important challenge is to parti- [12] H. Kim, P. Ravindra, and K. Anyanwu. From tion the data in a way, that the partitioning itself is efficient SPARQL to MapReduce: The Journey Using a Nested and data is distributed to all nodes equally to ensure that TripleGroup Algebra. PVLDB, 4(12):1426–1429, 2011. all nodes get approx. the same workload to avoid one node [13] J. Makkonen, H. Ahonen-Myka, and M. Salmenkivi. having to do all the work while the others are idle. In [14] Topic detection and tracking with spatio-temporal an approach using a grid based partitioning is introduced, evidence. In Advances in Information Retrieval, and [6] shows an angular based partitioning approach. volume 2633, pages 251–265. 2003. For clustering, density based algorithms like DBSCAN [9] [14] K. Mullesgaard, J. L. Pederseny, H. Lu, and Y. Zhou. are chosen if the number of clusters is not known apriori. For Efficient Skyline Computation in MapReduce. In dealing with spatio-temporal data an extension to DBSCAN EDBT, 2014. has been proposed in [4], that uses two ε parameters (one for spatial and one for temporal distance). To run the clustering [15] M. Noticewala and D. Vaghela. MR-IDBSCAN: algorithms in a distributed environment for MapReduce, we Efficient Parallel Incremental DBSCAN Algorithm rely on concepts developed in [15, 8]. using MapReduce. Intl J Computer Applications, 93(4):13–17, 2014. [16] C. Olston, B. Reed, U. Srivastava, R. Kumar, and 7. SUMMARY A. Tomkins. Pig latin: a not-so-foreign language for In this paper we presented our approach for an event cor- data processing. In SIGMOD, 2008. relation framework based on Apache Spark. The framework [17] W. Ribarsky, D. Skau, X. Wang, W. Dou, and M. X. provides operators for importing data as well as for clus- Zhou. Leadline: Interactive visual analysis of text data tering, skylines and k nearest neighbor queries. We chose through event identification and exploration. VAST, Apache Zeppelin for our exploration tool to allow an incre- 0:93–102, 2012. mental creation of scripts and an immediate visualization of [18] A. D. Sarma, A. Jain, and C. Yu. Dynamic results. relationship and event discovery. In WSDM, pages 207–216, 2011. Acknowledgements [19] M. Sharifzadeh, C. Shahabi, and L. Kazemi. This work was funded by the German Research Foundation Processing spatial skyline queries in both vector (DFG) under grant no. SA782/22. spaces and spatial network databases. TODS, 2009. [20] W. Son, M.-W. Lee, H.-K. Ahn, and S. won Hwang. Spatial skyline queries: An efficient geometric 8. REFERENCES algorithm. In SSTD, pages 247–264, 2009. [1] C. C. Aggarwal and C. K. Reddy. Data clustering: [21] J. Strötgen and M. Gertz. Multilingual and algorithms and applications. CRC Press, 2013. cross-domain temporal tagging. Language Resources [2] J. Allan, editor. Topic detection and tracking: and Evaluation, 47(2):269–298, 2013. event-based information organization. Kluwer [22] J. Strötgen and M. Gertz. Proximity2-aware ranking Academic Publishers, 2002. for textual, temporal, and geographic queries. In [3] H. Becker, M. Naaman, and L. Gravano. Learning CIKM, pages 739–744, 2013. similarity metrics for event identification in social [23] J. Strötgen, M. Gertz, and C. Junghans. An media. In WSDM, pages 291–300, 2010. event-centric model for multilingual document [4] D. Birant and A. Kut. ST-DBSCAN: An algorithm for similarity. In SIGIR, pages 953–962, 2011. clustering spatial–temporal data. Data & Knowledge [24] Y. Yang, T. Pierce, and J. Carbonell. A study of Engineering, 2007. retrospective and on-line event detection. In SIGIR, [5] S. Börzsönyi, D. Kossmann, and K. Stocker. The pages 28–36, 1998. skyline operator. In ICDE, pages 421–430, 2001. 95