=Paper=
{{Paper
|id=Vol-1743/paper16
|storemode=property
|title=DAZIO: Detecting Activity Zones based on Input/Output Call and SMS Activity
|pdfUrl=https://ceur-ws.org/Vol-1743/paper16.pdf
|volume=Vol-1743
|authors=Miguel Nuñez-del-Prado,Ana Luna,Romain Gauthier
|dblpUrl=https://dblp.org/rec/conf/simbig/Nunez-del-Prado16a
}}
==DAZIO: Detecting Activity Zones based on Input/Output Call and SMS Activity==
DAZIO: Detecting Activity Zones based on Input/Output call and SMS activity Miguel Nuñez-del-Prado-Cortez Ana Luna Romain Gauthier Universidad del Pacı́fico Universidad del Pacı́fico Instersec Labs Av. Salaverry 2020 Av. Salaverry 2020 Parı́s - France Lima - Perú Lima - Perú romain.gauthier@intersec.com m.nunezdelpradoc@up.edu.pe ae.lunaa@up.edu.pe Abstract of installing new infrastructure, to provide a better QoS or to plan their infrastructure. Mobile telecoms operators possess an Therefore, in the present paper, we will identify enormous quantity of data, which could be activity and transit zones to monitor and to predict used to reduce the cost of installing new the activity levels in the telecoms operators net- infrastructure, to provide a better QoS or work. These monitoring and prediction are based to plan their infrastructure. Thus, they are on the SMS and calls input/output activity levels concerned to model, understand and pre- issued from the Telecoms Italia Big Data Chal- dict SMS and calls activity levels in their lenge1 . The results of the present study is directly infrastructures. Besides, SMS and call applied for: (1) targeting advertisement to activ- activities analysis can open new business ity zones; (2) proposing a suitable place to open a opportunities for geomarketing as well as new store in a city or (3) planning where to add trade area analysis. In the present effort, cell towers to improve QoS. we detected activity zones with a differ- In the present effort, we describe a methodology to ence of only 0.5 km from the reference detect activity and transit zones. More precisely, activity areas extracted from Geo-tweets. the contribution of this work is twofold. On one We also used Markov chains to represent hand, we present an Activity Markov chain model and predict SMS and call activity lev- to represent activity levels. On the other hand, we els, achieving a prediction success rate be- predict future activity levels using the aforemen- tween 80% and 90%. tioned model. The rest of the paper is organized as follows. First, Section 2 describes the related 1 INTRODUCTION works on activity zones detection. Then, Section 3 presents the datasets we use for experiments. Telecoms data is a rich information source for Next, Section 4 introduces our technique to detect many purposes, ranging from urban planning and model activity zones as well as the approach (Toole et al., 2012), human mobility patterns to forecast activity levels. Section 5 shows corre- (Ficek and Kencl, 2012; Gambs et al., 2011), lation measurements of activity levels versus pol- points of interest detection (Vieira et al., 2010), lution and weather conditions. Finally, Section 6 epidemic spread modeling (Lima et al., 2013), concludes the paper and depicts some future direc- community detection (Morales et al., 2013), disas- tions. ter planning (Pulse, 2013) and social interactions (Eagle et al., 2013). 2 RELATED WORK One common effort for these applications is to de- Dense areas detection has been studied from a termine dense areas where many users stay for a human mobility point of view, using fine grain significant amount of time, namely activity zones. and coarse-grained location data. As an example Another task is to identify contiguous zones relay- of fine-grained location, the work of (Gambs et ing activity zones (i.e., transit zones). Thus, in our al., 2011) use mobility traces of 172 Yellow Cabs context, the detection of activity and transit zones 1 Telecoms Italia Big Data Challenge web- as well as the interaction between identified ac- site: www.telecomitalia.com/tit/en/ tivity zones are crucial tasks for reducing the cost bigdatachallenge.html 130 Taxis, issued from GPS, in San Francisco Bay (Pi- pared to the ground truth. orkowski et al., 2009) to detect taxi’s point of in- Another, more refined technique to identify dense terests (POI). These POIs are equivalent to activity areas respecting natural tessellation is presented zones, which tend to be zones with high pedes- by (Vieira et al., 2010). Authors use CDR loca- trian presence. The authors rely on the begin- tions from calls of one million users during four end heuristic (Gambs et al., 2010) and cluster- months over an area of 80 000 km2 . They propose ing algorithms, such as Density Joinable (Zhou et a method composed of three phases: the first step al., 2004), Density Time (Hariharan and Toyama, is the graph construction, which relies on Delau- 2004) and Time Density clustering (Gambs et al., nay triangulation (Dobkin and Laszlo, 1987). The 2010) to detect POIs in San Francisco city. An- triangulation algorithm makes connexions (edges) other fine-grain data used to detect activity areas between near antennas (vertex) maximizing the are issued from Geo-social networks. size of the angles of the triangles. Once the graph The work of (Qu and Zhang, 2013) uses is built, all edges are weighted by the total activ- Foursquare’s check-ins from 446 users during ten ity and by the number of users of both connected months for identifying trade areas. They rely on antennas. The second phase is the computation of four different techniques like Center of Mass lo- dense areas based on a maximum spanning tree cation, the most commonly checked-in location, build using the Kruskal algorithm (Kruskal, 1956). the place with the highest check-in density and Taking as input the weighted graph G, the idea the center of mass of the most frequently visited behind this algorithm is to find a subgraph of G, location cluster. The algorithm use for cluster- which maximizes the density and does not con- ing is DBSCAN (Ester et al., 1996). Once they tain any cycle. At last, the post-processing phase have identified the activity centers, they mark the uses (Shiloach and Vishkin, 1982) algorithm to es- boundary of the area using drive-time/distance tablish groups of antennas representing dense ar- polygon (Kures and Pinkovitz, 2011). This tech- eas. Thus, the algorithm groups adjacent vertex nique consists of computing the decay distance from previously computed sub-graph to find a set from a given store to home or work (authors as- of close vertex (a set of antennas). The authors val- sume that the two most checked-in places are idate their results empirically based on the subway home and work). The drawback of this approach structure of the region under study. Inspired by is that the selected users are conditioned to check- these works, we propose a novel ad-hoc method- in in the store under study. Thus, the dataset is ology to find activity (dense) zones as well as to biased. model and forecast their activity levels using the data provided by the Telecom Italia Big Data Chal- Other works use coarse grain location from Call lenge (TIM challenge). We describe this dataset in Data Records (CDR). For instance, the work of the next section. (Isaacman et al., 2011) uses Hartigan’s leader clus- tering algorithm (Hartigan, 1975) to identify dense 3 DATASET areas. First, authors sort antennas by the amount of time that phones contact the antenna. Once data Datasets provided by the TIM challenge were is sorted, the clustering algorithm takes the first collected in the cities of Milan and Trento over antenna as the centroid of a cluster. Then, it veri- November and December 2013. Our study only fies if the next antenna is within a distance d from takes into account the dataset gathered from Milan the centroid. If it is not the case, the antenna be- city. Nevertheless, all the analysis and the method- comes the centroid of a new cluster. In the case ology could be generalized for any city. In the next the antenna is within the distance, the algorithm subsections, we will describe the datasets provided computes the new centroid as the weighted aver- for Milan city. These datasets were used to detect age. They repeat the process until all antennas be- and model the activity zones (primary datasets) long to a cluster. Researchers use CDR locations and to find some correlation between activity and of 97 and 71 thousand unique users in Los An- other measures like air quality and weather condi- geles and New York cities collected over 2 and a tions (auxiliary datasets). It should be noted that half months as well as 19 volunteers as the ground space was discretized in a grid, and all measures truth to validate their results. They were able to are normalized to correspond in one square of the estimate dense areas with an error of 3 miles com- grid. 131 3.1 Main datasets number of vehicles that belongs to Milan, the The datasets of Milan city we use to detect and number of vehicles that is not from Milan, model the activity zones are: the number vehicles with engine ignition systems, in movement and stopped (c.f. Milan Grid is a geographical segmentation over Table 3). the city to aggregate the measurements of the other datasets. The area of each square is Id Time Direction 55 225 m2 , and it has 10 000 squares in the 60 17/12/13 18:00 WEST Avg speed Std speed Mi plates form of a point (x, y) and the latitude and 24 95 21 longitude belonging to this x, y position. An Non-Mi plates Ignition Mov/Stopped example of the described data as well as the 62 2 2/0 grid over Milan are introduced in Table 1 and Table 3: Example of private mobility data in Milan Figure 1, respectively. Point Latitude Longitude 3.2 Auxiliary datasets x1 , y1 9.011 45.568 We also used additional datasets to analyze activ- ity zones, dynamics, and correlations, as we show Table 1: Example of the Milan grid data in the following paragraphs. Telecommunications - MI to MI provides infor- mation regarding the directional interaction strength, between the city of Milan and dif- ferent areas based on the calls exchanged be- tween Telecom Italia Mobile users. More precisely, this dataset contains the origin and destination Id squares, the time and the direc- tional interaction strength i.e., Activity (c.f. Table 4) Figure 1: Milan grid Id 1 Id 2 Time Activity Telecommunications (SMS, Call and Internet) 1 3 1383345474 0.24 provides information about the activity of a Table 4: Example of activity data between Milan zones square concerning received and sent SMS, incoming and outcoming calls as well as Precipitation describes the intensity and the pre- internet usage. This data is temporal ag- cipitation type over the city of Milan. In gregated in timeslots of ten minutes and more detail, the dataset uses a coarse spa- provides the measure of the activity of a tial aggregation by dividing Milan city into given event as well as the square id (c.f. Table four quadrants (northeast, northwest, south- 2). This kind of information is organized in a east and southwest). The intensity value of way that SMS-in and SMS-out activity scale the phenomenon is between 0 and 3, the per- are given in arbitrary units, and their values cent of coverage of a given quadrant and the range from 0 to 1. precipitation type between 0 and 2, where 0 Id Time Country SMS-in means absence of precipitation, 1 is rain and 1 1383265200 39 0.24 3 is snow (c.f. Table 5). SMS-out Call-in Call-out Data 0.16 0.108 0.026 6.83 Time Id Intensity Coverage Type Table 2: Example of activity data in Milan 201311060220 1 1 45 1 201311060220 2 0 0 0 201311060220 3 0 0 0 Private Transportation (Cobra Telematics) 201311060220 4 2 78 1 gives information about the private mobility in Milan city by measuring the speed, the Table 5: Example of activity data between Milan zones 132 Air Quality describes the air pollution monitor- 4.1 Detecting activity levels ing system of Milan city obtained by us- The basic idea behind this method is to have a ing various types of sensors located within good representation of the activity variation levels the city limits. This environmental dataset over the time. Activity levels could be classified measures a different kind of contamination in three different degrees, low, medium and high. agents, such as Ammonia, Nitrogen Dioxide, Cumulative distribution of incoming/outcoming Total Nitrogen, Particulate Matter 2.5 µm SMS and call activity levels illustrated by a Heat (PM2.5), Particulate Matter 10 µm (PM10), map over Milan city, as shown in Figure 2, were Benzene, Sulphur Dioxide, Black Carbon, used to analyze data. The objective is to, empir- Carbon Monoxide and Ozone. An exam- ically, find a suitable threshold to distinguish a ple of pollution measure is given in Table square with high activity level from a square with 6, where the characteristics of this particular medium or low activity levels represented as green sensor are in Table 7. and red in Figure 3, respectively. Sensor id Time Measure 5823 2013/12/30 04:00 1.9 Table 6: Example of air quality measure Sensor id Lat/Lon Pollution 5823 45.24/9.27 Carbon Monoxide Table 7: Description of the sensor 5823 Social Pulse contains data derived from an analy- sis of geolocalized tweets originated in Mi- lan. This dataset provides a user id, DB- Pedia entity, tweets language, municipality, time, timestamp and location (c.f. Table 8). User Entities Language 5fa4b1cc71 Halloween En Figure 2: Milan activity heat map Municipality Timestamp Lat, Lon Milan 1383260474 9.21, 45.49 Figure 3 depicts the cumulative distribution of Table 8: Example of Social Pulse data the aggregated incoming and outcoming SMS and call activity of the telecommunications dataset (c.f. Based on the aforementioned datasets, we have Subsection 3.1). The Heat map is built for an ac- implemented our experiments using the main and tivity threshold of 25 units. Based on this visual- the auxiliary datasets. These experiments are de- ization technique, that amount of units seems to tailed in sections 4 and 5, respectively. be a good trade-off between compact and well- separated activity zones. Heat maps were used 4 EXPERIMENTS to represent represent tourist activity as shown by In the present section, we describe our methodol- Olteanu et al. (Olteanu et al., 2011). ogy to discover, model and predict the behavior In order to detect groups of squares representing of a zone. We distinguish two different areas, the an activity zone, we can use a high activity thresh- activity zone, where people stay on a regular ba- old (c.f. Subsection 4.2). In addition, we study, in sis for a significant amount of time and the transit detail, the activity over work hours to analyze the zone which is the area used by individuals to go difference between busy and idle squares. Figure from one activity zone to another. In the next sub- 4 shows the difference in activity levels between sections, we describe how to recognize an activity activity and transit zones. From 8 AM to 8 PM the zone from a transit zone (Subsection 4.1), how to activity is considerably high, that is why that area model activity levels (Subsection 4.2) and finally is composed of a vast number of squares during how to predict them (Subsection 4.3). the day. On the other hand, transit zones display a 133 Figure 3: Cumulative distribution of the sms/call activity with heat maps Figure 5: Vehicles speed heat map based on Cobra dataset much lower activity level and a higher fluctuation throughout the day. Figure 4: Activity over time in workdays between 8AM- 8PM. Activity zone (left) and transit zone (right) Taking into account the elements as mentioned earlier, we use the Heat map and activity threshold presented in Figure 3 for detecting high activity zones. Nevertheless, we need to define the borders of these activity zones. From Figure 4, we can in- fer that this irregularity of transit zone represents Figure 6: Centroids of the activity zones in Milan city (blue), movement. Thus, the Cobra dataset gathers the centroid of the clusters from Geo tweets (green) and com- modities issued from Foursquares check-ins shops (red) and information about the movement of vehicles and restaurants (purple). we depicted this information in a Heat map over Milan city in Figure 5. The speed combined with the activity level allowed us to detect the activity geolocalized tweets dataset to verify the accuracy zones, as well as their borders. As the result of of our methodology. Applying DBSCAN (Ester the combination of this two variables, we obtained et al., 1996) clustering algorithm with at least 5 28 activity areas and the centroids of these activity points per cluster within a radius of 3 km over zones which are shown in Figure 6 in blue color. 8 282 users (i.e.,109 762 geolocalized tweets). We Since we do not have the ground truth, we used the obtained 24 clusters depicted as green points in 134 Figure 6. Thus, some groups are close to the iden- (transition probabilities). In our case, an Activ- tified activity zones in the northeast and south. In ity Markov chain is a probabilistic automaton (PA) Northwestern, activity zones are represented by model that represents, in a compact way, the occu- only one cluster due to the proximity of geolocal- pation (activity) of a square or activity zone. The ized tweets and the approach of DBSCAN to build nodes symbolize the state (low, medium or high) clusters. The Downtown area has many clusters of the squares or zones ([activity level] [zone due to commodities concentration. To verify this code], ex: L A) and edges, weighted with a prob- fact, we have included two categories of check-ins ability, represent the transition from one state to from Foursquare, like shops (red) and restaurants another over time windows. This model could be (purple). One thing that surprised us was the de- expressed in the form of a graph or the form of a tection of a large group of geolocated tweets in transition matrix (c.f. Figure 9). the southern outskirts of Milans grid. Using these The process for building an Activity Markov data we found that the distance between the cen- model is divided into two stages. The first one troids of the clusters and activity zones is 0.5 km is basically to order the events in a chronological closer. These results validated the accuracy of our way. Then we classify them as low, if they have heuristic method to find activity zones. Finally, we less than 15 activity units, as medium if activity present the centroid of the detected activity zones units are between 15 and 25 and as high if there in Figure 7. These regions are used in the next are more than 25 activity units. Then the transi- subsection for predicting purpose. tion matrix is built by counting the variations from one level to another, taking care to avoid loops. When events are not recorded anymore, the matrix is normalized to obtain the transition probabilities. As shown in Figure 8, we have divided the time into 4 different windows depending on the range of time studied, each one has 6 hours and starts at 6:00 am; for both weekdays and weekends; giv- ing a total number of eight windows. Furthermore, in each time window interactions between differ- ent levels of activity are also modeled. Moreover, we matched, after a model processing, an activity zone of a time window with another (blue arrows). We finally conclude, from the stationary vector of Markov chains, that activity areas are occupied in only 11% and free in 71%. It is important to point out that, for improving the accuracy of the anal- ysis, it would be better to divide the time slots in less intervals instead of 6 hours. Figure 7: Centroids of the activity zones in Milan. Until this point we showed that we are able to identify high activity squares, activity and transit Up to this point, we are able to identify activity zones. In the next subsections, we detail how to zones. Thus, the next tasks are to model the behav- predict, in an unprecedented way and with a very ior activity levels in the detected regions as well as acceptable rate of success, not only activity levels to predict the activity levels in the identified activ- but also their possible changes. ity zones. 4.2 Modeling activity levels 4.3 Prediction of activity levels Relying on the thresholds mentioned above, we Anticipating high activity levels within an “ac- can build an Activity Markov chains model to rep- tivity area” allows Telecoms operators to plan or resent the change of activity levels over time. An avoid unnecessary investment in infrastructure, as Activity Markov chain model is a stochastic pro- well as ensure the QoS or to start a new en- cess where the changes of states are related to a trepreneurship. In this paper, our prediction, in- probability associated with various state changes spired from the work of Gambs et al. (Gambs et 135 Figure 8: Example of an activity Markov chain (activity over time windows) al., 2012), is performed using the transition matrix inputs, the algorithm returns the maximal outgo- and allows us to obtain changes of activities be- ing probability from the transition matrix taking tween temporary windows and within them. Then, into account only columns corresponding to the predictions of activity changes within the same same time windows of the index row (local tran- window and estimations can be made. For exam- sition from line 1 to 4 of the Algorithm 1). For in- ple, we wonder what is the probability of passing stance, in Figure 8, if the actual state is medium on from a low activity state to a high one or what is the time windows from 18:00h to 0:00h on week- the likelihood for a given activity to remain in the days, the prediction algorithm will give an output same state when we consider the next window and in the high level in the same time window. Another the same range of time. To answer these questions, kind of prediction is to take into account others we rely on the algorithm presented in Algorithm 1. columns instead of those that belong to the same time window (inter time windows transition from Algorithm 1: Prediction algorithm line 6 to 8 of the Algorithm 1). Given the low in Data: TransitionMatrix, indexRow, the time windows from 18h to 0h on weekdays in inTimeWindows Figure 8, the algorithm will output the low state on Result: indexColumn time windows from 0h to 6h on weekends. In the 1 if inTimeWindows then case of the output we have the same probability, 2 //predict next state in the time windows ties are break randomly. 3 i=maxOutgoingProbOnWin(indexRow) 4 indexColumn = i 5 else 6 //predict the next state when the time window change 7 i=maxOutgoingProbOnNextWin(indexRow) 8 indexColumn = i 9 return indexColumn Figure 9: Transition matrix example. Where H=high, M=medium, L=low, W=Weekday, We=Weekend and be- gin end hours. As a way of example we show two temporary More precisely, Algorithm 1 takes as input a tran- windows framed with purple numbers inside. sition matrix (transitionMatrix). Where the index of the row in the transition matrix corresponds to To validate the accuracy of the predictions, we the actual state of the system (indexRow) and a used data from the whole month of November as boolean value is used to indicate whether the pre- training set and the first 16 days of December as diction is local (inTimeWindows). Based on these testing set (we did not take into account New Year 136 we are going to use the auxiliary datasets to ana- lyze activity zones interaction and to study possi- ble correlation with the levels or evaluate how the weather impact the utilization of the Telecoms ser- vice. 5 PLAYING WITH OTHER DATASETS In the present section, we will study the interac- tion between detected activity zones (Subsection 5.1); the correlation between activity levels versus pollution measures (Subsection 5.2) and the influ- ence of the weather on the activity levels in the Telecoms operator infrastructure (Subsection 5.3). Figure 10: Success rate of prediction. Where local refers to prediction inside a time windows and transition (trans) refers 5.1 Interaction between activity zones to prediction over time windows.(#) means prediction value. Using the directional interaction activity dataset square zone between zones in the area of Milan, we plotted a Local 0.86 0.7 graph to visualize the communication exchange, Trans. 0.86 0.89 as well as various activity levels as we can appre- ciate in Figure 11, where the width of the edges Table 9: Table summarizing results of Figure 10 accounts for the logarithm of the aggregated activ- ity for the whole month of November. To extend celebration to avoid special dates that separate our the semantic of this graph, we modulated the size study from normal behavior). The results are de- of the nodes according to the amount of tweets picted in Figure 10, where the success rate (Equa- emitted from the corresponding zone taking into tion 1) is the ratio between the correct prediction account global pulses dataset (c.f. Subsection 3.2). and the total number of predictions and the num- ber of overall forecast. Note that the number of predictions is indicated in parentheses at the bot- tom part of each bar. #goodprediction success rate = (1) #predictions We observe, from Table 9, the success rate for both kinds of predictions, namely (1) within a time window (local) and (2) in different time win- dows (trans.). It is important to note that there are two distinct scenarios; the first one considers both local and trans. predictions based on Activity Markov chains models which were built from ac- tivity levels of squares; while the second scenario takes into account activity levels from detected ac- tivity zones to forecast future values. We did not performed k-fold cross validation since the train- ing set of a month is representative of the mobil- Figure 11: Interaction graph of activity zones. ity pattern. Thus, adding more mobility traces to the training test does not contribute to increase the We observe that Milan city has a star topology, success rate. where there is a central node that communicates So far, we are able to model, identify and predict with the other peripheral nodes. Another interest- activity levels in activity zones. In the next section, ing fact is that small nodes tend to communicate 137 to the central node. Nevertheless, there are a few exchanges between small contiguous nodes. 5.2 Forecasting pollution through activity In this subsection, we study the correlation be- tween the activity level presented on the telecom- munication dataset and air quality measurements (both described in Subsection 3.2) to forecast the pollution level of the Milan city based on the ac- tivity of the telecom operator antennas. Figure 12 shows the results of the correlation of the activity with respect to different polluting gasses as well as the number of vehicles in movement, ignition or stopped. Figure 13: Correlation of the activity of a square with radia- tion sensors. Figure 12: Correlation of the activity with pollution measures and private mobility (Cobra). Figure 14: Activity level of outgoing SMS and call. We found out that the activity has a positive cor- relation with PM10 (particulates matter with a di- red and light blue bars). Nevertheless, people tend ameter of 10 microns or less) and PM2.5 pollution to call or send less SMS when it is snowing (dark measures (fine particles of 2.5 micrometers of di- and light green bars). ameter or less). From Figure 13 we can visualize that the activity levels has a positive correlation 6 CONCLUSION AND FUTURE with the radiation measures and a negative corre- DIRECTIONS lation with the relative humidity. Our purpose in this research is to understand the activity of the telecommunication network by an- 5.3 Influence of weather on the activity alyzing several aspects of Milan’s phone traffic We compare the outgoing SMS and call activity flows. We were interested in the definition and in the presence of different weather phenomena’s morphology of the activity and transit zones; the scenarios, like rain, snow or the absence of both. prediction of activity levels over different regions For this purpose, we used the Precipitation dataset with a success rate between 80% and 90%; the in- (c.f. Subsection 3.2). teractions between the different activity zones and From Figure 14, we can observe that people send the influence of the weather and the pollution on more SMS and give more calls in presence of rainy that activity. Thus, our results offer a new way of weather, even if rain is slight (blue, red, yellow looking at the telecommunication traffic data by and green bars) than in normal conditions (dark, examining the various connections between appar- 138 ently uncorrelated datasets, providing insights to Alexander Varshavsky. 2011. Identifying important manage and to optimize the whole network. For places in peoples lives from cellular network data. In Per- vasive Computing, June. business opportunities, this means (1) new geo- marketing opportunities through a better under- J. B. Kruskal. 1956. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. In Proceed- standing of users communication patterns, (2) new ings of the American Mathematical Society, 7. trade area analysis, (3) cheaper network load bal- ancing as well as (4) improved QoS. In the future, M Kures and Ryan Pinkovitz. 2011. Downtown and business district market analysis. http://fyi.uwex.edu/ we would like to study and analyze in detail the downtown-market-analysis/. opinions (sentiment analysis) discussed by users Antonio Lima, Manlio De Domenico, Veljko Pejovic, and and generators of tweets and identifying the geo- Mirco Musolesi. 2013. Exploiting cellular data for dis- location of these activity areas. Another line of ease containment and information campaigns strategies in investigation would include coarse mobility Call countrywide epidemics. Computing Research Repository, abs/1306(4534):1–7, Jun. Data Record (CDR) to take into account as an ad- ditional element for detecting activity and traffic J. Morales, W. Creixell, J. Borondo , J. C. Losada , and areas. R. M. Benito. 2013. Understanding ethnical interactions on ivory coast. In Data for Development (D4D) challenge, pages 115–120, May. References Ana-Maria Olteanu, Roberto Trasarti, T Couronn, Fosca Gi- D. P. Dobkin and M. J. Laszlo. 1987. Primitives for the ma- annotti, Mirco Nanni, Zbigniew Smoreda, and Cezary nipulation of three-dimensional subdivisions. In Proceed- Ziemlicki. 2011. Gsm data analysis for tourism applica- ings of the Third Annual Symposium on Computational tion. In Proceedings of the 7th International Symposium Geometry, pages 86–99, New York, NY, USA. on Spatial Data Quality (ISSDQ). Nathan Eagle, Alex S. Pentland, and David Lazer. 2013. Michal Piorkowski, Natasa Sarafijanovic-Djukic, and Inferring social network structure using mobile phone Matthias Grossglauser. 2009. CRAWDAD data set data. Proceedings of the National Academy of Sciences, epfl/mobility (v. 2009-02-24). Downloaded from 106(36):15274–15278., September. http://crawdad.org/epfl/mobility/, February. Martin Ester, Hans-Peter Kriegel, Joerg Sander, and Xiaowei United Nations Global Pulse. 2013. Mobile phone Xu. 1996. A density-based algorithm for discovering network data for development. http://www. clusters in large spatial databases with noise. In Second unglobalpulse.org/research, October. International Conference on Knowledge Discovery and Yan Qu and Jun Zhang. 2013. Trade area analysis using user Data Mining, pages 226–231. generated mobile location data. In International Confer- M. Ficek and L. Kencl. 2012. Inter-call mobility model: ence on World Wide Web, pages 1053–1064. A spatio-temporal refinement of call data records using Yossi Shiloach and Uzi Vishkin. 1982. An o(log n) parallel a gaussian mixture model. In Conference on Computer connectivity algorithm. Journal of Algorithms, 3(1):57– Communications, pages 469–477, March. 67. S. Gambs, M.-O. Nuñez Killijian, and M.. del Prado Cortez. Jameson L. Toole, Michael Ulm, Marta C. González, and Di- 2010. Gepeto: A geoprivacy-enhancing toolkit. In Ad- etmar Bauer. 2012. Inferring land use from mobile phone vanced Information Networking and Applications Work- activity. In Proceedings of the ACM SIGKDD Interna- shops (WAINA), 2010 IEEE 24th International Conference tional Workshop on Urban Computing, pages 1–8, New on, pages 1071–1076, April. York, NY, USA. Sébastien Gambs, Marc-Olivier Killijian, and Miguel M.R. Vieira, V. Frias-Martinez, N. Oliver, and E. Frias- Núòez del Prado Cortez. 2011. Show me how you move Martinez. 2010. Characterizing dense urban areas from and i will tell you who you are. Transition on Data Pri- mobile phone-call data: Discovery and social dynamics. vacy, 4(2):103–126, August. In Second International Conference on Social Computing, pages 241–248, August. Sébastien Gambs, Marc-Olivier Killijian, and Miguel Núñez del Prado Cortez. 2012. Next place prediction using mo- Changqing Zhou, Dan Frankowski, Pamela Ludford, Shashi bility markov chains. In Proceedings of the First Work- Shekhar, and Loren Terveen. 2004. Discovering personal shop on Measurement, Privacy, and Mobility, page 3. gazetteers: An interactive clustering approach. In Inter- ACM. national Workshop on Geographic Information Systems, pages 266–273, November. Ramaswamy Hariharan and Kentaro Toyama. 2004. Project lachesis: parsing and modeling location histories. In In Geographic Information Science, pages 106–124. John A. Hartigan. 1975. Clustering Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 99th edition. Sibren Isaacman, Richard Becke, Ramn Cceres, Stephen Kobourov, Margaret Martonosi, James Rowland, and 139