=Paper=
{{Paper
|id=Vol-1743/paper16
|storemode=property
|title=DAZIO: Detecting Activity Zones based on Input/Output Call and SMS Activity
|pdfUrl=https://ceur-ws.org/Vol-1743/paper16.pdf
|volume=Vol-1743
|authors=Miguel Nuñez-del-Prado,Ana Luna,Romain Gauthier
|dblpUrl=https://dblp.org/rec/conf/simbig/Nunez-del-Prado16a
}}
==DAZIO: Detecting Activity Zones based on Input/Output Call and SMS Activity==
<pdf width="1500px">https://ceur-ws.org/Vol-1743/paper16.pdf</pdf>
<pre>
    DAZIO: Detecting Activity Zones based on Input/Output call and SMS
                                 activity

Miguel Nuñez-del-Prado-Cortez       Ana Luna                                     Romain Gauthier
   Universidad del Pacı́fico   Universidad del Pacı́fico                           Instersec Labs
      Av. Salaverry 2020         Av. Salaverry 2020                                 Parı́s - France
         Lima - Perú               Lima - Perú                          romain.gauthier@intersec.com
    m.nunezdelpradoc@up.edu.pe              ae.lunaa@up.edu.pe


                     Abstract                              of installing new infrastructure, to provide a better
                                                           QoS or to plan their infrastructure.
     Mobile telecoms operators possess an                  Therefore, in the present paper, we will identify
     enormous quantity of data, which could be             activity and transit zones to monitor and to predict
     used to reduce the cost of installing new             the activity levels in the telecoms operators net-
     infrastructure, to provide a better QoS or            work. These monitoring and prediction are based
     to plan their infrastructure. Thus, they are          on the SMS and calls input/output activity levels
     concerned to model, understand and pre-               issued from the Telecoms Italia Big Data Chal-
     dict SMS and calls activity levels in their           lenge1 . The results of the present study is directly
     infrastructures. Besides, SMS and call                applied for: (1) targeting advertisement to activ-
     activities analysis can open new business             ity zones; (2) proposing a suitable place to open a
     opportunities for geomarketing as well as             new store in a city or (3) planning where to add
     trade area analysis. In the present effort,           cell towers to improve QoS.
     we detected activity zones with a differ-             In the present effort, we describe a methodology to
     ence of only 0.5 km from the reference                detect activity and transit zones. More precisely,
     activity areas extracted from Geo-tweets.             the contribution of this work is twofold. On one
     We also used Markov chains to represent               hand, we present an Activity Markov chain model
     and predict SMS and call activity lev-                to represent activity levels. On the other hand, we
     els, achieving a prediction success rate be-          predict future activity levels using the aforemen-
     tween 80% and 90%.                                    tioned model. The rest of the paper is organized
                                                           as follows. First, Section 2 describes the related
1    INTRODUCTION                                          works on activity zones detection. Then, Section
                                                           3 presents the datasets we use for experiments.
Telecoms data is a rich information source for             Next, Section 4 introduces our technique to detect
many purposes, ranging from urban planning                 and model activity zones as well as the approach
(Toole et al., 2012), human mobility patterns              to forecast activity levels. Section 5 shows corre-
(Ficek and Kencl, 2012; Gambs et al., 2011),               lation measurements of activity levels versus pol-
points of interest detection (Vieira et al., 2010),        lution and weather conditions. Finally, Section 6
epidemic spread modeling (Lima et al., 2013),              concludes the paper and depicts some future direc-
community detection (Morales et al., 2013), disas-         tions.
ter planning (Pulse, 2013) and social interactions
(Eagle et al., 2013).                                      2       RELATED WORK
One common effort for these applications is to de-         Dense areas detection has been studied from a
termine dense areas where many users stay for a            human mobility point of view, using fine grain
significant amount of time, namely activity zones.         and coarse-grained location data. As an example
Another task is to identify contiguous zones relay-        of fine-grained location, the work of (Gambs et
ing activity zones (i.e., transit zones). Thus, in our     al., 2011) use mobility traces of 172 Yellow Cabs
context, the detection of activity and transit zones           1
                                                                 Telecoms Italia Big Data Challenge web-
as well as the interaction between identified ac-          site:          www.telecomitalia.com/tit/en/
tivity zones are crucial tasks for reducing the cost       bigdatachallenge.html


                                                     130
Taxis, issued from GPS, in San Francisco Bay (Pi-        pared to the ground truth.
orkowski et al., 2009) to detect taxi’s point of in-     Another, more refined technique to identify dense
terests (POI). These POIs are equivalent to activity     areas respecting natural tessellation is presented
zones, which tend to be zones with high pedes-           by (Vieira et al., 2010). Authors use CDR loca-
trian presence. The authors rely on the begin-           tions from calls of one million users during four
end heuristic (Gambs et al., 2010) and cluster-          months over an area of 80 000 km2 . They propose
ing algorithms, such as Density Joinable (Zhou et        a method composed of three phases: the first step
al., 2004), Density Time (Hariharan and Toyama,          is the graph construction, which relies on Delau-
2004) and Time Density clustering (Gambs et al.,         nay triangulation (Dobkin and Laszlo, 1987). The
2010) to detect POIs in San Francisco city. An-          triangulation algorithm makes connexions (edges)
other fine-grain data used to detect activity areas      between near antennas (vertex) maximizing the
are issued from Geo-social networks.                     size of the angles of the triangles. Once the graph
The work of (Qu and Zhang, 2013) uses                    is built, all edges are weighted by the total activ-
Foursquare’s check-ins from 446 users during ten         ity and by the number of users of both connected
months for identifying trade areas. They rely on         antennas. The second phase is the computation of
four different techniques like Center of Mass lo-        dense areas based on a maximum spanning tree
cation, the most commonly checked-in location,           build using the Kruskal algorithm (Kruskal, 1956).
the place with the highest check-in density and          Taking as input the weighted graph G, the idea
the center of mass of the most frequently visited        behind this algorithm is to find a subgraph of G,
location cluster. The algorithm use for cluster-         which maximizes the density and does not con-
ing is DBSCAN (Ester et al., 1996). Once they            tain any cycle. At last, the post-processing phase
have identified the activity centers, they mark the      uses (Shiloach and Vishkin, 1982) algorithm to es-
boundary of the area using drive-time/distance           tablish groups of antennas representing dense ar-
polygon (Kures and Pinkovitz, 2011). This tech-          eas. Thus, the algorithm groups adjacent vertex
nique consists of computing the decay distance           from previously computed sub-graph to find a set
from a given store to home or work (authors as-          of close vertex (a set of antennas). The authors val-
sume that the two most checked-in places are             idate their results empirically based on the subway
home and work). The drawback of this approach            structure of the region under study. Inspired by
is that the selected users are conditioned to check-     these works, we propose a novel ad-hoc method-
in in the store under study. Thus, the dataset is        ology to find activity (dense) zones as well as to
biased.                                                  model and forecast their activity levels using the
                                                         data provided by the Telecom Italia Big Data Chal-
Other works use coarse grain location from Call          lenge (TIM challenge). We describe this dataset in
Data Records (CDR). For instance, the work of            the next section.
(Isaacman et al., 2011) uses Hartigan’s leader clus-
tering algorithm (Hartigan, 1975) to identify dense      3   DATASET
areas. First, authors sort antennas by the amount
of time that phones contact the antenna. Once data       Datasets provided by the TIM challenge were
is sorted, the clustering algorithm takes the first      collected in the cities of Milan and Trento over
antenna as the centroid of a cluster. Then, it veri-     November and December 2013. Our study only
fies if the next antenna is within a distance d from     takes into account the dataset gathered from Milan
the centroid. If it is not the case, the antenna be-     city. Nevertheless, all the analysis and the method-
comes the centroid of a new cluster. In the case         ology could be generalized for any city. In the next
the antenna is within the distance, the algorithm        subsections, we will describe the datasets provided
computes the new centroid as the weighted aver-          for Milan city. These datasets were used to detect
age. They repeat the process until all antennas be-      and model the activity zones (primary datasets)
long to a cluster. Researchers use CDR locations         and to find some correlation between activity and
of 97 and 71 thousand unique users in Los An-            other measures like air quality and weather condi-
geles and New York cities collected over 2 and a         tions (auxiliary datasets). It should be noted that
half months as well as 19 volunteers as the ground       space was discretized in a grid, and all measures
truth to validate their results. They were able to       are normalized to correspond in one square of the
estimate dense areas with an error of 3 miles com-       grid.


                                                   131
3.1 Main datasets                                                      number of vehicles that belongs to Milan, the
The datasets of Milan city we use to detect and                        number of vehicles that is not from Milan,
model the activity zones are:                                          the number vehicles with engine ignition
                                                                       systems, in movement and stopped (c.f.
Milan Grid is a geographical segmentation over                         Table 3).
    the city to aggregate the measurements of the
    other datasets. The area of each square is                               Id               Time           Direction
    55 225 m2 , and it has 10 000 squares in the                             60           17/12/13 18:00      WEST
                                                                          Avg speed         Std speed        Mi plates
    form of a point (x, y) and the latitude and                              24                 95              21
    longitude belonging to this x, y position. An                        Non-Mi plates       Ignition       Mov/Stopped
    example of the described data as well as the                             62                  2              2/0
    grid over Milan are introduced in Table 1 and
                                                                  Table 3: Example of private mobility data in Milan
    Figure 1, respectively.

             Point      Latitude     Longitude                   3.2   Auxiliary datasets
             x1 , y1     9.011        45.568                     We also used additional datasets to analyze activ-
                                                                 ity zones, dynamics, and correlations, as we show
      Table 1: Example of the Milan grid data
                                                                 in the following paragraphs.

                                                                 Telecommunications - MI to MI provides infor-
                                                                      mation regarding the directional interaction
                                                                      strength, between the city of Milan and dif-
                                                                      ferent areas based on the calls exchanged be-
                                                                      tween Telecom Italia Mobile users. More
                                                                      precisely, this dataset contains the origin and
                                                                      destination Id squares, the time and the direc-
                                                                      tional interaction strength i.e., Activity (c.f.
                                                                      Table 4)
                  Figure 1: Milan grid
                                                                           Id 1    Id 2       Time           Activity
Telecommunications (SMS, Call and Internet)                                 1       3      1383345474         0.24
     provides information about the activity of a                Table 4: Example of activity data between Milan zones
     square concerning received and sent SMS,
     incoming and outcoming calls as well as
                                                                 Precipitation describes the intensity and the pre-
     internet usage. This data is temporal ag-
                                                                     cipitation type over the city of Milan. In
     gregated in timeslots of ten minutes and
                                                                     more detail, the dataset uses a coarse spa-
     provides the measure of the activity of a
                                                                     tial aggregation by dividing Milan city into
     given event as well as the square id (c.f. Table
                                                                     four quadrants (northeast, northwest, south-
     2). This kind of information is organized in a
                                                                     east and southwest). The intensity value of
     way that SMS-in and SMS-out activity scale
                                                                     the phenomenon is between 0 and 3, the per-
     are given in arbitrary units, and their values
                                                                     cent of coverage of a given quadrant and the
     range from 0 to 1.
                                                                     precipitation type between 0 and 2, where 0
          Id           Time         Country       SMS-in
                                                                     means absence of precipitation, 1 is rain and
           1        1383265200        39           0.24              3 is snow (c.f. Table 5).
        SMS-out       Call-in       Call-out       Data
         0.16          0.108         0.026         6.83
                                                                            Time          Id   Intensity   Coverage      Type
     Table 2: Example of activity data in Milan                         201311060220       1       1          45          1
                                                                        201311060220       2       0          0           0
                                                                        201311060220       3       0          0           0
Private Transportation (Cobra Telematics)                               201311060220       4       2          78          1
    gives information about the private mobility
    in Milan city by measuring the speed, the                    Table 5: Example of activity data between Milan zones


                                                           132
Air Quality describes the air pollution monitor-                4.1   Detecting activity levels
    ing system of Milan city obtained by us-                    The basic idea behind this method is to have a
    ing various types of sensors located within                 good representation of the activity variation levels
    the city limits. This environmental dataset                 over the time. Activity levels could be classified
    measures a different kind of contamination                  in three different degrees, low, medium and high.
    agents, such as Ammonia, Nitrogen Dioxide,                  Cumulative distribution of incoming/outcoming
    Total Nitrogen, Particulate Matter 2.5 µm                   SMS and call activity levels illustrated by a Heat
    (PM2.5), Particulate Matter 10 µm (PM10),                   map over Milan city, as shown in Figure 2, were
    Benzene, Sulphur Dioxide, Black Carbon,                     used to analyze data. The objective is to, empir-
    Carbon Monoxide and Ozone. An exam-                         ically, find a suitable threshold to distinguish a
    ple of pollution measure is given in Table                  square with high activity level from a square with
    6, where the characteristics of this particular             medium or low activity levels represented as green
    sensor are in Table 7.                                      and red in Figure 3, respectively.
       Sensor id          Time                  Measure
         5823        2013/12/30 04:00            1.9
      Table 6: Example of air quality measure


      Sensor id      Lat/Lon            Pollution
        5823        45.24/9.27      Carbon Monoxide
      Table 7: Description of the sensor 5823

Social Pulse contains data derived from an analy-
    sis of geolocalized tweets originated in Mi-
    lan. This dataset provides a user id, DB-
    Pedia entity, tweets language, municipality,
    time, timestamp and location (c.f. Table 8).
             User           Entities      Language
          5fa4b1cc71       Halloween         En
                                                                           Figure 2: Milan activity heat map
          Municipality     Timestamp      Lat, Lon
             Milan        1383260474     9.21, 45.49
                                                                Figure 3 depicts the cumulative distribution of
       Table 8: Example of Social Pulse data                    the aggregated incoming and outcoming SMS and
                                                                call activity of the telecommunications dataset (c.f.
Based on the aforementioned datasets, we have                   Subsection 3.1). The Heat map is built for an ac-
implemented our experiments using the main and                  tivity threshold of 25 units. Based on this visual-
the auxiliary datasets. These experiments are de-               ization technique, that amount of units seems to
tailed in sections 4 and 5, respectively.                       be a good trade-off between compact and well-
                                                                separated activity zones. Heat maps were used
4   EXPERIMENTS
                                                                to represent represent tourist activity as shown by
In the present section, we describe our methodol-               Olteanu et al. (Olteanu et al., 2011).
ogy to discover, model and predict the behavior                    In order to detect groups of squares representing
of a zone. We distinguish two different areas, the              an activity zone, we can use a high activity thresh-
activity zone, where people stay on a regular ba-               old (c.f. Subsection 4.2). In addition, we study, in
sis for a significant amount of time and the transit            detail, the activity over work hours to analyze the
zone which is the area used by individuals to go                difference between busy and idle squares. Figure
from one activity zone to another. In the next sub-             4 shows the difference in activity levels between
sections, we describe how to recognize an activity              activity and transit zones. From 8 AM to 8 PM the
zone from a transit zone (Subsection 4.1), how to               activity is considerably high, that is why that area
model activity levels (Subsection 4.2) and finally              is composed of a vast number of squares during
how to predict them (Subsection 4.3).                           the day. On the other hand, transit zones display a


                                                          133
Figure 3: Cumulative distribution of the sms/call activity with
heat maps
                                                                     Figure 5: Vehicles speed heat map based on Cobra dataset

much lower activity level and a higher fluctuation
throughout the day.


Figure 4: Activity over time in workdays between 8AM-
8PM. Activity zone (left) and transit zone (right)

Taking into account the elements as mentioned
earlier, we use the Heat map and activity threshold
presented in Figure 3 for detecting high activity
zones. Nevertheless, we need to define the borders
of these activity zones. From Figure 4, we can in-
fer that this irregularity of transit zone represents               Figure 6: Centroids of the activity zones in Milan city (blue),
movement. Thus, the Cobra dataset gathers the                       centroid of the clusters from Geo tweets (green) and com-
                                                                    modities issued from Foursquares check-ins shops (red) and
information about the movement of vehicles and                      restaurants (purple).
we depicted this information in a Heat map over
Milan city in Figure 5. The speed combined with
the activity level allowed us to detect the activity                geolocalized tweets dataset to verify the accuracy
zones, as well as their borders. As the result of                   of our methodology. Applying DBSCAN (Ester
the combination of this two variables, we obtained                  et al., 1996) clustering algorithm with at least 5
28 activity areas and the centroids of these activity               points per cluster within a radius of 3 km over
zones which are shown in Figure 6 in blue color.                    8 282 users (i.e.,109 762 geolocalized tweets). We
Since we do not have the ground truth, we used the                  obtained 24 clusters depicted as green points in


                                                              134
Figure 6. Thus, some groups are close to the iden-              (transition probabilities). In our case, an Activ-
tified activity zones in the northeast and south. In            ity Markov chain is a probabilistic automaton (PA)
Northwestern, activity zones are represented by                 model that represents, in a compact way, the occu-
only one cluster due to the proximity of geolocal-              pation (activity) of a square or activity zone. The
ized tweets and the approach of DBSCAN to build                 nodes symbolize the state (low, medium or high)
clusters. The Downtown area has many clusters                   of the squares or zones ([activity level] [zone
due to commodities concentration. To verify this                code], ex: L A) and edges, weighted with a prob-
fact, we have included two categories of check-ins              ability, represent the transition from one state to
from Foursquare, like shops (red) and restaurants               another over time windows. This model could be
(purple). One thing that surprised us was the de-               expressed in the form of a graph or the form of a
tection of a large group of geolocated tweets in                transition matrix (c.f. Figure 9).
the southern outskirts of Milans grid. Using these              The process for building an Activity Markov
data we found that the distance between the cen-                model is divided into two stages. The first one
troids of the clusters and activity zones is 0.5 km             is basically to order the events in a chronological
closer. These results validated the accuracy of our             way. Then we classify them as low, if they have
heuristic method to find activity zones. Finally, we            less than 15 activity units, as medium if activity
present the centroid of the detected activity zones             units are between 15 and 25 and as high if there
in Figure 7. These regions are used in the next                 are more than 25 activity units. Then the transi-
subsection for predicting purpose.                              tion matrix is built by counting the variations from
                                                                one level to another, taking care to avoid loops.
                                                                When events are not recorded anymore, the matrix
                                                                is normalized to obtain the transition probabilities.
                                                                As shown in Figure 8, we have divided the time
                                                                into 4 different windows depending on the range
                                                                of time studied, each one has 6 hours and starts
                                                                at 6:00 am; for both weekdays and weekends; giv-
                                                                ing a total number of eight windows. Furthermore,
                                                                in each time window interactions between differ-
                                                                ent levels of activity are also modeled. Moreover,
                                                                we matched, after a model processing, an activity
                                                                zone of a time window with another (blue arrows).
                                                                We finally conclude, from the stationary vector of
                                                                Markov chains, that activity areas are occupied in
                                                                only 11% and free in 71%. It is important to point
                                                                out that, for improving the accuracy of the anal-
                                                                ysis, it would be better to divide the time slots in
                                                                less intervals instead of 6 hours.
    Figure 7: Centroids of the activity zones in Milan.         Until this point we showed that we are able to
                                                                identify high activity squares, activity and transit
Up to this point, we are able to identify activity              zones. In the next subsections, we detail how to
zones. Thus, the next tasks are to model the behav-             predict, in an unprecedented way and with a very
ior activity levels in the detected regions as well as          acceptable rate of success, not only activity levels
to predict the activity levels in the identified activ-         but also their possible changes.
ity zones.

4.2 Modeling activity levels                                    4.3   Prediction of activity levels
Relying on the thresholds mentioned above, we                   Anticipating high activity levels within an “ac-
can build an Activity Markov chains model to rep-               tivity area” allows Telecoms operators to plan or
resent the change of activity levels over time. An              avoid unnecessary investment in infrastructure, as
Activity Markov chain model is a stochastic pro-                well as ensure the QoS or to start a new en-
cess where the changes of states are related to a               trepreneurship. In this paper, our prediction, in-
probability associated with various state changes               spired from the work of Gambs et al. (Gambs et


                                                          135
                     Figure 8: Example of an activity Markov chain (activity over time windows)


al., 2012), is performed using the transition matrix         inputs, the algorithm returns the maximal outgo-
and allows us to obtain changes of activities be-            ing probability from the transition matrix taking
tween temporary windows and within them. Then,               into account only columns corresponding to the
predictions of activity changes within the same              same time windows of the index row (local tran-
window and estimations can be made. For exam-                sition from line 1 to 4 of the Algorithm 1). For in-
ple, we wonder what is the probability of passing            stance, in Figure 8, if the actual state is medium on
from a low activity state to a high one or what is           the time windows from 18:00h to 0:00h on week-
the likelihood for a given activity to remain in the         days, the prediction algorithm will give an output
same state when we consider the next window and              in the high level in the same time window. Another
the same range of time. To answer these questions,           kind of prediction is to take into account others
we rely on the algorithm presented in Algorithm 1.           columns instead of those that belong to the same
                                                             time window (inter time windows transition from
  Algorithm 1: Prediction algorithm                          line 6 to 8 of the Algorithm 1). Given the low in
   Data: TransitionMatrix, indexRow,                         the time windows from 18h to 0h on weekdays in
          inTimeWindows                                      Figure 8, the algorithm will output the low state on
   Result: indexColumn                                       time windows from 0h to 6h on weekends. In the
1 if inTimeWindows then                                      case of the output we have the same probability,
2      //predict next state in the time windows              ties are break randomly.
3      i=maxOutgoingProbOnWin(indexRow)
4      indexColumn = i
5 else
6      //predict the next state when the time
       window change
7      i=maxOutgoingProbOnNextWin(indexRow)
8      indexColumn = i
9   return indexColumn                                       Figure 9: Transition matrix example. Where H=high,
                                                             M=medium, L=low, W=Weekday, We=Weekend and be-
                                                             gin end hours. As a way of example we show two temporary
More precisely, Algorithm 1 takes as input a tran-           windows framed with purple numbers inside.
sition matrix (transitionMatrix). Where the index
of the row in the transition matrix corresponds to           To validate the accuracy of the predictions, we
the actual state of the system (indexRow) and a              used data from the whole month of November as
boolean value is used to indicate whether the pre-           training set and the first 16 days of December as
diction is local (inTimeWindows). Based on these             testing set (we did not take into account New Year


                                                       136
                                                                   we are going to use the auxiliary datasets to ana-
                                                                   lyze activity zones interaction and to study possi-
                                                                   ble correlation with the levels or evaluate how the
                                                                   weather impact the utilization of the Telecoms ser-
                                                                   vice.

                                                                   5     PLAYING WITH OTHER DATASETS
                                                                   In the present section, we will study the interac-
                                                                   tion between detected activity zones (Subsection
                                                                   5.1); the correlation between activity levels versus
                                                                   pollution measures (Subsection 5.2) and the influ-
                                                                   ence of the weather on the activity levels in the
                                                                   Telecoms operator infrastructure (Subsection 5.3).
Figure 10: Success rate of prediction. Where local refers to
prediction inside a time windows and transition (trans) refers     5.1   Interaction between activity zones
to prediction over time windows.(#) means prediction value.
                                                                   Using the directional interaction activity dataset
                           square      zone                        between zones in the area of Milan, we plotted a
                Local       0.86        0.7                        graph to visualize the communication exchange,
                Trans.      0.86       0.89                        as well as various activity levels as we can appre-
                                                                   ciate in Figure 11, where the width of the edges
      Table 9: Table summarizing results of Figure 10              accounts for the logarithm of the aggregated activ-
                                                                   ity for the whole month of November. To extend
celebration to avoid special dates that separate our               the semantic of this graph, we modulated the size
study from normal behavior). The results are de-                   of the nodes according to the amount of tweets
picted in Figure 10, where the success rate (Equa-                 emitted from the corresponding zone taking into
tion 1) is the ratio between the correct prediction                account global pulses dataset (c.f. Subsection 3.2).
and the total number of predictions and the num-
ber of overall forecast. Note that the number of
predictions is indicated in parentheses at the bot-
tom part of each bar.

                              #goodprediction
        success rate =                                    (1)
                               #predictions
   We observe, from Table 9, the success rate for
both kinds of predictions, namely (1) within a
time window (local) and (2) in different time win-
dows (trans.). It is important to note that there
are two distinct scenarios; the first one considers
both local and trans. predictions based on Activity
Markov chains models which were built from ac-
tivity levels of squares; while the second scenario
takes into account activity levels from detected ac-
tivity zones to forecast future values. We did not
performed k-fold cross validation since the train-
ing set of a month is representative of the mobil-                        Figure 11: Interaction graph of activity zones.
ity pattern. Thus, adding more mobility traces to
the training test does not contribute to increase the              We observe that Milan city has a star topology,
success rate.                                                      where there is a central node that communicates
So far, we are able to model, identify and predict                 with the other peripheral nodes. Another interest-
activity levels in activity zones. In the next section,            ing fact is that small nodes tend to communicate


                                                             137
to the central node. Nevertheless, there are a few
exchanges between small contiguous nodes.

5.2 Forecasting pollution through activity
In this subsection, we study the correlation be-
tween the activity level presented on the telecom-
munication dataset and air quality measurements
(both described in Subsection 3.2) to forecast the
pollution level of the Milan city based on the ac-
tivity of the telecom operator antennas. Figure 12
shows the results of the correlation of the activity
with respect to different polluting gasses as well
as the number of vehicles in movement, ignition
or stopped.


                                                                   Figure 13: Correlation of the activity of a square with radia-
                                                                   tion sensors.


Figure 12: Correlation of the activity with pollution measures
and private mobility (Cobra).
                                                                       Figure 14: Activity level of outgoing SMS and call.

We found out that the activity has a positive cor-
relation with PM10 (particulates matter with a di-                 red and light blue bars). Nevertheless, people tend
ameter of 10 microns or less) and PM2.5 pollution                  to call or send less SMS when it is snowing (dark
measures (fine particles of 2.5 micrometers of di-                 and light green bars).
ameter or less). From Figure 13 we can visualize
that the activity levels has a positive correlation                6    CONCLUSION AND FUTURE
with the radiation measures and a negative corre-                       DIRECTIONS
lation with the relative humidity.                                 Our purpose in this research is to understand the
                                                                   activity of the telecommunication network by an-
5.3 Influence of weather on the activity                           alyzing several aspects of Milan’s phone traffic
We compare the outgoing SMS and call activity                      flows. We were interested in the definition and
in the presence of different weather phenomena’s                   morphology of the activity and transit zones; the
scenarios, like rain, snow or the absence of both.                 prediction of activity levels over different regions
For this purpose, we used the Precipitation dataset                with a success rate between 80% and 90%; the in-
(c.f. Subsection 3.2).                                             teractions between the different activity zones and
From Figure 14, we can observe that people send                    the influence of the weather and the pollution on
more SMS and give more calls in presence of rainy                  that activity. Thus, our results offer a new way of
weather, even if rain is slight (blue, red, yellow                 looking at the telecommunication traffic data by
and green bars) than in normal conditions (dark,                   examining the various connections between appar-


                                                             138
ently uncorrelated datasets, providing insights to                 Alexander Varshavsky. 2011. Identifying important
manage and to optimize the whole network. For                      places in peoples lives from cellular network data. In Per-
                                                                   vasive Computing, June.
business opportunities, this means (1) new geo-
marketing opportunities through a better under-                  J. B. Kruskal. 1956. On the Shortest Spanning Subtree of a
                                                                    Graph and the Traveling Salesman Problem. In Proceed-
standing of users communication patterns, (2) new                   ings of the American Mathematical Society, 7.
trade area analysis, (3) cheaper network load bal-
ancing as well as (4) improved QoS. In the future,               M Kures and Ryan Pinkovitz. 2011. Downtown and business
                                                                   district market analysis. http://fyi.uwex.edu/
we would like to study and analyze in detail the                   downtown-market-analysis/.
opinions (sentiment analysis) discussed by users
                                                                 Antonio Lima, Manlio De Domenico, Veljko Pejovic, and
and generators of tweets and identifying the geo-                  Mirco Musolesi. 2013. Exploiting cellular data for dis-
location of these activity areas. Another line of                  ease containment and information campaigns strategies in
investigation would include coarse mobility Call                   countrywide epidemics. Computing Research Repository,
                                                                   abs/1306(4534):1–7, Jun.
Data Record (CDR) to take into account as an ad-
ditional element for detecting activity and traffic              J. Morales, W. Creixell, J. Borondo , J. C. Losada , and
areas.                                                              R. M. Benito. 2013. Understanding ethnical interactions
                                                                    on ivory coast. In Data for Development (D4D) challenge,
                                                                    pages 115–120, May.
References                                                       Ana-Maria Olteanu, Roberto Trasarti, T Couronn, Fosca Gi-
D. P. Dobkin and M. J. Laszlo. 1987. Primitives for the ma-        annotti, Mirco Nanni, Zbigniew Smoreda, and Cezary
   nipulation of three-dimensional subdivisions. In Proceed-       Ziemlicki. 2011. Gsm data analysis for tourism applica-
   ings of the Third Annual Symposium on Computational             tion. In Proceedings of the 7th International Symposium
   Geometry, pages 86–99, New York, NY, USA.                       on Spatial Data Quality (ISSDQ).

Nathan Eagle, Alex S. Pentland, and David Lazer. 2013.           Michal Piorkowski, Natasa Sarafijanovic-Djukic, and
  Inferring social network structure using mobile phone            Matthias Grossglauser. 2009. CRAWDAD data set
  data. Proceedings of the National Academy of Sciences,           epfl/mobility (v. 2009-02-24).         Downloaded from
  106(36):15274–15278., September.                                 http://crawdad.org/epfl/mobility/, February.

Martin Ester, Hans-Peter Kriegel, Joerg Sander, and Xiaowei      United Nations Global Pulse. 2013.  Mobile phone
  Xu. 1996. A density-based algorithm for discovering              network data for development.    http://www.
  clusters in large spatial databases with noise. In Second        unglobalpulse.org/research, October.
  International Conference on Knowledge Discovery and
                                                                 Yan Qu and Jun Zhang. 2013. Trade area analysis using user
  Data Mining, pages 226–231.
                                                                   generated mobile location data. In International Confer-
M. Ficek and L. Kencl. 2012. Inter-call mobility model:            ence on World Wide Web, pages 1053–1064.
  A spatio-temporal refinement of call data records using
                                                                 Yossi Shiloach and Uzi Vishkin. 1982. An o(log n) parallel
  a gaussian mixture model. In Conference on Computer
                                                                   connectivity algorithm. Journal of Algorithms, 3(1):57–
  Communications, pages 469–477, March.
                                                                   67.
S. Gambs, M.-O. Nuñez Killijian, and M.. del Prado Cortez.      Jameson L. Toole, Michael Ulm, Marta C. González, and Di-
   2010. Gepeto: A geoprivacy-enhancing toolkit. In Ad-            etmar Bauer. 2012. Inferring land use from mobile phone
   vanced Information Networking and Applications Work-            activity. In Proceedings of the ACM SIGKDD Interna-
   shops (WAINA), 2010 IEEE 24th International Conference          tional Workshop on Urban Computing, pages 1–8, New
   on, pages 1071–1076, April.                                     York, NY, USA.
Sébastien Gambs, Marc-Olivier Killijian, and Miguel             M.R. Vieira, V. Frias-Martinez, N. Oliver, and E. Frias-
   Núòez del Prado Cortez. 2011. Show me how you move            Martinez. 2010. Characterizing dense urban areas from
   and i will tell you who you are. Transition on Data Pri-        mobile phone-call data: Discovery and social dynamics.
   vacy, 4(2):103–126, August.                                     In Second International Conference on Social Computing,
                                                                   pages 241–248, August.
Sébastien Gambs, Marc-Olivier Killijian, and Miguel Núñez
   del Prado Cortez. 2012. Next place prediction using mo-       Changqing Zhou, Dan Frankowski, Pamela Ludford, Shashi
   bility markov chains. In Proceedings of the First Work-         Shekhar, and Loren Terveen. 2004. Discovering personal
   shop on Measurement, Privacy, and Mobility, page 3.             gazetteers: An interactive clustering approach. In Inter-
   ACM.                                                            national Workshop on Geographic Information Systems,
                                                                   pages 266–273, November.
Ramaswamy Hariharan and Kentaro Toyama. 2004. Project
  lachesis: parsing and modeling location histories. In In
  Geographic Information Science, pages 106–124.

John A. Hartigan. 1975. Clustering Algorithms. John Wiley
   & Sons, Inc., New York, NY, USA, 99th edition.

Sibren Isaacman, Richard Becke, Ramn Cceres, Stephen
   Kobourov, Margaret Martonosi, James Rowland, and


                                                           139

</pre>