=Paper= {{Paper |id=Vol-1810/BIGQP_paper_05 |storemode=property |title=The Effect of Weather on User-Generated Big Geo Data in Mobile Phone Networks |pdfUrl=https://ceur-ws.org/Vol-1810/BIGQP_paper_05.pdf |volume=Vol-1810 |authors=Carolina Arias Muñoz,Maria Antonia Brovelli |dblpUrl=https://dblp.org/rec/conf/edbt/MunozB17 }} ==The Effect of Weather on User-Generated Big Geo Data in Mobile Phone Networks== https://ceur-ws.org/Vol-1810/BIGQP_paper_05.pdf
    The effect of weather on user-generated big geo data in
                    mobile phone networks

                    Carolina Arias Muñoz                                                 Maria Antonia Brovelli
                     Politecnico di Milano                                                 Politecnico di Milano
              Via Valleggio 11, 22100 Como Italy                                    Via Valleggio 11, 22100 Como Italy
                  carolina.arias@polimi.it                                              maria.brovelli@polimi.it


ABSTRACT                                                                 meteorological variables, and then spectral analysis to unveil
User-generated data in mobile networks are normally used as              significant periodical components in the time series of the
proxies for human activity and mobility, and it can be used in an        remaining factors (output of the factor analysis) and mobile
extensive variety of research problems including mobility, city          telecom traffic intensity. Later on [2] reproduced [1] experience,
planning, tourism, event detection, urban well-being and many            both to validate the datasets they had available and to confirm the
others. In not so many studies, the relationship between                 authors’ results. Their focus was on the creation of a predictive
environmental variables and mobile usage data has been explored.         model of weather conditions and to evaluate an autoregressive
This work tests this possible relationship in the city of Milan, as      integrated moving average (ARIMA) model and its forecast.
one first approach towards a predictive model for weather                On both studies, the correlation between temperature (or variables
conditions/environmental stress from mobile usage data. Used data        related to Thermal perception) and telecommunications data is
corresponded to two months (November and December 2013) of               high, but the results related to precipitation do not match, probably
Call Detail Records (CDRs) of sent and received SMS, incoming            due to a series of factors including the data quality and the different
and outgoing calls, and internet traffic; as well as the precipitation   human behavior of the two regions used as case study.
data from the meteorological network. According to the Kruskal-
Wallis test results, we can conclude that for a confidence interval      Taking into account these previous works, we wanted to test if there
(95%) the null hypothesis of equality of medians can be rejected:        is any relationship specifically between precipitation and mobile
there is a significant relationship between telecommunications data      outgoing calls in the city of Milan, as a first approach towards a
and precipitation intensity levels.                                      predictive model for weather conditions/environmental stress from
                                                                         mobile usage data. In this work we consider:
CCS Concepts
                                                                             The use of only Free and Open Source tools
• Applied computing • Applied computing~Internet telephony •                 The of use of Kruskal-Wallis test instead of factor analysis.
Applied computing~Mathematics and statistics • Information
systems~Open source software                                             2. DATA AND METHODS
                                                                         This research is based on two types of data, namely the user-
Keywords                                                                 generated traffic in mobile networks and precipitation data. The
Mobile phone data; spatial data mining; urban planning; Big Geo          latter is considered, to our purposes, an indicator of the weather
Data; Kruskal-Wallis; Extraction of spatial relations in Big Data.       conditions.

1. INTRODUCTION                                                          2.1 User-generated big geo data in mobile
User-generated data in mobile networks are normally used as              phone networks
proxies for human activity and mobility and it can be used in an         Telecom Italia together with, SpazioDati, MIT Media Lab, EIT ICT
extensive variety of research problems including mobility, city          Labs, Polytechnic University of Milan, Northeastern University,
planning, tourism, event detection, urban well-being and many            University of Trento, Fondazione Bruno Kessler and Trento RISE
others. In not so many studies, the relationship between                 have been organizing the Telecom Italia Big Data Challenge,
environmental variables and mobile usage data have been explored.        providing various geo-referenced and anonymized datasets. For the
There are two studies worth to be mentioned: First important work        2014 edition, they provided data for two Italian areas: the city of
was done by [1] were the authors explored the influence of weather       Milan and the Province of Trentino [3]. These data are available1
on mobile phone usage and (indirectly) human behavior. They used         to the public under the Open Database License (ODbL) .
factor analysis to reduce dimensionality and redundancy in some          From all the Telecom open data available, used data corresponds to
                                                                         two months (November and December 2013) of mobile
                                                                         telecommunications of the city of Milan. The datasets are user-
2017, Copyright is with the authors. Published in the Workshop           generated telecommunication traffic, corresponding to the result of
Proceedings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017,
                                                                         computation over the Call Detail Records (CDRs) of sent and
Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this
paper is permitted under the terms of the Creative Commons license CC-   received SMS, incoming and outgoing calls, and Internet traffic.
by-nc-nd 4.0


1 https://dandelion.eu/datamine/open-big-data/
All Datasets have a temporal aggregation of ten minutes, being in        Figure 2 summarizes the spatial behavior of the data. Mean and
total 14.877.485 records. Data are provided in a series of CSV files,    maximum values cluster on areas characterized by a high density
each containing one day of records, organized according to the           of buildings, population and activities, such as new residential
following schema:                                                        centers, airports, industrial zones, etc.
    Square id: the id of the square that is part of the Milano grid     2.1.2 Temporal distribution of the mobile network
     (see Figure 1)                                                      data
 Time interval: the beginning of the time interval.                     Figure 3 shows a clear view of data temporal behavior. These are
 Country code: the phone country code of a nation.                      radial box plots, where data is summarized by timestamp, meaning
 Mobile network activity: the activity (a number) inside the            that the value of each hour is the sum of all grid values of a specific
     Square id, in terms of outgoing and received SMS/calls and          time stamp.
     internet connection (one column for each variable) during the
     Time interval and sent from the nation identified by the
     Country Code.
The CDRs records provided by Telecom Italia are not the real
records; they are proportional values of the actual records, to
provide anonymized data. To understand how these values were
calculated, please refer to [3].

2.1.1 Spatial distribution of the mobile network data
Since the data come from different companies which have adopted
different standards, their spatial distribution irregularity is
aggregated in a square grid (100 columns by 100 rows) covering
the city of Milan, with a square cell size of 235 meters and WGS84
projection (EPSG:4326). The Milano Grid is available in GeoJSON
format (see Figure 1).                                                   Figure 3. Radial box plot of incoming calls in the city of Milan.
                                                                            a) 11/29/13 and 12/05/13 (normal week). b) 12/25/13 and
                                                                                12/31/13 (last week of the year, new year’s eve).
                                                                         The left plot shows data from December’s first week, which is
                                                                         representative of all the rest of the weeks between November and
                                                                         December 2013. It evidences a strong daily seasonality of incoming
                                                                         calls corresponding with working and non-working hours; the same
                                                                         behavior is observed for the rest of the variables (outgoing calls,
                                                                         incoming and outgoing SMS, Internet connection) indicating a
                                                                         temporal human behavioral pattern. Likewise, there is a weekly
                                                                         seasonality between working days and weekends, with Sunday the
                                                                         day with less activity. The right plot shows data from December’s
                                                                         last week, where peaks for December 25th and 31st can be seen
                                                                         clearly.

    Figure 1. Representation of the Milano grid, with d = 235            2.2 Precipitation data
                             meters                                      Precipitation maps for Milan city between November and
                                                                         December 2013, comes from Lombardia's regional agency for
On the CSV files, each record is related to a specific cell id of the
                                                                         environmental protection ARPA2. An Optimal Interpolation (OI)
Milano grid, in such a way that each record can be referenced to
                                                                         method, explained in [5], was used by [4] to interpolate on a regular
each grid cell. Once the data is represented spatially, it can be seen
                                                                         gid of 1.5 km, the hourly-cumulated precipitation using data from
as a series of raster maps, one for each time stamp (i.e. 144 raster
                                                                         ARPA Lombardia’s mesoscale meteorological network. Figure 4
map per day for each variable).
                                                                         shows the maximum values during November and December 2013,
                                                                         where high values are concentrated outside the city of Milan,
                                                                         especially on the southeast part of the city.




    Figure 2. Mean values (left) and maximum values (right) of
      Outgoing calls in the city of Milan between November –
                          December 2013


2 http://ita.arpalombardia.it/ita/index.asp
                                                                         the analysis was made for each location and all time stamps (see
                                                                         figure 6).




                                                                         Figure 6. Schema of the map arrange to perform the Kruskal –
                                                                         Wallis test.
                                                                         This test was used because assumptions of normality could not be
Figure 4. Milan’s precipitation map of maximum values (mm)
                                                                         confidently made for each of the cells of the Milano grid. On this
between November – December 2013
                                                                         analysis, the response variable is the telecommunications activity,
In Lombardy, the meteorological autumn and winter 2013 were              spatially distributed within the city of Milan, in zones on which the
very mild, rainy and with exceptionally heavy snowfall [6]. The          precipitation can be high/moderate/slight (factors) or not be present
cause is the succession of disturbed situations, characterized by        at all, and its behavior can be different whether is a weekday or a
currents coming from southern quadrants, which favored an almost         weekend. The Kruskal-Wallis test was used to test the null
continuous supply of warm, moist air in particularly from the end        hypothesis that the precipitation intensity levels have equal
of December onwards.                                                     population medians. We want to know if people behavior in
                                                                         telecommunications activity can change: e.g. Are people's outgoing
                                                                         calls median significantly on days of heavy rain?

                                                                         2.4 Data processing
                                                                         Precipitation maps and the telecommunications activity records
                                                                         were analyzed on a common space-time basis using python with
                                                                         Spyder Scientific PYthon Development EnviRonment4 and
                                                                         scipy.stats5 and pandas6 libraries. For data visualization QGIS7 and
                                                                         d3.js8, ricksaw.js9, rbox.js10 javascript libraries.
Figure 5. Milano grid precipitation map of maximum values
                                                                         The original files were imported in MongoDB non sql database11.
(mm) between November – December 2013
                                                                         Different python scripts were used to data processing:
Figure 5 shows rain distribution within the Milano grid area, as in
the boxplots, summarized by time stamp. The different colors                 Asciitotable.py: Conversion of precipitation maps into data
divide the threshold of precipitation intensity3. There are two main          frame data structures.
peaks on November 16th and December 27th.                                    exploration.py: Basic data statistics, distribution fitting and
                                                                              histogram analysis.
2.3 Methods of analysis                                                      pre_kw.py: Sum values ignoring country code, Aggregate data
The Kruskal-Wallis test is the non – parametric counterpart to the            by hour, Join of telecom a precipitation data, Creation
analysis of variance ANOVA test. It allows to compare samples of              precipitation intensity and days of the week categories.
the same variable by their difference in their medians [7]. To detect        kruskal.py: Kruskal-Wallis calculation by location.
any differences of telecommunications data activity between the
levels of precipitation intensity (high, moderate, slight and no rain)   All     scripts   used     can     be    found      on     Github:
a Kruskal-Wallis was used in each cell (square) of the Milano grid:      https://github.com/carolinarias/Kruskal-Wallis-Spatial.git


                                                                         3. RESULTS AND DISCUSSION
                                                                         We calculated a series of Kruskal-Wallis maps for each of the
                                                                         telecommunications variables (incoming and outgoing SMS/calls).
                                                                         High (heavy) rain was not considered because the number of
                                                                         measurements was less than five on the majority of the cells; for
                                                                         Kruskal – Wallis test to work, the samples must have more than


3                                                                        7 http://www.qgis.org/it/site/

    http://www.arpa.piemonte.gov.it/rischinaturali/tematismi/meteo/      8 https://d3js.org/
    osservazioni/radar/intensita-precipitazione.html?delta=0
                                                                         9 http://code.shutterstock.com/rickshaw/
4 https://pythonhosted.org/spyder/
                                                                         10 https://bl.ocks.org/davidwclin/ad5d13db260caeffe9b3
5 https://docs.scipy.org/doc/scipy/reference/stats.html
                                                                         11 www.mongodb.com
6 http://pandas.pydata.org/
five observations. Figure 7 shows an example for outgoing calls          We will continue this research taking into account:
(that represent the results also for the other variables), where the
                                                                                  Citizen - generated geographic content vs. official sensor
value 1 indicates that the test passed: Being the null hypothesis that
                                                                                   data.
the population medians are all equal, a P-value ≤ α (0.05 in our
                                                                                  Test not only precipitation but other weather variables
case) means that the differences between the medians are
                                                                                   like temperature sun radiation, wind direction, etc.
statistically significant. The Kruskal-Wallis test reveals that the
medians for the telecommunication activity are significantly                      Using a larger data sample (i.e. one-year time series of
different across the different precipitation intensity levels:                     data).
moderate, slight and no rain.                                                     Integrate additional data (i.e. traffic data, census data,
                                                                                   historical social media data, etc.).
                                                                         We hope to explore further the hypothesis of predicting weather
                                                                         conditions / environmental stress with the help of mobile data.


                                                                         5. REFERENCES
                                                                         [1] Sagl, G., Beinat, E., Resch, B., & Blaschke, T. (2011, June).
                                                                             Integrated geo-sensing: A case study on the relationships
                                                                             between weather and mobile phone usage in northern italy. In
                                                                             Spatial Data Mining and Geographical Knowledge Services
                                                                             (ICSDM), 2011 IEEE International Conference on (pp. 208-
                                                                             213). IEEE.
                                                                         [2] Craveiro, P., Ramos, F.M.V., Kanjo, E., Mawass, N.E., 2013.
                                                                             Towards an early warning system : the effect of weather on
Figure 7. Kruskal / Wallis map for outgoing calls between                    mobile phone usage A case study in Abidjan 1–11.
November – December 2013                                                 [3] Barlacchi, G., De Nadai, M., Larcher, R., Casella, A., Chitic,
The test also identified areas where certain levels of precipitation         C., Torrisi, G., Antonelli, F., Vespignani, A., Pentland, A.,
are common (i.e. areas with the same value of moderate                       Lepri, B., 2015. A multi-source dataset of urban life in the city
precipitation = 2.5 mm), identify on the map as nan.                         of Milan and the Province of Trentino. Sci. data 2, 150055.
                                                                             DOI= 10.1038/sdata.2015.55.
                                                                         [4] Lussana, C., Salvati, M.R., Pellegrini, U., Uboldi, F. 2009.
4. CONCLUSIONS AND FUTURE WORK                                               Efficient high-resolution 3-D interpolation of meteorological
                                                                             variables for operational use. Adv. Sci. Res. 3, 105–112.
The results discussed above are a promising step towards a holistic
understanding of the complex relationship between environmental          [5] Uboldi, F., Lussana, C., Salvati, M., 2008. Three‐dimensional
and social dynamics, and a starting point for further smart cities and       spatial interpolation of surface meteorological observations
human geography analysis.                                                    from high‐resolution local networks. Meteorol. Appl. 15, 331–
According to the Kruskal-Wallis test results, we can conclude that           345.
for a confidence interval (95%) the null hypothesis of equality of       [6] ARPA Agenzia Regionale per la Protezione dell’Ambiente,
medians can be rejected: there is a significant relationship between         2015. Sintesi Meteoclimatica Inverno 2013/2014.
telecommunications data and precipitation intensity levels.              [7] Wheeler, D., Shaw, G., Barr, S., 2013. Statistical Techniques
A following analysis would try to identify the causality of the              in Geographical Analysis, Third Edition. Taylor & Francis.
relationship between precipitation and telecommunications
activity: e.g., how strong/weak is the relationship? is there any
primary process or feature which may have a spatial and/or
temporal component?