=Paper=
{{Paper
|id=Vol-1810/BIGQP_paper_05
|storemode=property
|title=The Effect of Weather on User-Generated Big Geo Data in Mobile Phone Networks
|pdfUrl=https://ceur-ws.org/Vol-1810/BIGQP_paper_05.pdf
|volume=Vol-1810
|authors=Carolina Arias Muñoz,Maria Antonia Brovelli
|dblpUrl=https://dblp.org/rec/conf/edbt/MunozB17
}}
==The Effect of Weather on User-Generated Big Geo Data in Mobile Phone Networks==
The effect of weather on user-generated big geo data in mobile phone networks Carolina Arias Muñoz Maria Antonia Brovelli Politecnico di Milano Politecnico di Milano Via Valleggio 11, 22100 Como Italy Via Valleggio 11, 22100 Como Italy carolina.arias@polimi.it maria.brovelli@polimi.it ABSTRACT meteorological variables, and then spectral analysis to unveil User-generated data in mobile networks are normally used as significant periodical components in the time series of the proxies for human activity and mobility, and it can be used in an remaining factors (output of the factor analysis) and mobile extensive variety of research problems including mobility, city telecom traffic intensity. Later on [2] reproduced [1] experience, planning, tourism, event detection, urban well-being and many both to validate the datasets they had available and to confirm the others. In not so many studies, the relationship between authors’ results. Their focus was on the creation of a predictive environmental variables and mobile usage data has been explored. model of weather conditions and to evaluate an autoregressive This work tests this possible relationship in the city of Milan, as integrated moving average (ARIMA) model and its forecast. one first approach towards a predictive model for weather On both studies, the correlation between temperature (or variables conditions/environmental stress from mobile usage data. Used data related to Thermal perception) and telecommunications data is corresponded to two months (November and December 2013) of high, but the results related to precipitation do not match, probably Call Detail Records (CDRs) of sent and received SMS, incoming due to a series of factors including the data quality and the different and outgoing calls, and internet traffic; as well as the precipitation human behavior of the two regions used as case study. data from the meteorological network. According to the Kruskal- Wallis test results, we can conclude that for a confidence interval Taking into account these previous works, we wanted to test if there (95%) the null hypothesis of equality of medians can be rejected: is any relationship specifically between precipitation and mobile there is a significant relationship between telecommunications data outgoing calls in the city of Milan, as a first approach towards a and precipitation intensity levels. predictive model for weather conditions/environmental stress from mobile usage data. In this work we consider: CCS Concepts The use of only Free and Open Source tools • Applied computing • Applied computing~Internet telephony • The of use of Kruskal-Wallis test instead of factor analysis. Applied computing~Mathematics and statistics • Information systems~Open source software 2. DATA AND METHODS This research is based on two types of data, namely the user- Keywords generated traffic in mobile networks and precipitation data. The Mobile phone data; spatial data mining; urban planning; Big Geo latter is considered, to our purposes, an indicator of the weather Data; Kruskal-Wallis; Extraction of spatial relations in Big Data. conditions. 1. INTRODUCTION 2.1 User-generated big geo data in mobile User-generated data in mobile networks are normally used as phone networks proxies for human activity and mobility and it can be used in an Telecom Italia together with, SpazioDati, MIT Media Lab, EIT ICT extensive variety of research problems including mobility, city Labs, Polytechnic University of Milan, Northeastern University, planning, tourism, event detection, urban well-being and many University of Trento, Fondazione Bruno Kessler and Trento RISE others. In not so many studies, the relationship between have been organizing the Telecom Italia Big Data Challenge, environmental variables and mobile usage data have been explored. providing various geo-referenced and anonymized datasets. For the There are two studies worth to be mentioned: First important work 2014 edition, they provided data for two Italian areas: the city of was done by [1] were the authors explored the influence of weather Milan and the Province of Trentino [3]. These data are available1 on mobile phone usage and (indirectly) human behavior. They used to the public under the Open Database License (ODbL) . factor analysis to reduce dimensionality and redundancy in some From all the Telecom open data available, used data corresponds to two months (November and December 2013) of mobile telecommunications of the city of Milan. The datasets are user- 2017, Copyright is with the authors. Published in the Workshop generated telecommunication traffic, corresponding to the result of Proceedings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017, computation over the Call Detail Records (CDRs) of sent and Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted under the terms of the Creative Commons license CC- received SMS, incoming and outgoing calls, and Internet traffic. by-nc-nd 4.0 1 https://dandelion.eu/datamine/open-big-data/ All Datasets have a temporal aggregation of ten minutes, being in Figure 2 summarizes the spatial behavior of the data. Mean and total 14.877.485 records. Data are provided in a series of CSV files, maximum values cluster on areas characterized by a high density each containing one day of records, organized according to the of buildings, population and activities, such as new residential following schema: centers, airports, industrial zones, etc. Square id: the id of the square that is part of the Milano grid 2.1.2 Temporal distribution of the mobile network (see Figure 1) data Time interval: the beginning of the time interval. Figure 3 shows a clear view of data temporal behavior. These are Country code: the phone country code of a nation. radial box plots, where data is summarized by timestamp, meaning Mobile network activity: the activity (a number) inside the that the value of each hour is the sum of all grid values of a specific Square id, in terms of outgoing and received SMS/calls and time stamp. internet connection (one column for each variable) during the Time interval and sent from the nation identified by the Country Code. The CDRs records provided by Telecom Italia are not the real records; they are proportional values of the actual records, to provide anonymized data. To understand how these values were calculated, please refer to [3]. 2.1.1 Spatial distribution of the mobile network data Since the data come from different companies which have adopted different standards, their spatial distribution irregularity is aggregated in a square grid (100 columns by 100 rows) covering the city of Milan, with a square cell size of 235 meters and WGS84 projection (EPSG:4326). The Milano Grid is available in GeoJSON format (see Figure 1). Figure 3. Radial box plot of incoming calls in the city of Milan. a) 11/29/13 and 12/05/13 (normal week). b) 12/25/13 and 12/31/13 (last week of the year, new year’s eve). The left plot shows data from December’s first week, which is representative of all the rest of the weeks between November and December 2013. It evidences a strong daily seasonality of incoming calls corresponding with working and non-working hours; the same behavior is observed for the rest of the variables (outgoing calls, incoming and outgoing SMS, Internet connection) indicating a temporal human behavioral pattern. Likewise, there is a weekly seasonality between working days and weekends, with Sunday the day with less activity. The right plot shows data from December’s last week, where peaks for December 25th and 31st can be seen clearly. Figure 1. Representation of the Milano grid, with d = 235 2.2 Precipitation data meters Precipitation maps for Milan city between November and December 2013, comes from Lombardia's regional agency for On the CSV files, each record is related to a specific cell id of the environmental protection ARPA2. An Optimal Interpolation (OI) Milano grid, in such a way that each record can be referenced to method, explained in [5], was used by [4] to interpolate on a regular each grid cell. Once the data is represented spatially, it can be seen gid of 1.5 km, the hourly-cumulated precipitation using data from as a series of raster maps, one for each time stamp (i.e. 144 raster ARPA Lombardia’s mesoscale meteorological network. Figure 4 map per day for each variable). shows the maximum values during November and December 2013, where high values are concentrated outside the city of Milan, especially on the southeast part of the city. Figure 2. Mean values (left) and maximum values (right) of Outgoing calls in the city of Milan between November – December 2013 2 http://ita.arpalombardia.it/ita/index.asp the analysis was made for each location and all time stamps (see figure 6). Figure 6. Schema of the map arrange to perform the Kruskal – Wallis test. This test was used because assumptions of normality could not be Figure 4. Milan’s precipitation map of maximum values (mm) confidently made for each of the cells of the Milano grid. On this between November – December 2013 analysis, the response variable is the telecommunications activity, In Lombardy, the meteorological autumn and winter 2013 were spatially distributed within the city of Milan, in zones on which the very mild, rainy and with exceptionally heavy snowfall [6]. The precipitation can be high/moderate/slight (factors) or not be present cause is the succession of disturbed situations, characterized by at all, and its behavior can be different whether is a weekday or a currents coming from southern quadrants, which favored an almost weekend. The Kruskal-Wallis test was used to test the null continuous supply of warm, moist air in particularly from the end hypothesis that the precipitation intensity levels have equal of December onwards. population medians. We want to know if people behavior in telecommunications activity can change: e.g. Are people's outgoing calls median significantly on days of heavy rain? 2.4 Data processing Precipitation maps and the telecommunications activity records were analyzed on a common space-time basis using python with Spyder Scientific PYthon Development EnviRonment4 and scipy.stats5 and pandas6 libraries. For data visualization QGIS7 and d3.js8, ricksaw.js9, rbox.js10 javascript libraries. Figure 5. Milano grid precipitation map of maximum values The original files were imported in MongoDB non sql database11. (mm) between November – December 2013 Different python scripts were used to data processing: Figure 5 shows rain distribution within the Milano grid area, as in the boxplots, summarized by time stamp. The different colors Asciitotable.py: Conversion of precipitation maps into data divide the threshold of precipitation intensity3. There are two main frame data structures. peaks on November 16th and December 27th. exploration.py: Basic data statistics, distribution fitting and histogram analysis. 2.3 Methods of analysis pre_kw.py: Sum values ignoring country code, Aggregate data The Kruskal-Wallis test is the non – parametric counterpart to the by hour, Join of telecom a precipitation data, Creation analysis of variance ANOVA test. It allows to compare samples of precipitation intensity and days of the week categories. the same variable by their difference in their medians [7]. To detect kruskal.py: Kruskal-Wallis calculation by location. any differences of telecommunications data activity between the levels of precipitation intensity (high, moderate, slight and no rain) All scripts used can be found on Github: a Kruskal-Wallis was used in each cell (square) of the Milano grid: https://github.com/carolinarias/Kruskal-Wallis-Spatial.git 3. RESULTS AND DISCUSSION We calculated a series of Kruskal-Wallis maps for each of the telecommunications variables (incoming and outgoing SMS/calls). High (heavy) rain was not considered because the number of measurements was less than five on the majority of the cells; for Kruskal – Wallis test to work, the samples must have more than 3 7 http://www.qgis.org/it/site/ http://www.arpa.piemonte.gov.it/rischinaturali/tematismi/meteo/ 8 https://d3js.org/ osservazioni/radar/intensita-precipitazione.html?delta=0 9 http://code.shutterstock.com/rickshaw/ 4 https://pythonhosted.org/spyder/ 10 https://bl.ocks.org/davidwclin/ad5d13db260caeffe9b3 5 https://docs.scipy.org/doc/scipy/reference/stats.html 11 www.mongodb.com 6 http://pandas.pydata.org/ five observations. Figure 7 shows an example for outgoing calls We will continue this research taking into account: (that represent the results also for the other variables), where the Citizen - generated geographic content vs. official sensor value 1 indicates that the test passed: Being the null hypothesis that data. the population medians are all equal, a P-value ≤ α (0.05 in our Test not only precipitation but other weather variables case) means that the differences between the medians are like temperature sun radiation, wind direction, etc. statistically significant. The Kruskal-Wallis test reveals that the medians for the telecommunication activity are significantly Using a larger data sample (i.e. one-year time series of different across the different precipitation intensity levels: data). moderate, slight and no rain. Integrate additional data (i.e. traffic data, census data, historical social media data, etc.). We hope to explore further the hypothesis of predicting weather conditions / environmental stress with the help of mobile data. 5. REFERENCES [1] Sagl, G., Beinat, E., Resch, B., & Blaschke, T. (2011, June). Integrated geo-sensing: A case study on the relationships between weather and mobile phone usage in northern italy. In Spatial Data Mining and Geographical Knowledge Services (ICSDM), 2011 IEEE International Conference on (pp. 208- 213). IEEE. [2] Craveiro, P., Ramos, F.M.V., Kanjo, E., Mawass, N.E., 2013. Towards an early warning system : the effect of weather on Figure 7. Kruskal / Wallis map for outgoing calls between mobile phone usage A case study in Abidjan 1–11. November – December 2013 [3] Barlacchi, G., De Nadai, M., Larcher, R., Casella, A., Chitic, The test also identified areas where certain levels of precipitation C., Torrisi, G., Antonelli, F., Vespignani, A., Pentland, A., are common (i.e. areas with the same value of moderate Lepri, B., 2015. A multi-source dataset of urban life in the city precipitation = 2.5 mm), identify on the map as nan. of Milan and the Province of Trentino. Sci. data 2, 150055. DOI= 10.1038/sdata.2015.55. [4] Lussana, C., Salvati, M.R., Pellegrini, U., Uboldi, F. 2009. 4. CONCLUSIONS AND FUTURE WORK Efficient high-resolution 3-D interpolation of meteorological variables for operational use. Adv. Sci. Res. 3, 105–112. The results discussed above are a promising step towards a holistic understanding of the complex relationship between environmental [5] Uboldi, F., Lussana, C., Salvati, M., 2008. Three‐dimensional and social dynamics, and a starting point for further smart cities and spatial interpolation of surface meteorological observations human geography analysis. from high‐resolution local networks. Meteorol. Appl. 15, 331– According to the Kruskal-Wallis test results, we can conclude that 345. for a confidence interval (95%) the null hypothesis of equality of [6] ARPA Agenzia Regionale per la Protezione dell’Ambiente, medians can be rejected: there is a significant relationship between 2015. Sintesi Meteoclimatica Inverno 2013/2014. telecommunications data and precipitation intensity levels. [7] Wheeler, D., Shaw, G., Barr, S., 2013. Statistical Techniques A following analysis would try to identify the causality of the in Geographical Analysis, Third Edition. Taylor & Francis. relationship between precipitation and telecommunications activity: e.g., how strong/weak is the relationship? is there any primary process or feature which may have a spatial and/or temporal component?