<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The effect of weather on user-generated big geo data in mobile phone networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carolina Arias Muñoz</string-name>
          <email>carolina.arias@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Antonia Brovelli</string-name>
          <email>maria.brovelli@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Milano</institution>
          ,
          <addr-line>Via Valleggio 11, 22100 Como</addr-line>
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>User-generated data in mobile networks are normally used as proxies for human activity and mobility, and it can be used in an extensive variety of research problems including mobility, city planning, tourism, event detection, urban well-being and many others. In not so many studies, the relationship between environmental variables and mobile usage data has been explored. This work tests this possible relationship in the city of Milan, as one first approach towards a predictive model for weather conditions/environmental stress from mobile usage data. Used data corresponded to two months (November and December 2013) of Call Detail Records (CDRs) of sent and received SMS, incoming and outgoing calls, and internet traffic; as well as the precipitation data from the meteorological network. According to the KruskalWallis test results, we can conclude that for a confidence interval (95%) the null hypothesis of equality of medians can be rejected: there is a significant relationship between telecommunications data and precipitation intensity levels. • Applied computing • Applied computing~Internet telephony • Applied computing~Mathematics and statistics • Information systems~Open source software</p>
      </abstract>
      <kwd-group>
        <kwd>Mobile phone data</kwd>
        <kwd>spatial data mining</kwd>
        <kwd>urban planning</kwd>
        <kwd>Big Geo Data</kwd>
        <kwd>Kruskal-Wallis</kwd>
        <kwd>Extraction of spatial relations in Big Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        User-generated data in mobile networks are normally used as
proxies for human activity and mobility and it can be used in an
extensive variety of research problems including mobility, city
planning, tourism, event detection, urban well-being and many
others. In not so many studies, the relationship between
environmental variables and mobile usage data have been explored.
There are two studies worth to be mentioned: First important work
was done by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] were the authors explored the influence of weather
on mobile phone usage and (indirectly) human behavior. They used
factor analysis to reduce dimensionality and redundancy in some
2017, Copyright is with the authors. Published in the Workshop
Proceedings of the EDBT/ICDT 2017 Joint Conference (March 21, 2017,
Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distribution of this
paper is permitted under the terms of the Creative Commons license
CCby-nc-nd 4.0
      </p>
      <sec id="sec-1-1">
        <title>1 https://dandelion.eu/datamine/open-big-data/</title>
        <p>
          meteorological variables, and then spectral analysis to unveil
significant periodical components in the time series of the
remaining factors (output of the factor analysis) and mobile
telecom traffic intensity. Later on [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] reproduced [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] experience,
both to validate the datasets they had available and to confirm the
authors’ results. Their focus was on the creation of a predictive
model of weather conditions and to evaluate an autoregressive
integrated moving average (ARIMA) model and its forecast.
On both studies, the correlation between temperature (or variables
related to Thermal perception) and telecommunications data is
high, but the results related to precipitation do not match, probably
due to a series of factors including the data quality and the different
human behavior of the two regions used as case study.
Taking into account these previous works, we wanted to test if there
is any relationship specifically between precipitation and mobile
outgoing calls in the city of Milan, as a first approach towards a
predictive model for weather conditions/environmental stress from
mobile usage data. In this work we consider:


        </p>
      </sec>
      <sec id="sec-1-2">
        <title>The use of only Free and Open Source tools The of use of Kruskal-Wallis test instead of factor analysis.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. DATA AND METHODS</title>
      <p>This research is based on two types of data, namely the
usergenerated traffic in mobile networks and precipitation data. The
latter is considered, to our purposes, an indicator of the weather
conditions.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 User-generated big geo data in mobile phone networks</title>
      <p>
        Telecom Italia together with, SpazioDati, MIT Media Lab, EIT ICT
Labs, Polytechnic University of Milan, Northeastern University,
University of Trento, Fondazione Bruno Kessler and Trento RISE
have been organizing the Telecom Italia Big Data Challenge,
providing various geo-referenced and anonymized datasets. For the
2014 edition, they provided data for two Italian areas: the city of
Milan and the Province of Trentino [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These data are available1
to the public under the Open Database License (ODbL) .
From all the Telecom open data available, used data corresponds to
two months (November and December 2013) of mobile
telecommunications of the city of Milan. The datasets are
usergenerated telecommunication traffic, corresponding to the result of
computation over the Call Detail Records (CDRs) of sent and
received SMS, incoming and outgoing calls, and Internet traffic.
All Datasets have a temporal aggregation of ten minutes, being in
total 14.877.485 records. Data are provided in a series of CSV files,
each containing one day of records, organized according to the
following schema:
 Square id: the id of the square that is part of the Milano grid
(see Figure 1)
 Time interval: the beginning of the time interval.
 Country code: the phone country code of a nation.
 Mobile network activity: the activity (a number) inside the
Square id, in terms of outgoing and received SMS/calls and
internet connection (one column for each variable) during the
Time interval and sent from the nation identified by the
Country Code.
      </p>
      <p>
        The CDRs records provided by Telecom Italia are not the real
records; they are proportional values of the actual records, to
provide anonymized data. To understand how these values were
calculated, please refer to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
2.1.1 Spatial distribution of the mobile network data
Since the data come from different companies which have adopted
different standards, their spatial distribution irregularity is
aggregated in a square grid (100 columns by 100 rows) covering
the city of Milan, with a square cell size of 235 meters and WGS84
projection (EPSG:4326). The Milano Grid is available in GeoJSON
format (see Figure 1).
On the CSV files, each record is related to a specific cell id of the
Milano grid, in such a way that each record can be referenced to
each grid cell. Once the data is represented spatially, it can be seen
as a series of raster maps, one for each time stamp (i.e. 144 raster
map per day for each variable).
2.1.2 Temporal distribution of the mobile network
data
Figure 3 shows a clear view of data temporal behavior. These are
radial box plots, where data is summarized by timestamp, meaning
that the value of each hour is the sum of all grid values of a specific
time stamp.
      </p>
      <p>The left plot shows data from December’s first week, which is
representative of all the rest of the weeks between November and
December 2013. It evidences a strong daily seasonality of incoming
calls corresponding with working and non-working hours; the same
behavior is observed for the rest of the variables (outgoing calls,
incoming and outgoing SMS, Internet connection) indicating a
temporal human behavioral pattern. Likewise, there is a weekly
seasonality between working days and weekends, with Sunday the
day with less activity. The right plot shows data from December’s
last week, where peaks for December 25th and 31st can be seen
clearly.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Precipitation data</title>
      <p>
        Precipitation maps for Milan city between November and
December 2013, comes from Lombardia's regional agency for
environmental protection ARPA2. An Optimal Interpolation (OI)
method, explained in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], was used by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to interpolate on a regular
gid of 1.5 km, the hourly-cumulated precipitation using data from
ARPA Lombardia’s mesoscale meteorological network. Figure 4
shows the maximum values during November and December 2013,
where high values are concentrated outside the city of Milan,
especially on the southeast part of the city.
In Lombardy, the meteorological autumn and winter 2013 were
very mild, rainy and with exceptionally heavy snowfall [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
cause is the succession of disturbed situations, characterized by
currents coming from southern quadrants, which favored an almost
continuous supply of warm, moist air in particularly from the end
of December onwards.
The Kruskal-Wallis test is the non – parametric counterpart to the
analysis of variance ANOVA test. It allows to compare samples of
the same variable by their difference in their medians [7]. To detect
any differences of telecommunications data activity between the
levels of precipitation intensity (high, moderate, slight and no rain)
a Kruskal-Wallis was used in each cell (square) of the Milano grid:
3
http://www.arpa.piemonte.gov.it/rischinaturali/tematismi/meteo/
osservazioni/radar/intensita-precipitazione.html?delta=0
      </p>
      <sec id="sec-4-1">
        <title>4 https://pythonhosted.org/spyder/</title>
        <p>5 https://docs.scipy.org/doc/scipy/reference/stats.html</p>
      </sec>
      <sec id="sec-4-2">
        <title>6 http://pandas.pydata.org/</title>
        <p>the analysis was made for each location and all time stamps (see
figure 6).</p>
        <p>This test was used because assumptions of normality could not be
confidently made for each of the cells of the Milano grid. On this
analysis, the response variable is the telecommunications activity,
spatially distributed within the city of Milan, in zones on which the
precipitation can be high/moderate/slight (factors) or not be present
at all, and its behavior can be different whether is a weekday or a
weekend. The Kruskal-Wallis test was used to test the null
hypothesis that the precipitation intensity levels have equal
population medians. We want to know if people behavior in
telecommunications activity can change: e.g. Are people's outgoing
calls median significantly on days of heavy rain?</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>2.4 Data processing</title>
      <p>Precipitation maps and the telecommunications activity records
were analyzed on a common space-time basis using python with
Spyder Scientific PYthon Development EnviRonment4 and
scipy.stats5 and pandas6 libraries. For data visualization QGIS7 and
d3.js8, ricksaw.js9, rbox.js10 javascript libraries.</p>
      <p>The original files were imported in MongoDB non sql database11.
Different python scripts were used to data processing:



</p>
      <p>Asciitotable.py: Conversion of precipitation maps into data
frame data structures.
exploration.py: Basic data statistics, distribution fitting and
histogram analysis.
pre_kw.py: Sum values ignoring country code, Aggregate data
by hour, Join of telecom a precipitation data, Creation
precipitation intensity and days of the week categories.
kruskal.py: Kruskal-Wallis calculation by location.</p>
      <p>All scripts used can be found on
https://github.com/carolinarias/Kruskal-Wallis-Spatial.git</p>
      <sec id="sec-5-1">
        <title>Github:</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>3. RESULTS AND DISCUSSION</title>
      <p>We calculated a series of Kruskal-Wallis maps for each of the
telecommunications variables (incoming and outgoing SMS/calls).
High (heavy) rain was not considered because the number of
measurements was less than five on the majority of the cells; for
Kruskal – Wallis test to work, the samples must have more than
7 http://www.qgis.org/it/site/
8 https://d3js.org/
9 http://code.shutterstock.com/rickshaw/
10 https://bl.ocks.org/davidwclin/ad5d13db260caeffe9b3
11 www.mongodb.com
five observations. Figure 7 shows an example for outgoing calls
(that represent the results also for the other variables), where the
value 1 indicates that the test passed: Being the null hypothesis that
the population medians are all equal, a P-value ≤ α (0.05 in our
case) means that the differences between the medians are
statistically significant. The Kruskal-Wallis test reveals that the
medians for the telecommunication activity are significantly
different across the different precipitation intensity levels:
moderate, slight and no rain.
The test also identified areas where certain levels of precipitation
are common (i.e. areas with the same value of moderate
precipitation = 2.5 mm), identify on the map as nan.</p>
    </sec>
    <sec id="sec-7">
      <title>4. CONCLUSIONS AND FUTURE WORK</title>
      <p>The results discussed above are a promising step towards a holistic
understanding of the complex relationship between environmental
and social dynamics, and a starting point for further smart cities and
human geography analysis.</p>
      <p>According to the Kruskal-Wallis test results, we can conclude that
for a confidence interval (95%) the null hypothesis of equality of
medians can be rejected: there is a significant relationship between
telecommunications data and precipitation intensity levels.
A following analysis would try to identify the causality of the
relationship between precipitation and telecommunications
activity: e.g., how strong/weak is the relationship? is there any
primary process or feature which may have a spatial and/or
temporal component?</p>
      <sec id="sec-7-1">
        <title>We will continue this research taking into account:</title>
        <p>


</p>
        <p>Citizen - generated geographic content vs. official sensor
data.</p>
        <p>Test not only precipitation but other weather variables
like temperature sun radiation, wind direction, etc.</p>
        <p>Using a larger data sample (i.e. one-year time series of
data).</p>
        <p>Integrate additional data (i.e. traffic data, census data,
historical social media data, etc.).</p>
        <p>We hope to explore further the hypothesis of predicting weather
conditions / environmental stress with the help of mobile data.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Sagl</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beinat</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Resch</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Blaschke</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2011</year>
          , June).
          <article-title>Integrated geo-sensing: A case study on the relationships between weather and mobile phone usage in northern italy</article-title>
          .
          <source>In Spatial Data Mining and Geographical Knowledge Services (ICSDM)</source>
          ,
          <year>2011</year>
          IEEE International Conference on (pp.
          <fpage>208</fpage>
          -
          <lpage>213</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Craveiro</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramos</surname>
            ,
            <given-names>F.M.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanjo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mawass</surname>
            ,
            <given-names>N.E.</given-names>
          </string-name>
          ,
          <year>2013</year>
          .
          <article-title>Towards an early warning system : the effect of weather on mobile phone usage A case study in Abidjan 1-11.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Barlacchi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Nadai</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larcher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casella</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chitic</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torrisi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antonelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vespignani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pentland</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lepri</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <year>2015</year>
          .
          <article-title>A multi-source dataset of urban life in the city of Milan and the Province of Trentino</article-title>
          .
          <source>Sci. data 2</source>
          ,
          <fpage>150055</fpage>
          . DOI=
          <volume>10</volume>
          .1038/sdata.
          <year>2015</year>
          .
          <volume>55</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Lussana</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salvati</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pellegrini</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uboldi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Efficient high-resolution 3-D interpolation of meteorological variables for operational use</article-title>
          .
          <source>Adv. Sci. Res</source>
          .
          <volume>3</volume>
          ,
          <fpage>105</fpage>
          -
          <lpage>112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Uboldi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lussana</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salvati</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <year>2008</year>
          .
          <article-title>Three‐dimensional spatial interpolation of surface meteorological observations from high‐resolution local networks</article-title>
          .
          <source>Meteorol. Appl</source>
          .
          <volume>15</volume>
          ,
          <fpage>331</fpage>
          -
          <lpage>345</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>ARPA</given-names>
            <surname>Agenzia</surname>
          </string-name>
          <article-title>Regionale per la Protezione dell'</article-title>
          <source>Ambiente</source>
          ,
          <year>2015</year>
          .
          <source>Sintesi Meteoclimatica Inverno</source>
          <year>2013</year>
          /
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Wheeler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaw</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barr</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <year>2013</year>
          .
          <article-title>Statistical Techniques in Geographical Analysis, Third Edition</article-title>
          . Taylor &amp; Francis.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>