=Paper= {{Paper |id=Vol-2029/paper7 |storemode=property |title=Toward a Privacy-aware Data Collector for Economic and Urban Analytics |pdfUrl=https://ceur-ws.org/Vol-2029/paper7.pdf |volume=Vol-2029 |authors=Miguel Nuñez-del-Prado,Bruno Esposito,Ana Luna |dblpUrl=https://dblp.org/rec/conf/simbig/Nunez-del-Prado17 }} ==Toward a Privacy-aware Data Collector for Economic and Urban Analytics== https://ceur-ws.org/Vol-2029/paper7.pdf
      Toward a Privacy-aware Data Collector for Economic and Urban
                               Analytics
                      Miguel Nunez-del-Prado, Bruno Esposito, Ana Luna
                                   Universidad del Pacı́fico
                                     Av. Salaverry 2020
                                         Lima - Peru

          {m.nunezdelpradoc, bn.espositoa, ae.lunaa}@up.edu.pe


                    Abstract                                lished Legislative Decree1 to create the National
                                                            Authority for Transparency and Access to Public
    Nowadays, there are a mature set of tools               Information; and, strengthen the Regime of Per-
    and techniques for data analytic, which                 sonal Data Protection. This first step would allow
    help Data Scientist to extract knowledge                not only greater transparency on the part of cer-
    from raw heterogeneous data. Nonethe-                   tain institutions but also the possibility of the citi-
    less, there is still a lack spatio-temporal             zens of becoming a partner and author of solutions
    historical dataset allowing to study ev-                that could improve the life quality of our society.
    eryday life phenomena, such as vehicular                Peru is beginning to generate an open data culture
    congestion, press influence, the effect of              and developing an open data portal at the national
    politicians comments on stock exchange                  level2 . Nonetheless, some phenomena need fine-
    markets, the relation between food prices               grained data. For instance, in a research paper
    evolution and temperatures or rainfall, so-             (Srivastava, 2017), the author highlights the need
    cial structure resilience against extreme               to collect segregated data of urban poor for inclu-
    climate events, among others. Unfortu-                  sive urban planning. The huge scarcity of segre-
    nately, there are few datasets combining                gated data does not allow to make a comprehen-
    different sources of urban data in order to             sive understanding of their vulnerabilities. Segre-
    carry out studies of phenomena occurring                gated data of urban poor are essential for inclusive
    in cities (i.e., Urban Analytics). To solve             planning and to build sustainable cities. Undoubt-
    this problem, we have implemented a Web                 edly, there are many benefits of having segregated
    crawler platform for gathering a different              data of urban poor in urban planning, not only for
    kind of available public datasets.                      inclusive planning but also to understand the vul-
                                                            nerability, to know the contribution of urban poor
1   Introduction
                                                            in urban economy and to prioritize actions.
Providing citizens with free access to raw data is             Our main contribution is presenting an alterna-
one of the new global trends. These data, gen-              tive to gathering daily basis generated data (from
erated on a daily basis, could come from differ-            public sources for storing, organizing and shar-
ent sources such as governmental entities, NGOs,            ing) to perform Urban Big Data Analytics under
companies or Public Administration Entities, so-            a privacy aware framework in a developing coun-
cial networks, newspapers, congestion services,             try such as Peru. The information collection has
etc. Therefore, the data format must be a standard          followed a sanitization process, which assures the
to make easier the access, use, generation of infor-        identity safety of citizens and brands located in
mation and sharing. Thus, it is crucial that govern-        Peru. The aim of this platform is to provide an
ments and private organizations, which have valu-           urban dataset for studying different phenomena in
able data in their systems, servers, and databases          urban environments such as urban planning, on-
make available these datasets for common benefits               1
                                                                  Legislative Decree 1353: ”Decreto legislativo que crea la
but taking into account citizen’s privacy.                  autoridad nacional de transparencia y acceso a la información
Unfortunately, Latin American countries and in              pública, fortalece el régimen de protección de datos person-
                                                            ales y la regulación de la gestión de intereses”
particular Peru lacks of historical data available              2
                                                                  Sistema Nacional de Información Ambiental: sinia.
to citizens. In May 2017, the government pub-               minam.gob.pe/




                                                       91
line emergency detection, vulnerability, climate              Sensors and Crowdsourcing layers. The former re-
change, resilience and even poverty.                          ceives textual data from Weibo users. The latter
   The present paper is organized as follows: Sec-            extracts the basic elements of an emergency event
tion 2 describes the related works, Sections 3, 4             (what, when, where, who, and why) to provide in-
and 5 detail the framework architecture, the col-             formation for rescue services or decision making.
lected data and data statistics of some datasets, re-            Regarding urban data gathering, which is an es-
spectively. Then, Section 6 shows an application              sential element of modern cities, a great challenge
of Urban Analytics. Finally, Section 7 concludes              appears, such as data volume, velocity, data qual-
our work.                                                     ity, privacy, and security, among others. In the
                                                              paper (Panagiotou et al., 2016), authors describe
2       Related works                                         the development of a set of techniques that aim at
Open Data promotes innovation thoughts soci-                  effective and efficient urban data management in
etal participation with the use of the data. Such             real settings in Dublin city. The solutions were
datasets include measurement data from city-wide              integrated into a system that is currently used by
sensor networks on smart cities as well as from               the city. The system can detect multiple types of
citizen sensors. In the current section, we present           incidents, each one focusing on a different input
some efforts for data collection to tackle urban              source. Hence, the solutions can identify events
planning problems, to detect emergencies and to               by analyzing in real-time GPS trajectories, data
show city insights in real time.                              coming from sensors installed in junctions, or tex-
   Concerning data collection for urban planning,             tual information coming from social media. Au-
(Rathore et al., 2016) propose a smart city data              thors developed analysis modules to forecast load
collection platform. This platform gathers infor-             so that they can manage efficiently the volume and
mation about floods, water usage, traffic, vehicu-            velocity. Besides, they combine information to in-
lar mobility traces, parking lots, pollution, social          fer events from data anomalies. Noisy data and
networks and weather from smart homes, smart                  erroneous measurements are also dealt. Moreover,
parking, vehicular networking, water & weather                machine learning algorithms are used to identified
and environmental pollution monitoring systems.               relevant tweets avoiding in that way low-quality
The authors use the collected information for ur-             data. Another real-time data collection application
ban planning decision-making. Nevertheless, in                can be found in London where the information can
(Santos et al., 2017), authors claim a need to give           be viewed in a dashboard4 . More details on this
some context to this kind of measurements. Thus,              work can be found in the reference (Gray et al.,
they proposed the Human-Aware Sensor Network                  2016), where its main idea is to understand the city
Ontology for Smart Cities (HASNetO-SC) to de-                 dynamics in a better way. The system gathers data
scribe knowledge associated with data collection              from different third party entities and open data
from city-wide sensor networks with an appropri-              platforms given as CSV file, JSON object or an
ate level of contextual metadata for data under-              HTML.
standing. Therefore, they implemented the archi-                 At last, there is also a multi-source dataset of
tecture for data collection in an urban metropolitan          urban life in Milano and Trentino (Barlacchi et al.,
area in Brazil. Consequently, the platform opens              2015). The authors put together different data sets,
the possibility that citizens, who have a little to no        such as spatial grid, social pulse, telecommuni-
knowledge about the collection environment and                cations, precipitations, weather, electricity, news,
the collected data, to access and process the infor-          and social pulse. The scientists locate spatially
mation.                                                       all the data in a grid, which allows comparing
   About emergencies detection, the work of (Xu               datasets generated in various companies using dif-
et al., 2016) proposes a mechanism to gather in-              ferent standards. The idea behind this dataset is
formation from Weibo3 about urban emergency                   to make a testbed for different solutions to ur-
events. The platform discovers What, Where,                   ban problems like energy consumption, mobility
When, Who, and Why of a given emergency event                 planning, tourism and migrant flows, urban struc-
from Weibo comments. Thus, to complete these                  tures and interactions, event detection, urban well-
pieces of information, the platform relies on Social
                                                                4
                                                                  London city dashboard:   citydashboard.org/
    3
        Weibo website: tw.weibo.com                           london




                                                         92
being, etc.                                                  the particular structure settled for a given website.
  In the next section, we detail the framework pro-          It is worth noting that websites structures for ex-
posed in the present effort.                                 tracting data are stored in the database beforehand.
                                                             For implementing this part of the crawler, we re-
3   Data gathering framework                                 lied on the Scrappy library5 .
In the present section, we describe the different               Extracted data is temporally stored in the
components of the architecture of the Web crawler.           database to generate a Comma Separated Vector
Figure 1 depicts the different parts of the architec-        CSV file at the end of a day. Therefore, we store
ture, where each part is responsible for a given task        the datasets in a daily basis to build a historical
as follows:                                                  repository. Finally, the last part of the framework
                                                             is the Control mechanism that verifies the state of
                                                             the crawler, scraper and file generation to notify
                                                             by email if something goes wrong while gathering
                                                             data from the different Websites.
                                                                In the next section, we detail the different
                                                             datasets collected by the platform.

       Figure 1: Data gathering framework                    4       Collected datasets
                                                             In this section, we detail the variables of the col-
Crawler: this artifact is responsible for reading            lected datasets, which are available6 . It is impor-
   the Uniform Resource Locators (URLs) from                 tant to note that a sanitization process was per-
   the database to download the target web                   formed over the datasets. Therefore, we con-
   pages from different websites.                            sider as sensible information people’s and brands’
                                                             names, which are pseudonymized and erased, re-
Data base: this data base engine stores a list of            spectively. Concerning the opinion dataset, the
    URL provided by the user and the rules for               comment field is sanitized by erasing stop words
    reading and extracting relevant fields.                  and sorting words alphabetivally to reduce the im-
Web Scraper: it receives the downloaded web                  pact of a De-anonymization re-link attack (Gambs
   pages and the extracting rules to parse the               et al., 2014). Consequently, these sanitization pro-
   web pages for gathering the needed fields of              cesses are carried out to prevent a privacy breach.
   a given web page.                                         Please note that the sanitization process is per-
                                                             formed off-line. Thus, the sanitization does not
Storing: generate the Comma Separated Vector                 limit the extraction of useful information.
     (CSV) files for each treated web site.                     It is possible to extract unstructured datasets
                                                             from the websites targeted (i.e. news from news-
Control: verifies the aforementioned parts (i.e.,
                                                             papers, satellite imagery, etc.). In this work, we
    Web Scraper, Crawler and Storing) are alive
                                                             present two non-tabular datasets: newspapers and
    and are able to perform their task without
                                                             social networks. Nonetheless, a tabular structure
    problems.
                                                             has been applied to them for easier readability.
   The process begins with the manual creation of               In the following paragraphs, we described each
a list of target URLs and the rules needed to extract        of the nine categories of data sources as well as the
relevant fields from these. Then, those information          datasets in each category.
are stored in the Data Base. The target URLs are
chosen by the user. The data extraction starts with          Beauty consists of a description, date, price, cat-
the Crawler reading the (URLs) from the Data                     egory and cosmetics products (c.f. Table 1).
Base to download target Web pages. Then, the
Web Scrapper receives a set of Web pages as in-              Climate category contains data from monitoring
put. Consequently, it generates an index I of all                stations, atmospheric pollutants and radiation
the Web pages in each registered Website. Next,                  5
                                                                Scrapy: scrapy.org
the Web Scrapper extract the data from the differ-               6
                                                                BITMAP Urban Analytics:        bitmap.com.pe/
ent gathered web pages, listed in the index I, using         urbands.html




                                                        93
 category       date         price       title                   Table 5 describes the product data for sale in
  mujer      2016-08-04      69.0    Noir de Nuit                markets. This dataset records the name, cate-
  mujer      2016-08-04       0.0     Orianité                  gory, minimum price, maximum, average and
                                                                 date.
       Table 1: Sample of beauty dataset.
                                                                    title         type          min price
                                                                   acelga        acelga            4.0
    provided by the National Meteorological and
                                                                  max price     avg price         date
    Hydro-graphic Service of Peru (SENAMHI).
                                                                     5.0           4.5         2016-08-04
    Table 2 shows atmospheric pollutants (CO,
    N O, N O2 , N OX, O3 , P M10 , P M2.5 , SO2 )                 Table 5: Sample of a Market dataset.
    acquired from several monitoring networks
    distributed in Metropolitan Lima and the                Medicament category comprises prices of
    Province of Lima. It also presents the date                medicines provided by the Ministry of
    and time of the measurement.                               Health of Peru (c.f., MINSA). Table 6 shows
                                                               the registered drug dataset containing the
  CO         NO      NO2        NOX           O3
                                                               condition of the drug, address, technical di-
   0.6       13.9    19.3       33.2          3.5
                                                               rector, pharmacy name, price, name, country,
 PM10       PM2.5    SO2        date         hour
                                                               and date of manufacture.
  51.6       34.0     2.5    2016-08-04      00:00
                                                                Condition           address            director
      Table 2: Sample of Climate dataset.                       Con receta     C. Cordoba 2300        Luis Guia
                                                               manufacturer           date          Working hours
                                                                   Hersil         2016-08-27        L-V 9:00-20:20
    Table 3 specifies the set of meteorological                    name             country           regulation
    data (humidity and temperature) of different              Bot. Pharmalys          Peru             NG1279
                                                                   price            register            phone
    monitoring stations. It also reports the station                2.5            NG1279              2660488
    name where it was measured, the date and the                  holder                     location
    UTM location of the record (department, dis-                   Hersil                  Lince - Lima
    trict, latitude and longitude and altitude).
                                                                Table 6: Sample of Medicament dataset.
    altitude      district         date    hum
     3928        AYAVIRI        16-08-04     9              Newspapers category contains news from differ-
       lat          lon          temper.   type                ent print media in Peru. Table 7 describes
   14 52’22”     70 35’34”          19     Met                 newspapers dataset composed of the publica-
                                                               tion date, author, and the section of newspa-
       Table 3: Sample of Station dataset.                     per.

    Table 4 describes the data corresponding to                    content     date            author
    solar radiation levels in different regions in                long text    2014-02-14      journalist
    Peru.                                                                      13:33:47        name
                                                                   section     location        title
     arequipa       cajamarca    cusco puno
                                                                   mundo       mundo/eeuu      Edward
        8.0             7.0       9.0      9.0
                                                                                               Snowden
       date             ica      junin tacna
    2016-08-04          6.0       9.0      4.0                  Table 7: Sample of written press dataset.
       lima         moquegua         piura
        2.0             8.0           8.0                   Real Estate Market comprehends prices for
                                                                houses, apartments, and offices sales and
    Table 4: Sample of UV Radiation dataset.                    rental nationwide. Table 8 shows real estate
                                                                data grouped in columns detailing whether
Markets category reports maximum and mini-                      the state of the property is a sale or rental,
   mum prices, description of different products                describes the property, and its address.
   of the first necessity of three different mar-               Additionally, the longitude and latitude of
   kets of supermarkets and suppliers in Lima.                  the property are provided.



                                                       94
           title      section     description                   dates, amount, the amount of the stock, op-
          text1       alquiler       text2                      erations, and price variation. These variables
        location        area         price                      are detailed in Table 12.
     Rep. Panama       320m         $5000                          pre         open        sale    company       date
           lon           lat          date                         7.8
                                                                currency
                                                                                7.7
                                                                           amountNeg
                                                                                           7.56
                                                                                        mnemonic
                                                                                                    Alicorp
                                                                                                   noShare
                                                                                                              2016-08-04
                                                                                                                noOper
      -77.032584     -12.0431     2016-08-06                       S/
                                                                 sector
                                                                            1 325 647
                                                                              segm
                                                                                        ALICORC1
                                                                                           last
                                                                                                   173 677
                                                                                                      var
                                                                                                                  16
                                                                                                                 sale
                                                                  IND          RV1         7.56      -3.08        7.7
 Table 8: Sample of Real Estate Market dataset.
                                                              Table 12: Sample of Stock Exchange dataset.

Social Networks category         contains  geo-
    referenced comments from people on                      Transportation contains main avenues traffic
    different topics and social relations.                      jams and domestic as well as international
                                                                departure and arrivals flights at Jorge Chavez
    Table 9 shows opinions of various users, the                airport in Lima, Peru.
    language in which the opinion was made as
    well as the region of origin. It is worth noting            Tables 13 and 14 detail datasets of both alerts
    that user id was pseudonymized using a hash                 and congestion, respectively. These datasets
    function. In the same spirit, comments were                 contain information of some points in Lima
    sanitized to reduce re-identification risk.                 city. It details the street, city, date, latitude
                                                                and longitude coordinates of the alerts or the
    user id          timestamp         language                 level of congestion. Table 13 also indicates
  1059254686      1476728629010            es                   the level of traffic, the node where the traffic
      lon                lat             region                 level is reset as well as the speed of the traffic.
   -77.0364           -12.0513         Lima, Pe.
                    comment                                        street                   city          date
   alza bar cafe centro el futuro gifs los puño              Av. Los Frutales           La Molina 2016-08-23
                                                                  latitude               longitude    traffic level
      Table 9: Sample of Opinion dataset.                       -12.071628              -76.964632         2.0
                                                                    node                          speed
    Table 10 represents the friendship links be-                 Calatrava                        4.719
    tween users of the social network.
                                                                     Table 13: Sample of Jams dataset.
             user id1       user id2
           1059254686     1059254367                            In the case of Table 14, the types and sub-
           1059254686     2259254876                            types of alerts are gathered in addition to the
                                                                above-mentioned data.
    Table 10: Sample of Social links dataset.
                                                                    Street          city      date
                                                                Av. Aviación    San Borja 2016-08-23
Stock Market category contains two datasets for
                                                                   latitude      longitude    type
    money exchange rates and stock exchange
    markets. The former contains the different                   -12.086116     -77.003996    JAM
    historical exchange rates, from Soles to other                               subtype
    foreign exchange as shown in Table 11.                                 JAM HEAVY TRAFFIC

    currency      buy      sell       date                          Table 14: Sample of Alerts dataset.
   Swiss franc   3.346    3.686    2016-08-04
                                                                Table 15 shows the dataset for both arrivals
      Euro       3.684    3.819    2016-08-04
                                                                and departures flights from Jorge Chavez air-
  Table 11: Sample of money exchange dataset.                   port in Lima, Peru. The name of the airline,
                                                                the city of origin or destination, the status of
    The latter dataset also contains the transac-               the flight, the belt in which the suitcases are
    tions of the Stock Exchange of Lima. This                   delivered. The estimated and scheduled time,
    dataset contains the price of the last transac-             and the door and flight number are also re-
    tion, opening price, purchase, sale, company,               ported.



                                                       95
                                                                               Dataset               Temporal      Geo refe-      Spatial        Types
        Airline                    city                    state                                     Granu-        renced         granula-
                                                                                                     larity                       rity1 2
       Peruvian                   Cusco                  Landing               Beauty                1.02          False          None           float, str, date
         belt                      date              estimated time            Money ex.
                                                                               Stock ex.
                                                                                                     3.19
                                                                                                     1.16
                                                                                                                   False
                                                                                                                   False
                                                                                                                                  None
                                                                                                                                  None
                                                                                                                                                 float, str, date
                                                                                                                                                 float, int., str, date
           4                    2016-07-22                10:50                Ate                   1.18          False          None           float, TSTP, date
                                                                               Cpo-de-marte          1.18          False          None           float, TSTP, date
    scheduled time                 door                   flight               Carabayllo            1.18          False          None           float, TSTP, date
                                                                               Stations              1.16          True           UTM            float, int., str, date
        11:00                       3                      210                 Huachipa              1.18          False          None           float, TSTP, date
                                                                               Puente-piedra         1.18          False          None           float, TSTP, date
                                                                               Radiacion UV          1.16          False          Region         int., date
       Table 15: Sample of Airport traffic dataset.                            San Borja             1.18          False          None           float, TSTP, date
                                                                               S. J. Lurigancho      1.18          False          None           float, TSTP, date
                                                                               S. M. de Porres       1.18          False          None           float, TSTP, date
                                                                               Sta. Anita            1.18          False          None           float, TSTP, date
   Table 16 shows the size of the data, the num-                               V. M. del Triunfo
                                                                               Real estate3
                                                                                                     1.18
                                                                                                     3.13
                                                                                                                   False
                                                                                                                   True
                                                                                                                                  None
                                                                                                                                  LL
                                                                                                                                                 float, TSTP, date
                                                                                                                                                 float, str, date
ber of attributes and the number of records per                                Real estate1          2.52          True           LL             float, str, date
                                                                               Real estate2          2.57          True           LL             float, int., str, date
data set. On the other hand, Table 17 synthe-                                  Medicines             1.17          False          District       float, int., str, date
                                                                               Commerce1             1.2           False          None           float, str, date
sizes the characteristics linked to data types, head-                          Commerce2             1.2           False          None           float, str, date
                                                                               Markets               3.09          False          None           float, str, date
ers, and temporal space granularity. Concerning                                Opinion               -             True           LL             float, TSTP, str
temporal granularity range from 1.02 to 6.84 min-                              Newsp.1
                                                                               Newsp.2
                                                                                                     3.89
                                                                                                     1.18
                                                                                                                   False
                                                                                                                   False
                                                                                                                                  None
                                                                                                                                  None
                                                                                                                                                 str, date
                                                                                                                                                 str, int., date
utes. With regard to spatial granularity, there are                            Newsp.3
                                                                               Newsp.4
                                                                                                     1.35
                                                                                                     6.84
                                                                                                                   False
                                                                                                                   False
                                                                                                                                  None
                                                                                                                                  None
                                                                                                                                                 int., str, date
                                                                                                                                                 int., str, date
seven datasets georeferenced with UTM coordi-                                  Arrivals              1.21          False          None           TSTP, str, int., date
                                                                               Departures            1.17          False          None           TSTP, str, int., date
nates (i.e., latitude and longitude).                                          Alerts                1.16          True           LL             float, str, date
                                                                               Jams                  1.16          True           LL             float, int., str, date
                                                                               1
      Data set        DataFrame        Data points    Attri-   Size              UTM is the Universal Transverse Mercator coordinate sys-
                                                      butes    (Mb)
       Beauty            Beauty        22,190         4        1.2
                                                                              tem.
                                                                               2
    Stock Market      Money ex.        997            4        0.0               LL is the Latitude - Longitude coordinate system.
    Stock Market       Stock ex.       7,253          16       0.8
      Weather              Ate         2,952          10       0.2
      Weather      Campo-de-marte      3,291          10       0.2            Table 17: Spatial-temporal granularity summary
      Weather         Carabayllo       2,796          10       0.1
      Weather           Stations       291,839        11       19.3           and data types of different datasets.
      Weather          Huachipa        2,322          10       0.1
      Weather       Puente-piedra      2,968          10       0.2
      Weather       Radiacion UV       3,181          11       0.2
      Weather          San-borja       3,180          10       0.2
      Weather      S. J. Lurigancho    2,565          10       0.1                   mode is the most important. These values
      Weather
      Weather
                    S. M. d Porres
                       Sta. Anita
                                       2,890
                                       3,233
                                                      10
                                                      10
                                                               0.2
                                                               0.2
                                                                                     are ”wrinkles” and ”Essential greasy skin”
      Weather
     Real estate
                   V. M. del Triunfo
                      Real estate3
                                       3,195
                                       415,225
                                                      10
                                                      9
                                                               0.2
                                                               172.6
                                                                                     for the category and article attributes, respec-
     Real estate      Real estate1     159,665        9        107.3                 tively. On the other hand, we have a numer-
     Real estate      Real estate2     162,736        9        140.7
     Medicines         Medicines       555,7353       14       1,311.6               ical attribute that is the price, of which we
       Markets        Commerce1        136,433        6        12.3
       Markets        Commerce2        490,582        5        37.1                  have the mean, median, standard deviation,
       Markets          Markets        11,178         6        0.7
      Opinion           Opinion        6’979,829      7        239                   minimum and maximum values. It is impor-
       News.
       News.
                     Newspapers1
                     Newspapers2
                                       1’560,134
                                       13,392
                                                      6
                                                      6
                                                               1,129.6
                                                               23.4
                                                                                     tant to note that, no attribute contains null
       News.
       News.
                     Newspapers3
                     Newspapers4
                                       27,060
                                       40,451
                                                      6
                                                      6
                                                               11.4
                                                               16.1
                                                                                     values (c.f., NAs).
      Transport          Arrival       726,624        9        39.3
      Transport        Departure       679,260        9        41.2                 N            variable          type         mean         median       std
      Transport          Alerts        351,690        7        33.3                 0           category            str           -             -          -
      Transport           Jams         1’643,277      8        187.5                1              date            date           -             -          -
                                                                                    2              price           float         935           45        35152
                                                                                    3             article           str           -             -          -
    Table 16: Summary of different data sets size.                                  N             mode             min           max          NAs        %NAs
                                                                                    0            arrugas             -            -             0          0
                                                                                    1                -               -            -             0          0
   In the next section, we describe some statistics                                 2                0               0         1400000          0          0
                                                                                    3      Essential cutis graso     -            -             0          0
of our datasets.
                                                                                         Table 18: Statistics of beauty dataset.
5       Dataset Statistics
In this section, we detail the statistics of the differ-                      Climate.- We describe the characteristics associ-
ent dataset in the described categories in Section                                ated with climate data. Two datasets will be
3.                                                                                described: 1) data from meteorological sta-
Beauty.- Table 18 shows the most interesting                                      tions and their measurements (Table 19); and,
    characteristics of the beauty dataset, de-                                    2) data on pollutants by districts (Tables 20).
    scribed in Table 1. On the one hand, two of                                   Other datasets are not described due to lack
    the four attributes of the table contains cate-                               of space.
    gorical values (discrete values) for which the                                   In Table 19, we have a large number of cate-



                                                                         96
      N         variable      type       mean      median      std
       0         altitude    integer   2143.82     2485.00   1560.81              market and super market that sell groceries.
       1
       2
              department
                 district
                                str
                                str
                                          -
                                          -
                                                      -
                                                      -
                                                                -
                                                                -                 This category was described in Table 5.
       3          station       str       -           -         -
       4            date       date       -           -         -                 Table 21 shows statistics about product.
       5        humidity      float     62.60       62.00    226.54
       6         latitude       str       -           -         -                 In the variable title, the most popu-
       7       longitude        str       -           -         -
       8        province        str       -           -         -                 lar product is potato.    The most men-
       9
      10
              temperature
                    type
                              float
                                str
                                        17.03
                                          -
                                                    16.70
                                                      -
                                                              41.39
                                                                -
                                                                                  tioned type is red-headed onion.       The
      N
       0
                   mode
                3812.00
                               min
                              0.00
                                         max
                                       5192.00
                                                     NAs
                                                      0
                                                              %NAs
                                                                0
                                                                                  minimum price has an average value of
       1              -          -        -        291839       1                 2, 427P EN and varies between 0.57P EN
       2              -          -        -        291839       1
       3     CABO INGA           -        -        272508       1                 and 13.00P EN . The maximum price is
       4              -          -        -           0         0
       5         100.00       5.00     45059.00     43820       0                 on average 3.057P EN and varies between
       6     12 46’ 17.86”       -        -           0         0
       7     75 0’ 44.52”        -        -           0         0                 0.71P EN and 14.00P EN . Finally, the av-
       8
       9
                      -
                   20.80
                                 -
                             -30.80
                                          -
                                       4974.20
                                                   291839
                                                    37258
                                                                1
                                                                0
                                                                                  erage price is 2.746P EN and fluctuates be-
      10     Meteorologica       -        -        272508       1
                                                                                  tween 0.61P EN and 13.5P EN .
           Table 19: Statistics of stations dataset.                          N    Var.         type    mean    median    std
                                                                              0    title         str      -        -       -
                                                                              1    type          str      -        -       -
      gorical attributes and represent the character-                         2    min price    float   2.427     1.5    2.339
      istics of the monitoring stations. However,                             3    max price    float   3.057     2.0    2.828
                                                                              4    av. price    float   2.746    1.75    2.525
      each of these monitoring stations measures                              5    fecha        date      -        -       -
      two meteorological characteristics, which are                           N    mode         min      max     NAs     %NAs
      represented by numerical values. These two                              0    PAPA           -       -        0      0.0
                                                                              1    CEBOLLA        -       -        0      0.0
      attributes have characteristics such as the                                  CABEZA
      mean, median, standard deviation, mode, and                                  ROJA
      minimum and maximum values.                                             2    2.0           0.57   13.0      0          0.0
                                                                              3    2.0           0.71   14.0      0          0.0
      In contrast, the pollutant data by districts (c.f.,                     4    1.25          0.61   13.5      0          0.0
      Table 20) contain a large amount of numeri-                             5    -               -     -        0          0.0
      cal data. As already mentioned in the previ-
                                                                                   Table 21: Statistics of market dataset.
      ous descriptions, some of them are the mean,
      median, standard deviation, mode, minimum
      and maximum values.                                                   Newspapers.- Table 22 shows the most frequent
                                                                               content that is the text1, which is a given news
  N          Var.        type       mean          median       std             (we prefer not to share the content due to the
  0           CO        float       1.299           1.2      7.659
  1          NO         float      46.657          40.1      31.31
                                                                               lack of space). Then, the most cited author,
  2          NO2        float      18.084          16.5      9.719             section, and title are Carlos Battle, executive
  3          NOX        float      64.632         58.95      35.193            zone, world / Current and 5 tips for a startup
  4           O3        float       7.661           5.4      9.329             to survive, respectively. It should be noted
  5         PM10        float      115.786        107.25     55.978
  6         PM2.5       float      35.145          28.9      27.485            that the percentage of missing values of con-
  7          SO2        float      11.274           9.6      10.499            tent, author, and location are 81%, 82% and
  8          date        date         -              -          -              95%.
  9         horas       TSTP          -              -          -
  N         mode         min         max           NAs       %NAs
  0           1.4         0.0       410.9           64        0.02
                                                                            Real Estate.- This dataset contains three different
  1          27.0         1.2       263.4          708        0.24              datasets. We only show one table due to lack
  2          19.6         0.1       164.3          720        0.24              of space. Described data are rents and/or real
  3          63.9         0.8       328.7          708        0.24              estate sales, the data are mostly categorical.
  4           0.5         0.3       198.3            9        0.0
  5          94.5         0.0       948.0           10        0.0                 In Table 23 we have only two attributes with
  6           0.0         0.0       203.0          518        0.18                numerical values associated with the loca-
  7           8.7         2.7       353.3          290        0.1
  8            -           -          -              0        0.0                 tion of the property. These data have statis-
  9            -           -          -              1        0.0                 tics such as mean, median, standard devi-
                                                                                  ation, mode, minimum and maximum val-
      Table 20: Statistics of pollutants dataset.
                                                                                  ues. On the other hand, we have another
                                                                                  group of characteristics with categorical val-
Markets.- Here we describe the statistics of a                                    ues. Among them, the more frequent section



                                                                       97
 N     variable              type      mean      median        std             N                 variable            type         mean     median       std
                                                                               0                    rue               str           -          -         -
 0     content                str       -          -            -              1                    city              str           -          -         -
 1     date                  date       -          -            -              2                   date              date           -          -         -
                                                                               3                 latitude            float      -12.087     -12.09    0.008
 2     author                 str       -          -            -              4                longitude            float       -77.01    -77.016    0.036
 3     section                str       -          -            -              5                 subtype              str           -          -         -
 4     location               str       -          -            -              6                   type               str           -          -         -
                                                                               N                  mode               min          max        NAs      %NAs
 5     title                  str       -          -            -              0          Av. Javier Prado Este        -            -         0        0.0
 N     mode                  min       max        NAs        %NAs              1                San Isidro             -            -         0        0.0
                                                                               2                      -                -            -         0        0.0
 1     texto1                  -        -       1264992       0.81             3               -12.091415           -12.1       -12.067       0        0.0
 2     Carlos                  -        -       1275035       0.82             4               -77.003755          -77.072      -76.949       0        0.0
       Batalla                                                                 5        JAM HEAVY TRAFFIC              -            -       28659      0.08
                                                                               6                   JAM                 -            -         0        0.0
 3     zona-                  -         -           0             0
       ejecutiva
 4     mun./act.              -         -       1480648          0.95
                                                                                         Table 24: Statistics of alerts dataset.
 5     5       cons.          -         -          0               0
       para      que                                                                     12.091415, 77.003755 where alerts of
       una startup
       sobre.                                                                          type jam and subtype jam heavy traffic are of-
                                                                                       ten reported.
     Table 22: Statistics of newspapers dataset.
                                                                                       Table 25 shows the street and the node where
                                                                                       most congestion is reported Av. Circun-
      variable is renting. We can also see that prop-                                  valacin del Golf Los Incas in the la Molina
      erties of 100m2 are the most ”offered”, using                                    district with an average traffic level of three
      the attribute area. Regarding the price, the                                     on a scale of one to five. Finally, it shows the
      most frequent value is $900. It is important                                     average speed of 2.59 Km/h.
      to note that, although we have a large amount                                N     variable                  type        mean       median      std
      of null data (NAs), these represent a rather                                 0     calle                       str         -           -         -
                                                                                   1     ciudad                      str         -           -         -
      low percentage due to a large amount of data                                 2     fecha                      date         -           -         -
                                                                                   3     latitud                   float      -12.084     -12.086    0.009
      available.                                                                   4     longitud                  float      -76.995     -76.993    0.035
                                                                                   5     nivel de trafico         integer      3.286        3.0      0.713
        N      variable        type     mean    median     std                     6     nodo                        str         -           -         -
        0         title         str       -        -        -                      7     velocidad                 float       2.597       2.369     1.302
        1       section         str       -        -        -                      N     mode                       min         max         NAs      %NAs
        2    description        str       -        -        -                      0     Av.     Circunvalacin        -          -           0        0.0
                                                                                         del Golf Los Incas
        3      location         str       -        -        -
                                                                                   1     La Molina                   -           -          0         0.0
        4        area           str       -        -        -
                                                                                   2     -                           -           -          0         0.0
        5        price          str       -        -        -
                                                                                   3     -12.076636               -12.112     -12.061       0         0.0
        6     longitude        float   -77.01   -77.03    0.07
                                                                                   4     -76.963088               -77.081     -76.941       0         0.0
        7      latitude        float   -77.01   -77.03    0.07
                                                                                   5     3.0                        1.0         5.0         2         0.0
        8        date          date       -        -        -
                                                                                   6     Av. Circunvalacin El        -           -          0         0.0
        N        mode          min      max      NAs     %NAs
                                                                                         Golf Los Incas
        0   Alquiler de...       -        -       0        0.0
                                                                                   7     2.025                    0.139        12.289     14288      0.01
        1      alquiler          -        -       0        0.0
        2      Rento...          -        -      2524     0.02
        3
        4
             Ubicacin...
                100 m
                                 -
                                 -
                                          -
                                          -
                                                27912
                                                 2790
                                                          0.17
                                                          0.02
                                                                                         Table 25: Statistics of jams dataset.
        5     US$ 900            -        -       0        0.0
        6       -77.03        -77.76   -76.13     0        0.0
        7       -77.03        -77.76   -76.13     0        0.0               Stock Market.- This category has two different
        8           -            -        -       0        0.0
                                                                                 datasets, which are money exchange and
      Table 23: Statistics of real estate dataset.                               stock market. The former is detailed in Ta-
                                                                                 ble 26. The latter dataset is not described due
Transportation.- We describe the statistics of the                               to the lack of space in the present work.
    datasets related to the terrestrial transporta-                            N         variable             type          mean        median         std
    tion mode, As far as the dataset of air trans-                             0         currency              str            -           -             -
                                                                               1         purchase            float          2.823       3.315        1.134
    port is concerned, we do not show them due
                                                                               2           sell              float          2.891       3.397        1.432
    to lack of space. For land transport, Table 24                             3           date               date            -           -             -
    and 25 report the statistics on alerts and con-                            N          mode                min            max         NAs         %NAs
    gestion, respectively.                                                     0        Swiss franc             -             -          264          0.26
                                                                               1          0.031              0.025          4.513        146          0.15
      Table 24 shows the most reported street,                                 2          0.034                0.0          4.827        264          0.26
                                                                               3             -                  -             -           0            0.0
      which is Av.       Javier Prado.    In the
      city variable, that corresponds to the dis-                             Table 26: Statistics of money exchange dataset.
      trict, San Isidro is the most reported dis-
      trict. More precisely in the coordinates                                         Concerning the money exchange dataset (c.f.,



                                                                        98
         Tabla 26). We have the half of variables cat-
         egorical (currency and date) and the another
         half numerical (purchase and sell). It is worth
         noting that the mode of the currency attribute
         is Swiss franc. Please note that there are some
         null values (no data), which are represented
         by the NAs and the percentage is given by                                Figure 3: Alerts
         %NAs.

In the present section, we have not described the                  Figure 4 shows the distribution of the 17700
statistical metrics of two data categories, which are           alerts gathered with our platform. We note that
medicaments and social networks because of lack                 Jam heavy traffic and Jam stand still traffic are the
of space. In the next section, we propose an ex-                most reported alerts with 5500 alerts each one.
ample of Urban Analytics to show the potential of
our datasets.

6       Traffic congestion application
To show a possible application of our datasets,
we take the Congestion (c.f., Table 13) and Alerts
(c.f., Table 14) datasets of the Transport category
to analyze traffic jams in a given district of Lima.
Consequently, we filter the congestion reports and
alerts of Lince district. Then, we select all records
produced in this district. Finally, we extract a CSV
file containing traffic data to analyze it.
   In the present example, we make a visual analy-
sis of congestion to show the enormous potential-
ity of our dataset. Subsequently, we rely on Qlik7
to depict Figure 2.




                                                                Figure 4: Heatmap of reported alerts (top), real
                                                                estate (middle) and business locations (bottom) in
                                                                Lince district.

Figure 2: Minimum, maximums and averages                           Finally, using Fusion Tables8 , we can draw a
speeds in different streets and avenues of Lima                 geolocated heatmap of reported alerts and jams
                                                                in Figure 4A. As we can see, the most reported
   Figure 2 shows traffic level in red, minimal and             and congested part is the road exchange between
maximal speed in blue and yellow, respectively.                 Javier Prado Avenue and Paseo de la República
This analysis was done for five different avenues               highway at the bottom right the figure. We have
and streets. As we can see, the lower the speed, the            analyzed this segment of the city to study the traf-
higher the traffic level. Another interesting fact is           fic level between this important road exchange and
the reported maximal and minimal speed in Paseo                 the location of our university (i.e., Universidad del
de la Republica Avenue, which has the same value                Pacı́fico) in the upper left part of the figure.
(i.e., 1.6Km) meaning a high congestion in this                    A simple analysis is already interesting but
avenue.                                                           8
                                                                    Fusion tables sites.google.com/site/
    7
        Qlik: qlikid.qlik.com                                   fusiontablestalks




                                                           99
                                                          References
                                                          Sofiane Abbar, Tahar Zanouda, and Javier Borge-
                                                            Holthoefer. 2016.   Robustness and resilience
                                                            of cities around the world.    arXiv preprint
                                                            arXiv:1608.01709 .
                                                          Gianni Barlacchi, Marco De Nadai, Roberto Larcher,
                                                            Antonio Casella, Cristiana Chitic, Giovanni Torrisi,
                                                            Fabrizio Antonelli, Alessandro Vespignani, Alex
                                                            Pentland, and Bruno Lepri. 2015. A multi-source
                                                            dataset of urban life in the city of milan and the
                                                            province of trentino. Scientific data 2.

                                                          Riccardo Di Clemente, Miguel Luengo-Oroz, Matias
Figure 5: Heatmap of reported alerts (top) and              Travizano, Bapu Vaitla, and Marta C Gonzalez.
jams (bottom) in Lince district.                            2017. Sequence of purchases in credit card data re-
                                                            veal life styles in urban populations. arXiv preprint
                                                            arXiv:1703.00409 .
crossing datasets could reveal useful insights.
                                                          Sébastien Gambs, Marc-Olivier Killijian, and
Therefore, we use datasets from real estate and              Miguel Núñez del Prado Cortez. 2014.       De-
business locations. The former is a dataset de-              anonymization attack on geolocated data. Journal
scribed in Table 23. The latter is a private dataset         of Computer and System Sciences 80(8):1597–1614.
containing the code, latitude, and longitude of
                                                          Steven Gray, Oliver O’Brien, and Stephan Hügel.
businesses. Relying on these datasets, we mea-               2016. Collecting and visualizing real-time urban
sured the influence of real estate and business lo-          data through city dashboards. Built Environment
cations over congestion. Figure 5 a heat-map                 42(3):498–509.
(where light and strong blue mean short and large         Miguel Nunez-del Prado, Edgardo Bravo, Miguel
distance, respectively) of the distance among the           Sierra, Miguel Canchay, and Isaias Hoyos. 2016.
datasets locations. It is possible to see that alerts       Knowledge tier platform for graph mining in (smart)
are influenced by real estate and business location         cities. In Proceedings of Symposium on Information
                                                            Management and Big Data.
and jams, only by real estate locations.
                                                          Nikolaos Panagiotou, Nikolas Zygouras, Ioannis
7   Conclusion                                              Katakis, Dimitrios Gunopulos, Nikos Zacheilas,
                                                            Ioannis Boutsis, Vana Kalogeraki, Stephen Lynch,
                                                            and Brendan OBrien. 2016. Intelligent urban data
In the present work, we have described the archi-           monitoring for smart cities. In Joint European Con-
tecture of a Web crawler platform to gather infor-          ference on Machine Learning and Knowledge Dis-
mation about nine different categories of datasets          covery in Databases. Springer, pages 177–192.
to make urban analytics. The main contribution
                                                          M Mazhar Rathore, Awais Ahmad, Anand Paul, and
of this work is the provision of information to             Seungmin Rho. 2016. Urban planning and building
the scientific community and policy makers for              smart cities based on the internet of things using big
analyzing and studying social behavior and ur-              data analytics. Computer Networks 101:63–80.
ban phenomena in a developing country such as             Henrique Santos, Vasco Furtado, Paulo Pinheiro,
Peru. We have collected data following a privacy-           and Deborah L McGuinness. 2017. Contextual
aware structure. Sensible information about the             data collection for smart cities. arXiv preprint
citizens or brand-names has been sanitized. These           arXiv:1704.01802 .
datasets have enabled us to implement a Knowl-            Ambey Kumar Srivastava. 2017. Segregated data
edge Tier Platform for Graph Mining (Nunez-                of urban poor for inclusive urban planning in
del Prado et al., 2016) and perform urban analyt-          india: Needs and challenges.     SAGE Open
ics (Di Clemente et al., 2017) or study urban re-          7(1):2158244016689377.
silience (Abbar et al., 2016).                            Zheng Xu, Yunhuai Liu, Neil Yen, Lin Mei, Xi-
   In the future, we plan to extend the crawler to          angfeng Luo, Xiao Wei, and Chuanping Hu. 2016.
collect more information from new public avail-             Crowdsourcing based description of urban emer-
                                                            gency events using social media big data. IEEE
able Web sites. We also plan to make a privacy              Transactions on Cloud Computing .
risk analysis of the described datasets.




                                                    100