=Paper=
{{Paper
|id=Vol-2029/paper7
|storemode=property
|title=Toward a Privacy-aware Data Collector for Economic and Urban Analytics
|pdfUrl=https://ceur-ws.org/Vol-2029/paper7.pdf
|volume=Vol-2029
|authors=Miguel Nuñez-del-Prado,Bruno Esposito,Ana Luna
|dblpUrl=https://dblp.org/rec/conf/simbig/Nunez-del-Prado17
}}
==Toward a Privacy-aware Data Collector for Economic and Urban Analytics==
Toward a Privacy-aware Data Collector for Economic and Urban Analytics Miguel Nunez-del-Prado, Bruno Esposito, Ana Luna Universidad del Pacı́fico Av. Salaverry 2020 Lima - Peru {m.nunezdelpradoc, bn.espositoa, ae.lunaa}@up.edu.pe Abstract lished Legislative Decree1 to create the National Authority for Transparency and Access to Public Nowadays, there are a mature set of tools Information; and, strengthen the Regime of Per- and techniques for data analytic, which sonal Data Protection. This first step would allow help Data Scientist to extract knowledge not only greater transparency on the part of cer- from raw heterogeneous data. Nonethe- tain institutions but also the possibility of the citi- less, there is still a lack spatio-temporal zens of becoming a partner and author of solutions historical dataset allowing to study ev- that could improve the life quality of our society. eryday life phenomena, such as vehicular Peru is beginning to generate an open data culture congestion, press influence, the effect of and developing an open data portal at the national politicians comments on stock exchange level2 . Nonetheless, some phenomena need fine- markets, the relation between food prices grained data. For instance, in a research paper evolution and temperatures or rainfall, so- (Srivastava, 2017), the author highlights the need cial structure resilience against extreme to collect segregated data of urban poor for inclu- climate events, among others. Unfortu- sive urban planning. The huge scarcity of segre- nately, there are few datasets combining gated data does not allow to make a comprehen- different sources of urban data in order to sive understanding of their vulnerabilities. Segre- carry out studies of phenomena occurring gated data of urban poor are essential for inclusive in cities (i.e., Urban Analytics). To solve planning and to build sustainable cities. Undoubt- this problem, we have implemented a Web edly, there are many benefits of having segregated crawler platform for gathering a different data of urban poor in urban planning, not only for kind of available public datasets. inclusive planning but also to understand the vul- nerability, to know the contribution of urban poor 1 Introduction in urban economy and to prioritize actions. Providing citizens with free access to raw data is Our main contribution is presenting an alterna- one of the new global trends. These data, gen- tive to gathering daily basis generated data (from erated on a daily basis, could come from differ- public sources for storing, organizing and shar- ent sources such as governmental entities, NGOs, ing) to perform Urban Big Data Analytics under companies or Public Administration Entities, so- a privacy aware framework in a developing coun- cial networks, newspapers, congestion services, try such as Peru. The information collection has etc. Therefore, the data format must be a standard followed a sanitization process, which assures the to make easier the access, use, generation of infor- identity safety of citizens and brands located in mation and sharing. Thus, it is crucial that govern- Peru. The aim of this platform is to provide an ments and private organizations, which have valu- urban dataset for studying different phenomena in able data in their systems, servers, and databases urban environments such as urban planning, on- make available these datasets for common benefits 1 Legislative Decree 1353: ”Decreto legislativo que crea la but taking into account citizen’s privacy. autoridad nacional de transparencia y acceso a la información Unfortunately, Latin American countries and in pública, fortalece el régimen de protección de datos person- ales y la regulación de la gestión de intereses” particular Peru lacks of historical data available 2 Sistema Nacional de Información Ambiental: sinia. to citizens. In May 2017, the government pub- minam.gob.pe/ 91 line emergency detection, vulnerability, climate Sensors and Crowdsourcing layers. The former re- change, resilience and even poverty. ceives textual data from Weibo users. The latter The present paper is organized as follows: Sec- extracts the basic elements of an emergency event tion 2 describes the related works, Sections 3, 4 (what, when, where, who, and why) to provide in- and 5 detail the framework architecture, the col- formation for rescue services or decision making. lected data and data statistics of some datasets, re- Regarding urban data gathering, which is an es- spectively. Then, Section 6 shows an application sential element of modern cities, a great challenge of Urban Analytics. Finally, Section 7 concludes appears, such as data volume, velocity, data qual- our work. ity, privacy, and security, among others. In the paper (Panagiotou et al., 2016), authors describe 2 Related works the development of a set of techniques that aim at Open Data promotes innovation thoughts soci- effective and efficient urban data management in etal participation with the use of the data. Such real settings in Dublin city. The solutions were datasets include measurement data from city-wide integrated into a system that is currently used by sensor networks on smart cities as well as from the city. The system can detect multiple types of citizen sensors. In the current section, we present incidents, each one focusing on a different input some efforts for data collection to tackle urban source. Hence, the solutions can identify events planning problems, to detect emergencies and to by analyzing in real-time GPS trajectories, data show city insights in real time. coming from sensors installed in junctions, or tex- Concerning data collection for urban planning, tual information coming from social media. Au- (Rathore et al., 2016) propose a smart city data thors developed analysis modules to forecast load collection platform. This platform gathers infor- so that they can manage efficiently the volume and mation about floods, water usage, traffic, vehicu- velocity. Besides, they combine information to in- lar mobility traces, parking lots, pollution, social fer events from data anomalies. Noisy data and networks and weather from smart homes, smart erroneous measurements are also dealt. Moreover, parking, vehicular networking, water & weather machine learning algorithms are used to identified and environmental pollution monitoring systems. relevant tweets avoiding in that way low-quality The authors use the collected information for ur- data. Another real-time data collection application ban planning decision-making. Nevertheless, in can be found in London where the information can (Santos et al., 2017), authors claim a need to give be viewed in a dashboard4 . More details on this some context to this kind of measurements. Thus, work can be found in the reference (Gray et al., they proposed the Human-Aware Sensor Network 2016), where its main idea is to understand the city Ontology for Smart Cities (HASNetO-SC) to de- dynamics in a better way. The system gathers data scribe knowledge associated with data collection from different third party entities and open data from city-wide sensor networks with an appropri- platforms given as CSV file, JSON object or an ate level of contextual metadata for data under- HTML. standing. Therefore, they implemented the archi- At last, there is also a multi-source dataset of tecture for data collection in an urban metropolitan urban life in Milano and Trentino (Barlacchi et al., area in Brazil. Consequently, the platform opens 2015). The authors put together different data sets, the possibility that citizens, who have a little to no such as spatial grid, social pulse, telecommuni- knowledge about the collection environment and cations, precipitations, weather, electricity, news, the collected data, to access and process the infor- and social pulse. The scientists locate spatially mation. all the data in a grid, which allows comparing About emergencies detection, the work of (Xu datasets generated in various companies using dif- et al., 2016) proposes a mechanism to gather in- ferent standards. The idea behind this dataset is formation from Weibo3 about urban emergency to make a testbed for different solutions to ur- events. The platform discovers What, Where, ban problems like energy consumption, mobility When, Who, and Why of a given emergency event planning, tourism and migrant flows, urban struc- from Weibo comments. Thus, to complete these tures and interactions, event detection, urban well- pieces of information, the platform relies on Social 4 London city dashboard: citydashboard.org/ 3 Weibo website: tw.weibo.com london 92 being, etc. the particular structure settled for a given website. In the next section, we detail the framework pro- It is worth noting that websites structures for ex- posed in the present effort. tracting data are stored in the database beforehand. For implementing this part of the crawler, we re- 3 Data gathering framework lied on the Scrappy library5 . In the present section, we describe the different Extracted data is temporally stored in the components of the architecture of the Web crawler. database to generate a Comma Separated Vector Figure 1 depicts the different parts of the architec- CSV file at the end of a day. Therefore, we store ture, where each part is responsible for a given task the datasets in a daily basis to build a historical as follows: repository. Finally, the last part of the framework is the Control mechanism that verifies the state of the crawler, scraper and file generation to notify by email if something goes wrong while gathering data from the different Websites. In the next section, we detail the different datasets collected by the platform. Figure 1: Data gathering framework 4 Collected datasets In this section, we detail the variables of the col- Crawler: this artifact is responsible for reading lected datasets, which are available6 . It is impor- the Uniform Resource Locators (URLs) from tant to note that a sanitization process was per- the database to download the target web formed over the datasets. Therefore, we con- pages from different websites. sider as sensible information people’s and brands’ names, which are pseudonymized and erased, re- Data base: this data base engine stores a list of spectively. Concerning the opinion dataset, the URL provided by the user and the rules for comment field is sanitized by erasing stop words reading and extracting relevant fields. and sorting words alphabetivally to reduce the im- Web Scraper: it receives the downloaded web pact of a De-anonymization re-link attack (Gambs pages and the extracting rules to parse the et al., 2014). Consequently, these sanitization pro- web pages for gathering the needed fields of cesses are carried out to prevent a privacy breach. a given web page. Please note that the sanitization process is per- formed off-line. Thus, the sanitization does not Storing: generate the Comma Separated Vector limit the extraction of useful information. (CSV) files for each treated web site. It is possible to extract unstructured datasets from the websites targeted (i.e. news from news- Control: verifies the aforementioned parts (i.e., papers, satellite imagery, etc.). In this work, we Web Scraper, Crawler and Storing) are alive present two non-tabular datasets: newspapers and and are able to perform their task without social networks. Nonetheless, a tabular structure problems. has been applied to them for easier readability. The process begins with the manual creation of In the following paragraphs, we described each a list of target URLs and the rules needed to extract of the nine categories of data sources as well as the relevant fields from these. Then, those information datasets in each category. are stored in the Data Base. The target URLs are chosen by the user. The data extraction starts with Beauty consists of a description, date, price, cat- the Crawler reading the (URLs) from the Data egory and cosmetics products (c.f. Table 1). Base to download target Web pages. Then, the Web Scrapper receives a set of Web pages as in- Climate category contains data from monitoring put. Consequently, it generates an index I of all stations, atmospheric pollutants and radiation the Web pages in each registered Website. Next, 5 Scrapy: scrapy.org the Web Scrapper extract the data from the differ- 6 BITMAP Urban Analytics: bitmap.com.pe/ ent gathered web pages, listed in the index I, using urbands.html 93 category date price title Table 5 describes the product data for sale in mujer 2016-08-04 69.0 Noir de Nuit markets. This dataset records the name, cate- mujer 2016-08-04 0.0 Orianité gory, minimum price, maximum, average and date. Table 1: Sample of beauty dataset. title type min price acelga acelga 4.0 provided by the National Meteorological and max price avg price date Hydro-graphic Service of Peru (SENAMHI). 5.0 4.5 2016-08-04 Table 2 shows atmospheric pollutants (CO, N O, N O2 , N OX, O3 , P M10 , P M2.5 , SO2 ) Table 5: Sample of a Market dataset. acquired from several monitoring networks distributed in Metropolitan Lima and the Medicament category comprises prices of Province of Lima. It also presents the date medicines provided by the Ministry of and time of the measurement. Health of Peru (c.f., MINSA). Table 6 shows the registered drug dataset containing the CO NO NO2 NOX O3 condition of the drug, address, technical di- 0.6 13.9 19.3 33.2 3.5 rector, pharmacy name, price, name, country, PM10 PM2.5 SO2 date hour and date of manufacture. 51.6 34.0 2.5 2016-08-04 00:00 Condition address director Table 2: Sample of Climate dataset. Con receta C. Cordoba 2300 Luis Guia manufacturer date Working hours Hersil 2016-08-27 L-V 9:00-20:20 Table 3 specifies the set of meteorological name country regulation data (humidity and temperature) of different Bot. Pharmalys Peru NG1279 price register phone monitoring stations. It also reports the station 2.5 NG1279 2660488 name where it was measured, the date and the holder location UTM location of the record (department, dis- Hersil Lince - Lima trict, latitude and longitude and altitude). Table 6: Sample of Medicament dataset. altitude district date hum 3928 AYAVIRI 16-08-04 9 Newspapers category contains news from differ- lat lon temper. type ent print media in Peru. Table 7 describes 14 52’22” 70 35’34” 19 Met newspapers dataset composed of the publica- tion date, author, and the section of newspa- Table 3: Sample of Station dataset. per. Table 4 describes the data corresponding to content date author solar radiation levels in different regions in long text 2014-02-14 journalist Peru. 13:33:47 name section location title arequipa cajamarca cusco puno mundo mundo/eeuu Edward 8.0 7.0 9.0 9.0 Snowden date ica junin tacna 2016-08-04 6.0 9.0 4.0 Table 7: Sample of written press dataset. lima moquegua piura 2.0 8.0 8.0 Real Estate Market comprehends prices for houses, apartments, and offices sales and Table 4: Sample of UV Radiation dataset. rental nationwide. Table 8 shows real estate data grouped in columns detailing whether Markets category reports maximum and mini- the state of the property is a sale or rental, mum prices, description of different products describes the property, and its address. of the first necessity of three different mar- Additionally, the longitude and latitude of kets of supermarkets and suppliers in Lima. the property are provided. 94 title section description dates, amount, the amount of the stock, op- text1 alquiler text2 erations, and price variation. These variables location area price are detailed in Table 12. Rep. Panama 320m $5000 pre open sale company date lon lat date 7.8 currency 7.7 amountNeg 7.56 mnemonic Alicorp noShare 2016-08-04 noOper -77.032584 -12.0431 2016-08-06 S/ sector 1 325 647 segm ALICORC1 last 173 677 var 16 sale IND RV1 7.56 -3.08 7.7 Table 8: Sample of Real Estate Market dataset. Table 12: Sample of Stock Exchange dataset. Social Networks category contains geo- referenced comments from people on Transportation contains main avenues traffic different topics and social relations. jams and domestic as well as international departure and arrivals flights at Jorge Chavez Table 9 shows opinions of various users, the airport in Lima, Peru. language in which the opinion was made as well as the region of origin. It is worth noting Tables 13 and 14 detail datasets of both alerts that user id was pseudonymized using a hash and congestion, respectively. These datasets function. In the same spirit, comments were contain information of some points in Lima sanitized to reduce re-identification risk. city. It details the street, city, date, latitude and longitude coordinates of the alerts or the user id timestamp language level of congestion. Table 13 also indicates 1059254686 1476728629010 es the level of traffic, the node where the traffic lon lat region level is reset as well as the speed of the traffic. -77.0364 -12.0513 Lima, Pe. comment street city date alza bar cafe centro el futuro gifs los puño Av. Los Frutales La Molina 2016-08-23 latitude longitude traffic level Table 9: Sample of Opinion dataset. -12.071628 -76.964632 2.0 node speed Table 10 represents the friendship links be- Calatrava 4.719 tween users of the social network. Table 13: Sample of Jams dataset. user id1 user id2 1059254686 1059254367 In the case of Table 14, the types and sub- 1059254686 2259254876 types of alerts are gathered in addition to the above-mentioned data. Table 10: Sample of Social links dataset. Street city date Av. Aviación San Borja 2016-08-23 Stock Market category contains two datasets for latitude longitude type money exchange rates and stock exchange markets. The former contains the different -12.086116 -77.003996 JAM historical exchange rates, from Soles to other subtype foreign exchange as shown in Table 11. JAM HEAVY TRAFFIC currency buy sell date Table 14: Sample of Alerts dataset. Swiss franc 3.346 3.686 2016-08-04 Table 15 shows the dataset for both arrivals Euro 3.684 3.819 2016-08-04 and departures flights from Jorge Chavez air- Table 11: Sample of money exchange dataset. port in Lima, Peru. The name of the airline, the city of origin or destination, the status of The latter dataset also contains the transac- the flight, the belt in which the suitcases are tions of the Stock Exchange of Lima. This delivered. The estimated and scheduled time, dataset contains the price of the last transac- and the door and flight number are also re- tion, opening price, purchase, sale, company, ported. 95 Dataset Temporal Geo refe- Spatial Types Airline city state Granu- renced granula- larity rity1 2 Peruvian Cusco Landing Beauty 1.02 False None float, str, date belt date estimated time Money ex. Stock ex. 3.19 1.16 False False None None float, str, date float, int., str, date 4 2016-07-22 10:50 Ate 1.18 False None float, TSTP, date Cpo-de-marte 1.18 False None float, TSTP, date scheduled time door flight Carabayllo 1.18 False None float, TSTP, date Stations 1.16 True UTM float, int., str, date 11:00 3 210 Huachipa 1.18 False None float, TSTP, date Puente-piedra 1.18 False None float, TSTP, date Radiacion UV 1.16 False Region int., date Table 15: Sample of Airport traffic dataset. San Borja 1.18 False None float, TSTP, date S. J. Lurigancho 1.18 False None float, TSTP, date S. M. de Porres 1.18 False None float, TSTP, date Sta. Anita 1.18 False None float, TSTP, date Table 16 shows the size of the data, the num- V. M. del Triunfo Real estate3 1.18 3.13 False True None LL float, TSTP, date float, str, date ber of attributes and the number of records per Real estate1 2.52 True LL float, str, date Real estate2 2.57 True LL float, int., str, date data set. On the other hand, Table 17 synthe- Medicines 1.17 False District float, int., str, date Commerce1 1.2 False None float, str, date sizes the characteristics linked to data types, head- Commerce2 1.2 False None float, str, date Markets 3.09 False None float, str, date ers, and temporal space granularity. Concerning Opinion - True LL float, TSTP, str temporal granularity range from 1.02 to 6.84 min- Newsp.1 Newsp.2 3.89 1.18 False False None None str, date str, int., date utes. With regard to spatial granularity, there are Newsp.3 Newsp.4 1.35 6.84 False False None None int., str, date int., str, date seven datasets georeferenced with UTM coordi- Arrivals 1.21 False None TSTP, str, int., date Departures 1.17 False None TSTP, str, int., date nates (i.e., latitude and longitude). Alerts 1.16 True LL float, str, date Jams 1.16 True LL float, int., str, date 1 Data set DataFrame Data points Attri- Size UTM is the Universal Transverse Mercator coordinate sys- butes (Mb) Beauty Beauty 22,190 4 1.2 tem. 2 Stock Market Money ex. 997 4 0.0 LL is the Latitude - Longitude coordinate system. Stock Market Stock ex. 7,253 16 0.8 Weather Ate 2,952 10 0.2 Weather Campo-de-marte 3,291 10 0.2 Table 17: Spatial-temporal granularity summary Weather Carabayllo 2,796 10 0.1 Weather Stations 291,839 11 19.3 and data types of different datasets. Weather Huachipa 2,322 10 0.1 Weather Puente-piedra 2,968 10 0.2 Weather Radiacion UV 3,181 11 0.2 Weather San-borja 3,180 10 0.2 Weather S. J. Lurigancho 2,565 10 0.1 mode is the most important. These values Weather Weather S. M. d Porres Sta. Anita 2,890 3,233 10 10 0.2 0.2 are ”wrinkles” and ”Essential greasy skin” Weather Real estate V. M. del Triunfo Real estate3 3,195 415,225 10 9 0.2 172.6 for the category and article attributes, respec- Real estate Real estate1 159,665 9 107.3 tively. On the other hand, we have a numer- Real estate Real estate2 162,736 9 140.7 Medicines Medicines 555,7353 14 1,311.6 ical attribute that is the price, of which we Markets Commerce1 136,433 6 12.3 Markets Commerce2 490,582 5 37.1 have the mean, median, standard deviation, Markets Markets 11,178 6 0.7 Opinion Opinion 6’979,829 7 239 minimum and maximum values. It is impor- News. News. Newspapers1 Newspapers2 1’560,134 13,392 6 6 1,129.6 23.4 tant to note that, no attribute contains null News. News. Newspapers3 Newspapers4 27,060 40,451 6 6 11.4 16.1 values (c.f., NAs). Transport Arrival 726,624 9 39.3 Transport Departure 679,260 9 41.2 N variable type mean median std Transport Alerts 351,690 7 33.3 0 category str - - - Transport Jams 1’643,277 8 187.5 1 date date - - - 2 price float 935 45 35152 3 article str - - - Table 16: Summary of different data sets size. N mode min max NAs %NAs 0 arrugas - - 0 0 1 - - - 0 0 In the next section, we describe some statistics 2 0 0 1400000 0 0 3 Essential cutis graso - - 0 0 of our datasets. Table 18: Statistics of beauty dataset. 5 Dataset Statistics In this section, we detail the statistics of the differ- Climate.- We describe the characteristics associ- ent dataset in the described categories in Section ated with climate data. Two datasets will be 3. described: 1) data from meteorological sta- Beauty.- Table 18 shows the most interesting tions and their measurements (Table 19); and, characteristics of the beauty dataset, de- 2) data on pollutants by districts (Tables 20). scribed in Table 1. On the one hand, two of Other datasets are not described due to lack the four attributes of the table contains cate- of space. gorical values (discrete values) for which the In Table 19, we have a large number of cate- 96 N variable type mean median std 0 altitude integer 2143.82 2485.00 1560.81 market and super market that sell groceries. 1 2 department district str str - - - - - - This category was described in Table 5. 3 station str - - - 4 date date - - - Table 21 shows statistics about product. 5 humidity float 62.60 62.00 226.54 6 latitude str - - - In the variable title, the most popu- 7 longitude str - - - 8 province str - - - lar product is potato. The most men- 9 10 temperature type float str 17.03 - 16.70 - 41.39 - tioned type is red-headed onion. The N 0 mode 3812.00 min 0.00 max 5192.00 NAs 0 %NAs 0 minimum price has an average value of 1 - - - 291839 1 2, 427P EN and varies between 0.57P EN 2 - - - 291839 1 3 CABO INGA - - 272508 1 and 13.00P EN . The maximum price is 4 - - - 0 0 5 100.00 5.00 45059.00 43820 0 on average 3.057P EN and varies between 6 12 46’ 17.86” - - 0 0 7 75 0’ 44.52” - - 0 0 0.71P EN and 14.00P EN . Finally, the av- 8 9 - 20.80 - -30.80 - 4974.20 291839 37258 1 0 erage price is 2.746P EN and fluctuates be- 10 Meteorologica - - 272508 1 tween 0.61P EN and 13.5P EN . Table 19: Statistics of stations dataset. N Var. type mean median std 0 title str - - - 1 type str - - - gorical attributes and represent the character- 2 min price float 2.427 1.5 2.339 istics of the monitoring stations. However, 3 max price float 3.057 2.0 2.828 4 av. price float 2.746 1.75 2.525 each of these monitoring stations measures 5 fecha date - - - two meteorological characteristics, which are N mode min max NAs %NAs represented by numerical values. These two 0 PAPA - - 0 0.0 1 CEBOLLA - - 0 0.0 attributes have characteristics such as the CABEZA mean, median, standard deviation, mode, and ROJA minimum and maximum values. 2 2.0 0.57 13.0 0 0.0 3 2.0 0.71 14.0 0 0.0 In contrast, the pollutant data by districts (c.f., 4 1.25 0.61 13.5 0 0.0 Table 20) contain a large amount of numeri- 5 - - - 0 0.0 cal data. As already mentioned in the previ- Table 21: Statistics of market dataset. ous descriptions, some of them are the mean, median, standard deviation, mode, minimum and maximum values. Newspapers.- Table 22 shows the most frequent content that is the text1, which is a given news N Var. type mean median std (we prefer not to share the content due to the 0 CO float 1.299 1.2 7.659 1 NO float 46.657 40.1 31.31 lack of space). Then, the most cited author, 2 NO2 float 18.084 16.5 9.719 section, and title are Carlos Battle, executive 3 NOX float 64.632 58.95 35.193 zone, world / Current and 5 tips for a startup 4 O3 float 7.661 5.4 9.329 to survive, respectively. It should be noted 5 PM10 float 115.786 107.25 55.978 6 PM2.5 float 35.145 28.9 27.485 that the percentage of missing values of con- 7 SO2 float 11.274 9.6 10.499 tent, author, and location are 81%, 82% and 8 date date - - - 95%. 9 horas TSTP - - - N mode min max NAs %NAs 0 1.4 0.0 410.9 64 0.02 Real Estate.- This dataset contains three different 1 27.0 1.2 263.4 708 0.24 datasets. We only show one table due to lack 2 19.6 0.1 164.3 720 0.24 of space. Described data are rents and/or real 3 63.9 0.8 328.7 708 0.24 estate sales, the data are mostly categorical. 4 0.5 0.3 198.3 9 0.0 5 94.5 0.0 948.0 10 0.0 In Table 23 we have only two attributes with 6 0.0 0.0 203.0 518 0.18 numerical values associated with the loca- 7 8.7 2.7 353.3 290 0.1 8 - - - 0 0.0 tion of the property. These data have statis- 9 - - - 1 0.0 tics such as mean, median, standard devi- ation, mode, minimum and maximum val- Table 20: Statistics of pollutants dataset. ues. On the other hand, we have another group of characteristics with categorical val- Markets.- Here we describe the statistics of a ues. Among them, the more frequent section 97 N variable type mean median std N variable type mean median std 0 rue str - - - 0 content str - - - 1 city str - - - 1 date date - - - 2 date date - - - 3 latitude float -12.087 -12.09 0.008 2 author str - - - 4 longitude float -77.01 -77.016 0.036 3 section str - - - 5 subtype str - - - 4 location str - - - 6 type str - - - N mode min max NAs %NAs 5 title str - - - 0 Av. Javier Prado Este - - 0 0.0 N mode min max NAs %NAs 1 San Isidro - - 0 0.0 2 - - - 0 0.0 1 texto1 - - 1264992 0.81 3 -12.091415 -12.1 -12.067 0 0.0 2 Carlos - - 1275035 0.82 4 -77.003755 -77.072 -76.949 0 0.0 Batalla 5 JAM HEAVY TRAFFIC - - 28659 0.08 6 JAM - - 0 0.0 3 zona- - - 0 0 ejecutiva 4 mun./act. - - 1480648 0.95 Table 24: Statistics of alerts dataset. 5 5 cons. - - 0 0 para que 12.091415, 77.003755 where alerts of una startup sobre. type jam and subtype jam heavy traffic are of- ten reported. Table 22: Statistics of newspapers dataset. Table 25 shows the street and the node where most congestion is reported Av. Circun- variable is renting. We can also see that prop- valacin del Golf Los Incas in the la Molina erties of 100m2 are the most ”offered”, using district with an average traffic level of three the attribute area. Regarding the price, the on a scale of one to five. Finally, it shows the most frequent value is $900. It is important average speed of 2.59 Km/h. to note that, although we have a large amount N variable type mean median std of null data (NAs), these represent a rather 0 calle str - - - 1 ciudad str - - - low percentage due to a large amount of data 2 fecha date - - - 3 latitud float -12.084 -12.086 0.009 available. 4 longitud float -76.995 -76.993 0.035 5 nivel de trafico integer 3.286 3.0 0.713 N variable type mean median std 6 nodo str - - - 0 title str - - - 7 velocidad float 2.597 2.369 1.302 1 section str - - - N mode min max NAs %NAs 2 description str - - - 0 Av. Circunvalacin - - 0 0.0 del Golf Los Incas 3 location str - - - 1 La Molina - - 0 0.0 4 area str - - - 2 - - - 0 0.0 5 price str - - - 3 -12.076636 -12.112 -12.061 0 0.0 6 longitude float -77.01 -77.03 0.07 4 -76.963088 -77.081 -76.941 0 0.0 7 latitude float -77.01 -77.03 0.07 5 3.0 1.0 5.0 2 0.0 8 date date - - - 6 Av. Circunvalacin El - - 0 0.0 N mode min max NAs %NAs Golf Los Incas 0 Alquiler de... - - 0 0.0 7 2.025 0.139 12.289 14288 0.01 1 alquiler - - 0 0.0 2 Rento... - - 2524 0.02 3 4 Ubicacin... 100 m - - - - 27912 2790 0.17 0.02 Table 25: Statistics of jams dataset. 5 US$ 900 - - 0 0.0 6 -77.03 -77.76 -76.13 0 0.0 7 -77.03 -77.76 -76.13 0 0.0 Stock Market.- This category has two different 8 - - - 0 0.0 datasets, which are money exchange and Table 23: Statistics of real estate dataset. stock market. The former is detailed in Ta- ble 26. The latter dataset is not described due Transportation.- We describe the statistics of the to the lack of space in the present work. datasets related to the terrestrial transporta- N variable type mean median std tion mode, As far as the dataset of air trans- 0 currency str - - - 1 purchase float 2.823 3.315 1.134 port is concerned, we do not show them due 2 sell float 2.891 3.397 1.432 to lack of space. For land transport, Table 24 3 date date - - - and 25 report the statistics on alerts and con- N mode min max NAs %NAs gestion, respectively. 0 Swiss franc - - 264 0.26 1 0.031 0.025 4.513 146 0.15 Table 24 shows the most reported street, 2 0.034 0.0 4.827 264 0.26 3 - - - 0 0.0 which is Av. Javier Prado. In the city variable, that corresponds to the dis- Table 26: Statistics of money exchange dataset. trict, San Isidro is the most reported dis- trict. More precisely in the coordinates Concerning the money exchange dataset (c.f., 98 Tabla 26). We have the half of variables cat- egorical (currency and date) and the another half numerical (purchase and sell). It is worth noting that the mode of the currency attribute is Swiss franc. Please note that there are some null values (no data), which are represented by the NAs and the percentage is given by Figure 3: Alerts %NAs. In the present section, we have not described the Figure 4 shows the distribution of the 17700 statistical metrics of two data categories, which are alerts gathered with our platform. We note that medicaments and social networks because of lack Jam heavy traffic and Jam stand still traffic are the of space. In the next section, we propose an ex- most reported alerts with 5500 alerts each one. ample of Urban Analytics to show the potential of our datasets. 6 Traffic congestion application To show a possible application of our datasets, we take the Congestion (c.f., Table 13) and Alerts (c.f., Table 14) datasets of the Transport category to analyze traffic jams in a given district of Lima. Consequently, we filter the congestion reports and alerts of Lince district. Then, we select all records produced in this district. Finally, we extract a CSV file containing traffic data to analyze it. In the present example, we make a visual analy- sis of congestion to show the enormous potential- ity of our dataset. Subsequently, we rely on Qlik7 to depict Figure 2. Figure 4: Heatmap of reported alerts (top), real estate (middle) and business locations (bottom) in Lince district. Figure 2: Minimum, maximums and averages Finally, using Fusion Tables8 , we can draw a speeds in different streets and avenues of Lima geolocated heatmap of reported alerts and jams in Figure 4A. As we can see, the most reported Figure 2 shows traffic level in red, minimal and and congested part is the road exchange between maximal speed in blue and yellow, respectively. Javier Prado Avenue and Paseo de la República This analysis was done for five different avenues highway at the bottom right the figure. We have and streets. As we can see, the lower the speed, the analyzed this segment of the city to study the traf- higher the traffic level. Another interesting fact is fic level between this important road exchange and the reported maximal and minimal speed in Paseo the location of our university (i.e., Universidad del de la Republica Avenue, which has the same value Pacı́fico) in the upper left part of the figure. (i.e., 1.6Km) meaning a high congestion in this A simple analysis is already interesting but avenue. 8 Fusion tables sites.google.com/site/ 7 Qlik: qlikid.qlik.com fusiontablestalks 99 References Sofiane Abbar, Tahar Zanouda, and Javier Borge- Holthoefer. 2016. Robustness and resilience of cities around the world. arXiv preprint arXiv:1608.01709 . Gianni Barlacchi, Marco De Nadai, Roberto Larcher, Antonio Casella, Cristiana Chitic, Giovanni Torrisi, Fabrizio Antonelli, Alessandro Vespignani, Alex Pentland, and Bruno Lepri. 2015. A multi-source dataset of urban life in the city of milan and the province of trentino. Scientific data 2. Riccardo Di Clemente, Miguel Luengo-Oroz, Matias Figure 5: Heatmap of reported alerts (top) and Travizano, Bapu Vaitla, and Marta C Gonzalez. jams (bottom) in Lince district. 2017. Sequence of purchases in credit card data re- veal life styles in urban populations. arXiv preprint arXiv:1703.00409 . crossing datasets could reveal useful insights. Sébastien Gambs, Marc-Olivier Killijian, and Therefore, we use datasets from real estate and Miguel Núñez del Prado Cortez. 2014. De- business locations. The former is a dataset de- anonymization attack on geolocated data. Journal scribed in Table 23. The latter is a private dataset of Computer and System Sciences 80(8):1597–1614. containing the code, latitude, and longitude of Steven Gray, Oliver O’Brien, and Stephan Hügel. businesses. Relying on these datasets, we mea- 2016. Collecting and visualizing real-time urban sured the influence of real estate and business lo- data through city dashboards. Built Environment cations over congestion. Figure 5 a heat-map 42(3):498–509. (where light and strong blue mean short and large Miguel Nunez-del Prado, Edgardo Bravo, Miguel distance, respectively) of the distance among the Sierra, Miguel Canchay, and Isaias Hoyos. 2016. datasets locations. It is possible to see that alerts Knowledge tier platform for graph mining in (smart) are influenced by real estate and business location cities. In Proceedings of Symposium on Information Management and Big Data. and jams, only by real estate locations. Nikolaos Panagiotou, Nikolas Zygouras, Ioannis 7 Conclusion Katakis, Dimitrios Gunopulos, Nikos Zacheilas, Ioannis Boutsis, Vana Kalogeraki, Stephen Lynch, and Brendan OBrien. 2016. Intelligent urban data In the present work, we have described the archi- monitoring for smart cities. In Joint European Con- tecture of a Web crawler platform to gather infor- ference on Machine Learning and Knowledge Dis- mation about nine different categories of datasets covery in Databases. Springer, pages 177–192. to make urban analytics. The main contribution M Mazhar Rathore, Awais Ahmad, Anand Paul, and of this work is the provision of information to Seungmin Rho. 2016. Urban planning and building the scientific community and policy makers for smart cities based on the internet of things using big analyzing and studying social behavior and ur- data analytics. Computer Networks 101:63–80. ban phenomena in a developing country such as Henrique Santos, Vasco Furtado, Paulo Pinheiro, Peru. We have collected data following a privacy- and Deborah L McGuinness. 2017. Contextual aware structure. Sensible information about the data collection for smart cities. arXiv preprint citizens or brand-names has been sanitized. These arXiv:1704.01802 . datasets have enabled us to implement a Knowl- Ambey Kumar Srivastava. 2017. Segregated data edge Tier Platform for Graph Mining (Nunez- of urban poor for inclusive urban planning in del Prado et al., 2016) and perform urban analyt- india: Needs and challenges. SAGE Open ics (Di Clemente et al., 2017) or study urban re- 7(1):2158244016689377. silience (Abbar et al., 2016). Zheng Xu, Yunhuai Liu, Neil Yen, Lin Mei, Xi- In the future, we plan to extend the crawler to angfeng Luo, Xiao Wei, and Chuanping Hu. 2016. collect more information from new public avail- Crowdsourcing based description of urban emer- gency events using social media big data. IEEE able Web sites. We also plan to make a privacy Transactions on Cloud Computing . risk analysis of the described datasets. 100