         Knowledge Tier Platform for Graph Mining in (Smart) Cities
                   Miguel Nuñez-del-Prado Edgardo Bravo Miguel Sierra
                               Isaias Hoyos Miguel Canchay
                                    Universidad del Pacfico
                                      Av. Salaverry 2020
                                         Lima - Peru

                     Abstract                             form of graphs. This platform enables people to
                                                          share the knowledge of the area where they live
    In the present effort, we present a knowl-
                                                          allowing them to inform about pollution, crime
    edge tier platform to collect information
                                                          levels, traffic jams, streets topology, commerces,
    from cities in a form of graphs. This plat-
                                                          markets, etc. The primary objective is to provide
    form enables people to share the infor-
                                                          information about the city to find spatio-temporal
    mation of the area where they live allow-
                                                          patterns using Graph Mining techniques.
    ing them to inform about pollution, crime
                                                             The present paper is organized as follows. Sec-
    levels, traffic jams, streets topology, com-
                                                          tion 2 introduce some basic concepts, while Sec-
    merces, markets, etc. The main objec-
                                                          tion 3 describes the platform architecture. Sec-
    tive is to provide information, stored in
                                                          tions 4 and 5 show some preliminary results and
    Elastic about a city to find spatio-temporal
                                                          present the discussion about the platform. Finally,
    patterns using Graph Mining techniques
                                                          Section 6 concludes the paper and presents future
    based on Apache Spark GraphX.
1   Introduction
                                                          2     Basic Concepts
In the last years, we have seen the explosion of
data from on-line activity, user content generated,       In the current section, we introduce some ba-
health, scientific computing, mobile phones activ-        sic concepts, such as graph, knowledge tiers and
ity, etc. This data increments due to the daily           Spark for describing the platform.
transaction of people in urban centers and still
                                                          2.1    Graph
grows. By 2030, 60% of the worldwide popu-
lation will live in cities appearing 27 megacities        A graph is a mathematical structure composed of
greater than 10 million inhabitants (Chourabi et          vertices, nodes or points, which are connected
al., 2012). One technique to solve this problem           through edges, lines or arcs as depicted in Figure
is to generate new instruments for gathering and          1. A graph (G = (V, E)) is composed of a set of
combining information continuously (Hernández-           V vertices and E edges. in our context this struc-
Muñoz et al., 2011). Consequently, there is an in-       ture allows us to represent street intersections as
crement of collaborative platforms to collect data.       geo-referenced nodes and roads as edges.
For instance, a platform, called WebCar, to col-          2.2    Haversine distance
lect GPS data from vehicles to estimate traffic in a
city (Lo et al., 2008). In the field of human health,     The Haversine distance (Shumaker and Sinnott,
Psychlog (Gaggioli et al., 2013) is a mobile phone        1984) computes the shortest distance between two
platform designed to collect users psychological,         points represented by latitude and longitude in the
physiological, and activity information for mental        earth’s surface.
health research relying on a self-report question-
naire. The last example developed an Internet site              dlon = lon2 lon1
and implemented the collection of data for a mul-               dlat = lat2 lat1
ticenter study of ethical decision-making (Avidan                  a = (sin( dlat  2
                                                                              2 )) + cos(lat1 )⇥          (1)
et al., 2005).                                                         cos(lat2 ) ⇥ (sin( dlon 2
                                                                                   p p 2
   In the present effort, we present a knowledge                   c = 2 ⇥ atan2( a, 1 a)
tire platform to collect information on cities in a                d = R⇥c

                                                                             Figure 3: Spark framework.

           Figure 1: Example of a graph.
                                                                 As shown in Figure 3 Apache Spark provides
                                                              at the top of its framework a tool for graph mining
  Where lat,lon and R are the latitude, longitude             call GraphX 2 . This API allows parallel graph com-
and radius of the Earth, respectively.                        putation and integrates tools for extraction, trans-
2.3 Knowledge Tiers                                           formation and load. More detail about the archi-
                                                              tecture as well as the capabilities of Spark is given
Since we are able to model street network of a city           in the next section.
in the form of a graph. Note that each node and
edge could have a weight representing different               3       System Overview
phenomena of a city, such as: (1) congestion, (2)
crime, (3) pollution, (4) population density, (5) ur-         In the current section, we describe the architecture
ban transportation, (6) subway network, etc. Thus,            of our platform. As illustrated in Figure 4, our
for each phenomenon, we have a graph model-                   platform allows collecting data from Open Street
ing this particular fact. Finally, we can stack each          Maps 3 (OSM) to build the graph representing
node as depicted in Figure 2 to have a knowledge              streets and intersections in the form of a comma
stack.                                                        separated values CSV files. Then, these CSV files
                                                              are stored in a NoSQL database. We use Elas-
                                                              ticsearch4 as NoSQL database due to its scalable,
                                                              flexible and performant search and analytics en-
                                                              gine (c.f., Figure 5).

            Figure 2: Knowledge tiers1 .

2.4 Apache Spark
Apache Spark is an open source cluster developed
by the University of Berkeley. Then, the code                         Figure 4: Example of a graph over streets.
was maintained by Apache Software Foundation.
Apache provides distributed computation taking                   Once data is saved in the NoSQL database, we
charge of task dispatching, scheduling, and basic             are able to analyze the knowledge tiers represented
I/O functionalities. These functionalities are avail-         and combined in form of graphs trough Spark
able through Java, Python, Scala and R interfaces.                2
                                                                  GraphX: http://spark.apache.org/graphx/
                                                                  OSM: https://www.openstreetmap.org/
   1                                                            4
     Fereshteh ASGARI, Inferring User Multimodal Trajec-          Elasticsearch      :https://www.elastic.
tories from Cellular Network Metadata in Metropolitan Ar-     co/guide/en/elasticsearch/reference/
eas                                                           current/index.html

GraphX as depicted in Figure 5. For instance, with
this platform, we could optimize supply chain in
cities minimizing cost, avoiding traffic jams and
passing over low crime rate zones. We can also
discover spatial patterns to understand common
features of high crime rates areas in a city. All
these analytics could be performed using program-
ming languages such as: Scala5 , Java6 , Python7 or
R8 .

                                                        Figure 6: Visualization of the graph over the

                                                        edges model the streets connecting nodes or inter-
                                                        sections as shown in Figure 6.

Figure 5: Overview of the Knowledge Tier Plat-

   Finally, we implement a Python script to visu-
alize the result of the pattern mining process us-
ing Google Maps9 . In the next section, we present
some preliminary visualization of graphs stored in
the platform.

4       Preliminary results
                                                        Figure 7: Visualization of the heatmap of tweets
In this section, we present some preliminary re-        over the cartography.
sults, of the Knowledge Tier Platform, about data
gathering, and visualizations.                             Another possibility of visualization are
   Concerning the data collection, we have done         Heatmaps. In our case, Heatmaps are generated
two campaigns to collect data from streets and          based on nodes weight. For example, Figure
tweets in Lima, Peru. The former campaign               7 presents a Heatmap of collected tweets in
was performed in the month of May collecting            the platform. It is worth noting that tweets are
1̃00 000 and 4̃20 000 nodes and vertices, respec-       affected to the nearest node relying on latitude
tively. The latter campaign was carried on between      and longitude of both nodes and tweets. We use
the months of April to Jun obtaining 7̃,1 millions      as distance function the Haversine function (c.f.,
of geolocated tweets.                                   Subsection 2.2). In the next section, we argue
                                                        about the platform, and we present our vision of
   About visualization, the platform allows to plot
                                                        its application to research on Smart Cities.
a graph over a cartography, where the nodes are
placed in the intersections of the streets and the      5   Discussion
      Scala: www.scala-lang.org                         We firmly believe in the potential of this project as
      Java: www.java.com
      Python: www.python.org
                                                        the cornerstone to enable new research directions.
      R: www.r-project.org                              Graphs have been widely used to model different
      Google Maps: /maps.google.com                     kinds of phenomena ranging from: urban street

