A user-centered process for the analysis and
           visualization of open data sets

            Marcia Tejeda1 and Diego Torres1,2[0000−0001−7533−0133]
            1
                Dto. CyT, UNQ, Roque Saez Peña 352, Bernal, Argentina
                              tejedamarcia@gmail.com
                  2
                    LIFIA, CICPBA-Facultad de Informática, UNLP
                                La Plata, Argentina
                       diego.torres@lifia.info.unlp.edu.ar


       Abstract. Open data is growing all the time throughout the world.
       Open government is advancing and more and more open data portals
       are available to be consulted by anyone. It is assumed that joining and
       combining two or more data sources can provide new information or
       knowledge that was not previously available. To be able to combine dif-
       ferent datasets, the use of statistical and computer science techniques
       such as data mining and machine learning is suggested. Although this
       is a practice that is currently carried out by data science professionals,
       this work invites the community in general to be able to do it through a
       process centered on the user. This article presents a process and a web
       tool that implements the analysis and visualization process.

       Keywords: Open-datasets · User centered Process · Data visualization.


1   Introduction
Currently a large number of public administrations in the world and non gov-
ernmental organizations are opening their data so that any person or entity can
use them [1, 2]. Projects that combine smart cities and the use of the internet
of things present a scenario of proliferation of open and standardized data, with
IoT and open data, interoperability and open standards[3, 4]. The way of pub-
lishing open data is done through data sets ( textit datasets), which can cover
different areas: science[5], economy[6], transport[7], education.
    Open data[8] is data that anyone can access, use and share. It can come from
any source and cover different topics: science, technology, economy, finance, ed-
ucation, among others[2]. However, not having forms of aggregation and visual-
ization makes open data difficult for users to understand and manipulate[9].
    Combining different data sources increases the level of information that can
be extracted from open data. To a large extent, open data is made for a specific
purpose. However, the possibility of combining two opendatasets that describe
events in the same geographic area can generate new interpretations, for example
combining cases of disease infections with the description of housing and eco-
nomic development. Interoperability is one of the goals of open data[4]. Through


Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).


                                           1
2      M. Tejeda and D. Torres

the use of standard formats, it is possible to connect computer systems through
data sharing. Open data presents great opportunities to be combined with stud-
ies other than the ones that originated it. Since this speeds up research times
and provides scientific and social advances of great impact. The use of open
data during the advance of COVID-19 has demonstrated the importance of the
same[10, 11].
    Information is a processed data that may have some kind of utility or value.
Converting data into information involves a process of knowledge and under-
standing that was not previously known[12]. Taking advantage of open data and
the information that it provides, new knowledge could be generated by study-
ing the relationships between different datasets. This can be possible using tools
from the field of statistics and computation: data mining [13, 14] and machine
learning [14, 15].
    It is difficult to analyze and process the large amount of available open data
by people who do not have the skills to analyze it. There are some alternatives
like the ones described in the[9] work. However, allowing the general public to
interpret open data is a constant concern of governments[16].
    There are some approaches without requiring programming skills. Tableau
enables end users to create visualizations and order information. Other ap-
proaches for end users simplify viewing but still require some technologies ad-
vanced knowledge[17]. Google Maps is a well known tool for combining georef-
erenced data, however it has limitations[18].
    This work will focus on presenting a process that allows the analysis and
visualization of open geo-referenced data[19] from a user-centered perspective.
We propose the creation of a tool that allows relating georeferenced datasets
so that they can be combined and analyzed in a simple way. This approach
is aimed at people with no programming skills. The strategy is to accompany
the user during the process that involves viewing the datasets, analyzing them
alone, combining them with other datasets available in the application and then
proceeding to the analysis and display of the information that was built in the
process in maps. It presents a process to manipulate and visualize datasets that
combine transformation and analysis functionalities that are generally offered as
parts of software modules to be manipulated by developers, such as clustering
algorithms.
    This article is organized as follows. Section 2 describes the proposed approach
in combining and visualizing the dataset. The whole merge and display process
is described in Section 3, which includes implementation details. Finally, Section
4 presents the conclusions and future work.


2   Motivation and approach

Generation of open data allows their free use for analysis, visualization and
use, possibly, in other contexts. Open science and citizen science are contexts
in which a large number of datasets are generated. So do governments with
open data policies. It is natural for the generation of datasets to be carried out


                                        2
    A user-centered process for the analysis and visualization of open data sets     3

in a specific context and then released for public use, however their use and
combination is complex.
    For example, the Encuesta permanente de hogares in Argentina 3 lists the
housing characteristics of Argentina and the citizen science project GeoVin 4
analyzes appearances of the insect vector of Chagas disease, endemic in Latin
America. The combination of both datasets could be considered interesting to
analyze if there is any relationship between the number of vector insect occur-
rences and the housing conditions in the region. This, even though neither of
the datasets was thought in terms of the other. Thus, it is possible to think of
combining different datasets to analyze an endless number of variables.
    The focus of this work is to propose a user-centered tool that is easy to use
for the analysis and visualization of open data. The focus is on defining a usable
process that articulates strategies for combining datasets that were created in
isolation from each other, and strategies for displaying the combined data. The
following sections describe the basic tools for combining datasets and display
tools in isolation. And then the user-centric process will be described in the way
they abstract from the combination and display algorithms, and turn them into
simple utilities of a larger process.


2.1     Combination of datasets

Datasets combination has the main goal of adding value to a data set with values
from another data set. This allows you to take advantage of different datasets
together with others to increase their potential and it also could generate more
valuable information.
    The design of three types of strategies for the combination of datasets is
proposed. All three follow the philosophy of ontology alignment [3], based on
ontology reconciliation, in this case datasets. Find relationships between concepts
that belong to different sources [20].
    Each file merge strategy will take two datasets and return only one with the
result of the merge strategy.
    For this, we define datasets as a matrix, or table, where the columns indicate
characteristics and the rows have the values of an element. As datasets are
geolocated, each row contains coordinates and the characteristics of the element
that is in those coordinates. For example, if dataset A contains 3 elements that
are found in a latitude, a longitude, and the characteristics CarA and CarB,
then we can annotate the dataset as A (3X4), since A has 3 rows (one for each
element) and 4 columns (latitude, longitude, CarA and CarB). More generally,
a D (mXn) dataset is a dataset that has m rows and n columns.
    The proposed strategies consist of aligning each row of a dataset with one or
more rows of another dataset. If a row is aligned, then a new row is generated
in the result dataset where the characteristics of the first row are concatenated
3
  https://www.indec.gob.ar/indec/web/Institucional-Indec-BasesDeDatos,         accessed
  on July 29, 2020,
4
  http://geovin.com.ar accessed on July 29, 2020.


                                            3
4      M. Tejeda and D. Torres

with those of the aligned row. This gives an idea of augmenting the original row
with information from the lined up. In other words, if you want to combine the
data set A (nXm) with the data set B(n’Xm’). The result would be a new data
set C (pXq) where n ≤ p ≤ n ∗ p y m ≤ q ≤ m + q.
    Here are three strategies for aligning, although the list could be longer.


Closest point In this type of combination for each element located at a point
of the dataset to which you want to add information -the ’base’ set-, find the
element that is at the closest point (at a maximum parameterizable distance)
from the data set to be added. Once the datasets have been combined, this
information will result in a single line with the data from both points. When
applying this combination what happens is that two nearby points become one,
containing the information of both. In other words, the information about the
elements found at that point has been enriched.
    If the base dataset is B (nXm), and the dataset to be added is S (jXk), the
resulting dataset will have a maximum of R (nXm + k + 1), since it will have the
number of rows of the base dataset, and the columns of the sum dataset will be
concatenated, and an additional column with the value of the maximum distance.
If there are no nearby points, the information from the base file is removed from
the result. It is necessary that both files have georeferenced information. This
type of combination has a parameter that is the maximum distance to which the
closest point can be, which allows to equalize the way in which the information
of the points is increased. If this were not parameterizable, all the points would
have a closer point even though they were separated by many kilometers apart
and could cause unwanted information. Information on the distance to the closest
point of each point is also saved in the resulting dataset.


Radial distance In this case, for each point on the map of a data set, the
points in another data set that are at a certain parameterizable distance are
searched, thus forming a circle of determined radius around each point. The size
of the circumference around a point is defined by a distance parameter. It is also
necessary that both datasets have georeferenced information. If the base dataset
is B (nXm), and the dataset that will be adding is S (jXk), the resulting dataset
will have a maximum of R (n * jXm + k + 1), since it will have a maximum that
each row of the base dataset relates to all the rows of the dataset to be added.
The number of columns maintains the logic of the previous combination.


Equal characteristic The combination of datasets seeks to add the information
of a data set (not necessarily georeferenced) to a georeferenced data set based
on some similarity between its columns other than the location. The final result
will be at most like that of the radius combination around the point.


                                        4
    A user-centered process for the analysis and visualization of open data sets   5

2.2     Configurations and visualizations

In addition to the combination of datasets presented in the previous section, the
possibility of displaying the datasets is presented, in their original version or the
combined one. We present three ways of displaying maps.

 – Simple map: It simply shows on a map the detail of information contained
   in the rows located in the position that indicates the latitude and longitude.
 – Layered map: This type of visualization displays the information from two
   datasets on a map at the same time. Each set will be displayed on a different
   layer of the map. Layers can be viewed individually or together.
 – Clustered map: This visualization shows for a single data set the informa-
   tion of the points that it contains but grouped in clusters. To do this, you can
   choose to analyze the data with different clustering algorithms: One of them
   is KMeans[15], which is an unsupervised classification algorithm that works
   with a K parameter that defines the number of clusters (groups) in which
   you want to group the information. Based on this parameter, the algorithm
   will divide the information into K groups according to their characteristics
   (See Fig. 1). The other algorithm that is being applied in this work is called
   Meanshift [21]. It also works by grouping information, but unlike KMeans it
   does not receive any parameters and decides based on the information that
   it is classifying how many clusters it should form.


                                 Fig. 1. Clustered map


3     Combining and displaying open datasets process

All the activity of manipulating open datasets, their analysis and visualization
are simplified by thinking from the user’s perspective through a sequence of
processes. It is illustrated in Fig. 2. There you can see the sequence and com-
munication between processes and sub-processes with specific tasks, for example


                                            5
6         M. Tejeda and D. Torres

                                                Column rename
                start
                                                  Column cut

                                           Transformation
             Dataset import
                                                                                     Single map

                        CSV
                                              Storage               Visualization   Combined map
                        JSON
                                                                                    Clustering map

                                            Combination

                                               Closest point
          Geo Normalization                                                             process
                                              Radial distance
                                                                                      subprocess
                                             Equal characteristic


                  Fig. 2. Combining and displaying open datasets process


those dedicated to importing files in CSV (comma-separated values) or JSON
format.
   As told before, the general process is geared towards simplifying datasets
manipulation tasks so that a person without programming skills can analyze
datasets. The processes and sub-processes generate a high level of abstraction
and increase the simplicity of the activity as black boxes. The processes are
described below in order:


                               Fig. 3. Combining proccess screening.


    – Dataset import: Allows you to add a dataset file in any format to the
      system. The user should not worry about understanding the format of the
      data set. This process includes sub-processes to decode different formats,
      for example CSV or JSON. These sub-processes can be extended to support
      more formats.
    – Geo normalization: Once the datasets have been imported, the columns
      representing the geolocation information must be detected. In case it cannot
      be detected automatically, the user will be asked to indicate the column (s)
      with the latitude and longitude information.


                                                  6
  A user-centered process for the analysis and visualization of open data sets   7

 – Storage: The normalized datasets are stored in the system database. There
   they will be available to apply functionalities of the following processes.


                      Fig. 4. Layering maps screening process.


      • Transformation: This process modifies the general structure of a data
        set. Sub-processes include renaming columns, transforming values or for-
        mats of a column, deleting a set of columns.
      • Combination: This process encompasses those operations that allow
        the combination of datasets. The sub-processes that are included are
        those that have been described as Closest Point, Distance Radius and
        Equal Characteristic. New ones can also be added. Fig. 3 shows the se-
        quence in the processes to combine two datasets through closest point.
        The final screen shows a field where the maximum distance is indicated
        and below the preview before saving the changes. Preview is a func-
        tionality that was considered relevant, since much of the analysis work
        requires a trial and error stage.
 – Visualization: This is the final process. At this point, the ways to visualize
   the work done with the datasets are decided, whether simple or combined.
   The threads seen in the figure correspond to those previously described and
   in the same way with the combination ones, they can be extended. As an
   example, Fig. 4 shows the steps required to, after selecting two datasets,
   visually combine them into a map combining the layers. On the left of the
   figure you can see the option to visualize, and on the right the result of the
   visualization on a map that includes the points and can be selected (in the
   upper right corner) to see both datasets simultaneously or one at a time.


3.1   Prototype

Both the process and the prototype have been developed through proofs of con-
cept and interviews with open data users and professionals in disciplines other
than software engineering. In particular, the people with whom the evaluations


                                          7
8        M. Tejeda and D. Torres

                                                        Django
                                                        REST
                                                      Framework
                      Pandas
                     dataframe         Backend                               Frontend

                                                                  Request
                       DB                Data                                  HTTP
                                                         API
                      sqlite3            sets                                  Client
                                                                  Response
                                                                               React /
                                 Django / Python3
                                                                             JavaScript
                          Clustering            Combination
                                                                                Maps
                                                                                          react-
                                                                                          leaﬂet
                           Encoder                                               UI
    scikit-learn
      cluster


                      scikit-learn                Pandas
                     LabelEncoder                dataframe


                                       Fig. 5. Arquitecture


and interviews have been carried out are dedicated to sociology, economics and
different studies related to student mobility in universities.
    The developed prototype is implemented as a Web application. It is defined in
a backend and frontend client-server architecture, which communicate through a
restful API. It was decided to use React as a frontend tool, a JavaScript library
developed and maintained by Facebook. Leaflet and React-Leaflet Material-UI.
The Fig. 5 details the organization of the architecture.


4   Conclusions and future work

Open data sets have proliferated in recent times so that any citizen can consume
them, however the volume of data in these sets requires easy-to-use tools. This
work presents a user-centered process model to be able to analyze, combine and
visualize open data sets. It is modularized in threads, which can be extended.
    The approach is bundled with an implementation of basic merge, display,
and clustering capabilities.
    As future work, the need to carry out usability evaluations with a significant
number of users is highlighted, since the present work includes conceptual tests at
the moment. It is also desirable to incorporate more merge and display threads.


References
 1. Davies, T., Perini, F., Alanso, J.: Researching the emerging impacts of open data
    (2013)
 2. Kitchin, R.: The Data Revolution. Sage Publications Ltd (2014)


                                                  8
  A user-centered process for the analysis and visualization of open data sets         9

 3. Ahlgren, B., Hidell, M., Ngai, E.C.H.: Internet of things for smart cities: Interop-
    erability and open data. IEEE Internet Computing 20(6) (2016) 52–56
 4. Domingo, A., Bellalta, B., Palacin, M., Oliver, M., Almirall, E.: Public open sensor
    data: Revolutionizing smart cities. IEEE Technology and Society Magazine 32(4)
    (2013) 50–56
 5. Arza, V., Fressoli, M., López, E.: Ciencia abierta en argentina: un mapa de expe-
    riencias actuales. Ciencia, docencia y tecnologı́a 28(55) (2017)
 6. Mouromtsev, D., d’Aquin, M.: Open Data for Education: Linked, Shared, and
    Reusable Data for Teaching and Learning. Springer Verlag (2016)
 7. Kujala, R., Weckström, C., Darst, R.K., Mladenović, M.N., Saramäki, J.: A col-
    lection of public transport network data sets for 25 cities. Scientific data 5 (2018)
    180089
 8. : El manual de open data
 9. Saddiqa, M., Larsen, B., Magnussen, R., Rasmussen, L.L., Pedersen, J.M.: Open
    data visualization in danish schools: a case study. (2019)
10. Amaro, R.E., Mulholland, A.J.: A community letter regarding sharing biomolecular
    simulation data for covid-19. Journal of Chemical Information and Modeling (2020)
11. Moorthy, V., Restrepo, A.M.H., Preziosi, M.P., Swaminathan, S.: Data sharing
    for novel coronavirus (covid-19). Bulletin of the World Health Organization 98(3)
    (2020) 150
12. Engvall, E.B.T.: Open data? Data, information, document or record? Emerald
    Group Publishing Limited (2014)
13. Maimon, O., Rokach, L.: Data Mining and Knowledge Discovery Handbook (Sec-
    ond Edition). Springer Verlag (2010)
14. Sammut, C., Webb, G.I.: Encyclopedia of Machine Learning and Data Mining.
    Springer Verlag (2017)
15. Witten, I.H., Frank, E.: Data Mining - Practical Machine Learning Tools and
    Techniques. Morgan Kaufman (2005)
16. Sieber, R.E., Johnson, P.A.:        Civic open data at a crossroads: Dominant
    models and current challenges.         Government Information Quarterly 32(3)
    (2015) 308 – 315.        https://doi.org/https://doi.org/10.1016/j.giq.2015.05.003
    http://www.sciencedirect.com/science/article/pii/S0740624X15000611
17. Ahrens, J., Geveci, B., Law, C.: Paraview: An end-user tool for large data visual-
    ization. The visualization handbook 717 (2005)
18. McQuire,     S.:         One     map     to    rule   them    all?    google    maps
    as digital technical object.                 Communication and the Public
    4(2)     (2019)     150–165.             https://doi.org/10.1177/2057047319850192
    https://doi.org/10.1177/2057047319850192
19. Hill, L.L.: Georeferencing - The Geographic Associations of Information. Cam-
    bridge, MA: The MIT Press (2006)
20. Euzenat, J.: An api for ontology alignment. In: International Semantic Web
    Conference, Springer (2004) 698–712
21. Anand, S., Mittal, S., Tuzel, O., Meer, P.: Semi-supervised kernel mean shift
    clustering. IEEE transactions on pattern analysis and machine intelligence 36(6)
    (2013) 1201–1215


                                           9