=Paper=
{{Paper
|id=Vol-2841/BigVis_1
|storemode=property
|title= BigDataVoyant: Automated Profiling of Large Geospatial Data
|pdfUrl=https://ceur-ws.org/Vol-2841/BigVis_1.pdf
|volume=Vol-2841
|authors=Pantelis Mitropoulos,Kostas Patroumpas,Dimitrios Skoutas,Thodoris Vakkas,Spiros Athanasiou
|dblpUrl=https://dblp.org/rec/conf/edbt/MitropoulosPSVA21
}}
== BigDataVoyant: Automated Profiling of Large Geospatial Data==
<pdf width="1500px">https://ceur-ws.org/Vol-2841/BigVis_1.pdf</pdf>
<pre>
BigDataVoyant: Automated Profiling of Large Geospatial Data
            Pantelis Mitropoulos                                       Kostas Patroumpas                                 Dimitrios Skoutas
      Geospatial Enabling Technologies                                Athena Research Center                            Athena Research Center
                   Greece                                                     Greece                                            Greece
         pmitropoulos@getmap.gr                                         kpatro@athenarc.gr                               dskoutas@athenarc.gr

                                            Thodoris Vakkas                                    Spiros Athanasiou
                                  Geospatial Enabling Technologies                           Athena Research Center
                                               Greece                                                Greece
                                        tvakkas@getmap.gr                                     spathan@athenarc.gr

ABSTRACT                                                                                  This is all the more important for geospatial data, represented
We envisage an open, extensible, and scalable data profiling                           not only as vectors, but also in raster or multi-dimensional for-
framework over various types of geospatial data, including vector,                     mats. In the context of the OpertusMundi project1 , we build a
raster and multidimensional assets. In this paper, we outline our                      trusted, robust, and scalable pan-European geospatial data mar-
work in progress regarding the design and implementation of                            ketplace. Metadata extracted from such datasets can indicate their
BigDataVoyant, a software platform for profiling big geospatial                        coverage, timeliness, consistency, and completeness, along with
data. This software is able to ingest data in various spatial formats                  other domain-specific properties (topology, scale, resolution, etc.)
and reference systems. Its main goal is to extract and visualize                       crucial for the entire lifecycle of spatial asset provision, discovery,
a large variety of metadata and descriptions about data quality                        sharing, purchasing, and use.
and characteristics both in an interactive as well as in a fully au-                      Data profiling as a means of exploring, analyzing, and inter-
tomated manner. We suggest a processing flow for such profiling                        preting big data assets has attracted a lot of interest over the past
and discuss a preliminary, yet comprehensive list of metadata                          decade from researchers and practitioners alike. Challenges and
items already supported by the open-source software prototype                          important use cases have been presented in recent surveys [1, 9],
we are implementing. Finally, we outline open issues and exten-                        which however focus on relational data, overlooking the specific
sions of the proposed framework to broaden its usefulness and                          requirements of geospatial data. Among the main challenges rec-
strengthen its appeal to the geospatial data community.                                ognized is the ability to handle input data from heterogeneous
                                                                                       sources and to deal with the computational complexity and scala-
                                                                                       bility of profiling functionalities. Interpreting the output profiles
                                                                                       is also challenging, since it typically requires domain expertise
1    INTRODUCTION
                                                                                       and may depend on the use case (e.g., exploration, integration,
As the availability, the volume, and the variety of data from dif-                     cleansing). Another recent survey [3] focuses on methods and
ferent sources grows, it is crucial for stakeholders to assess its                     systems that enable automatic indexing and interactive searching
relevance and suitability for a given type of analyses, applications                   over large collections of datasets that fit users’ needs. Discover-
or services. Data profiling comprises a collection of operations                       ing dependencies between attributes in relational data has also led
and processes for extracting metadata from a given dataset and                         to interesting techniques. For instance, algorithms for simulta-
thus facilitating decision making on its potential utilization. Such                   neously identifying unique column combinations, inclusion and
metadata may involve schema information, statistics, samples,                          functional dependencies [6], as well as order dependencies [4]
or other informative summaries over the data, thus offering ex-                        have been proposed. Prototype systems have also been developed
tensive and objective indicators for assessing datasets in terms                       for data profiling. Data Civilizer [5] extracts profiling signatures
of business value, fitness for purpose, and quality. In addition,                      and creates a metadata graph to facilitate discovery of joinable
a variety of visualizations (tables, charts, graphs, timelines, etc.)                  datasets or those relevant to user tasks. The GOODS platform [8]
may be applied against this metadata to convey the significance                        can also infer provenance metadata and annotation tags to enable
and interpret the value of the underlying data.                                        efficient and scalable discovery of datasets. Moreover, modern
   In proprietary data catalogues, organizational data lakes, or                       platforms for data science tasks like Kaggle2 or digital market-
scientific repositories, each containing numerous and perhaps                          places like Dawex3 include data profiling mechanisms, as well as
heterogeneous data assets, offering users with search capabilities                     keyword-based search for data discovery.
against rich collections of their extracted metadata can greatly                          Admittedly, the aforementioned generic schemes lack inherent
facilitate data discovery. Exploring the schema, semantics, and                        capabilities for geospatial data profiling. Yet, GIS software plat-
actual contents of open data repositories through such profiles                        forms already support Exploratory Spatial Data Analysis (ESDA).
can also evaluate their usefulness in data integration tasks. For                      ESDA employs spatial mining and analysis tools [2, 10] that allow
data traded in marketplaces, potential consumers can examine                           users to visualize spatial distributions, identify outliers, discover
such profiles (especially any quality indicators) on the available                     patterns like clusters or hot spots, etc. Such tools are widely
datasets, compare them, and then determine whether to purchase                         used in full-fledged GIS platforms like ArcGIS4 and QGIS5 or
and which one(s) to choose.

                                                                                       1 https://www.opertusmundi.eu/
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-
                                                                                       2 https://www.kaggle.com/
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)
                                                                                       3 https://www.dawex.com/en/
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
                                                                                       4 https://www.esri.com/arcgis/
International (CC BY 4.0)
                                                                                       5 https://qgis.org/
geospatially-aware DBMS like PostGIS6 or Oracle Spatial and
Graph7 . Although powerful in capabilities, data profiling in GIS
is not streamlined, but may involve manual execution of a se-
ries of operations. Sometimes, a step-by-step user intervention is
necessary: invoke a given functionality (e.g., heatmap creation),
specify parameters, map rendering options, etc. The cost of pur-
chasing a software license may also be a hindrance, as well as
usage or programming skills.
    In contrast, we propose BigDataVoyant, an open-source, inter-
active, modular, and extensible framework specifically tailored
for in-depth profiling of geospatial data. This platform aims to
extract metadata from different types of spatial assets (vector,
raster, multidimensional data) and allow data scientists to control
                                                                         Figure 1: Processing flow for automated geodata profiling.
its degree of automation. In its interactive, step-by-step mode,
the analyst configures each operation, inspects its results, and
optionally invokes it again with revised settings for advanced,          Table 1: Automated metadata computed over vector data
detailed data exploration. In a fully automated mode, when trig-         Metadata           Description                                 Scope
gered by inexperienced users or involved in a broader process-           Native CRS         Coordinate reference system (EPSG)        Geometry
ing workflow (e.g., pipeline), a default or a user-preconfigured         Spatial extent     Rectilinear MBR covering all features     Geometry
parametrization may be applied to run profiling as a batch job.          Feature count      Number of rows (records)                   Dataset
In either mode, metadata items become progressively available            Convex Hull        Convex polygon enclosing all features     Geometry
and can be graphically visualized, thus expediting spatial data          Concave Hull       Concave polygon enclosing all features Geometry
exploration. With BigDataVoyant, we aim to support efficient             Attribute names List of column names of all attributes       Thematic
and robust profiling of large geodatasets with modern processing         Attribute types    List of data types of all attributes      Thematic
                                                                         Cardinality        Count of NOT NULL values per attribute Thematic
schemes (data partitioning, distributed processing) and provide
                                                                         Value pattern      Semantic domain of values                 Thematic
it as a service with a RESTful API. Overall, we deem that the rich
                                                                         Value distribution Frequency histogram of attribute values Thematic
and extensible collection of automated metadata generated by             𝑁 -tiles           Values dividing data into 𝑁 equal parts   Thematic
BigDataVoyant will offer a powerful and comprehensive means              Distinct values    List of categorical values                Thematic
of assessing quantity, quality, and variety in spatial data assets       Frequent values    top-𝑘 most frequent categorical values    Thematic
for mission-critical applications and services.                          Attr. dependencies Key, functional, conditional dependencies Thematic
                                                                         Heatmap            Colormap with varying intensity           Geometry
2     PROFILING FRAMEWORK                                                Clusters, outliers Spatial clusters of features (e.g. POIs)  Geometry
                                                                         Samples            Data portion(s) of limited size/extent     Dataset
As illustrated in Figure 1, we envisage a data profiling framework
that can handle large geospatial datasets of different types. We
propose to automatically extract metadata not only from vector           data. We also discuss how such metadata can be visualized as
or raster data assets available in a variety of sources and formats      lists, maps, charts, etc.
(files, DBMS, or WFS services), but also from multidimensional
scientific data having spatial reference (like meteorological, hy-       2.1     Profiling of Vector Data Assets
drological, or other sensor measurements). The profiler engine is
the core processing component. Using an extensible set of soft-          Vector datasets may include thematic attributes, whereas geome-
ware libraries and APIs, it can ingest a geospatial dataset and          try is typically stored in WKT, WKB, or BLOB. Profiling of vector
apply a set of processing tasks. Each task computes metadata             data can be applied in different fashions to automatically com-
according to a configuration specified by the analyst or auto-           pute metadata regarding the entire data asset, its spatial attribute
matically determined by the system, concerning the target meta-          (geometry), or its thematic attributes, as listed in Table 1. Next,
data and parameter settings for the task. The resulting metadata         we briefly outline each category of such metadata.
include statistics in JSON format, geospatial features in Well-             2.1.1 Entire dataset. This examines the vector data as a whole.
Known Text or Binary (WKT/WKB) representations, maps as                  For instance, it may involve computation of simple numerical
PNG images, etc. All extracted metadata is stored in a repository        values (e.g., feature count) or extraction of data samples, either
to be used for dataset search and reporting. This metadata are           for the entire area or a user-specified one, e.g. to allow the analyst
also interactively visualized in a dashboard as lists, graphs, charts,   or potential customer to examine data quality for an area of their
and maps of various types, uncovering latent patterns, trends,           interest before using or purchasing an asset.
and even issues (e.g., outliers, inconsistent or missing values)
with the data. Employing a human-in-the-loop paradigm, the data             2.1.2 Spatial attributes. These are automatically computed
scientist can steer or fine-tune the execution by choosing specific      against a geometry attribute, usually projected according to a
metadata items of interest and adjusting their parameters in order       known Coordinate Reference System (CRS)8 . The spatial extent
to delve into more detailed inspection of data characteristics.          is captured by the Minimum Bounding Rectangle (MBR, BBOX)
    Next, we present a non-exhaustive list of metadata items that        of vectors, but is often insufficient or misleading due to the “dead
can be automatically extracted from several types of geospatial          space” induced by its rectangular shape. Thus, the convex and
                                                                         concave hulls of the geometries may provide further insight,
                                                                         allowing users to illustrate and inspect them together with the
6 https://postgis.net/
7 https://www.oracle.com/database/technologies/spatialandgraph.html      8 Typically defined according to the EPSG registry: https://www.epsg.org
Table 2: Automated metadata computed over raster data                     Table 3: Automated metadata over multidimensional data
Metadata           Description                                  Scope     Metadata           Description                                 Scope
Native CRS         Coordinate reference system (EPSG)          Dataset    Native CRS         Coordinate reference system (EPSG)         Dataset
Spatial extent     Rectangle with the image bounds             Raster     Dimension count Number of dimensions in the dataset           Dataset
Resolution         Pixel size (e.g., in meters) per axis       Raster     Dimension info Name, length, etc. of each dimension           Dataset
Width, Height      Number of pixels per axis                   Raster     Variable count     Number of variables in the dataset         Dataset
Bands              Count of the available bands                Raster     Variable info      Name, type, shape, attributes per variable Dataset
Value distribution Histogram(s) of raster values                Band      Spatial extent     Range of values per spatial dimension      Variable
Band statistics    Min, max, avg, stdev, etc. of raster values  Band      Temporal range Timespan of stored values per variable         Variable
COG                Boolean: is Cloud Optimized GeoTIFF?        Raster     Value distribution Histogram of values per variable           Variable
Pixel (bit) depth Integer denoting the range of values stored Band        NoData             Special value(s) denoting NoData           Variable
NoData             Special value(s) denoting NoData pixels      Band      Samples            Custom portion(s) by user preferences      Dataset
Samples            Custom portion(s) by user-specified box     Raster


                                                                          possible to calculate. For instance, native CRS for a TIFF image
MBR and better assess the spatial coverage of a given asset. Exam-        may be unknown if its accompanying .tfw file does not include
ining the spatial data distribution may be even more informative.         it in its header. Spatial extent (MBR, BBOX) may be deduced
Features like clustering (e.g., using DBSCAN [7]) or heatmaps,            from the original raster without reprojection. If the native CRS
computed over the geometries can reveal concentration, density            is known, raster resolution can be accompanied with units of
and other spatial patterns [10] in the underlying features, e.g.,         measurement.
hotpots for certain POI categories like bars or shops.                        Raster-specific metadata concerns resolution, image size, as
    2.1.3 Thematic Attributes. For this part, relational data pro-        well as whether the imagery is COG (i.e., Cloud Optimized Geo-
filing can be applied [1]. A list of the available attributes and         TIFF). Information about NoData values is also important; this is
their data types (e.g., string, integer, double, date, timestamp, etc.)   typically a specially designated value (e.g., None, -999) to indicate
can illustrate the schema. Further, for each thematic attribute,          missing information and could differ per band.
the number or percentage of missing values, and its cardinality               Finally, band-related metadata concern the number of bands,
(i.e., the number of distinct values) can be computed. When de-           the data type/depth of each one, and statistics (min, max, avg,
picted in a histogram or a chart, they can illustrate the quality of      stdev, etc.) of values per band. A default histogram of cell values
data across all attributes and possibly even indicate correlations        per band can give a more detailed insight about raster quality.
amongst them. Classification of the semantic domain for certain
attributes (e.g., phone numbers, web pages, names, etc.) may be           2.3     Profiling of Multidimensional Data Assets
also examined. This requires knowledge of specific patterns that          Multidimensional data is typically used to store environmental,
can be progressively collected from the corpus of data assets             climate, or meteorological information with measurements for
handled by the profiler.                                                  multiple variables, with NetCDF9 its most common representa-
    For a particular thematic attribute, extra processing can pro-        tive. Dimensions (each with a name and a length) are used to
vide the distribution of its values, depicted as frequency his-           represent real physical dimensions, such as time, latitude, lon-
tograms (equi-width, equi-depth, V-optimal, etc.). For numerical          gitude, or elevation, and define the structure of data. Variables
attributes, 𝑁 -tiles (e.g., quartiles) may be also provided. For a        (each specified with a name, a type, a shape, and often accom-
categorical attribute (e.g., POI categories or road classifications),     panied by their units) hold multi-dimensional arrays of values
lists of the distinct or most frequent values may further convey          of the same type. Profiling of such data assets involves some
its representativeness.                                                   automatically extracted metadata that are common with vector
    Discovery of attribute dependencies, such as key or functional        and raster data, but due to the inherent complexity in this data
dependencies, concerns multi-attribute profiling [6]. Such fea-           format, it requires specialized handling. Table 3 lists automated
tures may indicate that values in one set of columns function-            metadata extractable from multidimensional data assets, with
ally determine the value of another column, e.g., specific POI            some of them concerning the entire dataset (e.g. native CRS)
categories like pharmacies always include phone numbers for               while others are computed over individual variables.
emergency calls. Other dependencies may be conditional, e.g.                 Histograms (one per variable) may be created in order to visu-
street names are unique for each city or postal code.                     alize the distribution of the values. In addition, special attributes
                                                                          (such as “missing_value” and “_FillValue” in netCDF files) may be
2.2    Profiling of Raster Data Assets                                    used to identify missing or undefined data; these values should
Raster data includes digital aerial photographs, satellite imagery,       be treated like the NoData ones in raster layers. It may also be
scanned maps, etc., typically organized in one or more bands.             possible to extract metadata features depending on the type of
For example, a color RGB image has three bands (Red, Green,               the variables in the dataset. For example, considering the time
Blue), a digital elevation model (DEM) just one holding elevation         dimension, identify the time granularity of data per variable,
values, whereas a multispectral image may have many bands.                any gaps in measurements, etc. Such information is valuable for
Table 2 lists some automated metadata that may be extracted               potential users in order to assess the coverage, timespan, and
from rasters.                                                             completeness per variable of interest in such multidimensional
   Certain metadata items are similar to those proposed for vector        data. Finally, extraction of ad-hoc samples according to user pref-
data (e.g., spatial extent, native CRS, sampling). However, since         erences, such as a rectangular area of interest or an interval on a
rasters are tessellations of a 2-d surface into cells (pixels), raster    given dimension (e.g., a time period), should be also possible.
profiling tools require different functionality from that applied
                                                                          9 https://www.unidata.ucar.edu/software/netcdf/
against vectors. Besides, some information may not be always
3     IMPLEMENTATION STATUS & OUTLOOK
In the context of the OpertusMundi project, we are developing
an open-source prototype for BigDataVoyant10 . Current func-
tionality mostly concerns its profiler engine, which leverages
various Python libraries to offer an API for automated metadata
extraction and manipulation. Python offers a robust, user-friendly
environment for data scientists, and its ecosystem includes many
popular libraries for geospatial data processing and visualization.
Existing profiling tools11 in Python are mostly based on Pandas12
and unfortunately do not support geospatial data. We considered
extending them, but Pandas frames do not scale for large datasets.
Besides, its spatial extension GeoPandas can only handle vectors
and its performance over complex geometries is slow.
   As scalability is a major concern in BigDataVoyant, we opted
for another solution by extending the memory-efficient Vaex
library13 with spatial capabilities. Presently, we actually support
all types of 2-dimensional vector geometries, like (multi)points,
(multi)linestrings, (multi)polygons, collections, etc. according to
OGC specifications14 . At a later stage, we plan to handle vector
data with 𝑑 > 2 dimensions, e.g., road segments with linear
referencing. If necessary, we will also consider extra metadata          Figure 2: Visualization of automated metadata profiling
features for particular applications, e.g., examine whether a road       a vector dataset of 25 million polygons (buildings in Ger-
network is connected and can support routing. On top of this             many) extracted from OpenStreetMap.
GeoVaex extension15 we have been building for vector data, a
beta version of our profiler can automatically extract most of the
metadata in Table 1 for moderately large datasets. Further, we           features that stand out, accompanied by an ‘outlier confidence
employ GDAL/OGR16 to support raster data profiling (Table 2)             weight’, which will be apparently affected by the parameter values
for a wide range of file formats, and netCDF4 Python module17            set in the algorithm. We deem that this will offer data scientists
for multidimensional data profiling (Table 3).                           and stakeholders a precious tool to identify possible errors in the
   Parameters like number of clusters or histogram buckets, weight       data, refine their analyses, and improve quality in their products.
and radius for heatmaps, etc., are currently configured manually.           Finally, this spatial data profiling is extensible thanks to the
We will examine whether this could be automated in a data-driven         modular design of BigDataVoyant. Although our proposed col-
manner according to data characteristics (e.g., size, spatial extent).   lection of automated metadata is already rich and covers diverse
Besides, some methods are applicable on specific geometry types          types of geospatial data assets, extra features may be added based
(e.g., concave hulls on points only), so such data conversions are       on feedback from stakeholders in the geospatial value chain and
handled in the background without user intervention.                     open source developers. Complementing them with intuitive,
   Of course, scalability against very large spatial datasets is the     interactive visualizations will further improve the cognitive in-
main challenge. Although the libraries employed in profiling             terpretation and reveal latent aspects in the geospatial data.
can generally scale to big datasets, computation of certain meta-
data requires particular handling, especially where some kind of         ACKNOWLEDGMENTS
spatial aggregation is involved. For instance, special algorithms        This work was partially funded by the the European Union’s
are needed to treat alpha shapes in the whole dataset, compute           H2020 project OpertusMundi (870228).
heatmaps or clusters, etc., when data partitioning is applied to
boost computation efficiency. Initial tests of our approach em-          REFERENCES
ploying GeoVaex verify its strong potential in terms of scalability       [1] Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: a survey.
                                                                              VLDB Journal, 24(4):557–581, 2015.
and efficiency. Figure 2 exemplifies a subset of metadata auto-           [2] L. Anselin, Y. W. Kim, and I. Syabri. Web-based analytical tools for the
matically computed over a large vector dataset extracted from                 exploration of spatial data. J. Geogr. Syst., 6(2):197–218, 2004.
OpenStreetMap with more than 25 million polygons in Germany,              [3] A. Chapman, E. Simperl, L. Koesten, G. Konstantinidis, L. D. Ibáñez,
                                                                              E. Kacprzak, and P. Groth. Dataset search: a survey. VLDB Journal, 29(1):251–
and visualized in a dashboard. Once the software reaches a stable             272, 2020.
level, we plan an extensive empirical study against real-world            [4] C. Consonni, P. Sottovia, A. Montresor, and Y. Velegrakis. Discovering order
                                                                              dependencies through order compatibility. In EDBT, pages 409–420, 2019.
geodatasets of various types and formats.                                 [5] D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elma-
   Data cleansing and detection of outliers is also challenging,              garmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer
mostly regarding the spatial aspect (geometries, pixels) in large             system. In CIDR, 2017.
                                                                          [6] J. Ehrlich, M. Roick, L. Schulze, J. Zwiener, T. Papenbrock, and F. Naumann.
data collections. We intend to use clustering results and seclude             Holistic data profiling: Simultaneous discovery of various metadata. In EDBT,
                                                                              pages 305–316, 2016.
                                                                          [7] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
10 Publicly available at https://github.com/OpertusMundi/BigDataVoyant
                                                                              discovering clusters in large spatial databases with noise. In KDD, pages
11 E.g., https://github.com/pandas-profiling/
                                                                              226–231, 1996.
12 https://pandas.pydata.org/
                                                                          [8] A. Y. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang.
13 https://vaex.readthedocs.io
                                                                              Goods: Organizing Google’s datasets. In SIGMOD, pages 795–806, 2016.
14 https://www.ogc.org/standards/sfa                                      [9] F. Naumann. Data profiling revisited. SIGMOD Record, 42(4):40–49, 2013.
15 https://github.com/OpertusMundi/geovaex                               [10] S. Shekhar, M. R. Evans, J. M. Kang, and P. Mohan. Identifying patterns in
16 https://gdal.org/                                                          spatial information: A survey of methods. Wiley Interdiscip. Rev. Data Min.
17 https://unidata.github.io/netcdf4-python/netCDF4/index.html                Knowl. Discov., 1(3):193–214, 2011.

</pre>