<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">BigDataVoyant: Automated Profiling of Large Geospatial Data</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Pantelis</forename><surname>Mitropoulos</surname></persName>
							<email>pmitropoulos@getmap.gr</email>
							<affiliation key="aff0">
								<orgName type="laboratory">Geospatial Enabling Technologies</orgName>
								<address>
									<country key="GR">Greece</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Kostas</forename><surname>Patroumpas</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Athena Research Center</orgName>
								<address>
									<country key="GR">Greece</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Dimitrios</forename><surname>Skoutas</surname></persName>
							<email>dskoutas@athenarc.gr</email>
							<affiliation key="aff2">
								<orgName type="institution">Athena Research Center</orgName>
								<address>
									<country key="GR">Greece</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Spiros</forename><surname>Athanasiou</surname></persName>
							<affiliation key="aff3">
								<orgName type="institution">Athena Research Center</orgName>
								<address>
									<country key="GR">Greece</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">BigDataVoyant: Automated Profiling of Large Geospatial Data</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">FD5FFE438C4C998AD673248C8BBAA455</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T01:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We envisage an open, extensible, and scalable data profiling framework over various types of geospatial data, including vector, raster and multidimensional assets. In this paper, we outline our work in progress regarding the design and implementation of BigDataVoyant, a software platform for profiling big geospatial data. This software is able to ingest data in various spatial formats and reference systems. Its main goal is to extract and visualize a large variety of metadata and descriptions about data quality and characteristics both in an interactive as well as in a fully automated manner. We suggest a processing flow for such profiling and discuss a preliminary, yet comprehensive list of metadata items already supported by the open-source software prototype we are implementing. Finally, we outline open issues and extensions of the proposed framework to broaden its usefulness and strengthen its appeal to the geospatial data community.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>As the availability, the volume, and the variety of data from different sources grows, it is crucial for stakeholders to assess its relevance and suitability for a given type of analyses, applications or services. Data profiling comprises a collection of operations and processes for extracting metadata from a given dataset and thus facilitating decision making on its potential utilization. Such metadata may involve schema information, statistics, samples, or other informative summaries over the data, thus offering extensive and objective indicators for assessing datasets in terms of business value, fitness for purpose, and quality. In addition, a variety of visualizations (tables, charts, graphs, timelines, etc.) may be applied against this metadata to convey the significance and interpret the value of the underlying data.</p><p>In proprietary data catalogues, organizational data lakes, or scientific repositories, each containing numerous and perhaps heterogeneous data assets, offering users with search capabilities against rich collections of their extracted metadata can greatly facilitate data discovery. Exploring the schema, semantics, and actual contents of open data repositories through such profiles can also evaluate their usefulness in data integration tasks. For data traded in marketplaces, potential consumers can examine such profiles (especially any quality indicators) on the available datasets, compare them, and then determine whether to purchase and which one(s) to choose. This is all the more important for geospatial data, represented not only as vectors, but also in raster or multi-dimensional formats. In the context of the OpertusMundi project <ref type="foot" target="#foot_0">1</ref> , we build a trusted, robust, and scalable pan-European geospatial data marketplace. Metadata extracted from such datasets can indicate their coverage, timeliness, consistency, and completeness, along with other domain-specific properties (topology, scale, resolution, etc.) crucial for the entire lifecycle of spatial asset provision, discovery, sharing, purchasing, and use.</p><p>Data profiling as a means of exploring, analyzing, and interpreting big data assets has attracted a lot of interest over the past decade from researchers and practitioners alike. Challenges and important use cases have been presented in recent surveys <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b8">9]</ref>, which however focus on relational data, overlooking the specific requirements of geospatial data. Among the main challenges recognized is the ability to handle input data from heterogeneous sources and to deal with the computational complexity and scalability of profiling functionalities. Interpreting the output profiles is also challenging, since it typically requires domain expertise and may depend on the use case (e.g., exploration, integration, cleansing). Another recent survey <ref type="bibr" target="#b2">[3]</ref> focuses on methods and systems that enable automatic indexing and interactive searching over large collections of datasets that fit users' needs. Discovering dependencies between attributes in relational data has also led to interesting techniques. For instance, algorithms for simultaneously identifying unique column combinations, inclusion and functional dependencies <ref type="bibr" target="#b5">[6]</ref>, as well as order dependencies <ref type="bibr" target="#b3">[4]</ref> have been proposed. Prototype systems have also been developed for data profiling. Data Civilizer <ref type="bibr" target="#b4">[5]</ref> extracts profiling signatures and creates a metadata graph to facilitate discovery of joinable datasets or those relevant to user tasks. The GOODS platform <ref type="bibr" target="#b7">[8]</ref> can also infer provenance metadata and annotation tags to enable efficient and scalable discovery of datasets. Moreover, modern platforms for data science tasks like Kaggle<ref type="foot" target="#foot_1">2</ref> or digital marketplaces like Dawex<ref type="foot" target="#foot_2">3</ref> include data profiling mechanisms, as well as keyword-based search for data discovery.</p><p>Admittedly, the aforementioned generic schemes lack inherent capabilities for geospatial data profiling. Yet, GIS software platforms already support Exploratory Spatial Data Analysis (ESDA). ESDA employs spatial mining and analysis tools <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b9">10]</ref> that allow users to visualize spatial distributions, identify outliers, discover patterns like clusters or hot spots, etc. Such tools are widely used in full-fledged GIS platforms like ArcGIS<ref type="foot" target="#foot_3">4</ref> and QGIS <ref type="foot" target="#foot_4">5</ref> or geospatially-aware DBMS like PostGIS 6 or Oracle Spatial and Graph 7 . Although powerful in capabilities, data profiling in GIS is not streamlined, but may involve manual execution of a series of operations. Sometimes, a step-by-step user intervention is necessary: invoke a given functionality (e.g., heatmap creation), specify parameters, map rendering options, etc. The cost of purchasing a software license may also be a hindrance, as well as usage or programming skills.</p><p>In contrast, we propose BigDataVoyant, an open-source, interactive, modular, and extensible framework specifically tailored for in-depth profiling of geospatial data. This platform aims to extract metadata from different types of spatial assets (vector, raster, multidimensional data) and allow data scientists to control its degree of automation. In its interactive, step-by-step mode, the analyst configures each operation, inspects its results, and optionally invokes it again with revised settings for advanced, detailed data exploration. In a fully automated mode, when triggered by inexperienced users or involved in a broader processing workflow (e.g., pipeline), a default or a user-preconfigured parametrization may be applied to run profiling as a batch job. In either mode, metadata items become progressively available and can be graphically visualized, thus expediting spatial data exploration. With BigDataVoyant, we aim to support efficient and robust profiling of large geodatasets with modern processing schemes (data partitioning, distributed processing) and provide it as a service with a RESTful API. Overall, we deem that the rich and extensible collection of automated metadata generated by BigDataVoyant will offer a powerful and comprehensive means of assessing quantity, quality, and variety in spatial data assets for mission-critical applications and services.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">PROFILING FRAMEWORK</head><p>As illustrated in Figure <ref type="figure" target="#fig_0">1</ref>, we envisage a data profiling framework that can handle large geospatial datasets of different types. We propose to automatically extract metadata not only from vector or raster data assets available in a variety of sources and formats (files, DBMS, or WFS services), but also from multidimensional scientific data having spatial reference (like meteorological, hydrological, or other sensor measurements). The profiler engine is the core processing component. Using an extensible set of software libraries and APIs, it can ingest a geospatial dataset and apply a set of processing tasks. Each task computes metadata according to a configuration specified by the analyst or automatically determined by the system, concerning the target metadata and parameter settings for the task. The resulting metadata include statistics in JSON format, geospatial features in Well-Known Text or Binary (WKT/WKB) representations, maps as PNG images, etc. All extracted metadata is stored in a repository to be used for dataset search and reporting. This metadata are also interactively visualized in a dashboard as lists, graphs, charts, and maps of various types, uncovering latent patterns, trends, and even issues (e.g., outliers, inconsistent or missing values) with the data. Employing a human-in-the-loop paradigm, the data scientist can steer or fine-tune the execution by choosing specific metadata items of interest and adjusting their parameters in order to delve into more detailed inspection of data characteristics.</p><p>Next, we present a non-exhaustive list of metadata items that can be automatically extracted from several types of geospatial 6 https://postgis.net/ 7 https://www.oracle.com/database/technologies/spatialandgraph.html  data. We also discuss how such metadata can be visualized as lists, maps, charts, etc.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Profiling of Vector Data Assets</head><p>Vector datasets may include thematic attributes, whereas geometry is typically stored in WKT, WKB, or BLOB. Profiling of vector data can be applied in different fashions to automatically compute metadata regarding the entire data asset, its spatial attribute (geometry), or its thematic attributes, as listed in Table <ref type="table" target="#tab_0">1</ref>. Next, we briefly outline each category of such metadata.</p><p>2.1.1 Entire dataset. This examines the vector data as a whole. For instance, it may involve computation of simple numerical values (e.g., feature count) or extraction of data samples, either for the entire area or a user-specified one, e.g. to allow the analyst or potential customer to examine data quality for an area of their interest before using or purchasing an asset.</p><p>2.1.2 Spatial attributes. These are automatically computed against a geometry attribute, usually projected according to a known Coordinate Reference System (CRS) <ref type="foot" target="#foot_5">8</ref> . The spatial extent is captured by the Minimum Bounding Rectangle (MBR, BBOX) of vectors, but is often insufficient or misleading due to the "dead space" induced by its rectangular shape. Thus, the convex and concave hulls of the geometries may provide further insight, allowing users to illustrate and inspect them together with the MBR and better assess the spatial coverage of a given asset. Examining the spatial data distribution may be even more informative. Features like clustering (e.g., using DBSCAN <ref type="bibr" target="#b6">[7]</ref>) or heatmaps, computed over the geometries can reveal concentration, density and other spatial patterns <ref type="bibr" target="#b9">[10]</ref> in the underlying features, e.g., hotpots for certain POI categories like bars or shops.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.3">Thematic Attributes.</head><p>For this part, relational data profiling can be applied <ref type="bibr" target="#b0">[1]</ref>. A list of the available attributes and their data types (e.g., string, integer, double, date, timestamp, etc.) can illustrate the schema. Further, for each thematic attribute, the number or percentage of missing values, and its cardinality (i.e., the number of distinct values) can be computed. When depicted in a histogram or a chart, they can illustrate the quality of data across all attributes and possibly even indicate correlations amongst them. Classification of the semantic domain for certain attributes (e.g., phone numbers, web pages, names, etc.) may be also examined. This requires knowledge of specific patterns that can be progressively collected from the corpus of data assets handled by the profiler.</p><p>For a particular thematic attribute, extra processing can provide the distribution of its values, depicted as frequency histograms (equi-width, equi-depth, V-optimal, etc.). For numerical attributes, 𝑁 -tiles (e.g., quartiles) may be also provided. For a categorical attribute (e.g., POI categories or road classifications), lists of the distinct or most frequent values may further convey its representativeness.</p><p>Discovery of attribute dependencies, such as key or functional dependencies, concerns multi-attribute profiling <ref type="bibr" target="#b5">[6]</ref>. Such features may indicate that values in one set of columns functionally determine the value of another column, e.g., specific POI categories like pharmacies always include phone numbers for emergency calls. Other dependencies may be conditional, e.g. street names are unique for each city or postal code.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Profiling of Raster Data Assets</head><p>Raster data includes digital aerial photographs, satellite imagery, scanned maps, etc., typically organized in one or more bands. For example, a color RGB image has three bands (Red, Green, Blue), a digital elevation model (DEM) just one holding elevation values, whereas a multispectral image may have many bands. Table <ref type="table" target="#tab_1">2</ref> lists some automated metadata that may be extracted from rasters.</p><p>Certain metadata items are similar to those proposed for vector data (e.g., spatial extent, native CRS, sampling). However, since rasters are tessellations of a 2-d surface into cells (pixels), raster profiling tools require different functionality from that applied against vectors. Besides, some information may not be always Raster-specific metadata concerns resolution, image size, as well as whether the imagery is COG (i.e., Cloud Optimized Geo-TIFF). Information about NoData values is also important; this is typically a specially designated value (e.g., None, -999) to indicate missing information and could differ per band.</p><p>Finally, band-related metadata concern the number of bands, the data type/depth of each one, and statistics (min, max, avg, stdev, etc.) of values per band. A default histogram of cell values per band can give a more detailed insight about raster quality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Profiling of Multidimensional Data Assets</head><p>Multidimensional data is typically used to store environmental, climate, or meteorological information with measurements for multiple variables, with NetCDF<ref type="foot" target="#foot_6">9</ref> its most common representative. Dimensions (each with a name and a length) are used to represent real physical dimensions, such as time, latitude, longitude, or elevation, and define the structure of data. Variables (each specified with a name, a type, a shape, and often accompanied by their units) hold multi-dimensional arrays of values of the same type. Profiling of such data assets involves some automatically extracted metadata that are common with vector and raster data, but due to the inherent complexity in this data format, it requires specialized handling. Table <ref type="table" target="#tab_2">3</ref> lists automated metadata extractable from multidimensional data assets, with some of them concerning the entire dataset (e.g. native CRS) while others are computed over individual variables.</p><p>Histograms (one per variable) may be created in order to visualize the distribution of the values. In addition, special attributes (such as "missing_value" and "_FillValue" in netCDF files) may be used to identify missing or undefined data; these values should be treated like the NoData ones in raster layers. It may also be possible to extract metadata features depending on the type of the variables in the dataset. For example, considering the time dimension, identify the time granularity of data per variable, any gaps in measurements, etc. Such information is valuable for potential users in order to assess the coverage, timespan, and completeness per variable of interest in such multidimensional data. Finally, extraction of ad-hoc samples according to user preferences, such as a rectangular area of interest or an interval on a given dimension (e.g., a time period), should be also possible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">IMPLEMENTATION STATUS &amp; OUTLOOK</head><p>In the context of the OpertusMundi project, we are developing an open-source prototype for BigDataVoyant <ref type="foot" target="#foot_7">10</ref> . Current functionality mostly concerns its profiler engine, which leverages various Python libraries to offer an API for automated metadata extraction and manipulation. Python offers a robust, user-friendly environment for data scientists, and its ecosystem includes many popular libraries for geospatial data processing and visualization. Existing profiling tools<ref type="foot" target="#foot_8">11</ref> in Python are mostly based on Pandas<ref type="foot" target="#foot_9">12</ref> and unfortunately do not support geospatial data. We considered extending them, but Pandas frames do not scale for large datasets. Besides, its spatial extension GeoPandas can only handle vectors and its performance over complex geometries is slow.</p><p>As scalability is a major concern in BigDataVoyant, we opted for another solution by extending the memory-efficient Vaex library<ref type="foot" target="#foot_10">13</ref> with spatial capabilities. Presently, we actually support all types of 2-dimensional vector geometries, like (multi)points, (multi)linestrings, (multi)polygons, collections, etc. according to OGC specifications<ref type="foot" target="#foot_11">14</ref> . At a later stage, we plan to handle vector data with 𝑑 &gt; 2 dimensions, e.g., road segments with linear referencing. If necessary, we will also consider extra metadata features for particular applications, e.g., examine whether a road network is connected and can support routing. On top of this GeoVaex extension<ref type="foot" target="#foot_12">15</ref> we have been building for vector data, a beta version of our profiler can automatically extract most of the metadata in Table <ref type="table" target="#tab_0">1</ref> for moderately large datasets. Further, we employ GDAL/OGR<ref type="foot" target="#foot_13">16</ref> to support raster data profiling (Table <ref type="table" target="#tab_1">2</ref>) for a wide range of file formats, and netCDF4 Python module<ref type="foot" target="#foot_14">17</ref> for multidimensional data profiling (Table <ref type="table" target="#tab_2">3</ref>).</p><p>Parameters like number of clusters or histogram buckets, weight and radius for heatmaps, etc., are currently configured manually. We will examine whether this could be automated in a data-driven manner according to data characteristics (e.g., size, spatial extent). Besides, some methods are applicable on specific geometry types (e.g., concave hulls on points only), so such data conversions are handled in the background without user intervention.</p><p>Of course, scalability against very large spatial datasets is the main challenge. Although the libraries employed in profiling can generally scale to big datasets, computation of certain metadata requires particular handling, especially where some kind of spatial aggregation is involved. For instance, special algorithms are needed to treat alpha shapes in the whole dataset, compute heatmaps or clusters, etc., when data partitioning is applied to boost computation efficiency. Initial tests of our approach employing GeoVaex verify its strong potential in terms of scalability and efficiency. Figure <ref type="figure" target="#fig_1">2</ref> exemplifies a subset of metadata automatically computed over a large vector dataset extracted from OpenStreetMap with more than 25 million polygons in Germany, and visualized in a dashboard. Once the software reaches a stable level, we plan an extensive empirical study against real-world geodatasets of various types and formats.</p><p>Data cleansing and detection of outliers is also challenging, mostly regarding the spatial aspect (geometries, pixels) in large data collections. We intend to use clustering results and seclude features that stand out, accompanied by an 'outlier confidence weight', which will be apparently affected by the parameter values set in the algorithm. We deem that this will offer data scientists and stakeholders a precious tool to identify possible errors in the data, refine their analyses, and improve quality in their products.</p><p>Finally, this spatial data profiling is extensible thanks to the modular design of BigDataVoyant. Although our proposed collection of automated metadata is already rich and covers diverse types of geospatial data assets, extra features may be added based on feedback from stakeholders in the geospatial value chain and open source developers. Complementing them with intuitive, interactive visualizations will further improve the cognitive interpretation and reveal latent aspects in the geospatial data.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Processing flow for automated geodata profiling.</figDesc><graphic coords="2,309.71,83.69,231.64,131.11" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Visualization of automated metadata profiling a vector dataset of 25 million polygons (buildings in Germany) extracted from OpenStreetMap.</figDesc><graphic coords="4,309.71,83.69,231.65,220.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Automated metadata computed over vector data</figDesc><table><row><cell>Metadata</cell><cell>Description</cell><cell>Scope</cell></row><row><cell>Native CRS</cell><cell>Coordinate reference system (EPSG)</cell><cell>Geometry</cell></row><row><cell>Spatial extent</cell><cell>Rectilinear MBR covering all features</cell><cell>Geometry</cell></row><row><cell>Feature count</cell><cell>Number of rows (records)</cell><cell>Dataset</cell></row><row><cell>Convex Hull</cell><cell>Convex polygon enclosing all features</cell><cell>Geometry</cell></row><row><cell>Concave Hull</cell><cell></cell><cell></cell></row></table><note>Concave polygon enclosing all features Geometry Attribute names List of column names of all attributes Thematic Attribute types List of data types of all attributes Thematic Cardinality Count of NOT NULL values per attribute Thematic Value pattern Semantic domain of values Thematic Value distribution Frequency histogram of attribute values Thematic 𝑁 -tiles Values dividing data into 𝑁 equal parts Thematic Distinct values List of categorical values Thematic Frequent values top-𝑘 most frequent categorical values Thematic Attr. dependencies Key, functional, conditional dependencies Thematic Heatmap Colormap with varying intensity Geometry Clusters, outliers Spatial clusters of features (e.g. POIs) Geometry Samples Data portion(s) of limited size/extent Dataset</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Automated metadata computed over raster data</figDesc><table><row><cell>Metadata</cell><cell>Description</cell><cell>Scope</cell></row><row><cell>Native CRS</cell><cell>Coordinate reference system (EPSG)</cell><cell>Dataset</cell></row><row><cell>Spatial extent</cell><cell>Rectangle with the image bounds</cell><cell>Raster</cell></row><row><cell>Resolution</cell><cell>Pixel size (e.g., in meters) per axis</cell><cell>Raster</cell></row><row><cell>Width, Height</cell><cell>Number of pixels per axis</cell><cell>Raster</cell></row><row><cell>Bands</cell><cell>Count of the available bands</cell><cell>Raster</cell></row><row><cell cols="2">Value distribution Histogram(s) of raster values</cell><cell>Band</cell></row><row><cell>Band statistics</cell><cell>Min, max, avg, stdev, etc. of raster values</cell><cell>Band</cell></row><row><cell>COG</cell><cell>Boolean: is Cloud Optimized GeoTIFF?</cell><cell>Raster</cell></row><row><cell cols="3">Pixel (bit) depth Integer denoting the range of values stored Band</cell></row><row><cell>NoData</cell><cell>Special value(s) denoting NoData pixels</cell><cell>Band</cell></row><row><cell>Samples</cell><cell>Custom portion(s) by user-specified box</cell><cell>Raster</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Automated metadata over multidimensional data For instance, native CRS for a TIFF image may be unknown if its accompanying .tfw file does not include it in its header. Spatial extent (MBR, BBOX) may be deduced from the original raster without reprojection. If the native CRS is known, raster resolution can be accompanied with units of measurement.</figDesc><table><row><cell>Metadata</cell><cell>Description</cell><cell>Scope</cell></row><row><cell>Native CRS</cell><cell>Coordinate reference system (EPSG)</cell><cell>Dataset</cell></row><row><cell cols="2">Dimension count Number of dimensions in the dataset</cell><cell>Dataset</cell></row><row><cell cols="2">Dimension info Name, length, etc. of each dimension</cell><cell>Dataset</cell></row><row><cell>Variable count</cell><cell>Number of variables in the dataset</cell><cell>Dataset</cell></row><row><cell>Variable info</cell><cell cols="2">Name, type, shape, attributes per variable Dataset</cell></row><row><cell>Spatial extent</cell><cell>Range of values per spatial dimension</cell><cell>Variable</cell></row><row><cell cols="2">Temporal range Timespan of stored values per variable</cell><cell>Variable</cell></row><row><cell cols="2">Value distribution Histogram of values per variable</cell><cell>Variable</cell></row><row><cell>NoData</cell><cell>Special value(s) denoting NoData</cell><cell>Variable</cell></row><row><cell>Samples</cell><cell>Custom portion(s) by user preferences</cell><cell>Dataset</cell></row><row><cell cols="2">possible to calculate.</cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://www.opertusmundi.eu/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://www.kaggle.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://www.dawex.com/en/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://www.esri.com/arcgis/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">https://qgis.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_5">Typically defined according to the EPSG registry: https://www.epsg.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_6">https://www.unidata.ucar.edu/software/netcdf/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_7">Publicly available at https://github.com/OpertusMundi/BigDataVoyant</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_8">E.g., https://github.com/pandas-profiling/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_9">https://pandas.pydata.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_10">https://vaex.readthedocs.io</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="14" xml:id="foot_11">https://www.ogc.org/standards/sfa</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="15" xml:id="foot_12">https://github.com/OpertusMundi/geovaex</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="16" xml:id="foot_13">https://gdal.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="17" xml:id="foot_14">https://unidata.github.io/netcdf4-python/netCDF4/index.html</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>This work was partially funded by the the European Union's H2020 project OpertusMundi (870228).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Profiling relational data: a survey</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Abedjan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Golab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Naumann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">VLDB Journal</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="557" to="581" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Web-based analytical tools for the exploration of spatial data</title>
		<author>
			<persName><forename type="first">L</forename><surname>Anselin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Syabri</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Geogr. Syst</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="197" to="218" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Dataset search: a survey</title>
		<author>
			<persName><forename type="first">A</forename><surname>Chapman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Simperl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Koesten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Konstantinidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">D</forename><surname>Ibáñez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kacprzak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Groth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">VLDB Journal</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="251" to="272" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Discovering order dependencies through order compatibility</title>
		<author>
			<persName><forename type="first">C</forename><surname>Consonni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sottovia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Montresor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Velegrakis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">EDBT</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="409" to="420" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">The data civilizer system</title>
		<author>
			<persName><forename type="first">D</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">C</forename><surname>Fernandez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Abedjan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stonebraker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Elmagarmid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">F</forename><surname>Ilyas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Madden</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ouzzani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CIDR</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Holistic data profiling: Simultaneous discovery of various metadata</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ehrlich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Roick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schulze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zwiener</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Papenbrock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Naumann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">EDBT</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="305" to="316" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A density-based algorithm for discovering clusters in large spatial databases with noise</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ester</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kriegel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sander</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">KDD</title>
				<imprint>
			<date type="published" when="1996">1996</date>
			<biblScope unit="page" from="226" to="231" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Goods: Organizing Google&apos;s datasets</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Halevy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Korn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">F</forename><surname>Noy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Olston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Polyzotis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Whang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGMOD</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="795" to="806" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Data profiling revisited</title>
		<author>
			<persName><forename type="first">F</forename><surname>Naumann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGMOD Record</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="40" to="49" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Identifying patterns in spatial information: A survey of methods</title>
		<author>
			<persName><forename type="first">S</forename><surname>Shekhar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Evans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Kang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mohan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Wiley Interdiscip. Rev. Data Min. Knowl. Discov</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="193" to="214" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
