<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>volume 4168 of Lec-
Journal of Geo</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j</article-id>
      <title-group>
        <article-title>Advancements of Randomized Data Structures for Geospatial Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paul Walther</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Supervised by Martin Werner Professorship for Big Geospatial Data Management, Technical University of Munich</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>94</volume>
      <fpage>12</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Increasing amount of geospatial data acquired by various sensory improvements and increased governmental and societal interest in creating this data, results in rising issues in data storage and processing for this data type. Therefore this work describes the setting for a probabilistic data structure based on Random Projections and Bloom Filters to enable a more eficient processing of geospatial data. The paper states four research questions, which, apart from general considerations on probabilistic data structures, elaborate on the advantages of such data structures in their use together with new hardware like ifeld programmable gate arrays (FPGA) or quantum computers. The main contribution is to structure the problem space and define solution approaches for future research in this field.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;geospatial data</kwd>
        <kwd>geo-data management</kwd>
        <kwd>randomized data structures</kwd>
        <kwd>bloom filter</kwd>
        <kwd>geographic information systems</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>compression and simplification of the data or improved
query processing during data analysis to access the data
In recent years, the amount of data generated has in- more eficiently in big data environments. However, for
creased with the ability of technical devices to capture geospatial data and data queries, only a few compression
and store this information as well as governmental and algorithms, dimension reduction techniques, or database
societal willingness to create this data persistently. In access patterns were developed, which will be described
the geospatial domain, which combines location informa- in the section 2.
tion with attribute and sometimes temporal information To proceed based on this, my doctoral research will
[1], this became especially apparent as mobile technical focus on improving the fundamentals of randomized data
devices, such as phones, and earth observation devices, structures and query processing to make them more
aplike satellites and drones, create increasing data streams. plicable to geospatial data and larger databases. This
This imposes significant challenges on the data process- includes theoretical considerations based on information
ing solutions as pure data generation does not imply theory, implementations to achieve eficiently running
alapplication-oriented usage. On the one hand, global anal- gorithms, and the consideration of hardware implications.
yses are challenging due to the huge storage footprints The final goal is to propose a new application-proofed
of the generated data [2], and on the other hand, edge randomized data structure for geospatial data.
computing solutions, like direct processing on drones or
satellites, are challenged with processing only the data
they generate, not speaking about setting this data in 2. Related Work
context to external data streams.</p>
      <p>Especially on a global scale, the evaluation of single Geospatial data, in general, can appear in diferent
forsmall areas is often ineficient: The represented data mats such as land surveys, GPS information,
photograis frequently sparse, and information on neighboring phy, remote sensing images, digitized maps, point clouds,
regions is often not independent [2]. Furthermore, ac- and more [3]. Today’s standards for storing geometric
quired spatial information often has no exact location but data include point cloud representations, simplicial
comonly an approximate one [3]. Still, current procedures plexes, topological approaches, and CityGML [4], which
result in huge datasets, which are hard to query if, e.g., are mostly stored in object-oriented [5], object-relational
statistics based on single sparse elements need to be cal- [6, 7, 8] or graph-based databases [9]. The most used
culated. Improvements can be achieved either through representation is a form of simplicial complexes; the
rasterization of the data [10]. This means that the
to-bedescribed area or volume is quantized in distinct spatial
pixels/voxels, each associated with one or more attribute
values. One way to improve the eficiency of the storage
is to reduce the data footprint. For normal databases, this
can be done in two ways: Reduce the numerosity of the
Published in the Proceedings of the Workshops of the EDBT/ICDT 2024
Joint Conference (March 25-28, 2024), Paestum, Italy
$ paul.walther@tum.de (P. Walther)
0000-0002-5101-5793 (P. Walther)</p>
      <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)
samples or reduce the dimensionality [11]. We can fit a
probabilistic model to the data to learn a mapping to a
lower dimensional subspace.</p>
      <p>The probably easiest way to do so is a random
projection, which poses a way of projecting data to a lower
dimensional subspace by picking a random subset of
the properties [12]. Another probabilistic data structure
is a Bloom Filter (BF) as shown in Figure 1 [13]. BFs,
which were initially developed in the 1970s, represent a
probabilistic randomized data structure to represent sets,
trading data footprint, and computational efort against
error probability [2]. For construction, a bit array  of
size  is set to one at the indexes denoted by the
result of a set of  linear evenly distributed hash functions
 mapping the input sample to integer numbers &lt; .</p>
      <p>For membership queries, a possible sample is vice versa,
again mapped with the same  to a set of indexes. If
the BF  is zero at one of these indices, the tested
sample is not part of the set; otherwise, it is with a certain
probability in the set. Special about this structure is that
it does not allow False Negatives in such membership
queries [14]. Therefore, if False Positives are acceptable
or can be mitigated by other means the use of a BF-based
data structure is a system with a controllable error rate
[2]. The downside of this data structure is that, in its
initial form, it does not allow the deletion of objects from
a computed set. Therefore, various modifications of this
approach were proposed in the past, e.g., counting BFs,
which try to mitigate this issue but often introduce False
Negatives [15].</p>
      <p>Using BFs to represent geospatial data was rarely
reported before: E.g., [2] proposes a way to compress sparse
binary global datasets with BF-like data structures in
main memory and keep pixel-wise accessibility. Their
proposition approves existing solutions in specific cases,
like the pixel-wise random access of temporally ordered
but spatially labeled information. For random
projections, there is no previous work reported for geospatial
data to the best of our knowledge. Contrarily using BFs
for eficient set queries is widely used in various fields
[14] but not explicitly reported for spatial data to the
author’s knowledge.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Research Questions</title>
      <p>The main goal of the project is the development of a
new data structure for geospatial data leveraging the
advantages of randomized data structures. This leads to
the following research questions (RQs):</p>
      <p>RQ1 How can a randomized data structure be
modiifed for the eficient storage and retrieval of
geospatial data?
RQ2 How is a spatial movement of the stored data
eficiently represented in terms of database
modiifcations?
RQ3 What are the properties of a geospatial database
to enable continuous scaling of the database
storage footprint?
RQ4 How do the newly proposed geospatial data
representations leverage the computation on new
hardware like FPGAs and quantum computers?</p>
    </sec>
    <sec id="sec-3">
      <title>4. Methods</title>
      <p>Based on the research questions in section 3 an
improvement of spatial storage and retrieval may be achieved
by further developing and adopting existing randomized
data structures for this data environment. For first
consideration, the work focuses on data structures based on
Random Projections and BFs as described in section 2. In
the following, the research questions are expanded with
a planned methodology to answer them. This is the basis
for finally proposing a new geospatial data structure in
the future.</p>
      <sec id="sec-3-1">
        <title>4.1. Eficient Storage and Retrieval</title>
        <p>Eficient storage and retrieval of geospatial data include
element-wise access, deletion, and modification (later
referred to as basis operations). Thereby, it has to be
diferentiated between the exact storage and retrieval
of a single data element through proper indexing and
mapping and the eficient loading of a probabilistic
representation of a data element.</p>
        <p>The aforementioned basic operations are possible for
randomly projected footprints of the original data. On
the contrary, a standard BF supports the addition of single
elements to the data set, but the removal or modification
of an element can only be conducted by a recalculation
of the whole filter. Proposals of Counting BFs [ 16] try to
mitigate this issue but are less eficient. Part of this work
is, therefore, to adapt a BF architecture by either
providing more eficient structures for geospatial data based
on the counting BF architecture or adding an external
modification data structure that keeps track of
modifications and triggers recalculations of the filter or parts of
the filter based on a to be defined schedule or rule.</p>
        <p>Another nonbasic operation that will be considered
in the work is the merging or linking of several datasets
through probabilistic data structures. This is especially
important for the computation in distributed systems
where diferent devices take over diferent operations,
and their results must be merged. Finding the right merge
keys poses a challenging task. Conversely, simple bitwise
OR operations can merge the BF representation of data
sets. However, there are issues with this approach: The
optimal BF construction is dependent on the number of
samples to be represented. As the total number is not
expected to be known during the independent
construction of the two datasets, the combined filter may increase
FP rates. Therefore, it is necessary to investigate how to
modify the BF to perform merges eficiently, especially
on representations of geospatial data.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Spatial Movement</title>
        <p>Moving geo data spatially, which is stored in probabilistic
databases, is a topic that has not been discussed so far
(to our knowledge). At the same time, it is an often
performed operation in the geospatial domain. For data
structures based on random projections, a movement
of the input data in the spatial domain has to result in
a similar movement of the projection. With rasterized
data  ∈ R a lower dimensional space  ∈ R with
 &gt;&gt; , a random projection  and a movement in the
spatial domain  this mean it should hold
 = () ∈  → ( ) = ( )
(1)
To still achieve eficient random projections it needs to
be investigated how this changes the set of randomly
selected points to project.</p>
        <p>For BFs, it has to be discussed how a spatial movement
of the data can be expressed in a modification of the BF,
such that a simple movement of the data does not require
a recalculation of the whole BF representation.
Therefore, an investigation of how certain spatial patterns can
be encoded by eficient use of the hash functions or an
extension of the standard BF to additional hierarchies
is to be investigated. The main challenge here is the
dimension diference between spatial locations (2D or
more) and standard 1D BFs. The ideas here involve
either the mapping of the input data in 1D, e.g., through
space-filling curves, or encoding the spatial domain by
mapping spatial data points only to subsets of the final
BF. The second enables a spatial movement by applying
simple bit-shift operations on the stored data as shown
in Figure 2. Alternatively, a second dimension can be
added to the BF, which encodes the location separately
or locality-sensitive hashing techniques may be used.</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.3. Scaling of the Data Base Footprint</title>
        <p>Both considered randomized data structures, Random
Projections, and BFs, have the useful property of a
relatively easy resizing. For Random Projections, a fraction
of the randomly picked elements hold still a fraction of
the initial information. Taking only a subset of the
randomly projected points results in a representation that
keeps the initial information. Questionable in this case
is how to choose the subset most eficiently.</p>
        <p>For BFs, it is defined by construction that taking
onehalf of the filter holds a valid representation to which
additional data can still be added. However, scaling with
a factor ̸= 2 is not easily possible [2]. This is especially
important as today’s hardware often also scales with
2, e.g., 8 GB → 16 GB RAM. Therefore, reducing the
size of a, e.g., 64 GB large data set by a Factor of 21 is
not always eficient as it leaves too much or no space
for system processes. Therefore, scaling by a rational
factor should be introduced. This may be conducted
by the superposition of non-uniformly distributed hash
functions.</p>
        <p>For very small representations of the data, this
fractioning is not as easy for BFs: As there can be computed an
optimal amount of hash functions based on the amount
of to-be-stored samples and the requested filter sized for
very small filters, often a non-natural amount of hash
functions would be optimal. Therefore, the author
proposes to use a rational number of Hash Functions by
applying the functions with a probability only
representing the rational fraction of the number.</p>
      </sec>
      <sec id="sec-3-4">
        <title>4.4. Computation on New Hardware</title>
        <p>For the computation on future hardware especially small
representations of the to be considered data are needed.
Such, FPGAs have to deal with limited storage space. This
means embedding the original data fitting to the
hardware requirements is needed. The same is true for
quantum computers, which are still constricted to a limited
amount of Qubits. By using randomly projected
representations of the original complex geo-data, it is, therefore,</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This work is funded by the Deutsche
Forschungsgemeinschaft (DFG, German Research Foundation) - 507196470.
still possible to compute in this hardware-restricted
environment. Furthermore, BFs can map the original data
to a bit array. This is especially helpful for the
computation with quantum computers as it enables a simple and
gate-eficient data embedding by only using non-complex
basis encoding.
Based on the above considerations and propositions a
new method to store probabilistic representations of
geospatial data sets is to be developed. This will enable
the eficient usage of big geospatial data in new
application domains and on new hardware, such as the eficient
route planning on remote vehicles and the eficient
provision of geospatial data for quantum computers. To
actually enable a broad usage of these newly proposed
representations, they finally need to be integrated into a
framework to use with existing GIS software. Results
obtained during the development of the new data structure
may also propose a new possibility to improve singular
computationally expensive collection, simulation,
processing, or presentation tasks of geospatial data.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>