<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>May</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multi-Modal Spatial Data using Knowledge Graphs - a Case Study of Microflora Danica</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Katja Hose</string-name>
          <email>katja.hose@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mads Corfixen</string-name>
          <email>ma@bio.aau.dk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Heede</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomer Sagi</string-name>
          <email>tsagi@cs.aau.dk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mads Albertsen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas D. Nielsen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Spatial Data Integration, Knowledge Graph, S2 Geometry</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Chemistry and Bioscience, Aalborg University</institution>
          ,
          <addr-line>Aalborg</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Aalborg University</institution>
          ,
          <addr-line>Aalborg</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Logic and Computation, TU Wien</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>26</volume>
      <issue>2024</issue>
      <abstract>
        <p>Integrating semantically related, multi-modal, heterogeneous data sources is challenging, especially if one of the modalities includes spatial data, such as field measurements organized in geographical grids. Since geographical grids can have diferent rotations, be translated along one or more axes, or have diferent resolutions, a particular challenge when integrating such data is to reduce the information loss from projecting diferent grids into a common format. In this paper, we study this problem and sketch a method for integrating such spatial data using knowledge graphs. We discuss this solution in the context of a real-world use case, where we integrate geographically annotated microbial data (Microflora Danica) as well as environmental data to enable joint analysis. The first results of our experiments show that our method reduces the information loss compared to baseline methods.</p>
      </abstract>
      <kwd-group>
        <kwd>Microflora</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Integrating and linking semantically related heterogeneous and multi-modal data is an important
task that is often very challenging [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. While many works have focused on integrating tabular
data, defined over a fixed relational schema, recent years have seen a diversification into
combining multi-modal data (such as semi-structured data from diferent modalities) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and
data with a flexible schema [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The need for data integration appears in a multitude of domains. One such domain is the
study of microbiomes, i.e., the interaction between microbes and the environment they inhabit.
Understanding these interactions is essential for solving current and future environmental
challenges [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Such data is inherently multi-modal because the DNA sequence data is of
a diferent modality than the semantic relations between microbial species and the spatial
CEUR
Workshop
Proceedings
information about their environments. Spatial information in general often refers to vector or
raster data. Vector data consists of (x, y, z)-coordinates associated with real-world measures of
a specific area, where the area’s shape is defined by vector coordinates and can take the form
of points, lines, polylines, and polygons. On the other hand, raster data is a defined grid over
an area where each grid cell has a measurement value. The raster file itself is associated with
information such as the geographical location and the height and width of the grid.
      </p>
      <p>
        One method to facilitate the integration of multi-modal data is through a knowledge graph
(KG) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. An advantage of using KGs for data integration is that they can flexibly support
multiple modalities [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. KGs can cover strings, images, and audio, as well as links between
them, efectively allowing heterogeneous data to be mapped directly without requiring any
transformations to conform to strict schemas required by other data integration solutions.
Another advantage is that KGs are associated with ontologies that clearly define the semantics
of the concepts in the data sources. In contrast, table and column headers in a relational database
are often only clearly defined in the mind of the designer [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        In this paper, we present a case study for integrating spatial heterogeneous multi-modal data
sources using a KG. The case study is based on the Microflora Danica 1 (MfD) dataset, an archive
of microorganisms in Denmark. We present the challenges and possible solutions of enriching
this dataset with data from EcoDes-DK15 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a high-resolution dataset of ecological descriptors,
and soil maps of Denmark [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We focus on integrating the data along the spatial dimension,
as the datasets are connected through their geographical properties, i.e., GPS coordinates for
the MfD dataset and spatial raster maps for the EcoDes-DK15 and soil maps. The remainder
of this paper is structured as follows. First, in Section 2, we provide an overview of existing
approaches for integrating multi-modal data with a spatial component. Then, in Section 3, we
present the diferent data sources of our case study and their diferent modalities. Next, in
Section 4, we present our KG design for spatial integration and show how it can be extended
with microbial data. In Section 5, we evaluate our spatial integration approach based on the
information loss and compare it to a baseline method. Finally, Section 6 concludes the paper
with a summary and an outlook to future work.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>
        Data integration using KGs has recently gained attention, mainly due to their flexibility, but also
because of the advantages when dealing with heterogeneous multi-modal data [
        <xref ref-type="bibr" rid="ref11 ref6">6, 11</xref>
        ]. While
many studies have explored integrating or transforming spatial data into KGs, they focus mostly
on vector data [
        <xref ref-type="bibr" rid="ref1 ref12 ref13 ref14">1, 12, 13, 14</xref>
        ]. Integration of raster data has not been thoroughly explored.
      </p>
      <p>
        GeoTriples [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] proposes a method for transforming spatial data in vector format into RDF.
The method extends the mapping languages R2RML2 and RML3 to define rules for mapping
structured and semi-structured, heterogeneous vector data to RDF graphs. They recognize that
spatial data also exists in raster format but only provide a framework for transforming spatial
vector data into RDF triples. TripleGeo [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is a similar method, but the proposed framework
also does not support raster data.
      </p>
      <sec id="sec-3-1">
        <title>1https://www.bio.aau.dk/forskning/projekter/microflora-danica 2https://www.w3.org/TR/r2rml/ 3https://rml.io/specs/rml/</title>
        <p>
          Tran et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] propose an ETL process for integrating files in raster format to build an RDF
triple store over a designated area. They aggregate raster cells to extract a single value for a
territorial area, which leads to a loss of fine-grained information.
        </p>
        <p>
          Zhu et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] model observations as aggregations on discrete global grid systems, albeit
without providing an actual framework. Moreover, they describe the integration of singular
event geometries, such as wildfires, rather than integrating raster data.
        </p>
        <p>Several hierarchical discrete global grids (HDGG) have been proposed to represent spatial
data, including H34, Bing Maps Tile System5, and S2 Geometry6. Common among them is the
hierarchical division of the Earth into subsets. H3 utilizes hexagons, but, since hexagons cannot
be perfectly subdivided into smaller hexagons, child cells are only approximately contained
within their parent cells. Bing Maps Tile System projects the Earth onto a map; however, this
projection distorts the scales in proportion to the distance to the poles. The S2 Geometry
decomposes Earth into a hierarchy of cells. Instead of mapping points to a plane, S2 maps
them to a perfect sphere. Since Earth is closer to being a sphere than a plane, this creates less
distortion. At the top-most level of the S2 hierarchy, Earth is represented by six cells, perfectly
covering the Earth, with each lower level subdividing each cell into four children.</p>
        <p>
          The KnowWhereGraph (KWG) [
          <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
          ] is a KG using S2 Geometry as the spatial
component. The KWG focuses on how to model spatial data as RDF triples, but without providing a
framework for mapping between raster files and S2 Geometry. Additionally, they recognize the
advantages and limitations of the S2 Geometry as the spatial component but do not quantify
the error. Thus, our work is based upon the KWG, using it as a spatial layer to enable spatial
data integration.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Data Sources and Use Case Description</title>
      <p>In this paper, we consider a use case where we want to integrate several real-world heterogeneous
data sources of microbes and ecological descriptors of their habitats, exploiting the capabilities
of KGs for flexible multi-modal integration of spatially related data. In this section, we describe
the available data sources and how they shape our case study.</p>
      <p>The Microflora Danica Dataset. The Microflora Danica 7 (MfD) dataset comprises more than
10, 000 measured samples collected from diferent sample sites in Denmark. The MfD dataset
was created to provide a database of all microbes in Denmark, and contains several diferent
modalities. Information on where each sample was taken is geographical vector data, and DNA
reads can be considered as very long strings. Each sample in the MfD dataset is sequenced at
least once, which is a process that constructs probable DNA reads from a sample. Through this
process, each sample is associated with approximately 10 million DNA reads,  , each being a
fragment of a full genome consisting of the four bases, Adenine (A), Thymine (T), Cytosine (C),
and Guanine (G), found in DNA,  ∈ {,  , , } +. In total, the MfD contains 28 TB of reads.</p>
      <p>The structure of the MfD dataset is shown in Figure 1. Each sample site is described in the
Fieldsample Metadata, which contains information such as GPS location and habitat type,
4https://www.uber.com/en-SE/blog/h3/
5https://learn.microsoft.com/en-us/bingmaps/articles/bing-maps-tile-system
6https://s2geometry.io
7https://www.bio.aau.dk/forskning/projekter/microflora-danica
with the habitat types linked to three external taxonomies. Each sample is sequenced multiple
times, each sequencing being described in the Sequencing Metadata, linked to their related
sample site through a fieldsample_barcode . The Sequencing Metadata describes how each
sample was sequenced and handled in the laboratory. Each sequencing leads to a number of
reads, collected in a single file we denote as Read Data, connected to their related sequencing
through a sequence_id. In a Read Data file, there is, besides the reads themselves, a certainty
measure for each base of the reads. Finally, each site was sampled as part of diferent projects,
each project potentially collecting multiple samples. The projects are described in the Project
Metadata, containing information on the parties responsible for the projects. Each sample site
is linked to its related project through a project_id.</p>
      <p>
        Besides information on the reads, sequences, and sample site, we are also provided a mapping
from each sample to potential microbial species (taxons) present in that sample. Since reads
are only fragments of the complete genome of a taxon, the mappings are uncertain. They are
structured as shown in Table 1. Each Operational Taxonomic Unit (OTU) is a DNA sequence
that encodes a specific, potentially undiscovered taxon. Each MFD_X column represents a
ifeldsample_barcode and shows how many reads from a sample site were mapped to that OTU.
Environmental Raster Files. The EcoDes-DK15 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and soil maps [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] datasets contain
measurements of ecological descriptors across Denmark, providing us a means to quantify the
microbial habitats of the MfD dataset. Both datasets are in the form of raster files, adding a new
modality that needs to be integrated besides the ones present in the MfD dataset.
      </p>
      <p>A raster file is a matrix organized into a grid of rows and columns of raster cells, each cell
representing a geographical area. Each raster cell of a raster file is furthermore associated with
a value representing information about the geographical area, such as temperature or pH. In
total, the environmental data contains approximately 200 GB across approximately 100 files,
each containing a diferent type of measure.</p>
      <p>A raster file contains the geo-location of the upper-left cell of its matrix and a transformation
matrix containing information on what translations and rotations are needed to get the rest
of the spatial locations of all the other cells. In our datasets, raster cells have a resolution of
10 × 10 meters, and the spatial location is given on the Universal Transverse Mercator (UTM)
coordinate system. In the UTM coordinate system, each location is defined by the pair (, )
with  , easting, being the distance east of the Greenwich Meridian, and  , northing, being the
distance north of the equator.</p>
      <p>The S2 Geometry. In order to handle the integration of raster files with potentially diferent
transformation matrices, we integrate them into a common grid, the S2 Geometry, as all raster
cells in a raster file can be mapped to a set of corresponding cells in the S2 Geometry, regardless
of the transformation matrix.</p>
      <p>The S2 Geometry is a 31-level hierarchical grid that decomposes Earth into a hierarchy of
cells. At the top-most level (level 0) of the hierarchy, Earth is divided into six cells perfectly
covering it, while each higher level of granularity subdivides each cell into four children, such
that there are 24 cells at level 1, 96 cells at level 2, and so on.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Spatial Integration Approach</title>
      <p>In this section, we present our KG design for spatial integration, demonstrating how the S2
Geometry can be adopted to independently integrate the spatial aspect of each data source.
Further, we show how we extend the KG design to integrate the rest of the MfD dataset. The
code is available on GitHub8.</p>
      <p>Spatial Data. We build upon well-established ontologies to model the spatial dimension, as
shown in Table 2. The geo ontology is used to model the spatial relations between raster
cells, S2 cells, and MfD sample sites; the upper level ontology oboe is used to model the
environmental observations and measurements; and the kwg-ont ontology is the ontology of
the KnowWhereGraph.</p>
      <p>We can consider a raster file as a collection of raster cells, where each raster cell is associated
with at least one measurement. For raster files where their raster cells perfectly overlap one
another, such as they do in our use case, each raster cell is associated with multiple measurements
(pH value, soil salinity, etc.). This is modeled using the oboe concepts ObservationCollection,
Observation, Measurement, and MeasurementType; and properties hasMember, hasMeasurement,
hasValue, and containsMeasurementsOfType (see Figure 2).</p>
      <p>To model the spatial relation between S2 Geometry and the MfD sample sites, we link each
site to the S2 cell that covers it, using the GPS location obtained from the MfD dataset, described
in Section 3. For the spatial relation between the raster cells and S2 Geometry, we associate</p>
      <sec id="sec-5-1">
        <title>8https://github.com/MicrobialDarkMatter/MfD-spatial-integration</title>
        <p>each raster cell with the set of S2 cells that covers it. To model these spatial relations between
raster cells, S2 cells, and MfD sample sites, we utilize the geo ontology properties coveredBy and
covers, see Figure 2, where the data from the raster files is colored in orange, the data from the
MfD dataset is colored in blue, and the spatial integration layer, S2 Geometry, in bold.</p>
        <p>Our current datasets do not have significant overlaps, so we do not have significant
interoperability conflicts among them. However, in the future, we would like to integrate data from
diferent sources; therefore, a semantic alignment pipeline will be needed.</p>
        <p>Spatial Integration. The spatial aspect of each data source is mapped to the S2 Geometry. The
integration can be split into two types: the vector-based GPS integration of the MfD sample
sites and the area-based integration of raster files. We focus on the integration of raster files
in this section, as the integration of the vector-based MfD sample sites simply requires us to
integrate the given coordinates directly to the covering S2 cell.</p>
        <p>As described in Section 3, the S2 Geometry is a hierarchy of levels determining the granularity
of S2 cells. Increasing the granularity of the S2 level results in the area of each S2 cells being
lowered by a factor of 4. Therefore, the S2 level afects the number of S2 cells required to cover
each raster cell; a high granularity requires many S2 cells, whereas a low granularity requires
fewer. Consequently, lower granularities of the S2 level increase the over-coverage, i.e, the
area of S2 cells that covers an area outside their associated raster cells. The over-coverage is
illustrated for two diferent granularities of the S2 level in Figure 3, where the S2 cells (red) are
covering area outside the raster cell (green).</p>
        <p>Ideally, a set of S2 cells covers each raster cell perfectly;
however, this is not the case, and as such, choosing the highest
granularity will yield the lowest possible over-coverage. The issue
with this is that each level quadruples storage requirements, as
each S2 cell has four child cells.</p>
        <p>The EcoDes-DK15 and soil raster files have a resolution of 100
m2; hence, to minimize over-coverage, an S2 level of 24 is chosen
as S2 cells have an area of approximately 0.3 m2 at that S2 level.</p>
        <p>Irrespective of the S2 level, S2 cells over-covering a raster cell Figure 3: S2 cells covering
will overlap neighboring raster cells. This is not desirable since raster cells at
difeach S2 cell should only correspond to one value per feature, e.g., ferent levels
we cannot associate two pH values with the same location. To
address this issue, we use a majority rule; however, one could also aggregate the values, as long
as they are continuous. The majority rule disassociates S2 cells from any raster cells in which
their centroids are not located. However, disassociating S2 cells from a raster will introduce
under-coverage, i.e., the sub-area of a raster cell that is not covered by any S2 cells.</p>
        <p>A potential error introduced by over- and under-coverage is integrating an MfD site to an S2
cell corresponding to an incorrect raster cell. For example, in the left of Figure 3, the sample
site (cross) is linked to an S2 cell that covers the shown raster cell instead of an S2 cell covering
the raster cell below it. S2 cells of higher granularity diminishes this problem.</p>
        <p>In general, the integration of raster files is challenging due to diferent raster files possibly
having diferent transformations, i.e., resolutions, rotations, and translations. However, utilizing
the S2 Geometry, the integration of raster files is independent across diferent transformations.
Thus, our approach afords the integration of arbitrary raster files.</p>
        <p>
          Microbial Data. The MfD dataset contains sample-specific information such as metadata,
DNA reads, sequencing metadata, sample-to-OTU mappings, and habitat information of the
sample site. As the focus of this paper is on spatial integration, we omit details on this part;
however, we outline the design for this integration in Figure 4. Note that each entity has
properties, but due to limitations, we do not discuss metadata associated with the microbial
data. Each sample site, marked in blue, has a habitat type, such as ’forests’ or ’grasslands’,
which we map to a corresponding concept in the Environment Ontology, ENVO [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], marked
in green. Furthermore, each sample has a mapping to an OTU, which encodes for a specific
microbial taxon. These taxons are mapped to corresponding taxons in the the Taxonomy
Database ontology, NCBITaxon9, marked in green. Both the habitat and taxon mappings to
external ontologies require some entity resolution methods to account for, e.g., spelling errors.
Finally, since the DNA reads, marked in red, take up 28 TB of storage, we do not keep them
directly in the graph, but instead store a reference to an external key-value store.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Evaluation</title>
      <sec id="sec-6-1">
        <title>In this section, we evaluate the proposed framework in terms of</title>
        <p>the information loss of using the S2 Geometry as the integration
layer and compare it to a baseline method. Furthermore, we
highlight potential issues when using the S2 Geometry.</p>
        <p>Information Loss. We observe two contributing factors to the Figure 5: Over- and
underinformation loss from our integration to the S2 Geometry. The coverage
ifrst type of information loss is over-coverage ( Equation 1), where a part of the area of the
9https://obofoundry.org/ontology/ncbitaxon.html
minimal set of covering S2 cells associated with a raster cell is outside the bounds of the raster
cell. The over-coverage is calculated as the proportionate area of the set of S2 cells not contained
within the raster cell. The second type of information loss results from the implementation of
the majority rule, and is denoted under-coverage (Equation 2), where a raster cell is only partly
covered by its associated minimal set of covering S2 cells. The under-coverage is calculated
as the proportionate area of the raster cell not contained within the set of S2 cells. These two
types of information loss are visualized in Figure 5, with the blue area being over-coverage and
the red area being under-coverage.</p>
        <p>OC =
| 2 ⧵ Raster|
| 2|
(1)</p>
        <p>UC =
|Raster ⧵  2|
|Raster|
(2)</p>
        <p>Since the two contributing factors are equally unwanted, we define the information loss as
the harmonic mean of the over- and under-coverage (Equation 3). Due to the nature of the
harmonic mean, higher values are desirable; however, this is not the case for the over- and
under-coverage calculations. Therefore, we use the complement of each coverage. This yields
an information accuracy, which we subtract from 1 to get the loss.</p>
        <p>Information Loss = 1 − 2
(1 − OC) ⋅ (1 − UC)
(1 − OC) + (1 − UC)
.</p>
        <p>(3)</p>
        <p>The mean information loss for integrating with and without the majority rule is reported in
Figure 6 for diferent S2 levels, based on random sample of 100 raster cells. Low granularity
S2 levels have a mean information loss approaching 1, whereas higher granularities of the S2
level reduce the information loss. The advantage of using the majority rule is visible for higher
granularities, starting from S2 level 20. However, for low granularity S2 cells, the majority rule
introduces high under-coverage, as many raster cells become associated with no S2 cells, due to
the centroid of large S2 cells that covers many raster cells being located only within a single
raster cell.</p>
        <p>The S2 levels 1 through 15 and 27 through 30 are not shown, as these approach an information
loss of 1 and 0, respectively. At S2 level 24, where the information loss is 0.022, the information
loss begins to stagnate with higher granularity of the S2 levels, which is an indication that this
is a suitable level to integrate the use case raster files into.</p>
        <p>Using the S2 Geometry as the spatial layer is not without issues, especially when using a
granularity for the S2 cells. The grid of the raster files used for this case study is approximately
35,000 × 45,000, resulting in 1.575 billion raster cells. For an S2 level of 24, each raster cell is
linked to 625 S2 cells on average. Combining these two numbers yields a total of approximately
1 trillion triples. However, as the raster files for this case study lack measurements for raster
cells located in the ocean, the actual amount of integrated raster cells diminishes by around
75%, yielding approximately 250 billion triples solely for the integration of raster cells.
Up- and Downsampling. To evaluate the gain of using the S2 Geometry as an integration layer,
we compare it to using up- and downsampling for combining raster files of diferent resolutions.
Upsampling refers to taking a set of cells at a lower granularity and then aggregating them into
higher granularity and vice versa for downsampling. We exemplify the information loss from
up- and downsampling via two hypothetical resolutions 10 × 10 and 7.5 × 7.5 meters, in red and
blue, respectively, as illustrated in Figure 7.</p>
        <p>In order to upsample and downsample, we define how the majority rule works in this setting.
For the highlighted red cell, we see that it is covered by four diferent blue cells. Since only two
of the blue cells have their centroids within the red cell, we downsample only into those two
cells. Conversely, we upsample the highlighted two blue cells into the single red cell in which
their centroids are located. We note that upsampling is more dificult than simply aggregating
if we deal with categorical attributes.</p>
        <p>An information loss of 0.5 is obtained for upsampling into the 10 × 10 grid and 0.298 for
downsampling into the 7.5 × 7.5 grid. In comparison, integrating the two grids into the S2
Geometry at level 24 results in an information loss of 0.028 and 0.019, respectively.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>In this paper, we have presented an approach for integrating multi-modal heterogeneous data
sources through a spatial KG layer in the context of integrating geographically annotated
microbial data and environmental features to enable joint analysis. While the emphasis has
been on integrating the spatial data sources, we have also discussed the design of the complete
multi-modal data integration of raster data, DNA reads, and ontologies. We propose an approach
based on the S2 Geometry that can integrate raster files of diferent resolutions, translations,
and rotations without performing significant aggregations of the raster cell measurements. In
the future, we plan to work on improving scalability given the large number of RDF triples
related to the use of the S2 Geometry as well as to capture provenance through the use of the
PROV-O ontology10.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work is partially funded by the Villum Foundation (DarkScience project, 50093). We further
thank the SustainScapes group from Aarhus University for sharing the environmental dataset.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cheatham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pesquita</surname>
          </string-name>
          ,
          <article-title>Semantic Data Integration</article-title>
          ,
          <source>in: Handbook of Big Data Technologies</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>263</fpage>
          -
          <lpage>305</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Stahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <article-title>Why Is Data Integration So Hard?</article-title>
          ,
          <source>in: Measuring the Data Universe</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>23</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Phan</surname>
          </string-name>
          , et al.,
          <article-title>Integration of multi-modal biomedical data to predict cancer grade and patient survival</article-title>
          , in: BHI, IEEE,
          <year>2016</year>
          , pp.
          <fpage>577</fpage>
          -
          <lpage>580</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <source>Big Data Integration, Proc. VLDB Endow</source>
          .
          <volume>6</volume>
          (
          <year>2013</year>
          )
          <fpage>1188</fpage>
          -
          <lpage>1189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Timmis</surname>
          </string-name>
          , et al.,
          <article-title>The contribution of microbial biotechnology to sustainable development goals</article-title>
          ,
          <source>Microbial biotechnology 10</source>
          (
          <year>2017</year>
          )
          <fpage>984</fpage>
          -
          <lpage>987</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Bellandi</surname>
          </string-name>
          , et al.,
          <article-title>Toward a General Framework for Multimodal Big Data Analysis</article-title>
          ,
          <source>Big Data</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>408</fpage>
          -
          <lpage>424</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , et al.,
          <source>Multi-Modal Knowledge Graph Construction and Application: A Survey</source>
          ,
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>36</volume>
          (
          <year>2024</year>
          )
          <fpage>715</fpage>
          -
          <lpage>735</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Leida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gusmini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davies</surname>
          </string-name>
          ,
          <article-title>Semantics-aware data integration for heterogeneous data sources</article-title>
          ,
          <source>J. Ambient Intell. Humaniz. Comput</source>
          .
          <volume>4</volume>
          (
          <year>2013</year>
          )
          <fpage>471</fpage>
          -
          <lpage>491</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Assmann</surname>
          </string-name>
          , et al.,
          <article-title>EcoDes-DK15: High-resolution ecological descriptors of vegetation and terrain derived from Denmark's national airborne laser scanning data set</article-title>
          ,
          <source>ESSD</source>
          <volume>14</volume>
          (
          <year>2022</year>
          )
          <fpage>823</fpage>
          -
          <lpage>844</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>L. C.</surname>
          </string-name>
          o. Gomes,
          <article-title>Soil assessment in Denmark: Towards soil functional mapping and beyond, Frontiers in Soil Science (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simkus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <article-title>Semantic Querying of Integrated Raster and Relational Data: A Virtual Knowledge Graph Approach</article-title>
          , in: RuleML+RR (Companion),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kyzirakos</surname>
          </string-name>
          , et al.,
          <article-title>GeoTriples: Transforming geospatial data into RDF graphs using R2RML and RML mappings</article-title>
          ,
          <source>J. Web Semant</source>
          .
          <fpage>52</fpage>
          -
          <lpage>53</lpage>
          (
          <year>2018</year>
          )
          <fpage>16</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Gür</surname>
          </string-name>
          , et al.,
          <article-title>A foundation for spatial data warehouses on the Semantic Web, Semantic Web 9 (</article-title>
          <year>2018</year>
          )
          <fpage>557</fpage>
          -
          <lpage>587</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>A. B. Andersen</surname>
          </string-name>
          , et al.,
          <article-title>Publishing Danish Agricultural Government Data as Semantic Web Data</article-title>
          , in: JIST,
          <year>2014</year>
          , pp.
          <fpage>178</fpage>
          -
          <lpage>186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Patroumpas</surname>
          </string-name>
          , et al.,
          <article-title>TripleGeo: an ETL Tool for Transforming Geospatial Data into RDF Triples</article-title>
          , in: EDBT/ICDT Workshops, volume
          <volume>1133</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>275</fpage>
          -
          <lpage>278</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Tran</surname>
          </string-name>
          , et al.,
          <article-title>Semantic Integration of Raster Data for Earth Observation: An RDF Dataset of Territorial Unit Versions with their Land Cover</article-title>
          ,
          <source>ISPRS Int. J. Geo Inf</source>
          .
          <volume>9</volume>
          (
          <year>2020</year>
          )
          <fpage>503</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , et al.,
          <article-title>Environmental Observations in Knowledge Graphs</article-title>
          , in: DaMaLOS,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>K.</given-names>
            <surname>Janowicz</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Know</surname>
          </string-name>
          , Know Where,
          <article-title>Knowwheregraph: A Densely Connected, CrossDomain Knowledge Graph and Geo-Enrichment Service Stack for Applications in Environmental Intelligence</article-title>
          ,
          <source>AI Mag</source>
          .
          <volume>43</volume>
          (
          <year>2022</year>
          )
          <fpage>30</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Shimizu</surname>
          </string-name>
          , et al.,
          <article-title>A Pattern for Features on a Hierarchical Spatial Grid</article-title>
          , in: IJCKG, ACM,
          <year>2021</year>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Buttigieg</surname>
          </string-name>
          , et al.,
          <article-title>The environment ontology: contextualising biological and biomedical entities</article-title>
          ,
          <source>J. Biomed. Semant</source>
          .
          <volume>4</volume>
          (
          <year>2013</year>
          )
          <fpage>43</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>