<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Integrating Multi-Modal Spatial Data using Knowledge Graphs -a Case Study of Microflora Danica</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mads</forename><surname>Corfixen</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">Aalborg University</orgName>
								<address>
									<settlement>Aalborg</settlement>
									<country key="DK">Denmark</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Thomas</forename><surname>Heede</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">Aalborg University</orgName>
								<address>
									<settlement>Aalborg</settlement>
									<country key="DK">Denmark</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tomer</forename><surname>Sagi</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">Aalborg University</orgName>
								<address>
									<settlement>Aalborg</settlement>
									<country key="DK">Denmark</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mads</forename><surname>Albertsen</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Department of Chemistry and Bioscience</orgName>
								<orgName type="institution">Aalborg University</orgName>
								<address>
									<settlement>Aalborg</settlement>
									<country key="DK">Denmark</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Thomas</forename><forename type="middle">D</forename><surname>Nielsen</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">Aalborg University</orgName>
								<address>
									<settlement>Aalborg</settlement>
									<country key="DK">Denmark</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Katja</forename><surname>Hose</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">Aalborg University</orgName>
								<address>
									<settlement>Aalborg</settlement>
									<country key="DK">Denmark</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Institute of Logic and Computation</orgName>
								<orgName type="institution">TU Wien</orgName>
								<address>
									<settlement>Vienna</settlement>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Integrating Multi-Modal Spatial Data using Knowledge Graphs -a Case Study of Microflora Danica</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">72B848C4B3D0A3B8EE51644C7F7F2157</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:50+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Spatial Data Integration</term>
					<term>Knowledge Graph</term>
					<term>S2 Geometry</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Integrating semantically related, multi-modal, heterogeneous data sources is challenging, especially if one of the modalities includes spatial data, such as field measurements organized in geographical grids. Since geographical grids can have different rotations, be translated along one or more axes, or have different resolutions, a particular challenge when integrating such data is to reduce the information loss from projecting different grids into a common format. In this paper, we study this problem and sketch a method for integrating such spatial data using knowledge graphs. We discuss this solution in the context of a real-world use case, where we integrate geographically annotated microbial data (Microflora Danica) as well as environmental data to enable joint analysis. The first results of our experiments show that our method reduces the information loss compared to baseline methods.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Integrating and linking semantically related heterogeneous and multi-modal data is an important task that is often very challenging <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. While many works have focused on integrating tabular data, defined over a fixed relational schema, recent years have seen a diversification into combining multi-modal data (such as semi-structured data from different modalities) <ref type="bibr" target="#b2">[3]</ref> and data with a flexible schema <ref type="bibr" target="#b3">[4]</ref>.</p><p>The need for data integration appears in a multitude of domains. One such domain is the study of microbiomes, i.e., the interaction between microbes and the environment they inhabit. Understanding these interactions is essential for solving current and future environmental challenges <ref type="bibr" target="#b4">[5]</ref>. Such data is inherently multi-modal because the DNA sequence data is of a different modality than the semantic relations between microbial species and the spatial SeWebMeDA-2024: 7th International Workshop on Semantic Web Solutions for Large-scale Biomedical Data Analytics, May 26, 2024, Hersonissos, Greece Envelope maco@cs.aau.dk (M. Corfixen); thhe@cs.aau.dk (T. Heede); tsagi@cs.aau.dk (T. Sagi); ma@bio.aau.dk (M. Albertsen); tdn@cs.aau.dk (T. D. Nielsen); katja.hose@tuwien.ac.at (K. Hose) Orcid 0009-0004-7237-9694 (M. Corfixen); 0009-0004-7934-1293 (T. Heede); 0000-0002-8916-0128 (T. Sagi); 0000-0002-6151-190X (M. Albertsen); 0000-0002-4823-6341 (T. D. Nielsen); 0000-0001-7025-8099 (K. Hose) information about their environments. Spatial information in general often refers to vector or raster data. Vector data consists of (x, y, z)-coordinates associated with real-world measures of a specific area, where the area's shape is defined by vector coordinates and can take the form of points, lines, polylines, and polygons. On the other hand, raster data is a defined grid over an area where each grid cell has a measurement value. The raster file itself is associated with information such as the geographical location and the height and width of the grid.</p><p>One method to facilitate the integration of multi-modal data is through a knowledge graph (KG) <ref type="bibr" target="#b5">[6]</ref>. An advantage of using KGs for data integration is that they can flexibly support multiple modalities <ref type="bibr" target="#b6">[7]</ref>. KGs can cover strings, images, and audio, as well as links between them, effectively allowing heterogeneous data to be mapped directly without requiring any transformations to conform to strict schemas required by other data integration solutions. Another advantage is that KGs are associated with ontologies that clearly define the semantics of the concepts in the data sources. In contrast, table and column headers in a relational database are often only clearly defined in the mind of the designer <ref type="bibr" target="#b7">[8]</ref>.</p><p>In this paper, we present a case study for integrating spatial heterogeneous multi-modal data sources using a KG. The case study is based on the Microflora Danica<ref type="foot" target="#foot_0">1</ref> (MfD) dataset, an archive of microorganisms in Denmark. We present the challenges and possible solutions of enriching this dataset with data from EcoDes-DK15 <ref type="bibr" target="#b8">[9]</ref>, a high-resolution dataset of ecological descriptors, and soil maps of Denmark <ref type="bibr" target="#b9">[10]</ref>. We focus on integrating the data along the spatial dimension, as the datasets are connected through their geographical properties, i.e., GPS coordinates for the MfD dataset and spatial raster maps for the EcoDes-DK15 and soil maps. The remainder of this paper is structured as follows. First, in Section 2, we provide an overview of existing approaches for integrating multi-modal data with a spatial component. Then, in Section 3, we present the different data sources of our case study and their different modalities. Next, in Section 4, we present our KG design for spatial integration and show how it can be extended with microbial data. In Section 5, we evaluate our spatial integration approach based on the information loss and compare it to a baseline method. Finally, Section 6 concludes the paper with a summary and an outlook to future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Data integration using KGs has recently gained attention, mainly due to their flexibility, but also because of the advantages when dealing with heterogeneous multi-modal data <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b10">11]</ref>. While many studies have explored integrating or transforming spatial data into KGs, they focus mostly on vector data <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b13">14]</ref>. Integration of raster data has not been thoroughly explored.</p><p>GeoTriples <ref type="bibr" target="#b11">[12]</ref> proposes a method for transforming spatial data in vector format into RDF. The method extends the mapping languages R2RML<ref type="foot" target="#foot_1">2</ref> and RML<ref type="foot" target="#foot_2">3</ref> to define rules for mapping structured and semi-structured, heterogeneous vector data to RDF graphs. They recognize that spatial data also exists in raster format but only provide a framework for transforming spatial vector data into RDF triples. TripleGeo <ref type="bibr" target="#b14">[15]</ref> is a similar method, but the proposed framework also does not support raster data.</p><p>Tran et al. <ref type="bibr" target="#b15">[16]</ref> propose an ETL process for integrating files in raster format to build an RDF triple store over a designated area. They aggregate raster cells to extract a single value for a territorial area, which leads to a loss of fine-grained information.</p><p>Zhu et al. <ref type="bibr" target="#b16">[17]</ref> model observations as aggregations on discrete global grid systems, albeit without providing an actual framework. Moreover, they describe the integration of singular event geometries, such as wildfires, rather than integrating raster data.</p><p>Several hierarchical discrete global grids (HDGG) have been proposed to represent spatial data, including H3<ref type="foot" target="#foot_3">4</ref> , Bing Maps Tile System<ref type="foot" target="#foot_4">5</ref> , and S2 Geometry <ref type="foot" target="#foot_5">6</ref> . Common among them is the hierarchical division of the Earth into subsets. H3 utilizes hexagons, but, since hexagons cannot be perfectly subdivided into smaller hexagons, child cells are only approximately contained within their parent cells. Bing Maps Tile System projects the Earth onto a map; however, this projection distorts the scales in proportion to the distance to the poles. The S2 Geometry decomposes Earth into a hierarchy of cells. Instead of mapping points to a plane, S2 maps them to a perfect sphere. Since Earth is closer to being a sphere than a plane, this creates less distortion. At the top-most level of the S2 hierarchy, Earth is represented by six cells, perfectly covering the Earth, with each lower level subdividing each cell into four children.</p><p>The KnowWhereGraph (KWG) <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19]</ref> is a KG using S2 Geometry as the spatial component. The KWG focuses on how to model spatial data as RDF triples, but without providing a framework for mapping between raster files and S2 Geometry. Additionally, they recognize the advantages and limitations of the S2 Geometry as the spatial component but do not quantify the error. Thus, our work is based upon the KWG, using it as a spatial layer to enable spatial data integration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Data Sources and Use Case Description</head><p>In this paper, we consider a use case where we want to integrate several real-world heterogeneous data sources of microbes and ecological descriptors of their habitats, exploiting the capabilities of KGs for flexible multi-modal integration of spatially related data. In this section, we describe the available data sources and how they shape our case study. The Microflora Danica Dataset. The Microflora Danica<ref type="foot" target="#foot_6">7</ref> (MfD) dataset comprises more than 10, 000 measured samples collected from different sample sites in Denmark. The MfD dataset was created to provide a database of all microbes in Denmark, and contains several different modalities. Information on where each sample was taken is geographical vector data, and DNA reads can be considered as very long strings. Each sample in the MfD dataset is sequenced at least once, which is a process that constructs probable DNA reads from a sample. Through this process, each sample is associated with approximately 10 million DNA reads, 𝑅, each being a fragment of a full genome consisting of the four bases, Adenine (A), Thymine (T), Cytosine (C), and Guanine (G), found in DNA, 𝑅 ∈ {𝐴, 𝑇 , 𝐶, 𝐺} + . In total, the MfD contains 28 TB of reads.</p><p>The structure of the MfD dataset is shown in Figure <ref type="figure" target="#fig_0">1</ref>. Each sample site is described in the Fieldsample Metadata, which contains information such as GPS location and habitat type, with the habitat types linked to three external taxonomies. Each sample is sequenced multiple times, each sequencing being described in the Sequencing Metadata, linked to their related sample site through a fieldsample_barcode. The Sequencing Metadata describes how each sample was sequenced and handled in the laboratory. Each sequencing leads to a number of reads, collected in a single file we denote as Read Data, connected to their related sequencing through a sequence_id. In a Read Data file, there is, besides the reads themselves, a certainty measure for each base of the reads. Finally, each site was sampled as part of different projects, each project potentially collecting multiple samples. The projects are described in the Project Metadata, containing information on the parties responsible for the projects. Each sample site is linked to its related project through a project_id. Besides information on the reads, sequences, and sample site, we are also provided a mapping from each sample to potential microbial species (taxons) present in that sample. Since reads are only fragments of the complete genome of a taxon, the mappings are uncertain. They are structured as shown in Table <ref type="table">1</ref>. Each Operational Taxonomic Unit (OTU) is a DNA sequence that encodes a specific, potentially undiscovered taxon. Each MFD_X column represents a fieldsample_barcode and shows how many reads from a sample site were mapped to that OTU.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Mappings from sample sites to taxons</p><formula xml:id="formula_0">OTU MFD_1 MFD_2 ⋯ Kingdom ⋯ Species OTU_1 0 7 ⋯ Archaea ⋯ MFD_s_17257 OTU_2 633 482 ⋯ Bacteria ⋯ Cornyebacterium ⋮ ⋮ ⋮ ⋱ ⋮ ⋱ ⋮</formula><p>Environmental Raster Files. The EcoDes-DK15 <ref type="bibr" target="#b8">[9]</ref> and soil maps <ref type="bibr" target="#b9">[10]</ref> datasets contain measurements of ecological descriptors across Denmark, providing us a means to quantify the microbial habitats of the MfD dataset. Both datasets are in the form of raster files, adding a new modality that needs to be integrated besides the ones present in the MfD dataset.</p><p>A raster file is a matrix organized into a grid of rows and columns of raster cells, each cell representing a geographical area. Each raster cell of a raster file is furthermore associated with a value representing information about the geographical area, such as temperature or pH. In total, the environmental data contains approximately 200 GB across approximately 100 files, each containing a different type of measure.</p><p>A raster file contains the geo-location of the upper-left cell of its matrix and a transformation matrix containing information on what translations and rotations are needed to get the rest of the spatial locations of all the other cells. In our datasets, raster cells have a resolution of 10 × 10 meters, and the spatial location is given on the Universal Transverse Mercator (UTM) coordinate system. In the UTM coordinate system, each location is defined by the pair (𝑒, 𝑛) with 𝑒, easting, being the distance east of the Greenwich Meridian, and 𝑛, northing, being the distance north of the equator.</p><p>The S2 Geometry. In order to handle the integration of raster files with potentially different transformation matrices, we integrate them into a common grid, the S2 Geometry, as all raster cells in a raster file can be mapped to a set of corresponding cells in the S2 Geometry, regardless of the transformation matrix.</p><p>The S2 Geometry is a 31-level hierarchical grid that decomposes Earth into a hierarchy of cells. At the top-most level (level 0) of the hierarchy, Earth is divided into six cells perfectly covering it, while each higher level of granularity subdivides each cell into four children, such that there are 24 cells at level 1, 96 cells at level 2, and so on.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Spatial Integration Approach</head><p>In this section, we present our KG design for spatial integration, demonstrating how the S2 Geometry can be adopted to independently integrate the spatial aspect of each data source. Further, we show how we extend the KG design to integrate the rest of the MfD dataset. The code is available on GitHub<ref type="foot" target="#foot_7">8</ref> . Spatial Data. We build upon well-established ontologies to model the spatial dimension, as shown in Table <ref type="table" target="#tab_0">2</ref>. The geo ontology is used to model the spatial relations between raster cells, S2 cells, and MfD sample sites; the upper level ontology oboe is used to model the environmental observations and measurements; and the kwg-ont ontology is the ontology of the KnowWhereGraph. We can consider a raster file as a collection of raster cells, where each raster cell is associated with at least one measurement. For raster files where their raster cells perfectly overlap one another, such as they do in our use case, each raster cell is associated with multiple measurements (pH value, soil salinity, etc.). This is modeled using the oboe concepts ObservationCollection, Observation, Measurement, and MeasurementType; and properties hasMember, hasMeasurement, hasValue, and containsMeasurementsOfType (see Figure <ref type="figure" target="#fig_1">2</ref>).</p><p>To model the spatial relation between S2 Geometry and the MfD sample sites, we link each site to the S2 cell that covers it, using the GPS location obtained from the MfD dataset, described in Section 3. For the spatial relation between the raster cells and S2 Geometry, we associate each raster cell with the set of S2 cells that covers it. To model these spatial relations between raster cells, S2 cells, and MfD sample sites, we utilize the geo ontology properties coveredBy and covers, see Figure <ref type="figure" target="#fig_1">2</ref>, where the data from the raster files is colored in orange, the data from the MfD dataset is colored in blue, and the spatial integration layer, S2 Geometry, in bold. Our current datasets do not have significant overlaps, so we do not have significant interoperability conflicts among them. However, in the future, we would like to integrate data from different sources; therefore, a semantic alignment pipeline will be needed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Spatial Integration.</head><p>The spatial aspect of each data source is mapped to the S2 Geometry. The integration can be split into two types: the vector-based GPS integration of the MfD sample sites and the area-based integration of raster files. We focus on the integration of raster files in this section, as the integration of the vector-based MfD sample sites simply requires us to integrate the given coordinates directly to the covering S2 cell.</p><p>As described in Section 3, the S2 Geometry is a hierarchy of levels determining the granularity of S2 cells. Increasing the granularity of the S2 level results in the area of each S2 cells being lowered by a factor of 4. Therefore, the S2 level affects the number of S2 cells required to cover each raster cell; a high granularity requires many S2 cells, whereas a low granularity requires fewer. Consequently, lower granularities of the S2 level increase the over-coverage, i.e, the area of S2 cells that covers an area outside their associated raster cells. The over-coverage is illustrated for two different granularities of the S2 level in Figure <ref type="figure">3</ref>, where the S2 cells (red) are covering area outside the raster cell (green).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 3: S2 cells covering raster cells at different levels</head><p>Ideally, a set of S2 cells covers each raster cell perfectly; however, this is not the case, and as such, choosing the highest granularity will yield the lowest possible over-coverage. The issue with this is that each level quadruples storage requirements, as each S2 cell has four child cells.</p><p>The EcoDes-DK15 and soil raster files have a resolution of 100 m 2 ; hence, to minimize over-coverage, an S2 level of 24 is chosen as S2 cells have an area of approximately 0.3 m 2 at that S2 level.</p><p>Irrespective of the S2 level, S2 cells over-covering a raster cell will overlap neighboring raster cells. This is not desirable since each S2 cell should only correspond to one value per feature, e.g., we cannot associate two pH values with the same location. To address this issue, we use a majority rule; however, one could also aggregate the values, as long as they are continuous. The majority rule disassociates S2 cells from any raster cells in which their centroids are not located. However, disassociating S2 cells from a raster will introduce under-coverage, i.e., the sub-area of a raster cell that is not covered by any S2 cells.</p><p>A potential error introduced by over-and under-coverage is integrating an MfD site to an S2 cell corresponding to an incorrect raster cell. For example, in the left of Figure <ref type="figure">3</ref>, the sample site (cross) is linked to an S2 cell that covers the shown raster cell instead of an S2 cell covering the raster cell below it. S2 cells of higher granularity diminishes this problem.</p><p>In general, the integration of raster files is challenging due to different raster files possibly having different transformations, i.e., resolutions, rotations, and translations. However, utilizing the S2 Geometry, the integration of raster files is independent across different transformations. Thus, our approach affords the integration of arbitrary raster files.</p><p>Microbial Data. The MfD dataset contains sample-specific information such as metadata, DNA reads, sequencing metadata, sample-to-OTU mappings, and habitat information of the sample site. As the focus of this paper is on spatial integration, we omit details on this part; however, we outline the design for this integration in Figure <ref type="figure" target="#fig_2">4</ref>. Note that each entity has properties, but due to limitations, we do not discuss metadata associated with the microbial data. Each sample site, marked in blue, has a habitat type, such as 'forests' or 'grasslands', which we map to a corresponding concept in the Environment Ontology, ENVO <ref type="bibr" target="#b19">[20]</ref>, marked in green. Furthermore, each sample has a mapping to an OTU, which encodes for a specific microbial taxon. These taxons are mapped to corresponding taxons in the the Taxonomy Database ontology, NCBITaxon<ref type="foot" target="#foot_8">9</ref> , marked in green. Both the habitat and taxon mappings to external ontologies require some entity resolution methods to account for, e.g., spelling errors. Finally, since the DNA reads, marked in red, take up 28 TB of storage, we do not keep them directly in the graph, but instead store a reference to an external key-value store.  In this section, we evaluate the proposed framework in terms of the information loss of using the S2 Geometry as the integration layer and compare it to a baseline method. Furthermore, we highlight potential issues when using the S2 Geometry.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Evaluation</head><p>Information Loss. We observe two contributing factors to the information loss from our integration to the S2 Geometry. The first type of information loss is over-coverage (Equation <ref type="formula">1</ref>), where a part of the area of the minimal set of covering S2 cells associated with a raster cell is outside the bounds of the raster cell. The over-coverage is calculated as the proportionate area of the set of S2 cells not contained within the raster cell. The second type of information loss results from the implementation of the majority rule, and is denoted under-coverage (Equation <ref type="formula" target="#formula_1">2</ref>), where a raster cell is only partly covered by its associated minimal set of covering S2 cells. The under-coverage is calculated as the proportionate area of the raster cell not contained within the set of S2 cells. These two types of information loss are visualized in Figure <ref type="figure" target="#fig_3">5</ref>, with the blue area being over-coverage and the red area being under-coverage.</p><formula xml:id="formula_1">OC = |𝑆 2 ⧵ Raster| |𝑆 2 | (1) UC = |Raster ⧵ 𝑆 2 | |Raster|<label>(2)</label></formula><p>Since the two contributing factors are equally unwanted, we define the information loss as the harmonic mean of the over-and under-coverage (Equation <ref type="formula" target="#formula_2">3</ref>). Due to the nature of the harmonic mean, higher values are desirable; however, this is not the case for the over-and under-coverage calculations. Therefore, we use the complement of each coverage. This yields an information accuracy, which we subtract from 1 to get the loss.</p><formula xml:id="formula_2">Information Loss = 1 − 2 (1 − OC) ⋅ (1 − UC) (1 − OC) + (1 − UC) .<label>(3)</label></formula><p>The mean information loss for integrating with and without the majority rule is reported in Figure <ref type="figure" target="#fig_4">6</ref> for different S2 levels, based on random sample of 100 raster cells. Low granularity S2 levels have a mean information loss approaching 1, whereas higher granularities of the S2 level reduce the information loss. The advantage of using the majority rule is visible for higher granularities, starting from S2 level 20. However, for low granularity S2 cells, the majority rule introduces high under-coverage, as many raster cells become associated with no S2 cells, due to the centroid of large S2 cells that covers many raster cells being located only within a single raster cell.</p><p>The S2 levels 1 through 15 and 27 through 30 are not shown, as these approach an information loss of 1 and 0, respectively. At S2 level 24, where the information loss is 0.022, the information loss begins to stagnate with higher granularity of the S2 levels, which is an indication that this is a suitable level to integrate the use case raster files into.  Using the S2 Geometry as the spatial layer is not without issues, especially when using a granularity for the S2 cells. The grid of the raster files used for this case study is approximately 35,000 × 45,000, resulting in 1.575 billion raster cells. For an S2 level of 24, each raster cell is linked to 625 S2 cells on average. Combining these two numbers yields a total of approximately 1 trillion triples. However, as the raster files for this case study lack measurements for raster cells located in the ocean, the actual amount of integrated raster cells diminishes by around 75%, yielding approximately 250 billion triples solely for the integration of raster cells.</p><p>Up-and Downsampling. To evaluate the gain of using the S2 Geometry as an integration layer, we compare it to using up-and downsampling for combining raster files of different resolutions. Upsampling refers to taking a set of cells at a lower granularity and then aggregating them into higher granularity and vice versa for downsampling. We exemplify the information loss from up-and downsampling via two hypothetical resolutions 10 × 10 and 7.5 × 7.5 meters, in red and blue, respectively, as illustrated in Figure <ref type="figure" target="#fig_5">7</ref>.</p><p>In order to upsample and downsample, we define how the majority rule works in this setting. For the highlighted red cell, we see that it is covered by four different blue cells. Since only two of the blue cells have their centroids within the red cell, we downsample only into those two cells. Conversely, we upsample the highlighted two blue cells into the single red cell in which their centroids are located. We note that upsampling is more difficult than simply aggregating if we deal with categorical attributes.</p><p>An information loss of 0.5 is obtained for upsampling into the 10 × 10 grid and 0.298 for downsampling into the 7.5 × 7.5 grid. In comparison, integrating the two grids into the S2 Geometry at level 24 results in an information loss of 0.028 and 0.019, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In this paper, we have presented an approach for integrating multi-modal heterogeneous data sources through a spatial KG layer in the context of integrating geographically annotated microbial data and environmental features to enable joint analysis. While the emphasis has been on integrating the spatial data sources, we have also discussed the design of the complete multi-modal data integration of raster data, DNA reads, and ontologies. We propose an approach based on the S2 Geometry that can integrate raster files of different resolutions, translations, and rotations without performing significant aggregations of the raster cell measurements. In the future, we plan to work on improving scalability given the large number of RDF triples related to the use of the S2 Geometry as well as to capture provenance through the use of the PROV-O ontology <ref type="foot" target="#foot_9">10</ref> .</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Overview of the Microflora Danica dataset</figDesc><graphic coords="4,141.38,230.65,312.49,51.11" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Central part of KG design regarding spatial data</figDesc><graphic coords="6,99.71,150.10,395.85,68.53" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Extending the KG design with MfD microbial data</figDesc><graphic coords="7,99.71,383.68,395.77,107.60" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Over-and undercoverage</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: S2 error comparison at different hierarchy levels, with and without majority rule</figDesc><graphic coords="8,96.58,536.17,277.09,117.98" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Upsampling and downsampling rasters of different resolutions</figDesc><graphic coords="8,390.35,527.58,106.26,124.60" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2</head><label>2</label><figDesc>Ontologies for the spatial part of the KG design</figDesc><table><row><cell>Prefix</cell><cell>IRI</cell></row><row><cell>geo</cell><cell>&lt;http://www.opengis.net/ont/geosparql#&gt;</cell></row><row><cell>oboe</cell><cell>&lt;http://ecoinformatics.org/oboe/oboe.1.2/oboe-core.owl#&gt;</cell></row><row><cell cols="2">kwg-ont &lt;http://stko-kwg.geog.ucsb.edu/lod/ontology#&gt;</cell></row><row><cell>rdf</cell><cell>&lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://www.bio.aau.dk/forskning/projekter/microflora-danica</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://www.w3.org/TR/r2rml/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://rml.io/specs/rml/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://www.uber.com/en-SE/blog/h3/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">https://learn.microsoft.com/en-us/bingmaps/articles/bing-maps-tile-system</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">https://s2geometry.io</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">https://www.bio.aau.dk/forskning/projekter/microflora-danica</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7">https://github.com/MicrobialDarkMatter/MfD-spatial-integration</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_8">https://obofoundry.org/ontology/ncbitaxon.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_9">https://www.w3.org/TR/prov-o/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work is partially funded by the Villum Foundation (DarkScience project, 50093). We further thank the SustainScapes group from Aarhus University for sharing the environmental dataset.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Semantic Data Integration</title>
		<author>
			<persName><forename type="first">M</forename><surname>Cheatham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Pesquita</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Handbook of Big Data Technologies</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="263" to="305" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Why Is Data Integration So Hard?</title>
		<author>
			<persName><forename type="first">R</forename><surname>Stahl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Staab</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Measuring the Data Universe</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="23" to="34" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Integration of multi-modal biomedical data to predict cancer grade and patient survival</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Phan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">BHI, IEEE</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="577" to="580" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Big Data Integration</title>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Srivastava</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. VLDB Endow</title>
				<meeting>VLDB Endow</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="1188" to="1189" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">The contribution of microbial biotechnology to sustainable development goals</title>
		<author>
			<persName><forename type="first">K</forename><surname>Timmis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Microbial biotechnology</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="984" to="987" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Toward a General Framework for Multimodal Big Data Analysis</title>
		<author>
			<persName><forename type="first">V</forename><surname>Bellandi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Big Data</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="408" to="424" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Multi-Modal Knowledge Graph Construction and Application: A Survey</title>
		<author>
			<persName><forename type="first">X</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. Knowl. Data Eng</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page" from="715" to="735" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Semantics-aware data integration for heterogeneous data sources</title>
		<author>
			<persName><forename type="first">M</forename><surname>Leida</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gusmini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Davies</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Ambient Intell. Humaniz. Comput</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="471" to="491" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">EcoDes-DK15: High-resolution ecological descriptors of vegetation and terrain derived from Denmark&apos;s national airborne laser scanning data set</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Assmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ESSD</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="823" to="844" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Soil assessment in Denmark: Towards soil functional mapping and beyond</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">C O</forename><surname>Gomes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Frontiers in Soil Science</title>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Semantic Querying of Integrated Raster and Relational Data: A Virtual Knowledge Graph Approach</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ghosh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Simkus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Calvanese</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2023">2023</date>
			<publisher>Companion</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">GeoTriples: Transforming geospatial data into RDF graphs using R2RML and RML mappings</title>
		<author>
			<persName><forename type="first">K</forename><surname>Kyzirakos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Web Semant</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="page" from="16" to="32" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">A foundation for spatial data warehouses on the Semantic Web</title>
		<author>
			<persName><forename type="first">N</forename><surname>Gür</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Semantic Web</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="557" to="587" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">B</forename><surname>Andersen</surname></persName>
		</author>
		<title level="m">Publishing Danish Agricultural Government Data as Semantic Web Data</title>
				<imprint>
			<publisher>JIST</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="178" to="186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">TripleGeo: an ETL Tool for Transforming Geospatial Data into RDF Triples</title>
		<author>
			<persName><forename type="first">K</forename><surname>Patroumpas</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">EDBT/ICDT Workshops</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">1133</biblScope>
			<biblScope unit="page" from="275" to="278" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Semantic Integration of Raster Data for Earth Observation: An RDF Dataset of Territorial Unit Versions with their Land Cover</title>
		<author>
			<persName><forename type="first">B</forename><surname>Tran</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ISPRS Int. J. Geo Inf</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page">503</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Environmental Observations in Knowledge Graphs</title>
		<author>
			<persName><forename type="first">R</forename><surname>Zhu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2021">2021</date>
			<publisher>DaMaLOS</publisher>
			<biblScope unit="page" from="1" to="11" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Know, Know Where, Knowwheregraph: A Densely Connected, Cross-Domain Knowledge Graph and Geo-Enrichment Service Stack for Applications in Environmental Intelligence</title>
		<author>
			<persName><forename type="first">K</forename><surname>Janowicz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AI Mag</title>
		<imprint>
			<biblScope unit="volume">43</biblScope>
			<biblScope unit="page" from="30" to="39" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">A Pattern for Features on a Hierarchical Spatial Grid</title>
		<author>
			<persName><forename type="first">C</forename><surname>Shimizu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IJCKG, ACM</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="108" to="114" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">The environment ontology: contextualising biological and biomedical entities</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">L</forename><surname>Buttigieg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Biomed. Semant</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page">43</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
