-

May

1613-0073

Multi-Modal Spatial Data using Knowledge Graphs - a Case Study of Microflora Danica

Katja Hose

katja.hose@tuwien.ac.at 1 2

Mads Corfixen

ma@bio.aau.dk 1

Thomas Heede

Tomer Sagi

tsagi@cs.aau.dk 1

Mads Albertsen

Thomas D. Nielsen

Spatial Data Integration, Knowledge Graph, S2 Geometry

0 Department of Chemistry and Bioscience, Aalborg University , Aalborg , Denmark 1 Department of Computer Science, Aalborg University , Aalborg , Denmark 2 Institute of Logic and Computation, TU Wien , Vienna , Austria

2024

26 2024

Integrating semantically related, multi-modal, heterogeneous data sources is challenging, especially if one of the modalities includes spatial data, such as field measurements organized in geographical grids. Since geographical grids can have diferent rotations, be translated along one or more axes, or have diferent resolutions, a particular challenge when integrating such data is to reduce the information loss from projecting diferent grids into a common format. In this paper, we study this problem and sketch a method for integrating such spatial data using knowledge graphs. We discuss this solution in the context of a real-world use case, where we integrate geographically annotated microbial data (Microflora Danica) as well as environmental data to enable joint analysis. The first results of our experiments show that our method reduces the information loss compared to baseline methods.

Microflora

CEUR ceur-ws.org

1. Introduction

Integrating and linking semantically related heterogeneous and multi-modal data is an important task that is often very challenging [ 1, 2 ]. While many works have focused on integrating tabular data, defined over a fixed relational schema, recent years have seen a diversification into combining multi-modal data (such as semi-structured data from diferent modalities) [ 3 ] and data with a flexible schema [ 4 ].

The need for data integration appears in a multitude of domains. One such domain is the study of microbiomes, i.e., the interaction between microbes and the environment they inhabit. Understanding these interactions is essential for solving current and future environmental challenges [ 5 ]. Such data is inherently multi-modal because the DNA sequence data is of a diferent modality than the semantic relations between microbial species and the spatial CEUR Workshop Proceedings information about their environments. Spatial information in general often refers to vector or raster data. Vector data consists of (x, y, z)-coordinates associated with real-world measures of a specific area, where the area’s shape is defined by vector coordinates and can take the form of points, lines, polylines, and polygons. On the other hand, raster data is a defined grid over an area where each grid cell has a measurement value. The raster file itself is associated with information such as the geographical location and the height and width of the grid.

One method to facilitate the integration of multi-modal data is through a knowledge graph (KG) [ 6 ]. An advantage of using KGs for data integration is that they can flexibly support multiple modalities [ 7 ]. KGs can cover strings, images, and audio, as well as links between them, efectively allowing heterogeneous data to be mapped directly without requiring any transformations to conform to strict schemas required by other data integration solutions. Another advantage is that KGs are associated with ontologies that clearly define the semantics of the concepts in the data sources. In contrast, table and column headers in a relational database are often only clearly defined in the mind of the designer [ 8 ].

In this paper, we present a case study for integrating spatial heterogeneous multi-modal data sources using a KG. The case study is based on the Microflora Danica 1 (MfD) dataset, an archive of microorganisms in Denmark. We present the challenges and possible solutions of enriching this dataset with data from EcoDes-DK15 [ 9 ], a high-resolution dataset of ecological descriptors, and soil maps of Denmark [ 10 ]. We focus on integrating the data along the spatial dimension, as the datasets are connected through their geographical properties, i.e., GPS coordinates for the MfD dataset and spatial raster maps for the EcoDes-DK15 and soil maps. The remainder of this paper is structured as follows. First, in Section 2, we provide an overview of existing approaches for integrating multi-modal data with a spatial component. Then, in Section 3, we present the diferent data sources of our case study and their diferent modalities. Next, in Section 4, we present our KG design for spatial integration and show how it can be extended with microbial data. In Section 5, we evaluate our spatial integration approach based on the information loss and compare it to a baseline method. Finally, Section 6 concludes the paper with a summary and an outlook to future work.

2. Related Work

Data integration using KGs has recently gained attention, mainly due to their flexibility, but also because of the advantages when dealing with heterogeneous multi-modal data [ 6, 11 ]. While many studies have explored integrating or transforming spatial data into KGs, they focus mostly on vector data [ 1, 12, 13, 14 ]. Integration of raster data has not been thoroughly explored.

GeoTriples [ 12 ] proposes a method for transforming spatial data in vector format into RDF. The method extends the mapping languages R2RML2 and RML3 to define rules for mapping structured and semi-structured, heterogeneous vector data to RDF graphs. They recognize that spatial data also exists in raster format but only provide a framework for transforming spatial vector data into RDF triples. TripleGeo [ 15 ] is a similar method, but the proposed framework also does not support raster data.

1https://www.bio.aau.dk/forskning/projekter/microflora-danica 2https://www.w3.org/TR/r2rml/ 3https://rml.io/specs/rml/

Tran et al. [ 16 ] propose an ETL process for integrating files in raster format to build an RDF triple store over a designated area. They aggregate raster cells to extract a single value for a territorial area, which leads to a loss of fine-grained information.

Zhu et al. [ 17 ] model observations as aggregations on discrete global grid systems, albeit without providing an actual framework. Moreover, they describe the integration of singular event geometries, such as wildfires, rather than integrating raster data.

Several hierarchical discrete global grids (HDGG) have been proposed to represent spatial data, including H34, Bing Maps Tile System5, and S2 Geometry6. Common among them is the hierarchical division of the Earth into subsets. H3 utilizes hexagons, but, since hexagons cannot be perfectly subdivided into smaller hexagons, child cells are only approximately contained within their parent cells. Bing Maps Tile System projects the Earth onto a map; however, this projection distorts the scales in proportion to the distance to the poles. The S2 Geometry decomposes Earth into a hierarchy of cells. Instead of mapping points to a plane, S2 maps them to a perfect sphere. Since Earth is closer to being a sphere than a plane, this creates less distortion. At the top-most level of the S2 hierarchy, Earth is represented by six cells, perfectly covering the Earth, with each lower level subdividing each cell into four children.

The KnowWhereGraph (KWG) [ 18, 19 ] is a KG using S2 Geometry as the spatial component. The KWG focuses on how to model spatial data as RDF triples, but without providing a framework for mapping between raster files and S2 Geometry. Additionally, they recognize the advantages and limitations of the S2 Geometry as the spatial component but do not quantify the error. Thus, our work is based upon the KWG, using it as a spatial layer to enable spatial data integration.

3. Data Sources and Use Case Description

In this paper, we consider a use case where we want to integrate several real-world heterogeneous data sources of microbes and ecological descriptors of their habitats, exploiting the capabilities of KGs for flexible multi-modal integration of spatially related data. In this section, we describe the available data sources and how they shape our case study.

The Microflora Danica Dataset. The Microflora Danica 7 (MfD) dataset comprises more than 10, 000 measured samples collected from diferent sample sites in Denmark. The MfD dataset was created to provide a database of all microbes in Denmark, and contains several diferent modalities. Information on where each sample was taken is geographical vector data, and DNA reads can be considered as very long strings. Each sample in the MfD dataset is sequenced at least once, which is a process that constructs probable DNA reads from a sample. Through this process, each sample is associated with approximately 10 million DNA reads, , each being a fragment of a full genome consisting of the four bases, Adenine (A), Thymine (T), Cytosine (C), and Guanine (G), found in DNA, ∈ {, , , } +. In total, the MfD contains 28 TB of reads.

The structure of the MfD dataset is shown in Figure 1. Each sample site is described in the Fieldsample Metadata, which contains information such as GPS location and habitat type, 4https://www.uber.com/en-SE/blog/h3/ 5https://learn.microsoft.com/en-us/bingmaps/articles/bing-maps-tile-system 6https://s2geometry.io 7https://www.bio.aau.dk/forskning/projekter/microflora-danica with the habitat types linked to three external taxonomies. Each sample is sequenced multiple times, each sequencing being described in the Sequencing Metadata, linked to their related sample site through a fieldsample_barcode . The Sequencing Metadata describes how each sample was sequenced and handled in the laboratory. Each sequencing leads to a number of reads, collected in a single file we denote as Read Data, connected to their related sequencing through a sequence_id. In a Read Data file, there is, besides the reads themselves, a certainty measure for each base of the reads. Finally, each site was sampled as part of diferent projects, each project potentially collecting multiple samples. The projects are described in the Project Metadata, containing information on the parties responsible for the projects. Each sample site is linked to its related project through a project_id.

Besides information on the reads, sequences, and sample site, we are also provided a mapping from each sample to potential microbial species (taxons) present in that sample. Since reads are only fragments of the complete genome of a taxon, the mappings are uncertain. They are structured as shown in Table 1. Each Operational Taxonomic Unit (OTU) is a DNA sequence that encodes a specific, potentially undiscovered taxon. Each MFD_X column represents a ifeldsample_barcode and shows how many reads from a sample site were mapped to that OTU. Environmental Raster Files. The EcoDes-DK15 [ 9 ] and soil maps [ 10 ] datasets contain measurements of ecological descriptors across Denmark, providing us a means to quantify the microbial habitats of the MfD dataset. Both datasets are in the form of raster files, adding a new modality that needs to be integrated besides the ones present in the MfD dataset.

A raster file is a matrix organized into a grid of rows and columns of raster cells, each cell representing a geographical area. Each raster cell of a raster file is furthermore associated with a value representing information about the geographical area, such as temperature or pH. In total, the environmental data contains approximately 200 GB across approximately 100 files, each containing a diferent type of measure.

A raster file contains the geo-location of the upper-left cell of its matrix and a transformation matrix containing information on what translations and rotations are needed to get the rest of the spatial locations of all the other cells. In our datasets, raster cells have a resolution of 10 × 10 meters, and the spatial location is given on the Universal Transverse Mercator (UTM) coordinate system. In the UTM coordinate system, each location is defined by the pair (, ) with , easting, being the distance east of the Greenwich Meridian, and , northing, being the distance north of the equator.

The S2 Geometry. In order to handle the integration of raster files with potentially diferent transformation matrices, we integrate them into a common grid, the S2 Geometry, as all raster cells in a raster file can be mapped to a set of corresponding cells in the S2 Geometry, regardless of the transformation matrix.

The S2 Geometry is a 31-level hierarchical grid that decomposes Earth into a hierarchy of cells. At the top-most level (level 0) of the hierarchy, Earth is divided into six cells perfectly covering it, while each higher level of granularity subdivides each cell into four children, such that there are 24 cells at level 1, 96 cells at level 2, and so on.

4. Spatial Integration Approach

In this section, we present our KG design for spatial integration, demonstrating how the S2 Geometry can be adopted to independently integrate the spatial aspect of each data source. Further, we show how we extend the KG design to integrate the rest of the MfD dataset. The code is available on GitHub8.

Spatial Data. We build upon well-established ontologies to model the spatial dimension, as shown in Table 2. The geo ontology is used to model the spatial relations between raster cells, S2 cells, and MfD sample sites; the upper level ontology oboe is used to model the environmental observations and measurements; and the kwg-ont ontology is the ontology of the KnowWhereGraph.

We can consider a raster file as a collection of raster cells, where each raster cell is associated with at least one measurement. For raster files where their raster cells perfectly overlap one another, such as they do in our use case, each raster cell is associated with multiple measurements (pH value, soil salinity, etc.). This is modeled using the oboe concepts ObservationCollection, Observation, Measurement, and MeasurementType; and properties hasMember, hasMeasurement, hasValue, and containsMeasurementsOfType (see Figure 2).

To model the spatial relation between S2 Geometry and the MfD sample sites, we link each site to the S2 cell that covers it, using the GPS location obtained from the MfD dataset, described in Section 3. For the spatial relation between the raster cells and S2 Geometry, we associate

8https://github.com/MicrobialDarkMatter/MfD-spatial-integration

each raster cell with the set of S2 cells that covers it. To model these spatial relations between raster cells, S2 cells, and MfD sample sites, we utilize the geo ontology properties coveredBy and covers, see Figure 2, where the data from the raster files is colored in orange, the data from the MfD dataset is colored in blue, and the spatial integration layer, S2 Geometry, in bold.

Our current datasets do not have significant overlaps, so we do not have significant interoperability conflicts among them. However, in the future, we would like to integrate data from diferent sources; therefore, a semantic alignment pipeline will be needed.

Spatial Integration. The spatial aspect of each data source is mapped to the S2 Geometry. The integration can be split into two types: the vector-based GPS integration of the MfD sample sites and the area-based integration of raster files. We focus on the integration of raster files in this section, as the integration of the vector-based MfD sample sites simply requires us to integrate the given coordinates directly to the covering S2 cell.

As described in Section 3, the S2 Geometry is a hierarchy of levels determining the granularity of S2 cells. Increasing the granularity of the S2 level results in the area of each S2 cells being lowered by a factor of 4. Therefore, the S2 level afects the number of S2 cells required to cover each raster cell; a high granularity requires many S2 cells, whereas a low granularity requires fewer. Consequently, lower granularities of the S2 level increase the over-coverage, i.e, the area of S2 cells that covers an area outside their associated raster cells. The over-coverage is illustrated for two diferent granularities of the S2 level in Figure 3, where the S2 cells (red) are covering area outside the raster cell (green).

Ideally, a set of S2 cells covers each raster cell perfectly; however, this is not the case, and as such, choosing the highest granularity will yield the lowest possible over-coverage. The issue with this is that each level quadruples storage requirements, as each S2 cell has four child cells.

The EcoDes-DK15 and soil raster files have a resolution of 100 m2; hence, to minimize over-coverage, an S2 level of 24 is chosen as S2 cells have an area of approximately 0.3 m2 at that S2 level.

Irrespective of the S2 level, S2 cells over-covering a raster cell Figure 3: S2 cells covering will overlap neighboring raster cells. This is not desirable since raster cells at difeach S2 cell should only correspond to one value per feature, e.g., ferent levels we cannot associate two pH values with the same location. To address this issue, we use a majority rule; however, one could also aggregate the values, as long as they are continuous. The majority rule disassociates S2 cells from any raster cells in which their centroids are not located. However, disassociating S2 cells from a raster will introduce under-coverage, i.e., the sub-area of a raster cell that is not covered by any S2 cells.

A potential error introduced by over- and under-coverage is integrating an MfD site to an S2 cell corresponding to an incorrect raster cell. For example, in the left of Figure 3, the sample site (cross) is linked to an S2 cell that covers the shown raster cell instead of an S2 cell covering the raster cell below it. S2 cells of higher granularity diminishes this problem.

In general, the integration of raster files is challenging due to diferent raster files possibly having diferent transformations, i.e., resolutions, rotations, and translations. However, utilizing the S2 Geometry, the integration of raster files is independent across diferent transformations. Thus, our approach afords the integration of arbitrary raster files.

Microbial Data. The MfD dataset contains sample-specific information such as metadata, DNA reads, sequencing metadata, sample-to-OTU mappings, and habitat information of the sample site. As the focus of this paper is on spatial integration, we omit details on this part; however, we outline the design for this integration in Figure 4. Note that each entity has properties, but due to limitations, we do not discuss metadata associated with the microbial data. Each sample site, marked in blue, has a habitat type, such as ’forests’ or ’grasslands’, which we map to a corresponding concept in the Environment Ontology, ENVO [ 20 ], marked in green. Furthermore, each sample has a mapping to an OTU, which encodes for a specific microbial taxon. These taxons are mapped to corresponding taxons in the the Taxonomy Database ontology, NCBITaxon9, marked in green. Both the habitat and taxon mappings to external ontologies require some entity resolution methods to account for, e.g., spelling errors. Finally, since the DNA reads, marked in red, take up 28 TB of storage, we do not keep them directly in the graph, but instead store a reference to an external key-value store.

5. Evaluation In this section, we evaluate the proposed framework in terms of

the information loss of using the S2 Geometry as the integration layer and compare it to a baseline method. Furthermore, we highlight potential issues when using the S2 Geometry.

Information Loss. We observe two contributing factors to the Figure 5: Over- and underinformation loss from our integration to the S2 Geometry. The coverage ifrst type of information loss is over-coverage ( Equation 1), where a part of the area of the 9https://obofoundry.org/ontology/ncbitaxon.html minimal set of covering S2 cells associated with a raster cell is outside the bounds of the raster cell. The over-coverage is calculated as the proportionate area of the set of S2 cells not contained within the raster cell. The second type of information loss results from the implementation of the majority rule, and is denoted under-coverage (Equation 2), where a raster cell is only partly covered by its associated minimal set of covering S2 cells. The under-coverage is calculated as the proportionate area of the raster cell not contained within the set of S2 cells. These two types of information loss are visualized in Figure 5, with the blue area being over-coverage and the red area being under-coverage.

OC = | 2 ⧵ Raster| | 2| (1)

UC = |Raster ⧵ 2| |Raster| (2)

Since the two contributing factors are equally unwanted, we define the information loss as the harmonic mean of the over- and under-coverage (Equation 3). Due to the nature of the harmonic mean, higher values are desirable; however, this is not the case for the over- and under-coverage calculations. Therefore, we use the complement of each coverage. This yields an information accuracy, which we subtract from 1 to get the loss.

Information Loss = 1 − 2 (1 − OC) ⋅ (1 − UC) (1 − OC) + (1 − UC) .

(3)

The mean information loss for integrating with and without the majority rule is reported in Figure 6 for diferent S2 levels, based on random sample of 100 raster cells. Low granularity S2 levels have a mean information loss approaching 1, whereas higher granularities of the S2 level reduce the information loss. The advantage of using the majority rule is visible for higher granularities, starting from S2 level 20. However, for low granularity S2 cells, the majority rule introduces high under-coverage, as many raster cells become associated with no S2 cells, due to the centroid of large S2 cells that covers many raster cells being located only within a single raster cell.

The S2 levels 1 through 15 and 27 through 30 are not shown, as these approach an information loss of 1 and 0, respectively. At S2 level 24, where the information loss is 0.022, the information loss begins to stagnate with higher granularity of the S2 levels, which is an indication that this is a suitable level to integrate the use case raster files into.

Using the S2 Geometry as the spatial layer is not without issues, especially when using a granularity for the S2 cells. The grid of the raster files used for this case study is approximately 35,000 × 45,000, resulting in 1.575 billion raster cells. For an S2 level of 24, each raster cell is linked to 625 S2 cells on average. Combining these two numbers yields a total of approximately 1 trillion triples. However, as the raster files for this case study lack measurements for raster cells located in the ocean, the actual amount of integrated raster cells diminishes by around 75%, yielding approximately 250 billion triples solely for the integration of raster cells. Up- and Downsampling. To evaluate the gain of using the S2 Geometry as an integration layer, we compare it to using up- and downsampling for combining raster files of diferent resolutions. Upsampling refers to taking a set of cells at a lower granularity and then aggregating them into higher granularity and vice versa for downsampling. We exemplify the information loss from up- and downsampling via two hypothetical resolutions 10 × 10 and 7.5 × 7.5 meters, in red and blue, respectively, as illustrated in Figure 7.

In order to upsample and downsample, we define how the majority rule works in this setting. For the highlighted red cell, we see that it is covered by four diferent blue cells. Since only two of the blue cells have their centroids within the red cell, we downsample only into those two cells. Conversely, we upsample the highlighted two blue cells into the single red cell in which their centroids are located. We note that upsampling is more dificult than simply aggregating if we deal with categorical attributes.

An information loss of 0.5 is obtained for upsampling into the 10 × 10 grid and 0.298 for downsampling into the 7.5 × 7.5 grid. In comparison, integrating the two grids into the S2 Geometry at level 24 results in an information loss of 0.028 and 0.019, respectively.

6. Conclusion

In this paper, we have presented an approach for integrating multi-modal heterogeneous data sources through a spatial KG layer in the context of integrating geographically annotated microbial data and environmental features to enable joint analysis. While the emphasis has been on integrating the spatial data sources, we have also discussed the design of the complete multi-modal data integration of raster data, DNA reads, and ontologies. We propose an approach based on the S2 Geometry that can integrate raster files of diferent resolutions, translations, and rotations without performing significant aggregations of the raster cell measurements. In the future, we plan to work on improving scalability given the large number of RDF triples related to the use of the S2 Geometry as well as to capture provenance through the use of the PROV-O ontology10.

Acknowledgments

This work is partially funded by the Villum Foundation (DarkScience project, 50093). We further thank the SustainScapes group from Aarhus University for sharing the environmental dataset.

[1]

Cheatham ,

Pesquita , Semantic Data Integration , in: Handbook of Big Data Technologies , Springer, 2017 , pp. 263 - 305 .

[2]

Stahl ,

Staab , Why Is Data Integration So Hard? , in: Measuring the Data Universe , Springer, 2018 , pp. 23 - 34 .

[3]

J. H.

Phan , et al., Integration of multi-modal biomedical data to predict cancer grade and patient survival , in: BHI, IEEE, 2016 , pp. 577 - 580 .

[4]

X. L.

Dong ,

Srivastava , Big Data Integration, Proc. VLDB Endow . 6 ( 2013 ) 1188 - 1189 .

[5]

Timmis , et al., The contribution of microbial biotechnology to sustainable development goals , Microbial biotechnology 10 ( 2017 ) 984 - 987 .

[6]

Bellandi , et al., Toward a General Framework for Multimodal Big Data Analysis , Big Data 10 ( 2022 ) 408 - 424 .

[7]

Zhu , et al., Multi-Modal Knowledge Graph Construction and Application: A Survey , IEEE Trans. Knowl. Data Eng . 36 ( 2024 ) 715 - 735 .

[8]

Leida ,

Gusmini ,

Davies , Semantics-aware data integration for heterogeneous data sources , J. Ambient Intell. Humaniz. Comput . 4 ( 2013 ) 471 - 491 .

[9]

J. J.

Assmann , et al., EcoDes-DK15: High-resolution ecological descriptors of vegetation and terrain derived from Denmark's national airborne laser scanning data set , ESSD 14 ( 2022 ) 823 - 844 .

[10] L. C. o. Gomes, Soil assessment in Denmark: Towards soil functional mapping and beyond, Frontiers in Soil Science ( 2023 ).

[11]

Ghosh ,

Simkus ,

Calvanese , Semantic Querying of Integrated Raster and Relational Data: A Virtual Knowledge Graph Approach , in: RuleML+RR (Companion), 2023 .

[12]

Kyzirakos , et al., GeoTriples: Transforming geospatial data into RDF graphs using R2RML and RML mappings , J. Web Semant . 52 - 53 ( 2018 ) 16 - 32 .

[13]

Gür , et al., A foundation for spatial data warehouses on the Semantic Web, Semantic Web 9 ( 2018 ) 557 - 587 .

[14] A. B. Andersen , et al., Publishing Danish Agricultural Government Data as Semantic Web Data , in: JIST, 2014 , pp. 178 - 186 .

[15]

Patroumpas , et al., TripleGeo: an ETL Tool for Transforming Geospatial Data into RDF Triples , in: EDBT/ICDT Workshops, volume 1133 of CEUR Workshop Proceedings, CEUR-WS.org , 2014 , pp. 275 - 278 .

[16]

Tran , et al., Semantic Integration of Raster Data for Earth Observation: An RDF Dataset of Territorial Unit Versions with their Land Cover , ISPRS Int. J. Geo Inf . 9 ( 2020 ) 503 .

[17]

Zhu , et al., Environmental Observations in Knowledge Graphs , in: DaMaLOS, 2021 , pp. 1 - 11 .

[18]

Janowicz , et al., Know , Know Where, Knowwheregraph: A Densely Connected, CrossDomain Knowledge Graph and Geo-Enrichment Service Stack for Applications in Environmental Intelligence , AI Mag . 43 ( 2022 ) 30 - 39 .

[19]

Shimizu , et al., A Pattern for Features on a Hierarchical Spatial Grid , in: IJCKG, ACM, 2021 , pp. 108 - 114 .

[20]

P. L.

Buttigieg , et al., The environment ontology: contextualising biological and biomedical entities , J. Biomed. Semant . 4 ( 2013 ) 43 .