-

N. Abdelmageed);

Metadata to Knowledge Graphs

Nora Abdelmageed

nora.abdelmageed@uni-jena.de 0 1

Birgitta König-Ries

birgitta.koenig-ries@uni-jena.de 0 1 0 Friedrich Schiller University Jena , Jena , Germany 1 Michael Stifel Center Jena

2022

000 0 0002

Metadata is used to describe data. It includes information about the who, when, where, how, and why of data collection. Ideally, it should be in a machine-understandable format like RDF. This enables queries using structured query languages like SPARQL and empowers further data usage. In this paper, we investigate metadata as a source for generating Knowledge Graphs (KGs). We introduce a fully automatic approach that transforms raw metadata files into a Knowledge Graph ( KG). Our resources and code are publicly available1.

1. Introduction case; however, we expect our method to be domain-independent.

Methodology Figure 1 shows the four phases of our pipeline. 1) Data Acquisition We collected our metadata ifles from various biodiversity data portals to develop the data model and evaluate our matching technique. 2) Ontology Development The data-driven process of crafting our data model n o taa iiits D uq c A e t a l u p o P & h c t a M e s a e l e R

Pre

Processing 1 t n e m p o l e v e D y g o l o t n

O Seen Data Pre Processing 4 Embedding Source Unseen Data Keys

BMO

Keys

Get

Embedding 5 Reconcile & 2 Model Keys + Synonyms BMO E Match 6 Keys E Publish 9 Matches VPaolipdualtaete& 8 BMKG EGmebteOdndtinog 3 Embedding Source Evaluate 7 Scores Ground Truth

(Biodiversity Metadata Ontology (BMO)). We applied several cleaning steps to the collected data. During this phase, we held several meetings with a biodiversity expert to validate and review our conceptual model. In addition, we developed mean-based techniques to transform BMO to the embedding space (BMOE). 3) Match & Populate Our unsupervised learning methods for ontology matching and instance population. For matching, we used cosine similarity in the embedding space between the ontological embeddings, BMO E, and metadata embeddings, Keys E. We used embeddings to capture the semantic meaning of words. For population, We limit the population to a triple if and only if its value has the expected datatype. For example, we accept the triple, e.g., (author, phone, XXX) if “XXX” is a phone. We implemented such kind of validations using regular expressions. 4) Release We published our resources and code under the Creative Commons Attribution 4.0 International (CC BY 4.0) and Apache License 2.0, respectively.

Acknowledgments The authors thank the Carl Zeiss Foundation for the financial support of the project “A Virtual Werkstatt for Digitization in the Sciences (K3, P5)” within the scope of the program line “Breakthroughs: Exploring Intelligent Systems for Digitization” - explore the basics, use applications”. In addition, we thank, Cornelia Fürstenau, Sirko Schindler, Muhammad Abbady, and Jan Martin Keil for the fruitful discussions.

[1]

Hogan , E. Blomqvist,

Cochez , C. d'Amato, G. de Melo,

Gutiérrez ,

Kirrane ,

J. E. L.

Gayo ,

Navigli ,

Neumaier ,

A. N.

Ngomo ,

Polleres ,

S. M.

Rashid ,

Rula ,

Schmelzeisen ,

Sequeda ,

Staab ,

Zimmermann , Knowledge Graphs, Synthesis Lectures on Data, Semantics, and Knowledge , Morgan & Claypool Publishers, 2021 . doi: 10 . 2200/S01125ED1V01Y202109DSK022.

[2]

M. D.

Wilkinson ,

Dumontier ,

I. J.

Aalbersberg , G. Appleton,

Axton ,

Baak ,

Blomberg ,

J.-W.

Boiten ,

L. B. da Silva

Santos ,

P. E.

Bourne , et al., The fair guiding principles for scientific data management and stewardship , Scientific data 3 ( 2016 ). doi: 10 .1038/sdata. 2016 . 18 .

[3]

Bojanowski ,

Grave ,

Joulin , T. Mikolov, Enriching word vectors with subword information , Trans. Assoc. Comput. Linguistics 5 ( 2017 ) 135 - 146 . URL: https://doi.org/10. 1162/tacl_a_00051. doi: 10 .1162/tacl\_a\_ 00051 .