=Paper=
{{Paper
|id=Vol-3324/om2022_poster3
|storemode=property
|title=Meta2KG: transforming metadata to knowledge graphs
|pdfUrl=https://ceur-ws.org/Vol-3324/om2022_poster3.pdf
|volume=Vol-3324
|authors=Nora Abdelmageed,Birgitta König-Ries
|dblpUrl=https://dblp.org/rec/conf/semweb/AbdelmageedK22
}}
==Meta2KG: transforming metadata to knowledge graphs==
Meta2KG: Transforming Metadata to Knowledge Graphs Nora Abdelmageed1,2,3 , Birgitta König-Ries1,2,3 1 Heinz Nixdorf Chair for Distributed Information Systems 2 Michael Stifel Center Jena 3 Friedrich Schiller University Jena, Jena, Germany Abstract Metadata is used to describe data. It includes information about the who, when, where, how, and why of data collection. Ideally, it should be in a machine-understandable format like RDF. This enables queries using structured query languages like SPARQL and empowers further data usage. In this paper, we investigate metadata as a source for generating Knowledge Graphs (KGs). We introduce a fully automatic approach that transforms raw metadata files into a Knowledge Graph (KG). Our resources and code are publicly available1 . Keywords Metadata Analysis, RDF, Matching, Knowledge Graph, Embeddings 1. Introduction Knowledge Graphs (KGs) are widely used to represent information about entities of interest and their relations [1]. Lately, this includes information encoded in scientific datasets. Often, these datasets are accompanied by metadata describing the who, when, where, how, and why of data collection. Transforming metadata into KGs increases the FAIRness [2] of the data by enhancing its reusability. Embeddings are a well-established technique that captures the semantics of a given word or sentence. Previous works have shown their significant impact on many Natural Language Processing (NLP) applications [3]. In this work, we transform raw metadata files into a KG using an embedding-based matching technique. We tested our technique on a biodiversity use case; however, we expect our method to be domain-independent. 2. Methodology Figure 1 shows the four phases of our pipeline. 1) Data Acquisition We collected our metadata files from various biodiversity data portals to develop the data model and evaluate our matching technique. 2) Ontology Development The data-driven process of crafting our data model 1 https://github.com/fusion-jena/Meta2KG Ontology Matching @ISWC 2022 Envelope-Open nora.abdelmageed@uni-jena.de (N. Abdelmageed); birgitta.koenig-ries@uni-jena.de (B. König-Ries) Orcid 0000-0002-1405-6860 (N. Abdelmageed); 0000-0002-2382-9722 (B. König-Ries) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Acquisition Data Ontology Development Pre- 1 Reconcile & 2 Keys + Processing Keys Model Synonyms Get Onto 3 Embedding BMO Embedding Source Seen Data BMO E Pre- Ground Match & Populate 4 Keys Match 6 Matches Evaluate 7 Processing Truth Embedding Get 5 Validate & Embedding Populate 8 Scores Source Unseen Data Keys E Release Publish 9 BMKG Figure 1: Abstract overview of our workflow to transform raw metadata to KG. (Biodiversity Metadata Ontology (BMO)). We applied several cleaning steps to the collected data. During this phase, we held several meetings with a biodiversity expert to validate and review our conceptual model. In addition, we developed mean-based techniques to transform BMO to the embedding space (BMOE). 3) Match & Populate Our unsupervised learning methods for ontology matching and instance population. For matching, we used cosine similarity in the embedding space between the ontological embeddings, BMO E, and metadata embeddings, Keys E. We used embeddings to capture the semantic meaning of words. For population, We limit the population to a triple if and only if its value has the expected datatype. For example, we accept the triple, e.g., (author, phone, XXX) if “XXX” is a phone. We implemented such kind of validations using regular expressions. 4) Release We published our resources and code under the Creative Commons Attribution 4.0 International (CC BY 4.0) and Apache License 2.0, respectively. Acknowledgments The authors thank the Carl Zeiss Foundation for the financial support of the project “A Virtual Werkstatt for Digitization in the Sciences (K3, P5)” within the scope of the program line “Break- throughs: Exploring Intelligent Systems for Digitization” - explore the basics, use applications”. In addition, we thank, Cornelia Fürstenau, Sirko Schindler, Muhammad Abbady, and Jan Martin Keil for the fruitful discussions. References [1] A. Hogan, E. Blomqvist, M. Cochez, C. d’Amato, G. de Melo, C. Gutiérrez, S. Kirrane, J. E. L. Gayo, R. Navigli, S. Neumaier, A. N. Ngomo, A. Polleres, S. M. Rashid, A. Rula, L. Schmelzeisen, J. Sequeda, S. Staab, A. Zimmermann, Knowledge Graphs, Synthesis Lectures on Data, Semantics, and Knowledge, Morgan & Claypool Publishers, 2021. doi:10. 2200/S01125ED1V01Y202109DSK022 . [2] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, et al., The fair guid- ing principles for scientific data management and stewardship, Scientific data 3 (2016). doi:10.1038/sdata.2016.18 . [3] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, Trans. Assoc. Comput. Linguistics 5 (2017) 135–146. URL: https://doi.org/10. 1162/tacl_a_00051. doi:10.1162/tacl\_a\_00051 .