Towards a Gateway for Knowledge Graph Schemas Collection, Analysis, and Embedding

Towards a Gateway for Knowledge Graph Schemas Collection, Analysis, and Embedding MattiaFumagalli mattia.fumagalli@unibz.it Free University of Bozen-Bolzano

Bolzano Italy

MarcoBoffo marco.boffo@studenti.unitn.it Qascom SRL

Vicenza Italy

DaqianShi daqian.shi@unitn.it University of Trento

Trento Italy

MayukhBagchi mayukh.bagchi@unitn.it University of Trento

Trento Italy

FaustoGiunchiglia fausto.giunchiglia@unitn.it University of Trento

Trento Italy

2023 Sherbrooke Québec Canada

Towards a Gateway for Knowledge Graph Schemas Collection, Analysis, and Embedding 1613-0073 D5232A8AF53FF57033437338D1B643AC GROBID - A machine learning software for extracting information from scholarly documents Knowledge graphs knowledge graphs catalog knowledge graphs analysis knowledge graphs embedding

One of the significant barriers to the training of statistical models on knowledge graphs is the difficulty that scientists have in finding the best input data to address their prediction goal. In addition to this, a key challenge is to determine how to manipulate these relational data, which are often in the form of particular triples (i.e., subject, predicate, object), to enable the learning process. Currently, many high-quality catalogs of knowledge graphs, are available. However, their primary goal is the re-usability of these resources, and their interconnection, in the context of the Semantic Web. This paper describes the LiveSchema initiative, namely, the first version of a gateway that has the main scope of leveraging the gold mine of data collected by many existing catalogs collecting relational data like ontologies and knowledge graphs. At the current state, LiveSchema contains ∼ 1000 datasets from 4 main sources and offers some key facilities, which allow to: i) evolving LiveSchema, by aggregating other source catalogs and repositories as input sources; ii) querying all the collected resources; iii) transforming each given dataset into formal concept analysis matrices that enable analysis and visualization services; iv) generating models and tensors from each given dataset.

Introduction

Finding the best data to train statistical models and properly address the target learning goal is widely recognized as one of the most pivotal tasks in Machine Learning (ML) [1]. ML models highly depend, indeed, on the quality of data they receive as input. While, so far, the development of highly efficient and scalable learning methods, to address critical prediction tasks (see, for instance, image classification and information patterns recognition [2]), helped data scientists and analytics professionals in scaling their activities, the process of finding, selecting and improving the quality of these data still requires a considerable amount of time and manual effort [3]. This latter challenge is also present when statistical models are trained on knowledge representations, like knowledge graphs and ontologies [4], where the data received as input are graph-structured data, consisting of entities (or nodes) and labeled links, or edges, (relations between entities). In this setting, the final learning goal can be identified as the prediction of missing relations between nodes, the prediction of nodes properties, and the clustering of the nodes based on their connections, these being common goals arising in many scenarios, such as analysis of social networks and biological pathways [5]. Consequently, the efficacy of the ML algorithms directly depends on the quality of the input graphs, as well as their relevance to the domain of application.

In this paper, leveraging the ideas presented in [6,7,8] and [9], where an approach to analyzing knowledge graph schemas to address Entity Type Recognition (ETR) tasks has been devised, we introduce the LiveSchema initiative, namely the first version of a gateway that has the main scope of exploiting the gold mine of data collected by many existing catalogs collecting relational data like ontologies and knowledge graphs. At the current state, LiveSchema contains ∼ 1000 datasets from 4 sources and offers some key facilities, which allow to: i) continuously updating a catalog of knowledge graphs, by aggregating other source catalogs and repositories; ii) querying all the collected resources; iii) transforming each given dataset into formal concept analysis matrices that enable analysis and visualization services; iv) generating models and tensors from each given dataset. At the current state, LiveSchema is accessible at http://liveschema.eu/ and it is ready to be demonstrated. The admin functionalities can be accessed and tested at http://liveschema.eu/user/login, by using 'reviewer' as admin/password. This demonstration paper is organized as follows. Section 2 introduces the motivation guiding the initiative. Section 3 gives a brief overview of the data architecture. Section 4 shows how LiveSchema can be evolved with new knowledge data. Section 4 illustrates the main LiveSchema components. Section 5 provides an example of how Liveschema can be used. Section 6 discusses some implications and limitations. Section 7 provides the conclusion.

Motivation

As an example scenario, suppose that a data scientist needs to run a standard Entity Type Recognition task (ETR), 1 as it is described in [8] and [12], where the goal is to recognize objects of the type 'Person' across a set of multiple tabular data, coming, for instance, from an open data repository. This may involve that she needs to find a reference ontology containing: i. the target class and corresponding label; ii. possibly a huge number of properties for the target class, to increase the probability to match some of the input test properties; iii. possibly a low number of overlapping properties, in order to decrease the number of false-negative/positive predictions.

The process of searching, analyzing, and transforming the target ontology can take a long time and it may involve a considerable effort. The scientist has, indeed, to go through a broad search over the available resources and related catalogs, possibly checking multiple data versions and formats. Moreover, once the candidate resources are identified, she should run an analysis of the data, to better understand their reliability concerning the target task. Additionally, this analysis (see, for instance, the simple data about the number of properties associated with each class) requires a processing phase that is assumed to be set up and run directly by the scientist. As a final step, if the scientist succeeds in finding the data she needs, a transformation process must be run to re-use the relational data in the reference ETR setup. What if the scientist can run all these operations in one single place with the support of ready-to-be-used built-in facilities?

The idea of LiveSchema precisely arose from this key challenge. Firstly, the gateway aims at supporting scientists in better finding the relational data they need. Indeed, by leveraging the updates of some of the best state-of-the-art catalogs, LiveSchema should offer an aggregation service that allows searching and keeping track of the evolution of the knowledge representation development community in one place.

Moreover, by implementing some key state-of-the-art libraries, LiveSchema aims to facilitate the data analysis and preparation process. Most of the implemented libraries, indeed, require an ad-hoc set-up and may involve the combination of multiple components and environments, involving some coding and development skills that not all pure data scientists have. In this sense, LiveSchema aims to offer a platform that unites data analysis, data processing, and machine learning model deployment, making them easily accessible, reusable, and less time-consuming.

Data Architecture

The current version of LiveSchema is grounded in the CKAN2 open-source data management system which is widely recognized as one of the most reliable tools for managing open data. We concentrate on the fundamental distinction in CKAN which informs the data architecture of LiveSchema, namely that between dataset and resource. 3 A dataset is defined as a set of data (e.g., BBC Sport Ontology) which may contain several resources representing the physical embodiment of the dataset in different downloadable formats (e.g., BBC Sport Ontology in T U R T L E , F C A formats). This distinction allows us, as a major advance from mainstream catalogs such as [13], to exploit fine-grained metadata properties from the Application Profile for European Data Portals (DCAT-AP), 4 which makes a conceptually identical distinction between dataset and distribution. The additional advantage of using DCAT-AP is that it organizes metadata into mandatory, recommended, and optional properties which are considered the key for facilitating different levels of semantic interoperability amongst data catalogs.

We now elucidate the metadata specification, i.e. the selected metadata properties for datasets and distributions considered for the current version of LiveSchema: i. Dataset: Notice that the distinction between dataset and distribution metadata is non-trivial in the sense that metadata properties like format, license, byte size and download url are associated to a distribution and not to the dataset itself.

-M A N D A T O R Y : description, title; -R E C O M M E N D E D : dataset distribution,

Our first observation concerns the two major advantages that the aforementioned data distinction and metadata specification bring to LiveSchema. Firstly, metadata enforces 'FAIR'ification [14] of the KG schemas (which are 'data' in this case), thus rendering them findable, accessible, interoperable, and reusable for the machine learning tasks that LiveSchema targets. 5Secondly, as a consequence of the first advantage, the metadata-enhanced KG schemas also play a pivotal role in initiating, enhancing, and sustaining reproducibility [16] which is key for LiveSchema vis-à-vis the target machine learning ecosystem in which it participates.

Our second observation concerns the future extensibility of the metadata specification of LiveSchema. The starting distinction between dataset and distribution can help bootstrap the extension of the initial metadata specification to ontology-specific metadata which, mutadis mutandis, preserves the same distinction via the notions of ontology conceptualization and ontology serialization [17]. One of the key advantages of using ontology-specific metadata in LiveSchema is that the user can perform a highly customized (conjunctive) search, for instance, even at the level of logical formalism or ontology design language, thus retrieving the most compatible schema for the machine learning task at hand. In this direction, we plan to exploit the MOD [18] ontology metadata proposals in the immediate future.

Evolving LiveSchema

Data Collection and Development

At the current state, LiveSchema relies on four main state-of-the-art catalogs, namely LOV, DERI,6 FINTO7 and a custom catalog which is still under construction, 8 where some selected resources are stored.

Each catalog is associated with a provider, which is the person or organization that uploaded the data set and is in charge of its maintenance in the source catalog. From each catalog, multiple data sets have been individually scraped and uploaded in an automated way. Currently, LOV is the catalog providing most of the data sets, being one of the most widely used catalogs of vocabularies in the semantic web context. The extension of LiveSchema with new catalogs is part of the immediate future work.

Given the selected catalogs, 26 types of metadata have been scraped, namely: .i Catalog: id, name, title, logo, URL, description; .ii Data-Set: id, name, title, notes (description), issued, modified (last modified), language, uri (landingPage), contact-uri (homepage / contactPoint), maintainer / publisher (Provider), author / creator (Provider), license-id (license), license-url (license), owner-org (Catalog), version (versionInfo), tags (keyword), source; .iii Provider: id, name, title, uri.

We carefully checked the dataset during data scraping to ensure that LiveSchema is not breaking any license agreement. Currently, five kinds of licenses are admitted given their restrictions (all of them are part of the Creative Commons9 initiative). These license constraints need to be checked since we both provide access and we manipulate their content to provide the following resources. As we parse them from their source from various sets of formats, 10we serialize them into the most common ones, namely R D F and T u r t l e . More advanced output formats can be generated through the processing operations enabled by the LiveSchema services, namely C S V (where all the triples and metadata of the input relational data are stored in a datasheet format), C U E (where all the cue metrics are provided), F C A (i.e., the FCA transformation matrix result), V I S (the format that can be used to enable visualization services functionalities), E M B (the format used to generate a statistical model based on a knowledge embedding process).

Stoking LiveSchema

LiveSchema is managed by a group of knowledge experts, software engineers, and data scientists who contribute to the development and evolution of the whole system. This group of experts, whom we call here LiveSchema administrators, or simply admins, besides handling maintenance issues, are in charge of applying the evolution component functionalities. These functionalities play central roles in the process of populating the catalog. Two evolution operations can be applied by the LiveSchema admin. Firstly, LiveSchema provides an automated evolution process, which is composed of a parsing phase and a scraping phase. Few checkpoints are released for administrators to supervise the output of the automatic processes. Secondly, manual datasets uploading, reviewing, and managing are also available through the usage of LiveSchema services.

An example of the manually created list, containing new useful data sets, which are not present in the other selected catalogs, with all their relative information and metadata, is accessible. 11 Some of these data sets are not directly obtainable from the web and they had to be downloaded, unzipped, or edited, and then uploaded on GitHub in order to render them collectible using an URL link. Once the data is scraped, a second key semi-automated parsing process is applied.

The parsing process is very simple, it is executed iteratively and has the goal of producing two main outputs, namely a set of serialized data sets and a set of parsed data sets. The first output is produced by scanning the data sets list and parsing it using RDFlib python library, 12namely a library that is used to process RDF resources. Here the produced output is used to generate more standard reference formats, which, in the current setting are represented by R D F and T u r t l e . We also allow for the generation of an x l s x (or c s v ) file encoding all the information (e.g., triple and metadata) about the data set to easily enable the other applications provided by the catalog. In this step, the key role of the admin in charge of the parsing process is to edit the data set list in order to filter out undesired data sets and parse only the ones that are required. The second output is produced by scanning each triple of the input data set. The filtering among the predicates to specify the focus of the dataset is applied just before the application of some services.

Forging Datasets

All the datasets that are gathered from the source catalogs and uploaded to LiveSchema can be then transformed and used as input of the available functionalities. The current LiveSchema version contains six main functionalities, which are 1) FCA generator, namely the process by which data can be converted in the FCA format (FCA); 2) CUEs generator, i.e., the process by which the CUEs (as defined in Section 2) are generated and encoded in the C U E format; 3) Visualization generator, namely the process by which the input data can visualized and analyzed (see V I S format; 4) Knowledge Embedder, i.e., the application by which a model can be created out of the input data, by applying one or some of the libraries provided by the PyKEEN package [19] (see E M B reference format); 13 5) the Query Catalog service, which allows running SPARQL queries; 14 6) the knowledge graph visualizer, namely an implementation of the WebVOWL library. 15 This set of functionalities can be easily accessed and reused utilizing APIs services, and can also be easily extended, e.g., 4), 5) and 6) can be run by directly using . r d f files as input. Each functionality may require an ad hoc format to produce the output, and, in some cases, it may have some dependencies with the input format of other functionalities, e.g., 1), 2) and 3) involving new formats.

Figure 1 above provides an example of the LiveSchema data set management, where all the available functionalities for managing, analyzing, and transforming the data are presented. 16A set of metadata, tags, and information about the reference source catalog are also provided to users on the top left of the link. All the new formats (if present) are accessible on the corresponding functionality page.

LiveSchema Components

The main components of LiveSchema are combined as in Figure 2. LiveSchema includes five main components (See Figure 2): (1) user interfaces UIs; (2) the APIs, provided by the CKAN platform, and that we partially customized according to our set-up needs, provide the main accesses to the LiveSchema environment; (3) stoking components; (4) forging components cover the main novel contributions of the LiveSchema initiative and offer the possibility to harvest, generate and process data; (5) Storage allows to collect in one place all the data collected by other catalogs, provided by the users or generated through the services. All these components are grouped into three main layers: the presentation layer, the service layer, and the data layer. Presentation layer. This layer enables a community of users: .i to maintain the whole gateway and its applications, and .ii to suggest and upload new resources or edit some already existing resources. LiveSchema is mainly managed by a group of expert knowledge engineers, software engineers, and data scientists who contribute to the development and evolution of the whole catalog. Moreover, a group of guest users can also be involved in the collaborative development of the storage, by uploading and editing some new data sets and, possibly, creating new input reference catalogs, following well-founded guidelines provided by the knowledge engineers that administrate the ecosystem. The definition of the types of access and the different roles played by the LiveSchema users is part of the immediate future work. The APIs allow the users to exploit all of the website's core functionalities by external code. Using the API, developers will be able to: get JSON-formatted lists of vocabularies, with providers (namely, the agent who created the data set), source catalogs, or other LiveSchema information; get a full JSON representation of a data set or other related information derived from the analysis of the data set; search for data sets, providers, or other resources matching a query; create, update and delete data sets, with related metadata and information; get an activity stream of recently changed data set on LiveSchema, obtaining also the versioning information of each resource The UIs allow users to access data and functionalities. Two types of user interfaces are present, namely front-end and back-end user interfaces. The former is a customization of the standard CKAN template, where the home page allows to access all the contents of the website through five main widgets: 1) a menu with the top-level categories of the catalog, 2) a search form to easily access and browse data sets, 3) a showcase of the top services and a list of the source reference catalogs, 4) recent activities. Differently, the back-end user interface can be accessed only with credentials and allows for the editing and submission of existing or new data, or it enables the usage of some more applications.

Service layer. The stoking components are mainly necessary to check and gather any new knowledge resources from a set of previously selected catalogs. LiveSchema mainly relies on manual processes and a semi-automated process for data insertion. The former can be applied by any type of user, by submitting a new resource through a dedicated panel, but requires a review process from the administrator users. The latter is applied by selecting source catalogs as input and can be used to keep track of their updates. This stoking facility can be primarily customized by determining how many times the source catalogs must be checked and by defining what types of data sets can be collected and uploaded into the main storage. Currently, the quality criteria to allow the uploading of a data set, is the size, the type of license, and the correct format of its content. Along with the stoking components, the LiveSchema forging components encode a set of functionalities, which are aimed at the analysis and transformation of data, and the generation of new formats. All these functionalities are aimed at supporting scientists in the re-usage of the selected relational data.

Using LiveSchema

The scope of this section is mainly to show how the LiveSchema processing component works. Through a running example, we illustrate how a user can exploit main functionalities. All the described operations can be directly tested by exploring and using the LiveSchema ecosystem, which is accessible at http://liveschema.eu/. The admin functionalities can be accessed and tested at http://liveschema.eu/user/login, by using ' r e v i e w e r ' as admin/password.

Analyzing Relational Data

As an example scenario, suppose we need to run a standard entity type recognition task, as it is described in [8] and [12], where we may need to recognize objects of the type 'Person' across a set of multiple tabular data, coming, for instance, from an open data repository. This may involve the need to find a reference relational model with .i the target class and corresponding label; .ii possibly a huge number of properties for the target class, in order to increase the probability to match some of the input test properties; .iii possibly a low number of overlapping properties, in order to decrease the number of false negative/positive.

A LiveSchema user can perform a simple search across the available data sets that are present in the catalog and then run an analysis to select the best. The LiveSchema search facility exploits the CKAN search engine and allows for a quick 'Google-style' keyword search. All the data sets, providers, and group fields are searchable and the users can use all of them to research the desired entity. Thanks to this search functionality it is possible to provide a complete and customized service to the scientist looking for the desired ontology. The basic supported search options are .i search over all the data sets attributes, namely by using any of the applied metadata; .ii full-text search; .iii fuzzy-matching, namely an option to search for closely matching terms instead of exact matches; .iv search via API. Now, suppose that the user identifies three candidate resources for the goal ETR task, namely Schema.org 17 (reference standard to support the indexing of web documents by Google18 ), FOAF 19 (a widely used vocabulary in the context of social networks) and the BBC sport ontology 20 (the ontology used by the BBC to model supports events, and roles). The next step is to access each single data set and check its meta-information, which can be done by first generating the F C A format for the selected resources.

Each LiveSchema data set has a dedicated page collecting its information, and where each processing functionality can be accessed. Here information about the related source catalog is provided as well, and the available standard reference formats can be downloaded. The FCA functionality can be accessed through the corresponding tab and allows for the generation of the corresponding matrix for each given input relational model. On the FCA service page is also possible to customize the generation of the matrix by filtering the target predicates. Then, multiple insights can be extracted by using the functionalities represented by each tab on the data set page. By downloading all the cue information comparisons between the three representations of 'Person', provided by each ontology can be run. Figure 3 (a) represents the cue values for the given resources. From a quick benchmark is clear that in Schema.org, even if the cue of Person is not at the top, the given class has a high centrality with a score of 23.

Besides the quantification of the cues, further analysis can be run by visualizing the intersection of some of the top classes of the given resources. Figure 3 (b) represents an example of knowledge lotus that can be extracted by the input resources. Knowledge lotuses are venn diagrams that can be used to focus on specific parts of the input resources and they are particularly useful to represent the diversity of classes in terms of their (un-)shared properties. The yellow petals of the lotus show the number of properties that are distinctive for the given class. In Further analysis can be run by applying the UpSet (multiple set) visualization facilities, which allows us to analyze the intersections between classes, by selecting more than 6 sets (the limit for knowledge lotuses). LiveSchema allows for both knowledge lotuses and UpSet visualization by embedding the functionalities of the intervention visualization environment. 21This environment was created for the visualization of multiple genomic regions and gene sets (or lists of items). The main goal of the provided visualization options is to facilitate the analysis and interpretation of the input resource. An illustrative example of the representation of a resource utilizing the UpSet module is provided by the size of the properties intersection set.

Embedding Relational Data

Once the scientist has selected her resource, she is ready to embed it and generate a statistical model out of it. Notice that, in the current release of LiveSchema we allow for distributional embedding techniques only. The implementation of symbolic approaches, such as Inductive Logic Programming for addressing new tasks like, for instance, class expression learning, is part of the immediate future work. In this current setting, LiveSchema relies on a recent library collecting most of the state-of-the-art techniques for graph embedding, namely the PyKEEN library [19]. PyKEEN is a widely used solution for generating custom embedding models. It allows selection across a wide range of training approaches with multiple parameters and will output a . p k l file which can be directly imported inside ML pipelines.

Figure 5 demonstrates a screenshot of the LiveSchema KnowledgeEmbedder page, various parameters can be selected to obtain the specific learning goal. We can select the "embedding model" where we can select state-of-the-art algorithms, like TransE, RESCAL or DistMult [20]; and settings like the "loss function", which is typically used to minimize the error of the model and can be used for reducing multiple features of the models to a single number, namely a scalar value, which allows candidate solutions to be ranked and compared [21].

Notice that in LiveSchema we have data sets encoding relational models with no instance data (e.g., we have the DBpedia schema, but we do not have the so-called ABOX). This did not prevent us to adapt the embedding process and focus on the schema level only (relying on relational data we always have, indeed, triples: heads, tails, and relations). This, besides opening the possibility to test a new application scenario, does not exclude the possibility of applying the standard approach where populated schemas are used as input. The population of LiveSchema with this kind of data is part of the immediate future work.

Discussion

We believe that LiveSchema could be a useful support to study relational data for both knowledge representation tasks (e.g., designing a knowledge base for enabling the interoperability between systems), and machine learning tasks, in particular the ones that rely on relational data structures for training their models. In this section, we discuss the implications of the initiative. Moreover, by identifying the limitations of our current setting, we also discuss opportunities for future work.

Implications

Firstly, LiveSchema can support scientists in finding the relational data they need. By leveraging the updates of some of the best state-of-the-art catalogs, LiveSchema can offer an aggregation service that allows keeping track of the evolution of the knowledge representation development community in one place. Notice that this does not aim to substitute the function of each single source catalog. The scientist can indeed access the source catalog and related ad hoc services, if needed, directly from the LiveSchema gateway, this being also an opportunity to increase the visibility of the vocabularies themselves.

Another key point is that LiveSchema can represent an opportunity to bridge the gap between two key artificial intelligence communities, namely the knowledge representation and the machine learning community. While most of the data that are present in LiveSchema are indeed in a format that is compliant with the knowledge representation applications requirements, each data set can be also transformed so that it can be easily employed in machine learning set-ups. The analysis and embedding facilities offer further support in this direction. We believe that this is a way of supporting the exploitation of the huge amount of work done by the community and of making the relational model more accessible to machine learning scientists.

Moreover, we implement state-of-the-art libraries to support data scientists in the data analysis and preparation phases. Most of the implemented libraries require coding and development skills, which will limit the usage of data scientists. To solve this issue, LiveSchema offers a platform that unites data analysis, data preprocessing, and machine learning model deployment, which makes them easily accessible and usable.

Finally, the overall project was also devised to pave the way for large case studies. Integrating knowledge representation and machine learning scenarios may indeed be devised in a different way of designing relational structures, with a different focus on some of their features or constraints (e.g., the number of properties to be used for describing a class or the overlapping between classes). Moreover, data scientists, reusing relational models for their predictive tasks, may better realize what relational models can be better than others about the specific learning target, and how they should be tuned to better support their task.

Limitations

Developing LiveSchema as a community of data scientists that exchange and reuse data to the benefit of the AI community is our long-term objective and this triggers the agenda for immediate future work. To achieve this goal, with the current set-up, there is still a gap that needs to be bridged.

As long as the LiveSchema observatory will grow, serious challenges about the scalability of the approach still need to be handled. One issue is that through the current version of the evolution component is not possible to automatically check duplicated resources coming from different vocabularies. Another pending issue, which is part of the agenda for the immediate future work, is the definition of a processing component functionality that enables users to work with multiple data sets together, this would be an important option, especially for supporting data integration tasks and the evolution of more robust machine learning models. A possible way of implementing this functionality will be to develop a new version of the current FCA conversion process, where multiple data sets can be given as input and then merged, by computing their similarities, into one single file. The output file will be used as a single resource containing the information of all its component datasets.

Conclusion

In this paper, we introduced the first version of the LiveSchema gateway, which aims at exploiting the relational representation of ontologies not only for their classical application but also for their use in machine learning scenarios. The long-term goal of LiveSchema is to leverage the gold mine of data collected by many of the existing relational knowledge representations catalogs and offer a family of services to easily access, analyze, transform, and re-use data, with an emphasis on relational machine learning pipelines setup and predictive tasks.

Figure 1 :1Figure 1: LiveSchema data set page (from an admin perspective).

Figure 2 :2Figure 2: Overview of the LiveSchema components.

Figure 3 :3Figure 3: Cue values for the class 'Person'.

Figure 4 :4Figure 4: Etypes properties intersection: the UpSet visualization

Figure 4 .4Here 8 classes are selected. The blue bars on the left show the size of the classes in terms of the number of properties. The black dots identify the intersections between the classes and the red bars on top of the figure show

Figure 5 :5Figure 5: The Knowledge Embedding interface in Liveschema (zoom in for more details). Relational data was proven to be key also in a transfer learning setting[10,11]. https://docs.ckan.org/en/2.9/user-guide.html#what-is-ckan https://docs.ckan.org/en/538-package-install-docs/publishing-datasets.html https://ec.europa.eu/isa2/solutions/dcat-application-profile-data-portals-europe_en/ To get an idea of what a FAIR catalog consists of, the work presented in[15] and accessible at https://github.com/ OntoUML/ontouml-models provides a key reference http://vocab.deri.ie/ http://finto.fi/en/ http://liveschema.eu/organization/knowdive https://creativecommons.org/ https://rdflib.readthedocs.io/en/stable/plugin_parsers.html https://github.com/knowdive/resources/blob/master/otherVocabs.xlsx https://github.com/RDFLib/rdflib https://pypi.org/project/pykeen/ https://rdflib.readthedocs.io/en/stable/intro_to_sparql.html http://vowl.visualdataweb.org/webvowl.html three current reference resources . r d f , . t t l and . c s v are ready to be downloaded. https://schema.org/ https://www.google.com/ http://xmlns.com/foaf/spec/ https://www.bbc.co.uk/ontologies/sport https://intervene.shinyapps.io/intervene/

Acknowledgments

The research conducted by Mattia Fumagalli is supported by the "Dense and Deep Geographic Virtual Knowledge Graphs for Visual Analysis -D2G2" project, funded by the Autonomous Province of Bolzano. The research conducted by Fausto Giunchiglia and Mayukh Bagchi has received funding from JIDEP -under grant agreement number 101058732. The research conducted by Daqian Shi has received funding from the program of China Scholarships Council (No. 202007820024).

Garbage in, garbage out? do machine learning application papers in social computing report where humanlabeled training data comes from? RSGeiger KYu YYang MDai JQiu RTang JHuang Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency the 2020 Conference on Fairness, Accountability, and Transparency 2020 Pattern recognition and machine learning YAnzai 2012 Elsevier The elements of statistical learning: data mining, inference, and prediction THastie RTibshirani JFriedman 2009 Springer Science & Business Media What is a knowledge graph? M Domain-Specific Knowledge Graph Construction Springer 2019 A review of relational machine learning for knowledge graphs MNickel KMurphy VTresp EGabrilovich Proceedings of the IEEE 104 2015 Concepts as (recognition) abilities FGiunchiglia MFumagalli FOIS 2016 Teleologies: Objects, actions and functions FGiunchiglia MFumagalli International conference on conceptual modeling Springer 2017 Entity type recognition-dealing with the diversity of knowledge FGiunchiglia MFumagalli Seventeenth International Conference on Principles of Knowledge Representation and Reasoning 2020 On knowledge diversity FGiunchiglia MFumagalli 2019 JOWO Ontology-driven cross-domain transfer learning MFumagalli GBella SConti FGiunchiglia 2020 330 249 Towards understanding classification and identification MFumagalli GBella FGiunchiglia Pacific Rim International Conference on Artificial Intelligence Springer 2019 Entity type recognition for heterogeneous semantic graphs JSleeman TFinin AJoshi AI Magazine 36 2015 Linked open vocabularies (lov): a gateway to reusable semantic vocabularies on the web P.-YVandenbussche GAtemezing MPoveda-Villalón BVatant 10.3233/SW-160213 Semantic Web Journal 8 2017 The fair guiding principles for scientific data management and stewardship MDWilkinson MDumontier IJAalbersberg GAppleton MAxton ABaak NBlomberg J.-WBoiten LBDa Silva Santos PEBourne Scientific data 3 2016 A fair model catalog for ontology-driven conceptual modeling research PP FBarcelos TPSales MFumagalli CMFonseca IVSousa ERomanenko JKritz GGuizzardi Conceptual Modeling: 41st International Conference, ER 2022

Hyderabad, India

Springer October 17-20, 2022. 2022 The role of metadata in reproducible computational research JLeipzig DNüst CTHoyt KRam JGreenberg Patterns 2 100322 2021 Omv-ontology metadata vocabulary JHartmann YSure PHaase RPalma MSuarez-Figueroa ISWC Citeseer 2005 3729 Mod: metadata for ontology description and publication BDutta DNandini GKShahi International Conference on Dublin Core and Metadata Applications 2015 Pykeen 1.0: a python library for training and evaluating knowledge graph embeddings MAli MBerrendorf CTHoyt LVermue SSharifzadeh VTresp JLehmann The Journal of Machine Learning Research 22 2021 Knowledge graph embedding: A survey of approaches and applications QWang ZMao BWang LGuo IEEE Transactions on Knowledge and Data Engineering 29 2017 Deep learning in neural networks: An overview JSchmidhuber Neural networks 61 2015