=Paper= {{Paper |id=Vol-2721/paper535 |storemode=property |title= RICDaM: Recommending Interoperable and Consistent Data Models |pdfUrl=https://ceur-ws.org/Vol-2721/paper535.pdf |volume=Vol-2721 |authors=Daniela Oliveira,Mathieu D'Aquin |dblpUrl=https://dblp.org/rec/conf/semweb/Oliveirad20 }} == RICDaM: Recommending Interoperable and Consistent Data Models== https://ceur-ws.org/Vol-2721/paper535.pdf
     RICDaM: Recommending Interoperable and
            Consistent Data Models

                       Daniela Oliveira and Mathieu d’Aquin

Data Science Institute, Insight SFI Research Centre for Data Analytics, NUI Galway
                         {first,last}@insight-centre.org



        Abstract One of the core functionalities of Knowledge Graphs is that
        data is not required to adhere to strictly defined data models. Non-
        etheless, the RDF model provides specifications for publishing data on
        the Web that not only describe entities and relationships but also fo-
        cus on enabling interoperability between datasets. However, finding the
        right ontologies to model a dataset is a challenge since several valid
        data models exist and there is no clear agreement between them. We
        present a demonstration of an interface that allows users to custom-
        ise a data model based on recommendations obtained with the RICDaM
        framework. This framework focuses on ranking candidates based on three
        metrics that measure the relevancy of the candidate, the interoperability
        of the neighbourhood of the candidate, and the overall consistency of the
        model proposed. The interface allows the user to refine the recommen-
        ded data model and complete it when the framework is lacking or not
        directly fitting the intention of the user.
        The demo can be found at http://afel.insight-centre.org/ricdam/

        Keywords: Knowledge Graph · Ontologies · Linked Open Data.


1     Introduction
Data models have an important role in data integration on the Web because they
define how data is connected and stored. The RDF data model uses subject-
predicate-object statements (i.e., triples) to model datasets. The data model is
usually supported by one or more ontologies that provide standard nomenclature
to describe concepts. From the ontologies, classes are used to annotate the entity
types of subjects and objects in an RDF dataset, while datatype and object prop-
erties are used to model predicates. A survey conducted over several years with
Linked Data providers found that the third most common barrier to publish-
ing Linked Data was Selecting appropriate ontologies to represent our data [8].
If data publishers cannot find appropriate ontologies to model their data, they
tend to create their own ontologies or extend upper-level ontologies to meet their
requirements. Therefore, knowledge from similar domains ends up following dif-
ferent standards or data models that not always focus on interoperability with
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
other datasets in the same domain, making it challenging to integrate data from
multiple, existing knowledge graphs, as well as to model new data consistently.
This issue can be found in different domains [2, 6].
    We propose the RICDaM framework (Recommending Interoperable and Con-
sistent Data Models) which produces a ranked list of candidate data models
that not only fit the data but are also interoperable with a Knowledge Graph
of multiple published RDF data sources. The output of the framework is the
correspondence between a list of triple patterns (i.e., domain, property, range
triple) from an input dataset and a ranked list of candidate triple patterns. We
exploit the content and graph structure of this Knowledge Graph to compute
scores that consider the accuracy, interoperability, and consistency of the can-
didates. These scores are combined into a single score that is weighted according
to the user’s preferences or use-case. In this demo, we present an interface that
explores the use-case of aligning two datasets in the library data domain with
published RDF library data models. The demonstrator shows the output of the
framework and allows the user to choose the weights of the scores and manually
customise the automatic recommendations generated by the framework.
    The problem of creating a data model for a dataset is related to schema
matching, which maps relationships and concepts. A variety of schema matching
solutions have been proposed for different types of data (e.g., [7, 5, 4, 1]) with
the purpose of integrating heterogeneous datasets within a domain. Mapping
languages are also popular to align heterogeneous data sources with the RDF
data model. The RML [3] is an example of a mapping language that enables the
conversion of heterogeneous data formats to RDF. Contrary to these approaches,
our framework recommends data models based on existing Knowledge Graphs
by focusing not only on accuracy but also boost candidates that are the most
interoperable within the graph.


2     Framework Overview

The RICDaM framework, illustrated in Figure 1, has three stages: (1) Building
Background Knowledge, (2) Candidate Generation, and (3) Candidate Ranking.



2.1   Building Background Knowledge

The goal of this stage is to build the background knowledge structures that will
facilitate the next stages of the framework. The first task includes creating a
single KG from multiple existing RDF datasets. The entities of the KG are col-
lected in a document store and indexed using an inverted index to enable full-text
search. These entities maintain their original predicates but are also connected
to external entities via the underlying ontology graph that facilitates candidate
ranking. We enrich this ontology graph by inferring missing information (e.g.,
entity type of owl:SameAs relationships) and via ontology matching.
               Knowledge Building
                                                                   Data
                                                                    DataLayer
                                                                         Layer
                                                                     Data Layer
                         Data                                       Properties
                                                                     Properties
                                                                      Properties
                       Documents
                                                               Ontology
                                                               OntologyLayer
                                                                        Layer            Object      Datatype
                                                                Ontology Layer
                                                                                       Properties    Properties
                               Knowledge
                                 Graph                         RDF Sources

                                                                   Pre-computed
                    Document Store                                   metadata
                                                                                                Random Forest

              Entity Types                                                                           Properties
                                                                                   Object and
                                                    Input Dataset
                      Entity Types                                                  Datatype
                                                  (CSV, JSON, RDF)
                                                                                   Properties



                                                                   Unranked Object        Unranked Datatype
                    Unranked Entity
                                                                      Property                 Property
                    Type Candidates
                                                                     Candidates               Candidates

                                                Candidate Generation

                Content      Interoperability        Consistency
                 Score            Score                Score                Content        Interoperability
                                                                             Score              Score



                                                  Ranked Data Model
                   Ranked Entity Type                Candidates                    Ranked Property
                      Candidates                                                     Candidates
                                                Candidate Ranking



                                     Figure 1. Workflow diagram.


   In a final step, we extract metadata from the document store (e.g., entity
type frequency), and we train a Random Forest model on datatype property
values of the KG to facilitate the generation of datatype property candidates.


2.2   Candidate Generation and Ranking

This stage generates a list of entity type and property candidates for an input
dataset. For the entity types, the candidate list is obtained by searching literals
in the inverted index of the document store. Datatype properties are obtained
with the Random Forest classification model, which produces datatype prop-
erty predictions for input literal values. Object properties are inferred from the
relationships between the entity type candidates in the ontology graph.
    We propose to rank candidates using a Content score and an Interoperability
Score. The Content Score combines metrics based on string similarity, search
results frequency, and graph distances. This score assesses the appropriateness
of a candidate to match an entity type or property from an input dataset The
Interoperability score contains information from the document store and from
a sub-graph that restricts edges to relationships of equivalence, subsumption,
or relatedness extracted via ontology matching techniques. This score measures
how interoperable a candidate is considering the frequency of a candidate in
the KG and the graph neighbourhood of the candidate. Higher Interoperability
scores translate into a more connected and relevant candidate. This score uses
the interoperability metric and the neighbourhood size. The neighbourhood size
represents the total number of related neighbours up to a distance in the graph,
while interoperability metric counts the frequency of these neighbours in the
document store. In contrast with the interoperability metric, the neighbourhood
size rewards neighbours that do not appear in the RDF data sources but are
still connected to the candidate in the ontology graph.
     Until now, candidates are ranked independently of each other. Therefore, the
final metrics compute the Consistency score. This score increases the likelihood
of the same candidate being suggested for the same input entity type or property
and, at the same time, boosts triples that are more commonly encountered in the
Knowledge Graph. This consistency is achieved by ensuring that triples that ap-
pear together (co-occur) in the KG are rewarded, while also guarantees that the
same triple elements are assigned the same entity types or properties throughout
the data model. Therefore, the consistency combines Content, Interoperability,
and co-occurrence frequencies into a single score that ranks candidate data model
triples in terms of their adequacy and interoperability.


3   Demonstration

This demonstration presents the top candidate data model proposed to the user
in response to an input dataset in the library domain. The interface gives an
overview of the best ranked candidates for each triple but also allows the user
to adapt the data model to their preference and use-case.
    The main functionalities included in the demo are (1) an overview of the
output of the framework, (2) customisation of the data model, (3) tuning of the
parameters to produce different candidate rankings, and (4) exporting the data
model as a set of mappings between the input and the produced data model.
    When the user makes a manual change to the data model, they can choose
to propagate that change to maintain consistency across the dataset or keep
the change locally to the modified cell. The tuning parameters allow the user
to customise the ranking of the candidates, obtaining different top data models
that can speed up the modelling process by suggesting the candidates the user
is looking for more easily. Finally, the user can export the data model produced
and apply the mappings to translate their original input data to an RDF dataset
that is potentially more interoperable with the datasets in the KG.
    Table 1 presents the datasets used in the demo. The dashed line separates
datasets used to build the background KG (top) and datasets used as input
(bottom). Overall, the datasets contain a variety of records such as books, audio
records, and periodicals. For example, for pgterms:ebook in Project Gutenberg,
the suggestion is to use bibo:Document as the most interoperable, relevant, and
consistent entity type, together with properties such as dcterms:contributor.
Through the interface, it can be changed, for example, to schema:Book, which
is often used as well in the background KGs.
               Table 1. Summary of the source RDF libraries chosen.

Library name                   Handle     URL                             # Entities

British Library                   British  https://www.bl.uk               17 289 195
Bibliothèque Nationale de France French   https://www.bnf.fr              38 100 563
Deutsche Nationalbibliothek       German   https://www.dnb.de              50 340 156
Biblioteca Nacional de Portugal Portuguese http://www.bnportugal.gov.pt     2 437 096
Biblioteca Nacional de España    Spanish  http://www.bne.es               20 752 087
Project Gutenberg (RDF)        Gutenberg https://www.gutenberg.org            856 476
Open Library (JSON)            OpenL     https://openlibrary.org           54 678 367



4   Conclusion
In general, designing a data model is not a trivial task. When considering in-
tegration with existing datasets, the task becomes more complex. Our demo
facilitates the task of finding the best possible data model according to certain
criteria by producing a ranked list of candidates to match entity types and prop-
erties in a dataset. In the future, a complete tool using this framework would
allow the user to search the indexed ontologies or add new ontologies to the
graph to complete the model when the data model fails to find the desired class.
For JSON or CSV input datasets, it would also be possible to generate a RML
mapping file to facilitate the process of translating the data to RDF.


References
1. Alserafi, A., Abelló, A., Romero, O., and Calders, T.: Keeping the Data Lake in
   Form: Proximity Mining for Pre-Filtering Schema Matching. ACM Trans. Inf. Syst.
   38(3), 26:1–26:30 (2020)
2. d’Aquin, M., Adamou, A., and Dietze, S.: Assessing the educational linked data
   landscape. In: Proceedings of the 5th Annual ACM Web Science Conference. WebSci
   ’13, pp. 43–46. Association for Computing Machinery, Paris, France (2013)
3. Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., and Van
   de Walle, R.: RML: A Generic Language for Integrated RDF Mappings of Hetero-
   geneous Data. In: LDOW (2014)
4. Ermilov, I., and Ngomo, A.-C.N.: TAIPAN: Automatic Property Mapping for Tab-
   ular Data. In: Knowledge Engineering and Knowledge Management, pp. 163–179.
   Springer International Publishing, Cham (2016)
5. Limaye, G., Sarawagi, S., and Chakrabarti, S.: Annotating and searching web tables
   using entities, types and relationships. Proc. VLDB Endow. 3(1), 1338–1347 (2010)
6. Park, H., and Kipp, M.: Library Linked Data Models: Library Data in the Semantic
   Web. Cataloging & Classification Quarterly 57(5), 261–277 (2019)
7. Pei, J., Hong, J., and Bell, D.: A Novel Clustering-Based Approach to Schema
   Matching. In: Advances in Information Systems. LNCS, pp. 60–69. Springer, Heidel-
   berg (2006)
8. Smith-Yoshimura, K.: Analysis of 2018 International Linked Data Survey for Im-
   plementers. The Code4Lib Journal (2018)