<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RICDaM: Recommending Interoperable and Consistent Data Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ontologies Linked Open Data.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science Institute, Insight SFI Research Centre for Data Analytics</institution>
          ,
          <addr-line>NUI Galway</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>One of the core functionalities of Knowledge Graphs is that data is not required to adhere to strictly de ned data models. Nonetheless, the RDF model provides speci cations for publishing data on the Web that not only describe entities and relationships but also focus on enabling interoperability between datasets. However, nding the right ontologies to model a dataset is a challenge since several valid data models exist and there is no clear agreement between them. We present a demonstration of an interface that allows users to customise a data model based on recommendations obtained with the RICDaM framework. This framework focuses on ranking candidates based on three metrics that measure the relevancy of the candidate, the interoperability of the neighbourhood of the candidate, and the overall consistency of the model proposed. The interface allows the user to re ne the recommended data model and complete it when the framework is lacking or not directly tting the intention of the user. The demo can be found at http://afel.insight-centre.org/ricdam/</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge Graph</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Data models have an important role in data integration on the Web because they
de ne how data is connected and stored. The RDF data model uses
subjectpredicate-object statements (i.e., triples) to model datasets. The data model is
usually supported by one or more ontologies that provide standard nomenclature
to describe concepts. From the ontologies, classes are used to annotate the entity
types of subjects and objects in an RDF dataset, while datatype and object
properties are used to model predicates. A survey conducted over several years with
Linked Data providers found that the third most common barrier to
publishing Linked Data was Selecting appropriate ontologies to represent our data [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
If data publishers cannot nd appropriate ontologies to model their data, they
tend to create their own ontologies or extend upper-level ontologies to meet their
requirements. Therefore, knowledge from similar domains ends up following
different standards or data models that not always focus on interoperability with
Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
other datasets in the same domain, making it challenging to integrate data from
multiple, existing knowledge graphs, as well as to model new data consistently.
This issue can be found in di erent domains [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ].
      </p>
      <p>We propose the RICDaM framework (Recommending Interoperable and
Consistent Data Models) which produces a ranked list of candidate data models
that not only t the data but are also interoperable with a Knowledge Graph
of multiple published RDF data sources. The output of the framework is the
correspondence between a list of triple patterns (i.e., domain, property, range
triple) from an input dataset and a ranked list of candidate triple patterns. We
exploit the content and graph structure of this Knowledge Graph to compute
scores that consider the accuracy, interoperability, and consistency of the
candidates. These scores are combined into a single score that is weighted according
to the user's preferences or use-case. In this demo, we present an interface that
explores the use-case of aligning two datasets in the library data domain with
published RDF library data models. The demonstrator shows the output of the
framework and allows the user to choose the weights of the scores and manually
customise the automatic recommendations generated by the framework.</p>
      <p>
        The problem of creating a data model for a dataset is related to schema
matching, which maps relationships and concepts. A variety of schema matching
solutions have been proposed for di erent types of data (e.g., [
        <xref ref-type="bibr" rid="ref1 ref4 ref5 ref7">7, 5, 4, 1</xref>
        ]) with
the purpose of integrating heterogeneous datasets within a domain. Mapping
languages are also popular to align heterogeneous data sources with the RDF
data model. The RML [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is an example of a mapping language that enables the
conversion of heterogeneous data formats to RDF. Contrary to these approaches,
our framework recommends data models based on existing Knowledge Graphs
by focusing not only on accuracy but also boost candidates that are the most
interoperable within the graph.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Framework Overview</title>
      <p>The RICDaM framework, illustrated in Figure 1, has three stages: (1) Building
Background Knowledge, (2) Candidate Generation, and (3) Candidate Ranking.
2.1</p>
      <p>Building Background Knowledge
The goal of this stage is to build the background knowledge structures that will
facilitate the next stages of the framework. The rst task includes creating a
single KG from multiple existing RDF datasets. The entities of the KG are
collected in a document store and indexed using an inverted index to enable full-text
search. These entities maintain their original predicates but are also connected
to external entities via the underlying ontology graph that facilitates candidate
ranking. We enrich this ontology graph by inferring missing information (e.g.,
entity type of owl:SameAs relationships) and via ontology matching.</p>
      <p>Knowledge Building</p>
      <p>Data</p>
      <p>Documents
Entity Types</p>
      <p>Document Store
Entity Types
Unranked Entity
Type Candidates</p>
      <p>Knowledge</p>
      <p>Graph</p>
      <p>DDaatataLLaayyeerr
Data Layer</p>
      <p>PPrPororpopeperetritreiteisess
RDF Sources
Pre-computed</p>
      <p>metadata
OOOnntnototloololgogygyyLLaLayayeyererr PrOopbejerctites PDraotpaetyrtpiees</p>
      <p>Random Forest</p>
      <p>Properties</p>
      <p>Input Dataset
(CSV, JSON, RDF)</p>
      <p>Object and
Datatype</p>
      <p>Properties
Unranked Object Unranked Datatype</p>
      <p>Property Property</p>
      <p>Candidates Candidates</p>
      <p>Candidate Generation
Content
Score</p>
      <p>Interoperability</p>
      <p>Score</p>
      <p>Consistency</p>
      <p>Score
Ranked Entity Type</p>
      <p>Candidates</p>
      <p>Ranked Data Model</p>
      <p>Candidates
Candidate Ranking</p>
      <p>Content
Score</p>
      <p>Interoperability</p>
      <p>Score
Ranked Property</p>
      <p>Candidates</p>
      <p>In a nal step, we extract metadata from the document store (e.g., entity
type frequency), and we train a Random Forest model on datatype property
values of the KG to facilitate the generation of datatype property candidates.
This stage generates a list of entity type and property candidates for an input
dataset. For the entity types, the candidate list is obtained by searching literals
in the inverted index of the document store. Datatype properties are obtained
with the Random Forest classi cation model, which produces datatype
property predictions for input literal values. Object properties are inferred from the
relationships between the entity type candidates in the ontology graph.</p>
      <p>We propose to rank candidates using a Content score and an Interoperability
Score. The Content Score combines metrics based on string similarity, search
results frequency, and graph distances. This score assesses the appropriateness
of a candidate to match an entity type or property from an input dataset The
Interoperability score contains information from the document store and from
a sub-graph that restricts edges to relationships of equivalence, subsumption,
or relatedness extracted via ontology matching techniques. This score measures
how interoperable a candidate is considering the frequency of a candidate in
the KG and the graph neighbourhood of the candidate. Higher Interoperability
scores translate into a more connected and relevant candidate. This score uses
the interoperability metric and the neighbourhood size. The neighbourhood size
represents the total number of related neighbours up to a distance in the graph,
while interoperability metric counts the frequency of these neighbours in the
document store. In contrast with the interoperability metric, the neighbourhood
size rewards neighbours that do not appear in the RDF data sources but are
still connected to the candidate in the ontology graph.</p>
      <p>Until now, candidates are ranked independently of each other. Therefore, the
nal metrics compute the Consistency score. This score increases the likelihood
of the same candidate being suggested for the same input entity type or property
and, at the same time, boosts triples that are more commonly encountered in the
Knowledge Graph. This consistency is achieved by ensuring that triples that
appear together (co-occur) in the KG are rewarded, while also guarantees that the
same triple elements are assigned the same entity types or properties throughout
the data model. Therefore, the consistency combines Content, Interoperability,
and co-occurrence frequencies into a single score that ranks candidate data model
triples in terms of their adequacy and interoperability.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Demonstration</title>
      <p>This demonstration presents the top candidate data model proposed to the user
in response to an input dataset in the library domain. The interface gives an
overview of the best ranked candidates for each triple but also allows the user
to adapt the data model to their preference and use-case.</p>
      <p>The main functionalities included in the demo are (1) an overview of the
output of the framework, (2) customisation of the data model, (3) tuning of the
parameters to produce di erent candidate rankings, and (4) exporting the data
model as a set of mappings between the input and the produced data model.</p>
      <p>When the user makes a manual change to the data model, they can choose
to propagate that change to maintain consistency across the dataset or keep
the change locally to the modi ed cell. The tuning parameters allow the user
to customise the ranking of the candidates, obtaining di erent top data models
that can speed up the modelling process by suggesting the candidates the user
is looking for more easily. Finally, the user can export the data model produced
and apply the mappings to translate their original input data to an RDF dataset
that is potentially more interoperable with the datasets in the KG.</p>
      <p>Table 1 presents the datasets used in the demo. The dashed line separates
datasets used to build the background KG (top) and datasets used as input
(bottom). Overall, the datasets contain a variety of records such as books, audio
records, and periodicals. For example, for pgterms:ebook in Project Gutenberg,
the suggestion is to use bibo:Document as the most interoperable, relevant, and
consistent entity type, together with properties such as dcterms:contributor.
Through the interface, it can be changed, for example, to schema:Book, which
is often used as well in the background KGs.</p>
      <sec id="sec-3-1">
        <title>Bibliotheque Nationale de France French Deutsche Nationalbibliothek</title>
        <p>https://www.bnf.fr
Biblioteca Nacional de Portugal Portuguese http://www.bnportugal.gov.pt</p>
      </sec>
      <sec id="sec-3-2">
        <title>Biblioteca Nacional de Espan~a</title>
      </sec>
      <sec id="sec-3-3">
        <title>Spanish</title>
        <p>http://www.bne.es</p>
      </sec>
      <sec id="sec-3-4">
        <title>Project Gutenberg (RDF)</title>
      </sec>
      <sec id="sec-3-5">
        <title>Open Library (JSON)</title>
        <p>Gutenberg https://www.gutenberg.org
https://openlibrary.org
# Entities</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In general, designing a data model is not a trivial task. When considering
integration with existing datasets, the task becomes more complex. Our demo
facilitates the task of nding the best possible data model according to certain
criteria by producing a ranked list of candidates to match entity types and
properties in a dataset. In the future, a complete tool using this framework would
allow the user to search the indexed ontologies or add new ontologies to the
graph to complete the model when the data model fails to nd the desired class.
For JSON or CSV input datasets, it would also be possible to generate a RML
mapping le to facilitate the process of translating the data to RDF.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alsera</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abello</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romero</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Calders</surname>
          </string-name>
          , T.:
          <article-title>Keeping the Data Lake in Form: Proximity Mining for Pre-Filtering Schema Matching</article-title>
          .
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>38</volume>
          (
          <issue>3</issue>
          ),
          <volume>26</volume>
          :1{
          <fpage>26</fpage>
          :
          <fpage>30</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>d'Aquin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adamou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Dietze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Assessing the educational linked data landscape</article-title>
          .
          <source>In: Proceedings of the 5th Annual ACM Web Science Conference. WebSci '13</source>
          , pp.
          <volume>43</volume>
          {
          <fpage>46</fpage>
          .
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, Paris, France (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vander</surname>
            <given-names>Sande</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          , and Van de Walle, R.:
          <article-title>RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data</article-title>
          . In: LDOW (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ermilov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ngomo</surname>
          </string-name>
          , A.-C.N.:
          <article-title>TAIPAN: Automatic Property Mapping for Tabular Data</article-title>
          .
          <source>In: Knowledge Engineering and Knowledge Management</source>
          , pp.
          <volume>163</volume>
          {
          <fpage>179</fpage>
          . Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Limaye</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarawagi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Chakrabarti</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Annotating and searching web tables using entities, types and relationships</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <volume>1338</volume>
          {
          <fpage>1347</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Park</surname>
          </string-name>
          , H., and
          <string-name>
            <surname>Kipp</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Library Linked Data Models: Library Data in the Semantic Web</article-title>
          .
          <source>Cataloging &amp; Classi cation Quarterly</source>
          <volume>57</volume>
          (
          <issue>5</issue>
          ),
          <volume>261</volume>
          {
          <fpage>277</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pei</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <string-name>
            <given-names>A Novel</given-names>
            <surname>Clustering-Based Approach</surname>
          </string-name>
          to Schema Matching.
          <source>In: Advances in Information Systems. LNCS</source>
          , pp.
          <volume>60</volume>
          {
          <fpage>69</fpage>
          . Springer, Heidelberg (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Smith-Yoshimura</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Analysis of 2018 International Linked Data Survey for Implementers</article-title>
          .
          <source>The Code4Lib Journal</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>