<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. Randles);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Preserving the Alignment of LD with Source Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alex Randles</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Declan O'Sullivan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ADAPT Centre for Digital Content, Trinity College Dublin</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>A significant proportion of Linked data (LD) is created through mapping of data from a variety of sources of data. Linked data has been described as highly dynamic in nature with source data being continuously changed, which could impact the quality of the linked data and related mapping artefacts. Changes which have occurred in the source data of linked data datasets should be propagated into the resulting dataset to provide an accurate representation of the underlying data sources. These changes can occur at an extremely fast rate which can result in difficulties propagating each change in a timely manner. Surprisingly, despite the growth of linked data publication on the web of data, there exists no standard to address the dynamics of the data. An approach which captures changes in the source data used by mapping artefacts to create linked data datasets will help to address the dynamics involved in the publication process. Furthermore, capturing changes in a machine-readable format will allow software agents to automatically process them and take appropriate actions to preserve the alignment between mapping artefacts and data sources used to create the linked data dataset. Moreover, the ability to monitor the source data and detect changes regularly will support a mechanism to automatically send notifications of changes and potential alignment issues to data producers, therefore, providing necessary information to guide them in improving alignment. Evaluating an approach designed to address the dynamics of linked data is important to provide evidence of sufficient usability. This paper describes the evaluation of the Mapping Quality Improvement (MQI) Framework and focuses on change detection of source data used to create linked data and aims to support data producers in providing timely data to consumers and improving the quality, maintenance and reuse of related mapping artefacts. The evaluation of the MQI framework involved 55 participants with varying levels of background knowledge.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Usability Testing</kwd>
        <kwd>Dataset Dynamics</kwd>
        <kwd>Linked Data</kwd>
        <kwd>Mappings</kwd>
        <kwd>Data Quality</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Declarative uplift mapping artefacts are used to generate linked data datasets and contain rules
for converting Non-Resource Description Framework (RDF) data, in formats such as XML, CSV,
relational data into a RDF representation [19]. Various representations of these mapping
artefacts exist, such as RDB to RDF Mapping Language (R2RML) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which is the World Wide Web
Consortium (W3C) recommendation for transforming relational data into RDF and allows
customized transformation rules to be defined. Another prominent representation is RDF
Mapping Language (RML) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which extends R2RML to allow more diverse source data formats,
such as XML, CSV and JSON. The resulting linked data datasets are highly dynamic in nature with
resources continuously being added and removed in an attempt to improve data quality by
updating resources and respective vocabularies as they evolve [23]. Oftentimes, the dynamics of
the linked data dataset is measured by the ”freshness” quality dimension, which relates to the age
and occurrences of changes in data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and has been described as one of the most important
aspects of linked data quality [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Such 'freshness' is crucial to underpin machine learning
processes enabling an Internet of Things and People [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Interestingly, the issue of detecting and
propagating changes in linked data has been discussed for over a decade, however, no defacto
4th International Workshop on Knowledge Graph Construction, May 28, 2023, Crete, Greece
alex.randles@adaptcentre.ie (A. Randles); declan.osullivan@adaptcentre.ie (D. O’Sullivan)
approach or standards-based approach exists to tackle the problem [22]. Existing approaches
[
        <xref ref-type="bibr" rid="ref12 ref13">12,13,21</xref>
        ] in the state of the art predominantly propose methods to address the dynamics of
resources and interlinks in linked data datasets, however, one approach [24] exists which targets
the dynamics of the source data of linked data and focuses on relational data and R2RML
mappings. In this paper, an approach is proposed to capture change information in heterogenous
formats used to create linked data datasets and allowing these changes to be propagated into the
resulting data, with the aim of supporting an increase in linked data dataset freshness [20]. In
addition, a notification policy approach is included, which enables data producers to be informed
of changes in a timely manner. A usability evaluation has been conducted on the proposed
approach in an attempt to validate the design with end users [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In addition, usability testing
provides an opportunity to support collaboration between domain experts and computer
scientists when developing tools and processes [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Characterizing respective end users based
on background knowledge, allows the level of knowledge to sufficiently use the tool to be
determined [19].
      </p>
      <p>In this paper we discuss the design and evaluation of the second iteration of the MQI
Framework [15,19,20]. The first iteration of the framework included a component designed to
assess and refine the quality of R2RML mappings. The component uses the Mapping Quality
Improvement Ontology (MQIO)2 [16,17] to represent captured mapping quality information in
RDF format. The second iteration of the framework includes the original functionality and is
extended to include a component for change detection of source data, represented in
heterogeneous formats. In addition, the component detects links between detected changes and
respective mappings. Changes which are captured by this component are represented in the</p>
      <sec id="sec-1-1">
        <title>Ontology for Source Change Detection (OSCD)3 [20]. We also describe an extension to the</title>
        <p>functionality of the framework to provide suggestions to agents on how to improve alignment.
The objective of the framework is to improve the quality of mappings, while preserving alignment
with underlying data sources, with the aim of providing fresh data to consumers. The remainder
of this paper is structured as follows: Section 1 discusses the design of MQI framework, including
the utilization of OSCD. Section 2 outlines additional functionality integrated into the framework
as a result of the evaluation. Section 3 describes the evaluation setup and results. Section 4
discusses related work in the state of the art. Section 5 outlines future work and concludes the
paper.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>1. Assessing LD Alignment</title>
      <p>
        The first iteration of the MQI framework [18,19] included a component to assess and refine the
quality of R2RML [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] mappings involved in the generation of linked data datasets. The second
iteration of MQI extends the original functionality to add a component to detect source data
changes and link them with respective mappings. In addition, support for RML [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] mapping
artefacts, is included, allowing source data represented in heterogenous formats to be used as
input to the framework. The detected changes are represented according to OSCD [20], which
was previously developed by the authors of this paper in order to model information related to
source data changes. The ontology is utilized by the framework to represent and interchange
information related to changes detected in source data. The ontology was designed as format
independent and can be used to represent detected changes in source data formats such as XML,
CSV, JSON, relational data, among others. The ontology can be used to model changes in source
data formats supported by R2RML and RML. In addition, OSCD enables notification policies to be
defined using the Rei policy ontology [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which provides mapping engineers with timely
information on detected changes and their associated potential alignment issues in the resulting
2 MQIO specification at https://w3id.org/MQIO
3 OSCD specification at https://w3id.org/OSCD
linked data dataset. Representing changes in OSCD allows them to be linked with the mapping
artefact itself and associated quality reports, such as those represented in MQIO [16,17], as a
result of the mapping quality assessment and improvement component of the framework. Figure
1 presents a diagram of the source data change detection component of the MQI framework.
      </p>
      <sec id="sec-2-1">
        <title>The process is outlined below.</title>
        <p>• Input: Two versions of source data and respective mappings represented in RML or
R2RML are inputted into the framework. Oftentimes, mappings will have been
previously uploaded to the mapping quality assessment and refinement component of
the MQI framework. In addition, notification details can be input in order to create a
policy which defines when users will be notified of detected changes in the source data.
• Change Detection: Changes are detected between the versions using existing
methods, such as file comparison. Thereafter, the detected changes (and the
notification details input into framework) are uplifted in RDF format, resulting in two
named graphs.
• Analyze Changes: The detected changes are linked with the inputted mapping
artefacts in order to identify changes which could impact them, for example a data
reference in a mapping that does not exist in the current source data.
• Output: The resulting two named graphs detail detected source data changes and
notification policy. The changes are periodically detected until the notification policy
becomes invalid by fulfillment of a change threshold or end date.</p>
        <p>The MQI framework is implemented using the following technologies.</p>
        <p>
          • Several Python libraries are used in the implementation. The Flask library [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] was
used to create a web application with a Graphical User Interface (GUI). The RDFLib
library [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] enables execution of SPARQL queries and is used to query and update RDF
data. A number of file comparison methods are used. XMLDiff4 is used to compare XML
files. CSVDiff5 is used to compare CSV files. MySQL6 library is used to compare
relational data.
• SPARQL [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is used to link graphs containing detected changes and associated
notification policy, with respective mapping artefacts.
• R2RML [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] is used to uplift information captured by the MQI framework, that is
detected changes and notification details, into RDF format.
        </p>
        <p>Detecting changes using the implementation involves the following steps.
1. The versions of source data input into the GUI are compared using one of the
aforementioned methods and the result stored.
4 https://pypi.org/project/xml-diff/
5 https://pypi.org/project/csv-diff/
6 https://pypi.org/project/mysql-connector-python/
2. The results are uplifted into RDF using an R2RML mapping expressed according to the</p>
        <p>
          OSCD (see next section).
3. Input notification details are uplifted using an R2RML mapping expressed according to
the Rei policy ontology [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
4. SPARQL queries are used to retrieve necessary information in order to provide an
overview of the detected changes to users and link changes7 with respective mappings in
order to identify potential alignment issues.
The MusicBrainz project8 involves an online music encyclopedia, which contains music
metadata, such as artist, labels, recordings and releases. The project has created 12 R2RML [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
mappings designed to uplift information in the encyclopedia into linked data representation and
one of them is designed to transform the releases of artists. The mappings9 source data is a table
(“releases”) in a relational database and contains two term maps to map the ID (“gid”) and name
(“name”) of the release. For instance, 5 releases have been added and detected by the framework,
which should be propagated into the resulting dataset in order to preserve the freshness of the
data [20]. The screenshot shows the name of the releases which have been added to the source
data. It is difficult to determine when the mapping should be regenerated as releases could be
added frequently or infrequently. Therefore, a notification policy should be defined to ensure
timely updates of relevant information, which provides an indication of when the mapping should
be regenerated in order to capture new releases.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Improving LD Alignment</title>
      <p>
        Additional functionality has been added to the framework since the conclusion of the evaluation
described in this paper. The functionality is designed to automatically suggest actions which
could be executed to improve the level of alignment between mappings and respective source
data. In addition, Shapes Constraint Language (SHACL) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] constraints are proposed to assess the
level of alignment.
7 https://raw.githubusercontent.com/alex-randles/KGCW-2023-Supplementary/main/linking_query.rq
8 https://musicbrainz.org/
9 https://raw.githubusercontent.com/metabrainz/MusicBrainz-R2RML/master/mappings/release.ttl
      </p>
      <sec id="sec-3-1">
        <title>2.1. Alignment of Source Changes with Mapping</title>
        <p>The sample source data (“people.csv”) has had a column referenced in a related mapping11
removed, therefore, the mapping is no longer compatible with the current source data. The
column (“Address”) removed contained data on the location of people, however, a column
(“Postcode”) containing their postcodes (“Change Count: 4”) has been inserted. The framework
compares previous columns with the names of the current columns to identify indications that
they are related. The comparison is completed using WordNet Similarity12, which is software
designed to measure semantic similarity between a pair of concepts. The similarity score for the
“Address” and “Postcode” is 52%, which indicated they have similarities. The framework will
provide a suggestion in this case as the score is above the threshold (&gt; 0.25). Thereafter, the
framework will automatically update the mapping by executing a SPARQL query13 if the
suggestion is accepted by the user.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. SHACL Shapes</title>
        <p>
          SHACL [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is a W3C recommendation designed to validate the quality of RDF graphs, which can
be applied to mappings represented in RDF format, such as R2RML [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and RML [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Shapes refer
to constraints defined using the properties and classes in the SHACL vocabulary. The functionality
to generate SHACL shapes from the original source data has been integrated into the MQI
framework. The shapes can be applied to mappings at any point during their evolution in order
to easily allow the identification of alignment issues with underlying data sources. The
framework generates a shape which validates if each data reference in the mapping is in the
source data. Table 1 presents the pseudocode used to generate the shapes and the resulting
shape for the RML mapping used in the evaluation.
10 https://raw.githubusercontent.com/alex-randles/KGCW-2023-Supplementary/main/people.csv
11 https://raw.githubusercontent.com/alex-randles/KGCW-2023-Supplementary/main/sample_mapping.ttl
12 https://www.nltk.org/howto/wordnet.html
13 https://raw.githubusercontent.com/alex-randles/KGCW-2023-Supplementary/main/update_query.rq
        </p>
        <p>Pseudocode (A) is shown which outlines the process involved in generating a SHACL shape
(B) from the RML mapping. The same process can be applied to R2RML mappings and involves
adding each attribute (e.g. column, element, row) name in the original source data into a SHACL
list (sh:in), which can be used to validate that the attribute exists in the current source data. In
this case, the sample mapping will no longer be compatible with the source data as the “Address”
column has been changed to “Postcode”, which should be updated accordingly. The following
SHACL validation report14 (Listing 1) will be generated when the shape shown is executed on the
sample mapping.</p>
        <p>[ a sh:ValidationReport ;
sh:conforms false ;
sh:result [
a sh:ValidationResult ;
sh:resultSeverity sh:Violation ;
sh:sourceConstraintComponent sh:InConstraintComponent ;
sh:sourceShape _:n839 ;
sh:focusNode _:n959 ;
sh:value "Address" ;
sh:resultPath rml:reference;
sh:resultMessage "Data reference no longer in source data" ; ] .</p>
        <p>Listing 1: SHACL validation report generated when sample shape executed</p>
        <p>
          The SHACL validation report (sh:ValidationReport) is expressed in the SHACL
validation report vocabulary15. The report shown includes 1 violation
(sh:ValidationResult), which has detected a column (sh:value) is no longer present in
the source data of the mapping (sh:message). The validation report is machine-readable and
queryable by SPARQL [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], as it is represented in RDF format, which can be used to automatically
update the mapping in order to preserve alignment.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Evaluation</title>
      <p>The following section describes the user evaluation conducted on the change detection
component of the MQI framework. Firstly, the methodology and metrics used in the study are
discussed. Thereafter, the results of the study are described. Finally, a discussion on the
hypotheses is presented. The hypotheses related to this study were:
• H1) The framework facilitates the identification of changes in source data and
relationships with respective mappings;
• H2) The participants' background knowledge influences the successful completion of the
tasks.</p>
      <p>The hypotheses were defined to allow measurement of required level of knowledge to
successfully interact with the framework,
14 https://raw.githubusercontent.com/alex-randles/KGCW-2023-Supplementary/main/shacl_report.ttl
15 https://www.w3.org/TR/shacl/#dfn-validation-report-vocabulary</p>
      <sec id="sec-4-1">
        <title>3.1. Methodology</title>
        <p>A user evaluation was conducted to test the hypotheses related to this study. The participants
were grouped into two cohorts, student and expert cohort, depending on level of background
knowledge. The participants were provided with sample source data and a related mapping,
which would allow them to interact with the framework, in order to identify source data changes
and links with mappings. Hypothesis H1 was tested by analyzing the results of each cohort for
the Understanding Questionnaire and the Post Study Usability Questionnaire (PSSUQ).
Hypothesis H2 was tested by comparing the results of these questionnaires for both cohorts.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Metrics</title>
        <p>The metrics used for the evaluation included the scores and comments of three questionnaires
and qualitative data analysis from the data captured in the questionnaires.</p>
        <p>
          Post Study Usability Questionnaire (PSSUQ). The PSSUQ [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is a standardized
questionnaire which measures the satisfaction provided by a piece of software to users. The
questionnaire was developed by IBM and has had extensive psychometric evaluation completed
on it, unlike similar questionnaires such as the System Usability Scale (SUS)16. The questionnaire
consists of 19 positive statements related to the satisfaction of software and are scored on a Likert
scale from 1 (Best Case) – 7 (Worst Case). In addition, an open comment section accompanies
each question. Four metrics are measured by the PSSUQ which include system usefulness
(SysUse), information quality (InfoQual), interface quality (IntQual) and Overall.
        </p>
        <p>Understanding Questionnaire. A questionnaire17 (Table 2) was created to test if
participants could understand the change detection information provided by the MQI framework.
The questionnaire included two sections which related to the change detection processes
(Section 1) and changes which have been detected in the source data and links with respective
mappings (Section 2).</p>
        <p>The questions in Section 1 (S1) were designed to request information related to the total
number of changes detected in the source data (S1.Q1), notification policy details (S1.Q3-5), and
related mapping details (S1.Q6). The questions in Section 2 (S2) were designed to request
information related to types of changes detected (S2.Q1) and other details about them, such as
location and number of values changed (S2.Q2-6).</p>
        <p>Ontology Application Questionnaire. In addition, a questionnaire was created to ask for
feedback from participants in the expert cohort on the application of OSCD in the graph used
during the experiment (Table 3). The student cohort were not asked for feedback on the
application as they have limited ontology design knowledge.
16 https://www.usability.gov/how-to-and-tools/methods/system-usability-scale.html
17 Complete questionnaire available at https://forms.gle/oLCRyXZQQsEmfMis6</p>
        <p>It was hoped the questionnaire would allow feedback to be gathered related to the developed
ontology (Q1) and the application of OSCD (Q2) within the graph used in the experiment. In
addition, the open comment question (Q3) allowed additional diverse feedback to be gathered.</p>
        <p>
          Thematic Analysis. Thematic analysis [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] is a qualitative data analysis method used to
identify patterns. The method involves deriving themes from data which represent discovered
patterns. The themes consist of codes which relate to a specific area within the design of a
software tool. The analysis was completed on the qualitative data collected in the open comment
sections of the PSSUQ. The process involved the following six-steps, which includes 1)
Familiarizing yourself with the data 2) Generation of initial codes 3) Searching for themes 4)
Reviewing themes 5) Defining and naming themes and 6) Producing the report. The themes and
codes were iteratively refined during the analysis.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Experiment Setup</title>
        <p>The following section discusses the participants involved in the evaluation and tasks which they
were asked to complete.</p>
        <p>Sample Size. Participants in the student cohort have little experience of the mapping process
involved in creating linked data datasets. These participants have little experience in creating and
operating mappings, however, they have a basic knowledge of semantic web technologies, such
as RDF and R2RML. Participants in the expert cohort are researchers who are very
knowledgeable with RDF and related mapping languages. These participants have experience in
creating and operating mappings in a research environment. 48 students were initially recruited,
which was reduced to 45 participants after inclusion/exclusion criteria was applied. The expert
cohort consisted of 10 participants.</p>
        <p>Tasks. The tasks18 to be undertaken by participants were designed to evaluate the main
characteristics of the source data change detection component of the MQI framework. Tasks 1-2
involved the quality assessment of the mapping related to the source data. Tasks 3-7 involved
initiation of the change detection process on the source data. Task 8 involved the examination of
an overview of the change detection processes. Task 9 and 10 only applied to the expert cohort.
The two tasks were designed to retrieve expert feedback on the application of OSCD within the
graphs generated. As previously stated, the participants in the student cohort were not asked for
feedback as their knowledge of ontology design and application are limited. Task 11-12 involved
the examination of detected links between changes in the source data and mapping. Task 13
involved the completion of the questionnaires which measured perceived satisfaction and
understanding.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Experiment Data</title>
        <p>
          The data provided to the participants consisted of three items: 1) RML [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] mapping; 2) Original
source data; and 3) Changed source data. The original source data and mapping were retrieved
from the RML test case files19. The data contains information about famous sports personalities
such as their names, a unique identifier (ID), associated sport and place of birth. The changed
source data was derived from the test cases and additional similar changes were added by the
18 https://drive.google.com/file/d/14Rjmi-xVXJ8GJZAdC9uNHxlOP1r8fQa7
19 https://rml.io/test-cases/
lead author of this paper. Listing 2 presents the two versions of the source data used during the
experiment.
        </p>
        <p>ID, Name</p>
        <p>ID, FirstName, LastName, Sport, City</p>
        <p>The RML mapping20 used during the experiment was designed to uplift the information in the
original version of the source data. The name of the “ID” column referenced in the mapping is
unchanged between the versions of source data. However, 3 additional values have been added
to the column. New columns, “Sport” and “City” have been added with additional data. However,
changes between the versions have resulted in the mapping becoming incompatible with the
current version of source data, as the “Name” column has been split into “FirstName” and
“LastName”, respectively. Therefore, the alignment between the mapping and source data should
be improved to prevent a decrease in quality [19]. The graph generated by the MQI framework,
which contains detected changes during the experiment, expressed in OSCD is available21.</p>
      </sec>
      <sec id="sec-4-5">
        <title>3.5. Experiment Execution</title>
        <p>Participants in both cohorts were informed that assistance was available via email and contact
details provided to them.</p>
        <p>Completion of Experiment. The participants in both cohorts completed the experiment in an
identical structure apart from the questionnaire which included 1 additional section for the
expert cohort, which was outlined in Section 3.2. First, they were provided with a document22,
which contained information on the following: 1) MQI framework details; 2) Experiment details;
and 3) Task sheet. Thereafter, they accessed the framework using the provided details and
completed the tasks, including the questionnaire.</p>
        <p>Experiment Assistance. None of the participants in either cohort required assistance to
complete the tasks involved in the experiment.</p>
      </sec>
      <sec id="sec-4-6">
        <title>3.6. Experiment Results: Student Cohort</title>
        <p>The results of the student cohort consisted of the scores of the PSSUQ and results from the
understanding questionnaire.</p>
      </sec>
      <sec id="sec-4-7">
        <title>3.6.1. PSSUQ Results</title>
        <p>20
https://raw.githubusercontent.com/kg-construct/rml-test-cases/master/test-cases/RMLTC0002aCSV/mapping.ttl
21 https://raw.githubusercontent.com/alex-randles/KGCW-2023-Supplementary/main/evaluation_graph.ttl
22 https://drive.google.com/file/d/1pemmMIuW3cLeuSsAnqAKkMLWaeMQ9aqc
23 https://drive.google.com/file/d/1llIW23lI3Y25ChxQsVttsT4oRmBWTrSB</p>
        <p>
          The metric scores were compared with acceptable thresholds found in research [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Each
metric scored better than its threshold by at least 20%, which included system usefulness
(63%), information quality (46%), interface quality (21%) and overall (47%). A third quartile is
a statistical measurement in which 75% of the data points are below. Most questions (18 out of
19) had a third quartile of 3 or less. Only one question (Q9) had a third quartile of 4, which
related to error messages, however, the question is commonly noted as an outlier as none are
shown during most experiments [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The overall results of the PSSUQ indicated sufficient
satisfaction with the metrics and questions scoring better than their respective thresholds.
        </p>
      </sec>
      <sec id="sec-4-8">
        <title>3.6.2. Understanding Questionnaire Results</title>
        <p>Table 4 presents the scores of the understanding questionnaire for the student cohort. The mean
score (μ) and standard deviation (σ') for each section of the understanding questions are shown.</p>
        <p>Most (9 out of 12) questions had a mean score of at least 80% correct, which indicates
overall sufficient understanding from participants of the information presented to them about
changes detected. In addition, the low standard deviation of both sections (&lt; 0.35) indicates that
the scores are clustered around the mean. However, the worst scoring questions scored (2 out
of 12) below 60% and related to information presented on the number of mappings impacted
by changes detected and their related thresholds. This will require further clarification,
involving the addition of textual descriptions to the interface.</p>
      </sec>
      <sec id="sec-4-9">
        <title>3.7. Experiment Results: Expert Cohort</title>
        <p>The results of the expert cohort consisted of the PSSUQ scores, scores of the understanding
questionnaire and feedback on the application of OSCD in the graph used during the experiment.</p>
      </sec>
      <sec id="sec-4-10">
        <title>3.7.1. PSSUQ Results</title>
        <p>
          The metric scores were compared with acceptable thresholds found in research [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Each
metric scored better than its threshold by at least 10%, which included system usefulness
(56%), information quality (27%), interface quality (14%) and overall (38%). Most questions
(18 out of 19) had a third quartile of 3 or less. Similar to the results of the student cohort, only
one question (Q9) had a third quartile of 4, which related to error messages. The overall results
of the PSSUQ indicate sufficient satisfaction with the metrics and questions scoring better than
their respective thresholds.
        </p>
      </sec>
      <sec id="sec-4-11">
        <title>3.7.2. Understanding Questionnaire Results</title>
        <p>Table 5 presents the scores of the understanding questionnaire for the expert cohort.</p>
        <p>Most (11 out of 12) questions have a mean score of at least 80% correct, which indicated an
overall sufficient understanding by participants of the information presented to them by the
MQI framework. In addition, the low standard deviation of both sections (&lt; 0.25) indicated that
the scores are clustered around the mean. However, the worst scoring question had (1 out of
12) 70% correct and related to a change description provided by the framework. This poor
score could be as a result of tool-tips being incompatible with the browser of certain
participants.</p>
      </sec>
      <sec id="sec-4-12">
        <title>3.7.3. Results related to Application of OSCD</title>
        <p>Each comment received from the expert cohort through the OSCD application questionnaire was
reviewed to identify if a recommendation, by the expert, related to the ontology was indicated.
Thereafter, it was considered by the lead researcher as to whether the recommendation should
be addressed. An extract of recommendations received by experts and how they were addressed
by the lead researcher is presented in Table 6.
“provenance data related to who made the changes but
it might be difficult to find that info in the ontology
metadata. Also the time period the change has been
made (after how long the change was made). But these
are only minor things and only some suggestions to
consider.”</p>
        <p>The recommendations from experts resulted in the addition of two properties in OSCD. In
addition, the other feedback affirmed sufficient applicability by providing comments such as “It
seems to be well presented.”, “Seems clear to me” and “Seems like a useful tool”.</p>
      </sec>
      <sec id="sec-4-13">
        <title>3.8. Thematic Analysis</title>
        <p>Thematic analysis was conducted in order to identify patterns in the qualitative data of both
cohorts following the six-step process outlined in Section 3.2. The themes and codes24 were
created using a “bottom-up” approach, which involved defining them as they emerged from the
data. The final report was produced using Taguette [14], which is a qualitative data tagging
framework and presented the references for each code in the data. The themes and associated
codes defined as a result of thematic analysis are presented in Table 7. The defined themes and
codes were designed to group discovered negative and positive patterns. For instance, “Positive
GUI Requirements” indicated patterns related to sufficient GUI requirements, such as aesthetic
interface and clear layout, while “Negative GUI Requirements” indicated patterns related to
insufficient GUI requirements. Therefore, the frequency of codes in the themes can be used to
identify limitations of the usability of the MQI framework.
24 Code descriptions at https://drive.google.com/file/d/1G7pIyl2QxdhsaaL49iMQiB1F-SzHW_PS</p>
        <p>The results of the thematic analysis indicated overall positive usability during the experiment
with nearly 80% of codes related to positive themes, which included “User friendly”, “Positive
user experience”, “Positive GUI Requirements” and “Useful”. The most common negative codes
were in the “Negative GUI requirements” theme and mainly related to the number of tabs which
the framework opened during the experiment, which resulted in limited navigation.</p>
      </sec>
      <sec id="sec-4-14">
        <title>3.9. Hypotheses</title>
        <sec id="sec-4-14-1">
          <title>Based on the experiment undertaken, the hypotheses are examined below.</title>
          <p>Hypothesis H1: The framework facilitates the identification of changes in source data and
relationships with respective mappings. Based on an analysis of the experiment results gathered
for both cohorts, it is reasonable to assert that Hypothesis H1 is supported. The PSSUQ scores
indicated that the usability provided by the framework was sufficient for completing the tasks
with both scoring better than acceptable thresholds by at least 14%. The understanding
questionnaire which provides evidence that the links between source data changes and
respective mappings were understood scored high numbers in both sections for both cohorts.
The results of section 1 for both cohorts scored an average of 87% correct. The results of section
2 for both cohorts scored an average of 89% correct. The average score of both sections in the
questionnaire for both cohorts is 88% correct. The results indicated that participants with
varying levels of knowledge were able to understand information related to changes in the source
data of respective mappings. In addition, the most common themes (Positive user experience,
Positive GUI requirements, User friendly) discovered by thematic analysis identified patterns
related to positive overall usability.</p>
          <p>Hypothesis H2: The participants' background knowledge influences the successful
completion of the tasks. Based on an analysis of the experiment results gathered for both cohorts,
it is reasonable to assert that Hypothesis H2 is not supported. The satisfaction of the usability
which was measured through the PSSUQ indicated that participants in both cohorts had similar
levels of satisfaction. The scores of PSSUQ for both cohorts scored similarly better than acceptable
thresholds found in research with a mean of 44% better for students and 34% for experts.
Furthermore, the results of the understanding questionnaire indicated that participants in both
cohorts similarly understood the information provided by the framework. The scores of the
understanding questionnaire were similar with a difference of 6% between their mean scores. In
addition, the small difference of 0.04 between the standard deviations indicated that the scores
of both cohorts are clustered close to the mean. Moreover, no participants in both cohorts
required assistance in order to complete the experiment. Therefore, it can be concluded that
participants with limited knowledge of semantic web technologies can successfully interact with
the framework to complete the tasks.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Related Work</title>
      <p>A comparative study [22] has been conducted which discusses existing approaches to detect,
propagate and describe changes in resources and interlinks of linked data datasets. The study
compares the approaches based on requirements derived from community use cases, related to
aspects such as discovery, granularity level, change modelling and notification mechanisms. The
survey provided inspiration for the development of certain aspects of the MQI framework, such
as the change monitoring and notification mechanism.</p>
      <p>
        The most similar approach [24] proposes a framework for supporting alignment between
relational databases and RDF views. The approach focuses on R2RML [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] mappings, which are
designed to transform relational data. Changesets are computed by the framework and contain
information used to detect differences between two versions of datasets. The changesets are
automatically computed using mappings, which transform instance data from a relational
database into a target ontology. The formalism has been described as a simpler language than
R2RML. Unlike the MQI framework, the approach has been designed specifically for relational
data and does not provide support for heterogenous formats and respective RML [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] mappings.
However, the work provided insights into the requirements for the MQI framework.
      </p>
      <p>
        DSNotify (DataSet Notify) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is an approach designed to detect changes in linked data
datasets. The changes detected include create, remove, move, update of resources in the dataset.
The framework detects changes using a monitoring component, which periodically executes a
SPARQL query on the dataset and allows specific instance types to be targeted. A feature vector
is created for each triple in the data retrieved from the query, which can be used later for
detecting change events, by comparing these vectors. The triples in the datasets are modeled
using the DSNotify EventSet vocabulary, which was created by the researchers specifically for the
use case. The modelling of resource changes in a machine-readable format provided inspiration
for the development of OSCD [20], which models source data changes instead of resources.
      </p>
      <p>DELTA-LD [21] is an approach which detects and classifies changes in resources and interlinks
between two versions of linked data datasets. The approach classifies resources that have both
their IRI and representation changed. In addition, the approach aids in selecting the same
resource in a different version of data which can be used to update a dataset. The approach
proposes the DELTA-LD change model, which is used to represent detected changes and includes
an ontology with two levels of granularity. The change model provided inspiration for the
categorization of changes in OSCD.</p>
      <p>
        sparqlPuSH [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a flexible approach designed to enable the real-time notification and
broadcasting of changes in RDF stores. Notifications are sent in real-time to any RSS or Atom
reader. SPARQL query results are delivered through PubSubHubbub (PuSH) protocol25 when new
RDF data is detected. The approach allows users to subscribe to a subset of the content in an RDF
store. The users will receive a notification message each time content in the subset has changed.
The objective is to provide a push-model where users do not have to identify new changes
themselves. The approach provided useful background information for the MQI framework as it
provides push notifications, however, related to source data changes. To the best of our
knowledge, the MQI framework is the only approach which provides a notification mechanism
for changes detected in source data used to generate linked data.
      </p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>
        The component of the MQI framework, which was evaluated in this paper, demonstrated the
ability to facilitate the timeliness propagation of source data changes into resulting linked data.
Therefore, supporting the preservation of alignment between mappings and source data used to
generate linked data, resulting in improved quality of metrics in the freshness dimension [20].
Furthermore, the information captured by the framework can provide indications of suitability
for the application of consumers and improve trustworthiness by providing additional
provenance [20]. Moreover, the evaluation approach followed by the framework could be applied
to similar tools in order to validate them with respective end users. The usability testing of the
framework provided a method for collaboration with participants who are domain experts (i.e.
mapping experts) and early-stage mapping engineers (i.e. students), with a large sample size (55
participants), when compared with existing approaches [
        <xref ref-type="bibr" rid="ref12 ref13">12,13,21,24</xref>
        ]. The grouping of
participants allowed diverse feedback to be gathered, which was compared in order to identify
the level of background knowledge required to successfully interact with the framework. The
results indicated that expert and non-expert mapping engineers could benefit from use of the
25 https://github.com/pubsubhubbub/PubSubHubbub
change detection component. In addition, it is hoped the additional functionality added since the
evaluation will be a step closer to autonomic maintenance of alignment, by allowing software
agents to understand detected changes and automatically take appropriate actions in order to
propagate them and prevent a decrease in data quality [20].
      </p>
      <p>
        Future work includes the completion of the implementation of the new functionality discussed
in Section 2, designed to provide suggestions to agents to aid in improving alignment between
the mappings and data sources used to generate linked data datasets. An evaluation will be
conducted on the new functionality in order to ensure the framework provides sufficient usability
for respective end users. The evaluation will be structured similar to the one described in this
paper, however, slightly different metrics will be used. Satisfaction will be measured similarly
using the PSSUQ [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], however, understanding will not be measured. Instead, the level of
alignment between a provided mapping and source data will be compared before and after the
tasks have been completed, therefore, identifying whether an improvement has been made.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This research was conducted with the financial support of the SFI AI Centre for Research Training
under Grant Agreement No. 18/CRT/6223 at the ADAPT SFI Research Centre (Grant No.
13/RC/2106_P2) at Trinity College Dublin.
[14] Rémi Rampin and Vicky Rampin. 2021. Taguette: open-source qualitative data analysis. J.</p>
      <p>Open Source Softw. 6, 68 (2021), 3522.
[15] Alex Randles, Ademar Crotti Junior, and Declan O’Sullivan. 2020. A Framework for
Assessing and Refining the Quality of R2RML mappings. In Proceedings of the 22nd
International Conference on Information Integration and Web-Based Applications &amp; Services
(iiWAS2020). DOI:https://doi.org/10.1145/3428757.3429089
[16] Alex Randles, Ademar Crotti Junior, and Declan O’Sullivan. 2020. Towards a vocabulary for
mapping quality assessment. In 15th International Workshop on Ontology Matching
collocated with the 19th International Semantic Web Conference (ISWC 2020), 2020, 241–
242.
[17] Alex Randles, Ademar Crotti Junior, and Declan O’Sullivan. 2021. A Vocabulary for
Describing Mapping Quality Assessment, Refinement and Validation. In 2021 IEEE 15th
International Conference on Semantic Computing (ICSC), 425–430.</p>
      <p>DOI:https://doi.org/10.1109/ICSC50631.2021.00076
[18] Alex Randles and Declan O’Sullivan. 2021. Assessing quality of R2RML mappings for OSi’s</p>
      <p>Linked Open Data portal. 4th Int. Work. Geospatial Linked Data ESWC 2021 (2021).
[19] Alex Randles and Declan O’Sullivan. 2022. Evaluating Quality Improvement techniques
within the Linked Data Generation Process. In 18th International Conference on Semantics
Systems (SEMANTiCS).
[20] Alex Randles and Declan O’Sullivan. 2022. Modeling &amp; Analyzing Changes within LD Source
Data. In 8th Workshop on Managing the Evolution and Preservation of the Data Web
(MEPDaW) co-located with the 21st International Semantic Web Conference (ISWC 2022).
[21] Anuj Singh, Rob Brennan, and Declan O’Sullivan. 2018. DELTA-LD: A Change Detection
Approach for Linked Datasets. In 4th Workshop on Managing the Evolution and
Preservation of the Data Web (MEPDaW) co-located with the 15th Extended Semantic Web
Conference (EWSC 2018).
[22] Jürgen Umbrich, Boris Villazön-Terrazas, and Michael Hausenblas. 2010. Dataset dynamics
compendium: a comparative study. In Proceedings of the First International Conference on
Consuming Linked Data-Volume 665, 49–60.
[23] J ¨ Urgen Umbrich, Michael Hausenblas, Aidan Hogan, Axel Polleres, and Stefan Decker.</p>
      <p>Towards Dataset Dynamics: Change Frequency of Linked Open Data Sources. Retrieved from
http://code.google.com/p/pubsubhubbub/
[24] Vânia Vidal, Narciso Arruda, Matheus Cruz, Marco Casanova, Carlos Brito, and Valéria
Pequeno. 2017. Computing changesets for RDF views of relational data. In Workshop on
Managing the Evolution and Preservation of the Data Web (MEPDaW 2017), 43–58.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Haytham</given-names>
            <surname>Assem</surname>
          </string-name>
          , Lei Xu, Teodora Sandra Buda, and
          <string-name>
            <surname>Declan O'Sullivan</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Machine learning as a service for enabling Internet of Things and People</article-title>
          .
          <source>Pers. Ubiquitous Comput</source>
          .
          <volume>20</volume>
          ,
          <issue>6</issue>
          (
          <year>2016</year>
          ),
          <fpage>899</fpage>
          -
          <lpage>914</lpage>
          . DOI:https://doi.org/10.1007/s00779-016-0963-3
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Mokrane</given-names>
            <surname>Bouzeghoub</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>A framework for analysis of data freshness</article-title>
          .
          <source>In Proceedings of the 2004 international workshop on Information quality in information systems</source>
          ,
          <volume>59</volume>
          -
          <fpage>67</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Souripriya</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Seema Sundara</surname>
          </string-name>
          , and Richard Cyganiak.
          <year>2012</year>
          .
          <article-title>R2RML: RDB to RDF Mapping Language</article-title>
          .
          <source>W3C Recomm</source>
          . (
          <year>2012</year>
          ). DOI:https://doi.org/10.1017/CBO9781107415324.004
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Anastasia</given-names>
            <surname>Dimou</surname>
          </string-name>
          , Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Erik Mannens, and Rik Van de Walle.
          <year>2014</year>
          .
          <article-title>RML: A Generic Language for Integrated RDF Mappings of HeterogeneousData</article-title>
          .
          <source>In Proceedings of the Workshop on Linked Data on the Web co-located withthe 23rd International World Wide Web Conference (WWW</source>
          <year>2014</year>
          ),
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Miguel</given-names>
            <surname>Grinberg</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Flask web development: developing web applications with python</article-title>
          .
          <source>O'Reilly Media</source>
          , Inc.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Steve</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <source>Andy Seaborne, and Eric Prud'hommeaux. 2013. SPARQL 1</source>
          .
          <article-title>1 query language</article-title>
          .
          <source>W3C Recomm</source>
          .
          <volume>21</volume>
          ,
          <issue>10</issue>
          (
          <year>2013</year>
          ),
          <fpage>778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Lalana</given-names>
            <surname>Kagal</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Rei: A policy language for the me-centric project</article-title>
          . (
          <year>2002</year>
          ). DOI:https://doi.org/10.13016/M2MG5B-HRA9
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Holger</given-names>
            <surname>Knublauch</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dimitris</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Shapes Constraint Language (SHACL), W3C Recommendation 20 July 2017</article-title>
          . URL https//www. w3. org/TR/shacl (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D</given-names>
            <surname>Krech</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Rdflib: A python library for working with rdf</article-title>
          . Online https://github. com/RDFLib/rdflib (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>James</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Psychometric Evaluation of the PSSUQ Using Data from Five Years of Usability Studies</article-title>
          .
          <source>Int. J. Hum. Comput. Interact</source>
          .
          <volume>14</volume>
          ,
          <issue>3</issue>
          -
          <fpage>4</fpage>
          ,
          <fpage>463</fpage>
          -
          <lpage>488</lpage>
          . DOI:https://doi.org/10.1080/10447318.
          <year>2002</year>
          .9669130
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Lorelli</surname>
            <given-names>S Nowell</given-names>
          </string-name>
          , Jill M Norris,
          <string-name>
            <given-names>Deborah E White,</given-names>
            and
            <surname>Nancy</surname>
          </string-name>
          J Moules.
          <year>2017</year>
          .
          <article-title>Thematic analysis: Striving to meet the trustworthiness criteria</article-title>
          .
          <source>Int. J. Qual. methods (</source>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Passant and Pablo N Mendes</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>sparqlPuSH: Proactive Notification of Data Updates in RDF Stores Using PubSubHubbub</article-title>
          . In SFSW.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Niko</given-names>
            <surname>Popitsch</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bernhard</given-names>
            <surname>Haslhofer</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>DSNotify - A solution for event detection and link maintenance in dynamic datasets</article-title>
          .
          <source>J. Web Semant. 9</source>
          ,
          <issue>3</issue>
          ,
          <fpage>266</fpage>
          -
          <lpage>283</lpage>
          . DOI:https://doi.org/10.1016/j.websem.
          <year>2011</year>
          .
          <volume>05</volume>
          .002
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>