=Paper= {{Paper |id=Vol-3339/paper3 |storemode=property |title=Modelling & Analyzing Changes within LD Source Data |pdfUrl=https://ceur-ws.org/Vol-3339/paper3.pdf |volume=Vol-3339 |authors=Alex Randles,Declan O'Sullivan |dblpUrl=https://dblp.org/rec/conf/semweb/RandlesO22 }} ==Modelling & Analyzing Changes within LD Source Data== https://ceur-ws.org/Vol-3339/paper3.pdf
Modelling & Analyzing Changes within LD Source Data
Alex Randles1 and Declan O’Sullivan1
  1
      ADAPT Centre for Digital Content, Trinity College Dublin, Ireland

                 Abstract
                 Linked data (LD) datasets are highly dynamic in nature with continuously changing data. These
                 datasets are often created using mappings which transform the source data into Linked Data
                 format. These datasets may not be regenerated and changes which occur within the source data
                 may not be reflected in the Linked Data dataset. These source data changes could impact the
                 quality of the linked data, or the artefacts used to create the Linked Data. Capturing information
                 relating to these changes using a domain specific ontology will result in expressive machine-
                 readable information which can facilitate the propagation of such changes into the Linked Data
                 dataset. A system which uses the ontology to detect the impact of source data changes on the
                 mapping artefacts and Linked Data dataset would promote high-quality and availability of
                 fresh up-to-date linked data. In this paper we focus on the design of a proposed ontology which
                 is used to capture information relating to changes which have occurred within the source data
                 that have been used to generate Linked data datasets. Furthermore, the ontology is used within
                 a proposed component within a mapping quality framework to allow users to easily analyze
                 and address the impact of the changes on the mapping artefacts and the resulting linked data
                 dataset. Moreover, the component regularly monitors the source data to detect and represent
                 recent changes. In the paper an example of how the component and ontology is being applied
                 in an example industry led use case is also described. It is hoped that the approach proposed
                 offers one possible unified approach to ensure that linked data is kept in sync with underlying
                 changes in source data.

                 Keywords 1
                 Dynamics; Linked Data; Mappings; Data Quality; Ontology Design.

1. Introduction
    Linked data datasets are highly dynamic with resources continuously being added or removed [24].
“Freshness” is often mentioned within the context of dataset dynamics [3], which has been referred to
as a quality dimension that relates to the age and occurrences of changes within data. It has been
described as one of the most important attributes of data quality [3]. Data publishers are continuously
attempting to improve the quality of data by updating vocabularies or adding and removing resources
in datasets [25]. Consequently, there is a need to tackle the dynamics related to constantly evolving
linked data [12].
    The research challenge of how to identify and handle these changes has been present within the
Linked Data community for over a decade [24]. However, there exists no consensus about how best to
detect and propagate changes that occur within the linked data domain [24]. While approaches exist
[12,13,28,29] to model and handle changes within the resulting data, none offer an approach which
links and analyses source data changes with the corresponding mapping artefacts. We hope to tackle
these state-of-the-art limitations by proposing a unified approach which would allow users to represent
changes using a specific ontology with associated polices which would ensure the linked data is kept in
sync with the underlying data sources. Furthermore, ensuring these changes are propagated into the
resulting data will help to promote the freshness and quality of the linked data. Moreover, providing a

8th Workshop on Managing the Evolution and Preservation of the Data Web (MEPDaW)
co-located with the 21st International Semantic Web Conference (ISWC 2022)
EMAIL: alex.randles@adaptcentre.ie (A. 1); declan.osullivan@adaptcentre.ie (A. 2)
ORCID: 0000-0001-6231-3801 (A. 1); 0000-0003-1090-3548 (A. 2)
              ©️ 2022 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
change notification mechanism will ensure that linked data publishers are aware of what changes in
source data have occurred in a timely fashion, allowing them to take appropriate actions to prevent a
decrease in data quality of the linked data that they publish [12].
    In this paper we discuss the design and evaluation of our proposed Change Detection Ontology
(CDO) which is used to capture changes which have occurred in the source data of published Linked
Data datasets. The ontology is specifically targeted at datasets which have been created using
declarative uplift mapping artefacts. Furthermore, we showcase the design of a Change Detection
component which uses the ontology to define information related to changes which have occurred in
source data in formats such as XML, CSV, and relational databases. Moreover, the impact of these
changes on the declarative uplift mappings is analyzed by the component. The Change Detection
component [20] has been integrated within our previously introduced Mapping Quality Vocabulary
(MQV) [15,16] framework [17,18]. The framework was originally designed to assess and refine the
quality of uplift mappings, however, the design has been extended to include the component such that
quality processes are triggered depending on the type of changes that have been detected in the source
data. The evolution of the MQV framework is now referred to simply as the Mapping Quality (MQ)
framework, which we plan to evolve into a linked data quality improvement ecosystem. The objective
of the research is to improve mapping and dataset quality while promoting mapping maintenance and
re-use. The remainder of this paper is structured as follows: Section 2 describes the design and
evaluation of the Change Detection Ontology (CDO); Section 3 discusses the design and
implementation of the Change Detection Component; Section 4 discusses related work in the area and
Section 5 concludes the paper and outlines some future work.

2. Change Detection Ontology
The Change Detection Ontology (CDO) [19] is designed to represent and interchange information
related to changes which have occurred in the source data of Linked Data datasets. The objective of the
ontology is to represent the information in a machine-readable format which would allow mapping
engineers and software agents to analyze the changes and update the mapping where appropriate.
Furthermore, the ontology could be used to indicate that a dataset needs to be regenerated as important
changes in the source data may not being reflected in the current published version. It is hoped the
ontology will assist in improving mapping and dataset quality while promoting mapping maintenance
and reuse. The ontology is primarily targeted at data producers and consumers of Linked Data who are
concerned with freshness and quality of linked data.

2.1.    Ontology Design
   The proposed Change Detection Ontology (CDO) extends the Linking Open Descriptions of Events
(LODE) [22] vocabulary to represent the changes. LODE is designed to represent historical linked data
events and allows linking with other event related ontologies. In addition, the ontology reuses the Rei
Policy Ontology [8] to represent a notification policy which defines when users are notified of changes.
The Rei Policy Ontology is a universal policy language which can be applied in various domains. The
flexibility of the policy language makes it suitable for modelling the necessary information. The design
of the CDO ontology has been inspired by previous work in linked data change detection
[3,20,25,28,29]. Figure 1 shows the class interaction diagram for CDO. Concepts within the diagram
are represented using different colors.
                     Figure 1: Class interaction diagram of Change Detection Ontology (CDO)

     Changes which occur in the source data are represented by different change types. For instance, if
data is inserted into the source data, a data insertion change (cdo:InsertSourceData) has
occurred. Defining types allows changes to be identified and processed uniquely when required. The
model is extendible where users can create new change types using the same format. Categorizing
changes based on the action which caused them was inspired by related work [13,21,23,25] within the
state of the art where changes within linked data have been categorized. All the change types are
subclass of the LODE Event (lode:Event) which is used to represent non-specific events within the
linked data domain. The reuse of LODE was inspired by previous work [13] in modelling and detecting
changes in the linked data domain. Each change has information related to why the change has occurred
(cdo:hasReason) and when the change occurred (time:Instant), which allows the linked data
to be synchronized with the source data. The changes are grouped into a log (cdo:ChangeLog) which
enables related changes to be grouped. The source data is represented by the current
(cdo:hasCurrentSource) and the previous version (cdo:hasPreviousSource) of the data.
For relational databases the source data is represented using the connection information. The details of
the source data allow changes to be detected periodically to capture recent changes. The log has one or
more owners (dul:Agent). The DUL ontology [4] is designed to model things and is used by LODE
to model actors. Furthermore, each log has a notification policy (rei-policy:Policy) which
represents when users are notified of changes, allowing them to take appropriate action. The definition
of a notification policy has been inspired by sparqlPuSH [12] which provides notifications of data
changes within linked data stores. Once the notification policy has been triggered, changes are no longer
detected for that specific process.

2.2.         Ontology Evaluation
    The design of the ontology has been evaluated using three different methods which include the usage
of Protégé, Competency Questions and the Ontology Pitfall Scanner.
    Protégé. Protégé [26] is an open-source tool which provides the capability to create and edit
ontologies using an intuitive graphical interface. Plugins are also available which allow extra
functionality such as alternative visualization. Furthermore, reasoners are available which can be used
to detect inconsistencies within the ontology design. CDO was created and reasoned over using Protégé.
No inconsistencies could be found in protégé while designing the ontology2.
    Ontology Competency Questions. Competency Questions [2] are used to represent the requirements
of the ontology. These questions state information which the ontology should contain and are often
defined in natural language. These questions ensure that design requirements have been satisfied.
Furthermore, these questions can be answered by using SPARQL queries to query the ontology or
2
    CDO displayed in Protégé at https://github.com/alex-randles/MEPDaW-2022/blob/main/Protege_Screenshot.png
instances. Competency questions3 have been designed for CDO and have been answered using
SPARQL queries which query sample instances. Each competency question was answered correctly.
Table 1 shows an example of a competency questions used to validate the design requirement of CDO.

               Table 1: Example of an Ontology Competency Questions designed for CDO

               Question                   SPARQL Query                                Sample Answer
          What type of changes    SELECT ?changeType                             cdo:InsertSourceData
               occurred?          WHERE {                                        cdo:DeleteSourceData
                                  ?changeLog a cdo:ChangeLog;
                                       cdo:hasChange ?change .
                                  ?change a ?changeType.
                                  }
           Who owns the data?     SELECT ?owner                                          ex-user:1
                                  WHERE {
                                  ?changeLog a cdo:ChangeLog;
                                       cdo:hasOwner ?owner.
                                  }

    Ontology Pitfall Scanner. OOPS! (Ontology Pitfall Scanner) [14] is a web-based tool which is used
to detect issues within the design of an ontology. The tool provides a method to validate and verify the
design of an ontology. Furthermore, the tool is independent of any ontology development environment.
Moreover, the tool provides recommendations for how issues can be repaired. The issues are
represented in different severity which include minor, important, and critical. Critical issues must be
repaired to ensure an adequate quality level. CDO has been input into the tool and a report has been
generated4. The report details that there are no quality issues present within the ontology design. The
report validates that the ontology is sufficient for usage by potential users.

3. Change Detection Component
   The following section discusses the motivation and design of the component in Section 3.1. Then in
Section 3.2, an example of the component applied to an industry relevant use case is described.

3.1.     Design
    The Change Detection Component [20] is a component within the MQ framework [18,20] which is
a framework designed to produce high-quality and fresh linked data. The component is designed to
capture changes which have occurred in the source data of Linked Data datasets which have been
created by mapping artefacts. Furthermore, the component analyses these changes to detect if these
changes could impact mapping quality or dataset quality. It is hoped the component will allow factors
which could impact quality to be captured and removed early. Furthermore, the process could improve
mapping maintenance and reuse. Figure 2 shows a screenshot of the Change Detection Component
interface displaying information relating to sample CSV source data.
    The component was designed using Python libraries for creating a web application (Flask [6]) and
querying the RDF data created (RDFLib [9]). Furthermore, an R2RML [5] engine was used to uplift
the information related to changes which have occurred and the notification policy.
    The current implementation can detect changes within data stored in relational databases or
represented in CSV or XML format. The process starts by the users entering the source data details
which include the URL of the file versions for CSV and XML or the relational database server details.
Thereafter, changes are detected within the data and stored within a relational database. Furthermore,
the users enter the notification thresholds for each change type and an email address where the
notification can be sent when these thresholds have been reached. These details are also stored within
3
        Ontology      competency       questions    and        answers     at        https://github.com/alex-randles/MEPDaW-
2022/blob/main/CDO%20Competency%20Questions.pdf
4
          Report    generated    by     OOPS!    executed      on    CDO        at   https://github.com/alex-randles/MEPDaW-
2022/blob/main/OntOlogy%20Pitfall%20Scanner!%20-%20Results.png
the relational database. Storing the details within a relational database allows the information to be
uplifted into Linked data format using R2RML mappings [5]. The graph generated contains the changes
detected, notification policy and contact information. The graph generated is queried to populate the
user interface of the component.




                                Figure 2: Screenshot of Change Detection Component

   A sample graph5 has been generated using the RML test case files6. Table 2 shows an extract of the
graph, which includes 1 insert change and sample notification policy details.

                           Table 2: Extract of sample graph generated by the component

             change-log:0 a cdo:ChangeLog;
                          cdo:hasOwner user:1;
                          cdo:hasChange insert-change:0;
                          cdo:hasCurrentVersion  ;
                          cdo:hasPreviousVersion https://raw.githubusercontent.com/kg-
             construct/rml-test-cases/master/test-cases/RMLTC0002a-CSV/student.csv> .

             insert-change:0 a cdo:InsertSourceData;
                          cdo:hasReason "ID: 10";
                          lode:atTime detection-time:0 .

             detection-time:0 a time:Instant;
                time:inXSDDateTimeStamp "2022-07-20 22:15:45.843376"^^xsd:dateTimeStamp .

             notification-policy:0 a rei-policy:Policy;
                          rei-policy:desc "Notification policy for user 1" ;
                          rei-policy:grants policy-obligation:0 .

             policy-obligation:0 a rei-deontic:Obligation;
                          rei-deontic:action cdo:sendNotification;
                          rei-deontic:obligedTo cdo:softwareAgent;
                          rei-deontic:startingConstraint notification-constraint:0;
                          rei-policy:actor user:1, cdo:softwareAgent .


    The change log (change-log:0) contains 1 insert change (insert-change:0) which has
been     detected    within  the  CSV     source     data    (cdo:hasCurrentVersion,
cdo:hasPreviousVersion). The inserted data (cdo:hasReason) has been detected at a
specific point in time (lode:atTime). The detection time is represented as a date time stamp

5
    Sample graph at https://github.com/alex-randles/MEPDaW-2022/blob/main/sample_graph.ttl
6
    RML test case files at https://rml.io/test-cases/
(time:inXSDDateTimeStamp).The notification policy (notification-policy:0) is
described (rei-policy:desc) and grants an obligation (rei-policy:grants). The obligation
(policy-obligation:0) requires a notification be sent (rei-deontic:action) to the user
(user:1).

3.2.        Use Case
   Currently, the MQ framework is being applied in a research collaboration project with Ericsson
Software Technology. The use case involves the transformation and analysis of cloud native monitoring
data into linked format. The information is collected from the Prometheus [27] monitoring service. The
data is stored within a relational database and transformed into linked data using R2RML [5] mappings.
The component has been connected to the database and mappings used within the project have been
uploaded. Figure 3 shows components involved within the use case interaction.




                             Figure 3: Components used within the use case interaction

    The components7 involved detect changes within the source data which is used to generate the graphs
containing monitoring information using R2RML [5]. These components ensure that the information
contained within the graphs produced by the system are high-quality and fresh, which in the context of
cloud native monitoring is highly important. Furthermore, data is exchanged between each component
to link semantically related information, such as mapping data sources and source data changes.

4. Related Work
   The state of the art in ontologies which represent changes within the Linked Data domain has been
reviewed. Furthermore, systems which use these ontologies are mentioned. These ontologies are
highlighted within a survey which is focuses on dataset dynamics [24]. A system could not be found
within the state of the art which relates to the impact of source data changes on the mapping and
resulting Linked data dataset.
   DSNotify EventSet Vocabulary [13] is designed to represent events that resulted in resource
modifications in Linked data datasets. The vocabulary applies a resource-centric perspective and
preserves the timely order of changes. The changes are modelled as Resource Change Events and


7
    Component description at https://github.com/alex-randles/MEPDaW-2022/blob/main/Component%20Descriptions.pdf
grouped into a container of events called an EventSet. These EventSets are represented as voiD [1]
datasets.
   Dataset Dynamics Vocabulary (DaDy) [28] is designed to represent information about the frequency
and type of changes which are occurring within a Linked data dataset. The vocabulary is designed to
be used in conjunction with voiD [1]. Furthermore, the vocabulary can be used to represent the updated
source URI which allows for a change notification mechanism definition.
   Talis ChangeSet [29] is designed to represent ChangeSets, which define the delta between two
versions of a resource description. A resource description is described as a set of triples which include
a description of a resource. The delta is represented by two set of triples which describe the addition or
removal of resource descriptions. A ChangeSet can be used to modify the resource description based
on which triples have been added or removed.
   OWL 2 Change Ontology [11] is designed to represent changes which have occurred within
ontologies. The design comprises of a fine-grained taxonomy that considers varying levels of
operations. The approach consists of a generic change ontology that can be specialized for different
ontology languages. The generic change ontology has been implemented as an extension of the
Ontology Metadata Model (OMV) [7]. A metadata model is used to link the generic change ontology
with elements of the specialized model. Furthermore, the ontology allows changes to be represented in
chronological order.
   The Change and Annotations Ontology (CHAO) is designed to represent ontology changes which
occur in the evolution of an ontology in Protégé [26]. The initial ontology design represents basic
changes. However, an extension can be defined to represent complex changes. Furthermore, multiple
changes can be grouped into a higher-order of change. Moreover, metadata information about changes
can be defined such as the author or timestamp.

5. Conclusion
   Detecting and analyzing changes within the source data underlying Linked Data datasets which have
been generated by uplift mapping artefacts, will ensure these changes can be assessed and propagated
into the Linked Data dataset. The lack of a standard approach to detect and propagate changes within
linked data ecosystem is a limitation within the state of art [24], and which we provide an approach to
address in this paper. Source data changes could have a significant impact on how mapping artefacts
are defined and maintained while ensuring the quality of data for consumers. A process which
automatically detects source data changes and generates machine-readable information for analysis by
software agents or engineers, we would argue will improve the situation.
   Future work includes completing the implementation of the system and evaluating the component
with respective end users. The evaluation will involve users interacting with the system over a certain
period and reporting usability issues. Furthermore, standardized metrics such as the Post Study System
Usability Questionnaire (PSSUQ) [10] which will be used to evaluate user interaction in a quantitative
metric. Moreover, the results of the evaluation will be analyzed, and improvements will be made to the
component based on the analysis. Finally, the component could be expanded to handle other data
formats such as JSON.

6. Acknowledgment
This research was conducted with the financial support of the SFI AI Centre for Research Training
under Grant Agreement No. 18/CRT/6223 at the ADAPT SFI Research Centre (Grant #
13/RC/2106_P2) at Trinity College Dublin. A research collaboration with Ericsson Software
Technology is helping to evaluate the proposed solution. With special thanks to John Keeney & Liam
Fallon for their collaboration and support.

7. References
[1]    Keith Alexander, Ltd Talis, Cyganiak, Michael Hausenblas, and Jun Zhao. 2010. Describing
       Linked Datasets-On the Design and Usage of voiD, the’Vocabulary of Interlinked Datasets’.
       Linked Data Web Work. (LDOW 09), conjunction with 18th Int. World Wide Web Conf. (WWW
       09) 538, (February 2010).
[2]    Camila Bezerra, Fred Freitas, and Filipe Santana. 2013. Evaluating ontologies with Competency
       Questions. In Proceedings - 2013 IEEE/WIC/ACM International Joint Conference on Web
       Intelligence and Intelligent Agent Technology - Workshops, WI-IATW 2013.
       DOI:https://doi.org/10.1109/WI-IAT.2013.199
[3]    Mokrane Bouzeghoub. 2004. A framework for analysis of data freshness. In Proceedings of the
       2004 international workshop on Information quality in information systems, 59–67.
[4]    Victor Charpenay, Sebastian Käbisch, and Harald Kosch. 2016. Introducing Thing Descriptions
       and Interactions: An Ontology for the Web of Things. In SR+ SWIT@ ISWC, 55–66.
[5]    Souripriya Das, Seema Sundara, and Richard Cyganiak. 2012. R2RML: RDB to RDF Mapping
       Language. W3C Recomm. (2012). DOI:https://doi.org/10.1017/CBO9781107415324.004
[6]    Miguel Grinberg. 2018. Flask web development: developing web applications with python. “
       O’Reilly Media, Inc.”
[7]    Jens Hartmann, Y Sure, P Haase, R Palma, and M Suarez-Figueroa. 2005. OMV–ontology
       metadata vocabulary. In ISWC.
[8]    Lalana Kagal and others. 2002. Rei: A policy language for the me-centric project. (2002).
[9]    D Krech. 2006. Rdflib: A python library for working with rdf. Online https://github.
       com/RDFLib/rdflib (2006).
[10]   James R. Lewis. 2002. Psychometric Evaluation of the PSSUQ Using Data from Five Years of
       Usability Studies. Int. J. Hum. Comput. Interact. 14, 3–4 (September 2002), 463–488.
       DOI:https://doi.org/10.1080/10447318.2002.9669130
[11]   Raul Palma, Peter Haase, Oscar Corcho, and Asuncion Gomez-Perez. 2009. Change
       Representation For OWL 2 Ontologies.
[12]   Alexandre Passant and Pablo N Mendes. sparqlPuSH: Proactive notification of data updates in
       RDF stores using PubSubHubbub. Retrieved from http://www.ldodds.com/blog/2010/04/rdf-
       dataset-notifications/
[13]   Niko Popitsch and Bernhard Haslhofer. 2011. DSNotify - A solution for event detection and link
       maintenance in dynamic datasets. J. Web Semant. 9, 3 (September 2011), 266–283.
       DOI:https://doi.org/10.1016/j.websem.2011.05.002
[14]   María Poveda-Villalón, Mari Carmen Suárez-Figueroa, and Asunción Gómez-Pérez. 2012.
       Validating ontologies with OOPS! In Lecture Notes in Computer Science (including subseries
       Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 267–281.
       DOI:https://doi.org/10.1007/978-3-642-33876-2_24
[15]   Alex Randles, Ademar Crotti Junior, and Declan O’Sullivan. 2020. Towards a vocabulary for
       mapping quality assessment. Proc. 15th Int. Work. Ontol. Matching 19th Int. Semant. Web Conf.
       (ISWC), (2020).
[16]   Alex Randles, Ademar Crotti Junior, and Declan O’Sullivan. 2021. A Vocabulary for Describing
       Mapping Quality Assessment, Refinement and Validation. In 2021 IEEE 15th International
       Conference           on         Semantic          Computing          (ICSC),         425–430.
       DOI:https://doi.org/10.1109/ICSC50631.2021.00076
[17]   Alex Randles and Declan O’Sullivan. 2021. Assessing quality of R2RML mappings for OSi’s
       Linked Open Data portal. 4th Int. Work. Geospatial Linked Data ESWC 2021 (2021).
[18]   Alex Randles and Declan O’Sullivan. 2022. Evaluating Quality Improvement techniques within
       the Linked Data Generation Process. In 18th International Conference on Semantics Systems
       (SEMANTiCS).
[19]   Alex Randles and Declan O’Sullivan. 2022. Change Detection Ontology (CDO) Specification.
       Retrieved from https://alex-randles.github.io/Change-Detection-Ontology/
[20]   Alex Randles, Declan O’Sullivan, John Keeney, and Liam Fallon. 2022. Applying a Mapping
       Quality Framework in Cloud Native Monitoring. In 18th International Conference on Semantics
       Systems (SEMANTiCS).
[21]   Julio Cesar Dos Reis, Cédric Pruski, Marcos Da Silveira, and Chantal Reynaud-Dela\^\itre.
       2015. DyKOSMap: A framework for mapping adaptation between biomedical knowledge
       organization systems. J. Biomed. Inform. 55, (2015), 153–173.
[22]   Ryan Shaw, Raphaël Troncy, and Lynda Hardman. 2009. Lode: Linking open descriptions of
       events. In Asian semantic web conference, 153–167.
[23]   Anuj Singh, Rob Brennan, and Declan O’Sullivan. DELTA-LD: A Change Detection Approach
       for Linked Datasets.
[24]   Jürgen Umbrich, B. Villazón-Terrazas, and M. Hausenblas. 2010. Dataset Dynamics
       Compendium: A Comparative Study. undefined (2010).
[25]   J ¨ Urgen Umbrich, Michael Hausenblas, Aidan Hogan, Axel Polleres, and Stefan Decker.
       Towards Dataset Dynamics: Change Frequency of Linked Open Data Sources. Retrieved from
       http://code.google.com/p/pubsubhubbub/
[26]   Protégé. Retrieved January 1, 2022 from https://protege.stanford.edu/
[27]   Prometheus. Retrieved from https://prometheus.io/
[28]   2010. Dataset Dynamics (dady) Vocabulary. Retrieved June 14, 2022 from
       http://purl.org/NET/dady
[29]   2010. Talis Changeset Vocabulary. Retrieved from http://vocab.org/changeset/schema