<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ADOG - Annotating Data with Ontologies and Graphs</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science Institute, Insight Centre for Data Analytics</institution>
          ,
          <addr-line>NUI Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>ADOG is a system focused on leveraging the structure of a well-connected ontology graph extracted from di erent Knowledge Graphs to annotate structured or semi-structured data. The Semantic Web Challenge on Tabular Data to Knowledge Graph Matching provided us with the means to test the system within the more restricted scenario of annotating data with a single ontology. This competition provided important insights into the challenges we face not only in a single-ontology case but also in future multi-source scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Ontologies</kwd>
        <kwd>DBPedia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Presentation of the system</title>
      <sec id="sec-2-1">
        <title>State, purpose, general statement</title>
        <p>
          ADOG combines a series of existing technologies and algorithms in novel ways to
automatically annotate structured and semi-structured les. It takes advantage
of the native graph structure of ontologies to built a well-connected network
on ontologies from di erent sources. This integration facilitates the discovery of
connections between entities with distinct origins and types, but related topics.
More details and a preliminary evaluation of its e ectiveness are available in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>The Semantic Web Challenge on Tabular Data to Knowledge Graph
Matching1 provided us with a platform to test the base use-case of a single Knowledge
Graph (KG) with a single underlying ontology. The challenge distinguished
between three separate tasks:</p>
        <p>We participated in the four Rounds of the competition, and, except in Round
1, we submitted results to all tasks. Except for the CTA task, the remaining tasks
were evaluated with F1-measure and Precision. After Round 1, the CTA task
adopted a weighted scoring metric. The main score metric was named Average
Hierarchical Score (AH-Score) and the secondary measure was called Average
Perfect Score (AP-Score)</p>
        <p>The following metrics named Average Hierarchical Score (AH-Score) and
Average Perfect Score (AP-Score) are calculated for ranking:</p>
        <p>AH-Score =
(1
jP Aj) + (0:5 jOKAj)
jTarget Columnsj
(1
jW A )
j
where jP Aj is the number of Perfect Annotations, jOKAj is the number of
Correct Annotations, and jW Aj is the number of Wrong annotations.</p>
        <p>AP-Score =</p>
        <p>jP Aj
jAnnonated Classesj
1.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Speci c techniques used</title>
        <p>ADOG takes advantage of the graph properties of the ontologies and the KG
by enriching the links between nodes and, therefore, provide a new level of
relatedness connections between concepts in the KG. At this stage, the system
leverages the depth of the concepts in the ontology, i.e., the distance to the root
node, and the shortest paths between nodes to distinguish between stronger or
weaker candidate annotations.</p>
        <p>Figure 1 shows the three steps needed to build the schema layer that includes
the ontology graph, native links and discovered relatedness edges.</p>
        <p>The rst step parses the ontology les and entities of the KG. The system is
designed to integrate multiple ontologies via their owl:subclassOf and object
properties. Additionally, the ontologies are subjected to an ontology matching
step to explore additional relations between ontology classes.</p>
        <p>The ontology graph and all its links are loaded into ArangoDB2, a
multimodel NoSQL database. We chose this database due to its multi-model
capabilities, which allowed us to use graph and document models to store each ontology
class as a node in a graph that can be described as a document with key/value
properties. This database contained all information relevant to the schema. Each
node saved the relevant information of a class (uri, label, de nition), and also
the distance to the root node, i.e., depth. The database also includes a document
collection of relevant properties of the graph such as diameter (maximum
distance between two nodes in the graph), maximum Inverse Document Frequency
(IDF), and maximum depth.</p>
        <p>The relevant entity properties of the KG are indexed with ElasticSearch3.
The only mandatory property to be indexed is a label of entity that can be
matched against the data to be annotated.
2 https://www.arangodb.com
3 https://www.elastic.co/products/elasticsearch
(1)
(2)
Properties
Diameter
IDF
Depth
Post-processing</p>
        <p>After the build stage is complete, the matching process can start by
matching the data against the ElasticSearch index. When several matches are returned
from the matching process, additional measures are employed to score the
relevance of each match to the query, considering the context of the data to annotate.
The three main steps are calculating the similarity and frequency of properties
measures, and nal score weighting.</p>
        <p>Similarity This measure nds the string similarity between query words and the
matched terms. Both strings are normalised, punctuation is removed, and word
inside brackets are ignored. The similarity measure uses Levenshtein Distance
(LD) to calculate the similarity between s1 and s2 as follows:
sim = 1</p>
        <p>LD(s1; s2)
max(length(s1); length(s2))
(3)
Frequency of Properties If any extra properties, besides the labels, were
indexed from the source KG, this step calculates and normalises their frequencies
for each match. For example, in DBPedia, these properties can be the categories,
types or even other entities linked to the matched entity via an object property.
Final Score The nal score of each candidate will be weighted considering
the previous steps, plus the normalised ElasticSearch score for each search
performed. These weights are variable and can be adjusted to t any model, giving
more or less weight to similarity, search scores, or property frequencies.
1.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Adaptations made for the evaluation</title>
        <p>The main aim of ADOG is integrating ontologies and KGs from di erent sources,
however, it is still possible to use the system with a single ontology and KG. In
the case of the present challenge, the system is using DBPedia as the
Knowledge Graph and the DBPedia ontology as the schema. We adopted di erent
approaches for Round 1 and Round 2, detailed in the following sections.</p>
        <p>Round 1 In Round 1, we focused on the CTA task, and therefore, the build
stage was central to the methodology. Since the challenge includes only one
ontology, we matched it against itself to nd possible missing relations that
do not have to denote equivalence but can only indicate a degree of relatedness
between the two concepts. For example, in the DBPedia ontology (dbo) the class
http://dbpedia.org/ontology/MovieDirector is not directly connected with
the class http://dbpedia.org/ontology/Film. Instead the class dbo:Film has
the property http://dbpedia.org/ontology/director connecting dbo:Film
to dbo:Person. Partially matching dbo:MovieDirector with dbo:Film created
a direct mapping between these two classes that could more easily help identify
related matches for both classes.</p>
        <p>Since we focused mostly matching the columns to ontology classes, the data
layer was kept shallow, only indexing the labels of each resource from each
language available in the DBPedia data dumps. As scoring properties, we used
ontology type frequency and pair-wise shortest path computation between
candidate ontology classes.</p>
        <p>Round 2/3/4 In the remaining Rounds, we worked on improving the results
of the CEA task, and therefore, the focus was on the Data Layer. The Schema
Layer was not changed, while the Data Layer was updated to include more
information from DBPedia resources that facilitates the choice of right match
between query word and matched label. In addition to the previous properties,
resource categories were indexed and the IDF of the categories and types was
added to weight the frequencies with tf-idf.
1.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Link to the system and parameters le</title>
        <p>The code used for completing the challenge is available in
https://github.com/danielapoliveira/iswc-annotation-challenge. Instructions to
run are also contained in the repository.
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>In Round 1 we focused on the CTA task, but also submitted a results for the
CEA task. We did not submit to the CPA task in this Round. Most of the
performance enhancements were focused on improving the column type annotations,
therefore, the system performed better in CTA, then CEA.</p>
      <p>Table 1 shows that for CTA the F1-Score obtained was 0.908, with 0.915
precision, while in CEA we obtained a F1-Score of 0.657, with 0.673 precision.
Round</p>
      <p>CTA CEA CPA</p>
      <p>AH-Score AP-Score Precision Recall Precision Recall
In Round 2, 3, and 4, we mostly focused on the CEA task but submitted to all
tasks. The CTA task had di erent scoring and, therefore, is not comparable to
the results of Round 1.</p>
      <p>In these Rounds, both CTA and CPA were obtained from the CEA results
since the methods we used allowed us to directly extract all the necessary
information from CEA's results. For the CTA results no changes were necessary, while
for CPA a few changes were added to extract the correct relation between the
elements matched by the CEA algorithm. The CEA task had improved results
in with the new ground truth and re ned methods.</p>
    </sec>
    <sec id="sec-4">
      <title>General comments</title>
      <sec id="sec-4-1">
        <title>Comments on the results</title>
        <p>ADOG is still in early stages of research and development and we took
advantage of this challenge as a concrete testbed for research into the single-ontology
use-case. Despite being focused on the multiple source scenario, the system still
achieved a reasonable performance without many modi cations to its core
function. However, throughout the competition we were faced with a few challenges.
In its current state, the system is very sensitive to scoring and weight changes,
i.e., even small changes can have a big impact or changes that bene t a type of
data, hinder other types.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Discussions on the way to improve the proposed system</title>
        <p>The main research in the future will be focused on the multi-source system.
However, a more robust scoring system is necessary before adding an extra step of
complexity. Adding extra KGs and schema could lead to a performance
improvement since the graph capabilities of the approach could be further explored. In
the future, we also intend to focus more on the property annotation task since
that is also one of the overall goals of our system.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Comments on the challenge procedure</title>
        <p>We believe that a system should be organised so that the results submitted
and shown on the leaderboards can be double-checked for accuracy. Also, to
avoid over tting to the ground truth, we suggest that the systems are tested
against another test set generated by the same methods. Finally, we believe that
a more standard and robust method of generating the ground truth is necessary
since issues around inconsistencies, di erent encodings, and several instances of
incorrect ground truth data can generate frustration for participants, making
the competition less appealing.
3.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Comments on the challenge measures</title>
        <p>We consider that the measures used for CTA in Round 2 are not appropriate to
accurately evaluate the performance of an algorithm. A participant that obtains
all perfect results without modifying their system should not be forced to add
every parent of the right match just to t the challenge. Instead, we would suggest
a di erent weight measure, where the class assigned by the algorithms is weighted
based on their distance from the perfect match. For example, if the exact match
was dbo:MovieDirector and a result is submitted with dbo:Person, this match
should get a score of 0.5 instead of 1. If the exact match is found, then the score
for the match is 1. In this way, a single match would not have multiple answers,
and the total scores are bound from 0 to 1.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>Overall the Semantic Web Challenge on Tabular Data to Knowledge Graph
Matching provided an engaging platform for developing and testing our system.
The system expanded its functionalities due to the demands of this challenge
and participating provided important insights into the hurdles we are faced when
dealing with data annotation based on KGs.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements References</title>
      <p>This work has been partly funded by Science Foundation Ireland (SFI) under
Grant Number SFI/12/RC/2289 P2, Insight Centre for Data Analytics.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sahay</surname>
            , R., d'Aquin,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Leveraging Ontologies for Knowledge Graph Schemas</article-title>
          .
          <source>In: Knowledge Graph Building Worshop</source>
          . p.
          <volume>12</volume>
          (
          <year>2019</year>
          ), https://openreview.net/pdf?id=B1xnsmvaUE
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>