<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CANARD complex matching system: results of the 2018 OAEI evaluation campaign</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elodie Thieblin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ollivier Haemmerle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cassia Trojahn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IRIT &amp; Universite de Toulouse 2 Jean Jaures</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>4</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>This paper presents the results obtained by the CANARD system in the OAEI 2018 campaign. CANARD can produce complex alignments. This is the rst participation of CANARD in the campaign. Even though the system has been able to generate alignments for one only complex dataset (Taxon), the results are promising. The CANARD (Complex Alignment Need and A-box based Relation Discovery) system discovers complex correspondences between populated ontologies based on Competency Questions for Alignment (CQAs). Competency Questions for Alignment (CQAs) represent the knowledge needs of a user and de ne the scope of the alignment [3]. They are competency questions that need to be satis ed over two or more ontologies. Our approach takes as input a set of CQAs translated into SPARQL queries over the source ontology. The answer to each query is a set of instances retrieved from a knowledge base described by the source ontology. These instances are matched with those of a knowledge base described by the target ontology. The generation of the correspondence is performed by matching the graph-pattern from the source query to the lexically similar surroundings of the target instances.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Presentation of the system</title>
      <sec id="sec-2-1">
        <title>State, purpose, general statement</title>
        <p>1.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Speci c techniques used</title>
        <p>The CQAs that are taken as input by CANARD are limited to class expressions
(interpreted as a set of instances). The approach is developed in 11 steps, as
depicted in Figure 1:</p>
        <p>Extract source DL formula es from SPARQL CQA.</p>
        <p>Find equivalent or similar (same label) target instances instt to the source
instances insts.</p>
        <p>Retrieve description of target instances: set of triples and object/subject
type.</p>
        <p>For each triple, retrieve Lt labels of entities.</p>
        <p>Compare Ls and Lt using a string comparison metric (e.g., Levenshtein
distance with a threshold).</p>
        <p>Keep the triples with the summed similarity of their labels above a threshold
. Keep the object(/subject) type if its similarity is better than the one of
the object(/subject).</p>
        <p>Express the triple into a DL formula et.</p>
        <p>Aggregate the formulae et into an explicit or implicit form: if two DL
formulae have a common atom in their right member (target member), the atoms
which di ered are put together.
11 Put es and et together in a correspondence (es et) and express this
correspondence in EDOAL. The average string similarity between the aggregated
formula and the CQA labels gives the con dence value of the correspondence.
CQA</p>
        <p>Source
1 DL formula
2 URI labels</p>
        <p>7 similarity
3 answers
es</p>
        <p>Ls
insts</p>
        <p>EDOAL
correspondence</p>
        <p>11
4 sameAs</p>
        <p>Target
10 aggregate</p>
        <p>9 DL formula
For each Triple 6 7</p>
        <p>6 labels
et</p>
        <p>Lt
instt
5 surroundings</p>
        <p>Best Triples</p>
        <p>8 &gt;</p>
        <p>Triple</p>
        <p>Triples +
object/subject type
composed of</p>
        <p>The instance matching phase (step 4 ) is based on existing owl:sameAs,
skos:closeMatch, skos:exactMatch and exact label matching. The similarity
between the sets of labels Ls and Lt of step 7 is the cartesian product of the
string similarities between the labels of Ls and Lt (equation 1).</p>
        <p>sim(Ls; Lt) =</p>
        <p>X X strSim(ls; lt)
ls2Ls lt2Lt
(1)
strSim is the string similarity between two labels ls and lt (equation 2). is the
threshold for the similarity measure. In our experiments, we have empirically set
The con dence value given to the nal correspondence (step 11 ) is the
similarity of the triple it comes from or average similarity if it comes from more than
one triple. The con dence value is reduced to 1 if it is initially calculated over 1.</p>
        <p>
          Classes For each owl:Class populated with at least one instance, a SPARQL
query is created to retrieve all the instances of this class. If &lt;o1#class1&gt; is a
populated class of the source ontology, the following query is created:
SELECT DISTINCT ?x WHERE f?x a &lt;o1#class1&gt;.g
Property-Value pairs Inspired by the approaches of [
          <xref ref-type="bibr" rid="ref1 ref2 ref4">1,2,4</xref>
          ], we create SPARQL
queries of the form
{ SELECT DISTINCT ?x WHERE f?x &lt;o1#property1&gt; &lt;o1#Value1&gt;.g
{ SELECT DISTINCT ?x WHERE f&lt;o1#Value1&gt; &lt;o1#property1&gt; ?x.g
{ SELECT DISTINCT ?x WHERE f?x &lt;o1#property1&gt; "Value".g
These property-value pairs are computed as follow: for each property (object or
data property), the number of distinct object and subject values are retrieved.
If the ratio of these two numbers is over a threshold (arbitrarily set to 30)
and the smallest number is smaller than a threshold (arbitrarily set to 20), a
query is created for each of the less than 20 values. For example, if the property
&lt;o1#property1&gt; has 300 di erent subject values and 3 di erent object values
("Value1", "Value2", "Value3"), the ratio jsubjectj=jobjectj = 300=3 &gt; 30 and
jobjectj = 3 &lt; 20. The 3 following queries are created as CQAs:
{ SELECT DISTINCT ?x WHERE f?x &lt;o1#property1&gt; "Value1".g
{ SELECT DISTINCT ?x WHERE f?x &lt;o1#property1&gt; "Value2".g
{ SELECT DISTINCT ?x WHERE f?x &lt;o1#property1&gt; "Value3".g
The threshold on the smallest number ensures that the property-value pairs
represent a category. The threshold on the ratio ensures that properties represent
categories and not properties with few instanciations.
        </p>
        <p>Implementation adaptations In the initial version of the system, Fuseki
server endpoints are given as input. For the SEALS evaluation, we embedded a
Fuseki server inside the matcher. The ontologies are downloaded from the SEALS
repository, then uploaded in the embedded Fuseki server before the matching
process can start. This downloading-uploading phase may take time, in particular
when dealing with large les.</p>
        <p>The CANARD system in the SEALS package is available at http://doi.
org/10.6084/m9.figshare.7159760.v1. The generated alignments in EDOAL
format are available at http://oaei.ontologymatching.org/2018/results/
complex/taxon/CANARD.html (link to each pair of task). Note that, as described
below, CANARD was able to generate results for the Taxon track.
2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The CANARD system could only output correspondences for the Taxon dataset
of the Complex track. Indeed, the other datasets of this track do not contain
instances and least of all common instances.</p>
      <p>Table 1 shows the run-time of CANARD on all pairs of ontologies in the
Taxon track, as well as the characteristics of the output alignments. As the
alignment process is directional, we do not obtain symmetrical results for a pair of
ontologies. CANARD is able to generate di erent kinds of correspondences: (1:1),
(1:n) and (m:n). The best precision was obtained for the pair
agronomicTaxonagrovoc with a precision of 0.57. CANARD did not output any correspondence
for 4 oriented pairs (in grey in Table 1). These empty results can be due to
the fail of the instance matching phase of our approach. We could observe that
with TaxRef as the source knowledge base, no correspondence could be
generated. The exception is the pair taxref-agrovoc where 8 correspondences were
found but only involving skos:exactMatch or skos:closeMatch properties in the
constructions. The incorrect correspondences of this pair have a low con dence
(between 0.05 and 0.30).</p>
      <p>Looking for the query rewriting task in Taxon, CANARD's alignment was
used to rewrite the most queries (best qwr ). As CANARD does not deal with
binary CQAs, none of the 3 binary queries 12 pairs of ontologies = 36 binary
query cases could be dealt with. Out of the 2 unary queries 12 pairs = 24
unary query cases, CANARD could deal with 6 unary cases needing a complex
correspondence and 2 needing simple correspondences for a total of (8/24) 33%
of unary query cases.</p>
      <p>Overall, for the query cases needing complex correspondences, (0+6/28+16)
14% were covered by CANARD. For all the query cases, the CANARD system
could provide an answer to (8/36+24) 13% of all cases.
3</p>
    </sec>
    <sec id="sec-4">
      <title>General comments</title>
      <p>The CANARD approach relies on common instances between the ontologies to
be aligned. Hence, when such instances are not available, as for the Conference,
Test Case ID
GeoLink and Hydrography datasets, the approach is not able to generated
complex correspondences. Furthermore, CANARD is need-oriented and requires a
set competency questions to guide the matching process. Here, these \questions"
have been automatically generated based on a set of patterns.</p>
      <p>The current version of the system is limited to nding complex
correspondences involving classes and properties are not yet taken into account. We plan
to extend the systems to take binary relations in the next version. Another
point that we would like to improve is the semantics of the con dence of the
correspondences.</p>
      <p>With respect to the technical environment, as mentioned before, the initial
version of the system receives as input the endpoints of the populated
ontologies. Using SEALS, the large ontologies are stored into repositories. Our systems
hence downloads them and stores them into an embedded Fuseki server. This
con guration is not ideal as we have to deal with large knowledge bases.
Furthermore, we struggled with the SEALS dependencies in order to correctly package
our system into the SEALS format.</p>
      <p>As we focus on user needs in order to avoid dealing with the whole alignment
space, it could be interesting to having more need-oriented tasks with respect to
the alignments coverage.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>This paper presented the adapted version of the CANARD system and its
preliminary results in the OAEI 2018 campaign. This year, we have been participated
only in the Taxon track, in which ontologies are populated with common
instances. CANARD was the only system to output complex correspondences on
the Taxon track.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>Cassia Trojahn has been partially supported by the French CIMI Labex projet
IBLiD (Integration of Big and Linked Data for On-Line Analytics).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Parundekar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ambite</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Linking and building ontologies of linked data</article-title>
          .
          <source>In: ISWC</source>
          . pp.
          <volume>598</volume>
          {
          <fpage>614</fpage>
          . Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Parundekar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ambite</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Discovering concept coverings in ontologies of linked data sources</article-title>
          .
          <source>In: ISWC</source>
          . pp.
          <volume>427</volume>
          {
          <fpage>443</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Thieblin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haemmerle</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trojahn</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Complex matching based on competency questions for alignment: a rst sketch</article-title>
          .
          <source>In: Ontology Matching Workshop</source>
          . p.
          <volume>5</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Walshe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brennan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Bayes-recce: A bayesian model for detecting restriction class correspondences in linked open data knowledge bases</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          <volume>12</volume>
          (
          <issue>2</issue>
          ),
          <volume>25</volume>
          {
          <fpage>52</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>