<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Complex matching based on competency questions for alignment: a rst sketch</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elodie Thieblin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ollivier Haemmerle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cassia Trojahn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IRIT &amp; Universite de Toulouse 2 Jean Jaures</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>6</lpage>
      <abstract>
        <p>A complex alignment between a source ontology o1 and a target ontology o2 is a set of correspondences with at least a complex correspondence. Complex correspondences (e.g., o1:GenusRank 9 o2:hasRank.fo2:genusg) involve logical constructors (e.g., property restriction) or transformation functions of literal values (e.g., string concatenation). Complex matching approaches have emerged in the literature in the last years [10, 8, 13, 6]. While some rely on statistical methods [8, 13], others rely on linguistic matching conditions [10] or knowledge rules [6]. Many of them are based on correspondence patterns [10, 8, 13]. Following a di erent approach, this paper proposes a complex matching approach which relies on the notion of Competency Question for Alignment (CQA). CQAs express the knowledge that an alignment should cover. As for ontology authoring, they take the form of NLP questions or SPARQL queries. Our approach takes as input a set of CQAs translated into SPARQL queries over the source ontology. The answer to each query is a set of instances retrieved from a knowledge base described by the source ontology. These instances are matched with those of a knowledge base described by the target ontology. The generation of the correspondence is performed by matching the graph-pattern from the source query to the lexically similar surroundings of the target instances. For example, given the source query SELECT ?x WHERE f?x a o1:GenusRank.g, and an output correspondence o1:GenusRank 9 o2:hasRank.fo2:genusg, one could translate the source query into SELECT ?x WHERE f?x o2:hasRank o2:genus.g. Our approach was evaluated on a set of four knowledge bases about plant taxonomy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Competency questions for alignment</title>
      <p>
        In ontology matching system design, a question that rises is \Are there any
speci cations to the matching process ? If so, what are the needs/requirements
that an alignment should meet ?". Few guidelines in the literature are given
to characterise an alignment and/or the matching process. One of the few
examples is the NeOn methodology [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which characterises both alignment and
matching process through a set of questions: i) is matching performed under
time constraints ? ii) has matching to be performed automatically ? iii) must
the alignment be correct ? complete ? and iv) what type of operation (merging,
query, etc.) is to be performed ? Through these questions, qualitative and
applicative characteristics of an alignment and the matching process are de ned.
However, they do not help specifying the knowledge the alignment should cover,
i.e. its scope. Here, we extend the notion of \needs" for the alignment as de ned
in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] by proposing the notion of Competency Question for Alignment (CQA).
      </p>
      <p>
        In order to formalise the knowledge needs of an ontology, competency
questions (CQ) have been introduced as ontology's requirements in the form of
questions the ontology must be able to answer [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Here, a CQA expresses the
knowledge that an alignment should cover in the best case (if both ontologies' scope
can answer the CQA). The rst di erence between CQA and CQ in ontology
authoring is that the scope of the CQA is limited by the intersection of its
source and target ontologies' scopes. The second di erence is that this maximal
and ideal alignment's scope is not known a priori (as it is the purpose of the
alignment). Measuring the completeness or the competency of an alignment is,
however, out of the scope of this work.
      </p>
      <p>
        Taking into account the characteristics of CQs in the literature, we adapt
them for CQAs. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the authors de ne a set of CQ characteristics (question
type, element visibility, question polarity, predicate arity, modi er, domain
independent element), as well as a set of competency question patterns. Inspired
from the predicate arity in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we introduce the notion of question arity, which
represents the arity of the expected answers to a CQA:
{ A unary question expects a set of instances or values, e.g., \What are the
genus taxa?" (Triticum), (Anas).
{ A binary question expects a set of instances or value pairs, e.g., \What is
the rank of a taxon?" (Plantae, Kingdom), (Triticum, Genus).
{ A n-ary question expects a tuple of size 3 or more, e.g., \In which classi
cation is the rank of a taxon de ned?" (Triticum, Genus, Linnaeus 1753),
(Plantae, Kingdom, Haeckel 1866).
      </p>
      <p>
        Concerning the use of CQAs, they can be used for both alignment evaluation by
verifying that an alignment covers a user-de ned scope, as in the OA4QA task
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and for guiding alignment creation. Our approach falls in the latter case.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Proposed approach</title>
      <p>The approach takes as input a set of CQAs translated into SPARQL queries over
the source ontology. The answer to each input query is a set of instances, which
are matched with those of a knowledge base described by the target ontology.
The matching is performed by nding the lexically similar surroundings of the
target instances. Here, CQAs are limited to unary questions, (class expressions,
set of instances expected), of selection type, polarity positive and no modi er.
The approach is developed in 11 steps, as depicted in Figure 1:
1 Extract source DL formula es from SPARQL CQA (e.g., o1:Genus )
2</p>
      <p>Extract lexical information from the CQA, Ls set labels of atoms from the
DL formula (e.g., "Genus", "genre")</p>
      <p>Extract source instances insts (e.g., o1:triticum)
4 Find equivalent or similar (same label) target instances instt to the source
instances insts (e.g. o1:triticum o2:wheat )
5 Retrieve description of target instances: set of triples and object/subject type
(e.g. h(o2:wheat, o2:genus) : o2:hasRank, o2:genus: o2:Ranki, h(o2:emmer wheat,
o2:wheat) : o2:hasHigherTaxon, o2:emmer wheat: o2:Taxoni)
6 For each triple, retrieve Lt labels of entities (e.g., o2:hasRank ! "taxonomic
rank", o2:genus ! "genus", o2:Rank ! "rank")
7 Compare Ls and Lt using a string comparison metric (e.g., Levenshtein
distance with a threshold)
8 Keep the triples with the summed similarity of their labels above a threshold
. Keep the object(/subject) type if its similarity is better than the one of
the object(/subject). (e.g. sim(o2:genus, Ls) &gt; sim(o2:Rank,Ls) so we only
keep o2:genus in the triple)
9 Express the triple into a DL formula (e.g., 9 o2:hasRank.fo2:genusg)
10 Aggregate the formulas into an explicit or implicit form: if two DL
formulas have a common atom in their right member (target member): the
atoms which di ered are put together (e.g., 9 o2:hasRank.fo2:genusg and
9 o2:hasRank.fo2:kingdomg would give 2 formulae: 9 o2:hasRank.fo2:genus,
o2:kingdomg and 9 o2:hasRank.&gt;)
11 Put es and et together in a correspondence (e.g., o1:GenusRank
and express this correspondence in EDOAL
9 o2:hasRank.fo2:genusg)
CQA</p>
      <p>Source
1 DL formula
2 URI labels</p>
      <p>7 similarity
3 answers
es</p>
      <p>Ls
insts</p>
      <p>EDOAL
correspondence</p>
      <p>11
4 sameAs</p>
      <p>Target
10 aggregate</p>
      <p>9 DL formula
For each Triple 6 7</p>
      <p>6 labels
et</p>
      <p>Lt
instt
5 surroundings</p>
      <p>Best Triples</p>
      <p>8 &gt;</p>
      <p>Triple</p>
      <p>
        Triples +
object/subject type
composed of
We evaluated our approach on a set of four knowledge bases about plant
taxonomy: AgronomicTaxon [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Agrovoc [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], TaxRef-LD [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and DBpedia [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. All
except AgronomicTaxon contain thousands of taxa ( 32; 000 for Agrovoc,
500; 000 for TaxRef-LD, 307; 000 for DBpedia). Their instances are linked with
skos:exactMatch, skos:closeMatch, owl:sameAs and rdfs:seeAlso. Two CQAs were
used in the evaluation i) What are the genus taxa ? ii) What are the taxa ? Each
CQA was manually translated into a SPARQL query for each ontology. All the
source-target combinations of ontologies were tested, resulting in 12 alignment
pairs for each CQA. For each pair, the output correspondences were manually
evaluated. A correspondence was considered correct if their members are
semantically equivalent. The evaluation metrics are i) precision: number of correct
output correspondences / number of output correspondences and ii) top-k
accuracy, as used in the evaluation of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]: number of CQAs per pair for which at least
a correct correspondence was output. As we do not compare our alignments to a
reference alignment (because one would not cover all possible complex
correspondences), we cannot compute recall. Table 1 presents, for each pair of ontologies
and for each CQA, the number of correct correspondences out of the total
number of correspondences generated by the approach. The overall precision is 32.8%
(44/134) and the top-k accuracy is 83.4% (20/24). When the ontologies have a
similar structure, we obtain a better precision (Agrovoc { TaxRef-LD).
      </p>
      <p>Source/Target AgronomicTaxon Agrovoc TaxRef-LD DBpedia
AgronomicTaxon 1 / 1 3 / 3 2 / 15
su Agrovoc 1 / 3 3 / 5 2 / 8
en TaxRef-LD 1 / 6 1 / 2 3 / 10
G Dbpedia 1 / 1 1 / 2 4 / 6</p>
      <p>Some found correspondences were totally wrong, such as \a taxon in Agrovoc
(a concept having a taxonomic rank) is something which has been represented
by a statue in Wikidata" (for sake of comprehension, we express the
correspondences in natural language). Other found correspondences were not precise
enough such as \a taxon in Agrovoc is something having a taxon below it in a
taxonomy in AgronomicTaxon", which would be correct with a subsumption
relation. For some CQAs, more than one correspondence were evaluated as correct.
The rst reason is that some axioms of the ontology are equivalent (inverse
properties, etc.). The second one is that the knowledge bases sometimes import other
ontologies and instances. For example, TaxRef-LD imports data from Agrovoc,
VTO and NCBI. Hence, they share common elements. Finally, as Table 1 shows,
the Taxa CQA with DBpedia as source ontology does not output any correct
correspondence because a taxon in DBpedia is an instance of the dbo:Species
class. The source SPARQL query only contains this URI. Therefore, the query
labels on which the lexical similarity is based are those of dbo:Species which do
not contain anything related to Taxon. Most the correspondences found for this
query represent the taxa having specy as taxonomic rank.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and perspectives</title>
      <p>This paper introduced the notion of competency questions for alignment (CQAs)
and proposed a complex matching approach guided by CQAs. As the approach
relies on the labels from the SPARQL query, the similarity of the ontologies'
lexical layers impacts the output correspondences. As perspectives, we plan to
perform the instance matching phase using key detection techniques, to use more
linguistic evidence in the matching process, to consider binary CQAs, and work
on the semantics of the con dence of complex correspondences.</p>
      <p>Acknowledgements We would like to thank Catherine Roussey and Nathalie
Hernandez for their contribution on the Agronomic dataset.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>An</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>I.Y.</given-names>
          </string-name>
          :
          <article-title>Learning to discover complex mappings from web forms to ontologies</article-title>
          .
          <source>In: ACM Conference on Information and knowledge management</source>
          . pp.
          <volume>1253</volume>
          {
          <issue>1262</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Dbpedia: A nucleus for a web of open data. The semantic web (</article-title>
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Caracciolo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stellato</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morshed</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johannsen</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rajbhandari</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaques</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keizer</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The agrovoc linked dataset</article-title>
          .
          <source>Semantic Web</source>
          <volume>4</volume>
          (
          <issue>3</issue>
          ),
          <volume>341</volume>
          {
          <fpage>348</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Duc</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Methodological guidelines for matching ontologies</article-title>
          . In: Ontology engineering in a networked world, pp.
          <volume>257</volume>
          {
          <fpage>278</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Gruninger,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Fox</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.S.</surname>
          </string-name>
          :
          <article-title>Methodology for the design and evaluation of ontologies. international joint conference on arti cial inteligence</article-title>
          .
          <source>In: Workshop on Basic Ontological Issues in Knowledge Sharing</source>
          . vol.
          <volume>15</volume>
          , p.
          <volume>34</volume>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lowd</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Ka e, S.,
          <string-name>
            <surname>Dou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Ontology matching with knowledge rules. In: Transactions on Large-Scale Data-</article-title>
          and
          <string-name>
            <surname>Knowledge-Centered Systems</surname>
            <given-names>XXVIII</given-names>
          </string-name>
          , pp.
          <volume>75</volume>
          {
          <fpage>95</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gargominy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tercerie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faron-Zucker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A Model to Represent Nomenclatural and Taxonomic Information as Linked Data</article-title>
          .
          <article-title>Application to the French Taxonomic Register, TAXREF</article-title>
          . In: S4BioDiv (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Parundekar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ambite</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Discovering concept coverings in ontologies of linked data sources</article-title>
          .
          <source>In: ISWC</source>
          . pp.
          <volume>427</volume>
          {
          <fpage>443</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parvizi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mellish</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>J.Z.</given-names>
          </string-name>
          , van Deemter,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Stevens</surname>
          </string-name>
          , R.:
          <article-title>Towards Competency Question-Driven Ontology Authoring</article-title>
          .
          <source>In: The Semantic Web: Trends and Challenges</source>
          , vol.
          <volume>8465</volume>
          , pp.
          <volume>752</volume>
          {
          <issue>767</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ritze</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Volker, J.,
          <string-name>
            <surname>Meilicke</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Svab Zamazal</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Linguistic analysis for complex ontology matching</article-title>
          .
          <source>In: 5th workshop on ontology matching</source>
          . pp.
          <volume>1</volume>
          {
          <issue>12</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Roussey</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chanet</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cellier</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amarger</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Agronomic taxon</article-title>
          .
          <source>In: Proceedings of the 2nd International Workshop on Open Data</source>
          . p.
          <fpage>5</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Solimando</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinkel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Evaluating ontology alignment systems in query answering tasks</article-title>
          .
          <source>In: ISWC Posters &amp; Demos</source>
          . pp.
          <volume>301</volume>
          {
          <issue>304</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Walshe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brennan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Bayes-recce: A bayesian model for detecting restriction class correspondences in linked open data knowledge bases</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          <volume>12</volume>
          (
          <issue>2</issue>
          ),
          <volume>25</volume>
          {
          <fpage>52</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>