<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Legato: Results for OAEI 2017</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manel Achichi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zohra Bellahsene</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konstantin Todorov</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>Legato is an automatic data linking system handling datasets containing blocks of highly similar in their descriptions but yet distinct resources, as well as resources with highly heterogeneous descriptions. This paper presents the results of Legato on the Instance Matching track of the Ontology Alignment Evaluation Initiative 2017 via the SEALS platforme. Legato participated in the two sub-tracks of the instance matching track. We brie y describe the Legato framework, we present the di erent techniques used by the system in the accomplishment of the data linking task and we present and discuss the alignment results of the system as compared to the other tools participating to the 2017edition of the evaluation campaign.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>General Features and Purpose</title>
      <p>Legato is a data linking tool developed in the framework of the DOREMUS
project1. It is designed to match entities from highly heterogeneous graphs,
e ectively disambiguating highly similar (yet distinct) resources. Legato is based
on indexing techniques, with a preliminary phase of data cleaning allowing to
prune properties that make the comparison task di cult, as well as a
postprocessing phase allowing to discard erroneous links and to lower the rate of false
positives. An important feature of our system is that it requires very little manual
con guration { neither similarity measures and thresholds, nor properties to
align are required as input. The values of the various thresholds inherent to the
algorithm are set empirically so as to ensure a maximum performance on a large
variety of heterogeneous data. With this, we aim at placing Legato among the few
fully automatic instance matchers in the state of the art. The system is openly
available at the following link: https://github.com/DOREMUS-ANR/legato.</p>
      <sec id="sec-2-1">
        <title>1 http://www.doremus.org/</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Speci c Techniques Used</title>
      <p>This section brie y describes the overall work ow of Legato, shown in
Figure 1. Its con guration takes one single parameter: the type of resources for
comparing and linking. The system then proceeds to automatically process,
compare, repair and provide a set of identity links (owl:sameAs statements). More
precisely, Legato implements the following successive steps.
Data cleaning. The rst step before representing the resources in a
comparable form consists in ltering the problematic properties from the two input
datasets. Legato considers a property as problematic if it hinders the comparison
of resources. Consider the example given in Table 1, issued from the DOREMUS
track data from the IM@OAEI2017 (Instance Matching track of the Ontology
Alignment Evaluation Initiative).</p>
      <p>The descriptions mw1 and mw1' are about two equivalent musical works
retrieved from Philharmonie de Paris (PP) and Bibliotheque Nationale de France
(BNF), respectively. These descriptions are highly similar, with the notable
exception of the respective ecrm:P3 has note property values. Considering this
property, we would yield a very low value of the similarity score, and still it is
likely that this property is discovered as a key (because of its unique values) and
therefore used in a con guration le of a linking system.</p>
      <p>Properties identi ed as problematic may concern those that have values in a
free text format, i.e., comments (as in the example above), as well as
resourcespeci c values, that the publisher cannot describe freely. For example, for the
same musical work, two institutions would generally assign di erent identi ers
in their respective catalogs. The way we propose to identify automatically
problematic properties, is to discover mono-property keys valid on both datasets,
i.e., each object for such a property has at most one subject in both datasets.
mw12 a efrbroo:F22 SelfContained Expression
mus:U70 has title \Sonates"
mus:U12 has genre sonate3
ecrm:P3 has note \Cette sonate est constituee de cinq formants: Antiphonie,
Trope, Constellation, Strophe et Sequence. Seuls les 2e et 3e formants sont
publies. Le Formant 2 (Trope) est compose de quatre sections : Commentaire,
Glose , Texte, Parenthese, qui peuvent ^etre jouees dans di erents ordres. Cette
oeuvre necessite un piano a 3 pedales. - Duree d'execution : 20 minutes
environ"
mw1'4 a efrbroo:F22 SelfContained Expression
mus:U70 has title \Sonates"
mus:U12 has genre sonate5
ecrm:P3 has note \Date de revision : 1963, comprend : Antiphonie; Trope;
Constellation (ou Constellation-Miroir); Strophe; Sequence"
Table 1: ecrm:P3 has note | An example of a problematic property in
DOREMUS data
Instance pro ling. Legato creates instance pro les by exploiting the
information in the CBDs (for Concise Bounded Description) of the resources.6 We
extend the CBD notion by also considering the descriptions of neighboring nodes
of a resource in its graph. At this step, Legato extracts a subgraph for each
resource r that includes all the triples from the CBD of r, the CBDs of its direct
predecessors (linked by incoming links to r), and the CBDs of its direct
successors (linked through outgoing links to r). For instance pro ling, Legato only
considers datatype properties. In that, each resource is represented by a set of
literals in its pro le (subgraph) considered as relevant for its description. This
strategy allows to avoid manually setting the graph traversal distance to which
the information should be collected.</p>
      <p>Instance pre-matching. Once all resources in both datasets are pro led,
Legato employs an indexing technique to project each pro le onto a vector space
where terms are weighted by their TF-IDF (Term Frequency-Inverse Document
Frequency) values. Two standard NLP (Natural Language Processing) lters
are applied: tokenization and stop-words removal. Finally, Legato pre-selects the
identity links by computing the correlation between vectors by using the
wellknown cosine similarity. In order to increase recall and to automate the threshold
setting independently on the data, at this stage Legato generates links with a
very low threshold (empirically xed at 0:2).</p>
      <p>Link repairing. To ensure coherence, the alignments selected at the
prematching step are passed to the repair module. Note that decreasing the
similarity threshold may increase the number of false positive matches. As indicated
above, a source resource may be erroneously aligned to many target resources</p>
      <sec id="sec-3-1">
        <title>6 https://www.w3.org/Submission/CBD/</title>
        <p>
          (and vice versa). This is due to the fact that we can have highly similar
descriptions of di erent resources in a single dataset. Therefore, Legato includes a
post-processing phase allowing to disambiguate between such resources and to
repair the erroneous links generated between them in the previous phase. We
employ a clustering algorithm [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] within each dataset aiming to group together
the similar resources. Then, for each pair of similar clusters (identi ed by a
cluster matching algorithm) across the two datasets, the resources are compared on
a best-key basis. We apply the RANKey algorithm for identifying and ranking
the key properties [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. For each link l=(rs, rt) produced in the earlier step, the
repair module begins by searching for a link of rs to a target resource r0t 6= rt,
based on the key strategy. If found, the target resource rt in l is then replaced
by r0t. In case multiple matches are found in that scenario, the one with the
highest similarity score is kept. The repair module aims at improving precision.
Link to the System and Parameters File. We provide an open source
implementation of Legato in a GitHub project under the following link: https://github.
com/DOREMUS-ANR/legato. It is available as an eclipse project. Legato provides
an appropriate user interface allowing the user to select the source, target and
alignment (if it is available) les for aligning and evaluating the produced links.
If no alignment le exists, Legato produces a set of identity links without
evaluating them.
        </p>
        <p>Link to the Set of Provided Alignments. The alignments produced by Legato
on the instance matching track of OAEI2017 can be downloaded at https:
//github.com/manoach/Legato-at-OAEI-2017.
2</p>
        <sec id="sec-3-1-1">
          <title>Results</title>
          <p>In this section, we present the results obtained by Legato on the data coming
from the instance matching track of the OAEI2017 campaign.7 This year, the
instance matching track contains two tasks and four datasets. Legato participated
to all these tasks.
2.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Synthetic Task</title>
      <p>
        This task contains synthetic data about creative works. They have been
generated through the Semantic Publishing Instance Matching Benchmark
(SPIMBENCH) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] by transforming the source instances based on their values,
structure and semantics. The task contains two matching sub-tasks on two di erent
datasets: SPIMBENCH sandbox and SPIMBENCH mainbox (datasets of di
erent sizes). The rst one contains 380 resources while the second one { 1800.
      </p>
      <p>Tables 2 and 3 show Legato's results as compared to those of the other systems
that have participated at this task, namely, AML, I-Match and LogMap. As it</p>
      <sec id="sec-4-1">
        <title>7 http://oaei.ontologymatching.org/2017/</title>
        <p>System Precision Recall F-measure
System Precision Recall F-measure</p>
        <p>AML
I-Match
Legato
LogMap</p>
        <p>AML
I-Match
Legato
LogMap
can be seen, Legato achieves the highest score in terms of precision for both
SPIMBENCH sandbox and SPIMBENCH mainbox (98% and 97%, respectively).
We notice that Legato performs overall well on this task achieving a recall of
73% and 70%, and F-measures of 84% and 81% for SPIMBENCH sandbox and
SPIMBENCH mainbox, respectively.
2.2</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>DOREMUS Task</title>
      <p>
        The data from the DOREMUS track contain descriptions of real-world
classical music works and events, coming from the catalogs of two major French
cultural institutions (the Philharmonie de Paris and the National Library). These
data have been converted to RDF from their original MARC format by the
help the speci cally designed for that purpose by the DOREMUS team tool
marc2rdf.8 These data follow a common ontology [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] given by the DOREMUS
model, extending well-established models for intellectual works description,
historically used by libraries.9
System Precision Recall F-measure
System Precision Recall F-measure
      </p>
      <p>AML
I-Match
Legato
LogMap
NjuLink</p>
      <sec id="sec-5-1">
        <title>8 https://github.com/DOREMUS-ANR/marc2rdf</title>
      </sec>
      <sec id="sec-5-2">
        <title>9 http://data.doremus.org/ontology/</title>
        <p>On both subtasks, two systems stand out in terms of performance { Legato
and NjuLink, achieving comparable results and outperforming considerably the
other participant systems. More precisely, on the Heterogeneities task (HT data),
Legato ranks second after NjuLink with a precision of 93%, a recall of 92% and
F-measure of 93%. As for the False Positives Trap task (FTP data), it can be
seen in Table 5 that Legato achieves the best results in terms of precision (100%),
recall (98%) and F-measure (99%). It is worth noting that the DOREMUS track
appeared to be problematic for the majority of the systems, with average
Fmeasure scores of around 0:6 over all participants on both tasks.
3</p>
        <sec id="sec-5-2-1">
          <title>Discussion</title>
          <p>As seen in the previous section, our system proves to be very e ective for
the two sub-tracks of the instance matching track of OAEI 2017, showing its
strength of producing high scores in terms of F-measure (above 80% on all
tasks). Legato produced the best precision in 3 of the 4 instance matching tasks.
Thanks to its repair module, Legato ensures a very high accuracy, which is no
less than 93% on all instance matching tasks. In terms of recall, Legato scored
well on the DOREMUS track, but obtained the lowest rank on the synthetic
data track. We explain that result by the fact that Legato does not yet tackle
value-based variations that are characteristic for the synthetic data { the lack of
lemmatization in the indexing process of our system equates to looking only for
exact matches between string values.</p>
          <p>Proposed Improvements of the System Legato implements an approach handling
structurally heterogeneous descriptions. However, the limit of the current
version of our system is that it is not dealing with value-based heterogeneity, but
rather considers exact matches only. Therefore, this will be the main base of
future improvements. Furthermore, we plan to discover matches between resources
coming from multiple data sources simultaneously.
4</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>Conclusion</title>
          <p>In this paper, we presented Legato|an automatic and generic data linking
tool. Legato participates for the rst time at the OAEI campaign and it was
evaluated on data from the two sub-tracks of the Instance Matching track. The
results showed that Legato is capable of e ectively linking both synthetic and
real-world data of highly heterogeneous nature achieving comparable results to
the best systems and outperforming most of them in terms of precision while
keeping a decent recall level. In addition, Legato achieved the best score on
the FPT DOREMUS data containing highly similar resources, thanks to its
post-processing link repairing step. Finally, Legato is among the few participant
systems that are freely available and ready to use by researchers or practitioners.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>L.</given-names>
            <surname>Rokach</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Maimon</surname>
          </string-name>
          , \
          <article-title>Clustering methods," in The Data Mining and Knowledge Discovery Handbook</article-title>
          ., pp.
          <volume>321</volume>
          {
          <issue>352</issue>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Achichi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Ben</given-names>
            <surname>Elle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Symeonidou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          , \
          <article-title>Automatic key selection for data linking," in Knowledge Engineering and Knowledge Management: 20th International Conference</article-title>
          ,
          <string-name>
            <surname>EKAW</surname>
          </string-name>
          <year>2016</year>
          , Bologna, Italy,
          <source>November 19-23</source>
          ,
          <year>2016</year>
          , Proceedings 20, pp.
          <volume>3</volume>
          {
          <issue>18</issue>
          , Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>T.</given-names>
            <surname>Saveta</surname>
          </string-name>
          , E. Daskalaki, G. Flouris, I. Fundulaki,
          <string-name>
            <given-names>M.</given-names>
            <surname>Herschel</surname>
          </string-name>
          , and A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
          </string-name>
          , \
          <article-title>Pushing the limits of instance matching systems: A semanticsaware benchmark for linked data,"</article-title>
          <source>in Proceedings of the 24th International Conference on World Wide Web</source>
          , pp.
          <volume>105</volume>
          {
          <issue>106</issue>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Achichi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bailly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cecconi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Destandau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          , \Doremus:
          <article-title>Doing reusable musical data,"</article-title>
          <source>in ISWC: International Semantic Web Conference</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>