Legato: Results for OAEI 2017

             Manel Achichi, Zohra Bellahsene, Konstantin Todorov

                         {firstname.lastname}@lirmm.fr
                     LIRMM / University of Montpellier, France


       Abstract. Legato is an automatic data linking system handling datasets
       containing blocks of highly similar in their descriptions but yet distinct
       resources, as well as resources with highly heterogeneous descriptions.
       This paper presents the results of Legato on the Instance Matching track
       of the Ontology Alignment Evaluation Initiative 2017 via the SEALS
       platforme. Legato participated in the two sub-tracks of the instance
       matching track. We briefly describe the Legato framework, we present
       the different techniques used by the system in the accomplishment of
       the data linking task and we present and discuss the alignment results
       of the system as compared to the other tools participating to the 2017-
       edition of the evaluation campaign.


1     Presentation of the System

    We begin by providing an overview of the main characteristics of Legato, as
well as describing briefly the specific techniques applied in the different parts of
its workflow.


1.1    General Features and Purpose

    Legato is a data linking tool developed in the framework of the DOREMUS
project1 . It is designed to match entities from highly heterogeneous graphs,
effectively disambiguating highly similar (yet distinct) resources. Legato is based
on indexing techniques, with a preliminary phase of data cleaning allowing to
prune properties that make the comparison task difficult, as well as a post-
processing phase allowing to discard erroneous links and to lower the rate of false
positives. An important feature of our system is that it requires very little manual
configuration – neither similarity measures and thresholds, nor properties to
align are required as input. The values of the various thresholds inherent to the
algorithm are set empirically so as to ensure a maximum performance on a large
variety of heterogeneous data. With this, we aim at placing Legato among the few
fully automatic instance matchers in the state of the art. The system is openly
available at the following link: https://github.com/DOREMUS-ANR/legato.
1
    http://www.doremus.org/
2      Achichi, Bellahsene, Todorov

1.2   Specific Techniques Used
   This section briefly describes the overall workflow of Legato, shown in Fig-
ure 1. Its configuration takes one single parameter: the type of resources for
comparing and linking. The system then proceeds to automatically process, com-
pare, repair and provide a set of identity links (owl:sameAs statements). More
precisely, Legato implements the following successive steps.


                    Fig. 1: The processing pipeline of Legato


Data cleaning. The first step before representing the resources in a compa-
rable form consists in filtering the problematic properties from the two input
datasets. Legato considers a property as problematic if it hinders the comparison
of resources. Consider the example given in Table 1, issued from the DOREMUS
track data from the IM@OAEI2017 (Instance Matching track of the Ontology
Alignment Evaluation Initiative).
     The descriptions mw1 and mw1’ are about two equivalent musical works re-
trieved from Philharmonie de Paris (PP) and Bibliothèque Nationale de France
(BNF), respectively. These descriptions are highly similar, with the notable ex-
ception of the respective ecrm:P3 has note property values. Considering this
property, we would yield a very low value of the similarity score, and still it is
likely that this property is discovered as a key (because of its unique values) and
therefore used in a configuration file of a linking system.
     Properties identified as problematic may concern those that have values in a
free text format, i.e., comments (as in the example above), as well as resource-
specific values, that the publisher cannot describe freely. For example, for the
same musical work, two institutions would generally assign different identifiers
in their respective catalogs. The way we propose to identify automatically prob-
lematic properties, is to discover mono-property keys valid on both datasets,
i.e., each object for such a property has at most one subject in both datasets.
                                                  Legato: Results for OAEI 2017              3

 mw12 a efrbroo:F22 SelfContained Expression
      mus:U70 has title “Sonates”
      mus:U12 has genre sonate3
      ecrm:P3 has note “Cette sonate est constituée de cinq f̈ormants:̈ Antiphonie,
      Trope, Constellation, Strophe et Séquence. Seuls les 2e et 3e formants sont
      publiés. Le Formant 2 (Trope) est composé de quatre sections : Commentaire,
      Glose , Texte, Parenthèse, qui peuvent être jouées dans différents ordres. Cette
      oeuvre nécessite un piano à 3 pédales. - Durée d’exécution : 20 minutes envi-
      ron”
mw1’4 a efrbroo:F22 SelfContained Expression
      mus:U70 has title “Sonates”
      mus:U12 has genre sonate5
      ecrm:P3 has note “Date de révision : 1963, comprend : Antiphonie; Trope;
      Constellation (ou Constellation-Miroir); Strophe; Séquence”
Table 1: ecrm:P3 has note — An example of a problematic property in DORE-
MUS data


Instance profiling. Legato creates instance profiles by exploiting the infor-
mation in the CBDs (for Concise Bounded Description) of the resources.6 We
extend the CBD notion by also considering the descriptions of neighboring nodes
of a resource in its graph. At this step, Legato extracts a subgraph for each re-
source r that includes all the triples from the CBD of r, the CBDs of its direct
predecessors (linked by incoming links to r), and the CBDs of its direct suc-
cessors (linked through outgoing links to r). For instance profiling, Legato only
considers datatype properties. In that, each resource is represented by a set of
literals in its profile (subgraph) considered as relevant for its description. This
strategy allows to avoid manually setting the graph traversal distance to which
the information should be collected.

Instance pre-matching. Once all resources in both datasets are profiled,
Legato employs an indexing technique to project each profile onto a vector space
where terms are weighted by their TF-IDF (Term Frequency-Inverse Document
Frequency) values. Two standard NLP (Natural Language Processing) filters
are applied: tokenization and stop-words removal. Finally, Legato pre-selects the
identity links by computing the correlation between vectors by using the well-
known cosine similarity. In order to increase recall and to automate the threshold
setting independently on the data, at this stage Legato generates links with a
very low threshold (empirically fixed at 0.2).

Link repairing. To ensure coherence, the alignments selected at the pre-
matching step are passed to the repair module. Note that decreasing the simi-
larity threshold may increase the number of false positive matches. As indicated
above, a source resource may be erroneously aligned to many target resources
6
    https://www.w3.org/Submission/CBD/
4       Achichi, Bellahsene, Todorov

(and vice versa). This is due to the fact that we can have highly similar de-
scriptions of different resources in a single dataset. Therefore, Legato includes a
post-processing phase allowing to disambiguate between such resources and to
repair the erroneous links generated between them in the previous phase. We
employ a clustering algorithm [1] within each dataset aiming to group together
the similar resources. Then, for each pair of similar clusters (identified by a clus-
ter matching algorithm) across the two datasets, the resources are compared on
a best-key basis. We apply the RANKey algorithm for identifying and ranking
the key properties [2]. For each link l=(rs , rt ) produced in the earlier step, the
repair module begins by searching for a link of rs to a target resource r0t 6= rt ,
based on the key strategy. If found, the target resource rt in l is then replaced
by r0t . In case multiple matches are found in that scenario, the one with the
highest similarity score is kept. The repair module aims at improving precision.

Link to the System and Parameters File. We provide an open source implemen-
tation of Legato in a GitHub project under the following link: https://github.
com/DOREMUS-ANR/legato. It is available as an eclipse project. Legato provides
an appropriate user interface allowing the user to select the source, target and
alignment (if it is available) files for aligning and evaluating the produced links.
If no alignment file exists, Legato produces a set of identity links without eval-
uating them.

Link to the Set of Provided Alignments. The alignments produced by Legato
on the instance matching track of OAEI2017 can be downloaded at https:
//github.com/manoach/Legato-at-OAEI-2017.


2     Results

    In this section, we present the results obtained by Legato on the data coming
from the instance matching track of the OAEI2017 campaign.7 This year, the in-
stance matching track contains two tasks and four datasets. Legato participated
to all these tasks.


2.1    Synthetic Task

   This task contains synthetic data about creative works. They have been gen-
erated through the Semantic Publishing Instance Matching Benchmark (SPIM-
BENCH) [3] by transforming the source instances based on their values, struc-
ture and semantics. The task contains two matching sub-tasks on two different
datasets: SPIMBENCH sandbox and SPIMBENCH mainbox (datasets of differ-
ent sizes). The first one contains 380 resources while the second one – 1800.
   Tables 2 and 3 show Legato’s results as compared to those of the other systems
that have participated at this task, namely, AML, I-Match and LogMap. As it
7
    http://oaei.ontologymatching.org/2017/
                                            Legato: Results for OAEI 2017       5


System Precision Recall F-measure           System Precision Recall F-measure
  AML      0.849    1.000    0.918             AML     0.855    1.000       0.922
I-Match    0.854    0.997    0.920           I-Match   0.856    0.997       0.921
 Legato    0.980    0.730    0.840            Legato   0.970    0.700       0.810
LogMap     0.938    0.763    0.841           LogMap    0.893    0.709       0.790

Table 2: Results for SPIMBENCH              Table 3: Results for SPIMBENCH
sandbox.                                    mainbox.


can be seen, Legato achieves the highest score in terms of precision for both
SPIMBENCH sandbox and SPIMBENCH mainbox (98% and 97%, respectively).
We notice that Legato performs overall well on this task achieving a recall of
73% and 70%, and F-measures of 84% and 81% for SPIMBENCH sandbox and
SPIMBENCH mainbox, respectively.

2.2    DOREMUS Task
    The data from the DOREMUS track contain descriptions of real-world clas-
sical music works and events, coming from the catalogs of two major French cul-
tural institutions (the Philharmonie de Paris and the National Library). These
data have been converted to RDF from their original MARC format by the
help the specifically designed for that purpose by the DOREMUS team tool
marc2rdf.8 These data follow a common ontology [4] given by the DOREMUS
model, extending well-established models for intellectual works description, his-
torically used by libraries.9


System Precision Recall F-measure           System Precision Recall F-measure
  AML      0.851    0.479    0.613             AML     0.914    0.427       0.582
I-Match    0.680    0.071    0.129           I-Match   1.000    0.053       0.101
 Legato    0.930    0.920    0.930            Legato   1.000    0.980       0.990
LogMap     0.406    0.882    0.556           LogMap    0.119    0.880       0.210
NjuLink    0.966    0.945    0.955           NjuLink   0.959    0.933       0.946

Table 4: Results for HT of the              Table 5: Results for FPT of the
DOREMUS task                                DOREMUS task


   Tables 4 and 5 show Legato’s results and those of the four other systems
that participated at this task, namely, AML, I-Match, LogMap and NjuLink.
8
    https://github.com/DOREMUS-ANR/marc2rdf
9
    http://data.doremus.org/ontology/
6      Achichi, Bellahsene, Todorov

On both subtasks, two systems stand out in terms of performance – Legato
and NjuLink, achieving comparable results and outperforming considerably the
other participant systems. More precisely, on the Heterogeneities task (HT data),
Legato ranks second after NjuLink with a precision of 93%, a recall of 92% and
F-measure of 93%. As for the False Positives Trap task (FTP data), it can be
seen in Table 5 that Legato achieves the best results in terms of precision (100%),
recall (98%) and F-measure (99%). It is worth noting that the DOREMUS track
appeared to be problematic for the majority of the systems, with average F-
measure scores of around 0.6 over all participants on both tasks.


3   Discussion

    As seen in the previous section, our system proves to be very effective for
the two sub-tracks of the instance matching track of OAEI 2017, showing its
strength of producing high scores in terms of F-measure (above 80% on all
tasks). Legato produced the best precision in 3 of the 4 instance matching tasks.
Thanks to its repair module, Legato ensures a very high accuracy, which is no
less than 93% on all instance matching tasks. In terms of recall, Legato scored
well on the DOREMUS track, but obtained the lowest rank on the synthetic
data track. We explain that result by the fact that Legato does not yet tackle
value-based variations that are characteristic for the synthetic data – the lack of
lemmatization in the indexing process of our system equates to looking only for
exact matches between string values.

Proposed Improvements of the System Legato implements an approach handling
structurally heterogeneous descriptions. However, the limit of the current ver-
sion of our system is that it is not dealing with value-based heterogeneity, but
rather considers exact matches only. Therefore, this will be the main base of fu-
ture improvements. Furthermore, we plan to discover matches between resources
coming from multiple data sources simultaneously.


4   Conclusion

    In this paper, we presented Legato—an automatic and generic data linking
tool. Legato participates for the first time at the OAEI campaign and it was
evaluated on data from the two sub-tracks of the Instance Matching track. The
results showed that Legato is capable of effectively linking both synthetic and
real-world data of highly heterogeneous nature achieving comparable results to
the best systems and outperforming most of them in terms of precision while
keeping a decent recall level. In addition, Legato achieved the best score on
the FPT DOREMUS data containing highly similar resources, thanks to its
post-processing link repairing step. Finally, Legato is among the few participant
systems that are freely available and ready to use by researchers or practitioners.
                                              Legato: Results for OAEI 2017        7

Acknowledgements
   This work has been partially supported by the French National Research
Agency(ANR) within the DOREMUS Project, under grant number ANR-14-
CE24-0020.


References
1. L. Rokach and O. Maimon, “Clustering methods,” in The Data Mining and Knowl-
   edge Discovery Handbook., pp. 321–352, 2005.
2. M. Achichi, M. Ben Ellefi, D. Symeonidou, and K. Todorov, “Automatic key se-
   lection for data linking,” in Knowledge Engineering and Knowledge Management:
   20th International Conference, EKAW 2016, Bologna, Italy, November 19-23, 2016,
   Proceedings 20, pp. 3–18, Springer, 2016.
3. T. Saveta, E. Daskalaki, G. Flouris, I. Fundulaki, M. Herschel, and A.-C.
   Ngonga Ngomo, “Pushing the limits of instance matching systems: A semantics-
   aware benchmark for linked data,” in Proceedings of the 24th International Confer-
   ence on World Wide Web, pp. 105–106, ACM, 2015.
4. M. Achichi, R. Bailly, C. Cecconi, M. Destandau, K. Todorov, and R. Troncy, “Dore-
   mus: Doing reusable musical data,” in ISWC: International Semantic Web Confer-
   ence, 2015.