RE-miner for data linking results for OAEI 2020?

                       Armita Khajeh Nassiri1[0000−0002−5734−0351] , Nathalie
            1,2[0000−0003−1487−393X]
    Pernelle                            , Fatiha Saı̈s1[0000−0002−6995−2785] , and Gianluca
                                 Quercini1[0000−0001−9195−1618]
                   1
                    LRI, CNRS 8623, Paris Saclay University, Orsay F-91405, France
               2
                   LIPN, CNRS (UMR 7030), University Sorbonne Paris Nord, France
                               firstname.lastname@lri.fr


         Abstract. This paper presents the RE-miner results for data linking in the ontol-
         ogy alignment contest OAEI 2020, Spimbench track. RE-miner discovers all min-
         imal and diverse referring expressions of all instances of a given source knowl-
         edge graph. In a second step, it exploits these referring expressions to find the
         possible links to a target knowledge graph. This is the first participation of RE-
         miner in the OAEI campaign and produces the best result in terms of F-measure
         on the Spimbench dataset.


1     Presentation of the system

As the Web of Data continues to grow, more and more knowledge graphs (KGs) that
cover a wide range of topics are emerging in the Linked Open Data (LOD) Cloud.
As knowledge graphs are usually built independently from one another, inevitably, the
same Internationalized Resource Identifier (IRI) is not necessarily reused for a given
individual. Thus, it is essential to have systems capable of data linking, i.e., to produce a
set of mapping between the individuals of two knowledge graphs representing the same
real-world object. RE-miner for data linking is one such system that, given a subset of
class and property mappings between the source and target knowledge graphs, identifies
possible sameAs links between the instances of the two KGs.


1.1     State, purpose, general statement

RE-miner for data linking consists of 2 main steps. The algorithm has been thoroughly
presented in [4]. Here, we will miss out on the details and present the major steps
taken in this campaign. First, discovering referring expressions for all instances of the
source knowledge graph. A referring expression (RE) is a description that identifies an
instance unambiguously in a class of a knowledge graph—instantiating the keys of a
class yields numerous REs itself. However, many more referring expressions can po-
tentially be found. To reduce the search space, RE-miner focuses on non-key properties.
Both keys and maximal non-keys are obtained using SAKey [5]. Second, all the REs
?
    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0).
2       Khajeh Nassiri et al.

discovered on a class of source knowledge graph are taken into account to link to in-
stances of a target KG. The idea behind using REs for linking is that if an instance x in
the target knowledge graph satisfies a description that uniquely identifies the instance u
in the source knowledge graph, it is probable that the two instances are the same. Us-
ing different referring expressions, an instance u might be linked to different target KG
instances. A voting strategy is employed to choose the most confident link whenever
possible.

1.2   Specific techniques used
This system focuses on the instance matching problem between the instances of a given
class of the source dataset, on which the REs have been discovered, and a target dataset
having a non-empty set of mapped properties to the source. In other words, this ap-
proach assumes the schemas to have previously been aligned.

Create the source dataset. We first create the dataset on the source KG for the given
class C, for which we aim to find the alignments. The dataset is created by keeping
all instances that are of type C, and all sub-classes of C if the graph’s schema is not
saturated. For instance, in the Spimbench track, the instances of Creative Works class
are to be linked. The dataset, contains all instanced belonging to this class and its 3
sub-classes namely NewsItem, BlogPost, and Programme.

Referring Expressions. We discover all minimal and diverse referring expressions of
depth 1 on the source knowledge graph [4]. These REs do not contain the existen-
tial quantifier and are conjunctions of atoms (e.g., album(x) ∧ createdBy(x, Beatles) ∧
releasedOn(x, “1966 − 05 − 2”) holds as a referring expression when x is instantiated
with Yellow Submarine). We enrich this set, with the set of referring expressions that
are obtained through instantiating each set of key properties for class C obtained using
SAKey. Being a referring expression, each of these descriptions, holds only for one in-
stance in the class C of the source KG.

Linking and Voting Strategy. These REs are then used to find possible links in the
target dataset. For finding the possible candidate links, mapped properties and strict
equality are used between the atoms of a RE and triples of the target knowledge graph.
Moreover, first consider an instance u of type C in the source dataset and imagine
that k different referring expressions {RE1 (u), ..., REk (u)} have been associated to
it. Each of these REs can be linked to zero, one, or more instances of the target, using
the bottom-up approach explained in [4]. We consider the properties mapped if they are
strictly equal in source and target.
The confidence of each RE is inverse proportional to the number of links it suggests.
However, if the unique name assumption (UNA) is fulfilled, only one sameAs link can
be found between u and an instance x belonging to the target KG. Thus we propose a
voting strategy that assigns a weight to each distinct link. The weight is the sum of the
confidence degree of the REs proposing that link. Moreover, the weights are normalized
such that they have a value between 0 and 1. Finally, the instance x in the target knowl-
edge graph being linked to u with the highest weight is selected. For the Spimbench
                                    RE-miner for data linking results for OAEI 2020       3

dataset, we have set a very strict criterion. We only match two instances if and only if
the link with the highest weight has a weight equal to one. This way, we imply that we
only link two instances if we are really sure about it.

MELT. Matching EvaLuation Toolkit (MELT) is a framework optimized for OAEI
campaigns, facilitating submissions to the SEALS and HOBBIT evaluation platforms
[2]. The Spimbench track, on which we evaluate our performance, is available on the
HOBBIT, Holistic Benchmarking of Big Linked Data, platform1 . We used MELT to
wrap it as a HOBBIT package, and as our implementation is in Python, we used MELT’s
External Matching. Thankfully, MELT has eased the submission process; however, we
assume that it causes some run-time overhead.


2     Results
2.1    Spimbench track
Spimbench is an instance matching track and the only track we have done evaluations
on, in this first year of participation. It consists of two datasets of different sizes: the
SANDBOX dataset with about 380 instances and 10000 triplets, and the MAINBOX
dataset with about 1800 instances and 50000 triplets. We have compared our results
with AML [1], Lily [7], FTRL IM [6], and LogMap [3] in Table 2.1. All these systems
had participated in the past year(s) of the competition.


Table 1. Comparison of Performance in Spimbench track. The time performance is reported in
ms.

                                        Precision     Recall      F-measure       Time
                         AML            0.8348        0.8963      0.8645          6446
                         Lily           0.9835        1.0         0.9917          2050
      SANDBOX            FTRL-IM        0.8542        1.0         0.9214          1525
                         LogMap         0.9382        0.7625      0.8413          7483
                         RE-miner       1.0           0.9966      0.9983          7284
                         AML            0.8385        0.8835      0.8604          38772
                         Lily           0.9908        1.0         0.9953          3899
      MAINBOX            FTRL-IM        0.8558        0.9980      0.9214          2247
                         LogMap         0.8801        0.7094      0.7856          26782
                         RE-miner       0.9986        0.9966      0.9976          33966


The same strategy explained in Section 1.1 is used on both datasets for RE-miner. In
total, for the Sandbox dataset, 6920 REs are created. Whereas for the Mainbox dataset,
there are a total of 39892 REs among which 14085 are from key instantiation. We can
observe that we outperform the other systems in terms of Precision, and F-measure
on both datasets, showing a slight better performance than Lily. However, we come
 1
     http://project-hobbit.eu/
4        Khajeh Nassiri et al.

short when comparing the time-performance. This is mainly due to the fact that our
system must first compute the keys and non-keys of a given class using a Java-based
application, and then find the REs. Indeed more optimization can be done to decrease
the run-time.

3    General Comments
RE-miner for data linking has shown satisfactory results in the Spimbench instance
matching track. Although the source and target KGs shared almost the same ontology,
there were still some properties that would not be mapped together using strict similar-
ity. However, this did not hamper the performance of our system. This is because of the
fact that RE-miner usually discovers not just one but many more REs for each instance.
This will allow the system to choose the target instance most of the REs pointing to
agree on. Moreover, for this dataset, we have been fastidious, only outputting links we
really deem correct. As future work, we aim to do modifications, allowing us to par-
ticipate in more tracks for the next years and focus more on enhancing our system’s
run-time.

4    Conclusion
In this paper, we briefly presented the main components of our instance matching sys-
tem RE-miner for data linking. The evaluation of results on the Spimbench track was
presented, and we showed a better Precision and F-measure than other systems taking
part in the campaign this year. However, in terms of run-time, more improvement and
optimization are to be done.

References
1. Faria, D., Pesquita, C., Tervo, T., Couto, F.M., Cruz, I.F.: AML and AMLC results for OAEI
   2019. In: Shvaiko, P., Euzenat, J., Jiménez-Ruiz, E., Hassanzadeh, O., Trojahn, C. (eds.)
   Proceedings of the 14th International Workshop on Ontology Matching co-located with the
   18th International Semantic Web Conference (ISWC 2019), Auckland, New Zealand, Octo-
   ber 26, 2019. CEUR Workshop Proceedings, vol. 2536, pp. 101–106. CEUR-WS.org (2019),
   http://ceur-ws.org/Vol-2536/oaei19\_paper3.pdf
2. Hertling, S., Portisch, J., Paulheim, H.: MELT - matching evaluation toolkit. In: Semantic Sys-
   tems. The Power of AI and Knowledge Graphs - 15th International Conference, SEMANTiCS
   2019, Karlsruhe, Germany, September 9-12, 2019, Proceedings. pp. 231–245 (2019)
3. Jiménez-Ruiz, E.: Logmap family participation in the OAEI 2019. In: Shvaiko, P., Euzenat,
   J., Jiménez-Ruiz, E., Hassanzadeh, O., Trojahn, C. (eds.) Proceedings of the 14th Interna-
   tional Workshop on Ontology Matching co-located with the 18th International Semantic Web
   Conference (ISWC 2019), Auckland, New Zealand, October 26, 2019. CEUR Workshop
   Proceedings, vol. 2536, pp. 160–163. CEUR-WS.org (2019), http://ceur-ws.org/
   Vol-2536/oaei19\_paper11.pdf
4. Khajeh Nassiri, A., Pernelle, N., Saı̈s, F., Quercini, G.: Generating referring expressions from
   rdf knowledge graphs for data linking. In: Pan, J.Z., Tamma, V., d’Amato, C., Janowicz, K.,
   Fu, B., Polleres, A., Seneviratne, O., Kagal, L. (eds.) The Semantic Web – ISWC 2020. pp.
   311–329. Springer International Publishing, Cham (2020)
                                     RE-miner for data linking results for OAEI 2020         5

5. Symeonidou, D., Armant, V., Pernelle, N., Saı̈s, F.: Sakey: Scalable almost key discovery in
   rdf data. In: International Semantic Web Conference. pp. 33–49. Springer (2014)
6. Wang, X., Jiang, Y., Luo, Y., Fan, H., Jiang, H., Zhu, H., Liu, Q.: FTRLIM results for OAEI
   2019. In: Shvaiko, P., Euzenat, J., Jiménez-Ruiz, E., Hassanzadeh, O., Trojahn, C. (eds.)
   Proceedings of the 14th International Workshop on Ontology Matching co-located with the
   18th International Semantic Web Conference (ISWC 2019), Auckland, New Zealand, Octo-
   ber 26, 2019. CEUR Workshop Proceedings, vol. 2536, pp. 146–152. CEUR-WS.org (2019),
   http://ceur-ws.org/Vol-2536/oaei19\_paper9.pdf
7. Wu, J., Pan, Z., Zhang, C., Wang, P.: Lily results for OAEI 2019. In: Shvaiko, P., Euzenat,
   J., Jiménez-Ruiz, E., Hassanzadeh, O., Trojahn, C. (eds.) Proceedings of the 14th Interna-
   tional Workshop on Ontology Matching co-located with the 18th International Semantic Web
   Conference (ISWC 2019), Auckland, New Zealand, October 26, 2019. CEUR Workshop
   Proceedings, vol. 2536, pp. 153–159. CEUR-WS.org (2019), http://ceur-ws.org/
   Vol-2536/oaei19\_paper10.pdf