1. Introduction

Italian Symposium on Advanced Database Systems, June

Landmark Explanation: a Tool for Entity Matching

(Discussion Paper)

Andrea Baraldi

Francesco Del Buono

Matteo Paganelli

Francesco Guerra

0 0 DIEF - University of Modena and Reggio Emilia , Modena , Italy

2022

1 9 22

We introduce Landmark Explanation, a framework that extends the capabilities of a post-hoc perturbationbased explainer to the EM scenario. Landmark Explanation leverages on the specific schema typically adopted by the EM datasets, representing pairs of entity descriptions, for generating word-based explanations that efectively describe the matching model. Machine Learning (ML) and Deep Learning (DL) models have been successfully applied to the Entity Matching (EM) problem as the state-of-the-art approaches demonstrate (e.g., DeepER [1], DeepMatcher [2], DITTO [3], AutoML [4] and others [5, 6, 7]). Nevertheless, they are black-box models: the dificulty to evaluate [ 8] and to interpret their behaviors [9] hampers their adoption in business scenarios. Although many explanation systems have already been proposed in the literature (e.g., LIME [10], Shapley [11], Anchor [12], and Skater1), their application to EM tasks is not straightforward and only few approaches have partially addressed it [13, 14, 15, 16]. EM is conceived as a binary classification problem, where the classes show if the pairs of entities described in the dataset records are or are not matching. The structure of the datasets is then “unusual" for ML and DL techniques used to manage single evidence records and generic techniques for explaining ML and DL models cannot be straightforwardly applied. In this paper, we present Landmark Explanation a post-hoc perturbation-based local explainer for EM approaches. Post-hoc perturbation-based explainers build a surrogate linear model that approximates the model locally to the instance to explain. The surrogate linear model is trained with synthetic data. The dataset is generated by creating a number of alterations of the record to explain (in the so-called perturbation phase) and predicting their class by applying them the original model (in the so-called reconstruction phase). The explanation is directly obtained from

eol>Entity Matching Post-hoc Explanation Perturbation of EM datasets

1. Introduction

sony white cybershot t series digital camera jacket case with stylus lcjthcw for 2007 cybershot t series camera stylus include... sony white cybershot t se- top loading leather black ries digital camera jacket case with stylus lcjthcw sony lcs-csl cyber-shot camera case the surrogate model. The importance of a feature in the decision is computed by multiplying its value in the record with the linear coeficient of the surrogate model. In textual databases, as the ones considered in this paper, the features of the model are typically the words used in the entity descriptions.

Example 1. Table 1 shows an example of non-matching descriptions. Both the entities refer to camera cases produced by the same brand, but since their product code is diferent they are not be considered as the same entity. An explanation of for this record consists of a values associated to each word in the description. Words are extracted from the descriptions via a tokenization process (we evaluated the application of stemming techniques and the deletion of stop words). For this reason the terms “token" and “word" are used as synonym in this paper.

Landmark Explanation leverages the specificity of the EM dataset by introducing two main innovations. The first is the generation of two explanations per dataset entry, one for each entity described in the record. The second is a mechanism for computing meaningful explanations, especially for records belonging to non-matching classes. The descriptions of a non-matching entity are composed of diferent words, and selecting the ones that mostly contributed in the decision is a complex task even for humans. To address the problem, we inject additional words extracted from one entity into the second entity before the perturbation. The result is that the number of diferent words in non-matching entities decreases, while the similarity increases, thus enabling the approach to select the most relevant elements for the decision.

We implemented Landmark Explanation as an add-on component of the LIME system. The results of the experiments show that the explanations generated for EM datasets outperform the ones of the competing approaches in accuracy and “interest" for the users. This paper summarizes the Landmark Explanation presentations in [ 17, 18 ]. 2. The Landmark Explanation approach

2.1. Landmark Explanation principles

Landmark Explanation adapts a local post-hoc explanation technique to the EM scenario. Indeed, the direct application of a perturbation mechanism based on token removals is not efective for EM datasets. The reason is that removing random tokens is likely to afect both the entities represented by the two descriptions. The generated synthetic records may then contain null or non coherent perturbations where the same tokens referring to the diferent entities are removed. These inconsistent perturbations lead to biased explanations. Moreover, post-hoc

L entity R entity Class Model

LIME components

Landmark generation

+ augmentation

Perturbation

generation

Reconstruction

& prediction

Explanation via surrogate model

Landmark Explanation

Explanation Explanation

explanation systems adopt techniques for generating perturbations based on token removal. The resulting explanations for non-matching entity descriptions (the greatest parts of the records generally in EM datasets) are not useful as we will describe later on. Landmark Explanation addresses these issues by introducing the following two main innovations.

Double explanation. The first innovation consists of the generation of two explanations for each dataset entry. When we compute an explanation, we perturb a description (the varying entity) and keep unchanged its paired description (the landmark entity). The explanation assigns an impact to each token of the perturbed description. We repeat the computation by exchanging varying and landmark entities. Each result explains the model decision from the perspective of one of the two entities described in the record.

Injection of features. The second is a mechanism is for contrasting the asymmetric nature of the EM problem: an explanation of a matching pair is always composed of “interesting" tokens since they express the reasons why the entities have been considered as matching. The same does not happen for non-matching entities that have many reasons to be diferent. We address this issue by injecting additional tokens extracted from the landmark entity into the varying entity before the perturbation. Therefore, such a dataset contains entities close to the landmark, and the surrogate model trained with these entities will be able to highlight the distinctive tokens, that mainly contribute to the decision. Without the injection, descriptions of non-matching entities would have a large number of tokens that would uniformly contribute to the decision with the same low impact.

2.2. Landmark Explanation explanations

Let be a record in an EM dataset representing a pair of entity descriptions (, ), each one composed of a collection of tokens {1, ..., }, where ∈ {, }, and is the number of tokens belonging to the description of the entity . The application of an EM binary classiifcation model to returns {0, 1} when is composed of non-matching or matching entity descriptions, respectively. An explanation is composed of a score for each description token = {1, ..., }, where ∈ {, }, ∈ R, is the score of token . is the explanation generated by selecting as the landmark and, vice-versa, by selecting as the landmark. Positive scores push the decision towards the class of matching entities, negative towards non-matching. The highest the absolute value of the score, the highest the importance of the token associated with the score. An explanation with augmented features assumes the form of = {1, ..., , 1, ..., }, where for the explanation , the scores are the ones of the injected features from the entity description (and vice-versa for the explanation ).

2.3. Landmark Explanation workflow

Perturbation generation. A representation of the neighborhood for varying entities is generated by perturbing its tokens in multiple ways. We used LIME which generates a series of textual phrases containing many combinations of the tokens of the varying description. Reconstruction and prediction. We reconstruct the schema of the synthetic textual records obtained in the last step. We concatenate each of these new records with the original landmark entity. The produced pairs of entities are finally provided as input to the original EM model in order to obtain the relative prediction scores.

Explanation via surrogate model. Finally, a surrogate linear model (one for each workflow, one for the left and right entities, respectively) is trained on the perturbed dataset to learn an approximation of the behavior of the original model in those localities. The surrogate model takes in input the bag of words representation of the perturbed tokens and is trained to learn the relation between the input and the prediction score produced by the model under explanation. The coeficients learned during training represent the impact of each token in the prediction, and are used to generate the explanations of the original EM model for each EM record. In our implementation we adopt LIME to perform this task, but our approach is transparent to the explanation tool selected. 2.4. Explaining ER Models Studies applying interpretation techniques in the entity matching area [ 16, 14 ], and tools, like Mojito [ 15 ] and Explainer [ 13 ], have been proposed. ExplainER provides a unified interface for applying well-known interpretation techniques (e.g., LIME, Shapley, Anchor, and Skater) in the EM scenario. Mojito adapts LIME for the explanation of single EM predictions and represents the work closer to our approach. It extends LIME in two ways: 1) it exploits the subdivision of EM data into attributes, 2) it introduces a new form of data perturbation, called LIME-COPY2, which allows generating match elements starting from non-match elements. Diferently Landmark Explanation, Mojito treats attributes atomically, distributing its impact equally to its constituent tokens. Furthermore, Landmark Explanation analyzes the diversified impact that the same token can generate depending on the entity considered as a landmark for the explanation.

2In Section 3 we refer to this technique as Mojito Copy since it is part of the Mojito tool.

3. Experimental evaluation

We evaluated the explanations generated by Landmark Explanation according to two main perspectives: the fidelity in representing the EM Model (in Section 3.1) and the “quality" of the explanation. For this last evaluation, we introduce a measure for assessing the interest of the explanations (in Section 3.2) and we propose an example of explanation for non-matching entity descriptions (in Section 3.3). This shows the importance of the token injection mechanism. Dataset and Model. We perform an experimental evaluation against the datasets provided by the Magellan library3 which is considered as a standard benchmark for the evaluation of EM tasks. The datasets are divided into structured (iTunes-Amazon S-IA, DBLP-ACM S-DA, DBLP-GoogleScholar S-DG, Walmart-Amazon S-WA), textual (Abt-Buy T-AB) and Dirty (iTunesAmazon D-IA, DBLP-ACM D-DA, DBLP-GoogleScholar D-DG, Walmart-Amazon D-WA). The records in all datasets represent pairs of entities described with the same attributes. A label is provided to express if the record represents a matching / non-matching pair of entities. A simple logistic regression model is experimented as matcher, where the features are the similarities of the paired attributes in the descriptions. We compute the similarity by applying the jaccard measure on the trigrams of the attribute values. The experiments are performed by sampling 100 records per label (all records in datasets with smaller cardinality) and computing their explanations. We generate base explanations, by using the tokens from an entity description and augmented explanations, by generating explanations with the tokens of entity description with the ones injected from the second entity description. 3.1. Fidelity of the explanations To evaluate the fidelity of the explanations, i.e., if the weights assigned by Landmark Explanation to the tokens generate a surrogate model that is consistent with the EM model, we randomly remove 25% tokens from the record to explain, defining a new item. We then compared the probability score obtained passing the new item to the EM model with the one of the original records, where we have subtracted the sum of the coeficients associated with the removed tokens. If the explanation model correctly represents the EM model these two values should be close. The experiment is repeated 100 times per class, and the performance measured by means of two metrics: the mean absolute error (MAE) between the explanation and the EM Model and the accuracy that measures the percentage of times that the probability score of the new item changes consistently with to the sum of the impacts of the tokens removed. Table 2 shows the results of the experiment. The column LIME shows the results obtained with LIME with the same setting. Non-matching settings also include a comparison with the Mojito Copy technique.

Discussion. The experiments show that the surrogate model built by Landmark Explanation with the base perturbation provides an accurate representation of the EM model for records representing matching pairs of entities. At the same time, the model built with the augmented perturbation is an accurate representation of the EM model for record representing non-matching pairs of entities. In particular, Table 2a shows that Landmark Explanation, applied to records 3https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md labeled as matching entity, performs better than LIME in the datasets when the perturbation is generated with the base technique (it obtains better accuracy in all datasets and low MAE in 8/9 datasets). The augmented generation technique performs slightly worse: in 8/9 it obtains better accuracy and in 5/9 lower MAE). Note that this can be motivated also by the increased number of tokens in the augmented explanations. Nevertheless, the scores, when worst, are very close to LIME. Table 2b shows the accuracy and the MAE obtained analyzing records referring to non-matching labels. In this scenario, the augmented entity perturbation obtains the best scores with an accuracy better than LIME in 3/9 datasets and a lower MAE in 7/9 datasets. Finally, the copying technique introduced by Mojito to manage records associated with non-matching labels does not show high performance. The reason is that Mojito generates a perturbation by duplicating entire attributes. The result of this operation is that the tokens of the replaced attribute have the same weights, and decrease the performance. 3.2. Quality of the explanations Since there are many reasons to be dissimilar for two entities, the explanations of non-matching entity descriptions are typically “slightly polarized" having negative values distributed in a range close to zero and no value dominating the others. For the user, this means not being able to grasp a strong motivation for the non-matching decision. To evaluate if we are able to generate “interesting explanations", we introduced a heuristic according to which an explanation for non-matching entities is interesting if it contains tokens that, if injected into the second entity, would make the record classified as matching. These are the elements that make the explanation interesting for the users. To evaluate if the explanations generated by Landmark Explanation satisfy this property, we perform the same experiment described in Section 3.1, but selecting the tokens to remove: negative tokens are removed when the label represents a non-matching record (all tokens that contribute to the decision). Positive tokens are removed in case of matching records. In Table 3 we measure the interest, which is the percentage of records where the removal of the tokens was able to generate a change in the label. Discussion. Landmark Explanation generates interesting explanations, and the perturbation generated with the augmented technique efectively increases “the interest" of non-matching record explanations. In particular, Table 3a shows that Landmark Explanation is good but slightly worse than LIME in terms of interest, when the records are labeled as matching class. This happens even if the surrogate model is accurate (the MAE score is the lowest for all experiments with the single-entity configuration). The problem is that in most of the cases, even removing all tokens, the explanation created by Landmark Explanation belongs to the same class as before the token removal. Note that if we set a decision threshold to 0.4, our approach has the best results in all datasets. Table 3b shows that the augmented explanations of non-matching entities generated by Landmark Explanation outperform LIME and Mojito Copy. 3.3. Showing the explanations

Original Tokens l_price, nan l_name, case l_description, series rkandm l_descrilp_ntiaomn,ej,awckiteht La l_description, cybershot t h igR l_description, cybershot l_description, custom-fitted l_description, lcjthcw l_name, lcjthcw

Original Tokens r_name, case

r_id, 459 r_description, top rkandm r_nra_mnaem,cea,msoenray Lar_description, loading ftLe r_price, nan r_description, leather r_description, black r_name, lcs-csl

l_name, case l_description, case k l_name, camera ral_description, camera m itLhgdR ll__nnaammee,,jsatcykluest an l_description, white l_name, white l_name, lcjthcw

Original Tokens

Augmented Tokens

Original Tokens r_name, lcs-csl r_name, sony r_description, black

r_name, sony ftrLLkaaed m nr_description, black r_name, lcs-csl

Augmented Tokens l_name, lcjthcw l_name, jacket l_name, white l_name, case l_description, case l_name, camera l_description, camera l_description, white l_name, stylus 0.5 0.0 0.5 Token impact 0.5 0.0 0.5 Token impact 0.5Token impact0.5 0.0 0.5Token impact0.5 0.0 0.5Token impact0.5 0.0 0.5Token impact0.5 0.0 (a) The base technique.

(b) The augmented technique.

Figure 2a shows the explanations computed with the base technique for the entity descriptions in Table 1 1. We recall that positive impacts push towards the match decision, negative towards a non-match decision. Landmark Explanation generates two explanations per record and we can see that no token assumes a particular importance. The resulting explanation is therefore not interesting (and useful) for the user. Figure 2b shows the explanation obtained by the injection of the tokens from the landmark. The first explanation (where the right entity is the landmark) clearly shows that the token case pushes towards the match decision (both the entities refer to camera cases) and the code lcjthcw towards the non-match decision (it is diferent from the code in the second description). The augmented tokens show that the code lcs-csl pushes towards a match decision. This means that if that code had been part of the description for the left entity, it would have pushed the model towards a match decision. Similar considerations can be done by observing the second explanation obtained setting the left entity as landmark.

4. Conclusion

This paper introduces Landmark Explanation a tool that makes a post-hoc perturbation-based explainer able to deal with ML and DL models describing EM datasets. The approach has been experimented coupled with the LIME explainer on a simple EM model based on logistic regression. The results show that the explanations generated by Landmark Explanation outperform the ones generated by the competing approaches.

[1]

Ebraheem ,

Thirumuruganathan ,

S. R.

Joty ,

Ouzzani ,

Tang , Distributed representations of tuples for entity resolution , Proc. VLDB Endow . 11 ( 2018 ) 1454 - 1467 .

[2]

Mudgal ,

Li ,

Rekatsinas ,

Doan ,

Park , G. Krishnan,

Deep ,

Arcaute ,

Raghavendra , Deep learning for entity matching: A design space exploration , in: SIGMOD Conference, ACM, 2018 , pp. 19 - 34 .

[3]

Li ,

Suhara ,

Doan , W.-C. Tan, Deep entity matching with pre-trained language models , Proc. VLDB Endow . 14 ( 2020 ) 50 - 60 . URL: https://doi.org/10.14778/3421424.3421431. doi: 10 .14778/3421424.3421431.

[4]

Paganelli ,

F. D.

Buono ,

Pevarello ,

Guerra , M. Vincini, Automated machine learning for entity matching tasks , in: EDBT, OpenProceedings.org, 2021 , pp. 325 - 330 .

[5]

Gagliardelli ,

Zhu , G. Simonini, S. Bergamaschi, BigDedup: A Big Data Integration Toolkit for Duplicate Detection in Industrial Scenarios , in: TE, volume 7 of Advances in Transdisciplinary Engineering, IOS Press, 2018 , pp. 1015 - 1023 .

[6]

Cappuzzo ,

Papotti ,

Thirumuruganathan , Creating embeddings of heterogeneous relational datasets for data integration tasks , in: SIGMOD Conference, ACM, 2020 , pp. 1335 - 1349 .

[7]

Brunner ,

Stockinger , Entity matching with transformer architectures - A step forward in data integration, in: EDBT, OpenProceedings .org, 2020 , pp. 463 - 473 .

[8]

Paganelli ,

F. D.

Buono ,

Guerra ,

Ferro , Evaluating the integration of datasets , in: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing , SAC '22, Association for Computing Machinery, New York, NY, USA, 2022 , p. 347 - 356 . URL: https://doi.org/10.1145/3477314.3507688. doi: 10 .1145/3477314.3507688.

[9]

Du ,

Liu ,

Hu , Techniques for interpretable machine learning , Commun. ACM 63 ( 2020 ) 68 - 77 .

[10]

M. T.

Ribeiro ,

Singh ,

Guestrin , " why should i trust you?" explaining the predictions of any classifier , in: Proceedings of the 22nd ACM SIGKDD , 2016 , pp. 1135 - 1144 .

[11]

Ghorbani ,

J. Y.

Zou , Data shapley: Equitable valuation of data for machine learning , in: ICML , volume 97 of Proceedings of Machine Learning Research, PMLR , 2019 , pp. 2242 - 2251 .

[12]

M. T.

Ribeiro ,

Singh ,

Guestrin , Anchors: High-precision model-agnostic explanations , in: AAAI, AAAI Press, 2018 , pp. 1527 - 1535 .

[13]

Ebaid ,

Thirumuruganathan ,

W. G.

Aref ,

Elmagarmid ,

Ouzzani , Explainer: Entity resolution explanations , in: 2019 IEEE 35th Int. Conf. on Data Engineering (ICDE) , IEEE, 2019 , pp. 2000 - 2003 .

[14]

Thirumuruganathan ,

Ouzzani ,

Tang , Explaining entity resolution predictions: Where are we and what needs to be done? , in: Proceedings of the Workshop on Human-In-the-Loop Data Analytics , 2019 , pp. 1 - 6 .

[15]

V. D.

Cicco ,

Firmani ,

Koudas ,

Merialdo ,

Srivastava , Interpreting deep learning models for entity resolution: an experience report using LIME, in: aiDM@SIGMOD , ACM, 2019 , pp. 8 : 1 - 8 : 4 .

[16]

X. W. L. H. A.

Meliou , Explaining data integration, Data Engineering ( 2018 ) 47 .

[17]

Baraldi ,

F. D.

Buono ,

Paganelli ,

Guerra , Landmark explanation: An explainer for entity matching models , in: CIKM, ACM , 2021 , pp. 4680 - 4684 .

[18]

Baraldi ,

F. D.

Buono ,

Paganelli ,

Guerra , Using landmarks for explaining entity matching models , in: EDBT, OpenProceedings.org, 2021 , pp. 451 - 456 .