1. Introduction

Automatic Entity Labeling through Explanation Techniques

(Discussion Paper)

Silvana Castano

Alfio Ferrara

Donatella Firmani

Jerin George Mathew

Stefano Montanelli

Università di Roma Sapienza

Entity resolution (ER) aims at matching records that refer to the same real-world entity, e.g., the same product sold by diferent websites. Recent solutions to this problem have reached unprecedented accuracy. Nonetheless, due to intrinsic limitations of automatic testing methods, it is known among researchers and practitioners that a significant manual efort is still required in production environments for verification and cleaning of ER results. In order to facilitate such activity, we are developing the E2L methodology (Entity to Labels) for automatic computation of human-readable labels of identified entities. Given a selection of entities for which the user wants to compute labels, E2L first extracts relevant features by training a classifier on the ER results, then it leverages the notion of black-box model explanation to select the most important terms for the classifier, and finally it uses those terms to compute labels. In this paper we report our first experiences with E2L. Preliminary results on a real-world application scenario show that E2L labels can provide an accurate description of entities and a natural way for humans to assess the trustworthiness of ER results at a glance.

1. Introduction

Entity Resolution (ER) is the task of finding records in a collection that refer to the same realworld entity. Recent works have investigated the application of machine learning (ML) and deep learning (DL) techniques, demonstrating impressive prediction accuracy [ 1 ]. Nonetheless, in production environments, humans are still required to manually inspect the entities identified by the ER process, in order to assess their trustworthiness. This can be a gruesome activity, especially when large datasets are considered, with entities consisting of hundreds of records. For this reason, tools for supporting the manual inspection of ER results and speeding up the search for possibly mismatched records are strongly demanded. Within this space, we focus on the problem of computing human-readable textual labels of identified entities, such as those in Table 1, which represent a natural way to support human comprehension of what is inside each clustered entity.

Entity Label Canon EOS 1100D Sony A7 (b)

Already available solutions for analogous tasks (see Section 4) typically require some form of human intervention, such as, providing external knowledge (e.g., vocabularies) or a selection of sample labels for training. Fully-automated solutions instead are based on token frequencies (e.g., TF-IDF) which may perform poorly in datasets with skewed entity size distribution. Our main intuition is to exploit (i) recent methods to process natural language such as [ 2, 3 ] to discover meaningful patterns in the association between records and entities with no human efort, and (ii) recent explainable techniques such as [ 4 ] to reveal such patterns and make them human-readable, by selecting the salient information.

In this paper, we formalize these intuitions by presenting the E2L (Entity to Labels) approach and report our experiences with a real world application scenario [ 5 ] where ER results need to be manually curated. Our current implementation, featuring two representative text classification methods [ 3, 2 ] and one popular explanation method [ 4 ] can achieve promising results and highlight errors in the ER results. A repository with all our data and scripts is publicly available for download at https://github.com/jermathew/E2L.

2. The E2L methodology

Let = {1, . . . , } be a collection of record descriptions referring to a set of entities ˆ = {1, . . . , } with > . Each record ∈ is related to an entity ∈ ˆ, also referred as a cluster of records. Given two records 1 ∈ , 2 ∈ , we refer to them as matching records if they are associated to the same entity.

Our methodology, that we call E2L (Entity to Labels), comprises the sequence of modules in Figure 1 as described below. 1. Classification model. Given a set of entities ⊆ ˆ that ought to be manually checked by the final user, E2L trains a classifier to learn a function : → , such that () = denotes that the record ∈ is associated with the entity . In order to build the training set, we use standard text processing (e.g., stop-words removal) and tokenization techniques to represent a record ∈ as sequence of tokens () = [1, . . . , ]. Resulting tokens can be either single terms (e.g. Canon) or noun chunks. A noun chunk provides a singleton representation of a composite noun (e.g., digital camera, USA warranty). Note that this step requires no human efort as association between records and entities required for training are selected directly from the input ER results. 2. Candidate labels. Each element ∈ ⋃︀∈ s.t. ()= () represents a candidate label for the entity . Given an entity , this module computes a real-valued relevance score for each candidate label by leveraging a black-box explanation technique over model . Intuitively, consider a token ∈ () and let ˆ correspond to the record without . The candidate labels module assigns higher relevance score to tokens that yields more consistently (ˆ) ̸= (), for all s.t. () = . Specifically, we use LIME [ 4 ] as our black-box explanation technique. In order to compute relevance scores for a given record and a model , LIME creates a new set of records by randomly removing tokens from . In , records are represented as binary vectors where each dimension corresponds to a diferent token. Then, given a class ∈ , each ′ ∈ is labeled accordingly to whether (′) = or not. Finally, LIME fits a linear model on . Weights of the linear model represent how much each token contributed to (). Given an entity , the output of this module is a sorted list of candidate labels and associated relevance scores = [(1, 1 ), (2, 2 ), . . . ], ≥ +1 .

We now describe the candidate labels module in more details. Let ⊆ be the set of records that associates to a given entity , that is = { ∈ | () = }. For each record ∈ we submit its tokens () to the black-box explanation function in order to get their relevance scores = {⟨, ⟩ : ∈ (), ∈ R}. Tokens with positive relevance are then sorted by non-increasing relevance value and selected until their cumulative relevance is greater or equal to a user-specified fraction ∈ [ 0, 1 ] of the total. As a result a selection ′ ⊆ of tokens is obtained for the record . The set of candidate labels consists of the union of the selected tokens ′ for each ∈ , and, for each label , the label relevance [] is the sum of the relevance scores of ′ [] over all the records in ∈ . We repeat these steps for each entity ∈

Running the aforementioned steps can be infeasible if (i) contains a massive number of records or (ii) records in consist of thousands of tokens. In both cases, the black-box explanation function could take a significant amount of time to process . In order to address both points, we include in E2L a record sampling step and a token sampling step – described below – to be optionally executed before the black-box model explanation computation.

(i) During the record sampling step, we aim at picking a subset ′ ⊆ such that the tokens in ′ cover most of the relevant tokens in . In order to do so, we run the k-means clustering algorithm with parameter on a vector representation 1 of the input records and then, for each cluster, we select the closest record to its centroid based on ℓ2-norm. As a result, we obtain vectors from which we retrieve the corresponding records, which collectively make up ′ . The value , corresponding to the sample size, is set such that as the number of records || grows, the fraction of sampled records decreases via linear interpolation.

1The selected vector representation can be arbitrarily chosen, e.g a tf-idf vector representing a record or the mean word embedding of its constituent tokens

(ii) During the token sampling step, given a record ∈ we aim at picking a selection ′() ⊆ () of its most representative tokens. To that end, given a record ∈ we sort its tokens () based on their Term Frequency (TF) in decreasing order, prioritizing noun chunks over singleton text tokens. Afterwards, we select the top tokens as those to be included in the sample for the record . Analogously to the record sampling step, the value is set via linear interpolation so that as the number of tokens in () grows, the fraction of selected tokens decreases. 3. Label composition. Candidate labels and associated relevance scores are finally processed to return to the user a label for each entity. Given a user parameter , we return as label the composition (e.g concatenation) of the top labels in .

3. Experiences with E2L

The E2L approach is evaluated on the camera dataset in the Alaska Benchmark, an end-to-end benchmark tailored for a variety of tasks related to Data Integration, including ER [ 5 ], and has been recently used for the 2020 SIGMOD Programming Contest2 and for the two editions of the DI2KG challenge3. The dataset comprises i) a set of camera descriptions collected over diferent web sources, and ii) a manually-curated ground truth consisting of camera names (i.e., brand name and model name) for each description, such that multiple descriptions can refer to the same camera. In the evaluation, we take into account the 20 entities with the highest number of records, ranging from 184 to 53 records per entity. The resulting dataset consists of 2171 records. We use the page_title attribute from each description to compose a dataset (hereinafter called Alaska dataset) as a list of <page_title>,<model_name> pairs, where <model_name> represents the correct label expected for each group of descriptions referring to the same camera. The longest page_title field in the dataset contains 42 words, while the shortest one contains 3 words.

Our experiments were performed on a server environment using an Intel Xeon E5-2966 v4 CPU, 512 GB of RAM, and 4 NVIDIA Tesla P100-SXM2 GPUs. The operating system is Ubuntu 17.10.

Classification model. We exploited two models, a LSTM-based neural network [ 6 ] and a pretrained DistilBERT model [ 2 ], and we generated two versions of E2L, namely E2L-Bert and E2L-Glove. The LSTM-based network consists of a pre-trained embedding layer based on GloVe [ 3 ] followed by a bidirectional LSTM (Bi-LSTM) layer whose memory dimension is 100. The output of the last time step in the Bi-LSTM is then fed to a fully connected layer of size 64 using ReLu as the activation function. Finally, the resulting output is passed to a fully connected layer of size 20 where softmax is used as the activation function. As for the second model, we leveraged the Transformers library 4 to set up a pretrained DistilBERT model for a multiclass classification task. This model comprises two parts: the body, consisting in a pretrained DistilBERT model, and a classification head on top of the body whose last layer consists in a fully connected layer 2http://www.inf.uniroma3.it/db/sigmod2020contest 3http://di2kg.inf.uniroma3.it 4https://github.com/huggingface/transformers of size 20 with softmax as the activation function.

Baselines. As baselines for comparison against E2L, we exploit two diferent approaches for entity labeling, named TFIDF and BART. The choice of TFIDF is motivated by the fact that this is almost a standard solution for terminology retrieval and it provides good results on the entity labeling task. The choice of BART is motivated by the idea of comparing E2L against a solution for document summarization, based on the idea that summarizing entity descriptions is an efective way to enforce entity labeling. Both approaches start by joining the page_title ifelds referring to the same camera name in the Alaska dataset. This way, we obtain a set of 20 pseudo-descriptions, one for each camera. These pseudo-descriptions are then tokenized by exploiting the same procedure used in E2L.

• For the TFIDF baseline, we compute Tf-Idf on . Then, for each pseudo-description ∈ , tokens are sorted by their Tf-Idf weights in descending order. • As for the BART baseline, we feed each ∈ to a pretrained BART model, namely BART.large.cnn [ 7 ]. As a result, we obtain a summary of , that is a concise and shorter version of . Then, we tokenize and process the summary as in the E2L approach.

Tokens are sorted according to their position. 3.1. Experimental comparison Let be = [(1, 1 ), (2, 2 ), . . . ] a list of candidate labels for the entity produced by the approach , either one of the E2L versions or one of the baselines, sorted by their relevance score from the most relevant to the less relevant. For each label , we know the gold label (i.e., the correct camera name) and we aim to evaluate the capability of E2L to build by combining the candidate labels in . Moreover, we aim to assess how many of the labels we need to employ to obtain exactly the gold label . The efectiveness of an entity labeling solution can be measured by observing how many candidate labels are required to obtain the gold label. The lower is the number of needed candidates (taken with relevance score in descending order), the higher is the efectiveness. According to this, the quality of each approach is measured as follows. First, we create the set of the tokens in the gold label , by extracting single terms (i.e., separated by spaces). Then, we do the same for the most relevant candidate label 1, by defining 1 as the set of tokens of 1. Given , we define 1 = 1 and we evaluate precision (1) and recall (1) of at candidate 1 as: 1 = | ∩ 1 | ; 1 = | ∩ 1 |

| 1 | | |

This process is repeated for each of the top candidate labels produced by . At each step > 1, we define as:

= − 1 ∪ .

The F1-measure () at is the harmonic mean of and . By exploiting these measures of precision and recall at , we can easily check when the gold label has been completely obtained (i.e., the value where we have = 1) and how many wrong tokens we have TFIDF

BART E2L-Glove E2L-Bert collected during the process (i.e., ). Thus, we measure the overall quality of E2L and the baselines through the notion of Precision at full coverage ( *) that is defined as follows: * = : = 1

In Table 2, we report the values of precision at full coverage ( *) for all the approaches, together with the number and fraction of entities that are correctly retrieved when recall is equal to 1 (i.e., * = 1), which means that the gold label has been not only completely retrieved by also retrieved by not introducing any noisy token, that is with no errors. The experimental results show that the use of black-box explanation techniques in E2L allows to extract relevant terminology for composing the correct label of entities as a final stage in a ER process. Indeed, if the statistical techniques seem to be efective for retrieving relevant terminology, they appear also to be more prone to introduce noisy terms in the candidate labels. On the other hand, data summaries, especially for text, tend to produce longer descriptions that are not enough synthetic to be taken as a good entity label. By contrast, the terms found by E2L appear as a good compromise in that they are more specifically related to the entities at hand, but also short enough to be useful for the task of labeling entities.

ER errors. Limitations of statistical techniques such as TFIDF are even clearer when there are errors in the input ER results. Consider for instance entity merge errors, where diferent real-world entities are mis-clustered as one entity. Table 3 reports preliminary results on a selection of clusters with diferent sizes from our Alaska dataset. Specifically, we considered clusters of diferent sizes, merged them, and computed labels with TFIDF and E2L-Bert. In the table, we show the labels with equal to the size of the gold label for each of the considered merged clusters. In presence of merge errors, statistical techniques like TFIDF fail at identifying relevant terminology for all the sub-clusters in the merged cluster, while E2L-Bert can return the labels corresponding to the merged entities, thus supporting manual inspection of results and error detection.

4. Related work

Works related to the proposed E2L approach are about entity labeling as well as machine learning interpretation.

Entity labeling. A number of solutions has been proposed in the literature for entity labeling intended as the problem of finding a representative label to a set of records that refers to the same real-world object. A common solution is based on the idea to rely on an external knowledge base that works as a reference vocabulary for selecting the most appropriate label to assign to a given entity [ 8 ]. Entity labeling can be considered as a task of semantic data mining where labels emerge from record descriptions and they are selected according to the results of text processing techniques usually based on conventional information retrieval metrics (e.g., [ 9 ]). Machine learning techniques are also employed for entity labeling [ 10 ]. Automatic solutions to entity labeling can be integrated within human-in-the-loop workflows where domain experts are involved to validate the results of automated solutions (e.g., [ 11 ]).

Machine learning interpretation. In the recent years there was a surge of interest in the novel field of interpretability (see [ 12 ]). Explanation techniques can be distinguished between black-box and white-box. The former come with a model-agnostic interface while the latter rely on the internal mechanisms of the model. In E2L, we adopt LIME [ 4 ] that is a widely-employed black-box method. Other methods in the same category include SHAP [ 13 ] and Anchor [ 14 ].

5. Future Work

In this paper, we presented the E2L approach to entity labeling based on the use of techniques for classification and model explanation. Our current implementation features two representative text classification methods [ 3, 2 ] and one popular explanation method [ 4 ]. We plan as future works the inclusion of a wider choice of text classification and the inclusion of more explanation methods, such as SHAP [ 13 ] and Anchor [ 14 ].

Furthermore, a limitation of the current approach is that it depends on the ability of a supervised classifier to capture the entity properties. In principle, if the classifier is underperforming, the extracted labels can be less satisfactory. A simple solution could consists in training an ensemble of diferent models (as opposed to a single classifier) and select labels by using a voting system. A more sophisticated solution could be to model the ER process as a binary model indicating whether two records are matching and then apply directly the explanation engine, analogously to [ 15 ]. This can be non-trivial and it is left as future work. Indeed (i) computing pair-wise explanations exhaustively can be unfeasible for large datasets and (ii) diferent record pairs in the same entity can be matched for diferent reasons (e.g., some camera pairs could share only the model name while others could share not only the model name but also other technical specifications) and thus important tokens may vary significantly among pairs.

[1]

Li ,

Suhara ,

Doan , W.-C. Tan, Deep entity matching with pre-trained language models , arXiv: 2004 . 00584 ( 2020 ).

[2]

Sanh ,

Debut ,

Chaumond , T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter , arXiv: 1910 . 01108 ( 2019 ).

[3]

Pennington ,

Socher ,

C. D.

Manning , Glove: Global vectors for word representation , in: EMNLP , 2014 , pp. 1532 - 1543 .

[4]

M. T.

Ribeiro ,

Singh ,

Guestrin , “ Why Should I Trust you?” Explaining the Predictions of Any Classifier , in: KDD, 2016 , pp. 1135 - 1144 .

[5]

Crescenzi , A. De Angelis , D.

Firmani , M.

Mazzei , P.

Merialdo , F.

Piai , D.

Srivastava , Alaska: A flexible benchmark for data integration tasks , arXiv:2101.11259 ( 2021 ).

[6]

Hochreiter ,

Schmidhuber , Long short-term memory , Neural computation 9 ( 1997 ) 1735 - 1780 .

[7]

Lewis ,

Liu ,

Goyal ,

Ghazvininejad ,

Mohamed ,

Levy ,

Stoyanov , L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , arXiv preprint arXiv: 1910 . 13461 ( 2019 ).

[8]

Dou ,

Wang , H. Liu, Semantic Data Mining: A Survey of Ontology-based Approaches , in: ICSC, 2015 , pp. 244 - 251 .

[9]

Sun ,

Xiao ,

Wang ,

Wang , On Conceptual Labeling of a Bag of Words , in: Int. Joint Conference on Artificial Intelligence , 2015 .

[10]

N. C. de Araújo ,

V. P.

Machado ,

A. H. M.

Soares , R. de M.S. Veras , Automatic Cluster Labeling Based on Phylogram Analysis, in: IJCNN , 2018 , pp. 1 - 8 .

[11]

D. R.

Karger ,

Oh ,

Shah , Eficient Crowdsourcing for Multi-class Labeling , in: SIGMETRICS, 2013 , pp. 81 - 92 .

[12]

Z. C.

Lipton , The Mythos of Model Interpretability, ACM Queue 16 ( 2018 ) 31 - 57 .

[13]

S. M.

Lundberg ,

S.-I.

Lee , A Unified Approach to Interpreting Model Predictions , in: Advances in Neural Information Processing Systems , 2017 , pp. 4765 - 4774 .

[14]

M. T.

Ribeiro ,

Singh ,

Guestrin , Anchors: High-precision Model-agnostic Explanations , in: Proc. of the 32th AAAI Conf. on Artificial Intelligence , 2018 .

[15]

V. D.

Cicco ,

Firmani ,

Koudas ,

Merialdo ,

Srivastava , Interpreting deep learning models for entity resolution: an experience report using LIME , in: aiDM@SIGMOD, 2019 , pp. 8 : 1 - 8 : 4 .