=Paper=
{{Paper
|id=Vol-3741/paper55
|storemode=property
|title=How Transformers Are Revolutionizing Entity Matching
|pdfUrl=https://ceur-ws.org/Vol-3741/paper55.pdf
|volume=Vol-3741
|authors=Matteo Paganelli,Donato Tiano,Francesco Del Buono,Andrea Baraldi,Riccardo Benassi,Giacomo Guiduzzi,Francesco Guerra
|dblpUrl=https://dblp.org/rec/conf/sebd/PaganelliTB0BG024
}}
==How Transformers Are Revolutionizing Entity Matching==
<pdf width="1500px">https://ceur-ws.org/Vol-3741/paper55.pdf</pdf>
<pre>
                                How Transformers Are Revolutionizing Entity
                                Matching
                                Matteo Paganelli1 , Donato Tiano2 , Francesco Del Buono2 , Andrea Baraldi2 ,
                                Riccardo Benassi2 , Giacomo Guiduzzi2 and Francesco Guerra2,∗
                                1
                                    Hasso Plattner Institute, Prof.-Dr.-Helmert-Straße 2-3, 14482 Potsdam, Germany
                                2
                                    University of Modena and Reggio Emilia, Via P. Vivarelli 10, Modena, Italy


                                               Abstract
                                               State-of-the-art Entity Matching (EM) approaches rely on transformer architectures to capture hidden
                                               matching patterns in the data. Although their adoption has resulted in a breakthrough in EM performance,
                                               users have limited insight into the motivations behind their decisions. In this paper, we perform an
                                               extensive experimental evaluation to understand the internal mechanisms that allow the transformer
                                               architectures to obtain such outstanding results. The main findings resulting from this evaluation are: (1)
                                               off-the-shelf transformer-based EM models outperform previous (deep-learning-based) EM approaches;
                                               (2) different pre-training tasks result in different effectiveness performance, which is only partially
                                               motivated by a different learning of record representations, and (3) the fine-tuning process based on a
                                               binary classifier limits the generalization of the models to out-of-distribution data and prevents from
                                               learning entity-level representations.

                                               Keywords
                                               Entity Matching, Data integration, Transformers, Interpretability


                                1. Introduction
                                Data integration aims to combine heterogeneous data sources into a single, unified, duplicate-
                                free data representation. This improves information organization and accessibility, facilitating
                                more efficient decision-making processes and improving overall data quality. One of the main
                                steps of the data integration pipeline is Entity Matching (EM) which aims to recognize records
                                that refer to the same real-world entity.
                                   Nowadays this task is mainly approached through supervised methods where deep learning
                                models are trained with pairs of records labeled as match (or 1), if the two records refer to the
                                same entity, or as non-match (or 0) in the opposite case [1]. Architecturally these models consist
                                of two fundamental components: 1) an encoder whose goal is to generate meaningful record
                                pair representations, and 2) a binary classifier that classifies the encoder output as match or

                                SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy
                                ∗
                                    Corresponding author.
                                Envelope-Open matteo.paganelli@hpi.de (M. Paganelli); donato.tiano@unimore.it (D. Tiano); francesco.delbuono@unimore.it
                                (F. D. Buono); andrea.baraldi96@unimore.it (A. Baraldi); riccardo.benassi@unimore.it (R. Benassi);
                                giacomo.guiduzzi@unimore.it (G. Guiduzzi); francesco.guerra@unimore.it (F. Guerra)
                                Orcid 0000-0001-8119-895X (M. Paganelli); 0000-0003-0605-4184 (D. Tiano); 0000-0003-0024-2563 (F. D. Buono);
                                0000-0002-1015-5490 (A. Baraldi); 0009-0007-4819-259X (R. Benassi); 0000-0003-0819-405X (G. Guiduzzi);
                                0000-0001-6864-568x (F. Guerra)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
non-match. While the classifier typically coincides with a linear layer or a multi-layer perceptron
(MLP), most of the complexity of the model resides in the encoder. Current state-of-the-art
approaches, like Ditto [2] and R-SupCon [3], implement this component via the transformer
architecture [4], or derived models (such as BERT[5], SBERT[6] and RoBERTa[7]), which are
able to capture hidden matching patterns in the data after a fine-tuning process[2, 8, 9].
   The adoption of transformer architectures has resulted in a breakthrough in the effectiveness
of the EM approaches. However, they are black-box architectures and it is not easy to under-
stand which are the internal mechanisms that allow them to obtain such outstanding results.
Providing an answer to this question is crucial to increase their trustworthiness and promote
their application in real-world scenarios [10].
   This paper is an extended abstract of [11, 12], where we addressed this problem. More
specifically, we analyzed how transformer-based architectures perform the EM task according
to three perspectives1 . They concern (1) how off-the-shelf transformer-based EM models
perform compared to EM state-of-the-art approaches (Section 3); (2) the impact of the pre-
training technique on the ability of the transformer to learn the EM task (Section 4), and (3)
which is their performance in recognizing entities (Section 5) and how much they can generalize
to out-of-distribution data (Section 6).
   The three main findings that we obtained by answering the previous questions are:

       1. Off-the-shelf transformer-based EM models outperform previous deep-learning-based EM
          models (like DeepMatcher[13]) and perform well even on dirty data, where values are
          misplaced across attributes;
       2. Different pre-training tasks result in different effectiveness performances, which is only
          partially motivated by a different learning of record representations. Only R-SupCon can
          differentiate the knowledge encoded in the embeddings between matching and non-
          matching records;
       3. Models that are fine-tuned for EM via a binary classifier do not fully recognize cliques of
          entity descriptions and have limited generalization capacity to out-of-distribution data.


2. The Experimental Analysis
This section describes the experimental setup adopted to answer the three research questions
mentioned above.
Datasets. We performed the experiments against the datasets provided by the Magellan library2
which is the reference benchmark for the evaluation of EM tasks. The datasets consist of pairs of
entity descriptions sharing a common structure. Table 1 summarizes some statistical measures
describing the datasets: the total number of record pairs (fourth column), the percentage of
pairs associated with a match label (fifth column), and the number of attributes (last column).
Each dataset is already split into train, validation, and test sets with a ratio of 3:1:1.
Models. The evaluation considers four EM models based on the transformer architecture ranging
from simple baselines to more advanced and fully-fledged state-of-the-art approaches.
1
    Further analyzes are available in the original papers which are not reported here for reasons of limited space.
2
    https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md
Table 1
The datasets used for the evaluation.
            Acronym          Type               Dataset              Size     % Match       # Attributes
               S-FZ                         Fodors-Zagats            946        11.63             6
               S-DG                      DBLP-GoogleScholar         28,707      18.63             4
               S-DA                          DBLP-ACM               12,363      17.96             4
               S-AG        Strucured       Amazon-Google            11,460      10.18             3
               S-WA                       Walmart-Amazon            10,242       9.39             5
               S-BR                      BeerAdvo-RateBeer           450        15.11             4
                S-IA                       iTunes-Amazon             539        24.49             8
               T-AB         Textual             Abt-Buy              9,575      10.74             3
               D-IA                        iTunes-Amazon             539        24.49             8
               D-DA                          DBLP-ACM               12,363      17.96             4
                             Dirty
               D-DG                      DBLP-GoogleScholar         28,707      18.63             4
               D-WA                       Walmart-Amazon            10,242      9.39              5


        • BERT[5]. This is a simple baseline where the BERT language model is used to encode
          pairs of records into meaningful pair representations and a subsequent binary classifier is
          asked to predict match or non-match based on these representations;
        • SBERT[6]. SBERT is a modification of BERT that uses a siamese architecture to generate
          meaningful sentence embeddings whose distance approximates the sentence similarity.
          The objective of this training is very close to the one adopted in EM and therefore provides
          an alternative form of training for EM models. Similar to the BERT baseline, we use
          SBERT to produce a pair representation which is provided as input to a binary classifier;
        • Ditto[2]. Ditto is a RoBERTa-based model customized for solving EM by means of the
          application of domain knowledge injection and data augmentation to the input data;
        • R-SupCon[3]. R-SupCon is a RoBERTa-based model for product matching that applies a
          pre-training procedure based on supervised contrastive learning [14]. The idea is to force
          the model to create embedding representations that are close for descriptions referring to
          the same real-world entities and are far for different entities.
   While Ditto and R-SupCon represent state-of-the-art EM methods, BERT and SBERT provide
some baselines to evaluate the performance of off-the-shelf transformer-based architectures on
the EM task without relying on further optimizations. For these models, we considered both a
pre-trained (PT) and a fine-tuned (FT) version. The architecture of the pre-trained model extends
the original language model with two fully connected layers of 100 and 2 neurons respectively
(the 2 output neurons represent the match and non-match classes). These additional layers have
been trained on the EM task to predict whether pairs of input records are matching, while the
original pre-trained model remains unaltered. The fine-tuned architecture instead consists of a
single classification layer inserted on top of the embedding corresponding to the [CLS] token,
which summarizes the contents of the entire pair of records3 . The whole architecture is here
trained on the EM task, thus modifying the weights of the original language model.
3
    This is the usual standard practice adopted for fine-tuning language models to downstream tasks [2, 8].
Table 2
The effectiveness of the tested models in the EM task.
                                BERT    BERT     SBERT    SBERT
                       DM+                                         Ditto   R-SupCon
                                 (pt)    (ft)     (pt)     (ft)
              S-FZ     100.00   97.67    97.67    97.67   100.00   97.78     92.68
              S-DG     94.70    92.40    94.78    92.47    94.24   94.97     80.54
              S-DA     98.45    97.41    98.65    96.84    98.30   96.86     99.21
              S-AG     70.70    63.26    68.52    64.88    60.48   75.31     79.23
              S-WA     73.60    59.89    78.85    60.23    78.05   85.40     80.12
              S-BR     78.80    82.76    84.85    82.76    84.85   90.32     96.55
              S-IA     91.20    85.19    93.10    77.19    93.10   92.31     85.71
              T-AB     62.80    59.50    83.51    57.79    84.18   87.04     93.43
              D-IA     79.40    84.21    94.74    75.00    93.10   83.64     68.18
              D-DA     98.10    96.10    98.42    95.98    98.42   96.65     99.44
              D-DG     93.80    92.27    94.77    91.22    95.05   94.86     80.13
              D-WA     53.80    50.76    77.33    55.26    76.68   87.05     77.06
                AVG     82.95   80.12    88.77    78.94   88.04    90.18     86.02
                STD     15.40   17.03    9.90     16.19   11.71     6.74     10.04


3. Entity Matching effectiveness
This experiment evaluates the effectiveness of off-the-shelf transformer-based EM models (like
the proposed BERT and SBERT baselines) compared to EM state-of-the-art methods. In addition
to Ditto and R-SupCon, we also consider DeepMatcher (DM+)[13], which is a reference deep-
learning-based EM approach that does not rely on a transformer architecture. The results are
shown in Table 2, which reports the F1 score for each model.
Discussion. Even if DM+ obtains good results with most datasets, off-the-shelf transformer-based
EM models outperform it. This is particularly evident for the fine-tuned versions compared to
the original pre-trained versions. Regarding the BERT-based EM baseline, fine-tuning improves
the performance by around 8%, and by more than 10% with large dirty datasets (i.e., with more
than 10k records).
   BERT and SBERT achieve similar accuracy levels in almost all datasets. Moreover, they both
show better performance in dirty datasets than in structured datasets. This result is consistent
with [8, 15], which show that transformer architectures are particularly robust to dirty data
(e.g., where values are misplaced across attributes).
   Ditto achieves the best effectiveness: it obtains an average F1 score of 90.18%, which is 2-4
points higher than the other tested models. This derives from the injection of domain knowledge
and the application of a more advanced technique for encoding attribute values.
   Finally, we observe that the average performance of R-SupCon is not as good as expected.
It outperforms Ditto by about 4% in some datasets (e.g., T-AB, D-DA, S-DA, S-BR, and S-AG).
However, it performs poorly with structured and dirty DBLP-GoogleScholar and iTunes-Amazon
datasets (on average 12.5% lower). One of the reasons is that the approach was executed with
the standard hyper-parameters, with no specific fine-tuning for the selected datasets.
              Match     Non-match
     S-FZ      0.569        0.248
     S-DG      0.538        0.171
     S-DA      0.724        0.149                                      Match                              Non-match
                                                      1.0
     S-AG      0.423        0.220
     S-WA      0.407        0.292                     0.8


                                         cosine sim
     S-BR      0.553        0.264                     0.6
     S-IA      0.455        0.319                     0.4
     T-AB      0.192        0.133                     0.2
     D-IA      0.467        0.333                         BERT BERT SBERT SBERT Ditto SupCon BERT BERT SBERT SBERT Ditto SupCon
     D-DA      0.691        0.164                          pt   ft    pt    ft                pt   ft    pt    ft
                                                                       model                              model
     D-DG      0.578        0.201
                                                        (b) Embedding similarity in BERT, SBERT, Ditto and R-SupCon.
     D-WA      0.438        0.324
       AVG     0.503        0.235
       STD     0.141        0.072
    (a) Average Jaccard similarity be-
    tween record pairs.

Figure 1: Impact of different pre-training procedures in entity representation similarity.


4. The impact of the pre-training technique
This section investigates the importance of the technique adopted for pre-training transformer-
based models in learning how to solve EM. The BERT model is pre-trained to perform two tasks:
the prediction of masked words and the prediction of the next sentences. The effectiveness
of these techniques has been largely demonstrated in many NLP problems [7]. However, we
have limited knowledge of whether these pre-training techniques are the most effective for
learning EM. Therefore, we wonder whether a different pre-training technique could improve
the accuracy of EM tasks. We selected SBERT and R-SupCon because, as mentioned in section
2, they introduce alternative forms of pre-training based respectively on sentence similarity
and on the knowledge of labels to produce similar representations for records referring to the
same real-world entities.
   More specifically, we analyze how entity representations change after different pre-training
procedures. For each pair, we compute first the embeddings of both records4 and then the
similarity of the pair of embeddings. Table 1(a) shows the distribution of Jaccard similarities
between records, divided by matching and non-matching pairs. These values provide a reference
for evaluating the similarity of the embeddings. Matching pairs have a greater similarity than
non-matching pairs, therefore we expect a model that can discriminate between matches and
non-matches to encode this “distance” at the level of embeddings. The cosine similarity values
for the embeddings computed by the tested models are shown in Figure 1(b).
Discussion. The pre-trained version of BERT and SBERT show a compact distribution of the
cosine similarity of the embeddings. The fine-tuning step increases the variability of these
results, but the median similarity remains approximately the same. We observe that BERT and
SBERT generate very high cosine similarity (≥ 0.9) for both matching and non-matching records.
4
We average the embeddings of the words in the record.
Table 3
BERT and R-SupCon accuracy in discovering entities.
                                                  Uncompleted
                                                                               F1
                                  # cliques        cliques (%)
                                                BERT R-SupCon          BERT    R-SupCon
                  Computers           83       13.25%       10.84%     92.82        88.78
                  Cameras             44        6.82%        4.55%     90.85        90.25
                  Shoes               44       20.45%       29.55%     89.04        74.25
                  Watches             70       17.14%       11.43%     94.47        80.95
                  S-DG (Valid)        50       18.00%       34.00%     94.78        80.54
                  D-DG (Valid)        50       24.00%       36.00%     94.77        80.13
                          AVG        56.83     16.61%       21.06%     92.79        82.48


Therefore, the embeddings similarity alone cannot tell whether the records refer to the same
entity or not. This can probably be explained by the well-known anisotropy phenomenon: token
embeddings occupy a narrow cone, resulting in a high similarity between any sentence pair [16].
Ditto shows a similar behavior: the median of the similarity does not significantly change in
descriptions referring to matching and non-matching entities. This is expected since Ditto relies
on the standard BERT architecture that does not train the model to learn this kind of knowledge.
Conversely, R-SupCon is the only approach that learns a different behavior for matching and
non-matching entity descriptions. The similarity of the generated embeddings is consistent
with the Jaccard similarity shown in Table 1(a). This is the result of the contrastive learning
technique, which requires that records referring to the same entity have closer embeddings
than records of different entities.


5. Recognizing the entities
This experiment aims to evaluate the ability of transformer-based EM models to perform Entity
Resolution, i.e., to identify groups of records that refer to the same real-world entity. Real-world
entities are typically identified by computing the transitive closure of the matching decisions on
pairs of records. This generates cliques, where the records included in each clique represent an
entity [17]. The EM task is usually modeled in the literature as a binary classification problem.
Therefore, the EM model cannot recognize multiple pairs of records referring to the same real
entity. Nevertheless, evaluating if these models can preserve the cliques provides us insights
into their understanding of the entity concept.
   In this experiment, we examine how many cliques are recognized by the model with respect to
the ground truth. Of all the datasets used in the previous experiments, only the S-DG and D-DG
datasets generate cliques of size greater than 2. Therefore, we included the datasets describing
laptops, cameras, shoes, and watches from the WDC benchmark5 . We train an EM model on
the training set from the benchmark, apply the model to the validation set, and calculate the
cliques comprising descriptions of matching entities. In this experiment, we compare R-SupCon
5
    https://webdatacommons.org/largescaleproductcorpus/v2/index.html
Table 4
Robustness of BERT, SBERT, Ditto, and R-SupCon to out-of-distribution records.

                                              BERT    SBERT     Ditto   R-SupCon
                Domain      Source   Target
                            S-WA     T-AB      0.50     0.48     0.53       0.33
                            T-AB     S-WA      0.46     0.51     0.56       0.39
                            S-DG     S-DA      0.95     0.96     0.92       0.96
                 Same
                            S-DA     S-DG      0.75     0.70     0.87       0.92
                            D-DG     S-DA      0.96     0.95     0.94       0.96
                            D-DG     D-DA      0.96     0.95     0.95       0.96
                                     AVG       0.76     0.76     0.80       0.75
                            S-IA     S-DA      0.39     0.53     0.32       0.91
                            S-IA     S-DG      0.37     0.46     0.31       0.78
                            S-DA     S-IA      0.60     0.64     0.84       0.71
                Different
                            S-DG     S-IA      0.83     0.79     0.45       0.75
                            D-IA     D-DA      0.48     0.64     0.16       0.81
                            D-DA     D-IA      0.55     0.67     0.72       0.61
                                     AVG       0.54     0.62     0.47       0.76
                        Total        AVG       0.65     0.69     0.63       0.76


with the BERT-based baseline. R-SupCon generates discriminative embeddings, that encode the
similarity of the records; BERT presents similar behaviors compared to the other remaining
models, as highlighted in the previous experiments.
   Table 3 shows the results of the experiment. The first column reports the number of cliques
in the ground truth. The other columns show the percentage of cliques not correctly recognized
by the models and the accuracy obtained in terms of F1 score.
Discussion. Table 3 shows that an average of 16% of cliques are not recognized by the BERT
model, even if the model reaches a high level of accuracy (more than 92% on average). The
results of the experiment align with those reported in Section 4: since the model does not
correctly recognize entities, it generates very similar embeddings for any pair of descriptions
without distinguishing them based on the entity they belong to.
   A similar result is achieved by R-SupCon, where the lower level of accuracy impacts the
number of cliques found. However, R-SupCon finds more cliques in datasets on which the
models have similar effectiveness.


6. Generalization to out-of-distribution records
In this experiment, we evaluate the robustness of EM models against out-of-distribution data,
i.e., their behavior with data that differs from the training set. The experiment is inspired by
[18], which explores domain adaptation techniques for deep EM models. Following a similar
experimental evaluation, we evaluate the EM models against two scenarios. In the first scenario,
we experiment with test sets from the same domain as the training data. For instance, we train
the EM models with S-WA and we evaluate them against T-AB, since both datasets describe
products. The second scenario, on the other hand, evaluates the performance of models where
the training sets and the test sets are from different domains. Table 4 shows the experiment
results.
Discussion. In the first scenario, we observe that the EM models exhibit high performance
reaching an average F1 score in the range of 0.75-0.80. For the datasets S-DA and D-DA, the
scores are really close to the ones achieved with the training and testing set from the same
dataset (see Table 2). The poorest results concern the experiments involving T-AB. This dataset
is structurally different from S-WA even if it belongs to the same domain, because it includes
large textual attributes. In the second scenario, where training and test datasets are from
different domains, the performance decreases for all models apart from R-SupCon. This could
be the result of the contrastive learning technique implemented in the model which makes the
approach able to better generalize than the other learning techniques.


7. Conclusion
Summarizing the results obtained from the experiments, we observe that:
   1. Off-the-shelf transformer-based EM models outperform previous deep-learning-based EM
      models (like DeepMatcher[13]) and perform well even in dirty data, where values are mis-
      placed across attributes;
   2. Different pre-training tasks result in different effectiveness performance, which is only
      partially motivated by a different learning of record representations. We compared four EM
      models, each pre-trained with a different method: the usual word-masking technique,
      the sentence-similarity-based task offered by SBERT, and R-SupCon based on contrastive
      learning. This showed that only R-SupCon can differentiate the knowledge encoded in
      the embeddings between matching and non-matching records (Section 4).
   3. Models that are fine-tuned for EM via a binary classifier do not fully recognize cliques of
      entity descriptions (Section 5) and have limited generalization capacity to out-of-distribution
      data (Section 6).
  We conclude that, even if transformer-based architectures represent a breakthrough in
performing EM (Section 3), the reasons why they largely support the process can be only partially
explained. Thus, we believe that there is still room to instill human rationales regarding the
resolution of matching tasks within these architectures. In addition, exploring more advanced
forms of fine-tuning and pre-training represents a concrete direction to make the behavior of
such models more self-explanatory and promote their application in real-world scenarios.


References
 [1] N. Barlaug, J. A. Gulla, Neural networks for entity matching: A survey, ACM Trans. Knowl.
     Discov. Data 15 (2021) 52:1–52:37.
 [2] Y. Li, J. Li, Y. Suhara, A. Doan, W. Tan, Deep entity matching with pre-trained language
     models, Proc. VLDB Endow. 14 (2020) 50–60.
 [3] R. Peeters, C. Bizer, Supervised contrastive learning for product matching, in: WWW
     (Companion Volume), ACM, 2022, pp. 248–251.
 [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: NIPS, 2017, pp. 5998–6008.
 [5] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional trans-
     formers for language understanding, in: NAACL-HLT (1), Association for Computational
     Linguistics, 2019, pp. 4171–4186.
 [6] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
     in: EMNLP/IJCNLP (1), Association for Computational Linguistics, 2019, pp. 3980–3990.
 [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
     (2019).
 [8] U. Brunner, K. Stockinger, Entity matching with transformer architectures - A step forward
     in data integration, in: EDBT, OpenProceedings.org, 2020, pp. 463–473.
 [9] M. Paganelli, F. D. Buono, M. Pevarello, F. Guerra, M. Vincini, Automated machine learning
     for entity matching tasks, in: EDBT, OpenProceedings.org, 2021, pp. 325–330.
[10] A. Baraldi, F. D. Buono, F. Guerra, M. Paganelli, M. Vincini, An intrinsically interpretable
     entity matching system, in: EDBT, OpenProceedings.org, 2023.
[11] M. Paganelli, D. Tiano, F. Guerra, A multi-facet analysis of bert-based entity matching
     models, The VLDB Journal (2023) 1–26. doi:10.1007/s00778- 023- 00824- x .
[12] M. Paganelli, F. D. Buono, A. Baraldi, F. Guerra, Analyzing how BERT performs entity
     matching, Proc. VLDB Endow. 15 (2022) 1726–1738.
[13] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute,
     V. Raghavendra, Deep learning for entity matching: A design space exploration, in:
     SIGMOD Conference, ACM, 2018, pp. 19–34.
[14] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, D. Krishnan,
     Supervised contrastive learning, in: NeurIPS, 2020.
[15] Y. Lin, Y. C. Tan, R. Frank, Open sesame: Getting inside bert’s linguistic knowledge, CoRR
     abs/1906.01698 (2019).
[16] T. Jiang, S. Huang, Z. Zhang, D. Wang, F. Zhuang, F. Wei, H. Huang, L. Zhang, Q. Zhang,
     Promptbert: Improving BERT sentence embeddings with prompts, CoRR abs/2201.04337
     (2022).
[17] D. Firmani, B. Saha, D. Srivastava, Online entity resolution using an oracle, Proc. VLDB
     Endow. 9 (2016) 384–395.
[18] J. Tu, J. Fan, N. Tang, P. Wang, C. Chai, G. Li, R. Fan, X. Du, Domain adaptation for deep
     entity resolution, in: SIGMOD Conference, ACM, 2022, pp. 443–457.

</pre>