Intermediate Training of BERT for Product Matching
                   Ralph Peeters                                          Christian Bizer                                    Goran Glavaš
        Data and Web Science Group                              Data and Web Science Group                         Data and Web Science Group
          University of Mannheim                                  University of Mannheim                             University of Mannheim
           Mannheim, Germany                                        Mannheim, Germany                                 Mannheim, Germany
    ralph@informatik.uni-mannheim.de                         chris@informatik.uni-mannheim.de                  goran@informatik.uni-mannheim.de

ABSTRACT                                                                              framework Deepmatcher [24]. Fine-tuning BERT results in 15-20%
Transformer-based models like BERT have pushed the state-of                           higher F1 scores in settings with small- and medium-sized training
the-art for a wide range of tasks in natural language processing.                     sets. Even for large training sets, fine-tuning BERT still yields a 2%
General-purpose pre-training on large corpora allows Transformers                     improvement over Deepmatcher.
to yield good performance even with small amounts of training                            Inspired by findings that intermediate training on large training
data for task-specific fine-tuning. In this work, we apply BERT to                    sets for related tasks [28, 30] improves downstream performance,
the task of product matching in e-commerce and show that BERT                         we next introduce an intermediate training step before the final
is much more training data efficient than other state-of-the-art                      fine-tuning of the model for specific products. In this step, we train
methods. Moreover, we show that we can further boost its effec-                       BERT on product data from thousands of e-shops and show that
tiveness through an intermediate training step, exploiting large                      intermediate training leads to high performance (>90% F1) and
collections of product offers. Our intermediate training leads to                     good generalization to new products, even without any product-
strong performance (>90% F1) on new, unseen products without                          specific fine-tuning. Poor generalization to new products is the main
any product-specific fine-tuning. Further fine-tuning yields addi-                    weakness of Deepmatcher [24], as shown in our previous work [26].
tional gains, resulting in improvements of up to 12% F1 for small                     Our intermediate training is particularly beneficial for fine-tuning
training sets. Adding the masked language modeling objective in                       setups with limited training data: it leads to improvements of up to
the intermediate training step in order to further adapt the language                 12% F1 on new products with small training datasets, compared to
model to the application domain leads to an additional increase of                    direct fine-tuning (i.e. without any intermediate training). Finally,
up to 3% F1.                                                                          we show that adding domain-specific (self-supervised) language
                                                                                      modeling to the intermediate training leads to further gains of up
CCS CONCEPTS                                                                          to 3% F1 in downstream product-matching tasks.
                                                                                         All code and data of our experiments is available on GitHub1
• Information systems → Entity resolution; Electronic com-
                                                                                      which makes all results reproducible.
merce; • Computing methodologies → Neural networks.

KEYWORDS                                                                              2    BERT FOR PRODUCT MATCHING
                                                                                      Deep Transformer-based models like BERT [8] use stacked encoder
e-commerce, product matching, deep learning
                                                                                      layers based on a self-attention mechanism [33], which allows ev-
                                                                                      ery (sub-)word to attend to every other (sub-)word in a sequence,
1    INTRODUCTION                                                                     enabling mutual semantic contextualization of words. The deep
Product matching is the task of deciding if offers originating from                   architecture, i.e. stacking of attention layers, allows for model-
different web-shops refer to the same real-world product. This is                     ing of syntactic and semantic compositionality of the language
a central task for e-commerce applications such as online market                      that stems from word interactions [14]. Unlike static word embed-
places, price comparison portals, as well as for the construction                     dings [3, 23, 27], where each word has one fixed vector regardless
of product knowledge graphs [36] such as the one currently built                      of the context, pre-trained Transformers produce context-specific
by Amazon [10]. Different merchants present their products in                         vector representations of words, allowing, inter alia, to capture
different ways, leading to heterogeneity among offers of the same                     different word senses (e.g. bank would have very different repre-
product, which makes product matching a challenging task.                             sentations in contexts in which it denotes a financial institution
   In natural language processing (NLP), deep Transformer net-                        from those in contexts where it denotes a river bank). BERT is pre-
works [33], pre-trained on large corpora via language modeling                        trained on a large corpus of text (concatenation of Wikipedia and
objectives [7, 8, 22, inter alia] significantly pushed the state-of-the-              BookCorpus) using two pre-training objectives: (1) The masked lan-
art in a variety of downstream tasks [15, 34], including a number of                  guage modeling objective (MLM) aims to reconstruct (i.e. predict)
sentence-pair classification tasks, e.g. paraphrase identification [9].               words that have been masked out in the input text from the context;
Recent studies [4, 21] also demonstrate the effectiveness of Trans-                   (2) The next sentence prediction (NSP) objective predicts if two
former models like BERT [8] for the task of entity matching.                          sentences are adjacent to each other in text or not – contributing to
   In this work, we show that fine-tuning BERT for product match-                     downstream performance of text-pair classification tasks. The input
ing is much more training data efficient than the state-of-the-art                    to the BERT model has the following format: [CLS] Sequence 1
                                                                                      [SEP] Sequence 2 [SEP]. Two sequences, comprising (sub-)word
DI2KG 2020, August 31, Tokyo, Japan. Copyright ©2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0).                                                                                 1 https://github.com/Weyoun2211/productbert-intermediate
DI2KG 2020, August 31, 2020, Tokyo, Japan                                                                                                       Peeters et al.


              Table 1: Test and training set statistics                      the product data of each offer. To this end, we first concatenate
                                                                             all attributes of each product offer into one string. We use the
                        # products          # Pos.   # Neg.   # Comb.        attributes brand, title, description and specification table content and
                      w/ pos (overall)      Pairs     Pairs     Pairs        concatenate them in this order.
                                   Test set                                     Experimental setup. We conduct all our experiments with Py-
       computers          150 (745)          300      800      1,100         Torch [25] using BERT’s implementation3 from the HuggingFace
                                                                             Transformers library [35]. All hyperparameters are set to their de-
                               Training sets
                                                                             faults if not stated otherwise. We minimize the binary cross-entropy
       xlarge                745            9,690    58,771   68,461         loss using Adam [17] as optimization algorithm. BERT allows for
       large                 745            6,146    27,213   33,359         input sequences of maximal length of 512 tokens: we first constrain
       medium                745            1,762     6,332   8,094          each attributes length to 5 (brand), 50 (title), 100 (description) and
       small                 745             722      2,112   2,834
                                                                             200 (specification table content) words respectively, dropping any
                                                                             words outside that range, and further truncate long product of-
                                                                             fers by removing tokens from their end until we satisfy BERT’s
tokens, are separated using [SEP] tokens; the sequence start token
                                                                             constraint. We fine-tune all layers for 50 epochs with a linearly de-
[CLS] serves to capture the representation of the whole text-pair.
                                                                             caying learning rate with warm-up over the first epoch. We use the
   After the pre-training step, it it possible to either use the output
                                                                             validation set for model selection and early stopping: if the F1 score
representations of each word in downstream tasks (feature-based
                                                                             on the validation set does not improve over 10 consecutive epochs,
approach) or to fine-tune the BERT model itself for these tasks
                                                                             we stop the training. We use a fixed batch size of 32 and sweep
(fine-tuning-based approach), with the latter generally leading to
                                                                             learning rates in the range [5e-6, 1e-5, 3e-5, 5e-5, 8e-5, 1e-4]. We
better performance. In this work, we adopt the standard fine-tuning
                                                                             train three model instances for each hyperparameter configuration
for sentence-pair classification: we feed the transformed represen-
                                                                             and report the average performance.
tation of the sequence start token [CLS], xCLS into a simple logistic
regression classifier: 𝑦 = 𝜎 (xCLS Wcl +𝑏 cl ), with Wcl and bcl as well        Baselines. We compare BERT-based product matching with sev-
as BERT’s parameters being optimized during fine-tuning.                     eral baselines. First, we evaluate a simple word co-occurrence based
                                                                             approach, where we feed binary bag-of-words features of the two
2.1     Datasets                                                             product offers to traditional classification algorithms. We also test
In our experiments, we use the training, validation and gold stan-           the Magellan framework [18] for entity resolution which generates
dard (test) datasets from the computers category of the WDC Prod-            string- and numeric-similarity based features. Magellan constructs
uct Corpus for Large-Scale Product Matching [26]. These datasets             these features depending on the data types of the input attributes.
are derived from schema.org annotations from thousands of web-               We combine both the Magellan and the word co-occurrence fea-
shops extracted from the Common Crawl. Relying on schema.org                 ture creation methods with XGBoost, Random Forest, Decision Tree,
annotations of product identifiers like GTINs or MPNs allows us              linear SVM, and Logistic Regression as classification methods and ap-
to directly create binary (matching or non-matching) labels for              ply randomized search over the respective hyperparameter spaces.
our classification task, without the need for laborious manual an-           Finally, we compare against Deepmatcher [24], a state-of-the-art
notation. All labels of the test set used for final evaluation have          neural entity resolution framework using pre-trained word embed-
been manually checked. Previous experiments with these datasets              dings as input. Deepmatcher computes attribute-wise similarities
have shown that using schema.org ids as distant supervision re-              between two records and then combines these as features for the
sults in clean enough labels for training high-performance product           matching decision. For Deepmatcher, we use fastText embeddings
matchers [26].                                                               trained on the English Wikipedia4 as input and allow for the fine-
   The computers test set encompasses positive pairs for 150 unique          tuning of word embeddings, which, albeit not part of the original
products. The negative pairs for these products contain offers for           implementation, has been shown to improve performance [26]. We
595 additional products. The corresponding training sets contain             train all Deepmatcher instances for 50 epochs with default parame-
both positive and negative pairs for the same products. For more             ters and only search for the optimal learning rate. For Deepmatcher
details on the construction of the product corpus as well as the             and BERT we use the method specific tokenizers for pre-processing,
training and test sets, we refer the reader to [29] and to the project       for the other baselines we lower-case all attributes before further
website2 . To test the efficiency of the classifiers w.r.t. training size,   processing.
we experiment with training sets of varying size: small, medium,
large, xlarge. Table 1 shows statistics of the training sets and test        2.3      Fine-tuning Results
set.                                                                         Table 2 compares the results of fine-tuning BERT to the baselines.
                                                                             BERT outperforms all three baselines in all settings. The gains
2.2     Fine-tuning Setup                                                    from BERT-based product matching become larger the smaller the
We cast product matching as a binary classification task, i.e. given         training dataset is: for the smallest training set, BERT outperforms
two offers, we predict if they represent the same real-world product.        Deepmatcher by 20 F1 points. Even for the largest training set, we
Input for BERT (Sequence 1 and 2) is then the concatenation of
                                                                             3 We used the following pre-trained BERT instance: bert-base-uncased.
2 http://webdatacommons.org/largescaleproductcorpus/v2/                      4 https://fasttext.cc/docs/en/pretrained-vectors.html
Intermediate Training of BERT for Product Matching                                                                        DI2KG 2020, August 31, 2020, Tokyo, Japan


                                                        Table 2: BERT compared to baselines

                                   Word Cooc.                 Magellan                deepmatcher                 BERT                 Li et al. [21]
                               P       R        F1     P         R        F1      P         R       F1      P       R         F1             F1
                xlarge       86.59   79.67    82.99   71.44    56.89     63.33   89.63   94.78    92.12   95.99   93.00   94.47             95.45
                large        79.52   77.67    78.58   67.67    63.67     65.60   85.70   91.22    88.38   91.64   95.00   93.29             91.70
                medium       65.83   78.33    71.54   48.99    81.56     61.20   66.39   82.78    73.67   84.89   94.22   89.31             88.62
                small        53.98   74.67    62.66   50.86    71.22     59.17   54.86   69.56    61.20   75.62   89.33   81.89             80.76


obtain 2.3% F1 gain over Deepmatcher. Our results are in line with                               Table 3: Intermediate training set statistics
the findings of Li et al. [21], though not fully comparable, as the
authors use DistilBERT [31] and apply additional data augmentation                                          # products             # pos.      # neg     # comb.
techniques. Overall, we can conclude that fine-tuning BERT is a                                           w/ pos (overall)         pairs       pairs      pairs
promising technique for product matching, especially in settings                      computers only       60,030 (286,356)    409,445       2,446,765   2,856,210
with limited training data.                                                           4 categories        201,380 (838,317)    858,308       2,665,056   3,523,364

3     INTERMEDIATE TRAINING ON
      DOMAIN-SPECIFIC DATA
BERT has been pre-trained on a general-purpose natural language                       bag-of-words vectors of concatenation of title and the first 5 words
corpus, whose language as well as topics are rather different from                    of description and b) sorting offer pairs by cosine similarity and
product descriptions. We thus test the intuitive assumption that in-                  selecting pairs with the lowest scores. The remaining 50% are se-
termediate in-domain training – after BERT’s original pre-training                    lected by randomly pairing offers from the same cluster. We create
and before fine-tuning for specific products – can improve match-                     negative pairs in a similar fashion: for each offer taken for positives
ing performance. For the intermediate training we use training data                   pairs, we create the same amount of negatives pairs using offers
covering a wide range of products from thousands of e-shops.                          from other clusters of the same category. Hard negatives (50%) are
                                                                                      pairs of offers from different clusters with the highest cosine sim-
3.1     Building Intermediate Training Sets                                           ilarity; the other half are randomly sampled pairs of offers from
                                                                                      different clusters. Table 3 displays the statistics of the resulting
We leverage the WDC Product Corpus for Large-Scale Product
                                                                                      intermediate training sets.
Matching [29] and its product-cluster structure to build wide cov-
erage training sets consisting of millions of offer pairs. The corpus
consists of clusters containing offers for the same product. The                      3.2       Intermediate Training Procedure
clusters have been derived using schema.org annotated ids as weak                     For the first set of experiments, the intermediate training is per-
supervision (see Section 2.1). In order to have an unbiased evalua-                   formed with a single objective, the binary product matching task.
tion, the clusters contained in the test set and fine-tuning training                 The architecture is exactly the same as for the fine-tuning exper-
sets are removed from the corpus prior to building the intermediate                   iments. One model is trained for each of the training sets from
training sets.                                                                        Table 3. After intermediate training, we evaluate the model with
   We compare the effects of intermediate training on two struc-                      and without final product-specific fine-tuning. We run the inter-
turally different training sets. The first intermediate training set                  mediate training for 40 epochs with a linearly decaying learning
contains only offer pairs for the category computers: this allows                     rate (starting from 5e-5) with 10,000 warmup steps and a batch
us to introduce more computer information into BERT and have                          size of 256. Due to the long training times we train the first 90% of
the Transformer network detect relevant linguistic phenomena for                      epochs on sequences of length 128 and only the last 10% on the full
recognizing matches between computer offers. The second train-                        sequences of 512 tokens to speed up training, similar to the original
ing set contains pairs from four categories – computers, cameras,                     BERT training procedure [8].
watches and shoes – with fewer training pairs per product: this                          In the second set of experiments, we add the MLM objective
offers a wider selection of products (i.e., more versatile information                to the product matching objective and jointly optimize both in
about what constitutes a product match for the model), but less                       the intermediate training step. We follow the original masking
in-depth information for each product/category.                                       procedure: we randomly select 15% of tokens for replacement; in
   We build the training sets as follows: for positive instances, we                  80% of the cases, we replace the token with the [MASK] token,
select only clusters containing more than one offer, from which we                    in 10% of the cases with a random vocabulary token, and in the
can build at least one positive pair. We restrict ourselves to clusters               remaining 10% we keep the original token (i.e., we give up the
of size ≤80 after observing that very large clusters contain more                     replacement). As in the original work, we train the Transformer
noise and may lead to degradation of performance. For each offer                      network by minimizing the cross-entropy loss over predictions of
in each cluster we build up to 15 (computers) or 5 (4 categories)                     masked tokens. After the intermediate training, we again evaluate
positive pairs with the other offers from that cluster. Half of those                 two model variants: with and without the final product-specific
are hard positives, created by a) applying cosine similarity between                  matching fine-tuning.
DI2KG 2020, August 31, 2020, Tokyo, Japan                                                                                                            Peeters et al.


       Table 4: Intermediate training with PM objective                                 Table 5: Intermediate training with PM and MLM objective

                                    intermediate training                                                       intermediate training - PM + MLM
                   computers category                       4 categories                                                             Δ only     Δ interm.
                                      Δ only                                 Δ only
                                                                                                            P       R       F1
             P       R       F1                    P        R      F1                                                              fine-tune    only PM
                                    fine-tune                              fine-tune
                                                                                              xlarge      98.20    96.56   97.37      2.90        2.76
xlarge     95.58    93.67   94.61       0.14      95.45   95.44   95.45       0.98
                                                                                              large       94.99    96.67   95.82      2.53        1.73
large      92.68    95.56   94.09       0.80      91.34   96.00   93.61       0.32
medium     94.01    95.78   94.88       5.57      91.59   95.67   93.59       4.28            medium      96.05    97.11   96.58      7.27        1.70
small      94.38    93.11   93.73      11.84      90.39   90.89   90.64       8.75            small       95.64    97.44   96.53     14.64        2.80
none       94.41    90.00   92.15    -2.32 (xl)   88.24   95.00   91.49    -2.98 (xl)         none        94.31    94.00   94.16   -0.31 (xl)     2.01


3.3     Intermediate Training Results                                                   objective to the intermediate training results in further improve-
Table 4 shows the results of the intermediate training procedure.                       ments in matching performance, suggesting that domain-specific
We compare the intermediate training on the computers training set                      language modeling indeed successfully adapts BERT’s parameters
against the intermediate training on the training set comprising 4                      to the product domain.
product categories. We observe that even without final fine-tuning
(row ’none’ in Table 4), we achieve a very good matching per-                           4   RELATED WORK
formance of 92% F1. This suggests that through the intermediate
                                                                                        Product matching, a task with rich history and large body of work
training we inject category-specific knowledge into BERT’s param-
                                                                                        in both research and industry, can be seen as a special case of entity
eters, as it is evidently able to make good matching predictions for
                                                                                        resolution, which concerns itself with the disambiguation of entity
products for which it had not seen any training examples. Once
                                                                                        representations to their respective real-world entity [5, 6]. Early
the intermediate model is subjected to further fine-tuning on offer
                                                                                        approaches applied rule- and statistics-based methods [12]. Since
pairs from the training sets, we observe further improvements in all
                                                                                        the early 2000s, machine learning based methods have taken the
settings, with gains being most prominent for the smallest training
                                                                                        focus due to their strong performance [19]. In recent years, due
set. Intermediate training followed by fine-tuning on small training
                                                                                        to the successes of deep learning in fields like computer vision
sets reaches a performance of ∼94% F1, which, without intermedi-
                                                                                        and natural language processing, researchers working on entity-
ate pre-training (see Table 2), we previously obtained only on the
                                                                                        matching started to shift their attention towards these methods as
largest training set. Training on category-specific data (computers)
                                                                                        well [1, 11, 13, 16, 24, 32, 37]. Recently, Transformer-based architec-
generally yields marginally better performance than training on
                                                                                        tures [8, 33] were shown to produce state-of-the-art results [4, 21].
the mix of 4 categories..
   Table 5 shows the results of adding the MLM objective to the
product matching objective in the intermediate training step using                      5   CONCLUSION
the computers intermediate training set. Compared to the corre-                         Transformer-based language models like BERT have had a tremen-
sponding settings in which the intermediate training did not in-                        dous impact in the field of NLP, improving the state-of-the-art per-
clude MLM (see left half of the Table 4), the performance (with                         formance in a wide variety of tasks. In this work, we demonstrate
fine-tuning) increases by up to 3% F1, yielding a new top overall                       the utility of BERT for product matching in e-commerce, showing
matching performance (>97% F1 for the largest training set and 96%                      that it is much more training data efficient than Deepmatcher. Per-
F1 for all other training sizes). This confirms the findings from                       forming intermediate training of BERT with large amounts of prod-
other application domains [2, 20] pointing to benefits of domain-                       uct data from thousands of e-shops leads to a model with high gen-
specific MLM pre-training. The original pre-training data likely                        eralization performance (>90% F1) for new (i.e. unseen) products.
only contains few instances of product-specific vocabulary, as it                       We show that, if submitted to intermediate training, BERT reaches
covers a wide range of topics. Applying intermediate MLM training                       peak performance with less product-specific training data than
on domain-specific data allows for adaptation of the vocabulary                         without intermediate training. We achieve the best performance
embeddings to the domain, resulting in better downstream perfor-                        if intermediate training combines two jointly-trained objectives:
mance.                                                                                  (1) binary product-matching and (2) masked language modeling.
   In summary, subjecting BERT to an intermediate training step                         Category-specific intermediate training yields only slightly better
with large amounts of product data leads to a model that generalizes                    performance than intermediate training on cross-category data.
well to new unseen products from the same category and can be eas-                      While intermediate product-matching training alone brings sub-
ily fine-tuned with small amounts of product-specific training data                     stantial gains, adding the masked language modeling objective to
to further increase the performance for these products. Depending                       the intermediate training gives an additional performance edge of
on the structure of the intermediate training set, more training data                   up to 3% F1 in all setups. This is in line with observations from other
for a single category can lead to a small increase in performance                       domains, such as scientific text [2, 20], that domain-specific lan-
compared to a more heterogeneous training set encompassing a                            guage modelling improves the performance of BERT for in-domain
larger set of products from several categories. Adding the MLM                          downstream tasks.
Intermediate Training of BERT for Product Matching                                                                                      DI2KG 2020, August 31, 2020, Tokyo, Japan


REFERENCES                                                                                   [25] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, et al.
 [1] Luciano Barbosa. 2019. Learning Representations of Web Entities for Entity                   2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library.
     Resolution. International Journal of Web Information Systems 15, 3 (2019), 346–              In Advances in Neural Information Processing Systems 32. 8024–8035.
     358.                                                                                    [26] Ralph Peeters, Anna Primpeli, Benedikt Wichtlhuber, and Christian Bizer. 2020.
 [2] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language                   Using schema.org Annotations for Training and Maintaining Product Matchers.
     Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical                In Proceedings of the 10th International Conference on Web Intelligence, Mining
     Methods in Natural Language Processing and the 9th International Joint Conference            and Semantics.
     on Natural Language Processing. 3606–3611.                                              [27] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:
 [3] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.                     Global vectors for word representation. In Proceedings of the Conference on Em-
     Enriching word vectors with subword information. Transactions of the Association             pirical Methods in Natural Language Processing. 1532–1543.
     for Computational Linguistics 5 (2017), 135–146.                                        [28] Jason Phang, Thibault Févry, and Samuel R Bowman. 2018. Sentence encoders on
 [4] Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Ar-                stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint
     chitectures - a Step Forward in Data Integration. In Proceedings of the International        arXiv:1811.01088 (2018).
     Conference on Extending Database Technology, 2020. 463–473.                             [29] Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The WDC Training
 [5] Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage,             Dataset and Gold Standard for Large-Scale Product Matching. In Workshop on
     Entity Resolution, and Duplicate Detection. Springer-Verlag, Berlin Heidelberg.              e-Commerce and NLP (ECNLP2019), Companion Proceedings of WWW. 381–386.
 [6] Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis,           [30] Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang,
     and Kostas Stefanidis. 2019. End-to-End Entity Resolution for Big Data: A Survey.            et al. 2020. Intermediate-Task Transfer Learning with Pretrained Models for
     arXiv:1905.06397 [cs] (2019).                                                                Natural Language Understanding: When and Why Does It Work?. In Proceedings
 [7] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020.                 of the 58th Annual Meeting of the Association for Computational Linguistics. 5231–
     ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators.                5247.
     arXiv:2003.10555 [cs] (March 2020). arXiv:cs/2003.10555                                 [31] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. Dis-
 [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:                tilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.
     Pre-Training of Deep Bidirectional Transformers for Language Understanding. In               arXiv:1910.01108 [cs] (2020).
     Proceedings of the 2019 Conference of the North American Chapter of the Association     [32] Kashif Shah, Selcuk Kopru, and Jean David Ruvini. 2018. Neural Network Based
     for Computational Linguistics: Human Language Technologies. 4171–4186.                       Extreme Classification and Similarity Models for Product Matching. In Proceedings
 [9] William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus                of the 2018 Conference of the Association for Computational Linguistics, Volume 3
     of sentential paraphrases. In Proceedings of the Third International Workshop on             (Industry Papers). 8–15.
     Paraphrasing. 9–16.                                                                     [33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, et al.
[10] Xin Luna Dong, Xiang He, Andrey Kan, Xian Li, Yan Liang, et al. 2020. Auto-                  2017. Attention Is All You Need. In Proceedings of the 31st International Conference
     Know: Self-Driving Knowledge Collection for Products of Thousands of Types.                  on Neural Information Processing Systems. 6000–6010.
     arXiv:2006.13473 [cs] (June 2020).                                                      [34] Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R Thomas McCoy,
[11] Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad                          et al. 2019. Can You Tell Me How to Get Past Sesame Street? Sentence-Level
     Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity                Pretraining Beyond Language Modeling. In Proceedings of the 57th Annual Meeting
     Resolution. Proceedings of the VLDB Endowment 11, 11 (2018), 1454–1467.                      of the Association for Computational Linguistics. 4465–4476.
[12] Ivan P. Fellegi and Alan B. Sunter. 1969. A Theory for Record Linkage. J. Amer.         [35] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
     Statist. Assoc. 64, 328 (1969), 1183–1210.                                                   et al. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language
[13] Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, et al. 2019. End-to-End                   Processing. ArXiv abs/1910.03771 (2019).
     Multi-Perspective Matching for Entity Resolution. In Proceedings of the Twenty-         [36] Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan.
     Eighth International Joint Conference on Artificial Intelligence. 4961–4967.                 2020. Product Knowledge Graph Embedding for E-Commerce. In Proceedings of
[14] John Hewitt and Christopher D Manning. 2019. A structural probe for finding                  the 13th International Conference on Web Search and Data Mining. 672–680.
     syntax in word representations. In Proceedings of the 2019 Conference of the            [37] Dongxiang Zhang, Yuyang Nie, Sai Wu, Yanyan Shen, and Kian-Lee Tan. 2020.
     North American Chapter of the Association for Computational Linguistics: Human               Multi-Context Attention for Entity Matching. In Proceedings of The Web Confer-
     Language Technologies. 4129–4138.                                                            ence 2020. 2634–2640.
[15] Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, et al.
     2020. XTREME: A massively multilingual multi-task benchmark for evaluating
     cross-lingual generalization. In Proceedings of the International Conference on
     Machine Learning. 7449–7459.
[16] Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019.
     Low-Resource Deep Entity Resolution with Transfer and Active Learning. In
     Proceedings of the 57th Annual Meeting of the Association for Computational
     Linguistics. 5851–5861.
[17] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti-
     mization. arXiv:1412.6980 [cs] (Dec. 2014). arXiv:cs/1412.6980
[18] Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan,
     et al. 2016. Magellan: Toward Building Entity Matching Management Systems.
     Proceedings of the VLDB Endowment 9, 12 (2016), 1197–1208.
[19] Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity
     resolution approaches on real-world match problems. Proceedings of the VLDB
     Endowment 3, 1-2 (2010), 484–493.
[20] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, et al.
     2020. BioBERT: a pre-trained biomedical language representation model for
     biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.
[21] Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan.
     2020. Deep Entity Matching with Pre-Trained Language Models. arXiv:2004.00584
     [cs] (April 2020).
[22] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, et al. 2019.
     RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692
     (2019).
[23] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
     Distributed representations of words and phrases and their compositionality. In
     Proceedings of the Conference on Neural Information Processing Systems. 3111–
     3119.
[24] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park,
     et al. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In
     Proceedings of the 2018 International Conference on Management of Data. 19–34.