Intermediate Training of BERT for Product Matching Ralph Peeters Christian Bizer Goran Glavaš Data and Web Science Group Data and Web Science Group Data and Web Science Group University of Mannheim University of Mannheim University of Mannheim Mannheim, Germany Mannheim, Germany Mannheim, Germany ralph@informatik.uni-mannheim.de chris@informatik.uni-mannheim.de goran@informatik.uni-mannheim.de ABSTRACT framework Deepmatcher [24]. Fine-tuning BERT results in 15-20% Transformer-based models like BERT have pushed the state-of higher F1 scores in settings with small- and medium-sized training the-art for a wide range of tasks in natural language processing. sets. Even for large training sets, fine-tuning BERT still yields a 2% General-purpose pre-training on large corpora allows Transformers improvement over Deepmatcher. to yield good performance even with small amounts of training Inspired by findings that intermediate training on large training data for task-specific fine-tuning. In this work, we apply BERT to sets for related tasks [28, 30] improves downstream performance, the task of product matching in e-commerce and show that BERT we next introduce an intermediate training step before the final is much more training data efficient than other state-of-the-art fine-tuning of the model for specific products. In this step, we train methods. Moreover, we show that we can further boost its effec- BERT on product data from thousands of e-shops and show that tiveness through an intermediate training step, exploiting large intermediate training leads to high performance (>90% F1) and collections of product offers. Our intermediate training leads to good generalization to new products, even without any product- strong performance (>90% F1) on new, unseen products without specific fine-tuning. Poor generalization to new products is the main any product-specific fine-tuning. Further fine-tuning yields addi- weakness of Deepmatcher [24], as shown in our previous work [26]. tional gains, resulting in improvements of up to 12% F1 for small Our intermediate training is particularly beneficial for fine-tuning training sets. Adding the masked language modeling objective in setups with limited training data: it leads to improvements of up to the intermediate training step in order to further adapt the language 12% F1 on new products with small training datasets, compared to model to the application domain leads to an additional increase of direct fine-tuning (i.e. without any intermediate training). Finally, up to 3% F1. we show that adding domain-specific (self-supervised) language modeling to the intermediate training leads to further gains of up CCS CONCEPTS to 3% F1 in downstream product-matching tasks. All code and data of our experiments is available on GitHub1 • Information systems → Entity resolution; Electronic com- which makes all results reproducible. merce; • Computing methodologies → Neural networks. KEYWORDS 2 BERT FOR PRODUCT MATCHING Deep Transformer-based models like BERT [8] use stacked encoder e-commerce, product matching, deep learning layers based on a self-attention mechanism [33], which allows ev- ery (sub-)word to attend to every other (sub-)word in a sequence, 1 INTRODUCTION enabling mutual semantic contextualization of words. The deep Product matching is the task of deciding if offers originating from architecture, i.e. stacking of attention layers, allows for model- different web-shops refer to the same real-world product. This is ing of syntactic and semantic compositionality of the language a central task for e-commerce applications such as online market that stems from word interactions [14]. Unlike static word embed- places, price comparison portals, as well as for the construction dings [3, 23, 27], where each word has one fixed vector regardless of product knowledge graphs [36] such as the one currently built of the context, pre-trained Transformers produce context-specific by Amazon [10]. Different merchants present their products in vector representations of words, allowing, inter alia, to capture different ways, leading to heterogeneity among offers of the same different word senses (e.g. bank would have very different repre- product, which makes product matching a challenging task. sentations in contexts in which it denotes a financial institution In natural language processing (NLP), deep Transformer net- from those in contexts where it denotes a river bank). BERT is pre- works [33], pre-trained on large corpora via language modeling trained on a large corpus of text (concatenation of Wikipedia and objectives [7, 8, 22, inter alia] significantly pushed the state-of-the- BookCorpus) using two pre-training objectives: (1) The masked lan- art in a variety of downstream tasks [15, 34], including a number of guage modeling objective (MLM) aims to reconstruct (i.e. predict) sentence-pair classification tasks, e.g. paraphrase identification [9]. words that have been masked out in the input text from the context; Recent studies [4, 21] also demonstrate the effectiveness of Trans- (2) The next sentence prediction (NSP) objective predicts if two former models like BERT [8] for the task of entity matching. sentences are adjacent to each other in text or not – contributing to In this work, we show that fine-tuning BERT for product match- downstream performance of text-pair classification tasks. The input ing is much more training data efficient than the state-of-the-art to the BERT model has the following format: [CLS] Sequence 1 [SEP] Sequence 2 [SEP]. Two sequences, comprising (sub-)word DI2KG 2020, August 31, Tokyo, Japan. Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://github.com/Weyoun2211/productbert-intermediate DI2KG 2020, August 31, 2020, Tokyo, Japan Peeters et al. Table 1: Test and training set statistics the product data of each offer. To this end, we first concatenate all attributes of each product offer into one string. We use the # products # Pos. # Neg. # Comb. attributes brand, title, description and specification table content and w/ pos (overall) Pairs Pairs Pairs concatenate them in this order. Test set Experimental setup. We conduct all our experiments with Py- computers 150 (745) 300 800 1,100 Torch [25] using BERT’s implementation3 from the HuggingFace Transformers library [35]. All hyperparameters are set to their de- Training sets faults if not stated otherwise. We minimize the binary cross-entropy xlarge 745 9,690 58,771 68,461 loss using Adam [17] as optimization algorithm. BERT allows for large 745 6,146 27,213 33,359 input sequences of maximal length of 512 tokens: we first constrain medium 745 1,762 6,332 8,094 each attributes length to 5 (brand), 50 (title), 100 (description) and small 745 722 2,112 2,834 200 (specification table content) words respectively, dropping any words outside that range, and further truncate long product of- fers by removing tokens from their end until we satisfy BERT’s tokens, are separated using [SEP] tokens; the sequence start token constraint. We fine-tune all layers for 50 epochs with a linearly de- [CLS] serves to capture the representation of the whole text-pair. caying learning rate with warm-up over the first epoch. We use the After the pre-training step, it it possible to either use the output validation set for model selection and early stopping: if the F1 score representations of each word in downstream tasks (feature-based on the validation set does not improve over 10 consecutive epochs, approach) or to fine-tune the BERT model itself for these tasks we stop the training. We use a fixed batch size of 32 and sweep (fine-tuning-based approach), with the latter generally leading to learning rates in the range [5e-6, 1e-5, 3e-5, 5e-5, 8e-5, 1e-4]. We better performance. In this work, we adopt the standard fine-tuning train three model instances for each hyperparameter configuration for sentence-pair classification: we feed the transformed represen- and report the average performance. tation of the sequence start token [CLS], xCLS into a simple logistic regression classifier: 𝑦 = 𝜎 (xCLS Wcl +𝑏 cl ), with Wcl and bcl as well Baselines. We compare BERT-based product matching with sev- as BERT’s parameters being optimized during fine-tuning. eral baselines. First, we evaluate a simple word co-occurrence based approach, where we feed binary bag-of-words features of the two 2.1 Datasets product offers to traditional classification algorithms. We also test In our experiments, we use the training, validation and gold stan- the Magellan framework [18] for entity resolution which generates dard (test) datasets from the computers category of the WDC Prod- string- and numeric-similarity based features. Magellan constructs uct Corpus for Large-Scale Product Matching [26]. These datasets these features depending on the data types of the input attributes. are derived from schema.org annotations from thousands of web- We combine both the Magellan and the word co-occurrence fea- shops extracted from the Common Crawl. Relying on schema.org ture creation methods with XGBoost, Random Forest, Decision Tree, annotations of product identifiers like GTINs or MPNs allows us linear SVM, and Logistic Regression as classification methods and ap- to directly create binary (matching or non-matching) labels for ply randomized search over the respective hyperparameter spaces. our classification task, without the need for laborious manual an- Finally, we compare against Deepmatcher [24], a state-of-the-art notation. All labels of the test set used for final evaluation have neural entity resolution framework using pre-trained word embed- been manually checked. Previous experiments with these datasets dings as input. Deepmatcher computes attribute-wise similarities have shown that using schema.org ids as distant supervision re- between two records and then combines these as features for the sults in clean enough labels for training high-performance product matching decision. For Deepmatcher, we use fastText embeddings matchers [26]. trained on the English Wikipedia4 as input and allow for the fine- The computers test set encompasses positive pairs for 150 unique tuning of word embeddings, which, albeit not part of the original products. The negative pairs for these products contain offers for implementation, has been shown to improve performance [26]. We 595 additional products. The corresponding training sets contain train all Deepmatcher instances for 50 epochs with default parame- both positive and negative pairs for the same products. For more ters and only search for the optimal learning rate. For Deepmatcher details on the construction of the product corpus as well as the and BERT we use the method specific tokenizers for pre-processing, training and test sets, we refer the reader to [29] and to the project for the other baselines we lower-case all attributes before further website2 . To test the efficiency of the classifiers w.r.t. training size, processing. we experiment with training sets of varying size: small, medium, large, xlarge. Table 1 shows statistics of the training sets and test 2.3 Fine-tuning Results set. Table 2 compares the results of fine-tuning BERT to the baselines. BERT outperforms all three baselines in all settings. The gains 2.2 Fine-tuning Setup from BERT-based product matching become larger the smaller the We cast product matching as a binary classification task, i.e. given training dataset is: for the smallest training set, BERT outperforms two offers, we predict if they represent the same real-world product. Deepmatcher by 20 F1 points. Even for the largest training set, we Input for BERT (Sequence 1 and 2) is then the concatenation of 3 We used the following pre-trained BERT instance: bert-base-uncased. 2 http://webdatacommons.org/largescaleproductcorpus/v2/ 4 https://fasttext.cc/docs/en/pretrained-vectors.html Intermediate Training of BERT for Product Matching DI2KG 2020, August 31, 2020, Tokyo, Japan Table 2: BERT compared to baselines Word Cooc. Magellan deepmatcher BERT Li et al. [21] P R F1 P R F1 P R F1 P R F1 F1 xlarge 86.59 79.67 82.99 71.44 56.89 63.33 89.63 94.78 92.12 95.99 93.00 94.47 95.45 large 79.52 77.67 78.58 67.67 63.67 65.60 85.70 91.22 88.38 91.64 95.00 93.29 91.70 medium 65.83 78.33 71.54 48.99 81.56 61.20 66.39 82.78 73.67 84.89 94.22 89.31 88.62 small 53.98 74.67 62.66 50.86 71.22 59.17 54.86 69.56 61.20 75.62 89.33 81.89 80.76 obtain 2.3% F1 gain over Deepmatcher. Our results are in line with Table 3: Intermediate training set statistics the findings of Li et al. [21], though not fully comparable, as the authors use DistilBERT [31] and apply additional data augmentation # products # pos. # neg # comb. techniques. Overall, we can conclude that fine-tuning BERT is a w/ pos (overall) pairs pairs pairs promising technique for product matching, especially in settings computers only 60,030 (286,356) 409,445 2,446,765 2,856,210 with limited training data. 4 categories 201,380 (838,317) 858,308 2,665,056 3,523,364 3 INTERMEDIATE TRAINING ON DOMAIN-SPECIFIC DATA BERT has been pre-trained on a general-purpose natural language bag-of-words vectors of concatenation of title and the first 5 words corpus, whose language as well as topics are rather different from of description and b) sorting offer pairs by cosine similarity and product descriptions. We thus test the intuitive assumption that in- selecting pairs with the lowest scores. The remaining 50% are se- termediate in-domain training – after BERT’s original pre-training lected by randomly pairing offers from the same cluster. We create and before fine-tuning for specific products – can improve match- negative pairs in a similar fashion: for each offer taken for positives ing performance. For the intermediate training we use training data pairs, we create the same amount of negatives pairs using offers covering a wide range of products from thousands of e-shops. from other clusters of the same category. Hard negatives (50%) are pairs of offers from different clusters with the highest cosine sim- 3.1 Building Intermediate Training Sets ilarity; the other half are randomly sampled pairs of offers from different clusters. Table 3 displays the statistics of the resulting We leverage the WDC Product Corpus for Large-Scale Product intermediate training sets. Matching [29] and its product-cluster structure to build wide cov- erage training sets consisting of millions of offer pairs. The corpus consists of clusters containing offers for the same product. The 3.2 Intermediate Training Procedure clusters have been derived using schema.org annotated ids as weak For the first set of experiments, the intermediate training is per- supervision (see Section 2.1). In order to have an unbiased evalua- formed with a single objective, the binary product matching task. tion, the clusters contained in the test set and fine-tuning training The architecture is exactly the same as for the fine-tuning exper- sets are removed from the corpus prior to building the intermediate iments. One model is trained for each of the training sets from training sets. Table 3. After intermediate training, we evaluate the model with We compare the effects of intermediate training on two struc- and without final product-specific fine-tuning. We run the inter- turally different training sets. The first intermediate training set mediate training for 40 epochs with a linearly decaying learning contains only offer pairs for the category computers: this allows rate (starting from 5e-5) with 10,000 warmup steps and a batch us to introduce more computer information into BERT and have size of 256. Due to the long training times we train the first 90% of the Transformer network detect relevant linguistic phenomena for epochs on sequences of length 128 and only the last 10% on the full recognizing matches between computer offers. The second train- sequences of 512 tokens to speed up training, similar to the original ing set contains pairs from four categories – computers, cameras, BERT training procedure [8]. watches and shoes – with fewer training pairs per product: this In the second set of experiments, we add the MLM objective offers a wider selection of products (i.e., more versatile information to the product matching objective and jointly optimize both in about what constitutes a product match for the model), but less the intermediate training step. We follow the original masking in-depth information for each product/category. procedure: we randomly select 15% of tokens for replacement; in We build the training sets as follows: for positive instances, we 80% of the cases, we replace the token with the [MASK] token, select only clusters containing more than one offer, from which we in 10% of the cases with a random vocabulary token, and in the can build at least one positive pair. We restrict ourselves to clusters remaining 10% we keep the original token (i.e., we give up the of size ≤80 after observing that very large clusters contain more replacement). As in the original work, we train the Transformer noise and may lead to degradation of performance. For each offer network by minimizing the cross-entropy loss over predictions of in each cluster we build up to 15 (computers) or 5 (4 categories) masked tokens. After the intermediate training, we again evaluate positive pairs with the other offers from that cluster. Half of those two model variants: with and without the final product-specific are hard positives, created by a) applying cosine similarity between matching fine-tuning. DI2KG 2020, August 31, 2020, Tokyo, Japan Peeters et al. Table 4: Intermediate training with PM objective Table 5: Intermediate training with PM and MLM objective intermediate training intermediate training - PM + MLM computers category 4 categories Δ only Δ interm. Δ only Δ only P R F1 P R F1 P R F1 fine-tune only PM fine-tune fine-tune xlarge 98.20 96.56 97.37 2.90 2.76 xlarge 95.58 93.67 94.61 0.14 95.45 95.44 95.45 0.98 large 94.99 96.67 95.82 2.53 1.73 large 92.68 95.56 94.09 0.80 91.34 96.00 93.61 0.32 medium 94.01 95.78 94.88 5.57 91.59 95.67 93.59 4.28 medium 96.05 97.11 96.58 7.27 1.70 small 94.38 93.11 93.73 11.84 90.39 90.89 90.64 8.75 small 95.64 97.44 96.53 14.64 2.80 none 94.41 90.00 92.15 -2.32 (xl) 88.24 95.00 91.49 -2.98 (xl) none 94.31 94.00 94.16 -0.31 (xl) 2.01 3.3 Intermediate Training Results objective to the intermediate training results in further improve- Table 4 shows the results of the intermediate training procedure. ments in matching performance, suggesting that domain-specific We compare the intermediate training on the computers training set language modeling indeed successfully adapts BERT’s parameters against the intermediate training on the training set comprising 4 to the product domain. product categories. We observe that even without final fine-tuning (row ’none’ in Table 4), we achieve a very good matching per- 4 RELATED WORK formance of 92% F1. This suggests that through the intermediate Product matching, a task with rich history and large body of work training we inject category-specific knowledge into BERT’s param- in both research and industry, can be seen as a special case of entity eters, as it is evidently able to make good matching predictions for resolution, which concerns itself with the disambiguation of entity products for which it had not seen any training examples. Once representations to their respective real-world entity [5, 6]. Early the intermediate model is subjected to further fine-tuning on offer approaches applied rule- and statistics-based methods [12]. Since pairs from the training sets, we observe further improvements in all the early 2000s, machine learning based methods have taken the settings, with gains being most prominent for the smallest training focus due to their strong performance [19]. In recent years, due set. Intermediate training followed by fine-tuning on small training to the successes of deep learning in fields like computer vision sets reaches a performance of ∼94% F1, which, without intermedi- and natural language processing, researchers working on entity- ate pre-training (see Table 2), we previously obtained only on the matching started to shift their attention towards these methods as largest training set. Training on category-specific data (computers) well [1, 11, 13, 16, 24, 32, 37]. Recently, Transformer-based architec- generally yields marginally better performance than training on tures [8, 33] were shown to produce state-of-the-art results [4, 21]. the mix of 4 categories.. Table 5 shows the results of adding the MLM objective to the product matching objective in the intermediate training step using 5 CONCLUSION the computers intermediate training set. Compared to the corre- Transformer-based language models like BERT have had a tremen- sponding settings in which the intermediate training did not in- dous impact in the field of NLP, improving the state-of-the-art per- clude MLM (see left half of the Table 4), the performance (with formance in a wide variety of tasks. In this work, we demonstrate fine-tuning) increases by up to 3% F1, yielding a new top overall the utility of BERT for product matching in e-commerce, showing matching performance (>97% F1 for the largest training set and 96% that it is much more training data efficient than Deepmatcher. Per- F1 for all other training sizes). This confirms the findings from forming intermediate training of BERT with large amounts of prod- other application domains [2, 20] pointing to benefits of domain- uct data from thousands of e-shops leads to a model with high gen- specific MLM pre-training. The original pre-training data likely eralization performance (>90% F1) for new (i.e. unseen) products. only contains few instances of product-specific vocabulary, as it We show that, if submitted to intermediate training, BERT reaches covers a wide range of topics. Applying intermediate MLM training peak performance with less product-specific training data than on domain-specific data allows for adaptation of the vocabulary without intermediate training. We achieve the best performance embeddings to the domain, resulting in better downstream perfor- if intermediate training combines two jointly-trained objectives: mance. (1) binary product-matching and (2) masked language modeling. In summary, subjecting BERT to an intermediate training step Category-specific intermediate training yields only slightly better with large amounts of product data leads to a model that generalizes performance than intermediate training on cross-category data. well to new unseen products from the same category and can be eas- While intermediate product-matching training alone brings sub- ily fine-tuned with small amounts of product-specific training data stantial gains, adding the masked language modeling objective to to further increase the performance for these products. Depending the intermediate training gives an additional performance edge of on the structure of the intermediate training set, more training data up to 3% F1 in all setups. This is in line with observations from other for a single category can lead to a small increase in performance domains, such as scientific text [2, 20], that domain-specific lan- compared to a more heterogeneous training set encompassing a guage modelling improves the performance of BERT for in-domain larger set of products from several categories. Adding the MLM downstream tasks. Intermediate Training of BERT for Product Matching DI2KG 2020, August 31, 2020, Tokyo, Japan REFERENCES [25] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, et al. [1] Luciano Barbosa. 2019. Learning Representations of Web Entities for Entity 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Resolution. International Journal of Web Information Systems 15, 3 (2019), 346– In Advances in Neural Information Processing Systems 32. 8024–8035. 358. [26] Ralph Peeters, Anna Primpeli, Benedikt Wichtlhuber, and Christian Bizer. 2020. [2] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Using schema.org Annotations for Training and Maintaining Product Matchers. Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical In Proceedings of the 10th International Conference on Web Intelligence, Mining Methods in Natural Language Processing and the 9th International Joint Conference and Semantics. on Natural Language Processing. 3606–3611. [27] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: [3] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Global vectors for word representation. In Proceedings of the Conference on Em- Enriching word vectors with subword information. Transactions of the Association pirical Methods in Natural Language Processing. 1532–1543. for Computational Linguistics 5 (2017), 135–146. [28] Jason Phang, Thibault Févry, and Samuel R Bowman. 2018. Sentence encoders on [4] Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Ar- stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint chitectures - a Step Forward in Data Integration. In Proceedings of the International arXiv:1811.01088 (2018). Conference on Extending Database Technology, 2020. 463–473. [29] Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The WDC Training [5] Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Dataset and Gold Standard for Large-Scale Product Matching. In Workshop on Entity Resolution, and Duplicate Detection. Springer-Verlag, Berlin Heidelberg. e-Commerce and NLP (ECNLP2019), Companion Proceedings of WWW. 381–386. [6] Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, [30] Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, and Kostas Stefanidis. 2019. End-to-End Entity Resolution for Big Data: A Survey. et al. 2020. Intermediate-Task Transfer Learning with Pretrained Models for arXiv:1905.06397 [cs] (2019). Natural Language Understanding: When and Why Does It Work?. In Proceedings [7] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. of the 58th Annual Meeting of the Association for Computational Linguistics. 5231– ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators. 5247. arXiv:2003.10555 [cs] (March 2020). arXiv:cs/2003.10555 [31] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. Dis- [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: tilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. Pre-Training of Deep Bidirectional Transformers for Language Understanding. In arXiv:1910.01108 [cs] (2020). Proceedings of the 2019 Conference of the North American Chapter of the Association [32] Kashif Shah, Selcuk Kopru, and Jean David Ruvini. 2018. Neural Network Based for Computational Linguistics: Human Language Technologies. 4171–4186. Extreme Classification and Similarity Models for Product Matching. In Proceedings [9] William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of the 2018 Conference of the Association for Computational Linguistics, Volume 3 of sentential paraphrases. In Proceedings of the Third International Workshop on (Industry Papers). 8–15. Paraphrasing. 9–16. [33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, et al. [10] Xin Luna Dong, Xiang He, Andrey Kan, Xian Li, Yan Liang, et al. 2020. Auto- 2017. Attention Is All You Need. In Proceedings of the 31st International Conference Know: Self-Driving Knowledge Collection for Products of Thousands of Types. on Neural Information Processing Systems. 6000–6010. arXiv:2006.13473 [cs] (June 2020). [34] Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R Thomas McCoy, [11] Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad et al. 2019. Can You Tell Me How to Get Past Sesame Street? Sentence-Level Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity Pretraining Beyond Language Modeling. In Proceedings of the 57th Annual Meeting Resolution. Proceedings of the VLDB Endowment 11, 11 (2018), 1454–1467. of the Association for Computational Linguistics. 4465–4476. [12] Ivan P. Fellegi and Alan B. Sunter. 1969. A Theory for Record Linkage. J. Amer. [35] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Statist. Assoc. 64, 328 (1969), 1183–1210. et al. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language [13] Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, et al. 2019. End-to-End Processing. ArXiv abs/1910.03771 (2019). Multi-Perspective Matching for Entity Resolution. In Proceedings of the Twenty- [36] Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. Eighth International Joint Conference on Artificial Intelligence. 4961–4967. 2020. Product Knowledge Graph Embedding for E-Commerce. In Proceedings of [14] John Hewitt and Christopher D Manning. 2019. A structural probe for finding the 13th International Conference on Web Search and Data Mining. 672–680. syntax in word representations. In Proceedings of the 2019 Conference of the [37] Dongxiang Zhang, Yuyang Nie, Sai Wu, Yanyan Shen, and Kian-Lee Tan. 2020. North American Chapter of the Association for Computational Linguistics: Human Multi-Context Attention for Entity Matching. In Proceedings of The Web Confer- Language Technologies. 4129–4138. ence 2020. 2634–2640. [15] Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, et al. 2020. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. In Proceedings of the International Conference on Machine Learning. 7449–7459. [16] Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-Resource Deep Entity Resolution with Transfer and Active Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5851–5861. [17] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti- mization. arXiv:1412.6980 [cs] (Dec. 2014). arXiv:cs/1412.6980 [18] Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, et al. 2016. Magellan: Toward Building Entity Matching Management Systems. Proceedings of the VLDB Endowment 9, 12 (2016), 1197–1208. [19] Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment 3, 1-2 (2010), 484–493. [20] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, et al. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240. [21] Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep Entity Matching with Pre-Trained Language Models. arXiv:2004.00584 [cs] (April 2020). [22] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, et al. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 (2019). [23] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Conference on Neural Information Processing Systems. 3111– 3119. [24] Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, et al. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data. 19–34.