-

MWPD2020: Semantic Web Challenge on Mining the Web of HTML-embedded Product Data

Ziqi Zh

ziqi.zhang@sheffield.ac.uk

Christi

n Biz

lph P

Primp

0 Universitat Mannheim , Schloss, 68131 Mannheim , Germany 1 University of She eld , Broomhall, She eld S10 2TG , UK

This paper gives an overview of the Semantic Web Challenge on Mining the Web of HTML-embedded Product Data (MWPD2020) which has been conducted as part of the International Semantic Web Conference (ISWC2020). The challenge consists of two tasks: product matching and product classi cation. In the rst task, participants need to identify o ers for the same product originating from di erent websites. The goal of the second task is to categorize o ers from di erent websites into the GS1 GPC product hierarchy. Six teams from the USA, China, Japan, and Germany participated in the challenge. The winning system in Task 1, PMap, achieved an F1 score of 86.05 using an ensemble of transformer-based language models. Task 2 was won by team Rhinobird achieving a weighted average F1 score of 88.62 using a BERT-based ensemble which considers the dependencies among di erent category levels.

entity matching hierarchical classi cation e-commerce schema org microdata benchmarking

Recent years have seen signi cant growth of semantic annotations on the Web, using markup languages such as Microdata together with the schema.org vocabulary. A particular domain that is witnessing the boom of semantic annotations is e-commerce, where online shops are increasingly embedding schema.org annotations into HTML-pages describing products in order to enable search engines to easily identify product o ers and potentially drive tra c to the respective websites. Statistics from the Web Data Commons (WDC) project3 show that, as of November 2018, 37% of web pages, or 30% of websites contain semantic annotations, amounting to over 30 billion facts. Among these, nearly 20% are

mons License Attribution 4.0 International (CC BY 4.0).

3 http://webdatacommons.org/structureddata/

related to products. Such structured product data on the Web have created opportunities for new services, such as product search and integration platforms, recommender systems [ 1 ], as well as emerging research elds such as product knowledge graphs [ 6 ].

Many websites have started to semantically annotate product identi ers within their pages. This enables the identi cation of o ers for the same product on di erent websites. The resulting clusters of product descriptions can be used as weak supervision for training product matchers, which in turn can be applied to identify products on websites that do not provide product identi ers [ 3 ]. However, there are also challenges associated with the annotations. For example, less than 10% of the o ers are annotated with a product category [ 16 ], while the used categorization systems are website-speci c and highly inconsistent across di erent websites [ 22 ].

The potentials as well as the challenges resulting from the wide-spread availability of semantically annotated product data on the Web motivated the Semantic Web Challenge on Mining the Web of HTML-embedded Product Data (MWPD2020), as well as the speci c tasks of the challenge: product matching and product classi cation. In the rst task, participants need to identify o ers for the same product originating from di erent websites. The goal of the second task is to categorize o ers from di erent websites into the GS1 GPC product hierarchy. For both tasks, we have assembled training, validation, and test sets consisting of semantically annotated product data from a wide variety of di erent websites.

The event attracted a total of six participating teams including research institutions as well as commercial entities from the USA, China, Japan, and Germany. The winning team for the product matching task represents the National Institute of Informatics, Japan; while the winning team for the product classi cation task represents the Tongji University of China and Tencent, China.

The remainder of this paper is structured as follows. Section 2 and 3 explain the two tasks, including their objectives, datasets, and evaluation metrics; Section 4 presents the results of the challenge and gives an overview of the participating systems; and Section 5 concludes the paper with some lessons learned from the challenge and a comparison of MWPD2020 to related benchmark events. More information about MWPD2020 is found on the challenge's website which also provides all datasets for public download4. 2

Task 1 - Product Matching

E-commerce websites frequently annotate product identi ers, product titles, product descriptions, brands, and product prizes within their pages using schema. org terms. In addition, o ers are often accompanied by speci cation tables, i.e. HTML tables that contain product details in the form of key/value pairs. Given the syntactic, structural and semantic heterogeneity among the o ers, it is challenging to identify which o ers refer to the same product, a problem known as

4 https://ir-ischool-uos.github.io/mwpd/

product matching. In this task, product matching is handled as a binary classi cation problem: given two product o ers, the participating systems need to decide whether the o ers describe the same product (matching) or not (nonmatching). 2.1

Datasets The participants of Task 1 were given a large corpus of product o ers which are grouped into clusters referring to the same product. This corpus could be used by the participants to assemble training sets of di erent width and depth. In order to ease starting to work on the task, we also provide a readily assembled example training and validation set. The test set that was used to evaluate the participating systems was kept secret during the submission period of the challenge and was released afterwards.

Product Data Corpus. The WDC Product Data Corpus [ 20 ] was released in 20185 by the Web Data Commons project and is the largest publicly available product data corpus. It consists of 26 million product o ers originating from 70 thousand di erent e-shops. Exploiting the weak supervision found on the web in the form of product identi ers, such as GTINs or MPNs, product o ers are grouped into 16 million clusters. The clusters can be used to derive training sets containing matching and non-matching pairs of o ers. The grouping of o ers into clusters is subject to some degree of noise which is approximately 7%. The following attributes are used for describing the product o ers in the corpus and can be used for training: title, description, brand, price, specTableContent which contains the content of the speci cation tables found in the website of the product o er, keyValuePairs which are the heuristically extracted key/value pairs from the speci cation tables and category which is one of the 25 categories the o er was assigned to. Additionally, two identi er attributes are assigned to every product o er: the id which is the unique identi er of the o er and the cluster id which is the identi er of the cluster to which the o er belongs. Example Training Set. Generating interesting matching and non-matching pairs of o ers which can be used for training powerful matching models, is a non-trivial and resource intense task. Therefore, we o er an example of a training set derived from the product corpus, with the goal to additionally support the participants of this task. Being a direct subset of the product corpus, the example training set is subject to some inherent noise. The example training set contains 68K pairs of matching and non-matching o ers from 772 distinct products (clusters of o ers). We o er the example training set in JSON format. Every JSON object in the training set describes a pair of o ers (left o er - right o er) using the o er attributes together with their corresponding matching label. Figure 1 shows an example of a non-matching product o er pair in the example training set.

5 http://webdatacommons.org/largescaleproductcorpus/v2/

Validation Set. We provide a validation set consisting of 1,100 o er pairs from the Computers and Accessories category as the ground truth for this task. The validation set has the same structure as the example training set. The ratio of matching to non-matching pairs is 3:8. The o ers of the validation set are derived from 745 distinct products (clusters). Table 1 presents for the training and validation sets, the average attribute density, the average length in characters of the attribute values, as well as the standard deviation of the value length.

With the aim to construct the validation set using a good mixture of hard and easy matching and non-matching pairs of o ers distributed over di erent products, we applied the following heuristic: First 150 clusters of the category Computers and Accessories are randomly selected. Considering the clustering scheme, we construct all possible matching and non-matching pairs. For each o er pair we randomly pick the Jaccard similarity over the titles, the descriptions or the average of both, as a similarity metric and calculate its similarity score. To select matching pairs, we pick within each cluster the pair with the lowest similarity score and one randomly chosen pair and add them in the validation set. To select negative pairs, we take for each o er two to three pairs with high similarity and three pairs at random. All matching and non-matching pairs of the validation set are manually veri ed. Therefore, unlike the provided example training set, the validation set does not contain any noisy o er pairs. Test Set. The hidden test set which is used for evaluating and ranking the matching systems of the participants of this task, consists of 1,500 o er pairs from the category Computers and Accessories. The o er pairs in the test set are carefully selected in order to cover di erent types of matching challenges. 1,100 pairs are randomly selected corner cases, meaning that they are similar non-matching pairs and dissimilar matching pairs. For the remaining 400 pairs, we de ne a categorization scheme consisting of four speci c types of matching challenges. The distribution of o er pairs per type of challenge remains unknown to the participants in the rst round and is revealed to them at the beginning of the second round to allow them to tune their systems to the speci c challenges. The distribution of o er pairs per type of matching challenge is shown in Table 2.

Matching challenge (SN-DM) Similar non-matches, dissimilar matches (NP-HS) New products - high similarity to known products (NP-LS) New products - low similarity to known products (KP-TY) Known products - introduced typos (KP-DR) Known products - dropped tokens

With the term new product we refer to products which are contained in the WDC Product Data Corpus but not in the provided example training set. Known products means products from clusters that have training data in the provided training set. The similarity for choosing similar non-matching and dissimilar matching pairs is measured using the Jaccard similarity metric on the titles of the o ers. 2.2

Evaluation Metrics and Baseline For the evaluation of the product matching task we use standard precision, recall and F1 calculated on the positive class. As a baseline, we use the Deepmatcher [ 17 ] framework, more speci cally the RNN module using default parameters apart from the positive/negative ratio, which we set to the actual distribution found in the training set. This model is trained on the training set provided for the challenge for 15 epochs, using the attributes title, description, brand and specTableContent. We preprocess the attributes by removing some symbols and schema.org related terms and nally lowercasing them. Since the model relies on pre-trained word- or character-based embeddings, we use the fastText embeddings pre-trained on the English Wikipedia6. 3

Task 2 - Product Classi cation

Same products are often sold on di erent websites, which generally organise their products into certain categorisation systems. However, such product categorisations di er signi cantly for di erent websites, even if they sell similar product ranges. This makes it di cult for product information integration services to collect and organise product o ers on the Web. The product classi cation task deals with assigning pre-de ned product category labels from a universal catalogue to product instances (e.g., iPhone X is a `SmartPhone', and also `Electronics'). 3.1

Datasets Classi cation labels. In this task, the GS1 Global Product Classi cation standard (GPC) 7 is used to classify product instances. The GPC standard classi es products into a hierarchy of multiple levels based on their essential properties as well as their relationships to other products. It o ers a universal standard for organising product catalogues of any businesses. In this task, the top three levels of GPC are used to classify each product. Level 1 contains 40 classes such as `Automotive' and `Clothing'. Level 2 further divides level 1 classes into more than 100 classes such as `Automotive Accessories and Maintenance', and `Swimwear'. Then level 3 further divides level 3 classes into over 700 classes such as `Automotive Antifreeze' and `Beachwear/Cover Ups' Gold standard. The creation of the gold standard is based on the earlier product classi cation dataset released by the Web Data Commons project [ 16 ]. The original dataset contained 8,361 product instances randomly sampled from 702 vendors' websites. These were manually classi ed into the above mentioned three levels of classi cations by human annotators.

In MWPD2020, we further extended the original GS by adding over 7,000 product instances. However, due to the complexity of the GPC hierarchy, annotating a random sample by checking against every class in the three classi cation levels is a very time-consuming process. Therefore, we followed a `controlled process' detailed below. 1. Create a Solr8 index of product instances by parsing the Product Data Corpus. The index contains ve elds corresponding to attributes of each product: an id eld to uniquely identify a product index; a name eld recording

6 https://fasttext.cc/docs/en/pretrained-vectors.html 7 https://www.gs1.org/standards/gpc 8 https://lucene.apache.org/solr/

the name of the product; a description eld recording the long description of the product; a category text eld recording the product category information as provided by the source web page; and a provenance eld recording the source URL from which the structured data are extracted. All these attributes are extracted from the RDF quads where available in the dataset. 2. Given an existing product instance (i.e., the reference product) in the original

GS, search for its name in the description eld of the above index; 3. Select up to 50 results (i.e., the target products) with a di erent name from the reference product; 4. Rank the results by the Levenshtein distance between the reference product's name and the names of the target products; 5. Select the top 10 ranked results; 6. A human annotator manually evaluates the ranked results, and selects n products that s/he deems to belong to the same level-3 class of the reference product; 7. The selected n products are assigned the same level-3 class, as well as the corresponding level-2 and level-1 classes by traversing the GPC hierarchy.

In step 6 above, the human annotators are presented the following information when assessing each target product: the reference product's name, description, level-1, 2, 3 classes, provenance, website-speci c category information or breadcrumb if available; and the target product's name, description and provenance. They are instructed to exercise on their own discretion to decide an optimal n, by balancing the diversity in the already selected selected target products in terms of their name, vendor as identi ed by their provenance, and level-3 classes.

The main reasons of allowing this exibility are two-fold. First, in the original GS, there existed certain `di cult' and often minority classes (e.g., 93070100 Seeds/Spores) for which steps 2 and 3 hardly returned many positive matches. Second, there also existed certain `dominant' classes that represented a very large fraction of the original GS (e.g., 67000000 Clothing was over 40%) and were also `easier' to nd matches by steps 2 and 3. This implies that our controlled process runs a risk of further accentuating the already-unbalanced nature of the original GS. Thus by exercising their discretion based on the above principle, our goal was to control the balance in the distribution of classes in the end dataset. In practice, n ranged between 0 (no suitable target) and 6 (typically for `di cult' classes).

The annotation was conducted by two computer science researchers and Inter-Annotator-Agreement was studied on 100 product instances where they both annotated. A Cohen's Kappa of 97% was obtained. The end dataset contains 16,119 instances and is stored in JSON. In addition to the ve product attributes described before, each instance is assigned three label, corresponding to for the three levels of classi cation. Further, the description is truncated to a maximum of 5,000 characters. A screenshot of an example instance is shown in Figure 2. The dataset is split into the training (10,012), validation (3,000), and test (3,107) sets with statistics shown in Figures 3 and 4. Same as Task 1, only the training and validation sets are revealed to the participants before the submission of their nal system output that was created on the test set. Although the dataset are very unbalanced, with several large classes dominating the dataset at all three levels, it is worth noting that they follow a consistent distribution to the original GS. Additional resources. During the process of creating the gold standard, additional resources9 were created with an aim support participants system development. These include: { A `product data textual corpus' that contains descriptions of all product instances from the Solr index above. A light-weight cleaning process was applied to only keep descriptions of at least 5 tokens (separated by whitespace characters) and 20 characters. This corpus has over 1.9 billion tokens. { Word embedding models (both continuous Bag-of-Words (CBow) and Skipgram) trained using the above textual corpus, by applying the Gensim10 (version 3.4.0) implementation of Word2Vec. 3.2

Evaluation Metrics and Baseline For each classi cation level, the standard Precision, Recall and F1 are used and a Weighted-Average macro-F1 (WAF1) are calculated over all classes. Then the average of the WAF1 of the three levels are be calculated and used to rank the participating systems. For baseline, we use a con guration based on that used in the Rakuten Data Challenge11 [ 14 ]. Speci cally: { it uses the same FastText algorithm and parameters as in the Rakuten Data

Challenge; { it uses only product names as input text; { all text are lowercased and lemmatised using NLTK version 3.4.5. 4

Results

The competition was organised in two rounds. In order to improve their systems, the teams were shown the leaderboard after Round 1 and were informed about the F1 scores that their systems achieved on the speci c types of matching challenges (see Table 2). A total of six teams representing di erent industry organisations and research and academic institutions participated in Round 1. Six teams participated in Task 1, while ve participated in Task 2. Two teams continued in Round 2 for Task 1, while only one team chose to continue in Round 2 for Task 2. Five teams submitted a paper describing their system. A brief overview of the results for both tasks is given below. Section 4.3 provides additional details about each team and their system afterwards. 4.1

Results Task 1 - Product Matching

9 Download from: https://bit.ly/36d0NYd

10 https://radimrehurek.com/gensim/ 11 Download from: https://github.com/ir-ischool-uos/mwpd/tree/master/prodcls and ISCAS-ICIP very close behind. An overview of each team's methods along selected dimensions which resulted in the nal submission can be seen in Table 4. All of the teams, who submitted system papers, employed ne-tuning of transformer-based pre-trained language models for the task of product matching in one form or another, often combined with some form of ensembling. Team ISCAS-ICIP was the only team employing an ensembling approach across different deep matching models. The other teams, who employed ensembling, limited themselves to ne-tuned transformer-based models like e.g. BERT [ 5 ] and RoBERTa [ 15 ].

Team PMap Rhinobird (Round 2) Rhinobird (Round 1) ISCAS-ICIP (Round 2) ASVinSpace ISCAS-ICIP Megagon Team ISI Baseline (Deepmatcher)

Most teams applied some form of standard pre-processing to the data before training. Team ISCAS-ICIP also employs a feature extraction approach based on xed vocabularies and regular expressions to extract multiple features from the textual descriptions. For this they use the provided WDC Large-Scale Corpus for Product Matching. They further use it to expand the provided training set with more product pairs from the relevant product category. Also, teams Rhinobird and ASVinSpace tried augmenting the training data with pre-built training sets of other categories from the product corpus. Most teams tried different combinations of features and training sets during experimentation. Team Rhinobird also tried optimizing for focal loss in addition to cross-entropy as well as employing a self-ensembling technique over multiple training epochs of the same model. Rhinobird and ISCAS-ICIP further implemented a post-processing step, where they used heuristic rules to correct some of the predicted labels, e.g. always predicting non-match if the brands of both o ers do not match.

All teams were provided with information about their performance on the speci c matching challenges in the test set (see section 2) after Round 1. These scores could be used to improve speci c aspects of the systems for Round 2. The results of all teams on the speci c types of matching challenges are found in Table 5. Team ISCAS-ICIP was able to improve on their performance for all challenges in Round 2, signi cantly improving on new products, which were Dimension pre-processing

PMap remove symbols and non-alphabet chars

Rhinobird remove stopwords and lower-case used attributes title

title, description matching model matching decision post-processing external resources

BERT-large RoBERTa-large RoBERTa-base ensemble of transformers

BERT-base ensemble of self-ensembling transformers with di erent loss functions heuristic rule to correct predictions

Team

ISCAS-ICIP remove stopwords, alphanumeric chars and lower-case

title, price brand (extracted) model (extracted)

MPM HierMatcher

Ditto ensemble of di erent model types

ASVinSpace

title, description, specTableContent DistilRoBERTa-base single model - to choreruercitstpicrerduilcetsions - trfaoiunriWncgaDtdCeagtocaorirseppsuafsnronming afnodr ataWodttdbrDiiutbCiiouldntceaovlreoptxcrutaarsbianuucilstnaiegordinedsatfoar trfaoiunriWncgaDtdCeagtocaorirseppsuafsnronming

Table 4. Overview of the systems submitted for Task 1 not contained in the provided training set. For Round 2 they exchanged one of the matching models in their ensemble with a transformer-based model. This may suggest that this kind of model is better suited for handling new products than the previously used one (see below). Their overall result improved by 3% F1 going from Round 1 to 2. Team Rhinobird manages to improve signi cantly for products containing typos or dropped words while losing some performance across the other classes. They changed one of the BERT models in their ensemble of three to one trained with a di erent loss function for Round 2. Their overall result improves by only 0.5% F1 from Round 1 to 2, trading 2% Precision for 4% Recall. Overall, there is no team that consistently beats the others across all challenges, which is not surprising, as they all apply similar approaches and the overall results of the top teams vary only within 1% F1. 4.2

Results Task 2 - Product Classi cation

Team PMap Rhinobird (Round 2) Rhinobird (Round 1) ISCAS-ICIP (Round 2) ASVinSpace ISCAS-ICIP (Round 1) Megagon Team ISI F1 on speci c type of matching challenge

SN-DM

NP-HS NP-LS KP-TY

KP-DR of existing pre-trained language models (e.g., DICE) to very complex structures that combine 17 di erent models through ensemble (e.g., Rhinobird).

In terms of the text input used for supervised learning, all teams used product name and description, except ASVinSpace that also used URL. However, it is unclear what the e ect of URL is, due to a lack of ablation test. Interesting, no teams used the category text provided as-is by the source vendor websites, even thought such content proves to be useful for such tasks [ 16, 22 ]

In terms of using external resources (excluding the use of pre-trained language models) to support the learning, Team ASVinSpace used a novel approach that extends the training set by harvesting data from Wikidata. None of the teams used the pre-trained embedding models or the product description corpus. However, Table 6 demonstrates that the pre-trained embedding models can be e ective for further enhancing the learning.

Teams

Rhinobird (Round 1) Rhinobird (Round 2) ISI ASVinSpace Megagon DICE

Participating Teams In the following, we summarize the approaches that were used by the di erent teams. The complete details about the methods are given in the systems papers written by the teams themselves which are contained in the MWPD2020 proceedings.

Team Rhinobird represents the Tongji University of China and Tencent, China, and participated in both Task 1 and 2 in both rounds. For task 1, they rely on the BERT model while experimenting with di erent loss functions and ensembling steps. More speci cally, after removing stopwords and lower-casing the data, they ne-tune multiple BERT models while experimenting with di erent training sets, features as well as focal loss [ 13 ] as a variation to the standard cross-entropy loss function. In addition to the provided training set, they experiment with a larger training set containing product pairs for four product categories. They try using only the title attribute as well as the concatentation of title and description as input features. Furthermore, Team Rhinobird experiments with a method of self-ensembling across multiple training epochs of the same model, namely stochastic weight averaging (SWA) [ 10 ]. Finally, subsets of all the previously mentioned models are ensembled by averaging their prediction probabilities and subsequently selecting the best performing ensemble. Team Rhinobird also applies some simple post-processing rules to correct predictions of the models, more speci cally, all test pairs that do not belong to the same product category, are set to to be non-matches. Their submission for both rounds consists of an ensemble across three ne-tuned BERT models trained with di erent choices for the previously mentioned parameters.

For Task 2, Rhinobird used a BERT-based ensemble model that explicitly considers the dependencies among di erent category levels. Such hierarchical dependency features are encoded using a dynamic masked matrix obtained based on the hierarchical category structure. The masked matrix as a lter that dynamically discards the child categories irrelevant to the current parent category. The nal ensemble model combines 17 di erent BERT models to make the nal decisions. They used both product names and descriptions.

Team PMap represents the National Institute of Advanced Industrial Science and Technology, Japan, and participated in Task 1 in Round 1 only. They rely on using pretrained transformer-based language models, more speci cally they ne-tune BERT-base, BERT-large, DistilBERT-base, RoBERTa-base and RoBERTa-large, consequently ensembling the results of some of these models to arrive at the nal matching decision. Before ne-tuning they apply simple preprocessing, e.g. removing symbols and non-alphabet characters using a regular expression. Team PMap uses the datasets provided during the challenge without further additions. They furthermore use only the title attribute as input feature. After ne-tuning each model, based on their results, they select the BERT-large, RoBERTa-large and RoBERTa-base models for ensembling to reach the nal matching decision.

Team ASVinSpace represents the Leipzig University, Germany and the German Aerospace Center (DLR). They participated in both Task 1 and 2 in

Round 1. For Task 1, they employed pre-trained transformer-based language models, namely BERT, RoBERTa and their distilled versions [ 21 ]. Di erent feature combinations are tried with the input to the model consisting of the concatenation of the used feature strings. The standard model is augmented with a single dense and an output layer on top of the pooled output of the [CLS] token and subsequently ne-tuned. ASVinSpace try solving the task in two ways, once minimizing cross-entropy loss on an output layer of size two and on the other hand framed as a regression problem with a single output and minimizing the mean squared error. In addition to the training set provided for the challenge, the team further experiments with additional training data from other product categories from the same data corpus. To handle the class imbalance inherent to the data, the team randomly drops negative examples during each training epoch to normalize the class distribution. The nal submitted result is achieved by a DistilRoBERTa model using the title, description and specTableContent attributes, ne-tuned on data from four di erent product categories.

For Task 2, ASVinSpace used a CNN model adapted from [ 11 ]. It used a transformer-based language model [ 21 ] as input to the CNN layers instead of static word embedding models. The CNN model has three output layers, each corresponding to one of the classi cation levels, thus allowing the model to capture inter-dependencies of the di erent classi cation tasks. In addition, ASVinSpace also proposed to use external resources. For example, names of examples from the training set are used to retrieve relevant entities from Wikidata. Then, the corresponding descriptions and the GPC standard are used to disambiguate the retrieved entities in order to select only the ones that are highly similar (using a cosine similarity metric based on Tf-IDF weighted feature vectors) to the classi cation examples. These `expanded' entities are manually annotated to create additional training data. In terms of the text input, they used the concatenation of product names, descriptions, and URLs.

Team ISCAS-ICIP represents the Chinese Academy of Sciences, and participated in Task 1 in Round 1 and 2. Their approach is based on three steps: pre-processing, entity matching and post-processing. During pre-processing the team removes stopwords, alphanumeric characters and lower-cases the examples. Furthermore, they apply a feature extraction approach based on vocabularies built using the provided data corpus as well as based on regex-patterns to extract values for the attributes brand and model. For the entity matching stage, the team applies overall four di erent models whose results are integrated to achieve the nal prediction using a voting mechanism. The models are MPM [ 9 ], Seq2SeqMatcher [ 18 ], HierMatcher [ 8 ] and Ditto [ 12 ]. Finally, the post-processing module uses rules to correct predictions under certain circumstances, e.g. always assigning the label"match if two products have an exact match on the title attribute, or always assigning non-match if the brands of two products di er. The team also augments the provided training and validation sets by doubling their number using a similar sampling approach to the one used for building the provided training sets [ 20 ]. For their rst round submission, ISCAS-ICIP integrated the results of the MPM, Seq2SeqMatcher and Hiermatcher models, while for the second round submission they omit Seq2SeqMatcher and replace it with Ditto, which is based on pre-trained transformer-based language models, leading to improved performance ( 3% F1) on the evaluation test set.

Team DICE represents the Paderborn University, Germany, and participated in Task 2 in Round 1 only. The team used a simple adaptation of the BERT language model, by adding on top a fully-connected layer (i.e Dense) and using using a sigmoid activation function as replacement of the original softmax for classi cation. They used both product names and descriptions as input.

Team Megagon represents megagon.ai and Team ISI represents the University of Southern California, USA. They participated in both Tasks in Round 1, but did not submit a system paper. 5

Conclusion

The systems that were successful in the challenge all employed pre-trained transformer based language models which underlines the potential of these models for Web data integration tasks [ 4, 19, 12 ]. Especially the good results of systems using RoBERTa show the bene ts of transferring knowledge that has been learned using less structured web content from diverse sources12 to integration tasks involving structured web content, such as the matching and categorization tasks addressed in the challenge.

Several other benchmark competitions on product matching and product classi cation have been conducted in the last years: The SIGIR 2018 eCom Rakuten Data Challenge [ 14 ] focused on product classi cation, where individual products are classi ed into a hierarchy of over 3,000 categories in a companyspeci c catalogue (i.e., the Rakuten product catalogue). Compared to the Rakuten challenge which only involved product descriptions from a single source, our classi cation task involves more heterogeneous product descriptions from many websites. The 2019 and 2020 workshops on Challenges and Experiences from Data Integration to Knowledge Graphs (DI2KG2019) [ 7 ]13 and (DI2KG2020)14 focus on knowledge graph creation from product speci cations which were extracted from the Web. The workshops feature three shared tasks: entity resolution, schema matching, attribute matching. Products are described in the DI2KG dataset using distinct attributes such as screen size, display type, or refresh rate. Compared to the DI2KG entity resolution task, our matching task involves dealing with less structured textual product data.

Based on the ndings from this event, we identify several remaining gaps in the current research: First, despite the dominance of transformer based language models, there remains a signi cant degree of variety in terms of how such models can be adapted and/or combined for data integration tasks. There is also a lack 12 RoBERTa was pre-trained using di erent subsets of the CommonCrawl along with several text corpora. 13 http://di2kg.inf.uniroma3.it/2019/ 14 http://di2kg.inf.uniroma3.it/2020/ of systematic study of how these architectures compare under a uniform experimental setting. Second, there is a lack of exploration into what kind of external resources can be used to support such tasks and how they can be used to do so. For example, our product data textual corpus could be used to ne-tune a language model following an approach such as [ 2 ], which showed further gain in terms of domain-speci c tasks [ 19 ]. Finally, in terms of mining product information on the Web from a more general point of view, recent research [ 1 ] focused on harvesting and cleaning structured product data on the Web. However, there is a lack of studies on how such data could be used to enable self-supervised learning in downstream tasks [ 4 ]. We encourage future research to investigate these directions.

Acknowledgements This event is partially sponsored by Peak Indicators, UK15. 15 https://www.peakindicators.com/

1. Bai , P. , Ge , Y. , Liu , F. , Lu , H.: Joint interaction with context operation for collaborative ltering . Pattern Recognition 88 , 729 { 738 ( 2019 ). https://doi.org/doi:10.1016/j.patcog. 2018 . 12 .003

2. Beltagy , I. , Lo , K. , Cohan , A. : Scibert: A pretrained language model for scienti c text . arXiv preprint arXiv: 1903 . 10676 ( 2019 )

3. Bizer , C. , Primpeli , A. , Peeters , R. : Using the semantic web as a source of training data . Datenbank-Spektrum 19 ( 2 ), 127 { 135 ( 2019 ). https://doi.org/10.1007/s13222-019-00313-y

4. Deng , X. , Sun , H. , Lees , A. , Wu , Y. , Yu , C. : TURL: Table Understanding through Representation Learning . arXiv: 2006 .14806 [cs] ( Jun 2020 )

5. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . pp. 4171 { 4186 (Jun 2019 )

6. Dong , X.L. : Challenges and innovations in building a product knowledge graph . In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . p. 2869 . Association for Computing Machinery ( 2018 )

7. Firmani , D. , Crescenzi , V. , Angelis , A.D. , Dong , X.L. , Mazzei , M. , Merialdo , P. , Srivastava , D. : Proceedings of the 1st international workshop on challenges and experiences from data integration to knowledge graphs . CEUR Workshop Proceedings , vol. 2512 . CEUR-WS.org ( 2019 )

8. Fu , C. , Han , X. , He , J. , Sun , L. : Hierarchical Matching Network for Heterogeneous Entity Resolution . In: Proceedings of the Twenty-Ninth International Joint Conference on Arti cial Intelligence . pp. 3665 { 3671 (Jul 2020 )

9. Fu , C. , Han , X. , Sun , L. , Chen , B. , Zhang , W., et al.: End-to-End Multi-Perspective Matching for Entity Resolution . In: Proceedings of the Twenty-Eighth International Joint Conference on Arti cial Intelligence . pp. 4961 { 4967 ( 2019 )

10. Izmailov , P. , Podoprikhin , D. , Garipov , T. , Vetrov , D. , Wilson, A.G. : Averaging Weights Leads to Wider Optima and Better Generalization . arXiv: 1803 .05407 [cs, stat] ( Feb 2019 )

11. Kim , Y. : Convolutional neural networks for sentence classi cation . In: Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP) . pp. 1746 { 1751 . Association for Computational Linguistics (ACL) ( 2014 )

12. Li , Y. , Li , J. , Suhara , Y. , Doan , A. , Tan , W.C. : Deep Entity Matching with PreTrained Language Models . arXiv: 2004 .00584 [cs] ( Apr 2020 )

13. Lin , T.Y. , Goyal , P. , Girshick , R. , He , K. , Dollar , P. : Focal Loss for Dense Object Detection . In: Proceedings of the IEEE International Conference on Computer Vision . pp. 2980 { 2988 ( 2017 )

14. Lin , Y.C. , Das , P. , Datta , A. : Overview of the sigir 2018 ecom rakuten data challenge . In: eCOM at The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. CEUR-WS.org ( 2018 )

15. Liu , Y. , Ott , M. , Goyal , N. , Du , J. , Joshi , M. , et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach . arXiv: 1907 . 11692 ( 2019 )

16. Meusel , R. , Primpeli , A. , Meilicke , C. , Paulheim , H. , Bizer , C. : Exploiting microdata annotations to consistently categorize product o ers at web scale . In: International Conference on Electronic Commerce and Web Technologies . pp. 83 { 99 . Springer International Publishing ( 2015 )

17. Mudgal , S. , Li , H. , Rekatsinas , T. , Doan , A. , Park , Y. , et al.: Deep Learning for Entity Matching: A Design Space Exploration . In: Proceedings of the 2018 International Conference on Management of Data . pp. 19 { 34 ( 2018 )

18. Nie , H., Han, X. , He , B. , Sun , L. , Chen , B. , et al.: Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution . In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management . pp. 629 { 638 (Nov 2019 )

19. Peeters , R. , Bizer , C. , Glavas , G.: Intermediate Training of BERT for Product Matching . In: DI2KG Workshop @ VLDB ( 2020 )

20. Primpeli , A. , Peeters , R. , Bizer , C. : The WDC Training Dataset and Gold Standard for Large-Scale Product Matching . In: Workshop on e-Commerce and NLP (ECNLP2019 ), Companion Proceedings of WWW . pp. 381 { 386 ( 2019 )

21. Sanh , V. , Debut , L. , Chaumond , J. , Wolf , T. : Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter . In: NeurIPS EMC2 Workshop ( 2019 )

22. Zhang , Z. , Paramita , M. : Product classi cation using microdata annotations . In: Ghidini, C. , Hartig , O. , Maleshkova , M. , Svatek , V. , Cruz , I. , Hogan , A. , Song , J. , Lefrancois , M. , Gandon , F . (eds.) The Semantic Web { ISWC 2019 . pp. 716 { 732 . Springer International Publishing ( 2019 )