MWPD2020: Semantic Web Challenge on
    Mining the Web of HTML-embedded Product
                       Data

 Ziqi Zhang1[0000−0002−8587−8618] , Christian Bizer2[0000−0003−2367−0237] , Ralph
    Peeters2[0000−0003−3174−2616] , and Anna Primpeli2[0000−0002−1783−2482]
            1
              University of Sheffield, Broomhall, Sheffield S10 2TG, UK
                           ziqi.zhang@sheffield.ac.uk
           2
             Universität Mannheim, Schloss, 68131 Mannheim, Germany
               {chris,ralph,anna}informatik.uni-mannheim.de


      Abstract. This paper gives an overview of the Semantic Web Challenge
      on Mining the Web of HTML-embedded Product Data (MWPD2020)
      which has been conducted as part of the International Semantic Web
      Conference (ISWC2020). The challenge consists of two tasks: product
      matching and product classification. In the first task, participants need
      to identify offers for the same product originating from different websites.
      The goal of the second task is to categorize offers from different websites
      into the GS1 GPC product hierarchy. Six teams from the USA, China,
      Japan, and Germany participated in the challenge. The winning system
      in Task 1, PMap, achieved an F1 score of 86.05 using an ensemble of
      transformer-based language models. Task 2 was won by team Rhinobird
      achieving a weighted average F1 score of 88.62 using a BERT-based en-
      semble which considers the dependencies among different category levels.

      Keywords: entity matching · hierarchical classification · e-commerce ·
      schema.org · microdata · benchmarking


1    Introduction

Recent years have seen significant growth of semantic annotations on the Web,
using markup languages such as Microdata together with the schema.org vocab-
ulary. A particular domain that is witnessing the boom of semantic annotations
is e-commerce, where online shops are increasingly embedding schema.org anno-
tations into HTML-pages describing products in order to enable search engines
to easily identify product offers and potentially drive traffic to the respective
websites. Statistics from the Web Data Commons (WDC) project3 show that,
as of November 2018, 37% of web pages, or 30% of websites contain semantic
annotations, amounting to over 30 billion facts. Among these, nearly 20% are
  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0).
3
  http://webdatacommons.org/structureddata/
2        Zhang et al.

related to products. Such structured product data on the Web have created op-
portunities for new services, such as product search and integration platforms,
recommender systems [1], as well as emerging research fields such as product
knowledge graphs [6].
    Many websites have started to semantically annotate product identifiers
within their pages. This enables the identification of offers for the same product
on different websites. The resulting clusters of product descriptions can be used
as weak supervision for training product matchers, which in turn can be applied
to identify products on websites that do not provide product identifiers [3]. How-
ever, there are also challenges associated with the annotations. For example, less
than 10% of the offers are annotated with a product category [16], while the
used categorization systems are website-specific and highly inconsistent across
different websites [22].
    The potentials as well as the challenges resulting from the wide-spread avail-
ability of semantically annotated product data on the Web motivated the Se-
mantic Web Challenge on Mining the Web of HTML-embedded Product Data
(MWPD2020), as well as the specific tasks of the challenge: product matching
and product classification. In the first task, participants need to identify offers
for the same product originating from different websites. The goal of the second
task is to categorize offers from different websites into the GS1 GPC product
hierarchy. For both tasks, we have assembled training, validation, and test sets
consisting of semantically annotated product data from a wide variety of different
websites.
    The event attracted a total of six participating teams including research
institutions as well as commercial entities from the USA, China, Japan, and
Germany. The winning team for the product matching task represents the Na-
tional Institute of Informatics, Japan; while the winning team for the product
classification task represents the Tongji University of China and Tencent, China.
    The remainder of this paper is structured as follows. Section 2 and 3 ex-
plain the two tasks, including their objectives, datasets, and evaluation met-
rics; Section 4 presents the results of the challenge and gives an overview of
the participating systems; and Section 5 concludes the paper with some lessons
learned from the challenge and a comparison of MWPD2020 to related bench-
mark events. More information about MWPD2020 is found on the challenge’s
website which also provides all datasets for public download4 .


2     Task 1 - Product Matching
E-commerce websites frequently annotate product identifiers, product titles,
product descriptions, brands, and product prizes within their pages using schema.
org terms. In addition, offers are often accompanied by specification tables, i.e.
HTML tables that contain product details in the form of key/value pairs. Given
the syntactic, structural and semantic heterogeneity among the offers, it is chal-
lenging to identify which offers refer to the same product, a problem known as
4
    https://ir-ischool-uos.github.io/mwpd/
            MWPD2020: Mining the Web of HTML-embedded Product Data                3

product matching. In this task, product matching is handled as a binary clas-
sification problem: given two product offers, the participating systems need to
decide whether the offers describe the same product (matching) or not (non-
matching).

2.1    Datasets
The participants of Task 1 were given a large corpus of product offers which are
grouped into clusters referring to the same product. This corpus could be used
by the participants to assemble training sets of different width and depth. In
order to ease starting to work on the task, we also provide a readily assembled
example training and validation set. The test set that was used to evaluate
the participating systems was kept secret during the submission period of the
challenge and was released afterwards.

Product Data Corpus. The WDC Product Data Corpus [20] was released in
20185 by the Web Data Commons project and is the largest publicly available
product data corpus. It consists of 26 million product offers originating from 70
thousand different e-shops. Exploiting the weak supervision found on the web
in the form of product identifiers, such as GTINs or MPNs, product offers are
grouped into 16 million clusters. The clusters can be used to derive training sets
containing matching and non-matching pairs of offers. The grouping of offers
into clusters is subject to some degree of noise which is approximately 7%. The
following attributes are used for describing the product offers in the corpus
and can be used for training: title, description, brand, price, specTableContent
which contains the content of the specification tables found in the website of
the product offer, keyValuePairs which are the heuristically extracted key/value
pairs from the specification tables and category which is one of the 25 categories
the offer was assigned to. Additionally, two identifier attributes are assigned to
every product offer: the id which is the unique identifier of the offer and the
cluster id which is the identifier of the cluster to which the offer belongs.

Example Training Set. Generating interesting matching and non-matching
pairs of offers which can be used for training powerful matching models, is a
non-trivial and resource intense task. Therefore, we offer an example of a train-
ing set derived from the product corpus, with the goal to additionally support
the participants of this task. Being a direct subset of the product corpus, the
example training set is subject to some inherent noise. The example training
set contains 68K pairs of matching and non-matching offers from 772 distinct
products (clusters of offers). We offer the example training set in JSON format.
Every JSON object in the training set describes a pair of offers (left offer - right
offer) using the offer attributes together with their corresponding matching label.
Figure 1 shows an example of a non-matching product offer pair in the example
training set.
5
    http://webdatacommons.org/largescaleproductcorpus/v2/
4        Zhang et al.


               Fig. 1. Example of a product offer pair in the training set.


Validation Set. We provide a validation set consisting of 1,100 offer pairs from
the Computers and Accessories category as the ground truth for this task. The
validation set has the same structure as the example training set. The ratio of
matching to non-matching pairs is 3:8. The offers of the validation set are de-
rived from 745 distinct products (clusters). Table 1 presents for the training and
validation sets, the average attribute density, the average length in characters of
the attribute values, as well as the standard deviation of the value length.


    Table 1. Profiling statistics about the offers in the training and validation sets.

              Attribute              Density    Avg Length       Std Length
              title                    1.000         90.14            30.89
              description              0.743         366.6          1167.76
              brand                    0.533         12.43              4.8
              price                    0.206         15.18             3.85
              specTableContent         0.327        557.21           676.48
              keyValuePairs            0.298        497.96           429.02


    With the aim to construct the validation set using a good mixture of hard
and easy matching and non-matching pairs of offers distributed over different
products, we applied the following heuristic: First 150 clusters of the category
Computers and Accessories are randomly selected. Considering the clustering
scheme, we construct all possible matching and non-matching pairs. For each
offer pair we randomly pick the Jaccard similarity over the titles, the descriptions
           MWPD2020: Mining the Web of HTML-embedded Product Data                5

or the average of both, as a similarity metric and calculate its similarity score.
To select matching pairs, we pick within each cluster the pair with the lowest
similarity score and one randomly chosen pair and add them in the validation
set. To select negative pairs, we take for each offer two to three pairs with high
similarity and three pairs at random. All matching and non-matching pairs of
the validation set are manually verified. Therefore, unlike the provided example
training set, the validation set does not contain any noisy offer pairs.


Test Set. The hidden test set which is used for evaluating and ranking the
matching systems of the participants of this task, consists of 1,500 offer pairs
from the category Computers and Accessories. The offer pairs in the test set
are carefully selected in order to cover different types of matching challenges.
1,100 pairs are randomly selected corner cases, meaning that they are similar
non-matching pairs and dissimilar matching pairs. For the remaining 400 pairs,
we define a categorization scheme consisting of four specific types of matching
challenges. The distribution of offer pairs per type of challenge remains unknown
to the participants in the first round and is revealed to them at the beginning of
the second round to allow them to tune their systems to the specific challenges.
The distribution of offer pairs per type of matching challenge is shown in Table 2.


        Table 2. Distribution of offer pairs per type of matching challenge.

Matching challenge                                       #Match #Non-Match
(SN-DM) Similar non-matches, dissimilar matches             275        825
(NP-HS) New products - high similarity to known products     25         75
(NP-LS) New products - low similarity to known products      25         75
(KP-TY) Known products - introduced typos                   100          0
(KP-DR) Known products - dropped tokens                     100          0


    With the term new product we refer to products which are contained in the
WDC Product Data Corpus but not in the provided example training set. Known
products means products from clusters that have training data in the provided
training set. The similarity for choosing similar non-matching and dissimilar
matching pairs is measured using the Jaccard similarity metric on the titles of
the offers.


2.2   Evaluation Metrics and Baseline

For the evaluation of the product matching task we use standard precision,
recall and F1 calculated on the positive class. As a baseline, we use the Deep-
matcher [17] framework, more specifically the RNN module using default pa-
rameters apart from the positive/negative ratio, which we set to the actual dis-
tribution found in the training set. This model is trained on the training set
provided for the challenge for 15 epochs, using the attributes title, description,
6       Zhang et al.

brand and specTableContent. We preprocess the attributes by removing some
symbols and schema.org related terms and finally lowercasing them. Since the
model relies on pre-trained word- or character-based embeddings, we use the
fastText embeddings pre-trained on the English Wikipedia6 .


3     Task 2 - Product Classification
Same products are often sold on different websites, which generally organise their
products into certain categorisation systems. However, such product categorisa-
tions differ significantly for different websites, even if they sell similar product
ranges. This makes it difficult for product information integration services to col-
lect and organise product offers on the Web. The product classification task deals
with assigning pre-defined product category labels from a universal catalogue to
product instances (e.g., iPhone X is a ‘SmartPhone’, and also ‘Electronics’).

3.1    Datasets
Classification labels. In this task, the GS1 Global Product Classification stan-
dard (GPC) 7 is used to classify product instances. The GPC standard classifies
products into a hierarchy of multiple levels based on their essential properties
as well as their relationships to other products. It offers a universal standard
for organising product catalogues of any businesses. In this task, the top three
levels of GPC are used to classify each product. Level 1 contains 40 classes
such as ‘Automotive’ and ‘Clothing’. Level 2 further divides level 1 classes into
more than 100 classes such as ‘Automotive Accessories and Maintenance’, and
‘Swimwear’. Then level 3 further divides level 3 classes into over 700 classes such
as ‘Automotive Antifreeze’ and ‘Beachwear/Cover Ups’

Gold standard. The creation of the gold standard is based on the earlier
product classification dataset released by the Web Data Commons project [16].
The original dataset contained 8,361 product instances randomly sampled from
702 vendors’ websites. These were manually classified into the above mentioned
three levels of classifications by human annotators.
    In MWPD2020, we further extended the original GS by adding over 7,000
product instances. However, due to the complexity of the GPC hierarchy, anno-
tating a random sample by checking against every class in the three classification
levels is a very time-consuming process. Therefore, we followed a ‘controlled pro-
cess’ detailed below.

 1. Create a Solr8 index of product instances by parsing the Product Data Cor-
    pus. The index contains five fields corresponding to attributes of each prod-
    uct: an id field to uniquely identify a product index; a name field recording
6
  https://fasttext.cc/docs/en/pretrained-vectors.html
7
  https://www.gs1.org/standards/gpc
8
  https://lucene.apache.org/solr/
            MWPD2020: Mining the Web of HTML-embedded Product Data                  7

    the name of the product; a description field recording the long description
    of the product; a category text field recording the product category informa-
    tion as provided by the source web page; and a provenance field recording
    the source URL from which the structured data are extracted. All these
    attributes are extracted from the RDF quads where available in the dataset.
 2. Given an existing product instance (i.e., the reference product) in the original
    GS, search for its name in the description field of the above index;
 3. Select up to 50 results (i.e., the target products) with a different name from
    the reference product;
 4. Rank the results by the Levenshtein distance between the reference product’s
    name and the names of the target products;
 5. Select the top 10 ranked results;
 6. A human annotator manually evaluates the ranked results, and selects n
    products that s/he deems to belong to the same level-3 class of the reference
    product;
 7. The selected n products are assigned the same level-3 class, as well as the
    corresponding level-2 and level-1 classes by traversing the GPC hierarchy.

    In step 6 above, the human annotators are presented the following informa-
tion when assessing each target product: the reference product’s name, descrip-
tion, level-1, 2, 3 classes, provenance, website-specific category information or
breadcrumb if available; and the target product’s name, description and prove-
nance. They are instructed to exercise on their own discretion to decide an
optimal n, by balancing the diversity in the already selected selected target
products in terms of their name, vendor as identified by their provenance, and
level-3 classes.
    The main reasons of allowing this flexibility are two-fold. First, in the original
GS, there existed certain ‘difficult’ and often minority classes (e.g., 93070100
Seeds/Spores) for which steps 2 and 3 hardly returned many positive matches.
Second, there also existed certain ‘dominant’ classes that represented a very large
fraction of the original GS (e.g., 67000000 Clothing was over 40%) and were
also ‘easier’ to find matches by steps 2 and 3. This implies that our controlled
process runs a risk of further accentuating the already-unbalanced nature of the
original GS. Thus by exercising their discretion based on the above principle, our
goal was to control the balance in the distribution of classes in the end dataset.
In practice, n ranged between 0 (no suitable target) and 6 (typically for ‘difficult’
classes).
    The annotation was conducted by two computer science researchers and
Inter-Annotator-Agreement was studied on 100 product instances where they
both annotated. A Cohen’s Kappa of 97% was obtained. The end dataset con-
tains 16,119 instances and is stored in JSON. In addition to the five product
attributes described before, each instance is assigned three label, corresponding
to for the three levels of classification. Further, the description is truncated to
a maximum of 5,000 characters. A screenshot of an example instance is shown
in Figure 2. The dataset is split into the training (10,012), validation (3,000),
and test (3,107) sets with statistics shown in Figures 3 and 4. Same as Task
8       Zhang et al.

1, only the training and validation sets are revealed to the participants before
the submission of their final system output that was created on the test set.
Although the dataset are very unbalanced, with several large classes dominating
the dataset at all three levels, it is worth noting that they follow a consistent
distribution to the original GS.


    Fig. 2. Example of an instance in the Task 2 Product Classification dataset.


Fig. 3. Distribution of the character lengths of product names, category texts, and
descriptions in the training, validation and test sets.


Fig. 4. Distribution of the percentages of level-1, 2, and 3 classes in the training,
validation and test sets.
           MWPD2020: Mining the Web of HTML-embedded Product Data              9

Additional resources. During the process of creating the gold standard, ad-
ditional resources9 were created with an aim support participants system devel-
opment. These include:

 – A ‘product data textual corpus’ that contains descriptions of all product
   instances from the Solr index above. A light-weight cleaning process was
   applied to only keep descriptions of at least 5 tokens (separated by white-
   space characters) and 20 characters. This corpus has over 1.9 billion tokens.
 – Word embedding models (both continuous Bag-of-Words (CBow) and Skip-
   gram) trained using the above textual corpus, by applying the Gensim10
   (version 3.4.0) implementation of Word2Vec.

3.2   Evaluation Metrics and Baseline
For each classification level, the standard Precision, Recall and F1 are used and
a Weighted-Average macro-F1 (WAF1) are calculated over all classes. Then the
average of the WAF1 of the three levels are be calculated and used to rank the
participating systems. For baseline, we use a configuration based on that used
in the Rakuten Data Challenge11 [14]. Specifically:

 – it uses the same FastText algorithm and parameters as in the Rakuten Data
   Challenge;
 – it uses only product names as input text;
 – all text are lowercased and lemmatised using NLTK version 3.4.5.


4     Results
The competition was organised in two rounds. In order to improve their systems,
the teams were shown the leaderboard after Round 1 and were informed about
the F1 scores that their systems achieved on the specific types of matching
challenges (see Table 2). A total of six teams representing different industry
organisations and research and academic institutions participated in Round 1.
Six teams participated in Task 1, while five participated in Task 2. Two teams
continued in Round 2 for Task 1, while only one team chose to continue in
Round 2 for Task 2. Five teams submitted a paper describing their system. A
brief overview of the results for both tasks is given below. Section 4.3 provides
additional details about each team and their system afterwards.

4.1   Results Task 1 - Product Matching
Table 3 shows the results of the systems that participated in Task 1. Team PMap
managed to achieve the highest F1-score among all teams with teams Rhinobird
9
   Download from: https://bit.ly/36d0NYd
10
   https://radimrehurek.com/gensim/
11
   Download from: https://github.com/ir-ischool-uos/mwpd/tree/master/prodcls
10     Zhang et al.

and ISCAS-ICIP very close behind. An overview of each team’s methods along
selected dimensions which resulted in the final submission can be seen in Ta-
ble 4. All of the teams, who submitted system papers, employed fine-tuning of
transformer-based pre-trained language models for the task of product match-
ing in one form or another, often combined with some form of ensembling. Team
ISCAS-ICIP was the only team employing an ensembling approach across dif-
ferent deep matching models. The other teams, who employed ensembling, lim-
ited themselves to fine-tuned transformer-based models like e.g. BERT [5] and
RoBERTa [15].


              Team                           P        R        F1
              PMap                         82.04    90.48     86.05
              Rhinobird (Round 2)          80.63    92.00     85.94
              Rhinobird (Round 1)          82.86    88.38     85.53
              ISCAS-ICIP (Round 2)         85.77    84.95     85.36
              ASVinSpace                   86.20    82.10     84.10
              ISCAS-ICIP                   83.89    81.33     82.59
              Megagon                      82.69    65.52     73.11
              Team ISI                     78.44    57.52     66.37
              Baseline (Deepmatcher)       70.89    74.67     72.73
                  Table 3. Results of Task 1 - Product Matching.


    Most teams applied some form of standard pre-processing to the data before
training. Team ISCAS-ICIP also employs a feature extraction approach based
on fixed vocabularies and regular expressions to extract multiple features from
the textual descriptions. For this they use the provided WDC Large-Scale Cor-
pus for Product Matching. They further use it to expand the provided training
set with more product pairs from the relevant product category. Also, teams
Rhinobird and ASVinSpace tried augmenting the training data with pre-built
training sets of other categories from the product corpus. Most teams tried dif-
ferent combinations of features and training sets during experimentation. Team
Rhinobird also tried optimizing for focal loss in addition to cross-entropy as well
as employing a self-ensembling technique over multiple training epochs of the
same model. Rhinobird and ISCAS-ICIP further implemented a post-processing
step, where they used heuristic rules to correct some of the predicted labels, e.g.
always predicting non-match if the brands of both offers do not match.
    All teams were provided with information about their performance on the
specific matching challenges in the test set (see section 2) after Round 1. These
scores could be used to improve specific aspects of the systems for Round 2.
The results of all teams on the specific types of matching challenges are found
in Table 5. Team ISCAS-ICIP was able to improve on their performance for
all challenges in Round 2, significantly improving on new products, which were
                  MWPD2020: Mining the Web of HTML-embedded Product Data                                                  11

                                                                      Team
Dimension                  PMap                    Rhinobird                     ISCAS-ICIP                ASVinSpace
                                                                             remove stopwords,
                     remove symbols and        remove stopwords
pre-processing                                                               alphanumeric chars                  -
                      non-alphabet chars         and lower-case
                                                                                and lower-case
                                                                                 title, price
                                                                                                        title, description,
used attributes             title               title, description            brand (extracted)
                                                                                                        specTableContent
                                                                              model (extracted)
                        BERT-large                                                  MPM
matching model         RoBERTa-large              BERT-base                      HierMatcher           DistilRoBERTa-base
                       RoBERTa-base                                                 Ditto
                                           ensemble of self-ensembling
                         ensemble of                                         ensemble of different
matching decision                          transformers with different                                     single model
                        transformers                                             model types
                                                  loss functions
                                                 heuristic rule                 heuristic rules
post-processing               -                                                                                  -
                                             to correct predictions         to correct predictions
                                                                               WDC corpus used
                                             training data spanning                                    training data spanning
                                                                          for additional training data
external resources            -                four categories from                                      four categories from
                                                                         and to build vocabularies for
                                                  WDC corpus                                                WDC corpus
                                                                              attribute extraction
                       Table 4. Overview of the systems submitted for Task 1


not contained in the provided training set. For Round 2 they exchanged one of
the matching models in their ensemble with a transformer-based model. This
may suggest that this kind of model is better suited for handling new products
than the previously used one (see below). Their overall result improved by 3%
F1 going from Round 1 to 2. Team Rhinobird manages to improve significantly
for products containing typos or dropped words while losing some performance
across the other classes. They changed one of the BERT models in their ensemble
of three to one trained with a different loss function for Round 2. Their overall
result improves by only 0.5% F1 from Round 1 to 2, trading 2% Precision for
4% Recall. Overall, there is no team that consistently beats the others across all
challenges, which is not surprising, as they all apply similar approaches and the
overall results of the top teams vary only within 1% F1.

4.2      Results Task 2 - Product Classification
Table 6 shows the results of the participating systems on Task 2. Team Rhi-
nobird obtained the highest overall WAF1 of 88.62, significantly higher than
the baseline. Also, four out of five teams managed to outperform the baseline.
Comparing the performance over the three different classification levels, under-
standably, Level-3 is the most difficult, due to a significantly larger labelset and
consequently, fewer training examples per class.
    Among teams that submitted their system description paper, in terms of su-
pervised learning, all participants used Deep Neural Network structures and the
more recent transformer-based architectures or pretrained language models. This
aligns with the overall trend in the use of machine learning based methods for
NLP, where we see a shift towards using pretrained language models often using
transformer based architectures. The DNN models range from simple adaptation
12     Zhang et al.


                                F1 on specific type of matching challenge
     Team                       SN-DM NP-HS NP-LS KP-TY KP-DR
     PMap                        87.03     73.85     72.46    87.01      91.30
     Rhinobird (Round 2)         87.56     68.49     71.43    85.06      93.62
     Rhinobird (Round 1)         88.04     72.46     76.92    80.95      89.50
     ISCAS-ICIP (Round 2)        87.97     90.91     67.69    74.21      91.30
     ASVinSpace                  87.41     82.14     76.19    75.78      84.39
     ISCAS-ICIP (Round 1)        87.20     81.48     60.00    72.61      85.71
     Megagon                     80.00     71.19     72.73    47.33      71.79
     Team ISI                    82.52     24.00     23.26    27.59      63.01
     Table 5. Results on the specific types of matching challenges (see Table 2)


of existing pre-trained language models (e.g., DICE) to very complex structures
that combine 17 different models through ensemble (e.g., Rhinobird).
    In terms of the text input used for supervised learning, all teams used product
name and description, except ASVinSpace that also used URL. However, it is
unclear what the effect of URL is, due to a lack of ablation test. Interesting, no
teams used the category text provided as-is by the source vendor websites, even
thought such content proves to be useful for such tasks [16, 22]
    In terms of using external resources (excluding the use of pre-trained lan-
guage models) to support the learning, Team ASVinSpace used a novel ap-
proach that extends the training set by harvesting data from Wikidata. None
of the teams used the pre-trained embedding models or the product description
corpus. However, Table 6 demonstrates that the pre-trained embedding models
can be effective for further enhancing the learning.


                         Weighted Macro-F1 per level    Average over all levels
 Teams
                         Lvl.1    Lvl.2   Lvl.3      Precision Recall   WAF1
 Rhinobird (Round 1)     91.83    90.11   84.68      89.01     89.04    88.62
 Rhinobird (Round 2)     90.81    90.12   84.37      88.97     88.72    88.43
 ISI                     88.54    87.30   83.77      87.16     86.85    86.54
 ASVinSpace              89.44    88.05   80.86      86.96     86.30    86.10
 Megagon                 88.23    85.47   81.23      85.39     85.38    84.98
 DICE                    85.79    81.46   78.27      85.30     81.49    81.84
  Baseline               86.56    85.31     80.89     85.53      84.17  84.26
  Baseline + CBow        88.06    86.97     82.17     86.50      86.07  85.73
  Baseline + Skipgram 86.87       85.99     80.86     85.45      84.91  84.58
Table 6. Participants results on Task 2, product classification. WAF: Weighted Av-
erage (macro) F1. Baseline + CBow and Baseline + Skip indicates the results of the
baseline when the product embedding models are used.
           MWPD2020: Mining the Web of HTML-embedded Product Data                 13

4.3   Participating Teams

In the following, we summarize the approaches that were used by the different
teams. The complete details about the methods are given in the systems pa-
pers written by the teams themselves which are contained in the MWPD2020
proceedings.
    Team Rhinobird represents the Tongji University of China and Tencent,
China, and participated in both Task 1 and 2 in both rounds. For task 1, they
rely on the BERT model while experimenting with different loss functions and
ensembling steps. More specifically, after removing stopwords and lower-casing
the data, they fine-tune multiple BERT models while experimenting with differ-
ent training sets, features as well as focal loss [13] as a variation to the standard
cross-entropy loss function. In addition to the provided training set, they ex-
periment with a larger training set containing product pairs for four product
categories. They try using only the title attribute as well as the concatentation
of title and description as input features. Furthermore, Team Rhinobird exper-
iments with a method of self-ensembling across multiple training epochs of the
same model, namely stochastic weight averaging (SWA) [10]. Finally, subsets
of all the previously mentioned models are ensembled by averaging their pre-
diction probabilities and subsequently selecting the best performing ensemble.
Team Rhinobird also applies some simple post-processing rules to correct pre-
dictions of the models, more specifically, all test pairs that do not belong to the
same product category, are set to to be non-matches. Their submission for both
rounds consists of an ensemble across three fine-tuned BERT models trained
with different choices for the previously mentioned parameters.
    For Task 2, Rhinobird used a BERT-based ensemble model that explicitly
considers the dependencies among different category levels. Such hierarchical de-
pendency features are encoded using a dynamic masked matrix obtained based
on the hierarchical category structure. The masked matrix as a filter that dy-
namically discards the child categories irrelevant to the current parent category.
The final ensemble model combines 17 different BERT models to make the final
decisions. They used both product names and descriptions.
    Team PMap represents the National Institute of Advanced Industrial Sci-
ence and Technology, Japan, and participated in Task 1 in Round 1 only. They
rely on using pretrained transformer-based language models, more specifically
they fine-tune BERT-base, BERT-large, DistilBERT-base, RoBERTa-base and
RoBERTa-large, consequently ensembling the results of some of these models
to arrive at the final matching decision. Before fine-tuning they apply simple
preprocessing, e.g. removing symbols and non-alphabet characters using a reg-
ular expression. Team PMap uses the datasets provided during the challenge
without further additions. They furthermore use only the title attribute as in-
put feature. After fine-tuning each model, based on their results, they select
the BERT-large, RoBERTa-large and RoBERTa-base models for ensembling to
reach the final matching decision.
    Team ASVinSpace represents the Leipzig University, Germany and the
German Aerospace Center (DLR). They participated in both Task 1 and 2 in
14     Zhang et al.

Round 1. For Task 1, they employed pre-trained transformer-based language
models, namely BERT, RoBERTa and their distilled versions [21]. Different fea-
ture combinations are tried with the input to the model consisting of the con-
catenation of the used feature strings. The standard model is augmented with a
single dense and an output layer on top of the pooled output of the [CLS] token
and subsequently fine-tuned. ASVinSpace try solving the task in two ways, once
minimizing cross-entropy loss on an output layer of size two and on the other
hand framed as a regression problem with a single output and minimizing the
mean squared error. In addition to the training set provided for the challenge,
the team further experiments with additional training data from other product
categories from the same data corpus. To handle the class imbalance inherent
to the data, the team randomly drops negative examples during each training
epoch to normalize the class distribution. The final submitted result is achieved
by a DistilRoBERTa model using the title, description and specTableContent
attributes, fine-tuned on data from four different product categories.
    For Task 2, ASVinSpace used a CNN model adapted from [11]. It used
a transformer-based language model [21] as input to the CNN layers instead
of static word embedding models. The CNN model has three output layers,
each corresponding to one of the classification levels, thus allowing the model
to capture inter-dependencies of the different classification tasks. In addition,
ASVinSpace also proposed to use external resources. For example, names of
examples from the training set are used to retrieve relevant entities from Wiki-
data. Then, the corresponding descriptions and the GPC standard are used
to disambiguate the retrieved entities in order to select only the ones that are
highly similar (using a cosine similarity metric based on Tf-IDF weighted feature
vectors) to the classification examples. These ‘expanded’ entities are manually
annotated to create additional training data. In terms of the text input, they
used the concatenation of product names, descriptions, and URLs.
    Team ISCAS-ICIP represents the Chinese Academy of Sciences, and par-
ticipated in Task 1 in Round 1 and 2. Their approach is based on three steps:
pre-processing, entity matching and post-processing. During pre-processing the
team removes stopwords, alphanumeric characters and lower-cases the examples.
Furthermore, they apply a feature extraction approach based on vocabularies
built using the provided data corpus as well as based on regex-patterns to ex-
tract values for the attributes brand and model. For the entity matching stage,
the team applies overall four different models whose results are integrated to
achieve the final prediction using a voting mechanism. The models are MPM [9],
Seq2SeqMatcher [18], HierMatcher [8] and Ditto [12]. Finally, the post-processing
module uses rules to correct predictions under certain circumstances, e.g. always
assigning the label”match if two products have an exact match on the title at-
tribute, or always assigning non-match if the brands of two products differ. The
team also augments the provided training and validation sets by doubling their
number using a similar sampling approach to the one used for building the pro-
vided training sets [20]. For their first round submission, ISCAS-ICIP integrated
the results of the MPM, Seq2SeqMatcher and Hiermatcher models, while for the
           MWPD2020: Mining the Web of HTML-embedded Product Data             15

second round submission they omit Seq2SeqMatcher and replace it with Ditto,
which is based on pre-trained transformer-based language models, leading to
improved performance (∼3% F1) on the evaluation test set.
    Team DICE represents the Paderborn University, Germany, and partici-
pated in Task 2 in Round 1 only. The team used a simple adaptation of the
BERT language model, by adding on top a fully-connected layer (i.e Dense) and
using using a sigmoid activation function as replacement of the original softmax
for classification. They used both product names and descriptions as input.
    Team Megagon represents megagon.ai and Team ISI represents the Uni-
versity of Southern California, USA. They participated in both Tasks in Round
1, but did not submit a system paper.


5    Conclusion

The systems that were successful in the challenge all employed pre-trained trans-
former based language models which underlines the potential of these models for
Web data integration tasks [4, 19, 12]. Especially the good results of systems us-
ing RoBERTa show the benefits of transferring knowledge that has been learned
using less structured web content from diverse sources12 to integration tasks in-
volving structured web content, such as the matching and categorization tasks
addressed in the challenge.
    Several other benchmark competitions on product matching and product
classification have been conducted in the last years: The SIGIR 2018 eCom
Rakuten Data Challenge [14] focused on product classification, where individual
products are classified into a hierarchy of over 3,000 categories in a company-
specific catalogue (i.e., the Rakuten product catalogue). Compared to the Rakuten
challenge which only involved product descriptions from a single source, our
classification task involves more heterogeneous product descriptions from many
websites. The 2019 and 2020 workshops on Challenges and Experiences from
Data Integration to Knowledge Graphs (DI2KG2019) [7]13 and (DI2KG2020)14
focus on knowledge graph creation from product specifications which were ex-
tracted from the Web. The workshops feature three shared tasks: entity res-
olution, schema matching, attribute matching. Products are described in the
DI2KG dataset using distinct attributes such as screen size, display type, or
refresh rate. Compared to the DI2KG entity resolution task, our matching task
involves dealing with less structured textual product data.
    Based on the findings from this event, we identify several remaining gaps in
the current research: First, despite the dominance of transformer based language
models, there remains a significant degree of variety in terms of how such models
can be adapted and/or combined for data integration tasks. There is also a lack
12
   RoBERTa was pre-trained using different subsets of the CommonCrawl along with
   several text corpora.
13
   http://di2kg.inf.uniroma3.it/2019/
14
   http://di2kg.inf.uniroma3.it/2020/
16        Zhang et al.

of systematic study of how these architectures compare under a uniform experi-
mental setting. Second, there is a lack of exploration into what kind of external
resources can be used to support such tasks and how they can be used to do
so. For example, our product data textual corpus could be used to fine-tune a
language model following an approach such as [2], which showed further gain in
terms of domain-specific tasks [19]. Finally, in terms of mining product informa-
tion on the Web from a more general point of view, recent research [1] focused
on harvesting and cleaning structured product data on the Web. However, there
is a lack of studies on how such data could be used to enable self-supervised
learning in downstream tasks [4]. We encourage future research to investigate
these directions.

Acknowledgements This event is partially sponsored by Peak Indicators, UK15 .


References
 1. Bai, P., Ge, Y., Liu, F., Lu, H.: Joint interaction with context oper-
    ation for collaborative filtering. Pattern Recognition 88, 729–738 (2019).
    https://doi.org/doi:10.1016/j.patcog.2018.12.003
 2. Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific
    text. arXiv preprint arXiv:1903.10676 (2019)
 3. Bizer, C., Primpeli, A., Peeters, R.: Using the semantic web as a
    source of training data. Datenbank-Spektrum 19(2), 127–135 (2019).
    https://doi.org/10.1007/s13222-019-00313-y
 4. Deng, X., Sun, H., Lees, A., Wu, Y., Yu, C.: TURL: Table Understanding through
    Representation Learning. arXiv:2006.14806 [cs] (Jun 2020)
 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidi-
    rectional Transformers for Language Understanding. In: Proceedings of the 2019
    Conference of the North American Chapter of the Association for Computational
    Linguistics: Human Language Technologies. pp. 4171–4186 (Jun 2019)
 6. Dong, X.L.: Challenges and innovations in building a product knowledge graph. In:
    Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
    Discovery & Data Mining. p. 2869. Association for Computing Machinery (2018)
 7. Firmani, D., Crescenzi, V., Angelis, A.D., Dong, X.L., Mazzei, M., Merialdo, P.,
    Srivastava, D.: Proceedings of the 1st international workshop on challenges and
    experiences from data integration to knowledge graphs. CEUR Workshop Pro-
    ceedings, vol. 2512. CEUR-WS.org (2019)
 8. Fu, C., Han, X., He, J., Sun, L.: Hierarchical Matching Network for Heteroge-
    neous Entity Resolution. In: Proceedings of the Twenty-Ninth International Joint
    Conference on Artificial Intelligence. pp. 3665–3671 (Jul 2020)
 9. Fu, C., Han, X., Sun, L., Chen, B., Zhang, W., et al.: End-to-End Multi-Perspective
    Matching for Entity Resolution. In: Proceedings of the Twenty-Eighth Interna-
    tional Joint Conference on Artificial Intelligence. pp. 4961–4967 (2019)
10. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Wilson, A.G.: Averaging
    Weights Leads to Wider Optima and Better Generalization. arXiv:1803.05407 [cs,
    stat] (Feb 2019)
15
     https://www.peakindicators.com/
            MWPD2020: Mining the Web of HTML-embedded Product Data                    17

11. Kim, Y.: Convolutional neural networks for sentence classification. In: Proc. of the
    Conference on Empirical Methods in Natural Language Processing (EMNLP). pp.
    1746–1751. Association for Computational Linguistics (ACL) (2014)
12. Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep Entity Matching with Pre-
    Trained Language Models. arXiv:2004.00584 [cs] (Apr 2020)
13. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal Loss for Dense Object
    Detection. In: Proceedings of the IEEE International Conference on Computer
    Vision. pp. 2980–2988 (2017)
14. Lin, Y.C., Das, P., Datta, A.: Overview of the sigir 2018 ecom rakuten data chal-
    lenge. In: eCOM at The 41st International ACM SIGIR Conference on Research
    and Development in Information Retrieval. CEUR-WS.org (2018)
15. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., et al.: RoBERTa: A Robustly Op-
    timized BERT Pretraining Approach. arXiv:1907.11692 (2019)
16. Meusel, R., Primpeli, A., Meilicke, C., Paulheim, H., Bizer, C.: Exploiting micro-
    data annotations to consistently categorize product offers at web scale. In: Inter-
    national Conference on Electronic Commerce and Web Technologies. pp. 83–99.
    Springer International Publishing (2015)
17. Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., et al.: Deep Learning for
    Entity Matching: A Design Space Exploration. In: Proceedings of the 2018 Inter-
    national Conference on Management of Data. pp. 19–34 (2018)
18. Nie, H., Han, X., He, B., Sun, L., Chen, B., et al.: Deep Sequence-to-Sequence
    Entity Matching for Heterogeneous Entity Resolution. In: Proceedings of the 28th
    ACM International Conference on Information and Knowledge Management. pp.
    629–638 (Nov 2019)
19. Peeters, R., Bizer, C., Glavaš, G.: Intermediate Training of BERT for Product
    Matching. In: DI2KG Workshop @ VLDB (2020)
20. Primpeli, A., Peeters, R., Bizer, C.: The WDC Training Dataset and Gold Stan-
    dard for Large-Scale Product Matching. In: Workshop on e-Commerce and NLP
    (ECNLP2019), Companion Proceedings of WWW. pp. 381–386 (2019)
21. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert:
    smaller, faster, cheaper and lighter. In: NeurIPS EMC2 Workshop (2019)
22. Zhang, Z., Paramita, M.: Product classification using microdata annotations. In:
    Ghidini, C., Hartig, O., Maleshkova, M., Svátek, V., Cruz, I., Hogan, A., Song, J.,
    Lefrançois, M., Gandon, F. (eds.) The Semantic Web – ISWC 2019. pp. 716–732.
    Springer International Publishing (2019)