<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Language Model CNN-driven similarity matching and classi cation for HTML-embedded Product Data?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>nos Borst</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Krn</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>sjumruskit</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Aerospace Center (DLR), Institute of Data Science</institution>
          ,
          <addr-line>Mlzerstrae 3, 07745 Jena, Germany https://</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leipzig University, Faculty of Mathematics and Computer Science, Institute of Computer Science</institution>
          ,
          <addr-line>Augustusplatz 10, 04109 Leipzig, Germany https://</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Semantic Web Challenge Mining the Web of HTMLembedded Product Data aims to benchmark current technologies on the data integration tasks (1) product matching and (2) product classi cation, as recent years have seen signi cant use of semantic annotations in the e-commerce domain, but often with inconsistencies, no complete coverage or con icting information. We introduce a transformer-based approach for textual product matching and extend it with an CNN for product classi cation. We compare the in uence of di erent input feature combinations against prediction performance and introduce a technique to augment the classi cation task with additional information. We are able to outperform baseline results using text-only approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>product matching</kwd>
        <kwd>product category classi cation</kwd>
        <kwd>lan- guage models</kwd>
        <kwd>natural language processing</kwd>
        <kwd>text mining</kwd>
        <kwd>deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Semantic Web Challenge on Mining the Web of HTML-embedded Product
Data declares two tasks, (1) product matching and (2) product classi cation,
as main driver for product information integration services or research on
product knowledge graph acquisition. The problem of data-driven automatic product
data information emerged because semantic markup on the product information
on the web is often sparse or inconsistent. Since there is no standard for product
classi cation, and product vendors use their own category systems, third party
product information integration services cannot rely on equal preconditions. As
the main information page of the challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] states correctly: \Addressing these
challenges requires an orchestra of semantic technologies tailored to the
product domain, such as product classi cation, product o er matching, and product
taxonomy matching. Such tasks are also crucial elements for the construction of
product knowledge graphs, which are used by large, cross-sectoral e-commerce
vendors." Because of this the challenge intends to assess the quality of systems
addressing the two tasks. The challenge organizers developed data sets and
resources, which realize the comparability of various approaches.
      </p>
      <p>The de nition of the shared task states product matching as a binary classi
cation problem. Given two product descriptions, a system should decide whether
they describe the same product or not. As mentioned before, product
categorisations di er on di erent websites. The second task is therefore de ned as classi
cation of arbitrary product data sets into an uni ed single classi cation system.
Our group addresses both tasks using language model driven neural classi ers.</p>
      <p>In this paper we introduce a language model based approach for product
similarity matching and an language model based multi output text classi cation
network for product classi cation. The content of this work is structured as
follows: In section 2 we position the task and methods we used to other related
work. Section 3 will explain in detail the methods, architectures and data sets we
used before presenting the results in section 4. We then conclude by discussing
the results and pointing out possible improvements.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>The tasks we contribute to in this work are related to the elds of product
classi cation, product matching and data linking. While the use of semantic
annotations in the e-commerce domain has increased, it is still not su cient in
terms of consistency and completeness.</p>
      <p>
        The similarity challenge of the product matching task is to predict, given a
pair of structured product meta data, whether they describe the same product
or not. Previous works on product classi cation, categorization and matching
[
        <xref ref-type="bibr" rid="ref12 ref19">19,12</xref>
        ] perform well with text retrieval techniques and simple neural
architectures and classi cation models like FastText [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or Siamese Networks [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
      </p>
      <p>
        In the product classi cation domain, two similar data sets exist, i. e. Rakuten
Data Challenge [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which only deals with data gathered from a single source and
the more closely related Web Data Commons [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] project, which is used as a
basis for the data in this challenge.
      </p>
      <p>
        The methods proposed in this paper are highly related to the eld of
natural language representation, text classi cation and text similarity. In recent
years pre-training large language models have shown high impact on downstream
tasks. Transformer models such as BERT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or RoBERTa [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] can be pre-trained
on large amounts of data in an unsupervised fashion. The pre-trained models
provide a numeric and context-sensitive representation of any text, which are then
netuned to a speci c task using task-speci c data. While earlier approaches
based on word embeddings, like [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] or [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] often choose to keep text
representations xed during task-speci c training, netuning seems to be the core strength
of the language model approach.
      </p>
      <p>
        Text classi cation is a fundamental task in Natural Language Processing.
Before language model netuning became standard procedure, word embeddings
combined with task-speci c neural architectures provided state-of-the-art results
in multi and single label classi cation [
        <xref ref-type="bibr" rid="ref10 ref13 ref28">10,13,28</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] a CNN-based
architecture for text classi cation is presented, which exhibits robust results on a broad
range of data sets. The CNN-layers extract features, which are then used to
classify the text.
      </p>
      <p>
        We hypothesize that textual similarity between product texts, like titles or
descriptions, may be enough to decide for matching products. This bears
structural similarity towards semantic textual similarity that was often topic of shared
tasks [
        <xref ref-type="bibr" rid="ref2 ref29 ref4">2,29,4</xref>
        ] and has a variety of data sets [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ]. It suggests that current
transformer language models like BERT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] that compete for SOTA scores in sentence
pair classi cation are a good starting points for this task.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Classi cation Models and Data Flow</title>
      <p>
        Task 1: Product Matching
Sequence Pair Classi cation using Transformer Models Our approach
for the product matching task is based on the well known BERT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] architecture
as a good candidate to solve the product matching task using text features only.
We use the Huggingface [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] implementations of the standard BERT model as
well as RoBERTa [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and their Distil* [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] variants with pre-trainend English
language models, which we ne-tuned on the data sets. The model is structurally
simple, it consists of a pre-trained transformer model which feeds its pooled
output3 into a dropout layer followed by a dense layer with either a single output
for regression, or two outputs for classi cation (\same product" or not).
Datasets and Features Usage We chose to focus solely on the text features
title, description and specTable of the product pair data4 as they
contained the most text content and were structurally more consistent compared to
keyValuePairs, brand names or prices. We later show how those three features
compare against each other and in combination. Depending on the choice of text
features used, we simply concatenated them into a single sequence and annotated
which sequence belonged to which product. No further text preprocessing steps
were required as transformer models generally employ robust tokenizers, such as
WordPiece [
        <xref ref-type="bibr" rid="ref21 ref27">21,27</xref>
        ] or Byte-Pair-Encoding [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], which can handle arbitrary text
inputs. This resolves issues with unknown words.
3 The pooled output representation of BERT is based on the last hidden state of the
[CLS] token, the rst token in each sequence which is intended to learn information
about the entire text sequence. For pooling, this output is fed through a dense layer
with 768 units and tanh activation.
4 An example of the data format can be found at: https://ir-ischool-uos.github.io/
mwpd/index.html#task1
      </p>
      <p>
        In addition to the provided computer training and validation set, we also
included the more exhaustive webdatacommons (WDC) product matching data
set [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] to have a wider variety of topics, more training and validation data (see
Tab. 1) as well as a chance for better generalization.
      </p>
      <p>
        train set (attribute) negative positive
computer (title)
computer (desc)
computer (specTable)
WDC all
Classi cation Model: We employed a CNN architecture based on [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for the
product classi cation task. Since we understand the task as a single label multi
output setting, we adjust the network to address this. As input to the network we
use a transformer-based language model instead of static word vectors like GloVe
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] vectors. The core of the network is the CNN feature extraction layers, which
we implement analogous to the original paper [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], but instead of one output
layer, we use three, one for each hierarchy level of the data. A Dropout [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] layer
is applied to the feature vector. For every output we calculate the loss using
categorical crossentropy, which is then summed over all the outputs.
External Data: We support the training process by using the WDC [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] data
set5 and data extracted from Wikipedia. Since WDC uses the same category set
as the task data, we can easily restrict it to the task's classes, which provides us
with 8,004 additional examples.
      </p>
      <p>
        Additionally, we also use generic descriptions derived from Wikidata [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] via
its API. Names of classi cation examples from the training set are used to
retrieve a set of candidate entities from Wikidata. We augment the descriptions
from the training set with descriptions from Global Product Classi cation (GPC)
standard [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] using the labels as references. These extended descriptions are used
to disambiguate and lter relevant entities from the candidate sets using a
tfidf weight matrix. The entities are assigned to the label with the most similar
context according to the training examples. From these entity sets, we construct
training examples by joining the text content of the alternative labels,
descriptions, common categories and summaries from the Wikipedia page provided by
the Wikidata API. We use only retrieved entity descriptions for GPC level 3 and
5 English Goldstandard from http://webdatacommons.org/structureddata/2014-12/
products/gs.html
- since the GPC hierarchy is a tree - automatically assign the parent nodes. This
process provides 1,394 additional training examples.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>
        The organizers set baseline results for product matching with 90:8% F1 on the
validation set using deepmatcher [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], and 85:734% Weighted Avg. F1 for the
task product classi cation using FastText [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In what follows, we present our
experiments on the validation sets and the nal model con gurations we used
to submit to the o cial leaderboard. 6
4.1
      </p>
      <p>Task 1: Product Matching
Training and Hyperparameters: The computer training set shows a large
bias towards the negative class (\is not the same product", see Tab. 1), which
we account for by using class weighted random sampling of the training data.
We randomly discard about 80% of the negative product pairs in each epoch to
match the number of positive samples and so avoid skewing the model towards
negative predictions only.</p>
      <p>We kept the default dropout of 0:1. The maximum sequence length of text
input is a model dependent parameter, being either 128 or 512 tokens. Depending
on the amount of text input, this leads to a truncation of the input. Correlating
with the model dependent sequence length, we choose batch sizes of 8, 16 or
32, training for either 3 or 15 epochs. We also compare a two label (\matching"
product or not) prediction setup using cross-entropy loss against a single output
network using mean squared error loss (regression).</p>
      <p>Results: We start with a simple BERT-base model approach and improve with
more recent language models, combining various product text features,
hyperparameter settings, and additional data. As shown in Tab. 2, starting from
initially about 60%, we are able to increase the F1 by more than 30% on the
computer validation set.</p>
      <p>As shown in Tab. 2, the largest improvements stem from using the Distil*
transformer model variants. Compared to BERT-base, they improve performance
up to 25 percentage points, while consuming less memory. This makes either
longer sequences or larger batch sizes possible. The distilled versions of RoBERTa
further improve the F1 scores, although smaller in margin. Using the WDC
product data corpus as additional training data only marginally improves results,
indicating that the original data set is su cient to netune the computer topic
and more generalization through other topics is not necessary.</p>
      <p>In Tab. 3 we compare which text input feature combinations perform best
while keeping other hyperparameters unchanged. As transformer models are not
designed to arti cially align input sequences consisting of di ering features on
some boundary and then pad them, we simply concatenate the text features into
6 https://ir-ischool-uos.github.io/mwpd/index.html#results
model
bert-base
bert-base
distilroberta
distilbert
distilroberta
distilroberta
distilroberta (reg)
distilroberta
distilroberta (reg)
epochs train eval</p>
      <p>F1
3 comp comp 65.22
3 all comp 64.24
3 all comp 91.73
3 all comp 87.62
3 comp comp 91.41
15 comp comp 95.05
15 comp comp 95.57
15 all comp 95.80
15 all comp 95.00</p>
      <p>Features</p>
      <p>P</p>
      <p>F1
title+description+specTable 88.96 91.41
title 88.82 91.96
description 71.65 69.15
specTable 77.19 80.73
description+title+specTable 88.71 90.16
a single sequence for each product. The features description and specTable
are sometimes empty, as shown in Tab. 1, and the various feature elds contain
texts of varying lengths which results in di erences of available contexts when
generating vector representations. However, the advantage of combining those
text sequences is that more context is available for comparisons and that we
can use alternative texts for possibly missing elds, e. g. descriptive titles and
description texts.</p>
      <p>We achieve the best results with 95:8% F1 on the computer validation set
with the DistilRoBERTa-base model, using a sequence length of 512, a batch size
of 16 and netune for 15 epochs on the complete WDC categories training set
(all gs.json7). We combine the product text features title + description
+ specTable as a single input.</p>
      <p>Class
P</p>
      <p>R</p>
      <p>F1
new products with high similarity with known products (25 pos / 75 neg) 74.19 92.00 82.14
new products with low similarity with known products (25 pos / 75 neg) 63.16 96.00 76.19
known products with introduced typos (100 pos) 100.00 61.00 75.78
known products with dropped tokens (100 pos) 100.00 73.00 84.39
very hard cases for known products (25 pos / 75 neg) 91.67 88.00 89.80
Overall result on hidden test set
86.20 82.10 84.10
7 http://webdatacommons.org/largescaleproductcorpus/v2/index.html#toc6</p>
      <p>Manual inspection of false positives and false negatives in classi ed product
pairs of the computer validation set show various edge cases like languages other
than English, similar product attributes for di erent products etc. that are hard
to distinguish or match, even for humans. Tab. 4 is a detailed analysis on the
\hidden" test set and proves that our model performs best on the set of edge
cases (\very hard cases") in terms of F1 score, which are cases of highly similar
negative pairs or highly dissimilar positive pairs. The sets of known products are
both solved with a precision of 100 percent. This results in the highest precision
of all systems in the competition.
4.2</p>
      <p>
        Task 2: Product Classi cation
Training and Hyperparameters: As language model we employ
DistilRoBERTa-base from the Huggingface library [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. The model's weights can be
netuned during the supervised training. We use four CNN layers with kernel sizes
of 3, 4, 5 and 6 with 100 lters each and a dropout rate of 0:5. The model is
trained using the Adam [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] optimizer with a learning rate of 1 e -5 and a per
label categorical cross-entropy. We pre-train our model on the WDC and/or
Wikidata set for 20 epochs before switching to the task data. During training
the model creates checkpoints each epoch and we report the results on the best
epoch. From the task data we concatenated the text content of the following
features: name, description and url.
      </p>
      <p>Results: Since we chose a data driven approach we show the improvement each
additional step brings to the base model:
{ \Base": The BASE model denotes the proposed combination of
DistilRoBERTabase and multi output CNN architecture with a xed language model.
{ \FT": The language model weights are modi ed during training.
{ \WDC": The model is pre-trained on the WDC data set.
{ \Wiki": The model is pre-trained on the Wiki data set.</p>
      <sec id="sec-4-1">
        <title>Average-P Average-R Average-F1</title>
      </sec>
      <sec id="sec-4-2">
        <title>Base</title>
        <p>Base+FT
Base+FT+WDC
Base+FT+Wiki
Base+FT+WDC+Wiki
73:02
88:91
93:64
89:04
93.83
76:13
88:51
92:79
88:30
93.48
72:76
88:36
92:93
88:37
93.39</p>
        <p>Tab. 6 shows the ablation study of every extension we added to the training.
Unsurprisingly, the largest improvement stems from netuning the model. When
netuning the model gains an order of magnitude in trainable parameters, going
roughly from 1.5M to 83.6M parameters. The second big improvement stems
from pre-training on the WDC data set. In a preliminary experiment we noticed
that combining the task data and WDC resulted in worse results on the
validation set. While pre-training on the Wiki data alone does not have a signi cant
impact on the nal results, the combination of WDC and Wiki leads to the nal
model we use to predict on the test set. Tab. 6 breaks down the results per
P</p>
        <p>Lvl1</p>
        <p>R</p>
        <p>F1</p>
        <p>P</p>
        <p>Lvl2</p>
        <p>R</p>
        <p>F1</p>
        <p>P</p>
        <p>Lvl3</p>
        <p>R</p>
        <p>F1
Base 80:17 81:10 79:21 76:70 78:97 76:17 62:19 68:33 62:89
Base+FT 91:49 91:30 91:24 90:76 90:50 90:43 84:48 83:73 83:40
Base+FT+WDC 95:53 95:00 95:13 95.03 94:33 94.48 90:36 89:03 89:17
Base+FT+Wiki 91:17 90:80 90:90 90:26 89:87 89:90 85:69 84:23 84:30
Base+FT+WDC+Wiki 95.56 95.37 95.33 94:63 94.53 94:40 91.31 90.53 90.43
level. Level 3 was the most di cult to predict, mainly stemming from the larger
number of categories to classify. Here we see that the Wiki data seems to have a
slight impact on the lvl3 categorisation, but worsens results in lvl1 and level 2,
which may explain the slightly better overall results when combining WDC and
Wiki data. Tab. 7 shows the o cial results on the hidden test set.</p>
        <p>P</p>
        <p>R</p>
        <p>F1
level 1 89:75 89:44 89:38
level 2 88:66 88:22 88:05
level 3 82:45 81:24 80:86</p>
        <p>Average 86:96 86:30 86:10
We suggest a language model driven approach for identifying whether two texts
describe the same product and which category they belong to. Using this
textonly approach we left out additional available metadata, which, if successfully
included, may allow for even better results. This simple approach nevertheless is
enough to outperform baseline results, and while the models used might be
complex, they can be easily set up and are a decent starting point for further research.
For example, the integration of the whole product metadata for predictions like
prices, brand or other features and some more in-depth error analyses to
better generalize our models for unknown inputs would be promising experiments.
The most important outcome and learning from the task was the observation
that, even though we used pre-trained transformer models, more training data
still signi cantly boosts performance and introduces valuable information to the
classi cation process in both cases.</p>
        <p>Acknowledgments This research supported and funded in parts by the
Development Bank of Saxony (SAB) under project number 100335729.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Semantic</given-names>
            <surname>Web Challenge</surname>
          </string-name>
          <article-title>ISWC2020 { Mining the Web of HTML-embedded Product Data, https://ir-ischool-uos.github</article-title>
          .io/mwpd/index.html
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Agirre</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Diab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Agirre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
          </string-name>
          , W.: *
          <article-title>SEM 2013 shared task: Semantic textual similarity</article-title>
          .
          <source>In: Second Joint Conference on Lexical and Computational Semantics (*SEM)</source>
          , Volume
          <volume>1</volume>
          :
          <source>Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity</source>
          . pp.
          <volume>32</volume>
          {
          <fpage>43</fpage>
          . Association for Computational Linguistics, Atlanta, Georgia, USA (Jun
          <year>2013</year>
          ), https: //www.aclweb.org/anthology/S13-1004
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Amoualian</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goswami</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ach</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montalvo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Sigir 2020 e-commerce workshop data challenge overview</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Diab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agirre</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez-Gazpio</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , L.:
          <article-title>SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation</article-title>
          .
          <source>In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          . pp.
          <volume>1</volume>
          {
          <fpage>14</fpage>
          . Association for Computational Linguistics, Vancouver, Canada (Aug
          <year>2017</year>
          ). https://doi.org/10.18653/v1/
          <fpage>S17</fpage>
          -2001, https://www.aclweb. org/anthology/S17-2001
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          . arXiv:
          <year>1810</year>
          .04805 [cs] (
          <year>Oct 2018</year>
          ), http://arxiv.org/abs/
          <year>1810</year>
          .04805, arXiv:
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Dolan</surname>
            ,
            <given-names>W.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brockett</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Automatically constructing a corpus of sentential paraphrases</article-title>
          .
          <source>In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005)</source>
          (
          <year>2005</year>
          ), https://www.aclweb.org/anthology/I05-5002
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ganitkevitch</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Van Durme</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callison-Burch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>PPDB: The paraphrase database</article-title>
          .
          <source>In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          . pp.
          <volume>758</volume>
          {
          <fpage>764</fpage>
          . Association for Computational Linguistics, Atlanta,
          <source>Georgia (Jun</source>
          <year>2013</year>
          ), https://www.aclweb.org/anthology/N13-1092
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. GS1:
          <article-title>Global Product Classi cation (GPC) - Standards (dec</article-title>
          <year>2019</year>
          ), https://www. gs1.org/standards/gpc
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Bag of tricks for e cient text classi cation</article-title>
          .
          <source>In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>2</volume>
          ,
          <string-name>
            <given-names>Short</given-names>
            <surname>Papers</surname>
          </string-name>
          . pp.
          <volume>427</volume>
          {
          <fpage>431</fpage>
          . Association for Computational Linguistics (
          <year>April 2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Convolutional Neural Networks for Sentence Classi cation</article-title>
          .
          <source>arXiv:1408.5882 [cs] (Sep</source>
          <year>2014</year>
          ), http://arxiv.org/abs/1408.5882, arXiv:
          <fpage>1408</fpage>
          .
          <fpage>5882</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          .
          <source>arXiv:1412.6980 [cs] (Jan</source>
          <year>2017</year>
          ), http://arxiv.org/abs/1412.6980, arXiv:
          <fpage>1412</fpage>
          .
          <fpage>6980</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dou</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zuo</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <article-title>Deep cross-platform product matching in e-commerce</article-title>
          .
          <source>Information Retrieval Journal</source>
          <volume>23</volume>
          (
          <issue>2</issue>
          ),
          <volume>136</volume>
          {
          <fpage>158</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>W.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Deep Learning for Extreme Multi-label Text Classi cation</article-title>
          .
          <source>In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          . pp.
          <volume>115</volume>
          {
          <fpage>124</fpage>
          . SIGIR '17,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2017</year>
          ). https://doi.org/10.1145/3077136.3080834, http://doi.acm.
          <source>org/10</source>
          .1145/ 3077136.3080834, event-place: Shinjuku, Tokyo, Japan
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoyanov</surname>
          </string-name>
          , V.:
          <article-title>RoBERTa: A Robustly Optimized BERT Pretraining Approach</article-title>
          . arXiv:
          <year>1907</year>
          .11692 [cs] (
          <year>Jul 2019</year>
          ), http://arxiv.org/abs/
          <year>1907</year>
          .11692, arXiv:
          <year>1907</year>
          .11692
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>CoRR (Jan</source>
          <year>2013</year>
          ), http://arxiv.org/abs/1301.3781, arXiv:
          <fpage>1301</fpage>
          .
          <fpage>3781</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Mudgal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rekatsinas</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krishnan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deep</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arcaute</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavendra</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Deep learning for entity matching: A design space exploration</article-title>
          .
          <source>In: Proceedings of the 2018 International Conference on Management of Data</source>
          . p.
          <year>1934</year>
          . SIGMOD '
          <volume>18</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2018</year>
          ). https://doi.org/10.1145/3183713.3196926, https://doi.org/10.1145/3183713.3196926
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          . pp.
          <volume>1532</volume>
          {
          <issue>1543</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Primpeli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peeters</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>The wdc training dataset and gold standard for large-scale product matching</article-title>
          .
          <source>In: Companion Proceedings of The 2019 World Wide Web Conference</source>
          . p.
          <fpage>381386</fpage>
          . WWW '
          <volume>19</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2019</year>
          ). https://doi.org/10.1145/3308560.3316609, https:// doi.org/10.1145/3308560.3316609
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petrovski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mika</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.:</given-names>
          </string-name>
          <article-title>A machine learning approach for product matching and categorization</article-title>
          .
          <source>Semantic web 9(5)</source>
          ,
          <volume>707</volume>
          {
          <fpage>728</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Sanh</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Debut</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaumond</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter</article-title>
          .
          <source>In: NeurIPS EM C2 Workshop</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakajima</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Japanese and korean voice search</article-title>
          .
          <source>In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          . pp.
          <volume>5149</volume>
          {
          <fpage>5152</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Sennrich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birch</surname>
            ,
            <given-names>A.:</given-names>
          </string-name>
          <article-title>Neural machine translation of rare words with subword units</article-title>
          .
          <source>In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          . pp.
          <volume>1715</volume>
          {
          <fpage>1725</fpage>
          . Association for Computational Linguistics, Berlin, Germany (Aug
          <year>2016</year>
          ). https://doi.org/10.18653/v1/
          <fpage>P16</fpage>
          -1162, https://www.aclweb. org/anthology/P16-1162
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kopru</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruvini</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          :
          <article-title>Neural network based extreme classi cation and similarity models for product matching</article-title>
          .
          <source>In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>3</volume>
          (Industry Papers). pp.
          <volume>8</volume>
          {
          <issue>15</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          , R.:
          <article-title>Dropout: A simple way to prevent neural networks from over tting</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>15</volume>
          (
          <issue>1</issue>
          ),
          <year>1929</year>
          {
          <year>1958</year>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Vrandecic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Krotzsch, M.:
          <article-title>Wikidata: A free collaborative knowledgebase</article-title>
          .
          <source>Commun. ACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <volume>78</volume>
          {85 (Sep
          <year>2014</year>
          ). https://doi.org/10.1145/2629489
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Debut</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanh</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaumond</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delangue</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cistac</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rault</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Louf</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Funtowicz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brew</surname>
          </string-name>
          , J.:
          <article-title>Huggingface's transformers: Stateof-the-art natural language processing</article-title>
          . ArXiv abs/
          <year>1910</year>
          .03771 (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Norouzi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macherey</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krikun</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macherey</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , et al.:
          <article-title>Google's neural machine translation system: Bridging the gap between human and machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1609.08144v2</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jing</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Label-Speci c Document Representation for Multi-Label Text Classi cation</article-title>
          .
          <source>In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP)</source>
          . pp.
          <volume>466</volume>
          {
          <fpage>475</fpage>
          . Association for Computational Linguistics, Hong Kong,
          <source>China (Nov</source>
          <year>2019</year>
          ). https://doi.org/10.18653/v1/
          <fpage>D19</fpage>
          -1044, https://www.aclweb. org/anthology/D19-1044
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callison-Burch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dolan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>SemEval-2015 task 1: Paraphrase and semantic similarity in twitter (PIT)</article-title>
          .
          <source>In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2015</year>
          ). pp.
          <volume>1</volume>
          {
          <fpage>11</fpage>
          . Association for Computational Linguistics, Denver, Colorado (Jun
          <year>2015</year>
          ). https://doi.org/10.18653/v1/
          <fpage>S15</fpage>
          - 2001, https://www.aclweb.org/anthology/S15-2001
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>