=Paper=
{{Paper
|id=Vol-2720/paper5
|storemode=property
|title=Language Model CNN-driven Similarity Matching and Classification for HTML-embedded Product Data
|pdfUrl=https://ceur-ws.org/Vol-2720/paper5.pdf
|volume=Vol-2720
|authors=Janos Borst,Erik Körner,Kobkaew Opasjumruskit,Andreas Niekler
|dblpUrl=https://dblp.org/rec/conf/semweb/BorstKON20
}}
==Language Model CNN-driven Similarity Matching and Classification for HTML-embedded Product Data==
<pdf width="1500px">https://ceur-ws.org/Vol-2720/paper5.pdf</pdf>
<pre>
    Language Model CNN-driven similarity
matching and classification for HTML-embedded
                Product Data?

 Janos Borst1[0000−0002−9166−4069] , Erik Krner1[0000−0002−5639−6177] , Kobkaew
Opasjumruskit2[0000−0002−9206−6896] , and Andreas Niekler1[0000−0002−3036−3318]
    1
    Leipzig University, Faculty of Mathematics and Computer Science, Institute of
           Computer Science, Augustusplatz 10, 04109 Leipzig, Germany
                       https://www.informatik.uni-leipzig.de/
 2
   German Aerospace Center (DLR), Institute of Data Science, Mlzerstrae 3, 07745
                                   Jena, Germany
      https://www.dlr.de/content/en/institutes/institute-of-data-science.html


         Abstract. The Semantic Web Challenge Mining the Web of HTML-
         embedded Product Data aims to benchmark current technologies on the
         data integration tasks (1) product matching and (2) product classifica-
         tion, as recent years have seen significant use of semantic annotations
         in the e-commerce domain, but often with inconsistencies, no complete
         coverage or conflicting information. We introduce a transformer-based
         approach for textual product matching and extend it with an CNN for
         product classification. We compare the influence of different input feature
         combinations against prediction performance and introduce a technique
         to augment the classification task with additional information. We are
         able to outperform baseline results using text-only approaches.

         Keywords: product matching · product category classification · lan-
         guage models · natural language processing · text mining · deep learning


1       Introduction

The Semantic Web Challenge on Mining the Web of HTML-embedded Product
Data declares two tasks, (1) product matching and (2) product classification,
as main driver for product information integration services or research on prod-
uct knowledge graph acquisition. The problem of data-driven automatic product
data information emerged because semantic markup on the product information
on the web is often sparse or inconsistent. Since there is no standard for product
classification, and product vendors use their own category systems, third party
product information integration services cannot rely on equal preconditions. As
the main information page of the challenge [1] states correctly: “Addressing these
  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0).
?
  Participating System in the Mining the Web of HTML-embedded Product Data
2       J. Borst et al.

challenges requires an orchestra of semantic technologies tailored to the prod-
uct domain, such as product classification, product offer matching, and product
taxonomy matching. Such tasks are also crucial elements for the construction of
product knowledge graphs, which are used by large, cross-sectoral e-commerce
vendors.” Because of this the challenge intends to assess the quality of systems
addressing the two tasks. The challenge organizers developed data sets and re-
sources, which realize the comparability of various approaches.
    The definition of the shared task states product matching as a binary classifi-
cation problem. Given two product descriptions, a system should decide whether
they describe the same product or not. As mentioned before, product categorisa-
tions differ on different websites. The second task is therefore defined as classifi-
cation of arbitrary product data sets into an unified single classification system.
Our group addresses both tasks using language model driven neural classifiers.
    In this paper we introduce a language model based approach for product
similarity matching and an language model based multi output text classification
network for product classification. The content of this work is structured as
follows: In section 2 we position the task and methods we used to other related
work. Section 3 will explain in detail the methods, architectures and data sets we
used before presenting the results in section 4. We then conclude by discussing
the results and pointing out possible improvements.


2   Related Work

The tasks we contribute to in this work are related to the fields of product
classification, product matching and data linking. While the use of semantic
annotations in the e-commerce domain has increased, it is still not sufficient in
terms of consistency and completeness.
    The similarity challenge of the product matching task is to predict, given a
pair of structured product meta data, whether they describe the same product
or not. Previous works on product classification, categorization and matching
[19,12] perform well with text retrieval techniques and simple neural architec-
tures and classification models like FastText [9] or Siamese Networks [23].
    In the product classification domain, two similar data sets exist, i. e. Rakuten
Data Challenge [3], which only deals with data gathered from a single source and
the more closely related Web Data Commons [18] project, which is used as a
basis for the data in this challenge.
    The methods proposed in this paper are highly related to the field of nat-
ural language representation, text classification and text similarity. In recent
years pre-training large language models have shown high impact on downstream
tasks. Transformer models such as BERT [5] or RoBERTa [14] can be pre-trained
on large amounts of data in an unsupervised fashion. The pre-trained models pro-
vide a numeric and context-sensitive representation of any text, which are then
finetuned to a specific task using task-specific data. While earlier approaches
based on word embeddings, like [17] or [15] often choose to keep text representa-
          System description paper for Semantic Web Challenge @ ISWC’20              3

tions fixed during task-specific training, finetuning seems to be the core strength
of the language model approach.
    Text classification is a fundamental task in Natural Language Processing. Be-
fore language model finetuning became standard procedure, word embeddings
combined with task-specific neural architectures provided state-of-the-art results
in multi and single label classification [10,13,28]. In [10] a CNN-based architec-
ture for text classification is presented, which exhibits robust results on a broad
range of data sets. The CNN-layers extract features, which are then used to
classify the text.
    We hypothesize that textual similarity between product texts, like titles or
descriptions, may be enough to decide for matching products. This bears struc-
tural similarity towards semantic textual similarity that was often topic of shared
tasks [2,29,4] and has a variety of data sets [6,7]. It suggests that current trans-
former language models like BERT [5] that compete for SOTA scores in sentence
pair classification are a good starting points for this task.


3     Classification Models and Data Flow
3.1   Task 1: Product Matching
Sequence Pair Classification using Transformer Models Our approach
for the product matching task is based on the well known BERT [5] architecture
as a good candidate to solve the product matching task using text features only.
We use the Huggingface [26] implementations of the standard BERT model as
well as RoBERTa [14] and their Distil* [20] variants with pre-trainend English
language models, which we fine-tuned on the data sets. The model is structurally
simple, it consists of a pre-trained transformer model which feeds its pooled
output3 into a dropout layer followed by a dense layer with either a single output
for regression, or two outputs for classification (“same product” or not).
Datasets and Features Usage We chose to focus solely on the text features
title, description and specTable of the product pair data4 as they con-
tained the most text content and were structurally more consistent compared to
keyValuePairs, brand names or prices. We later show how those three features
compare against each other and in combination. Depending on the choice of text
features used, we simply concatenated them into a single sequence and annotated
which sequence belonged to which product. No further text preprocessing steps
were required as transformer models generally employ robust tokenizers, such as
WordPiece [21,27] or Byte-Pair-Encoding [22], which can handle arbitrary text
inputs. This resolves issues with unknown words.
3
  The pooled output representation of BERT is based on the last hidden state of the
  [CLS] token, the first token in each sequence which is intended to learn information
  about the entire text sequence. For pooling, this output is fed through a dense layer
  with 768 units and tanh activation.
4
  An example of the data format can be found at: https://ir-ischool-uos.github.io/
  mwpd/index.html#task1
4        J. Borst et al.

    In addition to the provided computer training and validation set, we also
included the more exhaustive webdatacommons (WDC) product matching data
set [18] to have a wider variety of topics, more training and validation data (see
Tab. 1) as well as a chance for better generalization.


                           train set (attribute)   negative positive
                           computer (title)      58,771 9,690
                           computer (desc)       30,102 5,019
                           computer (specTable)   9,416 1,650
                           WDC all              184,462 30,198

Table 1: Training data set statistics, number of positive/negative product match-
ing pairs per data set computer and WDC all, for computer also filtered by text
feature occurrence.


3.2    Task 2: Product Classification

Classification Model: We employed a CNN architecture based on [10] for the
product classification task. Since we understand the task as a single label multi
output setting, we adjust the network to address this. As input to the network we
use a transformer-based language model instead of static word vectors like GloVe
[17] vectors. The core of the network is the CNN feature extraction layers, which
we implement analogous to the original paper [10], but instead of one output
layer, we use three, one for each hierarchy level of the data. A Dropout [24] layer
is applied to the feature vector. For every output we calculate the loss using
categorical crossentropy, which is then summed over all the outputs.

External Data: We support the training process by using the WDC [18] data
set5 and data extracted from Wikipedia. Since WDC uses the same category set
as the task data, we can easily restrict it to the task’s classes, which provides us
with 8,004 additional examples.
    Additionally, we also use generic descriptions derived from Wikidata [25] via
its API. Names of classification examples from the training set are used to re-
trieve a set of candidate entities from Wikidata. We augment the descriptions
from the training set with descriptions from Global Product Classification (GPC)
standard [8] using the labels as references. These extended descriptions are used
to disambiguate and filter relevant entities from the candidate sets using a tf-
idf weight matrix. The entities are assigned to the label with the most similar
context according to the training examples. From these entity sets, we construct
training examples by joining the text content of the alternative labels, descrip-
tions, common categories and summaries from the Wikipedia page provided by
the Wikidata API. We use only retrieved entity descriptions for GPC level 3 and
5
    English Goldstandard from http://webdatacommons.org/structureddata/2014-12/
    products/gs.html
            System description paper for Semantic Web Challenge @ ISWC’20       5

- since the GPC hierarchy is a tree - automatically assign the parent nodes. This
process provides 1,394 additional training examples.


4     Results
The organizers set baseline results for product matching with 90.8% F1 on the
validation set using deepmatcher [16], and 85.734% Weighted Avg. F1 for the
task product classification using FastText [9]. In what follows, we present our
experiments on the validation sets and the final model configurations we used
to submit to the official leaderboard. 6

4.1    Task 1: Product Matching
Training and Hyperparameters: The computer training set shows a large
bias towards the negative class (“is not the same product”, see Tab. 1), which
we account for by using class weighted random sampling of the training data.
We randomly discard about 80% of the negative product pairs in each epoch to
match the number of positive samples and so avoid skewing the model towards
negative predictions only.
    We kept the default dropout of 0.1. The maximum sequence length of text
input is a model dependent parameter, being either 128 or 512 tokens. Depending
on the amount of text input, this leads to a truncation of the input. Correlating
with the model dependent sequence length, we choose batch sizes of 8, 16 or
32, training for either 3 or 15 epochs. We also compare a two label (“matching”
product or not) prediction setup using cross-entropy loss against a single output
network using mean squared error loss (regression).
Results: We start with a simple BERT-base model approach and improve with
more recent language models, combining various product text features, hyper-
parameter settings, and additional data. As shown in Tab. 2, starting from ini-
tially about 60%, we are able to increase the F1 by more than 30% on the
computer validation set.
    As shown in Tab. 2, the largest improvements stem from using the Distil*
transformer model variants. Compared to BERT-base, they improve performance
up to 25 percentage points, while consuming less memory. This makes either
longer sequences or larger batch sizes possible. The distilled versions of RoBERTa
further improve the F1 scores, although smaller in margin. Using the WDC
product data corpus as additional training data only marginally improves results,
indicating that the original data set is sufficient to finetune the computer topic
and more generalization through other topics is not necessary.
    In Tab. 3 we compare which text input feature combinations perform best
while keeping other hyperparameters unchanged. As transformer models are not
designed to artificially align input sequences consisting of differing features on
some boundary and then pad them, we simply concatenate the text features into
6
    https://ir-ischool-uos.github.io/mwpd/index.html#results
6        J. Borst et al.

model                 epochs train eval   F1
bert-base                  3 comp comp 65.22
bert-base                  3 all  comp 64.24
distilroberta              3 all  comp 91.73
distilbert                 3 all  comp 87.62   Features                         P       F1
distilroberta              3 comp comp 91.41   title+description+specTable 88.96 91.41
distilroberta             15 comp comp 95.05   title                        88.82 91.96
distilroberta (reg)       15 comp comp 95.57   description                  71.65 69.15
distilroberta             15 all  comp 95.80   specTable                    77.19 80.73
distilroberta (reg)       15 all  comp 95.00   description+title+specTable 88.71 90.16

Table 2: Overview of results for               Table 3: Precision and F1-scores for
various hyperparameter configurations.         label “matching product” on the com-
All models used are uncased vari-              puter train and validation sets us-
ants. Text input is a combination of           ing the DistilRoBERTa model, with 3
title+description+specTable. comp              epochs of finetuning, and a sequence
being the computer only training set and       length of 512.
all containing all categories. reg denotes
regression instead of two class output.


a single sequence for each product. The features description and specTable
are sometimes empty, as shown in Tab. 1, and the various feature fields contain
texts of varying lengths which results in differences of available contexts when
generating vector representations. However, the advantage of combining those
text sequences is that more context is available for comparisons and that we
can use alternative texts for possibly missing fields, e. g. descriptive titles and
description texts.
    We achieve the best results with 95.8% F1 on the computer validation set
with the DistilRoBERTa-base model, using a sequence length of 512, a batch size
of 16 and finetune for 15 epochs on the complete WDC categories training set
(all gs.json7 ). We combine the product text features title + description
+ specTable as a single input.


Class                                                                       P       R   F1
new products with high similarity with known products (25 pos / 75 neg) 74.19 92.00 82.14
new products with low similarity with known products (25 pos / 75 neg) 63.16 96.00 76.19
known products with introduced typos (100 pos)                         100.00 61.00 75.78
known products with dropped tokens (100 pos)                           100.00 73.00 84.39
very hard cases for known products (25 pos / 75 neg)                    91.67 88.00 89.80
Overall result on hidden test set                                       86.20 82.10 84.10

              Table 4: Official analysis of submitted test predictions.


7
    http://webdatacommons.org/largescaleproductcorpus/v2/index.html#toc6
          System description paper for Semantic Web Challenge @ ISWC’20         7

    Manual inspection of false positives and false negatives in classified product
pairs of the computer validation set show various edge cases like languages other
than English, similar product attributes for different products etc. that are hard
to distinguish or match, even for humans. Tab. 4 is a detailed analysis on the
“hidden” test set and proves that our model performs best on the set of edge
cases (“very hard cases”) in terms of F1 score, which are cases of highly similar
negative pairs or highly dissimilar positive pairs. The sets of known products are
both solved with a precision of 100 percent. This results in the highest precision
of all systems in the competition.

4.2   Task 2: Product Classification
Training and Hyperparameters: As language model we employ DistilRo-
BERTa-base from the Huggingface library [26]. The model’s weights can be fine-
tuned during the supervised training. We use four CNN layers with kernel sizes
of 3, 4, 5 and 6 with 100 filters each and a dropout rate of 0.5. The model is
trained using the Adam [11] optimizer with a learning rate of 1 e -5 and a per
label categorical cross-entropy. We pre-train our model on the WDC and/or
Wikidata set for 20 epochs before switching to the task data. During training
the model creates checkpoints each epoch and we report the results on the best
epoch. From the task data we concatenated the text content of the following
features: name, description and url.
Results: Since we chose a data driven approach we show the improvement each
additional step brings to the base model:
 – “Base”: The BASE model denotes the proposed combination of DistilRoBERTa-
   base and multi output CNN architecture with a fixed language model.
 – “FT”: The language model weights are modified during training.
 – “WDC”: The model is pre-trained on the WDC data set.
 – “Wiki”: The model is pre-trained on the Wiki data set.


                                    Average-P Average-R Average-F1
             Base                        73.02      76.13       72.76
             Base+FT                     88.91      88.51       88.36
             Base+FT+WDC                 93.64      92.79       92.93
             Base+FT+Wiki                89.04      88.30       88.37
             Base+FT+WDC+Wiki           93.83      93.48       93.39

Table 5: Results on the validation set of the product classification task averaged
over per-level weighted average.

Tab. 6 shows the ablation study of every extension we added to the training.
Unsurprisingly, the largest improvement stems from finetuning the model. When
finetuning the model gains an order of magnitude in trainable parameters, going
roughly from 1.5M to 83.6M parameters. The second big improvement stems
8       J. Borst et al.

from pre-training on the WDC data set. In a preliminary experiment we noticed
that combining the task data and WDC resulted in worse results on the valida-
tion set. While pre-training on the Wiki data alone does not have a significant
impact on the final results, the combination of WDC and Wiki leads to the final
model we use to predict on the test set. Tab. 6 breaks down the results per


                                 Lvl1                 Lvl2              Lvl3
                             P      R    F1       P      R   F1     P      R    F1
 Base              80.17 81.10 79.21 76.70 78.97 76.17 62.19 68.33 62.89
 Base+FT           91.49 91.30 91.24 90.76 90.50 90.43 84.48 83.73 83.40
 Base+FT+WDC       95.53 95.00 95.13 95.03 94.33 94.48 90.36 89.03 89.17
 Base+FT+Wiki      91.17 90.80 90.90 90.26 89.87 89.90 85.69 84.23 84.30
 Base+FT+WDC+Wiki 95.56 95.37 95.33 94.63 94.53 94.40 91.31 90.53 90.43

Table 6: Weighted average of every classification level from the validation set of
the product classification task.

level. Level 3 was the most difficult to predict, mainly stemming from the larger
number of categories to classify. Here we see that the Wiki data seems to have a
slight impact on the lvl3 categorisation, but worsens results in lvl1 and level 2,
which may explain the slightly better overall results when combining WDC and
Wiki data. Tab. 7 shows the official results on the hidden test set.

                                        P     R       F1
                              level 1 89.75 89.44 89.38
                              level 2 88.66 88.22 88.05
                              level 3 82.45 81.24 80.86
                             Average 86.96 86.30 86.10

          Table 7: Results on the hidden test set (weighted averages).


5   Discussion and Improvements

We suggest a language model driven approach for identifying whether two texts
describe the same product and which category they belong to. Using this text-
only approach we left out additional available metadata, which, if successfully
included, may allow for even better results. This simple approach nevertheless is
enough to outperform baseline results, and while the models used might be com-
plex, they can be easily set up and are a decent starting point for further research.
For example, the integration of the whole product metadata for predictions like
prices, brand or other features and some more in-depth error analyses to bet-
ter generalize our models for unknown inputs would be promising experiments.
The most important outcome and learning from the task was the observation
that, even though we used pre-trained transformer models, more training data
           System description paper for Semantic Web Challenge @ ISWC’20              9

still significantly boosts performance and introduces valuable information to the
classification process in both cases.

Acknowledgments This research supported and funded in parts by the De-
velopment Bank of Saxony (SAB) under project number 100335729.


References

 1. Semantic Web Challenge ISWC2020 – Mining the Web of HTML-embedded Prod-
    uct Data, https://ir-ischool-uos.github.io/mwpd/index.html
 2. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W.: *SEM 2013 shared
    task: Semantic textual similarity. In: Second Joint Conference on Lexical and
    Computational Semantics (*SEM), Volume 1: Proceedings of the Main Con-
    ference and the Shared Task: Semantic Textual Similarity. pp. 32–43. Associ-
    ation for Computational Linguistics, Atlanta, Georgia, USA (Jun 2013), https:
    //www.aclweb.org/anthology/S13-1004
 3. Amoualian, H., Goswami, P., Ach, L., Das, P., Montalvo, P.: Sigir 2020 e-commerce
    workshop data challenge overview
 4. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 task
    1: Semantic textual similarity multilingual and crosslingual focused evaluation.
    In: Proceedings of the 11th International Workshop on Semantic Evaluation
    (SemEval-2017). pp. 1–14. Association for Computational Linguistics, Vancouver,
    Canada (Aug 2017). https://doi.org/10.18653/v1/S17-2001, https://www.aclweb.
    org/anthology/S17-2001
 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidi-
    rectional Transformers for Language Understanding. arXiv:1810.04805 [cs] (Oct
    2018), http://arxiv.org/abs/1810.04805, arXiv: 1810.04805
 6. Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential para-
    phrases. In: Proceedings of the Third International Workshop on Paraphrasing
    (IWP2005) (2005), https://www.aclweb.org/anthology/I05-5002
 7. Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: The paraphrase
    database. In: Proceedings of the 2013 Conference of the North American Chapter
    of the Association for Computational Linguistics: Human Language Technologies.
    pp. 758–764. Association for Computational Linguistics, Atlanta, Georgia (Jun
    2013), https://www.aclweb.org/anthology/N13-1092
 8. GS1: Global Product Classification (GPC) - Standards (dec 2019), https://www.
    gs1.org/standards/gpc
 9. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text
    classification. In: Proceedings of the 15th Conference of the European Chapter
    of the Association for Computational Linguistics: Volume 2, Short Papers. pp.
    427–431. Association for Computational Linguistics (April 2017)
10. Kim, Y.: Convolutional Neural Networks for Sentence Classification.
    arXiv:1408.5882 [cs] (Sep 2014), http://arxiv.org/abs/1408.5882, arXiv: 1408.5882
11. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization.
    arXiv:1412.6980 [cs] (Jan 2017), http://arxiv.org/abs/1412.6980, arXiv: 1412.6980
12. Li, J., Dou, Z., Zhu, Y., Zuo, X., Wen, J.R.: Deep cross-platform product matching
    in e-commerce. Information Retrieval Journal 23(2), 136–158 (2020)
10      J. Borst et al.

13. Liu, J., Chang, W.C., Wu, Y., Yang, Y.: Deep Learning for Ex-
    treme Multi-label Text Classification. In: Proceedings of the 40th Inter-
    national ACM SIGIR Conference on Research and Development in Infor-
    mation Retrieval. pp. 115–124. SIGIR ’17, ACM, New York, NY, USA
    (2017). https://doi.org/10.1145/3077136.3080834, http://doi.acm.org/10.1145/
    3077136.3080834, event-place: Shinjuku, Tokyo, Japan
14. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: RoBERTa: A Robustly Optimized BERT Pretrain-
    ing Approach. arXiv:1907.11692 [cs] (Jul 2019), http://arxiv.org/abs/1907.11692,
    arXiv: 1907.11692
15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Rep-
    resentations in Vector Space. CoRR (Jan 2013), http://arxiv.org/abs/1301.3781,
    arXiv: 1301.3781
16. Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep,
    R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: A de-
    sign space exploration. In: Proceedings of the 2018 International Conference on
    Management of Data. p. 1934. SIGMOD ’18, Association for Computing Ma-
    chinery, New York, NY, USA (2018). https://doi.org/10.1145/3183713.3196926,
    https://doi.org/10.1145/3183713.3196926
17. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre-
    sentation. In: Proceedings of the 2014 conference on empirical methods in natural
    language processing (EMNLP). pp. 1532–1543 (2014)
18. Primpeli, A., Peeters, R., Bizer, C.: The wdc training dataset and gold standard for
    large-scale product matching. In: Companion Proceedings of The 2019 World Wide
    Web Conference. p. 381386. WWW ’19, Association for Computing Machinery,
    New York, NY, USA (2019). https://doi.org/10.1145/3308560.3316609, https://
    doi.org/10.1145/3308560.3316609
19. Ristoski, P., Petrovski, P., Mika, P., Paulheim, H.: A machine learning approach
    for product matching and categorization. Semantic web 9(5), 707–728 (2018)
20. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert:
    smaller, faster, cheaper and lighter. In: NeurIPS EM C 2 Workshop (2019)
21. Schuster, M., Nakajima, K.: Japanese and korean voice search. In: 2012 IEEE
    International Conference on Acoustics, Speech and Signal Processing (ICASSP).
    pp. 5149–5152. IEEE (2012)
22. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare
    words with subword units. In: Proceedings of the 54th Annual Meet-
    ing of the Association for Computational Linguistics (Volume 1: Long Pa-
    pers). pp. 1715–1725. Association for Computational Linguistics, Berlin, Ger-
    many (Aug 2016). https://doi.org/10.18653/v1/P16-1162, https://www.aclweb.
    org/anthology/P16-1162
23. Shah, K., Kopru, S., Ruvini, J.D.: Neural network based extreme classification and
    similarity models for product matching. In: Proceedings of the 2018 Conference of
    the North American Chapter of the Association for Computational Linguistics:
    Human Language Technologies, Volume 3 (Industry Papers). pp. 8–15 (2018)
24. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
    Dropout: A simple way to prevent neural networks from overfitting. The Jour-
    nal of Machine Learning Research 15(1), 1929–1958 (2014)
25. Vrandečić, D., Krötzsch, M.: Wikidata: A free collaborative knowledgebase. Com-
    mun. ACM 57(10), 78–85 (Sep 2014). https://doi.org/10.1145/2629489
          System description paper for Semantic Web Challenge @ ISWC’20            11

26. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
    Rault, T., Louf, R., Funtowicz, M., Brew, J.: Huggingface’s transformers: State-
    of-the-art natural language processing. ArXiv abs/1910.03771 (2019)
27. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,
    M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural machine translation
    system: Bridging the gap between human and machine translation. arXiv preprint
    arXiv:1609.08144v2 (2016)
28. Xiao, L., Huang, X., Chen, B., Jing, L.: Label-Specific Document Represen-
    tation for Multi-Label Text Classification. In: Proceedings of the 2019 Con-
    ference on Empirical Methods in Natural Language Processing and the 9th
    International Joint Conference on Natural Language Processing (EMNLP-
    IJCNLP). pp. 466–475. Association for Computational Linguistics, Hong Kong,
    China (Nov 2019). https://doi.org/10.18653/v1/D19-1044, https://www.aclweb.
    org/anthology/D19-1044
29. Xu, W., Callison-Burch, C., Dolan, B.: SemEval-2015 task 1: Paraphrase and se-
    mantic similarity in twitter (PIT). In: Proceedings of the 9th International Work-
    shop on Semantic Evaluation (SemEval 2015). pp. 1–11. Association for Computa-
    tional Linguistics, Denver, Colorado (Jun 2015). https://doi.org/10.18653/v1/S15-
    2001, https://www.aclweb.org/anthology/S15-2001

</pre>