=Paper=
{{Paper
|id=Vol-3075/paper10
|storemode=property
|title=Towards Multi-Modal Entity Resolution for Product Matching
|pdfUrl=https://ceur-ws.org/Vol-3075/paper10.pdf
|volume=Vol-3075
|authors=Moritz Wilke,Erhard Rahm
|dblpUrl=https://dblp.org/rec/conf/gvd/WilkeR21
}}
==Towards Multi-Modal Entity Resolution for Product Matching==
<pdf width="1500px">https://ceur-ws.org/Vol-3075/paper10.pdf</pdf>
<pre>
          Towards Multi-Modal Entity Resolution for Product
                             Matching

                               Moritz Wilke                                                      Erhard Rahm
                             Leipzig University                                                 Leipzig University
                wilke@informatik.uni-leipzig.de                                    rahm@informatik.uni-leipzig.de


ABSTRACT
Entity Resolution has been applied successfully to match
product offers from different web shops. Unfortunately, in
certain domains the (textual or numerical) attributes of a
product are not sufficient for a reliable match decision. To
overcome this problem we extend an attribute-based match-                          nike air max 2016 806771      nike sportswear air max 90 ultra
ing system to incorporate image data, which are available in               Title   001 para hombre negro 040     essential black mens shoes dark
almost every web shop. To evaluate the system we enhance                           footstop                      grey white 819474 013
                                                                                                                 featuring no sew overlays the air
the WDC product matching dataset with images crawled                                                             max 90 ultra delivers a supportive
                                                                           Desc. <None>
from the web. First evaluations show that the use of images                                                      and lightweight feel its visible air sole
                                                                                                                 unit helps absorb impact […]
is beneficial to increase recall and overall match quality.
                                                                           Brand <None>                          <None>
                                                                           Price <None>                          <None>
Keywords
Record Linkage, Product Matching, Deep Learning                            Figure 1: Example of two (non-matching) products

1.    INTRODUCTION
                                                                              Recent developments in computer vision (which are largely
   Entity resolution (ER), also known as record linkage, is
                                                                           driven by deep learning technologies) allow the assumption
the procedure of identifying which items from one or more
                                                                           that a reliable distinction of product images is achievable,
data sources refer to the same real-world entities. It is an
                                                                           given enough data. The two similar but different shoes pre-
important step in data integration where the goal is to unify
                                                                           sented in Figure 1 highlight some problems and opportuni-
data from different origins to increase the quality and size
                                                                           ties of visual product matching. The images readily show
of available data for further analysis. It is mostly based on
                                                                           that the two shoes are different while the attribute infor-
structured or semi-structured entity descriptions consisting
                                                                           mation makes it difficult to come to a clear decision. This
of several attributes (such as ’Name’, ’Date of Birth, ’Zip
                                                                           is because the missing attribute values lead to larger differ-
code’). For web data the attributes are often missing and
                                                                           ences in the description. Moreover, the description of the
noisy and may contain longer textual descriptions.
                                                                           second shoe is written in an advertising way to convince
   An application of ER is the matching of product offers
                                                                           the customer, which decreases its informational value. So in
across the web. It can be used to compare prices and in-
                                                                           this case the title attribute is the only usable attribute which
ventories of web shops or to present the best offer for a
                                                                           does not contain a lot of information, however. This leads
desired product to a user. This task can often be tackled
                                                                           to our main research question: How can we utilize the ad-
with attribute-based ER, e.g. utilizing attribute values ex-
                                                                           ditional information provided by product images to improve
tracted from the product web pages. A smart phone for
                                                                           matching quality?
example can be distinguished by attributes such as ’brand’,
                                                                              An obstacle for research in ER in general and product
’model’, ’storage size’, ’display size’, ’weight’, etc. But it
                                                                           matching has been the lack of large and public datasets that
is still hard to match items where (textual) information is
                                                                           contain ground-truth information about matching entities.
sparse or varies much, e.g. in the fashion domain. In con-
                                                                           There are several datasets that contain product images in
trast, the selling of fashion items is largely driven by their
                                                                           combination with descriptions and there are datasets that
visual features and hence every shop has images available.
                                                                           contain matching pairs of either (product) images or de-
                                                                           scriptions. However, there is currently no public dataset
                                                                           that contains product images and descriptions as well as the
                                                                           true set of matching items.
                                                                              The main contributions of this work are the creation of a
                                                                           suitable multi-modal ER benchmark dataset (Section 3), an
                                                                           extension of the existing DeepMatcher [11] ER framework
32nd GI-Workshop on Foundations of Databases (Grundlagen von Daten-        to also use image data for matching (Section 4), and a first
banken), September 01-03, 2021, Munich, Germany.
Copyright c 2021 for this paper by its authors. Use permitted under Cre-   evaluation of the system on a subset of the dataset (Section
ative Commons License Attribution 4.0 International (CC BY 4.0).           5). As this work is still in progress we also discuss current
limitations and plans for future work (Section 6).                   for evaluation they apply a number of heuristics, machine
                                                                     learning algorithms and manual processing steps. The result
2.    RELATED WORK                                                   is a dataset of 16 million products (supposedly described
                                                                     in English). The applied clustering by identifiers can be
  This work builds on previous work on ER approaches,
                                                                     considered as a silver standard. The creators add hand-
existing ER datasets, and neural networks for text and image
                                                                     crafted true matches and semi-automatically created differ-
analysis.
                                                                     ently sized training sets in four product categories (watches,
2.1    Entity Resolution                                             shoes, cameras, electronics). Although these datasets are
                                                                     suitable for ER evaluations, they do not contain product
   There is a long history of research in the field of entity res-
                                                                     images. The aforementioned experiments by Ristoski et al.
olution and a comprehensive introduction can be found in
                                                                     [14] have been conducted on an earlier version of the WDC
[4]. A lot of work has been put into adequate similarity met-
                                                                     dataset, but unfortunately the corresponding images are not
rics for attribute values and rule- or tree-based combinations
                                                                     available anymore.
of those similarities for match classification. Furthermore,
                                                                        DeepFashion2 [5] is a dataset for image retrieval in the
blocking techniques have been devised to pre-filter match
                                                                     fashion domain. It contains different images for the same
candidates in order to reduce the quadratic complexity of
                                                                     product, some from shops some from users. However, the
comparing all items with each other.
                                                                     products have no textual properties (such as a description,
   In the last years ER has been scaled to large datasets us-
                                                                     brand or price) so that it is not multi-modal.
ing the map-reduce paradigm and there has been a lot of
                                                                        The dataset for the SIGIR eCom 2020 multi-modal prod-
research on utilizing machine learning for either parts of an
                                                                     uct classification and retrieval challenge [1] contains product
ER pipeline (e.g. blocking, similarity computation, match
                                                                     descriptions and images but does not contain ground truth
decisions) or the configuration of the whole process. Due to
                                                                     for product matching.
the increasing amount of data available from the internet,
                                                                        To the best of our knowledge, there is currently no ER
ER approaches that work on heterogeneous, noisy and un-
                                                                     dataset with both product images and descriptions and in-
structured or semi-structured data have gained importance
                                                                     formation about matching products.
[12].
   An overview of the (increasing) usage of neural networks          2.3     Multi-modal deep learning
and deep learning for entity resolution can be found in [2].
A recent ER system that uses deep learning is DeepMatcher               Over the last years, convolutional neural networks (CNN)
[11]. It can use word-embeddings to encode the informa-              have achieved many breakthroughs in the field of computer
tion from different attributes and provides different sequence       vision and have contributed much to the wide adoption of
models to align and compare those encodings.                         deep learning [9]. An adaption of deep learning for fashion
   Applying ER on e-commerce data has been the focus of              images called Match-R-CNN is presented in [5]. It supports
Köpcke et al. [8]. An initial approach to use image data            the identification and classification of fashion items as well
for product matching can be found in [14]. The authors               as image retrieval. For attribute-based ER, deep learning
use attribute matching and enrich it with image embeddings           approaches internally work mostly with the concept of dis-
generated from a convolutional neural network. The main              tributed representations (embeddings). The combination of
difference to our approach is that the neural network is not         text and image data in deep learning systems is called multi-
directly trained in the matching task but solely functions           modal deep learning. Typical applications include the cre-
as a feature extractor. The cosine similarity between two            ation of text description for images or image retrieval from
image embeddings is passed to a match classifier along with          text queries. An overview on tasks, datasets and problems
the similarities of textual features. An evaluation is per-          in this field can be found in [10].
formed on three datasets from the electronics domain (lap-
tops, televisions, phones) which consist of 200 - 300 prod-          3.     BENCHMARK DATASET CREATION
ucts. F-scores of up to 73.35% (laptops), 83.27% (phones)               To overcome the mentioned lack of a multi-modal ER
and 84.96% (televisions) are achieved by using the images to         benchmark dataset, we extend the WDC dataset with im-
improve the best text matching results by a small amount             age data. The underlying common crawl1 snapshot dates
(around 1%). The authors conclude that images cannot be              back to November 2017 and does not contain additional data
used as a strong matching signal because some product vari-          apart from the HTML documents, hence it is not initially
ations that are important to distinguish (such as a phone            clear to which amount the URLs are still valid and whether
with 16GB of storage vs. one with 64GB) use the exact                the images are still available. To retrieve images, the follow-
same image. Such problems should have less impact in other           ing procedure is used: First the documents are parsed for
product domains such as clothing. Current research in com-           HTML tags that contain image URLs and a (semantic) an-
puter vision and image retrieval allows to expect a robust           notation that indicates that the images belongs to a product
image matching given enough clean data at least for certain          (this is needed to avoid collection unrelated images from the
categories of products.                                              website).
                                                                        In the second step, we query the URLs to retrieve up
2.2    Datasets                                                      to five images per product. Finally a procedure to query
  The WDC dataset (WDC Product Data Corpus and Gold                  the internet archive2 for missing images is used. Due to its
Standard for Large-scale Product Matching 2.0) has been              slow speed this method is only applied selectively to achieve
published by Primpeli et al. [13]. They use the (partial)            higher coverage of images for the experiments.
existence of product identifiers such as MPN and SKU to
                                                                     1
create weak clusters of matching products. To refine these               http://commoncrawl.org/
                                                                     2
noisy clusters and to obtain training data and ground-truth              https://archive.org/
                                        products      positive pairs    negative     % title    % description       % image
     WDC shoes train (xlarge)                2450                4141       38288       100%                 53%          79%
     WDC shoes gold standard                 1111                 300         800       100%                 69%          81%
     Evaluation subset train                  950                1350        6286       100%                 52%         100%
     Evaluation subset gold standard          813                 206         586       100%                 76%         100%

Table 1: Comparison of the original WDC shoe training and gold standard data and the cleaned evaluation
subset for which all images have been verified manually.


   The results is a database of images for 10M (63%) products      ingless words and represent the attribute as a fixed-length
from the WDC corpus. The gathered image data is far from           vector.
clean. Problems include that some images do not depict the            An image processing module is added to DeepMatcher
correct product but a (brand or shop) logo, a placeholder or       with the aim to create an image representation that can be
something entirely different, or that images show the prod-        processed exactly like the existing feature vectors. This is
uct isolated or in context of a scene and in combination with      achieved in the following way: first an optional pre-processing
other objects. Also some images are pixel-wise duplicates of       step detects the main shapes in the image and crops it ac-
each other.                                                        cordingly. This is done to reduce white-space and and non-
   Table 1 shows the crawling results for the shoe category        informational regions of the image. There are other possible
of the original dataset. The initial coverage of raw images is     ways to achieve this cropping, for example by using another
79% on the training set and 81% on the gold standard. Since        neural network for object detection and/or masking the rel-
we want to run our initial evaluation on data that has full        evant regions in the image. But these methods add further
image coverage and does not contain wrong or noisy image           complexity and they need training data (bounding boxes or
data, we manually verify the images and create evaluation          masks) for each product category. The second step is to feed
subsets with the desired properties which are subsequently         the image to a pre-trained image classification neural net-
smaller. The inspection also reveals that pairs in the exist-      work (our first experiments use Resnet50 [6]) and to append
ing gold standards have been determined without consider-          a fully connected layer to downsample its representation of
ation of the product images. Hence, there are hand-labelled        the image to the dimension of the other feature vectors. To
match pairs with a similar textual description where the im-       reduce the number of learnable parameters and hence train-
ages show (minor) differences. A similar problem occurs in         ing time, the first six layers of the backbone neural network
the training data: to create training pairs with optimal in-       are fixed and do not change during the training of the match-
formational value, the authors of the original dataset adapt       ing system. Since we currently only use a single image per
a method from [7]. It consists of the precomputation of            product, summarization is not needed for the images.
similarity scores between known matches and non-matches,              After these steps, the feature vector resulting from an im-
which is followed by choosing positive training pairs with         age or a text attribute has the same dimensions and can be
low and negative training pairs with high similarity. Since        treated equally. The attribute comparators take feature vec-
we do not repeat this process, the images have no influence        tors of the same attribute and from both products as input
on the choice of the training pairs. The effect of this bias       and create a similarity representation. In DeepMatcher, this
has not been examined yet.                                         can be the absolute distance of both vectors, their concate-
                                                                   nation or any other method that returns a vector. All at-
                                                                   tribute similarities are finally fed into the classification com-
4.    MATCHING SYSTEM                                              ponent which returns the final match decision. The standard
   Our current system for multi-modal product matching             classification method is a two-layered, fully connected neural
builds on DeepMatcher [11]. We add the capability of pro-          network.
cessing image data while keeping the overall structure and
components. Figure 2 shows the match processing for a pair
of records with the same attributes and an image to deter-         5.   EVALUATION
mine a binary match/non-match decision. Schema align-                 To evaluate the system, a subset of the existing WDC shoe
ment and the creation of suitable candidate pairs are not          training set and gold standard is created by filtering all pairs
part of the system.                                                that can be enriched with image data for both products.
   We now describe the four main components of DeepMatcher         The size and attribute coverage of the evaluation datasets
and how they are related to the processing of images. The          are provided in Table 1. This may not reflect the real-world
first component is Attribute-level embedding, which converts       data, but it is easier to establish a working system on clean
each word (or n-gram) of an attribute value into a word vec-       and complete data and later improve its robustness against
tor by applying a pre-trained word model. The output of this       real-world anomalies. Similarly, the shoe category was cho-
component is a list of such embeddings with the same length        sen as initial category because we assume that products in a
as the the number of input words. Because those lists are          domain such as fashion that is determined by visual factors
of different length for each record, they have to be aligned       are more likely to match using their images than products
before the attributes of two products can be compared. This        from a category like electronics where the visual aspect plays
procedure is performed by the attribute summarizer, which          a minor role and the products are easier to describe with
can be any kind of sequence to vector module, e.g. a re-           technical specifications.
current neural network. The objective of this module is to            Selecting the pairs with images, decreases the size of the
compress the information, filter out redundant and mean-           training set from 42.4k pairs to 7.6k pairs and the size of the
                                        Attribute-level      Attribute                       Attribute
                                                                                                                      Classifier
                                         embedding          summarizer                      comparator
                   nike air max 2016…
                                                                            vector
                                                    list of vectors
                                                                      describes attribute


                                                                                                     vector
                                      Image embedding                                                                              Result
                                                                                               describes similarity

                   nike sportswear…


                                Figure 2: Workflow for multi-modal product matching


               Features                 F1        P        R                      in mind that the underlying product dataset has been cre-
                                                                                  ated for attribute-based ER and can already achieve high
     image                              73.1    61.2      90.8
                                                                                  match quality for the use of attribute values only. Further-
     title                              85.2    79.3      91.9
                                                                                  more, as mentioned in Section 3, the images have not been
     title+image                        85.6    79.2      93.0
                                                                                  present at the time of labelling, hence the golden truth is
     title+description                  81.5    73.2      91.9
                                                                                  still not fully consistent with regards to images. This is also
     title+description+image            83.6    75.4      93.8
                                                                                  illustrated in Figure 1 where the images clearly show differ-
                                                                                  ent shoes while the mismatch is harder to infer from the title
Table 2: Match quality on the (clean) shoe dataset                                attributes.

                                                                                  6.        FUTURE WORK
gold standard (used as test set) from 1,111 to 792 pairs. A
validation set is taken out of the training data.                                    Concerning the image crawling, our goal is to increase cov-
   We choose the DeepMatcher configuration that performed                         erage of images and manually review and relabel the existing
best on the baseline experiments for the WDC dataset con-                         datasets to create and publish a dataset that can be used for
ducted in [13], which is using a pre-trained fastText model                       further experiments and evaluation. The resulting dataset
([3]) to create token-level embeddings from the input data                        is likely of high value not only for research on data cleaning,
and the RNN method for attribute summarization. Dif-                              product matching and retrieval but also for other applica-
ferent feature combinations of title, description and image                       tions such as product classification and multi-modal tasks
values are tested. The model is trained for 20 epochs on                          e.g. the generation of product descriptions from images.
a GeForce RTX 2080 Ti GPU. When using only the prod-                                 The matching system can be improved by different pre-
uct images for matching we observe that the system needs                          processing steps and matching architectures as well as the
a longer time to converge, hence we use 40 epochs of train-                       use of multiple images per product. Our current system
ing in that case. Each feature combination is trained and                         (as most ER systems) compares entities attribute-wise and
evaluated three times and the presented scores are the mean                       aggregates attribute similarities for a match decision. This
results of these runs.                                                            follows from the reasoning in structured ER that every at-
   The experiments in [13] show that DeepMatcher outper-                          tribute describes a single feature of the record that can only
forms methods such as support vector machines, logistic                           be compared with the corresponding field of another object
regression and random forest, which are also used in [13].                        (e.g. if a company called ”blue” offers a yellow product, it is
Hence we do not compare directly against these methods.                           not similar to a blue product of a different company). But
   The matching quality is measured as precision, recall and                      this assumption does not hold for the different features in
f-score, as presented in Table 2. Using only the images                           our case: a product description text and a product image
achieves a recall of over 90% and an f-score of 73% which                         can be interpreted as different encodings of the same infor-
shows that the images provide valuable information for match-                     mation. The equal treatment of text and image features
ing. Using the textual attributes alone is more effective and                     as embeddings in our system allows to compare across the
the use of attribute title performs better than using both                        modalities, which might be a good method to overcome spar-
title and description. This is influenced by the fact that the                    sity of images and descriptions and lead to better matches.
description is noisy and missing in many cases so that it is of
limited usefulness. Using the images in addition to the tex-                      7.        CONCLUSION
tual attributes improves recall by up to about 2% and also                           We show that the use of images is a promising strategy
the f-score. The best overall f-score of 85.6% is achieved by                     to improve product matching results in the fashion domain.
combining title and image similarity albeit the use of im-                        Although the improvements in f-score are only minor, the
ages allowed for only a small improvement here (0.4%). For                        image data can be used to obtain a good recall of matching
the use of title and description, the additional use of images                    pairs. Further experiments and data preparation have to be
enabled a bigger f-score improvement by 2.1%.                                     conducted to validate this hypothesis and ensure the quality
   While these improvements appear modest one has to bear                         of the matching and its stability under real-world conditions.
A new multi-modal product matching dataset that is created     [14] P. Ristoski, P. Petrovski, P. Mika, and H. Paulheim. A
for this experiment can be of use for researchers in many           machine learning approach for product matching and
applications.                                                       categorization. Semantic Web, 9, 2018.

8.   ACKNOWLEDGMENTS
  This project was funded by the Sächsische Aufbaubank
(FKZ 100378106) and German BMBF within the Project
ScaDS.AI Dresden/Leipzig (BMBF 01IS18026B). Compu-
tations for this work were done (in part) using resources of
the Leipzig University Computing Centre.

9.   REFERENCES
 [1] H. Amoualian, P. Goswami, L. Ach, P. Das, and
     P. Montalvo. SIGIR 2020 E-Commerce Workshop
     Data Challenge. 2020.
 [2] N. Barlaug and J. A. Gulla. Neural Networks for
     Entity Matching. 2020.
 [3] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov.
     Enriching word vectors with subword information.
     CoRR, 2016.
 [4] P. Christen. Data matching: Concepts and techniques
     for record linkage, entity resolution, and duplicate
     detection. Springer Berlin Heidelberg, 2012.
 [5] Y. Ge, R. Zhang, L. Wu, X. Wang, X. Tang, and
     P. Luo. DeepFashion2: A Versatile Benchmark for
     Detection, Pose Estimation, Segmentation and
     Re-Identification of Clothing Images. Proceedings of
     the IEEE Computer Society Conference on Computer
     Vision and Pattern Recognition, 2019.
 [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual
     learning for image recognition. In Proceedings of the
     IEEE conference on computer vision and pattern
     recognition, 2016.
 [7] H. Köpcke and E. Rahm. Training selection for tuning
     entity matching. Proceedings of the Sixth International
     Workshop on Quality in Databases and Management
     of Uncertain Data (QDB/MUD), 2008.
 [8] H. Köpcke, A. Thor, and E. Rahm. Evaluation of
     entity resolution approaches on real-world match
     problems. Proceedings of the VLDB Endowment, 2010.
 [9] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.
     Nature, 521, 05 2015.
[10] A. Mogadala, M. Kalimuthu, and D. Klakow. Trends
     in integration of vision and language research: A
     survey of tasks, datasets, and methods. arXiv, 2019.
[11] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park,
     G. Krishnan, R. Deep, E. Arcaute, and
     V. Raghavendra. Deep learning for entity matching: A
     design space exploration. In Proceedings of the ACM
     SIGMOD International Conference on Management of
     Data, 2018.
[12] G. Papadakis, E. Ioannou, and T. Palpanas. Entity
     Resolution: Past, Present and Yet-to-Come From
     Structured to Heterogeneous, to Crowd-sourced, to
     Deep Learned. In Proceedings of the 23rd International
     Conference on Extending Database Technology
     (EDBT), 2020.
[13] A. Primpeli, R. Peeters, and C. Bizer. The WDC
     training dataset and gold standard for large-scale
     product matching. In The Web Conference 2019 -
     Companion of the World Wide Web Conference,
     WWW 2019, 2019.

</pre>