=Paper=
{{Paper
|id=Vol-3075/paper10
|storemode=property
|title=Towards Multi-Modal Entity Resolution for Product Matching
|pdfUrl=https://ceur-ws.org/Vol-3075/paper10.pdf
|volume=Vol-3075
|authors=Moritz Wilke,Erhard Rahm
|dblpUrl=https://dblp.org/rec/conf/gvd/WilkeR21
}}
==Towards Multi-Modal Entity Resolution for Product Matching==
Towards Multi-Modal Entity Resolution for Product Matching Moritz Wilke Erhard Rahm Leipzig University Leipzig University wilke@informatik.uni-leipzig.de rahm@informatik.uni-leipzig.de ABSTRACT Entity Resolution has been applied successfully to match product offers from different web shops. Unfortunately, in certain domains the (textual or numerical) attributes of a product are not sufficient for a reliable match decision. To overcome this problem we extend an attribute-based match- nike air max 2016 806771 nike sportswear air max 90 ultra ing system to incorporate image data, which are available in Title 001 para hombre negro 040 essential black mens shoes dark almost every web shop. To evaluate the system we enhance footstop grey white 819474 013 featuring no sew overlays the air the WDC product matching dataset with images crawled max 90 ultra delivers a supportive Desc.from the web. First evaluations show that the use of images and lightweight feel its visible air sole unit helps absorb impact […] is beneficial to increase recall and overall match quality. Brand Price Keywords Record Linkage, Product Matching, Deep Learning Figure 1: Example of two (non-matching) products 1. INTRODUCTION Recent developments in computer vision (which are largely Entity resolution (ER), also known as record linkage, is driven by deep learning technologies) allow the assumption the procedure of identifying which items from one or more that a reliable distinction of product images is achievable, data sources refer to the same real-world entities. It is an given enough data. The two similar but different shoes pre- important step in data integration where the goal is to unify sented in Figure 1 highlight some problems and opportuni- data from different origins to increase the quality and size ties of visual product matching. The images readily show of available data for further analysis. It is mostly based on that the two shoes are different while the attribute infor- structured or semi-structured entity descriptions consisting mation makes it difficult to come to a clear decision. This of several attributes (such as ’Name’, ’Date of Birth, ’Zip is because the missing attribute values lead to larger differ- code’). For web data the attributes are often missing and ences in the description. Moreover, the description of the noisy and may contain longer textual descriptions. second shoe is written in an advertising way to convince An application of ER is the matching of product offers the customer, which decreases its informational value. So in across the web. It can be used to compare prices and in- this case the title attribute is the only usable attribute which ventories of web shops or to present the best offer for a does not contain a lot of information, however. This leads desired product to a user. This task can often be tackled to our main research question: How can we utilize the ad- with attribute-based ER, e.g. utilizing attribute values ex- ditional information provided by product images to improve tracted from the product web pages. A smart phone for matching quality? example can be distinguished by attributes such as ’brand’, An obstacle for research in ER in general and product ’model’, ’storage size’, ’display size’, ’weight’, etc. But it matching has been the lack of large and public datasets that is still hard to match items where (textual) information is contain ground-truth information about matching entities. sparse or varies much, e.g. in the fashion domain. In con- There are several datasets that contain product images in trast, the selling of fashion items is largely driven by their combination with descriptions and there are datasets that visual features and hence every shop has images available. contain matching pairs of either (product) images or de- scriptions. However, there is currently no public dataset that contains product images and descriptions as well as the true set of matching items. The main contributions of this work are the creation of a suitable multi-modal ER benchmark dataset (Section 3), an extension of the existing DeepMatcher [11] ER framework 32nd GI-Workshop on Foundations of Databases (Grundlagen von Daten- to also use image data for matching (Section 4), and a first banken), September 01-03, 2021, Munich, Germany. Copyright c 2021 for this paper by its authors. Use permitted under Cre- evaluation of the system on a subset of the dataset (Section ative Commons License Attribution 4.0 International (CC BY 4.0). 5). As this work is still in progress we also discuss current limitations and plans for future work (Section 6). for evaluation they apply a number of heuristics, machine learning algorithms and manual processing steps. The result 2. RELATED WORK is a dataset of 16 million products (supposedly described in English). The applied clustering by identifiers can be This work builds on previous work on ER approaches, considered as a silver standard. The creators add hand- existing ER datasets, and neural networks for text and image crafted true matches and semi-automatically created differ- analysis. ently sized training sets in four product categories (watches, 2.1 Entity Resolution shoes, cameras, electronics). Although these datasets are suitable for ER evaluations, they do not contain product There is a long history of research in the field of entity res- images. The aforementioned experiments by Ristoski et al. olution and a comprehensive introduction can be found in [14] have been conducted on an earlier version of the WDC [4]. A lot of work has been put into adequate similarity met- dataset, but unfortunately the corresponding images are not rics for attribute values and rule- or tree-based combinations available anymore. of those similarities for match classification. Furthermore, DeepFashion2 [5] is a dataset for image retrieval in the blocking techniques have been devised to pre-filter match fashion domain. It contains different images for the same candidates in order to reduce the quadratic complexity of product, some from shops some from users. However, the comparing all items with each other. products have no textual properties (such as a description, In the last years ER has been scaled to large datasets us- brand or price) so that it is not multi-modal. ing the map-reduce paradigm and there has been a lot of The dataset for the SIGIR eCom 2020 multi-modal prod- research on utilizing machine learning for either parts of an uct classification and retrieval challenge [1] contains product ER pipeline (e.g. blocking, similarity computation, match descriptions and images but does not contain ground truth decisions) or the configuration of the whole process. Due to for product matching. the increasing amount of data available from the internet, To the best of our knowledge, there is currently no ER ER approaches that work on heterogeneous, noisy and un- dataset with both product images and descriptions and in- structured or semi-structured data have gained importance formation about matching products. [12]. An overview of the (increasing) usage of neural networks 2.3 Multi-modal deep learning and deep learning for entity resolution can be found in [2]. A recent ER system that uses deep learning is DeepMatcher Over the last years, convolutional neural networks (CNN) [11]. It can use word-embeddings to encode the informa- have achieved many breakthroughs in the field of computer tion from different attributes and provides different sequence vision and have contributed much to the wide adoption of models to align and compare those encodings. deep learning [9]. An adaption of deep learning for fashion Applying ER on e-commerce data has been the focus of images called Match-R-CNN is presented in [5]. It supports Köpcke et al. [8]. An initial approach to use image data the identification and classification of fashion items as well for product matching can be found in [14]. The authors as image retrieval. For attribute-based ER, deep learning use attribute matching and enrich it with image embeddings approaches internally work mostly with the concept of dis- generated from a convolutional neural network. The main tributed representations (embeddings). The combination of difference to our approach is that the neural network is not text and image data in deep learning systems is called multi- directly trained in the matching task but solely functions modal deep learning. Typical applications include the cre- as a feature extractor. The cosine similarity between two ation of text description for images or image retrieval from image embeddings is passed to a match classifier along with text queries. An overview on tasks, datasets and problems the similarities of textual features. An evaluation is per- in this field can be found in [10]. formed on three datasets from the electronics domain (lap- tops, televisions, phones) which consist of 200 - 300 prod- 3. BENCHMARK DATASET CREATION ucts. F-scores of up to 73.35% (laptops), 83.27% (phones) To overcome the mentioned lack of a multi-modal ER and 84.96% (televisions) are achieved by using the images to benchmark dataset, we extend the WDC dataset with im- improve the best text matching results by a small amount age data. The underlying common crawl1 snapshot dates (around 1%). The authors conclude that images cannot be back to November 2017 and does not contain additional data used as a strong matching signal because some product vari- apart from the HTML documents, hence it is not initially ations that are important to distinguish (such as a phone clear to which amount the URLs are still valid and whether with 16GB of storage vs. one with 64GB) use the exact the images are still available. To retrieve images, the follow- same image. Such problems should have less impact in other ing procedure is used: First the documents are parsed for product domains such as clothing. Current research in com- HTML tags that contain image URLs and a (semantic) an- puter vision and image retrieval allows to expect a robust notation that indicates that the images belongs to a product image matching given enough clean data at least for certain (this is needed to avoid collection unrelated images from the categories of products. website). In the second step, we query the URLs to retrieve up 2.2 Datasets to five images per product. Finally a procedure to query The WDC dataset (WDC Product Data Corpus and Gold the internet archive2 for missing images is used. Due to its Standard for Large-scale Product Matching 2.0) has been slow speed this method is only applied selectively to achieve published by Primpeli et al. [13]. They use the (partial) higher coverage of images for the experiments. existence of product identifiers such as MPN and SKU to 1 create weak clusters of matching products. To refine these http://commoncrawl.org/ 2 noisy clusters and to obtain training data and ground-truth https://archive.org/ products positive pairs negative % title % description % image WDC shoes train (xlarge) 2450 4141 38288 100% 53% 79% WDC shoes gold standard 1111 300 800 100% 69% 81% Evaluation subset train 950 1350 6286 100% 52% 100% Evaluation subset gold standard 813 206 586 100% 76% 100% Table 1: Comparison of the original WDC shoe training and gold standard data and the cleaned evaluation subset for which all images have been verified manually. The results is a database of images for 10M (63%) products ingless words and represent the attribute as a fixed-length from the WDC corpus. The gathered image data is far from vector. clean. Problems include that some images do not depict the An image processing module is added to DeepMatcher correct product but a (brand or shop) logo, a placeholder or with the aim to create an image representation that can be something entirely different, or that images show the prod- processed exactly like the existing feature vectors. This is uct isolated or in context of a scene and in combination with achieved in the following way: first an optional pre-processing other objects. Also some images are pixel-wise duplicates of step detects the main shapes in the image and crops it ac- each other. cordingly. This is done to reduce white-space and and non- Table 1 shows the crawling results for the shoe category informational regions of the image. There are other possible of the original dataset. The initial coverage of raw images is ways to achieve this cropping, for example by using another 79% on the training set and 81% on the gold standard. Since neural network for object detection and/or masking the rel- we want to run our initial evaluation on data that has full evant regions in the image. But these methods add further image coverage and does not contain wrong or noisy image complexity and they need training data (bounding boxes or data, we manually verify the images and create evaluation masks) for each product category. The second step is to feed subsets with the desired properties which are subsequently the image to a pre-trained image classification neural net- smaller. The inspection also reveals that pairs in the exist- work (our first experiments use Resnet50 [6]) and to append ing gold standards have been determined without consider- a fully connected layer to downsample its representation of ation of the product images. Hence, there are hand-labelled the image to the dimension of the other feature vectors. To match pairs with a similar textual description where the im- reduce the number of learnable parameters and hence train- ages show (minor) differences. A similar problem occurs in ing time, the first six layers of the backbone neural network the training data: to create training pairs with optimal in- are fixed and do not change during the training of the match- formational value, the authors of the original dataset adapt ing system. Since we currently only use a single image per a method from [7]. It consists of the precomputation of product, summarization is not needed for the images. similarity scores between known matches and non-matches, After these steps, the feature vector resulting from an im- which is followed by choosing positive training pairs with age or a text attribute has the same dimensions and can be low and negative training pairs with high similarity. Since treated equally. The attribute comparators take feature vec- we do not repeat this process, the images have no influence tors of the same attribute and from both products as input on the choice of the training pairs. The effect of this bias and create a similarity representation. In DeepMatcher, this has not been examined yet. can be the absolute distance of both vectors, their concate- nation or any other method that returns a vector. All at- tribute similarities are finally fed into the classification com- 4. MATCHING SYSTEM ponent which returns the final match decision. The standard Our current system for multi-modal product matching classification method is a two-layered, fully connected neural builds on DeepMatcher [11]. We add the capability of pro- network. cessing image data while keeping the overall structure and components. Figure 2 shows the match processing for a pair of records with the same attributes and an image to deter- 5. EVALUATION mine a binary match/non-match decision. Schema align- To evaluate the system, a subset of the existing WDC shoe ment and the creation of suitable candidate pairs are not training set and gold standard is created by filtering all pairs part of the system. that can be enriched with image data for both products. We now describe the four main components of DeepMatcher The size and attribute coverage of the evaluation datasets and how they are related to the processing of images. The are provided in Table 1. This may not reflect the real-world first component is Attribute-level embedding, which converts data, but it is easier to establish a working system on clean each word (or n-gram) of an attribute value into a word vec- and complete data and later improve its robustness against tor by applying a pre-trained word model. The output of this real-world anomalies. Similarly, the shoe category was cho- component is a list of such embeddings with the same length sen as initial category because we assume that products in a as the the number of input words. Because those lists are domain such as fashion that is determined by visual factors of different length for each record, they have to be aligned are more likely to match using their images than products before the attributes of two products can be compared. This from a category like electronics where the visual aspect plays procedure is performed by the attribute summarizer, which a minor role and the products are easier to describe with can be any kind of sequence to vector module, e.g. a re- technical specifications. current neural network. The objective of this module is to Selecting the pairs with images, decreases the size of the compress the information, filter out redundant and mean- training set from 42.4k pairs to 7.6k pairs and the size of the Attribute-level Attribute Attribute Classifier embedding summarizer comparator nike air max 2016… vector list of vectors describes attribute vector Image embedding Result describes similarity nike sportswear… Figure 2: Workflow for multi-modal product matching Features F1 P R in mind that the underlying product dataset has been cre- ated for attribute-based ER and can already achieve high image 73.1 61.2 90.8 match quality for the use of attribute values only. Further- title 85.2 79.3 91.9 more, as mentioned in Section 3, the images have not been title+image 85.6 79.2 93.0 present at the time of labelling, hence the golden truth is title+description 81.5 73.2 91.9 still not fully consistent with regards to images. This is also title+description+image 83.6 75.4 93.8 illustrated in Figure 1 where the images clearly show differ- ent shoes while the mismatch is harder to infer from the title Table 2: Match quality on the (clean) shoe dataset attributes. 6. FUTURE WORK gold standard (used as test set) from 1,111 to 792 pairs. A validation set is taken out of the training data. Concerning the image crawling, our goal is to increase cov- We choose the DeepMatcher configuration that performed erage of images and manually review and relabel the existing best on the baseline experiments for the WDC dataset con- datasets to create and publish a dataset that can be used for ducted in [13], which is using a pre-trained fastText model further experiments and evaluation. The resulting dataset ([3]) to create token-level embeddings from the input data is likely of high value not only for research on data cleaning, and the RNN method for attribute summarization. Dif- product matching and retrieval but also for other applica- ferent feature combinations of title, description and image tions such as product classification and multi-modal tasks values are tested. The model is trained for 20 epochs on e.g. the generation of product descriptions from images. a GeForce RTX 2080 Ti GPU. When using only the prod- The matching system can be improved by different pre- uct images for matching we observe that the system needs processing steps and matching architectures as well as the a longer time to converge, hence we use 40 epochs of train- use of multiple images per product. Our current system ing in that case. Each feature combination is trained and (as most ER systems) compares entities attribute-wise and evaluated three times and the presented scores are the mean aggregates attribute similarities for a match decision. This results of these runs. follows from the reasoning in structured ER that every at- The experiments in [13] show that DeepMatcher outper- tribute describes a single feature of the record that can only forms methods such as support vector machines, logistic be compared with the corresponding field of another object regression and random forest, which are also used in [13]. (e.g. if a company called ”blue” offers a yellow product, it is Hence we do not compare directly against these methods. not similar to a blue product of a different company). But The matching quality is measured as precision, recall and this assumption does not hold for the different features in f-score, as presented in Table 2. Using only the images our case: a product description text and a product image achieves a recall of over 90% and an f-score of 73% which can be interpreted as different encodings of the same infor- shows that the images provide valuable information for match- mation. The equal treatment of text and image features ing. Using the textual attributes alone is more effective and as embeddings in our system allows to compare across the the use of attribute title performs better than using both modalities, which might be a good method to overcome spar- title and description. This is influenced by the fact that the sity of images and descriptions and lead to better matches. description is noisy and missing in many cases so that it is of limited usefulness. Using the images in addition to the tex- 7. CONCLUSION tual attributes improves recall by up to about 2% and also We show that the use of images is a promising strategy the f-score. The best overall f-score of 85.6% is achieved by to improve product matching results in the fashion domain. combining title and image similarity albeit the use of im- Although the improvements in f-score are only minor, the ages allowed for only a small improvement here (0.4%). For image data can be used to obtain a good recall of matching the use of title and description, the additional use of images pairs. Further experiments and data preparation have to be enabled a bigger f-score improvement by 2.1%. conducted to validate this hypothesis and ensure the quality While these improvements appear modest one has to bear of the matching and its stability under real-world conditions. A new multi-modal product matching dataset that is created [14] P. Ristoski, P. Petrovski, P. Mika, and H. Paulheim. A for this experiment can be of use for researchers in many machine learning approach for product matching and applications. categorization. Semantic Web, 9, 2018. 8. ACKNOWLEDGMENTS This project was funded by the Sächsische Aufbaubank (FKZ 100378106) and German BMBF within the Project ScaDS.AI Dresden/Leipzig (BMBF 01IS18026B). Compu- tations for this work were done (in part) using resources of the Leipzig University Computing Centre. 9. REFERENCES [1] H. Amoualian, P. Goswami, L. Ach, P. Das, and P. Montalvo. SIGIR 2020 E-Commerce Workshop Data Challenge. 2020. [2] N. Barlaug and J. A. Gulla. Neural Networks for Entity Matching. 2020. [3] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. CoRR, 2016. [4] P. Christen. Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Berlin Heidelberg, 2012. [5] Y. Ge, R. Zhang, L. Wu, X. Wang, X. Tang, and P. Luo. DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019. [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. [7] H. Köpcke and E. Rahm. Training selection for tuning entity matching. Proceedings of the Sixth International Workshop on Quality in Databases and Management of Uncertain Data (QDB/MUD), 2008. [8] H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 2010. [9] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521, 05 2015. [10] A. Mogadala, M. Kalimuthu, and D. Klakow. Trends in integration of vision and language research: A survey of tasks, datasets, and methods. arXiv, 2019. [11] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2018. [12] G. Papadakis, E. Ioannou, and T. Palpanas. Entity Resolution: Past, Present and Yet-to-Come From Structured to Heterogeneous, to Crowd-sourced, to Deep Learned. In Proceedings of the 23rd International Conference on Extending Database Technology (EDBT), 2020. [13] A. Primpeli, R. Peeters, and C. Bizer. The WDC training dataset and gold standard for large-scale product matching. In The Web Conference 2019 - Companion of the World Wide Web Conference, WWW 2019, 2019.