=Paper= {{Paper |id=Vol-2410/paper23.pdf |storemode=property |title=Unsupervised Construction of a Product Knowledge Graph |pdfUrl=https://ceur-ws.org/Vol-2410/paper23.pdf |volume=Vol-2410 |authors=Omar Alonso,Vasileios Kandylas,Rukmini Iyer |dblpUrl=https://dblp.org/rec/conf/sigir/AlonsoKI19 }} ==Unsupervised Construction of a Product Knowledge Graph== https://ceur-ws.org/Vol-2410/paper23.pdf

Unsupervised Construction of a Product Knowledge Graph
Omar Alonso, Vasileios Kandylas, Rukmini Iyer
Microsoft
omalonso,vakandyl,rukmini@microsoft.com

ABSTRACT learning. One of the few published reports on the end-to-end im-
We describe techniques for generating a commercial knowledge plementation and maintenance of KBs in industrial settings is the
graph with the goal of discovering relationships between brands, work by Deshpande et al. [1] that describes Kosmix (later acquired
products, and categories. Using a diverse set of input data sources, by Walmart). Very recently, Amazon is working on Product Graph,
we implement an unsupervised, efficient and scalable data pipeline an authoritative knowledge graph for all products in the world that
that produces a brand-product graph. We also outline a number of includes a knowledge extraction framework on semi-structured
challenges encountered when processing this type of data in the product websites by mining DOM trees [2] and OpenTag, an active
context of commercial products. learning approach for extracting attributes and values from prod-
uct titles and descriptions [5]. Finally, existing popular KBs have
ACM Reference format: knowledge gaps and detection of long-tail entities is one of main
Omar Alonso, Vasileios Kandylas, Rukmini Iyer. 2019. Unsupervised Con-
challenges with respect to coverage [4].
struction of a Product Knowledge Graph. In Proceedings of the SIGIR
2019 Workshop on eCommerce (SIGIR 2019 eCom), 4 pages.
2 PRODUCT GRAPH
We define terminology as follows. A brand is a term or phrase
1 INTRODUCTION that distinguishes an organization or product (e.g., Adidas, Calvin
We are interested in constructing a commercial knowledge graph Klein, Microsoft). A domain is the most common URL associated
data asset that revolves around brands, products, and categories. with the brand (e.g., Microsoft’s domain is www.microsoft.com).
Our graph allows users to query for a brand, say Microsoft, and A product is an item that is manufactured for sale (e.g., Microsoft
retrieve associated products (e.g., Surface 3, Windows 10, Xbox, Surface 3, Gucci Guilty, Google Pixel) or a service provided (e.g.,
etc.) and to query for a product, say jeans, or a category, say cloth- insurance, pet cleaning, food delivery). An alias is a name that an
ing, and retrieve the associated brands (e.g., Calvin Klein, Hudson, item is otherwise called or known as. In the case of Apple, a set of
Armani, etc.). aliases is Apple Inc., Apple Computer Inc., and AAPL. Categories
There are number of challenges in the construction a product group items into a given label or name (e.g., BMW→vehicles).
graph at scale. Compared to previous work in building knowledge Our proposed unsupervised approach is as follows. We start
graphs that use Wikipedia as a source, there are no major sources with generating lists of brands from multiple sources using aggre-
that contain clean information about brands and products. Com- gation functions. We tag brands with domains and categories using
merce is a very dynamic domain as new brands and products can a combination of query log mining and modeling. Finally, we derive
appear, or old ones can be retired. It is hard to define and to detect products using multiple modeling approaches on advertiser pro-
products: articles of clothing tend to have descriptive rather than vided data (bidded keywords and product offers from their product
distinctive names (e.g., men’s black cotton t-shirt, large), cheap and catalogs). Since advertisers also provide us brand information, we
mass-produced products often don’t have a mentioned brand (e.g., can associate products directly back to brands in the graph. We
1 inch iron nails), and service-oriented products are not well de- start with a bottom-up strategy with a focus on simplicity and data
fined (e.g., term life insurance). Popular brands have distinct names cleaning, that allows fast iteration and debugging. The proposed
but homonymous brands from different fields can be problematic graph construction methodology is designed so that new sources
(e.g., Delta, Apple). The distinction between brand and product is of data can be added easily without affecting the existing process
sometimes blurred (e.g., Is Amazon Video a product or a brand?) significantly and without requiring extensive redesign of the graph.
and, while there are textual descriptions in product catalogs that
can be used for extracting structured data, retailers do not always 2.1 Brands
provide clean data. For brand generation, we use as input n catalog sources, s 1 , . . . , sn ,
Several machine-readable knowledge bases (KBs) have been built that contain information about brands and produce a table where
in the last two decades that include the latest advances on infor- each source assigns 1 vote according to the presence of such term
mation extraction, harvesting Web resources at scale, and machine in their respective sources and a 0 vote if there is no presence. A
final score is computed based on such votes as presented in the
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic example in Table 1. Because source size varies and different sources
purposes. may have different name variations and spelling, a grouping of
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at similar brands is performed using alias data that updates the final
http://ceur-ws.org information for names and votes (e.g., The Gap Inc and Gap, Apple
and Apple Inc). Some sources may be more reputable than others
in terms of data quality so we perform the final selection based on
source authoritativeness and score > t, where t is a threshold for
SIGIR 2019 eCom, July 2019, Paris, France Omar Alonso, Vasileios Kandylas, Rukmini Iyer

the number of votes. This simple mechanism allow us to incorporate All URLs per brand are treated as equal, so P(U |B) is uniform.
and maintain input sources in a flexible way and, at the same time, We use Bing search query logs to get query-URL click infor-
identify areas that brand coverage may be low (e.g., end-consumer mation and perform the following filtering: navigational queries,
electronics vs. pharmaceutical). queries issued less than 2 times in 6 months and URLs with less than
5 clicks in 6 months are removed. These thresholds can be optimized
Term Alias s 1 s 2 . . . sn score but the goal is to remove the rare queries and rare URLs. The URLs
Apple Apple Inc. 1 1 ... 1 1 are matched to the brand URLs according to whether they fall under
Bose Bose audio 1 1 ... 0 0.7 them or not. For the brand URL www.microsoft.com/en-us, the
Gap The Gap Inc. 0 1 . . . 1 0.8 query www.microsoft.com/en-us/surface does fall under it, but
Table 1: Brand list generation example. Each source s votes the query www.microsoft.com/surface does not. The probability
on the presence of a term. of a query given the URL is based on the clicks on the URL for the
query:
clicks(Q, U )
P(Q |U ) =
clicks(U )
2.2 Brand Domains .
A brand domain is the URL that is associated with a brand and Each query is categorized using an internal categorizer that re-
we would like to automatically extract this information for all turns categories using an internal commercial ad taxonomy. The
possible brands in our graph as the domains are important for categorizer can return more than 1 category per query and category
categorization and also for linking product information. Some- scores lie between 0 and 1. It operates on a hierarchical taxonomy of
times brands will have more than one URL, like in the case of categories (e.g. Apparel > Footwear > Athletic Shoes). The catego-
Microsoft (e.g., microsoft.com, bing.com, visualstudio.com). rizer uses the web result snippet besides the query, which improves
For head brands, domain information can be obtained by using its accuracy. The categorizer scores are used as the probabilities of
Wikipedia’s infoboxes, but for tail brands such information is not category given query for the computation P(C |Q) = scoreC (Q).
available. To make it scale, we implement an algorithm that uses At this stage, each brand is associated with multiple categories
Bing search query logs to derive domain information by extract- through multiple queries and brand URLs. The same categories ap-
ing queries and their associated top-10 search result page (titles & pearing many times with different scores are aggregated to produce
links) for queries that are normal (not bots) and no adult content. a final set of unique categories with scores for the brand. The aggre-
By computing the most frequent URLs at positions 1 through 10 gation reduces the impact of mistakes in the query categorization.
over 3 months of historical data, we extract the most frequent URL The final score per category for a brand is basically the weighted
at position 1 as the domain owner for a brand query. average of the query category scores.
The final brand list contains alias, domains (one or more domains The final score accuracy depends on the number of queries and
associated), and score as presented in Table 2. This data is then clicks available for the brand. Therefore we reduce the final score
used as input for the categorization step that we describe in the based on the uncertainty:
next subsection. !
α1 α2
Term Alias Domain score Score(Q |B) = max 0, P(Q |B) − p −p
clicks(B) queries(B)
Apple Apple Inc. apple.com 1
Bose Bose audio bose.com 0.7 where α 1 and α 2 are manually tuned parameters. The score is the
Gap The Gap Inc. gap.com, gapinc.com 0.8 probability reduced by the second and third terms in the equation
Table 2: Brand list generation example with domains. above, which are corrections based on the uncertainty of the data.
Given n data points, the mean estimate has a standard error propor-
√
tional to 1/ n, so the fewer clicks or queries are available, the less
reliable the score is and the more it is discounted. As the number
2.3 Categorization of clicks and the number of queries for the brand increase, the
corrections vanish. The fewer clicks and queries are available, the
We now describe a method where brand categories are computed via
less reliable the estimate is and the lower the score.
the queries which are related to the brands through the brand URLs.
For each category and brand, we maintain a small set of catego-
Brands with no URLs discovered for them are not categorized. The
rized queries as provenance. These explain why a brand belongs to a
categorization happens in the following sequential steps: brands
category. E.g. Apple is categorized, among other things, as Apparel,
→ brand URLs → search queries to URLs → query categories →
because of the query “apple watch” which belongs to category
aggregation of categories → categories.
Apparel → Jewelry → Watches & Watch Accessories.
The computation relies on factorization of probabilities on a
A useful side effect of the previous computations is the popularity
tri-partite {B, U , Q } graph to generate the probability of category
of the brand, which is based on the clicks to the brand URLs. The
given brand using:
overall brand popularity is then
Õ Õ  Õ
P(C |B) =  P(C |Q) · P(Q |U ) P(U |B)

  P(B) = clicks(B)/ (clicks(b)
U  Q 
 b
Unsupervised Construction of a Product Knowledge Graph SIGIR 2019 eCom, July 2019, Paris, France

The brand popularity per category is computed as and locations. For every brand, each n-gram associated with the
P(C |B)P(B) brand is scored twice: one uses the overall n-gram frequencies from
P(B|C) = ∝ P(C |B)P(B) the whole data set and the other uses the brand-specific n-gram
P(C
frequencies. The score itself is a traditional language model score,
.
computed under a Markov assumption:
2.4 Products P(w1w2w3w4...) = P(w1)P(w2|w1)P(w3|w2, w1)P(w4|w3, w2)...
Identifying products is a much more difficult problem because prod- where the individual probabilities are smoothed using linear inter-
ucts can be referenced in a coarse or granular manner depending polation with lower length n-grams. The n-grams are ranked using
on the source of data. In search queries, we see simplified represen- KL divergence:
tations of products, for example “Aveeno lotion”. But, in product
catalogs, we observe very detailed description of products like scorephr ase = P(phrase |brand) · loд(P(phrase |brand)/P(phrase))
“Aveeno Active Naturals Daily Moisturizing Lotion 2 x 18 oz”.
The phrases above a certain threshold are used as products asso-
Products also include services provided by brands (examples
ciated with the brand. This method tends to produce more general
include insurance, car cleaning, printing) and not necessarily only
products, like “LED TV”, or “men’s t-shirt”, so it complements well
retail products. Another complication is that products contain more
the previous method, which tends to produce more specific prod-
generic terms and specifications (e.g., models, parts, size, etc.).
ucts, like “40W flood light bulb, medium base, dimmable”.
We have two very different approaches to deriving products, to
suit the source of data that we are deriving them from. A sample of
products for Microsoft and Apple derived with the two methods is 3 RESULTS AND EVALUATION
presented in Table 3. The complete back-end data pipeline and algorithms have been
implemented in Cosmos, Microsoft’s computation service and run
2.4.1 Products from Retailer Catalogs. Retailers upload prod- on a distributed cluster continuously. Our data pipeline identifies
uct catalogs and the Bing search engine serves product ads for around 130K brands, 1M products, and 22 top-level categories for
search queries. The product ads corpus typically contains brand, brands. The average number of top-level categories per brand is
product title or name, GTIN (Global Trade Item Number) and MPN 7.6. We can re-build the graph from scratch within a few hours
(Manufacturer Product Number) information among other data. To which allow us to deploy new changes very efficiently. The graph is
generate products for brands, we first perform a simple grouping of timestamped so it is possible to observe brand and product evolution
products within each brand. Since many ads are missing GTIN and over time.
MPN and the MPNs tend to be noisier than GTINs, we first group Due to confidentiality, however, we do not disclose user en-
together all products with the same name and GTIN. Then we group gagement or other behavioral data. Instead, we present an offline
all products with the same name, regardless of GTIN and MPN. As relevance evaluation for different components of the graph. We
a next step, we use a CDSSM [3] trained on the product ads corpus conducted an offline evaluation of 3,000 brands selected at random
to embed the product names in a lower dimensional vector space. with an internal tool for collecting labels using in-house editors
The lower dimensional space facilitates the next round of product and report precision numbers in Table 4 grouped by different brand
clustering using k-means with some heuristics to split clusters of scores. The task consisted in showing a brand to a judge and re-
large size or high variance. This process produces a hierarchical quires answering if such brand is indeed a brand, not a brand, or if
organization of products, where each higher level is more general the judge cannot tell.
than the previous. Each cluster represents a group of semantically Evaluation for categories was also conducted using the same
related product titles and contains many similar product names. internal tool where we provide editors with a brand and a list of
We generate a set of M representative labels for each cluster us- categories we have associated with it. The task is to mark the cate-
ing a simple generative algorithm. We compute the average length gories that are not correct. We also ask them to check a box if they
l of product names within a cluster. We generate top N n-grams think a category is missing from those provided. Optionally, they
of length l, embed the n-grams using the same CDSSM encoding can also write what category is missing, if any. Marking the incor-
as what we used in the original clustering, and choose the best rect categories allows us to estimate precision, while the checkbox
M n-grams based on the distance of their CDSSM vectors from to calculate recall. We evaluated a sample of 2,000 brands, stratified
the cluster centroid. We repeat this process for higher and lower according to score. We gathered 3 responses per brand, treating the
lengths of n-grams, iterating until we stop finding any additional majority answer as correct. Figure 1 shows the ROC curve of our
labels to insert or replace in the top M labels. brand categorizer for selected brand score thresholds.
2.4.2 Products from Bidded Keywords. For this approach we use
a data set of keywords on which the advertisers bid to display their 4 CONCLUSION AND FUTURE WORK
ads. Besides the bidded keywords, the data set also contains the URL In this paper we presented an efficient and scalable brand-product
of the ad and the frequency that the bidded keyword was triggered. graph generation data pipeline that is currently used in an industrial
First, the bidded keywords are matched to the brands by ensuring setting. We described a number of unsupervised techniques for
that the ad URL matches the known brand domain and that the deriving brands, brand categories, and products using different
bidded keywords contain the brand name. From the set of keywords input data sources. Preliminary evaluation results show that our
selected, we generate n-grams (n <= 4), after removing stop words implementation produces good quality output. We described the
SIGIR 2019 eCom, July 2019, Paris, France Omar Alonso, Vasileios Kandylas, Rukmini Iyer

Brand Bidded keyword products Retailer catalog-based products
microsoft microsoft word Microsoft Office Home & Student 2016 For Mac
microsoft microsoft office Microsoft Office 365 Personal Subscription 1 Year 1 User Pc/Mac
microsoft microsoft office 365 Microsoft Office Professional Plus 2016 365 Pro Word Excel Key Windows
microsoft microsoft outlook Microsoft Office Home and Business 2016 Retail License
microsoft microsoft surface Microsoft 13.5" Surface Book 2-in-1 Notebook 256GB SSD 8GB RAM Core i5 2.4 GHz
microsoft microsoft teams 12.3 Surface Pro 6 - 128GB / Intel Core m3 / 4GB RAM (Silver)
microsoft microsoft project Surface Pro 6 - 512GB / Intel Core i7 / 16GB RAM (Platinum)
microsoft microsoft excel Surface Pro 4 12.3" Bundle: Core i5, 4GB RAM, 128GB SSD, Surface Pen, Type Cover
microsoft microsoft office 2016 Microsoft Project Professional 2016 - Digital Download
microsoft microsoft powerpoint Microsoft Project 2019 Professional w/ 1 Server CAL Open License
apple apple watch Apple Watch Series 3 44MM Rose Gold
apple apple airpods Apple Watch Series 4 44 mm Gold Stainless Steel Case with Gold Milanese Loop
apple apple watch series 3 24K Gold Plated 42MM Apple Watch Series 3 Diamond Polished Modern Gold Link Band
apple apple watch series 4 Apple Airpods MMEF2J/A Wireless Earphone For Iphone/Apple Watch/Ipad/Mac
apple apple iphone Apple Airpods Genuine Left-Only Airpod (Without Charging Case)
apple apple watch 4 Apple Airpods - White MMEF2AM/A Genuine Airpod Retail Box
apple apple ipad Apple iPhone 8 Plus - 64GB - Space Gray (Unlocked) A1864 (CDMA + GSM)
apple apple iphone xs max Apple iPhone XR 256GB, Yellow - Sprint
apple apple tv Apple iPad Wi-Fi 128GB Silver
apple apple earpods Apple iPad 6th Gen. 32GB, Wi-Fi + Cellular (Unlocked), 9.7in - Silver #15718
Table 3: Example products from Apple and Microsoft returned by the two methods. The bidded keyword-based products are
more high-level, whereas the products from retailer catalogs have finer granularity.

Brand score Precision Our philosophy is to build a graph using a rapid and iterative
1 92.2% development process where at each stage we can test, debug, and
0.83 87.8% explain changes as we ingest new data sources and the size of
0.66 87.2% the graph grows. While the graph is far from complete, the data
0.5 85.6% is already in use by internal applications and as training set for
Table 4: Brand evaluation. classifiers and NERs. Thus, we agree with the Kosmix case study [1]
that an imperfect KB is still useful for real-world applications that
can help test and identify problems with the new derived data set.
Future work includes product evaluation, product generation
ROC according to P(C|B) by brand score
from query logs, improvements with brand conflation which should
1 improve the quality of brand data, and functionality for detecting
similar brands and products.
0.8
5 ACKNOWLEDGMENTS
We thank Karim Filali and Jaya Palli for help and assistance.
0.6
all brands
Recall

score ≥ 0.6 REFERENCES
0.4 score ≥ 0.8
[1] Omkar Deshpande, Digvijay S. Lamba, Michel Tourn, Sanjib Das, Sri Subrama-
score ≥ 1
niam, Anand Rajaraman, Venky Harinarayan, and AnHai Doan. 2013. Building,
Maintaining, and Using Knowledge Bases: A Report from the Trenches. In Proc.
0.2 of SIGMOD. 1209–1220.
[2] Colin Lockard, Xin Luna Dong, Prashant Shiralkar, and Arash Einolghozati. 2018.
CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web.
0 PVLDB 11, 10 (2018), 1084–1096.
0 0.2 0.4 0.6 0.8 1
[3] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014.
FPR
Learning Semantic Representations Using Convolutional Neural Networks for
Web Search. In Proc. of WWW. 373–374.
[4] Gerhard Weikum, Johannes Hoffart, and Fabian M. Suchanek. 2016. Ten Years
Figure 1: Root categories ROC. of Knowledge Harvesting: Lessons and Challenges. IEEE Data Eng. Bull. 39, 3
(2016), 41–50.
[5] Guineng Zheng, Subhabrata Mukherjee, Xin Luna Dong, and Feifei Li. 2018.
OpenTag: Open Attribute Value Extraction from Product Profiles. In Proc. of KDD.
1049–1058.
methods in detail so it should be possible to reproduce the results
using similar input sources.