=Paper= {{Paper |id=Vol-2410/paper23.pdf |storemode=property |title=Unsupervised Construction of a Product Knowledge Graph |pdfUrl=https://ceur-ws.org/Vol-2410/paper23.pdf |volume=Vol-2410 |authors=Omar Alonso,Vasileios Kandylas,Rukmini Iyer |dblpUrl=https://dblp.org/rec/conf/sigir/AlonsoKI19 }} ==Unsupervised Construction of a Product Knowledge Graph== https://ceur-ws.org/Vol-2410/paper23.pdf
       Unsupervised Construction of a Product Knowledge Graph
                                                Omar Alonso, Vasileios Kandylas, Rukmini Iyer
                                                                        Microsoft
                                                         omalonso,vakandyl,rukmini@microsoft.com

ABSTRACT                                                                              learning. One of the few published reports on the end-to-end im-
We describe techniques for generating a commercial knowledge                          plementation and maintenance of KBs in industrial settings is the
graph with the goal of discovering relationships between brands,                      work by Deshpande et al. [1] that describes Kosmix (later acquired
products, and categories. Using a diverse set of input data sources,                  by Walmart). Very recently, Amazon is working on Product Graph,
we implement an unsupervised, efficient and scalable data pipeline                    an authoritative knowledge graph for all products in the world that
that produces a brand-product graph. We also outline a number of                      includes a knowledge extraction framework on semi-structured
challenges encountered when processing this type of data in the                       product websites by mining DOM trees [2] and OpenTag, an active
context of commercial products.                                                       learning approach for extracting attributes and values from prod-
                                                                                      uct titles and descriptions [5]. Finally, existing popular KBs have
ACM Reference format:                                                                 knowledge gaps and detection of long-tail entities is one of main
Omar Alonso, Vasileios Kandylas, Rukmini Iyer. 2019. Unsupervised Con-
                                                                                      challenges with respect to coverage [4].
struction of a Product Knowledge Graph. In Proceedings of the SIGIR
2019 Workshop on eCommerce (SIGIR 2019 eCom), 4 pages.
                                                                                      2     PRODUCT GRAPH
                                                                                      We define terminology as follows. A brand is a term or phrase
1 INTRODUCTION                                                                        that distinguishes an organization or product (e.g., Adidas, Calvin
We are interested in constructing a commercial knowledge graph                        Klein, Microsoft). A domain is the most common URL associated
data asset that revolves around brands, products, and categories.                     with the brand (e.g., Microsoft’s domain is www.microsoft.com).
Our graph allows users to query for a brand, say Microsoft, and                       A product is an item that is manufactured for sale (e.g., Microsoft
retrieve associated products (e.g., Surface 3, Windows 10, Xbox,                      Surface 3, Gucci Guilty, Google Pixel) or a service provided (e.g.,
etc.) and to query for a product, say jeans, or a category, say cloth-                insurance, pet cleaning, food delivery). An alias is a name that an
ing, and retrieve the associated brands (e.g., Calvin Klein, Hudson,                  item is otherwise called or known as. In the case of Apple, a set of
Armani, etc.).                                                                        aliases is Apple Inc., Apple Computer Inc., and AAPL. Categories
   There are number of challenges in the construction a product                       group items into a given label or name (e.g., BMW→vehicles).
graph at scale. Compared to previous work in building knowledge                          Our proposed unsupervised approach is as follows. We start
graphs that use Wikipedia as a source, there are no major sources                     with generating lists of brands from multiple sources using aggre-
that contain clean information about brands and products. Com-                        gation functions. We tag brands with domains and categories using
merce is a very dynamic domain as new brands and products can                         a combination of query log mining and modeling. Finally, we derive
appear, or old ones can be retired. It is hard to define and to detect                products using multiple modeling approaches on advertiser pro-
products: articles of clothing tend to have descriptive rather than                   vided data (bidded keywords and product offers from their product
distinctive names (e.g., men’s black cotton t-shirt, large), cheap and                catalogs). Since advertisers also provide us brand information, we
mass-produced products often don’t have a mentioned brand (e.g.,                      can associate products directly back to brands in the graph. We
1 inch iron nails), and service-oriented products are not well de-                    start with a bottom-up strategy with a focus on simplicity and data
fined (e.g., term life insurance). Popular brands have distinct names                 cleaning, that allows fast iteration and debugging. The proposed
but homonymous brands from different fields can be problematic                        graph construction methodology is designed so that new sources
(e.g., Delta, Apple). The distinction between brand and product is                    of data can be added easily without affecting the existing process
sometimes blurred (e.g., Is Amazon Video a product or a brand?)                       significantly and without requiring extensive redesign of the graph.
and, while there are textual descriptions in product catalogs that
can be used for extracting structured data, retailers do not always                   2.1    Brands
provide clean data.                                                                   For brand generation, we use as input n catalog sources, s 1 , . . . , sn ,
   Several machine-readable knowledge bases (KBs) have been built                     that contain information about brands and produce a table where
in the last two decades that include the latest advances on infor-                    each source assigns 1 vote according to the presence of such term
mation extraction, harvesting Web resources at scale, and machine                     in their respective sources and a 0 vote if there is no presence. A
                                                                                      final score is computed based on such votes as presented in the
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic   example in Table 1. Because source size varies and different sources
purposes.                                                                             may have different name variations and spelling, a grouping of
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published at   similar brands is performed using alias data that updates the final
http://ceur-ws.org                                                                    information for names and votes (e.g., The Gap Inc and Gap, Apple
                                                                                      and Apple Inc). Some sources may be more reputable than others
                                                                                      in terms of data quality so we perform the final selection based on
                                                                                      source authoritativeness and score > t, where t is a threshold for
SIGIR 2019 eCom, July 2019, Paris, France                                                  Omar Alonso, Vasileios Kandylas, Rukmini Iyer


the number of votes. This simple mechanism allow us to incorporate         All URLs per brand are treated as equal, so P(U |B) is uniform.
and maintain input sources in a flexible way and, at the same time,        We use Bing search query logs to get query-URL click infor-
identify areas that brand coverage may be low (e.g., end-consumer       mation and perform the following filtering: navigational queries,
electronics vs. pharmaceutical).                                        queries issued less than 2 times in 6 months and URLs with less than
                                                                        5 clicks in 6 months are removed. These thresholds can be optimized
        Term Alias            s 1 s 2 . . . sn score                    but the goal is to remove the rare queries and rare URLs. The URLs
        Apple Apple Inc.       1 1 ... 1         1                      are matched to the brand URLs according to whether they fall under
        Bose   Bose audio      1 1 ... 0        0.7                     them or not. For the brand URL www.microsoft.com/en-us, the
        Gap    The Gap Inc. 0 1 . . . 1         0.8                     query www.microsoft.com/en-us/surface does fall under it, but
Table 1: Brand list generation example. Each source s votes             the query www.microsoft.com/surface does not. The probability
on the presence of a term.                                              of a query given the URL is based on the clicks on the URL for the
                                                                        query:
                                                                                                           clicks(Q, U )
                                                                                              P(Q |U ) =
                                                                                                            clicks(U )
2.2    Brand Domains                                                    .
A brand domain is the URL that is associated with a brand and               Each query is categorized using an internal categorizer that re-
we would like to automatically extract this information for all         turns categories using an internal commercial ad taxonomy. The
possible brands in our graph as the domains are important for           categorizer can return more than 1 category per query and category
categorization and also for linking product information. Some-          scores lie between 0 and 1. It operates on a hierarchical taxonomy of
times brands will have more than one URL, like in the case of           categories (e.g. Apparel > Footwear > Athletic Shoes). The catego-
Microsoft (e.g., microsoft.com, bing.com, visualstudio.com).            rizer uses the web result snippet besides the query, which improves
For head brands, domain information can be obtained by using            its accuracy. The categorizer scores are used as the probabilities of
Wikipedia’s infoboxes, but for tail brands such information is not      category given query for the computation P(C |Q) = scoreC (Q).
available. To make it scale, we implement an algorithm that uses            At this stage, each brand is associated with multiple categories
Bing search query logs to derive domain information by extract-         through multiple queries and brand URLs. The same categories ap-
ing queries and their associated top-10 search result page (titles &    pearing many times with different scores are aggregated to produce
links) for queries that are normal (not bots) and no adult content.     a final set of unique categories with scores for the brand. The aggre-
By computing the most frequent URLs at positions 1 through 10           gation reduces the impact of mistakes in the query categorization.
over 3 months of historical data, we extract the most frequent URL      The final score per category for a brand is basically the weighted
at position 1 as the domain owner for a brand query.                    average of the query category scores.
   The final brand list contains alias, domains (one or more domains        The final score accuracy depends on the number of queries and
associated), and score as presented in Table 2. This data is then       clicks available for the brand. Therefore we reduce the final score
used as input for the categorization step that we describe in the       based on the uncertainty:
next subsection.                                                                                                                          !
                                                                                                                   α1             α2
     Term Alias             Domain               score                      Score(Q |B) = max 0, P(Q |B) − p              −p
                                                                                                                clicks(B)      queries(B)
     Apple Apple Inc.       apple.com              1
     Bose    Bose audio     bose.com              0.7                   where α 1 and α 2 are manually tuned parameters. The score is the
     Gap     The Gap Inc. gap.com, gapinc.com     0.8                   probability reduced by the second and third terms in the equation
   Table 2: Brand list generation example with domains.                 above, which are corrections based on the uncertainty of the data.
                                                                        Given n data points, the mean estimate has a standard error propor-
                                                                                     √
                                                                        tional to 1/ n, so the fewer clicks or queries are available, the less
                                                                        reliable the score is and the more it is discounted. As the number
2.3    Categorization                                                   of clicks and the number of queries for the brand increase, the
                                                                        corrections vanish. The fewer clicks and queries are available, the
We now describe a method where brand categories are computed via
                                                                        less reliable the estimate is and the lower the score.
the queries which are related to the brands through the brand URLs.
                                                                           For each category and brand, we maintain a small set of catego-
Brands with no URLs discovered for them are not categorized. The
                                                                        rized queries as provenance. These explain why a brand belongs to a
categorization happens in the following sequential steps: brands
                                                                        category. E.g. Apple is categorized, among other things, as Apparel,
→ brand URLs → search queries to URLs → query categories →
                                                                        because of the query “apple watch” which belongs to category
aggregation of categories → categories.
                                                                        Apparel → Jewelry → Watches & Watch Accessories.
   The computation relies on factorization of probabilities on a
                                                                           A useful side effect of the previous computations is the popularity
tri-partite {B, U , Q } graph to generate the probability of category
                                                                        of the brand, which is based on the clicks to the brand URLs. The
given brand using:
                                                                        overall brand popularity is then
                         Õ Õ                                                                                Õ
             P(C |B) =       P(C |Q) · P(Q |U ) P(U |B)
                                                 
                                                                                          P(B) = clicks(B)/ (clicks(b)
                         U  Q                  
                                                                                                              b
Unsupervised Construction of a Product Knowledge Graph                                           SIGIR 2019 eCom, July 2019, Paris, France


The brand popularity per category is computed as                        and locations. For every brand, each n-gram associated with the
                          P(C |B)P(B)                                   brand is scored twice: one uses the overall n-gram frequencies from
               P(B|C) =               ∝ P(C |B)P(B)                     the whole data set and the other uses the brand-specific n-gram
                             P(C
                                                                        frequencies. The score itself is a traditional language model score,
.
                                                                        computed under a Markov assumption:
2.4    Products                                                         P(w1w2w3w4...) = P(w1)P(w2|w1)P(w3|w2, w1)P(w4|w3, w2)...
Identifying products is a much more difficult problem because prod-     where the individual probabilities are smoothed using linear inter-
ucts can be referenced in a coarse or granular manner depending         polation with lower length n-grams. The n-grams are ranked using
on the source of data. In search queries, we see simplified represen-   KL divergence:
tations of products, for example “Aveeno lotion”. But, in product
catalogs, we observe very detailed description of products like         scorephr ase = P(phrase |brand) · loд(P(phrase |brand)/P(phrase))
“Aveeno Active Naturals Daily Moisturizing Lotion 2 x 18 oz”.
                                                                           The phrases above a certain threshold are used as products asso-
   Products also include services provided by brands (examples
                                                                        ciated with the brand. This method tends to produce more general
include insurance, car cleaning, printing) and not necessarily only
                                                                        products, like “LED TV”, or “men’s t-shirt”, so it complements well
retail products. Another complication is that products contain more
                                                                        the previous method, which tends to produce more specific prod-
generic terms and specifications (e.g., models, parts, size, etc.).
                                                                        ucts, like “40W flood light bulb, medium base, dimmable”.
   We have two very different approaches to deriving products, to
suit the source of data that we are deriving them from. A sample of
products for Microsoft and Apple derived with the two methods is        3   RESULTS AND EVALUATION
presented in Table 3.                                                   The complete back-end data pipeline and algorithms have been
                                                                        implemented in Cosmos, Microsoft’s computation service and run
    2.4.1 Products from Retailer Catalogs. Retailers upload prod-       on a distributed cluster continuously. Our data pipeline identifies
uct catalogs and the Bing search engine serves product ads for          around 130K brands, 1M products, and 22 top-level categories for
search queries. The product ads corpus typically contains brand,        brands. The average number of top-level categories per brand is
product title or name, GTIN (Global Trade Item Number) and MPN          7.6. We can re-build the graph from scratch within a few hours
(Manufacturer Product Number) information among other data. To          which allow us to deploy new changes very efficiently. The graph is
generate products for brands, we first perform a simple grouping of     timestamped so it is possible to observe brand and product evolution
products within each brand. Since many ads are missing GTIN and         over time.
MPN and the MPNs tend to be noisier than GTINs, we first group             Due to confidentiality, however, we do not disclose user en-
together all products with the same name and GTIN. Then we group        gagement or other behavioral data. Instead, we present an offline
all products with the same name, regardless of GTIN and MPN. As         relevance evaluation for different components of the graph. We
a next step, we use a CDSSM [3] trained on the product ads corpus       conducted an offline evaluation of 3,000 brands selected at random
to embed the product names in a lower dimensional vector space.         with an internal tool for collecting labels using in-house editors
The lower dimensional space facilitates the next round of product       and report precision numbers in Table 4 grouped by different brand
clustering using k-means with some heuristics to split clusters of      scores. The task consisted in showing a brand to a judge and re-
large size or high variance. This process produces a hierarchical       quires answering if such brand is indeed a brand, not a brand, or if
organization of products, where each higher level is more general       the judge cannot tell.
than the previous. Each cluster represents a group of semantically         Evaluation for categories was also conducted using the same
related product titles and contains many similar product names.         internal tool where we provide editors with a brand and a list of
    We generate a set of M representative labels for each cluster us-   categories we have associated with it. The task is to mark the cate-
ing a simple generative algorithm. We compute the average length        gories that are not correct. We also ask them to check a box if they
l of product names within a cluster. We generate top N n-grams          think a category is missing from those provided. Optionally, they
of length l, embed the n-grams using the same CDSSM encoding            can also write what category is missing, if any. Marking the incor-
as what we used in the original clustering, and choose the best         rect categories allows us to estimate precision, while the checkbox
M n-grams based on the distance of their CDSSM vectors from             to calculate recall. We evaluated a sample of 2,000 brands, stratified
the cluster centroid. We repeat this process for higher and lower       according to score. We gathered 3 responses per brand, treating the
lengths of n-grams, iterating until we stop finding any additional      majority answer as correct. Figure 1 shows the ROC curve of our
labels to insert or replace in the top M labels.                        brand categorizer for selected brand score thresholds.
   2.4.2 Products from Bidded Keywords. For this approach we use
a data set of keywords on which the advertisers bid to display their    4   CONCLUSION AND FUTURE WORK
ads. Besides the bidded keywords, the data set also contains the URL    In this paper we presented an efficient and scalable brand-product
of the ad and the frequency that the bidded keyword was triggered.      graph generation data pipeline that is currently used in an industrial
First, the bidded keywords are matched to the brands by ensuring        setting. We described a number of unsupervised techniques for
that the ad URL matches the known brand domain and that the             deriving brands, brand categories, and products using different
bidded keywords contain the brand name. From the set of keywords        input data sources. Preliminary evaluation results show that our
selected, we generate n-grams (n <= 4), after removing stop words       implementation produces good quality output. We described the
SIGIR 2019 eCom, July 2019, Paris, France                                                                              Omar Alonso, Vasileios Kandylas, Rukmini Iyer

 Brand     Bidded keyword products Retailer catalog-based products
 microsoft microsoft word              Microsoft Office Home & Student 2016 For Mac
 microsoft microsoft office            Microsoft Office 365 Personal Subscription 1 Year 1 User Pc/Mac
 microsoft microsoft office 365        Microsoft Office Professional Plus 2016 365 Pro Word Excel Key Windows
 microsoft microsoft outlook           Microsoft Office Home and Business 2016 Retail License
 microsoft microsoft surface           Microsoft 13.5" Surface Book 2-in-1 Notebook 256GB SSD 8GB RAM Core i5 2.4 GHz
 microsoft microsoft teams             12.3 Surface Pro 6 - 128GB / Intel Core m3 / 4GB RAM (Silver)
 microsoft microsoft project           Surface Pro 6 - 512GB / Intel Core i7 / 16GB RAM (Platinum)
 microsoft microsoft excel             Surface Pro 4 12.3" Bundle: Core i5, 4GB RAM, 128GB SSD, Surface Pen, Type Cover
 microsoft microsoft office 2016       Microsoft Project Professional 2016 - Digital Download
 microsoft microsoft powerpoint        Microsoft Project 2019 Professional w/ 1 Server CAL Open License
 apple     apple watch                 Apple Watch Series 3 44MM Rose Gold
 apple     apple airpods               Apple Watch Series 4 44 mm Gold Stainless Steel Case with Gold Milanese Loop
 apple     apple watch series 3        24K Gold Plated 42MM Apple Watch Series 3 Diamond Polished Modern Gold Link Band
 apple     apple watch series 4        Apple Airpods MMEF2J/A Wireless Earphone For Iphone/Apple Watch/Ipad/Mac
 apple     apple iphone                Apple Airpods Genuine Left-Only Airpod (Without Charging Case)
 apple     apple watch 4               Apple Airpods - White MMEF2AM/A Genuine Airpod Retail Box
 apple     apple ipad                  Apple iPhone 8 Plus - 64GB - Space Gray (Unlocked) A1864 (CDMA + GSM)
 apple     apple iphone xs max         Apple iPhone XR 256GB, Yellow - Sprint
 apple     apple tv                    Apple iPad Wi-Fi 128GB Silver
 apple     apple earpods               Apple iPad 6th Gen. 32GB, Wi-Fi + Cellular (Unlocked), 9.7in - Silver #15718
Table 3: Example products from Apple and Microsoft returned by the two methods. The bidded keyword-based products are
more high-level, whereas the products from retailer catalogs have finer granularity.



                           Brand score Precision                                                   Our philosophy is to build a graph using a rapid and iterative
                           1              92.2%                                                 development process where at each stage we can test, debug, and
                           0.83           87.8%                                                 explain changes as we ingest new data sources and the size of
                           0.66           87.2%                                                 the graph grows. While the graph is far from complete, the data
                           0.5            85.6%                                                 is already in use by internal applications and as training set for
                         Table 4: Brand evaluation.                                             classifiers and NERs. Thus, we agree with the Kosmix case study [1]
                                                                                                that an imperfect KB is still useful for real-world applications that
                                                                                                can help test and identify problems with the new derived data set.
                                                                                                   Future work includes product evaluation, product generation
                                     ROC according to P(C|B) by brand score
                                                                                                from query logs, improvements with brand conflation which should
                          1                                                                     improve the quality of brand data, and functionality for detecting
                                                                                                similar brands and products.
                         0.8
                                                                                                5    ACKNOWLEDGMENTS
                                                                                                We thank Karim Filali and Jaya Palli for help and assistance.
                         0.6
                                                                                  all brands
                Recall




                                                                                  score ≥ 0.6   REFERENCES
                         0.4                                                      score ≥ 0.8
                                                                                                 [1] Omkar Deshpande, Digvijay S. Lamba, Michel Tourn, Sanjib Das, Sri Subrama-
                                                                                  score ≥ 1
                                                                                                     niam, Anand Rajaraman, Venky Harinarayan, and AnHai Doan. 2013. Building,
                                                                                                     Maintaining, and Using Knowledge Bases: A Report from the Trenches. In Proc.
                         0.2                                                                         of SIGMOD. 1209–1220.
                                                                                                 [2] Colin Lockard, Xin Luna Dong, Prashant Shiralkar, and Arash Einolghozati. 2018.
                                                                                                     CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web.
                          0                                                                          PVLDB 11, 10 (2018), 1084–1096.
                               0   0.2        0.4         0.6       0.8       1
                                                                                                 [3] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014.
                                                    FPR
                                                                                                     Learning Semantic Representations Using Convolutional Neural Networks for
                                                                                                     Web Search. In Proc. of WWW. 373–374.
                                                                                                 [4] Gerhard Weikum, Johannes Hoffart, and Fabian M. Suchanek. 2016. Ten Years
                Figure 1: Root categories ROC.                                                       of Knowledge Harvesting: Lessons and Challenges. IEEE Data Eng. Bull. 39, 3
                                                                                                     (2016), 41–50.
                                                                                                 [5] Guineng Zheng, Subhabrata Mukherjee, Xin Luna Dong, and Feifei Li. 2018.
                                                                                                     OpenTag: Open Attribute Value Extraction from Product Profiles. In Proc. of KDD.
                                                                                                     1049–1058.
methods in detail so it should be possible to reproduce the results
using similar input sources.