Learning to Identify Complementary Products from DBpedia Victor Anthony Arrascue Ayala Trong-Nghia Cheng Anas Alzoghbi Georg Lausen Department of Computer Science University of Freiburg Georges-Köhler-Allee 051, 79110 Freiburg, Germany arrascue|chengt|alzoghba|lausen@informatik.uni-freiburg.de ABSTRACT stance the RS can expose the user interested in a product Identifying the complementary relationship between prod- to other products whose use is related to their desired prod- ucts, like a cartridge to a printer, is a very useful technique uct. Those products are referred to as complementary prod- to provide recommendations. These are typically purchased ucts [2]. For instance, a user who placed a guitar in the together or within a short time frame and thus online re- shopping cart might also be interested in purchasing a gui- tailers benefit from it. Existing approaches rely heavily on tar tuner, a case, or a book about learning how to play it. It transactions and therefore they suffer from: (1) the cold is important to notice that complementary products might start problem for new products; (2) the inability to pro- belong to a different category than that of the associated duce good results for infrequently bought products; (3) the product, e.g. Instrument Accessories, Bags & Cases and inability to explain why two products are complementary. Books in the example above. This means that there are also We propose a framework that aims at alleviating these some relationships at the level of categories too. Some of problems by exploiting a knowledge graph (DBpedia) in them are evident, like Instruments and Instrument Acces- addition to products’ available information such as titles, sories, while some of them are not immediately obvious like descriptions and categories rather than transactions. Our Instruments and Books. problem is modeled as a classification task on a set of prod- The majority of the approaches to find complementary uct pairs. Our starting point is the semantic paths in the products make use of transactional data, e.g. by extract- knowledge graph linking between product attributes, from ing association rules or by applying other data mining tech- which we model product features. Then, having a labeled niques [7, 22, 27, 31, 32]. This means that new catalog set of product pairs we learn a model; and finally, we use this products have no chance to appear as complementary to any model to predict complementary products for an unseen set other product. In addition to this, these approaches might of products. Our experiments on a real world dataset from also produce noisy recommendations for products which are Amazon show high performance of our framework in pre- infrequently purchased and are not able to explain the rea- dicting whether one product is complementary to another son for the recommendation [18]. one. Therefore, a model is required from which this relationship between a complementary product and its associated prod- uct can be learned. We believe that a graph representation is Keywords the most suitable for this. Products, their attributes and the Recommender Systems, Complementary Products, DBpe- categories to which they belong, as well as the interconnec- dia, Knowledge-graph, Supervised Learning, Cross-domain. tions between them can be represented as nodes and paths in the graph. Having such a model would not only allow us to find links between a complementary product and its 1. INTRODUCTION associated product without any need for transactions, but Online retailers like Amazon or eBay sell millions of prod- also to provide explanations of why this relationship exists. ucts from several categories and continuously expand their However, learning to discriminate between different kinds catalogs. Being aware that browsing a large catalog can of relationships from those interconnections within a com- be overwhelming for a user and can therefore have a neg- plex semantic graph model is still an open problem. See ative impact on revenue [3, 24], it is not a surprise that for instance the feature vector illustrated in Figure 1, which many of these retailers are equipped with Recommender Sys- represents a pair of products (pi , pj ). Each of the features tems (RS) [28]. While recommending products based on the should indeed reflect a property which characterizes both user’s taste is certainly an important task any RS has to products as a whole, or a certain interaction between those. carry out, special attention has to be paid at the moment A class label, complementary or non-complementary is as- in which a purchase is about to be made. In this circum- sociated with each sample. While trying to fit a classifica- tion model which learns how to discriminate the relation- ship from a set of samples like those might seem the obvious Copyrights held by author/owner(s). way of solving the problem, the problem remains: what are WWW2017 Workshop: Linked Data on the Web (LDOW2017) those features that reflect some property of pairs of prod- ucts? What do they reflect? Is this property related to the arranged: acoustic guitars are a certain kind of guitar and fact that one product might be complementary to the other these are a certain kind of musical instrument. one? This work aims at addressing these questions. Since RDF is used as an underlying data model, we need to introduce some notation. Let I, B and L be the set of all IRIs (Internationalized Resource Identifiers), the set of all blank nodes and the set of all literals, respectively. Then a triple (rs , p, ro ) ∈ (I ∪ B) × I × (I ∪ B ∪ L) is called an RDF triple. An RDF graph G is a set of RDF triples. Moreover, let V be a set of variables. A tuple of the set Figure 1: Feature vector (I ∪ L ∪ V ) × (I ∪ V ) × (I ∪ L ∪ V ), e.g. (?x, p, ro ), is called a triple pattern, in which variable names are prefixed with the question mark (“?”) symbol. A triple pattern can Contributions. The contributions of this paper can be be evaluated according to the SPARQL semantics to find all summarized as follows. those mappings between variables and resources from the graph which would make of that tuple a valid triple. For a 1. We propose CompLoD, a novel and extensible frame- more comprehensive description of RDF and SPARQL we work that predicts the complementary relationship be- refer the reader to [13, 25]. tween two products by using only textual information A path in graph G between a source resource rs and a such as titles, descriptions and categories. target resource ro is denoted by ρ(rs , ro )(p1 ,p2 ,...,pt ) , where (p1 , p2 , ..., pt ) is the sequence of predicates that connects rs 2. The approach does not require transactions as other to ro 2 . A path can be recursively defined as follow: approaches and therefore, it is immune to the cold- start problem for new items. • ρ(rs , ro )(p) is a path of length 1 if there exists a triple (rs , p, ro ) ∈ G which connects rs to ro . 3. Our scheme bridges the gap of how to encode infor- mation from a knowledge graph into a relevant feature • ρ(rs , ro )(p1 ,p2 ,...,pt ) is a path of length t if there exists set to fit a classification model and predict the com- a resource rl such that ρ(rs , rl )(p1 ,p2 ,...,pt−1 ) is a path plementary relationship. of length t − 1 and ρ(rl , ro )(pt ) is a path of length 1. 4. Rather than building a knowledge graph from scratch, we believe that Semantic Web standards have enough In our RDF graph each of the above-mentioned elements, to offer in this regard. Therefore, we use DBpedia1 , a products, attributes and categories, corresponds to an RDF well-known Linked Open Data (LoD) dataset, as the resource. Let P ⊆ I = {rp1 , rp2 , ..., rpn } be a subset of all core of the knowledge graph. This dataset is the struc- IRIs representing a set of products, A ⊆ I = {ra1 , ra2 , ..., ram } tured counterpart of Wikipedia and it is modeled as an is a set of products’ attributes and C ⊆ I = {rc1 , rc2 , ..., rcl } RDF graph. We thereby demonstrate the usefulness of a set of products’ categories. From now on products, at- Semantic Web data. tributes and categories will denote the RDF resource which identify these elements. Moreover, we will use rp , ra and We evaluate the accuracy of our approach by conducting rc to denote an arbitrary product, attribute and category, experiments using Amazon data [14, 15] as our ground truth respectively. We assume that each product is connected to and find that it produces competitive results in comparison its attributes and to the categories to which it belongs in to those showed in [14] when taking into account that no G. Moreover, there exist arbitrary connections in G among transaction data is used. Our paper is structured as fol- attributes or categories, i.e. there are no restrictions on how lows: We aim at providing some notation and definitions resources are connected. used within this paper in section 2. In section 3 the work- flow of CompLoD is explained in detail. Section 4 shows Problem statement. Given two products (rpi , rpj ) in G, the experiments that have been carried out to validate our we would like to predict whether rpj is complementary to system. In section 5 we present the related work. Finally, rpi using G. Let comp(rpi , rpj ) be a function that returns 1 we present our conclusions in section 6. when this relationship holds, 0 otherwise. It is directional, i.e. comp(rpi , rpj ) = 1 doesn’t imply that comp(rpj , rpi ) also returns 13 . 2. NOTATION This section provides some fundamental definitions which allow us to explain the system in detail. Our approach re- 3. OUR APPROACH quires modeling four elements within an RDF graph: prod- Figure 2 illustrates a high-level view of our framework ucts, attributes and the categories to which products be- CompLoD. The system needs to process only textual infor- long in addition to the relationships between all of these mation of products, such as titles, descriptions and cate- elements. Attributes are characteristics of a product which gories. Example 1 illustrates the input expected for a prod- are extracted from a product’s specification. Categories are uct. instead given and are classified in a taxonomy. For instance, bronze strings is an attribute of a guitar. Moreover a gui- 2 tar might belong to categories such as Musical instruments, For simplicity, we assume (p1 , p2 , ..., pt ) identifies a unique Guitars and Acoustic Guitars. Categories are hierarchically sequence of predicates which connect rs to ro . 3 For instance, a guitar tuner is complementary to a guitar, 1 http://wiki.DBpedia.org/ but the opposite does not necessarily hold true. Figure 2: Workflow of CompLoD Title: Logitech M510 Wireless Mouse, Large Mouse, Com- also has regularities with predictive power [20]. To make use puter Wireless Mouse. of this, we need to integrate the products’ meta-data into Category: Electronics; Computers & Accessories; Computer it by extracting structured information from the text. We Accessories & Peripherals; Keyboards, Mice & Accessories; therefore use AlchemyAPI, a NER tool. This identifies those Mice tokens which correspond to published LoD IRIs. Whereas Description: the tool supports the mapping to resources from several LoD Contoured shape with soft rubber grips for all-day comfort datasets4 , we only store the information relative to DBpedia * Back/forward buttons and side-to-side scrolling plus zoom and leave the possibility of exploiting multiple datasets to let you do more, faster Requires Logitech SetPoint software future work. For instance, from the previous text some of * 2-year battery life eliminates the need to replace batteries. the identified resources are5 : Battery life may vary based on computing conditions Categories * Comes with a tiny Logitech Unifying receiver that stays in Electronics: dbpedia:Electronics your computer - plug it in, forget it Computer & Accessories: dbpedia:Computer, ... * 3-year limited hardware warranty Attributes * Compatibility: Works with - Windows XP, Windows Vista, Logitech: dbpedia:Logitech, yago:Logitech, ... Windows 7, Mac OS X 10.4 or later, USB Port Operating system: dbpedia:Operating system, ... * Mouse Dimensions (height x width x depth): 2.56 in (65 Microsoft Windows: dbpedia:Microsoft Windows mm) x 4.72 in (120 mm) x 1.61 in (41 mm) Mac OS X: dbpedia:Mac OS X, ... Microsoft: dbpedia:Microsoft, yago:Microsoft, ... IMac: dbpedia:IMac, yago:IMac, ... Example 1: Title, categories and description of a Windows Vista: dbpedia:Windows Vista, ... product (mouse) As previously mentioned, attributes are extracted from the description of a product. Moreover, a category is a struc- Example 2: Resources identified from the tured piece of information organized as a taxonomy. The meta-data of the product showed in example 1 categories go from generic (Electronics) to more specific (Mice). However, CompLoD doesn’t make use of the taxon- It can be observed that some of the identified resources cor- omy and sees all categories as independent. respond to categories, such as dbpedia:Electronics, while oth- Together all of this information is then processed in a ers are attributes, such as dbpedia:Logitech (brand). To keep pipeline-like fashion. First the product’s meta-data is trans- the approach fully automated, we treat categories as text as formed into a structured graph in which product attributes well. The quality of this task highly depends on the effi- and categories are matched against DBpedia resources (A). cacy of the NER tool, each of which has its own limitations. The resulting knowledge graph is a subgraph of DBpedia, While most of the important resources are identified, some extended with the products represented as synthetic nodes. of them, like dbpedia:Computer mouse (category), which are Being aware that not all attributes have the same role in important to characterize the product, are not necessarily characterizing a product and that not all the interconnec- recognized. For more details about the expected perfor- tions are able to provide the same quality of information, mance of these tools we refer the reader to [8]. we decided to attach scores to the nodes and paths in the Given the products and the identified resources, the RDF graph using consolidated metrics (B). The results are used in graph is built as follows: the next stage to build the feature set for pairs of products. 1. First we build a basic RDF graph Gp . We create one Finally, we train a classification model (C) and when this is resource rp for each product and connect it to its at- fitted, the framework is able to answer the question whether tributes IRIs. Let Arp be the bag6 of attributes iden- one product is complementary to another one. More details tified in the input text of the product represented by about the individual tasks are provided in the following sec- 4 Currently supported to the date of publication are DBpe- tions. dia, Freebase, US Census, GeoNames, UMBEL, OpenCyc, YAGO, MusicBrainz, CIA Factbook and CrunchBase. 3.1 Building a knowledge graph 5 We use prefixes to shorten the namespaces: Our framework requires a knowledge graph as the starting dbpedia: http://dbpedia.org/resource/ point and we opted for DBpedia. In addition to the deter- yago: http://yago-knowledge.org/resource/ 6 ministic rules which are used to build the graph, DBpedia The tokens can appear multiple times in the text. We have rp . We create a triple (rp , sp, ra ) linking the product graph to terms and documents. This requires the considera- rp with each ra ∈ Arp using a synthetic predicate sp. tion of a product rp as a bag of attributes {ra1 , ra2 , ..., rak }. If an attribute ra appears more than once in Arp , we A category rc can be also thought of as a bag of attributes, create as many triples as the number of times ra ap- namely those used to describe products of that category. pears in Arp , where each triple has a different synthetic We will explain these metrics in more detail in the following predicate. sections. 2. Let Crp be the set of categories to which the product 3.2.1 (Product) Attribute Frequency - Inverse Prod- belongs. We also create triples (rp , sp, rc ) that link uct Frequency (PAF-IPF) products with their categories. The representation of a product as a bag of attributes 3. We enrich Gp by extracting a subgraph from DBpedia. allows us to define a weighting scheme that reflects how well This subgraph contains all DBpedia resources identi- an attribute ra is able to represent or describe a product rp . fied by the NER tool and all the resources reachable We instantiate TF-IDF using a product as a document and within 2 hops from them. We ignore the direction of its attributes as the terms within the document. the edges to do so. The reason why we limit the length PAF-IPF (ra ,rp ) = P AF(ra ,rp ) × IP F(ra ) to 2 is to reduce the computational cost of computing |P | metrics on top of the graph (more details in the next IP F(ra ) = log section). We merge Gp with the extracted subgraph P F(ra ) to form the Extended graph EGp . Note that attributes Intuitively, P AF(ra ,rp ) counts the number of times a product and categories in EGp are interconnected. rp is described by an attribute ra , i.e. it counts the number of triples that link the product to an attribute. In a similar EGp is then the input of the next task, in which metrics way P F(ra ) counts the number of products that have ra as are computed to measure the importance of an attribute or an attribute. |P | is the cardinality of the set P , i.e. the a category for a product as well as the importance of the number of products available in G. The goal of IP F(ra ) is semantic paths interlinking them. to reduce the score by a factor which is proportional to the 3.2 Computation of metrics number of times the attribute is used to describe products. If an attribute is common to all products then it cannot In our knowledge graph every product is connected to its strongly characterize the product. attributes and categories. However, those attributes might have a distinct relevance in describing a product or even in 3.2.2 (Category) Attribute Frequency - Inverse Cat- describing a category (seen as a set of products). Therefore, egory Frequency (CAF-ICF) we require some metrics that reflect the following intuition: In the same way we define the following: • How well can an attribute characterize a product, e.g. CAF-ICF (ra ,rc ) = CAF(ra ,rc ) × ICF(ra ) is equalizer more representative than rechargeable to |C| describe a speaker ? Section 3.2.1. ICF(ra ) = log CF(ra ) • How well can an attribute characterize a category, e.g. which reflects to which extent an attribute characterizes a is 4K more representative than wireless for the cate- category. CAF(ra ,rc ) counts the number of times that an gory Television & Video? Section 3.2.2. attribute resource ra is used to describe a product rp that • How relevant is the information carried by a path con- belongs to a category rc . CF(ra ) counts the number of cat- necting two resources? Let’s assume, for instance, that egories in which the attribute ra is used to describe at least two product resources, a television (tv) and a speaker one product which belongs to that category. |C| is the car- (s), are connected through several paths of different dinality of the set C. The role of ICF(ra ) is the same as lengths in EGp . that of IP F(ra ) . Is the path tv −connectivity connectivity −−−−−−−→ Bluetooth ←−−−−−−−− s that 3.2.3 Path Informativeness goes through Bluetooth more important than another one that goes through Stereo? Which of these paths In addition to the two metrics introduced above, we re- is the most informative? Does the length also play a quire another one which reflects the degree of informative- role? Suppose that the connection is now ness a path carries with it. We use the definition of path connectivity type connectivity informativeness presented in [26], which we tailored for our tv −−−−−−−−→ Bluetooth −−−→ W ireless ←−−−−−−−− s. Does needs. This concept is based on a similar instantiation of the informativeness decay with the length of the path? TF-IDF. RDF resources r are used as documents. However, Section 3.2.3. two kinds of documents are considered: those in which r In order to provide answers to those questions and fulfill the appears as a subject of a triple and those in which it has requirements we employ Term frequency-inverse document the role of an object. This allows us to define the following frequency (TF-IDF)[19], a weighting scheme widely used in metric: Information Retrieval and text mining to determine how im- Predicate frequency - Inverse Triple Frequency PF-ITF : portant a term is to a document within a corpus. The design I-PF-ITF (p,r) = I-PF (p,r) × IT F(p) of such metrics is achieved through different instantiations O-PF-ITF (p,r) = O-PF (p,r) × IT F(p) of TF-IDF, i.e. we map different components of the RDF |T | to preserve the information of how often the attribute occurs IT F(p) = log in order to compute metrics in the next stage. |T (p)| I-PF-ITF (p,r) is the Input PF-ITF of a term p within the path as a feature either, for the same reason explained above document resource r and O-PF-ITF (p,r) is the Output PF- (high-dimensionality). ITF of a term p within the document resource r. I-PF (p,r) We propose therefore a strategy to combine these metrics, counts the number of resources that can be mapped to the by following a very natural intuition which turned out to variable ?x of the triple pattern (?x, p, r). This is also the produce useful results. This allows one to design a low- case for O-PF (p, r) in which the possible mappings for ?y dimensional feature vector which can be used to model pairs in (r, p, ?y) are counted. |T | is the overall number of triples of products. whereas |T (p)| is the number of triples in which p appears. The role of IT F(p) is the same as for the previous instanti- 3.3 Features engineering ations. To the best of our knowledge there is no work which This metric allows us to define the informativeness metric. addresses the issue of which features for pairs of products For a path ρ(ri , rj )(p) of length 1 the informativeness I is (rpi , rpj ) can be used to predict the complementary relation- defined as: ship. We aim at filling this gap. Our rationale is based on O-PF-ITF (p,ri ) + I-PF-ITF (p,rj ) the observation that for products belonging to different do- Iρ(ri ,rj )(p) = mains there might exist correspondences and dependencies 2 between their attributes. This has been validated in market- For paths of length t the informativeness is defined as the ing, behavioral, and data mining research studies [4]. Our sum of the informativeness of its components divided by the hypothesis is that (1) the correspondences and dependencies length of the path. tell us also something about the complementary relation- ship, and (2) this correspondence can be measured by the ρ(ri , rj )(p1,...,pt−1 ,pt ) informativeness carried in the interconnection between at- Iρ(ri ,rj )(p1,...,p = (Iρ(ri ,rk )(p1) + ... + tributes, weighted by those attributes’ relative importance. t−1 ,pt ) We therefore explain in this section how to leverage the in- Iρ(rq ,rs )(p + Iρ(rs ,rj )(pt ) )/t terconnections of the knowledge graph and to combine the t−1 ) metric scores in a meaningful way. Given the previous definitions, it is possible to define Imax , Let Arpi and Arpj be the bag of attributes of products i.e. the maximum informativeness carried by a path connect- rpi and rpj . To simplify the explanation of the formulas to ing two resources ri and rj . calculate the features we will from now on assume that rai Imax (ri , rj ) = max{I(ri , rj )ρ1 , ..., I(ri , rj )ρk }, and raj are attributes of rpi and rpj , respectively. More- over, we also assume that rci and rcj are the most specific where ρi is an arbitrary path. Each pair of paths ρi and ρj categories of rpi and rpj . For the sake of readability we in- is different, i.e. there exists at least one predicate in both troduce two weighting functions ωP and ωC shown in figure paths which differs. Similarly, it is possible to define Iavg , 3, which use PAF-IPF and CAF-ICF, respectively. In this the average informativeness of all paths between ri and rj . way, we can weigh a path between two attributes of rai and raj according to their characterization relevance. Next, we Iavg (ri , rj ) = (I(ri , rj )ρ1 + ... + I(ri , rj )ρk )/k will present the design choices we made. To summarize. The metrics allow us to determine the importance of attributes in describing a product or a cate- Maximal Informativeness over the most relevant prod- gory, or to measure the amount of informativeness carried by uct attributes (MIMPA): this feature combines PAF-IPF the semantic paths connecting two resources. These are de- and the concept of informativeness as follows. Given a prod- signed considering different granularities for the definition of uct rpi , we first search for the attribute rai that has the document. In the case of PAF-IPF a document is a product maximum PAF-IPF(rai ,rpi ) among other attributes. We do M M and the terms are the attributes that describe the product. the same for rpj . Let rai p and rajp be these attributes for rpi In the second case, CAF-ICF, a document is a category and and rpj , respectively. These are the most relevant product the terms are the attributes used to describe products of attributes. The feature is then designed as follows: that category. In the case of informativeness documents are   M M M M subject or object resources whereas predicates are the terms. M IM P A(rpi ,rpj ) = Imax rai p , rajp · ωP (rai p , rajp ) CompLoD computes the relevance metrics in EGp by iterat- M M ing over the set of identified attributes and then it proceeds Recall that when computing Imax (rai p , rajp ) all possible by searching paths linking between them. M M paths which connect rai p to rajp are considered and from All these metrics have been validated by other works and these, it returns the max. informativeness value. have been used to solve a variety of problems. However, when one focuses again on the problem illustrated in fig- Maximal Informativeness over the most relevant cat- ure 1, i.e. the problem of designing a feature set that ap- egory attributes (MIMCA): we focus again on the at- plies to pairs of products, one can notice that these metrics tribute bags of both products. However, rather than search- cannot be used directly as features. For instance, PAF-IPF ing for those which characterize the products the best, we and CAF-ICF are metrics which apply to a single product. take a look to the most specific category of each product and While nothing prevents one from using them as features for leverage CAF-ICF to find the attribute which characterizes pairs, it is not possible to have each possible attribute as a this category the best. Let raMi c and raMjc be these attributes feature. This would lead to an extremely high-dimensional for rpi and rpj , respectively. Then: feature vector. And whereas the path informativeness be- tween two products’ attributes could be thought of as a fea-   M IM CA(rpi , rpj ) = Imax raMi c , raMjc · ωC (raMi c , raMjc ) ture for pairs of products, one cannot have each possible s 2 s 2  2   2  PAF-IPF(ra ,rp ) + PAF-IPF(ra ,rp ) CAF-ICF(ra ,rc ) + CAF-ICF(ra ,rc ) i i j j i i j j ωP (rai , raj ) = 2 , ωC (rai , raj ) = 2 Figure 3: Weighting functions ωP and ωC where rci and rcj are the categories of products rpi and rpj , Average of average informativeness over all attributes respectively. (AAIO): this feature is computed in the same way as AMIO, except that Iavg is used instead of Imax . In this way we con- Average Informativeness over the most relevant prod- sider the complete flow of informativeness between each pair uct attributes (AIMPA): this feature works in a similar of attributes. way as MIMPA. When the most relevant attributes of the M M products, rai p and rajp , are identified, all paths connecting Maximal relevance of common product attributes these resources are considered. Therefore, the average infor- (MRCPA): when considering T only the common attributes mativeness Iavg is considered and not only the score of the of both products Arpi Arpj , we compute the PAF-IPF of most informative one. the elements of the intersection and return the maximum value.   M M M M AIM P A(rpi , rpj ) = Iavg rai p , rajp · ωP (rai p , rajp ) M RCP A(rpi , rpj ) = max { ωP (rai , raj ) : Average Informativeness over the most relevant cat- egory attributes (AIMCA): in the same way as MIMCA, rai = raj ∈ (Arpi ∩ Arpj )} we consider here the attributes which characterize the cat- Maximal relevance of common category attributes egory the most, but we use Iavg instead of Imax , i.e. we (MRCCA): this is the symmetrical counterpart of MR- consider the average informativeness of all paths connecting CPA in which one considers the contribution of common them.   attributes in characterizing the product categories. AIM CA(rpi , rpj ) = Iavg raMi c , raMjc · ωC (raMi c , raMjc ) M RCCA(rpi , rpj ) = max { ωC (rai , raj ) : Average of maximal informativeness over all attributes rai = raj ∈ (Arpi ∩ Arpj )} (AMIO): rather than considering the most relevant at- tribute for each product, we focus here on the informative- Main product relevance (MPR): this metric measures ness flowing from one product (from all of its attributes) to the relevance of the main product (first element of the pair the other one. To do so we consider all pairs of attributes rai at the input sample) on the basis of its attributes. Let and raj , which belong to rpi and rpj , respectively, and com- |(ra , ?p, ?o)| be the number of triples having ra as subject pute the maximum informativeness carried by the links of and |(?s, ?p, ra )| be the number of triples having ra as object. each product pair. As for previous features we also combine Then: the relevance of individual attributes with the informative- X ness of the link. M P R(rpi ) = (|(ra , ?p, ?o)| + |(?s, ?p, ra )|) Let P(rpi ,rpj ) = {(rai1 , raj1 ), (rai1 , raj2 ), ..., (raik , rajl )} be ra ∈Arp i a set of attributes tuples, where the first element is an at- Related product relevance (RPR): this metric is the tribute of rpi and the second of rpj . Then: same as the previous one, but applied to the related product 1 (second element of the pair at the input sample). AM IOP (rpi , rpj ) = · | P(rpi ,rpj ) | Interconnection between products (IP): this feature   aims at reflecting the degree of connectivity between the at- X tributes of two products. Let ϕ be the number of attributes Imax (rai , raj ) · ωP (rai , raj ) pairs (rai , raj ) ∈ P(rpi ,rpj ) in which there exist at least one    (rai ,raj )∈P(rp ,rp ) i j path connecting them. Then: ϕ 1 IP (rpi , rpj ) = AM IOC (rpi , rpj ) = · |P(rpi ,rpj ) | | P(rpi ,rpj ) |   Product similarity (PS ): to measure the degree of simi- X larity, we use the Jaccard similarity to the set of attributes Imax (rai , raj ) · ωC (rai , raj     of both products: (rai ,raj )∈P(rp ,rp ) T i j |Arpi Arpj | P S(rpi , rpj ) = S AM IOP (rpi , rpj ) + AM IOC (rpi , rpj ) |Arpi Arpj | AM IO(rpi , rpj ) = 2 Other metrics: We designed and tried several other fea- AMIO combines both, the characterization power of the at- tures whose contribution was small or minimal. tributes in the role of product attributes and category at- In this section we introduced the features which played an tributes. important role in predicting whether one product is comple- mentary to another one. All the 12 features are the result of a feature selection task performed to boost the accuracy Computing metrics. Once the knowledge graph is built prediction of the classifiers. These are shown on top of the we compute the metrics explained in section 3.2. The PAF- next page together with their relative importance7 . ICF and CAF-ICF relevance scores and the informativeness of the paths are efficiently stored in the system using in- 3.4 Classification model verted indexes for easy retrieval. With enough input samples and a well-defined feature set Classification. With the metrics computed, we compute the last missing step consists of fitting a classifier. A wide the features as explained in section 3.3 for those pairs of range of classification techniques exist for supervised learn- products which appear in the ground truth. We store the ing. We compared the performance of different kinds of clas- feature vectors and attach the class label to it, complemen- sifiers, ensemble and non, and random forest was the fittest tary and non-complementary. The overall number of in- model. Moreover, we validated our model and carried out put samples is ca. 3.7 million, of which 1.4 million rep- hyper-parameter optimization. More details can be found resents complementary pairs and 2.3 million pairs the non- in the next section. complementary counterpart. These input samples were then split into training and test sets. We kept the distribution 4. EXPERIMENTS at 50% of positive and 50% negative samples. We started with a relatively small number of samples for the training We conducted experiments to assess the performance of set (10%) and we gradually increased it until we reached CompLoD. Some of these experiments helped us to choose 90%. In this way we were able to analyze the learning rate the most suitable classifier and to optimize it. But the cen- of the classifiers. We tried the performance of several of tral question, whether the designed feature set was really them, such as Decision Trees, Random Forest, AdaBoost, significant, could be also answered. Each of the single tasks Naive Bayes (Gaussian), Linear Discriminant Analysis and of CompLoD were run on a single machine with a processor Quadratic Discriminant Analysis. We used the following Intel(R) Xeon(R) CPU E3-1231@3.80GHz with 4 cores and rates computed from the confusion matrix (TP =true posi- 32 GB of RAM. tive, FP =false positive, TN =true negative, FN =false neg- Dataset. We used a dataset from Amazon.com8 . This con- ative): tains products’ meta-data for each category. Our experi- ments focus on the category Electronics, which includes a TP TN wide range of subcategories, such as Internal Hard Drives, Precision(+) = ; Precision(-) = TP + FP TN + FN eBook Readers or Camera Batteries, etc. The overall num- TP TN ber of subcategories considered is 817 whereas the number of Recall(+) = ; Recall(-) = products is ca. 0.5 million. In addition to the meta-data for TP + FN TN + FP each product the dataset also contains four lists of related TP + TN Accuracy = products: also bought, also viewed, bought together, buy after TP + TN + FP + FN viewing. As in [14] we assume that also bought and bought Precision(+) · Recall(+) F1-Measure(+) = 2 · together products are complementary. Products from the Precision(+) · Recall(+) remaining lists are substitutable products9 . We use them as the negative class (non-complementary). We manually The label (+) in the metric focuses on the quality attribute verified that this assumption is not always correct: for some for predicting the positive class whereas (-) focuses on the products some related products appeared in both groups. negative one. Accuracy is then defined as the mean of the Assuming that a product cannot be both, substitutable and positive and negative precision score. The performance of complementary to another one, we removed those ambiguous the different classifiers was similar. To avoid overfitting related products. As a result of this process, 11% of comple- and make sure that the model generalizes to an indepen- mentary products and 8% of non-complementary products dent dataset, we validate our model with K-Fold and Monte were removed from the dataset. Carlo cross-validation (CV). While k-Fold CV makes sure Building the graph. One of the initial stages of the frame- that each of the folds is used at least once as a test set, work consists of identifying from the text those tokens which Monte Carlo CV randomly shuffles the whole dataset, thus correspond to resources published in the LoD cloud. Ap- generating independent partitions for each run. The final prox. 3500 resources were recognized by the NER tool, results are shown in Figure 4(A), which illustrates that the which is a relatively low number. However, for 96% of the Random Forest classifier improves its accuracy with more products at least one resource could be identified. The un- input samples from which it can learn, being able to obtain matched 4% was removed from the dataset. By querying up to ca. 78.5% accuracy. In the same figure (B) shows that DBpedia’s SPARQL remote endpoint we extract the triples the recall for the negative class overtakes that of the posi- in which those resources appear as subject or object of a tive class. It is important to mention that a TN refers to a triple. We repeat this process iteratively until we obtain non-complementary ground truth pair which is predicted to all resources reachable within 2 hops from the identified re- be non-complementary. As the very last task we performed sources. The resulting graph has approx. 1.4 million re- hyper-parameter optimization by exploiting two techniques, sources. namely Random Search (run with 30 iterations) and Grid 7 Search (run with the top-3 parameter set). The importance is computed as the normalized total re- Results and reflections. One of the unique advantages duction of the criterion represented by that feature (Gini importance of random forests). of our framework is that new products have a chance to be 8 The dataset is part of the project “Stanford Network Anal- classified as complementary to other products even in the ysis Project (SNAP)” and made available for research. absence of previous purchase history of the prod- 9 uct. The results are certainly not as good as for the case We are not addressing in this work the task of predicting the substitutable relationship. for which transaction data or post-transaction feedback is MIMPA MIMCA AIMPA AIMCA AMIO AAIO MRCPA MRCCA MPR RPR IP PS class 6.87 10.32 8.27 11.48 5.73 9.01 6.89 10.25 9.75 9.17 5.04 7.17 1 or 0 Feature set for pairs of product resources (rpi , rpj ) and their feature importance. 80 75 75 75 70 70 70 positive precision 65 positive recall positive F1-measure negative precision negative recall negative F1-measure accuracy precision 65 65 60 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 (A) Training set in % (B) Training set in % (C) Training set in % Figure 4: Performance results of CompLoD in terms of (A) accuracy, (B) recall, and (C) F1-measure available, but this work should open a new door for fur- ther contributions. For instance, the work of McAuley et al. in [14], which uses post-transaction feedback to learn a model, obtained an accuracy of 88% on the same dataset and category using reviews and additional meta-data which we didn’t consider in our work, e.g. prices10 . The perfor- mance of CompLoD is then surprisingly good at taking into account that only minimal information about the product is used, therefore making it well-suited to deal with the cold- start problem. CompLoD also enables a different perspective. Whenever the relationship between two products is identified as com- plementary and an explanation is required, it is possible to go back to the graph components and the values of the com- puted graph metrics from the values of the single features and to provide the user with a reasonable explanation. For Figure 5: Two products and their most relevant at- instance, one could use the most relevant attributes or cat- tributes egory attributes of both products, e.g. those that maximize PAF-IPF or CAF-ICF, respectively, and the path connecting those attributes which maximizes the informativeness. Fig- metrics can be then computed and stored at once. In this ure 5 illustrates this. All these components are a valuable way when a new product arrives in the system, all of its at- source of explanation. However, assessing which of these tributes will already be present in the graph, together with graph components are the most useful would require an on- the informativeness scores of the semantic paths connecting line study with real users. them to other attributes. Computational cost and scalability. PAF-IPF and CAF- ICF have a cost of O(|P |+|A|) and O(|C|+|A|), respectively. The cost of path informativeness is higher: O(|I| + |P red|), 5. RELATED WORK where P red is the set of all vocabulary predicates. While Approaches to find complementary products have a wide it is obvious that extracting information from a large graph range of applications, from product placement to products and computing metrics over it is certainly expensive, most bundling to website structure optimization, etc. In partic- of the tasks done in the single stages of the framework can ular, on-line retailers make use of complementary products be pre-computed off-line. For instance, one could try to keep to provide recommendations [30, 35], which help the users as many attributes in the graph as possible or even try to to discover related products while improving their volume of anticipate attributes of soon-to-be-released products. The sales. Most of these techniques typically consist of mining transactional purchase data, event logs, or users’ implicit 10 feedback. However, for new products for which such infor- The results are not comparable only in terms of the learn- mation is not available or scarce, most of these approaches ing source but also on the fact that they only considered in are not applicable or fail altogether. their experiments products with 20 reviews or more, which means that only 20% of the pairs of products contained in As in [14], we aim at learning the semantics of this rela- the original ground truth remain after filtering those prod- tionship, thus making it possible to disregard transactions. ucts out. Therefore, we build a knowledge graph in which products, attributes, and categories are represented as nodes inter- DBpedia demonstrates the usefulness and value of Semantic connected by semantic paths. Knowledge acquisition is in Web standards. The prediction task in the last stage is im- general a very costly task. Therefore, we extract this infor- plemented with a Random Forest Classifier which predicts, mation from DBpedia, whose usage as a knowledge base has for a pair of products, whether one product is complemen- been validated in other fields, such as that of Cross-domain tary to the other. However, such a model requires a feature Recommender Systems [6, 9, 23, 29]. set which represents properties of pairs of products. To the Being able to learn from the semantics encoded in the best of our knowledge, this is the first time where the prob- graph and the patterns therein has some commonalities with lem of extracting this information from a semantic graph to Link Discovery or Link prediction. These approaches see predict the complementary relationship is addressed. Our a knowledge graph as a statistical model. The existence experiments show that the classifier is able to learn the re- of triples is represented in these models as binary random lationship from our designed feature set, thus validating it. variables which are correlated with each other. Different as- Although the predictive power of our system is surpris- sumptions for the correlation lead to different models, e.g. ingly good, given that this is achieved only by using prod- based on latent features [21], graph features or those in ucts’ meta-data, it is important to mention that the quality which variables have local interactions [20]. Our work is of CompLoD depends on the quality of the models and tools closer to observable graph feature models, in the sense that used in the single stages. This dependency consequently some of the problems addressed there, e.g. designing a fea- leaves room for improvement. Therefore, we would like to ture set from a graph to fit a learning model, are similar. For try different NER tools and LoD datasets and investigate the instance, [1] focuses on predicting future friendship links in qualitative impact. We could also exploit a more domain- Social Networks using topological features. Aiming at the specific LoD dataset. For this reason, the framework is ex- same goal, [5] presents and uses a combination of a large tensible, i.e. all components can be extended or replaced. number of features. These techniques are not necessarily In the future we would like to conduct further experiments limited to Social Networks, but they are used in fields like considering other categories, to strengthen the significance Biomedicine [10] or Genomics [34]. of our approach. Predicting whether a product is comple- However, there is a central difference between these ap- mentary to another one might still produce too many recom- proaches and ours. First, we deal here with a knowledge mendation candidates for a single user. Therefore, we would graph in which the structured information is integrated into like to extend our model to be able to further refine the list a complex ontology. This means two nodes can be inter- of complementary products taking into account the user’s connected in several ways which are not necessarily known taste. For this task, a ranking strategy might be more suit- in advance. This differs from the kinds of graphs typically able than a classification one. We would also like to include treated in the link prediction problem, e.g. social, concept, features that reflect the hierarchical structure of categories or interaction networks, whose edges intend rather to repre- and in general to continue searching for new features that sent a small number of relationships and are therefore less improve the accuracy of the system. flexible in the representation. Secondly, many of the fea- tures proposed are based on local similarity indices, such as Common Neighbors, Adamic-Adar Index, Preferential At- 7. REFERENCES tachment Index, Jaccard Index, etc. and might therefore be [1] C. Ahmed, A. ElKorany, and R. Bahgat. A supervised too localized to capture the patterns which determine the learning approach to link prediction in twitter. Social complementary relationship. Network Analysis and Mining, 6(1):24, 2016. A related problem can also be found in the field of online [2] A. Aribarg and N. Z. Foutz. Category-based screening search behavior [16]. In this area some efforts have been in choice of complementary products. Journal of done to predict whether two queries posed by a user within Marketing Research, 46(4):518–530, 2009. a single or multiple sessions aim at performing the same [3] A. Azaria, A. Hassidim, S. Kraus, A. Eshkol, task. To do so, some approaches try to model hierarchies O. Weintraub, and I. Netanely. Movie recommender of tasks and subtasks. For example, booking a hotel and system for profit maximization. In 7th ACM Conf. on registering for a conference venue are subtasks of planning a Recommender Systems, RecSys ’13, Hong Kong, conference attendance. These methods are typically based China, pages 121–128, 2013. on classification, as in our case, or clustering [11, 12, 17, [4] I. Cantador, I. Fernández-Tobı́as, S. Berkovsky, and 33]. Since a purchase of one product and a complementary P. Cremonesi. Cross-domain recommender systems. In one can be thought of as the problem of matching tasks, we Recommender Systems Handbook, pages 919–959. consider this line of research related work. 2015. [5] W. Cukierski, B. Hamner, and B. Yang. Graph-based 6. CONCLUSION AND FUTURE WORK features for supervised link prediction. In Neural We presented CompLoD, a framework that leverages dif- Networks (IJCNN), Int. Joint Conf. on, pages ferent models and techniques to identify complementary prod- 1237–1244, July 2011. ucts while only using a product’s available meta-data, such [6] I. Fernández-Tobı́as, I. Cantador, M. Kaminskas, and as titles, descriptions and categories. By extracting struc- F. Ricci. A generic semantic-based framework for tured information from the text, we manage to build a knowl- cross-domain recommendation. In Proc. of the 2Nd edge graph where products’ attributes are interconnected by Int. Workshop on Information Heterogeneity and semantic paths. Being able to identify complementary prod- Fusion in Recommender Systems, HetRec ’11, pages ucts from this graph rather than using transactions, as most 25–32, 2011. of the current techniques do, solves the cold-start problem [7] F. Gedikli and D. Jannach. Neighborhood-restricted for new items. Moreover, the fact that this graph is based on mining and weighted application of association rules for recommenders. In Proc. of the 11th Int. Conf. on [22] D. Nikovski and V. Kulev. Induction of compact Web Information Systems Engineering, WISE’10, decision trees for personalized recommendation. In pages 157–165, 2010. Proc. of the 2006 ACM Symposium on Applied [8] T. Heuss, B. Humm, C. Henninger, and T. Rippl. A Computing, SAC ’06, pages 575–581, 2006. comparison of ner tools w.r.t. a domain-specific [23] V. C. Ostuni, T. Di Noia, E. Di Sciascio, S. Oramas, vocabulary. In Proc. of the 10th Int. Conf. on and X. Serra. A semantic hybrid approach for sound Semantic Systems, SEM ’14, pages 100–107, 2014. recommendation. In Proc. of the 24th Int. Conf. on [9] M. Kaminskas, I. Fernández-Tobı́as, I. Cantador, and World Wide Web, WWW ’15 Companion, pages F. Ricci. Ontology-Based Identification of Music for 85–86, 2015. Places, pages 436–447. 2013. [24] B. Pathak, R. Garfinkel, R. D. Gopal, R. Venkatesan, [10] J. Katukuri, Y. Xiey, V. V. Raghavan, and A. Gupta. and F. Yin. Empirical analysis of the impact of Supervised link discovery on large-scale biomedical recommender systems on sales. Journal of concept networks. In Proc. of the 2011 IEEE Int. Management Information Systems, 27(2):159–188, Conf. on Bioinformatics and Biomedicine, BIBM ’11, 2010. pages 562–568, 2011. [25] J. Pérez, M. Arenas, and C. Gutierrez. Semantics and [11] L. Li, H. Deng, A. Dong, Y. Chang, and H. Zha. complexity of SPARQL. ACM Trans. Database Syst., Identifying and labeling search tasks via query-based 34(3), 2009. hawkes processes. In Proc. of the 20th ACM SIGKDD [26] G. Pirrò. Reword: Semantic relatedness in the web of Int. Conf. on Knowledge Discovery and Data Mining, data. In Proc. of the 26th AAAI Conf. on Artificial KDD ’14, pages 731–740, 2014. Intelligence, 2012, Toronto, Ontario, Canada., 2012. [12] Z. Liao, Y. Song, Y. Huang, L. w. He, and Q. He. [27] T. Raeder and N. V. Chawla. Market basket analysis Task trail: An effective segmentation of user search with networks. Social Network Analysis and Mining, behavior. IEEE Transactions on Knowledge and Data 1(2):97–113, 2011. Engineering, 26(12):3090–3102, Dec 2014. [28] F. Ricci, L. Rokach, and B. Shapira, editors. [13] F. Manola, E. Miller, and B. McBride. RDF Primer. Recommender Systems Handbook. 2015. W3C Recom., 2004. [29] P. Ristoski, M. Schuhmacher, and H. Paulheim. Using [14] J. J. McAuley, R. Pandey, and J. Leskovec. Inferring graph metrics for linked open data enabled networks of substitutable and complementary recommender systems. In E-Commerce and Web products. In Proc. of the 21th ACM SIGKDD Int. Technologies - 16th Int. Conf. on Electronic Conf. on Knowledge Discovery and Data Mining, Commerce and Web Technologies, EC-Web 2015, Sydney, NSW, Australia, 2015, pages 785–794, 2015. Valencia, Spain, 2015, Revised Selected Papers, pages [15] J. J. McAuley, C. Targett, Q. Shi, and A. van den 30–41, 2015. Hengel. Image-based recommendations on styles and [30] A. Sharma, J. M. Hofman, and D. J. Watts. substitutes. In Proc. of the 38th Int. ACM SIGIR Estimating the causal impact of recommendation Conf. on Research and Development in Information systems from observational data. In Proc. of the 16th Retrieval, Santiago, Chile, 2015, pages 43–52, 2015. ACM Conf. on Economics and Computation, EC ’15, [16] R. Mehrotra, P. Bhattacharya, and E. Yilmaz. pages 453–470, 2015. Uncovering task based behavioral heterogeneities in [31] G. Suchacka and G. Chodak. Using association rules online search behavior. In Proc. of the 39th Int. ACM to assess purchase probability in online stores. SIGIR Conf. on Research and Development in Information Systems and e-Business Management, Information Retrieval, SIGIR ’16, pages 1049–1052, pages 1–30, 2016. 2016. [32] I. M. A. O. Swesi, A. A. Bakar, and A. S. A. Kadir. [17] R. Mehrotra and E. Yilmaz. Towards hierarchies of Mining positive and negative association rules from search tasks & subtasks. In Proc. of the 24th Int. interesting frequent and infrequent itemsets. In Fuzzy Conf. on World Wide Web, WWW ’15 Companion, Systems and Knowledge Discovery (FSKD), 2012 9th pages 73–74, 2015. Int. Conf. on, pages 650–655, May 2012. [18] R. Meymandpour and J. G. Davis. Recommendations [33] H. Wang, Y. Song, M.-W. Chang, X. He, R. W. using linked data. In Proc. of the 5th Ph.D. Workshop White, and W. Chu. Learning to extract cross-session on Information and Knowledge Management, PIKM search tasks. In Proc. of the 22Nd Int. Conf. on World 2012, Maui, HI, USA, 2012, pages 75–82, 2012. Wide Web, WWW ’13, pages 1353–1364, 2013. [19] I. C. Mogotsi. Christopher d. manning, prabhakar [34] Z. You, S. Zhang, and L. Li. Integration of genomic raghavan, and hinrich schütze: Introduction to and proteomic data to predict synthetic genetic information retrieval - cambridge university press, interactions using semi-supervised learning. In Proc. of cambridge, england, 2008, 482 pp, ISBN: the 5th Int. Conf. on Emerging Intelligent Computing 978-0-521-86571-5. Inf. Retr., 13(2):192–195, 2010. Technology and Applications, ICIC ’09, pages 635–644, [20] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. 2009. A review of relational machine learning for knowledge [35] J. Zheng, X. Wu, J. Niu, and A. Bolivar. Substitutes graphs. Proc. of the IEEE, 104(1):11–33, Jan 2016. or complements: Another step forward in [21] M. Nickel, V. Tresp, and H.-P. Kriegel. Factorizing recommendations. In Proc. of the 10th ACM Conf. on yago: Scalable machine learning for linked data. In Electronic Commerce, EC ’09, pages 139–146, 2009. Proc. of the 21st Int. Conf. on World Wide Web, WWW ’12, pages 271–280, 2012.