1. INTRODUCTION

Learning to Identify Complementary Products from DBpedia

Anas Alzoghbi

0 0 Department of Computer Science University of Freiburg Georges-Köhler-Allee 051 , 79110 Freiburg , Germany

Identifying the complementary relationship between products, like a cartridge to a printer, is a very useful technique to provide recommendations. These are typically purchased together or within a short time frame and thus online retailers bene t from it. Existing approaches rely heavily on transactions and therefore they su er from: (1) the cold start problem for new products; (2) the inability to produce good results for infrequently bought products; (3) the inability to explain why two products are complementary. We propose a framework that aims at alleviating these problems by exploiting a knowledge graph (DBpedia) in addition to products' available information such as titles, descriptions and categories rather than transactions. Our problem is modeled as a classi cation task on a set of product pairs. Our starting point is the semantic paths in the knowledge graph linking between product attributes, from which we model product features. Then, having a labeled set of product pairs we learn a model; and nally, we use this model to predict complementary products for an unseen set of products. Our experiments on a real world dataset from Amazon show high performance of our framework in predicting whether one product is complementary to another one.

1. INTRODUCTION

Online retailers like Amazon or eBay sell millions of products from several categories and continuously expand their catalogs. Being aware that browsing a large catalog can be overwhelming for a user and can therefore have a negative impact on revenue [ 3, 24 ], it is not a surprise that many of these retailers are equipped with Recommender Systems (RS) [ 28 ]. While recommending products based on the user's taste is certainly an important task any RS has to carry out, special attention has to be paid at the moment in which a purchase is about to be made. In this circumstance the RS can expose the user interested in a product to other products whose use is related to their desired product. Those products are referred to as complementary products [ 2 ]. For instance, a user who placed a guitar in the shopping cart might also be interested in purchasing a guitar tuner, a case, or a book about learning how to play it. It is important to notice that complementary products might belong to a di erent category than that of the associated product, e.g. Instrument Accessories, Bags & Cases and Books in the example above. This means that there are also some relationships at the level of categories too. Some of them are evident, like Instruments and Instrument Accessories, while some of them are not immediately obvious like Instruments and Books.

The majority of the approaches to nd complementary products make use of transactional data, e.g. by extracting association rules or by applying other data mining techniques [ 7, 22, 27, 31, 32 ]. This means that new catalog products have no chance to appear as complementary to any other product. In addition to this, these approaches might also produce noisy recommendations for products which are infrequently purchased and are not able to explain the reason for the recommendation [ 18 ].

Therefore, a model is required from which this relationship between a complementary product and its associated product can be learned. We believe that a graph representation is the most suitable for this. Products, their attributes and the categories to which they belong, as well as the interconnections between them can be represented as nodes and paths in the graph. Having such a model would not only allow us to nd links between a complementary product and its associated product without any need for transactions, but also to provide explanations of why this relationship exists.

However, learning to discriminate between di erent kinds of relationships from those interconnections within a complex semantic graph model is still an open problem. See for instance the feature vector illustrated in Figure 1, which represents a pair of products (pi; pj). Each of the features should indeed re ect a property which characterizes both products as a whole, or a certain interaction between those. A class label, complementary or non-complementary is associated with each sample. While trying to t a classi cation model which learns how to discriminate the relationship from a set of samples like those might seem the obvious way of solving the problem, the problem remains: what are those features that re ect some property of pairs of products? What do they re ect? Is this property related to the fact that one product might be complementary to the other one? This work aims at addressing these questions. Contributions. The contributions of this paper can be summarized as follows.

1. We propose CompLoD, a novel and extensible framework that predicts the complementary relationship between two products by using only textual information such as titles, descriptions and categories. 2. The approach does not require transactions as other approaches and therefore, it is immune to the coldstart problem for new items. 3. Our scheme bridges the gap of how to encode information from a knowledge graph into a relevant feature set to t a classi cation model and predict the complementary relationship. 4. Rather than building a knowledge graph from scratch, we believe that Semantic Web standards have enough to o er in this regard. Therefore, we use DBpedia1, a well-known Linked Open Data (LoD) dataset, as the core of the knowledge graph. This dataset is the structured counterpart of Wikipedia and it is modeled as an RDF graph. We thereby demonstrate the usefulness of Semantic Web data.

We evaluate the accuracy of our approach by conducting experiments using Amazon data [ 14, 15 ] as our ground truth and nd that it produces competitive results in comparison to those showed in [ 14 ] when taking into account that no transaction data is used. Our paper is structured as follows: We aim at providing some notation and de nitions used within this paper in section 2. In section 3 the workow of CompLoD is explained in detail. Section 4 shows the experiments that have been carried out to validate our system. In section 5 we present the related work. Finally, we present our conclusions in section 6.

NOTATION

This section provides some fundamental de nitions which allow us to explain the system in detail. Our approach requires modeling four elements within an RDF graph: products, attributes and the categories to which products belong in addition to the relationships between all of these elements. Attributes are characteristics of a product which are extracted from a product's speci cation. Categories are instead given and are classi ed in a taxonomy. For instance, bronze strings is an attribute of a guitar. Moreover a guitar might belong to categories such as Musical instruments, Guitars and Acoustic Guitars. Categories are hierarchically 1http://wiki.DBpedia.org/ arranged: acoustic guitars are a certain kind of guitar and these are a certain kind of musical instrument.

Since RDF is used as an underlying data model, we need to introduce some notation. Let I, B and L be the set of all IRIs (Internationalized Resource Identi ers ), the set of all blank nodes and the set of all literals, respectively. Then a triple (rs; p; ro) 2 (I [ B) I (I [ B [ L) is called an RDF triple. An RDF graph G is a set of RDF triples. Moreover, let V be a set of variables. A tuple of the set (I [ L [ V ) (I [ V ) (I [ L [ V ), e.g. (?x; p; ro), is called a triple pattern, in which variable names are pre xed with the question mark (\?") symbol. A triple pattern can be evaluated according to the SPARQL semantics to nd all those mappings between variables and resources from the graph which would make of that tuple a valid triple. For a more comprehensive description of RDF and SPARQL we refer the reader to [ 13, 25 ].

A path in graph G between a source resource rs and a target resource ro is denoted by (rs; ro)(p1;p2;:::;pt), where (p1; p2; :::; pt) is the sequence of predicates that connects rs to ro2. A path can be recursively de ned as follow: (rs; ro)(p) is a path of length 1 if there exists a triple (rs; p; ro) 2 G which connects rs to ro. (rs; ro)(p1;p2;:::;pt) is a path of length t if there exists a resource rl such that (rs; rl)(p1;p2;:::;pt 1) is a path of length t 1 and (rl; ro)(pt) is a path of length 1. In our RDF graph each of the above-mentioned elements, products, attributes and categories, corresponds to an RDF resource. Let P I = frp1 ; rp2 ; :::; rpn g be a subset of all IRIs representing a set of products, A I = fra1 ; ra2 ; :::; ram g is a set of products' attributes and C I = frc1 ; rc2 ; :::; rcl g a set of products' categories. From now on products, attributes and categories will denote the RDF resource which identify these elements. Moreover, we will use rp, ra and rc to denote an arbitrary product, attribute and category, respectively. We assume that each product is connected to its attributes and to the categories to which it belongs in G. Moreover, there exist arbitrary connections in G among attributes or categories, i.e. there are no restrictions on how resources are connected.

Problem statement. Given two products (rpi ; rpj ) in G, we would like to predict whether rpj is complementary to rpi using G. Let comp(rpi ; rpj ) be a function that returns 1 when this relationship holds, 0 otherwise. It is directional, i.e. comp(rpi ; rpj ) = 1 doesn't imply that comp(rpj ; rpi ) also returns 13. 3.

OUR APPROACH

2For simplicity, we assume (p1; p2; :::; pt) identi es a unique sequence of predicates which connect rs to ro. 3For instance, a guitar tuner is complementary to a guitar, but the opposite does not necessarily hold true. As previously mentioned, attributes are extracted from the description of a product. Moreover, a category is a structured piece of information organized as a taxonomy. The categories go from generic (Electronics) to more speci c (Mice). However, CompLoD doesn't make use of the taxonomy and sees all categories as independent.

Together all of this information is then processed in a pipeline-like fashion. First the product's meta-data is transformed into a structured graph in which product attributes and categories are matched against DBpedia resources (A). The resulting knowledge graph is a subgraph of DBpedia, extended with the products represented as synthetic nodes. Being aware that not all attributes have the same role in characterizing a product and that not all the interconnections are able to provide the same quality of information, we decided to attach scores to the nodes and paths in the graph using consolidated metrics (B). The results are used in the next stage to build the feature set for pairs of products. Finally, we train a classi cation model (C) and when this is tted, the framework is able to answer the question whether one product is complementary to another one. More details about the individual tasks are provided in the following sections. 3.1

Building a knowledge graph

Our framework requires a knowledge graph as the starting point and we opted for DBpedia. In addition to the deterministic rules which are used to build the graph, DBpedia also has regularities with predictive power [ 20 ]. To make use of this, we need to integrate the products' meta-data into it by extracting structured information from the text. We therefore use AlchemyAPI, a NER tool. This identi es those tokens which correspond to published LoD IRIs. Whereas the tool supports the mapping to resources from several LoD datasets4, we only store the information relative to DBpedia and leave the possibility of exploiting multiple datasets to future work. For instance, from the previous text some of the identi ed resources are5:

Categories

Electronics: dbpedia:Electronics

Computer & Accessories: dbpedia:Computer, ...

Attributes

Logitech: dbpedia:Logitech, yago:Logitech, ...

Operating system: dbpedia:Operating system, ...

Microsoft Windows: dbpedia:Microsoft Windows Mac OS X: dbpedia:Mac OS X, ...

Microsoft: dbpedia:Microsoft, yago:Microsoft, ...

IMac: dbpedia:IMac, yago:IMac, ...

Windows Vista: dbpedia:Windows Vista, ...

Example 2: Resources identi ed from the meta-data of the product showed in example 1 It can be observed that some of the identi ed resources correspond to categories, such as dbpedia:Electronics, while others are attributes, such as dbpedia:Logitech (brand). To keep the approach fully automated, we treat categories as text as well. The quality of this task highly depends on the e cacy of the NER tool, each of which has its own limitations. While most of the important resources are identi ed, some of them, like dbpedia:Computer mouse (category), which are important to characterize the product, are not necessarily recognized. For more details about the expected performance of these tools we refer the reader to [ 8 ].

Given the products and the identi ed resources, the RDF graph is built as follows: 1. First we build a basic RDF graph Gp. We create one resource rp for each product and connect it to its attributes IRIs. Let Arp be the bag6 of attributes identi ed in the input text of the product represented by 4Currently supported to the date of publication are DBpedia, Freebase, US Census, GeoNames, UMBEL, OpenCyc, YAGO, MusicBrainz, CIA Factbook and CrunchBase. 5We use pre xes to shorten the namespaces: dbpedia: http://dbpedia.org/resource/ yago: http://yago-knowledge.org/resource/ 6The tokens can appear multiple times in the text. We have rp. We create a triple (rp; sp; ra) linking the product rp with each ra 2 Arp using a synthetic predicate sp. If an attribute ra appears more than once in Arp , we create as many triples as the number of times ra appears in Arp , where each triple has a di erent synthetic predicate. 2. Let Crp be the set of categories to which the product belongs. We also create triples (rp; sp; rc) that link products with their categories. 3. We enrich Gp by extracting a subgraph from DBpedia.

This subgraph contains all DBpedia resources identied by the NER tool and all the resources reachable within 2 hops from them. We ignore the direction of the edges to do so. The reason why we limit the length to 2 is to reduce the computational cost of computing metrics on top of the graph (more details in the next section). We merge Gp with the extracted subgraph to form the Extended graph EGp. Note that attributes and categories in EGp are interconnected.

EGp is then the input of the next task, in which metrics are computed to measure the importance of an attribute or a category for a product as well as the importance of the semantic paths interlinking them. 3.2

Computation of metrics

In our knowledge graph every product is connected to its attributes and categories. However, those attributes might have a distinct relevance in describing a product or even in describing a category (seen as a set of products). Therefore, we require some metrics that re ect the following intuition: How well can an attribute characterize a product, e.g. is equalizer more representative than rechargeable to describe a speaker ? Section 3.2.1.

How well can an attribute characterize a category, e.g. is 4K more representative than wireless for the category Television & Video? Section 3.2.2.

How relevant is the information carried by a path connecting two resources? Let's assume, for instance, that two product resources, a television (tv) and a speaker (s), are connected through several paths of di erent lengths in EGp.

Is the path tv connectivit!y Bluetooth connectivity s that goes through Bluetooth more important than another one that goes through Stereo? Which of these paths is the most informative? Does the length also play a role? Suppose that the connection is now tv connectivit!y Bluetooth typ!e W ireless connectivity s. Does the informativeness decay with the length of the path? Section 3.2.3.

In order to provide answers to those questions and ful ll the requirements we employ Term frequency-inverse document frequency (TF-IDF)[ 19 ], a weighting scheme widely used in Information Retrieval and text mining to determine how important a term is to a document within a corpus. The design of such metrics is achieved through di erent instantiations of TF-IDF, i.e. we map di erent components of the RDF to preserve the information of how often the attribute occurs in order to compute metrics in the next stage. graph to terms and documents. This requires the consideration of a product rp as a bag of attributes fra1 ; ra2 ; :::; rak g. A category rc can be also thought of as a bag of attributes, namely those used to describe products of that category. We will explain these metrics in more detail in the following sections. 3.2.1 (Product) Attribute Frequency - Inverse Product Frequency (PAF-IPF)

The representation of a product as a bag of attributes allows us to de ne a weighting scheme that re ects how well an attribute ra is able to represent or describe a product rp. We instantiate TF-IDF using a product as a document and its attributes as the terms within the document.

PAF-IPF (ra;rp) = P AF(ra;rp)

IP F(ra) IP F(ra) = log

jP j P F(ra) Intuitively, P AF(ra;rp) counts the number of times a product rp is described by an attribute ra, i.e. it counts the number of triples that link the product to an attribute. In a similar way P F(ra) counts the number of products that have ra as an attribute. jP j is the cardinality of the set P , i.e. the number of products available in G. The goal of IP F(ra) is to reduce the score by a factor which is proportional to the number of times the attribute is used to describe products. If an attribute is common to all products then it cannot strongly characterize the product. 3.2.2 (Category) Attribute Frequency - Inverse Category Frequency (CAF-ICF) In the same way we de ne the following:

CAF-ICF (ra;rc) = CAF(ra;rc)

ICF(ra) ICF(ra) = log

jCj CF(ra) which re ects to which extent an attribute characterizes a category. CAF(ra;rc) counts the number of times that an attribute resource ra is used to describe a product rp that belongs to a category rc. CF(ra) counts the number of categories in which the attribute ra is used to describe at least one product which belongs to that category. jCj is the cardinality of the set C. The role of ICF(ra) is the same as that of IP F(ra). 3.2.3

Path Informativeness

In addition to the two metrics introduced above, we require another one which re ects the degree of informativeness a path carries with it. We use the de nition of path informativeness presented in [ 26 ], which we tailored for our needs. This concept is based on a similar instantiation of TF-IDF. RDF resources r are used as documents. However, two kinds of documents are considered: those in which r appears as a subject of a triple and those in which it has the role of an object. This allows us to de ne the following metric: Predicate frequency - Inverse Triple Frequency PF-ITF : I-PF-ITF (p;r) = I-PF (p;r) O-PF-ITF (p;r) = O-PF (p;r)

IT F(p)

IT F(p) IT F(p) = log jT j jT (p)j I-PF-ITF (p;r) is the Input PF-ITF of a term p within the document resource r and O-PF-ITF (p;r) is the Output PFITF of a term p within the document resource r. I-PF (p;r) counts the number of resources that can be mapped to the variable ?x of the triple pattern (?x; p; r). This is also the case for O-PF (p; r) in which the possible mappings for ?y in (r; p; ?y) are counted. jT j is the overall number of triples whereas jT (p)j is the number of triples in which p appears. The role of IT F(p) is the same as for the previous instantiations.

This metric allows us to de ne the informativeness metric. For a path (ri; rj )(p) of length 1 the informativeness I is de ned as:

I (ri;rj)(p) =

O-PF-ITF (p;ri) + I-PF-ITF (p;rj) 2

For paths of length t the informativeness is de ned as the sum of the informativeness of its components divided by the length of the path.

(ri; rj )(p1;:::;pt 1;pt) I (ri;rj)(p1;:::;pt 1;pt) = (I (ri;rk)(p1) + ::: +

I (rq;rs)(pt 1) + I (rs;rj)(pt) )=t

Given the previous de nitions, it is possible to de ne Imax, i.e. the maximum informativeness carried by a path connecting two resources ri and rj .

Imax(ri; rj ) = maxfI(ri; rj ) 1 ; :::; I(ri; rj ) k g; where i is an arbitrary path. Each pair of paths i and j is di erent, i.e. there exists at least one predicate in both paths which di ers. Similarly, it is possible to de ne Iavg, the average informativeness of all paths between ri and rj .

Iavg(ri; rj ) = (I(ri; rj ) 1 + ::: + I(ri; rj ) k )=k To summarize. The metrics allow us to determine the importance of attributes in describing a product or a category, or to measure the amount of informativeness carried by the semantic paths connecting two resources. These are designed considering di erent granularities for the de nition of document. In the case of PAF-IPF a document is a product and the terms are the attributes that describe the product. In the second case, CAF-ICF, a document is a category and the terms are the attributes used to describe products of that category. In the case of informativeness documents are subject or object resources whereas predicates are the terms. CompLoD computes the relevance metrics in EGp by iterating over the set of identi ed attributes and then it proceeds by searching paths linking between them.

All these metrics have been validated by other works and have been used to solve a variety of problems. However, when one focuses again on the problem illustrated in gure 1, i.e. the problem of designing a feature set that applies to pairs of products, one can notice that these metrics cannot be used directly as features. For instance, PAF-IPF and CAF-ICF are metrics which apply to a single product. While nothing prevents one from using them as features for pairs, it is not possible to have each possible attribute as a feature. This would lead to an extremely high-dimensional feature vector. And whereas the path informativeness between two products' attributes could be thought of as a feature for pairs of products, one cannot have each possible path as a feature either, for the same reason explained above (high-dimensionality).

We propose therefore a strategy to combine these metrics, by following a very natural intuition which turned out to produce useful results. This allows one to design a lowdimensional feature vector which can be used to model pairs of products. 3.3

Features engineering

To the best of our knowledge there is no work which addresses the issue of which features for pairs of products (rpi ; rpj ) can be used to predict the complementary relationship. We aim at lling this gap. Our rationale is based on the observation that for products belonging to di erent domains there might exist correspondences and dependencies between their attributes. This has been validated in marketing, behavioral, and data mining research studies [ 4 ]. Our hypothesis is that (1) the correspondences and dependencies tell us also something about the complementary relationship, and (2) this correspondence can be measured by the informativeness carried in the interconnection between attributes, weighted by those attributes' relative importance. We therefore explain in this section how to leverage the interconnections of the knowledge graph and to combine the metric scores in a meaningful way.

Let Arpi and Arpj be the bag of attributes of products rpi and rpj . To simplify the explanation of the formulas to calculate the features we will from now on assume that rai and raj are attributes of rpi and rpj , respectively. Moreover, we also assume that rci and rcj are the most speci c categories of rpi and rpj . For the sake of readability we introduce two weighting functions !P and !C shown in gure 3, which use PAF-IPF and CAF-ICF, respectively. In this way, we can weigh a path between two attributes of rai and raj according to their characterization relevance. Next, we will present the design choices we made.

Maximal Informativeness over the most relevant product attributes (MIMPA): this feature combines PAF-IPF and the concept of informativeness as follows. Given a product rpi , we rst search for the attribute rai that has the maximum PAF-IPF(rai ;rpi ) among other attributes. We do the same for rpj . Let rai Mp be these attributes for rpi

Mp and raj and rpj , respectively. These are the most relevant product attributes. The feature is then designed as follows: M IM P A(rpi ;rpj ) = Imax raMip ; raMjp Recall that when computing Imax(raMip ; raMjp ) all possible paths which connect rai Mp are considered and from

Mp to raj these, it returns the max. informativeness value.

Maximal Informativeness over the most relevant category attributes (MIMCA): we focus again on the attribute bags of both products. However, rather than searching for those which characterize the products the best, we take a look to the most speci c category of each product and leverage CAF-ICF to nd the attribute which characterizes this category the best. Let raMic and raMjc be these attributes for rpi and rpj , respectively. Then:

M IM CA(rpi ; rpj ) = Imax raMic ; raMjc !C (raMic ; raMjc ) where rci and rcj are the categories of products rpi and rpj , respectively.

Average Informativeness over the most relevant product attributes (AIMPA): this feature works in a similar way as MIMPA. When the most relevant attributes of the

Mp and raMjp , are identi ed, all paths connecting products, rai these resources are considered. Therefore, the average informativeness Iavg is considered and not only the score of the most informative one.

AIM P A(rpi ; rpj ) = Iavg raMip ; raMjp Average Informativeness over the most relevant category attributes (AIMCA): in the same way as MIMCA, we consider here the attributes which characterize the category the most, but we use Iavg instead of Imax, i.e. we consider the average informativeness of all paths connecting them.

AIM CA(rpi ; rpj ) = Iavg raMic ; raMjc Average of maximal informativeness over all attributes (AMIO): rather than considering the most relevant attribute for each product, we focus here on the informativeness owing from one product (from all of its attributes) to the other one. To do so we consider all pairs of attributes rai and raj , which belong to rpi and rpj , respectively, and compute the maximum informativeness carried by the links of each product pair. As for previous features we also combine the relevance of individual attributes with the informativeness of the link.

Let P(rpi ;rpj ) = f(rai1 ; raj1 ); (rai1 ; raj2 ); :::; (raik ; rajl )g be a set of attributes tuples, where the rst element is an attribute of rpi and the second of rpj . Then:

AM IOP (rpi ; rpj ) = j P(rpi ;rpj ) j 1 1 !P (rai ; raj ) = , !C (rai ; raj ) = s

2 CAF-ICF(rai ;rci ) + CAF-ICF(raj ;rcj ) 2 Average of average informativeness over all attributes (AAIO): this feature is computed in the same way as AMIO, except that Iavg is used instead of Imax. In this way we consider the complete ow of informativeness between each pair of attributes.

Maximal relevance of common product attributes (MRCPA): when considering only the common attributes of both products Arpi T Arpj , we compute the PAF-IPF of the elements of the intersection and return the maximum value.

M RCP A(rpi ; rpj ) = max f !P (rai ; raj ) :

rai = raj 2 (Arpi \ Arpj )g Maximal relevance of common category attributes (MRCCA): this is the symmetrical counterpart of MRCPA in which one considers the contribution of common attributes in characterizing the product categories.

M RCCA(rpi ; rpj ) = max f !C (rai ; raj ) :

rai = raj 2 (Arpi \ Arpj )g Main product relevance (MPR): this metric measures the relevance of the main product ( rst element of the pair at the input sample) on the basis of its attributes. Let j(ra; ?p; ?o)j be the number of triples having ra as subject and j(?s; ?p; ra)j be the number of triples having ra as object. Then:

M P R(rpi ) =

(j(ra; ?p; ?o)j + j(?s; ?p; ra)j)

X ra2Arpi Related product relevance (RPR): this metric is the same as the previous one, but applied to the related product (second element of the pair at the input sample).

Interconnection between products (IP ): this feature aims at re ecting the degree of connectivity between the attributes of two products. Let ' be the number of attributes pairs (rai ; raj ) 2 P(rpi ;rpj ) in which there exist at least one path connecting them. Then:

IP (rpi ; rpj ) =

' jP(rpi ;rpj )j Product similarity (PS ): to measure the degree of similarity, we use the Jaccard similarity to the set of attributes of both products:

P S(rpi ; rpj ) = jArpi T Arpj j jArpi S Arpj j Other metrics: We designed and tried several other features whose contribution was small or minimal.

In this section we introduced the features which played an important role in predicting whether one product is complementary to another one. All the 12 features are the result

Imax(rai ; raj ) !P (rai ; raj )AC Imax(rai ; raj ) !C (rai ; raj CA

1 1 AM IOC (rpi ; rpj ) = j P(rpi ;rpj ) j AM IO(rpi ; rpj ) = 2 AMIO combines both, the characterization power of the attributes in the role of product attributes and category attributes.

AM IOP (rpi ; rpj ) + AM IOC (rpi ; rpj ) of a feature selection task performed to boost the accuracy prediction of the classi ers. These are shown on top of the next page together with their relative importance7. 3.4

Classification model

With enough input samples and a well-de ned feature set the last missing step consists of tting a classi er. A wide range of classi cation techniques exist for supervised learning. We compared the performance of di erent kinds of classi ers, ensemble and non, and random forest was the ttest model. Moreover, we validated our model and carried out hyper-parameter optimization. More details can be found in the next section.

EXPERIMENTS

We conducted experiments to assess the performance of CompLoD. Some of these experiments helped us to choose the most suitable classi er and to optimize it. But the central question, whether the designed feature set was really signi cant, could be also answered. Each of the single tasks of CompLoD were run on a single machine with a processor Intel(R) Xeon(R) CPU E3-1231@3.80GHz with 4 cores and 32 GB of RAM.

Dataset. We used a dataset from Amazon.com8. This contains products' meta-data for each category. Our experiments focus on the category Electronics, which includes a wide range of subcategories, such as Internal Hard Drives, eBook Readers or Camera Batteries, etc. The overall number of subcategories considered is 817 whereas the number of products is ca. 0.5 million. In addition to the meta-data for each product the dataset also contains four lists of related products: also bought, also viewed, bought together, buy after viewing. As in [ 14 ] we assume that also bought and bought together products are complementary. Products from the remaining lists are substitutable products9. We use them as the negative class (non-complementary). We manually veri ed that this assumption is not always correct: for some products some related products appeared in both groups.

Assuming that a product cannot be both, substitutable and complementary to another one, we removed those ambiguous related products. As a result of this process, 11% of complementary products and 8% of non-complementary products were removed from the dataset.

Building the graph. One of the initial stages of the framework consists of identifying from the text those tokens which correspond to resources published in the LoD cloud. Approx. 3500 resources were recognized by the NER tool, which is a relatively low number. However, for 96% of the products at least one resource could be identi ed. The unmatched 4% was removed from the dataset. By querying DBpedia's SPARQL remote endpoint we extract the triples in which those resources appear as subject or object of a triple. We repeat this process iteratively until we obtain all resources reachable within 2 hops from the identi ed resources. The resulting graph has approx. 1.4 million resources. 7The importance is computed as the normalized total reduction of the criterion represented by that feature (Gini importance of random forests). 8The dataset is part of the project \Stanford Network Analysis Project (SNAP)" and made available for research. 9We are not addressing in this work the task of predicting the substitutable relationship.

Computing metrics. Once the knowledge graph is built we compute the metrics explained in section 3.2. The PAFICF and CAF-ICF relevance scores and the informativeness of the paths are e ciently stored in the system using inverted indexes for easy retrieval.

Classi cation. With the metrics computed, we compute the features as explained in section 3.3 for those pairs of products which appear in the ground truth. We store the feature vectors and attach the class label to it, complementary and non-complementary. The overall number of input samples is ca. 3.7 million, of which 1.4 million represents complementary pairs and 2.3 million pairs the noncomplementary counterpart. These input samples were then split into training and test sets. We kept the distribution at 50% of positive and 50% negative samples. We started with a relatively small number of samples for the training set (10%) and we gradually increased it until we reached 90%. In this way we were able to analyze the learning rate of the classi ers. We tried the performance of several of them, such as Decision Trees, Random Forest, AdaBoost, Naive Bayes (Gaussian), Linear Discriminant Analysis and Quadratic Discriminant Analysis. We used the following rates computed from the confusion matrix (TP =true positive, FP =false positive, TN =true negative, FN =false negative): Precision(+) =

Recall(+) =

T P T P + F P

T P T P + F N ; ;

Precision(-) = Recall(-) =

T N T N + F N

T N

T N + F P

Accuracy = F1-Measure(+) = 2

T P + T N T P + T N + F P + F N

Precision(+) Recall(+)

Precision(+) Recall(+) The label (+) in the metric focuses on the quality attribute for predicting the positive class whereas (-) focuses on the negative one. Accuracy is then de ned as the mean of the positive and negative precision score. The performance of the di erent classi ers was similar. To avoid over tting and make sure that the model generalizes to an independent dataset, we validate our model with K-Fold and Monte Carlo cross-validation (CV). While k-Fold CV makes sure that each of the folds is used at least once as a test set, Monte Carlo CV randomly shu es the whole dataset, thus generating independent partitions for each run. The nal results are shown in Figure 4(A), which illustrates that the Random Forest classi er improves its accuracy with more input samples from which it can learn, being able to obtain up to ca. 78.5% accuracy. In the same gure (B) shows that the recall for the negative class overtakes that of the positive class. It is important to mention that a TN refers to a non-complementary ground truth pair which is predicted to be non-complementary. As the very last task we performed hyper-parameter optimization by exploiting two techniques, namely Random Search (run with 30 iterations) and Grid Search (run with the top-3 parameter set).

Results and re ections. One of the unique advantages of our framework is that new products have a chance to be classi ed as complementary to other products even in the absence of previous purchase history of the product. The results are certainly not as good as for the case for which transaction data or post-transaction feedback is 75 70 65 positive precision negative precision accuracy precision 80 75 70 65 60 75 70 65 positive recall negative recall positive F1-measure negative F1-measure

Feature set for pairs of product resources (rpi ; rpj ) and their feature importance. available, but this work should open a new door for further contributions. For instance, the work of McAuley et al. in [ 14 ], which uses post-transaction feedback to learn a model, obtained an accuracy of 88% on the same dataset and category using reviews and additional meta-data which we didn't consider in our work, e.g. prices10. The performance of CompLoD is then surprisingly good at taking into account that only minimal information about the product is used, therefore making it well-suited to deal with the coldstart problem.

CompLoD also enables a di erent perspective. Whenever the relationship between two products is identi ed as complementary and an explanation is required, it is possible to go back to the graph components and the values of the computed graph metrics from the values of the single features and to provide the user with a reasonable explanation. For instance, one could use the most relevant attributes or category attributes of both products, e.g. those that maximize PAF-IPF or CAF-ICF, respectively, and the path connecting those attributes which maximizes the informativeness. Figure 5 illustrates this. All these components are a valuable source of explanation. However, assessing which of these graph components are the most useful would require an online study with real users.

Computational cost and scalability. PAF-IPF and CAFICF have a cost of O(jP j+jAj) and O(jCj+jAj), respectively.

The cost of path informativeness is higher: O(jIj + jP redj), where P red is the set of all vocabulary predicates. While it is obvious that extracting information from a large graph and computing metrics over it is certainly expensive, most of the tasks done in the single stages of the framework can be pre-computed o -line. For instance, one could try to keep as many attributes in the graph as possible or even try to anticipate attributes of soon-to-be-released products. The 10The results are not comparable only in terms of the learning source but also on the fact that they only considered in their experiments products with 20 reviews or more, which means that only 20% of the pairs of products contained in the original ground truth remain after ltering those products out. metrics can be then computed and stored at once. In this way when a new product arrives in the system, all of its attributes will already be present in the graph, together with the informativeness scores of the semantic paths connecting them to other attributes. 5.

RELATED WORK

Approaches to nd complementary products have a wide range of applications, from product placement to products bundling to website structure optimization, etc. In particular, on-line retailers make use of complementary products to provide recommendations [ 30, 35 ], which help the users to discover related products while improving their volume of sales. Most of these techniques typically consist of mining transactional purchase data, event logs, or users' implicit feedback. However, for new products for which such information is not available or scarce, most of these approaches are not applicable or fail altogether.

As in [ 14 ], we aim at learning the semantics of this relationship, thus making it possible to disregard transactions. Therefore, we build a knowledge graph in which products, attributes, and categories are represented as nodes interconnected by semantic paths. Knowledge acquisition is in general a very costly task. Therefore, we extract this information from DBpedia, whose usage as a knowledge base has been validated in other elds, such as that of Cross-domain Recommender Systems [ 6, 9, 23, 29 ].

Being able to learn from the semantics encoded in the graph and the patterns therein has some commonalities with Link Discovery or Link prediction. These approaches see a knowledge graph as a statistical model. The existence of triples is represented in these models as binary random variables which are correlated with each other. Di erent assumptions for the correlation lead to di erent models, e.g. based on latent features [ 21 ], graph features or those in which variables have local interactions [ 20 ]. Our work is closer to observable graph feature models, in the sense that some of the problems addressed there, e.g. designing a feature set from a graph to t a learning model, are similar. For instance, [ 1 ] focuses on predicting future friendship links in Social Networks using topological features. Aiming at the same goal, [ 5 ] presents and uses a combination of a large number of features. These techniques are not necessarily limited to Social Networks, but they are used in elds like Biomedicine [ 10 ] or Genomics [ 34 ].

However, there is a central di erence between these approaches and ours. First, we deal here with a knowledge graph in which the structured information is integrated into a complex ontology. This means two nodes can be interconnected in several ways which are not necessarily known in advance. This di ers from the kinds of graphs typically treated in the link prediction problem, e.g. social, concept, or interaction networks, whose edges intend rather to represent a small number of relationships and are therefore less exible in the representation. Secondly, many of the features proposed are based on local similarity indices, such as Common Neighbors, Adamic-Adar Index, Preferential Attachment Index, Jaccard Index, etc. and might therefore be too localized to capture the patterns which determine the complementary relationship.

A related problem can also be found in the eld of online search behavior [ 16 ]. In this area some e orts have been done to predict whether two queries posed by a user within a single or multiple sessions aim at performing the same task. To do so, some approaches try to model hierarchies of tasks and subtasks. For example, booking a hotel and registering for a conference venue are subtasks of planning a conference attendance. These methods are typically based on classi cation, as in our case, or clustering [ 11, 12, 17, 33 ]. Since a purchase of one product and a complementary one can be thought of as the problem of matching tasks, we consider this line of research related work.

CONCLUSION AND FUTURE WORK

We presented CompLoD, a framework that leverages different models and techniques to identify complementary products while only using a product's available meta-data, such as titles, descriptions and categories. By extracting structured information from the text, we manage to build a knowledge graph where products' attributes are interconnected by semantic paths. Being able to identify complementary products from this graph rather than using transactions, as most of the current techniques do, solves the cold-start problem for new items. Moreover, the fact that this graph is based on DBpedia demonstrates the usefulness and value of Semantic Web standards. The prediction task in the last stage is implemented with a Random Forest Classi er which predicts, for a pair of products, whether one product is complementary to the other. However, such a model requires a feature set which represents properties of pairs of products. To the best of our knowledge, this is the rst time where the problem of extracting this information from a semantic graph to predict the complementary relationship is addressed. Our experiments show that the classi er is able to learn the relationship from our designed feature set, thus validating it.

Although the predictive power of our system is surprisingly good, given that this is achieved only by using products' meta-data, it is important to mention that the quality of CompLoD depends on the quality of the models and tools used in the single stages. This dependency consequently leaves room for improvement. Therefore, we would like to try di erent NER tools and LoD datasets and investigate the qualitative impact. We could also exploit a more domainspeci c LoD dataset. For this reason, the framework is extensible, i.e. all components can be extended or replaced.

In the future we would like to conduct further experiments considering other categories, to strengthen the signi cance of our approach. Predicting whether a product is complementary to another one might still produce too many recommendation candidates for a single user. Therefore, we would like to extend our model to be able to further re ne the list of complementary products taking into account the user's taste. For this task, a ranking strategy might be more suitable than a classi cation one. We would also like to include features that re ect the hierarchical structure of categories and in general to continue searching for new features that improve the accuracy of the system. 7.

[1]

Ahmed , A. ElKorany, and

Bahgat . A supervised learning approach to link prediction in twitter . Social Network Analysis and Mining , 6 ( 1 ): 24 , 2016 .

[2]

Aribarg and

N. Z.

Foutz . Category-based screening in choice of complementary products . Journal of Marketing Research , 46 ( 4 ): 518 { 530 , 2009 .

[3]

Azaria ,

Hassidim ,

Kraus ,

Eshkol ,

Weintraub , and I. Netanely. Movie recommender system for pro t maximization . In 7th ACM Conf. on Recommender Systems , RecSys '13, Hong

Kong

, China, pages 121 { 128 , 2013 .

[4]

Cantador , I.

Fernandez-Tob as, S. Berkovsky, and

Cremonesi . Cross-domain recommender systems . In Recommender Systems Handbook , pages 919 { 959 . 2015 .

[5]

Cukierski ,

Hamner , and

Yang . Graph-based features for supervised link prediction . In Neural Networks (IJCNN) , Int. Joint Conf. on , pages 1237 { 1244 , July 2011 .

[6]

Fernandez-Tob as , I. Cantador,

Kaminskas , and

Ricci . A generic semantic-based framework for cross-domain recommendation . In Proc. of the 2Nd Int. Workshop on Information Heterogeneity and Fusion in Recommender Systems, HetRec '11 , pages 25 { 32 , 2011 .

[7]

Gedikli and

Jannach . Neighborhood-restricted mining and weighted application of association rules for recommenders . In Proc. of the 11th Int. Conf. on Web Information Systems Engineering , WISE' 10 , pages 157 { 165 , 2010 .

[8]

Heuss ,

Humm ,

Henninger , and

Rippl . A comparison of ner tools w .r.t. a domain-speci c vocabulary . In Proc. of the 10th Int. Conf. on Semantic Systems, SEM '14 , pages 100 { 107 , 2014 .

[9]

Kaminskas , I.

Fernandez-Tob as, I. Cantador, and

Ricci . Ontology-Based Identi cation of Music for Places , pages 436 { 447 . 2013 .

[10]

Katukuri ,

Xiey ,

V. V.

Raghavan , and

Gupta . Supervised link discovery on large-scale biomedical concept networks . In Proc. of the 2011 IEEE Int. Conf. on Bioinformatics and Biomedicine , BIBM '11 , pages 562 { 568 , 2011 .

[11]

Li ,

Deng ,

Dong ,

Chang , and

Zha . Identifying and labeling search tasks via query-based hawkes processes . In Proc. of the 20th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, KDD '14 , pages 731 { 740 , 2014 .

[12]

Liao ,

Song ,

Huang , L. w. He, and

He . Task trail: An e ective segmentation of user search behavior . IEEE Transactions on Knowledge and Data Engineering , 26 ( 12 ): 3090 { 3102 , Dec 2014 .

[13]

Manola ,

Miller , and

B. McBride. RDF

Primer. W3C Recom ., 2004 .

[14] J. J. McAuley ,

Pandey , and

Leskovec . Inferring networks of substitutable and complementary products . In Proc. of the 21th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining , Sydney, NSW , Australia, 2015 , pages 785 { 794 , 2015 .

[15] J. J. McAuley , C.

Targett , Q.

Shi , and A. van den Hengel. Image-based recommendations on styles and substitutes . In Proc. of the 38th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval , Santiago, Chile, 2015 , pages 43 { 52 , 2015 .

[16]

Mehrotra ,

Bhattacharya , and

Yilmaz . Uncovering task based behavioral heterogeneities in online search behavior . In Proc. of the 39th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR '16 , pages 1049 { 1052 , 2016 .

[17]

Mehrotra and

Yilmaz . Towards hierarchies of search tasks & subtasks . In Proc. of the 24th Int. Conf. on World Wide Web, WWW '15 Companion , pages 73 { 74 , 2015 .

[18]

Meymandpour and

J. G.

Davis . Recommendations using linked data . In Proc. of the 5th Ph.D. Workshop on Information and Knowledge Management , PIKM 2012 , Maui , HI , USA, 2012 , pages 75 { 82 , 2012 .

[19]

I. C.

Mogotsi . Christopher d. manning, prabhakar raghavan, and hinrich schutze: Introduction to information retrieval - cambridge university press, cambridge, england, 2008 , 482 pp, ISBN: 978-0-521-86571-5 . Inf. Retr., 13 ( 2 ): 192 { 195 , 2010 .

[20]

Nickel ,

Murphy ,

Tresp , and

Gabrilovich . A review of relational machine learning for knowledge graphs . Proc. of the IEEE , 104 ( 1 ): 11 { 33 , Jan 2016 .

[21]

Nickel ,

Tresp , and

H.-P.

Kriegel . Factorizing yago: Scalable machine learning for linked data . In Proc. of the 21st Int. Conf. on World Wide Web, WWW '12 , pages 271 { 280 , 2012 .

[22]

Nikovski and

Kulev . Induction of compact decision trees for personalized recommendation . In Proc. of the 2006 ACM Symposium on Applied Computing, SAC '06 , pages 575 { 581 , 2006 .

[23]

V. C.

Ostuni ,

T. Di

Noia ,

E. Di

Sciascio ,

Oramas , and

Serra . A semantic hybrid approach for sound recommendation . In Proc. of the 24th Int. Conf. on World Wide Web, WWW '15 Companion , pages 85 { 86 , 2015 .

[24]

Pathak , R. Gar nkel, R. D. Gopal , R.

Venkatesan , and F.

Yin . Empirical analysis of the impact of recommender systems on sales . Journal of Management Information Systems , 27 ( 2 ): 159 { 188 , 2010 .

[25]

Perez ,

Arenas , and

Gutierrez . Semantics and complexity of SPARQL . ACM Trans. Database Syst ., 34 ( 3 ), 2009 .

[26]

Pirro . Reword: Semantic relatedness in the web of data . In Proc. of the 26th AAAI Conf. on Arti cial Intelligence , 2012 , Toronto, Ontario, Canada., 2012 .

[27]

Raeder and

N. V.

Chawla . Market basket analysis with networks . Social Network Analysis and Mining , 1 ( 2 ): 97 { 113 , 2011 .

[28]

Ricci ,

Rokach , and B. Shapira, editors. Recommender Systems Handbook . 2015 .

[29]

Ristoski ,

Schuhmacher , and

Paulheim . Using graph metrics for linked open data enabled recommender systems . In E-Commerce and Web Technologies - 16th Int. Conf. on Electronic Commerce and Web Technologies , EC-Web

2015

, Valencia, Spain, 2015 , Revised Selected Papers, pages 30 { 41 , 2015 .

[30]

Sharma ,

J. M.

Hofman , and

D. J.

Watts . Estimating the causal impact of recommendation systems from observational data . In Proc. of the 16th ACM Conf. on Economics and Computation , EC ' 15 , pages 453 { 470 , 2015 .

[31]

Suchacka and

Chodak . Using association rules to assess purchase probability in online stores . Information Systems and e-Business Management , pages 1 { 30 , 2016 .

[32]

I. M. A. O.

Swesi ,

A. A.

Bakar , and

A. S. A.

Kadir . Mining positive and negative association rules from interesting frequent and infrequent itemsets . In Fuzzy Systems and Knowledge Discovery (FSKD) , 2012 9th Int. Conf. on , pages 650 { 655 , May 2012 .

[33]

Wang ,

Song , M.-

Chang ,

He ,

R. W.

White , and

Chu . Learning to extract cross-session search tasks . In Proc. of the 22Nd Int. Conf. on World Wide Web, WWW '13 , pages 1353 { 1364 , 2013 .

[34]

You ,

Zhang , and

Li . Integration of genomic and proteomic data to predict synthetic genetic interactions using semi-supervised learning . In Proc. of the 5th Int. Conf. on Emerging Intelligent Computing Technology and Applications , ICIC ' 09 , pages 635 { 644 , 2009 .

[35]

Zheng ,

Wu ,

Niu , and

Bolivar . Substitutes or complements: Another step forward in recommendations . In Proc. of the 10th ACM Conf. on Electronic Commerce, EC '09 , pages 139 { 146 , 2009 .