-

September

An Evaluation of Recommendation Algorithms for Online Recipe Portals

Christoph Trattner

christoph.trattner@uib.no 0

David Elsweiler

david.elsweiler@ur.de 1 0 University of Bergen , Norway 1 University of Regensburg , Germany

2019

20 2019 24 28

for example, building nutritional content into the recommendation Better models of food preferences are required to realise the oftprocess [15, 19, 34] or by recommending meal plans, which tailor touted potential of food recommenders to aid with the obesity crisriecso.mmendations to users' nutritional needs over time [6]. Many of the food recommender evaluations in the literature have Providing healthful food recommendations, using any of the been performed with small convenience samples, which limits souugrgested strategies necessitates, however, that we can accurately conidence in the generalisability of the results. In this work we tmesotdel and predict the food individual users would actually like to a range of collaborative iltering (CF) and content-based (CB) re-eat. We have yet limited understanding as to which recommender commenders on a large dataset crawled from the web consisting of algorithms work best33[] and the studies that have been performed naturalistic user interaction data over a 15 year period. The resulttsypically focus on one approach in isolation (e.g. recipe ingredients reveal strengths and limitations of diferent approaches. While [C1F1] or properties of the associated ima1g4e]).[ Moreover, past approaches consistently outperform CB approaches when testing work has tended to employ datasets derived from small scale user on the complete dataset, our experiments show that to improve on studies1[1, 19] limiting our conidence in the generalisability of the CF methods require a large number of user>s (637 when sampling results. In this work, we test a number of competitive collaborative randomly). Moreover the results show diferent facets of recipe con- iltering (CF) and content-based (CB) recommenders on a large tent to ofer utility. In particular one of the strongest content relatedscale naturalistic dataset similar to those that have been studied features was a measure of health derived from guidelines from thfeor cultural24[, 40] or epidemiologi3c7a]l r[easons using data UK Food Safety Agency. This inding underlines the challenges wescience methods. We formulate the problem as is typically done face as a community to develop recommender algorithms, which in recommendation experiments using past feedback from a given improve the healthfulness of the food people choose to eat. user to predict future interactions by that same u2s6e]r.[The aim being not only to compare and contrast diferent models, but also to

RELATED WORK

In this section two bodies of related work are reviewed. The irsttained from Allrecipes.com. focuses on the evaluation of food recommender algorithms. The second summarises studies of user interaction with online recipe Total published recipes 60,983 portals, which provides insight into human food preference and Recipes containing nutrition information the variables inluencing this. Recipes rated 46,713 Ratings 1,032,226 Users providing ratings 125,762 Eforts to design automated systems to recommend meals can be traced to the mid-1980s where case-based planning was employed [18, 21]. More recent eforts have focused on rating prediction, using interacted with and a growing body of evidence reports correlaeither aspects of recipe content or ratings data using collaborativtieons between recipes accessed via search engines, recipes portals iltering approaches. Freyne et al11.][ showed the recommenda- and social-media and incidence of diet-related1i,3l,l29n,e3ss7][. tions could be improved by decomposing recipes into indiviMdouraelover, clear weekly and seasonal trends can be observed in ingredients and building user proiles comprising ingredientsthuesewrsay users interact with recipes, both in terms of the contained liked based on ratings for the recipes containing these ingrediienntsg.redients and the nutritional value of the recipes (fat, proteins, Harvey et al. extended the approach and improved performance by carbohydrates, and calories)23[, 40]. Other work has reported difercreating positive and negative proiles for users and reducing thenet interaction patterns for users with diferent ge2n8d,e3r9[] and dimensionality of the matrices [19]. who live in diferent geographical areas within a 4c0o,u4n4t]r.y [

Other CB approaches have employed visual signals. Yang and The number of variables shown to relate to eating habits highlights colleagues demonstrated that algorithms designed to extrapolate im-just how challenging a problem food recommendation is. portant visual aspects of food images outperform baseline methods The brief review of literature above has highlighted the increas[42, 43]. Elsweiler et al8]. [also show that automatically extrac- ing popularity of food recsys research and that a key motivator is ted low-level image features, such as brightness, colourfulness anddesire to build systems to promote healthy nutrition. Key takeaways sharpness can be useful for predicting user food preference. from the review are as follows:

A second approach has been to exploit ratings data using col- • While several evaluations have CF and CB baselines, no extensive laborative iltering (CF) techniques. Freyne and Berkovsky tested comparison of CF and CB approaches in food recsys domain has a nearest neighbour approach, which ofered poorer performance been published. than the content approach described above1[1]. Ge et al. [15] tested • Moreover, no detailed investigation of diferent aspects of content a matrix factorization solution that fuses ratings information athnadt may be useful is available and much of the recipe content user supplied tags to achieve signiicantly better prediction accu(rr-ecipe description, cooking steps, cooking time etc.) has not been acy than content-based and standard matrix factorization baselines. evaluated.

Several studies report that the best results are achieved when CF • Finally, the evaluations performed to date have typically been and CB approaches are combined in hybrid models [11, 14, 19]. performed on small artiicially generated test collections.

A common motivator for food recommendation work has been to promote healthy nutrition. One approach is to rely on rules de-3 MATERIALS rived from domain experts to meet daily energy requirem13en]ts [ or focus on the nutritional requirements of speciic groups sucTho address the identiied gaps in the literature, in this work, we make use of a web crawl of the online platform Allrecipes.com to as the elderly care1[0] or body-builders38[]. Others have tailored recommendations based on the user’s caloriic or other nutritionalevaluate diverse CF and CB approaches in the recipe recommendaneeds [15, 16, 34], existing nutritional ha3b1i]otsr c[ombine re- tion context. commendations to meet requirement6s][. Again, approaches have The platform was crawled between 2th0 and 24th of July, 2015. been published for speciic target groups e.g. diabetics [25]. We retrieved 60,983 recipes published by 25,037 users between the years 2000 and 2015 through the sitemap that is available in the robots.txt ile of the website. In this paper we only make use of the 2.2 Studies of Food Behaviour using Online 58,263 recipes where nutrition information was available. The basic

Recipe Portals statistics of this dataset can be found in Table 1. While not focusing on recommendation, a large body of recent work In addition to the core recipe components ś such as recipe title, sheds light on food preferences by studying interactions with oin-ngredient list, number of servings and instructions ś we also colline food portals. Analysing the nutritional content of these portlalecsted for each recipe the according image, comments provided by using metrics derived from the World Health Organisation (WHOus)ers, rating information and nutritio1n,sfuacchtsas total energy and the United Kingdom Food Standards Agency (FSA) has found (kCal), protein (g), carbohydrate (g), sugar (g), salt (g), fat (g) and recipes to be mainly unhealthy, although healthy recipes can besaturated fat (g) content (measured in 100g per recipe). found [35]. Overall, people tend to interact with the least healthy recipes most often34[]. There is, nevertheless, heterogeneity in 1thAelclornectaiipneesd.coimngersetidimeanttess wthietnhutthriosteiionnatlhefEaSctHsAforresaenarucphldoaatda9eb]d.aTsrehece[iEpSeHbAy matching the user-base with respect to the nutritional properties of recipesysstem is used by popular companies such as MCDonald’s and Kellogs. An Evaluation of Recommendation Algorithms for Online Recipe Portals HealthRecSys ’19, September 20, 2019, Copenhagen, Denmark

Allrecipes.com is just one of many online recipe portals. Othe•rsDirections: From the directions block we computed two similarity popular sites include Food.com, Epicurious.com, Yummly.com and features based again on a LDA topic vector representation of Cooks.com. We chose Allrecipes.com because, at the time of writ- the text as well as on TFśIDF vector representation. Similarities ing, it claims to be the world’s largest food-focused social networkw:ere again computed employing the cosine similarity measure the site has a community of over 40 million users from 24 countrieson these vectors. who annually visit 3 billion r2e]c.iThpiess[claim has been corrob- • Ratings: Here we rely on the the number of ratings of a recipe as orated by services such as eBizMBA, which ranks Allrecipes.com well the average rating. To compute similarities between recipes as the most popular recipe websit5e].[This means that we not on theses indicators we rely again on the inverse Manhatten only analyze a large scale dataset, but also the most popular recipe distance, i.e.−1|metric(ri ) − metric(rj )|. platform on the Web. • Health: In order to measure healthiness of a recipe we rely on the following macro nutrient: ‘fat’, ‘saturated fat’, ‘sugar’ and 4 EXPERIMENTAL SETUP ‘salt’ (measured in 100g per recipe). This allows us to measure We ran a series of experiments evaluating the performance of the healthiness of a recipe according to international standards 6 prominent recommender algorithms on the rating data using as introduced in 2007 by The Food Standard Agency (FSA1)2[]. the LibRe2c framework. The algorithms tested are: Random item There are also other standards that can be applied, such as the ranking (our baseline), Most Popular item ranking (MostPopular), ones provided by the World Health Organization (WH41O]) [ user- and item-based collaborative iltering (denoted as UserKNN or the HEI metric as proposed by the CDC2[0]. We employ the and ItemKNN) [30], Bayesian Personalized Ranking (BP2R6]), [ standards provided by the FSA, as this is currently most robust Weighted matrix factorization (WR2M2]Fa)n[d Latent Dirichlet method to estimate the healthiness of online recipes. The metric Allocation (LDA) [17]. was also used in related work34[]. The scale ranges from 4 for

For the content-based approaches we induced in total 20 diferent very healthy recipes to 12 for very unhealthy recipes. Throughout features, which we used to compute similarities between recipest.he paper we refer to this metric as ‘FSA score’. Below we briely summarise these features and their corresponding For each of the features described above, we derive a scoring sets: function that computes as follows: • Title: For the title feature set, we derived 5 similarity features, Í sim(i, p) (bLaCseSd), oJanroL-eWveinnshkleiernd idistsatnacnecea,nLdebaist-gCroammmdoinstSaunbc-eS.eTqouoenbtcaein a score(u, i)f eatur e = p ∈Pu |Pu | , (1) similarity value between two recipes based on these featurewshere Pu is the set of items of a uus,eir an arbitrary item, and we calculate 1− dist (ri , rj ). Furthermore, we employ LDA topic sim(i, p) is any of the above mentioned similarity metrics between modelling on the recipe titles using Mallet with Gibbs sampiltiemniga.nd p.

The number of topics was set to 100 topics. Hence for each recipe For each feature set we calculate scores based on the linear we induce a vector of dimension one hundred capturing the topiccombination of the simil3a.rities distribution. To calculate similarities between recipes we empAlsoyin previous work26[], we operationalise the experiments the cosine similarity metric. as a personalized ranking problem (item recommendation). The • Image: For the image feature set we employed on the one hand aim here is to provide a user with a ranked list of items where the side image attractiveness measures such as image brightnessr,anking has to be inferred from the implicit behavior of the user sharpness, contract, colorfulness and entropy as well as deep (e.g. recipes rated in the past). Implicit feedback systems, such as convolutional neural network (CNN) features from a pre-trained those studied in26[] are challenging as only positive observations VGG-16 model [32]. For each image we derive one embedding are available. The non-observed user-item pairs ś e.g. a user has vector of dimension 4096 and calculate cosine similarity betwneoetncooked a recipe yet ś are a mixture of real negative feedback recipes on these vectors. To measure the similarity between two (the user is not interested in cooking the recipe) and missing values recipes based on the image attractiveness metr3i6c]sw[e employ (the user might want to cook the recipe in the future). We use 5the Manhatten distance, i.e.−1|metric(ri ) − metric(rj )|. fold cross validation as protocol for all the experiments and report • Ingredients: To calculate similarities between recipes on ingrtehdeir-ecommendation performance results employing AUC as a ent level, we inducted four diferent features. On the one hand performance metric [27]. side the text itself was used and brought to a TFśIDF repres- To reduce data sparsity issues, a well-known issue in collaboratentation to calculate cosine similarity between recipes. Onithvee iltering-based metho2d7s],[ in the irst experiments we apply other hand side we also chose to employ LDA again to derive a p-core ilter approach4[] using only user proiles with at least a topic distribution and to calculate cosine similarity be2t0wreaetning interacti4oannsd recipes that have been rated at least 20 recipes on those vectors. Finally, we employed the normalizedtimes by the users, resulting in a inal dense dataset comprising ingredient strings, to calculate similarities between recipes1u2s7i3nugsers, 1031 items and 50,681 interactions. To study the efects cosine similarity and Jaccard. In the case of cosine we normaliozefddiferent levels of users on performance we report a second set the quantities of each ingredient to 100g of a recipe and used the normalized quantity values as frequency indicator. 3Parameters were tuned to the optimum using grid search.

4We transfer all ratings to positive feedback, i.e. any rating is counted as positive feedback and any none interaction as negative feedback. This makes sense as 95% of 2http://www.librec.net/ all ratings in the Allrecipes.com dataset are 5-star ratings, see also [36].

Algorithm ● BPR

CB:All Algorithm ● BPR

CB:All ● ● ● ●

Title:Levenstein-Distance

Title:Bigram-Distance Title:LCS-Distance Title:LDA-Text-Cosine Title:Jaro-Winkler-Distance Title:All Image:Cosine-Embeddings .5322 Image:Colorfulness-Distance .50(7↓2) Image:Contrast-Distance .5175 Image:Sharpness-Distance .5109 Image:Entropy-Distance .5080(↓) Image:Brightness-Distance .499(↓1) Image:All .5425

AUC scores of> .686. This compares to.5883 achieved by the linear

combination of content features (= CB:All).

BC Examining the performance of diferent aspects of content (title, Ingredients:Cosine-Text .5547 image, ingredients, direction and health) shows that there is a signal Ingredients:Cosine-LDA-Text .565(↑3) in each of these aspects. This is a sign of the consistency, in terms Ingredients:Jaccard .5502 of the properties of recipes, which individual users tend to rate. Ingredients:Cosine .5575 The fact that the combined model Ałllž does not achieve a high Ingredients:All .5718 improvement on these signals individually is perhaps an indication that a linear combination is not the best means to combine these Directions:Cosine-LDA-Text .5(6↑0)6 signals. One of the strongest content-based features is the FSA score Directions:Cosine-Text .5210 (AUC=.5775). Again, this hints at consistency in user preference, Directions:All .5731 this time in terms of the healthiness of recipes, which individua Ratings:Number-Distance .478(↓9) users interact with.

Ratings:Average-Distance .483(↓2) To complement these initial results and better understand the Ratings:All .5249 relationship between CF and CB methods and the amount of data Health:FSA .5775(↑) required to achieve strong recommendation performance with these approaches, we performed the bootstrapping study as described CB:All .5883 above. The results are presented in Figure 1.

Random .4989 In a irst test, see Figure 1 (A), we sampled only from active users, that is, we derived a test size of various sizes where users had rated at least 20 items and the items involved had also achieved of bootstrapped experiments using smaller dense samples of heavy at least 20 ratings. Taking this dense sample showed that even a users (using the same criteria as above), and varying collection sizsemsall number of users can attain stable performance. With only 1% using standard random sampling, referred to as ‘sparse samples’ in of all users (N=13) the CF technique (BPR) is able to outperform the the text. These experiments were repeated 100 times each and the content approach. Nevertheless, when users are selected at random average performance reported. from the dataset and no p-core ilter is applied, see Figure 1 (B) ś which we argue is a much more realistic s4e]t uśmpa[ny more 5 RESULTS users are required on average to achieve an equivalent performThe results of the experiments on the full dataset are shown in ance. Whereas the CB approaches achieve a consistent performance Table 2. The CF methods clearly outperform the content-based (AUC=> .54) regardless of the number of users studied, half of the approaches. The best performing CF method (BPR) achieved an dataset (50%, N=637) is required before the CF methods outperform AUC score of.7094 and the remaining CF methods demonstrated the CB approach. An Evaluation of Recommendation Algorithms for Online Recipe Portals 6

SUMMARY & CONCLUSION