<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An Evaluation of Recommendation Algorithms for Online Recipe Portals</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christoph Trattner</string-name>
          <email>christoph.trattner@uib.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Elsweiler</string-name>
          <email>david.elsweiler@ur.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bergen</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Regensburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>20</volume>
      <issue>2019</issue>
      <fpage>24</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>for example, building nutritional content into the recommendation Better models of food preferences are required to realise the oftprocess [15, 19, 34] or by recommending meal plans, which tailor touted potential of food recommenders to aid with the obesity crisriecso.mmendations to users' nutritional needs over time [6]. Many of the food recommender evaluations in the literature have Providing healthful food recommendations, using any of the been performed with small convenience samples, which limits souugrgested strategies necessitates, however, that we can accurately conidence in the generalisability of the results. In this work we tmesotdel and predict the food individual users would actually like to a range of collaborative iltering (CF) and content-based (CB) re-eat. We have yet limited understanding as to which recommender commenders on a large dataset crawled from the web consisting of algorithms work best33[] and the studies that have been performed naturalistic user interaction data over a 15 year period. The resulttsypically focus on one approach in isolation (e.g. recipe ingredients reveal strengths and limitations of diferent approaches. While [C1F1] or properties of the associated ima1g4e]).[ Moreover, past approaches consistently outperform CB approaches when testing work has tended to employ datasets derived from small scale user on the complete dataset, our experiments show that to improve on studies1[1, 19] limiting our conidence in the generalisability of the CF methods require a large number of user&gt;s (637 when sampling results. In this work, we test a number of competitive collaborative randomly). Moreover the results show diferent facets of recipe con- iltering (CF) and content-based (CB) recommenders on a large tent to ofer utility. In particular one of the strongest content relatedscale naturalistic dataset similar to those that have been studied features was a measure of health derived from guidelines from thfeor cultural24[, 40] or epidemiologi3c7a]l r[easons using data UK Food Safety Agency. This inding underlines the challenges wescience methods. We formulate the problem as is typically done face as a community to develop recommender algorithms, which in recommendation experiments using past feedback from a given improve the healthfulness of the food people choose to eat. user to predict future interactions by that same u2s6e]r.[The aim being not only to compare and contrast diferent models, but also to</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>In this section two bodies of related work are reviewed. The irsttained from Allrecipes.com.
focuses on the evaluation of food recommender algorithms. The
second summarises studies of user interaction with online recipe Total published recipes 60,983
portals, which provides insight into human food preference and Recipes containing nutrition information
the variables inluencing this. Recipes rated 46,713
Ratings 1,032,226
Users providing ratings 125,762
Eforts to design automated systems to recommend meals can be
traced to the mid-1980s where case-based planning was employed
[18, 21]. More recent eforts have focused on rating prediction, using interacted with and a growing body of evidence reports
correlaeither aspects of recipe content or ratings data using collaborativtieons between recipes accessed via search engines, recipes portals
iltering approaches. Freyne et al11.][ showed the recommenda- and social-media and incidence of diet-related1i,3l,l29n,e3ss7][.
tions could be improved by decomposing recipes into indiviMdouraelover, clear weekly and seasonal trends can be observed in
ingredients and building user proiles comprising ingredientsthuesewrsay users interact with recipes, both in terms of the contained
liked based on ratings for the recipes containing these ingrediienntsg.redients and the nutritional value of the recipes (fat, proteins,
Harvey et al. extended the approach and improved performance by carbohydrates, and calories)23[, 40]. Other work has reported
difercreating positive and negative proiles for users and reducing thenet interaction patterns for users with diferent ge2n8d,e3r9[] and
dimensionality of the matrices [19]. who live in diferent geographical areas within a 4c0o,u4n4t]r.y [</p>
      <p>Other CB approaches have employed visual signals. Yang and The number of variables shown to relate to eating habits highlights
colleagues demonstrated that algorithms designed to extrapolate im-just how challenging a problem food recommendation is.
portant visual aspects of food images outperform baseline methods The brief review of literature above has highlighted the
increas[42, 43]. Elsweiler et al8]. [also show that automatically extrac- ing popularity of food recsys research and that a key motivator is
ted low-level image features, such as brightness, colourfulness anddesire to build systems to promote healthy nutrition. Key takeaways
sharpness can be useful for predicting user food preference. from the review are as follows:</p>
      <p>A second approach has been to exploit ratings data using col- • While several evaluations have CF and CB baselines, no extensive
laborative iltering (CF) techniques. Freyne and Berkovsky tested comparison of CF and CB approaches in food recsys domain has
a nearest neighbour approach, which ofered poorer performance been published.
than the content approach described above1[1]. Ge et al. [15] tested • Moreover, no detailed investigation of diferent aspects of content
a matrix factorization solution that fuses ratings information athnadt may be useful is available and much of the recipe content
user supplied tags to achieve signiicantly better prediction accu(rr-ecipe description, cooking steps, cooking time etc.) has not been
acy than content-based and standard matrix factorization baselines. evaluated.</p>
      <p>Several studies report that the best results are achieved when CF • Finally, the evaluations performed to date have typically been
and CB approaches are combined in hybrid models [11, 14, 19]. performed on small artiicially generated test collections.</p>
      <p>A common motivator for food recommendation work has been
to promote healthy nutrition. One approach is to rely on rules de-3 MATERIALS
rived from domain experts to meet daily energy requirem13en]ts [
or focus on the nutritional requirements of speciic groups sucTho address the identiied gaps in the literature, in this work, we
make use of a web crawl of the online platform Allrecipes.com to
as the elderly care1[0] or body-builders38[]. Others have tailored
recommendations based on the user’s caloriic or other nutritionalevaluate diverse CF and CB approaches in the recipe
recommendaneeds [15, 16, 34], existing nutritional ha3b1i]otsr c[ombine re- tion context.
commendations to meet requirement6s][. Again, approaches have The platform was crawled between 2th0 and 24th of July, 2015.
been published for speciic target groups e.g. diabetics [25]. We retrieved 60,983 recipes published by 25,037 users between the
years 2000 and 2015 through the sitemap that is available in the
robots.txt ile of the website. In this paper we only make use of the
2.2 Studies of Food Behaviour using Online 58,263 recipes where nutrition information was available. The basic</p>
      <p>Recipe Portals statistics of this dataset can be found in Table 1.
While not focusing on recommendation, a large body of recent work In addition to the core recipe components ś such as recipe title,
sheds light on food preferences by studying interactions with oin-ngredient list, number of servings and instructions ś we also
colline food portals. Analysing the nutritional content of these portlalecsted for each recipe the according image, comments provided by
using metrics derived from the World Health Organisation (WHOus)ers, rating information and nutritio1n,sfuacchtsas total energy
and the United Kingdom Food Standards Agency (FSA) has found (kCal), protein (g), carbohydrate (g), sugar (g), salt (g), fat (g) and
recipes to be mainly unhealthy, although healthy recipes can besaturated fat (g) content (measured in 100g per recipe).
found [35]. Overall, people tend to interact with the least healthy
recipes most often34[]. There is, nevertheless, heterogeneity in 1thAelclornectaiipneesd.coimngersetidimeanttess wthietnhutthriosteiionnatlhefEaSctHsAforresaenarucphldoaatda9eb]d.aTsrehece[iEpSeHbAy matching
the user-base with respect to the nutritional properties of recipesysstem is used by popular companies such as MCDonald’s and Kellogs.
An Evaluation of Recommendation Algorithms for Online Recipe Portals HealthRecSys ’19, September 20, 2019, Copenhagen, Denmark</p>
      <p>Allrecipes.com is just one of many online recipe portals. Othe•rsDirections: From the directions block we computed two similarity
popular sites include Food.com, Epicurious.com, Yummly.com and features based again on a LDA topic vector representation of
Cooks.com. We chose Allrecipes.com because, at the time of writ- the text as well as on TFśIDF vector representation. Similarities
ing, it claims to be the world’s largest food-focused social networkw:ere again computed employing the cosine similarity measure
the site has a community of over 40 million users from 24 countrieson these vectors.
who annually visit 3 billion r2e]c.iThpiess[claim has been corrob- • Ratings: Here we rely on the the number of ratings of a recipe as
orated by services such as eBizMBA, which ranks Allrecipes.com well the average rating. To compute similarities between recipes
as the most popular recipe websit5e].[This means that we not on theses indicators we rely again on the inverse Manhatten
only analyze a large scale dataset, but also the most popular recipe distance, i.e.−1|metric(ri ) − metric(rj )|.
platform on the Web. • Health: In order to measure healthiness of a recipe we rely on
the following macro nutrient: ‘fat’, ‘saturated fat’, ‘sugar’ and
4 EXPERIMENTAL SETUP ‘salt’ (measured in 100g per recipe). This allows us to measure
We ran a series of experiments evaluating the performance of the healthiness of a recipe according to international standards
6 prominent recommender algorithms on the rating data using as introduced in 2007 by The Food Standard Agency (FSA1)2[].
the LibRe2c framework. The algorithms tested are: Random item There are also other standards that can be applied, such as the
ranking (our baseline), Most Popular item ranking (MostPopular), ones provided by the World Health Organization (WH41O]) [
user- and item-based collaborative iltering (denoted as UserKNN or the HEI metric as proposed by the CDC2[0]. We employ the
and ItemKNN) [30], Bayesian Personalized Ranking (BP2R6]), [ standards provided by the FSA, as this is currently most robust
Weighted matrix factorization (WR2M2]Fa)n[d Latent Dirichlet method to estimate the healthiness of online recipes. The metric
Allocation (LDA) [17]. was also used in related work34[]. The scale ranges from 4 for</p>
      <p>For the content-based approaches we induced in total 20 diferent very healthy recipes to 12 for very unhealthy recipes. Throughout
features, which we used to compute similarities between recipest.he paper we refer to this metric as ‘FSA score’.
Below we briely summarise these features and their corresponding For each of the features described above, we derive a scoring
sets: function that computes as follows:
• Title: For the title feature set, we derived 5 similarity features, Í sim(i, p)
(bLaCseSd), oJanroL-eWveinnshkleiernd idistsatnacnecea,nLdebaist-gCroammmdoinstSaunbc-eS.eTqouoenbtcaein a score(u, i)f eatur e = p ∈Pu |Pu | , (1)
similarity value between two recipes based on these featurewshere Pu is the set of items of a uus,eir an arbitrary item, and
we calculate 1− dist (ri , rj ). Furthermore, we employ LDA topic sim(i, p) is any of the above mentioned similarity metrics between
modelling on the recipe titles using Mallet with Gibbs sampiltiemniga.nd p.</p>
      <p>The number of topics was set to 100 topics. Hence for each recipe For each feature set we calculate scores based on the linear
we induce a vector of dimension one hundred capturing the topiccombination of the simil3a.rities
distribution. To calculate similarities between recipes we empAlsoyin previous work26[], we operationalise the experiments
the cosine similarity metric. as a personalized ranking problem (item recommendation). The
• Image: For the image feature set we employed on the one hand aim here is to provide a user with a ranked list of items where the
side image attractiveness measures such as image brightnessr,anking has to be inferred from the implicit behavior of the user
sharpness, contract, colorfulness and entropy as well as deep (e.g. recipes rated in the past). Implicit feedback systems, such as
convolutional neural network (CNN) features from a pre-trained those studied in26[] are challenging as only positive observations
VGG-16 model [32]. For each image we derive one embedding are available. The non-observed user-item pairs ś e.g. a user has
vector of dimension 4096 and calculate cosine similarity betwneoetncooked a recipe yet ś are a mixture of real negative feedback
recipes on these vectors. To measure the similarity between two (the user is not interested in cooking the recipe) and missing values
recipes based on the image attractiveness metr3i6c]sw[e employ (the user might want to cook the recipe in the future). We use
5the Manhatten distance, i.e.−1|metric(ri ) − metric(rj )|. fold cross validation as protocol for all the experiments and report
• Ingredients: To calculate similarities between recipes on ingrtehdeir-ecommendation performance results employing AUC as a
ent level, we inducted four diferent features. On the one hand performance metric [27].
side the text itself was used and brought to a TFśIDF repres- To reduce data sparsity issues, a well-known issue in
collaboratentation to calculate cosine similarity between recipes. Onithvee iltering-based metho2d7s],[ in the irst experiments we apply
other hand side we also chose to employ LDA again to derive a p-core ilter approach4[] using only user proiles with at least
a topic distribution and to calculate cosine similarity be2t0wreaetning interacti4oannsd recipes that have been rated at least 20
recipes on those vectors. Finally, we employed the normalizedtimes by the users, resulting in a inal dense dataset comprising
ingredient strings, to calculate similarities between recipes1u2s7i3nugsers, 1031 items and 50,681 interactions. To study the efects
cosine similarity and Jaccard. In the case of cosine we normaliozefddiferent levels of users on performance we report a second set
the quantities of each ingredient to 100g of a recipe and used the
normalized quantity values as frequency indicator. 3Parameters were tuned to the optimum using grid search.</p>
      <p>4We transfer all ratings to positive feedback, i.e. any rating is counted as positive
feedback and any none interaction as negative feedback. This makes sense as 95% of
2http://www.librec.net/ all ratings in the Allrecipes.com dataset are 5-star ratings, see also [36].</p>
      <p>Algorithm
● BPR</p>
      <p>CB:All
Algorithm
● BPR</p>
      <p>CB:All
●
●
● ●</p>
      <sec id="sec-2-1">
        <title>Title:Levenstein-Distance</title>
        <p>Title:Bigram-Distance
Title:LCS-Distance
Title:LDA-Text-Cosine
Title:Jaro-Winkler-Distance
Title:All
Image:Cosine-Embeddings .5322
Image:Colorfulness-Distance .50(7↓2)
Image:Contrast-Distance .5175
Image:Sharpness-Distance .5109
Image:Entropy-Distance .5080(↓)
Image:Brightness-Distance .499(↓1)
Image:All .5425</p>
      </sec>
      <sec id="sec-2-2">
        <title>AUC scores of&gt; .686. This compares to.5883 achieved by the linear</title>
        <p>combination of content features (= CB:All).</p>
        <p>BC Examining the performance of diferent aspects of content (title,
Ingredients:Cosine-Text .5547 image, ingredients, direction and health) shows that there is a signal
Ingredients:Cosine-LDA-Text .565(↑3) in each of these aspects. This is a sign of the consistency, in terms
Ingredients:Jaccard .5502 of the properties of recipes, which individual users tend to rate.
Ingredients:Cosine .5575 The fact that the combined model Ałllž does not achieve a high
Ingredients:All .5718 improvement on these signals individually is perhaps an indication
that a linear combination is not the best means to combine these
Directions:Cosine-LDA-Text .5(6↑0)6 signals. One of the strongest content-based features is the FSA score
Directions:Cosine-Text .5210 (AUC=.5775). Again, this hints at consistency in user preference,
Directions:All .5731 this time in terms of the healthiness of recipes, which individua
Ratings:Number-Distance .478(↓9) users interact with.</p>
        <p>Ratings:Average-Distance .483(↓2) To complement these initial results and better understand the
Ratings:All .5249 relationship between CF and CB methods and the amount of data
Health:FSA .5775(↑) required to achieve strong recommendation performance with these
approaches, we performed the bootstrapping study as described
CB:All .5883 above. The results are presented in Figure 1.</p>
        <p>Random .4989 In a irst test, see Figure 1 (A), we sampled only from active
users, that is, we derived a test size of various sizes where users
had rated at least 20 items and the items involved had also achieved
of bootstrapped experiments using smaller dense samples of heavy at least 20 ratings. Taking this dense sample showed that even a
users (using the same criteria as above), and varying collection sizsemsall number of users can attain stable performance. With only 1%
using standard random sampling, referred to as ‘sparse samples’ in of all users (N=13) the CF technique (BPR) is able to outperform the
the text. These experiments were repeated 100 times each and the content approach. Nevertheless, when users are selected at random
average performance reported. from the dataset and no p-core ilter is applied, see Figure 1 (B) ś
which we argue is a much more realistic s4e]t uśmpa[ny more
5 RESULTS users are required on average to achieve an equivalent
performThe results of the experiments on the full dataset are shown in ance. Whereas the CB approaches achieve a consistent performance
Table 2. The CF methods clearly outperform the content-based (AUC=&gt; .54) regardless of the number of users studied, half of the
approaches. The best performing CF method (BPR) achieved an dataset (50%, N=637) is required before the CF methods outperform
AUC score of.7094 and the remaining CF methods demonstrated the CB approach.
An Evaluation of Recommendation Algorithms for Online Recipe Portals
6</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>SUMMARY &amp; CONCLUSION</title>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>