HeASe: An AI-powered Framework to Promote Healthy and Sustainable Eating Alessandro Petruzzelli, Cataldo Musto* , Michele Ciro Di Carlo, Giovanni Tempesta and Giovanni Semeraro University of Bari Aldo Moro, via Orabona 4, Bari, 70125, Italy Abstract This paper introduces Healthy And Sustainable eating (HeASe), a comprehensive framework designed to promote healthy and sustainable eating by leveraging large language models and food retrieval techniques. As global concerns about nutrition and environmental sustainability escalate, the need for effective solutions that allow people to better nourish and improve their knowledge and self-awareness about food becomes imperative. To this end, given an input recipe, our framework first identifies a set of substitute meals by exploiting a retrieval strategy based on macro-nutrients, then relies on large language models to re-rank candidate recipes based on their healthiness and sustainability. As shown in our experiments, the methodology has the ability to expose individuals to better dietary choices, potentially contributing to overall well-being and reducing the ecological footprint of food consumption. Keywords Food Recommendation, Large Language Models, Health-aware Recommender Systems, Sustainability 1. Introduction Today, the food industry is efficient and offers a variety of fresh and processed options. However, every step of the agricultural and food chain raises environmental concerns. Land use, water consumption, and air emissions all have an impact on the environment. While technological advancements create new markets and opportunities, they must also address these environmental challenges. To mitigate the environmental footprint of the food chain, a fundamental shift in consumer behavior is essential. Indeed, we must transition towards a dietary paradigm that prioritizes both individual health and environmental sustainability [1]. This necessitates a move away from conventional consumption patterns and towards a more mindful approach to food choices. All these principles are in lines with several Sustainable Development Goals (SDGs), in particular SDG3 (Good Health and Well-being) and SDG12 (Responsible Consumption and Production). In recent years, food recommendation systems (RSs) [2] have emerged as a promising avenue to guide consumers toward healthier and more sustainable dietary choices. These systems can be categorized into two primary types: health-aware and sustainable-aware RSs [3]. Health-aware food RSs [4] aim to assist users in defining daily diets that align with their nutritional needs and health goals. These systems typically achieve this by balancing user preferences with various health-related factors. Previous methods have tried to incorporate healthiness by replacing ingredients with healthier alternatives [5, 6] or incorporating nutritional facts as function constraints [7, 8]. In [9], a post-filtering method has been proposed to score recipes based on health criteria.While these approaches have shown promise in promoting healthier eating habits, they often face limitations. Notably, methods that directly substitute ingredients or impose hard constraints on healthiness can significantly alter the recipe’s original STAI’24: International Workshop on Sustainable Transition with AI (Collocated with the 33rd International Joint Conference on Artificial Intelligence 2024), August 05, 2024, Jeju, Republic of Korea. * Corresponding author. $ alessandro.petruzzelli@uniba.it (A. Petruzzelli); cataldo.musto@uniba.it (C. Musto); m.dicarlo6@studenti.uniba.it (M. C. Di Carlo); g.tempesta16@studenti.uniba.it (G. Tempesta); giovanni.semeraro@uniba.it (G. Semeraro)  0009-0008-2880-6715 (A. Petruzzelli); 0000-0001-6089-928X (C. Musto); 0009-0001-0461-8276 (M. C. Di Carlo); 0009-0000-6211-7173 (G. Tempesta); 0000-0001-6883-1853 (G. Semeraro) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings characteristics, potentially compromising user satisfaction. Additionally, post-filtering approaches may discard potentially healthy recipes that fall below an arbitrary threshold, limiting user choice. On the other hand, sustainability-aware food RSs solely consider the environmental impact related to food consumption. For instance, in [3], the authors introduce a system that exploits the information about water footprint. In particular, it promotes recipes with ingredients whose production needs a lower quantity of water. While being of interest and certainly novel, this approach fails to capture the complete picture of a recipe’s impact ignoring other sustainability aspects such as carbon emissions [10], that play a key role in assessing the sustainability of a recipe. To sum up, the analysis of the state of the art showed that there is a scarcity of systems that jointly tackle the problem of providing food suggestions that are healthy and sustainable at the same time. Accordingly, we propose a novel framework that aims to fill in this gap by exploiting large language models (LLMs) and a recipe similarity formula based on macro-nutrients. In particular, given an input (not sustainable) recipe, we first use macro-nutrients to identify suitable alternative, then we rank them based on our sustainability score and we finally exploit large language models (i.e., GPT 3.5 Turbo [11]) to select an alternative recipe that is both healthy and sustainable. Up to our knowledge, the use of LLMs to identify sustainable food alternative is a completely novel research direction. Figure 1: A toy example of HeASe framework In our vision, this approach acknowledges that health-conscious consumers often consider not only the nutritional value of food but also its environmental impact. So, by incorporating a sustainability score for each ingredient, the framework can identify recipes that encompass both individual well-being and environmental responsibility. A toy example showing the behavior of the framework is presented in Figure 1, while the contribution of the paper can be summarized as follows: • Sustainability Score: we introduce a strategy to estimate the sustainability of a recipe based on the information about water and carbon footprint of its ingredients. • Dataset: we release a new dataset that extends HUMMUS [12] with sustainability and healthiness scores for ingredients. In particular, we provided all the recipes in the dataset with information about environmental aspects. This will encourage and foster research in the area of sustainability- aware food RSs. • HeASe Framework: we propose a framework that provides users with more sustainable and healthier recipes by exploiting: (a) recipe similarity based on macro-nutrients; (b) sustainability and healthiness scores; (c) selection mechanism based on LLMs. • Evaluation: we showed that our sustainability scores allowed to identify similar but more sustainable recipes. Moreover, we also showed the LLMs can be particularly effective in selecting the most suitable alternative given a pool of candidate recipes. Both these directions have been scarcely investigated in the state of the art. 2. Assessing Healthiness and Sustainability 2.1. Calculating Healthiness of Recipes Determining the "healthiness" of a recipe is a complex issue, heavily influenced by its nutrient composi- tion and individual dietary needs. The concept of healthy food has experienced significant evolution, with past approaches focusing on factors like calories information [4], cholesterol levels [13], or multi- nutrients like protein, sodium, and saturated fats [14]. Today, we have a more comprehensive framework based on guidelines from international health organizations like the World Health Organization (WHO) [15]. The WHO recommends daily intake ranges for 15 macro-nutrients. Based on these intakes, in the HUMMUS dataset [12] the authors created a single score reflecting a recipe’s overall healthiness. In particular, the method relies on the "traffic light" system proposed by [16]: each macro-nutrient range is assigned a color based on its perceived healthfulness (green for healthy, yellow for moderate, red for unhealthy) , and each color is mapped to a range of scores. The individual scores of the macro-nutrients are then added up and normalized to create a final WHO score ranging from 0 (very healthy) to 14 (very unhealthy) for each recipe. Given a recipe 𝑟, from now on the healthiness of the recipes calculated as we just described is indicated as 𝑊 𝐻𝑂(𝑟). For more details on the formula, we suggest to refer to [12]. 2.2. Calculating Sustainability of Recipes While the task of calculating the healthiness of a recipe has some previous attempts, the assessment of the sustainability is relatively newer and scarcely investigated. Indeed, sustainability is a complex and constantly developing field, with no single universally accepted method. This makes it challenging to objectively compare the environmental impact of different recipes. Only of the first attempts in this direction is represented by the SU-EATABLE Life (SEL) dataset [17], that provides carbon footprint (WC) and water footprint (WF) data for various food ingredients. In this work, we tackle the task of assessing the sustainability of the recipes available in the HUMMUS dataset by properly processing the information encoded in SEL dataset. In particular, the process is organized as follows: 1. Pre-process the SU-EATABLE Life (SEL) dataset. We remove noise by eliminating items lacking both footprints, removing unnecessary characters from names, and filtering out stopwords and adjectives. 2. Match ingredients with recipes: We match ingredients in the SEL dataset with those in each recipe from the HUMMUS dataset. 3. Handle missing ingredients: To ensure comprehensive matching, we perform additional steps: • Check if the SEL ingredient name is contained within the recipe ingredient name. • Check if the recipe ingredient name is contained within the SEL ingredient name. • If the above steps find matchings, we utilize transformers1 to calculate the similarity between missing ingredients and matched ones in SEL, with a threshold of 0.98. We manually reviewed similarities further refined the matches. 4. Manual intervention for high-occurrence missing ingredients: We manually addressed 87 missing ingredients with over 1000 occurrences, identifying 19 potential associations. Based on the previous strategy, given an ingredient 𝐾 we can obtain its corresponding water and carbon footprints, labeled as 𝑊 𝑃𝑓 (𝐾) and 𝐶𝑃𝑓 (𝐾). Next, to evaluate the overall environmental impact of an ingredient we designed a new metric named Ingredient Sustainability Score (ISS), calculated as follows: 𝐼𝑆𝑆(𝐾) = 𝛼 × 𝑊 𝐹𝑓 (𝐾) + 𝛽 × 𝐶𝐹𝑓 (𝐾) (1) where: • 𝐾 represents the specific ingredient. • 𝑊 𝐹𝑓 (𝐾) denotes the water footprint of ingredient 𝐾. • 𝐶𝐹𝑓 (𝐾) represents the carbon footprint of 𝐾. • 𝛼 and 𝛽 are weighting factors, with 𝛼 = 0.2 and 𝛽 = 0.82 1 https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 2 This weighting scheme prioritizes the carbon footprint over the water footprint, reflecting the generally greater environmental impact of greenhouse gas emissions compared to water use. Of course, different weighting schemes may be adopted as well. Next, based on the ISS scores for ingredients, we define a scoring function for recipes. To this end, we first rank the ingredients 𝑖1 . . . 𝑖𝑛 based on their ISS. Then, we define the Recipe Sustainable Score (RSS) for a recipe 𝑅 as: |𝑁 |−1 ∑︁ 𝑅𝑆𝑆(𝑅) = 𝐼𝑆𝑆(𝑖𝑘 )𝑒−𝑖 (2) 𝑘=0 Where 𝑖𝑘 represents the 𝑘-th ingredient of the recipe, based on the previous ranking. The intuition behind this formula is to give a greater importance to the ingredients with higher carbon and water footprint (i.e., those that have a greater environmental impact). Differently from a simple average, that gives identical importance to the ingredients, this strategy gives more importance to ingredients that are not sustainable. Indeed, this discounting mechanism ensures that the overall recipe score reflects the dominance of the main ingredient while incorporating the influence of additional ingredients. Finally, the ultimate sustainability score (SuS) of a recipe was computed as: 𝑅𝑆𝑆(𝑅) − 𝑀 𝑖𝑛𝑅𝑆𝑆 SuS(R) = 1 − (3) 𝑀 𝑎𝑥𝑅𝑠𝑠 − 𝑀 𝑖𝑛𝑅𝑠𝑠 Where MinRSS and MaxRSS are the minimum and maximum RSS scores obtained over the dataset of recipes, respectively, and are used as a normalization factor. It is important to note that the Sustainability Score is calculated based on the water and carbon footprint of all the ingredients of the recipe. These have negative environmental impacts, so a higher overall score indicates a more sustainable recipe. A qualitative evaluation of the effectiveness of our formula is provided next. 2.3. Description of the Dataset As mentioned in the previous steps, one of the contributions of the paper is a new dataset providing information about sustainability of recipes. Our dataset is based on Health-aware User-centered recoMMendation and argUment-enabling data Set (HUMMUS) dataset. This dataset is built on top of the existing FoodKG [18] knowledge graph. The authors have added more data to the graph by collecting additional information for each recipe. They have also included valuable features such as nutritional scores from WHO, FDA, and Nutriscore. This dataset has over 507, 000 recipes, and each recipe contains details about ingredients, macro-nutrients (calories, total fat, etc.), and other relevant information organized into tags. The tags provide information about key recipe aspects like main ingredients (meat, pork, fruit) and dish category (main course, dessert, breakfast). The dataset contains a set of 902 unique tag values. To ensure the dataset’s quality, we performed some prep-rocessing steps. We removed duplicate recipes, those missing any tags, and those lacking any listed ingredients. This process helped to refine the dataset and improve its overall usability, reducing the number of recipes to 214, 800. Next, we applied the pipeline described in section 2.2 to calculate the SuS score for each recipe. However, during this process, we noticed that not all ingredients could be matched, even after manual checking. To maintain the overall quality of the dataset, we decided to remove recipes where more than 30% of ingredients could not be matched in the SEL dataset. This additional filtering reduced the number of recipes to 100,870. Finally, we categorized recipes with three sustainability labels based on their sustainability scores: • High (𝑠𝑐𝑜𝑟𝑒 ≥ 0.9): Representing highly sustainable recipes (16,433 recipes). • Medium (0.5 < 𝑠𝑐𝑜𝑟𝑒 < 0.9): Representing moderately sustainable recipes (79,157 recipes). • Low (𝑠𝑐𝑜𝑟𝑒 ≤ 0.5): Indicating recipes with low sustainability (5,280 recipes). Some examples of the recipes that were classified in each category will be provided next. Moreover, the dataset together with the labels we calculated was used in our experiment to assess the effectiveness of the strategy and was released as a contribution of the work. 3. Description of the Framework This section introduces the HeASe framework. As previously stated (see Figure 1), the goal of the framework is to automatically suggest a similar-but-healthier and more sustainable alternative of an input recipe given a by user. For better understanding the framework, we break down the process into four steps, each corresponding to a component in Figure 2. Figure 2: A schematic diagram of HeASe. The framework takes a recipe name as input and outputs a more sustainable and healthiness alternative. The framework consists of four modules: (1) Encoding (2) Retrieval (3) Ranking, and (4) Selection. 3.1. Step 1: Encoding Module The workflow starts with the Encoding Module. In a nutshell, this module takes as input the input recipe and returns a vector encoding the characteristics of the recipe in terms of macro-nutrients. This is a mandatory step, since we want to identify recipes that are healthier and more sustainable, but also similar to the input. Accordingly, it is necessary to understand nutritional values and characteristics of a recipe. To this end, we exploited a pre-trained transformer fine-tuned on the recipe domain3 to encode the input recipe based on the name of the recipe. Next, we calculate the similarity between the input recipe and the names of the other recipes available in the dataset. If a match with a similarity score exceeding 0.99 is found, we obtain a precise match. It means that a recipe with (almost) the same name exists in the dataset. Otherwise, the 𝑘 most similar recipes are returned. In this way, the framework is able to manage both exact and non-exact matching. In case of exact match, the output of the module is a vector encoding the values of the macro-nutrients of the matched recipe, together with the descriptive tags available in the dataset. Conversely, in case of non-exact matching, the macro-nutrients of the input recipe are obtained as the centroid vector of the macro-nutrients of the 𝑘 similar recipes previously identified by the transformer. 3 https://huggingface.co/davanstrien/autotrain-recipes-2451975973 3.2. Step 2: Retrieval Module As mentioned in the previous step, the Encoding module generates a representation of the input recipe based on its macro-nutrients. Such a representation is then used to search for similar recipes. To address this task, we calculated the similarity in terms of macro-nutrients between the input recipe (as returned by the Encoding module) and all the recipes in the dataset, based on the cosine similarity. This allowed us to retrieve recipes that closely matched the input recipe in terms of their nutritional composition. Moreover, we also used the tags that are available for each recipe as a further element to improve the quality of the retrieved recipes. In particular, we only return recipes that are similar and share at least one tag (i.e., pasta, breakfast, japanese, etc.) with the input recipe provided by the user. In this way, we avoid that very different recipes could be included in the output of the Retrieval module. 3.3. Step 3: Ranking Module Once similar recipes are obtained, it is necessary to rank them in order to identify an alternative that is more sustainable and healthier. This role is played by the Ranking module, whose goal is to take as input the recipes previously returned by the Retrieval module and identify the better alternatives for the user. To rank the recipes, we defined a new function called HeaSe Score (HS), defined as follows: HS(R) = 𝛼 · Sustainability(𝑅) + 𝛽 · WHO(𝑅) (4) • Where 𝑅 represents a recipe. • SuS(𝑅) is a function that returns the sustainability score of R, as described in Section 2.2 • WHO(𝑅) is a function that returns the WHO score of a given recipe. • 𝛼 and 𝛽 hyperparameters that allow you to weight the importance of each factor. At the end of this step, a list of ranked alternative recipes is obtained. It is worth emphasizing that the workflow can also stop after this step, by returning to the user the top-1 recipe retrieved by the systems based on the HeaSe score. However, we also implemented a Selection module based on LLMs to assess whether the knowledge encoded in large language models can be exploited to better handle this task. 3.4. Step 4: Selection Module Finally, in the Selection module, the output previously obtained from the Ranking module is processed by using LLMs, specifically GPT-3.5 turbo, in order to select the most suitable alternative of the recipe provided as input by the user. To carry out this step we specifically designed a strategy inspired by Retrieval-Augmented Generation (RAG) [19] which takes as input the list of candidate recipes and asks the LLM to select the most suitable one. This is done through a zero-shot prompt that is used to query the LLM, leaving it the task to identify the most suitable candidate recipe based on the knowledge encoded in the language model. An example of such a prompt is provided below. As shown in the example, we populate the prompt with the recipes previously identified and we let GPT pick the more sustainable alternative recipe. To mitigate potential biases like positional bias [20], the retrieved recipes are shuffled and inserted into the prompt without any additional information. U s i n g your knowledge , p l e a s e r a n k ( i f n e c e s s a r y ) t h e f o l l o w i n g r e c i p e s from most t o l e a s t recommended b a s e d on a b a l a n c e o f s u s t a i n a b i l i t y and healthiness : 1 . Recipe : Healthy Salad 2 . R e c i p e : Quinoa Bowl 3 . Recipe : Veggie S t i r −Fry Which one s h o u l d I c h o o s e ? R e t u r n j u s t t h e name . It is crucial to note that the lack of information about the input recipe is intentional and derives from the experiment’s ultimate objective. We aim to assess the LLM’s ability to accurately identify the recipe with higher values of sustainability and healthiness without relying on specific recipe details. Of course, one of the goals of the experiment will be to assess the effectiveness of LLMs in the task of automatically identifying healthy and sustainable recipes. 4. Experimental Evaluation This section explores the effectiveness of the proposed metrics and framework through experiments addressing the following Research Questions (RQs): RQ1 - Scoring Effectiveness: Can SuS and HeASe scores actually rank recipes based on sustainability and healthiness? RQ2 - Retrieval Effectiveness: Is the framework able to successfully identify suitable food alternatives? RQ3 - LLM-based Selection Effectiveness: Can LLMs be leveraged to automatically select sustainable alternatives? 4.1. Experimental Setting Dataset and Evaluation Protocol All the experiments rely on the dataset previously described in Section 2.3, that is also available online on our repository4 . Based on this dataset, we evaluated the performance of the framework by providing an input recipe and by checking whether the alternative identified by the framework is healthier and/or more sustainable. To guarantee the soundness of the protocol, we evaluated the performance of HeaSe system across diverse scenarios: 1. Low Sustainability: based on 100 randomly selected recipes labeled as "Low" in sustainability. 2. Medium Sustainability: based on 100 randomly selected recipes labeled as "Medium" in sus- tainability. 3. High Health: based on 100 randomly selected recipes with a WHO score above average. 4. Unknown Recipes: based on 30 Recipes not present in the recipe dataset. These scenarios allow us to assess the framework’s efficacy in different contexts. For instance, for the "Low Sustainability" scenario we expect significant improvements in the output recipe’s sustainability and healthiness compared to the input. However, we also evaluate the framework’s performance in more challenging settings (i.e., high health, based on recipes that are already healthy, or unknown, in order to also assess the effectiveness of non-exact matching in the retrieval phase). Implementation Details and Model Parameters The model uses a pre-trained transformer encoder with a hidden dimensionality of 768. This allows the model to efficiently find similarities between the input text and recipe titles, even when the input doesn’t perfectly match the recipe title. As for the Retrieval module, the number of alternative recipes based on macro-nutrient similarity which is returned is set to 100. The recipe representation is based on its macro-nutrients, which include: Calories [cal], Total Fat [g], Saturated Fat [g], Cholesterol [mg], Sodium [mg], Dietary Fiber [g], Sugars [g], and Protein [g]. As regards the scoring function in the Ranker module, the best configuration for the model was achieved by setting the alpha and beta values in the formula 4 to 0.7 and 0.3, respectively. Evaluation Metric We evaluate the performance of the HeASe system by calculating the mean percentage increment of each metric for each scenario. Given an input recipe (𝑅) and a list of 𝑁 possible alternatives (𝐴) returned by the system, we compute the following: 1 ∑︀𝑁 𝑁 𝑖=0 𝑊 𝐻𝑂(𝐴𝑖 ) − 𝑊 𝐻𝑂(𝑅) WHO_incr = × 100 (5) 𝑊 𝐻𝑂(𝑅) 4 https://github.com/swapUniba/HeASe 1 ∑︀𝑁 𝑁 𝑖=0 𝑆𝑢𝑆(𝐴𝑖 ) − 𝑆𝑢𝑆(𝑅) SuS_incr = × 100 (6) 𝑆𝑢𝑆(𝑅) 1 ∑︀𝑁 𝑁 𝑖=0 𝐻𝑒𝐴𝑆𝑒(𝐴𝑖 ) − 𝐻𝑒𝐴𝑆𝑒(𝑅) HeASe_incr = × 100 (7) 𝐻𝑒𝐴𝑆𝑒(𝑅) Intuitively, these metrics calculate the increase (if any) in terms of healthiness and sustainability of the recipe retrieved by the framework compared to the input one. Sensitivity Analysis. Finally, to investigate the performance of the system on varying of different parameters, we also carried out a sensitivity analysis based on the following key factors: • Tags matching: This option controls how strictly the recipe tags need to match between the input recipe and the retrieved items. By setting it to true, the framework only outputs recipes that share all the same tags with the input recipe. • Retrieved items: This parameter determines the number of alternative recipes retrieved as recom- mendations. 4.2. Discussion of the Results RQ1 - Scoring Function Effectiveness: To answer RQ1, we present the top-5 and worst-5 recipes based on SuS and HeASe scores. • Top-5 Recipes (Tables 1 and 3): as shown in the tables, this includes recipes like "Homemade Oat- meal," "Quinoa-Toasted," and "Seasoned Rice", which excel in both sustainability and healthiness, achieving high SuS and HeASe scores. These options likely prioritize plant-based ingredients and simple preparation methods, reducing environmental impact and promoting nutritional value. Generally speaking, we can state that the list of the more sustainable and healthy recipes confirms the effectiveness of the scoring function we designed. • Worst-5 recipes (Tables 2 and 4): Conversely, recipes like "Rich Lamb Curry," "Five Meat Chili," and "Middle Eastern Stew" score poorly in both categories. These dishes likely contain significant amounts of meat, which can contribute to a higher environmental footprint and potentially lower overall health benefits. Also, in this case, we can state that the poorly sustainable recipes are correctly identified through our scoring function. The disparity between metrics: Interestingly, the top and bottom scorers for SuS do not entirely overlap with those for HeASe. "Boiled Radishes" and "Granita" for example, rank highly in SuS but not in HeASe. This suggests that some sustainable practices might not always translate directly to health benefits, and vice versa, highlighting the need for a balanced metric like HeASe. To sum up, we can answer RQ1 by stating that the qualitative analysis we provided generally confirmed the effectiveness of the scoring function we introduced in this paper. Table 1 Top-5 Recipes ordered for HeASe Score Recipe Title SuS WHO HeASe Homemade Oatmeal 0.983 0.461 0.827 Quinoa-Toasted 0.975 0.444 0.816 Seasoned Rice 0.979 0.423 0.812 Fat Free Whole Wheat Tortillas 0.975 0.418 0.808 Plain Rice 0.977 0.383 0.801 Table 2 Worst-5 Recipes ordered for HeASe Score Recipe Title SuS WHO HeASe Rich Lamb Curry 0.039 0.153 0.074 Five Meat Chili 0.028 0.198 0.079 Middle Eastern Stew 0.031 0.206 0.084 Roast Leg of Lamb 0.049 0.213 0.098 Curried Lamb on Rice 0.049 0.224 0.101 Table 3 Top-5 Recipes ordered for SuS metric Recipe Title SuS WHO HeASe Boiled Radishes 0.997 0.293 0.786 Horseradish Applesauce 0.997 0.314 0.792 Granita 0.996 0.236 0.768 Rehydrated Onions 0.995 0.268 0.777 Pot Onion Chops 0.995 0.260 0.775 Table 4 Worst-5 Recipes ordered for SuS metric Recipe Title SuS WHO HeASe Five Meat Chili 0.029 0.198 0.079 Middle Eastern Stew 0.032 0.206 0.084 Rich Lamb Curry 0.040 0.153 0.074 Curried Lamb on Rice 0.049 0.224 0.101 Roast Leg of Lamb 0.049 0.213 0.098 RQ2 - Retrieval Effectiveness To answer RQ2, we conducted several tests to evaluate the effec- tiveness of the framework, that is to say, to assess whether the alternative recipes retrieved through our pipeline are healthier and more sustainable w.r.t. the input recipe. In particular, for each of the 100 recipes in each scenario (see Section 4.1) we retrieved the 100 most similar recipes based on macro- nutrients, we ranked them based on our HeaSe score, and we calculated the average increase in terms of healthiness and sustainability for all the recipes. The results are reported in Table 5. As shown in Table 5, the results confirmed the effectiveness of the approach, since the proposed alternative recipes are healthier and more sustainable, on average, in all the experimental scenarios we considered. It is worth emphasizing that the results are consistent across all the different scenarios, even if the gaps of course reflect the complexity of the task. Indeed, when poorly sustainable recipes are used as input of the framework, a huge average increase emerges from all the alternatives. Even though this was expected, it is important to see that the increase we obtained is really huge, on average. It is also important to note that an average increase in terms of sustainability is obtained when recipes that are already healthy are used as input. Next, the results of the sensitivity analysis are shown in Figures 3 and 4. Due to space constraints, we only reported the plot for two scenarios, i.e., the "Low Sustainability" scenario and the "High Health" scenario. The other scenarios follow a similar trend. Plots clearly show that the framework achieves better performance as the number 𝑁 of alternative recipes increases, and it confirmed our choice of choice of retrieving and ranking 100 similar recipes. In particular, as shown in Figure 4a, this is a necessary choice for the "high health" scenario, since by considering the top-1 and top-10 recipes retrieved we have an average decrease in sustainability. Conversely, by increasing the number of recipes, the overall healthiness and sustainability are higher. While this suggests that alternative strategies for retrieval and ranking need to be investigated in the future, proper tuning of the parameters still guarantees good performance. Finally, Figures 3b and 4b show the results on varying of the tag matching strategy. The results reveal slight differences, with configurations that don’t require matching all tags generally producing better results. This means that when the retrieved recipes need to match all the tags of the input recipe, non-relevant recipes may be generally returned. To sum up, all the results of the sensitivity analysis showed that the platform generally performs well, but a proper choice of parameter may lead to more effective results. Table 5 Performance of the HEaSe framework in the retrieval task Scenario WHO_incr SuS_incr HeASe_incr Low Sustainability +12.70% +139.03% +112.89% Medium Sustainability +69.27% +22.70% +21.38% High Health +5.51% +20.19% +17.67% Unknown Recipes +16.43% +17.87% +17.51% To conclude the analysis, in Table 6 we report some qualitative examples showing the real behavior of the HEaSe framework. In particular, for each experimental scenario, we present the output generated by the platform based on different input recipes. As shown in the table, in all the reported settings the alternative recipe is healthier more sustainable, and sufficiently similar to the input one. This definitely confirmed the effectiveness of the design choices. More tests can be carried out by running our online demo5 . Table 6 Input-Output examples per scenario Scenario Input Output HeASe Rockin Cheddar Ranch Turkey Burgers! Ginger, Lemon and Garlic Swordfish Steak. +24.70% Medium Sustainability Strippin’ Chicken! (Bacon Strip Chicken) Super Simple Chicken Salad +24.12% Beef Stir-Fry Tofu Hot wings +104.80% Low Sustainability Turkey-beef Kebabs Slow-Cooker Swiss Steak +92.17% Chili Dog Casserole No-fuss Burgers +119.23% High Health Wedding Cakes Spice Cookies +10.84% Figure 3: Mean percentage increments on the three metrics on Low Sustainability Scenario on different configuration Figure 4: Mean percentage increments on the three metrics on High Health Scenario on different configuration 5 https://github.com/GiovTemp/SustainaMeal_Case_Study Table 7 Experiments on the Selection based on LLMs WHO_incr SuS_incr HeASe_incr gpt_rerank +3.26% +71.33% +56.07% True +2.77% +68.41% +54.27% False RQ3 - LLM-based Selection Effectiveness: Finally, to answer RQ3, we evaluated the ability of GPT- 3.5 Turbo to automatically pick the more sustainable alternative in a pool of candidate recipes retrieved by the system. The process follows the step described in the Selection module of the framework. Due to limitations in prompt length, we experimented with a smaller set of alternatives (i.e., 10 candidate recipes). The analysis with a longer prompt is left as future work. In Table 7, we compare the healthiness and sustainability of the recipe with the highest score calculated by the Ranker to the recipe identified by GPT among the top-10 returned by the Ranker as well. As shown in the table, the results show that the LLM showed an unexpected and surprising ability to exploit its own knowledge about responsible food consumption to automatically select the best recipe in a pool of 10 candidates. Indeed, when compared with the top-1 recipes previously picked, the average sustainability and healthiness of the recipes is generally higher. These findings suggest that LLMs can effectively leverage the strengths of both retrieval and generation techniques to identify recipes that are both sustainable and healthy. This is an important finding of this work, showing the effectiveness of LLMs in a novel and scarcely investigated research direction. 5. Discussion and Future Works The framework described in this paper aligns with SDG3 and SDG12. In particular, we foresee the following impact: - SDG 3 - Good Health and Well-being: Promoting Healthier Diets. The framework focuses on encouraging individuals to adopt healthier eating habits. By leveraging our system users can explore and choose recipes that contribute to a balanced and nutritious diet. This directly contributes to the goal of ensuring good health and well-being by promoting better nutrition and reducing the risk of diet-related diseases. - SDG12 - Responsible Consumption and Production: Ingredient Substitution: The framework contributes to responsible consumption by helping users identify more sustainable substitute ingredients in recipes. This aligns with SDG 12’s focus on ensuring sustainable consumption by promoting eco- friendly and ethically sourced ingredients. In summary, the HeaSe framework contributes to SDG 3 by promoting healthier diets and better well-being and to SDG 12 by encouraging responsible consumption and production practices. By combining technology-driven solutions with user engagement and education, the project seeks to address the interconnected challenges of health and sustainability in the context of food choices. In future work, we will evaluate different strategies for the selection of alternative recipes, and we evaluate the effectiveness with real users. Acknowledgements We acknowledge the support of the PNRR project FAIR - Future AI Research (PE00000013), Spoke 6 - Symbiotic AI under the NRRP MUR program funded by the NextGenerationEU and project PHaSE (CUP H53D23003530006) - Promoting Healthy and Sustainable Eating through Interactive and Explainable AI Methods, funded by MUR under the PRIN program. Additionally, we acknowledge the CINECA award under the ISCRA initiative (class C project: IscrC_LLM_REC), for the availability of high-performance computing resources and support References [1] C. Hartmann, G. Lazzarini, A. Funk, M. Siegrist, Measuring consumers’ knowledge of the environ- mental impact of foods, Appetite 167 (2021) 105622. [2] C. Trattner, D. Elsweiler, Food recommender systems: important contributions, challenges and future research directions, arXiv preprint arXiv:1711.02760 (2017). [3] I. Gallo, N. Landro, R. La Grassa, A. Turconi, Food recommendations for reducing water foot- print, Sustainability 14 (2022). URL: https://www.mdpi.com/2071-1050/14/7/3833. doi:10.3390/ su14073833. [4] M. Ge, F. Ricci, D. Massimo, Health-aware food recommender system, in: Proceedings of the 9th ACM Conference on Recommender Systems, RecSys ’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 333–334. URL: https://doi.org/10.1145/2792838.2796554. doi:10.1145/2792838.2796554. [5] C.-Y. Teng, Y.-R. Lin, L. A. Adamic, Recipe recommendation using ingredient networks, in: Proceedings of the 4th annual ACM web science conference, 2012, pp. 298–307. [6] D. Elsweiler, C. Trattner, M. Harvey, Exploiting food choice biases for healthier recipe recommen- dation, in: Proceedings of the 40th international acm sigir conference on research and development in information retrieval, 2017, pp. 575–584. [7] D. Elsweiler, M. Harvey, B. Ludwig, A. Said, Bringing the "healthy" into food recommenders, in: International Workshop on Decision Making and Recommender Systems, 2015. URL: https: //api.semanticscholar.org/CorpusID:1838398. [8] Y.-K. Ng, M. Jin, Personalized recipe recommendations for toddlers based on nutrient intake and food preferences, in: Proceedings of the 9th international conference on management of digital ecosystems, 2017, pp. 243–250. [9] C. Trattner, D. Elsweiler, Investigating the healthiness of internet-sourced recipes: implications for meal planning and recommender systems, in: Proceedings of the 26th international conference on world wide web, 2017, pp. 489–498. [10] D. Pandey, M. Agrawal, J. S. Pandey, Carbon footprint: current methods of estimation, Environ- mental monitoring and assessment 178 (2011) 135–160. [11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901. [12] F. Bölz, D. Nurbakova, S. Calabretto, A. Gerl, L. Brunie, H. Kosch, Hummus: A linked, healthiness- aware, user-centered and argument-enabling recipe data set for recommendation, in: Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 1–11. URL: https://doi.org/10.1145/3604915.3609491. doi:10.1145/3604915.3609491. [13] A. Starke, C. Trattner, H. Bakken, M. Johannessen, V. Solberg, The cholesterol factor: Balancing accuracy and health in recipe recommendation through a nutrient-specific metric, in: Proceedings of the 1st Workshop on Multi-Objective Recommender Systems (MORS 2021), 2021. [14] R. Yera Toledo, A. A. Alzahrani, L. Martínez, A food recommender system considering nutritional information and user preferences, IEEE Access 7 (2019) 96695–96711. doi:10.1109/ACCESS. 2019.2929413. [15] W. H. Organization, Healthy diet, https://www.who.int/news-room/fact-sheets/detail/healthy-diet, 2020. [16] G. Sacks, M. Rayner, B. Swinburn, Impact of front-of-pack ‘traffic-light’nutrition labelling on consumer food purchases in the uk, Health promotion international 24 (2009) 344–352. [17] T. Petersson, L. Secondi, A. Magnani, M. Antonelli, K. Dembska, R. Valentini, A. Varotto, S. Castaldi, A multilevel carbon and water footprint dataset of food commodities, Scientific data 8 (2021) 127. [18] S. Haussmann, O. Seneviratne, Y. Chen, Y. Ne’eman, J. Codella, C.-H. Chen, D. L. McGuinness, M. J. Zaki, Foodkg: A semantics-driven knowledge graph for food recommendation, in: C. Ghidini, O. Hartig, M. Maleshkova, V. Svátek, I. Cruz, A. Hogan, J. Song, M. Lefrançois, F. Gandon (Eds.), The Semantic Web – ISWC 2019, Springer International Publishing, Cham, 2019, pp. 146–162. [19] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in Neural Information Processing Systems 33 (2020) 9459–9474. [20] P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y. Cao, Q. Liu, T. Liu, Z. Sui, Large language models are not fair evaluators, 2023. arXiv:2305.17926.