1. Introduction

Benchmarking the Semantics of Taste: Towards the Automatic Extraction of Gustatory Language

Teresa Paccosi

0 1 2

Sara Tonelli

1 0 DHLab / KNAW Humanities Cluster , Oudezijds Achterburgwal 185 1012 DK Amsterdam , The Netherlands 1 Fondazione Bruno Kessler , Via Sommarive, 18, Trento 2 Università degli studi di Trento , Via Calepina, 14, Rovereto

In this paper, we present a benchmark containing texts manually annotated with gustatory semantic information. We employ a FrameNet-like approach previously tested to address olfactory language, which we adapt to capture gustatory events. We then propose an exploration of the data in the benchmark to show the possible insights brought by this type of approach, addressing the investigation of emotional valence in text genres. Eventually, we present a supervised system trained with the taste benchmark for the extraction of gustatory information from historical and contemporary texts.

eol>Sensory semantics gustatory language information extraction digital humanities

1. Introduction Semantics [4], and the system is trained to identify the

lexical units and the possible semantic roles contributDespite the central role of nutrition in our lives, taste has ing to the construction of a gustatory event. We present been often classified as an inferior sense in the Western the results of the experiments and an exploration of the philosophical tradition. This downplayed role is reflected benchmark data, aiming to demonstrate the potential of in the vocabulary used to describe the gustatory experi- frame-based analysis for sensory studies. ence, which, together with smell, is characterized by a scarcity of domain-specific terms [ 1]. The dificulty in capturing the semantics of taste could help explain why 2. Related Work there are few works in the fields of Natural Language Processing (NLP) and Digital Humanities (DH) that deal In recent years, there has been a growing interest within with this sense and, in particular, the language used to the NLP community in developing resources designed to describe its experience. While there has been renewed capture the sensory content of language [5]. In particuinterest in the automatic extraction of nutrients and in- lar, in the framework of the three-year European Project gredients from texts for health and medicinal purpose [2], “Odeuropa”1 aimed at preserving intangible cultural herless attention has been devoted to the development of itage, several works have focused on analyzing smell detools and models focused on capturing the semantics of scriptions [6] and extracting olfactory information from sensory experiences, especially in a diachronic fashion. texts. For instance, [3] created a manually annotated

In this paper, we present an English benchmark for benchmark with smell events, which has been subsethe study of gustatory language and a supervised system quently used to train a system for olfactory information for the automatic extraction of taste-related events in extraction [7, 8]. The benchmark focuses on the lanEnglish, which we trained using this benchmark. The guage used to describe olfactory experiences and covers benchmark was built to be a counterpart to the olfactory a period of four centuries (1600-1900), making it useful one presented in [3], with the idea of making the study for historical research. An extension in this direction of the language of these two senses comparable. The sys- is SENSE-LM, a system for extracting sensory informatem is designed as a means to study the language used to tion from texts, which shows that combining language describe the experience of tasting from both synchronic models with lexical resource-based approaches yields and diachronic perspectives. The selected formal repre- better results in extracting sensory references from texts sentation for the semantics of taste is based on Frame compared to systems that do not integrate these two components [9]. The authors were the first to combine CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, sensorimotor representations with the textual features Dec 04 — 06, 2024, Pisa, Italy of language models for the task of sensory information $ tpaccosi@fbk.eu;teresa.paccosi@unitn.it (T. Paccosi); extraction in text documents. Even if they propose the sato0n00e9ll-i0@0f0b9k-2.e3u48(S-7.5T5o6n(eTll.i)Paccosi); 0000-0001-8010-6689 system for all the 5 senses, they only tested it on olfactory (S. Tonelli) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1https://odeuropa.eu/ Attribution 4.0 International (CC BY 4.0).

Frame Element

Taste_Source Quality Taste_Carrier Taster Evoked_Taste Location Taste_Modifier Circumstances Efect

Definition

The food items that are ingested Any property used to describe the taste (usually adjectives)

Anything that can contain the taste source

The person/animal who ingests the food The taste that is evoked but it is not present (e.g., it tastes like onions)

The place in which the food is tasted An ingredient that can modify the perception of the taste of a taste source

The condition or circumstance in which the taste event occurs

Any efect provoked by the tasting experience and auditory language, using respectively the benchmark mark together with the frame elements associated with of [3] and an artificial dataset they generated with GPT-4 it, which the taste extraction system should then iden[10]. Most existing work on food representation in the tify automatically. For instance, in the sentence “[Slimy ifeld of NLP focuses on health-related applications. A no- milk] _ has an [unpleasant] taste”, the table work with a linguistic focus is [2], where the authors system has to identify the Taste_Word (‘taste’), and then concentrate on identifying noun-compound headnouns the possible frame elements (in this case, Taste_Source for developing conversational agents in the e-commerce and Quality). A list of the possible frame elements and domain. They propose a supervised approach based on a their definition is provided in Table 1. The documents neural sequence-to-sequence model to identify the most annotated in the benchmark cover 5 diferent domains or informative token in Italian food compound-nouns, ob- genres, almost evenly distributed with 3/4 documents for taining promising results despite the complexity of the century in every domain for a total of 72 documents. The task. Taste has been also addressed from a diachronic genres are: Literature, Science & Philosophy, Household & point of view in [11], in which the author reconstructs Recipes, Travel & Ethnography, and Medicine & Botany. the evolution of food language focusing on the history To select the documents we automatically search for texts of some dishes and ingredients across continents using presenting a greater density of lexical units (taste words) computational linguistic tools. Several studies have de- 2 spanning through several English corpora and tasteveloped named-entity recognition (NER) models to au- related websites. The corpora form which we extract tomatically extract food entities for medicinal purposes the documents we annotated are: (1) Early English Books and food science applications [12, 13], creating domain- Online (EEBO)3, a collection of documents published bespecific corpora by sourcing data from culinary websites tween 1475 and 1700 covering diferent domains such and online recipe books [14, 15]. as literature, philosophy, politics, religion, geography, history, politics, and mathematics; (2) Project Gutenberg4, a digitized archive of cultural works, containing difer3. Benchmark for Taste ent repositories, mainly in the literary domain; (3) medievalcookery.com5 a list of texts freely available online The training data we use for the models in this paper is relating to medieval food and ancient cooking recipes; (4) a benchmark created according to the annotation guide- foodsofengland.co.uk6 an online library which holds the lines presented in [16]. The formalization adopted to complete texts of several cook books from 1390 to 1974; annotate the benchmark is inspired by Frame Seman- (5) Wikisource7, an online digital library of free-content tics [4] and their implementation through the FrameNet textual sources managed by the Wikimedia Foundation; annotation project [17]. In FrameNet, events and situa- (6) British Library8, a collection of 65,227 digitised voltions are constructed as frames, structures that represent umes from the 16th to the 19th Century; (7) London Pulse the knowledge necessary to understand the meaning of words. Frames include two main components, namely lexical units, domain-specific words or expression that trigger the frame, and frame elements, domain-specific 32Ththtepsli:s//ttoexftlcerxeicaatilounnpiatsrtinseprrsohviipd.oedrgi/ntcApp-tpeexntsd/ix A semantic roles usually attached as dependents to the lex- eebo-tcp-early-english-books-online/ ical unit. In our case, taste events are captured through 4https://www.gutenberg.org/ a so-called Gustatory frame, which is triggered in a 5https://www.medievalcookery.com/etexts.html?England document by Taste_Words (i.e., domain-specific lexi- 76hhttttpp:s/://w/ewn.ww.ifkoiosdosuorfceen.ogrlga/nwdi.ckoi/.uMka/irne_fePraegneces.htm cal units). Each lexical unit is annotated in the bench- 8https://data.bl.uk/digbks/

Frame Elements (FEs) 1500 1900 Overall

Taste_Words Taste_Source

Quality Taste_Modifier

Taster Evoked_Taste

Location Taste_Carrier Circumstances

Efect Medical Reports9, a collection of 5800 Medical Oficer of To this purpose, we use the categories proposed in the Health reports from the Greater London area from 1848 Historical Thesaurus of English of Savouriness and to 1972. Unsavouriness for Taste and Fragrant/Fragrance

In Table 2 we report the statistics of the annotated and Stench for Smell10. This thesaurus contains almost benchmark (note that in [16] we presented only a prelim- every recorded word in English from medieval times to inary version of the benchmark containing around 1,400 the present day, ordered into detailed hierarchies of meanTaste_Words). The most frequent frame element is the ing. In the Thesaurus, every category of the hierarchy Taste_Source, followed by Quality and Taste_Modifier, is divided per part of speech (PoS). For our analysis, we which represent the core frame elements, while the rest manually selected all the nouns, adjectives and adverbs of the frame elements are much sparser. Even if the distri- used in the period we cover with our documents, namely bution of the frame elements is not balanced, the system from 16th century to 20th century. We then assigned the is trained to extract the taste words and all the 9 frame words labeled as Taste_Words and Smell_Words in the elements. Two expert linguists, trained on [16]’s guide- documents to one of the two categories (positive or neglines, annotated three documents from 1670, 1720, and ative) and calculated the normalized frequency of each 1920 to assess Inter Annotator Agreement (IAA). The category across diferent text genres. As reported in Krippendorf’s alpha score [ 18] at span level was 0.70, Section 3, the genres represented in the gustatory benchindicating a moderate agreement. mark are: Literature, Science & Philosophy, Household & Recipes, Travel & Ethnography, Medicine & Botany.

In the olfactory benchmark presented in [3], there are 4. Exploration of olfactory and instead 10 diferent genres: Household & Recipes, Law & gustatory benchmarks Regulations, Literature, Medicine & Botany, Perfumes & Fashion, Public health, Religion, Science & Philosophy, It has been observed that words used to describe ol- Theatre, Travel & Ethnography. factory and gustatory experiences tend to appear more We display the output of this analyses in Fig. 1 frequently in emotionally charged contexts and carry a (for taste words) and Fig. 2 (for smell words), aimed stronger evaluative content compared to words related at showing which emotional valence prevails in each to other senses [19]. By ‘evaluative content’, we refer in genre for the two senses. We observe that two genthis paper to the concept of ‘emotional valence’, which is res exhibit opposite tendencies: medicine/botany defined as “the pleasantness of a word in terms of pos- shows a more negative orientation in the smell benchitive and negative meaning” ([1], p. 201). We therefore mark and a more positive one in the taste benchmark, conducted an exploration of the gustatory benchmark whereas travel/ethnography is more positive conto investigate the positive and negative connotations of cerning smell and more negative for taste (see Fig. 1 gustatory events across diferent text genres . We perform and Fig. 2, where the light blue refers to negative vathe same analysis for olfactory events, using the olfactory lencies and the dark blue to positive ones). We then benchmark of [3] in order to compare the outcome for analyzed the most frequent smell / taste sources in the two senses. To perform this analysis, we first divide the two selected genres to motivate why they exhibit Taste_Words and Smell_Words into positive and negative.

9https://wellcomelibrary.org/moh/about-the-reports/

about-the-medical-oficer-of-health-reports/ 10In the categories at https://ht.ac.uk/category/: The world>physical sensation>Taste/Flavour>Savouriness&Unsavouriness; The world>physical sensation>Smell/Odour>Fagrant/Fragrance&Stench such diference in emotional valence. We notice that smell sources in medicine/botany tend to be common to hospital and disease-related domains having words such as ‘urine’ and ’fetid bronchitis’, while taste sources more easily belong to the realm of common food, with words such as ‘almonds’ and ‘apples’. For what concerns travel/ethnography instead, among the most frequently described taste sources there are exotic and rare foods such as ‘coconut’ and ‘plantain’, likely resulting unpleasant to the palates of foreign travelers. Smell sources tend to refer instead to plants, like ‘flowers’ or ‘roots’, hence usually pleasant or neutral to the noses of the writers. This analysis of categories and sources’ distribution in the genres underlines the importance of a frame-base analysis for understanding and comparing sensory descriptions, in particular their emotional valence.

5. System for Gustatory Information Extraction

The benchmark introduced in the previous sections is used to train a classifier whose goal is to detect gustatory information in English texts. The system is based on multi-task learning (Section 5.1), and is then compared with a “single task” classifier, which we consider our baseline (Section 5.2). 5.1. Multitask configuration

To build our system for gustatory information extraction,

we adopted a multitask learning approach [20, 21], a conifguration successfully tested for olfactory information extraction in [7, 8]. This approach treats the classification of lexical units and each frame element as diferent tasks.

Additionally, we explored a “single task” classification approach, where both lexical units and frame elements are classified within a multiclass token classification task.

The results of these experiments served as a baseline for evaluating the efectiveness of the multitask approach. In both configurations, we employed a transformer-based model fine-tuned for a token classification task [ 22]. This methodology has proved efective across various NLP tasks, including olfactory information extraction [8] and the extraction of food-related ingredients [13]. We experiment the two configurations with monolingual (English) and multilingual versions of BERT and RoBERTa and with an English historical model, MacBERTh. The models we use are listed below: - English BERT: bert-base-cased 11 [23] - Multilingual BERT (mBERT): bert-base-multilingualcased 12[23] - English historical model: MacBERTh 13 [24] - English RoBERTa: roberta-base 14[25] - Multilingual RoBERTa (RoBERTa xlm): xlmroberta-large15 [26] We fine-tuned each model using the same data, maintaining identical training, validation, and test splits, and evaluated them using 5-fold cross-validation. Each fold contained 80% of the lexical units and their related frame elements for training, 10% for validation (dev), and 10% for testing. These splits were consistent across all conifgurations and not entirely random. This configuration ensured a balanced distribution of frame elements and comparability in every run. For labeling the data, we adopted the IOB (Inside-Outside-Beginning) labeling format, as used in [7, 8]. This method facilitates a comprehensive analysis of sentences and lexical expressions by 11https://huggingface.co/google-bert/bert-base-cased 12https://huggingface.co/google-bert/bert-base-multilingual-cased 13https://huggingface.co/emanjavacas/MacBERTh 14https://huggingface.co/FacebookAI/roberta-base 15https://huggingface.co/FacebookAI/xlm-roberta-base

Taster labeling each token with either Inside, Outside, or Begin- five times, each time with a diferent data fold, and the ning labels as appropriate. To fine-tune the models, we average scores were computed. We present the results of used MaChAmp [27], a specialized toolkit designed for for the single task approach of each model in italics in multi-task fine-tuning scenarios. In this approach, each Table 3. We observe high performance variations across label classification is treated as a distinct task. This setup diferent frame elements, with the best results obtained ensures that simpler tasks, such as recognizing lexical for “Quality” and “Taste_Modifier”. This is probably due units, contribute as auxiliary tasks to more complex la- to the fact that their syntactic realization tends to be conbel classifications like “Circumstances” or “Efect” which sistent in the diferent documents, with “Quality” mainly include entire sentences rather than individual words. expressed by adjectives and “Taste_Modifier” by preposiMaChAmp enables the choice of diferent parameters, tional phrases introduced by with. On the contrary, classuch as loss weight, epochs and batch size, and we tested sification results for “Taste_Source” are quite low despite diferent configurations 16. The results in Table 3 for it being the most frequent FE in the training set, probably the multitask approach share the configuration which because they can be expressed by many diferent role yielded the best results. The configuration is the same ifllers and syntactic constructions. Upon reviewing the for all the models and it is reported in Appendix A. test and prediction results, we find that most mistakes concerning Taste_Source are due to a wrong span extent, 5.2. “Single Task” configuration as for instance the system predicts “the taste of [lollilop]” while the gold standard is “the taste [of lollipop]”. This

Baseline issue is also likely reflected in the inter-annotator agreeSimilar to the system for smell information extraction ment (IAA) of the benchmark. In the future, we will presented in [8], we designed our baseline approach as consider alternative ways to evaluate text spans beside a single-task multiclass classification, where the model exact match, for instance by computing the cosine simiassigns one of 21 possible labels to each token. These larity between gold instances and system predictions. labels include 20 representing either “begin” or “inside” Overall, MacBERTh is the best model for Taste_Word of each lexical unit and frame element, and 1 label repre- detection, but the diferent FEs are mostly detected with senting “outside”. As we did for the multitask approach, higher accuracy using RoBERTa xlm. For this reason, each model is fine-tuned with a token classification head we plan to adopt this model for our future research on on top 17. During the training of each model, a hy- gustatory language. perparameter search was conducted on the first fold of our data. The search space included learning rates 6. Conclusions and Future [1 − 5, 2 − 5, 3 − 5, 4 − 5, 5 − 5], batch sizes [8, 16, 32], and training epochs up to 20, with warmup ap- Direction plied for 10% of the training steps. After determining the optimal hyperparameters for each model, it is fine-tuned 16Loss weight with diferent combinations over the labels [1, 0.75],

epochs [10, 20, 30], and batch size [16, 32] 17https://huggingface.co/docs/transformers/tasks/token_ classification

In this paper, we presented a benchmark for gustatory

events containing manually annotated taste-related information, built as a counterpart to the one proposed in [3]. The benchmark is constructed with the same approach adopting a frame-based methodological framework to

7. Aknowledgments Funded by the European Union under grant agreement

101088548 -TRIFECTA. Views and opinions expressed are however those of the author only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. The authors would also like to thank Marieke Van Erp, the head of the project, for her support. analyze sensory language. We emphasized the importance of frame-based analysis to capture sensory events by exploring the characterization of positive and negative valence in the benchmarks through the analysis of taste and smell words and sources. The analysis based on frames seems to bring relevant insights into capturing sensory valence from diferent perspectives, likely supporting the suitability of this approach to deal with humanistic inquiries. We then presented a supervised system to automatically extract taste-related frames, trained on this benchmark. This preliminary exploration and the results obtained with our experiments seem promising for future exploration with automatically extracted data.

Indeed, the limited data of the benchmark are not enough to draw relevant conclusions, and for this reason we plan to use our system to extract more data and conduct largescale analyses of the evolution of sensory information over time. The limited number of documents is likely a contributing factor to the significant discrepancies in accuracy among the diferent frame elements, necessitating more instances to enable a good generalization. Future steps should involve increasing the number of documents and providing less sparse annotations, aiming for better temporal balance. The focus should be on annotating frame elements with lower scores and fewer instances in the benchmark, such as Taste_Carrier and Location. Additionally, alternative metrics and techniques should be employed to capture and explain performance variations across diferent models. As a further comparison, we plan also to assess the performance of general-purpose frame semantic parsers like LOME [28] on our benchmark.

search 23 (2021) e28229. hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, [14] G. Popovski, B. K. Seljak, T. Eftimov, Foodbase L. Zettlemoyer, V. Stoyanov, Unsupervised crosscorpus: a new resource of annotated food entities, lingual representation learning at scale, CoRR Database 2019 (2019) baz121. abs/1911.02116 (2019). URL: http://arxiv.org/abs/ [15] A. Wróblewska, A. Kaliska, M. Pawłowski, 1911.02116. arXiv:1911.02116.

D. Wiśniewski, W. Sosnowski, A. Ławrynowicz, [27] R. Van Der Goot, A. Üstün, A. Ramponi, I. Sharaf, Tasteset–recipe dataset and food entities recogni- B. Plank, Massive choice, ample tasks (machamp): A tion benchmark, arXiv preprint arXiv:2204.07775 toolkit for multi-task learning in nlp, arXiv preprint (2022). arXiv:2005.14672 (2020). [16] T. Paccosi, S. Tonelli, A new annotation scheme [28] P. Xia, G. Qin, S. Vashishtha, Y. Chen, T. Chen, for the semantics of taste, in: Proceedings of the C. May, C. Harman, K. Rawlins, A. S. White, 20th Joint ACL-ISO Workshop on Interoperable Se- B. Van Durme, LOME: Large ontology multilingual mantic Annotation@ LREC-COLING 2024, 2024, pp. extraction, in: D. Gkatzia, D. Seddah (Eds.), Proceed39–46. ings of the 16th Conference of the European Chap[17] J. Ruppenhofer, M. Ellsworth, M. Schwarzer- ter of the Association for Computational LinguisPetruck, C. R. Johnson, J. Schefczyk, FrameNet tics: System Demonstrations, Association for ComII: Extended theory and practice, Technical Report, putational Linguistics, Online, 2021, pp. 149–159.

International Computer Science Institute, 2016. URL: https://aclanthology.org/2021.eacl-demos.19. [18] K. Krippendorf, Computing krippendorf’s alpha- doi:10.18653/v1/2021.eacl-demos.19.

reliability, 2011. [19] B. Winter, Taste and smell words form an afectively loaded and emotionally flexible part of the english lexicon, Language, Cognition and Neuroscience 31 (2016) 975–988. [20] R. Caruana, Multitask learning: A knowledge-based source of inductive bias1, in: Proceedings of the Tenth International Conference on Machine Learning, Citeseer, 1993, pp. 41–48. [21] R. Caruana, Multitask learning, Machine learning

28 (1997) 41–75. [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,

L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [23] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:

Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186. [24] E. Manjavacas Arévalo, L. Fonteyn, MacBERTh:

Development and evaluation of a historically pretrained language model for English (1450-1950), in: Proceedings of the Workshop on Natural Language Processing for Digital Humanities (NLP4DH), Association for Computational Linguistics, 2021, pp. 23– 36. URL: https://aclanthology.org/2021.nlp4dh-1.4.

pdf . [25] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,

O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http: //arxiv.org/abs/1907.11692. arXiv:1907.11692. [26] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud

Part of Speech Lexical Units

Nouns Adjectives Verbs

Adverbs

Acidity, aftertaste, aroma, bitterness, dainty, delicacy, disgust, distaste, flavor, flavour, flavorful, flavourful, flavoring, flavouring, flavorsome, flavoursome, flavorous, flavourous, gustation, insipidity, mistaste, over-eating, palatableness, piquancy, pungency, rancidity, relish, rellish (obsolete), saltness, sapidity, sapor, savor, savoriness, savour, sharpness, smack, smatch, sourness, sowreness (archaic form of sourness), sweetness, tang, tarage, tartness, tast (obsolete), taste, tastelessness, tasting, unsavoriness, unsavouriness Drink (up), drinking (up), drank (up), drunk (up), eat (up), ate (up), eateth (archaic), eaten (up), eating (up), distaste, distasting, distasted, mistaste, mistasted, mistasting, partake, partaking, partook, partaken, relish, relisheth (archaic), relishing, relished, season, seasoning, seasoned, smack, smacking, smacked, smatch (obsolete), sweeten, sweetening, sweetened, taste, tasting, tasted

Value 0.9, 0.99 0.2 20 32 0.0001 0.38 0.3

1 Appendices

A. Lexical Units and Frame Elements In Table 4, we display the list of lexical units or taste words presented in [16]. B. Hyperparameter Values The hyperparameter setting for all our models is pre

sented in Table 5. The setting is the default MaChAmp’s hyperparameter values, with the addition of loss weights at 1, and 20 epochs of training.