What’s in a Food Name: Knowledge Induction from Gazetteers of Food Main Ingredient Bernardo Magnini1 , Vevake Balaraman1,2 , Simone Magnolini1,3 , Marco Guerini1 1 Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento — Italy 2 University of Trento, Italy. 3 AdeptMind Scholar {magnini, balaraman, magnolini, guerini}@fbk.eu Abstract stituents (Shwartz and Dagan, 2018). For in- English. We investigate head-noun identi- stance, an apple cake is a cake made of apples. fication in complex noun-compounds (e.g. While in the literature there has been a large in- table is the head-noun in three legs ta- terest in interpreting noun-compounds by classi- ble with white marble top). The task is fying them with a fixed set of ontological relations of high relevancy in several application (Nakov and Hearst, 2013), in this paper we fo- scenarios, including utterance interpreta- cus on automatic recognition of the head-noun in tion for dialogue systems, particularly in noun-compounds. We assume that in each noun- the context of e-commerce applications, compound there is a noun which can be consid- where dozens of thousand of product de- ered as the more informative, as it brings the most scriptions for several domains and differ- relevant information that allows the correct inter- ent languages have to be analyzed. We pretation of the whole noun-compound. For in- define guidelines for data annotation and stance, in the apple cake example, we consider propose a supervised neural model that is cake as the head-noun, because it brings more in- able to achieve 0.79 F1 on Italian food formation than apple about the kind of food the noun-compounds, which we consider an compound describes (i.e. a dessert), its ingredi- excellent result given both the minimal su- ents (i.e. likely, flour, milk and eggs), and the typ- pervision required and the high linguistic ical amount a person may eat (i.e. likely, a slice). complexity of the domain. While in simple noun-compounds the head-noun usually corresponds to the syntactic head of the Italiano. Affrontiamo il problema di iden- compound, this is not the case for complex com- tificare head-noun in nomi composti com- pounds, where the head-noun can occur in differ- plessi (ad esempio "tavolo" is the head- ent positions of the compound, making its identi- noun in "tavolo con tre gambe e piano in fication challenging. As an example, in the Italian marmo bianco"). Il compito é di alta rile- food description filetto di vitellone senza grasso vanza in numerosi contesti applicativi, in- visibile, there are three nouns (i.e. filetto, vitellone clusa l’interpretazione di enunciati nei sis- and grasso) which are candidates to be the head- temi di dialogo, in particolare nelle ap- noun of the compound. plicazioni di e-commerce, dove decine di migliaia di descrizioni di prodotti per vari There are a number of tasks and application domini e lingue differenti devono essere domains where identifying noun-compound head- analizzate. Proponiamo un modello neu- nouns is relevant. A rather general context is on- rale supervisionato che riesce a raggiun- tology population (Buitelaar et al., 2005), where gere lo 0.79 di F-measure, che consideri- entity names automatically recognized in text are amo un risultato eccellente data la minima confronted against entity names already present in quantitá di supervisione richiesta e la alta an ontology, and have to be appropriately matched complessitá linguistica del dominio. in the ontology taxonomy. Our specific appli- cation interest is conversational agents for the e- commerce domain. Particularly, understanding 1 Introduction names of products (e.g. food, furniture, clothes, Noun-compounds are nominal descriptions that digital equipment) as expressed by users in differ- hold implicit semantic relations between their con- ent languages, requires the capacity to distinguish the main element in a product name (e.g. a table in salata arlecchino; food including proper names or I am looking for a three legs table with white mar- brands, like saint-honoré, tagliatelle Matilde, for- ble top), in order to match them against vendor cat- maggio bel paese; food names focusing on cook- alogues and to provide a meaningful dialogue with ing modalities, like pane fatto in casa, or peperoni the user. The task is made much more challeng- fritti; and focusing on alimentary properties, like ing by the general lack of annotated data, so that ragù di carne dietetico, or sangria analcolica. fully supervised approaches are simply not feasi- We assume that the head-noun of a food de- ble. Along this perspective, the long term goal of scription is the more informative noun in the noun- our work is to develop unsupervised techniques compound, i.e. the noun that better allows to an- that can identify head-nouns in complex noun- swer questions about properties of the food being compounds by learning properties on the base of described by the noun-compound. We consider the noun-compounds included in, possibly large, the following four property related questions, in gazetteers, regardless of the domain and language order of relevance: in which they are described. In this paper we propose a supervised ap- 1. What food category (e.g. meat, vegetable, proach based on a neural sequence-to-sequence cake, soup, pasta, fish, liquid, salad, etc.) is model (Lample et al., 2016) augmented with described by the noun-compound? noun-compound structural features (Guerini et al., 2. What course (e.g. main, appetizer, side 2018). This model identifies the more informative dish, dessert, etc.) is described by the noun- token(s) in the noun-compound, that are finally compound? tagged as the head-noun. We run experiments on Italian food names, and show that, although the 3. Which is the main ingredient (in term of domain is very complex, results are promising. quantity) described by the noun-compound? The paper is structured as follow: we first define 4. Which could be the overall quantity (ex- noun-compound head-noun identification, with pressed in grams) of food described by the specific reference to complex noun-compound noun-compound? (Section 2). Then we introduce the neural model we have implemented (Section 3), and finally the Although our approach does not require any do- experimental setting and the results we have ob- main knowledge, for the purpose of human anno- tained (Section 4). tation and evaluation it is useful to assume a sim- ple ontology for food, where we define the prop- 2 Food Compound-Nouns erties used for judging head-nouns and the set of possible values for each property. Table 1 reports In this Section we focus on Italian compound- the food ontology at the base of our work. nouns referring to food, the domain on which we run our experiments. Similar considerations and Property Values same methodology can be applied to compound- meat, vegetable, cake, soup, nouns in different domains and languages. Food category pasta, fish, liquid, salad... There is a very high variety of food compound- main, first, second, appetizer, nouns, describing various aspects of food, includ- Course side , dessert... ing: simple food names, like mortadella di fe- gato, pesce, gin and tonic, aglio fresco; recipes Main ingredient mentioning their ingredients, like scaloppine al limone, spaghetti al nero, passato di pollo, decotto quantity di carciofo; recipes focusing on preparation style, like mandorle delle tre dame, cavolfiore alla napo- Table 1: Food Ontology. letana; food names focusing on visual or shape properties, like filetto di vitellone senza grasso A good head-noun should be as much informa- visibile, palline di formaggio fritte; food descrip- tive as possible about the noun-compound proper- tions containing a course name, like antipasto ties, or, in other terms, it should allow to infer as di capesante, dessert di mascarpone; food us- much as possible answers to questions 1-4. An- ing fantasy names, like frappé capriccioso, or in- swers to such questions are in most of the cases graduated and probabilistic, as a noun-compound FDH type Position Single token Multi token Total contains just a fraction of the knowledge needed to 1st token 72.48 9.17 81.65 answer them. For instance, given question 1) for Not 1st token 17.89 0.46 18.35 the food noun-compound insalata noci e formag- Total 90.37 9.63 gio should be posed in the following way: know- Table 2: Coverage on the data set of head-noun ing that formaggio is part of a food description, characteristics (in %): either single token or multi- which is the probability that the overall descrip- token and whether appearing at the beginning of tion correctly refers to a food of category salad? the food description or not. When the probability is very low, we assume a "no guess" value for the answer. The core procedure for human annotations con- the two accounts for roughly 70% of the cases. siders each content word in a food description, fills From the point of view of predicting the head- in the values of the four attributes, and then se- noun of a food name, easier cases are given by sin- lect the noun with the best guesses. Below some gle token in first position, while harder cases are examples (in black the selected head of the food given by multi-token head inside the food name. description): 3 Model • insalata noci e formaggio: because insalata is a better predictor of the food category than The architecture we use to recognize head-nouns formaggio or noci. is based on a bidirectional LSTM (Long Short Term Memory) network (Graves and Schmidhu- • involtini di peperoni: because peperoni is a ber, 2005), similar to the one presented in (Lam- better predictor of food category (i.e. veg- ple et al., 2016). We briefly describe the LSTM etable) and of the main ingredient than invol- model used in the approach and proceed with the tini. implementation details. • budino al cioccolato fondente: because 3.1 LSTM budino is a good predictor of food category (i.e. dessert) and a better predictor than cioc- Recurrent Neural Network (RNN) is a class of ar- colato of the main ingredient (i.e. milk) of tificial neural network that resemble a chain of the noun-compound. repeating modules to efficiently model sequential data (Mikolov et al., 2010). They take sequential 2.1 Task and Data Set data (x1 , x2 , ....xn ) as input and provide a repre- Given a food noun-compound, the task we address sentation (h1 , h2 , ....hn ) which captures the infor- is to predict its head-noun, labelling one or more mation at every time step in the input. Formally, consecutive tokens in the food description. We as- sume that a head is always present, even in case it ht = f (U xt + W ht−1 ) is poorly informative. Two annotators were selected to annotate a data where xt is the input at time t, U is the embed- set of 436 food names, collected from recipe ding matrix, f is a non-linear operation (such as books, with their head-noun. The inter annotator sigmoid, tanh or ReLU) and W is the parameter agreement, computed at the token level, is Cohen’s of RNN learned during training. kappa: 0.91, which is considered very high. The hidden state ht of the network at time t cap- In table 2 we give an overview of the data set of tures only the left context of the sequence for the food-description head (FDH) we created focusing input at time t. The right context for the input at on two main orthogonal characteristics: whether time t can be captured by performing the same op- the head-noun is comprised of a single token or eration in the negative time direction. The input → − of a multi-token, and whether the head-noun cor- can be represented by both its left context ht and ←− → − ← − responds to the beginning of the food description right context ht as ht = [ ht ; ht ]. Similarly, the or not. As can be seen, the vast majority of head- representation of the completed sentence is given − → ← − nouns is either made of a single token (almost 90% by hT = [hT ; h0 ]. Such processing of the input in of cases), or starts at the beginning of the entity both forward and backward time-step is known as name (almost 80% of cases). The combination of bidirectional RNN. Though a vanilla RNN is good at modelling sequential data, it struggles to cap- the noun-compound; (vii) if the token can be an ture the long-term dependencies in the sequence. noun-compound; (viii) the ratio of the time the to- Long Short Term Memory (LSTM) (Hochreiter ken is the first token in a noun-compound; (ix) the and Schmidhuber, 1997) is a special kind of RNN ratio of the time the token is the last token in a that is designed specifically to capture the long- noun-compound. These handcrafted features for term dependencies in sequential data. They com- each word are extracted from a large corpus of Ital- pute the the hidden state ht as follows, ian food names reported in (Guerini et al., 2018). The concatenation of word embedding, final it = σ(Wi · [ht−1 , xt ] + bi states of bidirectional character embeddings net- ft = σ(Wf · [ht−1 , xt ] + bf work, and hand crafted features is used as the word representation. C̃t = tanh(WC · [ht−1 , xt ] + bC ) Ct = ft ∗ C(t−1) + it ∗ C̃t Input encoder. LSTM nodes are used to encode the input sequence of word embeddings. We em- ot = σ(Wo · [ht−1 , xt ] + bo ploy a bidirectional LSTM (Bi-LSTM) to cap- ht = ot ∗ tanh(Ct ) ture the context in both forward and backward timesteps. The hidden representation of a word where xt is the embedding for input at time t; it , at time t is given as, ft , ot are the input, forget and output gates, respec- tively. → − ← − ht = [ h t ; h t ] 3.2 Implementation Classification. The output layer receives the The task of head-noun identification aims to pre- hidden representation from the Bi-LSTM and out- dict a sequence of tags y = {y1 , y2 , .., yn } given puts a probability distribution over the possible an input sequence X = {x1 , x2 , ..xn }. The tag sequences. Then, a conditional random field system is modeled as a sequence labelling task (CRF) layer (Lafferty et al., 2001) is used to and consists of three main steps: i) word embed- model the dependency in labelling tags. The ding: each word in the sequence is embedded to hidden representations from the Bi-LSTM are a higher dimension; ii) Input encoder: encoding passed through a linear layer to obtain the score the sequence of embeddings; iii) Classification: Pi for each word in the input sequence X = labelling the sequence. {x1 , x2 , .., xn }. The score for each possible output tag sequence ŷ ∈ Ŷ is then obtained as follows, Word embeddings. Each word in the input se- quence is represented by a vector of d-dimensions n X n X that captures the syntactic and semantic informa- Score(ŷ) = Ayi ,yi+1 + Pi,yi tion of the word. The representation is carried by i=0 i=1 a word embedding matrix E ∈ Rd×|v| where |v| is the input vocabulary size. In addition to this, the model combines a character embedding that is where A is the transition matrix representing the learned during training using a Bi-LSTM network transition scores from tag i to tag j. The probabil- to deal with out of vocabulary terms and possible ity of the tag sequence is then computed using a misspellings (Ling et al., 2015). softmax operation as follows, To represent the core structure of a complex exp(Score(ŷ)) noun-compound, we also use the following hand- p(ŷ|X) = P crafted features of a head-noun candidate token ỹ∈Ŷ exp(Score(ỹ) (Guerini et al., 2018): (i) the actual position of the The training is done by maximizing the log prob- token within the compound name; (ii) the length ability of the correct output tag sequence. of the candidate token; (iii) the frequency of the token in the gazetteer; (iv) the average length of 4 Experiments and Results the noun-compounds in the gazetteer containing the token; (v) the average position of the token in 4.1 Setup the noun-compound it appears in; (vi) the bigram The dimension of character embedding is set to 30 probability with reference to the previous token in and embeddings are learned using 50 hidden units in each direction. For the word embeddings, as Accuracy Precision Recall F1 Baselines learning this level of representation with a small 1st token 83.74 70.29 70.24 70.27 dataset is highly inefficient, we decided to use Spacy 78.47 62.70 62.67 62.67 pre-trained embeddings trained using skip-gram Bi-LSTM a) word_emb 84.06 74.10 65.18 69.28 (Mikolov et al., 2013) on the Italian corpus of b) a + hc_feat 85.17 75.76 66.50 70.76 Wikipedia. The input encoder consists of 120 hid- c) a + char_emb 85.21 76.24 66.28 70.79 den units in each direction with a dropout (E. Hin- d) b + CRF 88.07 78.57 77.67 78.09 d) d + char_emb 88.59 80.58 78.62 79.58 ton et al., 2012) of 0.5 applied between the Bi- LSTM layer and the output layer. Table 3: Experimental results on FDH dataset. 4.2 Baselines To compare the performance of the proposed ap- a performance of 70.27 of 62.67 respectively. In proach, we provide two baselines: i) 1st token, particular, the performance of syntactic depen- where the 1st token of a noun-compound is chosen dency parser from Spacy reiterates the difference as its head-noun; ii) Spacy1 , where the root token between the semantic and syntactic head. The re- of the dependency tree for the noun-compound is sults are shown by incremental features, for the chosen as its head-noun. proposed approach. The models reported with- out CRF, are trained using a softmax function as 1st token. This baseline implicitly accounts for output layer to predict the tag. We can notice a number of linguistic behaviours of head-nouns from the results that using only the pre-trained em- in Italian language: (a) avoids stop words as head- beddings, the network suffers from a poor recall nouns, as they do not occur at the first position of and fails to achieve even the baseline performance. a noun-compound; (b) avoids adjectives as head- However, using either character embedding or the nouns, as they usually occur after the noun they hand-crafted features, improves the performance modify; (c) captures the syntactic head of the of the model on par with the baseline. Since the noun-compound, which, in Italian is likely to be single token head-noun in FDH dataset is very the first noun in a Noun Phrase; as already seen in high (as shown in table 2), learning the multi to- Table 2. Summing up, the first-token baseline cap- ken head-nouns and the dependency of tags is a tures relevant linguistic behaviours, and is a strong challenge. However, introducing the CRF layer to competitor of our neural model, as in more than jointly predict the sequence of tags in combina- 80% of the entries in our dataset the first token be- tion with the hand crafted features, enables us to longs to head-noun of the noun-compound. predict multi-token heads and improve the perfor- Spacy. This is a widely known open-source li- mance of the model to 78.09. Finally, the char- brary for natural language processing and include acter embeddings learned during training helps to a syntactic dependency parser. Given an input se- improves the recall further, reaching a F1 score of quence, based on the result returned by the depen- 79.58. dency parser, the root of the sequence is chosen to 5 Conclusion and Future Work be the head-noun. We used the statistical model it_core_news_sm2 released by Spacy for Italian We have addressed head-noun identification in language. complex noun-compounds, a task of high rele- vancy in utterance interpretation for dialogue sys- 4.3 Evaluation metric tems. We proposed a neural model, and experi- The performance of the models are evaluated us- ments on Italian food noun-compounds show that ing F1 score as in CoNLL-2003 NER evaluation the model is able to outperform strong baselines (Sang and Meulder, 2003), which is a standard for even with a small amount of data. For the future evaluating sequence tagging tasks. we plan to extend our investigation to other do- main and languages. 4.4 Results The results for the FDH dataset are shown in Ta- ble 3. The baselines 1st token and Spacy achieve References 1 https://spacy.io/ Paul Buitelaar, Philipp Cimiano, and Bernardo 2 https://spacy.io/models/it Magnini. 2005. Ontology Learning from Text: Methods, Evaluation and Applications, volume 123 Vered Shwartz and Ido Dagan. 2018. Paraphrase to of Frontiers in Artificial Intelligence and Applica- explicate: Revealing implicit noun-compound rela- tions Series. IOS Press, Amsterdam, 7. tions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Vol- Geoffrey E. Hinton, Nitish Srivastava, Alex ume 1: Long Papers), pages 1200–1211. Associa- Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhut- tion for Computational Linguistics. dinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv, 07. A. Graves and J. Schmidhuber. 2005. Framewise phoneme classification with bidirectional lstm net- works. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol- ume 4, pages 2047–2052 vol. 4, July. Marco Guerini, Simone Magnolini, Vevake Balara- man, and Bernardo Magnini. 2018. Toward zero- shot entity recognition in task-oriented conversa- tional agents. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 317–326, Melbourne, Australia, July. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling se- quence data. In Proceedings of the Eighteenth Inter- national Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA. Mor- gan Kaufmann Publishers Inc. Guillaume Lample, Miguel Ballesteros, Sandeep Sub- ramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recog- nition. CoRR, abs/1603.01360. Wang Ling, Chris Dyer, Alan W. Black, Isabel Tran- coso, Ramon Fermandez, Silvio Amir, Luís Marujo, and Tiago Luís. 2015. Finding function in form: Compositional character models for open vocabu- lary word representation. In EMNLP. Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocký, and Sanjeev Khudanpur. 2010. Recur- rent neural network based language model. In Pro- ceedings of the 11th Annual Conference of the In- ternational Speech Communication Association (IN- TERSPEECH 2010), volume 2010, pages 1045– 1048. International Speech Communication Associ- ation. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. CoRR, abs/1301.3781. Preslav Nakov and Marti A. Hearst. 2013. Semantic interpretation of noun compounds using verbal and other paraphrases. TSLP, 10(3):13:1–13:51. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. CoRR, cs.CL/0306050.