1. Introduction

10.48550/arXiv

On Cross-Language Entity Label Projection and Recognition

Paolo Gajo

Alberto Barrón-Cedeño

0 0 Università di Bologna , Corso della Repubblica, 136, 47121, Forlì , Italy

2024

33 1 0009 0009

Most work on named entity recognition (NER) focuses solely on English. Through the use of training data augmentation via machine translation (MT), multilingual NER can become a powerful tool for information extraction in multilingual contexts. In this paper, we augment NER data from culinary recipe ingredient lists by means of MT and word alignment (WA), following two approaches: (i) translating each entity separately, while taking into account the full context of the list and (ii) translating the whole list of ingredients and then aligning entities using three types of WA models: Giza++, Fast Align, and BERT, fine-tuned using a novel entity-shufling approach. We depart from English data and produce Italian versions via MT, span-annotated with the entities projected from English. Then, we use the data produced by the two approaches to train mono- and multilingual NER BERT models. We test the performance of the WA and NER models on an annotated dataset of ingredient lists, partially out-of-domain compared to the training data. The results show that shufling entities leads to better BERT aligner models. The higher quality NER data created by these models enables NER models to achieve better results, with multilingual models reaching performances equal to or greater than their monolingual counterparts.

eol>information extraction named entity recognition cross-lingual label projection data augmentation

1. Introduction

from culinary recipes, annotated at the span level with entity category labels, we first rely on a MT engine to Named entity recognition (NER) is a sequence labeling translate each source entity individually into Italian, task with a long history of works mainly focusing on while keeping the full context into account. This results the recognition of entities such as people, locations, and in a first entity-wise (EW) translated EN–IT–ES dataset organizations. Multilingual NER has also attracted re- where entities are linked across languages.1 search eforts, with recent SemEval campaigns including Using these synthetic alignments, we train BERT modtasks on multilingual complex NER (MultiCoNER) [1, 2]. els to align source and target entities, shufling the latter Despite its popularity and various mono- and multilin- to prevent the model from learning to simply predict the gual NER datasets being available, specific domains such original entity order. We then test the models on two as the culinary one likely require new annotated data. In novel entity alignment datasets, partially out-of-domain addition, NER is often the first step in information extrac- compared to the training data, e.g., as regards the used tion for knowledge graph construction and, to the best food products, units of measure, and cooking processes. of our knowledge, all literature in the domain of cuisine As baselines to evaluate the BERT alignment models, on this topic solely focuses on English data [3, 4, 5, 6, 7]. we use Giza++ [9] and Fast Align [10], two statistical Therefore we argue that, given cuisine’s multicultural word alignment (WA) models. In order to produce highernature, more research in this direction is warranted. quality r-NE data, we translate the ingredient lists across

Entity label projection [8] aims to address this scarcity their whole length, predicting target entity spans with by automating the data generation process for NER. This the best BERT models from the previous step, along with task consists in taking the labels associated with spans the baseline models. We thus obtain various sentencefrom a source text and automatically applying them to wise (SW) translated datasets in Italian, trading some its translation in another language, i.e. the target text. alignment accuracy for better translations. Through this task, we attempt to find an eficient auto- Both types of training data, EW and SW, are then used matic way of developing models for entity projection to fine-tune mono- and multilingual BERT NER models across languages to produce high-quality multilingual on the task of recognizing entities in food recipes. The data for recipe Named Entities (r-NE) [4]. Departing models are trained on various combinations of mono- and from an English-language dataset containing ingredients multilingual data and are tested on the entity annotations from the two aforementioned novel testing datasets.

Our contribution is three-fold: (i) We show the eficacy of fine-tuning alignment models by shufling entities in contexts where most of the information depends on the presence of lexical items rather than the dependencies linking them. (ii) We showcase the performance delta [27] use in-context learning [28] to evaluate GPT-3 [22] between mono- and multilingual NER models when fine- for NER on the CoNLL2003 [20] and OntoNotes5.0 [29] tuning on the synthetic data produced by our alignment. datasets by using retrieval-augmented generation [30] These models can be used to label large datasets in multi- and comparing the results to BERT and models based on ple languages at a finer granularity level compared to cur- graph neural networks [31]. rently available monolingual resources. (iii) We release With regard to data specific to the culinary domain, code and data to produce data in multiple languages.2 many English-language resources exist in various forms.

The rest of the paper is structured as follows. Sec- RecipeDB [32] is an ontology comprising 118 web tion 2 presents relevant past research on the subjects recipes which can be used to relate foods and cooking of cross-lingual entity alignment and recognition. Sec- processes to taste profiles and health data. FoodOn [ 33] tion 3 introduces the datasets and corpora used in the is a “farm-to-fork” ontology which provides a structure experiments, along with their annotation process. Sec- of relationships between food products across the whole tion 4 presents architecture, training, and evaluation de- industrial supply chain. Bridging the gap between ontails for the models comprising our pipeline. Section 5 tologies and NER datasets, FoodKG [34] is a knowledge discusses the conducted experiments and their results. graph which can be used to find ingredient substitutions Finally, Section 6 summarizes the paper and draws con- based on dietary health requirements. It is built by leverclusions. Appendix A shows further results including aging FoodOn and Recipe1M+ [35], a dataset originally Spanish. Appendix B presents statistics and gives insight intended for learning joint text/image embeddings on on the additional training data used. Appendix C lists over 1 culinary recipes. Expanding on Recipe1M+, information on the computational requirements. Bień et al. [36] construct RecipeNLG, comprising more than 2 recipes. It is the biggest food NER dataset to date, but its granularity stops at the sole food product 2. Related Work names. More fine-grained silver labels are obtained by Komariah et al. [37], who propose a new methodology to exWord alignment was first approached for statistical MT, tract entities from AllRecipes.3 Doing so, they construct with models such as IBM 1-5 [11], used in well-known FINER, a dataset comprising 64 recipes with labels preimplementations such as Giza++ and Fast Align. With the dicted by what the authors refer to as a “semi-supervised advent of Transformers [12] and the BERT model [13], multi-model prediction technique.” The dataset also conthis task has been approached by employing both questains recipe tags such as vegetarian and vegan, which tion answering [14] and token classification [ 15] models, can be useful for training recipe classifiers. Leveraging trained on freely available resources, such as XL-WA [16]. RecipeDB [32], a large-scale structured corpus of recipes,

A number of past works have studied label projection [38] generate a synthetic dataset of augmented ingredifollowing a range of approaches. Jain et al. [8] project ent phrases and compare the NER performance of various PER, ORG, LOC and MISC labels (person, organization, rule-based and neural models. location, and miscellaneous) by translating sentences Despite the wide availability of English-language reand then finding potential matches using glossaries. Fei sources in the culinary domain, other languages are et al. [17] align words using Fast Align and use POS tag- largely understudied. To the best of our knowledge, the ging to enhance data for semantic role labeling. García- only study to approach this domain in a multilingual setFerrero et al. [18] use the AWESoME word alignment ting was conducted by Radu et al. [39], who obtain NER model [19] to align machine-translated data from NER tags automatically in English, German, and French by datasets in seven languages. Li et al. [15] fine-tune a NER using a regex-based tagger. Our work aims to partially model on English PER, ORG, LOC, MISC data from address this gap in past research by focusing on Italian. CoNLL2003 [20] to infer on the source portion of parallel Opus corpora [21] with the aim of creating silver NER data. Subsequently, they train an XLM RoBERTa align- 3. Data ment model by using Wikipedia articles and project the labels on the target portion of the parallel corpus, which The entity alignment data used for training is generthey use to train a target-language NER model. ated through MT starting from TASTEset [40], a dataset

NER can also be approached with large language mod- comprising ingredient lists from 700 food recipes, annoels (LLM) [22, 23, 24] by prompting them to extract en- tated at the span level. We use TASTEset because it is tities from a given text. For example, PromptNER [25] human-curated and its annotations are fine-grained. We uses chain of thought [26] along with a list of entity def- translate each entity one by one with DeepL,4 concurinitions to prompt a variety of LLMs, obtaining results rently feeding the whole ingredient list and the single on par with SOTA supervised NER systems. Similarly,

2Resources available at https://github.com/paolo-gajo/food 3https://www.allrecipes.com 4https://www.deepl.com/en/docs-api

Class food qty. unit process phys. q. color taste purpose part total entity as two separate inputs. This provides DeepL with context, improving translation quality and retaining the start and end span indexes in the target text by simply concatenating each translated entity. To the best of our knowledge, DeepL is currently the only MT engine capable of contextually translating a substring taken from a sentence, which is why we are using it in this study. Doing this, we obtain an Entity-wise Machine-translated TASTEset (EMT). Since entities are automatically paired to the source label, the distribution across English and Table 1

Italian is identical (Table 1). Dataset class distributions. EMT and SMT refer to the entity We also generate shufled variations of EMT, where and sentence-wise machine-translated TASTEset. GZ refers

the entities within a single ingredient have a probability to our testing dataset. ∈ {0.1, 0.2, . . . , 1.0} of being shufled, for a total of ten variations. Figure 1 shows an example where entities s1 s2 s3 have been shufled in the first and third target ingredients. s1,1 s1,2 s1,3 s2,1 s2,2 s2,3 s3,1 s3,2 s3,3 The rationale behind this approach is that, when training A 4 • cups • flour ; 1/2 teaspoon salt ; 1 teaspoon baking soda on EMT, if the dataset were to be left as-is, the model Maligner(A, B) = t1,3 = {'start': 9, 'end': 14} would simply learn to associate a source entity to the target entity in the corresponding position, since entities B farina 4 tazze ; 1/2 cucchiaino sale ; cucchiaino 1 bicarbonato di sodio are simply translated and replaced in EMT. t1,1 t1,2 t1,3 t2,1 t2,2 t2,3 t3,1 t3,2 t3,3

Overall, we have 22 diferent variations of EMT, i.e. t1 (shuffled) t2 (unshuffled) t3 (shuffled) the original and the 10 shufled versions for each of the two types of tokenization (mBERT’s WordPiece [13] vs Figure 1: Aligning source and shufled target entities. mDeBERTa’s SentencePiece [41]). The datasets have to be tokenized during the generation of the dataset because token indexes depend on the tokenizer being used when TASTEset, and the second cross-language entity-linking converted from character-level span annotations. annotation task, carried out by the same annotator at

We produce a second kind of synthetic dataset by first a later time. The annotation was carried out in Label translating the ingredient lists as a whole, and then align- Studio.7 ing source and target entities by using the BERT, Giza++, The GialloZaferano (GZ) dataset comprises 597 and Fast Align models presented in Section 4. We refer to recipes. The alignments were annotated manually on this type of dataset as Sentence-wise Machine-translated a subset of 300 recipes, with the possibility of more than TASTEset (SMT). As Table 1 shows, the SMT dataset pro- one source entity being aligned with one target entity, duced by the BERT model trained on both XL-WA and and vice versa. This is because some recipes contain more the shufled version of EMT contains slightly fewer enti- than one ingredient option in English but not in Italian ties than the source material. This is due to the fact that (and vice versa), e.g., Cocomero (anguria) 1 fetta at times the models produce impossible predictions, e.g. vs Watermelon 1 slice. The GZ dataset contains a predicting the end of an entity to be before its start.5 This total of 46,903 NER annotations and 9,842 alignments. problem does not exist with Giza++ and Fast Align, since We manually scrutinized GZ and found that the paired their alignments are word-based. As additional training recipes do not always coincide completely. Some ingredidata for the BERT models, we use the EN–IT portion of ents may be missing in either language or be an equivaXL-WA. Table 9 in Appendix B reports the size of each lent rather than the same food product. In order to avoid of the partitions we used. training the alignment models on excessively diferent

For testing, we annotated an English–Italian dataset recipes, we chose to avoid annotating alignments whenof recipes, obtained from GialloZaferano (where the En- ever the number of source ingredients missing from the glish recipes are translated from the Italian ones).6 For the target recipe surpassed a heuristic threshold of 1/3. annotation process, we recruited a professional translator Note that in GZ quantities and units of measure are who is a native speaker of Italian, with an MA in Special- localized and are thus listed in both imperial and SI units. ized Translation in both English and Spanish. Figure 2 As shown in Table 1, this is reflected by the lower number shows the instructions given for the first multi-class en- of instances annotated as quantity and unit in the tity annotation task, which consider the same entities as Italian portion of GZ, compared to its English portion.

5The efect on model performance upon training is negligible given

that these predictions constitute less than 1% of the total. 6https://www.giallozaferano.it

7https://labelstud.io

Instructions for the multi-class entity annotation task. Annotate the ingredients below by assigning to spans of text one of the following categories: quantity, unit, food, color, part, physical quality, process, purpose, taste. Use quantity for numerical values or expressions such as ‘to taste’, identifying the quantity of an ingredient. Unit stands for “unit of measure”, such as grams (g) or ounces (oz). Use color for any color that is not part of a food’s own name (e.g. ‘red’ can be tagged in ‘red wine’, but not ‘black’ in ‘blackberries’). Part refers to parts of an ingredient, such as ‘wings’ in ‘chicken wings’. Use physical quality for attributes which already characterize an ingredient at the start of the preparation. Process refers to actions that the reader is supposed to carry out. The label purpose answers the question, “What is this entity used for?” Finally, use taste for words referring to a taste, such as ‘unsweetened’ or ‘dry’, with relation to a wine. N.B.: The annotations cannot overlap. You can either choose to annotate multiple spans with the same label, or avoid annotating one or more spans of text.

Instructions for the cross-lingual entity linking task. Link each source language entity to its corresponding entity in the target language with an arrow. Entities should only be linked if they share the same use. For example, the “2” in “2 tablespoons chopped onions” does not have the same function as in “2 cebollas largas picadas”, since one refers to tablespoons and other to the number of onions. However, “onions” and “chopped” could still be linked, as they are equivalent in the two sequences. Individual source entities can be linked to multiple target entities and vice versa. N.B.: Entities can still be linked if they difer slightly in form or content but still clearly perform the same function in the same ingredient context. For example, “340” and “450” could still be linked if they both refer to the quantity of grams of the same source and target food products.

4. Models

EMT XL-WA Data EMT XL-WA P 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 – mDeBERTa 42.17± 1.19 57.00± 0.94 55.03± 2.40 57.09 ± 3.61 57.26 ± 1.09 55.97 ± 3.11 58.37 ± 2.46 57.07 ± 1.58 57.31 ± 1.20 56.95 ± 2.69 57.59 ± 1.81 mDeBERTax 46.98± 3.77 58.45± 1.37 57.02± 2.43 60.25± 2.35 59.21± 2.59 58.43± 2.53 61.07± 2.94 60.68± 3.01 62.08± 3.74 61.05± 1.27 60.87± 1.13 31.71 an entity within a shufled ingredient’s boundaries.

We train the models for up to 3 epochs on each dataset with a batch size of 16. The optimizer’s learning rate is set at 3 × 10− 4, while is 10− 8. Each training run, we select the best model based on the Exact metric [44]: where is a list of predictions and (, ) is the Kronecker delta: (, ) = {︃1, if = , 0, if ̸= (1) (2) Entity Alignment As baselines, we use two statistical with the predicted and gold strings and having models: Giza++ [9] and Fast Align [10]. Giza++ com- been lowercased and stripped of excess punctuation and bines a HMM [42] alignment model and IBM M1-5 [11]. spaces. We calculate mean Exact and its standard deviaFast Align is much more lightweight, only leveraging tion out of five random runs for each model. IBM M2. We use two multilingual BERT models as well: In order to improve the models’ ability to align entimBERT [13] as the baseline multilingual Transformer ties, we optionally train them on an intermediary wordmodel and mDeBERTa [43] because of its larger size alignment task using the EN–IT training and dev sets (276 vs 179 param.) and performance. When using of XL-WA. In addition, we train mBERT and mDeBERTa the BERT models, we follow Nagata et al. [14] and treat solely using said XL-WA partitions in order to test them entity alignment as a question-answering task, enclosing directly on GZ. This serves as a baseline which will allow the source word to be aligned within rarely used charac- us to gauge the positive efects of fine-tuning on EMT. ters, e.g., ‘∙ ’, feeding the model both the source sequence and the target sequence at once. Figure 1 exempliifes this, where the model is trained to predict on shufled data. Unsurprisingly, the larger mDeBERTa performs much better than the smaller mBERT across the board. Although the model obtaining the highest mean performance is obtained at = 0.8, an overlap can be observed between all the confidence intervals for ≥ 0.1. However, this is not true when going from = 0 to = 0.1. Consequently, increased shufling past 10% does not seem to provide a concrete performance gain, which is why we decided to produce SMT Table 3 by using the BERT trained on the least-shufled version

Exact metric results of the alignment task by class on GZ for of EMT.

the best models (trained on IT⊕ ES). Best in bold. In and of itself, the intermediary training step on XL-WA provides a slight performance boost when looking at mBERT vs mBERTx and mDeBERTa vs Entity Recognition For the NER task, treated mDeBERTax. Still, this increase is much smaller comas token classification, we once again use mBERT.8 pared to the one gained through shufling. While fineTo test the eficacy of the multilingual approach, tuning the models on a general word-alignment task can we also use the following monolingual models be beneficial, the target domain is likely too diferent from when training and testing on a single language: the training data for this to produce a large performance bert-base-uncased (henceforth “BERTen”) for boost. This is especially true as regards the structure of English [13] and bert-base-italian-uncased the sentences, since the test data is comprised by short (“BERTit”) [45] for Italian. We forgo mDeBERTa for this lists of entities separated by semicolons, while the traintask, as the focus is showing a comparison between ing data is a domain-balanced sample of sentences from models of equivalent size and performance. Prior to Wikipedia. An additional performance boost is provided training, the data is preprocessed and labeled using by multilingual fine-tuning, while cross-lingual settings the BIO annotation scheme [46]. We ignore subword (e.g., fine-tuning on ES and testing on IT) lead to worse tokens when calculating cross-entropy loss, following outcomes. Table 6 (Appendix A) shows the results. established methodology.9 Table 3 reports the performance of the best overall

We train the models on the EN–IT, EN–ES, EN–IT– models on each class. As the results show, the much ES language subsets of EMT and of the four versions lighter Giza++ model surpasses mBERTx, only trailing of SMT, produced by mBERT, mDeBERTa, Giza++, and behind mDeBERTax. The poor scores achieved by the Fast Align. For the BERT models, we use the same hyper- two BERT models are largely attributable to their poor parameters used for the alignment task, but with a lower scores on the unit and part classes. We hypothesize learning rate of 2 × 10− 4. The models are evaluated that this poor class-specific performance has to do with using the macro F1-measure. Details on the employed units of measure often being very short strings. Training computational resources can be found in Appendix C. mDeBERTa only on the unit instances does not improve its performance, with the model scoring a lower 18.08 5. Results and Discussion Exact metric. Inspecting its individual predictions in this single-class scenario, we noticed that the model does Entity Alignment Table 2 reports the Exact scores for learn to always predict two consecutive tokens, but the the entity alignment experiment. The entity shufling enclosed token does not match the original text when approach appears to be very efective for creating data converted into characters. This is due to two separate which can make the models better at generalizing. The issues: (i) the model selects the wrong span, e.g., selectperformance of every single model is greatly enhanced ing an ingredient such as “carote” (carrots) rather than when shufling ingredients just 10% of the time, with the unit “g” or (ii) the model’s prediction is empty when increased shufling frequency not leading to any signifi- converted to characters. Since mBERT and mDeBERTa cant further improvement. The increase in performance both have poor performance on this class while using two seems to be greater for models which have undergone diferent tokenization algorithms (WordPiece vs Sentenintermediate training on XL-WA, with mDeBERTax gain- cePiece), the problem may lie in the models’ tokenizer’s ing almost 12 points in the Exact metric, when fine-tuned token-to-character conversion method.10 We plan to shed light on this in the future. As regards the part class, the poor performance could be explained by the small

8We do not use the larger mDeBERTa model due to the computa

tional cost deriving from the number of language combinations. 9https://huggingface.co/docs/transformers/en/tasks/token_ classification 10https://huggingface.co/docs/transformers/en/main_classes/ tokenizer#transformers.BatchEncoding.char_to_token mBERT

BERTit

mBERT BERTen tuning on EMT’s Italian data. Our data augmentation strategy is thus providing an evident performance boost, with entity alignment producing bigger improvements than machine-translating each entity individually.

In all settings, mBERT performs on par with the monolingual models. This shows that a single multilingual model can sufice when extracting entities from multilingual corpora, saving time and compute.

6. Conclusions

Table 4 We explored a simple novel technique to automatically

Model performance for the entity recognition task, in terms of generate high-quality multilingual NER data by com

F1 measure. All results are macro avg. out of 5 random runs. bining machine translation and cross-language entity linking. For our experiments, we relied on the Englishlanguage TASTEset dataset, which includes recipes number of training instances (55). However, the models whose lists of ingredients are span-annotated for entity obtain high scores on the purpose class, also just 94 in- recognition. Moreover, we manually curated a novel stances (mBERTx gets 94.87 Exact score). Unfortunately, English–Italian cross-language dataset, featuring the repeating the approach we used for the unit class is not same kind of annotation, with the addition of crossfeasible, as fine-tuning the model on just 55 instances language alignments. does not produce any reliable results ( = 3.96), We machine translated the entities in TASTEset’s meaning this will have to be left for future work. recipes individually and shufled them within ingredi

The rest of the results from Table 3 are generally in ent boundaries. Leveraging this augmented data, we line with the average results from Table 2. The scores then fine-tuned BERT entity-alignment models. Using achieved by the baselines for each class do not have any statistical word-alignment models as baselines, we tested evident outliers, save for Fast Align scoring a 0 on taste. these BERT models on our English–Italian parallel corMore generally, Fast Align, being the simplest and most pus. The results showed that models fine-tuned using our lightweight model, performs on average well below the novel approach consistently outperform those trained on other more complex models. unshufled data, along with two statistical baselines. We then created additional synthetic data by first transEntity Recognition Table 4 reports the results for the lating TASTEset’s recipes in their entirety, and then alignNER task. The aligner column indicates which alignment ing the entities in the machine-translated target text usmodel, out of the best ones listed in Table 3, has produced ing the best models obtained from the first part of the the SMT training data used to fine-tune the NER model. study. These data allowed us to obtain better NER models, When no alignment model is specified, the training data compared to the ones we would have obtained by using being used is EMT. Note that in this case we are not using the original recipes translated entity by entity. We tested EMT’s shufled versions, as there is no relation between monolingual English and Italian BERT models against any two recipes when fine-tuning on the NER task. mBERT, and showed that the latter is capable of obtaining

When training and testing on Italian data, the best the same performance as its monolingual counterparts results are obtained for both mBERT and BERTit when when tested on monolingual NER data. ifne-tuning on SMT data produced by mDeBERTa. When In future work, we plan to extend the annotation of ifne-tuning them on EMT, the performance is notice- our datasets, both in terms of number of instances and ably lower, with a 5-point diference for mBERT and an annotators. We will also prioritize solving the token-to8-point diference for BERTit. The data produced by character conversion issues encountered in this study. mBERT also allows both models to outperform the EMT Furthermore, we plan to leverage this data augmentation baseline, although by smaller amounts. Conversely, the technique in order to improve multilingual text-to-graph data produced by Fast Align and Giza++ worsens the data models, since all of the literature in this regard focuses quality in 75% of the cases. When fine-tuning mBERT on English-only data [3, 4, 5, 6, 7]. on bilingual ES-IT data, the performance on the test set remains essentially unvaried (see Table 8 in Appendix A). References

Looking at the baselines at the bottom of Table 4, we can see that fine-tuning mBERT on English data yields worse performance when testing on GZ, compared to fineentity recognition (MultiCoNER), in: G. Emerson, Recipes, 2024. URL: http://arxiv.org/abs/2401.12088. N. Schluter, G. Stanovsky, R. Kumar, A. Palmer, [8] A. Jain, B. Paranjape, Z. C. Lipton, Entity proN. Schneider, S. Singh, S. Ratan (Eds.), Proceedings jection via machine translation for cross-lingual of the 16th International Workshop on Semantic NER, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Evaluation (SemEval-2022), Association for Com- Proceedings of the 2019 Conference on Empirical putational Linguistics, Seattle, United States, 2022, Methods in Natural Language Processing and the pp. 1412–1437. URL: https://aclanthology.org/2022. 9th International Joint Conference on Natural Lansemeval-1.196. guage Processing (EMNLP-IJCNLP), Association [2] B. Fetahu, S. Kar, Z. Chen, O. Rokhlenko, S. Mal- for Computational Linguistics, Hong Kong, China, masi, SemEval-2023 task 2: Fine-grained multi- 2019, pp. 1083–1092. URL: https://aclanthology.org/ lingual named entity recognition (MultiCoNER 2), D19-1100. doi:10.18653/v1/D19-1100. in: A. K. Ojha, A. S. Doğruöz, G. Da San Martino, [9] F. J. Och, H. Ney, A systematic comparison of varH. Tayyar Madabushi, R. Kumar, E. Sartori (Eds.), ious statistical alignment models, Computational Proceedings of the 17th International Workshop on Linguistics 29 (2003) 19–51.

Semantic Evaluation (SemEval-2023), Association [10] C. Dyer, V. Chahuneau, N. A. Smith, A Simple, Fast, for Computational Linguistics, Toronto, Canada, and Efective Reparameterization of IBM Model 2023, pp. 2247–2265. URL: https://aclanthology.org/ 2, in: L. Vanderwende, H. Daumé III, K. Kirch2023.semeval-1.310. hof (Eds.), Proceedings of the 2013 Conference of [3] C. Kiddon, G. T. Ponnuraj, L. Zettlemoyer, Y. Choi, the North American Chapter of the Association Mise en Place: Unsupervised Interpretation of In- for Computational Linguistics: Human Language structional Recipes, in: L. Màrquez, C. Callison- Technologies, Association for Computational LinBurch, J. Su (Eds.), Proceedings of the 2015 Confer- guistics, Atlanta, Georgia, 2013, pp. 644–648. URL: ence on Empirical Methods in Natural Language https://aclanthology.org/N13-1073. Processing, Association for Computational Lin- [11] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, guistics, Lisbon, Portugal, 2015, pp. 982–992. URL: R. L. Mercer, The mathematics of statistical mahttps://aclanthology.org/D15-1114. doi:10.18653/ chine translation: Parameter estimation, Compuv1/D15-1114. tational Linguistics 19 (1993) 263–311. URL: https: [4] Y. Yamakata, S. Mori, J. Carroll, English Recipe Flow //aclanthology.org/J93-2003.

Graph Corpus, in: N. Calzolari, F. Béchet, P. Blache, [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isa- L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, hara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, Attention is All you Need, in: Advances in Neural J. Odijk, S. Piperidis (Eds.), Proceedings of the Information Processing Systems, volume 30, CurTwelfth Language Resources and Evaluation Con- ran Associates, Inc., 2017. URL: https://proceedings. ference, European Language Resources Associa- neurips.cc/paper_files/paper/2017/hash/ tion, Marseille, France, 2020, pp. 5187–5194. URL: 3f5ee243547dee91fbd053c1c4a845aa-Abstract. https://aclanthology.org/2020.lrec-1.638. html. [5] D. P. Papadopoulos, E. Mora, N. Chepurko, K. W. [13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Huang, F. Ofli, A. Torralba, Learning Program Rep- Pre-training of deep bidirectional transformers for resentations for Food Images and Cooking Recipes, language understanding, in: J. Burstein, C. Doin: 2022 IEEE/CVF Conference on Computer Vision ran, T. Solorio (Eds.), Proceedings of the 2019 Conand Pattern Recognition (CVPR), IEEE, New Or- ference of the North American Chapter of the Asleans, LA, USA, 2022, pp. 16538–16548. URL: https: sociation for Computational Linguistics: Human //ieeexplore.ieee.org/document/9878478/. doi:10. Language Technologies, Volume 1 (Long and Short 1109/CVPR52688.2022.01606. Papers), Association for Computational Linguistics, [6] D. J. Bhatt, S. A. Abdollahpouri Hosseini, F. Fan- Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: cellu, A. Fazly, End-to-end Parsing of Procedu- https://aclanthology.org/N19-1423. doi:10.18653/ ral Text into Flow Graphs, in: N. Calzolari, M.- v1/N19-1423.

Y. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), [14] M. Nagata, K. Chousa, M. Nishino, A suProceedings of the 2024 Joint International Con- pervised word alignment method based on ference on Computational Linguistics, Language cross-language span prediction using multilinResources and Evaluation (LREC-COLING 2024), gual BERT, in: B. Webber, T. Cohn, Y. He, ELRA and ICCL, Torino, Italia, 2024, pp. 5833–5842. Y. Liu (Eds.), Proceedings of the 2020 ConferURL: https://aclanthology.org/2024.lrec-main.517. ence on Empirical Methods in Natural Language [7] A. Diallo, A. Bikakis, L. Dickens, A. Hunter, Processing (EMNLP), Association for ComputaR. Miller, Unsupervised Learning of Graph from tional Linguistics, Online, 2020, pp. 555–565. URL: https://aclanthology.org/2020.emnlp-main.41. Proceedings of the Eighth International Condoi:10.18653/v1/2020.emnlp-main.41. ference on Language Resources and Evaluation [15] B. Li, Y. He, W. Xu, Cross-Lingual Named Entity (LREC’12), European Language Resources AssoRecognition Using Parallel Corpus: A New Ap- ciation (ELRA), Istanbul, Turkey, 2012, pp. 2214– proach Using XLM-RoBERTa Alignment, 2021. URL: 2218. URL: http://www.lrec-conf.org/proceedings/ http://arxiv.org/abs/2101.11112. lrec2012/pdf/463_Paper.pdf . [16] F. Martelli, A. S. Bejgu, C. Campagnano, J. Čibej, [22] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, R. Costa, A. Gantar, J. Kallas, S. Koeva, K. Koppel, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, S. Krek, M. Langemets, V. Lipp, S. Nimb, S. Olsen, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, B. S. Pedersen, V. Quochi, A. Salgado, L. Simon, G. Krueger, T. Henighan, R. Child, A. Ramesh, C. Tiberius, R.-J. Ureña-Ruiz, R. Navigli, XL-WA: a D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, Gold Evaluation Benchmark for Word Alignment in E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, 14 Language Pairs, in: F. Boschetti, N. N. Gianluca C. Berner, S. McCandlish, A. Radford, I. Sutskever, E. Lebani, Bernardo Magnini (Eds.), Proceedings D. Amodei, Language Models are Few-Shot Learnof the Ninth Italian Conference on Computational ers, 2020. URL: http://arxiv.org/abs/2005.14165, Linguistics (CLiC-it 2023), volume 3596, CEUR-WS, arXiv:2005.14165 [cs].

Venice, Italy, 2023. [23] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, [17] H. Fei, M. Zhang, D. Ji, Cross-lingual semantic G. Mishra, A. Roberts, P. Barham, H. W. Chung, role labeling with high-quality translated train- C. Sutton, S. Gehrmann, P. Schuh, K. Shi, ing corpus, in: D. Jurafsky, J. Chai, N. Schluter, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, J. Tetreault (Eds.), Proceedings of the 58th An- Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, nual Meeting of the Association for Computa- B. Hutchinson, R. Pope, J. Bradbury, J. Austin, tional Linguistics, Association for Computational M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, Linguistics, Online, 2020, pp. 7014–7026. URL: S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, https://aclanthology.org/2020.acl-main.627. doi:10. V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ip18653/v1/2020.acl-main.627. polito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, [18] I. García-Ferrero, R. Agerri, G. Rigau, Model R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, and data transfer for cross-lingual sequence la- A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, belling in zero-resource settings, in: Y. Goldberg, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, Z. Kozareva, Y. Zhang (Eds.), Findings of the Associ- X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, ation for Computational Linguistics: EMNLP 2022, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, Association for Computational Linguistics, Abu N. Fiedel, PaLM: Scaling Language Modeling with Dhabi, United Arab Emirates, 2022, pp. 6403–6416. Pathways, 2022. URL: http://arxiv.org/abs/2204. URL: https://aclanthology.org/2022.findings-emnlp. 02311, arXiv:2204.02311 [cs]. 478. doi:10.18653/v1/2022.findings-emnlp. [24] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. 478. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham[19] Z.-Y. Dou, G. Neubig, Word alignment by fine- bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, tuning embeddings on parallel corpora, in: P. Merlo, G. Lample, LLaMA: Open and Eficient Foundation J. Tiedemann, R. Tsarfaty (Eds.), Proceedings of Language Models, 2023. URL: http://arxiv.org/abs/ the 16th Conference of the European Chapter 2302.13971, arXiv:2302.13971 [cs]. of the Association for Computational Linguis- [25] D. Ashok, Z. C. Lipton, PromptNER: Prompting tics: Main Volume, Association for Computa- For Named Entity Recognition, 2023. URL: http: tional Linguistics, Online, 2021, pp. 2112–2128. //arxiv.org/abs/2305.15444, arXiv:2305.15444 [cs]. URL: https://aclanthology.org/2021.eacl-main.181. [26] J. Wei, X. Wang, D. Schuurmans, M. Bosma, doi:10.18653/v1/2021.eacl-main.181. B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain[20] E. F. Tjong Kim Sang, F. De Meulder, Introduc- of-Thought Prompting Elicits Reasoning in Large tion to the CoNLL-2003 Shared Task: Language- Language Models, in: Advances in Neural InIndependent Named Entity Recognition, in: Pro- formation Processing Systems, arXiv, 2022. URL: ceedings of the Seventh Conference on Natural Lan- http://arxiv.org/abs/2201.11903, arXiv:2201.11903 guage Learning at HLT-NAACL 2003, 2003, pp. 142– [cs].

147. URL: https://aclanthology.org/W03-0419. [27] S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang, [21] J. Tiedemann, Parallel data, tools and inter- J. Li, G. Wang, GPT-NER: Named Entity Recognifaces in OPUS, in: N. Calzolari, K. Choukri, tion via Large Language Models, 2023. URL: http: T. Declerck, M. U. Doğan, B. Maegaard, J. Mar- //arxiv.org/abs/2304.10428, arXiv:2304.10428 [cs]. iani, A. Moreno, J. Odijk, S. Piperidis (Eds.), [28] Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, cessing: System Demonstrations, Association for which is evident from the fact that on the website all Computational Linguistics, Brussels, Belgium, 2018, Spanish recipes have an English counterpart, but not vice pp. 66–71. URL: https://aclanthology.org/D18-2012. versa. We believe approximately 5-10% of the dataset’s doi:10.18653/v1/D18-2012. instances to be possible MT. A good indication of this is [42] P. Blunsom, Hidden markov models, Lecture notes, the fact that the English “to taste” is sometimes translated

August 15 (2004) 48. as “para probar”, likely an MT mistake, while other times [43] P. He, J. Gao, W. Chen, Debertav3: Improv- the correct “al gusto” is used. Although using machineing deberta using electra-style pre-training with translated data is not ideal, this was our best choice for a gradient-disentangled embedding sharing, 2021. Spanish-language parallel recipe corpus, due to the lack arXiv:2111.09543. of availability of similar online resources. The use of MT [44] P. Rajpurkar, R. Jia, P. Liang, Know what you data has implications with respect to the evaluation of don’t know: Unanswerable questions for SQuAD, the models, as their performance would likely be lower in in: I. Gurevych, Y. Miyao (Eds.), Proceedings of a real-world scenario involving recipes written directly the 56th Annual Meeting of the Association for in Spanish. Nonetheless, given the limited amount of Computational Linguistics (Volume 2: Short Pa- data we hypothesize as being machine-translated, we pers), Association for Computational Linguistics, believe the impact would not be large enough to discredit Melbourne, Australia, 2018, pp. 784–789. URL: our results, which focus on the improvement over the https://aclanthology.org/P18-2124. doi:10.18653/ cross-lingual EN–ES baseline, rather than the absolute v1/P18-2124. performance of the best model. [45] S. Schweter, J. Baiter, Dbmdz BERT Models, https: MCR contains 276 recipes, 104 of which are bilingual //github.com/dbmdz/berts, 2019. Accessed: 2024-04- and annotated with alignments. Due to this imbalance 22. between the number of English and Spanish recipes, the [46] L. A. Ramshaw, M. P. Marcus, Text Chunk- number of entities is around 3x for the former, as shown ing using Transformation-Based Learning, in Table 5. In total, MCR contains annotations for 15,257 1995. URL: http://arxiv.org/abs/cmp-lg/9505040. entities and 3,565 alignments. Along with the ingredient doi:10.48550/arXiv.cmp-lg/9505040, lists, MCR also contains cooking instructions for all its arXiv:cmp-lg/9505040. recipes, along with nutritional facts for 139 of them. [47] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho,

H. Kang, J. Pérez, Spanish pre-trained bert model A.2. BERT Model and evaluation data, in: PML4DC at ICLR 2020, 2020. As a monolingual Spanish BERT model base[48] H. Schwenk, V. Chaudhary, S. Sun, H. Gong, line to compare against mBERT, we use F. Guzmán, WikiMatrix: Mining 135M parallel bert-base-spanish-wwm-cased (“BERTes”) [47]. sentences in 1620 language pairs from Wikipedia, in: P. Merlo, J. Tiedemann, R. Tsarfaty (Eds.), Pro- A.3. Results ceedings of the 16th Conference of the European Chapter of the Association for Computational Lin- Entity Alignment Table 6 reports the results for the guistics: Main Volume, Association for Compu- alignment task, complete with the settings including tational Linguistics, Online, 2021, pp. 1351–1361. Spanish-language data.

URL: https://aclanthology.org/2021.eacl-main.115. Fine-tuning on the same language as the test set yields doi:10.18653/v1/2021.eacl-main.115. better results than cross-lingual scenarios. Furthermore, the best performance on MCR is obtained when finetuning mDeBERTax on both Italian and Spanish.

This is not the case for mBERTx and mDeBERTa, A. Incorporating Spanish whose performance is hindered by the addition of Italian training data. MCR is much narrower in terms of culiIn order to test more thoroughly the soundness of our nary variety, focusing solely on Colombian recipes. On approach, we carry out an equivalent study with Spanish. the other hand, GZ contains not just traditional Italian recipes, but an international range of dishes. This is probA.1. Data ably the reason why bilingual training is helpful on GZ, We annotated an English–Spanish dataset of recipes ob- but is not beneficial with relation to MCR: adding data tained from My Colombian Recipes,11 which we refer from a separate locale helps the models when approachto as MCR. MCR is translated from English to Spanish, ing the more varied GZ, helping them generalize more efectively over its data. Conversely, they are thrown of

Class food quantity unit process physical q. color taste purpose part total

TS / EMT mBERTx

mBERT GZ

MCR

GZ mBERTx

MCR

GZ mDeBERTa

MCR

mDeBERTax GZ MCR by the addition of out-of-domain data when tested on MCR. Giza++ essentially matches mDeBERTa’s perforMCR’s narrower domain. mance on MCR, which once again points to entities in

Comparing the EMT fine-tuning results with the base- MCR being easier to identify compared to GZ. However, lines at the bottom of Table 6, we can see that further the similar performance is largely due to mDeBERTa perifne-tuning on EMT does provide a boost, compared to forming poorly on the unit and part classes, due to the training only on XL-WA. Nonetheless, the diference in reasons outlined in Section 4. performance is much greater when testing on GZ, compared to MCR. When looking at mBERTx, fine-tuned on Entity Recognition Table 8 reports the results for the both Italian and Spanish, the model improves by more NER task for all language settings. For each language, than 23 Exact points on GZ, while the gap in performance we use the aligner models which obtained the highest is just under 16 points on MCR. This efect is even more results on the entity alignment task. Note that, since the dramatic for mDeBERTax, with a diference of more than aligner performance does not significantly improve with 25 points on GZ, but only 2.48 points on MCR. increased shufling (see Section 5), we only train aligner

Compounded with the fact that, in general, the metrics models up to = 0.2 for the Spanish setting due to are much higher when testing on MCR compared to GZ, computational constraints. this points to MCR being a much less challenging test In the Spanish monolingual setting, both BERTes and set, compared to GZ. As previously mentioned, part of mBERT obtain F1 scores between 0.92 and 0.95 when the dataset is likely machine translated, and since an MT fine-tuned on SMT, with the models fine-tuned on EMT engine is more likely to follow rigidly defined patterns trailing behind by 11 to 12 points. As all the models compared to a human translator, this might play a role perform similarly and the standard deviation is also close into the alignment task being easier on these data. to zero, it once again appears that the entities contained

Table 7 reports the performance of the best overall in the MCR dataset are not too challenging for both the models on each of the individual classes, on both GZ and mono- and multilingual models to identify. dataset [16] built from WikiMatrix [48], 12 featuring 14 EN–XX language combinations. Its training set is composed of silver labels generated by a statistical model, while the development and test sets are manually annotated. Since XL-WA has a balanced domain distribution and can be considered representative of general language, B. XL-WA it can be a good resource on which to train a baseline word-alignment model. Table 9 reports statistics for the As additional data for intermediate word-alignment train- EN–IT and EN–ES partitions used in this study. ing, we use XL-WA [16], a multilingual word-alignment

In the bilingual fine-tuning scenario, the training data

is a concatenation of the SMT datasets produced by the models obtaining the highest performance on the two test sets. Since this is a bilingual fine-tuning scenario, we only use mBERT, as the monolingual models would not be able to be fine-tuned appropriately on this multilingual data. In this setup, the usefulness of the BERT-based aligners becomes more evident. Indeed, while performance on MCR is largely similar to the other setups, with all models outperforming the baseline by a large amount, the same cannot be said for mBERT’s performance on GZ. Finetuning mBERT on the combination of the Italian and Spanish data aligned by Fast Align and Giza++ makes the NER model considerably worse at identifying entities in GZ, with a performance decrease of 20 F1 points with the data created by Fast Align and of 21 F1 points with that created by Giza++. The opposite is true when finetuning the mBERT NER model on the SMT data created by mDeBERTa, with the model achieving an F1 of 0.94, beating the baseline by 5 points. Compared to the model ifne-tuned on data created by Giza++, this represents a 26 F1 point increase in performance.

As regards the baseline model fine-tuned on TASTEset’s English data and tested on MCR’s Spanish entities, we can see that, unexpectedly, the model obtains a 0.88 F1 score, outperforming the mBERT (0.83 F1) and BERTes (0.84 F1) models fine-tuned on the monolingual Spanish EMT data. Despite this, fine-tuning on SMT data produced through our alignment approach allows the NER models to beat this 0.88 F1 baseline, reaching scores as high as 0.95 F1, as previously mentioned.

In all three scenarios, mBERT achieves performances comparable to those of the monolingual models. This shows that, when inferring on multilingual corpora to extract entities, a single multilingual model can be used, saving time and computational resources both during training and inference.

Train

Test it es it–es en it es it es it es en (GZ) en (MCR) en (GZ) en (MCR)

Aligner

– mBERT mDeBERTa Fast Align Giza++

– Fast Align

Giza++ mDeBERTa

– Fast Align

Giza++ mDeBERTa –

NER mBERT

BERTit

mBERT BERTes mBERT mBERT mBERT BERTen Sentences Train Dev 1,002 103 1,002 105

C. Computational Resources All models are trained on a single NVIDIA RTX 5000

Ada Generation, with 32 GB of VRAM. The total training time is around 7-15 minutes for each alignment model, depending on the training data combination, plus 30-60 minutes for training each on XL-WA. Training each NER model takes around 6-7 minutes. All the training, including multiple models for standard deviation calculation, was carried out in under 48 hours.