=Paper= {{Paper |id=Vol-3878/46_main_long |storemode=property |title=On Cross-Language Entity Label Projection and Recognition |pdfUrl=https://ceur-ws.org/Vol-3878/46_main_long.pdf |volume=Vol-3878 |authors=Paolo Gajo,Alberto Barrón-Cedeño |dblpUrl=https://dblp.org/rec/conf/clic-it/GajoB24 }} ==On Cross-Language Entity Label Projection and Recognition== https://ceur-ws.org/Vol-3878/46_main_long.pdf
                                On Cross-Language Entity Label Projection and Recognition
                                Paolo Gajo1,* , Alberto Barrón-Cedeño1
                                1
                                    Università di Bologna, Corso della Repubblica, 136, 47121, Forlì, Italy


                                                   Abstract
                                                   Most work on named entity recognition (NER) focuses solely on English. Through the use of training data augmentation
                                                   via machine translation (MT), multilingual NER can become a powerful tool for information extraction in multilingual
                                                   contexts. In this paper, we augment NER data from culinary recipe ingredient lists by means of MT and word alignment (WA),
                                                   following two approaches: (i) translating each entity separately, while taking into account the full context of the list and
                                                   (ii) translating the whole list of ingredients and then aligning entities using three types of WA models: Giza++, Fast Align,
                                                   and BERT, fine-tuned using a novel entity-shuffling approach. We depart from English data and produce Italian versions via
                                                   MT, span-annotated with the entities projected from English. Then, we use the data produced by the two approaches to train
                                                   mono- and multilingual NER BERT models. We test the performance of the WA and NER models on an annotated dataset of
                                                   ingredient lists, partially out-of-domain compared to the training data. The results show that shuffling entities leads to better
                                                   BERT aligner models. The higher quality NER data created by these models enables NER models to achieve better results,
                                                   with multilingual models reaching performances equal to or greater than their monolingual counterparts.

                                                   Keywords
                                                   information extraction, named entity recognition, cross-lingual label projection, data augmentation



                                1. Introduction                                                                                             from culinary recipes, annotated at the span level with
                                                                                                                                            entity category labels, we first rely on a MT engine to
                                Named entity recognition (NER) is a sequence labeling                                                       translate each source entity 𝑠𝑖 individually into Italian,
                                task with a long history of works mainly focusing on                                                        while keeping the full context into account. This results
                                the recognition of entities such as people, locations, and                                                  in a first entity-wise (EW) translated EN–IT–ES dataset
                                organizations. Multilingual NER has also attracted re-                                                      where entities are linked across languages.1
                                search efforts, with recent SemEval campaigns including                                                        Using these synthetic alignments, we train BERT mod-
                                tasks on multilingual complex NER (MultiCoNER) [1, 2].                                                      els to align source and target entities, shuffling the latter
                                Despite its popularity and various mono- and multilin-                                                      to prevent the model from learning to simply predict the
                                gual NER datasets being available, specific domains such                                                    original entity order. We then test the models on two
                                as the culinary one likely require new annotated data. In                                                   novel entity alignment datasets, partially out-of-domain
                                addition, NER is often the first step in information extrac-                                                compared to the training data, e.g., as regards the used
                                tion for knowledge graph construction and, to the best                                                      food products, units of measure, and cooking processes.
                                of our knowledge, all literature in the domain of cuisine                                                   As baselines to evaluate the BERT alignment models,
                                on this topic solely focuses on English data [3, 4, 5, 6, 7].                                               we use Giza++ [9] and Fast Align [10], two statistical
                                Therefore we argue that, given cuisine’s multicultural                                                      word alignment (WA) models. In order to produce higher-
                                nature, more research in this direction is warranted.                                                       quality r-NE data, we translate the ingredient lists across
                                   Entity label projection [8] aims to address this scarcity                                                their whole length, predicting target entity spans with
                                by automating the data generation process for NER. This                                                     the best BERT models from the previous step, along with
                                task consists in taking the labels associated with spans                                                    the baseline models. We thus obtain various sentence-
                                from a source text and automatically applying them to                                                       wise (SW) translated datasets in Italian, trading some
                                its translation in another language, i.e. the target text.                                                  alignment accuracy for better translations.
                                Through this task, we attempt to find an efficient auto-                                                       Both types of training data, EW and SW, are then used
                                matic way of developing models for entity projection                                                        to fine-tune mono- and multilingual BERT NER models
                                across languages to produce high-quality multilingual                                                       on the task of recognizing entities in food recipes. The
                                data for recipe Named Entities (r-NE) [4]. Departing                                                        models are trained on various combinations of mono- and
                                from an English-language dataset containing ingredients                                                     multilingual data and are tested on the entity annotations
                                                                                                                                            from the two aforementioned novel testing datasets.
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                Dec 04 — 06, 2024, Pisa, Italy                                                                                 Our contribution is three-fold: (i) We show the efficacy
                                *
                                  Corresponding author.                                                                                     of fine-tuning alignment models by shuffling entities in
                                $ paolo.gajo2@unibo.it (P. Gajo); a.barron@unibo.it                                                         contexts where most of the information depends on the
                                (A. Barrón-Cedeño)                                                                                          presence of lexical items rather than the dependencies
                                 0009-0009-9372-3323 (P. Gajo); 0000-0003-4719-3420
                                (A. Barrón-Cedeño)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   1
                                             Attribution 4.0 International (CC BY 4.0).                                                         Experiments on Spanish (ES) are included in Appendix A.




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
linking them. (ii) We showcase the performance delta            [27] use in-context learning [28] to evaluate GPT-3 [22]
between mono- and multilingual NER models when fine-            for NER on the CoNLL2003 [20] and OntoNotes5.0 [29]
tuning on the synthetic data produced by our alignment.         datasets by using retrieval-augmented generation [30]
These models can be used to label large datasets in multi-      and comparing the results to BERT and models based on
ple languages at a finer granularity level compared to cur-     graph neural networks [31].
rently available monolingual resources. (iii) We release           With regard to data specific to the culinary domain,
code and data to produce data in multiple languages.2           many English-language resources exist in various forms.
   The rest of the paper is structured as follows. Sec-         RecipeDB [32] is an ontology comprising 118 𝑘 web
tion 2 presents relevant past research on the subjects          recipes which can be used to relate foods and cooking
of cross-lingual entity alignment and recognition. Sec-         processes to taste profiles and health data. FoodOn [33]
tion 3 introduces the datasets and corpora used in the          is a “farm-to-fork” ontology which provides a structure
experiments, along with their annotation process. Sec-          of relationships between food products across the whole
tion 4 presents architecture, training, and evaluation de-      industrial supply chain. Bridging the gap between on-
tails for the models comprising our pipeline. Section 5         tologies and NER datasets, FoodKG [34] is a knowledge
discusses the conducted experiments and their results.          graph which can be used to find ingredient substitutions
Finally, Section 6 summarizes the paper and draws con-          based on dietary health requirements. It is built by lever-
clusions. Appendix A shows further results including            aging FoodOn and Recipe1M+ [35], a dataset originally
Spanish. Appendix B presents statistics and gives insight       intended for learning joint text/image embeddings on
on the additional training data used. Appendix C lists          over 1 𝑀 culinary recipes. Expanding on Recipe1M+,
information on the computational requirements.                  Bień et al. [36] construct RecipeNLG, comprising more
                                                                than 2 𝑀 recipes. It is the biggest food NER dataset to
                                                                date, but its granularity stops at the sole food product
2. Related Work                                                 names. More fine-grained silver labels are obtained by Ko-
                                                                mariah et al. [37], who propose a new methodology to ex-
Word alignment was first approached for statistical MT,
                                                                tract entities from AllRecipes.3 Doing so, they construct
with models such as IBM 1-5 [11], used in well-known
                                                                FINER, a dataset comprising 64 𝑘 recipes with labels pre-
implementations such as Giza++ and Fast Align. With the
                                                                dicted by what the authors refer to as a “semi-supervised
advent of Transformers [12] and the BERT model [13],
                                                                multi-model prediction technique.” The dataset also con-
this task has been approached by employing both ques-
                                                                tains recipe tags such as vegetarian and vegan, which
tion answering [14] and token classification [15] models,
                                                                can be useful for training recipe classifiers. Leveraging
trained on freely available resources, such as XL-WA [16].
                                                                RecipeDB [32], a large-scale structured corpus of recipes,
   A number of past works have studied label projection
                                                                [38] generate a synthetic dataset of augmented ingredi-
following a range of approaches. Jain et al. [8] project
                                                                ent phrases and compare the NER performance of various
PER, ORG, LOC and MISC labels (person, organization,
                                                                rule-based and neural models.
location, and miscellaneous) by translating sentences
                                                                   Despite the wide availability of English-language re-
and then finding potential matches using glossaries. Fei
                                                                sources in the culinary domain, other languages are
et al. [17] align words using Fast Align and use POS tag-
                                                                largely understudied. To the best of our knowledge, the
ging to enhance data for semantic role labeling. García-
                                                                only study to approach this domain in a multilingual set-
Ferrero et al. [18] use the AWESoME word alignment
                                                                ting was conducted by Radu et al. [39], who obtain NER
model [19] to align machine-translated data from NER
                                                                tags automatically in English, German, and French by
datasets in seven languages. Li et al. [15] fine-tune a NER
                                                                using a regex-based tagger. Our work aims to partially
model on English PER, ORG, LOC, MISC data from
                                                                address this gap in past research by focusing on Italian.
CoNLL2003 [20] to infer on the source portion of parallel
Opus corpora [21] with the aim of creating silver NER
data. Subsequently, they train an XLM RoBERTa align-            3. Data
ment model by using Wikipedia articles and project the
labels on the target portion of the parallel corpus, which      The entity alignment data used for training is gener-
they use to train a target-language NER model.                  ated through MT starting from TASTEset [40], a dataset
   NER can also be approached with large language mod-          comprising ingredient lists from 700 food recipes, anno-
els (LLM) [22, 23, 24] by prompting them to extract en-         tated at the span level. We use TASTEset because it is
tities from a given text. For example, PromptNER [25]           human-curated and its annotations are fine-grained. We
uses chain of thought [26] along with a list of entity def-     translate each entity one by one with DeepL,4 concur-
initions to prompt a variety of LLMs, obtaining results         rently feeding the whole ingredient list and the single
on par with SOTA supervised NER systems. Similarly,
                                                                3
                                                                    https://www.allrecipes.com
2                                                               4
    Resources available at https://github.com/paolo-gajo/food       https://www.deepl.com/en/docs-api
                                                                             Class              EMT (en/it)                SMT (it)       GZ (en)            GZ(it)
entity as two separate inputs. This provides DeepL with                      food                 4,020                      4,017           5,958            6,473
context, improving translation quality and retaining the                     qty.                 3,780                      3,777          10,186            6,564
start and end span indexes in the target text by simply                      unit                 3,172                      3,159           8,148            4,450
concatenating each translated entity. To the best of our                     process              1,091                      1,090             217              265
                                                                             phys. q.               793                       791            1,245            1,547
knowledge, DeepL is currently the only MT engine ca-                         color                  231                       231              482              479
pable of contextually translating a substring taken from                     taste                  126                       125               98               72
a sentence, which is why we are using it in this study.                      purpose                 94                        94               69              126
                                                                             part                    55                        55              220              263
Doing this, we obtain an Entity-wise Machine-translated                      total               13,362                     13,259         26,631            20,272
TASTEset (EMT). Since entities are automatically paired
to the source label, the distribution across English and                Table 1
Italian is identical (Table 1).                                         Dataset class distributions. EMT and SMT refer to the entity-
   We also generate shuffled variations of EMT, where                   and sentence-wise machine-translated TASTEset. GZ refers
                                                                        to our testing dataset.
the entities within a single ingredient have a probability
𝑝 ∈ {0.1, 0.2, . . . , 1.0} of being shuffled, for a total of
                                                                                      s1                       s2                          s3
ten variations. Figure 1 shows an example where entities
have been shuffled in the first and third target ingredients.                s1,1 s1,2     s1,3     s2,1      s2,2    s2,3 s3,1   s3,2             s3,3
The rationale behind this approach is that, when training               A 4 • cups • flour ; 1/2 teaspoon salt ; 1 teaspoon baking soda
on EMT, if the dataset were to be left as-is, the model                                           Maligner(A, B) = t1,3 = {'start': 9, 'end': 14}
would simply learn to associate a source entity to the
                                                                        B farina 4 tazze ; 1/2 cucchiaino sale ; cucchiaino 1 bicarbonato di sodio
target entity in the corresponding position, since entities
                                                                                t1,1 t1,2 t1,3    t2,1        t2,2     t2,3        t3,1     t3,2          t3,3
are simply translated and replaced in EMT.
   Overall, we have 22 different variations of EMT, i.e.                        t1 (shuffled)            t2 (unshuffled)                     t3 (shuffled)

the original and the 10 shuffled versions for each of the
two types of tokenization (mBERT’s WordPiece [13] vs                    Figure 1: Aligning source 𝑠𝑖 and shuffled target 𝑡𝑗 entities.
mDeBERTa’s SentencePiece [41]). The datasets have to
be tokenized during the generation of the dataset because
token indexes depend on the tokenizer being used when                   TASTEset, and the second cross-language entity-linking
converted from character-level span annotations.                        annotation task, carried out by the same annotator at
   We produce a second kind of synthetic dataset by first               a later time. The annotation was carried out in Label
translating the ingredient lists as a whole, and then align-            Studio.7
ing source and target entities by using the BERT, Giza++,                  The GialloZafferano (GZ) dataset comprises 597
and Fast Align models presented in Section 4. We refer to               recipes. The alignments were annotated manually on
this type of dataset as Sentence-wise Machine-translated                a subset of 300 recipes, with the possibility of more than
TASTEset (SMT). As Table 1 shows, the SMT dataset pro-                  one source entity being aligned with one target entity,
duced by the BERT model trained on both XL-WA and                       and vice versa. This is because some recipes contain more
the shuffled version of EMT contains slightly fewer enti-               than one ingredient option in English but not in Italian
ties than the source material. This is due to the fact that             (and vice versa), e.g., Cocomero (anguria) 1 fetta
at times the models produce impossible predictions, e.g.                vs Watermelon 1 slice. The GZ dataset contains a
predicting the end of an entity to be before its start.5 This           total of 46,903 NER annotations and 9,842 alignments.
problem does not exist with Giza++ and Fast Align, since                   We manually scrutinized GZ and found that the paired
their alignments are word-based. As additional training                 recipes do not always coincide completely. Some ingredi-
data for the BERT models, we use the EN–IT portion of                   ents may be missing in either language or be an equiva-
XL-WA. Table 9 in Appendix B reports the size of each                   lent rather than the same food product. In order to avoid
of the partitions we used.                                              training the alignment models on excessively different
   For testing, we annotated an English–Italian dataset                 recipes, we chose to avoid annotating alignments when-
of recipes, obtained from GialloZafferano (where the En-                ever the number of source ingredients missing from the
glish recipes are translated from the Italian ones).6 For the           target recipe surpassed a heuristic threshold of 1/3.
annotation process, we recruited a professional translator                 Note that in GZ quantities and units of measure are
who is a native speaker of Italian, with an MA in Special-              localized and are thus listed in both imperial and SI units.
ized Translation in both English and Spanish. Figure 2                  As shown in Table 1, this is reflected by the lower number
shows the instructions given for the first multi-class en-              of instances annotated as quantity and unit in the
tity annotation task, which consider the same entities as               Italian portion of GZ, compared to its English portion.
5
    The effect on model performance upon training is negligible given
    that these predictions constitute less than 1% of the total.
6                                                                       7
    https://www.giallozafferano.it                                          https://labelstud.io
                                                                             Data       P      mBERT            mBERTx
   Instructions for the multi-class entity annotation task.                            0.0    35.93±0.79       38.87±0.48
  Annotate the ingredients below by assigning to spans of text                         0.1    43.13±2.51       44.49±1.21
   one of the following categories: quantity, unit, food,
                                                                                       0.2    42.54±1.37       44.02±3.32
                                                                                       0.3    42.49±3.64       46.61±1.62
          color, part, physical quality, process,
                                                                                       0.4    42.31±2.58       47.04±4.01
    purpose, taste. Use quantity for numerical values or
                                                                             EMT       0.5    41.87±1.93       47.22±1.64
   expressions such as ‘to taste’, identifying the quantity of an                      0.6    44.84±2.19       46.89±3.36
  ingredient. Unit stands for “unit of measure”, such as grams                         0.7    42.87±3.61       47.36±2.06
  (g) or ounces (oz). Use color for any color that is not part of                      0.8    44.08±1.98       48.34±2.73
      a food’s own name (e.g. ‘red’ can be tagged in ‘red wine’,                       0.9    42.87±3.27       47.28±1.49
     but not ‘black’ in ‘blackberries’). Part refers to parts of an                    1.0    41.65±2.25       45.98±1.97
  ingredient, such as ‘wings’ in ‘chicken wings’. Use physical             XL-WA        –                      21.04
          quality for attributes which already characterize an
  ingredient at the start of the preparation. Process refers to              Data       P     mDeBERTa        mDeBERTax
      actions that the reader is supposed to carry out. The label                      0.0     42.17±1.19      46.98±3.77
      purpose answers the question, “What is this entity used                          0.1     57.00±0.94      58.45±1.37
    for?” Finally, use taste for words referring to a taste, such                      0.2     55.03±2.40      57.02±2.43
               as ‘unsweetened’ or ‘dry’, with relation to a wine.                     0.3    57.09 ± 3.61     60.25±2.35
       N.B.: The annotations cannot overlap. You can either                            0.4    57.26 ± 1.09     59.21±2.59
                                                                             EMT       0.5    55.97 ± 3.11     58.43±2.53
      choose to annotate multiple spans with the same label, or
                                                                                       0.6    58.37 ± 2.46     61.07±2.94
                     avoid annotating one or more spans of text.                       0.7    57.07 ± 1.58     60.68±3.01
                                                                                       0.8    57.31 ± 1.20     62.08±3.74
                                                                                       0.9    56.95 ± 2.69     61.05±1.27
                                                                                       1.0    57.59 ± 1.81     60.87±1.13
     Instructions for the cross-lingual entity linking task.               XL-WA        –                      31.71
  Link each source language entity to its corresponding entity
                                                                      Table 2
  in the target language with an arrow. Entities should only
                                                                      Exact metric results of the alignment task; averaged out of 5
  be linked if they share the same use. For example, the “2”
    in “2 tablespoons chopped onions” does not have the same          random runs, besides the XL-WA baseline. Best in bold.
  function as in “2 cebollas largas picadas”, since one refers to
      tablespoons and other to the number of onions. However,
       “onions” and “chopped” could still be linked, as they are      an entity within a shuffled ingredient’s boundaries.
    equivalent in the two sequences. Individual source entities
         can be linked to multiple target entities and vice versa.       We train the models for up to 3 epochs on each dataset
  N.B.: Entities can still be linked if they differ slightly in       with a batch size of 16. The optimizer’s learning rate is
  form or content but still clearly perform the same function         set at 3 × 10−4 , while 𝜖 is 10−8 . Each training run, we
  in the same ingredient context. For example, “340” and “450”        select the best model based on the Exact metric 𝐸 [44]:
        could still be linked if they both refer to the quantity of
            grams of the same source and target food products.
                                                                                           ∑︀𝑛
                                                                                              𝑖 𝑒𝑥𝑎𝑐𝑡(𝑝𝑖 , 𝑔𝑖 )
                                                                                      𝐸=                        ,            (1)
                                                                                                 ‖𝑝𝑟𝑒𝑑𝑠‖

Figure 2: Annotation task instructions.
                                                                      where 𝑝𝑟𝑒𝑑𝑠 is a list of predictions and 𝑒𝑥𝑎𝑐𝑡(𝑝𝑖 , 𝑔𝑖 ) is
                                                                      the Kronecker delta:
                                                                                                  {︃
                                                                                                    1, if 𝑝𝑖 = 𝑔𝑖 ,
4. Models                                                                       𝑒𝑥𝑎𝑐𝑡(𝑝𝑖 , 𝑔𝑖 ) =                             (2)
                                                                                                    0, if 𝑝𝑖 ̸= 𝑔𝑖
Entity Alignment As baselines, we use two statistical                 with the predicted and gold strings 𝑝𝑖 and 𝑔𝑖 having
models: Giza++ [9] and Fast Align [10]. Giza++ com-                   been lowercased and stripped of excess punctuation and
bines a HMM [42] alignment model and IBM M1-5 [11].                   spaces. We calculate mean Exact and its standard devia-
Fast Align is much more lightweight, only leveraging                  tion out of five random runs for each model.
IBM M2. We use two multilingual BERT models as well:                     In order to improve the models’ ability to align enti-
mBERT [13] as the baseline multilingual Transformer                   ties, we optionally train them on an intermediary word-
model and mDeBERTa [43] because of its larger size                    alignment task using the EN–IT training and dev sets
(276𝑀 vs 179𝑀 param.) and performance. When using                     of XL-WA. In addition, we train mBERT and mDeBERTa
the BERT models, we follow Nagata et al. [14] and treat               solely using said XL-WA partitions in order to test them
entity alignment as a question-answering task, enclosing              directly on GZ. This serves as a baseline which will allow
the source word to be aligned within rarely used charac-              us to gauge the positive effects of fine-tuning on EMT.
ters, e.g., ‘∙’, feeding the model both the source sequence
𝐴 and the target sequence 𝐵 at once. Figure 1 exempli-
fies this, where the model 𝑀𝑎𝑙𝑖𝑔𝑛𝑒𝑟 is trained to predict
    Class Fast Align Giza++ mBERTx mDeBERTax
    Qty.       18.41  35.21  30.09   54.95                         on shuffled data. Unsurprisingly, the larger mDeBERTa
    Unit       30.94  15.24  24.81   29.75                         performs much better than the smaller mBERT across
    Food       61.95  77.01  81.66   83.49                         the board. Although the model obtaining the highest
    Process    15.27  51.91  62.60   83.21                         mean performance is obtained at 𝑃 = 0.8, an overlap
    Color      33.70  84.81  67.04   85.93
    Phys. q.   39.00  71.76  61.41   87.66                         can be observed between all the confidence intervals for
    Taste       0.00  27.03  35.14   75.68                         𝑃 ≥ 0.1. However, this is not true when going from
    Purpose    25.64  61.54  94.87   89.74                         𝑃 = 0 to 𝑃 = 0.1. Consequently, increased shuffling
    Part       52.48  63.37  13.86   14.85
    Macro avg. 30.82  54.21  52.38   67.25                         past 10% does not seem to provide a concrete perfor-
                                                                   mance gain, which is why we decided to produce SMT
Table 3                                                            by using the BERT trained on the least-shuffled version
Exact metric results of the alignment task by class on GZ for      of EMT.
the best models (trained on IT⊕ES). Best in bold.                     In and of itself, the intermediary training step on
                                                                   XL-WA provides a slight performance boost when
                                                                   looking at mBERT vs mBERTx and mDeBERTa vs
Entity Recognition For the NER task, treated                       mDeBERTax . Still, this increase is much smaller com-
as token classification, we once again use mBERT.8                 pared to the one gained through shuffling. While fine-
To test the efficacy of the multilingual approach,                 tuning the models on a general word-alignment task can
we also use the following monolingual models                       be beneficial, the target domain is likely too different from
when training and testing on a single language:                    the training data for this to produce a large performance
bert-base-uncased (henceforth “BERTen ”) for                       boost. This is especially true as regards the structure of
English [13] and bert-base-italian-uncased                         the sentences, since the test data is comprised by short
(“BERTit ”) [45] for Italian. We forgo mDeBERTa for this           lists of entities separated by semicolons, while the train-
task, as the focus is showing a comparison between                 ing data is a domain-balanced sample of sentences from
models of equivalent size and performance. Prior to                Wikipedia. An additional performance boost is provided
training, the data is preprocessed and labeled using               by multilingual fine-tuning, while cross-lingual settings
the BIO annotation scheme [46]. We ignore subword                  (e.g., fine-tuning on ES and testing on IT) lead to worse
tokens when calculating cross-entropy loss, following              outcomes. Table 6 (Appendix A) shows the results.
established methodology.9                                             Table 3 reports the performance of the best overall
   We train the models on the EN–IT, EN–ES, EN–IT–                 models on each class. As the results show, the much
ES language subsets of EMT and of the four versions                lighter Giza++ model surpasses mBERTx , only trailing
of SMT, produced by mBERT, mDeBERTa, Giza++, and                   behind mDeBERTax . The poor scores achieved by the
Fast Align. For the BERT models, we use the same hyper-            two BERT models are largely attributable to their poor
parameters used for the alignment task, but with a lower           scores on the unit and part classes. We hypothesize
learning rate of 2 × 10−4 . The models are evaluated               that this poor class-specific performance has to do with
using the macro F1 -measure. Details on the employed               units of measure often being very short strings. Training
computational resources can be found in Appendix C.                mDeBERTa only on the unit instances does not improve
                                                                   its performance, with the model scoring a lower 18.08
                                                                   Exact metric. Inspecting its individual predictions in this
5. Results and Discussion                                          single-class scenario, we noticed that the model does
Entity Alignment Table 2 reports the Exact scores for              learn to always predict two consecutive tokens, but the
the entity alignment experiment. The entity shuffling              enclosed token does not match the original text when
approach appears to be very effective for creating data            converted into characters. This is due to two separate
which can make the models better at generalizing. The              issues: (i) the model selects the wrong span, e.g., select-
performance of every single model is greatly enhanced              ing an ingredient such as “carote” (carrots) rather than
when shuffling ingredients just 10% of the time, with              the unit “g” or (ii) the model’s prediction is empty when
increased shuffling frequency not leading to any signifi-          converted to characters. Since mBERT and mDeBERTa
cant further improvement. The increase in performance              both have poor performance on this class while using two
seems to be greater for models which have undergone                different tokenization algorithms (WordPiece vs Senten-
intermediate training on XL-WA, with mDeBERTax gain-               cePiece), the problem may lie in the models’ tokenizer’s
ing almost 12 points in the Exact metric, when fine-tuned          token-to-character conversion method.10 We plan to shed
                                                                   light on this in the future. As regards the part class,
8
                                                                   the poor performance could be explained by the small
  We do not use the larger mDeBERTa model due to the computa-
  tional cost deriving from the number of language combinations.
9                                                                  10
  https://huggingface.co/docs/transformers/en/tasks/token_              https://huggingface.co/docs/transformers/en/main_classes/
  classification                                                        tokenizer#transformers.BatchEncoding.char_to_token
 Train     Test       Aligner         NER            F1
                         –                       0.89±0.01       tuning on EMT’s Italian data. Our data augmentation
                      mBERTx                     0.91±0.02       strategy is thus providing an evident performance boost,
                    mDeBERTax        mBERT       0.94±0.01       with entity alignment producing bigger improvements
                     Fast Align                  0.84±0.01       than machine-translating each entity individually.
                      Giza++                     0.87±0.03
    it       it          –                       0.86±0.01          In all settings, mBERT performs on par with the mono-
                      mBERTx                      0.9±0.04       lingual models. This shows that a single multilingual
                    mDeBERTax        BERTit      0.94±0.0        model can suffice when extracting entities from multilin-
                     Fast Align                  0.85±0.04
                      Giza++                     0.91±0.03       gual corpora, saving time and compute.
             it                                  0.79±0.05
             en                      mBERT        0.9±0.01
   en                     –
             en                      BERTen      0.91±0.01       6. Conclusions
Table 4                                                          We explored a simple novel technique to automatically
Model performance for the entity recognition task, in terms of   generate high-quality multilingual NER data by com-
F1 measure. All results are macro avg. out of 5 random runs.     bining machine translation and cross-language entity
                                                                 linking. For our experiments, we relied on the English-
                                                                 language TASTEset dataset, which includes recipes
number of training instances (55). However, the models           whose lists of ingredients are span-annotated for entity
obtain high scores on the purpose class, also just 94 in-        recognition. Moreover, we manually curated a novel
stances (mBERTx gets 94.87 Exact score). Unfortunately,          English–Italian cross-language dataset, featuring the
repeating the approach we used for the unit class is not         same kind of annotation, with the addition of cross-
feasible, as fine-tuning the model on just 55 instances          language alignments.
does not produce any reliable results (𝐸𝑝𝑎𝑟𝑡 = 3.96),               We machine translated the entities in TASTEset’s
meaning this will have to be left for future work.               recipes individually and shuffled them within ingredi-
   The rest of the results from Table 3 are generally in         ent boundaries. Leveraging this augmented data, we
line with the average results from Table 2. The scores           then fine-tuned BERT entity-alignment models. Using
achieved by the baselines for each class do not have any         statistical word-alignment models as baselines, we tested
evident outliers, save for Fast Align scoring a 0 on taste.      these BERT models on our English–Italian parallel cor-
More generally, Fast Align, being the simplest and most          pus. The results showed that models fine-tuned using our
lightweight model, performs on average well below the            novel approach consistently outperform those trained on
other more complex models.                                       unshuffled data, along with two statistical baselines.
                                                                    We then created additional synthetic data by first trans-
Entity Recognition Table 4 reports the results for the           lating TASTEset’s recipes in their entirety, and then align-
NER task. The aligner column indicates which alignment           ing the entities in the machine-translated target text us-
model, out of the best ones listed in Table 3, has produced      ing the best models obtained from the first part of the
the SMT training data used to fine-tune the NER model.           study. These data allowed us to obtain better NER models,
When no alignment model is specified, the training data          compared to the ones we would have obtained by using
being used is EMT. Note that in this case we are not using       the original recipes translated entity by entity. We tested
EMT’s shuffled versions, as there is no relation between         monolingual English and Italian BERT models against
any two recipes when fine-tuning on the NER task.                mBERT, and showed that the latter is capable of obtaining
   When training and testing on Italian data, the best           the same performance as its monolingual counterparts
results are obtained for both mBERT and BERTit when              when tested on monolingual NER data.
fine-tuning on SMT data produced by mDeBERTa. When                  In future work, we plan to extend the annotation of
fine-tuning them on EMT, the performance is notice-              our datasets, both in terms of number of instances and
ably lower, with a 5-point difference for mBERT and an           annotators. We will also prioritize solving the token-to-
8-point difference for BERTit . The data produced by             character conversion issues encountered in this study.
mBERT also allows both models to outperform the EMT              Furthermore, we plan to leverage this data augmentation
baseline, although by smaller amounts. Conversely, the           technique in order to improve multilingual text-to-graph
data produced by Fast Align and Giza++ worsens the data          models, since all of the literature in this regard focuses
quality in 75% of the cases. When fine-tuning mBERT              on English-only data [3, 4, 5, 6, 7].
on bilingual ES-IT data, the performance on the test set
remains essentially unvaried (see Table 8 in Appendix A).
   Looking at the baselines at the bottom of Table 4, we         References
can see that fine-tuning mBERT on English data yields
                                                                  [1] S. Malmasi, A. Fang, B. Fetahu, S. Kar, O. Rokhlenko,
worse performance when testing on GZ, compared to fine-
                                                                      SemEval-2022 task 11: Multilingual complex named
    entity recognition (MultiCoNER), in: G. Emerson,             Recipes, 2024. URL: http://arxiv.org/abs/2401.12088.
    N. Schluter, G. Stanovsky, R. Kumar, A. Palmer,          [8] A. Jain, B. Paranjape, Z. C. Lipton, Entity pro-
    N. Schneider, S. Singh, S. Ratan (Eds.), Proceedings         jection via machine translation for cross-lingual
    of the 16th International Workshop on Semantic               NER, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.),
    Evaluation (SemEval-2022), Association for Com-              Proceedings of the 2019 Conference on Empirical
    putational Linguistics, Seattle, United States, 2022,        Methods in Natural Language Processing and the
    pp. 1412–1437. URL: https://aclanthology.org/2022.           9th International Joint Conference on Natural Lan-
    semeval-1.196.                                               guage Processing (EMNLP-IJCNLP), Association
[2] B. Fetahu, S. Kar, Z. Chen, O. Rokhlenko, S. Mal-            for Computational Linguistics, Hong Kong, China,
    masi, SemEval-2023 task 2: Fine-grained multi-               2019, pp. 1083–1092. URL: https://aclanthology.org/
    lingual named entity recognition (MultiCoNER 2),             D19-1100. doi:10.18653/v1/D19-1100.
    in: A. K. Ojha, A. S. Doğruöz, G. Da San Martino,        [9] F. J. Och, H. Ney, A systematic comparison of var-
    H. Tayyar Madabushi, R. Kumar, E. Sartori (Eds.),            ious statistical alignment models, Computational
    Proceedings of the 17th International Workshop on            Linguistics 29 (2003) 19–51.
    Semantic Evaluation (SemEval-2023), Association         [10] C. Dyer, V. Chahuneau, N. A. Smith, A Simple, Fast,
    for Computational Linguistics, Toronto, Canada,              and Effective Reparameterization of IBM Model
    2023, pp. 2247–2265. URL: https://aclanthology.org/          2, in: L. Vanderwende, H. Daumé III, K. Kirch-
    2023.semeval-1.310.                                          hoff (Eds.), Proceedings of the 2013 Conference of
[3] C. Kiddon, G. T. Ponnuraj, L. Zettlemoyer, Y. Choi,          the North American Chapter of the Association
    Mise en Place: Unsupervised Interpretation of In-            for Computational Linguistics: Human Language
    structional Recipes, in: L. Màrquez, C. Callison-            Technologies, Association for Computational Lin-
    Burch, J. Su (Eds.), Proceedings of the 2015 Confer-         guistics, Atlanta, Georgia, 2013, pp. 644–648. URL:
    ence on Empirical Methods in Natural Language                https://aclanthology.org/N13-1073.
    Processing, Association for Computational Lin-          [11] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra,
    guistics, Lisbon, Portugal, 2015, pp. 982–992. URL:          R. L. Mercer, The mathematics of statistical ma-
    https://aclanthology.org/D15-1114. doi:10.18653/             chine translation: Parameter estimation, Compu-
    v1/D15-1114.                                                 tational Linguistics 19 (1993) 263–311. URL: https:
[4] Y. Yamakata, S. Mori, J. Carroll, English Recipe Flow        //aclanthology.org/J93-2003.
    Graph Corpus, in: N. Calzolari, F. Béchet, P. Blache,   [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
    K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isa-         L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
    hara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno,           Attention is All you Need, in: Advances in Neural
    J. Odijk, S. Piperidis (Eds.), Proceedings of the            Information Processing Systems, volume 30, Cur-
    Twelfth Language Resources and Evaluation Con-               ran Associates, Inc., 2017. URL: https://proceedings.
    ference, European Language Resources Associa-                neurips.cc/paper_files/paper/2017/hash/
    tion, Marseille, France, 2020, pp. 5187–5194. URL:           3f5ee243547dee91fbd053c1c4a845aa-Abstract.
    https://aclanthology.org/2020.lrec-1.638.                    html.
[5] D. P. Papadopoulos, E. Mora, N. Chepurko, K. W.         [13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
    Huang, F. Ofli, A. Torralba, Learning Program Rep-           Pre-training of deep bidirectional transformers for
    resentations for Food Images and Cooking Recipes,            language understanding, in: J. Burstein, C. Do-
    in: 2022 IEEE/CVF Conference on Computer Vision              ran, T. Solorio (Eds.), Proceedings of the 2019 Con-
    and Pattern Recognition (CVPR), IEEE, New Or-                ference of the North American Chapter of the As-
    leans, LA, USA, 2022, pp. 16538–16548. URL: https:           sociation for Computational Linguistics: Human
    //ieeexplore.ieee.org/document/9878478/. doi:10.             Language Technologies, Volume 1 (Long and Short
    1109/CVPR52688.2022.01606.                                   Papers), Association for Computational Linguistics,
[6] D. J. Bhatt, S. A. Abdollahpouri Hosseini, F. Fan-           Minneapolis, Minnesota, 2019, pp. 4171–4186. URL:
    cellu, A. Fazly, End-to-end Parsing of Procedu-              https://aclanthology.org/N19-1423. doi:10.18653/
    ral Text into Flow Graphs, in: N. Calzolari, M.-             v1/N19-1423.
    Y. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.),    [14] M. Nagata, K. Chousa, M. Nishino,              A su-
    Proceedings of the 2024 Joint International Con-             pervised word alignment method based on
    ference on Computational Linguistics, Language               cross-language span prediction using multilin-
    Resources and Evaluation (LREC-COLING 2024),                 gual BERT, in: B. Webber, T. Cohn, Y. He,
    ELRA and ICCL, Torino, Italia, 2024, pp. 5833–5842.          Y. Liu (Eds.), Proceedings of the 2020 Confer-
    URL: https://aclanthology.org/2024.lrec-main.517.            ence on Empirical Methods in Natural Language
[7] A. Diallo, A. Bikakis, L. Dickens, A. Hunter,                Processing (EMNLP), Association for Computa-
    R. Miller, Unsupervised Learning of Graph from               tional Linguistics, Online, 2020, pp. 555–565.
     URL: https://aclanthology.org/2020.emnlp-main.41.            Proceedings of the Eighth International Con-
     doi:10.18653/v1/2020.emnlp-main.41.                          ference on Language Resources and Evaluation
[15] B. Li, Y. He, W. Xu, Cross-Lingual Named Entity              (LREC’12), European Language Resources Asso-
     Recognition Using Parallel Corpus: A New Ap-                 ciation (ELRA), Istanbul, Turkey, 2012, pp. 2214–
     proach Using XLM-RoBERTa Alignment, 2021. URL:               2218. URL: http://www.lrec-conf.org/proceedings/
     http://arxiv.org/abs/2101.11112.                             lrec2012/pdf/463_Paper.pdf.
[16] F. Martelli, A. S. Bejgu, C. Campagnano, J. Čibej,      [22] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
     R. Costa, A. Gantar, J. Kallas, S. Koeva, K. Koppel,         J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
     S. Krek, M. Langemets, V. Lipp, S. Nimb, S. Olsen,           G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
     B. S. Pedersen, V. Quochi, A. Salgado, L. Simon,             G. Krueger, T. Henighan, R. Child, A. Ramesh,
     C. Tiberius, R.-J. Ureña-Ruiz, R. Navigli, XL-WA: a          D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
     Gold Evaluation Benchmark for Word Alignment in              E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
     14 Language Pairs, in: F. Boschetti, N. N. Gianluca          C. Berner, S. McCandlish, A. Radford, I. Sutskever,
     E. Lebani, Bernardo Magnini (Eds.), Proceedings              D. Amodei, Language Models are Few-Shot Learn-
     of the Ninth Italian Conference on Computational             ers, 2020. URL: http://arxiv.org/abs/2005.14165,
     Linguistics (CLiC-it 2023), volume 3596, CEUR-WS,            arXiv:2005.14165 [cs].
     Venice, Italy, 2023.                                    [23] A. Chowdhery, S. Narang, J. Devlin, M. Bosma,
[17] H. Fei, M. Zhang, D. Ji, Cross-lingual semantic              G. Mishra, A. Roberts, P. Barham, H. W. Chung,
     role labeling with high-quality translated train-            C. Sutton, S. Gehrmann, P. Schuh, K. Shi,
     ing corpus, in: D. Jurafsky, J. Chai, N. Schluter,           S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes,
     J. Tetreault (Eds.), Proceedings of the 58th An-             Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du,
     nual Meeting of the Association for Computa-                 B. Hutchinson, R. Pope, J. Bradbury, J. Austin,
     tional Linguistics, Association for Computational            M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya,
     Linguistics, Online, 2020, pp. 7014–7026. URL:               S. Ghemawat, S. Dev, H. Michalewski, X. Garcia,
     https://aclanthology.org/2020.acl-main.627. doi:10.          V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ip-
     18653/v1/2020.acl-main.627.                                  polito, D. Luan, H. Lim, B. Zoph, A. Spiridonov,
[18] I. García-Ferrero, R. Agerri, G. Rigau, Model                R. Sepassi, D. Dohan, S. Agrawal, M. Omernick,
     and data transfer for cross-lingual sequence la-             A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz,
     belling in zero-resource settings, in: Y. Goldberg,          E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou,
     Z. Kozareva, Y. Zhang (Eds.), Findings of the Associ-        X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta,
     ation for Computational Linguistics: EMNLP 2022,             J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov,
     Association for Computational Linguistics, Abu               N. Fiedel, PaLM: Scaling Language Modeling with
     Dhabi, United Arab Emirates, 2022, pp. 6403–6416.            Pathways, 2022. URL: http://arxiv.org/abs/2204.
     URL: https://aclanthology.org/2022.findings-emnlp.           02311, arXiv:2204.02311 [cs].
     478. doi:10.18653/v1/2022.findings-emnlp.               [24] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.
     478.                                                         Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham-
[19] Z.-Y. Dou, G. Neubig, Word alignment by fine-                bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave,
     tuning embeddings on parallel corpora, in: P. Merlo,         G. Lample, LLaMA: Open and Efficient Foundation
     J. Tiedemann, R. Tsarfaty (Eds.), Proceedings of             Language Models, 2023. URL: http://arxiv.org/abs/
     the 16th Conference of the European Chapter                  2302.13971, arXiv:2302.13971 [cs].
     of the Association for Computational Linguis-           [25] D. Ashok, Z. C. Lipton, PromptNER: Prompting
     tics: Main Volume, Association for Computa-                  For Named Entity Recognition, 2023. URL: http:
     tional Linguistics, Online, 2021, pp. 2112–2128.             //arxiv.org/abs/2305.15444, arXiv:2305.15444 [cs].
     URL: https://aclanthology.org/2021.eacl-main.181.       [26] J. Wei, X. Wang, D. Schuurmans, M. Bosma,
     doi:10.18653/v1/2021.eacl-main.181.                          B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-
[20] E. F. Tjong Kim Sang, F. De Meulder, Introduc-               of-Thought Prompting Elicits Reasoning in Large
     tion to the CoNLL-2003 Shared Task: Language-                Language Models, in: Advances in Neural In-
     Independent Named Entity Recognition, in: Pro-               formation Processing Systems, arXiv, 2022. URL:
     ceedings of the Seventh Conference on Natural Lan-           http://arxiv.org/abs/2201.11903, arXiv:2201.11903
     guage Learning at HLT-NAACL 2003, 2003, pp. 142–             [cs].
     147. URL: https://aclanthology.org/W03-0419.            [27] S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang,
[21] J. Tiedemann, Parallel data, tools and inter-                J. Li, G. Wang, GPT-NER: Named Entity Recogni-
     faces in OPUS, in: N. Calzolari, K. Choukri,                 tion via Large Language Models, 2023. URL: http:
     T. Declerck, M. U. Doğan, B. Maegaard, J. Mar-               //arxiv.org/abs/2304.10428, arXiv:2304.10428 [cs].
     iani, A. Moreno, J. Odijk, S. Piperidis (Eds.),         [28] Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia,
     J. Xu, Z. Wu, B. Chang, X. Sun, L. Li, Z. Sui, A          [35] J. Marin, A. Biswas, F. Ofli, N. Hynes, A. Salvador,
     Survey on In-context Learning, 2024. URL: http:                Y. Aytar, I. Weber, A. Torralba, Recipe1M+: A
     //arxiv.org/abs/2301.00234, arXiv:2301.00234 [cs].             Dataset for Learning Cross-Modal Embeddings for
[29] S. Pradhan, A. Moschitti, N. Xue, H. T. Ng,                    Cooking Recipes and Food Images, 2019. URL: http:
     A. Björkelund, O. Uryupina, Y. Zhang, Z. Zhong, To-            //arxiv.org/abs/1810.06553. doi:10.48550/arXiv.
     wards Robust Linguistic Analysis using OntoNotes,              1810.06553, arXiv:1810.06553 [cs].
     in: J. Hockenmaier, S. Riedel (Eds.), Proceedings of      [36] M. Bień, M. Gilski, M. Maciejewska, W. Taisner,
     the Seventeenth Conference on Computational Nat-               D. Wisniewski, A. Lawrynowicz, RecipeNLG: A
     ural Language Learning, Association for Computa-               Cooking Recipes Dataset for Semi-Structured Text
     tional Linguistics, Sofia, Bulgaria, 2013, pp. 143–152.        Generation, in: B. Davis, Y. Graham, J. Kelleher,
     URL: https://aclanthology.org/W13-3516.                        Y. Sripada (Eds.), Proceedings of the 13th Inter-
[30] P. Lewis, E. Perez, A. Piktus, F. Petroni,                     national Conference on Natural Language Gen-
     V. Karpukhin, N. Goyal, H. Küttler, M. Lewis,                  eration, Association for Computational Linguis-
     W.-t. Yih, T. Rocktäschel, S. Riedel, D. Kiela,                tics, Dublin, Ireland, 2020, pp. 22–28. URL: https://
     Retrieval-Augmented Generation for Knowledge-                  aclanthology.org/2020.inlg-1.4. doi:10.18653/v1/
     Intensive NLP Tasks, in: H. Larochelle, M. Ran-                2020.inlg-1.4.
     zato, R. Hadsell, M. F. Balcan, H. Lin (Eds.),            [37] K. S. Komariah, A. T. Purnomo, A. Satri-
     Advances in Neural Information Processing                      awan, M. O. Hasanuddin, C. Setianingsih, B.-K.
     Systems, volume 33, Curran Associates, Inc.,                   Sin, SMPT: A Semi-Supervised Multi-Model Pre-
     2020, pp. 9459–9474. URL: https://proceedings.                 diction Technique for Food Ingredient Named
     neurips.cc/paper_files/paper/2020/file/                        Entity Recognition (FINER) Dataset Construc-
     6b493230205f780e1bc26945df7481e5-Paper.pdf.                    tion, Informatics 10 (2023) 10. URL: https://
[31] S. Wang, Y. Meng, R. Ouyang, J. Li, T. Zhang,                  www.mdpi.com/2227-9709/10/1/10. doi:10.3390/
     L. Lyu, G. Wang, GNN-SL: Sequence Labeling                     informatics10010010, number: 1 Publisher:
     Based on Nearest Examples via GNN, in: A. Rogers,              Multidisciplinary Digital Publishing Institute.
     J. Boyd-Graber, N. Okazaki (Eds.), Findings of the        [38] A. Agarwal, J. Kapuriya, S. Agrawal, A. V. Konam,
     Association for Computational Linguistics: ACL                 M. Goel, R. Gupta, S. Rastogi, N. Niharika, G. Bagler,
     2023, Association for Computational Linguistics,               Deep Learning Based Named Entity Recognition
     Toronto, Canada, 2023, pp. 12679–12692. URL: https:            Models for Recipes, in: N. Calzolari, M.-Y. Kan,
     //aclanthology.org/2023.findings-acl.803. doi:10.              V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), Proceed-
     18653/v1/2023.findings-acl.803.                                ings of the 2024 Joint International Conference on
[32] D. Batra, N. Diwan, U. Upadhyay, J. S. Kalra,                  Computational Linguistics, Language Resources
     T. Sharma, A. K. Sharma, D. Khanna, J. S.                      and Evaluation (LREC-COLING 2024), ELRA and
     Marwah, S. Kalathil, N. Singh, R. Tuwani,                      ICCL, Torino, Italia, 2024, pp. 4542–4554. URL:
     G. Bagler, RecipeDB: a resource for exploring                  https://aclanthology.org/2024.lrec-main.406.
     recipes, Database 2020 (2020) baaa077. URL: https:        [39] C. Radu, C.-E. Staicu, L.-M. Mitrică, M. Dînşore-
     //doi.org/10.1093/database/baaa077. doi:10.1093/               anu, R. Potolea, C. Lemnaru, Extracting Settings
     database/baaa077.                                              from Multilingual Recipes with Various Sequence
[33] D. M. Dooley, E. J. Griffiths, G. S. Gosal, P. L.              Tagging Models: an Experimental Study, in: 2022
     Buttigieg, R. Hoehndorf, M. C. Lange, L. M. Schriml,           IEEE 18th International Conference on Intelligent
     F. S. L. Brinkman, W. W. L. Hsiao, FoodOn: a                   Computer Communication and Processing (ICCP),
     harmonized food ontology to increase global food               2022, pp. 65–72. URL: https://ieeexplore.ieee.org/
     traceability, quality control and data integration,            document/10053968/?arnumber=10053968. doi:10.
     npj Science of Food 2 (2018) 23. URL: https:                   1109/ICCP56966.2022.10053968, iSSN: 2766-
     //www.nature.com/articles/s41538-018-0032-6.                   8495.
     doi:10.1038/s41538-018-0032-6.                            [40] A. Wróblewska, A. Kaliska, M. Pawłowski,
[34] S. Haussmann, O. Seneviratne, Y. Chen, Y. Ne’eman,             D. Wiśniewski, W. Sosnowski, A. Ławrynowicz,
     J. Codella, C.-H. Chen, D. L. McGuinness, M. J. Zaki,          TASTEset – Recipe Dataset and Food Entities Recog-
     FoodKG: A Semantics-Driven Knowledge Graph                     nition Benchmark, 2022. URL: http://arxiv.org/abs/
     for Food Recommendation, in: C. Ghidini, O. Har-               2204.07775.
     tig, M. Maleshkova, V. Svátek, I. Cruz, A. Hogan,         [41] T. Kudo, J. Richardson, SentencePiece: A simple and
     J. Song, M. Lefrançois, F. Gandon (Eds.), The Se-              language independent subword tokenizer and deto-
     mantic Web – ISWC 2019, Springer International                 kenizer for Neural Text Processing, in: E. Blanco,
     Publishing, Cham, 2019, pp. 146–162. doi:10.1007/              W. Lu (Eds.), Proceedings of the 2018 Conference
     978-3-030-30796-7_10.                                          on Empirical Methods in Natural Language Pro-
     cessing: System Demonstrations, Association for        which is evident from the fact that on the website all
     Computational Linguistics, Brussels, Belgium, 2018,    Spanish recipes have an English counterpart, but not vice
     pp. 66–71. URL: https://aclanthology.org/D18-2012.     versa. We believe approximately 5-10% of the dataset’s
     doi:10.18653/v1/D18-2012.                              instances to be possible MT. A good indication of this is
[42] P. Blunsom, Hidden markov models, Lecture notes,       the fact that the English “to taste” is sometimes translated
     August 15 (2004) 48.                                   as “para probar”, likely an MT mistake, while other times
[43] P. He, J. Gao, W. Chen, Debertav3: Improv-             the correct “al gusto” is used. Although using machine-
     ing deberta using electra-style pre-training with      translated data is not ideal, this was our best choice for a
     gradient-disentangled embedding sharing, 2021.         Spanish-language parallel recipe corpus, due to the lack
     arXiv:2111.09543.                                      of availability of similar online resources. The use of MT
[44] P. Rajpurkar, R. Jia, P. Liang, Know what you          data has implications with respect to the evaluation of
     don’t know: Unanswerable questions for SQuAD,          the models, as their performance would likely be lower in
     in: I. Gurevych, Y. Miyao (Eds.), Proceedings of       a real-world scenario involving recipes written directly
     the 56th Annual Meeting of the Association for         in Spanish. Nonetheless, given the limited amount of
     Computational Linguistics (Volume 2: Short Pa-         data we hypothesize as being machine-translated, we
     pers), Association for Computational Linguistics,      believe the impact would not be large enough to discredit
     Melbourne, Australia, 2018, pp. 784–789. URL:          our results, which focus on the improvement over the
     https://aclanthology.org/P18-2124. doi:10.18653/       cross-lingual EN–ES baseline, rather than the absolute
     v1/P18-2124.                                           performance of the best model.
[45] S. Schweter, J. Baiter, Dbmdz BERT Models, https:         MCR contains 276 recipes, 104 of which are bilingual
     //github.com/dbmdz/berts, 2019. Accessed: 2024-04-     and annotated with alignments. Due to this imbalance
     22.                                                    between the number of English and Spanish recipes, the
[46] L. A. Ramshaw, M. P. Marcus, Text Chunk-               number of entities is around 3x for the former, as shown
     ing using Transformation-Based Learning,               in Table 5. In total, MCR contains annotations for 15,257
     1995. URL: http://arxiv.org/abs/cmp-lg/9505040.        entities and 3,565 alignments. Along with the ingredient
     doi:10.48550/arXiv.cmp-lg/9505040,                     lists, MCR also contains cooking instructions for all its
     arXiv:cmp-lg/9505040.                                  recipes, along with nutritional facts for 139 of them.
[47] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho,
     H. Kang, J. Pérez, Spanish pre-trained bert model      A.2. BERT Model
     and evaluation data, in: PML4DC at ICLR 2020,
     2020.                                                  As a monolingual Spanish BERT model base-
[48] H. Schwenk, V. Chaudhary, S. Sun, H. Gong,             line to compare against mBERT,       we use
     F. Guzmán, WikiMatrix: Mining 135M parallel            bert-base-spanish-wwm-cased (“BERTes ”) [47].
     sentences in 1620 language pairs from Wikipedia,
     in: P. Merlo, J. Tiedemann, R. Tsarfaty (Eds.), Pro-   A.3. Results
     ceedings of the 16th Conference of the European
     Chapter of the Association for Computational Lin-   Entity Alignment Table 6 reports the results for the
     guistics: Main Volume, Association for Compu-       alignment task, complete with the settings including
     tational Linguistics, Online, 2021, pp. 1351–1361.  Spanish-language data.
     URL: https://aclanthology.org/2021.eacl-main.115.      Fine-tuning on the same language as the test set yields
     doi:10.18653/v1/2021.eacl-main.115.                 better results than cross-lingual scenarios. Furthermore,
                                                         the best performance on MCR is obtained when fine-
                                                         tuning mDeBERTax on both Italian and Spanish.
                                                            This is not the case for mBERTx and mDeBERTa,
A. Incorporating Spanish                                 whose performance is hindered by the addition of Italian
                                                         training data. MCR is much narrower in terms of culi-
In order to test more thoroughly the soundness of our nary variety, focusing solely on Colombian recipes. On
approach, we carry out an equivalent study with Spanish. the other hand, GZ contains not just traditional Italian
                                                         recipes, but an international range of dishes. This is prob-
A.1. Data                                                ably the reason why bilingual training is helpful on GZ,
                                                         but is not beneficial with relation to MCR: adding data
We annotated an English–Spanish dataset of recipes ob-
                                      11                 from a separate locale helps the models when approach-
tained from My Colombian Recipes, which we refer
                                                         ing the more varied GZ, helping them generalize more
to as MCR. MCR is translated from English to Spanish,
                                                         effectively over its data. Conversely, they are thrown off
11
     https://www.mycolombianrecipes.com
                         TS / EMT                      SMT                                GZ                    MCR
                                             mBERTx         mDeBERTax
           Class         en / it / es     it       es       it      es            en             it          en          es
           food             4,020        3,999    4,012    4,017   4,018          5,958         6,473       3,600      1,143
           quantity         3,780        3,764    3,778    3,777   3,780         10,186         6,564       2,945        962
           unit             3,172        3,151    3,169    3,159   3,171          8,148         4,450       2,325        760
           process          1,091        1,066    1,089    1,090   1,091            217           265       1,236        379
           physical q.        793          785      791      791     793          1,245         1,547         897        285
           color              231          226      231      231     231            482           479         309         97
           taste              126          121      123      125     123             98             72          8           2
           purpose              94           94       94       94      94            69           126          89         34
           part                 55           53       55       55      55           220           263         142         44
           total          13,362        13,259 13,342     13,339 13,356         26,631         20,272      11,551      3,706
Table 5
Dataset class distributions. EMT and SMT refer to the entity- and sentence-wise machine-translated TASTEset. GZ and MCR
refer to our testing datasets.

                              mBERT                   mBERTx                 mDeBERTa                        mDeBERTax
   Data       P          GZ        MCR           GZ        MCR              GZ     MCR                      GZ     MCR
   EMT
              0     35.93±0.79                38.87±0.48                42.17±1.19                       46.98±3.77
     it      0.1    43.13±2.51                44.49±1.21                57.00±0.94                       58.45±1.37
             0.2    42.54±1.37                44.02±3.32                55.03±2.40                       57.02±2.43
              0               49.03±0.59            50.38±1.10            51.93±0.65            53.20±0.83
    es       0.1              63.69±0.96            67.60±0.74            70.43±2.48            71.07±3.62
             0.2              66.07±1.30            70.20±1.66            69.25±1.94            72.62±1.93
              0    33.82±5.30 46.59±0.98 41.54±1.87 47.72±1.56 46.98±3.77 40.33±1.97 45.17±2.58 52.70±1.09
   it–es     0.1   43.36±2.72 64.57±2.35 46.14±3.85 67.16±2.19 58.45±1.37 53.68±2.45 57.64±0.83 72.95±1.75
             0.2   44.37±1.57 67.62±0.33 47.14±1.43 69.10±1.10 57.02±2.43 54.87±1.37 58.84±2.68 72.71±1.83
   XL-WA
     it  –                                       21.04                                                      31.71
     es  –                                                    54.14                                         58.56
   it–es –                                       23.60        53.89                                      33.56 70.47
Table 6
Alignment task results (Exact metric). All results are averaged out of 5 random runs, besides the XL-WA baselines. Best in bold.



by the addition of out-of-domain data when tested on             MCR. Giza++ essentially matches mDeBERTa’s perfor-
MCR’s narrower domain.                                           mance on MCR, which once again points to entities in
   Comparing the EMT fine-tuning results with the base-          MCR being easier to identify compared to GZ. However,
lines at the bottom of Table 6, we can see that further          the similar performance is largely due to mDeBERTa per-
fine-tuning on EMT does provide a boost, compared to             forming poorly on the unit and part classes, due to the
training only on XL-WA. Nonetheless, the difference in           reasons outlined in Section 4.
performance is much greater when testing on GZ, com-
pared to MCR. When looking at mBERTx , fine-tuned on             Entity Recognition Table 8 reports the results for the
both Italian and Spanish, the model improves by more             NER task for all language settings. For each language,
than 23 Exact points on GZ, while the gap in performance         we use the aligner models which obtained the highest
is just under 16 points on MCR. This effect is even more         results on the entity alignment task. Note that, since the
dramatic for mDeBERTax , with a difference of more than          aligner performance does not significantly improve with
25 points on GZ, but only 2.48 points on MCR.                    increased shuffling (see Section 5), we only train aligner
   Compounded with the fact that, in general, the metrics        models up to 𝑃 = 0.2 for the Spanish setting due to
are much higher when testing on MCR compared to GZ,              computational constraints.
this points to MCR being a much less challenging test               In the Spanish monolingual setting, both BERTes and
set, compared to GZ. As previously mentioned, part of            mBERT obtain F1 scores between 0.92 and 0.95 when
the dataset is likely machine translated, and since an MT        fine-tuned on SMT, with the models fine-tuned on EMT
engine is more likely to follow rigidly defined patterns         trailing behind by 11 to 12 points. As all the models
compared to a human translator, this might play a role           perform similarly and the standard deviation is also close
into the alignment task being easier on these data.              to zero, it once again appears that the entities contained
   Table 7 reports the performance of the best overall           in the MCR dataset are not too challenging for both the
models on each of the individual classes, on both GZ and         mono- and multilingual models to identify.
  Model             Test set Qty. Unit Food Process Color Phys. q. Taste Purpose Part Macro avg.
  Fast Align          GZ     18.41 30.94 61.95 15.27 33.70 39.00    0.00  25.64  52.48  30.82
  Fast Align         MCR 54.27 71.82 62.73 42.77     66.67 45.68    0.00  58.82  40.00  49.20
  Giza++              GZ     35.21 15.24 77.01 51.91 84.81 71.76   27.03  61.54  63.37  54.21
  Giza++             MCR 90.29 89.31 76.93 76.30     79.76 75.72   50.00  82.35  68.57  76.58
  mBERTx en–it–es     GZ     30.09 24.81 81.66 62.60 67.04 61.41   35.14  94.87  13.86  52.38
  mBERTx en–it–es    MCR 95.30 3.93 89.32 81.72      87.50 77.02 100.00 100.00    9.52  71.59
  mDeBERTax en–it–es GZ      54.95 29.75 83.49 83.21 85.93 87.66 75.68    89.74  14.85  67.25
  mDeBERTa           MCR 97.05 11.25 90.48 93.91     94.32 93.95 100.00   97.06  14.29  76.92
Table 7
Results of the alignment task by class for the best models, using the Exact metric. Best on GZ in bold, best on MCR underlined.


                                                                     Train        Test         Aligner       NER         F1
   In the bilingual fine-tuning scenario, the training data                                       –                  0.89±0.01
is a concatenation of the SMT datasets produced by the                                         mBERT                 0.91±0.02
models obtaining the highest performance on the two test                                      mDeBERTa      mBERT    0.94±0.01
sets. Since this is a bilingual fine-tuning scenario, we only                                 Fast Align             0.84±0.01
                                                                                               Giza++                0.87±0.03
use mBERT, as the monolingual models would not be able                 it           it            –                  0.86±0.01
to be fine-tuned appropriately on this multilingual data.                                      mBERT                  0.9±0.04
In this setup, the usefulness of the BERT-based aligners                                      mDeBERTa      BERTit    0.94±0.0
                                                                                              Fast Align             0.85±0.04
becomes more evident. Indeed, while performance on                                             Giza++                0.91±0.03
MCR is largely similar to the other setups, with all models                                       –                  0.83±0.01
outperforming the baseline by a large amount, the same                                         mBERT                  0.95±0.0
cannot be said for mBERT’s performance on GZ. Fine-                                           mDeBERTa      mBERT    0.92±0.01
tuning mBERT on the combination of the Italian and                                            Fast Align              0.94±0.0
                                                                                               Giza++                 0.95±0.0
Spanish data aligned by Fast Align and Giza++ makes                    es          es             –                   0.84±0.0
the NER model considerably worse at identifying entities                                       mBERT                  0.95±0.0
in GZ, with a performance decrease of 20 F1 points with                                       mDeBERTa      BERTes   0.93±0.01
                                                                                              Fast Align              0.95±0.0
the data created by Fast Align and of 21 F1 points with                                        Giza++                 0.95±0.0
that created by Giza++. The opposite is true when fine-                                           –                  0.89±0.01
tuning the mBERT NER model on the SMT data created                                            Fast Align             0.69±0.01
                                                                                    it                      mBERT
by mDeBERTa, with the model achieving an F1 of 0.94,                                           Giza++                0.68±0.03
beating the baseline by 5 points. Compared to the model                                       mDeBERTa               0.94±0.01
                                                                     it–es                        –                   0.83±0.0
fine-tuned on data created by Giza++, this represents a                                       Fast Align              0.95±0.0
26 F1 point increase in performance.                                               es          Giza++       mBERT     0.95±0.0
   As regards the baseline model fine-tuned on                                                mDeBERTa               0.94±0.01
TASTEset’s English data and tested on MCR’s Spanish                                it                                0.79±0.05
entities, we can see that, unexpectedly, the model obtains                         es                                0.88±0.01
                                                                                en (GZ)                     mBERT     0.9±0.01
a 0.88 F1 score, outperforming the mBERT (0.83 F1 ) and                en      en (MCR)            –                  0.93±0.0
BERTes (0.84 F1 ) models fine-tuned on the monolingual                          en (GZ)
                                                                                                            BERTen
                                                                                                                     0.91±0.01
Spanish EMT data. Despite this, fine-tuning on SMT                             en (MCR)                               0.93±0.0
data produced through our alignment approach allows             Table 8
the NER models to beat this 0.88 F1 baseline, reaching          Entity recognition task F1 scores (5 random runs macro avg).
scores as high as 0.95 F1 , as previously mentioned.
   In all three scenarios, mBERT achieves performances
comparable to those of the monolingual models. This       dataset [16] built from WikiMatrix [48], 12 featuring 14
shows that, when inferring on multilingual corpora to     EN–XX language combinations. Its training set is com-
extract entities, a single multilingual model can be used,posed of silver labels generated by a statistical model,
saving time and computational resources both during       while the development and test sets are manually anno-
training and inference.                                   tated. Since XL-WA has a balanced domain distribution
                                                          and can be considered representative of general language,
B. XL-WA                                                  it can be a good resource on which to train a baseline
                                                          word-alignment model. Table 9 reports statistics for the
As additional data for intermediate word-alignment train- EN–IT and EN–ES partitions used in this study.
ing, we use XL-WA [16], a multilingual word-alignment
                                                                12
                                                                     https://ai.meta.com/blog/wikimatrix/
                     Sentences       Alignments
       Language     Train Dev       Train   Dev
         en–it      1,002   103     20,525 1,961
         en–es      1,002   105     16,720 1,980
Table 9
Statistics for XL-WA’s EN–IT and EN–ES subsets.



C. Computational Resources
All models are trained on a single NVIDIA RTX 5000
Ada Generation, with 32 GB of VRAM. The total training
time is around 7-15 minutes for each alignment model,
depending on the training data combination, plus 30-60
minutes for training each on XL-WA. Training each NER
model takes around 6-7 minutes. All the training, includ-
ing multiple models for standard deviation calculation,
was carried out in under 48 hours.