Semantic Planning for Multilingual Fiction
                                Generation
                                Nayla Escribano
                                HiTZ Basque Center for Language Technologies - Ixa NLP Group, University of the Basque Country UPV/EHU


                                                                      Abstract
                                                                      Recent approaches to fiction generation make use of generative Large Language Models to create fluent
                                                                      narratives, but they still struggle in other tasks, such as keeping coherence in rather long stories or
                                                                      respecting logical relations between events. Furthermore, most work focuses in English fiction, due
                                                                      to its prevalence in linguistic resources. In this PhD thesis we hypothesise that planning on semantic
                                                                      information may not only help models in such tasks, as stated by previous work, but also allow for fiction
                                                                      generation in other languages as an intermediate representation of stories. To that end, we collected a
                                                                      preliminary version of a dataset of high-quality human-written stories with extensive metadata. This
                                                                      new dataset will be used to test the influence of semantic planning on multilingual fiction generation
                                                                      and improve relevant story attributes such as coherence, logicality or likability in different languages.

                                                                      Keywords
                                                                      Fiction Generation, Multilingualism, Generative Models, Knowledge-based Methods


                                1. Introduction
                                Recent advances pushed by the development of generative Large Language Models (LLMs) make
                                us wonder what are their true capabilities at generating fictional texts in different languages.
                                Thus far, conditioning methods such as control mechanisms or plot-based planning have proven
                                better coherence along generated stories than just prompting state-to-the-art generative LLMs
                                [1, 2, 3, 4, 5]. Furthermore, this kind of conditioning usually aimes at imitating the writing
                                process performed by professional writers or helping them at this task [3, 6, 4, 7], leading us
                                to a more natural way of creating new stories. On the other hand, efforts to improve fiction
                                generation rely on English-written data and, to the best of our knowledge, there are no proposals
                                for multilingual fiction generation.
                                   In this context, we define semantic planning as the creation of plans built upon semantic
                                information (such as events, semantic roles, named entities, and temporal or causal relations)
                                that enables guided surface realisation. Thus, for the fiction generation task, semantic planning
                                should create skeletons based on semantic information to later build narrative texts. Inspired by
                                recent work on cross-lingual transfer [8], we hypothesise that fiction generation in languages
                                other than English can benefit from using semantic planning as multilingual bridges, given the
                                lack of semantic and fiction resources in most languages.

                                Doctoral Symposium on Natural Language Processing from the Proyecto ILENIA, 28 September 2023, Jaén, Spain.
                                Envelope-Open nayla.escribano@ehu.eus (N. Escribano)
                                Orcid 0000-0003-3761-7947 (N. Escribano)
                                                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   The present thesis project addresses the research question How does semantic planning affect
multilingual fiction generation? To perform our experiments we created PromptStories1 , a large
collection of up to 200 k high-quality human-written English prompt-story pairs with extensive
metadata. Table 1 shows an example of a prompt-story pair from PromptStories. This dataset
will allow us to test the capabilities of generative LLMs at creating narrative texts, study the
influence of semantic planning as an intermediate representation of stories and analyse its effect
on multilingual fiction generation.

 Prompt      A group of men burst into your house dressed in what looks like Viking armour. In
             gruff voices, they inform you they are here to serve your dog who they believe is the
             reincarnation of Fenrir. Your dog is a four pound Chihuahua called Mr Wiggles.
     Date    16/3/2019
    Score    10318
    Story    There was a polite knock at the front door.
             Drying my hands, I left the kitchen and slung the towel over my shoulder and opened the
             large inner front door and pushing the frenzied, barking Mr. Wiggles. On the other side
             of the screen door stood twenty or so people in strange armor.
             A tall man in chainmail, furs, and a rounded skullcap stepped forward. He spoke, but I
             did not understand a word he said.
             Mr. Wiggles jumped into view, resuming his wild, frenzied barking.
             They all immediately fell to one knee, crying out a single word in unison. “Fenrir!”
             I looked to them. Then to Mr. Wiggles, who was still barking. I looked back to them. “I
             uh. I don’t want any?” I closed the door. [...]
     Date    17/3/2019
    Score    1704
Table 1
Example of a prompt-story pair from PromptStories. “Date” refers to the posting date and “Score” to
the result of upvotes minus downvotes.


2. Related Work
2.1. Story Datasets
Story collections are useful resources for training and evaluating fiction generation models.
ROCStories [9] gathers 40 k English five-sentence stories full of causal and temporal common-
sense relations from everyday life events to test story understanding, while [10] propose the
CaTeRS scheme to annotate causal and temporal relations in 320 of these stories. Recently,
the validation split of ROCStories has been translated by professionals to 10 other languages
to create the multilingual XStoryCloze dataset, but it contains less than 2 k examples due to
their experimental objectives [11]. [12, 4] prefer using plots from Wikipedia to learn how to
generate coherent stories. However, stories collected in these datasets are not written in a
natural narrative fashion, due to their specific purposes.

    1
      This dataset is still in a preliminary phase. For the moment, a sample of the texts is available at https:
//github.com/ixa-ehu/PromptStories.
   On the other hand, the STORIUM dataset [6] gathers 6 k stories that have been extensively
annotated by collaboratively writing in a gamified framework, whereas StoryWars [7] offers 40 k
stories extracted from another online collaborative storytelling platform to investigate different
understanding and generation tasks. WritingPrompts [1] collects 300 k prompt-story pairs
written in English by reddit-users2 to train and evaluate story generation methods, but lacks
relevant metadata related to the scraped texts that could inform us about the posting date, the
quality perceived by users, and so on. Although [13] do consider such metadata to design a new
story evaluation method, it was not possible to retrieve their dataset from the corresponding
sites3 . For this reason, we propose PromptStories as an extension of WritingPrompts that
includes such information, gathering up to 200 k new prompt-story pairs.
   Moreover, XStoryCloze is the only one of the previous datasets to present multilingual parallel
stories, but this data is too scarce for our purposes. Although there exist multiple collections of
fictional texts in several languages, these are usually too diverse in style, length and language
coverage. PromptStories, on the contrary, contains a large amount of human-written short
stories that we plan to translate from English to Spanish and Basque primarily, thus creating a
large parallel story dataset.

2.2. Fiction Generation
Plot planning has been used since early attempts to fiction generation, such as the novel writer
from [14], TALE-SPIN [15] or UNIVERSE [16]. These first approaches relied on hand-crafted
rules to build plots from closed worlds of possible events. In order to overcome the limitations of
manually created worlds, later work focused on statistical ways to extract possible events from
existing stories [17, 18]. Nonetheless, the recent development of neural networks has motivated
new methods to design plots for automatic fiction generation. These approaches usually make
use of control mechanisms [19, 20, 1, 3, 4, 21, 22] and/or more fine-grained knowledge-based
plot planning [12, 2, 6].
   [1] test the capabilities of hierarchical control to maintain the relevance of generated stories
to their corresponding prompts on the WritingPrompts dataset. In a later experiment, [2]
show that applying Semantic Role Labelling (SRL) and coreference-based entity anonymization
to decompose stories into action sequences and entity mentions improves the diversity and
coherence of generated events and entities. In this work we propose to use the last approach,
where event-based plot planning not only helps keeping coherence, but also allows for generating
stories in different languages by translating the original stories and projecting their annotations.

2.3. Story Evaluation
As an open-ended generation task, the evaluation of human or automatic stories remains
unsolved. Referenced metrics do not capture the complex characteristics of creative generation,
where several outputs may correspond to a single input and many specific attributes could be
considered (fluency, coherence, logicality, creativity, likability, etc.) [1, 6, 23, 24, 13]. To address
this problem, some recent works try to evaluate stories using new unreferenced metrics, by
    2
    https://www.reddit.com/r/WritingPrompts/
    3
    We did neither receive an answer to our request.
selecting the appropriate story among machine-generated negative samples [23, 24] or assessing
general quality after rating automatically created comments on concrete aspects about the story
[13]. These techniques are backed by alleged correlation with human judgements, which are
still the most reliable but expensive evaluation method. In this project we plan to test the last
existing evaluation approaches and compare them with in-house human evaluations to later
use them in our experiments.


3. Research Proposal
The Main Research Question (MRQ) of this thesis project is the following one: How does
semantic planning affect multilingual fiction generation? To articulate the project work, we
can divide this MRQ in smaller Research Questions (RQs). RQs present the same experimental
structure (prepare the dataset, train models and evaluate them) and focus on different evaluation
objectives.

3.1. RQ1: Are current models able to create good stories?
Testing the capabilities of current generative LLMs seems to be a proper initial step, given that
they are widely used for story generation either in storytelling systems with control mechanisms
or to create the surface realisation in those with knowledge-based plot planning. Indeed, RQ1
tackles the fiction generation part in MRQ. To answer this research question, we reduce fiction
generation to short story generation (ranging ∼100 to ∼3000 words in length) as it constitutes
a more manageable framework than working with larger fiction. On the other hand, we let
humans answer the difficult question of what is a good story? by collecting a dataset of stories
rated by their own readers. Thus, this dataset will allow us to evaluate the performance of
state-of-the-art generative LLMs by comparing their zero-shot and fine-tuned results on a
manual evaluation to check whether fine-tuning on stories rated as good by readers improves
the story generation capabilities of these models.

3.2. RQ2: Does semantic planning improve specific story attributes?
Several fiction generation systems use knowledge-based methods to improve different attributes
like intra-story coherence or logicality between events, among others. RQ2 mostly involves
the semantic planning part of MRQ. For this reason, we will test different event representation
schemes on a small sample of our dataset to find the most appropriate for story semantic
planning, and we will use this scheme to annotate our prompt-story pairs. Then, we will
prepare a human evaluation to compare stories generated by prompting our fine-tuned model
from RQ1 and those from a model fine-tuned on our semantically-annotated dataset. Participants
will be asked to evaluate specific story attributes, such as language fluency, intra-story coherence,
logicality between events, creativity and relevance based on the prompt or likability from the
user perspective.
3.3. RQ3: May this semantic planning help generating stories in languages
     other than English?
Finally, RQ3 involves the multilingual part of MRQ. Given the lack of parallel semantically-
annotated corpora of short stories and the weak performance of semantic labelling in other
languages (specially for under-resourced ones), we plan to translate our original English prompt-
story pairs and project their annotations from RQ2 to the target languages. Because of evaluation
availability, we wish to apply this translation and projection to Spanish first and, if feasible, to
Basque. Similarly to the previous research questions, we will fine-tune multilingual generative
LLMs both on raw and semantically annotated stories, and we will compare the human evalua-
tions of these new models with the zero-shot setting. Furthermore, we will also analyse the
accumulated error from this annotate-translate-and-project method.


4. Experimental Setup
4.1. Data Collection
We collected the data in PromptStories from the WritingPrompts subreddit as in [1] and filtered
out undesired texts such as removed or moderator posts. Our preliminary dataset contains
200 k prompt-story pairs from 2019 to the beginning of 2023 along with relevant metadata like
creation date, score received by users and so on, which allow us to select the best stories. The
dataset has been split in train, dev and test sets (80%, 10% and 10% of prompt-story pairs each),
but we may prepare smaller subdatasets to experiment on stories with a minimum score, the
best 𝑛 stories per a unique prompt, and so on.

4.2. Annotation, Translation and Projection
Before starting experiments on multilingual fiction generation, we need to create a common
frame for all the languages that we wish to study. To that end, we will annotate our English
dataset with semantic information in order to build event plans that represent stories as in [2].
Then, we will automatically translate our prompt-story pairs to the desired languages and use
word-aligning projections to project those events to the translated texts. This annotate-translate-
and-project techique has proven to be a solid method for cross-lingual sequence-labelling tasks
in zero-resource settings and facilitates the annotation of under-resourced languages [25, 26, 8].
   We wish to test different state-of-the-art models for each of these three subtasks by comparing
their performance on a human evaluation of a small sample of prompt-story pairs. These are
the considered models4 :

    • Event extraction: AllenNLP SRL [27] and AllenNLP SRL_BERT [28].
    • Translation: M2M100 [29], NLLB200 [30] and DeepL5 .
    • Projection: SimAlign [25] and AWSoME [26].

    4
      For very large models such as M2M100 and NLLB200, we will compare the most appropriate versions according
to our processing capabilities.
    5
      https://www.deepl.com/
4.3. Fiction Generation
Once we have created our semantically annotated parallel story dataset, we can proceed to study
our specific research questions. We will first test the performance of different state-of-the-art
generative LLMs on the RQ1 zero-shot setting and choose the most appropriate one for story
generation. The selected model will be used for analysing RQs 1-3 by fine-tuning it on raw and
annotated prompt-story pairs for each of the studied languages. Following recent progress in
developing generative LLMs, for this task we will consider LLaMA [31] or similar models and
specific ones designed for storytelling, such as MPT-StoryWriter [32].


5. Conclusion
We present this thesis project on studying how semantic planning affects multilingual fiction
generation. To that end, we propose an analysis frame based on previous work that also explores
new tasks such as using planning on semantic information to act as a bridge at creating stories
in languages other than English. We explain our future experiments to investigate the main
research question and present a new dataset under development for these experiments.


Acknowledgments
Nayla Escribano is funded by the Basque Government PhD grant “Programa Predoctoral de
Formación de Personal Investigador No Doctor del Departamento de Educación del Gobierno
Vasco”.


References
 [1] A. Fan, M. Lewis, Y. Dauphin, Hierarchical neural story generation, in: Proceedings of
     the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:
     Long Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp.
     889–898. URL: https://aclanthology.org/P18-1082. doi:10.18653/v1/P18- 1082 .
 [2] A. Fan, M. Lewis, Y. Dauphin, Strategies for structuring story generation, in: Proceedings of
     the 57th Annual Meeting of the Association for Computational Linguistics, Association for
     Computational Linguistics, Florence, Italy, 2019, pp. 2650–2660. URL: https://aclanthology.
     org/P19-1254. doi:10.18653/v1/P19- 1254 .
 [3] D. Ippolito, D. Grangier, C. Callison-Burch, D. Eck, Unsupervised hierarchical story
     infilling, in: Proceedings of the First Workshop on Narrative Understanding, Association
     for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 37–43. URL: https:
     //aclanthology.org/W19-2405. doi:10.18653/v1/W19- 2405 .
 [4] H. Rashkin, A. Celikyilmaz, Y. Choi, J. Gao, PlotMachines: Outline-conditioned generation
     with dynamic plot state tracking, in: Proceedings of the 2020 Conference on Empirical
     Methods in Natural Language Processing (EMNLP), Association for Computational Lin-
     guistics, Online, 2020, pp. 4274–4295. URL: https://aclanthology.org/2020.emnlp-main.349.
     doi:10.18653/v1/2020.emnlp- main.349 .
 [5] L. J. Martin, Neurosymbolic Automated Story Generation (Thesis Dissertation), Georgia
     Institute of Technology, 2021. URL: http://hdl.handle.net/1853/64643.
 [6] N. Akoury, S. Wang, J. Whiting, S. Hood, N. Peng, M. Iyyer, STORIUM: A Dataset and
     Evaluation Platform for Machine-in-the-Loop Story Generation, in: Proceedings of the 2020
     Conference on Empirical Methods in Natural Language Processing (EMNLP), Association
     for Computational Linguistics, Online, 2020, pp. 6470–6484. URL: https://aclanthology.org/
     2020.emnlp-main.525. doi:10.18653/v1/2020.emnlp- main.525 .
 [7] Y. Du, L. Chilton, Storywars: A dataset and instruction tuning baselines for collaborative
     story understanding and generation, 2023. arXiv:2305.08152 .
 [8] I. García-Ferrero, R. Agerri, G. Rigau, Model and data transfer for cross-lingual sequence
     labelling in zero-resource settings, in: Findings of the Association for Computational
     Linguistics: EMNLP 2022, Association for Computational Linguistics, Abu Dhabi, United
     Arab Emirates, 2022, pp. 6403–6416. URL: https://aclanthology.org/2022.findings-emnlp.
     478.
 [9] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, J. Allen,
     A corpus and cloze evaluation for deeper understanding of commonsense stories, in: Pro-
     ceedings of the 2016 Conference of the North American Chapter of the Association for Com-
     putational Linguistics: Human Language Technologies, Association for Computational Lin-
     guistics, San Diego, California, 2016, pp. 839–849. URL: https://aclanthology.org/N16-1098.
     doi:10.18653/v1/N16- 1098 .
[10] N. Mostafazadeh, A. Grealish, N. Chambers, J. Allen, L. Vanderwende, CaTeRS: Causal and
     temporal relation scheme for semantic annotation of event structures, in: Proceedings of
     the Fourth Workshop on Events, Association for Computational Linguistics, San Diego,
     California, 2016, pp. 51–61. URL: https://aclanthology.org/W16-1007. doi:10.18653/v1/
     W16- 1007 .
[11] X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhosale,
     J. Du, R. Pasunuru, S. Shleifer, P. S. Koura, V. Chaudhary, B. O’Horo, J. Wang, L. Zettlemoyer,
     Z. Kozareva, M. Diab, V. Stoyanov, X. Li, Few-shot learning with multilingual generative
     language models, in: Proceedings of the 2022 Conference on Empirical Methods in Natural
     Language Processing, Association for Computational Linguistics, Abu Dhabi, United Arab
     Emirates, 2022, pp. 9019–9052. URL: https://aclanthology.org/2022.emnlp-main.616.
[12] L. Martin, P. Ammanabrolu, X. Wang, W. Hancock, S. Singh, B. Harrison, M. Riedl, Event
     representations for automated story generation with deep neural nets, Proceedings of the
     AAAI Conference on Artificial Intelligence 32 (2018). URL: https://ojs.aaai.org/index.php/
     AAAI/article/view/11430. doi:10.1609/aaai.v32i1.11430 .
[13] H. Chen, D. Vo, H. Takamura, Y. Miyao, H. Nakayama, StoryER: Automatic story eval-
     uation via ranking, rating and reasoning, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.),
     Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,
     Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp.
     1739–1753. URL: https://aclanthology.org/2022.emnlp-main.114. doi:10.18653/v1/2022.
     emnlp- main.114 .
[14] S. Klein, J. F. Aeschlimann, D. F. Balsiger, S. L. Converse, C. Court, M. Foster, R. Lao, J. D.
     Oakley, J. Smith, Automatic novel writing: A status report, 1973. URL: http://digital.library.
     wisc.edu/1793/57816.
[15] J. R. Meehan, Tale-spin, an interactive program that writes stories, in: Proceedings of the
     5th International Joint Conference on Artificial Intelligence - Volume 1, IJCAI’77, Morgan
     Kaufmann Publishers Inc., San Francisco, CA, USA, 1977, p. 91–98.
[16] M. Lebowitz, Story-telling as planning and learning, Poetics 14 (1985) 483–502. doi:10.
     1016/0304- 422X(85)90015- 4 .
[17] N. McIntyre, M. Lapata, Learning to tell tales: A data-driven approach to story generation,
     in: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and
     the 4th International Joint Conference on Natural Language Processing of the AFNLP,
     Association for Computational Linguistics, Suntec, Singapore, 2009, pp. 217–225. URL:
     https://aclanthology.org/P09-1025.
[18] M. O. Riedl, Story planning: Creativity through exploration, retrieval, and analogical
     transformation, Minds & Machines 20 (2010) 589––614. doi:https://doi.org/10.1007/
     s11023- 010- 9210- 2 .
[19] N. Peng, M. Ghazvininejad, J. May, K. Knight, Towards controllable story generation, in:
     Proceedings of the First Workshop on Storytelling, Association for Computational Linguis-
     tics, New Orleans, Louisiana, 2018, pp. 43–49. URL: https://aclanthology.org/W18-1505.
     doi:10.18653/v1/W18- 1505 .
[20] J. Xu, X. Ren, Y. Zhang, Q. Zeng, X. Cai, X. Sun, A skeleton-based model for pro-
     moting coherence among sentences in narrative story generation, in: Proceedings
     of the 2018 Conference on Empirical Methods in Natural Language Processing, Asso-
     ciation for Computational Linguistics, Brussels, Belgium, 2018, pp. 4306–4315. URL:
     https://aclanthology.org/D18-1462. doi:10.18653/v1/D18- 1462 .
[21] K. Yang, Y. Tian, N. Peng, D. Klein, Re3: Generating longer stories with recursive reprompt-
     ing and revision, in: Proceedings of the 2022 Conference on Empirical Methods in Natural
     Language Processing, Association for Computational Linguistics, Abu Dhabi, United Arab
     Emirates, 2022, pp. 4393–4479. URL: https://aclanthology.org/2022.emnlp-main.296.
[22] K. Yang, D. Klein, N. Peng, Y. Tian, Doc: Improving long story coherence with detailed
     outline control, 2023. arXiv:2212.10077 .
[23] J. Guan, M. Huang, UNION: An Unreferenced Metric for Evaluating Open-ended Story
     Generation, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Confer-
     ence on Empirical Methods in Natural Language Processing (EMNLP), Association for
     Computational Linguistics, Online, 2020, pp. 9157–9166. URL: https://aclanthology.org/
     2020.emnlp-main.736. doi:10.18653/v1/2020.emnlp- main.736 .
[24] S. Ghazarian, Z. Liu, A. S M, R. Weischedel, A. Galstyan, N. Peng, Plot-guided adversarial
     example construction for evaluating open-domain story generation, in: K. Toutanova,
     A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell,
     T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North Ameri-
     can Chapter of the Association for Computational Linguistics: Human Language Tech-
     nologies, Association for Computational Linguistics, Online, 2021, pp. 4334–4344. URL:
     https://aclanthology.org/2021.naacl-main.343. doi:10.18653/v1/2021.naacl- main.343 .
[25] M. Jalili Sabet, P. Dufter, F. Yvon, H. Schütze, SimAlign: High quality word alignments
     without parallel training data using static and contextualized embeddings, in: Findings
     of the Association for Computational Linguistics: EMNLP 2020, Association for Com-
     putational Linguistics, Online, 2020, pp. 1627–1643. URL: https://aclanthology.org/2020.
     findings-emnlp.147. doi:10.18653/v1/2020.findings- emnlp.147 .
[26] Z.-Y. Dou, G. Neubig, Word alignment by fine-tuning embeddings on parallel corpora,
     in: Proceedings of the 16th Conference of the European Chapter of the Association for
     Computational Linguistics: Main Volume, Association for Computational Linguistics,
     Online, 2021, pp. 2112–2128. URL: https://aclanthology.org/2021.eacl-main.181. doi:10.
     18653/v1/2021.eacl- main.181 .
[27] L. He, K. Lee, M. Lewis, L. Zettlemoyer, Deep semantic role labeling: What works
     and what’s next, in: Proceedings of the 55th Annual Meeting of the Association for
     Computational Linguistics (Volume 1: Long Papers), Association for Computational Lin-
     guistics, Vancouver, Canada, 2017, pp. 473–483. URL: https://aclanthology.org/P17-1044.
     doi:10.18653/v1/P17- 1044 .
[28] P. Shi, J. Lin, Simple bert models for relation extraction and semantic role labeling, 2019.
     arXiv:1904.05255 .
[29] A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Celebi, G. Wen-
     zek, V. Chaudhary, N. Goyal, T. Birch, V. Liptchinsky, S. Edunov, E. Grave, M. Auli, A. Joulin,
     Beyond english-centric multilingual machine translation, 2020. arXiv:2010.11125 .
[30] N. Team, M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan,
     E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood,
     B. Akula, L. Barrault, G. M. Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan,
     D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao,
     V. Goswami, F. Guzmán, P. Koehn, A. Mourachko, C. Ropers, S. Saleem, H. Schwenk,
     J. Wang, No language left behind: Scaling human-centered machine translation, 2022.
     arXiv:2207.04672 .
[31] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière,
     N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open
     and efficient foundation language models, 2023. arXiv:2302.13971 .
[32] M. N. Team, Introducing mpt-7b: A new standard for open-source, commercially usable
     llms, 2023. URL: www.mosaicml.com/blog/mpt-7b.