-

That branch of the Lake of Como...: Developing a New Resource for the Analysis of I Promessi Sposi and its Historical Translations

Rachele Sprugnoli

Marco Sartor

0 0 Università di Parma , Viale D'Azeglio, 85, 43125 Parma , Italy

This paper presents a directional parallel corpus of the Ventisettana, that is the version of I Promessi Sposi published by Manzoni in 1827, aligned at sentence level with the anonymous English translation published in London in 1834 by Richard Bentley. After describing the procedure followed for creating the resource and analyzing the results of the manual alignment, the corpus is used as a gold standard to evaluate Bertalign automatic aligner. This new linguistic resource can benefit the research community, in particular in the fields of the history of literature and translation studies, and be useful for developing new automatic tools specific for handling the peculiarities of historical literary texts.

eol>parallel corpus sentence alignment translation digital humanities

1. Introduction 1834 by Richard Bentley. An analysis of these two texts and of other English translations of the 19th century is Critics have established how the Promessi sposi imme- provided by Intonti and Mallardi [5]: the volume is acdiately enjoyed a wide resonance in Europe and in US, companied by examples of alignments to show specific although the success of the work outside Italy has not al- cases both at the level of sentences, such as cuts and adways been accompanied by an efective understanding of ditions, and at the level of words, such as the rendition the author’s thought.1 For this reason, the development of figurative expressions and proverbs. The creation of of new linguistic resources based on the first historical a more extensive resource, such as the one presented translations of the novel assumes particular importance. here, aims to test the feasibility of a procedure to be apThese resources will benefit the research of historians plied in the future also to other historical translations of the Italian language, but also the development of new so to ofer the possibility of extending the range of linautomatic tools suitable for processing historical literary guistic analysis. Furthermore, our parallel corpus is a texts. Last but not least, they can be used for educational gold standard for evaluating fully automatic algorithms purposes, both in secondary school, for the study of Man- in a complex setting due to the peculiarities of historical zoni’s texts and their circulation beyond national borders, texts and historical translations. Indeed, the complexand at university level, in the field of translation studies. ity is due both to the characteristics of Manzoni’s novel In particular, in this contribution we present a parallel (rich, among other things, in irony, dialectal expressions, corpus of the so-called Ventisettana, that is the version of dialogues and monologues) and to the fact that during the novel published by Manzoni in 1827, aligned with the the 19th century translations did not aim to guarantee anonymous English translation published in London in the greatest possible fidelity towards the source text, but rather to bend it in the light of the historical-cultural context in which they were implemented [6]. This approach to translation causes the original text to be revised and changed through additions and omissions of even entire chapters, making it a challenge to automate the alignment process.

CLiC-it 2023: 9th Italian Conference on Computational Linguistics, Nov 30 — Dec 02, 2023, Venice, Italy * Corresponding author. † This paper is the result of the collaboration between the two authors. For the specific concerns of the Italian academic attribution system: Rachele Sprugnoli is responsible for Sections 2, 3.2, and 4; Marco Sartor is responsible for Section 3.1. Sections 1 and 5 were collaboratively written by Rachele Sprugnoli and Marco Sartor. $ rachele.sprugnoli@unipr.it (R. Sprugnoli); 2. Related Work marco.sartor@unipr.it (M. Sartor) (M.0S0a0r0t-o0r0)01-6861-5595 (R. Sprugnoli); 0000-0002-1176-2735 A parallel corpus is made of a set of texts in a given source © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License language aligned with their translations in one or more CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) target languages. The alignment, that is the identifica1A large number of blunders and mistakes made in the translations tion of corresponding text units in parallel texts, can be has been reported in [ 1 ] and [2]. For Manzoni’s popularity outside performed at paragraph, sentence or word level. When Italy, see [3] and the references listed on [4]. the translation direction is known (i.e. when the source and target languages are clearly stated) and when the translation is direct (i.e. not mediated by an intermediary 3.1. Creation language), the parallel corpus is defined as directional [7]. The digital text of the Ventisettana was provided by the

The development of large parallel corpora, both bilin- Italian project (PRIN 2017) ManzoniOnline2: new docgual and multilingual, took of in the 90s of the last cen- uments, translations and tradition [19],4 whereas the tury but their growth in terms of number of texts and text of the 1834 English translation was downloaded languages covered is more recent thanks to initiative from the Gutenberg project website as UTF-8 text file. 5 such as the OPUS project [8] and those promoted by the Both texts have been divided into chapters; for each of European Commission [9]. The great attention given to them the sentence-level alignment was completed semithis type of corpora is due to the fact that parallel corpora automatically, with manual correction of the output of are useful to gain insights into interlinguistic phenom- the aligner. In the initial phase of our work we tested ena; at the same time they are a rich source of materials various tools which include graphical user interfaces for for language teaching, translation studies, lexicography, editing the automatic alignment, such as TAligner 3.0 and a fundamental resource for terminology extraction [20], LF Aligner6 and InterText. More specifically, as and machine translation systems. stated in [21], extensive trials were conducted with LF

Since manual alignment is a particularly time- Aligner, before the final choice fell on InterText because consuming process, various automatic techniques have of the intuitiveness of its interface and the possibility of been proposed over the years [10]. Specifically, with exporting in various formats [22]. regard to sentence-level alignment, early approaches are Each chapter was loaded onto InterText in a separate based on sentence length in terms of number of words ifle with one sentence per line. Sentence splitting was or characters. The idea behind this method is that long done manually: we tried various sentence splitting modsentences in the source text are translated with long sen- els but always obtaining low performances due to the tences, while short sentences are translated using short peculiarity of the novel’s punctuation and to an unconsentences [11, 12]. Lexical matching methods using bilin- ventional use of capital letters. Among the Universal Degual dictionaries (such as in the hunalign system [13]) or pendencies (UD) 2.10 models available in UDPipe [23], specific tokens (such as dates, proper nouns, punctuation) the best result was obtained with VIT with an accuracy as anchors for the alignment [14] are also worth mention- of 39%. A better, but far from perfect, accuracy (64%) was ing. On the other hand, MT-based approaches require the registered with Stanza [24]. Overall, it can be remarked source text to be automatically translated into the target that automatic sentence splitting fails especially (but not language and use a similarity score (e.g. the BLEU metric) exclusively) with punctuation marks that are no longer to align the machine translation output with the target in use or with traditional punctuation marks employed in text sentences; an example of this kind of method is given unusual contexts compared to today’s custom. In particby Bleualign [15]. The most recent systems, however, are ular, the use of hyphens – short and long – with diferent those based on multilingual sentence embeddings, such functions is very frequent in the 1834 English translaas Vecalign [16], or sentence-transformers, as Bertalign tion. Normally, the latter separate one sentence from [17]. Such approaches have been tested on literary texts the other, mostly marking the end of a direct speech,7 obtaining good performances [18].2 while the former convey a character’s inner thoughts,

In this paper we present a manually created bilingual include an aside, render a hesitation in direct speech or (IT-EN) directional parallel corpus of historical literary mark a pause of medium intensity without giving rise texts together with the evaluation of automatic sentence to a new sentence.8 Automatic splitting also displays alignment methods. Dealing with texts written in not glitches when dealing with inverted commas marking contemporary languages and of a literary genre is par- the start of a direct speech, the three suspension dots, and ticularly interesting and not so widespread; sufice it to say that the CLARIN infrastructure gives access to 87 parallel corpora:3 out of these, only 5 include texts in Italian, but none contain works by Manzoni or historical literary translations. 2Results obtained on literary and non-literary texts using various methods, including the Vecalign and Bertalign systems, are reported in https://github.com/bfsujason/aligner-eval. 3https://www.clarin.eu/resource-families/parallel-corpora. 4https://www.alessandromanzoni.org/ 5https://www.gutenberg.org/ebooks/35155. 6https://sourceforge.net/projects/aligner/. 7For instance: "But, fair sirs, you are too just, too reasonable—-" "But," interrupted the other comrade... (from chapter 1). 8Here are some examples, all taken from chapter 1: "for if you do, ehem!–you understand–the consequences would be the same as if you performed the marriage ceremony"; "the poor curate neither meddles nor makes–they settle their afairs amongst themselves, and then–then, they come to us, as if to redeem a pledge; and we– we are the servants of the public"; "but he will require reasons–and what can I say to him"; "... and he arose, continuing–"No! I’ll take nothing, nothing?". exclamation or question marks followed by a lower-case letter (which do not start a new sentence but denote a single flow of text). At the end of the manual sentence splitting procedure, we obtained 8,718 sentences for the Ventisettana and 7,484 sentences for the English translation.

In the following phase, we manually corrected the automatic alignment made by hunalign system integrated in InterText. On average, 3 hours of work were required for validating each chapter. Texts were then exported in three files: each chapter was saved as two independent XML files (one for the Italian text and one for the English translation) and their alignment was exported as a separate XML file containing pointers to the individual sentences of the two texts. 3.2. Analysis

The alignments produced can be categorized into the

following diferent types: • 1:1, i.e. one sentence is translated by one sentence.

It should be noted that such correspondence is not necessarily a symptom of total fidelity, or rather of a linear (or even literal) translation of the subphrasal units. While respecting the boundaries of the sentence, in fact, there could be phenomena of expansion or synthesis. For example, in chapter VIII, a long sentence – with a simile used to indicate how the Bravi (hired assassins) were gathered in a courtyard by their leader emphasizing their animal nature – is strongly synthesized by removing the rhetorical figure altogether.

– Ventisettana: Come il cane che scorta un gregge di porci corre or qua or là a quei che si sbandano, ne addenta uno per un’orecchia e lo tira in ischiera, ne spinge un altro col muso, abbaia ad un altro che esce di fila in quel momento, così il pellegrino acciufa uno di coloro che già toccava la soglia e lo strappa indietro, caccia indietro col bordone uno e un altro che v’eran già presso, grida agli altri che scorrazzano senza saper dove, tanto che li raccozzò tutti nel mezzo del cortiletto.9 – 1834 English translation: He succeeded, however, in assembling them in the middle of the court-yard. • 1:0 and 0:1, i.e. a sentence in the Ventisettana or in the translation lacks a parallel in the other text, following an omission (type 1:0) or an addition by the translator (type 0:1). Omissions are part of a wider trend in the historical translations of Manzoni’s novel to significantly cut sentences that were considered not essential for understanding the text. This is aimed at giving the translation a drier and more pragmatic tone than the original, in line with the prevailing fashions in the literary context of reception; such approach is consistent with the so-called domestication strategy of translations [25]. 9English literal translation: Like the dog that escorts a herd of pigs, he runs here and there among those who are straying, he bites one by the ear and puts him in line, he pushes another with his muzzle, he barks at another who leaves the line at that moment, so the pilgrim grabs one of those who were already on the doorstep and snatches him back, he drives one and another who was nearby back with his stick, he shouts to the others who are running around without knowing where, so much so that he gathered them all in the middle of the little courtyard. to-1 sentence pairs that are quite common in literary texts.

The comparative evaluation carried out on literary texts considering the English-Chinese translation pair showed that Bertalign is able to outperform other (length-based, dictionary-based, MT-based and embedding-based) aligners.

We configured Bertalign with the following options: • maximum alignment types (max_align): 6 • k nearest target neighbors of each source sen

tence (top_k): 3 • search window (win): 5 • similarity score for 1:0 and 0:1 alignments (skip):

0 • modified cosine similarity as proposed in [ 17]

(margin): True • length diference between source and target sen

tences (len_penalty): False • sentence splitting (is_split): True

• 1:N and N:1, i.e. the translator has split or merged the original sentences. When one Italian sentence is split into two or more sentences the alignment is 1:N. When, on the contrary, two or more Italian sentences are merged in a single sentence in the translation the alignment is N:1.

Table 1 provides examples, taken from chapter VIII, of the aforementioned types, while Figure 1 shows how the same alignments are displayed in InterText interface. In addition, Table 2 presents the number of alignments per With respect to the default configuration, we increased type. The vast majority of alignments are 1:1 (66%), but the maximum alignment length (i.e. the max_align opthere are also several omissions in the translation (1:0, tion) from 5 to 6 because our corpus has many complex 14%), followed by cases of 2:1 merging (9%) and 1:2 split- alignments, that is various types of 1:N and N:1 alignting (8%). Under the “Other” category we collect the types ments. We also set a larger value for the similarity score having a number of occurrences less than 1% (i.e. 0-1, 4-1, (i.e. the skip option) because our corpus contains many 1-4, 5-1, 6-1, 3-2, 1-5). It is important to notice that our omissions and insertions. Given that we have several resource includes few cases of cross-order alignments in cases of expansion or synthesis even in 1:1 alignments, which the translator has changed the order of the sen- the len_penalty parameter is set to False: in this way tences in the translation so that, to create the alignment, the length diference between source and target senit is necessary to move sentences out of their original tences is not taken into consideration when calculating position (which is possible with InterText). Cross-order the similarity between sentence pairs. On the contrary, alignments fall into the types described above: for exam- the is_split option is set to True because our corpus ple, Figure 2 shows a cross-order alignment, taken from was already split into sentences. chapter XXXVI, which generates a 1:1 match between Table 3 reports the results of our evaluation using the source and the target sentences. both Bertalign (with the default configuration, Bertalign_d, and with our custom options, Bertalign_c) and the 4. Testing Automatic Alignment Galechurch length-based algorithm. The superiority of the embedding-based approach over the length-based Methods one is evident: the former outperform the latter by 5 F1 points. The custom configuration further improves The parallel corpus described in the previous section Bertalign’s performance in terms of both precision and has been used as gold standard for testing the perfor- recall. However, the results are slightly lower than those mances of Bertalign, an automatic aligner that uses recorded on the English-Chinese pair: indeed, for the LaBSE (language-agnostic BERT sentence embeddings, MAC corpus of literary texts a precision of 0.906, a recall [26]) for building cross-lingual embeddings of source of 0.912 and an F1 of 0.909 are reported.11 and target sentences.10 As reported by Liu and Zhu [17], Figure 3 displays F1 performance across the chapters Bertalign is designed with the aim of dealing with non-1- of the novel. The variation between individual chapters 10https://github.com/bfsujason/bertalign 11https://github.com/bfsujason/aligner-eval. a custom setting of the parameters are compared to the ones achieved with the default options and with a lengthbased algorithm (Galechurch) showing very good performances, with an F1 slightly below 0.9.

The activity presented here served as a laboratory for future experiments which will concern the other editions of the novel and the main translations into neo-Romance languages. In particular, a sentence level alignment activity of chapter VIII is underway taking into account the largest possible number of available English translations also considering, thanks to an agreement with the translator, the very recent American translation of the novel [27]. The choice of maintaining the sentence unity in the Italian text will facilitate the comparison between diferent translations and, consequently, investigations on the choices made by the translator in a diachronic perspective.

The alignment at the word level of some chapters of the Ventisettena with the English edition of 1834, already adopted for the sentence level alignment, is also in progress. In this case, the alignment is done using Ugarit [28].13 Unlike what has been done in other projects [29], in our project the aim of the alignment does not concern the creation of a translation memory for machine translation purposes, but the analysis of the choices made by the translator: for this reason, the alignment is performed considering punctuation and also between linguistic elements whose literal correspondence is rather fuzzy. This choice makes it possible to highlight oversights, errors and singular innovations of the translator. The output of our manual alignment will be used to evaluate automatic approaches, such as fast_align14 and AWESOME15. is not great, with an average F1 of 0.879. However a drop can be noted in the range between chapters 31 and 35 which describe the plague in Milan with numerous historical digressions, often not translated. In particular, chapters 31 and 32 of the original text are merged into a single chapter in the translation in which there is a high number of omissions covering 33% of all the align- Acknowledgments ments. In addition, that group of chapters includes crossalignments that are not correctly handled by Bertalign. Questa pubblicazione è stata realizzata da ricercatrice On the contrary, the best F1 (0.896) is found for chapter con contratto di ricerca cofinanziato dall’Unione europea 25 in which 1:1 alignments, the simplest type, are 73% of - PON Ricerca e Innovazione 2014-2020 ai sensi dell’art. the total. 24, comma 3, lett. a, della Legge 30 dicembre 2010, n. 240 e s.m.i. e del D.M. 10 agosto 2021 n. 1062. Questa ricerca è stata anche finanziata dall’Università degli Studi 5. Conclusion and Future Work di Parma attraverso l’azione Bando di Ateneo 2022 per la ricerca co-finanziata dal MUR-Ministero dell’Università e della Ricerca - D.M. 737/2021 - PNR - PNRR - NextGenerationEU.

This paper described the creation of a parallel corpus

aligned at sentence level made of the whole text of Ventisettana, that is the version of the novel published by Manzoni in 1827, and the 1834 anonymous English translation. This resource is made available on Github in XLM format12 and will be also uploaded in the ILC4CLARIN repository. The whole aligned corpus has been used as gold standard for evaluating Bertalign, an embeddingbased automatic sentence aligner. Results obtained with 12https://github.com/RacheleSprugnoli/Sentence_Alignment_Man zoni. 13https://ugarit.ialigner.com. 14https://github.com/clab/fast_align. 15https://github.com/neulab/awesome-align. [2] P. Bellezza, Attraverso le traduzioni dei "Promessi Processing (EMNLP-IJCNLP), Association for Comsposi", in: Curiosita‘ manzoniane, Vallardi, Milano, putational Linguistics, Hong Kong, China, 2019, pp. 1923, pp. 75–95. 1342–1348. URL: https://www.aclweb.org/antholo [3] G. Getto, Manzoni europeo, Ugo Mursia, Milano, gy/D19-1136. doi:10.18653/v1/D19-1136.

1971. [17] L. Liu, M. Zhu, Bertalign: Improved word [4] P. Frare, Manzoni europeo?, Nuovi quaderni del embedding-based sentence alignment for chinese– Crier. I "Promessi sposi" nell’Europa romantica 9 english parallel corpora of literary texts, Digital (2012) 199–220. Scholarship in the Humanities 38 (2023) 621–634. [5] V. Intonti, R. Mallardi, Cultures in contact: Trans- [18] E. Signoroni, Evaluating the state-of-the-art senlation and reception of i promessi sposi in 19th tence alignment system on literary texts., in: Recent century england, 2011. Advances in Slavonic Natural Language Processing [6] T. R. Steiner, English translation theory 1650-1800, (RASLAN 2021), 2021, pp. 115–124.

2, Rodopi, 1975. [19] G. Raboni, «manzonionline»: considerazioni in [7] M.-A. Lefer, Parallel corpora, in: A practical hand- corso d’opera, Griseldaonline 20 (2021) 149–155. book of corpus linguistics, Springer, 2021, pp. 257– [20] Z. S. Villar, O. A. Pinedo, Taligner 3.0: A tool to 282. create parallel and multilingual corpora, in: Cor[8] J. Tiedemann, Parallel data, tools and interfaces in pora in Translation and Contrastive Research in OPUS, in: Proceedings of the Eighth International the Digital Age: Recent advances and explorations, Conference on Language Resources and Evaluation John Benjamins, 2021, pp. 125–146. (LREC’12), European Language Resources Associa- [21] R. Sprugnoli, A. Redaelli, M. Sartor, Risorse lintion (ELRA), Istanbul, Turkey, 2012, pp. 2214–2218. guistiche per lo studio dei "Promessi sposi", in: La URL: http://www.lrec-conf.org/proceedings/lrec20 memoria digitale. Forme del testo e organizzazione 12/pdf /463_Paper.pdf . della conoscenza. Atti del XII convegno annuale [9] R. Steinberger, M. Ebrahim, A. Poulis, M. Carrasco- AIUCD (Siena, 5-7 giugno 2023), Università degli Benitez, P. Schlüter, M. Przybyszewski, S. Gilbro, Studi di Siena, Siena, 2023, pp. 301–303. An overview of the european union’s highly multi- [22] P. Vondřička, Aligning parallel texts with Interlingual parallel corpora, Language resources and Text, in: Proceedings of the Ninth International evaluation 48 (2014) 679–707. Conference on Language Resources and Evaluation [10] Y. Xu, A. Max, F. Yvon, Sentence alignment for (LREC’14), European Language Resources Associliterary texts: The state-of-the-art and beyond, in: ation (ELRA), Reykjavik, Iceland, 2014, pp. 1875– Linguistic Issues in Language Technology, Volume 1879. URL: http://www.lrec-conf.org/proceedings/ 12, 2015-Literature Lifts up Computational Linguis- lrec2014/pdf /285_Paper.pdf .

tics, 2015. [23] M. Straka, Udpipe 2.0 prototype at conll 2018 ud [11] P. F. Brown, J. C. Lai, R. L. Mercer, Aligning sen- shared task, in: Proceedings of the CoNLL 2018 tences in parallel corpora, in: 29th Annual Meeting Shared Task: Multilingual Parsing from Raw Text of the Association for Computational Linguistics, to Universal Dependencies, 2018, pp. 197–207. 1991, pp. 169–176. [24] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Man[12] W. A. Gale, K. W. Church, et al., A program for ning, Stanza: A Python natural language processing aligning sentences in bilingual corpora, Computa- toolkit for many human languages, in: Proceedtional linguistics 19 (1994) 75–102. ings of the 58th Annual Meeting of the Association [13] D. Varga, P. Halácsy, A. Kornai, V. Nagy, L. Németh, for Computational Linguistics: System DemonstraV. Trón, Parallel corpora for medium density lan- tions, 2020. URL: https://nlp.stanford.edu/pubs/qi guages, volume 292, Amsterdam; Philadelphia; J. 2020stanza.pdf .

Benjamins Pub. Co, 2007, p. 247. [25] J. Munday, S. R. Pinto, J. Blakesley, Introducing [14] M. Kay, M. Roscheisen, Text-translation alignment, translation studies: Theories and applications, Rout

Computational linguistics 19 (1993) 121–142. ledge, 2022. [15] R. Sennrich, M. Volk, Mt-based sentence alignment [26] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, for ocr-generated parallel texts, in: Proceedings of Language-agnostic BERT sentence embedding, in: the 9th Conference of the Association for Machine Proceedings of the 60th Annual Meeting of the AsTranslation in the Americas: Research Papers, 2010. sociation for Computational Linguistics (Volume 1: [16] B. Thompson, P. Koehn, Vecalign: Improved sen- Long Papers), Association for Computational Lintence alignment in linear time and space, in: Pro- guistics, Dublin, Ireland, 2022, pp. 878–891. URL: ceedings of the 2019 Conference on Empirical Meth- h t t p s : / / a c l a n t h o l o g y . o r g / 2 0 2 2 . a c l- l o n g . 62. ods in Natural Language Processing and the 9th In- doi:10.18653/v1/2022.acl-long.62. ternational Joint Conference on Natural Language [27] A. Manzoni, The Betrothed: A Novel, Modern Library, 2022. Translated by Michael F. Moore. [28] T. Yousef, C. Palladino, F. Shamsian, M. Foradi,

Translation alignment with ugarit, Information 13 (2022) 65. [29] T. Yousef, C. Palladino, F. Shamsian, A. d. Ferreira,

M. F. dos Reis, An automatic model and gold standard for translation alignment of ancient greek, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 5894–5905.

[1]

Bellezza , Il Manzoni all'estero , in: Curiosita' manzoniane, Vallardi, Milano, 1923 , pp. 57 - 73 .