=Paper=
{{Paper
|id=Vol-3834/paper128
|storemode=property
|title=Automatic Translation Alignment Pipeline for Multilingual Digital Editions of Literary Works
|pdfUrl=https://ceur-ws.org/Vol-3834/paper128.pdf
|volume=Vol-3834
|authors=Maria Levchenko
|dblpUrl=https://dblp.org/rec/conf/chr/Levchenko24
}}
==Automatic Translation Alignment Pipeline for Multilingual Digital Editions of Literary Works==
Automatic Translation Alignment Pipeline for
Multilingual Digital Editions of Literary Works
Maria Levchenko
Dipartimento di Filologia Classica e Italianistica, University of Bologna, Italy
Abstract
This paper investigates the application of translation alignment algorithms in the creation of a Multi-
lingual Digital Edition (MDE) of Alessandro Manzoni’s Italian novel I promessi sposi (“The Betrothed”),
with translations in eight languages (English, Spanish, French, German, Dutch, Polish, Russian and Chi-
nese) from the 19th and 20th centuries. We identify key requirements for the MDE to improve both the
reader experience and support for translation studies. Our research highlights the limitations of current
state-of-the-art algorithms when applied to the translation of literary texts and outlines an automated
pipeline for MDE creation. This pipeline transforms raw texts into web-based, side-by-side represen-
tations of original and translated texts with different rendering options. In addition, we propose new
metrics for evaluating the alignment of literary translations and suggest visualization techniques for
future analysis.
Keywords
multilingual digital edition, Alessandro Manzoni, translation alignment, literary translation, embed-
dings
1. Introduction
From the very beginning of digital edition creation, there has been a tendency, supported by
the power of web technologies, to represent not only the original text but also its translation(s),
following the tradition of bilingual printed editions. In this paper, we propose to define mul-
tilingual digital editions (MDE) as editions in which translations are not supplementary but
essential, intended to enrich both computational analysis and reader experience.
Beyond annotated file accessibility, the MDE should meet additional criteria to be effective.
Primarily, the platform must display the original text alongside translations. It is anticipated
that there will be a visual correlation between aligned pairs, which will facilitate straightfor-
ward comparison and analysis. The accuracy of alignment is, by default, ensuring that the
corresponding parts of the texts are properly aligned. Furthermore, the platform should sup-
port the visual highlighting of omitted or inserted parts in the translations[6], which will enable
users to discern differences and interpret the nuances of each translation.
These requirements are generally feasible for short, structured texts like poetry or histor-
ical documents (examples of MDE publishing strategies are described in Appendix A). The
challenge is to develop a flexible, automated system that accurately aligns complex literary
CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark
£ maria.levchenko@studio.unibo.it (M. Levchenko)
ç https://mary-lev.github.io/ (M. Levchenko)
ȉ 0000-0002-0877-7063 (M. Levchenko)
© 2024 Copyright for this paper by its author. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
1086
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
texts across multiple languages for computational analysis and user-friendly exploration. The
technology should be able to handle the complexities of literary texts, including the splitting,
merging, and reordering of sentences, and align text fragments of manageable length, ensur-
ing that they are easy for users to read and understand at a glance to obtain insight into the
linguistic and cultural nuances of each version. The automated alignment process should save
researchers both time and resources.
For the MDE of Alessandro Manzoni’s novel ”I promessi sposi” (The Betrothed), we propose an
automatic translation alignment pipeline that adapts state-of-the-art alignment techniques to
the objectives of the multilingual digital edition of literary works for educational and research
purposes.
2. The Betrothed by Alessandro Manzoni and Its Translations
A comparative analysis of translations of the same literary work over time can provide valuable
insights into the evolution of interpretation and understanding. “I promessi sposi” is particu-
larly compelling in this context. Not only does it reflect the author’s exploration of the Italian
language during a period of significant linguistic evolution, but it has also been translated into
many European languages over the past two centuries. This makes it an ideal case study for in-
vestigating the influence of temporal factors, linguistic shifts, and the reception of the original
novel in different cultural contexts.
Two main original editions (1827 and 1840) were translated into European languages and
published in parallel in the XIX century. For the development of an automated translation
alignment pipeline, we selected and prepared the texts of a wide range of translations of the
classic edition of the novel, also known as Quarantana (1840), including English translations
from 1845, 1876, 1983, and 2022; Russian translations from 1854 and 1999; a Dutch (1849); a
German (1884), a French (1874), a Spanish (1858), a Polish (1882) and a Chinese (1998) (see
Appendix 1 for a list of translations).
3. Related Work
The core of the MDE creation process is translation alignment, which involves mapping corre-
sponding units (typically words or sentences) between a source and target text. State-of-the-art
alignment algorithms have evolved significantly in recent years and now perform optimally in
many applications, including machine translation, bilingual dictionary creation, and parallel
corpus development.
Modern methods have moved from statistical approaches [5, 20, 18, 14, 15, 10] and lexical
associations (Hunalign in [22]), first to the use of machine translation (MT) systems and then
to the alignment systems adopted multilingual sentence embeddings, which significantly im-
proves the accuracy (LASER in [2] and LaBSE in [8]). Thomson and Koehn’s Vecalign [21] uses
LASER embeddings and a recursive dynamic programming approach to achieve state-of-the-art
results by reducing complexity from quadratic to linear [21]. These methods use multilingual
models to generate embeddings for each sentence, which are then compared using cosine simi-
larity to find the best matches between the original and translated sentences. Liu and Zhu [17]
1087
Alignment Types Across Different Translations
8000
7000
6000
Total Alignments
5000
4000
3000
2000
One-to-One
One-to-Many
1000
Many-to-One
Many-to-Many
0
2
2
82
99
2
7
3
80
58
02
97
88
87
84
19
19
18
18
2
1
1
1
1
ish
ish
ian
ian
sh
ch
ch
an
ish
li
n
n
rm
gl
gl
an
ain
ss
Po
Fre
Fre
En
En
Ru
Ge
Sp
r
Uk
Translations
Figure 1: Alignment Types in The Betrothed
introduced Bertalign, which uses LaBSE vectors and demonstrated superior performance on
the Bible One dataset and an English-Chinese literary corpus.
4. Challenges of the Sentence-Level Alignment
While the alignment of text and translation at the line level is sufÏcient for poetic, historical,
and even verse dramatic texts (see [1]), where we cannot expect significant variation in the
splitting, merging, or reordering of lines, this alignment approach is inadequate for prose due to
the extent of restructuring that inevitably occurs in literary prose translation. In such cases, the
standard approach is sentence-level alignment. However, it can be challenging, particularly in
the case of literary translations, due to the irregularity of the syntactic structure of the original
text in another language. Literary translators are not limited to translating a single sentence
into another single sentence (this can be described as a one-to-one type of alignment) but are
free to manage sentence boundaries and reconfigure sentence structures to better convey the
meaning and style of the original text. In this case, in working to achieve the highest similarity
score for the aligned pairs, the alignment algorithms are forced to combine several sentences
into one, using one-to-many, many-to-one, and many-to-many alignment types.
The ideal alignment type for the MDE is a one-to-one alignment type to maintain the gran-
ularity and consistency of the alignment. In our analysis of the sentence-level alignment of I
promessi sposi (see Figure 1 for the different translations), while one-to-one alignments are the
most common, a significant proportion are more complex types. While this does not inherently
complicate the alignment process, as advanced tools such as Bertalign and Vecalign can handle
this complexity, the results may be less optimal in terms of meaningfulness. The length of the
aligned pairs becomes longer, including several sentences from both the source text and the
translation (examples can be seen in the Appendix B). The edge case for this expanded align-
ment result would be the pairing of the paragraph or even the chapter of the original text with
the same of the translation.
1088
Figure 2: The visualization of the alignment of the long sentence.
That’s why traditional metrics may not be sufÏcient for evaluating the alignment results.
The performance of alignment algorithms is typically evaluated using established metrics such
as precision, recall, F1 score, and Alignment Error Rate (AER) [23]. The first limitation of this
approach is that it is based on a ”gold dataset,” which does not provide insight into the perfor-
mance of the algorithm with respect to other types of text [9]. A second consequence is that the
scores may be high, but the results are not suitable for MDE because the aligned pairs are too
large to be analyzed or identified at a glance by a human observer. We, therefore, suggest that,
in addition to the increasing importance of the distribution of alignment types (one-to-one,
one-to-many, many-to-one, many-to-many) as a metric of the acceptability of the results, the
number and length of alignment pairs derived from the original sentences should also be
considered. A number of aligned pairs close to the number of original sentences would indicate
an effective alignment process. Conversely, a significant reduction in the number of aligned
pairs would indicate limitations of the sentence-level alignment approach, as it implies that the
alignment algorithm is forced to combine more sentences to obtain the appropriate similarity
score.
The length of alignment pairs will indicate if they are suitable for human readers. In the con-
text of creating a digital edition for educational or research purposes with multiple languages,
it is not advisable to present long-aligned texts, given the limited attention span and working
memory of the readers (for further insight, see studies of working memory and comprehension
with multiple text reading [11, 12]).
To illustrate, if a sentence in the source language (Italian) is 130 tokens long and its corre-
sponding sentence in the target language (English) is 140 tokens long, readers may encounter
difÏculties in comparing and understanding such lengthy segments. Even the use of color
differentiation to highlight aligned pairs does not overcome this challenge (see Figure 2).
In summary, in the context of MDE of literary works, sentence-level alignment still faces
a significant challenge. 1) Sentence boundaries are not stable in different languages, which
leads to a variety of alignment types and doesn’t allow a consistent alignment across the MDE.
2) Strict sentence-level alignment does not fully reflect the variability of the translated texts,
1089
such as inserted or omitted parts, and 3) strains readers’ attention spans and working memory
and fails to achieve the alignment granularity that is comfortable for the overall reading expe-
rience. The alignment process needs to be modified to address these challenges to automated
processing and readability.
5. Sentence Segmentation as an Alternative Solution
The alternative methods can provide more accurate and meaningful segmentation of literary
texts. In an attempt to move from sentence-level alignment to phrase- or segment-level align-
ment, here are two promising approaches:
• Punctuation splitting: Applying punctuation marks (such as commas, periods, and semi-
colons) to create initial segments. This method provides natural breaks in the text, pre-
serving the contextual meaning. However, by using this approach and aligning the re-
sulting segments with Bertalign, we achieved a more granular alignment but increased
the number of reordering problems that didn’t occur with sentence-level alignment.
• Zero/Few-Shot Prompting with LLM Models: The sentences of the original text are seg-
mented using zero-shot prompting OpenAI CPT-4o model [19]. The approved segments
are then used as patterns for few-shot prompting to segment the sentences of the trans-
lations. This approach provides a robust foundation for universal alignment.
The similarity score can be visualized to evaluate segment-level alignment and compare its
results with traditional sentence-level alignment. In addition, the visual representation of the
similarity score of the aligned segments or sentences allows us to find the semantic outliers.
After extracting high-dimensional embeddings for each aligned line from the original and
translated text using the multilingual model (LaBSE), we applied t-Distributed Stochastic Neigh-
bour Embedding (t-SNE) to reduce them to two dimensions. By visually examining the cosine
similarity, we can detect anomalies and curious translated fragments (see Figure 3), even if the
alignment algorithm establishes the correlation between sentences.
There are several quantitative metrics that can be used to assess the quality of alignment:
• Number of resulting aligned pairs with respect to the number of input segments;
• Consistency: Ensure that segments are consistently aligned across languages, preserv-
ing the meaning and context of the original text.
• Number of clusters: The number of clusters indicates how the sentences are grouped.
An ideal alignment would result in clusters where each cluster contains two points cor-
responding to the embedding of the aligned pair, one from the original text and one from
the translation, placed close to each other;
• Average similarity: Calculating the average similarity within clusters gives an insight
into the coherence of the alignments, with higher values indicating more semantically
consistent groupings.
• The length of the aligned pair of lines indicates whether the detected lines are within
the reader’s attention span and suitable for the user experience.
1090
Similarity visualisation of sentence embeddings in the Chapter 23
100
language
german
90 italian
80
70
60
50
40
30
20
10
0
y
−10
−20
−30
−40
−50
−60
−70
−80
−90
−100
−120 −110 −100 −90 −80 −70 −60 −50 −40 −30 −20 −10 0 10 20 30 40 50 60 70 80 90 100
x
Figure 3: Similarity visualisation for sentence-level alignment in the German translation of chapter
23. This translation omits Don Abbondio’s inner monologue, which is not captured in the sentence-
level alignment, but is evident in the visualisation, where several Italian sentences appear without
corresponding German pairs.
Comparing sentence-level and segment-level alignment, we can assume that even sentence-
level alignment provides valuable insights into the differences between the translations and the
original text (see Appendix D for examples in Spanish translation of I Promessi sposi); segment-
level alignment allows us to go deeper and capture more nuanced variations between the orig-
inal and the translated text. For example, we can identify the omission of the end of chapter 8
in the German translation (see Table 1) and two omissions in the Russian translation of chap-
ter 1 (see Tables 3-2), which can be interpreted through the lens of cultural differences and/or
censorship and could not be captured by sentence-level alignment.
Table 1: Italian / German segment-level alignment
Original German 1880
Presto, io spero, potrete ritornar sicuri a Ich hoffe, ihr werdet bald ohne Gefahr in
casa vostra; euer Haus zurückkehren können;
a ogni modo, Dio vi provvederà, per il in jedem Falle wird Gott Alles zu euerm
vostro meglio; Besten lenken.
e io certo mi studierò di non mancare alla
grazia che mi fa, scegliendomi per suo min-
istro, nel servizio di voi suoi poveri cari tri-
bolati.
1091
Table 1: (continued)
Original German 1880
Voi,» continuò volgendosi alle due donne, Und ihr», fuhr er, zu den beiden Frauen
«potrete fermarvi a ***. gewandt, fort, «ihr könnt euch in *** so
lange aufhalten.
Table 2: Italian / Russian segment-level alignment
Original Russian 1854
Ai tempi in cui accaddero i fatti che pren- Во время тех событий, которые мы
diamo a raccontare, quel borgo, già consid- намерены описать, Лекко было уже
erabile, era anche un castello, значительным местечком и маленькой
крепостцей;
e aveva perciò l’onore d’alloggiare un co- вследствие чего в нем жили комендант
mandante, e il vantaggio di possedere una и постоянный гарнизон испанских
stabile guarnigione di soldati spagnoli, солдат,
che insegnavan la modestia alle fanciulle
e alle donne del paese, accarezzavan di
tempo in tempo le spalle a qualche marito,
a qualche padre; e, sul finir dell’estate,
non mancavan mai di spandersi nelle vi- которые занимались собиранием
gne, per diradar l’uve, e alleggerire a’ con- винограда.
tadini le fatiche della vendemmia.
Table 3: Italian / Russian segment-level alignment
Original Russian 1854
Con tutto ciò, Несмотря на все,
anzi in gran parte a cagion di ciò, и может быть, потому именно,
quelle gride, ripubblicate e rinforzate di
governo in governo, non servivano ad
altro che ad attestare ampollosamente
l’impotenza de’ loro autori,
o, se producevan qualche effetto immedi- если декреты имели минутную
ato... действительность...
By providing a more granular and accurate alignment, the segment-level approach also al-
lows the length of aligned pairs to be reduced (increasing their number) and makes the MDE
more suitable for reader reception compared to sentence-level alignment (see Figure 4).
1092
Average Lengths for Original and Translation Aligned Pairs in Spanish_1858
60 spanish_1858 - Original Segment Length
spanish_1858 - Translated Segment Length
spanish_1858 - Original Sentence Length
spanish_1858 - Translated Sentence Length
50
Average Length (in Words)
40
30
20
10
Intro
Cap1
Cap2
Cap3
Cap4
Cap5
Cap6
Cap7
Cap8
Cap9
Cap10
Cap11
Cap12
Cap13
Cap14
Cap15
Cap16
Cap17
Cap18
Cap19
Cap20
Cap21
Cap22
Cap23
Cap24
Cap25
Cap26
Cap27
Cap28
Cap29
Cap30
Cap31
Cap32
Cap33
Cap34
Cap35
Cap36
Cap37
Cap38
Figure 4: Reducing the Lengths of Aligned Pairs for the Spanish 1858
6. Multilingual Digital Edition Pipeline
The automated pipeline for the MDE is proposed as a means of enabling creators to prepare
annotated TEI files that are accessible, adaptable, correct, easily parsed by computational tools,
and rendered for readers (see Figure 5).
Figure 5: The Translation Alignment Pipeline
We start with the raw texts of the translations in TXT format, obtained after OCR and error
checking. For Manzoni’s text, we used TEI files with identifiers assigned to each token. This
preparation allows us to take into account the irregular segmentation to be expected due to
1093
inconsistencies across the languages.
Step 1. The choice of the segmentation method. Based on the above analysis and the
specifics of the texts to be published, the MDE developers can select the segmentation method-
ology in accordance with the projected audience and the project’s objectives, enabling align-
ment at the sentence, phrase, or word level, or a combination, giving readers the flexibility to
choose their preferred option.
Step 2-3. Segmentation of the original text and the translations. Depending on the
decision made in the first step, the text can be split into sentences, segments, or even tokens.
Step 4. Applying alignment algorithms: By default, we applied Bertalign with the LaBSE
model, trained on 109 languages [7] to the segments obtained at the previous steps. Other
multilingual sentence-transformer models, such as BGE M3-Embedding, can also be used [4].
Step 5. Choosing the encoding approach. The encoding approach determines the flexi-
bility of the alignment description for future rendering and for establishing a link between the
original and translated texts. For structured texts, where each segment in the original closely
corresponds to an equivalent segment in the translations, it may be appropriate to mark each
segment with the same identifier. Given the complexity of multilingual alignment, we have
taken a different approach. The TEI encoding of the original text includes identifiers for each
token, providing granular reference points. The TEI-encoded translation text is divided into
segments, each referencing the start and end identifiers from the original text, allowing for
flexible and accurate alignment.
Step 6. Encoding. By iterating over the alignment results, we assign the referencing start
and end identifiers from the original text to each aligned segment from the translation and
generate a new TEI file for the translation, ensuring that all segments are accurately linked to
the corresponding elements.
Step 7. Rendering Aligned Texts on the Web. Render the original and translated texts
from the TEI files as two columns on a web page with separate XSLT templates for the orig-
inal and translated text. This interactive interface allows users to click on the original text
and see the corresponding translation fragment highlighted, enhancing the user experience by
providing an intuitive way to explore and compare the translations side by side.
Step 8. Visualization and evaluation. While the highly unstable text versions[13] or line-
level aligned translations[16] can be effectively visualized with the Sankey diagram or bipartite
graph, the alignment results for the modern translations can be visualized with the approach
described above, based on the embedding vectors with t-SNE and clustering with DBSCAN.
As for the presentation in the user interface, ideally, all multilingual translations should be
comparable and aligned with each other, allowing the user to see and interpret the differences.
7. Future Development and Challenges
• Current alignment algorithms face challenges in accurately aligning segments with re-
ordered content. Future work will focus on improving the alignment performance in
such cases, ensuring more precise matches even when the original and translated texts
differ significantly in structure.
• Previous studies on user behavior in digital editions have analyzed log files to understand
1094
interaction patterns [3]. To gain deeper insights, we are using advanced tools such as
ReactFlow to study more comprehensively how users interact with different elements of
MDEs. For example, when readers view two lines in different languages side by side, the
optimal reading span may differ from traditional reading practices. By analysing user
interactions, we aim to determine the most effective segment length for the MDEs.
8. Conclusion
The proposed pipeline aims to improve the development of Multilingual Digital Editions (MDE)
by ensuring that MDE is both methodologically robust and user-centered. By prioritizing user
experience and usability, the pipeline adapts existing computational methods and algorithms
to the specific needs of educational and research applications.
We have also proposed new metrics for MDEs that focus on the consistency, meaningfulness
and granularity of the alignment. These metrics assess the suitability of an alignment for edu-
cational and research purposes. By ensuring that the alignment is accessible to human readers
while supporting translation studies, the pipeline balances conciseness and reader engagement.
References
[1] M. Alharbi, T. Cheesman, and R. S. Laramee. “AlignVis: Semi-automatic Alignment and
Visualization of Parallel Translations”. In: 2020 24th International Conference Information
Visualisation (IV). 2020, pp. 98–108. doi: 10.1109/iv51561.2020.00026.
[2] M. Artetxe and H. Schwenk. “Margin-based Parallel Corpus Mining with Multilingual
Sentence Embeddings”. In: Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. Ed. by A. Korhonen, D. Traum, and L. Màrquez. Florence, Italy:
Association for Computational Linguistics, 2019, pp. 3197–3203. doi: 10.18653/v1/P19-1
309.
[3] A. Baillot and A. Busch. “Editing for Man and Machine. Digital Scholarly Editions and
their Users”. In: Variants (2021), pp. 175–187. doi: 10.4000/variants.1220.
[4] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu. BGE M3-Embedding: Multi-Lingual,
Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distilla-
tion. 2024. doi: 10.48550/arXiv.2402.03216. eprint: 2402.03216 (cs.CL).
[5] K. W. Church. “Char_align: A Program for Aligning Parallel Texts at the Character Level”.
In: 31st Annual Meeting of the Association for Computational Linguistics. Columbus, Ohio,
USA: Association for Computational Linguistics, 1993, pp. 1–8. doi: 10.3115/981574.981
575.
[6] G. Crane, A. Babeu, L. Cerrato, A. Parrish, C. Penagos, F. Shamsian, J. Tauber, and J.
Wegner. “Beyond translation: engaging with foreign languages in a digital library”. In:
International Journal on Digital Libraries 24 (2023). doi: 10.1007/s00799-023-00349-2.
1095
[7] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang. “Language-agnostic BERT Sen-
tence Embedding”. In: CoRR abs/2007.01852 (2020). doi: 10 . 48550 / arXiv . 2007 . 01852.
eprint: 2007.01852.
[8] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang. “Language-agnostic BERT Sen-
tence Embedding”. In: Proceedings of the 60th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers). Ed. by S. Muresan, P. Nakov, and A. Villav-
icencio. Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 878–891.
doi: 10.18653/v1/2022.acl-long.62.
[9] A. Fraser and D. Marcu. “Squibs and Discussions: Measuring Word Alignment Quality
for Statistical Machine Translation”. In: Computational Linguistics 33.3 (2007), pp. 293–
303. doi: 10.1162/coli.2007.33.3.293.
[10] W. A. Gale and K. W. Church. “A Program for Aligning Sentences in Bilingual Corpora”.
In: Computational Linguistics 19.1 (1993). Ed. by J. Hirschberg, pp. 75–102. url: https://a
clanthology.org/J93-1004.
[11] C. Hahnel, F. Goldhammer, U. Kroehne, N. Mahlow, C. Artelt, and C. Schoor. “Automated
and controlled processes in comprehending multiple documents”. In: Studies in Higher
Education 46 (2021), pp. 2074–2086. url: https://api.semanticscholar.org/CorpusID:2375
38010.
[12] L. Hildenbrand and J. Wiley. “Working memory capacity as a predictor of multiple text
comprehension”. In: Discourse Processes 60.4-5 (2023), pp. 378–396. doi: 10.1080/0163853
x.2023.2197690.
[13] S. Jänicke and D. J. Wrisley. “Interactive Visual Alignment of Medieval Text Versions”. In:
2017 IEEE Conference on Visual Analytics Science and Technology (VAST) (2017), pp. 127–
138. url: https://api.semanticscholar.org/CorpusID:39979643.
[14] O. Kraif. “Exploitation des cognats dans les systèmes d’alignement bi-textuel : archi-
tecture et évaluation”. In: Revue TAL : traitement automatique des langues 42.3 (2001),
pp. 833–867.
[15] F. Lamraoui and P. Langlais. “Yet Another Fast, Robust and Open Source Sentence
Aligner. Time to Reconsider Sentence Alignment?” In: XIV Machine Translation Summit.
Nice, France, 2013, pp. 77–84.
[16] R. S. Laramee, S. J. Walton, and X. Liu. “Interactive Visualisation of Shakespeare’s Oth-
ello”. MA thesis. Swansea University, 2018. url: https://api.semanticscholar.org/Corpus
ID:70244581.
[17] L. Liu and M. Zhu. “Bertalign: Improved word embedding-based sentence alignment for
Chinese–English parallel corpora of literary texts”. In: Digital Scholarship in the Human-
ities 38.2 (2023), pp. 621–634. doi: 10.1093/llc/fqac089.
[18] T. Mcenery and M. Oakes. “Sentence and word alignment on the CRATER project: meth-
ods and assessment”. In: Proceedings of the Association for Computational Linguistics
Workshop SIG-DAT Workshop (1995), pp. 104–116.
[19] OpenAI. ChatGPT (May 13 version). https://chat.openai.com. 2024.
1096
[20] M. Simard, G. F. Foster, and P. Isabelle. “Using cognates to align sentences in bilingual
corpora”. In: Proceedings of the Fourth Conference on Theoretical and Methodological Issues
in Machine Translation of Natural Languages. Montréal, Canada, 1992, pp. 67–81. url: h
ttps://aclanthology.org/1992.tmi-1.7.
[21] B. Thompson and P. Koehn. “Vecalign: Improved Sentence Alignment in Linear Time
and Space”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-
guage Processing and the 9th International Joint Conference on Natural Language Process-
ing (EMNLP-IJCNLP). Ed. by K. Inui, J. Jiang, V. Ng, and X. Wan. Hong Kong, China:
Association for Computational Linguistics, 2019, pp. 1342–1348. doi: 10.18653/v1/D19-
1136.
[22] D. Varga, P. Halácsy, A. Kornai, N. Viktor, N. Laszlo, N. László, and T. Viktor. “Parallel
corpora for medium density languages”. In: Recent Advances in Natural Language Pro-
cessing IV: Selected papers from RANLP 2005. 2007, pp. 292–247. url: https://api.semanti
cscholar.org/CorpusID:13133927.
[23] T. Yousef, G. Heyer, and S. Jänicke. “EVALIGN: Visual Evaluation of Translation Align-
ment Models”. In: Proceedings of the 17th Conference of the European Chapter of the As-
sociation for Computational Linguistics: System Demonstrations. Ed. by D. Croce and L.
Soldaini. Dubrovnik, Croatia: Association for Computational Linguistics, 2023, pp. 277–
297. doi: 10.18653/v1/2023.eacl-demo.31.
Acknowledgments
This research was supported by the Dipartimento di Filologia Classica e Italianistica, University
of Bologna, as part of the project “Manzoni online2: manoscritti e documenti inediti, tradizione
e traduzioni” (CUP J34I19003370001, project code 2017CFZFAY_003). For more information on
the Leggo Manzoni project, visit https://projects.dharc.unibo.it/leggomanzoni.
Appendixes
A. Strategies of Text/Translation Representation in MDE
Project Name Alignment Comparison Notes
1. Separate pages for the text and the translation
Decameron web - -
La entretenida by - -
Miguel de Cervantes
2. Same page for the text and the translation with JS switcher
Furnace and Fugue - -
1097
Table 4: Strategies of Text/Translation Representation in Multilingual Digital Editions (contin-
ued)
Project Name Alignment Comparison Notes
Type
3. Side-by-side viewer for the text and translation
Cantar de mio Cid - + Editions include different
Vincent van Gogh. The - + modes of the text on a
Letters single page with a
Ancrene Wisse Preface - + side-by-side view,
The Codex Sinaiticus - + including the translation.
Project Typically, there is no
Dafydd ap Gwilym.net - + alignment, but the text
Princeton Dante - + length is usually just one
Project page, making comparison
Ein Sermon von Ablass - + straightforward. These
und Gnade viewers enable comparison
Secrets of Craft and Na- - + of two or more versions of
ture the text.
4. Interlinear text/translation alignment
Codex Suprasliensis lines + The original text in Old
Church Slavonic is directly
followed by its correspond-
ing parallel Greek text.
5. Dynamic alignment display
Electronic Beowulf lines + When the special view type
and option are selected, and
the user hovers the mouse
over a line, the translation
appears in a special area.
Kassák Lajos: The lines + The page displays side-by-
Horse Dies the Birds side views of the original
Fly Away text and its translations into
two other languages, high-
lighting the corresponding
translated line when the
mouse hovers over the orig-
inal line.
1098
Table 4: Strategies of Text/Translation Representation in Multilingual Digital Editions (contin-
ued)
Project Name Alignment Comparison Notes
Type
The Community of the sentences + The side-by-side viewer of
Realm in Scotland the Latin text with its En-
glish translation aligns the
sentences, allowing users
to click on the sentence
number in the original text,
which automatically scrolls
the other side of the page to
the corresponding sentence.
Tabula Salomonis lines + The TEI Publisher tool al-
lows the user to highlight
corresponding parts and au-
tomatically scroll when hov-
ering over the lines.
B. Translations
• (English 1845). The Betrothed Lovers: A Milanese Story of the Seventeenth Century. With
the Column of Infamy. By Alessandro Manzoni. In Three Volumes. Henry Francis C.
Logan. London: Longman, Brown, Green, and Longmans, Paternoster-Row.
• (English 1876). The Betrothed, by Alessandro Manzoni. London, G. Bell and Sons, 1876.
• (English 1983). Alessandro Manzoni, The Betrothed, Bruce Penman (tr.), Penguin Ran-
dom House UK. London, 1983.
• (English 2022). The Betrothed. A novel, translated and with Introduction of Michael
Moore, Preface by Pulitzer Prize-Winning Author Jhumpa Lahiri, Modern Library, 2022.
• (Russian 1854). Обрученные : Медиолан. быль XVIII [!XVII] столетия, найден. и
передел. Александром Манзони / Пер. с итал. В.С. Межевича. Ч. 1-4. Москва,
1854. 4 т.; 20. (Библиотека романов, повестей, путешествий и записок, изд. Н.Н.
Улитиным; Вып. 7, т. 1-2, 6-7).
• (Russian 1999). Обрученные [Повесть из истории Милана XVII в.] / А. Мандзони;
[Пер. с итал. под ред. Н. Георгиевской, А. Эфроса]. Москва: Терра-Книжный клуб,
1999.
• (Dutch 1849). De verloofden: eene Milanesche geschiedenis uit de zeventiende eeuw. Vol.
1. Translated by Petrus Van Limburg Brouwer. Groningen, Van Boekeren, 1849.
• (German 1884). Die Verlobten: eine Mailändischer Geschichte aus dem 17. Jahrhundert,
Volume 1. 3rd ed. Regensburg, G.J. Manz, 1884.
• (French 1874). Les fiancés: histoire milanaise du XVIIe siècle / Alexandre Manzoni;
traduite de l’italien par Rey Dussueil. Paris: Charpentier, 1874.
1099
• (Spanish 1858). Los desposados: historia milanesa del siglo XVII traducida del italiano,
Volume 1. México: Imp. de Andrade y Escalante, 1858.
• (Polish 1882). Narzeczeni. Powieść medyolańska z XVII stulecia ze starego rękopisu
spisana i przerobiona, tłum. Maria z Siermiradzkich Obrąpalska. Warszawa, 1882.
• (Chinese 1998). Yuehun Fufu / (Yidali) Mengzuoni (Manzoni, A.) zhu; Zhang Shihua yi.
- Nanjing: Yilin chubanshe, 1998.10.
C. Examples of Many-to-Many Alignment Type
Table 5: Italian / French: 3-1 alignment type
Italian French
1 Sì; ma com’è dozzinale! com’è sguaiato! 1 Oui; mais comme il est commun! comme
com’è scorretto! il est inégal! comme il est incorrect! id-
2 Idiotismi lombardi a iosa, frasi della lin- iotismes lombards à foison, phrases de
gua adoperate a sproposito, grammatica ar- la langue employées à rebours, construc-
bitraria, periodi sgangherati. tions arbitraires, périodes boiteuses; et
3 E poi, qualche eleganza spagnola semi- puis quelques petites élégances espagnoles
nata qua e là; e poi, ch’è peggio, ne’ lu- semées ça et là; et puis, ce qui est bien pis,
oghi più terribili o più pietosi della storia, dans les endroits les plus terribles ou les
a ogni occasione d’eccitar maraviglia, o di plus touchants de son histoire, à chaque
far pensare, a tutti que’ passi insomma che occasion d’exciter la surprise ou de faire
richiedono bensì un po’ di rettorica, ma ret- penser, à tous les passages enfin qui de-
torica discreta, fine, di buon gusto, costui mandent, il est vrai, quelques fleurs de rhé-
non manca mai di metterci di quella sua torique, mais d’une rhétorique sobre, fine,
così fatta del proemio. de bon goût, ce digne homme ne manque
jamais d’y mettre quelque chose dans le
genre de son début.
1100
Table 6: Italian / German: 2-2 alignment type with the overlapping sentence boundaries
Italian German
1 Né alcuno dirà questa sij imperfettione del 1 Und es wird gewiß niemand sagen, dies
Racconto, e defformità di questo mio rozzo sei ein Geschichtsfälscher und eine Entstel-
Parto, a meno questo tale Critico non sij per- lung dieser meiner einfältigen Erzählung,
sona affatto diggiuna della Filosofia: che es sei denn der Tadel ein Mann, der aller
quanto agl’huomini in essa versati, ben Weltweisheit vollständig bar wäre.
vederanno nulla mancare alla sostanza di 2 Denn man wird bald sehen, daß in
detta Narratione. Beziehung der darin vorkommenden Per-
2 Imperciocché, essendo cosa evidente, sonen am Wesentlichsten der besagten
e da verun negata non essere i nomi se Erzählung nichts fehle; zumal es eine
non puri purissimi accidenti... augenfällige, von niemand gelenkte
Sache ist, daß Namen bloß reine
Nebensachen seien…
Table 7: Italian / Dutch: 2-2 alignment type with the overlapping sentence boundaries
Italian Dutch
1 Però alla mia debolezza non è lecito 1 Doch mijn’ geringeren krachten is het
solleuarsi a tal’argomenti, e sublimità peri- niet gegeven zich tot zoo hooge vlugt, tot
colose, con aggirarsi tra Labirinti de’ Politici zulk eene gevaarvolle verhevenheid te ver-
maneggj, et il rimbombo de’ bellici Ori- heffen, en zich te wagen in den doolhof der
calchi: solo che hauendo hauuto notitia di staatkundige spitsvondigheden of te midden
fatti memorabili, se ben capitorno a gente van het geschal der schorre krijgsklaroenen.
meccaniche, e di piccol affare, mi accingo 2 Naardemaal ’er dus eenige merk-
di lasciarne memoria a Posteri, con far di waardige gebeurtenissen ter mijner
tutto schietta e genuinamente il Racconto, kennis gekomen zijn, welke, wel is waar,
ouuero sia Relatione. slechts menschen van gering bedrijve en
2 Nella quale si vedrà in angusto Teatro lage geboorte betreffen, maar des alniet-
luttuose Traggedie d’horrori, e Scene temin eene rijke vertooning opleveren van
di malvaggità grandiosa, con intermezi droevige en vreesselijke ongevallen, voor-
d’Imprese virtuose e buontà angeliche, op- beelden van drieste boosheid, doormengd
poste alle operationi diaboliche. met vrome ondernemingen en verheerljkt
door het zielesterkend schouwspel van
hemelsche deugd, in onophoudelijken
strijd met de gruwelijke aanslagen der
helle, zoo heb ik besloten mij aan te
gorden om daarvan der nakomelingschap
een getrouw en nauwkeurig Verhaal ofte
Relaas achterlaten.
1101
Table 8: Italian / Spanish: 2-3 alignment type with the overlapping sentence boundaries
Italian Spanish
1 Ma che? quando siamo stati al punto 1 Pero, ¡oh cielos! llegado el momento de re-
di raccapezzar tutte le dette obiezioni e capitular las objeciones y sus respuestas y el
risposte, per disporle con qualche ordine, de ordenarlas, hallamos, que habíamos he-
misericordia! venivano a fare un libro. cho un libro: visto lo cual, abandonamos
2 Veduta la qual cosa, abbiam messo da nuestro intento por dos razones, que sin
parte il pensiero, per due ragioni che il let- duda alguna el lector considerará oportu-
tore troverà certamente buone: la prima, nas. —
che un libro impiegato a giustificarne un al- 2 La primera, porque temimos que el hacer
tro, anzi lo stile d’un altro, potrebbe parer un libro para justificar otro, ó solo su estilo,
cosa ridicola: la seconda, che di libri basta parecería cosa ridícula.
uno per volta, quando non è d’avanzo. 3 La segunda, porque creemos que es sufi-
ciente, cuando no excesivo, el publicar un
solo libro á la vez.
D. Examples of Omission Captured through Sentence-Level
Alignment
Table 9: Italian / Spanish segment level alignment for the Chapter 7
Italian Spanish 1858
Gertrude domandò sommessamente e tre- Gertrudis con mucha timidez pidió la expli-
mando, che cosa dovesse fare. cación de aquellas palabras y lo que debía
hacer en consecuencia.
Il principe (non ci regge il cuore di dar-
gli in questo momento il titolo di padre)
non rispose direttamente, ma cominciò a
parlare a lungo del fallo di Gertrude: e
quelle parole frizzavano sull’animo della
poveretta, come lo scorrere d’una mano ru-
vida sur una ferita.
1102
Table 9: (continued)
Italian Spanish 1858
Continuò dicendo che, quand’anche... caso Él continuó diciendo que... “á pesar de lo
mai... che avesse avuto prima qualche ocurrido... en el caso en que... hubiera
intenzione di collocarla nel secolo, lei sido con la intención de establecerse en el
stessa ci aveva messo ora un ostacolo in- mundo, ella había contraído un lazo indis-
superabile; giacché a un cavalier d’onore, oluble y había creado un obstáculo inven-
com’era lui, non sarebbe mai bastato cible. Hombre de honor como era, jamás
l’animo di regalare a un galantuomo una se habría atrevido á presentarla á ningún
signorina che aveva dato un tal saggio di caballero después de tales antecedentes”.
sé.
«Ebbene, non si parli più del passato: tutto En hora buena; no hablemos más de lo
è cancellato. pasado: todo está olvidado ya.
Avete preso il solo partito onorevole, con-
veniente, che vi rimanesse; ma perché
l’avete preso di buona voglia, e con buona
maniera, tocca a me a farvelo riuscir gra-
dito in tutto e per tutto: tocca a me a farne
tornare tutto il vantaggio e tutto il merito
sopra di voi.
Ne prendo io la cura.»
E. Examples of Omission Captured through Segment-Level
Alignment
Table 10: Italian / Spanish segment level alignment for the Chapter 7
Italian Spanish 1858
«Brava! bene!» esclamarono, a una voce, — Muy bien, muy bien, exclamaron á la par
la madre e il figlio, madre é hijo.
e l’uno dopo l’altra abbracciaron Gertrude;
la quale ricevette queste accoglienze con
lacrime,
che furono interpretate per lacrime di con-
solazione.
Allora il principe si diffuse a spiegar ciò
che farebbe per render lieta e splendida la
sorte della figlia.
1103
Table 10: (continued)
Italian Spanish 1858
Parlò delle distinzioni di cui goderebbe nel Entonces el príncipe habló de las distin-
monastero e nel paese; ciones que Gertrudis habría de tener en el
convento y en el país.
che, là sarebbe come una principessa, come
la rappresentante della famiglia; che, ap-
pena l’età l’avrebbe permesso, sarebbe in-
nalzata alla prima dignità; e, intanto, non
sarebbe soggetta che di nome.
1104