Stylometry in Computer-Assisted Translation: Experiments on the Babylonian Talmud Emiliano Giovannetti1 , Davide Albanesi1 , Andrea Bellandi1 , David Dattilo2 , Felice Dell’Orletta1 1 Istituto di Linguistica Computazionale, Via G. Moruzzi 1, 56124, Pisa name.surname@ilc.cnr.it 2 Progetto Traduzione Talmud Babilonese S.c.a r.l., Lungotevere Sanzio 9, 00153 Roma david.dattilo@talmud.it Abstract Babylonian Talmud (BT), a fundamental book of the Jewish tradition, covering every aspect of hu- English. The purpose of this research is man knowledge: law, science, philosophy, religion to experiment the application of stylomet- and even aspects of everyday life. The transla- ric techniques in the area of Computer- tion of the Talmud has been assigned to more than Assisted Translation to reduce the revi- 50 scholars comprising expert translators, trainee sion effort in the context of a collaborative, translators, instructors, editors and curators. large scale translation project. The ob- The translated text is accompanied by the expla- tained results show a correlation between nations and comments on specific words and sub- the editing extent and the compliance to jects, and also by illustrative sheets for the vari- some specific linguistic features, suggest- ous scientific, historical and linguistic topics ad- ing that supporting translators in writ- dressed inside the Talmudic discussions. How- ing translations following a desired style ever, the Project objectives include more than the may actually reduce the number of fol- translation of the Talmud: the whole work has lowing necessary interventions (and, con- been set up to be completely digital. Everything, sequently, save time) by revisors, editors from the very first activities of assigning users to and curators the translation of specific chapters to supporting in the definition of the final printing layout, re- Italiano. Lo scopo di questa ricerca volves around Traduco, a collaborative web-based è la sperimentazione dell’applicazione Computer-Assisted Translation (CAT) tool devel- di tecniche stilometriche nell’area della oped within the Project. Traduzione Assistita dal Calcolatore per ridurre il lavoro di revisione nel con- Today, many CAT tools, both commercial and testo di un progetto di traduzione col- freely distributed, are already available, but they laborativo di ampia scala. I risultati have been designed for the translation of techni- ottenuti mostrano una correlazione tra cal manuals or domain-specific texts (legislative, l’entità delle modifiche effettuate e la con- medical) with the main purpose of speeding up the formità ad alcune specifiche caratteris- translation process. tiche linguistiche, suggerendo che sup- The BT is a very complex text in many ways: portare i traduttori nel processo traduttivo its content, the different, ancient, languages it is seguendo uno stile desiderato possa effet- composed of (though mainly Babylonian Aramaic tivamente ridurre il numero di interventi and Mishnaic Hebrew), and the history of its com- necessari (e, quindi, risparmiare tempo) position over the centuries. For these reasons, da parte di revisori, redattori e curatori. the approach we adopted for the development of Traduco had to take into account the needs of translators working on a text with very particu- 1 Introduction lar interpretative issues. Traduco allows a user to distinguish the literal part of the translation (in The Progetto Traduzione Talmud Babilonese1 bold, see Fig.1) from explicative additions, in- (PTTB) is a research and education project car- cluded by translators to make the most difficult rying out the digitized Italian translation of the passages clearer to readers. Indeed, a full under- 1 www.talmud.it (last access: 25/07/2017) standing of this kind of texts requires a translation Figure 1: The life cycle of a translated string. to be enriched with comments, notes, and glos- rect. Contextually, each string can be enriched, sary entries. Furthermore, due to the complexity if needed to help in the understanding of the text, of the inner structure of the BT, Traduco allows with pictures and tables. The last phase is the cu- users to split autonomously their translations into ratorship, during which one more general con- “strings” (representing, typically, a sentence, see trol of the translation is done before proceeding Fig.1), gathered into “logical units” . Finally, Tra- with the final exporting and printing of the vol- duco provides a collaborative and training envi- ume. As we showed in a previous work (Bellandi ronment allowing a translator to instantly consult et al., 2016), the introduction of Natural Language translations done by others, when portions of text Processing techniques in CAT tools can bring con- (and sometimes even a single word) are difficult crete advantages to the translation work and pave to interpret and translate. For a comprehensive de- the way to innovative research in the area of NLP scription of how Traduco works refer to (Giovan- for Digital Humanities. netti et al., 2017). The size and complexity of the One way to ease the translation of a text as the text and the need to produce a printed version of BT is to assist translators in writing, in the first the BT translation, required a team of users com- place, good translations requiring as few correc- posed of translators, revisors, editors, curators and tions as possible by revisors, editors and cura- supervisors. tors. In other words, we want to find a way of The whole translation workflow can be de- alerting a user about to submit a new translation scribed by following the “life-cycle” of each string by highlighting specific characteristics of the sen- (Fig.1). It all starts as soon as the coordinator tence that may further require a revision and, thus, of the translation assigns a chapter to a specific slow down the overall translation process. translator: the first phase of the work, the trans- To do that, we chose to experiment the appli- lation, begins. The translation is carried out by cation of stylometric measures to Italian transla- scholars having two distinct profiles: expert trans- tions. The assumption we would like to prove is lators, working autonomously, and trainee transla- that translations being more compliant to the style tors, these latter being constantly supported by in- of revisors will actually require less revisions. If structors monitoring online their work and provid- that will be demonstrated, we may develop a strat- ing face-to-face lectures. Once the translation of egy to alert translators of potential “unfit” trans- a specific chapter is concluded, the revision phase lations and suggest a way to improve them in or- starts. Revisors are chosen among the most ex- der to minimize the following editing for revision, pert scholars involved in the Project and their main editing, and curatorship. task is to verify if translators have understood cor- rectly the meaning of each string. They also have 2 Background to check if the domain terms (if present) have been appropriately annotated and explained in the rela- Over the last ten years, Natural Language Pro- tive glossary entry. After the content has been re- cessing (NLP) techniques combined with machine vised, the editing starts. In this phase, a formal learning algorithms started being used to investi- and linguistic control of the translation is carried gate the “form” of a text rather than its content. out, where the editors ensure that the translated The range of tasks sharing this approach to the strings are syntactically and orthographically cor- analysis of texts is wide, ranging e.g. from na- tive language identification (see among the oth- logical units for DSlu . ers (Koppel et al., 2005) and (Wong and Dras, In more details, each dataset has been built as 2009)), author recognition and verification (see a set of textual segment pairs extracted from the e.g. (van Halteren, 2004), authorship attribution translations of the tractates Berakhot and Ta’anit, (see (Juola, 2008) for a survey), genre identifica- respectively composed, in their revised versions, tion (Mehler et al., 2011) to readability assessment of 216138 and 81696 tokens. Given a pair (s1 , s2 ), (see (Dell’Orletta et al., 2014) for an updated sur- the first component s1 represents the last trans- vey) or tracking the evolution of written language lation of a block or logical unit inserted by the competence (Richter et al., 2015). Besides obvi- translator2 and the second component s2 its very ous differences at the level of the considered task, last version (i.e. that following the revision, edit- they share a common approach: they succeed in ing and curatorship phases). Concerning the size, determining the language variety, the author, the DSbl was composed of 554 blocks and DSlu of text genre or the level of readability of a text by ex- 4303 logical units. Each logical unit is composed, ploiting the distribution of features automatically in average, by 5.62 strings, while each string is extracted from texts. To put it in van Halteren composed, in average, by 12.5 tokens. words (van Halteren, 2004), they carry out “lin- Once the datasets were ready, we had to at- guistic profiling” of texts, i.e. “the occurrences tribute to each pair a “revision measure” to quan- of a large number of linguistic features in a text, tify the difference between s1 and s2 in terms of either individual items or combinations of items, both words and characters. For this purpose we are counted” in order to determine “how much [...] chose to adopt the Levenshtein distance. Since they differ from the mean observed in a profile ref- Traduco is equipped with a spell checker, we as- erence corpus”. sumed that the presence of typos should not im- To the best of our knowledge, however, no re- pact on the revision measure significantly. search has been documented in literature about As the next step we investigated the presence of the application of stylometric or readability tech- linguistic features extracted from those texts be- niques to Computer-Assisted Translation. For this longing to the s1 component of the pairs corre- reason, a comparison with existing approaches and lating with the revision measures. For this pur- results was not possible. pose, the considered texts were automatically POS On the other hand, the use of stylometry and tagged by the Part-Of-Speech tagger described in readability in translation studies is described in (Cimino and Dell’Orletta, 2016) and dependency several works, especially in the analysis of lit- parsed by the DeSR parser (Attardi et al., 2009) erary texts (Heydel and Rybicki, 2012), (Kolahi using multilayer perceptron as learning algorithm. and Shirvani, 2012), (Acar and İŞİSAĞ, 2017), For the specific concerns of this study, we focused (Huang, 2015) and some of them provide useful on a wide set of features ranging across differ- indications on how the personal writing style (be- ent linguistic description levels which are typi- ing it, in our case, that of a translator or a revisor) cally used in studies focusing on the “form” of can influence the final translation (Baker, 2000) a text, e.g. on issues of genre, style, authorship and (Rybicki, 2012). or readability. This represents a peculiarity of our approach: we resort to general features qualify- 3 Methodology ing the lexical and grammatical characteristics of a text, rather than ad hoc features, specifically se- To construct the dataset we exploited the version- lected for a given text type or task. The set of ing features of Traduco. As a matter of fact, ev- selected features is organised into four main cat- ery version of most of textual resources (currently: egories defined on the basis of the different levels strings, notes, and glossary entries) is stored in the of linguistic analysis automatically carried out (to- database. It is thus possible to compare earlier ver- kenization, lemmatization, morphosyntactic tag- sions of translations (i.e. those inserted by trans- ging and dependency parsing): i.e. raw text fea- lators) with the latest ones (i.e. those that have tures, lexical features as well as morpho-syntactic been completely revised) in order to analyse the and syntactic features. differences between them. For the experiment, we 2 sometimes translators insert a draft version of a transla- built two datasets using textual segments of differ- tion, to be completed later: for this reason we chose to take ent granularity: blocks for the DSbl dataset and the last translation available. DSlu DSbl features char token char token Number of tokens 0.65 0.68 0.84 0.85 Arity of verbs 0.62 0.64 0.83 0.83 Number of main verbs 0.62 0.64 0.83 0.83 Number of prepositional ’chains’ 0.57 0.60 0.81 0.82 Number of sentences 0.49 0.53 0.80 0.80 Number of verb roots 0.49 0.53 0.79 0.79 Number of subord clauses 0.37 0.38 0.68 0.68 % of verbs with 5 syntactic dependent - - 0.37 0.36 % of first person singular of verbs - - 0.31 0.32 % of subjunctive auxiliary-verbs - - 0.31 0.30 % of locative modifier - - 0.31 0.31 % of second person plural - - 0.31 0.31 % of verb in infinitive mood - - 0.30 0.32 % of demonstrative determiner - - 0.30 - % of ”balanced” punctuation - 0.33 - - Average of length of dependency links 0.35 0.37 - - Longest dependency links 0.34 0.34 - - Average of main verbs for sentence 0.33 0.32 - - Average length of subord clauses 0.31 0.31 - - Table 1: Spearman’s rank correlation coefficients (in bold with p < 0.001, otherwise with p < 0.05) calculated on both datasets and the two revision measures (distance per character and per token); values below 0.3 have been discarded. To conclude our experiment we applied the of long and articulated syntactic structures appear Spearman’s rank correlation coefficient to assess to be more subjected to revisions. As expected, the presence of a statistical dependence between the correlation of some of these syntactic features, our revision measures and the calculated linguis- such as the number of prepositional chains, ap- tic features. pears to be proportional to the size of the analysed text (as in the blocks wrt the logical units in the 4 Evaluation datasets), since the presence of deeper syntactic The results (filtered by keeping just the features structures increases and the text, at least in princi- providing coefficients greater or equal than 0.3) ple, gets more linguistically complex. are summarized in Table 1. Apart from the ex- 5 Conclusions pected correlations between the size of the texts (represented by raw text features such as “Number The experiment described in this paper proves that of tokens” and “Number of sentences”) and the re- the application of NLP to CAT contexts can open vision measures, we found some significative cor- new research perspectives and, more importantly, relations, in relation to morphosyntactic and syn- may be of concrete help in real usage translation tactic features. Most of the morphosyntactic fea- scenarios. The proposed methodology can be ap- tures involve verbs: the presence of main verbs, plied, in principle, to any translation project in the mood, the tense, etc. which a revision phase is a part of the whole trans- Some of the syntactic features showing a corre- lation workflow and where an history of the edits lation, such as the length of dependency links, the is maintained. The same analysis could be per- length of subordinate clauses and the number of formed on different languages depending solely on prepositional chains, are particularly interesting. the availability of the suitable NLP tools. Some As a matter of fact, these linguistic features are of the NLP techniques adopted for the stylomet- typically used as indicators of linguistic complex- ric analysis of Italian may also be adapted to the ity: indeed, portions of translated text constituted processing of Mishnaic Hebrew and Aramaic (the main source languages). The automatic linguistic get. International Journal of Translation Studies, analysis of Mishnaic Hebrew, for example, is be- 12(2):241–266. ing experimented (Pecchioli, 2017). However, an Andrea Bellandi, Giulia Benotto, Gianfranco Di Segni, analysis of the style (or complexity) of the source and Emiliano Giovannetti. 2016. Investigating the text, though interesting in a historical text analysis application and evaluation of distributional seman- perspective, would be pointless in the specific con- tics in the translation of humanistic texts: a case study. In Proceedings of the 2nd Workshop on Natu- text of revision support in computer-assisted trans- ral Language Processing for Translation Memories, lation. pages 6–11. The correlation we found between the revision measures and some linguistic features (some of Andrea Cimino and Felice Dell’Orletta. 2016. Build- ing the state-of-the-art in POS tagging of italian which are actually used as indicators of linguis- tweets. In Proceedings of Fifth Evaluation Cam- tic complexity) is the first step towards the design paign of Natural Language Processing and Speech of a technique aimed at providing users a way of Tools for Italian. Final Workshop (EVALITA 2016), writing translations less prone to revisions. In this Napoli, Italy, December 5-7, 2016. way, the whole translation workflow would ben- Felice Dell’Orletta, Simonetta Montemagni, and Giulia efit from a reduced time in the revision, editing Venturi. 2014. Assessing document and sentence and curatorship phases. Once the approach will be readability in less resourced languages and across defined, the relative software will be implemented textual genres. In John Benjamins Publishing Com- pany, editor, Recent Advances in Automatic Read- as a new component of Traduco. Moreover, the ability Assessment and Text Simplification. Special possibility of suggesting a way of writing “better” issue of International Journal of Applied Linguistics, translations (at least wrt revisor’s style) will be ex- 165:2, pages 163–193. ploited in the education of trainee translators. Emiliano Giovannetti, Davide Albanesi, Andrea Bel- landi, and Giulia Benotto. 2017. Traduco: A collab- 6 Acknowledgment orative web-based cat environment for the interpre- tation and translation of texts. Digital Scholarship This work was partially supported by the project in the Humanities, 32(suppl 1):i47–i62. TALMUD and carried out in the context of the scientific partnership between S.c.a r.l. “Progetto Magda Heydel and Jan Rybicki. 2012. The stylometry of collaborative translation. woolf’s night and day Traduzione del Talmud Babilonese” and ILC- in polish. In Digital Humanities 2012 Conference CNR and on the basis of the regulations stated Abstracts, pages 212–217. in the “Protocollo d’Intesa” (memorandum of un- derstanding) between the Italian Presidency of the Libo Huang. 2015. Readability as an indicator of self- translating style: A case study of eileen chang. In Council of Ministers, the Italian Ministry of Ed- Style in Translation: A Corpus-Based Perspective, ucation, Universities and Research, the Union of pages 95–111. Springer. Italian Jewish Communities, the Italian Rabbinical College, and the Italian National Research Coun- Patrick Juola. 2008. Authorship attribution. In Now Publishers Inc. cil (21 January 2011). Sholeh Kolahi and Elaheh Shirvani. 2012. A compar- ative study of the readability of english textbooks of References translation and their persian translations. Interna- tional Journal of Linguistics, 4(4):344. Alpaslan Acar and Korkut Uluç İŞİSAĞ. 2017. Read- ability and comprehensibility in translation using Moshe Koppel, Jonathan Schler, and Kfir Zigdon. reading ease and grade indices. International Jour- 2005. Automatically determining an anonymous au- nal of Comparative Literature and Translation Stud- thor’s native language. In Intelligence and Secu- ies, 5(2):47–53. rity Informatics, vol. 3495, LNCS, Springer–Verlag, pages 209–217. Giuseppe Attardi, Felice Dell’Orletta, Maria Simi, and Joseph Turian. 2009. Accurate dependency pars- Alexander Mehler, Serge Sharoff, and Marina (Eds.) ing with a stacked multilayer perceptron. In Pro- Santini. 2011. Genres on the web. computational ceedings of the 2nd Evaluation Campaign of Natural models and empirical studies. In Springer Series: Language Processing and Speech Tools for Italian, Text, Speech and Language Technology. (EVALITA 2009). Alessandra Pecchioli. 2017. Elaborazione del linguag- Mona Baker. 2000. Towards a methodology for in- gio naturale (nlp) in ebraico: il caso dell’analisi lin- vestigating the style of a literary translator. Tar- guistica automatica applicata all’ebraico mishnaico del talmud. Oral communication, sep. XXXI Con- vegno AISG 2017 - Nuovi studi sullEbraismo, 4-6 settembre 2017, Ravenna, Italy. Stefan Richter, Andrea Cimino, Felice Dell’Orletta, and Giulia Venturi. 2015. Tracking the evolution of written language competence: an nlpbased ap- proach. In Cristina Bosco, Sara Tonelli, and Mas- simo Zanzotto, editors, Proceedings of the Second Italian Conference on Computational Linguistics - CLiC-it 2015, pages 236–240. Jan Rybicki. 2012. The great mystery of the (al- most) invisible translator. Quantitative Methods in Corpus-Based Translation Studies: A practical guide to descriptive translation research, 231:231– 248. Hans van Halteren. 2004. Linguistic profiling for author recognition and verification. In John Ben- jamins Publishing Company, editor, Proceedings of the Association for Computational Linguistics (ACL04), pages 200–207. Sze-Meng Wong and Mark Dras. 2009. Contrastive analysis and native language identification. In Pro- ceedings of the Australasian Language Technology Association Workshop.