Stylometry in Computer-Assisted Translation: Experiments on the
                              Babylonian Talmud
                     Emiliano Giovannetti1 , Davide Albanesi1 , Andrea Bellandi1 ,
                                   David Dattilo2 , Felice Dell’Orletta1
                1
                  Istituto di Linguistica Computazionale, Via G. Moruzzi 1, 56124, Pisa
                                    name.surname@ilc.cnr.it
        2
          Progetto Traduzione Talmud Babilonese S.c.a r.l., Lungotevere Sanzio 9, 00153 Roma
                                    david.dattilo@talmud.it

                         Abstract                        Babylonian Talmud (BT), a fundamental book of
                                                         the Jewish tradition, covering every aspect of hu-
        English. The purpose of this research is         man knowledge: law, science, philosophy, religion
        to experiment the application of stylomet-       and even aspects of everyday life. The transla-
        ric techniques in the area of Computer-          tion of the Talmud has been assigned to more than
        Assisted Translation to reduce the revi-         50 scholars comprising expert translators, trainee
        sion effort in the context of a collaborative,   translators, instructors, editors and curators.
        large scale translation project. The ob-            The translated text is accompanied by the expla-
        tained results show a correlation between        nations and comments on specific words and sub-
        the editing extent and the compliance to         jects, and also by illustrative sheets for the vari-
        some specific linguistic features, suggest-      ous scientific, historical and linguistic topics ad-
        ing that supporting translators in writ-         dressed inside the Talmudic discussions. How-
        ing translations following a desired style       ever, the Project objectives include more than the
        may actually reduce the number of fol-           translation of the Talmud: the whole work has
        lowing necessary interventions (and, con-        been set up to be completely digital. Everything,
        sequently, save time) by revisors, editors       from the very first activities of assigning users to
        and curators                                     the translation of specific chapters to supporting
                                                         in the definition of the final printing layout, re-
        Italiano. Lo scopo di questa ricerca
                                                         volves around Traduco, a collaborative web-based
        è la sperimentazione dell’applicazione
                                                         Computer-Assisted Translation (CAT) tool devel-
        di tecniche stilometriche nell’area della
                                                         oped within the Project.
        Traduzione Assistita dal Calcolatore per
        ridurre il lavoro di revisione nel con-             Today, many CAT tools, both commercial and
        testo di un progetto di traduzione col-          freely distributed, are already available, but they
        laborativo di ampia scala. I risultati           have been designed for the translation of techni-
        ottenuti mostrano una correlazione tra           cal manuals or domain-specific texts (legislative,
        l’entità delle modifiche effettuate e la con-   medical) with the main purpose of speeding up the
        formità ad alcune specifiche caratteris-        translation process.
        tiche linguistiche, suggerendo che sup-             The BT is a very complex text in many ways:
        portare i traduttori nel processo traduttivo     its content, the different, ancient, languages it is
        seguendo uno stile desiderato possa effet-       composed of (though mainly Babylonian Aramaic
        tivamente ridurre il numero di interventi        and Mishnaic Hebrew), and the history of its com-
        necessari (e, quindi, risparmiare tempo)         position over the centuries. For these reasons,
        da parte di revisori, redattori e curatori.      the approach we adopted for the development of
                                                         Traduco had to take into account the needs of
                                                         translators working on a text with very particu-
1       Introduction                                     lar interpretative issues. Traduco allows a user
                                                         to distinguish the literal part of the translation (in
The Progetto Traduzione Talmud Babilonese1
                                                         bold, see Fig.1) from explicative additions, in-
(PTTB) is a research and education project car-
                                                         cluded by translators to make the most difficult
rying out the digitized Italian translation of the
                                                         passages clearer to readers. Indeed, a full under-
    1
        www.talmud.it (last access: 25/07/2017)          standing of this kind of texts requires a translation
                               Figure 1: The life cycle of a translated string.


to be enriched with comments, notes, and glos-           rect. Contextually, each string can be enriched,
sary entries. Furthermore, due to the complexity         if needed to help in the understanding of the text,
of the inner structure of the BT, Traduco allows         with pictures and tables. The last phase is the cu-
users to split autonomously their translations into      ratorship, during which one more general con-
“strings” (representing, typically, a sentence, see      trol of the translation is done before proceeding
Fig.1), gathered into “logical units” . Finally, Tra-    with the final exporting and printing of the vol-
duco provides a collaborative and training envi-         ume. As we showed in a previous work (Bellandi
ronment allowing a translator to instantly consult       et al., 2016), the introduction of Natural Language
translations done by others, when portions of text       Processing techniques in CAT tools can bring con-
(and sometimes even a single word) are difficult         crete advantages to the translation work and pave
to interpret and translate. For a comprehensive de-      the way to innovative research in the area of NLP
scription of how Traduco works refer to (Giovan-         for Digital Humanities.
netti et al., 2017). The size and complexity of the         One way to ease the translation of a text as the
text and the need to produce a printed version of        BT is to assist translators in writing, in the first
the BT translation, required a team of users com-        place, good translations requiring as few correc-
posed of translators, revisors, editors, curators and    tions as possible by revisors, editors and cura-
supervisors.                                             tors. In other words, we want to find a way of
   The whole translation workflow can be de-             alerting a user about to submit a new translation
scribed by following the “life-cycle” of each string     by highlighting specific characteristics of the sen-
(Fig.1). It all starts as soon as the coordinator        tence that may further require a revision and, thus,
of the translation assigns a chapter to a specific       slow down the overall translation process.
translator: the first phase of the work, the trans-         To do that, we chose to experiment the appli-
lation, begins. The translation is carried out by        cation of stylometric measures to Italian transla-
scholars having two distinct profiles: expert trans-     tions. The assumption we would like to prove is
lators, working autonomously, and trainee transla-       that translations being more compliant to the style
tors, these latter being constantly supported by in-     of revisors will actually require less revisions. If
structors monitoring online their work and provid-       that will be demonstrated, we may develop a strat-
ing face-to-face lectures. Once the translation of       egy to alert translators of potential “unfit” trans-
a specific chapter is concluded, the revision phase      lations and suggest a way to improve them in or-
starts. Revisors are chosen among the most ex-           der to minimize the following editing for revision,
pert scholars involved in the Project and their main     editing, and curatorship.
task is to verify if translators have understood cor-
rectly the meaning of each string. They also have        2   Background
to check if the domain terms (if present) have been
appropriately annotated and explained in the rela-       Over the last ten years, Natural Language Pro-
tive glossary entry. After the content has been re-      cessing (NLP) techniques combined with machine
vised, the editing starts. In this phase, a formal       learning algorithms started being used to investi-
and linguistic control of the translation is carried     gate the “form” of a text rather than its content.
out, where the editors ensure that the translated        The range of tasks sharing this approach to the
strings are syntactically and orthographically cor-      analysis of texts is wide, ranging e.g. from na-
tive language identification (see among the oth-          logical units for DSlu .
ers (Koppel et al., 2005) and (Wong and Dras,                In more details, each dataset has been built as
2009)), author recognition and verification (see          a set of textual segment pairs extracted from the
e.g. (van Halteren, 2004), authorship attribution         translations of the tractates Berakhot and Ta’anit,
(see (Juola, 2008) for a survey), genre identifica-       respectively composed, in their revised versions,
tion (Mehler et al., 2011) to readability assessment      of 216138 and 81696 tokens. Given a pair (s1 , s2 ),
(see (Dell’Orletta et al., 2014) for an updated sur-      the first component s1 represents the last trans-
vey) or tracking the evolution of written language        lation of a block or logical unit inserted by the
competence (Richter et al., 2015). Besides obvi-          translator2 and the second component s2 its very
ous differences at the level of the considered task,      last version (i.e. that following the revision, edit-
they share a common approach: they succeed in             ing and curatorship phases). Concerning the size,
determining the language variety, the author, the         DSbl was composed of 554 blocks and DSlu of
text genre or the level of readability of a text by ex-   4303 logical units. Each logical unit is composed,
ploiting the distribution of features automatically       in average, by 5.62 strings, while each string is
extracted from texts. To put it in van Halteren           composed, in average, by 12.5 tokens.
words (van Halteren, 2004), they carry out “lin-             Once the datasets were ready, we had to at-
guistic profiling” of texts, i.e. “the occurrences        tribute to each pair a “revision measure” to quan-
of a large number of linguistic features in a text,       tify the difference between s1 and s2 in terms of
either individual items or combinations of items,         both words and characters. For this purpose we
are counted” in order to determine “how much [...]        chose to adopt the Levenshtein distance. Since
they differ from the mean observed in a profile ref-      Traduco is equipped with a spell checker, we as-
erence corpus”.                                           sumed that the presence of typos should not im-
   To the best of our knowledge, however, no re-          pact on the revision measure significantly.
search has been documented in literature about               As the next step we investigated the presence of
the application of stylometric or readability tech-       linguistic features extracted from those texts be-
niques to Computer-Assisted Translation. For this         longing to the s1 component of the pairs corre-
reason, a comparison with existing approaches and         lating with the revision measures. For this pur-
results was not possible.                                 pose, the considered texts were automatically POS
   On the other hand, the use of stylometry and           tagged by the Part-Of-Speech tagger described in
readability in translation studies is described in        (Cimino and Dell’Orletta, 2016) and dependency
several works, especially in the analysis of lit-         parsed by the DeSR parser (Attardi et al., 2009)
erary texts (Heydel and Rybicki, 2012), (Kolahi           using multilayer perceptron as learning algorithm.
and Shirvani, 2012), (Acar and İŞİSAĞ, 2017),         For the specific concerns of this study, we focused
(Huang, 2015) and some of them provide useful             on a wide set of features ranging across differ-
indications on how the personal writing style (be-        ent linguistic description levels which are typi-
ing it, in our case, that of a translator or a revisor)   cally used in studies focusing on the “form” of
can influence the final translation (Baker, 2000)         a text, e.g. on issues of genre, style, authorship
and (Rybicki, 2012).                                      or readability. This represents a peculiarity of our
                                                          approach: we resort to general features qualify-
3   Methodology                                           ing the lexical and grammatical characteristics of
                                                          a text, rather than ad hoc features, specifically se-
To construct the dataset we exploited the version-        lected for a given text type or task. The set of
ing features of Traduco. As a matter of fact, ev-         selected features is organised into four main cat-
ery version of most of textual resources (currently:      egories defined on the basis of the different levels
strings, notes, and glossary entries) is stored in the    of linguistic analysis automatically carried out (to-
database. It is thus possible to compare earlier ver-     kenization, lemmatization, morphosyntactic tag-
sions of translations (i.e. those inserted by trans-      ging and dependency parsing): i.e. raw text fea-
lators) with the latest ones (i.e. those that have        tures, lexical features as well as morpho-syntactic
been completely revised) in order to analyse the          and syntactic features.
differences between them. For the experiment, we             2
                                                               sometimes translators insert a draft version of a transla-
built two datasets using textual segments of differ-      tion, to be completed later: for this reason we chose to take
ent granularity: blocks for the DSbl dataset and          the last translation available.
                                                                       DSlu            DSbl
                                       features                   char    token   char   token
                      Number of tokens                            0.65    0.68    0.84    0.85
                      Arity of verbs                              0.62    0.64    0.83    0.83
                      Number of main verbs                        0.62    0.64    0.83    0.83
                      Number of prepositional ’chains’            0.57    0.60    0.81    0.82
                      Number of sentences                         0.49    0.53    0.80    0.80
                      Number of verb roots                        0.49    0.53    0.79    0.79
                      Number of subord clauses                    0.37    0.38    0.68    0.68
                      % of verbs with 5 syntactic dependent        -        -     0.37    0.36
                      % of first person singular of verbs          -        -     0.31    0.32
                      % of subjunctive auxiliary-verbs             -        -     0.31    0.30
                      % of locative modifier                       -        -     0.31    0.31
                      % of second person plural                    -        -     0.31    0.31
                      % of verb in infinitive mood                 -        -     0.30    0.32
                      % of demonstrative determiner                -        -     0.30     -
                      % of ”balanced” punctuation                  -      0.33     -       -
                      Average of length of dependency links       0.35    0.37     -       -
                      Longest dependency links                    0.34    0.34     -       -
                      Average of main verbs for sentence          0.33    0.32     -       -
                      Average length of subord clauses            0.31    0.31     -       -

Table 1: Spearman’s rank correlation coefficients (in bold with p < 0.001, otherwise with p < 0.05)
calculated on both datasets and the two revision measures (distance per character and per token); values
below 0.3 have been discarded.

   To conclude our experiment we applied the                  of long and articulated syntactic structures appear
Spearman’s rank correlation coefficient to assess             to be more subjected to revisions. As expected,
the presence of a statistical dependence between              the correlation of some of these syntactic features,
our revision measures and the calculated linguis-             such as the number of prepositional chains, ap-
tic features.                                                 pears to be proportional to the size of the analysed
                                                              text (as in the blocks wrt the logical units in the
4   Evaluation                                                datasets), since the presence of deeper syntactic
The results (filtered by keeping just the features            structures increases and the text, at least in princi-
providing coefficients greater or equal than 0.3)             ple, gets more linguistically complex.
are summarized in Table 1. Apart from the ex-
                                                              5    Conclusions
pected correlations between the size of the texts
(represented by raw text features such as “Number             The experiment described in this paper proves that
of tokens” and “Number of sentences”) and the re-             the application of NLP to CAT contexts can open
vision measures, we found some significative cor-             new research perspectives and, more importantly,
relations, in relation to morphosyntactic and syn-            may be of concrete help in real usage translation
tactic features. Most of the morphosyntactic fea-             scenarios. The proposed methodology can be ap-
tures involve verbs: the presence of main verbs,              plied, in principle, to any translation project in
the mood, the tense, etc.                                     which a revision phase is a part of the whole trans-
   Some of the syntactic features showing a corre-            lation workflow and where an history of the edits
lation, such as the length of dependency links, the           is maintained. The same analysis could be per-
length of subordinate clauses and the number of               formed on different languages depending solely on
prepositional chains, are particularly interesting.           the availability of the suitable NLP tools. Some
As a matter of fact, these linguistic features are            of the NLP techniques adopted for the stylomet-
typically used as indicators of linguistic complex-           ric analysis of Italian may also be adapted to the
ity: indeed, portions of translated text constituted          processing of Mishnaic Hebrew and Aramaic (the
main source languages). The automatic linguistic            get. International Journal of Translation Studies,
analysis of Mishnaic Hebrew, for example, is be-            12(2):241–266.
ing experimented (Pecchioli, 2017). However, an           Andrea Bellandi, Giulia Benotto, Gianfranco Di Segni,
analysis of the style (or complexity) of the source         and Emiliano Giovannetti. 2016. Investigating the
text, though interesting in a historical text analysis      application and evaluation of distributional seman-
perspective, would be pointless in the specific con-        tics in the translation of humanistic texts: a case
                                                            study. In Proceedings of the 2nd Workshop on Natu-
text of revision support in computer-assisted trans-        ral Language Processing for Translation Memories,
lation.                                                     pages 6–11.
   The correlation we found between the revision
measures and some linguistic features (some of            Andrea Cimino and Felice Dell’Orletta. 2016. Build-
                                                            ing the state-of-the-art in POS tagging of italian
which are actually used as indicators of linguis-           tweets. In Proceedings of Fifth Evaluation Cam-
tic complexity) is the first step towards the design        paign of Natural Language Processing and Speech
of a technique aimed at providing users a way of            Tools for Italian. Final Workshop (EVALITA 2016),
writing translations less prone to revisions. In this       Napoli, Italy, December 5-7, 2016.
way, the whole translation workflow would ben-            Felice Dell’Orletta, Simonetta Montemagni, and Giulia
efit from a reduced time in the revision, editing           Venturi. 2014. Assessing document and sentence
and curatorship phases. Once the approach will be           readability in less resourced languages and across
defined, the relative software will be implemented          textual genres. In John Benjamins Publishing Com-
                                                            pany, editor, Recent Advances in Automatic Read-
as a new component of Traduco. Moreover, the                ability Assessment and Text Simplification. Special
possibility of suggesting a way of writing “better”         issue of International Journal of Applied Linguistics,
translations (at least wrt revisor’s style) will be ex-     165:2, pages 163–193.
ploited in the education of trainee translators.
                                                          Emiliano Giovannetti, Davide Albanesi, Andrea Bel-
                                                            landi, and Giulia Benotto. 2017. Traduco: A collab-
6   Acknowledgment                                          orative web-based cat environment for the interpre-
                                                            tation and translation of texts. Digital Scholarship
This work was partially supported by the project            in the Humanities, 32(suppl 1):i47–i62.
TALMUD and carried out in the context of the
scientific partnership between S.c.a r.l. “Progetto       Magda Heydel and Jan Rybicki. 2012. The stylometry
                                                           of collaborative translation. woolf’s night and day
Traduzione del Talmud Babilonese” and ILC-
                                                           in polish. In Digital Humanities 2012 Conference
CNR and on the basis of the regulations stated             Abstracts, pages 212–217.
in the “Protocollo d’Intesa” (memorandum of un-
derstanding) between the Italian Presidency of the        Libo Huang. 2015. Readability as an indicator of self-
                                                            translating style: A case study of eileen chang. In
Council of Ministers, the Italian Ministry of Ed-           Style in Translation: A Corpus-Based Perspective,
ucation, Universities and Research, the Union of            pages 95–111. Springer.
Italian Jewish Communities, the Italian Rabbinical
College, and the Italian National Research Coun-          Patrick Juola. 2008. Authorship attribution. In Now
                                                            Publishers Inc.
cil (21 January 2011).
                                                          Sholeh Kolahi and Elaheh Shirvani. 2012. A compar-
                                                            ative study of the readability of english textbooks of
References                                                  translation and their persian translations. Interna-
                                                            tional Journal of Linguistics, 4(4):344.
Alpaslan Acar and Korkut Uluç İŞİSAĞ. 2017. Read-
  ability and comprehensibility in translation using      Moshe Koppel, Jonathan Schler, and Kfir Zigdon.
  reading ease and grade indices. International Jour-      2005. Automatically determining an anonymous au-
  nal of Comparative Literature and Translation Stud-      thor’s native language. In Intelligence and Secu-
  ies, 5(2):47–53.                                         rity Informatics, vol. 3495, LNCS, Springer–Verlag,
                                                           pages 209–217.
Giuseppe Attardi, Felice Dell’Orletta, Maria Simi, and
  Joseph Turian. 2009. Accurate dependency pars-          Alexander Mehler, Serge Sharoff, and Marina (Eds.)
  ing with a stacked multilayer perceptron. In Pro-         Santini. 2011. Genres on the web. computational
  ceedings of the 2nd Evaluation Campaign of Natural        models and empirical studies. In Springer Series:
  Language Processing and Speech Tools for Italian,         Text, Speech and Language Technology.
  (EVALITA 2009).
                                                          Alessandra Pecchioli. 2017. Elaborazione del linguag-
Mona Baker. 2000. Towards a methodology for in-             gio naturale (nlp) in ebraico: il caso dell’analisi lin-
 vestigating the style of a literary translator. Tar-       guistica automatica applicata all’ebraico mishnaico
  del talmud. Oral communication, sep. XXXI Con-
  vegno AISG 2017 - Nuovi studi sullEbraismo, 4-6
  settembre 2017, Ravenna, Italy.
Stefan Richter, Andrea Cimino, Felice Dell’Orletta,
   and Giulia Venturi. 2015. Tracking the evolution
   of written language competence: an nlpbased ap-
   proach. In Cristina Bosco, Sara Tonelli, and Mas-
   simo Zanzotto, editors, Proceedings of the Second
   Italian Conference on Computational Linguistics -
   CLiC-it 2015, pages 236–240.
Jan Rybicki. 2012. The great mystery of the (al-
   most) invisible translator. Quantitative Methods
   in Corpus-Based Translation Studies: A practical
   guide to descriptive translation research, 231:231–
   248.
Hans van Halteren. 2004. Linguistic profiling for
  author recognition and verification. In John Ben-
  jamins Publishing Company, editor, Proceedings
  of the Association for Computational Linguistics
  (ACL04), pages 200–207.
Sze-Meng Wong and Mark Dras. 2009. Contrastive
  analysis and native language identification. In Pro-
  ceedings of the Australasian Language Technology
  Association Workshop.