Constructing an Annotated Resource for Part-Of-Speech
                  Tagging of Mishnaic Hebrew
         Emiliano Giovannetti1 , Davide Albanesi1 , Andrea Bellandi1 ,
                      Simone Marchi1 , Alessandra Pecchioli2
       1
         Istituto di Linguistica Computazionale, Via G. Moruzzi 1, 56124, Pisa
                               name.surname@ilc.cnr.it
2
  Progetto Traduzione Talmud Babilonese S.c.a r.l., Lungotevere Sanzio 9, 00153 Roma
                                  alepec3@gmail.com

                  Abstract                    (in Italian, Progetto Traduzione Talmud Ba-
    English. This paper introduces the        bilonese - PTTB) which aims at the transla-
    research in Part-Of-Speech tagging of     tion of the Babylonian Talmud (BT) into Ital-
    mishnaic Hebrew carried out within        ian.
    the Babylonian Talmud Translation            The translation is being carried out with the
    Project. Since no tagged resource was     aid of tools for text and language processing
    available to train a stochastic POS       integrated into an application, called Traduco
    tagger, a portion of the Mishna of        (Bellandi et al., 2016), developed by the In-
    the Babylonian Talmud has been mor-       stitute of Computational Linguistics “Antonio
    phologically annotated using an ad        Zampolli” of the CNR in collaboration with
    hoc developed tool connected with the     the PTTB team. Traduco is a collaborative
    DB containing the talmudic text be-       computer-assisted translation (CAT) tool con-
    ing translated. The final aim of this     ceived to ease the translation, revision and
    research is to add a linguistic support   editing of the BT.
    to the Translation Memory System             The research described here fits exactly in
    of Traduco, the Computer-Assisted         this context: we want to provide the system
    Translation tool developed and used       with additional informative elements as a fur-
    within the Project.                       ther aid in the translation of the Talmud. In
                                              particular, we intend to linguistically analyze
    Italiano.        In questo articolo è
                                              the Talmudic text starting from the automatic
    introdotta la ricerca nel Part-Of-
                                              attribution of the Part-Of-Speech to words by
    Speech tagging dell’Ebraico mishnaico
                                              adopting a stochastic POS tagging approach.
    condotta nell’ambito del Progetto
                                                 The first difficulty that has emerged regards
    Traduzione Talmud Babilonese. Data
                                              the text and the languages it contains. In this
    l’indisponibilità di risorse annotate
                                              regard we can say, simplifying, that the Baby-
    necessarie per l’addestramento di un
                                              lonian Talmud is essentially composed of two
    POS tagger stocastico, una porzione di
                                              languages which, in turn, correspond to two
    Mishnà del Talmud Babilonese è stata
                                              distinct texts: the Mishna and the Gemara.
    annotata morfologicamente utilizzando
                                              The first is the oldest one written in mishnaic
    uno strumento sviluppato ad hoc
                                              Hebrew, one of the most homogeneous and
    collegato al DB dove risiede il testo
                                              coherent languages appearing in the Talmud
    talmudico in traduzione. L’obiettivo
                                              that, for this reason, has been chosen to start
    finale di questa ricerca è lo sviluppo
                                              from in the POS tagging experiment.
    di un supporto linguistico al sistema
                                                 The main purpose of linguistic analysis in
    di Memoria di Traduzione di Traduco,
                                              the context of our translation project is to
    lo strumento di traduzione assistita
                                              improve the suggestions provided by the sys-
    utilizzato nell’ambito del Progetto.
                                              tem through the so-called Translation Memory
                                              (TM).
1   Introduction
                                                 Moreover, on a linguistically annotated text
The present work has been conducted within    it is possible to carry out linguistic-based
the Babylonian Talmud Translation Project     searches, useful both for the scholar (in this
case a talmudist), and, during the translation    tury and that does not correspond to the id-
work, for the revisor and the curator, who        ioms recurring in the BT. Among them we cite
have the possibility, for example, to make bulk   HebTokenizer5 for tokenization, MILA (Bar-
editing of polysemous words by discarding out     haim et al., 2008), HebMorph6 , MorphTag-
words with undesired POS.                         ger 7 and NLPH8 for morphological analy-
   The rest of the paper is organized as fol-     sis and lemmatization, yap9 , hebdepparser10 ,
lows: Section 2 summarizes the state of the       UD_Hebrew11 for syntactic analysis. We con-
art in NLP of Hebrew. The construction of the     ducted some preliminary tests by starting with
linguistically annotated corpus is described in   MILA’s (ambiguous) morphological analyzer
Section 3. The training process and evaluation    applied to the three main languages of the Tal-
of the POS taggers used in the experiments is     mud:
detailed in Section 4. Lastly, Section 5 out-      1. Aramaic: Hebrew and Aramaic are differ-
lines the next steps of the research.                 ent languages. There are even some cases
2       State of the art                              in which the very same root has differ-
                                                      ent semantics in the two languages. In
The aforementioned linguistic richness and the        addition, MILA did not recognize many
intrinsic complexity of the Babylonian Talmud         aramaic roots, tagging the relative words,
make automatic linguistic analysis of the BT          derived from them, as proper nouns.
particularly hard (Bellandi et al., 2015).
   However, some linguistic resources of an-       2. Biblical Hebrew: MILA recognized most
cient Hebrew and Aramaic have been (and               of the words, since Modern Hebrew pre-
are being) developed, among which we cite: i)         served almost the entire biblical lexicon.
the Hebrew Text Database (Van Peursen and             However, syntax of Modern Hebrew is
Sikkel, 2014) (ETCBC) accessible by SHE-              quite different from that of Biblical He-
BANQ1 an online environment for the study             brew, leading MILA to output wrong
of Biblical Hebrew (with emphasis on syntax),         analyses.
developed by the Eep Talstra Centre for Bible      3. Mishnaic Hebrew: this is the language
and Computer of the Vrije Universiteit in Am-         where MILA performed better. Mod-
sterdam; ii) the Historical Dictionary2 project       ern Hebrew inherits some of the morpho-
of the Academy of the Hebrew Language of              syntactic features of mishnaic Hebrew,
Israel; iii) the Comprehensive Aramaic Lexi-          however, the two idioms differ substan-
con (CAL)3 developed by the Hebrew Union              tially on the lexicon, since in modern He-
College of Cincinnati; iv) the Digital Mishna4        brew many archaic words have been lost
project, concerning the creation of a digital         (Skolnik and Berenbaum, 2007).
scholarly edition of the Mishna conducted by
                                                  In the light of the above, we decided to create a
the Maryland Institute of Technology in the
                                                  novel linguistically annotated resource to start
Humanities.
                                                  developing our own tools for the processing of
   Apart from the aforementioned resources, to
                                                  ancient Jewish languages. In the next section,
date there are no available NLP tools suitable
                                                  we will describe how the resource was built.
for the processing of ancient north-western
Semitic languages, such as the different Ara-     3 Building the resource
maic idioms and the historical variants of He-
brew attested in the BT. The only existing        The linguistic annotation of Semitic languages
projects and tools for the processing of Jew-     poses several problems. Although we here dis-
ish languages (Kamir et al., 2002) (Cohen and     cuss the analysis of Hebrew, many of the criti-
Smith, 2007) have been developed for mod-         cal points that must be taken into account are
                                                    5
ern Hebrew, a language that has been artifi-          www.cs.bgu.ac.il/∼yoavg/software/hebtokenizer
                                                    6
cially revitalized from the end of the XIX cen-       code972.com/hebmorph
                                                    7
                                                      www.cs.technion.ac.il/∼barhaim/MorphTagger
    1                                               8
      shebanq.ancient-data.org                        github.com/NLPH/NLPH
    2                                               9
      maagarim.hebrew-academy.org.il                  github.com/habeanf/yap
    3                                              10
      cal.huc.edu                                     tinyurl.com/hebdepparser
    4                                              11
      www.digitalmishnah.org                          github.com/UniversalDependencies/UD_Hebrew
common to other languages belonging to the           both, in particular: i) the definition of the to-
same family. As already mentioned in the pre-        ken from a syntagmatic perspective (i.e. what
vious section, the first problem concerns the        the token represents in context) and ii) the lex-
access to existing linguistic resources and ana-     ical information that the token gives by itself
lytical tools which, in the case of Hebrew, are      (without context). To give a couple of exam-
available exclusively for the modern language.       ples:
   One of the major challenges posed by the
morphological analysis of Semitic languages           • Verb/noun: ‫ → ִא ְשּתּ אֶ ת הַ מַ ִדיר‬is ‫“ הַ מַ ִדיר‬the
is the orthographic disambiguation of words.            one who makes a vow” or “the vowing”?
Since writing is almost exclusively consonan-           (the one who consecrates his wife): should
tal, every word can have multiple readings.             it be assigned to verb or noun category?
The problem of orthographic ambiguity, cru-
cial in all studies on large corpora (typically in    • Adjective/verb: ‫עַ ד וְּ ִלגְּ מ ר ְלּהַ ְתּ ִחיל יְּ כ ִלין ִאם‬
                                                        ‫שּׁוּרה יַגִ יעוּ שֶ ל ֹא‬
                                                                           ָ ַ‫ ל‬- ‫ → י ְַתּ ִחילוּ‬is ‫ יְּ כ ִלין‬adjec-
Hebrew and modern Arabic), does not prove
to be so difficult when the text under exami-           tive or verb (given that most of the mish-
nation is vocalized.                                    naic language dictionaries provide both
                                                        options)?
   The edition of the Talmud used in the
project is actually vocalized and the text, con-
                                                     We could discuss about which category would
sequently, is orthographically unambiguous.
                                                     be the best for each and why, but, for now,
An additional critical aspect is represented by
                                                     we decided to keep both by introducing two
the definition of the tagset. Most of the com-
                                                     parallel annotations, by “category” (without
putational studies on language analysis have
                                                     context) and by “function” (in context). The
been conducted on Indo-european languages
                                                     tagset we used for this work are the follow-
(especially on English).
                                                     ing: agg., avv., cong., interiez., nome pr., num.
   As a result, it may be difficult to reuse         card., num. ord., pref. art., pref. cong., pref.
tagsets created for these languages.          Not    prep., pref. pron. rel., prep., pron. dim., pron.
surprisingly, there are still many discussions       indef., pron. interr., pron. pers., pron. suff.,
about how it is better to catalog some POS           punt., sost., vb.
and each language has its own part under dis-           One could also envisage the refining of the
cussion. Each tagset must ultimately be cre-         tagset by adding: interrogative, modal, nega-
ated in the light of a specific purpose. For         tion, and quantifier (Adler, 2007) (Netzer and
example, the tagging of the (Modern) Hebrew          Elhadad, 1998) (Netzer et al., 2007).
Treebank developed at the Technion (Sima’an
                                                        As anticipated, in order to build the mor-
et al., 2001) was syntax-oriented, while the
                                                     phologically annotated resource, all of the
work on participles of Hebrew described in
                                                     Mishna sentences were extracted from the Tal-
(Adler et al., 2008) was more lexicon-oriented.
                                                     mud and annotated using an ad hoc developed
We considered the idea of adopting the tagset
                                                     Web application (Fig. 1).
used in the already cited Universal Depen-
                                                        All the annotations have been made with
dency Corpus for Hebrew. However, its 16
                                                     the aim of training a stochastic POS tagger in
tags appeared to be too “coarse grained” for
                                                     charge of the automatic analysis of the entire
our purposes.12 In particular, the UD tagset
                                                     Mishna: to obtain a good accuracy it was thus
lacks of all the prefix tags that we needed.
                                                     necessary to manually annotate as many sen-
For this reason we decided to define our own
                                                     tences as possible. To date, 10442 tokens have
tagset.
                                                     been annotated.
   Once the tagset has been defined, it remains
                                                        The software created for the annotation
to decide which is the most suitable grammati-
                                                     shows, in a tabular form, the information of
cal category to associate with each token. You
                                                     the analysis carried out on a sentence by sen-
can collect essentially two types of informa-
                                                     tence basis.
tion, the problem is how and if you can keep
                                                        The system, once a sentence is selected for
 12
    github.com/UniversalDependencies/UD_Hebrew-      annotation, checks whether the tokens com-
HTB/blob/master/stats.xml                            posing it have already been analyzed and, in
Figure 1: The interface for the linguistic annotation of the corpus to be used to train the POS
tagger


case, calculates a possible subdivision into sub-               Tagging Accuracy
tokens (i.e. the stems, prefixes and suffixes              Stanford Hunpos Treetagger
constituting each word) by exploiting previous             87,90% 86,34%     86,74%
annotations. If the system finds that a word is
associated with multiple different annotations,     Table 1: Accuracy of the three POS taggers.
it proposes the most frequent one.
   Regarding the linguistic annotation, the         Stanford POS tagger provided the best results
grammar of Pérez Fernández (Fernández and           over HunPos and Treetagger, with an accuracy
Elwolde, 1999) was adopted and, for lemmati-        of 87,9%.
zation, the dictionary of M. Jastrow (Jastrow,
1971).
                                                    5 Next steps
   The software allows to gather as much infor-
mation as possible for each word by providing       In this work, the tagging experiments have
a double annotation: by “category” to rep-          been limited to the attribution of the Part-
resent the POS from a grammatical point of          Of-Speech: the next, natural step, will be the
view, and by “function” to describe the func-       addition of the lemma. Furthermore, we will
tion the word assumes in its context. For the       try to modify the parameters affecting the be-
POS tagging experiments, described below, we        haviour of the three adopted POS taggers (left
used the annotation made by “function”.             at their default values for the experiments)
                                                    and see how they influence the results.
4   Training and testing of POS
                                                       Once the Mishna will be lemmatized, Tra-
    taggers
                                                    duco, the software used to translate the Tal-
Once the mishnaic corpus has been linguisti-        mud in Italian, will be able to exploit this ad-
cally annotated three of the most used algo-        ditional information mainly to provide trans-
rithms for POS tagging have been used and           lators with translation suggestions based on
evaluated: HunPos (Halácsy et al., 2007),           lemmas, but also to allow users to query the
the Stanford Log-linear Part-Of-Speech Tag-         mishnaic text by POS and lemma.
ger (Toutanova et al., 2003), and TreeTagger           As a further step we will also take into
(Schmid, 1994). The three algorithms imple-         account the linguistic annotation of portions
ment supervised stochastic models and, conse-       of the Babylonian Talmud written in other
quently, they need to be trained with a man-        languages, starting from the Babylonian Ara-
ually annotated corpus.                             maic, the language of the Gemara, which con-
   To evaluate the accuracy of the algorithms       stitutes the earlier portion of the Talmud.
we adopted the strategy of k-fold cross valida-
tion (Brink et al., 2016), with k set to 10, and    Acknowledgments
thus dividing the corpus in 10 partitions.
   Table 1 summarizes the results of the ex-        This work was conducted in the context of the
periment by showing the tagging accuracy of         TALMUD project and the scientific coopera-
the three tested algorithms. With a number of       tion between S.c.a r.l. PTTB and ILC-CNR.
tokens slightly higher than ten thousands the
References                                           Dror Kamir, Naama Soreq, and Yoni Neeman.
                                                       2002. A Comprehensive NLP System for Mod-
Meni Adler, Yael Netzer, Yoav Goldberg, David          ern Standard Arabic and Modern Hebrew. In
 Gabay, and Michael Elhadad. 2008. Tag-                Proceedings of the ACL-02 Workshop on Com-
 ging a hebrew corpus: the case of partici-            putational Approaches to Semitic Languages,
 ples. In Nicoletta Calzolari (Conference Chair),      SEMITIC ’02, pages 1–9, Stroudsburg, PA,
 Khalid Choukri, Bente Maegaard, Joseph                USA. Association for Computational Linguis-
 Mariani, Jan Odijk, Stelios Piperidis, and            tics.
 Daniel Tapias, editors, Proceedings of the
 Sixth International Conference on Language          Yael Dahan Netzer and Michael Elhadad. 1998.
 Resources and Evaluation (LREC’08), Mar-              Generating Determiners and Quantifiers in He-
 rakech, Morocco, may. European Language Re-           brew. In Proceedings of the Workshop on Com-
 sources Association (ELRA). http://www.lrec-          putational Approaches to Semitic Languages,
 conf.org/proceedings/lrec2008/.                       Semitic ’98, pages 89–96, Stroudsburg, PA,
                                                       USA. Association for Computational Linguis-
Menahem Meni Adler. 2007. Hebrew Morphologi-           tics.
 cal Disambiguation: An Unsupervised Stochastic
 Word-based Approach. PhD Thesis, Ben-Gurion         Yael Netzer, Meni Adler, David Gabay, and
 University of the Negev.                              Michael Elhadad. 2007. Can You Tag the
                                                       Modal? You Should. In Proceedings of the
Roy Bar-haim, Khalil Sima’an, and Yoad Winter.         2007 Workshop on Computational Approaches
  2008. Part-of-speech Tagging of Modern Hebrew        to Semitic Languages: Common Issues and Re-
  Text. Nat. Lang. Eng., 14(2):223–251, April.         sources, pages 57–64, Prague, Czech Republic.
Andrea Bellandi, Alessia Bellusci, and Emiliano        Association for Computational Linguistics.
  Giovannetti. 2015. Computer Assisted Trans-        Helmut Schmid. 1994. Part-of-speech tagging with
  lation of Ancient Texts: the Babylonian Tal-         neural networks. In Proceedings of the 15th Con-
  mud Case Study. In Natural Language Pro-             ference on Computational Linguistics - Volume
  cessing and Cognitive Science, Proceedings 2014,     1, COLING ’94, pages 172–176, Stroudsburg,
  Berlin/Munich. De Gruyter Saur.                      PA, USA. Association for Computational Lin-
Andrea Bellandi, Davide Albanesi, Giulia Benotto,      guistics.
  and Emiliano Giovannetti. 2016. Il Sistema         Khalil Sima’an, Alon Itai, Yoad Winter, Alon Alt-
  Traduco nel Progetto Traduzione del Talmud Ba-       man, and Noa Nativ. 2001. Building a tree-
  bilonese. IJCoL Vol. 2, n. 2, December 2016.         bank of modern hebrew text. TAL. Traitement
  Special Issue on ”NLP and Digital Humanities”.       automatique des langues, 42(2):347–380.
  Accademia University Press.
                                                     Fred Skolnik and Michael Berenbaum, editors.
Henrik Brink, Joseph Richards, and Mark                2007. Encyclopaedia Judaica vol. 8. Ency-
  Fetherolf. 2016. Real-World Machine Learn-           clopaedia Judaica. Macmillan Reference USA, 2
  ing. Manning Publications Co., Greenwich, CT,        edition. Brovender Chaim and Blau Joshua and
  USA, 1st edition.                                    Kutscher Eduard Y. and Breuer Yochanan and
Shay B. Cohen and Noah A. Smith. 2007. Joint           Eytan Eli sub v. “Hebrew Language”.
  Morphological and Syntactic Disambiguation.
                                                     Kristina Toutanova, Dan Klein, Christopher D.
  In Proceedings of the 2007 Joint Conference on
                                                       Manning, and Yoram Singer. 2003. Feature-
  Empirical Methods in Natural Language Process-       rich Part-of-speech Tagging with a Cyclic De-
  ing and Computational Natural Language Learn-        pendency Network. In Proceedings of the 2003
  ing (EMNLP-CoNLL).                                   Conference of the North American Chapter of
Miguel Pérez Fernández and John F. Elwolde.            the Association for Computational Linguistics
  1999.    An Introductory Grammar of Rab-             on Human Language Technology - Volume 1,
  binic Hebrew. Interactive Factory, Leiden, The       NAACL ’03, pages 173–180, Stroudsburg, PA,
  Netherlands.                                         USA. Association for Computational Linguis-
                                                       tics.
Péter Halácsy, András Kornai, and Csaba Oravecz.
  2007. HunPos: An Open Source Trigram Tag-          Wido Van Peursen and Constantijn Sikkel. 2014.
  ger. In Proceedings of the 45th Annual Meet-        Hebrew Text Database ETCBC4. type: dataset.
  ing of the ACL on Interactive Poster and
  Demonstration Sessions, ACL ’07, pages 209–
  212, Stroudsburg, PA, USA. Association for
  Computational Linguistics.
Marcus Jastrow. 1971. A dictionary of the Tar-
 gumim, the Talmud Babli and Yerushalmi, and
 the Midrashic literature. Judaica Press.