Constructing an Annotated Resource for Part-Of-Speech Tagging of Mishnaic Hebrew Emiliano Giovannetti1 , Davide Albanesi1 , Andrea Bellandi1 , Simone Marchi1 , Alessandra Pecchioli2 1 Istituto di Linguistica Computazionale, Via G. Moruzzi 1, 56124, Pisa name.surname@ilc.cnr.it 2 Progetto Traduzione Talmud Babilonese S.c.a r.l., Lungotevere Sanzio 9, 00153 Roma alepec3@gmail.com Abstract (in Italian, Progetto Traduzione Talmud Ba- English. This paper introduces the bilonese - PTTB) which aims at the transla- research in Part-Of-Speech tagging of tion of the Babylonian Talmud (BT) into Ital- mishnaic Hebrew carried out within ian. the Babylonian Talmud Translation The translation is being carried out with the Project. Since no tagged resource was aid of tools for text and language processing available to train a stochastic POS integrated into an application, called Traduco tagger, a portion of the Mishna of (Bellandi et al., 2016), developed by the In- the Babylonian Talmud has been mor- stitute of Computational Linguistics “Antonio phologically annotated using an ad Zampolli” of the CNR in collaboration with hoc developed tool connected with the the PTTB team. Traduco is a collaborative DB containing the talmudic text be- computer-assisted translation (CAT) tool con- ing translated. The final aim of this ceived to ease the translation, revision and research is to add a linguistic support editing of the BT. to the Translation Memory System The research described here fits exactly in of Traduco, the Computer-Assisted this context: we want to provide the system Translation tool developed and used with additional informative elements as a fur- within the Project. ther aid in the translation of the Talmud. In particular, we intend to linguistically analyze Italiano. In questo articolo è the Talmudic text starting from the automatic introdotta la ricerca nel Part-Of- attribution of the Part-Of-Speech to words by Speech tagging dell’Ebraico mishnaico adopting a stochastic POS tagging approach. condotta nell’ambito del Progetto The first difficulty that has emerged regards Traduzione Talmud Babilonese. Data the text and the languages it contains. In this l’indisponibilità di risorse annotate regard we can say, simplifying, that the Baby- necessarie per l’addestramento di un lonian Talmud is essentially composed of two POS tagger stocastico, una porzione di languages which, in turn, correspond to two Mishnà del Talmud Babilonese è stata distinct texts: the Mishna and the Gemara. annotata morfologicamente utilizzando The first is the oldest one written in mishnaic uno strumento sviluppato ad hoc Hebrew, one of the most homogeneous and collegato al DB dove risiede il testo coherent languages appearing in the Talmud talmudico in traduzione. L’obiettivo that, for this reason, has been chosen to start finale di questa ricerca è lo sviluppo from in the POS tagging experiment. di un supporto linguistico al sistema The main purpose of linguistic analysis in di Memoria di Traduzione di Traduco, the context of our translation project is to lo strumento di traduzione assistita improve the suggestions provided by the sys- utilizzato nell’ambito del Progetto. tem through the so-called Translation Memory (TM). 1 Introduction Moreover, on a linguistically annotated text The present work has been conducted within it is possible to carry out linguistic-based the Babylonian Talmud Translation Project searches, useful both for the scholar (in this case a talmudist), and, during the translation tury and that does not correspond to the id- work, for the revisor and the curator, who ioms recurring in the BT. Among them we cite have the possibility, for example, to make bulk HebTokenizer5 for tokenization, MILA (Bar- editing of polysemous words by discarding out haim et al., 2008), HebMorph6 , MorphTag- words with undesired POS. ger 7 and NLPH8 for morphological analy- The rest of the paper is organized as fol- sis and lemmatization, yap9 , hebdepparser10 , lows: Section 2 summarizes the state of the UD_Hebrew11 for syntactic analysis. We con- art in NLP of Hebrew. The construction of the ducted some preliminary tests by starting with linguistically annotated corpus is described in MILA’s (ambiguous) morphological analyzer Section 3. The training process and evaluation applied to the three main languages of the Tal- of the POS taggers used in the experiments is mud: detailed in Section 4. Lastly, Section 5 out- 1. Aramaic: Hebrew and Aramaic are differ- lines the next steps of the research. ent languages. There are even some cases 2 State of the art in which the very same root has differ- ent semantics in the two languages. In The aforementioned linguistic richness and the addition, MILA did not recognize many intrinsic complexity of the Babylonian Talmud aramaic roots, tagging the relative words, make automatic linguistic analysis of the BT derived from them, as proper nouns. particularly hard (Bellandi et al., 2015). However, some linguistic resources of an- 2. Biblical Hebrew: MILA recognized most cient Hebrew and Aramaic have been (and of the words, since Modern Hebrew pre- are being) developed, among which we cite: i) served almost the entire biblical lexicon. the Hebrew Text Database (Van Peursen and However, syntax of Modern Hebrew is Sikkel, 2014) (ETCBC) accessible by SHE- quite different from that of Biblical He- BANQ1 an online environment for the study brew, leading MILA to output wrong of Biblical Hebrew (with emphasis on syntax), analyses. developed by the Eep Talstra Centre for Bible 3. Mishnaic Hebrew: this is the language and Computer of the Vrije Universiteit in Am- where MILA performed better. Mod- sterdam; ii) the Historical Dictionary2 project ern Hebrew inherits some of the morpho- of the Academy of the Hebrew Language of syntactic features of mishnaic Hebrew, Israel; iii) the Comprehensive Aramaic Lexi- however, the two idioms differ substan- con (CAL)3 developed by the Hebrew Union tially on the lexicon, since in modern He- College of Cincinnati; iv) the Digital Mishna4 brew many archaic words have been lost project, concerning the creation of a digital (Skolnik and Berenbaum, 2007). scholarly edition of the Mishna conducted by In the light of the above, we decided to create a the Maryland Institute of Technology in the novel linguistically annotated resource to start Humanities. developing our own tools for the processing of Apart from the aforementioned resources, to ancient Jewish languages. In the next section, date there are no available NLP tools suitable we will describe how the resource was built. for the processing of ancient north-western Semitic languages, such as the different Ara- 3 Building the resource maic idioms and the historical variants of He- brew attested in the BT. The only existing The linguistic annotation of Semitic languages projects and tools for the processing of Jew- poses several problems. Although we here dis- ish languages (Kamir et al., 2002) (Cohen and cuss the analysis of Hebrew, many of the criti- Smith, 2007) have been developed for mod- cal points that must be taken into account are 5 ern Hebrew, a language that has been artifi- www.cs.bgu.ac.il/∼yoavg/software/hebtokenizer 6 cially revitalized from the end of the XIX cen- code972.com/hebmorph 7 www.cs.technion.ac.il/∼barhaim/MorphTagger 1 8 shebanq.ancient-data.org github.com/NLPH/NLPH 2 9 maagarim.hebrew-academy.org.il github.com/habeanf/yap 3 10 cal.huc.edu tinyurl.com/hebdepparser 4 11 www.digitalmishnah.org github.com/UniversalDependencies/UD_Hebrew common to other languages belonging to the both, in particular: i) the definition of the to- same family. As already mentioned in the pre- ken from a syntagmatic perspective (i.e. what vious section, the first problem concerns the the token represents in context) and ii) the lex- access to existing linguistic resources and ana- ical information that the token gives by itself lytical tools which, in the case of Hebrew, are (without context). To give a couple of exam- available exclusively for the modern language. ples: One of the major challenges posed by the morphological analysis of Semitic languages • Verb/noun: ‫ → ִא ְשּתּ אֶ ת הַ מַ ִדיר‬is ‫“ הַ מַ ִדיר‬the is the orthographic disambiguation of words. one who makes a vow” or “the vowing”? Since writing is almost exclusively consonan- (the one who consecrates his wife): should tal, every word can have multiple readings. it be assigned to verb or noun category? The problem of orthographic ambiguity, cru- cial in all studies on large corpora (typically in • Adjective/verb: ‫עַ ד וְּ ִלגְּ מ ר ְלּהַ ְתּ ִחיל יְּ כ ִלין ִאם‬ ‫שּׁוּרה יַגִ יעוּ שֶ ל ֹא‬ ָ ַ‫ ל‬- ‫ → י ְַתּ ִחילוּ‬is ‫ יְּ כ ִלין‬adjec- Hebrew and modern Arabic), does not prove to be so difficult when the text under exami- tive or verb (given that most of the mish- nation is vocalized. naic language dictionaries provide both options)? The edition of the Talmud used in the project is actually vocalized and the text, con- We could discuss about which category would sequently, is orthographically unambiguous. be the best for each and why, but, for now, An additional critical aspect is represented by we decided to keep both by introducing two the definition of the tagset. Most of the com- parallel annotations, by “category” (without putational studies on language analysis have context) and by “function” (in context). The been conducted on Indo-european languages tagset we used for this work are the follow- (especially on English). ing: agg., avv., cong., interiez., nome pr., num. As a result, it may be difficult to reuse card., num. ord., pref. art., pref. cong., pref. tagsets created for these languages. Not prep., pref. pron. rel., prep., pron. dim., pron. surprisingly, there are still many discussions indef., pron. interr., pron. pers., pron. suff., about how it is better to catalog some POS punt., sost., vb. and each language has its own part under dis- One could also envisage the refining of the cussion. Each tagset must ultimately be cre- tagset by adding: interrogative, modal, nega- ated in the light of a specific purpose. For tion, and quantifier (Adler, 2007) (Netzer and example, the tagging of the (Modern) Hebrew Elhadad, 1998) (Netzer et al., 2007). Treebank developed at the Technion (Sima’an As anticipated, in order to build the mor- et al., 2001) was syntax-oriented, while the phologically annotated resource, all of the work on participles of Hebrew described in Mishna sentences were extracted from the Tal- (Adler et al., 2008) was more lexicon-oriented. mud and annotated using an ad hoc developed We considered the idea of adopting the tagset Web application (Fig. 1). used in the already cited Universal Depen- All the annotations have been made with dency Corpus for Hebrew. However, its 16 the aim of training a stochastic POS tagger in tags appeared to be too “coarse grained” for charge of the automatic analysis of the entire our purposes.12 In particular, the UD tagset Mishna: to obtain a good accuracy it was thus lacks of all the prefix tags that we needed. necessary to manually annotate as many sen- For this reason we decided to define our own tences as possible. To date, 10442 tokens have tagset. been annotated. Once the tagset has been defined, it remains The software created for the annotation to decide which is the most suitable grammati- shows, in a tabular form, the information of cal category to associate with each token. You the analysis carried out on a sentence by sen- can collect essentially two types of informa- tence basis. tion, the problem is how and if you can keep The system, once a sentence is selected for 12 github.com/UniversalDependencies/UD_Hebrew- annotation, checks whether the tokens com- HTB/blob/master/stats.xml posing it have already been analyzed and, in Figure 1: The interface for the linguistic annotation of the corpus to be used to train the POS tagger case, calculates a possible subdivision into sub- Tagging Accuracy tokens (i.e. the stems, prefixes and suffixes Stanford Hunpos Treetagger constituting each word) by exploiting previous 87,90% 86,34% 86,74% annotations. If the system finds that a word is associated with multiple different annotations, Table 1: Accuracy of the three POS taggers. it proposes the most frequent one. Regarding the linguistic annotation, the Stanford POS tagger provided the best results grammar of Pérez Fernández (Fernández and over HunPos and Treetagger, with an accuracy Elwolde, 1999) was adopted and, for lemmati- of 87,9%. zation, the dictionary of M. Jastrow (Jastrow, 1971). 5 Next steps The software allows to gather as much infor- mation as possible for each word by providing In this work, the tagging experiments have a double annotation: by “category” to rep- been limited to the attribution of the Part- resent the POS from a grammatical point of Of-Speech: the next, natural step, will be the view, and by “function” to describe the func- addition of the lemma. Furthermore, we will tion the word assumes in its context. For the try to modify the parameters affecting the be- POS tagging experiments, described below, we haviour of the three adopted POS taggers (left used the annotation made by “function”. at their default values for the experiments) and see how they influence the results. 4 Training and testing of POS Once the Mishna will be lemmatized, Tra- taggers duco, the software used to translate the Tal- Once the mishnaic corpus has been linguisti- mud in Italian, will be able to exploit this ad- cally annotated three of the most used algo- ditional information mainly to provide trans- rithms for POS tagging have been used and lators with translation suggestions based on evaluated: HunPos (Halácsy et al., 2007), lemmas, but also to allow users to query the the Stanford Log-linear Part-Of-Speech Tag- mishnaic text by POS and lemma. ger (Toutanova et al., 2003), and TreeTagger As a further step we will also take into (Schmid, 1994). The three algorithms imple- account the linguistic annotation of portions ment supervised stochastic models and, conse- of the Babylonian Talmud written in other quently, they need to be trained with a man- languages, starting from the Babylonian Ara- ually annotated corpus. maic, the language of the Gemara, which con- To evaluate the accuracy of the algorithms stitutes the earlier portion of the Talmud. we adopted the strategy of k-fold cross valida- tion (Brink et al., 2016), with k set to 10, and Acknowledgments thus dividing the corpus in 10 partitions. Table 1 summarizes the results of the ex- This work was conducted in the context of the periment by showing the tagging accuracy of TALMUD project and the scientific coopera- the three tested algorithms. With a number of tion between S.c.a r.l. PTTB and ILC-CNR. tokens slightly higher than ten thousands the References Dror Kamir, Naama Soreq, and Yoni Neeman. 2002. A Comprehensive NLP System for Mod- Meni Adler, Yael Netzer, Yoav Goldberg, David ern Standard Arabic and Modern Hebrew. In Gabay, and Michael Elhadad. 2008. Tag- Proceedings of the ACL-02 Workshop on Com- ging a hebrew corpus: the case of partici- putational Approaches to Semitic Languages, ples. In Nicoletta Calzolari (Conference Chair), SEMITIC ’02, pages 1–9, Stroudsburg, PA, Khalid Choukri, Bente Maegaard, Joseph USA. Association for Computational Linguis- Mariani, Jan Odijk, Stelios Piperidis, and tics. Daniel Tapias, editors, Proceedings of the Sixth International Conference on Language Yael Dahan Netzer and Michael Elhadad. 1998. Resources and Evaluation (LREC’08), Mar- Generating Determiners and Quantifiers in He- rakech, Morocco, may. European Language Re- brew. In Proceedings of the Workshop on Com- sources Association (ELRA). http://www.lrec- putational Approaches to Semitic Languages, conf.org/proceedings/lrec2008/. Semitic ’98, pages 89–96, Stroudsburg, PA, USA. Association for Computational Linguis- Menahem Meni Adler. 2007. Hebrew Morphologi- tics. cal Disambiguation: An Unsupervised Stochastic Word-based Approach. PhD Thesis, Ben-Gurion Yael Netzer, Meni Adler, David Gabay, and University of the Negev. Michael Elhadad. 2007. Can You Tag the Modal? You Should. In Proceedings of the Roy Bar-haim, Khalil Sima’an, and Yoad Winter. 2007 Workshop on Computational Approaches 2008. Part-of-speech Tagging of Modern Hebrew to Semitic Languages: Common Issues and Re- Text. Nat. Lang. Eng., 14(2):223–251, April. sources, pages 57–64, Prague, Czech Republic. Andrea Bellandi, Alessia Bellusci, and Emiliano Association for Computational Linguistics. Giovannetti. 2015. Computer Assisted Trans- Helmut Schmid. 1994. Part-of-speech tagging with lation of Ancient Texts: the Babylonian Tal- neural networks. In Proceedings of the 15th Con- mud Case Study. In Natural Language Pro- ference on Computational Linguistics - Volume cessing and Cognitive Science, Proceedings 2014, 1, COLING ’94, pages 172–176, Stroudsburg, Berlin/Munich. De Gruyter Saur. PA, USA. Association for Computational Lin- Andrea Bellandi, Davide Albanesi, Giulia Benotto, guistics. and Emiliano Giovannetti. 2016. Il Sistema Khalil Sima’an, Alon Itai, Yoad Winter, Alon Alt- Traduco nel Progetto Traduzione del Talmud Ba- man, and Noa Nativ. 2001. Building a tree- bilonese. IJCoL Vol. 2, n. 2, December 2016. bank of modern hebrew text. TAL. Traitement Special Issue on ”NLP and Digital Humanities”. automatique des langues, 42(2):347–380. Accademia University Press. Fred Skolnik and Michael Berenbaum, editors. Henrik Brink, Joseph Richards, and Mark 2007. Encyclopaedia Judaica vol. 8. Ency- Fetherolf. 2016. Real-World Machine Learn- clopaedia Judaica. Macmillan Reference USA, 2 ing. Manning Publications Co., Greenwich, CT, edition. Brovender Chaim and Blau Joshua and USA, 1st edition. Kutscher Eduard Y. and Breuer Yochanan and Shay B. Cohen and Noah A. Smith. 2007. Joint Eytan Eli sub v. “Hebrew Language”. Morphological and Syntactic Disambiguation. Kristina Toutanova, Dan Klein, Christopher D. In Proceedings of the 2007 Joint Conference on Manning, and Yoram Singer. 2003. Feature- Empirical Methods in Natural Language Process- rich Part-of-speech Tagging with a Cyclic De- ing and Computational Natural Language Learn- pendency Network. In Proceedings of the 2003 ing (EMNLP-CoNLL). Conference of the North American Chapter of Miguel Pérez Fernández and John F. Elwolde. the Association for Computational Linguistics 1999. An Introductory Grammar of Rab- on Human Language Technology - Volume 1, binic Hebrew. Interactive Factory, Leiden, The NAACL ’03, pages 173–180, Stroudsburg, PA, Netherlands. USA. Association for Computational Linguis- tics. Péter Halácsy, András Kornai, and Csaba Oravecz. 2007. HunPos: An Open Source Trigram Tag- Wido Van Peursen and Constantijn Sikkel. 2014. ger. In Proceedings of the 45th Annual Meet- Hebrew Text Database ETCBC4. type: dataset. ing of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 209– 212, Stroudsburg, PA, USA. Association for Computational Linguistics. Marcus Jastrow. 1971. A dictionary of the Tar- gumim, the Talmud Babli and Yerushalmi, and the Midrashic literature. Judaica Press.