=Paper=
{{Paper
|id=Vol-2253/paper46
|storemode=property
|title=Constructing an Annotated Resource for Part-Of-Speech Tagging of Mishnaic Hebrew
|pdfUrl=https://ceur-ws.org/Vol-2253/paper46.pdf
|volume=Vol-2253
|authors=Emiliano Giovannetti,Davide Albanesi,Andrea Bellandi,Simone Marchi,Alessandra Pecchioli
|dblpUrl=https://dblp.org/rec/conf/clic-it/GiovannettiABMP18
}}
==Constructing an Annotated Resource for Part-Of-Speech Tagging of Mishnaic Hebrew==
Constructing an Annotated Resource for Part-Of-Speech
Tagging of Mishnaic Hebrew
Emiliano Giovannetti1 , Davide Albanesi1 , Andrea Bellandi1 ,
Simone Marchi1 , Alessandra Pecchioli2
1
Istituto di Linguistica Computazionale, Via G. Moruzzi 1, 56124, Pisa
name.surname@ilc.cnr.it
2
Progetto Traduzione Talmud Babilonese S.c.a r.l., Lungotevere Sanzio 9, 00153 Roma
alepec3@gmail.com
Abstract (in Italian, Progetto Traduzione Talmud Ba-
English. This paper introduces the bilonese - PTTB) which aims at the transla-
research in Part-Of-Speech tagging of tion of the Babylonian Talmud (BT) into Ital-
mishnaic Hebrew carried out within ian.
the Babylonian Talmud Translation The translation is being carried out with the
Project. Since no tagged resource was aid of tools for text and language processing
available to train a stochastic POS integrated into an application, called Traduco
tagger, a portion of the Mishna of (Bellandi et al., 2016), developed by the In-
the Babylonian Talmud has been mor- stitute of Computational Linguistics “Antonio
phologically annotated using an ad Zampolli” of the CNR in collaboration with
hoc developed tool connected with the the PTTB team. Traduco is a collaborative
DB containing the talmudic text be- computer-assisted translation (CAT) tool con-
ing translated. The final aim of this ceived to ease the translation, revision and
research is to add a linguistic support editing of the BT.
to the Translation Memory System The research described here fits exactly in
of Traduco, the Computer-Assisted this context: we want to provide the system
Translation tool developed and used with additional informative elements as a fur-
within the Project. ther aid in the translation of the Talmud. In
particular, we intend to linguistically analyze
Italiano. In questo articolo è
the Talmudic text starting from the automatic
introdotta la ricerca nel Part-Of-
attribution of the Part-Of-Speech to words by
Speech tagging dell’Ebraico mishnaico
adopting a stochastic POS tagging approach.
condotta nell’ambito del Progetto
The first difficulty that has emerged regards
Traduzione Talmud Babilonese. Data
the text and the languages it contains. In this
l’indisponibilità di risorse annotate
regard we can say, simplifying, that the Baby-
necessarie per l’addestramento di un
lonian Talmud is essentially composed of two
POS tagger stocastico, una porzione di
languages which, in turn, correspond to two
Mishnà del Talmud Babilonese è stata
distinct texts: the Mishna and the Gemara.
annotata morfologicamente utilizzando
The first is the oldest one written in mishnaic
uno strumento sviluppato ad hoc
Hebrew, one of the most homogeneous and
collegato al DB dove risiede il testo
coherent languages appearing in the Talmud
talmudico in traduzione. L’obiettivo
that, for this reason, has been chosen to start
finale di questa ricerca è lo sviluppo
from in the POS tagging experiment.
di un supporto linguistico al sistema
The main purpose of linguistic analysis in
di Memoria di Traduzione di Traduco,
the context of our translation project is to
lo strumento di traduzione assistita
improve the suggestions provided by the sys-
utilizzato nell’ambito del Progetto.
tem through the so-called Translation Memory
(TM).
1 Introduction
Moreover, on a linguistically annotated text
The present work has been conducted within it is possible to carry out linguistic-based
the Babylonian Talmud Translation Project searches, useful both for the scholar (in this
case a talmudist), and, during the translation tury and that does not correspond to the id-
work, for the revisor and the curator, who ioms recurring in the BT. Among them we cite
have the possibility, for example, to make bulk HebTokenizer5 for tokenization, MILA (Bar-
editing of polysemous words by discarding out haim et al., 2008), HebMorph6 , MorphTag-
words with undesired POS. ger 7 and NLPH8 for morphological analy-
The rest of the paper is organized as fol- sis and lemmatization, yap9 , hebdepparser10 ,
lows: Section 2 summarizes the state of the UD_Hebrew11 for syntactic analysis. We con-
art in NLP of Hebrew. The construction of the ducted some preliminary tests by starting with
linguistically annotated corpus is described in MILA’s (ambiguous) morphological analyzer
Section 3. The training process and evaluation applied to the three main languages of the Tal-
of the POS taggers used in the experiments is mud:
detailed in Section 4. Lastly, Section 5 out- 1. Aramaic: Hebrew and Aramaic are differ-
lines the next steps of the research. ent languages. There are even some cases
2 State of the art in which the very same root has differ-
ent semantics in the two languages. In
The aforementioned linguistic richness and the addition, MILA did not recognize many
intrinsic complexity of the Babylonian Talmud aramaic roots, tagging the relative words,
make automatic linguistic analysis of the BT derived from them, as proper nouns.
particularly hard (Bellandi et al., 2015).
However, some linguistic resources of an- 2. Biblical Hebrew: MILA recognized most
cient Hebrew and Aramaic have been (and of the words, since Modern Hebrew pre-
are being) developed, among which we cite: i) served almost the entire biblical lexicon.
the Hebrew Text Database (Van Peursen and However, syntax of Modern Hebrew is
Sikkel, 2014) (ETCBC) accessible by SHE- quite different from that of Biblical He-
BANQ1 an online environment for the study brew, leading MILA to output wrong
of Biblical Hebrew (with emphasis on syntax), analyses.
developed by the Eep Talstra Centre for Bible 3. Mishnaic Hebrew: this is the language
and Computer of the Vrije Universiteit in Am- where MILA performed better. Mod-
sterdam; ii) the Historical Dictionary2 project ern Hebrew inherits some of the morpho-
of the Academy of the Hebrew Language of syntactic features of mishnaic Hebrew,
Israel; iii) the Comprehensive Aramaic Lexi- however, the two idioms differ substan-
con (CAL)3 developed by the Hebrew Union tially on the lexicon, since in modern He-
College of Cincinnati; iv) the Digital Mishna4 brew many archaic words have been lost
project, concerning the creation of a digital (Skolnik and Berenbaum, 2007).
scholarly edition of the Mishna conducted by
In the light of the above, we decided to create a
the Maryland Institute of Technology in the
novel linguistically annotated resource to start
Humanities.
developing our own tools for the processing of
Apart from the aforementioned resources, to
ancient Jewish languages. In the next section,
date there are no available NLP tools suitable
we will describe how the resource was built.
for the processing of ancient north-western
Semitic languages, such as the different Ara- 3 Building the resource
maic idioms and the historical variants of He-
brew attested in the BT. The only existing The linguistic annotation of Semitic languages
projects and tools for the processing of Jew- poses several problems. Although we here dis-
ish languages (Kamir et al., 2002) (Cohen and cuss the analysis of Hebrew, many of the criti-
Smith, 2007) have been developed for mod- cal points that must be taken into account are
5
ern Hebrew, a language that has been artifi- www.cs.bgu.ac.il/∼yoavg/software/hebtokenizer
6
cially revitalized from the end of the XIX cen- code972.com/hebmorph
7
www.cs.technion.ac.il/∼barhaim/MorphTagger
1 8
shebanq.ancient-data.org github.com/NLPH/NLPH
2 9
maagarim.hebrew-academy.org.il github.com/habeanf/yap
3 10
cal.huc.edu tinyurl.com/hebdepparser
4 11
www.digitalmishnah.org github.com/UniversalDependencies/UD_Hebrew
common to other languages belonging to the both, in particular: i) the definition of the to-
same family. As already mentioned in the pre- ken from a syntagmatic perspective (i.e. what
vious section, the first problem concerns the the token represents in context) and ii) the lex-
access to existing linguistic resources and ana- ical information that the token gives by itself
lytical tools which, in the case of Hebrew, are (without context). To give a couple of exam-
available exclusively for the modern language. ples:
One of the major challenges posed by the
morphological analysis of Semitic languages • Verb/noun: → ִא ְשּתּ אֶ ת הַ מַ ִדירis “ הַ מַ ִדירthe
is the orthographic disambiguation of words. one who makes a vow” or “the vowing”?
Since writing is almost exclusively consonan- (the one who consecrates his wife): should
tal, every word can have multiple readings. it be assigned to verb or noun category?
The problem of orthographic ambiguity, cru-
cial in all studies on large corpora (typically in • Adjective/verb: עַ ד וְּ ִלגְּ מ ר ְלּהַ ְתּ ִחיל יְּ כ ִלין ִאם
שּׁוּרה יַגִ יעוּ שֶ ל ֹא
ָ ַ ל- → י ְַתּ ִחילוּis יְּ כ ִליןadjec-
Hebrew and modern Arabic), does not prove
to be so difficult when the text under exami- tive or verb (given that most of the mish-
nation is vocalized. naic language dictionaries provide both
options)?
The edition of the Talmud used in the
project is actually vocalized and the text, con-
We could discuss about which category would
sequently, is orthographically unambiguous.
be the best for each and why, but, for now,
An additional critical aspect is represented by
we decided to keep both by introducing two
the definition of the tagset. Most of the com-
parallel annotations, by “category” (without
putational studies on language analysis have
context) and by “function” (in context). The
been conducted on Indo-european languages
tagset we used for this work are the follow-
(especially on English).
ing: agg., avv., cong., interiez., nome pr., num.
As a result, it may be difficult to reuse card., num. ord., pref. art., pref. cong., pref.
tagsets created for these languages. Not prep., pref. pron. rel., prep., pron. dim., pron.
surprisingly, there are still many discussions indef., pron. interr., pron. pers., pron. suff.,
about how it is better to catalog some POS punt., sost., vb.
and each language has its own part under dis- One could also envisage the refining of the
cussion. Each tagset must ultimately be cre- tagset by adding: interrogative, modal, nega-
ated in the light of a specific purpose. For tion, and quantifier (Adler, 2007) (Netzer and
example, the tagging of the (Modern) Hebrew Elhadad, 1998) (Netzer et al., 2007).
Treebank developed at the Technion (Sima’an
As anticipated, in order to build the mor-
et al., 2001) was syntax-oriented, while the
phologically annotated resource, all of the
work on participles of Hebrew described in
Mishna sentences were extracted from the Tal-
(Adler et al., 2008) was more lexicon-oriented.
mud and annotated using an ad hoc developed
We considered the idea of adopting the tagset
Web application (Fig. 1).
used in the already cited Universal Depen-
All the annotations have been made with
dency Corpus for Hebrew. However, its 16
the aim of training a stochastic POS tagger in
tags appeared to be too “coarse grained” for
charge of the automatic analysis of the entire
our purposes.12 In particular, the UD tagset
Mishna: to obtain a good accuracy it was thus
lacks of all the prefix tags that we needed.
necessary to manually annotate as many sen-
For this reason we decided to define our own
tences as possible. To date, 10442 tokens have
tagset.
been annotated.
Once the tagset has been defined, it remains
The software created for the annotation
to decide which is the most suitable grammati-
shows, in a tabular form, the information of
cal category to associate with each token. You
the analysis carried out on a sentence by sen-
can collect essentially two types of informa-
tence basis.
tion, the problem is how and if you can keep
The system, once a sentence is selected for
12
github.com/UniversalDependencies/UD_Hebrew- annotation, checks whether the tokens com-
HTB/blob/master/stats.xml posing it have already been analyzed and, in
Figure 1: The interface for the linguistic annotation of the corpus to be used to train the POS
tagger
case, calculates a possible subdivision into sub- Tagging Accuracy
tokens (i.e. the stems, prefixes and suffixes Stanford Hunpos Treetagger
constituting each word) by exploiting previous 87,90% 86,34% 86,74%
annotations. If the system finds that a word is
associated with multiple different annotations, Table 1: Accuracy of the three POS taggers.
it proposes the most frequent one.
Regarding the linguistic annotation, the Stanford POS tagger provided the best results
grammar of Pérez Fernández (Fernández and over HunPos and Treetagger, with an accuracy
Elwolde, 1999) was adopted and, for lemmati- of 87,9%.
zation, the dictionary of M. Jastrow (Jastrow,
1971).
5 Next steps
The software allows to gather as much infor-
mation as possible for each word by providing In this work, the tagging experiments have
a double annotation: by “category” to rep- been limited to the attribution of the Part-
resent the POS from a grammatical point of Of-Speech: the next, natural step, will be the
view, and by “function” to describe the func- addition of the lemma. Furthermore, we will
tion the word assumes in its context. For the try to modify the parameters affecting the be-
POS tagging experiments, described below, we haviour of the three adopted POS taggers (left
used the annotation made by “function”. at their default values for the experiments)
and see how they influence the results.
4 Training and testing of POS
Once the Mishna will be lemmatized, Tra-
taggers
duco, the software used to translate the Tal-
Once the mishnaic corpus has been linguisti- mud in Italian, will be able to exploit this ad-
cally annotated three of the most used algo- ditional information mainly to provide trans-
rithms for POS tagging have been used and lators with translation suggestions based on
evaluated: HunPos (Halácsy et al., 2007), lemmas, but also to allow users to query the
the Stanford Log-linear Part-Of-Speech Tag- mishnaic text by POS and lemma.
ger (Toutanova et al., 2003), and TreeTagger As a further step we will also take into
(Schmid, 1994). The three algorithms imple- account the linguistic annotation of portions
ment supervised stochastic models and, conse- of the Babylonian Talmud written in other
quently, they need to be trained with a man- languages, starting from the Babylonian Ara-
ually annotated corpus. maic, the language of the Gemara, which con-
To evaluate the accuracy of the algorithms stitutes the earlier portion of the Talmud.
we adopted the strategy of k-fold cross valida-
tion (Brink et al., 2016), with k set to 10, and Acknowledgments
thus dividing the corpus in 10 partitions.
Table 1 summarizes the results of the ex- This work was conducted in the context of the
periment by showing the tagging accuracy of TALMUD project and the scientific coopera-
the three tested algorithms. With a number of tion between S.c.a r.l. PTTB and ILC-CNR.
tokens slightly higher than ten thousands the
References Dror Kamir, Naama Soreq, and Yoni Neeman.
2002. A Comprehensive NLP System for Mod-
Meni Adler, Yael Netzer, Yoav Goldberg, David ern Standard Arabic and Modern Hebrew. In
Gabay, and Michael Elhadad. 2008. Tag- Proceedings of the ACL-02 Workshop on Com-
ging a hebrew corpus: the case of partici- putational Approaches to Semitic Languages,
ples. In Nicoletta Calzolari (Conference Chair), SEMITIC ’02, pages 1–9, Stroudsburg, PA,
Khalid Choukri, Bente Maegaard, Joseph USA. Association for Computational Linguis-
Mariani, Jan Odijk, Stelios Piperidis, and tics.
Daniel Tapias, editors, Proceedings of the
Sixth International Conference on Language Yael Dahan Netzer and Michael Elhadad. 1998.
Resources and Evaluation (LREC’08), Mar- Generating Determiners and Quantifiers in He-
rakech, Morocco, may. European Language Re- brew. In Proceedings of the Workshop on Com-
sources Association (ELRA). http://www.lrec- putational Approaches to Semitic Languages,
conf.org/proceedings/lrec2008/. Semitic ’98, pages 89–96, Stroudsburg, PA,
USA. Association for Computational Linguis-
Menahem Meni Adler. 2007. Hebrew Morphologi- tics.
cal Disambiguation: An Unsupervised Stochastic
Word-based Approach. PhD Thesis, Ben-Gurion Yael Netzer, Meni Adler, David Gabay, and
University of the Negev. Michael Elhadad. 2007. Can You Tag the
Modal? You Should. In Proceedings of the
Roy Bar-haim, Khalil Sima’an, and Yoad Winter. 2007 Workshop on Computational Approaches
2008. Part-of-speech Tagging of Modern Hebrew to Semitic Languages: Common Issues and Re-
Text. Nat. Lang. Eng., 14(2):223–251, April. sources, pages 57–64, Prague, Czech Republic.
Andrea Bellandi, Alessia Bellusci, and Emiliano Association for Computational Linguistics.
Giovannetti. 2015. Computer Assisted Trans- Helmut Schmid. 1994. Part-of-speech tagging with
lation of Ancient Texts: the Babylonian Tal- neural networks. In Proceedings of the 15th Con-
mud Case Study. In Natural Language Pro- ference on Computational Linguistics - Volume
cessing and Cognitive Science, Proceedings 2014, 1, COLING ’94, pages 172–176, Stroudsburg,
Berlin/Munich. De Gruyter Saur. PA, USA. Association for Computational Lin-
Andrea Bellandi, Davide Albanesi, Giulia Benotto, guistics.
and Emiliano Giovannetti. 2016. Il Sistema Khalil Sima’an, Alon Itai, Yoad Winter, Alon Alt-
Traduco nel Progetto Traduzione del Talmud Ba- man, and Noa Nativ. 2001. Building a tree-
bilonese. IJCoL Vol. 2, n. 2, December 2016. bank of modern hebrew text. TAL. Traitement
Special Issue on ”NLP and Digital Humanities”. automatique des langues, 42(2):347–380.
Accademia University Press.
Fred Skolnik and Michael Berenbaum, editors.
Henrik Brink, Joseph Richards, and Mark 2007. Encyclopaedia Judaica vol. 8. Ency-
Fetherolf. 2016. Real-World Machine Learn- clopaedia Judaica. Macmillan Reference USA, 2
ing. Manning Publications Co., Greenwich, CT, edition. Brovender Chaim and Blau Joshua and
USA, 1st edition. Kutscher Eduard Y. and Breuer Yochanan and
Shay B. Cohen and Noah A. Smith. 2007. Joint Eytan Eli sub v. “Hebrew Language”.
Morphological and Syntactic Disambiguation.
Kristina Toutanova, Dan Klein, Christopher D.
In Proceedings of the 2007 Joint Conference on
Manning, and Yoram Singer. 2003. Feature-
Empirical Methods in Natural Language Process- rich Part-of-speech Tagging with a Cyclic De-
ing and Computational Natural Language Learn- pendency Network. In Proceedings of the 2003
ing (EMNLP-CoNLL). Conference of the North American Chapter of
Miguel Pérez Fernández and John F. Elwolde. the Association for Computational Linguistics
1999. An Introductory Grammar of Rab- on Human Language Technology - Volume 1,
binic Hebrew. Interactive Factory, Leiden, The NAACL ’03, pages 173–180, Stroudsburg, PA,
Netherlands. USA. Association for Computational Linguis-
tics.
Péter Halácsy, András Kornai, and Csaba Oravecz.
2007. HunPos: An Open Source Trigram Tag- Wido Van Peursen and Constantijn Sikkel. 2014.
ger. In Proceedings of the 45th Annual Meet- Hebrew Text Database ETCBC4. type: dataset.
ing of the ACL on Interactive Poster and
Demonstration Sessions, ACL ’07, pages 209–
212, Stroudsburg, PA, USA. Association for
Computational Linguistics.
Marcus Jastrow. 1971. A dictionary of the Tar-
gumim, the Talmud Babli and Yerushalmi, and
the Midrashic literature. Judaica Press.