1. Introduction

Nominal Class Assignment in Swahili

A Computational Account

Giada Palmieri

Konstantinos Kogkalidis

0 1 0 Aalto University 1 University of Bologna

We discuss the open question of the relation between semantics and nominal class assignment in Swahili. We approach the problem from a computational perspective, aiming first to quantify the extent of this relation, and then to explicate its nature, taking extra care to suppress morphosyntactic confounds. Our results are the first of their kind, providing a quantitative evaluation of the semantic cohesion of each nominal class, as well as a nuanced taxonomic description of its semantic content. Table 1 provides an overview of Swahili nominal 2.1. Nominal Classes in Swahili classes, with their respective nominal afixes and subject concord markers. The division of the nominal classes is Like other Bantu languages, Swahili has a rich nominal based on reconstructions from Proto-Bantu [5, 6, inter system, where nouns belong to diferent classes [ 1, 2], alia], and it aims at maintaining a correspondence across sometimes also referred to as 'genders' [3]. The nominal Bantu languages. Swahili is considered to have a total class is signalled by an afix on the noun itself, and co- of 18 nominal classes, but some are missing in standard referenced with other elements of the sentence through Swahili (e.g., classes 12, 13 and 18), while others are not grammatical agreement [4]. uniquely identified by their nominal afix and/or subject In Swahili, verbs require markers that agree with the concord markers. Odd numbers are traditionally associnominal class of the subject. An example of subject con- ated with singular classes, and even numbers with plural cord is reported below in (1): the noun mtoto 'child' bears classes. The first ten classes are in singular/plural pairing the prefix of noun class 1 m- on the noun, and agrees relations (e.g., class 2 is the plural form of class 1), while with the verb through the subject marker a-. The same some singular noun classes may lack a plural form or process can be observed in (2) for the noun mti 'tree' borrow their plural forms from other classes. (class 3), or in (3) for kitabu 'book' (class 7).1 There is a long-standing debate on whether Bantu nominal classification is arbitrary [ 7], or whether it is based on some underlying semantic principles, with specific meanings associated to specific classes [ 8, 9]. For Swahili, contemporary studies often adopt a stance that lies between these two extremes: nominal classification seems somewhat predictable based on semantic content, though it may often seem arbitrary [2, 10, 1, 11]. This

eol>Swahili nominal classification lexical semantics computational semantics topic modeling unsupervised learning

1. Introduction

Swahili has a grand total of 18 nominal classes (i.e., ‘genders’). There is no consensus on the extent to which the assignment of a noun to a given class is determined by its semantic content. We explore this question from a computational angle. Our experiments suggest semantic cohesion among nominal classes, and provide a summary of the taxonomic concepts associated to each class. (1) (2)

M-toto a-me-anguk-a.

[1]-child sm[1]-prf-fall-fv ‘The child has fallen.’ M-ti u-me-anguk-a. [3]-tree sm[3]-prf-fall-fv ‘The tree has fallen.’ (3) Ki-tabu ki-me-anguk-a. [7]-book sm[7]-prf-fall-fv ‘The book has fallen.’

2. Background

CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec 04 — 06, 2024, Pisa, Italy * Equal contribution. Authorship order was determined through a ifrst-to-five game of rock paper scissors. $ giada.palmieri5@unibo.it (G. Palmieri); kokos.kogkalidis@aalto.fi (K. Kogkalidis) https://giadapalmieri.github.io/ (G. Palmieri); https://konstantinoskokos.github.io/ (K. Kogkalidis)

© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License view is also commonly found in textbooks: semantic cues 1AbbreviaAttitroibnutsionu4s.0eIndterinnatiotnhale(CCexBYa4m.0).ples: [n] = nominal class; sm = are provided as an aid for the acquisition of Swahili, but subject marker; prf = perfect; fv = final vowel. accompanied by the admonition that many nouns do not necessarily admit generalizations [12, 13].

Two prominent attempts to examine the semantic categories associated with Swahili nominal classes are provided by Contini-Morava [14] and Moxley [15]. Both studies are cast in a cognitive linguistic framework, and propose networks of meanings and semantic features based on criteria such as resemblance or metaphoric and specifically and explicitly marks the features proposed metonymic extensions. As an example, consider the se- by Contini-Morava [14]. The approach is framed as an mantic network for class 3 suggested by Contini-Morava empirical test of Contini-Morava’s hypothesis, which [14] in Figure 1: part of the branching includes the fea- the trained model is claimed to experimentally confirm; tures plants > objects made of plants > powerful nonetheless, this assessment is compromised by lukethings. Similarly, Moxley [15] suggests a structure of warm results and a flawed evaluation. 2 More recently, class 3/4 where the notions of ‘plants, trees’ extends to Byamugisha [18] builds a noun class disambiguation sys‘parts of plants’ or to objects with ‘long, thin, extended tem for Runyankore, another Bantu language. The sysshape’. These studies ofer valuable insights into the tem relies on both a morphological and a semantic comprinciples underlying nominal classifications, suggesting ponent, the latter employing k-NN clustering of word the potential for more articulate generalizations than are vectors to resolve ambiguities that extend beyond nomiimmediately apparent. However, note that they rely on nal morphology. The work is results-oriented, adopting features that were conceived ad hoc to account for the a task-driven NLP posturing – its only tangible contribucategorization of Swahili nouns. Despite this, the nomi- tion is the system itself. nal classification of several nouns remains unaccounted for [2]. It is unclear whether this is due to features that 3. Methodology were overlooked in these studies, or an indication that the classification of some nouns is inherently arbitrary.

Unlike prior works, we are neither interested in preemp

tively adopting or verifying some existing theory, nor in 2.2. Computational Approaches to maximizing discriminative performance metrics in some Swahili Nominal Classes artificial downstream task. What we are interested in is computationally investigating whether semantic content Despite the long-standing theoretical debate, computa- alone is indeed a predictor of nominal class membertional attempts at semantically characterizing Swahili ship. At first glance, word vectors seem to make for a nominal classes are few and far between. In the context natural starting point. However, language-native word of word sense disambiguation, Ng’ang’a [16] utilizes a vectors are bound to carry implicit morphological cues, collection of manually selected morphosyntactic features trivializing the mapping to nominal classes (at worst), or in combination with a self-organizing map in order to obfuscating its semantic aspect (at best). Word vectors semantically cluster Swahili nouns. The study finds that (both distributional and predictive) are built on the basis including noun prefix features ( i.e., nominal class indica- of co-occurrence contexts and/or statistics. The efect of tors) moderately improves clustering performance, indi- grammatical agreement is that nouns will inadvertently cating a degree of coherence between semantics and morphology. This improvement is particularly notable for 2The key metrics reported are dataset-wide accuracy and per-class classes 1/2, 7/8, and 11. Olstad [17] trains a naive Bayes area-under-the-curve. Both are over-optimistic: the first tends to faclassifier over a private, manually annotated dataset that vor class-imbalanced datasets, whereas the latter ignores precision and obfuscates the predictive conflict of the competing classifiers. co-occur with verbs that carry subject markers indica- 2.9% tive of the noun’s class. Case in point, the examples in 2.2% (1), (2) and (3) contain morphologically distinct entries 2.3% of the same verbal stem, which disclose the subject’s nominal class. The same problem is expounded when 0 200 400 600 800 1000 1200 using modern segmentation techniques which implicitly account for morphology by incorporating information at the sub-word (i.e., syllable- or character-) level (cf. to one of the 9 most populous classes, which together BPE [19], SentencePiece [20], inter alia). To bypass the account for about 98% of the data, and discard the rest. problem, we conduct our analyses on English transla- In what follows, we use these subject concord markers tions of Swahili nouns. Mediating meaning through a as an approximation of the underlying nominal classes.4 foreign language carries the risk of inducing translation The records we are left with correspond to the nominal shifts and introducing inaccuracies. That said, we deem classes 1/2, 3/4, 5/6, 7/8, 9/10, 11|14, 4|9 and (11|14)/10; it a necessary compromise; the bottleneck completely the latter three are necessarily conflated or ambiguous erases any traces of morphology, which would otherwise due to their shared morphology.5 confound our results (and their interpretation). 3.1. Data

3.2. Predicting Nominal Classes with a Language Model

We first compile a list of nominal lexical entries by con- Our data allows for a first quantitative inquiry into the sulting the TUKI Swahili-English dictionary. We gather semantic uniformity and separation of nominal classes. these by scraping the dictionary’s online version3, fil- For our first take, we employ a supervised learning aptering for pages under the category of Swahili nouns. proach. We task a small language model with predictThe scrape yields 5 974 lexical entries. Each lexical en- ing a record’s subject concord class through the phrasal try corresponds to a Swahili nominal homograph. Each representation of its English definition. The use of a homograph is assigned one or more meanings, grouped pretrained language model allows the seamless represenunder one or more subject concord classes. Meanings are tation of translations that are not strict word-to-word provided in English, in the form of (lists of) synonyms, correspondences, promising also the ability to capture brief descriptions, or mixtures of the two. These are subtle semantic distinctions in the process. sometimes interlaced with linguistic metadata such as We use MiniLMv2 [21], a distilled encoder-only model usage examples, apothegms, explanatory comments, etc. that has been fine-tuned for sentential similarity using a

The dictionary is consistent in its typographic notation, contrastive learning objective. We apply a 75/25 train/ewhich allows us to standardize its presentation with a val split and further fine-tune the model to the task (we tiny rule-based parser. The parser removes metadata and follow standard practices, attaching a neural classifier splits homographs to nominals with unique meanings, to the model’s topmost layer, applied exclusively on the gracefully pointing out the occasional inconsistency or start-of-sequence token). Model selection is based on error. Guided by the parser, we identify and manually fix evaluation loss; we select three models from as many common typographic errors. Following our corrections, training repetitions over the same split (one model per we are left with a set of 6 341 unique records, i.e., triplets of repetition). an entry identifier, a meaning and a subject concord class We report means and 95% confidence intervals for the (Figure 2). The distribution of subject concord classes macro- and micro-averaged and per-class F1 scores in is heavily skewed (Figure 3). We keep records assigned

3Available at https://swahili-dictionary.com. 4The use of subject concord markers over noun afixes is mandated

by the annotation format of the TUKI dictionary. 5We use the pipe operator (· |· ) to denote disjunction. Macro- and micro-averaged and per-class F1 scores. Confusion matrix over subject concord predictions. i

3.3. Finding the Taxonomies of Nominal Classes with WordNet Our mixed results paint a nuanced picture. Performance

above random afirms that nominal classes are to an extent semantically coherent – even if not perfectly so. Per- ter out hypernyms with less than 10 global occurrences, formance below perfect, however, ofers nothing tangible.

The model’s shortcomings might be indicative of a se- information between classes and hypernyms: and compute the frequency-weighted7 pointwise mutual mantic dispersion or arbitrariness within nominal classes, but could also be attributed to the model itself, the training process, or the dataset. In either case, we have strong evidence of an (at least partial) overlap between (at least some) semantic and morphological clusters. Other than this confirmation, the supervised approach does not have much else to ofer at this stage; over-parameterized blackbox models are notoriously hard to extract linguistic insights from. To actually ascribe semantic descriptions to nominal classes, we need a better behaved alternative. where: wPMI(, ℎ) := c× h(, ℎ) PMI(, ℎ)

PMI(, ℎ) := 2 ︂( c× h(, ℎ) )︂ c()h(ℎ) (1) (2)

Pairs with a positive wPMI score indicate relevance (i.e.,

mutual dependence) between their coordinates – the

6A ‘native’ WordNet would be a better fit for the task, but no mature

Swahili version exists as of the time of writing.

7The scaling helps alleviate the ‘rare event’ bias of vanilla PMI.

higher the score, the better a hypernym describes a sub- This observation may support the correlation between ject concord class. The aggregation of positive scores uncountability and abstract meanings noticed in other allows us to quantify and compare the semantic cohesion languages [24, 25]; doing so would however require a of subject concord classes given their descriptions – we thorough examination of these nouns’ properties. present these in Table 4. We also present the top 20 ex- From a high-level perspective, we have chosen to isotracted descriptors along with their scores in Appendix A. late the first few highest-ranked semantic components The sum total of positive mutual information between of each class. This ensures backwards compatibility with extracted descriptors and subject concord classes under the literature, but is also a very radical simplification. In this weighting scheme is approximately 0.26 shannons, reality, our descriptions are fine-grained enough to alsuggesting a moderate bidirectional dependency between low semantically distinguishing between any two classes, the two. even when their primary descriptors overlap. Case in point, i-/zi-, ki-/vi- and li-/ya- have all been reduced to ‘human-made objects’; yet the three are actually very 4. Analysis diferent, having only 2 (out of a total of 41) descriptors in common. Moreover, a descriptor is not just a For several classes, our experimental results are congru- (weighted) concept in isolation, but inherits also the exent with the hypotheses of Contini-Morava [14] and Mox- pansive structure of the underlying WordNet it came ley [15], inter alia. Concretely: from. In that sense, our approach does not only describe • Subject concord class a-/wa- is associated with hu- nominal classes with WordNet synsets, but dually also mans, causal agents and animacy; the class is the decorates the WordNet graph with nominal class weights. most semantically coherent and categorically defined; the classifier can accurately predict it, and its taxonomic descriptors are well-pronounced. 5. Conclusions • Subject concord class u- predominantly refers to abstract concepts; the class is the second easiest to pre- We explored the relation between semantics and nominal dict, and has the most homogeneous description. class assignment in Swahili. We approached the question • Subject concord class u-/i- is mostly associated with from two complementary computational angles. Veriplants; it is the third easiest class to predict, but pre- fying first the presence of a relation using supervised dictions are already getting somewhat unreliable. learning, we then sought to explicate its nature using • Subject concord class i-/zi- is semantically disparate; unsupervised topic modeling. Starting from a blank slate its descriptors are heterogeneous and carry relatively and without any prior interpretative bias, our methodollow scores. This disparity is consistent with the class’ ogy rediscovered go-to theories of Swahili nominal clascharacterization as a ‘residual catchall category’ [8, 14] sification, while also ofering room for further insights where loanwords are often assigned [23]. The only and explorations. Our work is among the first to tackle standout descriptor relates the class to human-made Bantu nominal assignment computationally, and the first objects, but the same descriptor dominates also classes to focus exclusively on semantics. Our methodology li-/ya- and ki-/vi-.8 Indeed, the model struggles to tell is typologically unbiased and computationally accessithese three classes apart. ble, allowing for an easy extension to other languages, In addition to experimentally afirming existing hy- under the sole requirement of a dictionary. We make potheses, our approach also yields novel insights and our scripts and generated artifacts publicly available at artifacts. With respect to ya- and i-, the macro-level https://github.com/konstantinosKokos/swa-nc. summary of these two understudied classes reveals an We leave several directions open to future work. We as-of-yet undocumented pattern: both classes lack a have experimented with a single dataset, a single model singular-plural paradigm, and contain concepts broadly and a single lexical database; varying either of these cocategorized as abstractions, albeit of diferent kinds. ordinates and aggregating the results should help debias our findings. We have only looked for semantic generalizations across hyperonymic taxonomies – looking at other kinds of lexical relations might yield diferent semantic observations. Our chosen metric of relevance is by 8Describing li-/ya and ki-/vi- as human-made objects is in partial alignment with the literature. The two are respectively associated with ‘augmentative’ and ‘dimininutive’ meanings [15] and, by extension, with big or small objects [14]. construction limited to first-order pairwise interactions, [12] P. M. Wilson, Simplified Swahili, Longman Nairobi; failing to account for exceptional cases or conditional London, 1985. associations. Finally, we had to resort to computational [13] J. F. Safari, Swahili Made Easy: A Beginner’s Comacrobatics through English in order to access necessary plete Course, Mkuki na Nyota; Dar es Salaam, 2012. tools and resources. This is yet another reminder of the [14] E. Contini-Morava, Noun classification in Swahili, disparities in the pace of ‘progress’ of language tech- Virginia: Publications of the Institute for Advanced nology, and a call for the computational inclusion of Technology in the Humanities, University of Virtypologically diverse languages. ginia (1994). URL: http://www2.iath.virginia.edu/ swahili/swahili.html. [15] J. L. Moxley, Semantic structure of Swahili noun 6. Acknowledgments classes, in: I. Maddieson, T. J. Hinnebusch (Eds.), Language history and linguistic description in We are grateful to Joost Zwarts and to three anonymous Africa, Africa World Press Inc, 1998, pp. 229–238. reviewers for their helpful feedback. [16] W. Ng’ang’a, Word sense disambiguation of Swahili: Extending Swahili language techonology with maReferences chine learning, Ph.D. thesis, University of Helsinki, 2005. [1] B. Wald, Swahili and the Bantu languages, in: [17] J. Olstad, Noun class assignment in Swahili via B. Comrie (Ed.), The major languages of South Asia, Bayesan probability, Cambridge Scholars Publishthe Middle East and Africa, Routledge, London, ing, 2012, pp. 180–194.

2018, pp. 903–924. [18] J. Byamugisha, Noun class disambiguation in Run[2] F. Katamba, Bantu nominal morphology, in: yankore and related languages, in: Proceedings D. Nurse, G. Philippson (Eds.), The Bantu languages, of the 29th International Conference on Computavolume 103, Routledge, London, 2003, p. 120. tional Linguistics, 2022, pp. 4350–4359. [3] P. Spinner, J. A. Thomas, L2 learners’ sensitivity [19] P. Gage, A new algorithm for data compression, to semantic and morphophonological information The C Users Journal 12 (1994) 23–38. on Swahili nouns, International Review of Applied [20] T. Kudo, J. Richardson, SentencePiece: A simple Linguistics in Language Teaching 52 (2014) 283– and language independent subword tokenizer and 311. detokenizer for neural text processing, in: E. Blanco, [4] R. M. Dixon, Noun classes, Lingua 21 (1968) 104– W. Lu (Eds.), Proceedings of the 2018 Conference 125. on Empirical Methods in Natural Language Pro[5] A. E. Meeussen, Bantu grammatical reconstruc- cessing: System Demonstrations, Association for tions, Africana linguistica 3 (1967) 79–121. Computational Linguistics, Brussels, Belgium, 2018, [6] M. Guthrie, Comparative Bantu, volume 2, Gregg, pp. 66–71. URL: https://aclanthology.org/D18-2012.

1971. doi:10.18653/v1/D18-2012. [7] I. Richardson, Linguistic evolution and Bantu noun [21] W. Wang, H. Bao, S. Huang, L. Dong, F. Wei, class system, in: G. Manessy, A. Martinet (Eds.), La Minilmv2: Multi-head self-attention relation disClassification Nominale Dans Les Langues Négro- tillation for compressing pretrained transformers, Aaricaines, Centre national de la recherche scien- in: Findings of the Association for Computational tifique, 1967, p. 373–390. Linguistics: ACL-IJCNLP 2021, 2021, pp. 2140–2151. [8] S. Zawawi, Loan words and their efect on the clas- [22] G. A. Miller, Wordnet: a lexical database for English, sification of Swahili nominals, Brill Archive, 1979. Communications of the ACM 38 (1995) 39–41. [9] J. P. Denny, C. A. Creider, The semantics of noun [23] T. C. Schadeberg, Loanwords in Swahili, in: classes in Proto-Bantu, in: C. G. Craig (Ed.), Noun M. Haspelmath, U. Tadmor (Eds.), Loanwords in classes and categorization, John Benjamins Publish- the world’s languages: A comparative handbook, ing Company, 1986. De Gruyter Mouton Berlin, 2009, pp. 76–102. [10] M. Krifka, Swahili, in: J. Jacobs, A. von Stechow, [24] G. Katz, R. Zamparelli, Quantifying count/mass W. Sternefeld, T. Vennemann (Eds.), Syntax. An In- elasticity, in: Proceedings of the 29th West Coast ternational Handbook of Contemporary Research, Conference on Formal Linguistics, 2012.

De Gruyter, Berlin, 2005, pp. 1397–1418. [25] H. Husić, On abstract nouns and countability, Ph.D. [11] L. Marten, Noun Classes and Plurality in Bantu thesis, Ruhr-Universität Bochum, 2020.

Languages, in: P. C. Hofherr, J. Doetjes (Eds.), The Oxford Handbook of Grammatical Number, Oxford

University Press, 2021. Taxonomic description of nominal classes. Scores are multiplied by 100c()− 1 to enhance legibility and facilitate direct numerical comparison across classes. Bold face scores indicate higher mutual information. Grayed out descriptors are hyponyms of at least one other descriptor with a higher score.