<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Nominal Class Assignment in Swahili</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A Computational Account</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giada Palmieri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konstantinos Kogkalidis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aalto University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bologna</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We discuss the open question of the relation between semantics and nominal class assignment in Swahili. We approach the problem from a computational perspective, aiming first to quantify the extent of this relation, and then to explicate its nature, taking extra care to suppress morphosyntactic confounds. Our results are the first of their kind, providing a quantitative evaluation of the semantic cohesion of each nominal class, as well as a nuanced taxonomic description of its semantic content. Table 1 provides an overview of Swahili nominal 2.1. Nominal Classes in Swahili classes, with their respective nominal afixes and subject concord markers. The division of the nominal classes is Like other Bantu languages, Swahili has a rich nominal based on reconstructions from Proto-Bantu [5, 6, inter system, where nouns belong to diferent classes [ 1, 2], alia], and it aims at maintaining a correspondence across sometimes also referred to as 'genders' [3]. The nominal Bantu languages. Swahili is considered to have a total class is signalled by an afix on the noun itself, and co- of 18 nominal classes, but some are missing in standard referenced with other elements of the sentence through Swahili (e.g., classes 12, 13 and 18), while others are not grammatical agreement [4]. uniquely identified by their nominal afix and/or subject In Swahili, verbs require markers that agree with the concord markers. Odd numbers are traditionally associnominal class of the subject. An example of subject con- ated with singular classes, and even numbers with plural cord is reported below in (1): the noun mtoto 'child' bears classes. The first ten classes are in singular/plural pairing the prefix of noun class 1 m- on the noun, and agrees relations (e.g., class 2 is the plural form of class 1), while with the verb through the subject marker a-. The same some singular noun classes may lack a plural form or process can be observed in (2) for the noun mti 'tree' borrow their plural forms from other classes. (class 3), or in (3) for kitabu 'book' (class 7).1 There is a long-standing debate on whether Bantu nominal classification is arbitrary [ 7], or whether it is based on some underlying semantic principles, with specific meanings associated to specific classes [ 8, 9]. For Swahili, contemporary studies often adopt a stance that lies between these two extremes: nominal classification seems somewhat predictable based on semantic content, though it may often seem arbitrary [2, 10, 1, 11]. This</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Swahili</kwd>
        <kwd>nominal classification</kwd>
        <kwd>lexical semantics</kwd>
        <kwd>computational semantics</kwd>
        <kwd>topic modeling</kwd>
        <kwd>unsupervised learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Swahili has a grand total of 18 nominal classes (i.e.,
‘genders’). There is no consensus on the extent to which the
assignment of a noun to a given class is determined by
its semantic content. We explore this question from a
computational angle. Our experiments suggest semantic
cohesion among nominal classes, and provide a summary
of the taxonomic concepts associated to each class.
(1)
(2)</p>
      <sec id="sec-1-1">
        <title>M-toto a-me-anguk-a.</title>
        <p>[1]-child sm[1]-prf-fall-fv
‘The child has fallen.’
M-ti u-me-anguk-a.
[3]-tree sm[3]-prf-fall-fv
‘The tree has fallen.’
(3) Ki-tabu ki-me-anguk-a.
[7]-book sm[7]-prf-fall-fv
‘The book has fallen.’</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
Dec 04 — 06, 2024, Pisa, Italy
* Equal contribution. Authorship order was determined through a
ifrst-to-five game of rock paper scissors.
$ giada.palmieri5@unibo.it (G. Palmieri);
kokos.kogkalidis@aalto.fi (K. Kogkalidis)
 https://giadapalmieri.github.io/ (G. Palmieri);
https://konstantinoskokos.github.io/ (K. Kogkalidis)</p>
      <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License view is also commonly found in textbooks: semantic cues
1AbbreviaAttitroibnutsionu4s.0eIndterinnatiotnhale(CCexBYa4m.0).ples: [n] = nominal class; sm = are provided as an aid for the acquisition of Swahili, but
subject marker; prf = perfect; fv = final vowel. accompanied by the admonition that many nouns do not
necessarily admit generalizations [12, 13].</p>
      <p>Two prominent attempts to examine the semantic
categories associated with Swahili nominal classes are
provided by Contini-Morava [14] and Moxley [15]. Both
studies are cast in a cognitive linguistic framework, and
propose networks of meanings and semantic features
based on criteria such as resemblance or metaphoric and specifically and explicitly marks the features proposed
metonymic extensions. As an example, consider the se- by Contini-Morava [14]. The approach is framed as an
mantic network for class 3 suggested by Contini-Morava empirical test of Contini-Morava’s hypothesis, which
[14] in Figure 1: part of the branching includes the fea- the trained model is claimed to experimentally confirm;
tures plants &gt; objects made of plants &gt; powerful nonetheless, this assessment is compromised by
lukethings. Similarly, Moxley [15] suggests a structure of warm results and a flawed evaluation. 2 More recently,
class 3/4 where the notions of ‘plants, trees’ extends to Byamugisha [18] builds a noun class disambiguation
sys‘parts of plants’ or to objects with ‘long, thin, extended tem for Runyankore, another Bantu language. The
sysshape’. These studies ofer valuable insights into the tem relies on both a morphological and a semantic
comprinciples underlying nominal classifications, suggesting ponent, the latter employing k-NN clustering of word
the potential for more articulate generalizations than are vectors to resolve ambiguities that extend beyond
nomiimmediately apparent. However, note that they rely on nal morphology. The work is results-oriented, adopting
features that were conceived ad hoc to account for the a task-driven NLP posturing – its only tangible
contribucategorization of Swahili nouns. Despite this, the nomi- tion is the system itself.
nal classification of several nouns remains unaccounted
for [2]. It is unclear whether this is due to features that 3. Methodology
were overlooked in these studies, or an indication that
the classification of some nouns is inherently arbitrary.</p>
      <sec id="sec-2-1">
        <title>Unlike prior works, we are neither interested in preemp</title>
        <p>tively adopting or verifying some existing theory, nor in
2.2. Computational Approaches to maximizing discriminative performance metrics in some
Swahili Nominal Classes artificial downstream task. What we are interested in is
computationally investigating whether semantic content
Despite the long-standing theoretical debate, computa- alone is indeed a predictor of nominal class
membertional attempts at semantically characterizing Swahili ship. At first glance, word vectors seem to make for a
nominal classes are few and far between. In the context natural starting point. However, language-native word
of word sense disambiguation, Ng’ang’a [16] utilizes a vectors are bound to carry implicit morphological cues,
collection of manually selected morphosyntactic features trivializing the mapping to nominal classes (at worst), or
in combination with a self-organizing map in order to obfuscating its semantic aspect (at best). Word vectors
semantically cluster Swahili nouns. The study finds that (both distributional and predictive) are built on the basis
including noun prefix features ( i.e., nominal class indica- of co-occurrence contexts and/or statistics. The efect of
tors) moderately improves clustering performance, indi- grammatical agreement is that nouns will inadvertently
cating a degree of coherence between semantics and
morphology. This improvement is particularly notable for 2The key metrics reported are dataset-wide accuracy and per-class
classes 1/2, 7/8, and 11. Olstad [17] trains a naive Bayes area-under-the-curve. Both are over-optimistic: the first tends to
faclassifier over a private, manually annotated dataset that vor class-imbalanced datasets, whereas the latter ignores precision
and obfuscates the predictive conflict of the competing classifiers.
co-occur with verbs that carry subject markers indica- 2.9%
tive of the noun’s class. Case in point, the examples in 2.2%
(1), (2) and (3) contain morphologically distinct entries 2.3%
of the same verbal stem, which disclose the subject’s
nominal class. The same problem is expounded when 0 200 400 600 800 1000 1200
using modern segmentation techniques which implicitly
account for morphology by incorporating information
at the sub-word (i.e., syllable- or character-) level (cf. to one of the 9 most populous classes, which together
BPE [19], SentencePiece [20], inter alia). To bypass the account for about 98% of the data, and discard the rest.
problem, we conduct our analyses on English transla- In what follows, we use these subject concord markers
tions of Swahili nouns. Mediating meaning through a as an approximation of the underlying nominal classes.4
foreign language carries the risk of inducing translation The records we are left with correspond to the nominal
shifts and introducing inaccuracies. That said, we deem classes 1/2, 3/4, 5/6, 7/8, 9/10, 11|14, 4|9 and (11|14)/10;
it a necessary compromise; the bottleneck completely the latter three are necessarily conflated or ambiguous
erases any traces of morphology, which would otherwise due to their shared morphology.5
confound our results (and their interpretation).
3.1. Data</p>
        <sec id="sec-2-1-1">
          <title>3.2. Predicting Nominal Classes with a</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Language Model</title>
          <p>We first compile a list of nominal lexical entries by con- Our data allows for a first quantitative inquiry into the
sulting the TUKI Swahili-English dictionary. We gather semantic uniformity and separation of nominal classes.
these by scraping the dictionary’s online version3, fil- For our first take, we employ a supervised learning
aptering for pages under the category of Swahili nouns. proach. We task a small language model with
predictThe scrape yields 5 974 lexical entries. Each lexical en- ing a record’s subject concord class through the phrasal
try corresponds to a Swahili nominal homograph. Each representation of its English definition. The use of a
homograph is assigned one or more meanings, grouped pretrained language model allows the seamless
represenunder one or more subject concord classes. Meanings are tation of translations that are not strict word-to-word
provided in English, in the form of (lists of) synonyms, correspondences, promising also the ability to capture
brief descriptions, or mixtures of the two. These are subtle semantic distinctions in the process.
sometimes interlaced with linguistic metadata such as We use MiniLMv2 [21], a distilled encoder-only model
usage examples, apothegms, explanatory comments, etc. that has been fine-tuned for sentential similarity using a</p>
          <p>The dictionary is consistent in its typographic notation, contrastive learning objective. We apply a 75/25
train/ewhich allows us to standardize its presentation with a val split and further fine-tune the model to the task (we
tiny rule-based parser. The parser removes metadata and follow standard practices, attaching a neural classifier
splits homographs to nominals with unique meanings, to the model’s topmost layer, applied exclusively on the
gracefully pointing out the occasional inconsistency or start-of-sequence token). Model selection is based on
error. Guided by the parser, we identify and manually fix evaluation loss; we select three models from as many
common typographic errors. Following our corrections, training repetitions over the same split (one model per
we are left with a set of 6 341 unique records, i.e., triplets of repetition).
an entry identifier, a meaning and a subject concord class We report means and 95% confidence intervals for the
(Figure 2). The distribution of subject concord classes macro- and micro-averaged and per-class F1 scores in
is heavily skewed (Figure 3). We keep records assigned</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>3Available at https://swahili-dictionary.com.</title>
      </sec>
      <sec id="sec-2-3">
        <title>4The use of subject concord markers over noun afixes is mandated</title>
        <p>by the annotation format of the TUKI dictionary.
5We use the pipe operator (· |· ) to denote disjunction.
Macro- and micro-averaged and per-class F1 scores.
Confusion matrix over subject concord predictions.
i</p>
        <sec id="sec-2-3-1">
          <title>3.3. Finding the Taxonomies of Nominal</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Classes with WordNet</title>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Our mixed results paint a nuanced picture. Performance</title>
        <p>above random afirms that nominal classes are to an
extent semantically coherent – even if not perfectly so. Per- ter out hypernyms with less than 10 global occurrences,
formance below perfect, however, ofers nothing tangible.</p>
        <p>The model’s shortcomings might be indicative of a se- information between classes and hypernyms:
and compute the frequency-weighted7 pointwise mutual
mantic dispersion or arbitrariness within nominal classes,
but could also be attributed to the model itself, the
training process, or the dataset. In either case, we have strong
evidence of an (at least partial) overlap between (at least
some) semantic and morphological clusters. Other than
this confirmation, the supervised approach does not have
much else to ofer at this stage; over-parameterized
blackbox models are notoriously hard to extract linguistic
insights from. To actually ascribe semantic descriptions to
nominal classes, we need a better behaved alternative.
where:
wPMI(, ℎ) := c× h(, ℎ) PMI(, ℎ)</p>
        <p>PMI(, ℎ) := 2
︂( c× h(, ℎ) )︂
c()h(ℎ)
(1)
(2)</p>
      </sec>
      <sec id="sec-2-5">
        <title>Pairs with a positive wPMI score indicate relevance (i.e.,</title>
        <p>mutual dependence) between their coordinates – the</p>
      </sec>
      <sec id="sec-2-6">
        <title>6A ‘native’ WordNet would be a better fit for the task, but no mature</title>
        <p>Swahili version exists as of the time of writing.</p>
      </sec>
      <sec id="sec-2-7">
        <title>7The scaling helps alleviate the ‘rare event’ bias of vanilla PMI.</title>
        <p>higher the score, the better a hypernym describes a sub- This observation may support the correlation between
ject concord class. The aggregation of positive scores uncountability and abstract meanings noticed in other
allows us to quantify and compare the semantic cohesion languages [24, 25]; doing so would however require a
of subject concord classes given their descriptions – we thorough examination of these nouns’ properties.
present these in Table 4. We also present the top 20 ex- From a high-level perspective, we have chosen to
isotracted descriptors along with their scores in Appendix A. late the first few highest-ranked semantic components
The sum total of positive mutual information between of each class. This ensures backwards compatibility with
extracted descriptors and subject concord classes under the literature, but is also a very radical simplification. In
this weighting scheme is approximately 0.26 shannons, reality, our descriptions are fine-grained enough to
alsuggesting a moderate bidirectional dependency between low semantically distinguishing between any two classes,
the two. even when their primary descriptors overlap. Case in
point, i-/zi-, ki-/vi- and li-/ya- have all been reduced to
‘human-made objects’; yet the three are actually very
4. Analysis diferent, having only 2 (out of a total of 41)
descriptors in common. Moreover, a descriptor is not just a
For several classes, our experimental results are congru- (weighted) concept in isolation, but inherits also the
exent with the hypotheses of Contini-Morava [14] and Mox- pansive structure of the underlying WordNet it came
ley [15], inter alia. Concretely: from. In that sense, our approach does not only describe
• Subject concord class a-/wa- is associated with hu- nominal classes with WordNet synsets, but dually also
mans, causal agents and animacy; the class is the decorates the WordNet graph with nominal class weights.
most semantically coherent and categorically defined;
the classifier can accurately predict it, and its
taxonomic descriptors are well-pronounced. 5. Conclusions
• Subject concord class u- predominantly refers to
abstract concepts; the class is the second easiest to pre- We explored the relation between semantics and nominal
dict, and has the most homogeneous description. class assignment in Swahili. We approached the question
• Subject concord class u-/i- is mostly associated with from two complementary computational angles.
Veriplants; it is the third easiest class to predict, but pre- fying first the presence of a relation using supervised
dictions are already getting somewhat unreliable. learning, we then sought to explicate its nature using
• Subject concord class i-/zi- is semantically disparate; unsupervised topic modeling. Starting from a blank slate
its descriptors are heterogeneous and carry relatively and without any prior interpretative bias, our
methodollow scores. This disparity is consistent with the class’ ogy rediscovered go-to theories of Swahili nominal
clascharacterization as a ‘residual catchall category’ [8, 14] sification, while also ofering room for further insights
where loanwords are often assigned [23]. The only and explorations. Our work is among the first to tackle
standout descriptor relates the class to human-made Bantu nominal assignment computationally, and the first
objects, but the same descriptor dominates also classes to focus exclusively on semantics. Our methodology
li-/ya- and ki-/vi-.8 Indeed, the model struggles to tell is typologically unbiased and computationally
accessithese three classes apart. ble, allowing for an easy extension to other languages,
In addition to experimentally afirming existing hy- under the sole requirement of a dictionary. We make
potheses, our approach also yields novel insights and our scripts and generated artifacts publicly available at
artifacts. With respect to ya- and i-, the macro-level https://github.com/konstantinosKokos/swa-nc.
summary of these two understudied classes reveals an We leave several directions open to future work. We
as-of-yet undocumented pattern: both classes lack a have experimented with a single dataset, a single model
singular-plural paradigm, and contain concepts broadly and a single lexical database; varying either of these
cocategorized as abstractions, albeit of diferent kinds. ordinates and aggregating the results should help debias
our findings. We have only looked for semantic
generalizations across hyperonymic taxonomies – looking at
other kinds of lexical relations might yield diferent
semantic observations. Our chosen metric of relevance is by
8Describing li-/ya and ki-/vi- as human-made objects is in partial
alignment with the literature. The two are respectively associated
with ‘augmentative’ and ‘dimininutive’ meanings [15] and, by
extension, with big or small objects [14].
construction limited to first-order pairwise interactions, [12] P. M. Wilson, Simplified Swahili, Longman Nairobi;
failing to account for exceptional cases or conditional London, 1985.
associations. Finally, we had to resort to computational [13] J. F. Safari, Swahili Made Easy: A Beginner’s
Comacrobatics through English in order to access necessary plete Course, Mkuki na Nyota; Dar es Salaam, 2012.
tools and resources. This is yet another reminder of the [14] E. Contini-Morava, Noun classification in Swahili,
disparities in the pace of ‘progress’ of language tech- Virginia: Publications of the Institute for Advanced
nology, and a call for the computational inclusion of Technology in the Humanities, University of
Virtypologically diverse languages. ginia (1994). URL: http://www2.iath.virginia.edu/
swahili/swahili.html.
[15] J. L. Moxley, Semantic structure of Swahili noun
6. Acknowledgments classes, in: I. Maddieson, T. J. Hinnebusch (Eds.),
Language history and linguistic description in
We are grateful to Joost Zwarts and to three anonymous Africa, Africa World Press Inc, 1998, pp. 229–238.
reviewers for their helpful feedback. [16] W. Ng’ang’a, Word sense disambiguation of Swahili:
Extending Swahili language techonology with
maReferences chine learning, Ph.D. thesis, University of Helsinki,
2005.
[1] B. Wald, Swahili and the Bantu languages, in: [17] J. Olstad, Noun class assignment in Swahili via
B. Comrie (Ed.), The major languages of South Asia, Bayesan probability, Cambridge Scholars
Publishthe Middle East and Africa, Routledge, London, ing, 2012, pp. 180–194.</p>
        <p>2018, pp. 903–924. [18] J. Byamugisha, Noun class disambiguation in
Run[2] F. Katamba, Bantu nominal morphology, in: yankore and related languages, in: Proceedings
D. Nurse, G. Philippson (Eds.), The Bantu languages, of the 29th International Conference on
Computavolume 103, Routledge, London, 2003, p. 120. tional Linguistics, 2022, pp. 4350–4359.
[3] P. Spinner, J. A. Thomas, L2 learners’ sensitivity [19] P. Gage, A new algorithm for data compression,
to semantic and morphophonological information The C Users Journal 12 (1994) 23–38.
on Swahili nouns, International Review of Applied [20] T. Kudo, J. Richardson, SentencePiece: A simple
Linguistics in Language Teaching 52 (2014) 283– and language independent subword tokenizer and
311. detokenizer for neural text processing, in: E. Blanco,
[4] R. M. Dixon, Noun classes, Lingua 21 (1968) 104– W. Lu (Eds.), Proceedings of the 2018 Conference
125. on Empirical Methods in Natural Language
Pro[5] A. E. Meeussen, Bantu grammatical reconstruc- cessing: System Demonstrations, Association for
tions, Africana linguistica 3 (1967) 79–121. Computational Linguistics, Brussels, Belgium, 2018,
[6] M. Guthrie, Comparative Bantu, volume 2, Gregg, pp. 66–71. URL: https://aclanthology.org/D18-2012.</p>
        <p>1971. doi:10.18653/v1/D18-2012.
[7] I. Richardson, Linguistic evolution and Bantu noun [21] W. Wang, H. Bao, S. Huang, L. Dong, F. Wei,
class system, in: G. Manessy, A. Martinet (Eds.), La Minilmv2: Multi-head self-attention relation
disClassification Nominale Dans Les Langues Négro- tillation for compressing pretrained transformers,
Aaricaines, Centre national de la recherche scien- in: Findings of the Association for Computational
tifique, 1967, p. 373–390. Linguistics: ACL-IJCNLP 2021, 2021, pp. 2140–2151.
[8] S. Zawawi, Loan words and their efect on the clas- [22] G. A. Miller, Wordnet: a lexical database for English,
sification of Swahili nominals, Brill Archive, 1979. Communications of the ACM 38 (1995) 39–41.
[9] J. P. Denny, C. A. Creider, The semantics of noun [23] T. C. Schadeberg, Loanwords in Swahili, in:
classes in Proto-Bantu, in: C. G. Craig (Ed.), Noun M. Haspelmath, U. Tadmor (Eds.), Loanwords in
classes and categorization, John Benjamins Publish- the world’s languages: A comparative handbook,
ing Company, 1986. De Gruyter Mouton Berlin, 2009, pp. 76–102.
[10] M. Krifka, Swahili, in: J. Jacobs, A. von Stechow, [24] G. Katz, R. Zamparelli, Quantifying count/mass
W. Sternefeld, T. Vennemann (Eds.), Syntax. An In- elasticity, in: Proceedings of the 29th West Coast
ternational Handbook of Contemporary Research, Conference on Formal Linguistics, 2012.</p>
        <p>De Gruyter, Berlin, 2005, pp. 1397–1418. [25] H. Husić, On abstract nouns and countability, Ph.D.
[11] L. Marten, Noun Classes and Plurality in Bantu thesis, Ruhr-Universität Bochum, 2020.</p>
        <p>Languages, in: P. C. Hofherr, J. Doetjes (Eds.), The
Oxford Handbook of Grammatical Number, Oxford</p>
        <p>University Press, 2021.
Taxonomic description of nominal classes. Scores are multiplied by 100c()− 1 to enhance legibility and facilitate
direct numerical comparison across classes. Bold face scores indicate higher mutual information. Grayed out
descriptors are hyponyms of at least one other descriptor with a higher score.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>