=Paper= {{Paper |id=Vol-2155/courtin |storemode=property |title=Establishing a Language by Annotating a Corpus: The Case of Naija, a Post-creole Spoken in Nigeria |pdfUrl=https://ceur-ws.org/Vol-2155/courtin.pdf |volume=Vol-2155 |authors=Marine Courtin,Bernard Caron,Kim Gerdes,Sylvain Kahane }} ==Establishing a Language by Annotating a Corpus: The Case of Naija, a Post-creole Spoken in Nigeria== https://ceur-ws.org/Vol-2155/courtin.pdf
                          Establishing a Language by Annotating a Corpus:
                          the Case of Naija, a Post-creole Spoken in Nigeria

                     Marine Courtin1, Bernard Caron2, Kim Gerdes3, Sylvain Kahane1
                                           1
                                          Modyco, Université Paris Nanterre & CNRS
                                             2
                                               Llacan, CNRS / IFRA Ibadan, CNRS
                                        3
                                          LPP, Université Sorbonne Nouvelle & CNRS
                                 marine.courtin@sorbonne-nouvelle.fr, bernard.caron@cnrs.fr,
                                           kim@gerdes.fr, skahane@parisnanterre.fr

                                                               Abstract
In this paper, we show that building a treebank can be used as a way to establish a language. Annotated corpus can be used as tools
when arguing that some linguistic data belongs to a separate language (rather than a dialect or variety of another established language).
We provide here a case study on a treebank of Naija, a Post-creole spoken in Nigeria which presents us with significant differences
from treebanks of English in terms of existing constructions and frequency of several syntactic units.

Keywords: Naija, Nigerian Pidgin, Treebank, Quantitative Linguistics, Typology

              1.    The Situation of Naija                              Naija also borrowed lexical items from other local
Spoken by educated Nigerians, the Nigerian post-creole                  languages, in particular ideophones such as kpatakpata
has been shown by Deuber (2005) to develop in Lagos as                  ‘completely’.
a discrete language, separate from Nigerian English. This
language, that we propose to call Naija, is now spoken as
a second language by over 100 million speakers, all over
Nigeria, a country of 180 millions people, where about
450 native languages are spoken with three dominating
languages (Igbo, Yoruba, and Hausa). This new language
has taken a considerable economical and cultural
importance in Nigeria. Nevertheless, for its speakers, this
language is often considered as an inferior version of
English (they call it “Broken”) with a negative influence
on Nigerian education. Most speakers are not conscious
that, as a separate language with its own grammar and                   (1) sotay di rain sef           kuku       fall some house
lexicon, it has a outstanding potential in favor of national                dem down kpatakpata
cohesion, since it is perceived as ethnically neutral, and                  so_that the rain EMPH commonly fall some house
for regional integration, due to its intercomprehension                     PL down completely
with Ghanaian and Cameroonian pidgins.                                      ‘So that, often, the rain completely destroys houses.’
Considering the particular situation of this language,                  We use the Arborator (Gerdes 2013) as the online
building a syntactic treebank takes a particular                        annotation tool for POS and dependency annotation. The
significance. Of course, as for any language, a treebank                Arborator’s exercise mode allows to present pre-annotated
can be useful for many applications, such as the training               sentences as exercises to newly recruited annotators. The
of a syntactic parser. But here the treebank helps us to                Arborator integrates the Mate parser (Bohnet 2010) that
establish the existence of Naija as a language separate                 can be trained at any time which allows for quick and easy
from (Nigerian) English, by showing constructions that                  bootstrapping of the annotation process.
are specific to Naija (qualitative analysis) and
constructions that are over-represented in Naija                        In order to allow for typological comparison and distance
(quantitative analysis).                                                measures on Naija, we use a surface-syntactic dependency
                                                                        annotation scheme that is compliant with standard
               2.    Tools and Workflow                                 dependency annotation (e.g. prepositions as governors)
                                                                        and thus easy to learn and to apply, but which allows for a
The study is based on a 750,000 word corpus collected all               lossless transformation into Universal Dependencies (UD)
around Nigeria. The transcription is a scientific and                   by means of a graph rewriting process (Guillaume 2012).
political challenge by itself because most words stem                   Each treebank for the 75 languages of the UD database
from English, but some of them have grammaticalized and                 must conform to the universal tagset for POS and
are pronounced differently. We follow what is done in the               dependency relation names. Language idiosyncrasies have
(mostly informal) writing of Naija: keep the English                    to be encoded as additional features next to the POS or as
spelling for lexical words, with exceptions for very                    subtypes of dependency relation names, e.g. in English the
frequent words such as broda ‘brother’; and a more                      noun modifier (nmod) receives a subtype to describe the
                                                                        Saxon genitive: “John[’s] <-nmod:poss- book”.
phonetic spelling for grammatical terms (dem ‘them’, im
‘him’, sey complementizer lit. say).                                    Currently the treebank has 12,000 tokens and is available
                                                                        on the UD webpage. We intend to manually annotate


                                                                    7
100,000 tokens and then to automatically parse the whole            One of our hypothesis concerning these differences is that
corpus.                                                             information packaging (or communicative structure) plays
                                                                    a larger role in Naija than in English. To explore this
                                                                    hypothesis it is necessary that we dispose of an annotated
              3.   Qualitative Analysis                             corpus, as we need to measure the frequency of some
A good number of morphosyntactic specificities of Naija             structures (for example dislocations and cleft sentences),
have called for an ongoing review of the annotating                 rather than their strict presence or absence in the
scheme that was initially adopted for the language.                 language. For this purpose, we use all available treebanks
                                                                    of English in UD v2.1: UD_English-ParTUT (Bosco and
Some of these specificities are linked to the influence of          Sanguinetti, 2014), UD_English-LinES (Ahrenberg 2007),
adstrate vernacular languages belonging mainly to the               UD_English-EWT (Silveira et al., 2014), and v2.2 version
Niger-Congo family. This is the case of emphatic                    of UD_Naija-NSC. We also parsed the Santa Barbara
adverbial particles (e.g. sha, o) tagged with the ADV POS           Corpus of Spoken American English (Du Bois et al. 2000-
label, but whose function is characterized by the                   2005) to get a reference of what spoken English might
mod:emph dependency link. The influence of adstrate                 look like in terms of syntactic relations’ distribution.
vernacular languages is observed in the use of Serial Verb
Constructions, that is “monoclausal construction[s]                 The table below presents some of the interesting
consisting of multiple independent verbs with no element            differences between (1) written English, (2) spoken
linking them and with no predicate-argument relation                English and (3) spoken Naija :
between the verbs.” (Haspelmath 2016) Such
constructions appear in languages of Nigeria, such as                           det    case     obl    dislo- ccomp           aux     cc
Yoruba (Stahlke 1970) (see (2)), and it has already been                                               cated
shown that they are present in creoles languages.                           9.4 %     10.6 %   5.8 %   0.0 %         1.1%    4.2 %   3.7 %
                                                                    (1)
(2) mo mu  iwwe wa ilwe        (Yoruba, Aubry 2010)                 (2)     6.7 %     6.6 %    4.2 %   ?             2.0 %   4.5%    4.3 %
    1SG take book come home’
                                                                    (3)     5.7 %     4.2 %    3.7 %   1.7 %         2.1 %   9.3 %   1.4 %
    ‘I brought a book to my home’
We used the subtyped relation compound:svc for these
constructions, which do not exist in English (see (3)).             To test the significance of the observed differences in
                                                                    frequency counts, we applied a Fisher's Exact Test for
Other specificities are linked to the emergence of up to            Count Data with simulated p-value (based on 2000
here undescribed structures which the corpus has enabled            replicates), giving us an overall p-value of 0.0004998.
us to identify. One of them is a focus structure where the
focus particle na (which identifies the clefted constituent)        Some differences such as the lower frequency of
is doubled by the morpheme naim (which introduces the               determiners are easily explained. A Naija sentence such as
cleft clause). This morpheme originates in the                      no dey stay for middle of road would not require definite
grammaticalization of the colocation na + im, lit. ‘it is’ +        determiners in front of middle or road, while its English
‘him/it/her’. This discovery of a new structure is the result       counterpart, don’t stay in the middle of the road, would.
of a collaborative analysis done by the team of annotators
during the production of the corpus.                                Another variation concerns the frequency of auxiliaries,
                                                                    which are more than twice as frequent in Naija than in
The same ongoing grammaticalization process is observed             English, regardless of the distinction written/spoken. We
in the formation of TAM auxiliaries where full lexical              then looked at the ratio of verb on auxiliaries to see which
verbs (e.g. go ‘go’; come ‘come’ ; dey ‘exist’) coexist             language had more complex verbal constructions and
with their grammaticalized equivalents (go, future; come,           found that Naija had the highest score (which means less
realis; dey, imperfective). Likewise, the verb make, which          auxiliaries per verb on average).
already appears in Serial Verb Constructions to express
the equivalent of the comitative case, is used as an
auxiliary for converb forms (e.g. dem want make e go
church ‘they want him to go to church’). This flourishing                                       Verb / Auxiliaries ratio
multifunctionality, typical of creole languages, creates                  (1)                                  1.9
challenges for the recognition of government.
                                                                          (2)                                  1.8

                                                                          (3)                                  2.0
             4.    Quantitative analysis
In creoles, it is usually assumed that there is a division of
labor between the lexifier language which provides the              Taking into account the fact that Naija also has the highest
majority of the lexicon (in our case English) and substrate         frequency of auxiliaries (9.3% against 4.2% for written
languages in areal contact with the creole (in the case of          English and 4.6% for spoken English) we observe that
Naija these might be Yoruba, Igbo and Hausa for                     Naija must compensate by having a high frequency of
example). We attempt to show quantitative evidence of               verbs which can be accounted for by the compound:svc,
structural similarities and differences between Naija and           ccomp, acl:relcl and root relations. If we look more
English.                                                            precisely at the distribution of these auxiliaries, it appears
                                                                    that it is the auxiliaries which are not shared with English


                                                                8
(dey, come, go, don, fit, for and neva) which are more             between written and spoken French, which seems to
frequent, while there is only one occurrence of the shared         suggest that this might very well be a product of the genre
auxiliary will.                                                    rather than a characteristic of the language. 2
The lower frequencies for both oblique and case relations          This over-representation seems to apply to cleft sentences
are correlated: Naija seems to use less oblique                    as well. The subtype :cleft, which we used in the
complement in favor of more direct objects. Locative               annotation of both UD_Naija and UD_French_Spoken,
complements can be expressed through Serial Verb                   can be found on 1.1 % of all relations in Naija, while it is
Constructions with the place as direct object of the second        considerably less frequent in spoken French (0.2%).
verb as in (3).
                                                                   Another interesting findings is that Naija also shows three
(3) government worker dem go dey         enter go work             times less coordinating conjunctions than English does
    government worker PL FUT PROG get_on go work                   (1.4% for Naija against 3.7% and 4.3% for written and
   ‘government workers will be getting on to go to work’           spoken English). This is interesting as we would expect a
                                                                   higher frequency of coordinations in spoken texts, to
                                                                   accommodate for lists and reformulations which are more
                                                                   common. In Naija it is not uncommon to have several
                                                                   coordinations without any coordinating conjunction as in
                                                                   (5) [conjuncts are underlined].
                                                                   (5) Lagos don follow see dis kind rain o wey uproot tree
                                                                       take am block road spoil dose big billboard dem […]
                                                                       comot di roof of plenty house dem.
                                                                       ‘Lagos has experienced the kind of rain where trees
                                                                       were uprooted and blocked the road, destroyed those
This role would be filled by an oblique complement                     big billboards […] and removed the roofing of lots of
introduced by an adposition in English, as in the example              houses.’
below:                                                             This suggests that Naija might favor other strategies such
                                                                   as juxtaposition rather than coordinated constituents
                                                                   linked with coordinating conjunctions.
(4)
                                                                   We might also be interested in the differences in
                                                                   distribution of part-of-speech tags3 between English and
                                                                   Naija.




Other differences do not show such clear-cut contrasts
between English and Naija, but are still interesting as they
indicate areas which might need to be investigated further.
We measure that 1.7 % of all dependency relations1 in the
Naija treebank are labeled dislocated. The mean length of
sentences being around 10 tokens, this means that on
average there is a dislocation in 1 sentence out of 6, which
is very significant, even more so when compared to the
0.0004% frequency found in written English.
Unfortunately our parser performs poorly on this relation
(due to the lack of training data) and no reliable frequency           Fig 1. Relative frequency of pos tags in English
count of this relation type can be extracted from the
spoken English corpus. We therefore look at spoken
French (which has the reputation of being particularly
prone to dislocations) to get a better sense of the
significance of our findings, and find that 1.0 % of
dependency       links     are     dislocated     (in    the
UD_French_Spoken, Lacheret and al., 2014). This
indicates that dislocation is a major feature of spoken
Naija. However, the variation in frequency of this                 2
                                                                     One reviewer also noted that some of the English corpora such
dislocated link is not significantly more important                as EWT were automatically converted from constituent
between written English and spoken Naija than it is                treebanks using rule-based systems which often fail to identify
                                                                   dislocated constructions.
1                                                                  3
    punct links excepted                                             We filtered tokens with PUNCT, X and SYM tags


                                                               9
                                                                                              5.   Conclusion
                                                                           Annotators who were speakers of Naija reported that
                                                                           throughout the annotation process, their vision of Naija
                                                                           had changed. They noticed more readily that some
                                                                           syntactic phenomena were specific to Naija and that there
                                                                           were complex rules which governed the Naija grammar.
                                                                           We believe this to be an interesting pedagogical
                                                                           experiment where student annotators re-discover their
                                                                           language through the annotation of a corpus, and are
                                                                           confronted with regularities and patterns that sometimes
                                                                           went unnoticed in their day to day life (particularly so
                                                                           since speaking Naija is mostly depreciated).
                                                                           We think that claims of Naija being a separate language
                                                                           can better be supported using a treebank. Indeed, while
      Fig 2. Relative frequency of pos tags in Naija                       lexical differences are certainly noticeable between Naija
                                                                           and English, we believe that the identity of the language
                                                                           lies in its syntactic structure which is not as easily
Naija has significantly more verbs while the English                       accessible from raw text or even tagged corpus. Having a
corpus is a lot richer in nouns. Part of the over-                         treebank of Naija enables us to quantify the frequency of
representation of verbs in Naija can be attributed to Serial               some syntactic structures, which in turns helps us to
Verb Constructions, with verbs in the second position                      evaluate the complexity and idiosyncracies of the Naija
representing 1.48 % of all tokens, but this account does                   grammar, and to measure the distance the language has
not suffice to explain such a gap. Investigating this                      taken from English. Comparisons between the two
disparity, we also measured other relations involving
                                                                           languages could also yield interesting insights concerning
verbal dependents such as ccomp. We find twice as many
clausal complements with respectively 1.64 % and 0.82 %                    the ungoing creolization process of Naija.
ccomp links in Naija and English. This indicates that
looking at complex sentences in more details might                         Acknowledgments
provide us with additional examples of differences
between the two languages.
                                                                           We thank our reviewers for valuable remarks and
We also expect that genre differences 4 between the                        corrections. This work is supported by the French
treebanks play an important part in this repartition. Future               National Research Agency (ANR) with the project
work using a Nigerian English corpus of both spoken and                    NaijaSynCor
written texts should allow us to better determine the extent
of differences due to genre and the variety of English
being considered.
                                                                           References
Interestingly enough, even though Naija allows the
dropping of pronouns they are still very frequent in our
corpus. One possible explanation is that pronouns are
highly susceptible to repetition and reformulation in                      Ahrenberg, L. (2007). "LinES: An English-Swedish
spoken language. But it might also have to do with the                       Parallel Treebank". Proceedings of the 16th Nordic
frequent topicalization of subjects through dislocation in                   Conference        of      Computational      Linguistics
Naija, as in (6), or with rhetorical devices which involve                   (NODALIDA, 2007).
repeating the pronoun to emphasize parallelism as in (7).                  Aubry, N. (2010) Changements syntaxiques dans le
                                                                             Yorùbá de la presse (1930-2010) : traitement
(6) dat man im pull over                                                     automatique d'un corpus diachronique et analyse des
    that man he pulls over                                                   résultats, PhD thesis, Inalco.
    ‘that man pulls over’                                                  Bohnet, B. (2010). "Very high accuracy and fast
(7) dem go bring am dem go seize am again.                                   dependency parsing is not a contradiction." Proceedings
                                                                             of the 23rd international conference on computational
    they will bring it they will seize it again
                                                                             linguistics. Association for Computational Linguistics.
    ‘they will bring it and seize it again’
                                                                           Bosco, C. and Sanguinetti, M. (2014). "Towards a
                                                                             Universal Stanford Dependencies parallel treebank". In
                                                                             Proceedings of the 13th Workshop on Treebanks and
                                                                             Linguistic Theories (TLT-13), Tubingen (Germany).
4                                                                          Deuber, D. (2005). Nigerian Pidgin in Lagos: Language
  There is a small portion of spoken English in UD_English-
LinES, but apart from this the corpus we used is all written texts,
                                                                            contact, variation and change in an African urban
with variations in terms of genres (news, wiki, nonfiction, blog,           setting. Battlebridge Publications.
emails, legal texts..). The Naija treebank is all spoken texts
(conversations and interviews).

                                                                      10
Du Bois, John W., Wallace L. Chafe, Charles Meyer,                 generalizations. Language and Linguistics, 17(3), 291-
 Sandra A. Thompson, Robert Englebretson, and Nii                  319.
 Martey. (2000-2005). Santa Barbara corpus of spoken             Jansen, B., Koopman, H., Muysken, P. (1978). Serial
 American English, Parts 1-4. Philadelphia: Linguistic             verbs in the creole languages. Amsterdam Creole
 Data Consortium.                                                  Studies 2. 125–159.
Gerdes, K. (2013). "Collaborative dependency                     Lacheret, A., Kahane, S., Beliao, J., Dister, A., Gerdes,
 annotation." Proceedings of the second international              K., Goldman, J. P., Tchobanov, A. (2014). Rhapsodie: a
 conference on dependency linguistics (DepLing 2013).              prosodic-syntactic treebank for spoken french. In
Guillaume, B., Bonfante, G., Masson, P., Morey, M. and             Language Resources and Evaluation Conference.
 Perrier, G. (2012). "Grew: un outil de réécriture de            Silveira, N., Dozat, T., de Marneffe, M., Bowman, S.,
 graphes pour le TAL (Grew: a Graph Rewriting Tool                 Connor, M., Bauer, J., and Manning, C. (2014)."A Gold
 for NLP)[in French]." Proceedings of JEP-TALN-                    Standard Dependency Corpus for English." LREC.
 RECITAL.
Haspelmath, M. (2016). The serial verb construction:
 Comparative       concept     and       cross-linguistic




                                                            11