A two-stage contagious Naive Bayes classifier for
    detecting sociolinguistic features in text

                          Iena Petronella Derks1 and Alta de Waal12
                      1
                           Department of Statistics, University of Pretoria
                  2
                          Center for Artificial Intelligence Research (CAIR)


1    Introduction

Online platforms allow users to masquerade themselves; making virtual interac-
tions anonymous or misleading recipients of the interactions. It also facilitates
an environment for cybercrimes, allowing users to take advantage of others and
commit heinous acts. An important concern on social media usage, in particular,
has to do with the security of under-age users that have access to the Internet.
Children are more vulnerable to threatening situations, such as harassment [3],
cyberbullying [7], and inappropriate conversations [8]. Natural language process-
ing (NLP) techniques can be used to process and understand social media data
[1]. In the area of sociolinguistics, there is evidence that links natural word use to
personality and social fluctuations [5]. In NLP, the term burstiness is used to de-
scribe the tendency of word recurrence. The burstiness phenomenon is frequently
exhibited in real text, in which an informative word is more likely to occur if it
has already appeared in the text [2]. State-of-the-art NLP models, such as the
multinomial Naive Bayes model, are often used to model text documents [4].


2    Methodology

One application area of NLP is sociolinguistics which can be defined as the
relationship between social factors and linguistics. Sociolinguistics aims to isolate
features to determine linguistic variation in social conditions [6]. This paper
investigates classification models which can model the burstiness, or contagious
effects of text as the text that we are interested in are manifestations of different
social groups of people. For example, a teenager vs. an adult impersonating as
a teenager will have different sentence structures. More formally defined, the
purpose of this work is to:

 1. Learn the linguistic patterns among different social groups of people, and
    classifying unknown authors according to these patterns; and
 2. Represent these patterns as Bayesian networks to gain an understanding of
    the dependency structure of words used among different social groups.

To identify these linguistic patterns, a comparison is made between two clas-
sification techniques, namely the Naive Bayes (NB) classifier and a contagious


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0)
2       IP Derks, A de Waal

counterpart thereof. The NB classifier assumes that words occurring in a docu-
ment are independent of each other. On the other hand, the contagious classifier
captures the burstiness phenomenon. To go one step further, we investigate the
dependencies between words using a Bayesian network. This allows us to under-
stand why certain word patterns results in a classification.


2.1   Data Application
This paper presents a comparison between the baseline NB classifier and the
proposed contagious counterpart thereof. Two data sets will be used to evaluate
the performance of each method, namely the IMDB data set and the PAN 2012
data set. The IMDB data set consist of movie reviews, with binary sentiment
classification. The PAN 2012 data set is originally used to identify potential
predators in online conversations, with 66 927 conversations. The problem is
addressed with a two-stage solution, where stage 1 is based on text classification
techniques and stage 2 makes use of Bayesian networks to understand the struc-
tural dependencies among words in a document. The evaluation of stage 1 is
typical classification performance, whereas the visual structural learning of the
Bayesian network provides for exploratory data analysis in order to understand
the conditional dependencies between words.


References
1. Chowdhury, G.G.: Natural language processing. Annual review of information sci-
   ence and technology 37(1), 51–89 (2003)
2. Doyle, G., Elkan, C.: Accounting for burstiness in topic models. In: Proceedings of
   the 26th Annual International Conference on Machine Learning. pp. 281–288. ACM
   (2009)
3. Kennedy, G., McCollough, A., Dixon, E., Bastidas, A., Ryan, J., Loo, C., Sahay,
   S.: Technology solutions to combat online harassment. In: Proceedings of the first
   workshop on abusive language online. pp. 73–77 (2017)
4. Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the dirich-
   let distribution. In: Proceedings of the 22nd international conference on Machine
   learning. pp. 545–552. ACM (2005)
5. Pennebaker, J.W., Mehl, M.R., Niederhoffer, K.G.: Psychological aspects of natural
   language use: Our words, our selves. Annual review of psychology 54(1), 547–577
   (2003)
6. Spolsky, B., Widdowson, H., et al.: Sociolinguistics, vol. 1. Oxford University Press
   (1998)
7. Van Hee, C., Lefever, E., Verhoeven, B., Mennes, J., Desmet, B., De Pauw, G.,
   Daelemans, W., Hoste, V.: Automatic detection and prevention of cyberbullying.
   In: International Conference on Human and Social Analytics (HUSO 2015). pp.
   13–18. IARIA (2015)
8. Yenala, H., Jhanwar, A., Chinnakotla, M.K., Goyal, J.: Deep learning for detecting
   inappropriate content in text. International Journal of Data Science and Analytics
   6(4), 273–286 (2018)