Predicting author characteristics of Arabic
          tweets through Author Profiling ?

              Isabella Karabasz, Paolo Cellini, and Gonzalo Galiana

                      Universitat Politècnica de València, Spain
                    {iska1, paocel, gongafor}@masters.upv.es


        Abstract. In this paper we present our team participation in the Au-
        thor Profiling Task for the APDA@FIRE-2019 competition, using the
        bag of words technique and including a search for additional indicative
        vocabulary, we trained a random forest model to categorize the age, gen-
        der, and dialect of authors of Arabic tweets.

        Keywords: Author Profiling · Arabic · Discriminatory Vocabulary ·
        Bag of Words · Tweets


1     Introduction

Author Profiling is a method that is used to study the use of language through
written text with the objective of collecting information about the author and
identifying characteristic traits.
    Nowadays, there is a tremendous amount of technology that is used to cre-
ate and share text; moreover, there is a widespread availability to numerous
platforms where every day millions and millions of texts are created and shared
in digital format. Thanks to the liberty that allows us all to write whatever
they want, many of users publish their daily thoughts or actions for every one to
read and interpret, thereby building for themselves a separate digital identity. As
such, it is becoming increasingly relevant to develop a system to hold individuals
accountable for their actions online, just as in the physical world.
    The acquisition of user information can be used for many different objec-
tives. This task is motivated by the aim to improve cyber-security by detecting
potentially threatening messages and identifying their author. Understanding
how individuals of a certain age, gender and origin think and use language is a
key part in fulfilling the following task. Nonetheless, to get the necessary data,
we need to design and use algorithms that can learn recognize characteristics of
each class of author, and this will be the objective of this project.
    To achieve this objective, the social media platform Twitter will be used as
a source for the recollection of texts from people of the Arabic world, with the
?
    Copyright © 2019 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15
    December 2019, Kolkata, India.
2       Karabasz, Cellini, Galiana

purpose to analyze some users. This will be done through the design, imple-
mentation and use of Machine Learning algorithms, that utilize the method of
Author Profiling to try to predict characteristics of each author based on his/her
writing features.

2     Task Description
The APDA (Author Profiling And Deception Detection In Arabic) task is sepa-
rated into two subtasks, of which our team chose to take part in Author Profiling
[2]. Given a set of Arabic tweets, the goal of the task is to design and implement
and algorithm in the R programming language to predict the gender, age, and
dialect of the authors of these texts.
     The given training corpus consists of a total of 2250 authors and 100 tweets
per author (in XM L format), whereby each author is assigned three labels:
gender, age and dialect (in a separate text file). The gender category is a binomial
class with labels ‘Male’ and ‘Female’. The age category is divided into three
classes: ‘Under’, ‘Between’, and ‘Above’, where the class ‘Under’ represent ages
under 25 years, the class ‘Between’ represent ages between 25 and 34 years
and the class ‘Above’ represents ages above 35 years, inclusively. Finally, the
dialect category is divided into fifteen classes, such as as ‘Sudan’, ‘Morocco’, and
‘Algeria’, just to name a few. By studying how language is shared by people, we
can learn to distinguish authors of different categories.

3     Proposed Approach
We propose an approach that builds on the Bag of Words technique as a baseline
model for each category. From there, variations to the baseline model were made
by adding additional features to encounter particular characteristics for each
category class.

3.1   Bag of Words
The Bag of Words is a technique used Natural Language Analysis where texts
are vectorized into a matrix representation for more effective computation. In
the matrix, there is one row per author and one column per word. The words
are based on a vocabulary created by calculating the N most frequent words of
the entire corpus. This matrix is then used as input for a selection of machine
learning models such as Random Forest, Support Vector Machines and Decision
Trees, all of which are offered in the caret package in R.

3.2   Discriminant Vocabulary
One of the key modifications done to the baseline model was the addition of a
discriminatory vocabulary for each category. That is, a list of words that are
perhaps less frequent and not included in the original model but offer conclusive
information about the gender, age, or dialect of the author.
                                                     Author Profiling   3

   Once the Bag Of Words and Discriminant Vocabulary techniques have been
presented, below is the suggested workflow of our approach.


                        Fig. 1. Proposed Workflow.
4      Karabasz, Cellini, Galiana

    The use of discriminatory words provides the bag of words with some vocab-
ulary that may have been excluded, but with high information gain. Their low
frequency but exclusive use serves to make definite predictions on the age, gender
or dialect of the author. Below are some examples included in each discriminant
vocabulary.
    In the case of predicting dialect, we have selected some words that are unique
to several regions of the Arab world [1]. For example, the list of words below
all mean yes but are used in distinct regions; Egypt, the United Arab Emirates
and Iraq respectively.


                 Fig. 2. Arabic word for yes in different dialects.


   In the case of the prediction of the gender and age, as we are working on
the Arabic language, we try to find the most common words based on the most
common topics in the Arabic world.
   For example, for the gender, we use words or phrases that only men or women
would say, for example ”my husband” would likely only be said by a female, and
”my wife” in the case of a male. Further, Arabic pronouns differ for males and
females, so some common pronouns were included as well. As for age, we used
words that have been found to be common blog topics for each age group [4].


                              Gender       Age
                            my husband homework
                              my wife      bored
                               us (f.)   apartment
                              us (m.)     marriage
Table 1. Some example words included in age and gender discriminant vocabularies.
                                                          Author Profiling      5

4     Experiments and Results

4.1   Machine Learning Models
The base model uses the SVM linear kernel provided in the caret package in
R, and yields a cross-validation accuracy of 84% for the dialect category. We
have also experimented using SVM Polynomial, decision trees and random forest
machine learning models. Despite taking considerable time to train, the random
forest method was truly worth the wait as it increased the accuracy by around
6% from the dialect category. We have also experimented using SVM Polynomial,
C5.0 decision trees and random forest machine learning models. Despite taking
considerable time to train, the random forest method was truly worth the wait
as it increased the accuracy by around 6% from the original SVM Linear model
for the prediction of dialect. Similar improvements occurred for the other two
categories.


4.2   Stopwords
Besides the inclusion of a discriminatory vocabulary, the inclusion of stopwords
was one of the first modifications done to the model. These are words that
occur in high frequency that offer little to no additional information about the
content of the text. The stopwords include months, days of the week, spelled out
numbers, prepositions of time and place, among many others to create a list of
750 words that are omitted when creating the bag of words vocabulary.


4.3   Absolute or Relative Frequency
The baseline bag of words uses the absolute frequency of each word in the feature
matrix. Alternative variations include replacing the absolute frequencies with
relative frequencies. That is the number of times a word occurs per author is
divided by the total number of words written by that author.


                       Classifier Frequency Accuracy
                        Dialect    Absolute    88.57%
                        Dialect    Relative    89.02%
                   Table 2. Absolute and Relative Frequencies


   When applied to the dialect classifier, the use of relative frequency increased
the cross validation accuracy by 1%
6      Karabasz, Cellini, Galiana

4.4   Discriminatory Emoticons

The use of emoticons adds character, style, and of course emotion to a tweet.
In fact, some of the most frequent words that ended up in the vocabulary were
emoticons. Here they are:


          Fig. 3. Some of the most common words are in fact emoticons.


   We decided to add some emoticons to the gender discriminatory vocabulary.
While some emoticons shown in Figure 3 above are likely to be used equally
by men and women, the one on the right is far more likely to be used by a
male than a female, due to the nature of the image. It is for this reason that
we added various emoticons to the gender discriminatory vocabulary. Below are
some examples of the emoticons selected.


           Fig. 4. Examples of selected gender discriminatory emoticons.


4.5   Length of Tweets and Number of Mentions
Besides the content of the tweet there are other quantitative characteristics that
can be analyzed as well. One of these is the length of tweets. For each tweet,
we calculated the word count per tweet and aggregated the results per author
by calculating their average word count. In the case of dialect and gender, there
were no notable differences between tweet lengths. However, for age, we can
observe a trend where younger authors write shorter tweets than older authors,
as shown in figure 5 below.
    In addition to calculating tweet length, we also calculated the number of
mentions per tweet. There were no striking differences between the classes of
any of the categories, besides the ”above” age class having very slightly more
mentions. Both columns, length of tweets and number of mentions were appended
to the matrix of features.
                                                               Author Profiling       7


                         Fig. 5. Tweet length trends by age.


5    Analysis

Overall, we were able to produce good results for the prediction of dialect and
gender. However, these high accuracies will be overshadowed in the final com-
bined results due to the low accuracy of the age models.


                           Classifier Accuracy (CV)
                            Dialect        90.22%
                            Gender         76.89%
                              Age          54.49%
                 Table 3. Cross Validation Results of Final Models


    It is logical that the best classifier is the dialect classifier. The models are al-
most entirely based on vocabulary; all the features in the matrix used to train the
models are words taken directly from the given corpus. Naturally, the variations
of dialect were most easily detected using this method.
    Some other vocabulary based experiments that could have been applied to
the other categories is a Parts of Speech analysis. It is known that age plays a
factor in the way the author conjugates their verbs, young people look to the
future and older people are more retrospective. On the other hand, in the case
of classifying gender, males use more determiners and women more pronouns.
Applying knowledge from these past studies could have had a positive impact
on the accuracies of our models [3].
8      Karabasz, Cellini, Galiana

5.1   Future Work

In future projects, it would be imperative to include further experiments in
the search for an optimal model. Among these, we suggest n-grams and term
frequency-inverse document frequency (TF-IDF).


n-grams n-grams is an additional modification that can be done to the bag
of words model. It implies splitting the corpus not strictly into words but into
groups of words or characters of varying quantities. We attempted the imple-
mentation word 2-grams. That means the corpus was divided into pairs of words
and the Bag of Words was created based on the frequency of these word pairs.
The computation of this model was highly time consuming, making it too costly
to pursue.


TF-IDF ‘Term Frequency-Inverse Document Frequency’ is another technique
worth exploring in future developments of the task. It measures the uniqueness
of terms within documents compared to others. That is, if a word is highly
recurrent with the tweets of one author, and has a low frequency is almost all
other documents, then it holds relevant value for the classification of the author
in whose tweets this term appears. This would be particularly useful in the
development and application of a discriminant vocabulary for each classifier.
                                                            Author Profiling       9

6   Conclusion

During the development of this task, the most challenging part of the process
was the fact that the tweets were in Arabic. While creating the discriminant
vocabularies, we were guided by our biased intuition when selecting some topics
for gender and age. We cannot know for sure if they are the most representative
words and topics due to our unfamiliarity with Arabic culture. Depending solely
on the corpus given to research common topic, we could run the risk of over-
fitting the models. This is why it would be necessary to dedicate more time to
the study of the Arabic world to get a better understanding of the nuances of
the cultures in order to improve the results of the task.
     The being said, despite the difficulties in programming with an Arabic corpus,
the fact that we could not understand the text has given us incentive to approach
the task with a more analytic perspective. Just as machines do not understand
the content of a corpus until it is processed using NLP techniques and automated
learning, we performed this task without using the corpus as a safety net for
immediate validation. We were forced to trust the process and the code; our
process depended purely on the results of our models and on the understanding
of the bag of words technique.


References
1. A. Randa. How different are arabic dialects? 2019.
2. F. Rangel, P. Rosso, A. Charfi, W. Zaghouani, B. Ghanem, and J. Snchez-Junquera.
   Overview of the track on author profiling and deception detection in arabic. In:
   Mehta P., Rosso P., Majumder P., Mitra M. (Eds.) Working Notes of the Forum
   for Information Retrieval Evaluation (FIRE 2019). CEUR Workshop Proceedings.
   CEUR-WS.org, Kolkata, India, December 12-15.
3. P. Rosso. Author profiling and fake reviews identification. PRHLT Research Center,
   Universitat Politècnica de València, 2019.
4. J. Schler, M. Koppel, S. Argamon, and J. Pennebaker. Effects of age and gender on
   blogging. American Association for Artificial Intelligence, 2005.