Author Profiling in Arabic Tweets：An Approach based
       on Multi-Classification with Word and Character
                            Features

     Yutong Sun1, Hui Ning1, Kaisheng Chen3, Leilei Kong2,*, Yunpeng Yang2, Jiexi
                              Wang2 and Haoliang Qi2
                           1
                             Harbin Engineering University, Harbin, China
                     2
                         Heilongjiang Institute of Technology, Harbin, China
                          3
                            East China Normal University, Shanghai, China
                                 Kongleilei1979@gmail.com


        Abstract. This paper focuses on the author profiling task published in the FIRE
        2019 (Forum for Information Retrieval Evaluation), which includes automatic
        identification of the age, gender, and language variety of Arabic tweets. We
        think the author profiling task as a multi-Classification problem. We have used
        word and character based on TFIDF features, learned the logistic regression
        classifier to predict the labels. In the final results, our proposed method shows a
        good performance in terms of age prediction, the accuracy rate is 0.6250.
        Additionally, we have obtained 0.5111 and 0.9604 accuracy for gender and
        language variety classifications respectively. In the experiment, We have used
        the different feature combination and adjusted the feature parameters to test the
        system. The combination of word and character features can improve the
        prediction accuracy and enhance the system performance significantly.

        Keywords: Author Profiling , Logistic Regression , Word and Characters N-
        gram.


1       Introduction

With the continuous development of social media, the research of author profiling
task has significant progress that has been made [1, 2, 3]. Author profiling task is to
identify the user profiling aspects such as age, gender, and language variety ,among
others. We formalized the author profiling task into a multi-classification problem. we
have used word and character or their combination as features of learning the
classifier. In order to extract the effective features, we have exploited TFIDF based
method to filter features. In the paper, the model which has proposed is based on the
Logistic Regression classifier, using word feature from unigram and character


*
 Corresponding author
Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15 December 2019, Kolkata, India.
2


features from bigram to 4-gram and its combinations as a standard of the label
predictions. The final evaluation results show that in the model the age prediction
accuracy can reach 0.6250. For the language variety, it is 0.9694, and 0.5111 accuracy
for the gender.


2       Methods

2.1    Preprocessing

Firstly, we read and parsed all .xml documents, and combined the each author's tweets
into a single text. Secondly, in this paper we have proposed the method based on text
vocabulary to extract the corresponding features. So we filtered out the non-text
content in document, such as @, emoticons and URL. Thirdly, we normalized the text
content, removed the unnecessary spaces, tabs and punctuations.

2.2    Experimental Methods

Following the successful of author profiling system[4], we applied a model based on
classification to build our system. The gender prediction task is a problem of binary
classification, the age and language tasks are the multi-classification problems.
   We compared to the Logistic Regression classifier and Linear SVC classifier,
found the performance of LR classification is more stable and simple. So we chose
LR as the final classifier. In terms of feature selection, we have used the character
features from bigram to 4-gram ,word feature from unigram and its combinations of
the features, exploited TFIDF method to extract more representative features. Giving
a tern, to calculate its TF, IDF and DF values, combine TF and IDF as features and
remove features of DF value which is lower than predefined minimum and higher
than the predefined maximum.
   The process of our method is shown in Fig. 1.


              Fig. 1. The process of the proposed approach for Author profiling

Firstly, we preprocessed the training data set. Secondly, based on TF and IDF values,
we extracted the classification features. In the experiment, we combined the character
features from bigram to 4-gram with the word feature from unigram. Thirdly, we
learned the LR classifier (default setting all parameters) based on the filtered features.
Finally, we used the learning classifier to predict the test data set.
                                                                                         3


3           Experiments

3.1      Data Set Description

The corpus2 of this task is consists of Arabic tweets, sign with age, gender, language
variety labels. Data set divided into five groups, each group contains three languages,
all of which belong to Arabic. In gender classification, the label including two types:
male and female. Age label divided into three types: Under(< 25), Between(25-35)
and Above(>=35).
      Through by analyzing the corpus, we found the number of types is same in the
labels and it is a balance state.

3.2      Evaluation Measures

The performance of author profiling approach is evaluated by the joint accuracy. The
accuracy is defined as the ratio of the predicted correct number Pc to the total
predicted number Pt.

3.3      Experimental Results

We split the training data, 80% for training and 20% for testing, to observe the
different effect in the feature combinations. The experimental results are shown in
Table 1.

                   Table 1. Experimental results with different feature combinations
    Features                                    Gender         Age             Variety
    word-unigram                                0.8123         0.5648          0.9236
    character-bigram                            0.7821         0.5417          0.8823
    word+char-bigram                            0.8046         0.5872          0.9405
    word+char-bigram-trigram                    0.8058         0.6235          0.9423
    word+char-bigram-trigram-4gram              0.8052         0.6148          0.9542


Table 1 shows that language variety predictions have the highest accuracy , about
95%. In order to identify age, we found that using the combination features is better
than word and character alone. In gender, the difference of the experimental results
using various features are seldom , the word unigram feature is slightly better than
others.
    Table 2 describes the final experimental results of the top three teams and our
team. In the yutong.2 file, we have used the word unigram feature for gender and
language variety ,the word + char-bigram -trigram combination for age. The age
classification accuracy rate is shown in Table 3.


2
    https://www.autoritas.net/APDA/corpus/
4


                                Table 2. The final evaluation results
Team                                     Gender         Age               Variety       Joint
DBMS-KU.2(Top 1)                         0.7944         0.5861            0.9722        0.4556
Nayel.1(Top 2)                           0.8153         0.5708            0.9750        0.4486
Nayel.3(Top 3)                           0.8014         0.5792            0.9708        0.4486
Yutong.2(Our team)                       0.5111         0.6250            0.9694        0.3125


                          Table 3. The age accuracy of the top five groups
                                         Age Group Ranking
    Team         Yutong.2       Yutong.3     Yutong.1     DBMS_KU.2                  DBMS_KU.3
    Accuracy     0.6250         0.6000         0.5875            0.5861              0.5819


Table 4 compares the effects of the parameters min_df and max_df in the TFIDF
model .

                          Table 4. Results of different parameter values
    Parameter Combination                Gender             Age                     Variety
                  max_df = 0.7           0.7770             0.6041                  0.9310
    min_df=4      max_df = 0.8           0.7772             0.6043                  0.9312
                  max_df = 0.9           0.7838             0.6154                  0.9322
                  max_df = 0.7           0.7921             0.6126                  0.9410
    min_df=5      max_df = 0.8           0.7944             0.6224                  0.9412
                  max_df = 0.9           0.8058             0.6235                  0.9423


4           Conclusions

This paper presents the method based on multi-classification with word and character
features for author profiling in Arabic tweets. In our method, we have chose word and
character and their combinations as the features and classified the LR classifier. The
final evaluation results show that the best performance is the combination features of
gender and language (word unigram) + age (word + char-bigram-trigram). We have
obtained 0.6250, 0.5111 and 0.9604 accuracy for age, gender and language variety
classifications respectively. In the future work, we will consider the feature extraction
of non-text content, and further improve the experimental performance.


Acknowledgments

This research was supported by the Social Science Fund of Heilongjiang Province of
China (No.18TQB103).
                                                                                      5


References
1. Marquardt James, et al.: Age and Gender Identification in Social Media. In: CEUR
   Workshop Proceedings, vol.1180, pp. 1129-1136 (2014).
2. Michał Meina, Karolina Brodzi ń ska, Bartosz Celmer, Maja Czoków, Martyna Patera,
   Jakub Pezacki, Mateusz Wilk: Ensemble-based Classification for Author Profiling Using
   Various Features -Notebook for PAN at CLEF 2013. In: CLEF 2013 Evaluation Labs and
   Workshop-Working Notes Papers. Valencia, Spain (2013).
3. A. Pastor López-Monroy, Manuel Montes-y-Gómez, Hugo Jair Escalante, Luis Villaseñor-
   Pineda, Esaú Villatoro-Tello: Using Intra-Profile Information for Author Profiling-
   Notebook for PAN at CLEF 2014. In: CLEF 2014 Evaluation Labs and Workshop-
   Working Notes Papers. Valencia, Spain (2014).
4. Sharmila Devi V, Kannimuthu S, Ravikumar G, Anand Kumar M: KCE_DAlab
   @MAPonSMS-FIRE2018: Effective Word and Character-based Features for Multilingual
   Author Profiling. In: Working Notes for MAPonSMS at FIRE’18 -Workshop Proceedings
   of the 10th International Forum for Information Retrieval Evaluation, pp. 213-222.
   Gujarat, India (2018).
5. Rangel, F., Rosso, P., Charfi, A., Zaghouani, W., Ghanem, B., Snchez-Junquera,
   J.: Overview of the track on author profiling and deception detection in arabic. In:
   Mehta P., Rosso P., Majumder P., Mitra M. (Eds.) Working Notes of the Forum
   for Information Retrieval Evaluation (FIRE 2019). CEUR Workshop Proceedings.
   In: CEUR-WS.org, Kolkata, India, December 12-15 (2019).