Author Profiling for Arabic Tweets based on n-grams Ayoub Abbassi1, Seifeddine Mechti 2, Lamia Hadrich Belguith1, and Rim Faiz3 1ANLP Group MIRACL Laboratory, FSEGS, University of Sfax 2LARODEC Laboratory, ISG of Tunis, 2000 Le Bardo, Tunisia 3 LARODEC Laboratory, ISG of Tunis IHEC, 2016 Carthage, Tunisia ayoub.abess@gmail.com,mechtiseif@gmail.com l.belguith@fsegs.rnu.tn,Rim.faiz@ihec.rnu.tn Abstract. This paper presents an approach for author profiling of an unknown users from their texts produced in social media. In particular, we address the identification of two profile dimensions: gender and language variety, of Arabic twitter users based on their tweets. Our approach focused on applying meta- classification technique on features extracted from tweets body. We explored two main sets of features which are character and word n-grams. The proposed approach allowed us to reach promising results for both language variety and gender identification Keywords: Author profiling, Meta-classifier, N-gram features. 1 Introduction The rapid growth of internet and computer technology during the last two decades makes humanity in front of an incredible increased amount of online data. According to internet live stats1, in one second the Internet traffic is about 36,411 GB. This impressive amount of data -mostly of text type- are shared, published, and transit in a free (sometime in anonymous) way. In fact, an important portion of internet users are misrepresenting themselves while surfing in the net, therefore there are a need to deal with the data that come from unknown source. Two main sectors are interested in knowing the potential source of data. First, the commercial sector where information such as age, gender, nationality, and native language about customers is of higher value for marketing intelligence. Second, the 1 http://www.internetlivestats.com/ security sector that bear the burden to protect the internet from crime such as plagiarism and identity theft, etc. Therefore, research community promotes researchers to discover and develop effective methods and techniques in related fields such as plagiarism detection and author profiling. This work is made in the context of the participation of the Author Profiling task in the PAN17 shared task2. In particular, we focus on identifying the gender and Language variety of Arabic users from their twitter tweets. 2 Dataset Description We used training dataset provided by PAN clef 2017 to train our proposed system. We participated in the author-profiling task for the Arabic subtask. The training dataset is composed of Twitter tweets and annotated with authors' gender and their specific variation of their native language. A detailed statistics of the used dataset is given in Table 1. Table 1: Distribution of data for Arabic author-profiling task in the PAN17 training corpus Task Number of files Egypt 600 Gulf 600 language variety Levantine 600 Maghrebi 600 Male 1200 Gender Female 1200 As the above table shows, it is clear that the turning dataset is well distributed across classes. However, the analysis reveals that some documents are written in Modern Standard Arabic, not in one of the Arabic varieties [1], which can affect the performance of our system. 3 System Architecture Our proposed system is divided into three steps: pre-processing, feature extraction and Classification. Firstly, in the pre-processing step, we prepare the input data to be used in the next step. Then, in the feature extraction phase, we extract the set of features that seem to be useful for the task. Finally, we generate the classification model. This model will be used to predict the class of new document. 2 http://pan.webis.de/clef17/pan17-web/ 3.1 Pre-processing As the input dataset is basically composed of Twitter tweets, these tweets have the nature of being noisy including a lot of useless data such as links, tags, emoticons, etc. Thus they can’t be exploited directly. The idea is to remove these noisy data. However, in stand of looking for the variety of noisy, we simply extracted the Arabic text. The example below shows a tweet before and after prepossessing step. Example: Input tweet: “#thanx @alaakarmus 😘 ‫😜 كان في تحدي ع سؤال وانا ربحت حصلت شكالطه‬ https://t.co/UySVCM1qwm https://t.co/wKBUpGXmZo“ Tweet after extract the Arabic text: “‫”كان في تحدي ع سؤال وانا ربحت حصلت شكالطه‬ 3.2 Features extraction We extracted tow n-gram feature types, namely ‘character n-grams’ and ‘word n- grams’. Accordingly, we generated two sets of features for each input document. For each individual feature, we calculated the Inverse Document Frequency (IDF) with which it appears. The documents are then represented as TF-IDF matrix. Given a text extracted from tweets, the set of n-grams was extracted by moving a window of n cases across the text body. For example, based on the word as a feature, word n-grams means all the n consecutive words in the text. For the previous tweet "‫" كان في تحدي ع سؤال وانا ربحت حصلت شكالطه‬, the word n-gram model is illustrated in Table 2. Table 2: Example of word n-gram model N-gram model Example Word-based 1-grams ‫ كان‬,‫ في‬,‫ تحدي‬, ‫ ع‬,‫سؤال‬, . . . Word-based 2-grams ‫ كان في‬,‫ في تحدي‬,‫ تحدي ع‬, . . . Word-based 3-grams ‫ كان في تحدي‬,‫ في تحدي ع‬, . . . 3.3 Classification Once the documents have been transformed to their new representation, they will be used as input to train the classifier. Training the classifier is the main key of this work, we apply a meta-classifier technique known as 'stacking' [2] to generate the finale module, which will be used to predict the correct class of unlabelled document. Stacking consist in combining several base classifiers of different type, in our case, we use the three most popular machine-learning algorithm (Support vector machines, Decision trees and Naive Bayes) [3].The principle of this technique is illustrated in the following figure: Figure 1: Stacking principle 4 Results We carried out several series of experiments in order to evaluate the performance of the classifiers mentioned before individually and combined, using different sets of features. Table 3 and Table 4 show the result of our experiments: Table 3: Language variety results for PAN 17 Training Dataset Classification technique Features set Word n-grams Character n-gram Combined individual Decision trees 27.1 28.0 29.2 Naive Bayes 26.0 26.5 27.3 SVM 29.0 28.2 31 combined Stacking 34.0 31.3 33.0 Table 4: Gender results for PAN 17 Training Dataset Classification technique Features set Word n-grams Character n-grams Combined individual Decision trees 55.0 56.0 57.0 Naive Bayes 56.2 54.2 56.0 SVM 58.1 57.0 59.3 Combined Stacking 61.1 59.0 63.2 For gender dimension, the best accuracy is 59.3 which is obtained using SVM, in the case of individual classifier, and 63.2 using Stacking as classification technique. These results are obtained by combining all features together. Such results confirm our finding [4] of the outperformance of SVM compared with other learning algorithms in author profiling problem. However, for language variety, the result obtained using word n-grams outperformed those obtained using character n-grams or combination with 34 of accuracy. This is obtained by combining (Stacking) the performance of classifier. 5 Conclusion In this paper we described our approach of profiling the users of Twitter based on meta-classifier trained on n-grams features. In particular, we focused on the identification of gender and language variety of Arabic users. We found out that combining the n-grams- features in a meta-classification process allowed us to achieve higher results, on the tow tasks. The best result are obtained using word n- grams for language variety detection and using all features combined for gender detection. Reference 1. A. FARGHALY and K.SHAALAN “Arabic natural language processing: Challenges and solutions”, in proceedings of ACM Transactions on Asian Language Information Processing (TALIP), vol. 8, no 4, 2009. 2. S.B. Kotsiantis., I. Zaharakis, and P. Pintelas. “Supervised machine learning: A review of classification techniques”, p.3-24, 2007. 3. K. Vandana and M. Namrata, “Text classification and classifiers: a survey” Artificial Intelligence & Applications, vol. 3, n. 2, 2012. 4. S., Mechti, A., Abbassi, L. H. Belguith, R., Faiz,, C. “An empirical method using features combination for Arabic native language identification”, in proceedings of the 3th International Conference of Computer Systems and Applications (AICCSA),2016.