Multi-lingual Author Profiling Using Stylistic Features Abdul Sittar*, Iqra Ameer* *COMSATS University Islamabad, Lahore Campus, Pakistan abdulsittar72@gmail.com Iqraameer133@gmail.om Abstract. Author profiling is the identification of an author’s traits by examining the text written by an author. Author profiling has many useful applications in security, criminal, marketing, identification of false accounts on shared commu- nication websites, etc. We submitted our system to the FIRE'18-MAPonSMS (Multi-lingual Author Profiling on SMS), a shared task to classify the attributes of an author like gender and age group from multilingual text specifically English +Roman Urdu. Roman Urdu is common language specifically in SMS messag- ing, Facebook posts/comments and chats blog of games etc. Our presented sys- tem is based on 29 different stylistic features. On the training dataset, we have achieved best Accuracy = 73.714, for gender while using all 14-language-inde- pendent features together and Accuracy = 58.571 for age group by using all 29 features together. We obtained Accuracy = 0.55 and 0.37 on testing data for both gender and age respectively. Keywords: Author profiling, Multi-lingual text, Machine Learning, Roman Urdu, Stylistic. 1 Introduction Author profiling (AP) is the task of determining the writer’s traits, like age, gender, profession, personality types and mother language by analyzing the written document. Due to the heap of information on social networks, it became essential to classify user’s characteristics. This chore has diverse applications in forensics, security, and in mar- keting fields [1]. For example, by using forensics and terrorism applications, we can decrease the search space for the suspicious writer. In marketing point of view, these facts can be essential to predict and target specific customers and to form new strategies according to consumer’s interest and preferences. Recent tendencies in the domain include Multi-lingual AP [2], that is, the Multi- lingual document: “occurrence of more than two languages in a text document" [3]. Multi-lingual AP settings match the necessities of a real-life situation of security appli- cations when the produced text by the authors can belong to a mixture of different lan- guages from the texts under examination. Following the emerging field, the FIRE’2018 shared task on Age and Gender iden- tification in SMS based author profiles (MAPonSMS), provided the training and testing 2 corpora that were composed of multi-lingual (English +Roman Urdu) SMS based doc- uments. Since the provided training data was in multi-lingual, one of our primary objectives was to determine how our proposed technique is performing in the multi-lingual text. The author profiling is supervised document classification task. We carried out differ- ent experiments, i.e. (i) using all 29 stylistic features, (ii) using all 14-language inde- pendent stylistic features and (iii) using individual language independent stylistic fea- tures. In recent period, deep-learning methods [4], such as word, character, document-em- bedding and word approaches [5], have been used for this specific problem; still, linear models perform well, as they seem to be more robust in picking up stylistic information in the author’s writing. So, we applied frequently used linear machine-learning (ML) approaches. The document is prepared as follows. Section 2 discusses related work has been done in this domain. Section 3 details about our approach for the FIRE’2018 shared the task. Results and their analysis in section 4. The final section is five which concludes the paper. 2 Related Work The International PAN Competitions made remarkable progress in author profiling, especially for the gender and age identification tasks [1, 6, 7, 8, 9]. In PAN-2017, 22 teams participated, and traditional machine-learning algorithms were used by most of them [9], like Logistic Regression [3, 13] or SVM [10, 11, 12]. Some of them applied deep-learning techniques, especially word and character embedding [4, 13, 16], which are considering competing techniques, but still, results are not up to the mark for the Author Profiling task. As researchers are attracted towards the multi-lingual settings, however, there is only one research study found in the literature that considering the same genre author pro- filing task for multi-lingual text. [2] worked on multi-lingual corpus based on Roman Urdu and English Facebook posts and comments for same genre author profiling. Con- tent-based and stylistic features were explored in this study and 10-fold cross-validation was used for evaluation. They have achieved Accuracy = 0.875 on multi-lingual corpus by word uni-gram, char 3-gram, and char 8-gram content-based approach. By using the word bigram content-based approach, they got 0.750 accuracy for the age group clas- sification task. 3 Proposed Approach 3.1 Stylistic Feature Set There are three comprehensive categories have been used for automatic identifica- tion of an author’s traits: (1) content-based methods – that aim to detect characteristic 3 of a writer by using content of the text, (2) stylistic-based methods – which try to predict a writer’s demographics traits by analyzing writing style of the writer, and (3) topic- based methods – are applied to classify characteristics of an author by using debated topics in the text. For the FIRE’18-MAPonSMS Author Profiling competition, our system1 is based on different statistical features. As this year training data of FIRE’18-MAPonSMS is based on the multi-lingual SMS messages, i.e. Roman Urdu and English. This system- atic investigation purposes to detect some language independent stylistic features, which are likely to perform in multi-lingual text. List of language-independent as fol- lows: Avg. Word Length, Avg. Sentence Length, %age of Words with Six and More Letters, %age of Words with Two and Three Letters, %age of Question Sentences, %age of Semicolons, %age of Punctuations, %age of Comma, %age of Short Sen- tences, %age of Long Sentences, %age of Capitals, %age of Colons, %age of Digits, and %age of Full Stop. However, further are not language independent: Avg. Syllables Per Word, %age of Pronouns, %age of Prepositions, %age of Coordinating Conjunc- tions, %age of Articles, %age of Words with One Syllable, %age of Words with Three Plus Syllables, %age of Adjectives, %age of Determiners, %age of Interjections, %age of Modals, %age of Nouns, %age of Personal Pronouns, %age of Verbs, and %age of Adverbs. We can observe that the list mentioned above of features purposes to catch different stylistic facts from a multi-lingual written text, which can be useful to uncover the age group and gender of an unknown author. 4 Experimental Setup The focus of the FIRE’18-MAPonSMS shared task 2018 is on two author attributes (1) age and (2) gender identification on the multi-lingual text. The organizers provided us training dataset composed of multi-lingual (Roman Urdu and English) SMS text messages. There are two classes for gender (male, female) and three classes for the age group (15-19, 20-24 and 25-xx). 4.1 FIRE’18-MAPonSMS Training and Test Dataset for Author Profiling In the training dataset, for the gender classification, we have 210 male while 140 female labeled text documents. On the other hand for age group classification 108 text documents are in the 15-19 age group, 176 are in the 20-24 category, and 66 are in 25- xx age group. However, 150 files were provided for the testing phase. 1 The implementation (source code) details of our approach is provided in a repository at https://github.com/abdulsittar/Multilingual-Author-Profiling 4 4.2 Evaluation Methodology Author profiling classification problem is handled as a supervised ML problem. For detection of the age group, there are multi–groups problem and objective is to classify the age amongst 3-groups: (1) 15-19, (2) 20–24 and (3) 25–xx. For the gender catego- rization, there are binary-groups and objective is to differentiate between 2 groups: (1) female as well as the (2) male. 10–fold cross-validation was used in experiments to evaluate the performance of our model. We conducted our experiments by using four different ML algorithms named Naive Bayes, J48, Random Forest and Logistic Regression. Implementation of WEKA was used for these algorithms. The scores produced using the stylistic features are ma- nipulated as input to above-stated ML algorithms. 4.3 Evaluation Measure As suggested by the organizing team of the FIRE’18-MAPonSMS shared the task, the performance of the submitted automatic system for the age and gender was meas- ured using accuracy. Accuracy is described as the proportion of the correctly classified predictions 𝒑𝒄 out of all the predictions 𝒑𝒂 made. 𝑝𝑐 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑝𝑎 5 Results and Analysis For all the tables shown in this results and analysis section, next mentioned termi- nologies are used. The “Classifier” indicates the ML algorithm which we have applied to generate the numeric scores (NB → Naive Bayes, RF → Random Forest, LR → Logistic Regression). Best results are highlighted in bold typeface. We performed three sets of experiments: (1) performance on all 29 features, (2) performance on all 14 lan- guage independent features (see section 3.1) and (3) performance on individual lan- guage independent features (by using every single feature).Results on Training Dataset Table 1 shows the scores using all 29 stylistic features for both groups, i.e., age and gender. Using all features together we obtained best results by using Random Forest, Accuracy = 73.714 for the gender and 53.142 for the age group classification. Table 2 depicts the results for all 14-language independent stylistic features collectively. We achieved the best results for gender (Accuracy = 72.000) using Random Forest classi- fier and for the age (Accuracy = 58.571) by using Logistic Regression. 5 Table 1. Results using all 29 stylistic features Classifiers Age Gender (Accuracy) (Accuracy) NB 51.428 58.285 RF 56.571 72.000 LR 58.571 66.000 J48 49.142 70.571 Table 2. Results using all 14-language independent features Classifiers Age Gender (Accuracy) (Accuracy) NB 47.428 63.428 RF 53.142 73.714 LR 52.857 70.857 J48 46.285 69.428 Table 3. Results using language independent features individually by Random Forest Features Age Gender (Accuracy) (Accuracy) Avg. Word Length 37.714 51.714 Avg. Sentence Length 38.857 44.000 %age of Words with Six and More Letters 42.285 52.000 %age of Words with Two and Three Letters 37.142 51.142 %age of Question Sentences 41.714 56.285 %age of Semicolons 50.185 64.285 %age of Punctuations 35.428 64.857 %age of Comma 47.142 52.857 %age of Short Sentences 50.085 60.000 %age of Long Sentences 50.285 60.000 %age of Capitals 41.714 48.000 %age of Colons 40.857 56.000 %age of Digits 43.142 56.571 %age of Full Stop 43.142 58.285 Table 3 displays the results using 14-language independent stylistic features individ- ually (for every single feature). RF was performing better on both age and gender iden- tification; therefore we are only reporting the results on single language-independent features by using RF. The best results are obtained when “%age of Punctuations” single stylistic feature is used for gender (Accuracy = 64.857) and “%age of Long Sentences” for age (Accuracy = 50.285). This indicates that in SMS multi-lingual messages, one of the gender prefers Punctuations than the other, while one of the age groups prefers longer messages than others. 6 Overall concerning algorithms, for age group identification, the best scores are ob- tained using two classification algorithms named as RF and LG. For gender estimation, best sores are achieved by RF algorithm. This shows that the RF algorithm is suitable if we give a collection of attributes as an input to the algorithm in the classification problem. 5.1 Results on Test Dataset We obtained Accuracy = 0.55 and 0.37 on testing data for both gender and age re- spectively, which is below the baseline (baseline for gender = 0.60 and age = 0.51). Joint estimation of our model for both age and gender is 0.23. 6 Conclusion Correct profiling of an unknown author is getting a reputation for security point of view, investigation of criminal activities and the market research opinion. In this paper, we have participated in our approach in the FIRE'18-MAPonSMS author profiling shared task on age and gender identification in multi-lingual text. We have shown how the stylistic features and machine learning techniques enable an automatic system to determine different characteristics of an unknown author efficiently. We have considered the stylistic features to uncover the traits of an author on the multi-lingual corpus. We implemented 29 stylistic features and performed three differ- ent set of experiments, i.e., compared the results by using all 29 features, analyzed the scores for 14-language independent features altogether and at the end using single lan- guage-independent features. We observed that best results are achieved when we used all 29 features together for gender (Accuracy = 73.714) identification by Random For- est and for the age (Accuracy = 58.571) group when used all 14-language independent features by Logistic Regression classifier. References 1. Rangel, F., Rosso, P., Potthast, M., Stein, B., and Daelemans, W. (2015). Overview of the 3rd author profiling task at pan 2015. In CLEF http://fire.irsi.res.in/fire/2018/home, last ac- cessed 2018/07/27. 2. Mehwish Fatima, Komal Hasan, Saba Anwar, Rao Muhammad Adeel Nawab (2017), "Mul- tilingual author profiling on Facebook", Information Processing & Management, Elsevier, pp: 886 - 904, Vol: 53, Issue: 4, Standard: 0306-4573. 3. Meylaerts, R., 2010. Multilingualism and translation. Handbook of translation studies 1, 227{230. 4. Sebastian Sierra, Manuel Montes-y-Gómez, Thamar Solorio, and Fabio A. González. 2017. Convolutional Neural Networks for Author Profling. In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings), Vol. 1866. CLEF and CEUR- WS.org, Dublin, Ireland. 5. Ilia Markov, Helena Gómez-Adorno, Juan-Pablo Posadas-Durán, Grigori Sidorov, and 7 Alexander Gelbukh. 2017. Author Profiling with Doc2vec Neural Network Based Docment Embeddings. In Proceedings of the 15th Mexican International Conference on Artificial In- telligence, MICAI 2016, Vol. 10062. Part II, LNAI, Springer, Cancún, Mexico, 117–131. 6. Rangel, F., Rosso, P., 2013. On the Identification of Emotions and Authors’ Gender in Fa- cebook Comments on the Basis of Their Writing Style. In: Proceedings of the First Interna- tional Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and Perspectives from AI (ESSEM 2013). Vol. 1096. CEUR Workshop Proceedings, Turin, Italy, pp. 34{46. 7. Rangel, F., Rosso, P., Moshe Koppel, M., Stamatatos, E., Inches, G., 2013. Overview of the Author Profiling Task at PAN 2013. In: CLEF 2013 Evaluation Labs and Workshop Work- ing Notes Papers. Valencia, Spain. 8. Rangel, F., Rosso, P., Potthast, M., Stein, B., 2017. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In: Working Notes Papers of the CLEF 2017 Evaluation Labs. CEUR Workshop Proceedings. CLEF and CEUR-WS.org. 9. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B., 2016. Over- view of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum. CEUR-WS.org, vora - Portugal, pp. 750{784. 10. A. Pastor Lopez-Monroy, Manuel Montes-y-Go ´ mez, Hugo Jair-Escalante, Luis ´ Vil- lasenõr Pineda, and Thamar Solorio. 2017. Social-Media Users can be Profled by their Sim- ilarity with other Users. In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings), Vol. 1866. CLEF and CEUR-WS.org, Dublin, Ireland. 11. Ilia Markov, Helena Gómez-Adorno, and Grigori Sidorov. 2017. Language- and Subtask-Dependent Feature Selection and Classifer Parameter Tuning for Author Profling. In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceed- ings), Vol. 1866. CLEF and CEUR-WS.org, Dublin, Ireland. 12. Eric S. Tellez, Sabino Miranda-Jiménez, Mario Graff, and Daniela Moctezuma. 2017. Gender and Language-Variety Identifcation with microTC. In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings), Vol. 1866. CLEF and CEUR-WS.org, Dublin, Ireland. 13. Andrey Ignatov, Liliya Akhtyamova, and John Cardiff. 2017. Twitter Author Profling Using Word Embeddings and Logistic Regression. In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings), Vol. 1866. CLEF and CEUR-WS.org, Dublin, Ireland. 14. Matej Martinc, Iza Škrjanec, Katja Zupan, and Senja Pollak. 2017. PAN 2017: Author Profling - Gender and Language Variety Prediction. In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings), Vol. 1866. CLEF and CEUR-WS.org, Dublin, Ireland. 15. Marc Franco-Salvador, Nataliia Plotnikova, Neha Pawar, and Yassine Benajiba. 2017. Subword-based Deep Averaging Networks for Author Profling in Social Media. In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings), Vol. 1866. CLEF and CEUR-WS.org, Dublin, Ireland. 16. Sebastian Sierra, Manuel Montes-y-Gómez, Thamar Solorio, and Fabio A. González. 2017. Convolutional Neural Networks for Author Profling. In Working Notes Papers of the CLEF 2017 Evaluation Labs (CEUR Workshop Proceedings), Vol. 1866. CLEF and CEUR-WS.org, Dublin, Ireland.