BERT-Based Arabic Social Media Author Profiling Chiyu Zhang and Muhammad Abdul-Mageed Natural Language Processing Lab The University of British Columbia chiyuzh@mail.ubc.ca, muhammad.mageeed@ubc.ca Abstract. We report our models for detecting age, language variety, and gender from social media data in the context of the Arabic author profiling and deception detection shared task (APDA) [32]. We build sim- ple models based on pre-trained bidirectional encoders from transformers (BERT). We first fine-tune the pre-trained BERT model on each of the three datasets with shared task released data. Then we augment shared task data with in-house data for gender and dialect, showing the util- ity of augmenting training data. Our best models on the shared task test data are acquired with a majority voting of various BERT models trained under different data conditions. We acquire 54.72% accuracy for age, 93.75% for dialect, 81.67% for gender, and 40.97% joint accuracy across the three tasks.1 Keywords: author profiling identification, BERT, Arabic, social media 1 Introduction The proliferation of social media has made it possible to collect user data in unprecedented ways. These data can come in the form of usage and behavior (e.g., who likes what on Facebook), network (e.g., who follows a given user on Instagram), and content (e.g., what people post to Twitter). Availability of such data have made it possible to make discoveries about individuals and communities, mobilizing social and psychological research and employing natural language processing methods. In this work, we focus on predicting social media user age, dialect, and gender based on posted language. More specifically, we use the total of 100 tweets from each manually-labeled user to predict each of these attributes. Our dataset comes from the Arabic author profiling and deception detection shared task (APDA) [32]. We focus on building simple models using pre-trained bidirectional encoders from transformers (BERT) [12] under various data conditions. Our results show (1) the utility of augmenting training data, and (2) the benefit of using majority votes from our simple classifiers. In the rest of the paper, we introduce the dataset, followed by our experi- mental conditions and results. We then provide a literature review and conclude. 1 Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15 Decem- ber 2019, Kolkata, India. 2 Zhang & Abdul-Mageed 2 Data For the purpose of our experiments, we use data released by the APDA shared task organizers. The dataset is divided into train and test by organizers. The TRAIN set is distributed with labels for the three tasks of age, dialect, and gender. Following the standard shared tasks set up, the test set is distributed without labels and participants were expected to submit their predictions on test. The shared task predictions are expected by organizers at the level of users. The distribution has 100 tweets for each user, and so each tweet is distributed with a corresponding user id. As such, in total, the distributed training data has 2,250 users, contributing a total of 225,000 tweets. The official task test set contains 720,00 tweets posted by 720 users. For our experiments, we split the training data released by organizers into 90% TRAIN set (202,500 tweets from 2,025 users) and 10% DEV set (22,500 tweets from 225 users). The age task labels come from the tagset {under-25, between-25 and 34, above-35 }. For dialects, the data are labeled with 15 classes, from the set {Algeria, Egypt, Iraq, Kuwait, Lebanon-Syria, Lybia, Morocco, Oman, Palestine-Jordan, Qatar, Saudi Arabia, Sudan, Tunisia, UAE, Yemen}. The gender task involves binary labels from the set {male, female}. 3 Experiments As explained earlier, the shared task is set up at the user level where the age, dialect, and gender of each user are the required predictions. In our experiments, we first model the task at the tweet level and then port these predictions at the user level. For our core modelling, we fine-tune BERT on the shared task data. We also introduce an additional in-house dataset labeled with dialect and gender tags to the task as we will explain below. As a baseline, we use a small gated recurrent units (GRU) model. We now introduce our tweet-level models. 3.1 Tweet-Level Models Baseline GRU. Our baseline is a GRU network for each of the three tasks. We use the same network architecture across the 3 tasks. For each network, the network contains a layer unidirectional GRU, with 500 units and an output linear layer. The network is trained end-to-end. Our input embedding layer is initialized with a standard normal distribution, with µ = 0, and σ = 1, i.e., W ∼ N (0, 1). We use a maximum sequence length of 50 tokens, and choose an arbitrary vocabulary size of 100,000 types, where we use the 100,000 most frequent words in TRAIN. To avoid over-fitting, we use dropout [43] with a rate of 0.5 on the hidden layer. For the training, we use the Adam [22] optimizer with a fixed learning rate of 1e − 3. We employ batch training with a batch size of 32 for this model. We train the network for 15 epochs and save the model at the end of each epoch, choosing the model that performs highest accuracy on DEV as our best model. We present our best result on DEV in Table 1. We report all BERT-Based Arabic Social Media Author Profiling 3 our results using accuracy. Our best model obtains 42.48% for age, 37.50% for dialect, and 57.81% for gender. All models obtain best results with 2 epochs. BERT. For each task, we fine-tune on the BERT-Base Muultilingual Cased model relesed by the authors [12] 2 . The model was pre-trained on Wikipedia of 104 languages (including Arabic) with 12 layer, 768 hidden units each, 12 attention heads, and has 110M parameters in entire model. The vocabulary of the model is 119,547 shared WordPices. We fine-tune the model with maximum sequence length of 50 tokens and a batch size of 32. We set the learning rate to 2e − 5 and train for 15 epochs. We use the same network architecture and parameters across the 3 tasks. As Table 1 shows, comparing with GRU, BERT is 3.16% better for age, 4.85% better for dialect, and 2.45% higher for gender. Data Augmentation. To further improve the performance of our models, we introduce in-house labeled data that we use to fine-tune BERT. For the gender classification task, we manually label an in-house dataset of 1,100 users with gen- der tags, including 550 female users, 550 male users. We obtain 162,829 tweets by crawling the 1,100 users’ timelines. We combine this new gender dataset with the gender TRAIN data (from shared task) to obtain an extended dataset, to which we refer as EXTENDED Gender. For the dialect identification task, we randomly sample 20,000 tweets for each class from an in-house dataset gold labeled with the same 15 classes as the shared task. In this way, we obtain 298,929 tweets (Su- dan only has 18,929 tweets). We combine this new dialect data with the shared task dialect TRAIN data to form EXTENDED Dialect. For both the dialect and gender tasks, we fine-tune BERT on EXTENDED Dialect and EXTENDED Gender independently and report performance on DEV. We refer to this iteration of ex- periments as BERT EXT. As Table 1 shows, BERT EXT is 2.18% better than BERT for dialect and 0.75% better than BERT for gender. 3 Table 1. Tweet level results on DEV Age Dialect Gender GRU 42.48 37.50 57.81 BERT 45.64 42.35 60.26 BERT EXT - 44.53 61.01 3.2 User-Level Models Our afore-mentioned models identify user’s profiling on the tweet-level, rather than directly detecting the labels of a user. Hence, we follow the work of Zhang 2 https://github.com/google-research/bert/blob/master/multilingual.md 3 We note that it was not possible for us to use external age-labeled data and hence we do not report on the age task with this data augmentation setting. 4 Zhang & Abdul-Mageed & Abdul-Mageed [47] to identify user-level labels. For each of the three tasks, we use tweet-level predicted labels (and associated softmax values) as a proxy for user-level labels. For each predicted label, we use the softmax value as a threshold for including only highest confidently predicted tweets. Since in some cases softmax values can be low, we try all values between 0.00 and 0.99 to take a softmax-based majority class as the user-level predicted label, fine-tuning on our DEV set. Using this method, we acquire the following results at the user level: BERT models obtain an accuracy of 55.56% for age, 96.00% for dialect, and 80.00% for gender. BERT EXT models achieve 95.56% accuracy for dialect and 84.00% accuracy for gender. 3.3 APDA@FIRE2019 submission First submission. For the shared task submission, we use the predictions of BERT EXT as out first submission for gender and dialect, but only BERT for age (since we have no BERT EXT models for age, as explained earlier). In each case, we acquire results at tweet-level first, then port the labels at the user- level as explained in the previous section. For our second and third submitted models, we also follow this method of going from tweet to user level. Second submission. We combine our DEV data with our EXTENDED Dialect and EXTENDED Gender data, for dialect and gender respectively, and train our second submssions for the two tasks. For age second submsision, we concatenate DEV data to TRAIN and fine-tune the BERT model. We refer to the settings for our second submission models collectively as BERT EXT+DEV. Third submission. Finally, for our third submission, we use a majority vote of (1) first submission, (2) second submission, and (3) predictions from our user-level BERT model. These majority class models (i.e., our third submission) achieve best results on the official test data. We acquire 54.72% accuracy for age, 81.67% accuracy for gender, 93.75% accuracy for dialect, and 40.97% joint accuracy. Table 2. Results of our submissions on official test data (user level) Exp. Condition Age Dialect Gender Joint Submission 1 BERT EXT 54.72 93.33 77.08 38.75 Submission 2 BERT EXT+DEV 54.72 92.64 81.67 40.97 Submission 3 MAJ CLASS 54.72 93.75 81.67 40.97 4 Related Works Arabic. Arabic is a term that refers to a collection of languages, varieties, and dialects. The standard variety, Modern Standard Arabic (MSA), is the one usu- ally used in formal communication and educational settings. Arabic also has a BERT-Based Arabic Social Media Author Profiling 5 wide range of under-studied varieties and dialects that classically used to be cat- egorized in a coarse-grained fashion (e.g., Levantine, North African) [17, 46, 1, 13, 2]. More recent treatments focus on fine-grained categorizations such as country and city levels [25, 39, 3, 40, 31, 47]. Differences between varieties of Arabic hap- pen at various linguistic levels, including including phonological, morphological, lexical, and syntactic [19, 6, 27, 1]. Social Media Author Profiling. Author profiling is the term usually used to refer to detecting a host of attributes of (often social media) users. This include identifying attributes such as age, gender, educational level, economic class, stance or ideology [10, 30], personality [42, 23, 7, 18], moral traitstraits [21, 28], and other socilogical and psychological constructs. Author profiling based on text [15, 5] is rooted in computational stylometry [16, 44] and has traces in the early work of Holmes [20]. The task of author profiling has also been approached from network perspective where cues based on friending, following, mentioning, and commenting have been leveraged for identifying author attributes [24] for author profiling. In addition, the PAN author profiling shared task [34, 33, 37, 36, 35] was established to advance related work. More information about PAN can be found in [29]. Age and Gender. A number of studies have been conducted on English- based age and gender detection, including [38, 14, 8, 45, 11]. Many of these works use feature engineering such as text n-gram and topic models [42, 41]. In these works, age is either cast as a multi-class classification task with, e.g., labels from the set {10-19, 20-29, 30-39 } or as a regression task [26]. Other works model age with both classification and regression combined [9]. With rare exceptions [35, 4], we do not know of work on Arabic targeting age and gender. 5 Conclusion In this work, we described our submitted models to the Arabic author profiling and deception detection shared task (APDA) [32]. We focused on detecting age, dialect, and gender using BERT models under various data conditions, showing the utility of additional, in-house data on the task. We also showed that a ma- jority vote of our models trained under different conditions outperforms single models on the official evaluation. In the future, we will investigate automatically extending training data for these tasks as well as better representation learning methods. 6 Acknowledgement We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Social Sciences Research Council of Canada (SSHRC), and Compute Canada (www.computecanada.ca). 6 Zhang & Abdul-Mageed References 1. Abdul-Mageed, M.: Subjectivity and sentiment analysis of Arabic as a morophologically-rich language. Ph.D. thesis, Indiana University (2015) 2. Abdul-Mageed, M.: Modeling arabic subjectivity and sentiment in lexical space. Information Processing & Management (2017) 3. Abdul-Mageed, M., Alhuzali, H., Elaraby, M.: You tweet what you speak: A city- level dataset of arabic dialects. In: Proceedings of the Eleventh International Con- ference on Language Resources and Evaluation (LREC-2018) (2018) 4. Alrifai, K., Rebdawi, G., Ghneim, N.: Arabic tweeps gender and dialect prediction. In: CLEF (Working Notes) (2017) 5. Argamon, S.E.: Register in computational language research. Register Studies 1(1), 100–135 (2019) 6. Bassiouney, R.: Arabic sociolinguistics. Edinburgh University Press (2009) 7. Bleidorn, W., Hopwood, C.J.: Using machine learning to advance person- ality assessment and theory. Personality and Social Psychology Review p. 1088868318772990 (2018) 8. Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twit- ter. In: Proceedings of the conference on empirical methods in natural language processing. pp. 1301–1309. Association for Computational Linguistics (2011) 9. Chen, J., Cheng, L., Yang, X., Liang, J., Quan, B., Li, S.: Joint learning with both classification and regression models for age prediction. In: Journal of Physics: Conference Series. vol. 1168, p. 032016. IOP Publishing (2019) 10. Colleoni, E., Rozza, A., Arvidsson, A.: Echo chamber or public sphere? predicting political orientation and measuring political homophily in twitter using big data. Journal of communication 64(2), 317–332 (2014) 11. Daneshvar, S., Inkpen, D.: Gender identification in twitter using n-grams and lsa. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018) (2018) 12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 13. Elaraby, M., Abdul-Mageed, M.: Deep models for arabic dialect identification on benchmarked data. In: Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). pp. 263–274 (2018) 14. Flekova, L., Preoţiuc-Pietro, D., Ungar, L.: Exploring stylistic variation with age and income on twitter. In: Proceedings of the 54th Annual Meeting of the Associ- ation for Computational Linguistics (Volume 2: Short Papers). vol. 2, pp. 313–319 (2016) 15. Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th international conference on Computational Linguistics. p. 611. Association for Computational Linguistics (2004) 16. Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers age and gen- der. In: Third international AAAI conference on weblogs and social media (2009) 17. Habash, N.Y.: Introduction to arabic natural language processing. Synthesis Lec- tures on Human Language Technologies 3(1), 1–187 (2010) 18. Hinds, J., Joinson, A.: Human and computer personality prediction from digital footprints. Current Directions in Psychological Science p. 0963721419827849 (2019) BERT-Based Arabic Social Media Author Profiling 7 19. Holes, C.: Modern Arabic: Structures, functions, and varieties. Georgetown Uni- versity Press (2004) 20. Holmes, D.I.: The evolution of stylometry in humanities scholarship. Literary and linguistic computing 13(3), 111–117 (1998) 21. Johnson, K., Goldwasser, D.: Classification of moral foundations in microblog po- litical discourse. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 720–730 (2018) 22. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 23. Matz, S.C., Kosinski, M., Nave, G., Stillwell, D.J.: Psychological targeting as an effective approach to digital mass persuasion. Proceedings of the national academy of sciences 114(48), 12714–12719 (2017) 24. Mitrou, L., Kandias, M., Stavrou, V., Gritzalis, D.: Social media profiling: A panop- ticon or omniopticon tool? In: Proc. of the 6th Conference of the Surveillance Studies Network. Barcelona, Spain (2014) 25. Mubarak, H., Darwish, K.: Using twitter to collect a multi-dialectal corpus of arabic. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP). pp. 1–7 (2014) 26. Nguyen, D., Smith, N.A., Rosé, C.P.: Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT Workshop on Language Technol- ogy for Cultural Heritage, Social Sciences, and Humanities. pp. 115–123. Associa- tion for Computational Linguistics (2011) 27. Palva, H.: Dialects: classification. Encyclopedia of Arabic Language and Linguistics 1, 604–613 (2006) 28. Pang, D., Eichstaedt, J.C., Buffone, A., Slaff, B., Ruch, W., Ungar, L.H.: The language of character strengths: Predicting morally valued traits on social media. Journal of personality (2019) 29. Potthast, M., Rosso, P., Stamatatos, E., Stein, B.: A decade of shared tasks in digital text forensics at pan. In: European Conference on Information Retrieval. pp. 291–300. Springer (2019) 30. Preoţiuc-Pietro, D., Liu, Y., Hopkins, D., Ungar, L.: Beyond binary labels: political ideology prediction of twitter users. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 729–740 (2017) 31. Qwaider, C., Saad, M., Chatzikyriakidis, S., Dobnik, S.: Shami: A corpus of levan- tine arabic dialects. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) (2018) 32. Rangel, F., Rosso, P., Charfi, A., Zaghouani, W., Ghanem, B., Snchez-Junquera, J.: Overview of the track on author profiling and deception detection in arabic. In: Mehta P., Rosso P., Majumder P., Mitra M. (Eds.) Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2019). CEUR Workshop Proceedings. In: CEUR-WS.org, Kolkata, India, December 12-15 (2019) 33. Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verho- even, B., Daelemans, W.: Overview of the 2nd author profiling task at pan 2014. In: CLEF 2014 Evaluation Labs and Workshop Working Notes Papers, Sheffield, UK, 2014. pp. 1–30 (2014) 34. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at pan 2013. In: CLEF Conference on Multilingual and Mul- timodal Information Access Evaluation. pp. 352–365. CELCT (2013) 8 Zhang & Abdul-Mageed 35. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. Working Notes Papers of the CLEF (2017) 36. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at pan 2016: cross-genre evaluations. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings/Balog, Krisztian [edit.]; et al. pp. 750–784 (2016) 37. Rangel Pardo, F.M., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at pan 2015. In: CLEF 2015 Evaluation Labs and Workshop Working Notes Papers. pp. 1–8 (2015) 38. Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on Search and mining user-generated contents. pp. 37–44. ACM (2010) 39. Sadat, F., Kazemi, F., Farzindar, A.: Automatic identification of arabic language varieties and dialects in social media. Proceedings of SocialNLP p. 22 (2014) 40. Salameh, M., Bouamor, H.: Fine-grained arabic dialect identification. In: Proceed- ings of the 27th International Conference on Computational Linguistics. pp. 1332– 1344 (2018) 41. Sap, M., Park, G., Eichstaedt, J., Kern, M., Stillwell, D., Kosinski, M., Ungar, L., Schwartz, H.A.: Developing age and gender predictive lexica over social media. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1146–1151 (2014) 42. Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E., et al.: Per- sonality, gender, and age in the language of social media: The open-vocabulary approach. PloS one 8(9), e73791 (2013) 43. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014) 44. Verhoeven, B.: Two authors walk into a bar: studies in author profiling. Ph.D. thesis, University of Antwerp (2018) 45. Verhoeven, B., Daelemans, W., Plank, B.: Twisty: a multilingual twitter stylome- try corpus for gender and personality profiling. In: Proceedings of the 10th Annual Conference on Language Resources and Evaluation (LREC 2016)/Calzolari, Nico- letta [edit.]; et al. pp. 1–6 (2016) 46. Versteegh, K.: The arabic language. Edinburgh University Press (2014) 47. Zhang, C., Abdul-Mageed, M.: No army, no navy: Bert semi-supervised learning of arabic dialects. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop. pp. 279–284 (2019)