=Paper=
{{Paper
|id=Vol-2517/T2-4
|storemode=property
|title=Arabic Author Profiling and Deception Detection using Traditional Learning Methodologies with Word Embedding
|pdfUrl=https://ceur-ws.org/Vol-2517/T2-4.pdf
|volume=Vol-2517
|authors=Haritha Ananthakrishnan,Akshaya Ranganathan,Thenmozhi D,Chandrabose Aravindan
|dblpUrl=https://dblp.org/rec/conf/fire/Ananthakrishnan19
}}
==Arabic Author Profiling and Deception Detection using Traditional Learning Methodologies with Word Embedding==
Arabic Author Profiling and Deception Detection using Traditional Learning Methodologies with Word Embedding Haritha Ananthakrishnan, Akshaya Ranganathan, Thenmozhi D, and Chandrabose Aravindan Department of CSE, SSN College of Engineering, Chennai {haritha16038,akshaya16009}@cse.ssn.edu.in {theni_d, aravindanc}@ssn.edu.in Abstract. With the ubiquity of social media, although one’s thoughts and opinions can be expressed through virtual platforms effortlessly, there have been numerous cases of posts that threaten the security of a certain community, caste, or religion or spread false propaganda against a certain group of people. Developments in the fields of Natural Language Processing and Machine Learning have paved the way to the concept of author profiling, which helps identify an author’s age, demographics, and gender details. The Author Profiling in Arabic Tweets task of FIRE 2019 aims to monitor Arabic Twitter posts and profile their authors concern- ing their age, gender, and language variety using learning concepts. The task of Deception Detection in Arabic texts focuses on monitoring Twit- ter and News headlines and detect deceptive texts: Posts that are drafted to seem authentic but suggest other ulterior motives. We have adopted the concept of SGD Optimized Support Vector Machine classification with AraVec word embedding for both the tasks and have achieved a joint F-1 score of 0.3403 for Author Profiling and an average score of 0.7598 for the Deception detection task. Keywords: Author profiling · Deception Detection · Natural Language Processing · Machine Learning · Arabic Tweets · News · Support Vector Machines · Stochaistic gradient descent 1 1 Introduction and Related works The APDA task of FIRE 2019 funded by ARAP Qatar aims to enhance cyber- security using Machine Learning. Social media such as Twitter, Facebook, In- stagram, etc. have gained popularity over the past decade where users can post 1 Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15 Decem- ber 2019, Kolkata, India. images, and text online, no questions asked. Many news magazines and papers have shifted their base to virtual media as well. Although these practices have uncovered many societal goods, they have engendered many drawbacks such as the rapid propagation of offensive posts and unreliable news that threaten the security of certain communities or individuals. Author profiling is the process of predicting the chatacterestics an author (gender, age, and location), based on their stylistic writing features. This not only forms a security layer, but also pro- vides interesting linguistic information for targeted advertising, and marketing. The second task, Deception detection aimed at classifying news headlines and snippets as deceptive or non-deceptive, based on their language and word usage. This was done in [8] through the analysis of linguistic cues or leakages such as frequencies and patterns of word usage. The Text Attribution Tool (TAT) was developed for author profiling in a variety of languages [2]. Arabic poses a chal- lenge as extensive data pre-processing is required like tokenization, character set normalization, informal spelling normalization, etc. Buckwalter scheme has been used for character set normalization. A variety of algorithms have been tried in the past and Bagging and SVM based SMO algorithms proved to have consid- erably better performance. Generally, traditional machine learning algorithms like SVM had better results [7]. For deception detection, Credibility Analysis of Arabic Content on Twitter (CAT)[8] which used user’s timeline features like the number of retweets, user’s activity, etc. 2 Dataset Analysis and Data cleaning 2.1 Dataset Analysis The APDA task of FIRE 2019 was divided into two individual tasks, Author Profiling of Twitter posts and Deception Detection of news headlines and tweets. The corpora of the Author Profiling task consisted of a total of 2,250 users released over five days as groups of 450 users per day, which had even distribution across all classes. Each user’s posts were collectively classified into three of the following classes: Gender, Age and language variety. – Gender - Male or Female – Age - Under 25, Between 25 and 34, Above 34 – Language variety - Algeria, Egypt, Iraq, Kuwait, Lebanon-Syria, Lybia, Morocco, Oman, Palestine-Jordan, Qatar, Saudi Arabia, Sudan, Tunisia, UAE, Yemen. The training dataset of the Deception detection in Arabic texts task was divided into Twitter headlines and News headlines, each of which had two classifications based on whether or not the sentence was deceptive – Truth or Lie. 2.2 Data Cleaning Data cleaning was a major task for the Arabic data set, as the language was written from right to left, and had different rules of sentence termination, all Table 1. Distribution of corpora for both tasks Type Training users Test users Twitter 532 241 News 1443 370 Author Profiling 2250 720 of which were indecipherable to those who do not know Arabic. The following processes were implemented in python to flatten out data discrepancies. – Links : Twitter posts contain lots of hyperlinks containing “Http://”, “www.”, “.com”, etc. All of these were adding noise to the textual data and hence were removed using regular expressions. – Hashtags: English hashtags were removed in their entirety, whereas Arabic hashtags were maintained with only the removal of the hashtag to prevent loss of useful linguistic information. – Non - Arabic Words: As it was impractical to run language models for both Arabic and English due to the sparse density of English posts, all English words were removed from the text. – Twitter handles: Social Media handles starting with “@” were eliminated from the text – Special Characters: To further level the contents of the text, all special characters, erroneous blank spaces, numbers, and empty strings were re- moved – Emojis: A major challenge for data cleanup for both tasks, was the removal of emojis, which had characters outside of the basic multilingual plane. These characters were extracted using Unicode conversion and used regular expres- sions to remove them. – Stop Words: To remove Arabic stopwords, the NLTK platform’s Arabic stopword list was taken, against which every sentence was filtered. 3 Methodology and Implementation 3.1 Word Embedding Word embedding [4] is one of the most recent developments in the field of Nat- ural Language Processing that facilitates the identification of the context of a word, where words are represented as vectors in a continuous space, capturing syntactic and semantic relationships between them. AraVec [6] is a powerful, pre- trained word embedding tool developed solely for Arabic NLP research, which is built upon Twitter, World Wide Web, and Wikipedia Arabic pages. The pre- processing step of AraVec included the removal of tashkeel, an Arabic symbol added after a word to distinguish it from others, which did not contribute to the overall meaning of the sentence. AraVec is built over the gensim2 Word2Vec 2 https://radimrehurek.com/gensim/about.html model[3] which is a two-layer neural network used to identify the right context of words using CBOW and Skip-Gram techniques. AraVec was used to pre-process our text, as the generated word vectors could assign appropriate weights to the semantics of words. 3.2 Machine Learning model We have implemented an SGD optimized SVM classifier with Aravec word em- bedding for our model. In Stochastic Gradient Descent optimization[1], the gra- dient of the loss is estimated one sample at a time, which is randomly shuffled for performing the iteration. In our model, SGD Classifier of sklearn performs Stochastic Gradient Descent Optimization on a linear SVM Classification Model whose training accuracy was 86% for Author profiling and an average of 92% for Deception detection of Twitter And News. 4 Results Analysis The performance of our model for deception detection was remarkably better than that for author profiling. The F1 scores were 0.34 for author profiling and 0.76 for deception detection [9]. SGD optimized SVM worked better for deception detection. This can be attributed to the comparatively low sizes of the news datasets, leading to better F1 scores. For author profiling, the model was able to predict well when predicting the gender and location. F1 scores were 0.76 and 0.83 respectively. The model performed poorly when it came to age with the F1 score being 0.55. Therefore, the joint performance was lower in com- parison to the best performer who scored 0.4556. The results can be explained by pondering into the structural differences in Arabic. Research on Language and Gender Differences in Jordanian Spoken Arabic [11] shows that there are significant differences in the Arabic speech of men and women. Similarly, [12] aims at exploring the regional variations in Arabic. However, such differences in written Arabic amongst different age groups are unfathomable. Through word embedding, vectors are generated by grouping words that belong to similar con- texts. As the differences of Arabic amongst age groups were not remarkable, the vectoriser could have given similar scores which ultimately led to the poor per- formance. The same logic can be used to explain the comparable performances in deception detection and prediction of gender and location. 5 Conclusion and Future works Author profiling and deception detection have numerous applications given the amount of data exchanges over the internet each day. Although numerous NLP models are available for the English language, these need to be extended to languages such as Arabic, which is spoken by a vast majority of the world, which is what APDA@FIRE aims to achieve. This paper aims to predict the personality of authors based on their style of word usage and to ensure the credibility of news by classifying the snippets as true or false. We have used word embedding and a traditional learning model to implement the same. Future scope includes comparing the accuracy of different types of traditional models such as Decision trees, Random Forrest, and Bayesian classifiers, and studying the choice of other word embedding tools that can capture the minute differences in written Arabic amongst age groups. References 1. Robbins H, Monro S. A stochastic approximation method. The annals of mathe- matical statistics. 1951 Sep 1:400-7. 2. Estival, D., Gaustad, T., Hutchinson, B., Pham, S.B. and Radford, W., 2008. Author profiling for English and Arabic emails. 3. R. Rehurek and P. Sojka, “Software framework for topic modeling with large cor- pora,” in In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010. 4. Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013). 5. Conroy NJ, Rubin VL, Chen Y. Automatic deception detection: Methods for finding fake news. Proceedings of the Association for Information Science and Technology. 2015;52(1):1-4. 6. Abu Bakr Soliman, Kareem Eisa, and Samhaa R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP”, in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE, 2017. 7. Rangel F, Rosso P, Montes-y-Gómez M, Potthast M, Stein B. Overview of the 6th author profiling task at pan 2018: multimodal gender identification in Twitter. Working Notes Papers of the CLEF. 2018. 8. Rangel F, Charfi A, Rosso P, Zaghouani W. Detecting Deceptive Tweets in Arabic for Cyber-Security. 9. Overview of the Track on Author Profiling and Deception Detection in Arabic. Fran- cisco Rangel, Paolo Rosso, Anis Charfi, Wajdi Zaghouani, Bilal Ghanem, Javier Sánchez-Junquera. In: Mehtha P., Rosso P., Majumder P., Mitra M. (Eds.) Work- ing Notes of the Forum for Information Retrieval Evaluation (FIRE 2019). CEUR Workshop Proceedings. CEUR-WS.org, Kolkata, India, December 12-15. 10. http://arap.qatar.cmu.edu/ 11. Al-Harahsheh, A.M.A., 2014. Language and gender differences in Jordanian spoken Arabic: a sociolinguistics perspective. Theory and Practice in Language Studies, 4(5), p.872. 12. Ibrahim, Z., 2009. Beyond lexical variation in modern standard Arabic: Egypt. Lebanon and.