-

KCE DALab-APDA@FIRE2019: Author Pro ling and Deception Detection in Arabic using Weighted Embedding

Sharmila Devi V

sharmiladevi1002@gmail.com 1

Kannimuthu S

Ravikumar G

Anand Kumar M

anandkumar@nitk.edu.in 2 0 Department of Computer Science and Engineering , CIET, Coimbatore 1 Department of Information Technology, Karpagam College of Engineering , Coimbatore 2 Department of Information Technology, National Institute of Technology Karnataka , Surathkal , India m

This paper explaining the work submitted on Author Proling and Deception Detection in Arabic Tweets shared task organized at the Forum for Information Retrieval Evaluation (FIRE) 2019. The rst task Author pro ling illustrates identifying the categories of authors based on the Arabic tweets. In the second task, the aim is to Detect deception in Arabic for two genres such as Twitter and News. Deception detection means that the automatic way of identifying false messages in the text content on social network or news. For each task, we have submitted three di erent systems. For submission 1, we have used the Term Frequency and Inverse Document Frequency (TFIDF) based Support Vector Machine classi cation and in submission 2, we have used fastText classi er. For submission 3, we have proposed a low dimensional weighted document embedding (TFIDF + Word embedding) with SVM classi cation. We have attained second place in the Deception detection and third in Author pro ling. The performance di erence between the top team results and the submitted runs are only 3.34% for Author proling and 1.16% for Deception detection.

Author pro ling Deception detection Arabic tweets Machine Learning TFIDF Word embeddings fastText Classi er Weighted document embeddings

In our busy day-to-day life, a computer-based technology, social media plays a major role in sharing of information, ideas, thoughts from one people to another. Most of the people used to send their personal messages, documents, videos and photos through social media network such as Twitter, Facebook, WhatsApp etc. Author pro ling is the method which analyse the demographic features of an author such as age, gender and the language varieties. Some of the applications of author pro ling are forensics, security, marketing, etc. For example, in the marketing eld, it is useful to nd which pro le of the customer like or dislike the product. This analysis will help companies for better market segmentation. From a forensic viewpoint, it is important to nd out the pro le of the person who wrote the suspicious text. Deception detection is the method of analysing whether the given message is lie or truth. The rest of the paper will brie y as follows: In section 2, we discuss the literature survey about the author pro ling and deception detection in various languages. Section 3 mentions the data set description and the statistics. In section 4, we explain the methodology and section 5 discusses the results obtained. In section 6, we conclude the paper with limitations and future work. 2

Related Works

The peculiarities of the Arabic dialectal varieties are used in social media and the annotation framework is proposed in [ 1 ]. The suspicious message of the author is whether a potential threat or not is focused in Arabic Author Pro ling for Cyber-Security project [ 2 ]. The framework for improving the deception detection accuracy for online digital news veracity is proposed in [ 3 ]. Bayesian classi cation and K- means clustering algorithm to nd out the deception detection in the twitter pro le characteristics is proposed to analyze the user behavior [ 4 ]. Various features extraction methods proposed in deception detection from Arabic Twitter post [ 5 ]. The accuracy gained for the SVM with trigram over other classi ers is 91.55%. Arabic word correction to manipulate the vulnerability is explained in [ 6 ]. They achieved accuracy of 96.5% for detecting abusive Arabic tweets.

Author pro ling system for Urdu is proposed [ 8 ] by word and characterbased term frequency and TFIDF features and support vector machine classi er. Weighted embeddings based on a novel median-based loss function is explained [ 9 ] with the experimental results on Wikipedia and twitter data. Embedding variations to the doc2vec embedding on a new evaluation task using Trip advisor reviews, and also the CQADupStack benchmark are proposed in [ 10 ]. Word mover's embedding to enable the unsupervised document embedding from pretrained word embeddings is proposed in [ 11 ]. Identi cation of the age and gender form blog authors are proposed [ 12 ] and the experiments on information retrieval features yielded best predictions. 3

Dataset Description

The dataset for Arabic author pro ling is given as ve di erent categories where each consists of three natives. The details of the nativity are given in the overview of the shared task [ 14 ]. The dataset consists of three age groups (25, Between 25 and 34 and Above 35) and two genders (male and female) in all the categories. The primary di erence between the given deception and pro ling dataset is in the representation. In Author pro ling, each XML le which consists of 100 tweets needs to be labeled as gender, age group, and language variety. But in Deception detection, each tweet should be identi ed whether it is truth or lie. Two di erent domains such as News and Tweets are given for deception detection. We have submitted 6 runs for Deception detection.

All the ve training dataset of author pro ling and deception detection are completely balanced and the number of documents in di erent classes are given on Table 1 and 2. We have totally submitted three methods which are based on TFIDF features with SVM classi er, word bi-grams with fastText classi er and TFIDF weighted document embeddings. We have submitted 21 runs for Arabic Author pro ling and Deception detection. In the case of deception detection, we have tried the same approaches followed for the Arabic author pro ling task. The three methods are explained below.

Submission-1: The rst run is based on the conventional method where we have used the word and character n-gram features with SVM classi er [ 8 ]. Word uni-grams and character bi-grams, trigrams and four-grams are considered as features. Out of all features, we have considered a maximum of 5000 features for words and 5000 for characters. These feature values are weighted with TFIDF values. The nal feature matrix is given to the Linear SVM for classi cation. The SVM parameters are L2 norm for a penalty with C value 1 and multi-class using one versus rest. We have followed the same method for Arabic author pro ling and Deception detection.

Submission-2: In the second run, we have used the well-known fastText embedding and classi er [ 7 ] for pro ling the Arabic authors and identifying the deception. The fastText classi er is compatible for the sentence classi cation, task so for Deception detection we have used the fastText classi er as such. But in the case of Author pro ling task, the XML le is input. Fortunately, all the training as well as testing XML les are made from equal (100) tweets. So we have modi ed the input as individual tweets and trained as a sentence classi cation task. After tagging the tweets during testing, we have counted the labels of each XML le and select the maximum label as a label for the corresponding XML le. The main drawback of this approach is to infer the cross-validation results. The parameters of fastText are xed as follows, word bi-grams, learning rate lr=0.25 and 40 epochs. We have used softmax as the loss function. Submission-3: We have developed the weighted word embedding model for the third submission. Here, we have used the Arabic pre-trained word vectors from Arabic tweets and web pages [ 13 ]. The complete architecture of the model is shown in Figure 1. In the case of Author pro ling, initially word unigram features are vectorized using conventional TFIDF vectorizer. The maximum features are limited to 5000, so each XML document is represented as 5000 unique words. All the XML documents in the training data are TFIDF vectorized with maximum feature size of 5000. The existing skip-gram based Arabic pre-trained vectors [ 13 ] of size 300 are used to create the embedding matrix for the unique words. The words which are not present in the pre-trained vectors are considered as unknown words, for these words the embeddings are generated randomly from the word vectors. Finally, we have taken the dot product between the TFIDF and embedding matrix which results in the document transformed to low dimensional document vectors. These set of vectors are considered as TFIDF weighted document embeddings which are further trained using SVM.

Author pro ling in Arabic tweets for gender it is 0.7667, age it is 0.5722 and the variety it is 0.9694. The performance also evaluated jointly where the accuracy gained is 0.4222. The top accuracy gained for Deception detection for news it is 0.7331 and for Twitter, it is 0.8541, the average performance of the accuracy is obtained as 0.7887. In this paper, we illustrate the work on the identi cation of age, gender and language variety in author pro ling and deception detection in Arabic (APDA). Using the given training dataset, we have developed three systems. We have used the Term Frequency and Inverse Document Frequency and SVM, fastText classi er method and weighted word embedding with SVM. Compared with the traditional model the most expected weighted embeddings attained less accuracy. The main reason for less accuracy is that the certain words in the given dataset are not present in the pre-trained model. Even though, we have used the pre-trained model of Arabic tweets, around 30% of unknown words present in the training data. This can be resolved with the recent character-speci c word embeddings. With this 30% of information loss, the performance of the proposed low-dimensional document embedding on Author pro ling attained decent accuracy. In the future, this can be enhanced with character-speci c embedding and retrain the pre-trained models.

Sharmila et al.

1. Zaghouani , Wajdi, and Anis Char . "Guidelines and Annotation Framework for Arabic Author Pro ling." arXiv preprint arXiv: 1808 . 07678 ( 2018 ).

2. Rosso , Paolo, Francisco Rangel, Bilal Ghanem, and Anis Char . "ARAP: Arabic Author Pro ling Project for Cyber-Security." Procesamiento del Lenguaje Natural 61 ( 2018 ): 135 - 138 .

3. Eembi@ Jamil, Normala Che, Iskandar Ishak, and

Fatimah

Sidi . "Deception detection approach for data veracity in online digital news: Headlines vs contents." AIP Conference Proceedings . Vol. 1891 . No. 1 .

AIP

Publishing , 2017 .

4. Alowibdi

, Buy

, Philip

, Ghani

, Mokbel

Deception detection in Twitter. Social network analysis and mining . 2015 Dec 1 ; 5 ( 1 ): 32 .

5. Al-Saif , Hissah, and Hmood Al-Dossari. "Detecting and Classifying Crimes from Arabic Twitter Posts using Text Mining Techniques." International Journal of Advanced Computer Science and Applications 9 .10 ( 2018 ): 377 - 387 .

6. Abozinadah , Ehab A., and

J. H.

Jones . "Improved micro-blog classi cation for detecting abusive Arabic Twitter accounts." International Journal of Data Mining and Knowledge Management Process (IJDKP) 6 .6 ( 2016 ): 17 - 28 .

7. Joulin , Armand, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov . "Bag of tricks for e cient text classi cation . " arXiv preprint arXiv:1607.01759 ( 2016 ).

Sharmila

Devi , V. , Kannimuthu , S. , Ravikumar , G. , Anand Kumar , M.

"KCe Dalab@maponsms-Fire2018: E ective word and character-based features for multilingual author pro ling" (

2018 ) CEUR Workshop Proceedings, 2266 , pp. 213 - 222 .

9. De

Boom

, Cedric, Steven Van Canneyt, Thomas Demeester , and Bart Dhoedt . "Representation learning for very short texts using weighted word embedding aggregation . " arXiv preprint arXiv:1607.00570 ( 2016 ).

10. Schmidt , Craig W. "Improving a tf-idf weighted document vector embedding." arXiv preprint arXiv: 1902 . 09875 ( 2019 ).

11. Wu , Lingfei, Ian EH Yen , Kun Xu, Fangli Xu, Avinash Balakrishnan, Pin-Yu

Chen

, Pradeep Ravikumar, and

Michael J.

Witbrock . "Word Mover's Embedding: From Word2Vec to Document Embedding." arXiv preprint arXiv: 1811 . 01713 ( 2018 )

12. Weren , Edson

, Anderson

. Kauer, Lucas Mizusaki,

Viviane P.

Moreira , J. Palazzo M. de Oliveira , and Leandro

Wives . "Examining Multiple Features for Author Pro ling . " ( 2014 ).

13. Abu Bakr Soliman, Kareem Eisa, and Samhaa

El-Beltagy , AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP , in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017 ), Dubai, UAE , 2017 .

14. Rangel , F. , Rosso , P. , Char , A. , Zaghouani , W. , Ghanem , B. , Snchez-Junquera , J. : Overview of the track on author pro ling and deception detection in arabic . In: Mehta P., Rosso

, Majumder

, Mitra

. (Eds.) Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2019) . CEUR Workshop Proceedings. In: CEUR-WS.org, Kolkata, India, December 12 - 15 ( 2019 )