=Paper=
{{Paper
|id=Vol-2517/T2-8
|storemode=property
|title=Deception Detection in Arabic Tweets and News
|pdfUrl=https://ceur-ws.org/Vol-2517/T2-8.pdf
|volume=Vol-2517
|authors=F Javier Fernández-Bravo Peñuela
|dblpUrl=https://dblp.org/rec/conf/fire/Penuela19
}}
==Deception Detection in Arabic Tweets and News==
Deception Detection in Arabic Tweets and News F. Javier Fernández-Bravo Peñuela Polytechnic University of Valencia, Spain Abstract. The project Arabic Author Profiling for Cyber-Security (ARAP)1 aims at preventing cyber-threats using Machine Learning. To this end, they monitor social media to early detect threatening messages and, in such a case, to profile the authors behind. Profiling potential terrorists from messages shared in social media may allow detecting communities whose aim is to undermine the security of others. One of this framework’s main challenges is recognizing false positives, such as potential threaten- ing messages that are actually deceptive, ironic or humorous. This paper focuses on the goal of detecting deceptive messages, which are intention- ally written trying to sound authentic. This task is performed on two different genres of Arabic texts: Twitter messages and news headlines. Keywords: text classification, deception detection, arabic text mining, natural language processing 1 Introduction The present work describes the process of designing and implementing a classifier whose goal is to decide whether a message is either true or deceptive, based on the application of natural language processing techniques for decomposing the text and arranging it in the form of a vector of features, along with the construction of a machine learning model trained upon two different datasets of Arabic written texts: Twitter messages and news headlines. This task is carried out in the context of the APDA challenge [2] held in conjunction with the FIRE 2019 Forum for Information Retrieval Evaluation. This document first outlines the aforementioned challenges and difficulties found in the problems of text mining and deception detection, focusing on their application in the analysis of Arabic written texts. Next, it focuses on how to deal with these problems, detailing how each one of them was overcome in the design of the proposed classifier for detecting deceptive messages: which techniques and features were applied on the corresponding stages of development, which were considered but later discarded for different reasons, and which were retained and assimilated in the final model. Last, some conclusions regarding the attempt in providing a solution to the problem of deception detection, where its assorted eventualities and the results reached are highlighted. * Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15 December 2019, Kolkata, India. 1 http://arap.qatar.cmu.edu 2 F.J. Fernández-Bravo 2 Challenges Three main challenges are found when analyzing written texts from Arabic media (both press and social). First of all, differences to languages based on the latin alphabet and with widely different from the gramatical point of view must be handled, so that texts written on these languages can be processed in the context of an automated task. Next, the result of this process in the current task must be applied to the problem of deception detection, which will be addressed below in section 3. 3 Deception detection in Arabic written texts The focus of the task is on deception detection in Arabic on two different genres: Twitter and news headlines. Both dataset’s origin are the corpora created in [1], which contain 1444 news headlines (679 true statements and 765 deceptive statements) and 532 Twitter messages (259 true publications and 273 deceptive publications). They reach a large variety of topics and both classes are balanced enough, so that they are supposed to be representative of the whole populations generalizable by the experiment. The documents contained in each dataset must be transformed into repre- sentation which allows its processing, such as a vector of features. A machine learning model has to be constructed and trained using the portion of the dataset available to the APDA challenge contestants [1]. This model will be used for making classification predictions for the texts in the evaluation portion of the dataset. Finally, the test results’ quality will be measured by computing the F- Score which compares the classification predictions to the actual classes for the evaluation data subset. The tasks carried out to reach this goal are outlined in the following subsec- tions. 3.1 Preprocessing In order to process the raw text collection from the dataset and build a classifier capable of differentiate whether tweets and news are true or deceptive, the first step was to apply some normalization tasks on the text entries, as follows: – All letters capitalization was turned to lowercase. – Numbers were removed. – White spaces and other splitter characters were first collapsed and then removed. – Every word contained into the English and Arabic languages stop-words lists of common words with empty semantic meaning was removed from the text. – Punctuation symbols were removed. Deception Detection in Arabic Tweets and News 3 – Words from the English and Arabic languages stop-words lists were removed again. The reason for performing this elimination twice was having found out that removing words from the stop-words list before and after removing punctuations actually contributed to a better cleansing of the text processed, thus increasing the final accuracy attained by the classifier. 3.2 Feature extraction Next step was to retrieve a data frame containing the 1000 most frequent words occurring in the text collection. This data frame, which corresponds to the text collection vocabulary, was used to generate the bag of words representation of the text collection, a two-dimensional matrix which relates words from the vo- cabulary to their number of occurrences on each of the dataset’s documents. This bag of words is actually a vectorized representation of the text’s features, which, along with the class each document belongs to, can be used to train a classifier a build a Machine Learning model with the ability to decide the most probably category for new unseen documents. At this point, the bag of words was enriched in different ways for the news and Twitter datasets (constructing separated classifiers for every one). A new feature was added to both data frames containing the number of words in the document, and three new features were added to the Twitter data frame de- tailing the number of hashtags, user mentions, emojis. These three are some traits characteristic in Twitter’s texts and it is believed that their frequency might be related to either true or deceptive messages, so that they make a good discriminatory factor for differentiating and classifying messages. 3.3 Classifier model training Once the vectorized form of the text collection was complete, the data frame containing the bag of words representation for each dataset was used to train a classifier2 , taking support vector machines as the machine learning technique of choice, due to their good performance on classification where many features are used (as it happens to be the case of text classification). K-fold cross-validation was used to select the classifier which provides the highest accuracy and displays the best ability to generalize. The collection was split into four folds, three of which were used for training on every iteration, while the other one was left for evaluation. Since the result classes for the test dataset are unknown, the F-Score ob- tained by the implementation of the K-fold cross-validation technique was used to check which combination of parameters, features, and techniques produced the highest accuracy on classification. This accuracy result data from evaluation are displayed in section 4. 2 In case of having unbalanced classes, weighting and penalization mechanisms can be included in the classifier, so that the predictions made will be more accurate and representative for the general population. 4 F.J. Fernández-Bravo 3.4 Accuracy evaluation Once the classifier model was built, the bag of words representations for the test datasets (news and tweets) were constructed, applying on them the same normalization techniques on the documents as in their documents as in the training data and using the vocabulary from the training stage. By doing this, both vectorized representations are equivalent to the training ones regarding order and indexing, and the classifier built from training data can be used to generate predictions for new data. Finally, predictions were generated for the vectorized representation of the test data, and the identifier and predicted class for each document were stored into text files, ready to be submitted for evaluation. Since different models were trained for classifying news and tweets, the corresponding classifier was used for generating predictions on test data files. 4 Results First of all, a prototype implementation was built, including just the components strictly needed for basic tokenization, vectorization, and classification (without including other manipulations on the text neither the extraction of additional features), so that a baseline score could be obtained and later improved. The original F-score reached on both datasets was quite low for a classification with just two possible categories. These results, along with the attained in further stages, are displayed in table 1. Table 1. Classification results on news and tweets Model News Twitter Baseline 0.58 0.61 Intermediate model 0.70 0.64 Final model with ad-hoc Twitter features 0.70 0.77 Once the definitive implementation was complete, having tuned the support vector machine’s hyperparameters, extracted additional useful feature and in- cluded supplementary mechanisms (which were commented in subsection 3.2), classification on the Twitter and news datasets respectively improved in 16% and 12%. Although an even higher improvement would be feasible, the results show that the application of some of the techniques introduced actually improves the classifier’s accuracy. In the case of the classification on the Twitter dataset, it was evaluated before implementing the extraction of relevant features (just keeping the n most frequent words). The inclusion of some relevant features based on the text characteristic themselves, combined with common natural language processing techniques, provides a significant improvement on the results attained. Deception Detection in Arabic Tweets and News 5 5 Conclusions The present work has described the process of designing and implementing a classifier whose goal is to decide whether a message is either true or deceptive (a false statement which is intentionally written pretending to seem true). The extraction of features related to the nature of the text itself has proven useful for increasing the model’s accuracy and might indeed be decisive on the results attained. Particularly, when analyzing Twitter texts, characteristic traits from this media were extracted, such as hashtags, user mentions, and emojis. This led to the construction of a different classifier for each media, trained with their corresponding vectors of features. In a world of social media and fake news, this effort is oriented to provide a solution to the problem of determining the truthfulness of a statement just from the way it is written and expressed. Further efforts aim to take into account linguistic twists which may imply false positives, such as irony or humor. References 1. Rangel, F., Charfi, A., Rosso, P., Zaghouani, W.: Detecting deceptive tweets in arabic for cyber-security 2. Rangel, F., Rosso, P., Charfi, A., Zaghouani, W., Ghanem, B., Sanchez-Junquera, J.: Overview of the track on author profiling and deception detection. In: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2019) (2019)