=Paper=
{{Paper
|id=Vol-3315/paper13
|storemode=property
|title=Fake News Detection for Hindi Language
|pdfUrl=https://ceur-ws.org/Vol-3315/paper13.pdf
|volume=Vol-3315
|authors=Kausthub Thekke Madathil,Neeraj Mirji,Charan,Anand Kumar
}}
==Fake News Detection for Hindi Language==
Fake News Detection for Hindi Language Kausthub Thekke Madathil1 , Neeraj Mirji2 , Charan R3 and Anand Kumar M4 1 Department of Information Technology National Institute of Technology Karnataka Surathkal, India 575025 Abstract The understanding of the term “Fake news” varies from one individual to the other. If we look into the basic meaning of “Fake news”, it refers to inappropriate and made up news. In most cases, the news is made up of baseless sources and facts. These news generally mislead the reader and are generally published for one’s own benefit or to defame others. In recent years, a large population is active on various social media platforms and hence they have become the major medium through which fake news is circulated. A lot of fake news is been circulated in local languages as well. Also most of the existing work is based on the English language and only very little work is done using resource scare language for fake news identification like Indic Languages. So this paper focuses to define false news and suggest an effective method for detecting fake news in Hindi using standard machine learning algorithms like Multi-layer Perceptron and Naive Bayes and deep learning techniques like transforms - mainly mBERT. Keywords Fake news detection, Indic Languages, Hindi language, Classification Algorithms, Transformers, mBERT 1. Introduction Internet has become a part of our life. There is no doubt that many young people prefer the internet to get their news rather than newspapers, radio etc. The internet provides many opportunities for us and more number of people rely on it for the majority of their knowledge. But most of the news read from the internet is either less accurate or completely fake. Fake news has quickly become a social problem. Propagating false or rumor information to change people’s behavior to create chaos in society is increasing day by day. Nowadays the word “fake news” has become most common synonym for issues which describe incorrect and misleading stories which are generally published to generate money on the basis of page views. As a result, combating fake news has become more vital, as well as more difficult, in the age of social media. This is a difficult task because even humans have difficulty distinguishing between fake and genuine news. As a result, developing an automated system for detecting fake news becomes critical. In this paper we have developed a method that can predict whether a particular news article in Hindi is fake or not using deep learning techniques. The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural Language Processing (ALTNLP), June 7-8, Koper, Slovenia $ kausthubtm.191it125@nitk.edu.in (K. T. Madathil); neerajmirji.191it232@nitk.edu.in (N. Mirji); charan.191it212@nitk.edu.in (C. R); m1_anandkumar@nitk.edu.in (A. K. M) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Related Work There are several automated detection approaches for false news and deception posts that are currently in use. Counterfeit news detection has several facets, ranging from utilising chatbots to promote disinformation to exploiting click bait to spread rumours. Many click baits are available on social media networks, including Facebook, which encourages people to share and like messages, propagating false information. There has been a lot of effort put into detecting falsified data. Authors have introduced various detection techniques. Following Detection Methods have been proposed by the authors of the paper [3]. First being, Linguistic foundation Modeling deception, second, Grouping, next method is Predictive Analytics, fourth, Methods based on content cues and the last method, Nontextual cue-based methods. The writers have compiled a list of False news strategies for various sorts of fake news as and is shown in paper [3] figure 1. The accuracy of these models, according to the authors, is only 63 percent to 70 percent. In [4] N. J. Conroy et al where the paper deals with methods for detecting fake news utilising automated detection. Linguistic Cue Approaches for Machine Learning, Rhetorical Structure and Discourse Analysis, Network Analysis Approaches, and SVM Classifiers are among the approaches addressed. These models are only text-based and offer no improvement over previous approaches. Most of the literature focuses on fake news spread through social media. [16] S. Helmstetter and H.Paulheim (2018, August): Weekly Twitter false news detection for supervised learning. The authors classed every tweet/post in this article as a binary classification task. The source of the post/tweet is the sole criterion for classification. The authors used Twitter API and DMOZ to manually collect data sets. The results show that 15% of the tweets are false, 45 percent are genuine, and the rest is undecided. Stanford University’s Samir Bajaj released a work on false news identification where fake news detection using an NLP viewpoint was proposed and they apply deep learning system. They have also considered a genuine data set from the Signal Media News dataset. Facebook and WhatsApp are also focusing on detecting fake news according to a recent article. Facebook mentioned in an article that they are attempting to stop the spread of fake news in two important areas, because most false news is motivated by money, the first step is to disrupt economic incentives. The second is to create innovative features to combat the spread of incorrect information. The following are Facebook’s precautionary measures: Ranking advancements: News Feed rankings help to limit the predominance of inaccurate news items. Easier Reporting: Identify what is and is not accurate. Most of the existing work is based on the English language and only very little literature is available for resource scarce languages for fake news identification like Indian language - Hindi. [17] D. K. Sharma et. al. proposes a dataset for Hindi Language. The dataset i.e., Hindi news is collected using a web scraping utility called the Parsehub. They have conducted multiple experiments by using standard existing machine learning algorithms and have able to achieve better results varying from 1%-5% improvement. The proposed results reflect the effectiveness of their proposed dataset. Our dataset is a combination of two datasets from BBC Boomlive and from Cornell University Hostility Detection Dataset. The BBC Boomlive dataset is formed by processesing the raw dataset collected from Boomlive and BBC news which includes removing Figure 1: [3] False News Strategies null values, unwanted columns and stop words. Whereas the Cornell dataset consisted of news which were labeled under the categories ”hostile”, ”offensive”, ”fake” from which we considered the news under the label of ”fake” only for our dataset. 3. Problem Statement In a world where the media has more influence over its consumers than ever before, the news plays a major role in providing them with the information about current affairs taking place around the globe. But, many times they are provided with wrong information and are misled either knowingly or unknowingly. To prevent such issues, it is necessary to filter out all fake news from reaching the audience before they could lead to any sort of misunderstandings or wrong assumptions. Machine learning can be used as an aid to achieve this feat and make the news consumption more cleaner and correct. . Also, since fake news is not just limited to the English language we are targeting Hindi language and exploring different methods to overcome the problem of limited dataset and text processing techniques. This paper focuses on defining fake news and developing and comparing machine learning models that can accurately predict whether a given news article in Hindi is fake or not. Also, our aim is to propose a method to expand the limited dataset size, process the Hindi texts, train and compare standard machine learning models and discuss the results. 4. Dataset Description The dataset was collected from two sources which were BBC Boomlive and from Cornell University Hostility Detection Dataset. The BBC Boomlive dataset consisted of 1250 fake news and 720 true news taken from popular newspapers and news channels. The Cornell University Hostility Detection Dataset News consisted for news which were labelled under the categories ”hostile”, ”offensive”, ”fake” from which we considered the news under the label of ”fake” only for our dataset. The number of fake news collect from this dataset was 1010. In total the combined Figure 2: Word cloud for the dataset dataset consists of 3020 news out of which 760 are true news and 2260 are fake news. There are a total of 4 attributes for each data row which includes the news title, short description of the news, long description of the news, and the labels. For this paper we have considered only the short description and rest of the columns and attributes are dropped off. Table 1 Dataset Summary DATASET FAKE NEWS DATA TRUE NEWS DATA BBC-BOOMLIVE 1250 720 CORNELL UNIVERSITY HOSTILITY DETECTION 1010 0 TOTAL 2260 720 5. Methodology This paper’s focus would be various machine learning models and its performance to identify fake news specifically for Hindi Language. Since the problem under consideration is text classification, we would be considering standard classification algorithms such as Naive Bayes Classifier and Multi-Layer Perceptron as these are the standard algorithms for text based processing. Also, we have discussed transformers, which is effective for text classification because of their self attention mechanism and a better understanding of word features. mBERT model with fine-tuning is discussed to draw comparisons and results. Since, the dataset collected was really small we have empolyed certain data augmentation techniques to increase the dataset for better prediction. The proposed model consists of crucial steps of Data Augmentation, Text Preprocessing, Tokenization and Model Architecture. 5.1. Data Augmentation This section deals with two data augmentation techniques that we have employed on the dataset to increase the size of out training dataset. We won’t be augmenting the testing dataset because it might lead to false answers. The two techniques used are • Back translation : The Hindi texts are translated to English and retranslated back to Hindi which might lead to certain change in the original text which might be synonyms that might be induced during the translation. For this we have considered the python library called google-trans. • iNLTK : generate similar sentences is a function borrowed from iNLTK package which generates sentences that are similar to the original sentences by keeping the meaning still intact. After data augmentation the size of our training set increased from close to around 2400 texts to around 10000 texts. The testing set size remained the same of around 600 texts since we did not apply augmentation to the testing data set. 5.2. Text Preprocessing The dataset is cleaned by clean-text library from python and the texts with null values are dropped. Other columns other than the ”short description” column is removed since these are not necessary for our model. This is mainly because the short description contains the gist of the entire text and is sufficient for our study, whereas the long description is too big and may introduce errors and reduce the performance. Also, due to the lack of a huge dataset it is not favorable to train our model with long description. Additionally, expanding our dataset is of prior importance in our future work and will make sure to include long description. The special characters and non-alphabetic characters are replaced by white-spaces and the text is converted to lowercase. Next the stop words and five most frequently occurring words are removed. Further the labels are added to the texts as 0 or 1 which corresponds to ”Fake” or ”Real”. 5.3. Tokenization Since we are dealing with sentences which cannot be directly feed into the models we must convert them into tokens. Considering Hindi language texts, we have utilized indic tokenize for tokenizing for Naive Bayes and MLP models whereas we have considered BertTokenizer for the mBERT model . 5.4. Model Specific Preprocessing For MLP and Naive Bayes models the tokens futher were lemmatized using the WordNet lemmatizer from the NLTK package. The countVectorizer from the SKLearn package then extracts features from these lemmatized tokens, returning a matrix of token counts. To keep the computational complexity acceptable, the vectorizer is configured to use just the top 4000 features/tokens. In case of mBERT, utility function padsequences() is borrowed from python to make the size of input vector same. Further attention masks are created i.e., if a token iD is 0, then it’s padding, set the mask to 0 or else if a token iD is greater than 0, then it’s a real token, set the mask to 1. These attention masks help our model train better. 5.5. Models and its architecture 5.5.1. Multinomial Naive Bayes The Multinomial Naive Bayes Algorithm is a set of probabilistic algorithms based on the popular Bayes’ theorem and the ”naive” assumption of conditional independence between all pairs of features. The Bayes theorem basically estimates the probability P(x/y), where x is the class of probable outcomes and y is the supplied case to be identified, which represents some specific characteristics. In natural language processing (NLP) challenges, naive Bayes is commonly utilised. We have basically followed a pipe line like architecture. In this type of work flow, the output of the first stage is obtained and is provided as the input for second. We will be using a powerful machine learning tool Pipeline. Its a python module that basically helps in handling such pipes. It takes two parameters as input. The Step list which is basically a list of tuples that are chained in a sequence with the last object being a estimator and the verbose. 5.5.2. Multi-Layer Perceptron A multi-layer perceptron is a feed-forward artificial neural network (ANN). It uses forward propagation to make predictions and back propagation based on the loss function and computed gradients to train the parameters. The number of hidden layers present varies based on whether it’s a shallow network or a deep network. We have constructed a basic feed-forward ANN with an input size of 4000 (matching to the size of the matrices supplied by the preprocessing function), three hidden linear layers with ReLU activation functions, and a size 2 output layer. Finally, because we are doing a binary classification job, we utilise the Adam optimizer with a rather low learning rate of 0.001, and the error function used is cross entropy loss. . Figure 3: Multi-Layered Perceptron architecture 5.5.3. mBERT mBERT is Multiligual Bidirectional Encoder Representations from Transformers . Its architec- ture is Transformer-based and is pre-trained over a large collection of unlabelled text dataset. It is “deeply bidirectional” model. Hence during the training phase it learns from both left and right side of the token’s context. One of most interesting feature of mBERT is that we can fine tune it just by adding couple of output layers. Following that, the pre-trained BertForSequence- Classification model is loaded, its parameters are locked, and a tiny multi-layer perceptron is added as a binary classifier, which is then trained. Finally, we employ a stochastic gradient descent (SGD) optimizer with a mean squared error (MSE) loss function and a 0.01 learning rate. Figure 4: mBERT architecture 6. Results This section discusses the results obtained while classifying the news into fake or real. The data is processed, and the finalized dataset is used to train individual models which are Multinomial Naive Bayes, Multi-Layered Perceptron and mBERT. The confusion matrix is considered to evaluate the models. Generally a high accuracy model represents a good model in many cases but since we are dealing with a binary classification model, A news that was forecasted as false but contained accurate facts might have cause negative implications similarly a news predicted as true but was really untrue (false positive) might can cause trust concerns. Therefore, there’s a need to used other metrics into account other than Confusion Matrix. We have used considered 3 other metrics into account i.e., precision, recall, and F1-score. We have used a training and testing split of 0.20 [ There is no separate test set. We split the training data itself into train and test sets with 80% of data points belonging to the train set and the rest 20% belonging to the test set.] for the training data, the results before and after data augmentation is shown in tables 2 and 3 respectively. The average and maximum token lengths is 43.33 and 786 respectively. Hence, we have considered a maximum token length of 100 for our model. MODEL PRECISION RECALL F1 SCORE ACCURACY MLP 0.94 0.77 0.81 0.89 NAIVE BAYES 0.93 0.92 0.94 0.92 mBERT MODEL 0.95 0.92 0.93 0.93 Table 2 Results on Test Data before Data Augmentation MODEL PRECISION RECALL F1 SCORE ACCURACY MLP 0.97 0.79 0.82 0.92 NAIVE BAYES 0.94 0.96 0.95 0.96 mBERT MODEL 0.95 0.94 0.97 0.97 Table 3 Results on Test Data after Data Augmentation We have observed better results with data augmentation. MLP has the best precision of 0.97, Naive Bayes has the best Recall of 0.96, mBERT has the best F1 score of 0.97 and the best accuracy of 0.97. Overall, the best performing model is mBERT which mainly because it uses transformers, which is effective for text classification because of their self-attention mechanism and a better understanding of word features. 7. Conclusion and Future Work The work of manually categorizing news demands in-depth knowledge of the domain as well as expertise in identifying abnormalities in the news. We examined the challenge of classifying false news items for the Hindi language using machine learning models in this study. The data we utilized in our analysis was gathered from BBC-Boomlive and from Cornell University Hostility Detection Dataset. We have used data augmentation to increase the dataset size and we compared the two results before and after data augmentation. Clearly, after the data augmentation, we got better results.We used multiple performance metrics to compare the results for each model. When compared to the other models, the mBERT model scored higher on most performance metrics except for F1 score before data augmentation where Naive Bayes outperformed with a score of 0.94 and precision and recall after data augmentation where MLP has a better precision score of 0.97 and Naive bayes with recall score of 0.96. The better performance of mBERT model is mostly due to the fact that it employs transformers, which are useful for text categorization due to its self-attention mechanism and a better comprehension of word aspects. This study could be further extended and few of the ideas are to : • collect more data from various sources and news to expand our dataset inorder to improve the accuracy of the model. • Express the conclusion as a result as the probability of a news text to be fake or true. For example, it is more informative for the user to know that this news text was detected 65 percent as a Fake. This means that there are 35 percent probability to be true. • A long term goal would be it implement and study fake news detection for various other Indian languages. References [1] Mehta D, Dwivedi A, Patra A, Anand Kumar M (2021) A transformer-based architecture for fake news classification. Soc Netw Anal Min 11:39. https://doi.org/10.1007/s13278-021- 00738-y [2] Hariharan RamakrishnaIyer LekshmiAmmal, Anand Kumar Madasamy, " Ensemble Trans- former Model for Fake News Classification " NITK_NLP at CheckThat! 2021. [3] Parikh, Shivam B. and Pradeep K. Atrey. “Media-Rich Fake News Detection: A Survey.” 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR) (2018): 436-441. [4] N. J. Conroy, V. L. Rubin, and Y. Chen, “Automatic deception detection: Methods for finding fake news,” Proceedings of the Association for information Science and Technology, vol. 52, no. 1, pp. 1–4, 2015. [5] S. Feng, R. Banerjee, and Y. Choi, “Syntactic stylometry for deception detection,” in Proceed- ings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, Association for Computational Linguistics, 2012, pp. 171–175. [6] Gilda, Shlok. “Evaluating machine learning algorithms for fake news detection.” 2017 IEEE 15th Student Conference on Research and Development (SCOReD) (2017): 110-115. [7] Calvillo DP, Ross BJ, Garcia RJB, Smelter TJ, Rutchick AM. Political Ideology Predicts Perceptions of the Threat of COVID-19 (and Susceptibility to Fake News About It). Social Psychological and Personality Science. 2020;11(8):1119-1128. doi:10.1177/1948550620940539 [8] G. K. Shahi, D. Nandini, FakeCovid – a multilingual cross-domain fact check news dataset for covid-19, in: Workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. [9] Q. Liao, H. Chai, H. Han, X. Zhang, X. Wang, W. Xia, Y. Ding, An Integrated Multi-Task Model for Fake News Detection, IEEE Transactions on Knowledge and Data Engineering 4347 (2021) 1–12. [10] M. Umer, Z. Imtiaz, S. Ullah, A. Mehmood, G. S. Choi, B. W. On, Fake news stance detection using deep learning architecture (CNN-LSTM), IEEE Access 8 (2020) 156695–156706. [11] F. Monti, F. Frasca, D. Eynard, D. Mannion, M. M. Bronstein, Fake news detection on social media using geometric deep learning, 2019. [12] Ajao, Oluwaseun et al. “Fake News Identification on Twitter with Hybrid CNN and RNN Models.” Proceedings of the 9th International Conference on Social Media and Society (2018): n. pag. [13] M. G. Sherry Girgis and E. amer, “Deep learning algorithms for detecting fake news in online text,” in Proceedings of the ICCES, pp. 93–97, Cairo, Egypt, July 2018. [14] J. Y. Khan, M. T. I. Khondaker, A. Iqbal, and S. Afroz, “A benchmark study on machine learning methods for fake news detection,” pp. 1–14, 2019. [15] A. M. P. Braşoveanu and R. Andonie, “Integrating machine learning techniques in semantic fake news detection,” Neural Processing Letters, vol. 52, no. 2, 2020. [16] Stefan Helmstetter, H. Paulheim, “Weakly Supervised Learning for Fake News Detection on Twitter”, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) [17] D. K. Sharma and S. Garg, "Machine Learning Methods to identify Hindi Fake News within social-media," 2021 12th International Conference on Computing Communication and Net- working Technologies (ICCCNT), 2021, pp. 1-6, doi: 10.1109/ICCCNT51525.2021.9580073. [18] https://journals.sagepub.com/doi/full/10.1177/0002764219878224 [19] https://pytorch.org/hub/huggingface_pytorch-transformers/ [20] https://scikit-learn.org/stable/modules/naive_bayes.html