=Paper=
{{Paper
|id=Vol-3159/T7-11
|storemode=property
|title=Fake News Detection in Urdu Language using BERT
|pdfUrl=https://ceur-ws.org/Vol-3159/T7-11.pdf
|volume=Vol-3159
|authors=Snehaan Bhawal,Pradeep Kumar Roy
|dblpUrl=https://dblp.org/rec/conf/fire/BhawalR21
}}
==Fake News Detection in Urdu Language using BERT==
Fake News Detection in Urdu Language using BERT
Snehaan Bhawal1 , Pradeep Kumar Roy2
1
Kalinga Institute of Industrial Technology, Odisha, India
2
Indian Institute of Information Technology Surat, Gujarat, india
Abstract
With the increase in popularity of social media, we can see an increase in the amount of Fake News in
circulation, leading to misleading public opinion. Thus a system of Fake News detection is necessary to
avoid such consequences. Most of such existing Fake News detection systems work with resource-rich
languages like English and Spanish, but very few systems can work with low resource languages like
Urdu. The current study focuses on detecting Fake News in the Urdu language using Machine and Deep
learning techniques. The ‘UrduFake’ data is used in this research, provided to us as a shared task of
FIRE-2021. The experimental outcomes of various models showed that the Transfer learning models
performed better than the Machine learning models and achieved a weighted average F1-score of 0.87
and 0.61 on the validation and test dataset.
Keywords
Fake News Detection, Urdu, Deep Learning
1. Introduction
There has been a steady rise in internet traffic throughout the world. Connectivity between
people has increased with the popularity of social media [1]. Such media houses have now
become the principal source of information for the general public. Due to the unrestricted
nature of such media, there is little to no oversight in the articles being posted. Although it
promotes freedom of speech, it can also be misused to spread Fake News [2]. Most of such
platforms do not verify the articles and promote them according to popularity, leading to the
faster spread of such unverified articles.
Rubin et al.[3] categorized such deceptive news into three broad groups: i) Serious Fabrication,
ii) Hoaxes and iii) Satire. There have been many cases where these kinds of Fake News was
intentionally spread via social media platforms to mislead the general public [4][5]. This can
be used to target people by discrediting them or creating a situation of political unrest, and
undermining society’s stability. Such articles are usually based on polarizing topics [6] and
garner massive popularity on social media, which in turn helps to promote the same to a wider
audience. Thus there is an urgent need to detect and stop such volatile articles at an earlier
stage of circulation to prevent further spread by assessing the credibility of the said article and
determining it to be trustworthy or not.
Forum for Information Retrieval Evaluation, December 13-17, 2021, India
Envelope-Open mailtosnehaan@gmail.com (S. Bhawal); pkroynitp@gmail.com (P. K. Roy)
Orcid 0000-0002-1072-5326 (S. Bhawal); 0000-0001-5513-2834 (P. K. Roy)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
However, most of the research regarding the detection of fake news has been done in resource-
rich languages like English, and Spanish [7]. Despite Urdu having more than 100 million
speakers, it has seen very little development in such detection systems due to the absence
of properly labelled data and very few resources for NLP tasks. The event organizers [8, 9]
provided a benchmark data set for Fake News detection in Urdu [10]. The current study utilizes
this data to implement and compare different Machine and Deep Learning models for Fake
News Detection in the Urdu language.
The rest of the article is summarized as follows: Section 2 discusses related work, while
the task description and data set distribution is explained in Section 3. Section 4 provides the
preprocessing steps taken, followed by the explanation of the proposed methodology in Section
5. The experiment results are discussed in Section 6. Section 7 conclude this research with
limitations and future scope.
2. Literature Review
Automating fake news detection has been a challenging task for a long time, particularly for low
resource languages. Researchers are creating their own data sets [11] [12] due to the presence
of insufficient benchmarked datasets. Zhou and Zafarani [13] introduced four techniques for
fake news detection based on (i) knowledge, (ii) content, (iii) propagation and (iv) source of
origin.
Rubin et al. [14] developed a model using content-based approach by picking up on the
satirical cues present in a the news articles and implementing a SVM based algorithm with
five features- (i) Absurdity, (ii) Humor, (iii) Grammar, (iv) Negative affect, and (v) Punctuation.
They tested their combinations on 360 different news articles and were able to detect satirical
or potentially misleading news with a F1 score of 0.87. Another study by [15] follows the
propagation-based approach by exploring the social context during news propagation on social
media by looking into the relationship between publishers, articles, and users.
Regarding Fake News detection in Urdu, the number of research works that has been con-
ducted is very less. To the best of our knowledge, the data set [10] provided by the organizers
serves as the single proper available data for the required task. For works relating to Fake News
Detection in the Urdu language, we can refer to the works done in the previous iteration of the
shared task of FIRE. The study reported by [16] topped the leader board. They used an ensemble
model of a RoBERTa and a CNN model with word and character embedding, respectively.
3. Task and Data description
Nowadays, social networking platforms are one of the primary sources of information used to
spread Fake news. Mostly, the existing system is built with a non-Urdu language dataset. Hence,
the news written in Urdu may not be detected by the system. The current study implements and
shows a comparison of different Machine and Deep Learning models in Fake News Detection in
the Urdu Language for the UrduFake-2021 task1 . Table 1 shows the category-wise distribution
1
https://www.urdufake2021.cicling.org/home
Table 1
Category wise Article Distribution Training data
Category Real Fake
Business 150 80
Health 150 130
Showbiz 150 130
Sports 150 80
Technology 150 130
Total 750 550
Table 2
Label Distribution in the given data
Data Set Real Fake Total
Train 600 438 1038
Validation 150 112 262
Test 200 100 300
of the articles present in the Training Data set from Six different domains, namely, Business,
Health, Showbiz, Sports and Technology. Table 2 provides the distribution of Real and Fake
classes in the Train and Test data.
4. Data Preprocessing
The dataset2 provided to us by the organizers is already processed as discussed by the authors
of [10]. Additionally, we have removed any numerals, URLS, email ids and all website links.
The punctuations were replaced with spaces and extra spaces were removed in each article.
5. Methodology
In the study, three different approach were used :
i Conventional Machine learning models.
ii Neural Network models
iii Transfer learning models
5.1. Conventional Machine Learning based models
Under Conventional ML-based models, we have explored the use of 1-5 gram word TF-IDF
features. The features were first extracted and then provided to the different Machine Learning
classifiers, namely, Logistic Regression(LR), Naive Bayes (NB), Random Forest (RF), XGBoost
(XGB) and Support Vector Machine (SVM). The detailed results of the classifier models are
shown in Table 3 of the Results section.
2
https://www.urdufake2021.cicling.org/dataset
Figure 1: Framework used to predict Fake News
5.2. Neural Network based models
In the Neural Network based models, we have reused the previously extracted 1-5 gram TF-
IDF features and used them as the input to a simple Deep Neural Network (DNN). The DNN
consists of three fully connected layers consisting of 512, 256, 128 layers, followed by a single
output neuron. Only a single neuron was chosen as the output because of the binary nature of
classification required in the problem. The ReLU activation function is used in the hidden layers
and the sigmoid activation function is used at the output layer. Adam and binary cross-entropy
were chosen to be the optimizer and loss function for all Neural Network-based models.
This was followed by a Convolutional Neural Network (CNN) based approach. The CNN model
consisted of one Conv1D layer followed by a Global Max Pooling layer and a dropout layer. This
was then connected to a sequential network consisting of two hidden layers comprising of 128
and 64 neurons, respectively. As the input, we used an embedding layer of 100 dimensions with
input length set to 512, resulting in an input layer of dimension (512, 100). The Convolutional
layer was made of 64 filters, with kernel size being 3.
As the final Neural Network based model, a Bidirectional Long Short-Term Memory model
(Bi-LSTM) was chosen. It consists of 256 memory units followed by a Global Max Pooling and
Batch Normalization layer. An embedding layer of 50 dimensions was taken as the input layer,
with the padding length being fixed at 512, followed by dense layers of 20 and 10 neurons in the
first and second layers, respectively. The output layer was the same as the other models, with a
single neuron with a sigmoid activation function.
After successive hyperparameter tuning, we found out that the best results were achieved for
the Neural network models by setting the max sequence length to 512, further increase led to
a decrease in F1 scores and an increase in training time. The learning rate was set to 0.00001,
and the optimizer was Adam. The codes for the current study can be found in the GitHub
repository3 .
5.3. Transfer Learning based models
We have implemented BERT (Bidirectional Encoder Representations from Transformers) models
to work with the transfer learning capabilities. For these models, no further preprocessing
was done. The limitation of such BERT-based models is that they cannot accommodate all the
tokens in each article as the maximum sequence length for such models is 512. Still, this issue
was ignored as we saw that increasing the sequence length in Neural Network models led to
diminishing returns.
Two different variants of BERT models were studied.
i BERT (multilingual)
ii MuRIL
The BERT [17] multilingual model was trained on 102 languages with masked language mod-
elling. Here, the pooled output from the pre-trained model was fed to a dropout layer and
finally to the output neuron.
The last model that we used is MuRIL [18] (Multilingual Representations for Indian Languages).
This is a BERT model trained on a large corpus of 17 Indian languages, including Urdu, collected
from Wikipedia and the Dakshina dataset [19]. This model is also trained with the translated
and transliterated data and the monolingual corpus. Which gives it an advantage in processing
code mixed languages.
6. Results
This section presents the experimental results of all the models mentioned in Section 5. These
results were obtained on the validation data with the model being trained with training sample
shown in Table 2 and presented in precision, recall, and weighted F1-score. A particular model
is best if it reports the best-weighted average of precision, recall, and F1 score among all other
models. The value in bold represents the highest value achieved for a particular data set.
3
https://github.com/Sbhawal/NEWUrduFake-FIRE-2021-CODES.git
Table 3
Results of Conventional Machine Learning Models
Model Class Precision Recall F1-score
Real 0.72 0.93 0.81
LR Fake 0.84 0.52 0.64
Weighted Avg 0.77 0.75 0.74
Real 0.64 0.84 0.73
RF Fake 0.64 0.38 0.47
Weighted Avg 0.64 0.64 0.62
Real 0.69 0.92 0.72
NB Fake 0.81 0.46 0.58
Weighted Avg 0.74 0.72 0.70
Real 0.70 0.89 0.78
XGB Fake 0.76 0.49 0.60
Weighted Avg 0.73 0.72 0.70
Real 0.58 1.00 0.74
SVM Fake 1.00 0.04 0.09
Weighted Avg 0.76 0.59 0.46
Table 4
Results of Neural Network based models
Model Class Precision Recall F1-score
Real 0.72 0.97 0.82
DNN Fake 0.92 0.49 0.64
Weighted Avg 0.80 0.76 0.75
Real 0.73 0.91 0.81
DNN+
Fake 0.82 0.55 0.66
Emb
Weighted Avg 0.77 0.76 0.75
Real 0.74 0.85 0.79
CNN Fake 0.75 0.61 0.67
Weighted Avg 0.74 0.74 0.74
Real 0.72 0.83 0.77
Bi-
Fake 0.71 0.57 0.63
LSTM
Weighted Avg 0.72 0.72 0.71
By observing the outcomes of the experimented tradition ML-based model shown in Table 3,
it is found that the LR classifier performed the best with precision, recall and F1-score of 0.77,
0.75 and 0.74, respectively.
The outcomes of the LR model is almost similar to that of the simple Deep Neural Network
model which achieved precision, recall and F1 score of 0.80, 0.76 and 0.75 respectively as Shown
in Table 4. But, in general, the performance of the traditional ML models shown in Table 3 are
low as compared to the Neural Network models. These comparative outcomes confirmed that
the neural network-based models are better choices for developing an automated Urdu fake
news detection system.
Finally, we have experimented with the Transfer Learning based models- BERT and MuRIL.
The outcomes of the models are shown in Table 5. The MuRIL model performed the best with
Table 5
Results of Transfer Learning based models
Model Class Precision Recall F1-score
Real 0.87 0.89 0.88
BERT Fake 0.84 0.82 0.83
Weighted Avg 0.86 0.86 0.86
Real 0.86 0.92 0.89
MuRIL Fake 0.88 0.80 0.84
Weighted Avg 0.87 0.87 0.87
weighted precision, recall and F1-score values of 0.87, 0.87, and 0.87, respectively, beating the
multilingual BERT model, which achieved an F1 score of 0.86 on the validation data.
7. Conclusion
Fake news on the social media platforms are a big issue at the current date. This research
suggested a Transfer learning based framework for Urdu fake news detection. Many traditional
ML models and NN based models have been experimented to achieve the best prediction
accuracy. We found, the MuRIL- a transfer learning model, outperform the traditional Machine
Learning and other NN based models in the Fake News Detection task. The transfer learning
based MuRIL model achieved accuracy and macro F1 score of 0.743 and 0.610 respectively on
the test dataset. The developed model used an Urdu dataset for training. Hence, the fake news
posted in other languages may not be detected by it. Due to the use of BERT based models, we
have limited the sequence length to 512 which can be improved by using an ensemble of DNN
and BERT models which will be explored in the future.
References
[1] P. K. Roy, S. Chahar, Fake profile detection on social networking websites: A comprehensive
review, IEEE Transactions on Artificial Intelligence 1 (2020) 271–285. doi:1 0 . 1 1 0 9 / T A I .
2021.3064901.
[2] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection on social media: A data
mining perspective, ACM SIGKDD explorations newsletter 19 (2017) 22–36.
[3] V. L. Rubin, Y. Chen, N. K. Conroy, Deception detection for news: three types of fakes,
Proceedings of the Association for Information Science and Technology 52 (2015) 1–4.
[4] H. Allcott, M. Gentzkow, Social media and fake news in the 2016 election, Journal of
economic perspectives 31 (2017) 211–36.
[5] C. Shao, G. L. Ciampaglia, O. Varol, K.-C. Yang, A. Flammini, F. Menczer, The spread of
low-credibility content by social bots, Nature communications 9 (2018) 1–9.
[6] B. Ghanem, P. Rosso, F. Rangel, An emotional analysis of false information in social media
and news articles, ACM Transactions on Internet Technology (TOIT) 20 (2020) 1–18.
[7] M. Amjad, G. Sidorov, A. Zhila, Data augmentation using machine translation for fake
news detection in the Urdu language, in: Proceedings of the 12th Language Resources
and Evaluation Conference, European Language Resources Association, Marseille, France,
2020, pp. 2537–2542. URL: https://aclanthology.org/2020.lrec-1.309.
[8] M. Amjad, G. Sidorov, A. Zhila, A. F. Gelbukh, P. Rosso, Overview of the shared task on
fake news detection in urdu at fire 2020., in: FIRE (Working Notes), 2020, pp. 434–446.
[9] M. Amjad, G. Sidorov, A. Zhila, A. Gelbukh, P. Rosso, Urdufake@ fire2020: Shared track
on fake news identification in urdu, in: Forum for Information Retrieval Evaluation, 2020,
pp. 37–40.
[10] M. Amjad, G. Sidorov, A. Zhila, H. Gomez-Adorno, I. Voronkov, A. Gelbukh, Bend the
truth: A benchmark dataset for fake news detection in urdu and its evaluation, Journal of
Intelligent & Fuzzy Systems 39 (2020) 2457–2469. doi:1 0 . 3 2 3 3 / J I F S - 1 7 9 9 0 5 .
[11] V. Pérez-Rosas, B. Kleinberg, A. Lefevre, R. Mihalcea, Automatic detection of fake news,
arXiv preprint arXiv:1708.07104 (2017).
[12] W. Y. Wang, ” liar, liar pants on fire”: A new benchmark dataset for fake news detection,
arXiv preprint arXiv:1705.00648 (2017).
[13] X. Zhou, R. Zafarani, A survey of fake news: Fundamental theories, detection methods,
and opportunities, ACM Computing Surveys (CSUR) 53 (2020) 1–40.
[14] V. Rubin, N. Conroy, Y. Chen, S. Cornwell, Fake news or truth? using satirical cues
to detect potentially misleading news, in: Proceedings of the Second Workshop on
Computational Approaches to Deception Detection, Association for Computational Lin-
guistics, San Diego, California, 2016, pp. 7–17. URL: https://aclanthology.org/W16-0802.
doi:1 0 . 1 8 6 5 3 / v 1 / W 1 6 - 0 8 0 2 .
[15] K. Shu, S. Wang, H. Liu, Beyond news contents: The role of social context for fake news
detection, in: Proceedings of the twelfth ACM international conference on web search
and data mining, 2019, pp. 312–320.
[16] N. Lina, S. Fua, S. Jianga, Fake news detection in the urdu language using charcnn-roberta,
Health 100 (2020) 100.
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[18] S. Khanuja, D. Bansal, S. Mehtani, S. Khosla, A. Dey, B. Gopalan, D. K. Margam, P. Aggarwal,
R. T. Nagipogu, S. Dave, S. Gupta, S. C. B. Gali, V. Subramanian, P. Talukdar, Muril:
Multilingual representations for indian languages, 2021. a r X i v : 2 1 0 3 . 1 0 7 3 0 .
[19] B. Roark, L. Wolf-Sonkin, C. Kirov, S. J. Mielke, C. Johny, I. Demirşahin, K. Hall, Processing
South Asian languages written in the Latin script: the Dakshina dataset, in: Proceedings
of The 12th Language Resources and Evaluation Conference (LREC), 2020, pp. 2413–2423.
URL: https://www.aclweb.org/anthology/2020.lrec-1.294.