Ensembling of Various Transformer Based Models for
the Fake News Detection Task in the Urdu Language
Sakshi Kalraa, Preetika Vermaa, Yashvardhan Sharmaa and Gajendra Singh Chauhanb
a
  Department of Computer Science and Information Systems, Birla Institute of Technology and Science Pilani, Pilani
Campus, Rajasthan, India
b
  Department of Humanities and Social Sciences, Birla Institute of Technology and Science Pilani, Pilani Campus,
Rajasthan, India


           Abstract
           The spread of misinformation has become a severe issue affecting society. Inaccurate information has
           enormous potential to cause real-world impacts. Developing algorithms to detect fake news automatically
           will be very useful in preventing unnecessary panic and damage caused by rumors. This fake news
           problem is present for all languages, and it becomes crucial to solve it for languages other than English,
           with scarce datasets. This paper aims to tackle the problem of automatic fake news detection in Urdu,
           a low-resource language. FIRE-2021 has provided the Urdu dataset used in this paper. We fine-tuned
           monolingual and multilingual transformers. After searching for hyperparameters, we tried ensembling
           our models. We submitted our model for the UrduFake task, and it achieved an accuracy of 0.596 and an F1-
           macro score of 0.449.
           Keywords
           Fake News Detection, Natural Language Processing, Label Classification, Various Transformers, Ensemble
           Techniques


1. Introduction
In 2016, Fake News was such a generally used word that the Oxford Dictionary appended this
word to their official list, with the description: “false stories that appear to be news spread on
the internet or using other media, usually created to influence political views or as a joke.”Fake
news dissemination had been a concerning issue since the invention of the printing press in the
15th century. The spread of fake news and misinformation has brought disastrous consequences
many times. An example is the recent Facebook post on 19th September 2021, which claimed that
the Canadian prime minister and his wife faked their covid-19 vaccinations on live television[1].
This created panic and was widely shared. Later, it was found to be false. Therefore, it is crucial
to develop an algorithm to curb the spread of fake news, creating panic and confusion among
people.
   There have been many attempts for automatic fake news detection in English, Chinese,

Forum for Information Retrieval Evaluation, December 13-17, 2021, India
French, and other high-resource languages. Many Natural Language Processing techniques are
used to detect fake news from the English language. Various Transformer based models such as
BERT [2], XLnet [3], DistilBERT [4], etc., have been designed to get the word embeddings and
word dependencies for the English language text. Very little work has been done on Urdu even
though it is spoken as a first language by nearly 70 million people and as second by 100 million
people. Urdu is a low-resource language, and there is a scarcity of publicly available datasets for
NLP tasks using this language. It is the most popular language in Pakistan, and it has around
100 million speakers across the world [5]and is widely spoken in the Indian subcontinent. It
still does not have a lot of language processing tools like parsers and corpora. Many researchers
have tried to target the Urdu language. A shared task [6],[7] on fake news detection in the Urdu
language has been started to tackle the fake news detection problem.
   Fact-checking websites like Politifact [8] have come up for English which checks the accuracy
of statements. Researchers have divided fake news into seven categories- false news, polarised
content, satire, misreporting, commentary, persuasive information, and citizen journalism [9].
It has also been found that fake news articles are less factual, less grammatically correct, and
have more emotionally charged claims. Analyzing the linguistic features of text has also been
proved to help in classification.
   In this paper, we target the Urdu Fake News Detection task by participating in NEWUrdu-
Fake@FIRE2021[10]. We experimented with various multilingual transformer-based models
individually and tried to get results by ensembling these different transformer-based models;
We are getting better results without the ensembling approach.


2. Related Work
Most of the proposed approaches have been used for the English language [11], and there
have been some efforts for Spanish[12], German[13] as well. Data augmentation with machine
translation is used to tackle the problem of small datasets[14]. Newly annotated data in Urdu
is generated by translating the English dataset introduced. Google Translate is used for this
purpose. It is found that the classifier trained on the original Urdu dataset showed better
results than the augmented and translated datasets. One reason for this was that the MT
translation quality between Urdu and English was inferior. Recent studies have extracted
different features from Urdu text and fed them into supervised classification models like logistic
regression, k-nearest neighbors, random forests, and support vector machines. These features
try to model the news articles mathematically. Linguistic features include the total number
of words, frequency of function words and phrases, parts of speech tags, unique word count,
syntactic dependencies, clauses, punctuation, etc. Domain-specific linguistic features precisely
align with the news domain and include quoted words, external links, etc. Approaches using
transformer-based models have been utilized to detect misleading news articles [15] and false
covid related news[11].
3. Dataset
The dataset used for this task is ‘Bend the Truth.’ This binary annotated corpus contains articles
from six domains: technology, education, business, sports, politics, and entertainment [5].
This is the only annotated corpus available for detecting false news in Urdu. The real news
articles are collected from different mainstream news websites like BBC News, CNN Urdu, Daily
Pakistan, urdupoint, etc. A newspaper library is used for web scraping. The data is collected
and annotated manually. If a legitimate website is published, the source is mentioned, or the
same news is found on other reliable websites, the article is real. Different lengths of texts are
collected during the data collection process. For fake news collection, professional journalists
are hired to write news articles for all the domains. They are asked to avoid unintentionally
introducing any patterns in the fake news articles. After collection, all the articles are reread
to remove typing errors and word misuse. Data is cleaned - Latin alphabet characters are
removed, Eastern Arabic-Indic numerals are converted to Western Arabic numbers. Paragraphs
are ramified into sentences on Urdu end markers. The training dataset had 750 real articles and
550 fake. The evaluation is done on an unknown dataset of 300 articles.


4. Proposed Techniques and Algorithms
Transformer-based implementation involves pre-training, which is followed by fine-tuning. The
model is trained on large language datasets(monolingual) or datasets in multiple languages(mul-
tilingual) for the first part. All the initialized parameters are fine-tuned using the labeled data
from the given dataset. The code is available in the github repository. 1 Only the encoder part
of the transformer architecture is used for getting the word-embeddings. One additional output
layer is added to calculate the probability for real and fake classes. Various word embeddings
models have been used and listed below:
    • RoBERta: This model is trained on Urdu news data from Pakistani newspapers. It is built
      on BERT, trained with larger mini-batches and learning rates.
    • ALBERT: Trained on Urdu datasets, but it gave inferior results.
    • XLM-RoBERTa: It is trained in 100 different languages and has the exact implementation
      as RoBERta. It is pre-trained on more than 2TB of CommonCrawl data. The idea is to
      map any language to a language-agnostic vector space where all the languages for the
      same input would point to the same area.
    • Multilingual BERT: This is pre-trained in 104 languages. The texts are lowercased and
      tokenized using WordPiece and a shared vocabulary size of 110,000. The languages with a
      larger Wikipedia are under-sampled, and the ones with lower resources are oversampled.

4.1. Hyperparameter Description
Raytune library is used for hyperparameter tuning. The original head of the models is removed
and replaced with a classification head so the output would be for two classes. The training
dataset is used for fine-tuning the pre-trained models available on huggingface. Alberta has
   1https://github.com/Kalra-Sakshi/URdu-FND.git
                                                Table 1
                         Eight Trials to Find the Optimum Hyperparameters

                    Trial no.   Learning rate    No. of epochs     Batch size
                        1        7.16125e-06            2          2
                        2        5.24269e-06            4          2
                        3        8.15284e-06            3          4
                        4        1.00742e-05            2          4
                        5       1.35181e-055            2          2
                        6        2.04561e-05            4          2
                        7        1.42903e-05            4          4
                        8        7.91852e-06            3          8


not ensembled due to the poor results. Monolingual RoBERTa is giving the best results. Both
the multilingual models are poor at detecting the fake class. For the final run, we submitted
roberta-urdu-small, which is pretrained on Urdu news corpus since it alone gives better results
than any ensembled combination. The training data has normalized using the normalization
module from urduhack library to eliminate the characters from other languages like Arabic.
We used the same tokenizer for fine-tuning the models. Eight trials are conducted to find the
optimum hyperparameters. Search space is defined, and hyperparameter combinations are
randomly selected. Adam optimizer with weight decay is used for the model optimization.
Table-1 lists the total number of 8-trials, conducted to find the optimum hyperparameters.

4.2. Ensembling of the Models
We tried to ensemble the three transformer-based architectures[11]. We have not considered
Alberta as it gives inferior results. The ensembled model computes the average of all softmax
values after extracting the softmax probabilities from each model. In our problem, the results
from the three transformers individually are not close to each other. This model performs
better than multilingual-bert and xlm-roberta. Monolingual roberta gives a better result on the
validation dataset alone than any different model or ensemble combination. Figure-1 shows the
transformer based ensemble model architecture.


5. Results and Evaluations
Different combinations are ensembled using soft-voting after the hyperparameter tuning to get
the best results. The results on the validation dataset are listed in Table-2. Roberta is giving
the best results for the fourth trial and the multilingual models for the seventh trial. Alberta is
providing poor results and is not used for ensembling. We tried soft-voting for all combinations.

5.1. Error Analysis
The multilingual models are inferior at identifying fake texts and classified a lot of them as real.
The reason would be due to the slight imbalance in the training data. The monolingual model
                     Figure 1 Transformer-based Ensemble Model Architecture


                    Table 2
                Dataset statistics
                             Language               Training Data    Testing Data
                    RoBERTa-urdu-small                 0.9083        0.9215
                      XLM-ROBERTa                      0.8167        0.8260
                bert-based-multilingual-case           0.8282        0.8442
                     alberta-urdu-large                0.6183        0.6621
                    ensembling all three               0.8625        0.8758


                                                Table 3
                                         Error Analysis Report

         Fake       Fake       Fake       Real      Real     Real
       Precision    Recall      F1      Precision   Recall    F1     F1 Macro   Accuracy
         0.266      0.120      0.165      0.654     0.835    0.734     0.449    0.596


alone performed slightly better than any ensembled combination. For the surprise dataset, our
submitted run is unable to detect most of the fake news articles. Table-3 lists the error analysis
report.


6. Conclusions and Future Work
Traditional Machine Learning-based approaches produced better results than a transformer-
based approach. The training set is small, and even though the results are suitable for the
validation set, the model cannot perform well on the surprise dataset. It is also unable to
distinguish fake samples and gives very low precision, recall, and F1 score for the fake class.
Future work involves trying to extract features from the intermediate transformer layers. It
would be interesting to try transfer learning for different languages—for example, training on
an English dataset and testing on Urdu. We can try if the problem of insufficient data can be
solved by using datasets in other languages and training on multilingual models. This idea is to
turn any incoming language into a language-agnostic vector in a space where all languages for
the same input would point to the same area.


References
 [1] Politifact, fact checking website, URL: https://www.politifact.com/factchecks/2021/
     sep/24/facebook-posts/trudeaus-got-their-covid-19-shots/.
 [2] J. Devlin, M. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional trans-
     formers for language understanding. arxiv 2018, arXiv preprint arXiv:1810.04805 (2021)
     0–85083815650.
 [3] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, Q. V. Le, Xlnet: Generalized
     autoregressive pretraining for language understanding, Advances in neural information
     processing systems 32 (2019).
 [4] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
     faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
 [5] M. Amjad, G. Sidorov, A. Zhila, H. Gómez-Adorno, I. Voronkov, A. Gelbukh, “bend the
     truth”: Benchmark dataset for fake news detection in urdu language and its evaluation,
     Journal of Intelligent & Fuzzy Systems 39 (2020) 2457–2469.
 [6] M. Amjad, G. Sidorov, A. Zhila, A. F. Gelbukh, P. Rosso, Overview of the shared task on
     fake news detection in urdu at fire 2020., in: FIRE (Working Notes), 2020, pp. 434–446.
 [7] M. Amjad, G. Sidorov, A. Zhila, A. Gelbukh, P. Rosso, Urdufake@ fire2020: Shared track
     on fake news identification in urdu, in: Forum for Information Retrieval Evaluation, 2020,
     pp. 37–40.
 [8] Politifact, fact checking website, URL: https://www.politifact.com/.
 [9] Economic times, seven types of fake news URL:
     https://economictimes.indiatimes.com/news/politics-and-nation/ seven-types-of-
     fake-news-identified-to-help-detect-misinformation/ no-message-in-fake-
     news/slideshow/72106573.cms.
[10] Fire 2021, urdufake2021, URL: https://www.urdufake2021.cicling.org/home.
[11] S. Gundapu, R. Mamidi, Transformer based automatic covid-19 fake news detection system,
     arXiv preprint arXiv:2101.00180 (2021).
[12] J.-P. Posadas-Durán, H. Gómez-Adorno, G. Sidorov, J. J. M. Escobar, Detection of fake
     news in a new corpus for the spanish language, Journal of Intelligent & Fuzzy Systems 36
     (2019) 4869–4876.
[13] I. Vogel, P. Jiang, Fake news detection with the new german dataset “germanfakenc”, in:
     International Conference on Theory and Practice of Digital Libraries, Springer, 2019, pp.
     288–295.
[14] M. Amjad, G. Sidorov, A. Zhila, Data augmentation using machine translation for fake
     news detection in the urdu language, in: Proceedings of The 12th Language Resources
     and Evaluation Conference, 2020, pp. 2537–2542.
[15] H. Jwa, D. Oh, K. Park, J. M. Kang, H. Lim, exbake: Automatic fake news detection model
     based on bidirectional encoder representations from transformers (bert), Applied Sciences
     9 (2019) 4062.


A. Online Resources
The implementation of different pre-trained BERT-models are available at

    • Huggingface.