Impact of Data Augmentation on Hate Speech
                                Detection in Roman Urdu
                                Fariha Maqbool1,* , Blerina Spahiu1 and Andrea Maurino1
                                1
                                 Dipartimento di informatica, sistemistica e comunicazione
                                University of Milano-Bicocca
                                Viale Sarca 336, 20126 Milan, Italy


                                           Abstract
                                           The prevalence of hate speech leads to an increase in hate crimes, online violence, and serious harm to
                                           social safety, physical security, and cyberspace. To address this issue, several studies have been conducted
                                           on hate speech detection in European languages, whereas little attention has been paid to low-resource
                                           South Asian languages, making social media vulnerable for millions of users. Due to the scarcity of the
                                           datasets and the samples available, there is a need to apply some strategies to increase the data samples.
                                           In this paper, we improved the performance of the already fine-tuned m-Bert model by applying data
                                           augmentation techniques to one of the datasets on hate speech on tweets in Roman Urdu language.
                                           F1-score and accuracy matrix have been used to compare the results. We also experiment to determine
                                           the optimal percentage of augmented data to be included and the percentage of words augmented in each
                                           instance of data. The new RUHSOLD++ Dataset containing the augmented data has also been published
                                           publicly. The improvement in hate speech detection of the model proved that the performance of the
                                           models can be improved by applying data augmentation techniques to the dataset with a limited number
                                           of instances.


                                1. Introduction
                                The exponential growth of social media platforms like Facebook1 , x (formally Twitter)2 , and
                                YouTube3 has provided a global stage for individuals from diverse cultures and social back-
                                grounds to communicate and share their opinions on a myriad of topics. While these platforms
                                uphold the principle of freedom of speech, some users also negatively exploit this freedom and
                                abuse other users on the basis of gender, religion or race. This surge in harmful content has
                                underscored the need for increased research in natural language processing (NLP) to effectively
                                detect instances of hate speech. The consequences faced by victims of targeted hate speech
                                are not limited to physical harm; they also experience a profound sense of dread and rejection
                                within their communities. Recognizing the urgency to create online spaces free of racism and
                                hate speech, researchers emphasize the importance of early detection mechanisms to mitigate
                                the pervasive harm caused by such content [1]. This challenge extends beyond the English
                                language, as millions of users worldwide employ diverse languages as vehicles for spreading

                                SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy
                                *
                                 Corresponding author.
                                $ f.maqbool@campus.unimib.it (F. Maqbool); blerina.spahiu@unimib.it (B. Spahiu); andrea.maurino@unimib.it
                                (A. Maurino)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                1
                                  https://www.facebook.com/
                                2
                                  https://x.com/
                                3
                                  https://www.youtube.com/


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
hate. Despite extensive research in English, there exists a noticeable dearth of datasets and
studies on languages like Urdu. Urdu, spoken by over 170 million people globally, faces unique
challenges in written communication, with an alphabet that cannot be easily mapped onto an
English keyboard. The Urdu language comprises 40 characters, yet English keyboards, accom-
modating only 26 letters, face limitations in fully supporting Urdu script. Attempting to map
the Urdu alphabet directly onto an English keyboard proves impractical due to these constraints.
Consequently, the predominant approach among Urdu speakers involves the use of Roman Urdu
particularly on social media platforms, Roman Urdu is a transliteration version of Urdu using
English letters. The use of Roman Urdu has sharply expanded as a result of social media’s rising
adoption[2, 3]. Users of these tools regularly utilize these platforms to share their opinions
about a variety of products, services, politics, and other items. However, despite its widespread
use, Roman Urdu (RU) encounters challenges such as a lack of linguistic resources, annotated
datasets, and dedicated language models [4]. To address these limitations and enhance model
performance, researchers have explored data augmentation techniques[3]. In a notable applica-
tion, simple data augmentation techniques were employed on a low-resourced language dataset
with a limited number of samples. The results demonstrated a significant improvement in model
performance, underscoring the potential of augmentation in mitigating the challenges posed by
scarce linguistic resources. Additionally, experiments aimed to identify the optimal percentage
of augmented data to be integrated with the original dataset, aiming to boost model performance
while minimizing training time. This multifaceted approach contributes to ongoing efforts to
combat hate speech across various languages, fostering inclusivity and positive discourse in
online spaces. In this paper we make the following contributions: (i) enrich RUHSOLD dataset
[3] and create a new RUHSOLD++ dataset; (ii) provide a new custom function in Python to
dynamically alter the spelling of selected words within a sentence; and (iii) provide an empirical
analysis of the different data augmentation methods. The paper is structured as follows: Section
2 discusses approaches to detect hate speech in the Roman Urdu language. In Section 3 we
provide the methodology to augment our initial dataset. Section 4 provides the analysis and
findings by applying different methods for data augmentation while conclusions and future
work end the paper in Section 5.


2. Related Work
The issue of abusive speech has been a longstanding focus within the research community.
In earlier investigations, attempts to identify abusive users primarily relied on lexical and
syntactic features extracted from their posts [5]. However, recent advancements in automated
hate speech detection have witnessed substantial growth. The availability of large datasets
has prompted a shift in academic research towards more data-intensive and sophisticated
models, notably leveraging deep learning techniques [6] and graph embedding methods [7].
Notably, transformer-based language models like BERT[8] have gained immense popularity in
various downstream tasks, proving to be particularly effective in surpassing traditional deep
learning models such as CNN-GRU[9], LSTM[10], etc., for the detection of abusive language [11],
[12]. This evolution highlights the dynamic nature of research in addressing the complexities
of abusive speech detection. Detecting hate and abusive speech in low-resourced languages
presents a formidable challenge, as exemplified in the context of Roman Urdu. The scarcity of
linguistic resources for Roman Urdu spurred the creation of the RUHSOLD dataset by H. Rizwan
et al. [3]. Comprising 10,012 tweets, this dataset stands out for its dual approach, offering
both coarse-grained and fine-grained labeling of hate speech instances. In their research,
Rizwan and colleagues not only curated this valuable dataset but also proposed a deep learning-
based architecture specifically tailored for hate speech detection in Roman Urdu. Addressing
the broader landscape of multilingual abusive speech, M. Das et al. [13] conducted an in-
depth investigation into the performance of multilingual models across eight distinct Indic
languages. In a noteworthy application, they employed m-BERT[8] and MuRIL[14] models on
the RUHSOLD dataset [3] to gauge their efficacy in detecting abusive speech. Through a series of
meticulously designed experiments, encompassing various settings, Das and team explored the
nuances of multilingual hate speech detection. Their findings underscored the effectiveness of
model transfers, revealing that transferring knowledge from one language to another enhances
the overall performance of the models. This body of research not only contributes to the
evolving field of hate speech detection but also illuminates the specific challenges associated
with low-resourced languages like Roman Urdu. By providing a robust dataset and proposing
dedicated architectures, these studies lay essential groundwork for future endeavors aimed at
combating hate speech across diverse linguistic landscapes. The insights gained from these
investigations, especially regarding the transferability of models, offer valuable guidance for
the development of more inclusive and effective hate speech detection systems in multilingual
contexts. In a meticulous analysis, M. M. Khan. et al. [15] delved into the complexities of hate
speech detection in Roman Urdu, manually examining over 90,000 tweets to curate a substantial
corpus of 5,000 Roman Urdu tweets. Their significant contribution extended beyond dataset
creation, as they systematically employed five supervised learning approaches, including a
sophisticated deep learning technique, to rigorously evaluate and compare their effectiveness in
hate speech detection. The results of their comprehensive study revealed that, across two levels
of categorization, logistic regression outperformed all other techniques, opening up a viable
path for robust of hate speech detection in Roman Urdu. Recognizing the challenges posed
by the low resources of Roman Urdu, Azam et al. [16] undertook a proactive exploration of
data augmentation strategies. Leveraging both easy data augmentation and transformer-based
augmentation approaches, they aimed to enhance hate speech detection capabilities in Roman
Urdu. The researchers conducted experiments using existing datasets in Roman Urdu and
baseline models to meticulously assess the impact of augmentation techniques. Their findings
unequivocally demonstrated that the performance of hate speech detection models could indeed
be significantly improved by the strategic application of augmentation techniques to the dataset.
This research not only contributes to the optimization of hate speech detection in low-resourced
languages like Roman Urdu but also highlights the potential of augmentation strategies as
valuable tools in mitigating the impact of resource constraints, providing valuable insights for
the ongoing evolution of hate speech detection methodologies.
3. Methodology
3.1. Dataset
We employed the RUHSOLD dataset, a comprehensive collection of tweets in Roman Urdu
created by H. Rizwan et al. [3]. The authors meticulously established a gold standard for
two distinct sub-tasks. Our focus centered on the first sub-task, which involves binary labels
categorizing content as either Hate-Offensive (labeled as 0) or Normal (labeled as 1), representing
inoffensive language. The dataset comprises a total of 10,000 tweets, thoughtfully partitioned
into training, testing, and validation sets in a ratio of 70%, 20%, and 10%, respectively.

3.2. Data Augmentation
In our quest to boost our dataset’s size and improve the overall performance, we strategically
employed Noise-based Data Augmentation techniques on training data. We dove into a detailed
exploration, trying out different percentages for augmenting the dataset to strike the right
balance. Moreover, we played around with the augmentation process by adjusting the percentage
of words in each tweet that underwent these augmentation techniques. This nuanced approach
was not just about expanding the dataset; it was about delicately adjusting the augmentation
impact and finding the sweet spot between quantity and quality to strengthen our model’s
resilience. Through methodical experimentation, we aimed to uncover the most effective
configurations that could genuinely enhance the overall performance of our model.

3.2.1. Random Swapping
Random swapping is an effective technique in the realm of noise-based data augmentation. This
technique is based on randomly swapping the words or tokens from a tweet. For example: "hum
kisi se km nhi" becomes "km kisi se hum nhi".
   This operation adds variances to the dataset without changing the overall sentiment or
context of the text. In order to help the model generalize and function well on a variety of
linguistic patterns, it is intended to be exposed to various word configurations.The percentage
of word augmentation in random swapping directly controls the degree of variability injected
into the dataset. Therefore, it is essential to find the optimal percentage of words which should
be swapped during augmentation.

3.2.2. Random Deletion
Another noise-based data augmentation is Random Deletion. This strategy involves the delib-
erate and random removal of words or tokens from a given sentence, introducing an element
of unpredictability and variability. We designate a specific percentage of words within the
sentence for potential removal, aiming to strike a careful balance between introducing noise
for robustness and preserving the coherence of the text. By implementing this intentional
randomness, we infuse the dataset with a dynamic quality, fortifying the model’s adaptability
to diverse linguistic nuances. This method serves as a potent instrument, enriching our model’s
adaptability and efficacy across a wide range of textual inputs.
3.2.3. Spelling Augmentation
Spelling Augmentation adds a layer of complexity by altering the spelling of words within
a sentence. This process entails replacing one or more characters in a word with randomly
chosen alternatives and deliberately introducing a controlled amount of noise into the data. The
purpose here is to diversify the linguistic patterns in the dataset, enhancing the model’s ability
to handle variations in spelling and promoting resilience against potential inconsistencies in
user-generated content. This meticulous introduction of noise through character substitution
serves as a strategic maneuver, refining our model’s capacity to adapt to a wide array of spelling
idiosyncrasies. For example: "chal ja tujhy maaf kia" becomes "chal aa tujha maaf kia".

3.3. The new RUHSOLD++ Dataset
After employing the data augmentation techniques, we created a new RUHSOLD++ Dataset4
which can be accessed publicly to promote future work. The dataset consists of 3 types of data
with swap, delete and spelling augmentation applied. The dataset contains the augmented data
in train and validation sets distributed uniformly and unaltered test data.

3.4. Model
The m-BERT model has garnered significant attention in the realm of abusive speech research.
Its efficacy is attributed to being pre-trained on a comprehensive dataset, comprising the
extensive content of Wikipedia5 , employing a masked language modeling (MLM)[17] objective
across 104 languages. This pre-training involves 12 fully connected transformer encoder layers,
incorporating a self-attention mechanism to efficiently process contextual information. It is
worth noting that m-BERT, while powerful, has a token limit of 512, necessitating the use
of a fine-tuned variant introduced by Das, M. et al. [7]. Das and colleagues enhanced the
original m-BERT by incorporating a fully connected layer, aligning its output with the CLS
(classification) token in the input. This added layer introduces a level of specificity, with the
output reflecting the model’s interpretation of the input sentence, often represented by the CLS
token output. This nuanced modification allows the model to capture and interpret complex
contextual nuances within the given token limit, contributing to its efficacy in understanding
and classifying abusive speech patterns.


4. Experimentation and Results
In this section, we describe the experimentation conducted on the RUHSOLD [3] dataset using
fined tuned m-Bert model, a refinement proposed by M. Das et al. [7]. In our implementation,
we use the PyTorch library6 in Python, configuring each model to run for 10 epochs with an
Adam optimizer and a batch size of 16.


4
  https://github.com/fariha231/impact-of-augmentation-ruhsoldplusplus
5
  https://www.wikipedia.org/
6
  https://pytorch.org/
   The experimentation extended to the exploration of data augmentation techniques, aiming
to strike an optimal balance between computational efficiency and accuracy. The primary focus
was on determining the most suitable percentage of the dataset for applying augmentation. To
achieve this, we selected a portion of the original training dataset that underwent Random Swap
Augmentation. Leveraging the NLPAug7 library in Python, a random percentage of 30% for
words was chosen, indicating that 30% of the total words in a tweet would undergo swapping
with each other. Following the generation of augmented data, it was seamlessly integrated with
the original dataset and subsequently shuffled to mitigate overfitting concerns. The outcomes
of this experiment are succinctly presented in Table 1, showcasing the impact of Random Swap
at varying percentages of the dataset. This experimentation highlights the strategic choices
made in augmenting the data to achieve an optimal trade-off between efficiency and accuracy.

Table 1
Random Swap with Varying Overall Augmented Data


             % of original data augmented    Precision   Recall   mF1-score   Accuracy
                               0               0.873     0.876      0.863        0.875
                              20               0.910     0.907      0.902        0.903
                              30               0.913     0.917      0.909        0.909
                              50               0.914     0.925      0.913       0.913
                              60               0.927     0.891      0.904        0.905
                              80               0.922     0.893      0.902        0.903
                              100              0.917     0.883      0.898        0.898


   We employed the Macro F1 score (mF1-score) as a performance metric, along with other
evaluation measures. The Macro F1 score allows to assess the performance of each class
individually while giving equal weight to all classes. Examining the results, we observe a
consistent improvement in both accuracy and mF1-score as the model is trained on augmented
data. However, a notable finding emerges: the highest accuracy is attained when augmentation
is applied to 50% of the data. Beyond this threshold, further increasing the size of augmented
data leads to diminishing returns, resulting in a decline in accuracy and mF1-score. This suggests
that the model tends to overfit the training data when subjected to an excessive amount of
augmented information. While the accuracy of the validation data may show promising signs,
the model’s performance on unseen data, specifically the test data, begins to decrease.
   With the optimal augmented data percentage identified, our exploration extends to varying
the percentage of words swapped in each iteration. The results, as depicted in Table 2, indicate
that the overall accuracy and mF1-score exhibit minimal fluctuations with changes in the
word augmentation percentage. However, precision and recall values do showcase variations
corresponding to alterations in the word augmentation percentage. This nuanced observation
7
    https://pypi.org/project/nlpaug/0.0.5/
underscores the importance of fine-tuning not only the quantity of augmented data but also the
specific aspects of augmentation.

Table 2
Random Swap with Varying Word Swap Rate


       Words swapped per instance (%)       Precision   Recall    mF1-score     Accuracy
                       20                     0.909      0.916       0.906        0.907
                       30                     0.914      0.925       0.913        0.913
                       40                     0.929      0.907       0.913        0.913
                       50                     0.926      0.910       0.913        0.913


   Subsequently, we implemented the Delete Data Augmentation on 50% of the original training
dataset. Leveraging the NLPAug library in Python, we conducted experiments to generate new
data by selectively removing certain words in each row. This augmented dataset was seamlessly
integrated with the original data, effectively amplifying the training set by 50%. Our exploration
further extended to varying the percentage of words designated for deletion in each iteration.
The outcomes of this experiment are presented in Table 3. The results demonstrate that this
approach not only diversifies the training data but also involves fine-tuning the augmentation
to achieve improvements in model performance.

Table 3
Random Delete with Varying Word Deletion Rate


        Words deleted per instance (%)     Precision    Recall   mF1-score     Accuracy
                       20                     0.92       0.88        0.89         0.90
                       30                     0.92       0.88       0.895        0.895
                       40                    0.916      0.894       0.899         0.90


   In implementing spelling augmentation, we crafted a custom function in Python to dynam-
ically alter the spelling of selected words within each sentence. This function provides the
flexibility to adjust the percentage of words in each row subject to augmentation. The outcomes
of this experiment are presented in Table 4. Notably, the results reveal an improvement in
accuracy as we incrementally raise the percentage of augmented words in each row. However,
a cautious approach was adopted, refraining from further increasing the percentage to prevent
potential distortion of the sentence’s meaning. Beyond a certain threshold, excessive alterations
could compromise the contextual integrity of the sentence, potentially undermining the model’s
overall performance.
   After conducting a comprehensive array of experiments, our findings show that the most
Table 4
Spelling Augmentation with Varying Word Augmentation Rate


      Words augmented per instance (%)       Precision    Recall   mF1-score     Accuracy
                       30                       0.921      0.885      0.898        0.898
                       50                       0.928      0.898      0.908        0.908


favorable accuracy was achieved through swap data augmentation. The optimal model, exhibit-
ing the highest accuracy, emerged from training with an additional 50% of data, where 30%
of words in each row were subject to swapping. This configuration demonstrated the finest
balance between data enrichment and model performance enhancement.
   To provide a visual representation of the model’s performance, Figure 1 presents the confusion
matrix for this optimal model, showcasing the details of how well the model navigated and
classified instances with the applied swap data augmentation. This synthesis of experimentation
outcomes reinforces not only the efficacy of swap data augmentation but also the significance
of precise configurations in achieving the model’s peak performance.


Figure 1: Confusion Matrix for best model with swap data augmentation


5. Conclusion and Future Work
Recognizing the pressing need to combat hate speech within the constraints of limited resources,
our paper lies in the strategic application of data augmentation techniques to linguistic datasets
on social media. We assert that applying data augmentation techniques to the dataset helps
to increase the dataset size and improves the overall model performance. We experimented
with determining the ideal percentage of augmented data to seamlessly integrate with the
original dataset. This exploration aimed not only to enhance model training efficiency but also
to circumvent the pitfalls of model overfitting. Our experimentation involved the application of
three distinct data augmentation techniques: random swap, random deletion, and spelling aug-
mentation. Notably, the results underscore the prowess of swap data augmentation, exhibiting
the highest accuracy at 91.3%. This achievement was realized with a 30% word augmentation
rate and a 50% augmented data incorporation. We also published the new RUHSOLD++ dataset
containing the augmented data. For future work we envision the exploration of additional aug-
mentation techniques, setting the stage for a comprehensive comparison of model performances.
This will improve and add more tools to the ongoing fight against hate speech on social media
especially for under-resourced languages such as Roman Urdu.


References
 [1] P. Fortuna, S. Nunes, A survey on automatic detection of hate speech in text, ACM Comput.
     Surv. 51 (2018) 85:1–85:30. URL: https://doi.org/10.1145/3232676. doi:10.1145/3232676.
 [2] M. M. Khan, K. Shahzad, M. K. Malik, Hate speech detection in roman urdu, ACM Trans.
     Asian Low Resour. Lang. Inf. Process. 20 (2021) 9:1–9:19. URL: https://doi.org/10.1145/
     3414524. doi:10.1145/3414524.
 [3] H. Rizwan, M. H. Shakeel, A. Karim, Hate-speech and offensive language detection in
     roman urdu, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference
     on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November
     16-20, 2020, Association for Computational Linguistics, 2020, pp. 2512–2522. URL: https://
     doi.org/10.18653/v1/2020.emnlp-main.197. doi:10.18653/V1/2020.EMNLP-MAIN.197.
 [4] A. Dewani, M. Memon, S. Bhatti, Development of computational linguistic resources for
     automated detection of textual cyberbullying threats in roman urdu language, 3C TIC:
     Cuadernos de desarrollo aplicados a las TIC 10 (2021) 101–121. doi:10.17993/3ctic.
     2021.102.101-121.
 [5] Y. Chen, Y. Zhou, S. Zhu, H. Xu, Detecting offensive language in social media to
     protect adolescent online safety, in: 2012 International Conference on Privacy, Secu-
     rity, Risk and Trust, PASSAT 2012, and 2012 International Confernece on Social Com-
     puting, SocialCom 2012, Amsterdam, Netherlands, September 3-5, 2012, IEEE Com-
     puter Society, 2012, pp. 71–80. URL: https://doi.org/10.1109/SocialCom-PASSAT.2012.55.
     doi:10.1109/SOCIALCOM-PASSAT.2012.55.
 [6] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep learning for hate speech detection in
     tweets, in: R. Barrett, R. Cummings, E. Agichtein, E. Gabrilovich (Eds.), Proceedings
     of the 26th International Conference on World Wide Web Companion, Perth, Australia,
     April 3-7, 2017, ACM, 2017, pp. 759–760. URL: https://doi.org/10.1145/3041021.3054223.
     doi:10.1145/3041021.3054223.
 [7] M. Das, P. Saha, R. Dutt, P. Goyal, A. Mukherjee, B. Mathew, You too brutus! trapping
     hateful users in social media: Challenges, solutions & insights, in: O. Conlan, E. Herder
     (Eds.), HT ’21: 32nd ACM Conference on Hypertext and Social Media, Virtual Event,
     Ireland, 30 August 2021 - 2 September 2021, ACM, 2021, pp. 79–89. URL: https://doi.org/10.
     1145/3465336.3475106. doi:10.1145/3465336.3475106.
 [8] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.),
     Proceedings of the 2019 NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume
     1, Association for Computational Linguistics, 2019, pp. 4171–4186. URL: https://doi.org/10.
     18653/v1/n19-1423. doi:10.18653/V1/N19-1423.
 [9] Z. Zhang, D. Robinson, J. A. Tepper, Detecting hate speech on twitter using a convolution-
     gru based deep neural network, in: A. Gangemi, R. Navigli, M. Vidal, P. Hitzler, R. Troncy,
     L. Hollink, A. Tordai, M. Alam (Eds.), The Semantic Web - 15th International Conference,
     ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings, volume 10843 of
     Lecture Notes in Computer Science, Springer, 2018, pp. 745–760. URL: https://doi.org/10.
     1007/978-3-319-93417-4_48. doi:10.1007/978-3-319-93417-4\_48.
[10] G. V. Houdt, C. Mosquera, G. Nápoles, A review on the long short-term memory model,
     Artif. Intell. Rev. 53 (2020) 5929–5955. URL: https://doi.org/10.1007/s10462-020-09838-1.
     doi:10.1007/S10462-020-09838-1.
[11] H. S. Alatawi, A. Alhothali, K. Moria, Detection of hate speech using BERT and hate
     speech word embedding with deep model, CoRR abs/2111.01515 (2021). URL: https:
     //arxiv.org/abs/2111.01515. arXiv:2111.01515.
[12] M. Bilal, A. Khan, S. Jan, S. Musa, S. Ali, Roman urdu hate speech detection using
     transformer-based model for cyber security applications, Sensors 23 (2023) 3909. URL:
     https://doi.org/10.3390/s23083909. doi:10.3390/S23083909.
[13] M. Das, S. Banerjee, A. Mukherjee, Data bootstrapping approaches to improve low resource
     abusive language detection for indic languages, in: A. Bellogín, L. Boratto, F. Cena (Eds.),
     HT ’22: 33rd ACM Conference on Hypertext and Social Media, Barcelona, Spain, 28 June
     2022- 1 July 2022, ACM, 2022, pp. 32–42. URL: https://doi.org/10.1145/3511095.3531277.
     doi:10.1145/3511095.3531277.
[14] S. Khanuja, D. Bansal, S. Mehtani, S. Khosla, A. Dey, B. Gopalan, D. K. Margam, P. Aggarwal,
     R. T. Nagipogu, S. Dave, S. Gupta, S. C. B. Gali, V. Subramanian, P. P. Talukdar, Muril:
     Multilingual representations for indian languages, CoRR abs/2103.10730 (2021). URL:
     https://arxiv.org/abs/2103.10730. arXiv:2103.10730.
[15] M. M. Khan, K. Shahzad, M. K. Malik, Hate speech detection in roman urdu, ACM Trans.
     Asian Low Resour. Lang. Inf. Process. 20 (2021) 9:1–9:19. URL: https://doi.org/10.1145/
     3414524. doi:10.1145/3414524.
[16] U. Azam, H. Rizwan, A. Karim, Exploring data augmentation strategies for hate speech
     detection in roman urdu, in: Proceedings of the Thirteenth Language Resources and
     Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, European Language
     Resources Association, 2022, pp. 4523–4531. URL: https://aclanthology.org/2022.lrec-1.481.
[17] J. Salazar, D. Liang, T. Q. Nguyen, K. Kirchhoff, Masked language model scoring, in:
     D. Jurafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.), Proceedings of the 58th Annual
     Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-
     10, 2020, Association for Computational Linguistics, 2020, pp. 2699–2712. URL: https:
     //doi.org/10.18653/v1/2020.acl-main.240. doi:10.18653/V1/2020.ACL-MAIN.240.