<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Chrestotes@HASOC 2020: Bert Fine-tuning for the Identification of Hate Speech and Offensive Language in Indo-European Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tochukwu Ezike</string-name>
          <email>eziket18@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manikandan Sivanesan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xend Tech</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nigeria</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>RedHat</string-name>
          <email>msivanes@redhat.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Canada</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This article describes our team Chrestotes' approach to the solution submitted to HASOC 2020: Hate Speech and Offensive Content Identification in IndoEuropean Languages. We demonstrate an end to end solution to the fine-grained detection of hate speech in tweets. Our solution is focused on the English Task which has been split into two subtasks. Our model achieved macro-average f1scores of 0.4969 and 0.2652 on the subtasks A and B respectively. This solution places us in the middle of the leaderboard for subtask A and first place for subtask B.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Hate speech</kwd>
        <kwd>Offensive language</kwd>
        <kwd>Text Classification</kwd>
        <kwd>Bert</kwd>
        <kwd>Transformers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The wide adoption of internet use today has seen a massive increase in the use of internet services
like social media for communication. This mass use has led to an increasing number of Hate speech on
the various social media platforms. A lot of research has gone into the automatic detection and flagging
of these hateful and derogatory comments by most social media companies [1]. The timely and accurate
detection of hate speech using well-defined algorithms will go a long way to clean up platforms used
for continuous discourse and also prevent the mental and physical effects these derogatory comments
have on its victims.</p>
      <p>HASOC 2020 is aimed at building efficient and scalable artificially intelligent models for detecting
Hate speech and offensive Language in multilingual settings without human assistance. The challenge is
divided into 2 subtasks across 3 languages which are English, German, and Hindi. The dataset for the 3
languages was obtained entirely from Twitter [2]</p>
      <p>We participated in subtasks A and B for the English language. We approached the task using the
transformer-based model BERT [3] which has proven to be very effective in Natural Language
Processing tasks across multiple languages. The provided English dataset is fine-tuned using a
pretrained BERT transformer model from the HuggingFace library [4]. Our approach placed us in the
middle of the leaderboard in subtask A and on top of the leaderboard in subtask B.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology and Data</title>
      <sec id="sec-2-1">
        <title>2.1. Data Description</title>
        <p>The task in the English Language challenge is divided into subtasks A and B. Subtask A is a binary
classification challenge where we are to categorize the sentences in the English dataset into
HateOffensive (HOF) and Non-Hate-Offensive (NOT) categories. In contrast, the subtask B is a
finegrained multiclass classification problem where we group the dataset into 3 categories; Hate speech
(HATE), Offensive (OFFN), Profane (PRFN), and Neutral (NONE). For the training dataset in the
English category, there are a total of 3708 tweets in the corpus. Further details about each task can be
found on the competition homepage [2].</p>
        <p>In total, there are 3708 tweets in the dataset. The two categories in subtask A have an almost equal
number of approximately 185 tweets respectively. The categories in the subtask B however are highly
imbalanced. The dataset distribution in both tasks can be seen in Figure 1 and Figure 2.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Model Description</title>
        <p>We used a pre-trained transformer-based Bert Model from the HuggingFace library. A concise
architecture diagram of this model can be seen in Figure 3.</p>
        <p>Based on the experiments performed, the pre-trained transformer-based Bert model provided better
scores than other language models such as ULMFiT [11]. Bert-based models are currently state of the
art in various NLP tasks including text classification. This is because transformers have higher
expressive powers and are better universal approximators for sequential functions [9] than traditional
methods. The pre-trained Bert model was trained on the large English Wikipedia corpus in an
unsupervised manner. This allowed the model to learn contextual embedding representations of the
words in the corpus. These learned contextual representations are the key reason Bert based models
perform a lot better than context-free representations like word2vec or Glove[3]. We leveraged these
learned contextual representations obtained through pre-training to initialize our weight. These weights
were then further fine-tuned on subtasks A and B. It has been shown that this weight initialization
method called Transfer Learning performs better and faster than random or universal initialization of
the model’s weights after training on the downstream task [10].</p>
        <p>Fig. 3. Visual Architecture for Bert showing transformer layers for masked sentence
prediction</p>
        <p>The output sequences from the Bert model were pooled together to create a single representation
using both a mean and max pool. This was done to ensure that strong representations were learned from
the output sequences. These 2 pooled outputs were then concatenated and passed into an appropriately
sized linear classifier for classification. This concatenation of pooled outputs gave a boost in the overall
F1 score of the final model.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Ensembling</title>
        <p>Our approach uses stratified K-fold across 5 folds as a cross-validation strategy due to the heavy
imbalance amongst the classes in the dataset. This stratification was done to ensure that each fold had
an equal number of samples. The intuition behind using this strategy is to randomly divide the dataset
into 5 equal subsets and then use 4 out of these 5 subsets as training and the remaining subset as the
hold out/validation dataset. This process is to be done 5 times across all the folds. This strategy is very
useful as it helps the model to generalize better and not overfit to a single train-validation fold split. A
diagram showing this ensembling approach can be seen in Figure 4.</p>
        <p>Finally, after making predictions on our given test set using the models trained on each of the 5
folds, a simple mean across all the test set predictions from the 5 models is taken. This became our final
prediction on the test set. The same validation strategy was used for subtasks A and B respectively.</p>
        <sec id="sec-2-3-1">
          <title>Cross validation strategy using stratified K-fold across 5 folds.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments and Results</title>
      <sec id="sec-3-1">
        <title>3.1. Training</title>
        <p>We used the “bert-based-uncased” model from the HuggingFace library and its corresponding
tokenizer to maintain consistency with what was used during the model pre-training. All our fine-tuned
models in the 2 subtasks were trained using Adam optimizer with weight decay [5]. The maximum
sequence length in each batch sent into the model was set to 72. Categorical cross-entropy was used as
a loss function in subtask A while Label smoothing cross-entropy was used for the subtask B. The loss
functions were chosen based on the local cross-validation scores on the 2 tasks after experimentation.
Maximum learning rates of 1e-5 and 1e-4 were used for the individual subtasks respectively. The
learning rates were warmed up from 0 to their maximum values and then further decayed from this set
maximum using a linear schedule.</p>
        <p>Due to the high imbalance in the categories in subtask B, we used an upsampling strategy where the
minority classes were duplicated until they matched the number of the majority classes. This was done
with the BalanceClassSampler from the Catalyst library [8].</p>
        <p>Lastly, a seed of 42 was set for reproducibility. All experiments were carried out with Pytorch [6] on
Google Colab using Nbdev [7] to enforce an interactive development.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Result</title>
        <sec id="sec-3-2-1">
          <title>Experiment exp1 exp2.1 exp2.2</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Subtask</title>
          <p>A
B</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Model</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>ULMFiT</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Bert + mean pool</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>Bert + concat pool</title>
          <p>Subtask</p>
          <p>A
B
A
B
A</p>
          <p>B</p>
          <p>According to the competition rules, the models for the 2 subtasks were evaluated on the validation
set using the macro-f1 score. This exact metric was also used on the test set by the competition
organizers to evaluate our models in order to rank them on the leaderboard. Experiments were
conducted using Bert and ULMFiT. From Table 1, exp1 shows the results obtained with ULMFiT in
subtasks A and B respectively. exp2.1 shows results from the Bert model using only a mean pool while
exp2.2 shows results from the Bert model using a concatenation of a mean and max pool along the
output sequences.</p>
          <p>Table 2. Performance of final models on the leaderboard for subtasks 1 and 2</p>
        </sec>
        <sec id="sec-3-2-7">
          <title>Model Name</title>
          <p>Bert + concat
pool
Bert + concat
pool</p>
          <p>LB
macro f1-score</p>
        </sec>
        <sec id="sec-3-2-8">
          <title>Highest LB macro f1-score 0.4969 0.2652</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this paper, we presented our solutions to the HASOC 2020 competition which was placed at the
middle and at the top of the leaderboard for the English subtasks A and B respectively. Our solution
involves fine-tuning a Bert model on the training dataset. We present a robust end-to-end pipeline for
reproducing our solution effectively. All our code has been open-sourced in this repository
https://github.com/tezike/Hasoc.​</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>E.</given-names>
            <surname>Whittaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.M.</given-names>
            <surname>Kowalski</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Cyberbullying via social media</article-title>
          .
          <source>Journal of School Violence</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ):11C29
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nandini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , J.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <article-title>Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages)</article-title>
          ,
          <source>in: Working Notes of FIRE 2020 - Forum for Information Retrieval Evaluation</source>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Toutanova: BERT: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers). pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Association for Computational</surname>
            <given-names>Linguistics</given-names>
          </string-name>
          , Minneapolis,
          <source>Minnesota (Jun</source>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          https://doi.org/10.18653/v1/
          <fpage>N19</fpage>
          -1423
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          , R.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Louf</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Funtowicz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Brew</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>HuggingFace's Transformers: State-of-the-art natural language processing</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .03771. G​
          <string-name>
            <surname>oogle Scholar</surname>
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Loshchilov</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Hutter</surname>
          </string-name>
          .
          <article-title>"Decoupled weight decay regularization</article-title>
          .
          <source>arXiv</source>
          <year>2017</year>
          .
          <article-title>"</article-title>
          arXiv preprint arXiv:
          <volume>1711</volume>
          .
          <fpage>05101</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Pytorch</surname>
          </string-name>
          :
          <article-title>An imperative style, high-performance deep learning library</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          (pp.
          <fpage>8026</fpage>
          -
          <lpage>8037</lpage>
          J. Howard,
          <string-name>
            <surname>S. Gugger</surname>
          </string-name>
          <article-title>(2019, December 6). n​ bdev: use Jupyter Notebooks for everything​</article-title>
          . https://www.fast.ai/
          <year>2019</year>
          /12/02/nbdev/ S. Kolesnikov, (
          <year>2018</year>
          ).
          <article-title>Catalyst: Accelerated deep learning</article-title>
          R&amp;D.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          https://github.com/catalyst-team/catalyst.​
          <string-name>
            <given-names>C.</given-names>
            <surname>Yun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhojanapalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rawat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Are Transformers universal approximators of sequence-to-sequence functions? ArXiv</article-title>
          , abs/
          <year>1912</year>
          .10077.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chuanqi</surname>
          </string-name>
          , et al. “
          <article-title>A Survey on Deep Transfer Learning</article-title>
          .” ArXiv:
          <year>1808</year>
          .
          <year>01974</year>
          [Cs, Stat],
          <source>Aug</source>
          .
          <year>2018</year>
          . arXiv.org, h​ ttp://arxiv.org/abs/
          <year>1808</year>
          .01974
          <string-name>
            <given-names>J.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          . “
          <article-title>Universal Language Model Fine-Tuning for Text Classification</article-title>
          .” ArXiv:
          <year>1801</year>
          .06146 [Cs, Stat], May
          <year>2018</year>
          . arXiv.org, http://arxiv.org/abs/
          <year>1801</year>
          .06146.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>