CTC: COVID-19 Tweet Classification using CT-BERT
Shivangi Bithel1
1
    Indian Institute of Technology, Delhi, Hauz Khas, New Delhi, Delhi 110016


                                         Abstract
                                         CTC is my submitted work to the Information Retrieval from Microblogs during Disasters (IRMiDis)
                                         Track at the Forum for Information Retrieval Evaluation (FIRE) 2022. Coronavirus disease (COVID-19) is
                                         an infectious disease caused by the SARS-CoV-2 virus. Most people infected with the virus experience a
                                         mild to moderate respiratory illness and recover without requiring special treatment. However, some
                                         become seriously ill and require medical attention. Vaccines against coronavirus and prompt reporting of
                                         symptoms saved many lives during the pandemic. The analysis of COVID-19-related tweets can provide
                                         valuable insights regarding the stance of people toward the new vaccine. It can also help the authorities
                                         to plan their strategies based on people’s opinions about the vaccine and ensure the effectiveness of
                                         vaccination campaigns. Tweets describing symptoms can also aid in identifying high-alert zones and
                                         determining quarantine regulations. The IRMiDis track focuses on these COVID-19-related tweets that
                                         flooded Twitter. I developed an effective classifier for both Tasks 1 and 2. The evaluation score of my
                                         submitted run is reported in terms of accuracy and macro-F1 score. I achieved an accuracy of 0.770, a
                                         macro-F1 score of 0.773 in Task 1, and an accuracy of 0.820, a macro-F1 score of 0.746 in Task 2. I enjoyed
                                         the first rank among other submissions in both the tasks.

                                         Keywords
                                         Sentiment Analysis, COVID-19 Tweets, COVID-Twitter-BERT, Tweet classification


1. Introduction
During the COVID-19 pandemic, the globe has waged its most difficult struggle. The disease
was unknown to everyone. It was impossible to determine its exact symptoms. Every time a
new variant was identified, it was accompanied by new symptoms. Due to the resemblance
of its symptoms to those of the common cold and influenza, this fatal virus has often been
misdiagnosed as a cold or the flu. Through social media, many people told their friends and
family about their own symptoms or the symptoms of their friends or family members. Not
only that, but people tweeted about celebrities and their symptoms to the public. By promptly
identifying individuals with COVID-19 symptoms, it is possible to offer them appropriate
treatment and prevent the disease’s spread.
   As the pandemic spread through human contact, it became more difficult to contain. Nu-
merous preventative measures, such as wearing masks and observing a 14-day quarantine,
assisted in controlling the spread. There was a rush to develop a vaccine capable of producing
the necessary antibodies. The coronavirus vaccine was the only method left that could help

Forum for Information Retrieval Evaluation, December 9-13, 2022, India
$ csy207657@cse.iitd.ac.in (S. Bithel)
 https://shivangibithel.github.io/ (S. Bithel)
 0000-0002-6152-4866 (S. Bithel)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
in combating and eradicating infectious illnesses by immunising individuals against viruses.
When the vaccine finally came, people began using social media platforms such as Twitter to
debate the vaccination as it was being disseminated throughout the world. People had both
favourable and negative opinions on the ongoing issues of vaccine advancement, accessibility,
effectiveness, and side effects. The government and numerous health groups, such as WHO,
would benefit from knowing what people think of the new COVID-19 vaccinations. They could
use the insights gleaned from these micro-blogs to develop future initiatives and urge everyone
to be fully vaccinated.
   Classifying tweets manually is laborious and error-prone. Therefore, there was an urgent
need to build machine learning algorithms that can assist us in categorising tweets concerning
COVID-19 vaccinations and also tweets that can detect individuals with COVID-19 symptoms.
In this paper I present an effective 3-class classifier for classification of COVID-19-related vaccine
tweets and a 4-class classifier for classification of COVID-19 symptoms reporting tweets.


2. Task Definition
For Task 1, "Building an effective classifier for 3-class classification on tweets regarding people’s
stance towards COVID-19 vaccines" and Task 2, "Building an effective classifier for 4-class
classification on tweets that can detect tweets that report someone experiencing COVID-19
symptoms", organized as a part of IRMiDis (Information Retrieval from Microblogs during
Disasters) Track in the FIRE (Forum for Information Retrieval Evaluation) 2022, I present an
effective approach in this paper.
   The tweets for Task 1 are classified into 3 classes described below with examples:

    • AntiVax - the tweet indicates hesitancy (of the user who posted the tweet) towards the
      use of vaccines.
    • ProVax - the tweet supports / promotes the use of vaccines.
    • Neutral - the tweet does not have any discernible sentiment expressed towards vaccines
      or is not related to vaccines.

  An example for each class of tweets has been given below:

    • AntiVax Tweet: "Let all politicians and their families be the first to take it. And then lets
      see how they are doing in 6 months or less. These vaccines take years to make and hopefully
      get it right! No way this has taken that long to make so i won’t be getting one and never will
      https://t.co/PcFL4NXNZM"
    • ProVax Tweet: "Good News: Pfizer COVID-19 vaccine 90 percent effective in phase 3
      https://t.co/cXb4WUZ0VV"
    • Neutral Tweet: "Great thread by @nataliexdean about today’s Moderna vaccine news
      https://t.co/IVqszNrFxm"

  The tweets for Task 2 are classified into 4 classes described below with examples:

    • Primary Reporting - The user (who posted the tweet) is reporting symptoms of him-
      self/herself.
    • Secondary Reporting - The user is reporting symptoms of some friend / relative /
      neighbour / someone they met.
    • Third-party Reporting - The user is reporting symptoms of some celebrity / third-party
      person.
    • Non-Reporting - The user is not reporting anyone experiencing COVID-19 symptoms,
      but talking about symptom-words in some other context. This class includes tweets that
      only give general information about COVID-19 symptoms, without specifically reporting
      about a person experiencing such symptoms.
  An example for each class of tweets has been given below:
    • Primary Reporting Tweet: "Wondering if I should get tested for covid.. I have had this
      cough for 2 weeks now, not getting better or worse, also runny nose and headaches.. Just in
      case..."
    • Secondary Reporting Tweet: "@cdngarbageman Omg David. Me too!! My sister in law
      just recovered from Covid. It took her two weeks it was like a very mild flu. My brother has
      a mild cough but tested negative. Its very very serious."
    • Third-party Reporting Tweet: "#Recent #TamilNaduCoronaupdate 18 months old child
      dead due to corona. Was admitted at Viluppuram government medical College and hospital
      on 26/06/2019 with symptoms of cough fever breathlessness and was found to be #Corona
      positive. https://t.co/1NnCkAG9ya"
    • Non-Reporting Tweet: "@trumpwarrior45 Dry cough, shortness of breath, and fever are
      what to look for. If you have a mucus cough, stuffy/runny nose, that is just a cold. Still a
      coronavirus, but not COVID-19. Just be mindful of symptoms."


3. Related Work
Users publish information on micro-blogs such as Twitter for a variety of reasons, including
to express their opinions on Coronavirus, inform their connections about their health, report
symptoms and cautions of themselves or others they know. People discuss about the COVID-19
vaccines, and vaccination campaigns in large number before getting their dose. The extraction of
information from these textual tweets is a common application of social computing. Traditional
machine learning techniques such as Naive-Bayes classifier, Linear classifier, Support Vector Ma-
chine, and Deep neural techniques such as Long Short Term Memory (LSTMs) and Bidirectional
RNN are very effective for text classification. The most current language models for natural
language processing are BERT (Bidirectional Encoder Representations from Transformers) [1]
and its domain-specific version CT-BERT (COVID-Twitter-BERT) [2]. VaccineBERT [3] is
a BERT based model, which performs the task of tweet classification over COVID-19-related
vaccine tweets.


4. Dataset
The training dataset provided for the track 1 contains 4392 tweets. 2792 tweets were extracted
from [4] on the stance of people towards COVID-19 vaccine crawled between November-
December 2020 and remaining 1600 tweets were crawled between March-December 2020 and
were annotated by crowdworkers for the three labels. It contains tweet-texts along with the
tweet IDs and the classes. The test dataset contains 500 tweets with tweet IDs and tweet-texts
only.
  The dataset shared for task 2 contains English tweets from February 2020 - June 2021, crawled
using keywords related to COVID-19 symptoms (e.g., ‘fever’, ‘cough’). The training dataset
contains 1574 tweet-texts along with the tweet IDs, classified into four classes by human workers.
The test dataset contains 400 tweets with tweet IDs and tweet-texts only.


5. Methodology
5.1. Pre-processing
Following [5] and [3], I pre-processed the tweets in order to improve the quality of word
embeddings produced by CTC. Tweets generally contain unique lexicons like HASHTAGS,
@USER, HTTP-URL and EMOJIS which without pre-processing, often reduce the performance
of the model. Thus, we used the following data cleaning pipeline as part of pre-processing the
tweets in the dataset:

    • converted words to their lower case
    • carefully removed stopwords such as "a", "an", "the", etc.
    • converted emoticons to words using python’s ’emoji’ library (https://pypi.org/project/
      emoji/).
    • expanded contractions to text using python’s ’contractions’ library ( https://pypi.org/
      project/contractions/).
    • removed non-alphanumeric characters like brackets, colon, semi-colon, @, etc.
    • remove URLs from the text using regular expression.

5.2. Model
I experimented with the following transformer-based model:

    • BERT: It stands for Bidirectional Encoder Representations from Transformers. BERT
      makes use of Transformer, an attention mechanism to learn contextual relations between
      words (or sub-words) in a text. Thus the textual representations generated by BERT are
      very powerful and generalize well to solve many NLP tasks.
    • CT-BERT: COVID-Twitter-BERT is a transformer-based model, pretrained on a large
      corpus of Twitter messages on the topic of COVID-19 collected during the period from
      January 12 to April 16, 2020. CT-BERT is optimised to be used on COVID-19 content, in
      particular social media posts from Twitter. This model showed a 10–30% marginal im-
      provement compared to its base model, BERT-large, on five different specialised datasets.
    • VaccineBERT: VaccineBERT is the best performing vaccine tweet classification model
      from FIRE 2021, IRMiDis Track Task 2. It uses CT-BERT, fine-tuned over the shared
      training dataset for the classification output.

Fine-tuned CT-BERT model, similar to VaccineBERT performed best on the Validation set.
5.3. Experimental Setup
I first shuffled the training data, then split it into training and validation sets in the ratio 90:10
such that the percentage of instances of each class were preserved in both sets. Both training and
validation instances were pre-processed, as explained in section 5.1. The resulting training data
was used for fine-tuning CT-BERT[2], similarly as done by VaccineBERT[3] while validation
data was used for evaluation. In order to prevent overfitting, I used early stop monitoring the
validation loss with patience value 3.

5.4. Prediction
For the prediction over the available test data, I used the fine-tuned CT-BERT model as a text
classification model to generate the embeddings for the tweet and then further predicted the
probability scores of each tweet against all three classes in Task 1 and all four classes in Task 2.
The class having the maximum probability was reported as the predicted class for that tweet.
The final prediction file containing the Tweet ID and the predicted class was submitted as run
for Task 1 and 2.


6. Results and Discussion
IRMiDis Track results are evaluated using overall accuracy and the macro-F1 score as metrics.
The result of my submitted automated run for Task 1 and 2 is shown in Table 1. CTC got the 1st
rank among other submissions for both the tasks.

                    Task     Team_ID       Accuracy     macro-F1 score      Rank
                     1      Data@IITD        0.770          0.773             1
                     2      Data@IITD        0.820          0.746             1
Table 1
Result of Task 1 and 2


7. Conclusion and Future Work
In this work, I propose a simple but effective approach to COVID-19 Tweet Classification Task
based on Covid-Twitter-BERT, a transformer-based model pre-trained on a large corpus of
COVID-19-related tweets. The experimental results showed that my solution achieved an
accuracy of 0.770, a macro-F1 score of 0.773 in Task 1, and an accuracy of 0.820, a macro-
F1 score of 0.746 in Task 2. CTC is ranked in the first place in Information Retrieval from
Microblogs during Disasters (IRMiDis) track at FIRE 2022. For future work, we can experiment
with ensembling learning to improve the model accuracy and its robustness.
References
[1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
    transformers for language understanding, 2018. URL: http://arxiv.org/abs/1810.04805.
[2] M. Müller, M. Salathé, P. E. Kummervold, Covid-twitter-bert: A natural language processing
    model to analyse covid-19 content on twitter, arXiv preprint arXiv:2005.07503 (2020).
[3] S. Bithel, S. Verma, Vaccinebert: Bert for covid-19 vaccine tweet classification, 2021, pp.
    1199–1203. URL: http://ceur-ws.org/Vol-3159/T8-1.pdf.
[4] L.-A. Cotfas, C. Delcea, I. Roxin, C. Ioanăş, D. S. Gherai, F. Tajariol, The longest month:
    Analyzing covid-19 vaccination opinions dynamics from tweets in the month following
    the first vaccine announcement, IEEE Access 9 (2021) 33203–33223. doi:10.1109/ACCESS.
    2021.3059821.
[5] S. Bithel, S. S. Malagi, Unsupervised identification of relevant prior cases, 2021.
    arXiv:2107.08973.