<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Bithel)
 https://shivangibithel.github.io/ (S. Bithel)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>CTC: COVID-19 Tweet Classification using CT-BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shivangi Bithel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology</institution>
          ,
          <addr-line>Delhi, Hauz Khas, New Delhi, Delhi 110016</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>CTC is my submitted work to the Information Retrieval from Microblogs during Disasters (IRMiDis) Track at the Forum for Information Retrieval Evaluation (FIRE) 2022. Coronavirus disease (COVID-19) is an infectious disease caused by the SARS-CoV-2 virus. Most people infected with the virus experience a mild to moderate respiratory illness and recover without requiring special treatment. However, some become seriously ill and require medical attention. Vaccines against coronavirus and prompt reporting of symptoms saved many lives during the pandemic. The analysis of COVID-19-related tweets can provide valuable insights regarding the stance of people toward the new vaccine. It can also help the authorities to plan their strategies based on people's opinions about the vaccine and ensure the efectiveness of vaccination campaigns. Tweets describing symptoms can also aid in identifying high-alert zones and determining quarantine regulations. The IRMiDis track focuses on these COVID-19-related tweets that lfooded Twitter. I developed an efective classifier for both Tasks 1 and 2. The evaluation score of my submitted run is reported in terms of accuracy and macro-F1 score. I achieved an accuracy of 0.770, a macro-F1 score of 0.773 in Task 1, and an accuracy of 0.820, a macro-F1 score of 0.746 in Task 2. I enjoyed the first rank among other submissions in both the tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sentiment Analysis</kwd>
        <kwd>COVID-19 Tweets</kwd>
        <kwd>COVID-Twitter-BERT</kwd>
        <kwd>Tweet classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>During the COVID-19 pandemic, the globe has waged its most dificult struggle. The disease
was unknown to everyone. It was impossible to determine its exact symptoms. Every time a
new variant was identified, it was accompanied by new symptoms. Due to the resemblance
of its symptoms to those of the common cold and influenza, this fatal virus has often been
misdiagnosed as a cold or the flu. Through social media, many people told their friends and
family about their own symptoms or the symptoms of their friends or family members. Not
only that, but people tweeted about celebrities and their symptoms to the public. By promptly
identifying individuals with COVID-19 symptoms, it is possible to ofer them appropriate
treatment and prevent the disease’s spread.</p>
      <p>As the pandemic spread through human contact, it became more dificult to contain.
Numerous preventative measures, such as wearing masks and observing a 14-day quarantine,
assisted in controlling the spread. There was a rush to develop a vaccine capable of producing
the necessary antibodies. The coronavirus vaccine was the only method left that could help
in combating and eradicating infectious illnesses by immunising individuals against viruses.
When the vaccine finally came, people began using social media platforms such as Twitter to
debate the vaccination as it was being disseminated throughout the world. People had both
favourable and negative opinions on the ongoing issues of vaccine advancement, accessibility,
efectiveness, and side efects. The government and numerous health groups, such as WHO,
would benefit from knowing what people think of the new COVID-19 vaccinations. They could
use the insights gleaned from these micro-blogs to develop future initiatives and urge everyone
to be fully vaccinated.</p>
      <p>Classifying tweets manually is laborious and error-prone. Therefore, there was an urgent
need to build machine learning algorithms that can assist us in categorising tweets concerning
COVID-19 vaccinations and also tweets that can detect individuals with COVID-19 symptoms.
In this paper I present an efective 3-class classifier for classification of COVID-19-related vaccine
tweets and a 4-class classifier for classification of COVID-19 symptoms reporting tweets.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Definition</title>
      <p>For Task 1, "Building an efective classifier for 3-class classification on tweets regarding people’s
stance towards COVID-19 vaccines" and Task 2, "Building an efective classifier for 4-class
classification on tweets that can detect tweets that report someone experiencing COVID-19
symptoms", organized as a part of IRMiDis (Information Retrieval from Microblogs during
Disasters) Track in the FIRE (Forum for Information Retrieval Evaluation) 2022, I present an
efective approach in this paper.</p>
      <p>The tweets for Task 1 are classified into 3 classes described below with examples:
• AntiVax - the tweet indicates hesitancy (of the user who posted the tweet) towards the
use of vaccines.
• ProVax - the tweet supports / promotes the use of vaccines.
• Neutral - the tweet does not have any discernible sentiment expressed towards vaccines
or is not related to vaccines.</p>
      <p>An example for each class of tweets has been given below:
• AntiVax Tweet: "Let all politicians and their families be the first to take it. And then lets
see how they are doing in 6 months or less. These vaccines take years to make and hopefully
get it right! No way this has taken that long to make so i won’t be getting one and never will
https://t.co/PcFL4NXNZM"
• ProVax Tweet: "Good News: Pfizer COVID-19 vaccine 90 percent efective in phase 3
https://t.co/cXb4WUZ0VV"
• Neutral Tweet: "Great thread by @nataliexdean about today’s Moderna vaccine news
https://t.co/IVqszNrFxm"
The tweets for Task 2 are classified into 4 classes described below with examples:
• Primary Reporting - The user (who posted the tweet) is reporting symptoms of
himself/herself.
• Secondary Reporting - The user is reporting symptoms of some friend / relative /
neighbour / someone they met.
• Third-party Reporting - The user is reporting symptoms of some celebrity / third-party
person.
• Non-Reporting - The user is not reporting anyone experiencing COVID-19 symptoms,
but talking about symptom-words in some other context. This class includes tweets that
only give general information about COVID-19 symptoms, without specifically reporting
about a person experiencing such symptoms.
• Primary Reporting Tweet: "Wondering if I should get tested for covid.. I have had this
cough for 2 weeks now, not getting better or worse, also runny nose and headaches.. Just in
case..."
• Secondary Reporting Tweet: "@cdngarbageman Omg David. Me too!! My sister in law
just recovered from Covid. It took her two weeks it was like a very mild flu. My brother has
a mild cough but tested negative. Its very very serious."
• Third-party Reporting Tweet: "#Recent #TamilNaduCoronaupdate 18 months old child
dead due to corona. Was admitted at Viluppuram government medical College and hospital
on 26/06/2019 with symptoms of cough fever breathlessness and was found to be #Corona
positive. https://t.co/1NnCkAG9ya"
• Non-Reporting Tweet: "@trumpwarrior45 Dry cough, shortness of breath, and fever are
what to look for. If you have a mucus cough, stufy/runny nose, that is just a cold. Still a
coronavirus, but not COVID-19. Just be mindful of symptoms."</p>
    </sec>
    <sec id="sec-3">
      <title>3. Related Work</title>
      <p>
        Users publish information on micro-blogs such as Twitter for a variety of reasons, including
to express their opinions on Coronavirus, inform their connections about their health, report
symptoms and cautions of themselves or others they know. People discuss about the COVID-19
vaccines, and vaccination campaigns in large number before getting their dose. The extraction of
information from these textual tweets is a common application of social computing. Traditional
machine learning techniques such as Naive-Bayes classifier, Linear classifier, Support Vector
Machine, and Deep neural techniques such as Long Short Term Memory (LSTMs) and Bidirectional
RNN are very efective for text classification. The most current language models for natural
language processing are BERT (Bidirectional Encoder Representations from Transformers) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and its domain-specific version CT-BERT (COVID-Twitter-BERT) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. VaccineBERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is
a BERT based model, which performs the task of tweet classification over COVID-19-related
vaccine tweets.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Dataset</title>
      <p>
        The training dataset provided for the track 1 contains 4392 tweets. 2792 tweets were extracted
from [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] on the stance of people towards COVID-19 vaccine crawled between
NovemberDecember 2020 and remaining 1600 tweets were crawled between March-December 2020 and
were annotated by crowdworkers for the three labels. It contains tweet-texts along with the
tweet IDs and the classes. The test dataset contains 500 tweets with tweet IDs and tweet-texts
only.
      </p>
      <p>The dataset shared for task 2 contains English tweets from February 2020 - June 2021, crawled
using keywords related to COVID-19 symptoms (e.g., ‘fever’, ‘cough’). The training dataset
contains 1574 tweet-texts along with the tweet IDs, classified into four classes by human workers.
The test dataset contains 400 tweets with tweet IDs and tweet-texts only.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Methodology</title>
      <sec id="sec-5-1">
        <title>5.1. Pre-processing</title>
        <p>
          Following [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], I pre-processed the tweets in order to improve the quality of word
embeddings produced by CTC. Tweets generally contain unique lexicons like HASHTAGS,
@USER, HTTP-URL and EMOJIS which without pre-processing, often reduce the performance
of the model. Thus, we used the following data cleaning pipeline as part of pre-processing the
tweets in the dataset:
• converted words to their lower case
• carefully removed stopwords such as "a", "an", "the", etc.
• converted emoticons to words using python’s ’emoji’ library (https://pypi.org/project/
emoji/).
• expanded contractions to text using python’s ’contractions’ library ( https://pypi.org/
project/contractions/).
• removed non-alphanumeric characters like brackets, colon, semi-colon, @, etc.
• remove URLs from the text using regular expression.
5.2. Model
I experimented with the following transformer-based model:
• BERT: It stands for Bidirectional Encoder Representations from Transformers. BERT
makes use of Transformer, an attention mechanism to learn contextual relations between
words (or sub-words) in a text. Thus the textual representations generated by BERT are
very powerful and generalize well to solve many NLP tasks.
• CT-BERT: COVID-Twitter-BERT is a transformer-based model, pretrained on a large
corpus of Twitter messages on the topic of COVID-19 collected during the period from
January 12 to April 16, 2020. CT-BERT is optimised to be used on COVID-19 content, in
particular social media posts from Twitter. This model showed a 10–30% marginal
improvement compared to its base model, BERT-large, on five diferent specialised datasets.
• VaccineBERT: VaccineBERT is the best performing vaccine tweet classification model
from FIRE 2021, IRMiDis Track Task 2. It uses CT-BERT, fine-tuned over the shared
training dataset for the classification output.
        </p>
        <p>Fine-tuned CT-BERT model, similar to VaccineBERT performed best on the Validation set.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.3. Experimental Setup</title>
        <p>
          I first shufled the training data, then split it into training and validation sets in the ratio 90:10
such that the percentage of instances of each class were preserved in both sets. Both training and
validation instances were pre-processed, as explained in section 5.1. The resulting training data
was used for fine-tuning CT-BERT[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], similarly as done by VaccineBERT[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] while validation
data was used for evaluation. In order to prevent overfitting, I used early stop monitoring the
validation loss with patience value 3.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.4. Prediction</title>
        <p>For the prediction over the available test data, I used the fine-tuned CT-BERT model as a text
classification model to generate the embeddings for the tweet and then further predicted the
probability scores of each tweet against all three classes in Task 1 and all four classes in Task 2.
The class having the maximum probability was reported as the predicted class for that tweet.
The final prediction file containing the Tweet ID and the predicted class was submitted as run
for Task 1 and 2.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results and Discussion</title>
      <p>IRMiDis Track results are evaluated using overall accuracy and the macro-F1 score as metrics.
The result of my submitted automated run for Task 1 and 2 is shown in Table 1. CTC got the 1st
rank among other submissions for both the tasks.</p>
      <p>Task
1
2</p>
      <p>Team_ID
Data@IITD
Data@IITD</p>
      <p>Accuracy
0.770
0.820
macro-F1 score
0.773
0.746</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>In this work, I propose a simple but efective approach to COVID-19 Tweet Classification Task
based on Covid-Twitter-BERT, a transformer-based model pre-trained on a large corpus of
COVID-19-related tweets. The experimental results showed that my solution achieved an
accuracy of 0.770, a macro-F1 score of 0.773 in Task 1, and an accuracy of 0.820, a
macroF1 score of 0.746 in Task 2. CTC is ranked in the first place in Information Retrieval from
Microblogs during Disasters (IRMiDis) track at FIRE 2022. For future work, we can experiment
with ensembling learning to improve the model accuracy and its robustness.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <year>2018</year>
          . URL: http://arxiv.org/abs/
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salathé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. E.</given-names>
            <surname>Kummervold</surname>
          </string-name>
          ,
          <article-title>Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>07503</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bithel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Verma</surname>
          </string-name>
          , Vaccinebert: Bert for covid-19
          <source>vaccine tweet classification</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1199</fpage>
          -
          <lpage>1203</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3159</volume>
          /
          <fpage>T8</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.-A.</given-names>
            <surname>Cotfas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delcea</surname>
          </string-name>
          , I. Roxin,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ioanăş</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Gherai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tajariol</surname>
          </string-name>
          ,
          <article-title>The longest month: Analyzing covid-19 vaccination opinions dynamics from tweets in the month following the first vaccine announcement</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>33203</fpage>
          -
          <lpage>33223</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2021</year>
          .
          <volume>3059821</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bithel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Malagi</surname>
          </string-name>
          ,
          <article-title>Unsupervised identification of relevant prior cases</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2107</volume>
          .
          <fpage>08973</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>