<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Chakraborty)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>COVID Vaccine Stance Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sk. Aftab Aman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meghna Chakraborty</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Engineering and Management</institution>
          ,
          <addr-line>New Town Kolkata,West Bengal</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper discusses the work submitted by us for IRMiDis FIRE 2021 Task[2].The goal of this task was to classify tweets related to COVID19 vaccines into three diferent sentiment classes.Our approach is based on using machine learning techniques to complete this 3-class sentiment classification problem.The evaluation scores of the submitted runs are reported in terms of accuracy and macro-f1 score.The accuracy reported for our classification was 0.448 and the macro-f1 score came out as 0.442.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;sentiment analysis</kwd>
        <kwd>micro blogs</kwd>
        <kwd>machine learning</kwd>
        <kwd>3-class classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Tasks</title>
      <sec id="sec-2-1">
        <title>Antivax - The tweet is against the use of vaccines. Provax - The tweet supports / promotes the use of vaccines Neutral - The tweet does not have any discernible sentiment expressed towards vaccines or is not related to vaccines.</title>
        <p>Below are samples of tweets showing various sentiments.</p>
        <p>Tweet 1 : Coronavirus: Some Canadians hesitant to take a COVID-19 vaccine â€“ Global News
Tweet 2 : More good news!!! I could get used to this Covid-19 vaccine candidate is 90 percent
efective, says manufacturer https://t.co/wtpyAh71pU</p>
        <p>Tweet 3 : Moderna on track to report late-stage COVID-19 vaccine data next month.</p>
        <p>Tweet 1 is an AntiVax tweet , Tweet 2 is a ProVax tweet while Tweet 3 is a neutral tweet.Tweet
1 shows how hesitant some Canadians are to take the vaccine while Tweet 2 shows how the
vaccines are a good news as its 90 percent efective. Tweet 3 gives us only facts about the
vaccines and does not show any distinguishable sentiment.The tweets are perfectly matched.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        The data used for this task was gathered from twitter.The tweets were taken in the year 2020
and are based on COVID-19 vaccines.The entire data was made available in two phases :
• The training tweets were taken from the dataset provided by article[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].It had the stances
regarding COVID-19 vaccines taken between November-December 2020.We used 2792
tweets from this dataset for training and validation.
• The dataset taken for testing comprises of 1600 unlabelled tweets annotated by three
crowdworkers and enjoy a majority agreement.
      </p>
      <p>The dataset was slightly skewed as the count of the Neutral and ProVax tweets were more
than that of the AntiVax tweets which could potentially bias the classification model.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Preprocessing</title>
        <sec id="sec-4-1-1">
          <title>This phase is the first and most important step for any text based problem.</title>
          <p>For both Run1 and Run2 this process remains same.First we removed all the URLs i.e. words
starting with https.The hash symbols are removed as they are common and appear in many
tweets with hashtags.All the words starting with ’ @ ’ are pruned from every tweet.We then
removed all the retweets to remove duplicates and thus remove some biasness.Next we divided
the CamelCased words into independent words.CamelCase words are words whose first letter of
the second word in a closed compound is a capital letter (example PayPal, iPhone etc.).Hashtags
generally have such words as seen in the example below.</p>
          <p>16km ENE of Nagarkot, Nepal: DYFI? - ITime2015-04-27 21:27:41 UTC2015-04-28
03:12...#EarthQuake</p>
          <p>Following this we converted the sentences into lowercase and removed all the emoticons,symbols,
lfags,pictographs,transport and map symbols because we deal with only textual data and the
unicode characters of these symbols are treated as random numbers and punctuations and do
not help to detect sentiment.Next we dealt with some contractions and converted words like
“haven’t”,”shouldn’t” into “have not” and “should not”.</p>
          <p>After this we removed all the punctuations and all the stop words i.e. the words that occur
in high frequency like a,the etc.(except no and not as they give us some knowledge about the
sentiment of a tweet) .After this we decided to lemmatize as stemming often gives us words
that are not part of the vocabulary but a lemma always belongs to the language.The lemmatized
tweets were then ready to be converted into vectors to be fed in our classifier.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Model Selection</title>
        <p>After cleaning our data we had to transform it into a type understandable by the machine
learning model.We used tf-idf vectoriser to transform each tweet to a vector.We considered
unigrams as well as bigrams.We found through experimentation that this particular arrangement
gave us the best result.</p>
        <p>Then we tried feeding this into three diferent models that can classify each tweet into any
one of the three classes.Naive Bayes,SVM and CNN.We worked with 2753 number of training
data.</p>
        <p>RUN 1 : We experimented with several learning algorithms.We checked the diferent models
by checking the validation accuracy. We split the data into a test size of 30%. We got a validation
accuracy of 0.733 with SVM and 0.724 with Naive Bayes.The confusion-matrix during validation
for SVM is given in Fig 1</p>
        <p>We picked Support Vector Machine to be used for our classification problem RUN 1 as the
validation accuracy for SVM was more.We used rbf as our kernel function because this is not a
linear classification problem.Then the model is trained using the preprocessed training data.The
accuracy and Macro f2-score for the test data is shown in Table 1.</p>
        <p>RUN 2 :For this run we experimented with Convolutional Neural Network(CNN).The
preprocessing was same as done in RUN1.The maximum length of a preprocessed sentence was
found to be 33,so we set maximum length of each tweet to 40.We padded each tweet.The train
data was split into a test size of 30% during the training phase.</p>
        <p>We used a sequential model and added an embedding layer.We followed that by adding a 1D
CNN.Then we used a GlobalMaxPooling1D layer to down sample the input representation.We
used a Dropout layer next to deal with some level of overfitting.At the very end a Dense layer
with “sigmoid” as the layer activation function was used.We trained this model for 100 epochs.A
visual representation of the model is shown in Fig 4.We can get an idea about the kind of fit and
accuracy this gives us from Fig 2 and Fig 3.</p>
        <p>Our neural network model was overfit as we can see from the graphs in Fig 2 and Fig 3.When
the training data loss is very less and validation data loss is high (as in Fig 2) it means our model
is sufering an overfit.The same inference can be gained from Fig 3.The categorical accuracy
of the training data is much higher than that of the validation data . This also indicated an
overfit.We used this and trained our model on our test data and found the result as shown in
Table 2.</p>
        <p>As CNN model in RUN 2 sufers from overfit and the accuracy is also lower as seen from
Table 2 we discarded this run and considered RUN 1 as our primary run.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>The gold-standard for the classification is generated using manual runs.As mentioned in the
IRMiDis Track, three crowdworkers are supplied with the tweets .The tweets have a majority
agreement i.e. 2 out of 3 or all 3 agree annotate the tweet in a certain class.This proves that
some of the tweets are subjective and thus likely to be falsely classified automatically.The run
submissions are evaluated against the overall accuracy and the macro-F1 score.The macro-F1
score was the main judging factor.</p>
      <p>The results of our submitted automated run are shown in Table 1 and Table 2.We were allowed
to submit more than one runs and our primary automatic run submission got 8th place and we
got 5th place as a team. We managed an accuracy of 0.448 and a macro-f1 score of 0.442 with
the RUN1 .With RUN2 we managed an accuracy of 0.414 and a macro f1-score of 0.401.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work for IRMiDis FIRE2021 we used natural language processing preprocessing
techniques and machine learning models to perform a three-class classification problem.We have
tried various learning models and found the one that gives the best result.As a future extension
of this work we plan to extend our knowledge of natural language processing and understand
the relative sequence of words and the POS-tags to improve the performance of the model.The
overfitting problem can be dealt with by tuning the hyperparameters.We can remove tweets
which have 80% or more similarity to decrease biasness. We can also use methods to help deal
with the imbalanced class problem which we ignored in our study.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.-A.</given-names>
            <surname>Cotfas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Gherai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ioanăş</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Roxin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tajariol</surname>
          </string-name>
          ,
          <article-title>The longest month: Analyzing covid-19 vaccination opinions dynamics from tweets in the month following the first vaccine announcement</article-title>
          , IEEE access (
          <year>2021</year>
          )
          <fpage>33203</fpage>
          -
          <lpage>33223</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2021</year>
          .
          <volume>3059821</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>