=Paper=
{{Paper
|id=Vol-3159/T8-1
|storemode=property
|title=VaccineBERT: BERT for COVID-19 Vaccine Tweet Classification
|pdfUrl=https://ceur-ws.org/Vol-3159/T8-1.pdf
|volume=Vol-3159
|authors=Shivangi Bithel,Samidha Verma
|dblpUrl=https://dblp.org/rec/conf/fire/BithelV21
}}
==VaccineBERT: BERT for COVID-19 Vaccine Tweet Classification==
VaccineBERT: BERT for COVID-19 Vaccine Tweet Classification Shivangi Bithel1 , Samidha Verma2 1 Indian Institute of Technology, Delhi, Hauz Khas, New Delhi, Delhi 110016 2 Indian Institute of Technology, Delhi, Hauz Khas, New Delhi, Delhi 110016 Abstract VaccineBERT is our submitted work to FIRE 2021 IRMiDis Track Task 2. We propose using a domain- specific BERT model to classify tweets as ProVax, AntiVax, and Neutral. The vaccination process is on- going worldwide to fight against the novel coronavirus disease (COVID-19), and the sentiment analysis of tweets can provide helpful insights regarding the stance of people about the new vaccine. Govern- ments can plan their strategies based on people’s points of view about the vaccine to make the vaccina- tion drives successful. The evaluation score of our submitted run is reported in terms of accuracy and macro-F1 score. We achieved an accuracy of 0.576, a macro-F1 score of 0.582, and enjoyed the first rank among other submissions. Keywords Sentiment Analysis, COVID-19 Vaccine Tweets, COVID-Twitter-BERT 1. Introduction Today the world is fighting its most challenging battle in the form of the COVID-19 pandemic. Over the years, vaccines have been proven to be a very safe and effective way to fight and eradicate infectious diseases by providing immunity to people to fight against viruses. Thus a race to discover new and effective vaccines made it possible to provide the Corona virus vaccine to the world so soon. People are using social media sites like Twitter to discuss about the vaccine as it is being distributed around the globe. The discussions of vaccination progress, accessibility, efficacy, and side effects are ongoing, and people have both positive and negative views about it. It is helpful for the government and various health organizations like WHO to know what people think about the new COVID-19 vaccines. They can use the insights from these micro-blogs to plan their future strategies and encourage everyone to get fully vaccinated. It is complex but also imperative to stop the spread of misinformation about the COVID-19 vaccine. The government is trying to stop the pandemic as well as the growing infodemic around the vaccine from spreading. Twitter also tries to ban tweets that involve incorrect or misleading information about the virus, its preventive measures, and treatments. Manual classification of tweets is tedious and erroneous. Hence, there is a desperate need to develop the machine learning models that can help us in the task of classifying tweets about the COVID-19 vaccines. Forum for Information Retrieval Evaluation, December 13-17, 2021, India " csy207657@cse.iitd.ac.in (S. Bithel); csy207575@cse.iitd.ac.in (S. Verma) ~ https://shivangibithel.github.io/ (S. Bithel); https:// (S. Verma) 0000-0002-6152-4866 (S. Bithel); 0000-0000-0000-0000 (S. Verma) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Task For task 2, "Building an effective classifier for 3-class classification on tweets regarding people’s stance towards COVID-19 vaccines", organized as a part of IRMiDis (Information Retrieval from Microblogs during Disasters) Track in the FIRE (Forum for Information Retrieval Evaluation) 2021, we present an effective approach in this paper. The tweets are classified into 3 classes described below with examples: • AntiVax - "The tweet is against the use of vaccines." • ProVax - "The tweet supports / promotes the use of vaccines." • Neutral - "The tweet does not have any discernible sentiment expressed towards vaccines or is not related to vaccines." An example for each class of tweets has been given below: • AntiVax Tweet: "They can have their vaccine, I want the right to say no- not in my body. We will only have that right under Donald J Trump https://t.co/MrfDSMm6JB " • ProVax Tweet: "Best news of the year so far, well at least for the last 34 weeks! One of the many vaccines against COVID19 being developed and looking extremely positive. We can start to see the light for 2021!!" • Neutral Tweet: "Will you REFUSE the Pfizer vaccine even if it means losing your job?" 3. Related Work Users post content on microblogs like twitter for various purposes, including their sentiments about Coronavirus, COVID-19 vaccines and vaccination drives. Information extraction from these textual tweets is very popular part of social computing. The traditional machine learning methods like Naive-Bayes classifier, Linear classifier, Support Vector Machine and Deep neural methods like Long Short Term Memory (LSTMs) and Bidirectional RNN are very successful for text classification. More recent language models for natural language processing includes BERT (Bidirectional Encoder Representations from Transformers) [1] and its domain-specific version CT-BERT (COVID-Twitter-BERT) [2]. 3.1. BERT BERT is a very powerful transformer-based architecture that generalizes well to many natural language processing tasks. Using BERT, deep bidirectional representations can be pre-trained from unlabeled text, which retains more information about the context and flow of the text. Model is pre-trained using Masked Language Modelling (MLM) task and Next Sentence predic- tion task. The BERT model can be fine-tuned for various tasks by adding an additional output layer and giving a state-of-the-art performance.. 4. Dataset The training dataset provided during the track contains 2792 tweets extracted from [3] on the stance of people towards COVID-19 vaccine crawled between November-December 2020. It contains tweets along with the tweet IDs and the classes. The test dataset contains 1600 tweets crawled using vaccine-related terms between March-December 2020. It contains tweets along with tweet IDs. Our approach used the dataset by Muller et al. [4] and crawled Twitter for more information. We augmented the dataset by Muller et al. [4] with attributes like screen name, retweet count, followers count, friends count, status count, verified status, and name of the user associated with the given tweet and tweet ID by using python API Tweepy [5] to observe the various trends in the data. 4.1. Trends in the dataset Based on the given and collected information, the following trends were observed in the dataset. • Training dataset includes 36.1% Neutral, 35.5% ProVax, and 28.4% AntiVax tweets. • People with more than 10000 followers tend to post 58.1% Neutral, 32.7% ProVax, and 9.2% AntiVax (very less) tweets. • People having verified accounts on Twitter tend to post 49.4% Neutral, 41.0% ProVax, and 9.5% AntiVax (very less) tweets. • People with more than 1000 friends tend to post 37.2% Neutral, 36% ProVax, and 26.8% AntiVax tweets. • People with more than 10000 status tend to post 42.6% Neutral, 31% ProVax, and 26.4% AntiVax tweets on their wall. • Most common words in the dataset include "vaccine", "vaccines", "covid19", "news", "Pfizer" and "coronavirus" having more than 500 mentions. • Most common accounts tagged in the tweets include "@realdonaldtrump" and "@pfizer", with more than 50 mentions. The test data is annotated by three human annotators, where a label is assigned on the unanimous agreement or majority agreement (2 out of 3) from the given labels. 5. Pre-processing Following the prior experience of NLP tasks[6], we pre-processed the tweets in order to improve the quality of word embeddings produced by BERT. Tweets generally contain unique lexicons like HASHTAGS, @USER, HTTP-URL and EMOJIS which without pre-processing, often reduce the performance of the model. Thus, we used the following data cleaning pipeline as part of pre-processing the tweets in the dataset: • Remove stop words: A stop word is a commonly used word such as "the", "a", "an", "in", which do not provide any valuable information. We remove the stop words in order to give more focus to the important information. • Convert words to lower case: Tweets are written more casually, thus by lower casing every word, we are keeping only a single version of every word, enhancing the text analysis. • Convert emoticons to words: Emojis are extensively used on Twitter to express feelings and emotions. Completely removing them removes a lot of sentiment information; thus, we converted the emojis to text and retained their meaning using ’emoji’ library (https://pypi.org/project/emoji/). • Expand contractions to text: In order to standardize our text, each contraction is converted to its expanded, original form. We used the ’contractions’ library ( https: //pypi.org/project/contractions/) to expand the words like "don’t" to "do not". • Remove non-alphanumeric characters: We removed all the non-letter characters like brackets, colon, semi-colon, @, etc. • Remove URLs: URLs do not help in sentiment analysis; thus, we removed them with the help of regular expression from the text. 6. Methodology 6.1. Model COVID-Twitter-BERT (CT-BERT): CT-BERT[2] is a domain-specific transformer-based model, pre-trained on a large corpus of tweets posted between January 12 to April 16, 2020, on the topic of COVID-19. It uses the BERT-Large weights for initialization and further pre-trained over 160M tweets about the coronavirus. The tweets were pseudonymized by replacing all Twitter usernames with a common text token. English words also replaced all the emoticons in the tweets. We specifically used this model since BERT-Large is trained on Wikipedia data, and using a model that is pre-trained in the same domain, i.e., Covid-19 related tweets, in this case, would intuitively give better results upon fine-tuning with the given training data. 6.2. Experimental Setup We first shuffled the training data, then split it into training and validation sets in the ratio 80:20 such that the percentage of instances of each class were preserved in both sets. Both training and validation instances were pre-processed, as explained in section 5. The resulting training data was used for fine-tuning CT-BERT[2] while validation data was used for evaluation. In order to prevent overfitting, we used early stop monitoring the validation loss with patience value 3. 6.3. Prediction For the prediction over the available test data, we used the fine-tuned CT-BERT model as a text classification model to generate the embeddings for the tweet and then further predict the probability scores of each tweet against all three classes. The class having the maximum probability was reported as the predicted class for that tweet. The final prediction file containing the Tweet ID and the predicted class was submitted as run for Task 2. 7. Evaluation Task 2 - IRMiDis Track results are evaluated using overall accuracy and the macro-F1 score on the three classes as metrics. The result of our submitted automated run for Task 2 is shown in Table 1. VaccineBERT got the 1st rank among other submissions with an overall accuracy of 0.576 and the macro-F1 score of 0.582. Sr No. Team_ID Accuracy macro-F1 score Rank 1 IR_IITD 0.576 0.582 1 Table 1 Result of Task 2 8. Conclusion and Future Work This paper uses Covid-Twitter-BERT, a transformer-based model pre-trained on a large corpus of COVID-19 related tweets, to classify tweets as ProVax, AntiVax, or Neutral. We observed that the transformer-based model outperformed the traditional natural language processing classifier, namely Naive Bayes, Logistic Regression, and Support Vector Machine, as word embeddings computed by the former are more expressive and yield better results on the task. We further propose to look into data augmentation strategies for improving the performance of our model since transformer-based models are data-hungry. Another addition could be to adversarially train the model to improve its robustness. References [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. URL: http://arxiv.org/abs/1810.04805. [2] M. Müller, M. Salathé, P. E. Kummervold, Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter, arXiv preprint arXiv:2005.07503 (2020). [3] L.-A. Cotfas, C. Delcea, I. Roxin, C. Ioanăş, D. S. Gherai, F. Tajariol, The longest month: Analyzing covid-19 vaccination opinions dynamics from tweets in the month following the first vaccine announcement, IEEE Access 9 (2021) 33203–33223. doi:10.1109/ACCESS. 2021.3059821. [4] M. M. Müller, M. Salathé, Crowdbreaks: Tracking health trends using public social media data and crowdsourcing, Frontiers in Public Health 7 (2019) 81. URL: https://www.frontiersin. org/article/10.3389/fpubh.2019.00081. doi:10.3389/fpubh.2019.00081. [5] J. Roesslein, Tweepy: Twitter for python!, URL: https://github.com/tweepy/tweepy (2020). [6] S. Bithel, S. S. Malagi, Unsupervised identification of relevant prior cases, 2021. arXiv:2107.08973.