Vaccine Vision: A deep learning approach towards identifying societal concerns regarding vaccines Kaustav Das1 , Shruti Biswas1 1 Amity University, Kolkata. Abstract In the wake of the COVID-19 pandemic, discerning public sentiments regarding vaccination stances - Pro-Vax, Anti-Vax, or Neutral - emerged as a pivotal undertaking. Leveraging machine learning on COVID-19 related tweets, this study focuses on categorizing individuals based on their vaccination perspectives. However, delving beyond the surface, the analysis uncovers a mosaic of concerns within Anti-Vax sentiments that surpass a mere dichotomy. These concerns encompass a diverse spectrum, spanning from conspiracies and political suspicions to multifaceted uncertainties. In response, this work employs a nuanced multi-label classification approach, aiming to thoroughly comprehend and classify the varied concerns explicitly articulated within Anti-Vax tweets. By scrutinizing a corpus of COVID-19 related tweets, this study endeavors to shed light on the intricate landscape of vaccine hesitancy, providing a comprehensive understanding of the multifaceted reasons underlying Anti-Vax sentiments. Keywords Twitter, microblogs, COVID-19, vaccine concerns, tweet, multi-label classification 1. Introduction Vaccine hesitancy is defined as “delay in acceptance or refusal of vaccination despite the availability of vaccination services”[1]. It is seen as the primary cause of decreasing vaccine rates and the resurgence of vaccine-preventable illnesses in many countries. To combat the COVID-19 pandemic, researchers and pharmaceutical companies came up with a number of vaccines using the S-protein of SARS-CoV-2. But regardless of the efforts made, number of concerns rose, which led to a significant decrease in the vaccination drives. In the fight against COVID-19, vaccine hesitancy has been identified as a significant hindrance.[2]. The reason for most of this hesitancy can be traced back to social media platforms, while social media platforms themselves didn’t directly influence the general public, a lot of misinformation by unregulated users whose opinions when voiced through these platforms caused mass paranoia and eventually a distrust towards the public health services. Some factors that contributed to the hesitancy also include the fact that the vaccines being administered, reportedly came with mild to quite severe side effects. Even before the emergence of SARS-CoV-2, WHO had already highlighted vaccine hesitancy as one of the ten leading threats to global health.[3] Thus, it is quite evident that vaccine hesitancy is a significant and complex phenomenon that cannot be avoided. To aid such a situation, a thorough analysis of the general public concern FIRE’23: Forum for Information Retrieval Evaluation, December 15-18, 2023, India $ kaustav.das1@s.amity.edu (K. Das); saptarshi.ghosh@gmail.com (S. Biswas) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings needs to be done. By virtue of social media, these concerns have been voiced by people in the form of tweets and other social media posts. Therefore, in this paper with the help of deep learning techniques and natural language processing, we have built an efficient multi-label classifier to identify some of the concerns associated with these tweets. Most of the tweets are primarily focused on vaccines associated with COVID-19 like Moderna, Pfizer, Astrazeneca etc. and most of them also voice more than one concern(label) hence the need for a multi-label classifier. 2. Dataset With our requirements and concerns(labels) defined let’s have a look at the dataset.The dataset used for this work can be credited to "CAVES: A Dataset to facilitate Explainable Classifi- cation and Summarization of Concerns towards COVID Vaccines"[4]. This dataset has been provided as part of the FIRE conference [5]. Here are some examples of tweets per label: CAVES Dataset Original Tweet Labels @SenSanders We have immune systems. Vaccines are filled with animal ingredients, and human DNA, disinfectants, heavy metals, chemicals. For over 30 years pharma, side-effect the US has paid out over 4 billion In compensations for deaths and injuries because the US gave Pharma liability freedom in 1986. #VaccinesKill @barbieashdown @leslieh707 Oh my! One does not have to be an expert at conspiracy reading body language to know he is covering something up. Depopulation is his game. No CV19 vax for me. No Moderna vax for anyone. @PetenyiSandor @katka_cseh Your choice, go and get a Chinese vaccine. I country will wait for vaccines which are approved by the EU. We will have enough by the summer. Let’s see how much the Chinese actually will deliver, how many people willing to accept it and if Orban can organize mass vaccination. I doubt it. Canadians who received Vaccine vaccine excluded from seeing Bruce Spring- side-effect, ineffec- steen, as Broadway opens up via people are against vaccine but Dr. Bonnie tive Henry stands by all vaccines all are safe seems they have no idea if they are or not folks @TraceyLouise @slimschutte @AefreBetty @jon_severs @StevenCross81 unnecessary, Ah yes, that good old fashioned vaccine that was trialled over the correct rushed, mandatory amount of time, and not forced upon every living being on the planet... Even without this new thing, if you’re healthy, you’ll still be fine. Big Pharma spent close to 7 Billion euros on lobbying in France...which as pharma we know is utterly corrupted in terms of Covid response: blocked hydrox- ychloroquine - like all other Western countries - and embraced remdesivir and vaccines. @realDonaldTrump Guess who’s gonna get rich before he leaves the white political house off of 71 million people he knows will take a shot in the arm if he says so? That’s right. Donald Trump now wants science on his side. How much stock in Pfizer did you buy huh? AUTHORITARIAN NAZI-wannabes of Democrat party are COMPLETELY political, religious, ANTI VOTER ID to PREVENT FRAUD. Those very same LIBERAL TYRANTS conspiracy want YOU to be DIGITALLY marked under the pretense of getting a "vaccine." NO THANKS! VOTER ID or NO VOTE! Keep both your vaccine and SATAN’S MARK! Now about this Pfizer "vaccine". Who is the government tryna give this to none again. By my calculations, the people who would want it are the folks who never got Covid19. Be careful of folks giving you a vaccine that will "protect" you from a virus for which you already had it. https://t.co/zdJqKENupA. @NeilClark66 Mad fascist dictator Johnson now wants to force you to take mandatory, rushed, an experimental vaccine with no long-term safety profile. All this for a virus unnecessary with 0.3% IFR. Still think this is about a virus? 1922 committee must step in NOW and REMOVE this communist lunatic! Table 1: Original tweet representation for each label. 2.1. Data Pre-processing For the pre-processing step, standard text-cleaning steps were followed where lexicons like URLS, usernames, and emojis were removed. Here are the steps: • Tweet id columns were dropped. • The label columns were binarized/ one hot encoded using the sklearn multi-label binarizer so that they can be used readily for the future. machine learning steps. (https://scikit-learn. org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) • Text standardization: – Removing HTML elements – Replacing non-standard punctuation with standard version. – Replacing '\r', '\n' and '\t' with white spaces – Removing all control characters – Removing duplicate white spaces • Removing contractions from text using contractions library (https://pypi.org/project/ contractions/) • Replacing usernames and URLs with fillers ’username’ and ’url’ • Removing unicode and accented characters from text • Emojis were replaced with their meaning using the emoji.demojize() method of emoji library. (https://pypi.org/project/emoji/) These steps were also taken in compliance with the standard cleaning steps taken for CT- BERT[6], for getting the best quality bert-embeddings.1 Cleaned Data Original Tweet Cleaned Tweet @PaolaQP1231 Well, I mean congratu- username well, i mean congratulations lations Covid19 for being the first ever covid19 for being the first ever "thing" to âœthingâ to eradicate influenza. In other eradicate influenza. in other news, covid news, Covid vaccines will spur the rise of vaccines will spur the rise of influenza in influenza In 2021-2022 season. Influenza 2021-2022 season. influenza will be return- will be returning for a shot at the title belt. ing for a shot at the title belt. nov 2, 2021 Nov 2, 2021 on pay per view. Order today. on pay per view. order today. Table 2 Original tweet vs Cleaned Tweet(with NLP) 3. Methodology Before starting with the model training step, we must handle the data imbalance present in our dataset. Thus, to handle the imbalance a custom loss function called Distribution Balanced Loss [7] was used. COVID Tweeter-BERT (CT-BERT)[6]: CT-BERT is a domain-specific transformer-based model, pre-trained using a sizable corpus of tweets about COVID-19 that were posted between January 12 and April 16, 2020. It is initialized with BERT-Large[8] weights and then pre-trained/fine-tuned using 160 million tweets regarding the coronavirus. The major advantage of using CT-BERT is that all of our vaccination tweets in the dataset are concerned with COVID-19, for which CT-BERT produces domain-specific, quality embeddings. CT-BERT was our primary model and performed best for the classification task. Other models like Bi-LSTM[9], and BERT variants like ROBERTA were considered which did not perform as well. 4. Evaluation The primary evaluation metric for our model performance was macro-f1 The model was evalu- ated on the held-out validation set compromising of 1984 tweets containing tweet instances from each label class. macro-F1 score micro-F1 score Bi-LSTM(Multi-smote + word2vec) 0.79 0.82 Baseline CT-BERT(without DB-LOSS) 0.55 0.72 CT-BERT(with DB-Loss + without cleaning) 0.78 0.81 CT-BERT(with DB-LOSS) 0.88 0.89 Table 3 Performance of Classifiers on the validation set. 1 CT-BERT is our final model used for classification, it will be explained later on in the methodology section From the results seen, it can be concluded that CT-BERT with DB-LOSS and proper cleaning produces the best result achieving the highest macro-F1 score. 5. Results To finally summarize and prove the effectiveness of our model we will test our model over a completely unknown distribution of vaccine-related tweets. This dataset 2 contains tweets associated with COVID-19 vaccines as well as some other non-COVID vaccines. Here is the model performance of CT-BERT+DB-LOSS on the same: Run File Methodology Macro-F1 score Jaccard score run_submission.csv CT-BERT with DBLOSS 0.71 0.70 Table 4 CT-BERT Model performance This shows a fairly respectable performance, considering the dataset consisted of many instances previously unseen by our model. The dataset and code for model training and setup of our experiment can be found in this repository. 6. Conclusion This work further solidifies the effectiveness of pre-trained models like CT-BERT when combined with other techniques like Distribution Balanced Loss. Also, from a societal angle, the work further proves the extent to which such a seemingly exhaustive task of manually analyzing tweets can be automated with the amalgamation of cutting-edge A.I. systems, to further help the concerned authorities act and mitigate the concerns of the general public. Thus, yet again a deep understanding of society on a global scale can be done with the power of deep learning and NLP, proving that AI can be a fast and impactful solution to many of today’s problems. 6.1. Future Work The next step would be to make an AI system with the model built here running its background. Also, the model’s performance can be further improved with the classification being done in two levels that is by first segregating the ’none’ labeled tweets by classification on the first level and then classifying the rest of the tweets on the next level. References [1] N. E. MacDonald, et al., Vaccine hesitancy: Definition, scope and determinants, Vaccine 33 (2015) 4161–4164. [2] D. S. Courtney, A.-M. Bliuc, Antecedents of vaccine hesitancy in weird and east asian contexts, Frontiers in psychology 12 (2021) 747721. 2 This dataset has been provided by FIRE [5] as part of model evaluation for each team. [3] E. Robertson, K. S. Reeve, C. L. Niedzwiedz, J. Moore, M. Blake, M. Green, S. V. Katikireddi, M. J. Benzeval, Predictors of covid-19 vaccine hesitancy in the uk household longitudinal study, Brain, Behavior, and Immunity 94 (2021) 41–50. URL: https://www.sciencedirect.com/ science/article/pii/S0889159121001100. doi:https://doi.org/10.1016/j.bbi.2021. 03.008. [4] S. Poddar, A. M. Samad, R. Mukherjee, N. Ganguly, S. Ghosh, Caves: A dataset to facilitate explainable classification and summarization of concerns towards covid vaccines, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 3154–3164. [5] S. Poddar, M. Basu, K. Ghosh, S. Ghosh, Overview of the fire 2023 track:artificial intelligence on social media (aisome), in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, 2023. [6] M. Müller, M. Salathé, P. E. Kummervold, Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter, Frontiers in Artificial Intelligence 6 (2023). URL: https://www.frontiersin.org/articles/10.3389/frai.2023.1023281. doi:10.3389/frai. 2023.1023281. [7] T. Wu, Q. Huang, Z. Liu, Y. Wang, D. Lin, Distribution-balanced loss for multi-label classifi- cation in long-tailed datasets, 2021. arXiv:2007.09654. [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. arXiv:1810.04805. [9] R. C. Staudemeyer, E. R. Morris, Understanding lstm – a tutorial into long short-term memory recurrent neural networks, 2019. arXiv:1909.09586.