Analysing Crowd-Sourced Vaccine Data Using Machine Learning: Uncovering Concerns and Insights Lakshmi S. Gopal1 , Aswathy A.2 , Krishnendu K.3 and Hemalatha Thirugnanam4 1 Center for Wireless Networks & Applications (WNA), Amrita Vishwa VIdyapeetham, Amritapuri, India 2 Center for Wireless Networks & Applications (WNA), Amrita Vishwa VIdyapeetham, Amritapuri, India 3 Center for Wireless Networks & Applications (WNA), Amrita Vishwa VIdyapeetham, Amritapuri, India 4 Center for Wireless Networks & Applications (WNA), Amrita Vishwa VIdyapeetham, Amritapuri, India Abstract The rapid development of the Covid-19 vaccines, concerns about its safety contributed to vaccine hesitancy globally. Social media platforms transfer knowledge on such global concerns and are a good source for investigating public opinions. This study proposes a machine learning based analysis of Covid-19 vaccine public opinions using Twitter data where a tweet post is classified into multiple labels which describes various concerns. We experimented with supervised learning algorithms wrapped along with multiple label classifier algorithms. We have achieved an average F1 micro score of 62% which suggested improvement. Keywords Covid Vaccines, Machine Learning, Social Media 1. Introduction Vaccination is a highly effective public health strategy, saving lives and reducing disease bur- den. However, vaccine hesitancy persists, especially in the digital age with easy access to both credible and misleading information. Crowdsourced data platforms now allow individuals to share their vaccine-related experiences and concerns, offering valuable insights into this issue. Machine learning has become a vital tool in public health and epidemiology. It can analyse large datasets, uncover patterns, and provide insights that traditional methods struggle to achieve. Machine learning algorithms can sift through vast amounts of unstructured text data from social media and other platforms to reveal patterns and concerns related to vaccines. This research paper aims to contribute to the growing body of knowledge in the field of vaccine hesitancy and public health by presenting a comprehensive analysis of crowdsourced vaccine data using machine learning techniques. This study is driven by the fact that a deeper understanding of the concerns expressed by individuals through crowdsourcing regarding vaccines can inform targeted public health interventions and communication strategies. By Forum for Information Retrieval Evaluation, December 15-18, 2023, India Envelope-Open lakshmisgopal@am.amrita.edu (L. S. Gopal); aswathynaik@am.amrita.edu (A. A.); krishnenduk@am.amrita.edu (K. K.); hemalathat@am.amrita.edu (H. Thirugnanam) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings leveraging machine learning algorithms, we aim to shed light on the intricate dynamics of vac- cine hesitancy, ultimately contributing to more effective vaccination campaigns and improved public health outcomes. Additionally, comprehending these perceptions within communities, states, and the nation across different time frames can furnish us with precise data for crafting specialised strategies to enhance immunisation education programs and public health campaigns. In the subsequent sections of this paper, we will discuss the methods employed, present our findings, and discuss the implications of our analysis. 2. Related Work The utilisation of Machine Learning for analysing vaccine-related concerns is of paramount significance in the current era of pandemics, and numerous studies have delved into this field. This paper specifically concentrates on developing a predictive model for assessing public sentiment-related concerns, primarily sourced from social media platforms. One study has highlighted the utilisation of social media bots to intentionally sow discord and confusion regarding vaccination, potentially dissuading people from getting vaccinated [1]. Additionally, another research emphasises the pivotal need to confront and counteract rumours and conspiracy theories in public health campaigns, particularly in the context of mitigating vaccine hesitancy and ensuring the success of vaccination initiatives [2]. Authors emphasise that factors causing vaccine hesitancy, like technological change and political disem- powerment, and addressing these issues requires long-term efforts from multiple stakeholders. Building vaccine confidence for the long term is measured by public trust in vaccine delivery institutions. A study conducted from April to August 2019 aimed to develop and validate deep learning models to understand public perceptions of the HPV vaccine using data from social media [3]. The study collected data from January 2014 to October 2018, analysing social media discussions related to health belief models and theory of planned behaviour. The results showed trends in constructs such as perceived barriers, positive attitudes towards the HPV vaccine, and negative attitudes. Interstate variations in public perceptions were also identified. The study provides a good understanding of public perceptions on social media and evolving trends, potentially influencing local anti vaccine sentiment. A study [4] examining vaccine sentiment on social media revealed that vaccine hesitancy contributes to suboptimal vaccination coverage in the United States. The study analysed se- mantic networks of vaccine information from Twitter users in the US, identifying positive, negative, and neutral sentiment. Positive sentiment focused on parents and health risks, while negative sentiment focused on children and organisational bodies. The study suggests that analysing vaccine sentiment on social media can help understand complex drivers of vaccine hesitancy and improve public health communication, ultimately improving vaccine confidence and coverage in the US. Moreover, a study has been conducted to demonstrate the efficient collection and preprocess- ing of Twitter data, encompassing information related to vaccines as well as other disaster-related data [5]. Another research work highlights the paramount importance of leveraging Machine Learning and Artificial Intelligence across diverse emergency situations. These advanced tech- nologies play a pivotal role in not only enhancing emergency preparedness and response but also in enabling data-driven decision-making, resource allocation, and predictive modelling to mitigate the impact of any emergency on affected populations and infrastructure [6]. 3. Task We aim to develop a multi label classification on public opinion tweets of the Covid-19 vaccines which is a methodology proposed as part of the AISoMe (Artificial Intelligence on Social Media) track [7][8] in the FIRE (Forum for Information Retrieval Evaluation) 2023. The developed classifier labels a tweet based on specific concern(s) about vaccines expressed by the respective Twitter user. A tweet can have more than one label (concern), e.g A tweet expressing 3 different concerns about vaccines will have 3 labels. As labels for the classification task, we take into consideration the following concerns about vaccines: • Unnecessary: The tweet implies that immunizations are not necessary or that better alternative treatments exist. • Mandatory: The tweet advocates against making vaccinations mandatory. • Pharma: The tweet implies that big pharmaceutical firms are only out to make a profit. • Conspiracy: The tweet raises the possibility of a larger conspiracy than merely big pharma’s desire for profit. • Political: The tweet raises fears that governments and politicians are using vaccines as a tool to advance their own agendas. • Country: The tweet criticizes a vaccination because of the nation where it was created or produced. • Rushed: The tweet raises questions about whether the vaccines have undergone adequate testing or whether the available data is reliable. • Ingredients: The tweet highlights concerns about the vaccine contents or the technology utilised. • Side-effect: The tweet expresses worry about vaccine side effects, including deaths that may result. • Ineffective: The tweet expresses worry that the immunizations are inefficient and useless because they are ineffective in some cases. • Religious: The tweet opposes vaccinations for religious reasons. • None: No explicit justification is provided in the tweet. Figure 1: One hot encoded data set - The ‘tweet’ column is taken from the given dataset. The rest of the columns are created programmatically and the values ‘1’ and ‘0’ represent the presence and absence of the label respectively. 4. Methodology The proposed methodology aims to perform multi label classification on the given dataset. In depth study of the literature [9][10][11] describes various methods of machine learning based multi label classification methods. We experimented with a problem transformation method, namely classifier chains, which transforms a multi label classification problem into multiple binary classification problems. 4.1. Exploratory Data Analysis (EDA) To comprehend and interpret the given data in depth, we begin with an EDA. The given data initially had 3 columns, ‘ID’, ‘tweet’ and ‘labels’. For a multi label classification problem, one hot encoded data is appropriate and hence the data was modified where the labels are one hot encoded. The given data contained no null or NaN values. The one hot encoded dataset contains 9921 rows and 14 columns. Figure 1 shows a sample of the one hot encoded dataset. The tweets in the data are labelled about concerns of covid vaccines (see section 3) and have categorised tweets under 12 labels. Figure 2 shows the number of tweets that are categorised under a particular label. From the figure we can see that the label ‘side-effect’ is the highest in number and ‘religious’ is the lowest. A tweet could be categorised by 1, 2 or 3 labels. Figure 2 shows the number of tweets that got categorised under a single label, two labels or three la- bels. From the figure we can see that the majority of the tweets were categorised by a single label. To understand the use of vocabulary in the tweets, word clouds were generated of the most frequent and less frequent labels in the dataset, which are ‘side-effect’ and ‘religious’ labels respectively. Figure 3 shows the generated word clouds. From the word clouds we can observe that the terms ‘vaccine’, ‘covid’ and ‘pfizer’ have frequent occurrences. Keywords similar to ‘side-effect’, such as ‘death’, ‘adverse reaction’, ‘blood clot’ etc are found to occur frequently in the ‘side-effect’ word cloud. Keywords similar to ‘religious’ label, such as ‘religion’, ‘faith’, ‘psalm’ etc were found, but were less frequent in the ‘religious’ word cloud. Figure 2: Analysis of data - The bar graph (left) shows the number of tweets categorised under a label. The x-axis represents the label name and y-axis represents the number of tweets under each label. The pie chart (right) shows the percentage of tweets categorised with 1, 2 or 3 labels. Figure 3: Analysis of data - word clouds generated for the labels ‘side-effect’ (left) and ‘religious’ (right). 4.2. Data Preprocessing Basic data cleaning methods have been applied onto the tweet data. Before removing the unwanted text, we word tokenize the tweet and each character to lowercase. The cleaned tweet is further fed into the model allowing to reduce unnecessary processing. We use Python libraries such as NLTK and Regular expressions to eliminate the following: • URLs often found along with a tweet (image, video urls) • Stopwords (the, a, is etc) are removed except for ‘not’ and ‘no’ to maintain the context • Special characters, smileys Table 1 Performance evaluation of the 3 models (first run) using scoring metrics Model Accuracy Precision Recall F1 micro Classifier chain+Logistic Regression 0.42 0.67 0.45 0.54 Classifier chain+Support Vector Machines 0.49 0.67 0.55 0.61 Multi output classifier +Support Vector Machines 0.46 0.80 0.52 0.64 4.3. Model Creation for Multi Label Classification We have experimented with a multi label classification where each data point is associated with multiple labels. Among various multi label classification approaches, we have experimented with classifier chains and multi output classifier methods. A classifier chain initially starts with a set of binary classifiers, one for each label in the multi label classification problem. When making predictions for a new instance, you start by predicting the first label using its binary classifier. Then, you use this prediction, along with the instance’s features, to predict the second label. This process continues until all labels have been predicted. To perform the classification we can wrap any classification algorithm which is capable of a binary classification in the classifier chain. We have also experimented using the multi output classifier algorithm, a wrapper that takes a single-output classifier and extends it to work with multiple output labels. Initially, we stratified the dataset with a train-test split of 70% to 30% respectively. Both the training and validation data was preprocessed as described in section 4.2. The resultant data was used as the input data for both classifier chain and multi output classifier models. We experimented two model creations with the classifier chains by wrapping a Logistic Regression and Support Vector Machines model. Both these models are widely used for classification problems. We experimented one model creation with the multi output classifier where it was wrapping a Support Vector Machine model. All 3 experiments showed a moderate result in the initial phase which led to fine tuning. 5. Results and Evaluation The created models are evaluated using accuracy, precision, recall and F1 score. Table 1 shows the evaluation results of the 3 models in the initial run. Table 3 shows the performance evaluation of the run files submitted to the AISoMe track. To evaluate the model better, we plotted the learning curve of the classifier chain models which led to fine tuning it further. Figure 4 shows the learning curves of the Logistic Regression model wrapped in a classifier chain, before and after fine tuning. Figure 5 shows the learning curves of the Support Vector Machines model wrapped in a classifier chain, before and after tuning. Fine tuning certainly improved the performance, but also suggests the need for more data for training. We have achieved slight improvement in the performance of the classifier chain models after Figure 4: Learning curves of Logistic Regression wrapped in Classifier Chain. (a) shows the learning curve of the model before fine tuning. (b) shows the learning curve of the model and after fine tuning. Figure 5: Learning curves of Support Vector Machines wrapped in Classifier Chain. (a) shows the learning curve of the model before fine tuning. (b) shows the learning curve of the model and after fine tuning. Table 2 Performance evaluation of the classifier chain models using scoring metrics post fine tuning Model Accuracy Precision Recall F1 micro Classifier chain+Logistic Regression 0.48 0.68 0.54 0.60 Classifier chain+Support Vector Machines 0.51 0.67 0.57 0.62 fine tuning. Table 2 shows the performance results of the 2 fine tuned models. 6. Conclusion This research work uses the Covid vaccine social media data which showed the concerns of the public related to the usage of vaccinations. We experimented multiple models with the given data and chose the top 3 performing models to showcase in this report. We used a classifier chain model which wraps Support Vector Machines and Logistic Regression and a multi output classifier which wraps Support Vector Machines. We achieved the highest score for Multi Table 3 Performance evaluation of submission models Run File Methodology Macro-f1 Jacc Model 1 SVM wrapped in Multi Output Classifier 0.38 0.45 Model 2 LR wrapped in Classifier Chain 0.38 0.41 Model 3 SVM wrapped in Classifier Chain 0.3 0.43 Output Classifier with a F1 score of 64%. The performance can be improved either by improving the dataset or by other preprocessing methods or data augmentation strategies. 7. Online Resources The input data, test data and implemented Python code are made available on ”https://github.com/lak- shmiSGopal/AISOME-FIRE-2023”. References [1] D. A. Broniatowski, A. M. Jamison, S. Qi, L. AlKulaib, T. Chen, A. Benton, S. C. Quinn, M. Dredze, Weaponized health communication: Twitter bots and russian trolls amplify the vaccine debate, American journal of public health 108 (2018) 1378–1384. [2] E. Pertwee, C. Simas, H. J. Larson, An epidemic of uncertainty: rumors, conspiracy theories and vaccine hesitancy, Nature medicine 28 (2022) 456–459. [3] J. Du, C. Luo, R. Shegog, J. Bian, R. Cunningham, J. Boom, G. Poland, Y. Chen, C. Tao, Use of deep learning to analyze social media discussions about the human papillomavirus vaccine. jama netw open. 2020 nov 02; 3 (11): e2022025. doi: 10.1001/jamanetworkopen. 2020.22025, ???? [4] G. J. Kang, S. R. Ewing-Nelson, L. Mackey, J. T. Schlitt, A. Marathe, K. M. Abbas, S. Swarup, Semantic network analysis of vaccine sentiment in online social media, Vaccine 35 (2017) 3621–3638. [5] A. Aswathy, R. Prabha, L. S. Gopal, D. Pullarkatt, M. V. Ramesh, An efficient twitter data collection and analytics framework for effective disaster management, in: 2022 IEEE Delhi Section Conference (DELCON), IEEE, 2022, pp. 1–6. [6] J. Phengsuwan, T. Shah, N. B. Thekkummal, Z. Wen, R. Sun, D. Pullarkatt, H. Thirugnanam, M. V. Ramesh, G. Morgan, P. James, et al., Use of social media data in disaster management: a survey, Future Internet 13 (2021) 46. [7] S. Poddar, M. Basu, K. Ghosh, S. Ghosh, Overview of the fire 2023 track:artificial intelligence on social media (aisome), in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, 2023. [8] S. Poddar, A. M. Samad, R. Mukherjee, N. Ganguly, S. Ghosh, Caves: A dataset to facilitate explainable classification and summarization of concerns towards covid vaccines, in: Pro- ceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 3154–3164. [9] C. Prathibhamol, G. Amala, M. Kapadia, Anomaly detection based multi label classification using association rule mining (admlcar), in: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), IEEE, 2016, pp. 2703–2707. [10] C. Prathibhamol, K. Jyothy, B. Noora, Multi label classification based on logistic regression (mlc-lr), in: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), IEEE, 2016, pp. 2708–2712. [11] R. Ramanathan, K. Soman, P. Rohini, G. Dharshana, Investigation and development of methods to solve multi-class classification problems, in: 2009 International Conference on Advances in Recent Technologies in Communication and Computing, IEEE, 2009, pp. 805–807.