Covid-19 Vaccination Stance Detection Using Natural Language Processing and Machine-Learning Algorithms. Harsh Tita1, Rashi Sharma1 1 Amity University, Kolkata, India Abstract The coronavirus outbreak has resulted in unprecedented measures, forcing authorities to make decisions related to establishing lockdowns in areas most affected by the pandemic. Social Media have supported people during this difficult time. On November 9, 2020, when the first vaccine with an efficacy rate over 90% was announced, social media reacted and people around the world began to express their feelings about this vaccination. This paper aims to analyze the dynamics of opinion on COVID-19 vaccination, in which the civil society is highly manifested in the vaccination process. We compared classical machine learning algorithms to select the best performing classifier. 4,392 tweets were collected and analyzed. The proposed approach can help governments create and evaluate appropriate communication tools to provide clear and relevant information to the general public, increasing public confidence in vaccination campaigns. Keywords Twitter, COVID-19, stance classification, vaccine.1 1. Introduction The coronavirus outbreak caused by the novel coronavirus SARS-CoV-2 has brought a series of changes in many aspects of people's economic and social life. Since its occurrence, the coronavirus pandemic has continued to monopolize the different parts of the world, reaching 220 countries and territories by December 9, 2020 [1]. Governments have tried to address the outbreak by considering a series of measures, not all of them in accordance with the general public opinion. In all this time, the rapid growth of the number of cases globally has produced panic, fear and anxiety among people [2]. Due to the current situation generated by the lockdown in some parts of the world and social distancing in others, the use of social media globally has intensified [2], as it succeeds in connecting people from geographically different places and allows them to exchange ideas and information related to a series of aspects that have occurred in this period. Even more, people seem to rely on the information posted on social media. As a result, social media platforms have become mediator channels between each individual and the rest of the world and have gained more and more attention, being one of the fastest growing information systems for social applications [3], [4]. On this channel, individuals show their different views, opinions and emotions during the various events that occur due to the coronavirus pandemic [3]. Among some of the popular social media platforms, Twitter receives special attention. This is because users can easily disseminate information about their opinions on a particular topic through public messages called tweets [5]. In addition to the information voluntarily provided by the user, Tweets may also contain information about the user's location and may include links, emoticons, and hashtags that allow the user better express the emotions, making it a source of valuable information [5], [6]. Additionally, Twitter is used by government officials and politicians to inform the public about their activities and major events.[7]. Forum for Information Retrieval Evaluation, December 9-13, 2022, India EMAIL: harshtita01@gmail.com (Harsh Tita), sharma.rashi2408@gmail.com ยฉ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) The issue of vaccination is one of the many issues that have raised many questions on social media, most of which relate to the safety of the overall process. Therefore, many studies have analysed the impact of various social media campaigns on resistance to vaccination [8], [9] or public sentiment related to the vaccination process [5], [10]. Furthermore, compared to other vaccination situations reviewed in the scientific literature, COVID-19 vaccination raises new questions related to the relatively short time span of vaccine development. It is well known that the process of developing a vaccine usually takes 10 years [11]. Note, however, that for the mumps vaccine, the fastest vaccine development before was 4 years [12], and nearly 40 years after the discovery of HIV, no effective vaccine has yet been developed. However, due to the state of emergency, the COVID-19 vaccination deadline has been shortened [11]. In this context, this paper analyses public opinion regarding the vaccination process in case of COVID 19, considering news posted on Twitter. Clean dataset was extracted, including 4392 tweets. The performance of multiple machine learning algorithms (both traditional and deep learning algorithms) was compared using annotated data sets. Best performing algorithms were selected and used to analyse the dataset. We collected and annotated the COVID-19 vaccination dataset, determined the best classifier for stance detection of COVID-19 vaccinations, and associated the number of tweets with stance (e.g., ProVax, Neutral and AntiVax). Selected approaches can be easily integrated into systems that allow interested organizations to adequately monitor public opinion regarding the vaccination process in case of the novel coronavirus. 2. Methodology The steps taken to analyze public opinion on COVID-19 vaccination from social media messages are stated below. The first step is to collect the COVID-19 vaccination stance dataset, which contains tweets in English. A randomly selected subset of this dataset was manually annotated as Neutral, ProVax or AntiVax to be used in the training phase of the pose classification algorithm. Due to its unstructured nature and informal writing style, tweets from the collected dataset were preprocessed in the next step to improve the performance of the pose classification algorithm. In the current work, the performance of several classic machine learning algorithms was evaluated based on the following widely used metrics: accuracy, precision, recall and f-score. Accuracy is the ratio of correctly predicted observations to all observations and is defined as shown in (1). where TP, TN, FP, and FN refer to true positives, true negatives, false positives, and false negatives. Thus, TP represents the number of real positive tweets classified as positive, FP is the number of real negative tweets classified incorrectly classified as positives, TN represents the number of negative tweets correctly classified as negative and FN is the number of real positive tweets incorrectly classified as negative. ๐‘ป๐‘ท + ๐‘ป๐‘ต ๐‘จ๐’„๐’„๐’–๐’“๐’‚๐’„๐’š = (1) ๐‘ป๐‘ท + ๐‘ป๐‘ต + ๐‘ญ๐‘ท + ๐‘ญ๐‘ต Precision, which represents the ratio of correctly predicted positive observations to the total predicted positive observations, is computed as shown in (2). ๐‘ป๐‘ท ๐‘ท๐’“๐’†๐’„๐’Š๐’”๐’Š๐’๐’ = ๐‘ป๐‘ท+๐‘ญ๐‘ท (2) Recall, representing the ratio of correctly predicted positive observations to all the observations in the actual class, is computed as shown in (3). ๐‘ป๐‘ท ๐‘น๐’†๐’„๐’‚๐’๐’ = ๐‘ป๐‘ท+๐‘ญ๐‘ต (3) Starting from Precision and Recall, the F-Score can be computed as a weighted average, as shown in (4). ๐‘ท๐’“๐’†๐’„๐’Š๐’”๐’Š๐’๐’โˆ—๐‘น๐’†๐’„๐’‚๐’๐’ ๐‘ญ โˆ’ ๐‘บ๐’„๐’๐’“๐’† = ๐Ÿ โˆ— ๐‘ท๐’“๐’†๐’„๐’Š๐’”๐’Š๐’๐’+๐‘น๐’†๐’„๐’‚๐’๐’ (4) Finally, the best performing algorithm has been used to analyze the evolution of the public stance towards vaccination in the considered period. 2.1. Data Pre-Processing: The main components of the stance detection process are the pre-processing, the feature extraction and the machine learning classification. The preprocessing step cleanses the text and the feature extraction transforms the raw text data into feature vectors. We have performed various pre-processing steps on the dataset that mainly dealt with removing stop words. The text document is then converted into the lowercase for better generalization. Subsequently, the punctuations were cleaned and removed thereby reducing the unnecessary noise from the dataset. After that, we have also removed the stop words from the words along with removing the URLs as they do not have any significant importance. At last, Lemmatization (reducing the derived words to their root form known as lemma) was performed for better results. Stop words are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. Countvectorizer: Machines cannot understand characters and words. So, when dealing with text data we need to represent it in numbers to be understood by the machine. Countvectorizer is a method to convert text to numerical data. The text is transformed to a sparse matrix. Countvectorizer makes it easy for text data to be used directly in machine learning and deep learning models such as text classification. We have used CountVectorizer for tuning the learning process and set its hyperparameters as the following: โ€ข max_features:5000, which implies that top 5000 frequent words from the data is selected โ€ข stop_words: an array of redundant words has been passed. We used inbuilt functions mentioned below to train our model: โ€ข train_test_split(): This function splits the dataset into a train and test set with a specified criteria of split, we started with a fraction setting of 0.2. This means we used 80% of our dataset for training our model and 20% for testing and evaluating our model. โ€ข TfidfVectorizer() : Tf-idf is used to handle text data for machine learning purposes, it stands for term frequency โ€” inverse document frequency and is represented by the formula below, using this function we convert all words into tf-idf scores. TF โ€” IDF = TF (Term frequency) * IDF (Inverse document frequency) Term frequency โ€” The number of times the term occurs in a given document. IDF โ€” The number of documents in which the given term is found. โ€ข make_pipeline() : This function is used for defining our data pipeline. In this we can apply a list of transforms, followed by a final estimator. We used Bernoulli for our case. 2.2. Learning Algorithms: A machine learning approach has been used in order to accurately determine the stance towards vaccination in the collected tweets. Starting from the annotated dataset, the performance of several popular classification algorithms has been investigated: Bernoulli Naรฏve Bayes, Support vector machine (SVM), Multinomial logistic regression, Logistic Regression Machine Learning, The KNN classifier, Gradient Boosting. 1) Multinomial Naรฏve Bayes Multinomial Naive Bayes algorithm is a probabilistic learning method that is mostly used in Natural Language Processing (NLP). The algorithm is based on the Bayes theorem and predicts the tag of a text such as a piece of email or newspaper article. It calculates the probability of each tag for a given sample and then gives the tag with the highest probability as output. 2) Support Vector Machine (SVM) Support Vector Machines (SVM) [99] are a family of supervised learning algorithms used for classification, regression and other tasks such as outlier detection. While other classification algorithms suffer from overfitting, one of the advantages of SVM is that they are less prone to this situation [100]. Another advantage resides in the fact that besides binary classification, multiclass classification can be performed by combining several binary classification functions. For this, each class is considered individually at a time, and for each class a classifier is searched that separates it from the other classes [101]. 3) Bernoulli Naรฏve Bayes This is used for discrete data and it works on Bernoulli distribution. The main feature of Bernoulli Naive Bayes is that it accepts features only as binary values like true or false, yes or no, success or failure, 0 or 1 and so on. So, when the feature values are binary, we know that we have to use Bernoulli Naive Bayes classifier. 4) K-Nearest Neighbor The K-Nearest Neighbor or the KNN algorithm is a machine learning algorithm based on the supervised learning model. The K-NN algorithm works by assuming that similar things exist close to each other. Hence, the K-NN algorithm utilizes feature similarity between the new data points and the points in the training set (available cases) to predict the values of the new data points. In essence, the K-NN algorithm assigns a value to the latest data point based on how closely it resembles the points in the training set. K-NN algorithm finds application in both classification and regression problems but is mainly used for classification problems. 5) Logistic Regression Logistic Regression Machine Learning is basically a classification algorithm that comes under the Supervised category (a type of machine learning in which machines are trained using "labelled" data, and on the basis of that trained data, the output is predicted) of Machine Learning algorithms. This simply means it fetches its roots to the field of Statistics. The main role of Logistic Regression in Machine Learning is predicting the output of a categorical dependent variable from a set of independent variables. In simple words, categorical dependent variable means a variable that is dichotomous or binary in nature having its data coded in the form of either 1 (stands for success/yes) or 0 (stands for failure/no). 6) Gradient Boosting Algorithm It is a popular boosting algorithm. In gradient boosting, each predictor corrects its predecessorโ€™s error. The weights of the training instances are not tweaked, instead, each predictor is trained using the residual errors of predecessor as labels. Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. 2.3. Approach Used: Start Program mn= [KNN( ), SVM(), LogisticRegression(), GradientBoostClassifier(), BernoulliNB(), MultinomialNB()] ; for (i=0; i<6; i++) Model= mn[i]; Model.fit(); Model.predict(); print(Accuracy(i), confusion_matrix, classification_report); end loop End Program 3. Results Letโ€™s view the performance of each of the machine learning algorithms/classifications by representing evaluation metrics such as recall, precision and f1-score. Class Precision Recall F-Score AntiVax 0.43 0.77 0.55 Neutral 0.59 0.58 0.58 ProVax 0.68 0.34 0.46 Table 1 Table-1 includes the results achieved using K- Nearest Neighbor classifier. AntiVax Neutral ProVax AntiVax 173 30 21 Neutral 95 177 35 ProVax 135 94 119 Confusion matrix for the KNN classifier is as shown in the above figure. Class Precision Recall F-Score AntiVax 0.70 0.55 0.61 Neutral 0.58 0.73 0.65 ProVax 0.61 0.55 0.58 Table 2 Table-2 includes the results achieved using Gradient Boosting Classifier. AntiVax Neutral ProVax AntiVax 123 45 56 Neutral 15 224 68 ProVax 38 117 193 Confusion matrix for the Gradient Boosting Classifier is as shown in the above figure. Class Precision Recall F-Score AntiVax 0.70 0.57 0.63 Neutral 0.62 0.72 0.66 ProVax 0.64 0.62 0.63 Table 3 Table-3 includes the results achieved using Logistic Regression. AntiVax Neutral ProVax AntiVax 127 44 53 Neutral 16 221 70 ProVax 39 93 216 Confusion matrix for the Logistic Regression is as shown in the above figure. Class Precision Recall F-Score AntiVax 0.73 0.54 0.62 Neutral 0.62 0.73 0.67 ProVax 0.63 0.64 0.63 Table 4 Table-4 includes the results achieved using Support Vector Machine. AntiVax Neutral ProVax AntiVax 122 40 62 Neutral 14 225 68 ProVax 31 96 221 Confusion matrix for the Support Vector Machine Classifier is as shown in the above figure. Class Precision Recall F-Score AntiVax 0.70 0.49 0.58 Neutral 0.66 0.68 0.67 ProVax 0.59 0.68 0.63 Table 5 Table-5 includes the results achieved using Multinomial Naรฏve Bayes. AntiVax Neutral ProVax AntiVax 110 33 81 Neutral 13 209 85 ProVax 35 76 237 Confusion matrix for the Multinomial Naรฏve Bayes is as shown in the above figure. Class Precision Recall F-Score AntiVax 0.59 0.58 0.58 Neutral 0.58 0.69 0.63 ProVax 0.66 0.54 0.59 Table 6 Table-6 includes the results achieved using Bernoulli Naรฏve Bayes. AntiVax Neutral ProVax AntiVax 130 53 41 Neutral 36 213 58 ProVax 56 103 189 Confusion matrix for the Logistic Regression is as shown in the above figure. After applying various Machine Learning Algorithms on the Training data-set we got accuracies as mentioned below in table-7. Algorithm Accuracy K-Nearest Neighbor 53.4% Gradient Boosting Classifier 61.4% Logistic Regression 64.2% Support Vector Machine 62.9% Multinomial Naรฏve Bayes 63.3% Bernoulli Naรฏve Bayes 60.5% Table 7 Algorithm Accuracy F1 - Score Bernoulli Naรฏve Bayes 49% 0.473 Support Vector Machine 48.7% 0.471 Logistic Regression 47.7% 0.469 Multinomial Naรฏve Bayes 46.6% 0.458 K-Nearest Neighbor 44.4% 0.432 Gradient Boosting Classifier 42.5% 0.392 Table 8 Below is the pictorial representation of the accuracies obtained by each machine learning classifier: Accuracy 70 60 50 40 30 20 10 0 K Nearest Logistic Multinomial Bernoulli Naรฏve Gradient Support Vector neighbor Regression Naรฏve Bayes Bayes Boosting Machine Accuracy 4. Conclusion In the current study, the initial announcement of a coronavirus vaccine and the first real vaccination process initiated outside of limited clinical trials were analyzed using machine learning-based stance detection. Several classical machine learning and deep learning algorithms were compared, and the best performing classifier was selected based on the performance metrics. The proposed approach used Bernoulli Naรฏve Bayes with an accuracy of 49% to classify tweets into three main classes: ProVax, AntiVax, and Neutral regarding the COVID-19 vaccination. The purpose of this paper was to monitor changes in the stance towards COVID-19 vaccination through tweets. With many countries around the world planning to initiate vaccination processes for COVID- 19, early detection of changes in opinion can be very useful and help government decision makers to take steps to curb infections. This can be very helpful as it allows us to drive targeted actions. Possible future research directions include the development of better performing stance classification algorithms, as well as extending the analyzed period, especially given the fact that the vaccination process is expected to take a relatively long period of time. 5. References [1] Worldometer. (Dec. 9, 2020). Coronavirus Update (Live): 63,777,845 Cases and 1,477,777 Deaths From COVID-19 Virus Pandemic. Accessed: Dec. 9, 2020. [Online]. Available: https://www.worldometers.info/ coronavirus/ [2] K. Chakraborty, S. Bhatia, S. Bhattacharyya, J. Platos, R. Bag, and A. E. Hassanien, โ€˜โ€˜Sentiment analysis of COVID-19 tweets by deep learning classifiersโ€”A study to show how popularity is affecting accuracy in social media,โ€™โ€™ Appl. Soft Comput., vol. 97, Dec. 2020, Art. no. 106754, doi: 10.1016/j.asoc.2020.106754. [3] A. H. Alamoodi, B. B. Zaidan, A. A. Zaidan, O. S. Albahri, K. I. Mohammed, R. Q. Malik, E. M. Almahdi, M. A. Chyad, Z. Tareq, A. S. Albahri, H. Hameed, and M. Alaa, โ€˜โ€˜Sentiment analysis and its applications in fighting COVID-19 and infectious diseases: A systematic review,โ€™โ€™ Expert Syst. Appl., vol. 167, Apr. 2021, Art. no. 114155, doi: 10.1016/j.eswa.2020.114155. [4] G. Appel, L. Grewal, R. Hadi, and A. T. Stephen, โ€˜โ€˜The future of social media in marketing,โ€™โ€™ J. Acad. Marketing Sci., vol. 48, no. 1, pp. 79โ€“95, Jan. 2020, doi: 10.1007/s11747-019-00695-1. [5] E. Dโ€™Andrea, P. Ducange, A. Bechini, A. Renda, and F. Marcelloni, โ€˜โ€˜Monitoring the public opinion about the vaccination topic from tweets analysis,โ€™โ€™ Expert Syst. Appl., vol. 116, pp. 209โ€“ 226, Feb. 2019, doi: 10.1016/j.eswa.2018.09.009. [6] A. Giachanou and F. Crestani, โ€˜โ€˜Like it or not: A survey of Twitter sentiment analysis methods,โ€™โ€™ ACM Comput. Surv., vol. 49, no. 2, Nov. 2016, Art. no. 28, doi: 10.1145/2938640. [7] J. Golbeck, J. M. Grimes, and A. Rogers, โ€˜โ€˜Twitter use by the U.S. Congress,โ€™โ€™ J. Amer. Soc. Inf. Sci. Technol., vol. 61, no. 8, pp. 1612โ€“1621, May 2010, doi: 10.1002/asi.21344. [8] E. A. Pedersen, L. H. Loft, S. U. Jacobsen, B. Sรธborg, and J. Bigaard, โ€˜โ€˜Strategic health communication on social media: Insights from a Danish social media campaign to address HPV vaccination hesitancy,โ€™โ€™ Vaccine, vol. 38, no. 31, pp. 4909โ€“4915, Jun. 2020, doi: 10.1016/j.vaccine.2020.05.061. [9] K. Dedominicis, A. M. Buttenheim, A. C. Howa, P. L. Delamater, D. Salmon, S. B. Omer, and N. P. Klein, โ€˜โ€˜Shouting at each other into the void: A linguistic network analysis of vaccine hesitance and support in online discourse regarding California law SB277,โ€™โ€™ Social Sci. Med., vol. 266, Dec. 2020, Art. no. 113216, doi: 10.1016/j.socscimed.2020.113216. [10] S. Martin, E. Kilich, S. Dada, P. E. Kummervold, C. Denny, P. Paterson, and H. J. Larson, โ€˜โ€˜Vaccines for pregnant womenยท ยท ยท ?! Absurdโ€™โ€” Mapping maternal vaccination discourse and stance on social media over six months,โ€™โ€™ Vaccine, vol. 38, no. 42, pp. 6627โ€“6637, Sep. 2020, doi: 10.1016/j.vaccine.2020.07.072. [11] T. T. Le, Z. Andreadakis, A. Kumar, R. G. Romรกn, S. Tollefsen, M. Saville, and S. Mayhew, โ€˜โ€˜The COVID 19 vaccine development landscape,โ€™โ€™ Nature Rev. Drug Discovery, vol. 19, no. 5, pp. 305โ€“ 306, Apr. 2020, doi: 10.1038/d41573-020-00073-5. [12] J. F. Modlin, W. A. Orenstein, and A. D. Brandling-Bennett, โ€˜โ€˜Current status of mumps in the united-states,โ€™โ€™ J. Infectious Diseases, vol. 132, no. 1, pp. 106โ€“109, Jul. 1975.