TweetClass: COVID-19 Vaccine Tweet Classification with scikit-learn Baivab Chakraborty1 , Subhajit Srimani2 and Souvit Biswas3 1 Amity University, Kolkata, West Bengal, Kolkata 700135 2 Netaji Subhash Engineering College, Kolkata, West Bengal, Kolkata 700152 3 Amity University, Kolkata, West Bengal, Kolkata 700135 Abstract Our team has proposed TweetClass as a solution in the FIRE 2023 AISoMe Track. We propose to utilize the sci-kit-learn library, which consists of several classifiers that can be used to categorize tweets as Unnecessary, Mandatory, Pharma, Conspiracy, Political, Country, Rushed, Ingredients, Side-effect, Ineffective and Religious. As the vaccination process continued globally to combat COVID-19, analyzing people’s tweets seemed to provide valuable insights into their opinions on the entire vaccination episode. This enormous dataset, on correct utilization, can help the government create effective vaccination strategies in case of future pandemics. Our submitted model achieves a Macro-Fl score of 0.39 and a Metric Jaccard score of 0.46, earning us one of the top spots amongst other submissions. Keywords Natural Language Processing, COVID-19 Vaccine Tweets, Multinomial Naive Bayes, Multi-Output Classifier 1. Introduction The world faced its toughest challenge in the form of the COVID-19 pandemic. Over time, vaccines have proven to be a safe and effective way to combat and eradicate infectious diseases. As a result, there emerged a race to discover effective vaccines that could prevent the havoc of COVID-19 and eventually, this led to the worldwide availability of these vaccines. However, the discussions concerning vaccination progress, accessibility, efficacy, and side effects had been ongoing, and people had both positive and negative opinions about it. Some took to social media sites, like Twitter, to share their concerns regarding the vaccine and it proves beneficial for the governments and health organizations, like WHO, to understand people’s thoughts regarding the new COVID-19 vaccines. They look to use such insights to plan their future strategies and encourage everyone to get fully vaccinated. It has been crucial to stop spreading misinformation about these vaccines and appreciate the efforts of governments to have worked to restrict the pandemic from spreading further. Twitter had also attempted to block tweets that contained incorrect or misleading information about the virus, its preventative measures, and treatments. The manual classification of tweets has proved time-consuming and prone to Forum for Information Retrieval Evaluation, December 15-18, 2023, India Envelope-Open baivab369@gmail.com (B. Chakraborty); subhrasrimani2002@gmail.com (S. Srimani); souvitbiswas26@gmail.com (S. Biswas) GLOBE https:// (B. Chakraborty); https://github.com/subhajitsrimani/TweetClass (S. Srimani); http:// (S. Biswas) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings error. Therefore, there had been an urgent need to develop machine-learning models that could assist in classifying tweets about the COVID-19 vaccines. 2. Task The Artificial Intelligence on Social Media (AISoMe) 2023 Track, organized as a part of the 15th meeting of Forum for Information Retrieval Evaluation (FIRE) 2023, has tasked us to build an effective multi-label classifier to label a social media post (particularly, a tweet) according to the specific worries about vaccines that the post’s author stated, and we use this research to show how we think the problem might be solved. The tweets consist of distinct concerns towards vaccines owing to reasons such as the politics involved, potential side-effects of vaccines, etc. To be precise, the tweets are categorized into the following 12 classes, described with instances: • Unnecessary : Suggests that vaccines are suspected to be unnecessary. Example: Besides the fact that those who already had covid are immune, so the vaccine will not be needed. The ’vaccine’ ain’t the only solution to immunity, in fact it doesn’t even provide immunity, just the symptoms are decreased. • Mandatory : Suggests that it is preferred if vaccination is not mandatory. Example: A vaccine passport will not be issued for those who don’t want vaccination, how are you gonna deal with these people, your objective seems like to push us to one corner and penalise, I suppose this is what you call justice. • Pharma : Suggests that COVID-19 is hoax for the big pharmaceutical companies to earn more money. Example: Fauci and Bill Gates are the real owners. Fauci owns a stake in Moderna through NIH, and Gates invested 100 million dollars in 2016 for the development of mRNA vaccines. • Conspiracy : Suggests that there is some deeper conspiracy than just pharmaceutical companies earning money. Example: First they came for your minds, then your freedoms with Lockdown, your rights to earn a living, then your body with vaccination. Now they are coming for your kids. Finally they will own everything you posses including you. • Political : Suggests that there is an attempt of propagating some political agenda. Example: We don’t need any twisted, insane politicians directing us whether or not to get vaccinated. They do not have the right to instruct us on how to live our lives. The best course of action is to push back while ignoring the absurd story. • Country : Suggests that the vaccine is denied because of the country of origin or man- ufacturing. Example: I regret to inform you that I will never take a vaccine produced in Russia. • Rushed : Suggests that the vaccines lack proper testing or that inaccurate data is published. Example: Since these tests typically take years, the long-term safety of any vaccination cannot possibly be tested. What will make it work? Why do you think they can heal diseases when they can’t even treat the common cold despite years of research? • Ingredients : Suggests that the ingredients of the vaccines are worth concerned. Exam- ple: Is created using the kidney cells of a young girl who was aborted in the 70s. • Side-effect : Suggests that there are side-effects of being vaccinated including death. Example: It starts. Please look for secure substitutes for this vaccine. Following patient illnesses, the UK publishes an allergy warning for the Pfizer COVID-19 vaccination. • Ineffective : Suggests that the vaccines may be ineffective altogether. Example: What exactly is the point? Are you giving your old and frail patients a second shot because Pfizer claims the vaccine is worthless if administered outside of the recommended time frame? • Religious : Suggests that there are religious restrictions against the vaccines. Example: A vaccine would go against my religion. The strongest defense available is the 91st Psalm. • None : The tweet doesn’t coincide with any of the above reasons or with those not stated above. Example: Total garbage. I could have taken the vaccine but will pass this one 3. Related Work Information extraction from social media posts containing textual data has become an integral part of social computing. Social media upholds the diverse opinions and expressions of people regarding almost every aspect of this world and, the concerns about the COVID-19 vaccines is no exception. Hence, the use of textual tweets as data to classify people’s stances on vaccines employing machine learning methods like Multinomial Naive Bayes Classifier [1] and Multi- Output Classifier [2] is very efficient and useful. 3.1. Multinomial Naive Bayes Multinomial Naive Bayes Classifier is a probabilistic machine learning algorithm based on Bayes’ theorem. and assumes that features are independent and follow a multinomial distribution. It is vital for roles like text categorization and sentiment analysis, where the frequency of words in documents is essential for classification. 3.2. Multi-Output Classifier A Multi-Output Classifier is a machine learning model designed to handle multiple target variables simultaneously, making it suitable for multi-label or multi-task learning problems. Instead of predicting a single output, it produces multiple outputs, each corresponding to a different target variable. Multi-output classifiers are used in various domains, including natural language processing, where multiple aspects or labels need to be predicted for a single input. They extend traditional classification and regression techniques to tackle complex, multi- dimensional prediction tasks. 4. Dataset A training dataset, containing 9921 tweets exhibiting the varied apprehensions of people about the vaccines, is provided. Since it is used for training the model, the tweets in this file possess both their IDs and their respective classes. The test dataset only contains 486 tweets along with their respective IDs. Our approach makes the best use of the CAVES dataset [? ] through various pre-processing techniques utilized and application of appropriate classifiers to develop an efficient model. 4.1. Trends in the Dataset On examining the training dataset, it can be stated that side-effects emerges as the most frequent class of concerns with a whopping 38.4% portion, followed by ineffective (16.9%), rushed (14.9%), pharmaa (12.8%), mandatory (7.9%), unnecessary (7.3%), none (6.3%), political (6.3%), conspiracy (4.9%), ingredients (4.4%), country (2.0%) and religious (0.7%). 5. Pre-processing A textual tweet tends to comprise of plain text, special characters and emojis which must be dealt with otherwise, this suppresses the performance of the model. Therefore, we pre-processed the tweets to enhance the quality of the data for further processing and reduce any chances of hampering performance. The following are the steps of pre-processing adopted in our model: • Feature and Lable Extraction : Feature extraction selects and transforms the raw data from the tweets into a set of features that can be used as input for the applied machine learning algorithms. Similarly, the labels are also extracted which the model will predict. • Vectorization : The TF-IDF Vectorizer further transforms the text data into a numerical format that the machine learning algorithms, the model uses, can understand. It also removes the English stop words from the text that does not add any extra meaning to the sentence on their own. • Binarization : The Multi-Label Binarizer assists in the multi-label classification task as required in case of our dataset where each tweet can belong to more than one class. It converts these categories into binary labels (0 or 1) for each class. 6. Methodology 6.1. Model We have made a model that executes multi-label text classification. We have basically used the scikit-learn library and its versatile features such as, the TF-IDF Vectorizer (converts the collection of raw documents to a matrix of TF-IDF features) and the Multinomial Naive Bayes Classifier (particularly useful in this case as the data set involves text data with discrete fea- tures such as word frequency counts) to perform multi-label text classification. Moreover, the implementation of Multi-Output Classifier employs one classifier per target (multi-target classification) and Multi-Label Binarizer converts the labels into a binary matrix representation, facilitating the multi-target classification process. Overall, the use of such versatile features from scikit-learn fine tunes our model enabling efficient prediction of the test dataset. 6.2. Experimental Setup The training data is split into training and validation sets in the ratio 9:1 and the data are shuffled prior splitting so that the fraction of instances of each class are preserved in both sets. We have already explained in section 5 about the pre-processing techniques applied on the data and this training data is used to fine tune our model while the validation data is used for evaluation. 6.3. Prediction We have designed our model to analyze and predict the available test dataset, stored in “test_data” data frame. In the dataset, the “tweet” column stores the extracted text data, and the corre- sponding labels are stored in a new column named “pred_labels”. Then the data frame with predicted labels is saved in a new CSV file named “prediction_file.csv” and the “tweet” column is dropped from the data frame. We have also used the feature accuracy_score to calculate accuracy of each predicted label. Finally, the model opens the CSV file with predictions and reads the contents into a data frame named “result” and displays the output in the console. 7. Evaluation AISoMe FIRE 23 Track results are evaluated using Macro-F1 score for primary evaluation and Metric Jaccard for secondary evaluation in case of tie with Macro-F1 score. The result of our two submitted automated run for the prediction of the test dataset is shown in Table 1. Our model got the 34th and 38th rank based on our two submitted run files. Table 1 Result of AISoMe FIRE 23 Track Sr No. Team_Name Macro-F1 Score Metric Jaccard Rank 1 APS AI&ML 0.39 0.41 34 2 APS AI&ML 0.35 0.46 38 8. Conclusion and Future Work This paper illustrates our TweetClass model. This model, upon construction and training, performs multi- label text classification to successfully execute the prediction of an available test dataset. The model uses the versatile features of scikit-learn to perform natural language processing. We pre-processed our data through vectorization, binarization and splitting the same then, performed feature and label extraction on the resultant data. This was followed by training of our model using the training data and finally using the model for prediction of the test dataset. We also look forward to further optimizing the model by exploring the future scopes of machine learning and artificial intelligence. Our future endeavors include using more domain specific models to improve the versatility of TweetClass thus, resulting in more optimized outputs and higher accuracy scores. References [1] J. P. D. Delizo, M. B. Abisado, M. I. P. De Los Trinos, Philippine twitter sentiments during covid-19 pandemic using multinomial naïve-bayes, International Journal 9 (2020). [2] J. Read, L. Martino, P. M. Olmos, D. Luengo, Scalable multi-output label prediction: From classifier chains to classifier trellises, Pattern Recognition 48 (2015) 2096–2109. [3] S. Poddar, A. M. Samad, R. Mukherjee, N. Ganguly, S. Ghosh, Caves: A dataset to facilitate explainable classification and summarization of concerns towards covid vaccines, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 3154–3164. [4] S. Poddar, M. Basu, K. Ghosh, S. Ghosh, Overview of the fire 2023 track:artificial intelligence on social media (aisome), in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, 2023. [3] [4]