-

1613-0073

Multi-label Classification of Covid-19 Vaccine Tweet

Palvika Bansal

palvika.bansal@thomsonreuters.com 0

Sumit Das

sumit.das@thomsonreuters.com 0

Vikas Rai

vikas.rai@thomsonreuters.com 0

Shalini Kumari

shalini.kumari@thomsonreuters.com 0

COVID-19 Vaccine Tweets, Sentiment Analysis, Multi label Classification, BERT, Prefix-Tuning

0 Thomson Reuters Lab , Bangalore , India

This research paper presents a novel approach to multi-label classification of tweets expressing concerns about Covid-19 vaccines. It introduces fine-tuned BERT based model, customized for this task, which achieves good performance in accurately categorizing specific concerns within tweets. Through extensive data preprocessing, the model accommodates a wide range of concerns. Our findings have significant implications for public health communication, as they enable precise monitoring of public sentiment and vaccine-related concerns. This research contributes to natural language processing and demonstrates the practical application of advanced machine learning techniques in addressing real-world challenges. It underscores the potential for innovative AI-driven solutions in public health communication.

CEUR ceur-ws.org

1. Introduction

Vaccination plays a crucial role in mitigating the risk and transmission of a wide range of diseases. Over the past few years, vaccination has emerged as a critical tool in combating the COVID-19 pandemic. Moreover, large-scale vaccination eforts are essential to reduce the prevalence of various diseases. Nonetheless, skepticism towards vaccines persists among many individuals, primarily due to a variety of reasons, including political factors and concerns about potential vaccine side efects.

It is imperative to acknowledge and address these diverse concerns surrounding vaccines. Social media platforms have proven to be invaluable sources of data for gauging public sentiment and opinions regarding vaccination. Leveraging platforms like these allows us to rapidly gather insights from conversations and discussions about vaccines [ 1 ]. To facilitate this understanding, our work has utilized training data sourced from a prior project called ”CAVES: A dataset designed to facilitate the transparent classification and summarization of concerns related to COVID vaccines.” [ 2 ].

Our rigorous methodology entailed a systematic experimentation with a wide spectrum of techniques in the realms of deep learning and machine learning. We experimented with these approaches to facilitate the precise categorization of tweets that revolved around vaccinerelated concerns. Within our experimental framework, we started with foundational models including TF-IDF and LSTM and advanced towards more contextual models which involved BERT [ 3 ] based models. One noteworthy experimentation involved the implementation of prefix CEUR Workshop Proceedings tuning [ 4 ], a refinement technique integrated with state-of-the-art transformer models. This intricate synergy enabled us to extract nuanced insights from the tweets under examination, enhancing the accuracy and depth of our classification eforts. To further extract the contextual meaning of tweet, we experimented with various data processing approaches such as identifying named entities in the tweets, expansion of tweets, analyzing sentiment of tweet and analysis of keywords in the tweets. We also experimented with state of the art GPT-4 [ 5 ] model to identify concerns related to the tweet by providing it with few-shot examples.

Furthermore, our investigative pursuits were not confined solely to the broad spectrum of techniques. We ventured into the specialized domain of model fine-tuning to accommodate the idiosyncrasies inherent in tweet data. This approach allowed us to harness the unique characteristics of Twitter’s concise and informal language style, ensuring our models were ifnely attuned to capture the subtle intricacies of vaccine-related discourse.

2. Task

Our primary aim is to develop a highly eficient multi-label classification model that can accurately assign labels to a social media post, specifically tweets. These labels will correspond to the specific concerns and sentiments expressed by the post’s author regarding vaccines. This task involves not only identifying the presence of various concerns but also understanding the nuances and context in which they are discussed, enabling a comprehensive analysis of public sentiment and discourse surrounding vaccines on social media platforms.

In the context of this study, the classification task is centered around a set of predefined concerns pertaining to vaccines. These concerns serve as the labels for categorizing social media (tweet) posts, providing a structured framework for analyzing and understanding public discourse on vaccine-related topics. To gain deeper insights, kindly refer to the following topics: • Unnecessary: The tweet indicates vaccines are unnecessary, or that alternate cures are better. • Mandatory: Against mandatory vaccination — The tweet suggests that vaccines should not be made mandatory. • Pharma: Against Big Pharma — The tweet indicates that the Big Pharmaceutical companies are just trying to earn money, or the tweet is against such companies in general because of their history. • Conspiracy: Deeper Conspiracy — The tweet suggests some deeper conspiracy, and not just that the Big Pharma want to make money (e.g., vaccines are being used to track people, COVID is a hoax) • Political: Political side of vaccines — The tweet expresses concerns that the governments/politicians are pushing their own agenda though the vaccines. • Country: Country of origin — The tweet is against some vaccine because of the country where it was developed/manufactured • Rushed: Untested/Rushed Process — The tweet expresses concerns that the vaccines have not been tested properly or that the published data is not accurate. • Ingredients: Vaccine Ingredients/technology — The tweet expresses concerns about the ingredients present in the vaccines (eg. fetal cells, chemicals) or the technology used (e.g., mRNA vaccines can change your DNA) • Side-efect: Side Efects/Deaths — The tweet expresses concerns about the side efects of the vaccines, including deaths caused. • Inefective: Vaccine is inefective — The tweet expresses concerns that the vaccines are not efective enough and are useless. • Religious: Religious Reasons — The tweet is against vaccines because of religious reasons • None: No specific reason stated in the tweet, or some reason other than the given ones.

3. Related Work

Users frequently turn to micro-blogging platforms such as Twitter, motivated by a diverse range of objectives. These include expressing their viewpoints on the Coronavirus pandemic, disseminating personal health updates to their online connections, flagging symptoms, and sharing alerts regarding their well-being or that of acquaintances. Robust discussions take place concerning COVID-19 vaccines and vaccination campaigns, often preceding individuals’ receipt of their vaccine doses. The extraction of valuable insights from these textual tweets represents a common application within the field of social computing.

In the realm of text classification, traditional machine learning techniques such as the NaiveBayes classifier, Linear classifier, Support Vector Machine (SVM), and cutting-edge deep learning methods including Long Short Term Memory (LSTM) networks and Bidirectional Recurrent Neural Networks (RNNs) have demonstrated their efectiveness.

Recent advancements in natural language processing have given rise to notable language models, with BERT (Bidirectional Encoder Representations from Transformers) [ 3 ] and its domain-specific counterpart CT-BERT (COVID-Twitter-BERT) [ 6 ] at the forefront. Additionally, VaccineBERT [ 7 ], a BERT-based model specialized in classifying COVID-19 vaccine-related tweets, has garnered attention.

4. Dataset

The dataset in its entirety consists of 9,921 tweets records, and it is worth noting that there are no missing values within this dataset, ensuring a comprehensive and complete collection of Twitter data for analysis.

4.1. Data Exploration

Within the scope of this classification task, it is imperative to acknowledge that individual tweets may be linked with multiple labels. Consequently, it is of utmost importance to undertake a comprehensive examination of the distribution of these labels within the dataset. This understanding is vital for efectively categorizing and interpreting the complex and diverse nature of the tweets in our dataset.For an in-depth analysis and a complete overview of the results from this analysis, Refer Table 1.

In addition to this, it is crucial to examine the distribution of the number of labels assigned to each individual tweet. Upon analyzing the entire dataset, we observed that approximately 7,936 tweet texts were assigned only one label, indicating a prevalent singularity of classification. Furthermore, around 1,716 tweets exhibited a dual-label configuration, suggesting a moderate level of complexity in label assignment. Intriguingly, a subset of 269 tweets challenged this convention by being concurrently linked to three distinct labels, underscoring the presence of intricately categorized content within the dataset. This meticulous examination of label distribution not only enhances our understanding of the dataset’s characteristics but also provides valuable insights into the diverse nature of the classification challenge at hand. Furthermore, we have undertaken an examination of the distribution of tweet lengths. For a more comprehensive view of the length distribution, Refer to the Appendix Figure 1.1.

4.2. Trends in the dataset

Label-Entity Mapping in Tweet Text An analysis aimed at mapping training data labels to the most prevalent entity types found within the tweet text. This analysis was carried out for both individual training data labels and when multiple labels were present. For a comprehensive breakdown of this analysis and its results you can refer to Appendix Tables 1.1 and 1.2. Extraction and Parsing of URL-Embedded HTML Content in Tweet Text We performed a two-fold analysis involving the extraction of URLs from tweet text and the subsequent parsing of HTML content from these URLs. The purpose was to examine the HTML content, particularly the headlines, associated with each URL and compare it with the tweet text. It was observed that the majority of these URLs referred to either other tweet threads or news media reports. Among the complete list of URLs, approximately 20% of the web pages were found to be non-existent. In the course of our analysis, we discovered that in most cases, the tweet text was concise and often a partial excerpt from the parsed URL contents. Additionally, there were instances where the context of the tweet text contradicted the information present in the HTML content of the URLs. Consequently, we arrived at the conclusion that incorporating this HTML content into the tweet text would not provide added value and could potentially introduce confusion to the model.

Analysis of @Mentioned Users in Tweet Text Furthermore, we conducted an analysis of the mentions of user profiles (@user) within the tweet text. The intention was to explore whether the profiles of mentioned users could ofer supplementary information related to the type of tweet. However, it is important to note that our eforts were hindered by the unavailability of data due to restrictions imposed by the Twitter API, which prevented access to user profiles. Exploring Entity Types Within Tweet Text In the course of our research, we leveraged the Hugging Face’s bertweet-tb2_wnut17-ner API as a cornerstone for detecting entities within the tweet texts. This API, tailored for the intricacies of social media data, harnessed the power of advanced Named Entity Recognition (NER) techniques, specifically fine-tuned for Twitter contexts, to accurately categorize entities amid the informality, hashtags, and mentions characteristic of tweets. However, it’s noteworthy that given the constraints of time, our exploration did not yield significant outcomes, warranting further investigation in the future.

5. Pre-processing

To enhance the quality of word embeddings that we leveraged in modeling process, we preprocessed the tweets. Tweets generally encompass distinctive lexical elements such as hashtags, @username mentions, URLs, RT and special characters. These elements, if left unprocessed, tend to hinder the model’s performance. Consequently, we implemented a specific data cleansing procedure as an integral component of our tweet pre-processing strategy within the dataset: • Removing stop words: In this phase, stop words, which are commonly used words such as ”the,” ”and,” and ”in,” are systematically removed from the text. We also removed some words specific to tweets data such as rt which depicts retweets. This step helped in reducing noise and improving the eficiency of the tasks by focusing on the most meaningful words and phrases in the text. • Removing URLs: Initially we explored using external URL content to enhance tweet meaning but it didn’t add much value to the core meaning of tweet and was distorting results. So, We removed these extraneous web links using regular expression. • Removing Username mentions: Removing username mentions in tweets analysis data is crucial to preserve privacy and reduce bias, as mentions often refer to specific individuals or accounts. This step ensured that the analysis remains impartial. • Convert words to lowercase: Converting words to lowercase in tweets analysis data standardizes text and enhances consistency, ensuring that words with diferent capitalization patterns are treated as identical. This step prevents discrepancies in analysis and simplifies text processing. • Remove non-alphanumeric characters: We removed special symbols, and punctuation marks that often don’t contribute significantly to the analysis. This step helped in focusing on the core linguistic content. • Tweet text expansion: For labels with less data, We utilized GPT-3.5 to augment tweet content for labels such as country, political, conspiracy, religious, and none, in order to provide richer context and enhance the relevance of the tweet in accordance with its label. This initiative aims to assess whether text expansion can contribute to the enhancement of the model’s performance, particularly for these challenging labels. For additional details on this analysis and its outcomes, consult the Appendix Table 1.5.

6. Methodology 6.1. Models

Fine Tuning DeBERTa Large: In one of our experiments we finetuned DeBERTa (Decodingenhanced BERT with Disentangled Attention) [ 8 ] ”large” variant. It builds on RoBERTa [ 9 ] with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa. It is a Transformer-based neural language model that aims to improve the BERT [ 3 ]and RoBERTa models with two techniques: a disentangled attention mechanism and an enhanced mask decoder. The disentangled attention mechanism is where each word is represented unchanged using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangle matrices on their contents and relative positions. The enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve model’s generalization on downstream tasks. We used max length as 128 with padding to right. We used learning rate ad 2e-5 and batch size of 10 to fine tune model for 15 epochs. To prevent overfitting, We used early stop monitoring the validation loss with patience value 5.

Prefix Tuning of RoBERTa Large: In our experiment, we employed the RoBERTa (A Robustly Optimized BERT Pretraining Approach) [ 9 ] ”large” variant, which is among the state-of-the-art transformers in the domain of natural language processing. RoBERTa builds on BERT [ 3 ] model architecture using a more efective training procedure and was trained on a much larger dataset. This variant is pre-trained on 160GB of text from the BookCorpus, OpenWebText, English Wikipedia etc., making it adept at grasping linguistic nuances and contextual representations of text. We chose prefix tuning [ 4 ] for RoBERTa large because it allows us to adapt the pretrained model for our specific multi-label classification task without overhauling the underlying patterns the model had previously learned. By adding a task-specific prefix to the input sequence, prefix tuning efectively guides the model to tailor its representations for the given task while leveraging the extensive pre-existing knowledge encoded in the model. We kept 128 virtual tokens at the prefix of the prompt and 100 tokens to encode the tweet looking at the distribution of tweet lengths. We used learning rate of 1e-2 and batch size of 8 to fine-tune the model for 15 epochs. We used BCEWithLogitsLoss loss function to suit the multi label classification problem. To prevent overfitting, We used early stop monitoring the validation loss with patience value 5. For this experiment, we selected probability threshold of 0.5 to assign classes above this threshold to any tweet.

6.2. Experimental Setup

Our experimental framework was designed to ensure robust model development and evaluation. We started by randomly shufling the dataset and then splitting it into an 80% training set and a 20% validation set. We pre-processed the training and validation set using the pre-processing steps mentioned in Section 5. Given the nature of tweets with multiple labels, we applied a Multilabel Binarizer to appropriately encode and handle these labels. Additionally, to prevent overfitting, we employed early stopping techniques with configurable parameters. For each experiment, we systematically varied model hyperparameters. Detailed information on these parameters and experiment configurations can be found in the Section 6.1.

6.3. Predictions

For the predictions over the final test data provided, we fine-tuned diferent language based model architectures with the objective of multi label text classification, details of which are mentioned in Section 6.1. We predicted the probability scores of each test tweet against all classes. We also experimented with diferent probability thresholds to assign classes for diferent models and selected thresholds based on Macro-F1 performance metric. Classes with probability score greater than the selected threshold were assigned as the predicted classes for that tweet. We also did some post-processing for scenarios where the model was predicting other class labels along with “none” class label, so we removed “none” class label in those scenarios and kept the other predicted class labels as is. Based on our thresholds, there might be a few scenarios, where the model didn’t make any prediction to ensure precise results. We submitted 3 prediction ifles from diferent models containing Tweet ID and predicted classes.

6.4. Additional Modeling Experiments

In addition to the submitted models, we conducted a series of experiments utilizing diverse feature sets and model architectures. However, these experiments did not yield superior results and were consequently not included in the final submission. This section provides insights into our exploration of alternative approaches, ofering valuable context for the chosen model’s selection.

BERTweet Large: As the cornerstone of our approach, we selected the BERTweet Large model due to its specialization in processing Twitter data. This model is pre-trained on a massive Twitter corpus, making it adept at capturing the linguistic nuances and contextual intricacies of tweets. BERTweet [ 10 ] is the first public large-scale language model pre-trained for English Tweets. BERTweet is trained based on the RoBERTa pre-training procedure. The corpus used to pre-train BERTweet consists of 850M English Tweets (16B word tokens, 80GB), containing 845M Tweets streamed from 01/2012 to 08/2019 and 5M Tweets related to the COVID-19 pandemic. The BERTweet Large model was fine-tuned on our training dataset. During fine-tuning, we optimized model weights to align with the specific multi-label classification task. This step included adjusting model parameters, learning rates, and batch sizes. We used learning rate of 2e-5 and batch size of 10 to fine-tune the model for 10 epochs. We used early stopping threshold of 0.001 for preventing model overfitting. For this experiment, we selected probability threshold of 0.2 to assign classes above this threshold to any tweet. tf-idf vectorizer with Deep Neural Network: After pre-processing the text, we used tf-idf vectorizer to create numerical representations of text features. Then, we used Deep Neural Network model on these features by adding dense layers and also drop out layers to handle overfitting.

LSTM with GloVe Twitter Embedding: We did another experimentation by building an LSTM model. We used GloVe Twitter(2B tweets, 27B tokens, 1.2M vocab, uncased, 100d) embedding 1 to create features. Then we used a dropout layer for handling overfitting, an LSTM layer and a Dense layer for building the multi-label classifier. We used sigmoid as the activation function at the output layer, binary cross-entropy as loss. With this experiment we got Macro Average F1 score of 0.296 on the validation set using threshold of 0.2.

Experiment with GPT-4: We experimented with GPT-4 [ 5 ] to generate labels for tweets in validation set by giving it few shots examples of all the class labels along with system and user prompt, details of which are mentioned in Appendix B. We used temperature of 0 to be more deterministic and top_p of 1.0. We analyzed the results to find that most of the times GPT-4 was predicting at least 2 labels for a tweet, even though our data distribution has majority of the times 1 label for each tweet. Hence, it was significantly lowering the precision of the results. Expanded Tweet Experiment: As mentioned in the pre-processing section, we expanded the tweets for certain classes to improve the performance of those classes. We did prefix tuning of Roberta Large model using expanded tweets for certain classes and normal tweet for other classes in the train set. In the validation set, we didn’t expand tweets to evaluate performance.

1https://nlp.stanford.edu/projects/glove/ We didn’t see any performance improvement over the prefix tuning of Roberta Large model on normal tweets.

Label Enhancement and Similarity Matching: In this experiment we tried to enhance the label by using GPT-3.5 model. After having enhanced labels we calculated its embedding using BERTweet model. In runtime we calculated the cosine similarity of embeddings of enhanced labels and tweets. We noticed the with threshold as 0.8 it was not performing well.

7. Evaluation

This task was evaluated using Macro-F1 score on the 12 diferent classes as metric. The result of our submitted automated runs on test set for this Task is shown in Table 2.

Sr No.

Team_name Model Details

macro-F1 score

Jacc score Rank

1 2 Cognitive Coders Cognitive Coders DeBERTa Large Fine-tuning RoBERTa Large Prefix tuning

8. Conclusion and Future Work

In the final evaluation of this study, we conducted fine-tuning experiments using diferent language models: DeBERTa Large and Prefix Tuning of RoBERTa Large. Our objective was to explore the performance of these models in the context of a complex dataset where none of the labels exhibited a direct correlation with entity, sentiment, length, or word characteristics. Our findings revealed that transformer-based models outperformed traditional classifiers in handling the intricacies of our dataset. This observation underscores the potential of transformerbased architectures in addressing multifaceted classification tasks.

Furthermore, we explored diferent data augmentation strategies, such as utilizing language models (LLM) to expand tweet text and provide additional context with the objective that this approach can potentially enhance model performance, particularly for labels with limited data points, such as religious, country, and ingredients. Increasing the dataset size for these labels may lead to improved classification accuracy, as transformer-based models are known to benefit from larger datasets due to their data-hungry nature. Also, In our research, we employed Hugging Face’s bertweet-tb2_wnut17-ner API to detect entities in tweet texts. This API, specialized for social media data, enhanced our Named Entity Recognition (NER) capabilities. It allowed us to categorize entities efectively in the context of Twitter’s informal language, hashtags, and mentions. This integration could enable comprehensive analyses of label assignments, sentiment, and tweet length, shedding light on the intricate entity-label relationships within our dataset. However, due to time constraints, our exploration yielded limited outcomes, suggesting the need for further investigation in the future.

In summary, our study highlights the promising performance of transformer-based models in tackling complex multi-label classification tasks. Additionally, we recommend future research eforts that focus on data augmentation and dataset expansion to further enhance model efectiveness, particularly in scenarios with limited labeled data.

A. Data Exploration and Observations

In this section, we delve into an extensive exploration of our data, unveiling key insights across various dimensions. Specifically, we scrutinize tweet length, dissect word frequency patterns within each label category, extract entities relevant to each label, investigate the correlations between label assignments and tweet length, and conduct a thorough analysis of original versus expanded tweets. The subsequent subsections provide a comprehensive account of these analyses and observations. Michael Yeadon, a former employee of Pfizer, said that the government rollout of the COVID-19 vaccine is an attempt at ”mass depopulation” with booster recipients expected to die ... @MrStache9 Well i believe there won’t be an election until Trudick get enough covid vaccine into enough people to claim he did something right...

Expanded tweet

It is important to note that the claim made by Michael Yeadon, a former employee of Pfizer, that the government rollout of the COVID-19 vaccine is an attempt at ”mass depopulation” with booster recipients expected to die...

Label

conspiracy The statement seems to suggest that the Canadian government, led by Prime Minister Justin Trudeau, will likely wait until political a significant portion of the population has been vaccinated against COVID-19... System Prompt: You are a helpful assistant that will help in providing the most relevant labels to a social media post from a list of labels that express significant concern towards the vaccine. User Prompt: Assign most relevant labels to a social media post (particularly, a tweet) according to the specific concern(s) towards vaccines as expressed by the author of the post.

Note that a tweet can have more than one label (concern), e.g., a tweet expressing more than 1 diferent concerns towards vaccines will have more labels.

We consider the following concerns towards vaccines as the labels for this classification task: {labels with description} tweet text: {text} Response: list of labels separated by space Sample of Few-shot examples: { "role": "system", "name": "example_user", "content": '''@kentlivenews Let's hope Boris Johnson isn't one of those new trainees to stick people with the vaccine. Not a good picture to use.''' }, { } "role": "system", "name": "example_assistant", "content": 'Political',

C. Online Resources

• GitHub

[1]

Poddar ,

Basu ,

Ghosh ,

Ghosh , Overview of the fire 2023 track:artificial intelligence on social media (aisome) , in: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation , 2023 .

[2]

Poddar ,

A. M.

Samad ,

Mukherjee ,

Ganguly ,

Ghosh , Caves: A dataset to facilitate explainable classification and summarization of concerns towards covid vaccines , 2022 . arXiv: 2204 . 13746 .

[3]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , 2019 . arXiv: 1810 .04805.

[4]

X. L.

Li ,

Liang , Prefix-tuning: Optimizing continuous prompts for generation , 2021 . arXiv: 2101 . 00190 .

[5] R. OpenAI , Gpt-4 technical report , arXiv ( 2023 ) 2303 - 08774 .

[6]

Müller ,

Salathé ,

P. E.

Kummervold , Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter , 2020 . arXiv: 2005 .07503.

[7]

Bithel ,

Verma , Vaccinebert: Bert for covid-19 vaccine tweet classification , in: Working Notes of FIRE-13th Forum for Information Retrieval Evaluation , FIRE-WN 2021 , 2021 , pp. 1199 - 1203 .

[8]

He ,

Liu ,

Gao , W. Chen, Deberta: Decoding-enhanced bert with disentangled attention , 2021 . arXiv: 2006 .03654.

[9]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized bert pretraining approach , 2019 . arXiv: 1907 .11692.

[10]

D. Q.

Nguyen ,

Vu , A. T. Nguyen, BERTweet: A pre-trained language model for English Tweets , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , 2020 , pp. 9 - 14 .