=Paper=
{{Paper
|id=Vol-3395/T5-8
|storemode=property
|title=A Machine Learning Approach for COVID-19 Tweet Classification
|pdfUrl=https://ceur-ws.org/Vol-3395/T5-8.pdf
|volume=Vol-3395
|authors=Subinay Adhikary
|dblpUrl=https://dblp.org/rec/conf/fire/Adhikary22
}}
==A Machine Learning Approach for COVID-19 Tweet Classification==
A Machine Learning Approach for COVID-19 Tweet Classification Subinay Adhikary1 1 Indian Institute of Science Education and Research Kolkata,Campus Road, Mohanpur, West Bengal 741246 Abstract Vaccine-related information is awash on social media platforms like Twitter and Facebook. One party supports vaccination, while the other opposes vaccination and promotes misconceptions and misleading information about the risks of vaccination. The analysis of social media posts can give significant infor- mation into public opinion on vaccines, which can help government authorities in decision-making.This paper describes the dataset used in the shared task, and compares the performance of different classifica- tion that are provax, antivax and last neutral for identifying effective tweets related to Covid vaccines.We experimented with a classification-based approach. Our experiment shows that SVM classification performs well in order to effiective post.We’re going to do this because vaccination is an important step for Covid19 so people can easily fix the news about the vaccine and grab their own slot and symptom detection is also playing a important part to arrest the spread of disease. Keywords Covid-19, Twitter, Vaccination, Classification 1. Introduction In the face of a COVID-19 outbreak that shows no signs of slowing down, vaccination seems to be the only possible solution. Around the globe, people began to share their thoughts about vaccination. In spite of this, many people are sceptical about vaccinations for a variety of reasons. Social media sites like Twitter and Facebook are inundated with vaccine-related information[1]. One group of individuals is in favour of vaccination, while another opposes vaccination and spreads myths and false information about the dangers of vaccination. Using social media posts to study can provide useful insights.[2][3] It has been identified that Covid-19 spreads very fast. Therefore, it is reacquired to identify the symptoms of Covid-19 as soon as possible. By identifying Covid symptoms based on their classification, we can identify them more effectively. Here, we have used SVM, which is a very simple and efficient classifier algorithm that is widely used for pattern recognition and has a very good classification performance compared to other classifiers.The proposed model is validated with the tasks of IRMiDis FIRE-2022. Forum for Information Retrieval Evaluation 2022 (FIRE 2022) $ subinayadhikary612@gmail.com (S. Adhikary) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Tasks • Task 1: COVID-19 vaccine stance classification from tweets The Covid-19 vaccination is of utmost importance to prevent the disease.Here the task is to build a effective classifier to understand the user’s stance in social media.The three classes are as follows: 1. AntiVax - the tweet indicates hesitancy (of the user who posted the tweet) towards the use of vaccines. 2. ProVax - the tweet supports / promotes the use of vaccines. 3. Neutral - the tweet does not have any discernible sentiment expressed towards vaccines or is not related to vaccines • Task 2: Detection of COVID-19 symptom-reporting in tweets It is important for authorities to identify people who are experiencing COVID-19 symp- toms as quickly as possible in order to prevent the spread of the disease. The purpose of this task is to detect tweets that relate to COVID-19 symptoms (e.g., ’fever’, ’cough’). Such tweets are referred to as symptom-reporting tweets.Build an effective classifier for 4-class classification on tweets that can detect tweets that report someone experiencing COVID-19 symptoms. The 4 classes are described below: 1. Primary Reporting - The user (who posted the tweet) is reporting symptoms of himself/herself. 2. Secondary Reporting - The user is reporting symptoms of some friend / relative / neighbour / someone they met. 3. Third-party Reporting - The user is reporting symptoms of some celebrity / third-party person. 4. Non-Reporting - The user is not reporting anyone experiencing COVID-19 symp- toms, but talking about symptom-words in some other context. This class includes tweets that only give general information about COVID-19 symptoms, without specifically reporting about a person experiencing such symptoms. 3. Dataset • For the Task1 we got 4392 tweets that were labelled with 3 classes. • For the Task2, there was 1574 tweets that were labelled with 4 classes. . 4. Our Approach We discuss the overall design and implementation of our approach in this section. We borrow the advantages and capabilities of machine learning algorithms to implement the two above- mentioned tasks which are properly discussed in the consequent subsections. 4.1. Task1 COVID-19 vaccine stance classification from tweets : The purpose of this task is to identify the public sentiments toward vaccines based on Twitter data. Model selection, vectorization, and preprocessing are the three steps of the overall process. 4.1.1. Preprocessing This phase involves cleaning up of the provided tweets labeled as AntiVax, Provax or Neutral in the training dataset.All the words starting with hashtags or containing multiple spaces, special characters,urls,punctuations or stopwords are firstly trimmed from every tweet. 4.1.2. Vectorization After streaming and lemmatization are applied,the pre processed level tweets are then transform to numeric feature vector using term frequency inverse document frequency. After transforming the unstructured tweet data into numeric structured data. 4.1.3. Model Selection This numeric structured data was fed into two classifiers: Naive Bayes and SVM(Support Vector Machine). In comparison with these two classifiers, SVM performed better. In this training set, 1676 provax tweets and 1081 antivax tweets are used as training data, and 1635 neutral tweets are used for training and get traing accuracy 0.65. precision recall f1-score support 0 0.67 0.72 0.69 432 1 0.62 0.61 0.61 407 2 0.66 0.59 0.62 259 accuracy 0.65 1098 macro avg 0.65 0.64 0.64 1098 weighted avg 0.65 0.65 0.65 1098 Table 1: Task1 classification result 4.2. Task2 Detection of COVID-19 symptom-reporting in tweets IThe purpose of this task is to identify the symptoms toward Covid-19 based on Twitter data. Model selection, vectorization, and preprocessing are the three steps of the overall process. 4.2.1. Preprocessing This phase involves cleaning up of the provided tweets labeled as AntiVax, Provax or Neutral in the training dataset.All the words starting with hashtags or containing multiple spaces, special characters,urls,punctuations or stopwords are firstly trimmed from every tweet. 4.2.2. Vectorization After streaming and lemmatization are applied,the pre processed level tweets are then transform to numeric feature vector using term frequency inverse document frequency. After transforming the unstructured tweet data into numeric structured data. 4.2.3. Model Selection This numeric structured data was fed into two classifiers: Naive Bayes and SVM(Support Vector Machine). In comparison with these two classifiers, SVM performed better.In this training set containing 437 primary reporting tweets and 127 Secondary Reporting tweets and 196 Third-Party reporting tweets and 814 tweets are Non-Reporting and get training accuracy 0.69. precision recall f1-score support 0 0.70 0.91 0.79 194 1 0.74 0.27 0.39 64 2 0.67 0.67 0.67 107 3 0.70 0.24 0.36 29 accuracy 0.69 394 macro avg 0.70 0.52 0.55 394 weighted avg 0.70 0.69 0.66 394 Table 2: Task2 classification result 5. Evaluation Task 1 - IRMiDis Track results are evaluated using overall accuracy and the macro-F1 score on the three classes as metrics. The result of our submitted automated run for Task 1 is shown in Table 1. Team ID File name Accuracy macro F1-Score Subinay-IISERK Vax_labels.csv 0.275 0.281 Task 2 - IRMiDis Track results are evaluated using overall accuracy and the macro-F1 score on the four classes as metrics. The result of our submitted automated run for Task 2 is shown in Table 2. Team ID File name Accuracy macro F1-Score Subinay-IISERK Symptoms_labels.csv 0.295 0.324 6. Conclusion Natural Language Processing was used for this work submitted to the IRMiDis Track processing of the tweets. Machine Learning models were used to identify tweets in the training data . We used the NeatText package of NLP for cleaning our tweets. Then techniques like Tokenization and Lemmatization were used to pre-process the tweets. Later, we used a few Machine Learning models to train our dataset among which the SVM(Support Vector Machine) model gave the highest F-score. References [1] E. D’Andrea, P. Ducange, A. Bechini, A. Renda, F. Marcelloni, Monitoring the public opinion about the vaccination topic from tweets analysis, Expert Systems with Applications 116 (2019) 209–226. [2] L.-A. Cotfas, C. Delcea, I. Roxin, C. Ioanăş, D. S. Gherai, F. Tajariol, The longest month: analyzing covid-19 vaccination opinions dynamics from tweets in the month following the first vaccine announcement, Ieee Access 9 (2021) 33203–33223. [3] M. M. Müller, M. Salathé, Crowdbreaks: tracking health trends using public social media data and crowdsourcing, Frontiers in public health 7 (2019) 81.