Identifying Situational Information during Mass
Emergency
Sumit Anand1 , Mehuly Chakraborthy2 and Diptaraj Sen3
1
  University of Engineering and Management Kolkata, India
2
  University of Engineering and Management Kolkata, India
3
  University of Engineering and Management Kolkata, India


                                         Abstract
                                         In the advent of Natural Language Processing, what finds itself in much use is analysis. This research
                                         paper finds itself in reference to the same that enables it in analysing sentiments of a text. The tasks
                                         that were covered in working with NLP includes – firstly, differentiating tweets on the basis of claims
                                         and facts, and secondly to create an effective classifier that finds out if a tweet is anti-covid vaccine,
                                         pro-covid vaccine or neutral. The beauty of our paper resides in the fact, that we have hit high end
                                         accuracies without using hefty algorithms, namely 93% for the first task using Random Forest and 45.4%
                                         for the second task using BERT’s Algorithm. Our accuracies are the best among all the teams working
                                         on the same tasks, which deepens the effect that this paper resonates. The details of the IRMiDis 2021
                                         data challenge have been discussed elaborately here, and we hope our paper marks its significance by
                                         virtue of its own merit.

                                         Keywords
                                         Random Forest, BERT’s Algorithm, Micro-blogging, Natural Language Processing, classifier


1. Introduction
In periods of such dire needs, where mankind is at loss of life, humanity struggles to stay put
in every way possible. Computer Science plays a massive role here by trying to make things
more accessible to people, in efficient and sophisticated ways. Social media posts are one of
the most important bullets that help coders analyse how and what need to be done in case of
massive worldwide emergencies. Our tasks have led us to a discovery of what people think
about covid vaccines, leading us to understand what has to be further coped up with to increase
awareness in society, and also to create a model that separates claims and facts. Taking help
from twitter data has solved our purpose meticulously, and we say this with immense confidence
that micro-blogging will be used further in plethora of fields that Computer Science has blessed
us with. Both our tasks are but an analysis that micro-blogging has provided us with. To explain
more, let’s sketch out a brief overview of tasks 1 and 2. For the first task, we were required to
differentiate facts and non-facts, the data sets being extracted from twitter. The second task
consisted of 2 data sets- the train and the test- that were also extracted from micro-blogging
sites, where a classifier was to be built to separate opinions-for, against and neutral-regarding

Forum for Information Retrieval Evaluation, December 13-17, 2021, India
" sumit.anand@uem.edu.in (S. Anand); mehuly25@gmail.com (M. Chakraborthy); diptaraj.work@gmail.com
(D. Sen)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
the covid vaccine. With the help of two efficient algorithms, namely Random Forest [1] and
BERT’s [2], we have conjured accuracies of 0.93 and 0.45 on tasks 1 and 2 respectively. The
paper, hence describes intricately the procedures that we’ve undertaken to get the highest
accuracies among the other teams working on the same IRMiDis 2021 data challenge.


2. Tasks
The tasks were pretty simple, yet immensely engaging.
   The first task involved the data set of 11000 tweets from twitter related to the Nepal Earthquake
in April 2015. Along with the dataset, sample of few claims or fact-checkable tweets and non-
fact-checkable tweets were also provided in text format. What needed to be done was to identify
claims and fact checkable tweets.
   Examples of claims:
   1. @mashable some pictures from Norvic Hospital *A Class Hospital of nepal* Patients have
been put on parking lot.
   2. @ Refugees: UNHCR rushes plastic sheeting and solar-powered lamps to Nepal earthquake
survivors [url]
   Example of non-fact checkable tweets:
   1. Students of Himalayan Komang Hostel are praying for all beings who lost their life after
earthquake!!! Please do...[url]
   2. We humans need to come up with a strong solution to create earthquake proof zone’s.
   The second task was in reference to the present scenario of covid wrenched pandemic that
has kept everyone in tatters of their own luck. Two data files were provided namely the Train
dataset and the Test data set. The train dataset contains stances of tweets towards COVID-19
vaccines, crawled between November-December 2020, whereas the test dataset contains tweets
between March-December 2020 with various vaccine related keywords. What needed to be
done was to identify how many people are still skeptical about the covid vaccine. Hence a
classifier was to be built for 3 class classification as stated below:
   1. AntiVax - the tweet is against the use of vaccines.
   2. ProVax - the tweet supports / promotes the use of vaccines.
   3. Neutral - the tweet does not have any discernible sentiment expressed towards vaccines or
is not related to vaccines


3. Dataset
We have used the datasets provided to us by IRMiDis FIRE 2021 [3], which include a wide range
of data ranging from Nepal Earthquake in 2015 to tweets deciphering how many people hold
negative views regarding the covid vaccine. For the first task, we have a dataset containing
around 11,000 microblogs [4] (tweets) from Twitter that were posted during the Nepal earthquake
in April 2015. Along with the dataset, sample of few claims or fact-checkable tweets and non-
fact-checkable tweets are also provided as text files, in the following format– Tweetid <||>
Tweettext
   Example:
  592568567247212544<||>RT @NewEarthquake: 4.7 earthquake, 25km S of Kodari, Nepal. Apr
26 13:21 at epicenter (21m ago, depth 10km).
  For the second task, IRMiDis FIRE 2021 has contributed 2 datasets – namely the training and
the testing. To describe more of it –
  The training data set consisted of stances of tweets towards COVID-19 vaccines, crawled
between November-December 2020. From this dataset 2,792 crawled tweets texts along with the
tweet-IDs and the classes were produced for the same. The testing data set, on the other hand,
had in it tweets between March-December 2020 with various vaccine-related keywords. There
were tweets annotated by three crowd workers. For 1600 tweets, there was at least majority
agreement, i.e., at least 2 out of the 3 annotators provided the same label. The test dataset is
formed of these 1600 tweets; each tweet was tagged in with the tweet ID and the tweet text.
Referring to these master datasets, the IRMiDis 2021 data challenge was accomplished with
great results, that helped us analyse skilfully with no hindrance in the least.


4. Methodology
This section, describes in details the process that we followed to get to the desired results. We
have tried to apply Random Forest and BERT’s Algorithm to build up the basics of our task, and
have been successful in leading out predictions at higher accuracies.

4.1. Task 1: Classifying tweets into Facts and Non-facts
The first task, as mentioned before, asks us to differentiate tweets into facts and non- facts.
Hence, to perform this, we have figured out a proper model algorithm that gives us an accuracy
of 93%. This task can be divided into 3 non lapping phases, namely – Preprocessing, Feature
Selection and Model Selection.

4.1.1. Preprocessing
The first phase helps us clear unnecessary data that hold no relevance to our task of interest.
The dataset, hence, was pre-processed before moving on to further phases. We removed links
that were present in the data list, along with stop words and tweeter id. Also, user id and
punctuations of any kind were removed, hence what was left was a dataset that simply has
letters and numbers. Then for the ease of our working, we converted the texts to lower case.
This was our pre-processed dataset.

4.1.2. Feature Selection
Proceeding to the next step, we come to Feature Selection, where we try to extract certain
features from our newly reduced dataset. This helps us to understand the dataset in a clearer
fashion. Here, after pre-processing, we have added an additional column which consists of 1
and 0. The assigning of 1 and 0 is done in the following way – if a tweet has a total number of
digits to be 5 or more, we have assigned a 1 in the respective column, while for total digits 4 or
lesser, it has a designated 0 in the column.
   For example,
   1. ‘Nepal is the only Hinu country in the world, we need to protect and provide relief in this
crisis, hats off to.’
   2. ‘:Earthquake helpline at the Indian embassy in Kathmandu: +977 98511 07021, +977 98511
35141’
   Here in the first example, the number of digits is 0, hence the assigned value is 0 and it is a
non-fact. In the next example, we see there are 14 digits in total, so likewise the assigned value
in column is 1, and also it is a fact. The reason we thought about this criterion is because, while
manually inspecting the tweets, we realized that most of the facts are those that involve digits.
So now, we experimented with our model by feeding different inputs of the digit count starting
from 2,3,4 and 5. What we came to understand is that if the total count of digits is more than or
equal to 5, it has a greater tendency of being a fact. Hence, we implemented this, and proceeded
to the final phase.

4.1.3. Model Selection
This is the final stage of our task, that involves selecting a proper model to train our data
to. After experimenting with many models and algorithm, we found out that Random Forest
Classifier gives us the best results. Training our data list after Feature extraction in Random
Forest, gives us an accuracy of 0.93. We have incorporated five-fold cross validation for better
results. Hence, we could successfully conclude our first task, by efficiently differentiating tweets
on the basis of facts and non-facts [5].

4.2. Task 2: 3-class classification on tweets regarding their stance towards
     COVID-19 vaccines.
As explained previously, task 2 asks us to build an effective classifier for 3-class classification on
tweets regarding their stance towards COVID-19 vaccines. The 3 classes are namely – AntiVax
(against covid vaccine), ProVax (for covid vaccine) and Neutral. Hence, we have to train our
training data through an algorithm, that will provide us with higher accuracies while tested
with the testing data. Like Task 1, here too we can divide the process in 3 non-overlapping
phases – Preprocessing, Feature Selection and Model Selection.

4.2.1. Preprocessing
Now to get rid of redundant data, we have performed certain functions that helped us lay more
focus on our required task. We started by removing all the usernames and hashtags, removing
URLs and links and all kinds of special characters, emojis and emoticons. The remaining texts
were converted to lowercase to ease our work. Hence our data was pre-processed successfully,
and we led onto the next step.

4.2.2. Feature Selection
In this step, we have tried to extract certain features from our pre-processed data so that training
the data becomes easy. The classes have been level-coded such that – AntiVax=-1 ProVax=1
Neutral=0. Along with this, we have tokenized the tweets for better working.

4.2.3. Model Selection
After much speculation, we applied a pretrained BERT Model to train our data, that has proved
to be immensely effective. Our BERT model is called ‘distilBERT-base-uncased’ that is a trans-
formers model, smaller and faster than BERT, which was pretrained on the same corpus in a
self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained
on the raw texts only, with no humans labelling them in any way (which is why it can use
lots of publicly available data) with an automatic process to generate inputs and labels from
those texts using the BERT base model. On applying this, the accuracy that we received while
testing our training data is around 45.4%, which is a good accuracy compared to all the other
algorithms that we have tried. Hence, this led us to the end of the second task [6].


5. Evaluation
This section displays the output of all our algorithms and concepts that we have applied to
complete our required tasks. Hence, this consists of our accuracies, precision, recall, MAP, MAP
Overall and macro-F1 score that we have retained after training our data through the models
which we decided to work on. For both tasks our results were inclined towards the higher ends,
that invariably imply that our tasks were a success.

5.1. Task 1: Classifying tweets into Facts and Non-facts
Task 1 was successfully completed by applying Random Forest Classifier to our pre-processed
data. The accuracy was 93% which is a very good output based on our dataset. Equivalently our
other results have borne amazing outcomes namely a precision (out of 100) of 0.9100, recall (out
of 1000) of 0.2165, MAP (out of 100) of 0.0669, and a MAP Overall of 0.1543. The facts and the
non-facts were efficiently differentiated, and the results yielded were more than satisfactory.
       Team ID       Precision@100 Recall@1000 MAP@100 MAP Overall
    ByteCrackers         0.9100            0.2165         0.0669        0.1543

5.2. Task 2: 3-class classification on tweets regarding their stance towards
     COVID-19 vaccines.
The second task required us to build up an effective model to classify between 3 classes namely
– ProVax, AntiVax and Neutral. We’ve applied a pretrained BERT’s model to train our data into,
called ‘distilBERT-base-uncased’ that has yielded us an accuracy of 45.4%. Likewise, our macro-
F1 score is 0.440, which is a good outcome based on our data. The model works dexterously and
hence, we have successfully completed our second task.
       Team ID       Accuracy macro-F1 Score
    ByteCrackers        0.454          0.440
6. Conclusion
This paper holds an amalgamation of two beautiful working algorithms – the Random Forest
Classifier and the DistilBERT Algorithm [7]. As mentioned in the paper, we have achieved
higher end accuracies in both the tasks. This IRMiDis 2021 data challenge has been much more
than a learning experience, for as coders, we have had a working idea about the surrounding
society, and their thoughts about issues troubling the nation. Programmers can further utilize
this data to procreate something better to treat these societal issues [8]. Our tasks can be further
modified to make it better and efficient. Theoretically we have thought about implementing
clustering along with BERT’s for task 2, and extracting more eminent features other than digits
to complete task 1. We have now a much wider grip on Machine Learning and we hope to
implement our theories to these tasks to build well-structured models that perform much higher
accuracies.


References
[1] Biau G., Scornet E. A random forest guided tour. TEST 25, 197–227 (2016).
   [2] Miller D., Leveraging BERT for Extractive Text Summarization on Lectures.
arXiv:1906.04165 [cs.CL]
   [3] Basu M., Ghosh S., and Ghosh K. 2018. Overview of the FIRE 2018 track: Information
Retrieval from Microblogs during Disasters (IRMiDis). In Proceedings of the 10th annual
meeting of the Forum for Information Retrieval Evaluation (FIRE’18). Association for Computing
Machinery, New York, NY, USA, 1–5. DOI: https://doi.org/10.1145/3293339.3293340
   [4] Dutt R., Basu M., Ghosh K., Ghosh S. Utilizing microblogs for assisting post-disaster
relief operations via matching resource needs and availabilities, Information Processing
and Management, Volume 56, Issue 5, 2019, Pages 1680-1697, ISSN 0306-4573. DOI:
https://doi.org/10.1016/j.ipm.2019.05.010.
   [5] Chatterjee S., Deng S., Liu J., Shan R., Jiao W. (2018). Classifying facts and opinions in
Twitter messages: a deep learning-based approach. Journal of Business Analytics. 1. 29-39.
10.1080/2573234X.2018.1506687.
   [6] Shekhar H., Gangisetty S. (2015).                 Disaster Analysis Through Tweets.
10.1109/ICACCI.2015.7275861.
   [7] Victor S., Lysandre D., Julien C., Thomas W. DistilBERT, a distilled version of BERT:
smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs.CL]
   [8] Elaziz M., Hosny K., Salah A., Darwish M., Lu S., et al. (2020) New machine
learning method for image-based diagnosis of COVID-19. PLOS ONE 15(6): e0235187.
https://doi.org/10.1371/journal.pone.0235187