Short Text Classification Using TF-IDF Features and Fast Text Learner
                                            Zeshan Khan, Umar Naseer, Muhammad Atif Tahir
                                         {zeshan.khan,umar.naseer,atif.tahir}@nu.edu.pk
                     FAST School of Computing, National University of Computer and Emerging Sciences, Pakistan


 ABSTRACT                                                                             the SVM [14] for the classification using TF-IDF as feature vector
 The spread of the COVID-19 is a challenge for the health sector. This                [9, 12]. The SVM-based approaches are good in timely detection
 pandemic created health and financial issues for the whole world.                    or classification of the text with lower detection accuracy. There is
 The medical experts are working for the diagnostics and reasons                      another group of researches done using neural network-based ap-
 behind the COVID-19 disease and its spread. Some conspiracies are                    proaches. The researchers used some pre-trained neural networks
 being spread related to the COVID-19 disease and its spread. Such                    like BERT [1] then fine-tuned with the classification dataset [8].
 conspiracies can be seen on social media including Twitter. In this                     Another type of research for this task is based on graph neural
 research, the conspiracies of the COVID-19 have been analyzed                        networks (GNN). The GNNs are neural networks that can capture
 from the public tweets. The tweets of the conspiracies have been                     the dependence of the graphs architecture by message passing be-
 filtered from the tweets of the COVID-19 disease, symptoms, and                      tween perceptrons of the network. There are various variants of
 other discussions related to the disease. The analysis of the COVID-                 GNN for the priority of usage in the domain including graph con-
 19 related tweets resulted into three conspiracy classes, the COVID-                 volutional network (GCN), graph attention network (GAT), graph
 19 tweets without any conspiracy and the conspiracies. A model                       recurrent network (GRN), etc. The GNNs are good in detection
 is presented for the classification of tweets into three conspiracy                  accuracy with a high time and computational cost [3, 7, 13].
 classes with the Matthews Correlation Coefficient (MCC) of 0.294.
                                                                                      3   APPROACH
                                                                                      The research is based on three different methodologies for a diverse
 1    INTRODUCTION                                                                    detection of the tweet-class. The Neural Network approaches are
 Social media became a source of information sharing from just                        performing well in the current era with a limitation of the high
 closed group chats. The information-sharing generated several                        availability of the data.
 trust issues in the information shared on the social media platforms.                    The first approach that has been explored in this research is the
 Currently, Twitter is one of the most used public post-sharing                       usage of cosine similarity between text vectors for the detection of
 platforms. There are a huge number of tweets being shared daily.                     conspiracies in the tweet texts [2, 5]. The idea used in this approach
 There may have several tweets containing misinformation.                             is to split the tweet texts into sentences and apply the learner for
    In the year 2019 a disease, COVID-19 badly damaged human                          classification. A similar learner is used to train on the whole tweet
 lives and the economy. There are several solutions proposed for the                  as a single unit. The learner fasttext evaluated both the split and the
 treatment and spread control of the disease. The guidelines of the                   combined tweet for the MCC. The architecture is visually presented
 health organizations are affected by the false information shared                    in Figure 1.
 by various individuals. Some of this false information is relating
 COVID-19 spread with some technological inventions including
 5G. A log of people is making a relationship between COVID-19
 disease with the 5G technology towers. The time era of COVID-19
 and the 5G technology are the same but that doesn’t show the one
 as a cause of other or vise versa.

 2    RELATED WORK
 The NLP domain is effective from the last decade for various analy-
 ses of textual data. One of the domains of textual analysis is text
 classification. The text classification becomes more challenging
 when the provided text consists of very short documents. It’s very                                Figure 1: Architecture for FAST Text.
 difficult to build a context with the short textual document and the
 benefit of the short document is the ease of processing.                                The second approach used in the research is the classification
     There is significant work available on the domain of text clas-                  of the TF-IDF vector [6]. This methodology converted tweets into
 sification and especially on short text classification. Some of the                  sentences to make the instances smaller. The TF-IDF features are
 researchers used various textual feature extraction techniques and                   extracted from the sentences. The TF-IDF feature returned in a
 then applied classifiers to the textual features. The classifiers of                 feature vector of 1045 with most of the zero values. A feature reduc-
                                                                                      tion technique of the principal component analysis was performed
 Copyright 2021 for this paper by its authors. Use permitted under Creative Commons
 License Attribution 4.0 International (CC BY 4.0).
                                                                                      to reduce the number of features for computation and accuracy
 MediaEval’21, December 13-15 2021, Online                                            advantages [15]. The reduced features vectors were classified using
                                                                                      majority voting of some diverse learners including Decision Tree
MediaEval’21, December 13-15 2021, Online                                                                                                   Khan et. al.


Classifier, Linear Discriminant Analysis, and Logistic Regressions.      5    RESULTS AND ANALYSIS
The architecture is summarised in Figure 2.                              Three approaches were designed to solve the challenge of the con-
                                                                         spiracy detection in the tweets. The first approach based on fasttext
                                                                         classification was evaluated on the training data with 30% as valida-
                                                                         tion data. It was evaluated by training with various wordNgrams,
                                                                         learning rates, dimensions and epochs. The best hyperparameters
                                                                         for the fasttext resulted in 1-word gram with a learning rate of 0.7
                                                                         and 800 dimensions. The model is trained for the 50 epochs due
                                                                         to the limitation of the availability of resources. This approach re-
                                                                         sulted in 0.89 accuracies on the validation dataset of 30% extracted
                                                                         from the training dataset. The same model when applied to the
                                                                         test dataset resulted in an MCC of 0.294. The second approach for
                                                                         the computation of conspiracy was based on the TF-IDF vector
Figure 2: Methodology of Majority Voting of Classifiers using            classification using a majority voting classifier. This methodology
TF-IDF features.                                                         used the dimensionality reduction technique of PCA with the selec-
                                                                         tion of the top 500 features out of 7147 features. The methodology
   The third algorithm for the detection of conspiracies in the tweet    resulted in an accuracy of 0.56 with an MCC of 0.20 when executed
is based on a fully connected neural network with the TF-IDF fea-        on the training dataset with 30% as validation dataset. The same
tures of the tweets [6]. The selection of the importance of words is     approach when applied to the test dataset it resulted in an MCC of
done using two phases of the removal of the word from tweets text.       0.03. The third approach that is executed in the research is based on
In the first phase the categories of the words that are higher impor-    a fully connected neural network on reduced TF-IDF features. In
tant for the decision between conspiracy have been selected which        this approach, the words of the tweet were selected based on their
includes the Nouns, verbs, etc. The second phase of the selection of     categorical/ part of speech (POS) importance in decision making.
the important terms is based on the Principal component analysis.        We applied various learners to several categories of the words in a
The PCA-based top 500 features have been selected to provide to          tweet e.g. the verb, nouns, adverbs, adjectives, etc. These learners
the neural network for ternary decision between conspiracy classes.      were guided about the decidability of the various POS. The selected
The detailed architecture of the algorithm can be seen in Figure 3.      set of words is then used for the computation of the TF-IDF and
                                                                         then PCA is used to reduce the feature vector length from 6898 to
                                                                         500. The approach is applied to the validation data and the test data
                                                                         and resulted in 0.17 and 0.07 MCC scores respectively.

                                                                         6    CONCLUSION AND FUTURE WORK
                                                                         Short text classification is a challenging topic in the domain of
                                                                         natural language processing. There are several challenges due to the
                                                                         unavailability of the context of the sentence due to lesser sentences.
                                                                         Various learners applied on the short text classification and fasttext
                                                                         [2] resulted in the best learner for tweet classification. The results
Figure 3: Architecture of Neural Network based Classification            of fasttext guided the use of neural-network (NN) based approaches
                                                                         or LSTM [4] can give better results for the classification of the short
                                                                         text.
4   DATASET                                                                 The results of the approaches above show that the neural net-
                                                                         work (NN) based approaches can result in better detection accuracy.
The research is conducted using the dataset of the MediaEval 2021
                                                                         The deep learning (DL) based approaches will be explored further
under the task of FakeNews: Corona Virus and Conspiracies Multi-
                                                                         to improve detection accuracy. The data is limited and the nature
media Analysis [10, 11]. The dataset is a set of tweets by various
                                                                         is very close to the various NLP datasets, So, the transfer learning
Twitter accounts. The tweets consist of text of the tweet for the
                                                                         approaches can also be beneficial e.g. BERT can be used with the
detection task. There is some other information available with the
                                                                         pre-trained weights for a better understanding of the words and
tweet for various objectives. The task of the conspiracy classifica-
                                                                         relationships then the tweet data will fine-tune it to decide between
tion needs only tweets text and conspiracy class for the training
                                                                         conspiracy classes.
data. The training data provided was of 1511 tweets with various
lengths from a few words to several sentences. The class distribution
of the tweets, class A, B and, C, was 754, 262, and 495 respectively.    REFERENCES
                                                                          [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
The test set was comprised of 266 tweets for the detection of the             Pre-training of deep bidirectional transformers for language understanding.
classes from three classes. The dataset shows there is a class imbal-         arXiv preprint arXiv:1810.04805 (2018).
ance between the three provided classes. Another finding in the           [2] Stanislav Glebik. 2021. FAST TEXT. https://github.com/facebookresearch/
                                                                              fastText/. [Online; accessed 25-November-2021].
dataset is of class decidability, the class B and C are much closer to    [3] Abdullah Hamid, Nasrullah Shiekh, Naina Said, Kashif Ahmad, Asma Gul, Laiq
each other than the class A.                                                  Hassan, and Ala Al-Fuqaha. 2020. Fake news detection in social media using
FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task                         MediaEval’21, December 13-15 2021, Online


     graph neural networks and NLP Techniques: A COVID-19 use-case. arXiv
     preprint arXiv:2012.07517 (2020).
 [4] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
     computation 9, 8 (1997), 1735–1780.
 [5] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou,
     and Tomas Mikolov. 2016. Fasttext. zip: Compressing text classification models.
     arXiv preprint arXiv:1612.03651 (2016).
 [6] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2020. Mining of
     massive data sets. Cambridge university press.
 [7] Hu Linmei, Tianchi Yang, Chuan Shi, Houye Ji, and Xiaoli Li. 2019. Heteroge-
     neous graph attention networks for semi-supervised short text classification. In
     Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro-
     cessing and the 9th International Joint Conference on Natural Language Processing
     (EMNLP-IJCNLP). 4821–4830.
 [8] A Malakhov, A Patruno, and S Bocconi. 2020. Fake news classification with
     BERT. In Multimedia Evaluation Benchmark Workshop 2020, MediaEval 2020.
 [9] Manfred Moosleitner, Benjamin Murauer, and Günther Specht. 2020. Detecting
     Conspiracy Tweets Using Support Vector Machines. (2020).
[10] Konstantin Pogorelov, Daniel Thilo Schroeder, Luk Burchard, Johannes Moe,
     Stefan Brenner, Petra Filkukova, and Johannes Langguth. 2020. Fakenews: Corona
     virus and 5g conspiracy task at mediaeval 2020. In MediaEval 2020 Workshop.
[11] Konstantin Pogorelov, Daniel Thilo Schroeder, Petra Filkuková, Stefan Brenner,
     and Johannes Langguth. 2021. WICO Text: A Labeled Dataset of Conspiracy
     Theory and 5G-Corona Misinformation Tweets. In Proc. of the 2021 Workshop on
     Open Challenges in Online Social Networks. 21–25.
[12] Daniel Thilo Schroeder23, Konstantin Pogorelov, and Johannes Langguth. 2020.
     Evaluating Standard Classifiers for Detecting COVID-19 Related Misinformation.
     (2020).
[13] Nguyen Manh Duc Tuan and Pham Quang Nhat Minh. 2020. FakeNews Detection
     Using Pre-trained Language Models and Graph Convolutional Networks. (2020).
[14] Lipo Wang. 2005. Support vector machines: theory and applications. Vol. 177.
     Springer Science & Business Media.
[15] Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis.
     Chemometrics and intelligent laboratory systems 2, 1-3 (1987), 37–52.