Short Text Classification Using TF-IDF Features and Fast Text Learner Zeshan Khan, Umar Naseer, Muhammad Atif Tahir {zeshan.khan,umar.naseer,atif.tahir}@nu.edu.pk FAST School of Computing, National University of Computer and Emerging Sciences, Pakistan ABSTRACT the SVM [14] for the classification using TF-IDF as feature vector The spread of the COVID-19 is a challenge for the health sector. This [9, 12]. The SVM-based approaches are good in timely detection pandemic created health and financial issues for the whole world. or classification of the text with lower detection accuracy. There is The medical experts are working for the diagnostics and reasons another group of researches done using neural network-based ap- behind the COVID-19 disease and its spread. Some conspiracies are proaches. The researchers used some pre-trained neural networks being spread related to the COVID-19 disease and its spread. Such like BERT [1] then fine-tuned with the classification dataset [8]. conspiracies can be seen on social media including Twitter. In this Another type of research for this task is based on graph neural research, the conspiracies of the COVID-19 have been analyzed networks (GNN). The GNNs are neural networks that can capture from the public tweets. The tweets of the conspiracies have been the dependence of the graphs architecture by message passing be- filtered from the tweets of the COVID-19 disease, symptoms, and tween perceptrons of the network. There are various variants of other discussions related to the disease. The analysis of the COVID- GNN for the priority of usage in the domain including graph con- 19 related tweets resulted into three conspiracy classes, the COVID- volutional network (GCN), graph attention network (GAT), graph 19 tweets without any conspiracy and the conspiracies. A model recurrent network (GRN), etc. The GNNs are good in detection is presented for the classification of tweets into three conspiracy accuracy with a high time and computational cost [3, 7, 13]. classes with the Matthews Correlation Coefficient (MCC) of 0.294. 3 APPROACH The research is based on three different methodologies for a diverse 1 INTRODUCTION detection of the tweet-class. The Neural Network approaches are Social media became a source of information sharing from just performing well in the current era with a limitation of the high closed group chats. The information-sharing generated several availability of the data. trust issues in the information shared on the social media platforms. The first approach that has been explored in this research is the Currently, Twitter is one of the most used public post-sharing usage of cosine similarity between text vectors for the detection of platforms. There are a huge number of tweets being shared daily. conspiracies in the tweet texts [2, 5]. The idea used in this approach There may have several tweets containing misinformation. is to split the tweet texts into sentences and apply the learner for In the year 2019 a disease, COVID-19 badly damaged human classification. A similar learner is used to train on the whole tweet lives and the economy. There are several solutions proposed for the as a single unit. The learner fasttext evaluated both the split and the treatment and spread control of the disease. The guidelines of the combined tweet for the MCC. The architecture is visually presented health organizations are affected by the false information shared in Figure 1. by various individuals. Some of this false information is relating COVID-19 spread with some technological inventions including 5G. A log of people is making a relationship between COVID-19 disease with the 5G technology towers. The time era of COVID-19 and the 5G technology are the same but that doesn’t show the one as a cause of other or vise versa. 2 RELATED WORK The NLP domain is effective from the last decade for various analy- ses of textual data. One of the domains of textual analysis is text classification. The text classification becomes more challenging when the provided text consists of very short documents. It’s very Figure 1: Architecture for FAST Text. difficult to build a context with the short textual document and the benefit of the short document is the ease of processing. The second approach used in the research is the classification There is significant work available on the domain of text clas- of the TF-IDF vector [6]. This methodology converted tweets into sification and especially on short text classification. Some of the sentences to make the instances smaller. The TF-IDF features are researchers used various textual feature extraction techniques and extracted from the sentences. The TF-IDF feature returned in a then applied classifiers to the textual features. The classifiers of feature vector of 1045 with most of the zero values. A feature reduc- tion technique of the principal component analysis was performed Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). to reduce the number of features for computation and accuracy MediaEval’21, December 13-15 2021, Online advantages [15]. The reduced features vectors were classified using majority voting of some diverse learners including Decision Tree MediaEval’21, December 13-15 2021, Online Khan et. al. Classifier, Linear Discriminant Analysis, and Logistic Regressions. 5 RESULTS AND ANALYSIS The architecture is summarised in Figure 2. Three approaches were designed to solve the challenge of the con- spiracy detection in the tweets. The first approach based on fasttext classification was evaluated on the training data with 30% as valida- tion data. It was evaluated by training with various wordNgrams, learning rates, dimensions and epochs. The best hyperparameters for the fasttext resulted in 1-word gram with a learning rate of 0.7 and 800 dimensions. The model is trained for the 50 epochs due to the limitation of the availability of resources. This approach re- sulted in 0.89 accuracies on the validation dataset of 30% extracted from the training dataset. The same model when applied to the test dataset resulted in an MCC of 0.294. The second approach for the computation of conspiracy was based on the TF-IDF vector Figure 2: Methodology of Majority Voting of Classifiers using classification using a majority voting classifier. This methodology TF-IDF features. used the dimensionality reduction technique of PCA with the selec- tion of the top 500 features out of 7147 features. The methodology The third algorithm for the detection of conspiracies in the tweet resulted in an accuracy of 0.56 with an MCC of 0.20 when executed is based on a fully connected neural network with the TF-IDF fea- on the training dataset with 30% as validation dataset. The same tures of the tweets [6]. The selection of the importance of words is approach when applied to the test dataset it resulted in an MCC of done using two phases of the removal of the word from tweets text. 0.03. The third approach that is executed in the research is based on In the first phase the categories of the words that are higher impor- a fully connected neural network on reduced TF-IDF features. In tant for the decision between conspiracy have been selected which this approach, the words of the tweet were selected based on their includes the Nouns, verbs, etc. The second phase of the selection of categorical/ part of speech (POS) importance in decision making. the important terms is based on the Principal component analysis. We applied various learners to several categories of the words in a The PCA-based top 500 features have been selected to provide to tweet e.g. the verb, nouns, adverbs, adjectives, etc. These learners the neural network for ternary decision between conspiracy classes. were guided about the decidability of the various POS. The selected The detailed architecture of the algorithm can be seen in Figure 3. set of words is then used for the computation of the TF-IDF and then PCA is used to reduce the feature vector length from 6898 to 500. The approach is applied to the validation data and the test data and resulted in 0.17 and 0.07 MCC scores respectively. 6 CONCLUSION AND FUTURE WORK Short text classification is a challenging topic in the domain of natural language processing. There are several challenges due to the unavailability of the context of the sentence due to lesser sentences. Various learners applied on the short text classification and fasttext [2] resulted in the best learner for tweet classification. The results Figure 3: Architecture of Neural Network based Classification of fasttext guided the use of neural-network (NN) based approaches or LSTM [4] can give better results for the classification of the short text. 4 DATASET The results of the approaches above show that the neural net- work (NN) based approaches can result in better detection accuracy. The research is conducted using the dataset of the MediaEval 2021 The deep learning (DL) based approaches will be explored further under the task of FakeNews: Corona Virus and Conspiracies Multi- to improve detection accuracy. The data is limited and the nature media Analysis [10, 11]. The dataset is a set of tweets by various is very close to the various NLP datasets, So, the transfer learning Twitter accounts. The tweets consist of text of the tweet for the approaches can also be beneficial e.g. BERT can be used with the detection task. There is some other information available with the pre-trained weights for a better understanding of the words and tweet for various objectives. The task of the conspiracy classifica- relationships then the tweet data will fine-tune it to decide between tion needs only tweets text and conspiracy class for the training conspiracy classes. data. The training data provided was of 1511 tweets with various lengths from a few words to several sentences. The class distribution of the tweets, class A, B and, C, was 754, 262, and 495 respectively. REFERENCES [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: The test set was comprised of 266 tweets for the detection of the Pre-training of deep bidirectional transformers for language understanding. classes from three classes. The dataset shows there is a class imbal- arXiv preprint arXiv:1810.04805 (2018). ance between the three provided classes. Another finding in the [2] Stanislav Glebik. 2021. FAST TEXT. https://github.com/facebookresearch/ fastText/. [Online; accessed 25-November-2021]. dataset is of class decidability, the class B and C are much closer to [3] Abdullah Hamid, Nasrullah Shiekh, Naina Said, Kashif Ahmad, Asma Gul, Laiq each other than the class A. Hassan, and Ala Al-Fuqaha. 2020. Fake news detection in social media using FakeNews: Corona Virus and Conspiracies Multimedia Analysis Task MediaEval’21, December 13-15 2021, Online graph neural networks and NLP Techniques: A COVID-19 use-case. arXiv preprint arXiv:2012.07517 (2020). [4] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780. [5] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016). [6] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2020. Mining of massive data sets. Cambridge university press. [7] Hu Linmei, Tianchi Yang, Chuan Shi, Houye Ji, and Xiaoli Li. 2019. Heteroge- neous graph attention networks for semi-supervised short text classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4821–4830. [8] A Malakhov, A Patruno, and S Bocconi. 2020. Fake news classification with BERT. In Multimedia Evaluation Benchmark Workshop 2020, MediaEval 2020. [9] Manfred Moosleitner, Benjamin Murauer, and Günther Specht. 2020. Detecting Conspiracy Tweets Using Support Vector Machines. (2020). [10] Konstantin Pogorelov, Daniel Thilo Schroeder, Luk Burchard, Johannes Moe, Stefan Brenner, Petra Filkukova, and Johannes Langguth. 2020. Fakenews: Corona virus and 5g conspiracy task at mediaeval 2020. In MediaEval 2020 Workshop. [11] Konstantin Pogorelov, Daniel Thilo Schroeder, Petra Filkuková, Stefan Brenner, and Johannes Langguth. 2021. WICO Text: A Labeled Dataset of Conspiracy Theory and 5G-Corona Misinformation Tweets. In Proc. of the 2021 Workshop on Open Challenges in Online Social Networks. 21–25. [12] Daniel Thilo Schroeder23, Konstantin Pogorelov, and Johannes Langguth. 2020. Evaluating Standard Classifiers for Detecting COVID-19 Related Misinformation. (2020). [13] Nguyen Manh Duc Tuan and Pham Quang Nhat Minh. 2020. FakeNews Detection Using Pre-trained Language Models and Graph Convolutional Networks. (2020). [14] Lipo Wang. 2005. Support vector machines: theory and applications. Vol. 177. Springer Science & Business Media. [15] Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and intelligent laboratory systems 2, 1-3 (1987), 37–52.