Factuality Classification Using BERT Embeddings and Support Vector Machines Biswarup Ray, Avishek Garain Department of Computer Science and Engineering, Jadavpur University, Kolkata-700032, West Bengal, India Abstract For any topic, its factuality can be defined as the category that determines the status of events with certainty of presentation of them. The first edition of the FACT task mainly focused on determination of the factuality of verb based events. The present edition is aimed at identifying noun based events and determine the factuality of all events be it verbs or nouns. We have participated in Subtask-1 of FACT 2020 task which is to automatically propose a factual tag for each event in the text. In this paper we have presented a method which extracts various features like BERT embeddings, Word2Vec embeddings and TF-IDF (Term Frequency-Inverse Document Frequency) scores of commonly recurring words, along with other manually extracted features as input features and passes them through a SVM (Support Vector Machine) classifier for classification purposes. Our system has achieved a f1-score of 36.6% and accuracy of 59.9% which is quite satisfactory relative to performance of other systems. Keywords BERT, embeddings, Factuality, SVM, Word2Vec 1. Introduction With the advent of ever-growing craze for Internet and increasing popularlity of social media platforms, there has been a noticeable exponential growth of user-generated content, rumors in these platforms. Every second a new news, a new writeup, a new post is either created or shared. Now it might be a real fact or some fake content created and shared as a hate speech propaganda or some other selfish motives. Classification of factuality of such content is highly important to decrease negativity over the platforms as well as preventing any danger at personal level that might be immediate result of defaming caused by fake news. Recently a tech giant like Facebook lost 7.2 billion USD at one go by losing trust of giant companies like Verizon Communications Inc., Hershey Co. and Coca Cola. The companies Verizon Communications Inc. and Hershey Co. have stopped social media ads after critics said that Facebook has failed to sufficiently police hate speech and disinformation on the platform. Coca-Cola Co. said it would pause all paid advertising on all social media platforms for at least 30 days. Such is the importance of Factuality Analysis these days. However, identification of the factual status of events early is a complex task with unavailabilty of enough evidence such as responses and fact checking sites. Making Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) EMAIL : raybiswarup9@gmail.com (B. Ray); avishekgarain@gmail.com (A. Garain) ORCID : 0000-0001-6225-3343 (A. Garain) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) the fact-verification pipeline is quite challenging, despite the recent progress in natural language processing, databases and information retrieval [1]. Many prior studies had begun by manual inspection of tweet messages in the training datasets to come up with an initial curated list of word features. It was found that these words could be categorized into meaningful groups which have been useful in identifying an author’s certainty in journalism, detecting disagreement in online dialogue and determining veracity of rumors [2, 3, 4] . The verification of facts in this task, that is, the commitment of the source to the truth-value of the event is not in context of real word problems but is just assessed on basis of the way these facts are presented by the annotator. In view of this perspective, the task can be conceived as a core methodology for other tasks ranging from fake-news to fact-checking, paving way for future tasks like comparison of what is narrated in the text (fact tagging) to what is happening in the world (fact-checking and fake-news). Here, we have designed a system which extracts various word based features making use of BERT pretained vectors along with other manually extracted features as input features and classifies them into corresponding classes using a SVM (Support Vector Machine) classifier. The rest of the paper has been organized as follows. Previous works related to this task have been briefly described in Section 2. Section 3 describes the data on which the task was performed. The methodology followed is described in Section 4. This is followed by the results and concluding remarks in Section 5 and 6 respectively. 2. Related Works In 2019, the FACT shared task (Factuality Annotation and Classification Task), was hosted by the First Iberian Languages Evaluation Forum (IberLEF) [5]. The corpus contained Spanish texts with more than 5,000 events classified as F (Fact), CF (Counterfact), U (Undefined). The corpus was divided in two subcorpora: the training corpus (80%), and the testing corpus (20%). Many teams participated and proposed different system designs for the task. Premjith et al.[6] proposed a system based on word embeddings using a Random Forest classifier. Since the data was unbalanced in nature, the implementation assigned a higher weight to the (CF) label to improve the prediction of the less frequent categories. The model obtained an accuracy of 72.1% and a Macro F1-score of 0.561 when tested. Giudice [7] proposed a system with a character-level convolutional recurrent neural network. It took advantage of tokenization to classify individual words within the text. Each word was represented as a fixed-size list of vectors. An event flag was added to indicate whether the word represented an event or not. In the final step a dense layer was applied to get, classification for each word among one of the three classes.The system obtained an accuracy of 63.5% and a Macro F1-score of 0.554 when tested. For this task, Mao [8] chose BERT-Base, multilingual cased model. The corpus was divided into two parts, Uruguayan texts and Spanish texts for the training task. The training and the prediction for the categories for each subcorpus was made separately for the two models. Finally, both outputs were combined in order to create the final result. The model obtained an accuracy of 62.2% and a Macro F1-score of 0.489 when tested. 215 Another team macro128 (Pastorini) utilized SentencePiece tokenizer in its pre-processing phase. Pre-trained BERT language model was used having one end layer classifying each token among the three possible categories. Since each and every words were not initially classified, each token was randomly assigned to a category.The model was initially trained without the classification layer, until convergence, for a maximum of 100 epochs. The entire model was then trained to converge for a maximum of 100 epochs using F1 measure for early stopping. The system obtained an accuracy of 57.9% and a Macro F1-score of 0.362 when tested. 3. Data The corpus that has been used for this task contains Spanish texts with approximately 6,300 events classified into respective categories of factuality. The texts belong to the journalistic register and most of them are from the political sections from Spanish and Uruguayan newspaper. The three possible categories established in the dataset are: • Fact (F): current and past situations in the world that are presented as real. • Counterfact (CF): current and past situations that the writer presents as not having happened. • Undefined (U): Possibilities, future situations, predictions, hypothesis and other options. Words have been tagged with these factuality tags. All these words are related to some event and proper placement in the sentences converts the sentences into events. We have divided the dataset in the ratio 80:20 for training and validation purposes respectively. The distribution of data instances is given in Table 1. Label Train Validation F 2777 694 CF 223 55 U 1155 289 All 4155 1038 Table 1: Distribution of the labels in dataset 4. Methodology For FACT Task, our method is based on a SVM classifier with BERT embeddings, Word2Vec embeddings and TF-IDF (Term Frequency-Inverse Document Frequency) of commonly recurring words as input features. The system architecture for the proposed method is represented in Fig.1 216 Figure 1: The architecture for the proposed method for the factuality classification task. 4.1. Feature Extraction For the BERT [9] embeddings firstly, for a given sentence, first token representation is obtained from the pre-trained BERT model using the help of a WordPiece model (cased). The pre-trained BERT models are pre-trained on large corpus (Wikipedia + BookCorpus). In FACT, we utilised a BERT-Base, multilingual cased model as: First, multilingual model are much better than English-only models for Spanish documents in FACT as the strictly-English models split tokens unavailable in its vocabulary into sub-tokens, which affects the accuracy of the classification task. Also BERT-Large generally outperforms BERT-Base in NLP tasks in English language, BERT-Large versions of multilingual models haven’t been published yet.The tokens are indexed and alongwith their segment ids are fed to a BERT model as torch tensors which are termed as the input embeddings. The output given by the final hidden layer has four dimensions, in the order of: • The number of the layer (13 layers). The first element is the input embeddings, the rest are the outputs of each of BERT’s 12 layers. 217 • Batch number (number of sentences) • Word / token number (number of tokens in the sentence) • Feature number (768 features) The batch dimension is not needed hence it is removed. The output is then permuted into the desired dimension of [𝑡𝑜𝑘𝑒𝑛𝑠, 𝑙𝑎𝑦𝑒𝑟𝑠, 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠]. To get a single vector for an entire sentence a simple approach is used by averaging the second to last hidden layer of each token producing a single vector of length 768. The value of these vectors are contextually dependent. An example of input embeddings for a particular sentence to find the BERT embeddings from the BERT model is shown in Fig.2 Figure 2: An example containing for a set of input tokens with positional, segment, transformer embedding. The vector representation of each of the word is produced using the Gensim module. The vectors of each word are the context in which the words appear (Word2Vec). Each of the texts is also converted into numerical vectors using the word vectors produced. Thus, we train the Doc2Vec model by feeding in our text data. By applying this model on the texts, we get the representation vectors. The manually extracted features thus added to the model for classification apart from the BERT embeddings are: 1. The vector representations obtained using Word2Vec. 2. The TF-IDF for the words that frequently occur in the text are also added to the feature list. The TF computes the number of times a word recurs in the dataset, and IDF computes the relative importance of the word which depends on how many times the word can be found, and are added as features to filter and reduce the size of the final output. 3. Normalized counts of words with positive sentiment, negative sentiment and neutral sentiment in Spanish by dividing with word count of the corresponding sentence. [10] 4. Normalized Frequency of auxiliary verbs like "es", "estaba", "son", "fueron" by dividing by word count of the sentence. 5. Subjectivity score of the text (calculated using predefined libraries). [11] 218 4.2. Classification For the classification task the features extracted through the various above mentioned processes are selected by using the feature importance rankings for each feature. To train the SVM (Support Vector Machine) with a linear kernel [12] model the categorical labels present for the data were label encoded into numerical values. The numerical labels along with the selected features were used to train the SVM model. As done in the work [13], class weights were assigned to the SVM model as the dataset is unbalanced and contains a very low quantity of texts having the Counterfact (CF) label. 5. Results The FACT corpus contains Spanish texts with approximately 6,300 events classified into respective categories of F (Fact), CF (Counterfact), and U (Undefined). In FACT, the performance is measured against the evaluation corpus using the Macro-F1(%), Macro-Precision(%), Macro-Recall(%), Accuracy(%) and Global accuracy metrics. Macro-F1 is the most important measure for this task. We have presented the results for the unknown test set in Table.2. As shown in Table.2 our proposed method outperformed the FACT baseline in all of the metrics: Macro-F1(%), Macro-Precision(%), Macro-Recall(%), Accuracy(%) and Global accuracy. Using class weights is one of the main reason for the better performance as it ensures lesser misclassification of the label CF present in much lesser quantity than other labels. Also using various features and implementing feature selection for the classification task ensures better performance. Model Macro-F1(%) Macro-Precision(%) Macro-Recall(%) Accuracy(%) Our model 36.6 35.7 39.4 59.9 FACT_baseline 24.6 25.4 25.1 52.4 Table 2: Result Metrics of our system for Subtask-1 However the lower performance of the proposed method may be due to the poor performance of the SVM model as word vectors may be highly non-linearly separable. Also even with class weights the lower number of instances in the CF class in the training data is a reason for misclassification. 6. Conclusion We have presented our system that we have used for participating in the FACT: Factuality Analysis and Classification Task in IberLEF 2020. Considering previous approaches, our approach is a comparatively different approach in terms of architecture as well as methodology of feature extraction. It is a generalized and versatile framework and has showed satisfactory performance among all participating systems during the FACT 2020 evaluations. In future works, we will use 219 an ensemble of different classification models to increase the performance and we will explore some practical applications such as fake news detection, etc. References [1] A. Vlachos, S. Riedel, Fact checking: Task definition and dataset construction, in: Pro- ceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, 2014, pp. 18–22. [2] A. Misra, M. Walker, Topic independent identification of agreement and disagreement in social media dialogue, arXiv preprint arXiv:1709.00661 (2017). [3] U. D. Reichel, P. Lendvai, Veracity computing from lexical cues and perceived certainty trends, arXiv preprint arXiv:1611.02590 (2016). [4] S. Soni, T. Mitra, E. Gilbert, J. Eisenstein, Modeling factuality judgments in social media text, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014, pp. 415–420. [5] A. Rosa, I. Castellón, L. Chiruzzo, H. Curell, M. Etcheverry, A. Fernández, G. Vázquez, D. Wonsever, Overview of fact at iberlef 2019: Factuality analysis and classification task, 2019. [6] B. Premjith, K. P. Soman, P. Poornachandran, Amrita_cen@fact: Factuality identification in spanish text, in: IberLEF@SEPLN, 2019. [7] V. Giudice, Aspie96 at fact (iberlef 2019): Factuality classification in spanish texts with character-level convolutional rnn and tokenization, in: IberLEF@SEPLN, 2019. [8] J. Mao, W. Liu, Factuality classification using the pre-trained language representation model bert, in: IberLEF@SEPLN, 2019. [9] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. arXiv:1810.04805. [10] A. Garain, S. K. Mahata, Sentiment analysis at sepln (tass)-2019: Sentiment analysis at tweet level using deep learning (2019). [11] A. Garain, Humor analysis based on human annotation (haha)-2019: Humor analysis at tweet level using deep learning (2019). [12] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (1995) 273–297. URL: https://doi.org/10.1007/BF00994018. doi:10.1007/BF00994018. [13] A. Garain, A. Basu, The titans at semeval-2019 task 5: Detection of hate speech against immigrants and women in twitter, in: Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 494–497. [14] A. Rosá, L. Alonso, I. Castellón, L. Chiruzzo, H. Curell, A. Fernández, S. Góngora, M. Malcuori, G. Vázquez, D. Wonsever, Overview of fact at iberlef 2020: Events detection and classification (2020). [15] J. Mao, W. Liu, Factuality classification using the pre-trained language representation model bert., in: IberLEF@ SEPLN, 2019, pp. 126–131. [16] A. Garain, A. Basu, The titans at semeval-2019 task 6: Offensive language identification, categorization and target identification, in: Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 759–762. 220 [17] A. Garain, S. K. Mahata, S. Dutta, Normalization of numeronyms using nlp techniques, in: 2020 IEEE Calcutta Conference (CALCON), 2020, pp. 7–9. 221