Factuality Classification Using the Pre-trained Language Representation Model BERT Jihang Mao1, Wanli Liu2 1 Montgomery Blair High School, 51 University Blvd E, Silver Spring, MD 20901, USA 2 TAJ Technologies, Inc., 7910 Woodmont Ave #1214, Bethesda, MD 20814, USA jim-blair@hotmail.com Abstract. In this paper we report our participation in the 2019 FACT (Factuality Analysis and Classification Task) challenge task, where a corpus containing texts with verbal events is provided and systems need to automatically propose a fac- tual tag for each event. In this task facts are not verified in regard to the real world, just assessed with respect to how they are presented by the source. There- fore it is important to find indications of the linguistic context surrounding the events. Our approach utilizes BERT, a multi-layer bidirectional transformer en- coder which can help learn deep bi-directional representations of texts, and the pre-trained model is fine-tuned on training data for FACT. The representations of an event and its sentence are fed into an output layer for classification. Our approach achieves encouraging results in evaluation, which demonstrates that it is competitive and applicable to multilingual text categorization tasks. Keywords: BERT; Factuality Detection; Text categorization; Multilingual Model; Evaluation. 1 Introduction With the exponential growth of user-generated content, rumors in social media plat- forms are widely noticed. In a Pew Research Center poll, 64% of US adults said that “made-up news” has caused a “great deal of confusion” about the facts of current events [1]. However, identifying the factual status of events early is a hard task without suffi- cient evidence such as responses and fact checking sites. Automating the fact-checking pipeline is rather challenging, despite the recent progress in natural language pro- cessing, databases and information retrieval [2]. Many prior studies began by manually inspecting tweet messages in the training dataset to come up with an initial human- curated list of word features. It was found that these words could be categorized into meaningful groups. Such “cue words” have been reported to be useful in identifying an author’s certainty in journalism, determining veracity of rumors and detecting disagree- ment in online dialogue [3-5]. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 September 2019, Bilbao, Spain. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) It is crucial to determine whether event references are presented as having taken place or as potential or not accomplished events. Despite its centrality for Natural Language Understanding, this task has been under-researched, with [6, 7] as a reference for Eng- lish and [8] for Spanish. Besides its inherent difficulty, the bottleneck to advance on this task has usually been the lack of annotated resources. Following Sauri [9], factual- ity is understood as the category that determines the factual status of events, i.e., whether events are presented or not as certain. Adopting the Sauri model with some changes, Wonsever et al. [10] create an annotated corpus with factuality information and an automatic annotation tool based on automatic supervised learning. Alonso et al. [11] create a tool for the annotation of factuality expressed in texts in Spanish through automatic processing, which is carried out from three different axes: multilevel, multi- dimensional and multitextual. FACT (Factuality Analysis and Classification Task) is a task to classify events in Span- ish texts (from Spanish and Uruguayan newspaper), according to their factuality status. The goal of FACT is the determination of the status of verb events with respect to fac- tuality in Spanish texts. In this task, participating teams are given a text with its events already identified, and required to automatically assign a factuality category to each one of the events. Current and past situations in the world that are presented as real will be categorized into Fact, while situations that the writer presents are not having hap- pened in real world will be categorized into Counterfacts. Situations presented as un- certain will be categorized into a class that includes a number of other values like different kinds of Future, Potential or Undefined [10]. Their tags are F (Fact), CF (CounterFact) and U (Undefined) respectively. A brief description of our method for FACT task is presented in Section 2. In Section 3 we show the results of our method on the official FACT test datasets. In section 4 we present a discussion of the results and conclusions of our participation in this challenge. 2 Methods For FACT Task, our method builds on BERT, which has obtained state-of-the-art per- formance on most NLP tasks [12]. More specifically, given a sentence, our method first obtains its token representation from the pre-trained BERT model using a case-preserv- ing WordPiece model, including the maximal document context provided by the data. Next we formulate this as a sentence-pair classification task by feeding the representa- tions of the event and its sentence into an output layer, a multiclass classifier over the factual tags. Finally, we combine the outputs of models for Spanish and Uruguayan texts to generate the result. BERT utilizes a multi-layer bidirectional transformer encoder which can learn deep bi- directional representations and can be later fine-tuned for a variety of tasks such as text classification. Before BERT, deep learning models, such as convolutional neural net- work (CNN) and Bi-directional Long Short-Term Memory (Bi-LSTM) have greatly 127 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) improved the performance in text classification over the last few years [13]. OpenAI GPT [14] has proved the effectiveness of the generative pre-training language model. The pre-trained BERT models are trained on a large corpus (Wikipedia + BookCorpus). There are several pre-trained models release. In FACT, we chose BERT-Base, multi- lingual cased model for following reasons: First, multilingual model is better for Span- ish documents in FACT because the English-only model splits tokens not available in its vocabulary into sub-tokens, which will affect the accuracy of the classification task. Second, although BERT-Large generally outperforms BERT-Base in English NLP tasks, BERT-Large versions of multilingual models haven’t been released. Third, the multilingual cased model fixes normalization issues in many languages, so it is recom- mended in languages with non-Latin alphabets (and is often better for most languages with Latin alphabets). In FACT, we use the final hidden state corresponding to a special token ([CLS]) as the aggregate sequence representation, then feed it into an output layer for classification (Figure 1). Fig. 1. Architecture of our model for sentence pair classification. Similar to [11], we denote input embedding as E, the final hidden vector of the special [CLS] token, and the final hidden vector for the ith input token as Ti BERT [CLS] E[ CL S ] C Class Label Tok1 E1 T1 ... Sentence TokN EN TN [SEP] E3 T[SEP] Event Tok E3 T Input text Output Layer In addition, in order to address the issue of local multilinguality, i.e. the differences of the texts from Spanish and Uruguayan newspaper, we build models for Spanish and Uruguayan texts respectively. We train the two models and predict the factual tags with 128 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) corresponding training and testing texts. We then combine the outputs of the two mod- els to generate the final results. 3 Results The FACT corpus contains Spanish texts with approximately 5,000 verbal events clas- sified as F (Fact), CF (Counterfact), and U (Undefined). It has been divided into two subsets: the training corpus with 4,000 events, and the testing corpus with 1,000 events. In FACT, the performance will be measured against the evaluation corpus using the following metrics: Precision, Recall and F1 score for each category, Macro-F1, and Global accuracy. Macro-F1 is the main measure for this task. Here we present the re- sults on the test set. In our best submission, the model was fine-tuned using the hy- perparameter values suggested in [12]: learning rate (Adam) = 2e-5, number of epochs=3, max sequence length=256, and batch size=16. When fine-tuning the model for Spanish texts, we divided the training set into two subsets: 1,671 events from 20 articles for training, and 336 events from 6 articles for development. To fine-tuning the model for Uruguayan texts, 1,679 events from 22 articles is for training, and 657 events from 8 articles is for development. As shown in table 1, our best submission significantly outperformed the baseline “fact” in both Macro-F and Accuracy, while the Macro-F score of our submission is not very far from the highest score (-0.072). We are in third place among all participants, which demonstrates a good performance of our system in automatically classifying events in Spanish texts according to their factuality status. Table 1: Official final results for FACT Systems Macro-F Accuracy Our proposal 0.489 0.622 Baseline 0.340 0.524 Best team 0.561 0.721 Runner-up 0.554 0.635 However, although the performance of our system is reasonable on accuracy compared to other systems (0.099 and 0.013 behind the top two systems respectively), it is far from the accuracy we achieved on the development set (0.622 vs. 0.825). Table 2 shows the accuracy of the models for Spanish texts, Uruguayan texts and mixed texts on cor- responding development set. The gap of performance might be caused by the differ- ences between the training and testing sets or over-fitting of the models. We will con- duct a further error analysis after the Gold Standard classifications of the test set are released. 129 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) Table 2: The Accuracy of models fine-tuned with different texts and evaluated on cor- responding development set Models Accuracy Spanish texts 0.840 Uruguayan texts 0.835 Mixed texts 0.825 4 Discussion & Conclusion We described our approach that participated in the FACT: Factuality Analysis and Clas- sification Task in IberLEF 2019. Compared to previous methods, our approach has sev- eral significant differences from system architecture to the actual implementation. It is a general and robust framework and showed competitive performance among all par- ticipating systems during the FACT evaluations. In future work, we will use a new set of random seeds each time to prevent over-fitting, and plan to explore its use in practical applications such as fact-checking and fake-news detecting. Acknowledgements The authors would like to thank Dr. Yutao Zhang for providing Jihang Mao the intern opportunity at George Mason University and valuable suggestions and comments on the manuscript. The authors would also like to thank the FACT task organizers for providing the data of the task. References 1. Pew Research Center: Many Americans Believe Fake News Is Sowing Confusion. https://www.journalism.org/2016/12/15/many-americans-believe-fake- news-is-sowing-confusion (retrieved on June 21, 2019) 2. Vlachos, A. and Riedel, S. Fact checking: Task definition and dataset construction. Association for Computational Linguistics, page 18, (2014) 3. Soni, S., Mitra, T., Gilbert, E. and Eisenstein, J. Modeling factuality judgments in social media text. In ACL (2). pages 415–420. (2014) 4. Reichel, U., Lendvai, P.: Veracity Computing from Lexical Cues and Perceived Certainty Trends. Proceedings of the 2nd Workshop on Noisy User- generated Text, 4-13 (2016) 5. Misra, A. and Walker, M.A. Topic independent identification of agreement and disagreement in social media dialogue. In Conference of the Special Inte Group on Discourse and Dialogue. page 920. (2013) 130 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) 6. Saurí , R., & Pustejovsky, J. FactBank: a corpus annotated with event factuality. Language resources and evaluation, 43(3), 227. (2009) 7. Gorrell, G., Aker, A., Bontcheva, K., Derczynski, L., Kochkina, E., Liakata, M., & Zubiaga, A. SemEval-2019 Task 7: RumourEval, Determining Rumour Veracity and Support for Rumours. In Proceedings of the 13th International Workshop on Semantic Evaluation (pp. 845-854). (2019) 8. Wonsever, D., Malcuori, M., & RosáFurman, A. Factividad de los eventos referidos en textos. Reportes Técnicos 09-12, Pedeciba. (2009) 9. Saurí , R. A Factuality Profiler for Eventualities in Text. Ph.D. Thesis. Brandeis University. (2008) 10. Wonsever, D., Rosá, A., & Malcuori, M. (2016). Factuality Annotation and Learning in Spanish Texts. In LREC. (2016) 11. Alonso, L., I. Castellón, H, Curell, A. Fernández-Montraveta, S. Oliver, G. Vázquez. "Proyecto TAGFACT: Del texto al conocimiento. Factualidad y grados de certeza en español", Procesamiento del Lenguaje Natural, 61, p. 151-154. ISSN: 1135- 5948. (2018) 12. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019, pages 4171–4186. Minneapolis, Minnesota, USA. (2019) 13. Zhang, T., Huang, M., & Zhao, L. Learning structured representation for text classification via reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence. (2018) 14. Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding with unsupervised learning. Technical report, OpenAI (2018) 131