NFinBERT: A Number-Aware Language Model for Financial Disclosures Hao-Lun Lin1 and Jr-Shian Wu2 Yu-Shiang Huang3 National Chengchi University National Taiwan University Taipei, Taiwan Taipei, Taiwan {106703027,106703026}@nccu.edu.tw b05702095@ntu.edu.tw Ming-Feng Tsai4 Chuan-Ju Wang5 National Chengchi University Academia Sinica Taipei, Taiwan Taipei, Taiwan mftsai@nccu.edu.tw cjwang@citi.sinica.edu.tw Abstract model, defining language modeling in an unconven- tional manner. Due to the superior performance of As numerals comprise rich semantic infor- BERT on various natural language processing tasks, mation in financial texts, they play crucial numerous related studies and models, including roles in financial data analysis and finan- RoBERTa (Liu et al., 2019) and ELECTRA (Clark cial decision making. We propose NFin- et al., 2020), have been proposed to advance the BERT, a number-aware contextualized lan- state of the art. Other works attempt to adopt this guage model trained on financial disclo- powerful technique to different domains by pre- sures. Although BERT and other contex- training language models on corpora from different tualized language models work well for target domains, such as finance and biomedical many NLP tasks, they are not specialized sciences (DeSola et al., 2019; Lee et al., 2020). in finance and thus do not properly manage For applications in finance and accounting, in numerical information in financial texts. addition to pre-training domain-specific language Therefore, we propose pre-training the lan- models, recent work has focused on fine-tuning the guage model on a large collection of “pre- pre-trained model for downstream tasks, including processed” financial disclosures in which sentiment analysis (Sousa et al., 2019) and numeral the numbers in reports are explicitly re- category prediction (Wang et al., 2019). However, placed with the knowledge and understand- most such studies directly use the original design ing of the financial and accounting func- of BERT and thus do not properly manage numer- tions of reports. Experimental results on ical information in financial texts. However, in two fine-tuning classification tasks show contrast to other domains, numbers in financial that language models pre-trained on finan- text such as financial disclosures, market commen- cial specialized texts generally outperform tary, and financial news are especially important BERT. Furthermore, the proposed number- for understanding the minutiae of such textual in- aware NFinBERT significantly surpasses formation. Moreover, financial documents usually other models when the task becomes more contain relatively large amounts of numbers; for difficult or number-sensitive. example, whereas only 0.98% of the tokens in the blog corpus (SCHLER, 2006) are numbers, the 10- 1 Introduction K financial reports used here have a much higher proportion of number tokens: 4.79% of all tokens. BERT (Devlin et al., 2018), a state-of-the-art lan- Thus, properly addressing such numeral informa- guage model, consists of a set of Transformer en- tion when pre-training the language models is criti- coders (Vaswani et al., 2017) stacked on top of each cal to raising the quality of the pre-trained model. other. In contrast to traditional language models’ For instance, the sentence “Q4 revenue raised by prediction of the next token given previous tokens, 4,000,000, which is 12.8% of the total amount in BERT uses masked LMs (MLMs) and next sen- the year” is nonsensical if the numbers in it are not tence prediction (NSP) to pre-train the language properly interpreted. Copyright © 2021 for this paper by its authors. Use permitted To this end, we propose NFinBERT,1 a number- under Creative Commons License Attribution 4.0 Interna- 1 tional (CC BY 4.0) The pre-trained model will be publicly available upon aware contextualized language model pre-trained numbers and slot them into one of the 11 classes. on a large collection of “pre-processed” financial For example, $1,000,000 in the reports was masked disclosures, for which we explicitly replace num- as the token [MONEY], and 95% was masked bers in reports with the knowledge and understand- as [RATIO]. Therefore, in addition to [CLS] and ing of the financial and accounting functions of [SEP], BERT’s original masks, we here add 11 reports. We conduct two downstream tasks to evalu- masks to train NFinBERT. The distribution of cate- ate the proposed model: one is binary classification gories is listed in the last column of Table 1. for risk sentence detection and the other is 12-class classification for sentence-level numeral category 2.2 FinBERT and NFinBERT prediction. The results indicate that that language BERT (Devlin et al., 2018) is a language model models pre-trained on financial specialized texts containing a set of Transformer encoders (Vaswani generally outperform BERT. Furthermore, the pro- et al., 2017) stacked on top of each other; such posed number-aware NFinBERT significantly sur- a design defines language modeling in an un- passes other models on more difficult or number- conventional manner. Following previous stud- sensitive tasks. ies (Howard and Ruder, 2018; Araci, 2019), we pre-train language models on finance, the target do- 2 Pre-Training Models on Financial main, and experiment with two approaches: 1) Pre- Reports training the model on a large collection of financial 2.1 Data and Preprocessing reports—the original corpus containing 92,402,863 sentences—and 2) Pre-training the model on a cor- To pre-train the domain-specific language mod- pus in which all of the numbers have been replaced els on financial reports, we used the 10-K reports by the tokens listed in Table 1. from 1996 to 2013 collected by (Loughran and As in (DeSola et al., 2019), we pre-train the two McDonald, 2011).2 Moreover, following previ- language models using 10K warm-up steps, setting ous studies (Kogan et al., 2009; Tsai et al., 2016; the max sentence length and batch size both equal Buehlmaier and Whited, 2018), we used only Sec- to 128, the maximum predictions per sequence to tion 7 “Management’s Discussion and Analysis of 20, and the learning rate to 5 × 10−4 . The perfor- Financial Conditions and Results of Operations” mance on MLM and NSP is summarized in Table 2 (MD&A) in the experiments as it contains the most and is generally consistent with the results in (DeS- important forward-looking statements about the ola et al., 2019). companies. The resultant corpus contains 183,115 MD&A sections from different companies in 18 3 Experiments years, with 45,126,776 sentences and 838,842,639 tokens in total. In this section, we describe experiments on two To train NFinBERT, the number-aware language fine-tuned classification tasks to evaluate the ef- model, we identify the 11 common categories of fectiveness of pre-trained language models. The numbers in financial reports listed in Table 1, with first task (denoted as Task 1 hereafter) considers bi- help from several domain experts in finance and nary classification for identifying risk sentences in accounting. Note that these tokens are usually not financial reports, and the second (Task 2) is multi- pure integers or floats, and may contain commas class classification regarding the types of numbers or parentheses due to the number formats used in mentioned in sentences extracted from the reports. accounting, making the preprocessing more com- plicated than that for normal numbers. For instance, 3.1 Datasets one million is sometimes presented as “1,000,000” 3.1.1 Task 1: Binary Classification for Risk in financial reports, and “(1,000)” represents nega- Prediction tive one thousand. For such complex preprocessing, we used both regular expressions and named entity We conducted the experiments on 10K-Sentence, a recognition (NER)3 to recognize tokens containing sentence-level risk classification dataset (Lin et al., 2020), consisting of 2,432 sentences extracted from publication. 2 the 10-K reports from 1996 to 2013; each sentence https://sraf.nd.edu/textual-analysis/ resources/ in 10K-sentence is categorized as either risky or 3 SpaCy was used for NER. non-risky by annotators specializing in finance or Category Explanation Example Amount [MONEY] monetary numbers $600,000 11,616,433 [DATE] dates 2020-02-02 21,480,578 [PHONE] phone numbers 800-555-5555 3,525 [BOND] bond ratings Aaa3 34,519 [ORDINAL] ordinal information Note 4 2,950,815 [QUANTITY] quantities 100,000 shares 1,476,558 [ADDRESS] addresses Rd. 3 13,577 [RATIO] numbers related to ratios 1-to-5 18,246 [PERCENT] percentages 95% 4,342,499 [TIME] time unit smaller than a day 1 hour 53,049 [OTHER] other numbers G-8 8,491 Table 1: Categories of finance numbers Models Steps MLM acc. NSP acc. Loss FinBERT 150K 73.66% 96.62% 1.2604 FinBERT 300K 75.98% 97.37% 1.1257 NFinBERT 150K 76.48% 97.62% 1.1080 NFinBERT 300K 77.55% 98.25% 1.0416 Table 2: Results for pre-trained FinBERT and NFinBERT linguistics,4 resulting in 1,536 risky sentences and 3.2 Experimental Settings 896 non-risky sentences for binary classification. In both tasks, we split the datasets into training, validation, and test sets at an 8:1:1 ratio, respec- 3.1.2 Task 2: Multi-class Classification for tively. Moreover, to mitigate the label imbalance Number Category Prediction problem in Task 2, we down-sampled the training data to the median of the numbers of instances in For the second task, we constructed a new dataset each category,6 resulting in 177,473 sentences in to- containing 25,261,147 sentences in total extracted tal.7 The resulting category distribution for model from the 10-K reports from 1996 to 2013, each of training is illustrated in the first row of Figure 1. which is labeled with one of the 11 categories listed Note that only the training set was down-sampled; in Table 1 plus a “[Nothing]” type. Specifically, the the validation and test sets retained the original dataset is composed of all sentences in the 10-K category distribution. We used 15 epochs to fine- reports from 1996 to 2013 containing exactly one tune all BERT-based models, setting the max sen- number or no number, the former of which was tence length to 128 and the batch size to 32, and labeled with one of the 11 categories and the latter used the validation set to search learning rates in of which was labeled with the [Nothing] type. For {10−5 , 5 × 10−5 , 10−4 }. The best learning rates the following experiments, we performed 12-class for BERT, FinBERT, and NFinBERT were 10−5 , classification for number category prediction on 10−4 , and 10−4 , respectively. Note that the results 1,403,397 randomly selected sentences, constitut- on both validation and test sets are the averaged ing 5.6% of the original dataset5 with the same results over five repetitions. category distribution of the original dataset, due to computational resource limitations. 6 Note that we reduced the training instances only in cate- gories for which the number of instances were higher than the 4 Dataset details can be found in (Lin et al., 2020). median; we kept the rest categories unchanged. 5 7 As the sentences are from the reports of 1996 to 2013 Note that in our experiments, as we found that using 2% (i.e., 18 years in total), we here simulate a one-year dataset by of the down-sampled sentences achieves satisfactory perfor- randomly sampling 1/18 ≈ 5.6% of the sentences from the mance, we here used only 3,195 sentences for Task 2 model original dataset. training due to computational resource limitations. Category proportion 0.0044 0.0088 0.15 0.17 0.16 0.16 0.0034 0.16 0.00063 0.16 0.0019 0.019 (training) BERT 0.36 0.34 0.93 0.92 0.94 0.79 0 0.9 0 0.43 0 0.4 FinBERT 0.35 0.29 0.92 0.92 0.94 0.78 0.064 0.86 0.46 0.57 0.25 0.32 NFinBERT 0.29 0.24 0.93 0.93 0.94 0.76 0.13 0.87 0.58 0.53 0.33 0.39 [ADDRESS] [BOND] [DATE] [MONEY] [NOTHING] [ORDINAL] [OTHER] [PERCENT] [PHONE] [QUANTITY] [RATIO] [TIME] Figure 1: F1 score of each category Task 1 Task 2 Models Accuracy Accuracy Macro F1 BOW 86.83% 72.51% 36.91% FastText (Joulin et al., 2016) 86.42% 62.44% 36.91% CNN (Kim, 2014) 87.36% 72.51% 32.37% BERT (Devlin et al., 2018) 88.61% 91.51% 50.49% FinBERT 88.75% 91.34% 56.10% NFinBERT 88.61% 91.19% 57.67% Table 3: Performance of two fine-tuning tasks 3.3 Results the number-aware pre-trained language model is beneficial for Task 2. For both tasks, we compared three BERT-based models with three baselines—TF-IDF bag-of- 4 Conclusion words (BOW) with logistic regression, convolu- tion neural network (CNN) (Kim, 2014), and fast- We introduce NFinBERT, a number-aware lan- Text (Joulin et al., 2016)—and summarize their guage model trained on financial disclosures, in performance in Table 3. As shown in the table, for which we identify 11 categories of numeral tokens Task 1, all three BERT-based models yield com- with the knowledge and understanding of the finan- parable performance, significantly better than the cial and accounting functions of reports and replace three baseline models. them with additional masks to pre-train the model. On Task 2, which is more difficult than Task 1, The experimental results show that it is crucial to both FinBERT and NFinBERT surpass BERT8 in pre-train BERT on a finance-specific corpus for terms of macro F1 by a significant amount.9 Fig- finance-related downstream tasks; moreover, the ure 1 details the performance of each category in proposed NFinBERT outperforms other compared terms of F1 score with a heatmap for all three models for 12-class classification for sentence-level BERT-based models. From the figure, we observe numeral category prediction. that BERT is outperformed by FinBERT and NFin- BERT for the categories with the fewest training instances ([OTHER], [PHONE], and [RATIO]); References this is why BERT achieves better accuracy but a Dogu Araci. 2019. Finbert: Financial sentiment analy- lower macro F1 score in Table 3. Moreover, for sis with pre-trained language models. arXiv preprint arXiv:1908.10063. these three categories, NFinBERT yields more ac- curate prediction than FinBERT, suggesting that Matthias MM Buehlmaier and Toni M Whited. 2018. Are financial constraints priced? Evidence from 8 The BERT-Base, Uncased pre-trained model was used in textual analysis. The Review of Financial Studies, the experiments. 31(7):2693–2728. 9 The improvements compared to BERT are statistically significant at p < 0.01 with a paired t-test. Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. ELECTRA: Pre- Matheus Gomes Sousa, Kenzo Sakiyama, Lucas training text encoders as discriminators rather than de Souza Rodrigues, Pedro Henrique Moraes, Er- generators. arXiv preprint arXiv:2003.10555. aldo Rezende Fernandes, and Edson Takashi Mat- subara. 2019. Bert for stock market sentiment anal- Vinicio DeSola, Kevin Hanna, and Pri Nonis. 2019. ysis. In Proceedings of the 2019 IEEE 31st Inter- FinBERT: pre-trained model on SEC filings for fi- national Conference on Tools with Artificial Intelli- nancial natural language tasks. Working paper. gence, pages 1597–1601. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Ming-Feng Tsai, Chuan-Ju Wang, and Po-Chuan Kristina Toutanova. 2018. Bert: Pre-training of deep Chien. 2016. Discovering finance keywords via bidirectional transformers for language understand- continuous-space language models. ACM Transac- ing. arXiv preprint arXiv:1810.04805. tions on Management Information Systems, 7(3):1– 17. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Proceedings of the 56th Annual Meeting of the As- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz sociation for Computational Linguistics (Volume 1: Kaiser, and Illia Polosukhin. 2017. Attention is all Long Papers), pages 328–339. you need. In Proceedings of Advances in Neural In- formation Processing Systems, pages 5998–6008. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. Wei Wang, Maofu Liu, Yukun Zhang, Junyi Xiang, and 2016. Fasttext. zip: Compressing text classification Ruibin Mao. 2019. Financial numeral classification models. arXiv preprint arXiv:1612.03651. model based on bert. In Proceedings of the NII Con- ference on Testbeds and Community for Information Yoon Kim. 2014. Convolutional neural networks for Access Research, pages 193–204. Springer. sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1746–1751. Shimon Kogan, Dimitry Levin, Bryan R Routledge, Ja- cob S Sagi, and Noah A Smith. 2009. Predicting risk from financial reports with regression. In Proceed- ings of Human Language Technologies: The 2009 Annual Conference of the North American Chap- ter of the Association for Computational Linguistics, pages 272–280. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre- trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240. Sheng-Chieh Lin, Wen-Yuh Su, Po-Chuan Chien, Ming-Feng Tsai, and Chuan-Ju Wang. 2020. Self- attentive sentimental sentence embedding for senti- ment analysis. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1678–1682. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. Tim Loughran and Bill McDonald. 2011. When is a li- ability not a liability? Textual analysis, dictionaries, and 10-Ks. The Journal of Finance, 66(1):35–65. Jonathan SCHLER. 2006. Effects of age and gender on blogging. In Proceedings of the AAAI Sympo- sium on Computational Approaches for Analyzing Weblogs, 2006, pages 199–205.