=Paper= {{Paper |id=Vol-2957/paper5 |storemode=property |title=NFinBERT: A Number-Aware Language Model for Financial Disclosures (short paper) |pdfUrl=https://ceur-ws.org/Vol-2957/paper5.pdf |volume=Vol-2957 |authors=Hao-Lun Lin,Jr-Shian Wu,Yu-Shiang Huang,Ming-Feng Tsai,Chuan-Ju Wang |dblpUrl=https://dblp.org/rec/conf/swisstext/LinWHTW21 }} ==NFinBERT: A Number-Aware Language Model for Financial Disclosures (short paper)== https://ceur-ws.org/Vol-2957/paper5.pdf

NFinBERT: A Number-Aware Language Model for Financial Disclosures
Hao-Lun Lin1 and Jr-Shian Wu2 Yu-Shiang Huang3
National Chengchi University National Taiwan University
Taipei, Taiwan Taipei, Taiwan
{106703027,106703026}@nccu.edu.tw b05702095@ntu.edu.tw

Ming-Feng Tsai4 Chuan-Ju Wang5
National Chengchi University Academia Sinica
Taipei, Taiwan Taipei, Taiwan
mftsai@nccu.edu.tw cjwang@citi.sinica.edu.tw

Abstract model, defining language modeling in an unconven-
tional manner. Due to the superior performance of
As numerals comprise rich semantic infor- BERT on various natural language processing tasks,
mation in financial texts, they play crucial numerous related studies and models, including
roles in financial data analysis and finan- RoBERTa (Liu et al., 2019) and ELECTRA (Clark
cial decision making. We propose NFin- et al., 2020), have been proposed to advance the
BERT, a number-aware contextualized lan- state of the art. Other works attempt to adopt this
guage model trained on financial disclo- powerful technique to different domains by pre-
sures. Although BERT and other contex- training language models on corpora from different
tualized language models work well for target domains, such as finance and biomedical
many NLP tasks, they are not specialized sciences (DeSola et al., 2019; Lee et al., 2020).
in finance and thus do not properly manage For applications in finance and accounting, in
numerical information in financial texts. addition to pre-training domain-specific language
Therefore, we propose pre-training the lan- models, recent work has focused on fine-tuning the
guage model on a large collection of “pre- pre-trained model for downstream tasks, including
processed” financial disclosures in which sentiment analysis (Sousa et al., 2019) and numeral
the numbers in reports are explicitly re- category prediction (Wang et al., 2019). However,
placed with the knowledge and understand- most such studies directly use the original design
ing of the financial and accounting func- of BERT and thus do not properly manage numer-
tions of reports. Experimental results on ical information in financial texts. However, in
two fine-tuning classification tasks show contrast to other domains, numbers in financial
that language models pre-trained on finan- text such as financial disclosures, market commen-
cial specialized texts generally outperform tary, and financial news are especially important
BERT. Furthermore, the proposed number- for understanding the minutiae of such textual in-
aware NFinBERT significantly surpasses formation. Moreover, financial documents usually
other models when the task becomes more contain relatively large amounts of numbers; for
difficult or number-sensitive. example, whereas only 0.98% of the tokens in the
blog corpus (SCHLER, 2006) are numbers, the 10-
1 Introduction K financial reports used here have a much higher
proportion of number tokens: 4.79% of all tokens.
BERT (Devlin et al., 2018), a state-of-the-art lan- Thus, properly addressing such numeral informa-
guage model, consists of a set of Transformer en- tion when pre-training the language models is criti-
coders (Vaswani et al., 2017) stacked on top of each cal to raising the quality of the pre-trained model.
other. In contrast to traditional language models’ For instance, the sentence “Q4 revenue raised by
prediction of the next token given previous tokens, 4,000,000, which is 12.8% of the total amount in
BERT uses masked LMs (MLMs) and next sen- the year” is nonsensical if the numbers in it are not
tence prediction (NSP) to pre-train the language properly interpreted.
Copyright © 2021 for this paper by its authors. Use permitted
To this end, we propose NFinBERT,1 a number-
under Creative Commons License Attribution 4.0 Interna-
1
tional (CC BY 4.0) The pre-trained model will be publicly available upon
aware contextualized language model pre-trained numbers and slot them into one of the 11 classes.
on a large collection of “pre-processed” financial For example, $1,000,000 in the reports was masked
disclosures, for which we explicitly replace num- as the token [MONEY], and 95% was masked
bers in reports with the knowledge and understand- as [RATIO]. Therefore, in addition to [CLS] and
ing of the financial and accounting functions of [SEP], BERT’s original masks, we here add 11
reports. We conduct two downstream tasks to evalu- masks to train NFinBERT. The distribution of cate-
ate the proposed model: one is binary classification gories is listed in the last column of Table 1.
for risk sentence detection and the other is 12-class
classification for sentence-level numeral category 2.2 FinBERT and NFinBERT
prediction. The results indicate that that language BERT (Devlin et al., 2018) is a language model
models pre-trained on financial specialized texts containing a set of Transformer encoders (Vaswani
generally outperform BERT. Furthermore, the pro- et al., 2017) stacked on top of each other; such
posed number-aware NFinBERT significantly sur- a design defines language modeling in an un-
passes other models on more difficult or number- conventional manner. Following previous stud-
sensitive tasks. ies (Howard and Ruder, 2018; Araci, 2019), we
pre-train language models on finance, the target do-
2 Pre-Training Models on Financial
main, and experiment with two approaches: 1) Pre-
Reports training the model on a large collection of financial
2.1 Data and Preprocessing reports—the original corpus containing 92,402,863
sentences—and 2) Pre-training the model on a cor-
To pre-train the domain-specific language mod- pus in which all of the numbers have been replaced
els on financial reports, we used the 10-K reports by the tokens listed in Table 1.
from 1996 to 2013 collected by (Loughran and
As in (DeSola et al., 2019), we pre-train the two
McDonald, 2011).2 Moreover, following previ-
language models using 10K warm-up steps, setting
ous studies (Kogan et al., 2009; Tsai et al., 2016;
the max sentence length and batch size both equal
Buehlmaier and Whited, 2018), we used only Sec-
to 128, the maximum predictions per sequence to
tion 7 “Management’s Discussion and Analysis of
20, and the learning rate to 5 × 10−4 . The perfor-
Financial Conditions and Results of Operations”
mance on MLM and NSP is summarized in Table 2
(MD&A) in the experiments as it contains the most
and is generally consistent with the results in (DeS-
important forward-looking statements about the
ola et al., 2019).
companies. The resultant corpus contains 183,115
MD&A sections from different companies in 18
3 Experiments
years, with 45,126,776 sentences and 838,842,639
tokens in total. In this section, we describe experiments on two
To train NFinBERT, the number-aware language fine-tuned classification tasks to evaluate the ef-
model, we identify the 11 common categories of fectiveness of pre-trained language models. The
numbers in financial reports listed in Table 1, with first task (denoted as Task 1 hereafter) considers bi-
help from several domain experts in finance and nary classification for identifying risk sentences in
accounting. Note that these tokens are usually not financial reports, and the second (Task 2) is multi-
pure integers or floats, and may contain commas class classification regarding the types of numbers
or parentheses due to the number formats used in mentioned in sentences extracted from the reports.
accounting, making the preprocessing more com-
plicated than that for normal numbers. For instance, 3.1 Datasets
one million is sometimes presented as “1,000,000”
3.1.1 Task 1: Binary Classification for Risk
in financial reports, and “(1,000)” represents nega-
Prediction
tive one thousand. For such complex preprocessing,
we used both regular expressions and named entity We conducted the experiments on 10K-Sentence, a
recognition (NER)3 to recognize tokens containing sentence-level risk classification dataset (Lin et al.,
2020), consisting of 2,432 sentences extracted from
publication.
2 the 10-K reports from 1996 to 2013; each sentence
https://sraf.nd.edu/textual-analysis/
resources/ in 10K-sentence is categorized as either risky or
3
SpaCy was used for NER. non-risky by annotators specializing in finance or
Category Explanation Example Amount
[MONEY] monetary numbers $600,000 11,616,433
[DATE] dates 2020-02-02 21,480,578
[PHONE] phone numbers 800-555-5555 3,525
[BOND] bond ratings Aaa3 34,519
[ORDINAL] ordinal information Note 4 2,950,815
[QUANTITY] quantities 100,000 shares 1,476,558
[ADDRESS] addresses Rd. 3 13,577
[RATIO] numbers related to ratios 1-to-5 18,246
[PERCENT] percentages 95% 4,342,499
[TIME] time unit smaller than a day 1 hour 53,049
[OTHER] other numbers G-8 8,491
Table 1: Categories of finance numbers

Models Steps MLM acc. NSP acc. Loss
FinBERT 150K 73.66% 96.62% 1.2604
FinBERT 300K 75.98% 97.37% 1.1257
NFinBERT 150K 76.48% 97.62% 1.1080
NFinBERT 300K 77.55% 98.25% 1.0416
Table 2: Results for pre-trained FinBERT and NFinBERT

linguistics,4 resulting in 1,536 risky sentences and 3.2 Experimental Settings
896 non-risky sentences for binary classification.
In both tasks, we split the datasets into training,
validation, and test sets at an 8:1:1 ratio, respec-
3.1.2 Task 2: Multi-class Classification for tively. Moreover, to mitigate the label imbalance
Number Category Prediction problem in Task 2, we down-sampled the training
data to the median of the numbers of instances in
For the second task, we constructed a new dataset
each category,6 resulting in 177,473 sentences in to-
containing 25,261,147 sentences in total extracted
tal.7 The resulting category distribution for model
from the 10-K reports from 1996 to 2013, each of
training is illustrated in the first row of Figure 1.
which is labeled with one of the 11 categories listed
Note that only the training set was down-sampled;
in Table 1 plus a “[Nothing]” type. Specifically, the
the validation and test sets retained the original
dataset is composed of all sentences in the 10-K
category distribution. We used 15 epochs to fine-
reports from 1996 to 2013 containing exactly one
tune all BERT-based models, setting the max sen-
number or no number, the former of which was
tence length to 128 and the batch size to 32, and
labeled with one of the 11 categories and the latter
used the validation set to search learning rates in
of which was labeled with the [Nothing] type. For
{10−5 , 5 × 10−5 , 10−4 }. The best learning rates
the following experiments, we performed 12-class
for BERT, FinBERT, and NFinBERT were 10−5 ,
classification for number category prediction on
10−4 , and 10−4 , respectively. Note that the results
1,403,397 randomly selected sentences, constitut-
on both validation and test sets are the averaged
ing 5.6% of the original dataset5 with the same
results over five repetitions.
category distribution of the original dataset, due to
computational resource limitations.
6
Note that we reduced the training instances only in cate-
gories for which the number of instances were higher than the
4
Dataset details can be found in (Lin et al., 2020). median; we kept the rest categories unchanged.
5 7
As the sentences are from the reports of 1996 to 2013 Note that in our experiments, as we found that using 2%
(i.e., 18 years in total), we here simulate a one-year dataset by of the down-sampled sentences achieves satisfactory perfor-
randomly sampling 1/18 ≈ 5.6% of the sentences from the mance, we here used only 3,195 sentences for Task 2 model
original dataset. training due to computational resource limitations.
Category proportion 0.0044 0.0088 0.15 0.17 0.16 0.16 0.0034 0.16 0.00063 0.16 0.0019 0.019
(training)

BERT 0.36 0.34 0.93 0.92 0.94 0.79 0 0.9 0 0.43 0 0.4

FinBERT 0.35 0.29 0.92 0.92 0.94 0.78 0.064 0.86 0.46 0.57 0.25 0.32

NFinBERT 0.29 0.24 0.93 0.93 0.94 0.76 0.13 0.87 0.58 0.53 0.33 0.39

[ADDRESS] [BOND] [DATE] [MONEY] [NOTHING] [ORDINAL] [OTHER] [PERCENT] [PHONE] [QUANTITY] [RATIO] [TIME]
Figure 1: F1 score of each category

Task 1 Task 2
Models Accuracy Accuracy Macro F1
BOW 86.83% 72.51% 36.91%
FastText (Joulin et al., 2016) 86.42% 62.44% 36.91%
CNN (Kim, 2014) 87.36% 72.51% 32.37%
BERT (Devlin et al., 2018) 88.61% 91.51% 50.49%
FinBERT 88.75% 91.34% 56.10%
NFinBERT 88.61% 91.19% 57.67%
Table 3: Performance of two fine-tuning tasks

3.3 Results the number-aware pre-trained language model is
beneficial for Task 2.
For both tasks, we compared three BERT-based
models with three baselines—TF-IDF bag-of- 4 Conclusion
words (BOW) with logistic regression, convolu-
tion neural network (CNN) (Kim, 2014), and fast- We introduce NFinBERT, a number-aware lan-
Text (Joulin et al., 2016)—and summarize their guage model trained on financial disclosures, in
performance in Table 3. As shown in the table, for which we identify 11 categories of numeral tokens
Task 1, all three BERT-based models yield com- with the knowledge and understanding of the finan-
parable performance, significantly better than the cial and accounting functions of reports and replace
three baseline models. them with additional masks to pre-train the model.
On Task 2, which is more difficult than Task 1, The experimental results show that it is crucial to
both FinBERT and NFinBERT surpass BERT8 in pre-train BERT on a finance-specific corpus for
terms of macro F1 by a significant amount.9 Fig- finance-related downstream tasks; moreover, the
ure 1 details the performance of each category in proposed NFinBERT outperforms other compared
terms of F1 score with a heatmap for all three models for 12-class classification for sentence-level
BERT-based models. From the figure, we observe numeral category prediction.
that BERT is outperformed by FinBERT and NFin-
BERT for the categories with the fewest training
instances ([OTHER], [PHONE], and [RATIO]);
References
this is why BERT achieves better accuracy but a Dogu Araci. 2019. Finbert: Financial sentiment analy-
lower macro F1 score in Table 3. Moreover, for sis with pre-trained language models. arXiv preprint
arXiv:1908.10063.
these three categories, NFinBERT yields more ac-
curate prediction than FinBERT, suggesting that Matthias MM Buehlmaier and Toni M Whited. 2018.
Are financial constraints priced? Evidence from
8
The BERT-Base, Uncased pre-trained model was used in textual analysis. The Review of Financial Studies,
the experiments. 31(7):2693–2728.
9
The improvements compared to BERT are statistically
significant at p < 0.01 with a paired t-test. Kevin Clark, Minh-Thang Luong, Quoc V Le, and
Christopher D Manning. 2020. ELECTRA: Pre- Matheus Gomes Sousa, Kenzo Sakiyama, Lucas
training text encoders as discriminators rather than de Souza Rodrigues, Pedro Henrique Moraes, Er-
generators. arXiv preprint arXiv:2003.10555. aldo Rezende Fernandes, and Edson Takashi Mat-
subara. 2019. Bert for stock market sentiment anal-
Vinicio DeSola, Kevin Hanna, and Pri Nonis. 2019. ysis. In Proceedings of the 2019 IEEE 31st Inter-
FinBERT: pre-trained model on SEC filings for fi- national Conference on Tools with Artificial Intelli-
nancial natural language tasks. Working paper. gence, pages 1597–1601.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Ming-Feng Tsai, Chuan-Ju Wang, and Po-Chuan
Kristina Toutanova. 2018. Bert: Pre-training of deep Chien. 2016. Discovering finance keywords via
bidirectional transformers for language understand- continuous-space language models. ACM Transac-
ing. arXiv preprint arXiv:1810.04805. tions on Management Information Systems, 7(3):1–
17.
Jeremy Howard and Sebastian Ruder. 2018. Universal
language model fine-tuning for text classification. In Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Proceedings of the 56th Annual Meeting of the As- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
sociation for Computational Linguistics (Volume 1: Kaiser, and Illia Polosukhin. 2017. Attention is all
Long Papers), pages 328–339. you need. In Proceedings of Advances in Neural In-
formation Processing Systems, pages 5998–6008.
Armand Joulin, Edouard Grave, Piotr Bojanowski,
Matthijs Douze, Hérve Jégou, and Tomas Mikolov. Wei Wang, Maofu Liu, Yukun Zhang, Junyi Xiang, and
2016. Fasttext. zip: Compressing text classification Ruibin Mao. 2019. Financial numeral classification
models. arXiv preprint arXiv:1612.03651. model based on bert. In Proceedings of the NII Con-
ference on Testbeds and Community for Information
Yoon Kim. 2014. Convolutional neural networks for Access Research, pages 193–204. Springer.
sentence classification. In Proceedings of the 2014
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1746–1751.

Shimon Kogan, Dimitry Levin, Bryan R Routledge, Ja-
cob S Sagi, and Noah A Smith. 2009. Predicting risk
from financial reports with regression. In Proceed-
ings of Human Language Technologies: The 2009
Annual Conference of the North American Chap-
ter of the Association for Computational Linguistics,
pages 272–280.

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,
Donghyeon Kim, Sunkyu Kim, Chan Ho So,
and Jaewoo Kang. 2020. BioBERT: a pre-
trained biomedical language representation model
for biomedical text mining. Bioinformatics,
36(4):1234–1240.

Sheng-Chieh Lin, Wen-Yuh Su, Po-Chuan Chien,
Ming-Feng Tsai, and Chuan-Ju Wang. 2020. Self-
attentive sentimental sentence embedding for senti-
ment analysis. In Proceedings of the 2020 IEEE
International Conference on Acoustics, Speech and
Signal Processing, pages 1678–1682.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
RoBERTa: A robustly optimized BERT pretraining
approach. arXiv preprint arXiv:1907.11692.

Tim Loughran and Bill McDonald. 2011. When is a li-
ability not a liability? Textual analysis, dictionaries,
and 10-Ks. The Journal of Finance, 66(1):35–65.

Jonathan SCHLER. 2006. Effects of age and gender
on blogging. In Proceedings of the AAAI Sympo-
sium on Computational Approaches for Analyzing
Weblogs, 2006, pages 199–205.