NFinBERT: A Number-Aware Language Model for Financial Disclosures
         Hao-Lun Lin1 and Jr-Shian Wu2     Yu-Shiang Huang3
           National Chengchi University National Taiwan University
                  Taipei, Taiwan              Taipei, Taiwan
    {106703027,106703026}@nccu.edu.tw b05702095@ntu.edu.tw

                    Ming-Feng Tsai4                                   Chuan-Ju Wang5
               National Chengchi University                            Academia Sinica
                      Taipei, Taiwan                                    Taipei, Taiwan
                mftsai@nccu.edu.tw                              cjwang@citi.sinica.edu.tw

                        Abstract                                model, defining language modeling in an unconven-
                                                                tional manner. Due to the superior performance of
    As numerals comprise rich semantic infor-                   BERT on various natural language processing tasks,
    mation in financial texts, they play crucial                numerous related studies and models, including
    roles in financial data analysis and finan-                 RoBERTa (Liu et al., 2019) and ELECTRA (Clark
    cial decision making. We propose NFin-                      et al., 2020), have been proposed to advance the
    BERT, a number-aware contextualized lan-                    state of the art. Other works attempt to adopt this
    guage model trained on financial disclo-                    powerful technique to different domains by pre-
    sures. Although BERT and other contex-                      training language models on corpora from different
    tualized language models work well for                      target domains, such as finance and biomedical
    many NLP tasks, they are not specialized                    sciences (DeSola et al., 2019; Lee et al., 2020).
    in finance and thus do not properly manage                     For applications in finance and accounting, in
    numerical information in financial texts.                   addition to pre-training domain-specific language
    Therefore, we propose pre-training the lan-                 models, recent work has focused on fine-tuning the
    guage model on a large collection of “pre-                  pre-trained model for downstream tasks, including
    processed” financial disclosures in which                   sentiment analysis (Sousa et al., 2019) and numeral
    the numbers in reports are explicitly re-                   category prediction (Wang et al., 2019). However,
    placed with the knowledge and understand-                   most such studies directly use the original design
    ing of the financial and accounting func-                   of BERT and thus do not properly manage numer-
    tions of reports. Experimental results on                   ical information in financial texts. However, in
    two fine-tuning classification tasks show                   contrast to other domains, numbers in financial
    that language models pre-trained on finan-                  text such as financial disclosures, market commen-
    cial specialized texts generally outperform                 tary, and financial news are especially important
    BERT. Furthermore, the proposed number-                     for understanding the minutiae of such textual in-
    aware NFinBERT significantly surpasses                      formation. Moreover, financial documents usually
    other models when the task becomes more                     contain relatively large amounts of numbers; for
    difficult or number-sensitive.                              example, whereas only 0.98% of the tokens in the
                                                                blog corpus (SCHLER, 2006) are numbers, the 10-
1    Introduction                                               K financial reports used here have a much higher
                                                                proportion of number tokens: 4.79% of all tokens.
BERT (Devlin et al., 2018), a state-of-the-art lan-             Thus, properly addressing such numeral informa-
guage model, consists of a set of Transformer en-               tion when pre-training the language models is criti-
coders (Vaswani et al., 2017) stacked on top of each            cal to raising the quality of the pre-trained model.
other. In contrast to traditional language models’              For instance, the sentence “Q4 revenue raised by
prediction of the next token given previous tokens,             4,000,000, which is 12.8% of the total amount in
BERT uses masked LMs (MLMs) and next sen-                       the year” is nonsensical if the numbers in it are not
tence prediction (NSP) to pre-train the language                properly interpreted.
Copyright © 2021 for this paper by its authors. Use permitted
                                                                   To this end, we propose NFinBERT,1 a number-
under Creative Commons License Attribution 4.0 Interna-
                                                                   1
tional (CC BY 4.0)                                                     The pre-trained model will be publicly available upon
aware contextualized language model pre-trained         numbers and slot them into one of the 11 classes.
on a large collection of “pre-processed” financial      For example, $1,000,000 in the reports was masked
disclosures, for which we explicitly replace num-       as the token [MONEY], and 95% was masked
bers in reports with the knowledge and understand-      as [RATIO]. Therefore, in addition to [CLS] and
ing of the financial and accounting functions of        [SEP], BERT’s original masks, we here add 11
reports. We conduct two downstream tasks to evalu-      masks to train NFinBERT. The distribution of cate-
ate the proposed model: one is binary classification    gories is listed in the last column of Table 1.
for risk sentence detection and the other is 12-class
classification for sentence-level numeral category      2.2     FinBERT and NFinBERT
prediction. The results indicate that that language     BERT (Devlin et al., 2018) is a language model
models pre-trained on financial specialized texts       containing a set of Transformer encoders (Vaswani
generally outperform BERT. Furthermore, the pro-        et al., 2017) stacked on top of each other; such
posed number-aware NFinBERT significantly sur-          a design defines language modeling in an un-
passes other models on more difficult or number-        conventional manner. Following previous stud-
sensitive tasks.                                        ies (Howard and Ruder, 2018; Araci, 2019), we
                                                        pre-train language models on finance, the target do-
2     Pre-Training Models on Financial
                                                        main, and experiment with two approaches: 1) Pre-
      Reports                                           training the model on a large collection of financial
2.1    Data and Preprocessing                           reports—the original corpus containing 92,402,863
                                                        sentences—and 2) Pre-training the model on a cor-
To pre-train the domain-specific language mod-          pus in which all of the numbers have been replaced
els on financial reports, we used the 10-K reports      by the tokens listed in Table 1.
from 1996 to 2013 collected by (Loughran and
                                                           As in (DeSola et al., 2019), we pre-train the two
McDonald, 2011).2 Moreover, following previ-
                                                        language models using 10K warm-up steps, setting
ous studies (Kogan et al., 2009; Tsai et al., 2016;
                                                        the max sentence length and batch size both equal
Buehlmaier and Whited, 2018), we used only Sec-
                                                        to 128, the maximum predictions per sequence to
tion 7 “Management’s Discussion and Analysis of
                                                        20, and the learning rate to 5 × 10−4 . The perfor-
Financial Conditions and Results of Operations”
                                                        mance on MLM and NSP is summarized in Table 2
(MD&A) in the experiments as it contains the most
                                                        and is generally consistent with the results in (DeS-
important forward-looking statements about the
                                                        ola et al., 2019).
companies. The resultant corpus contains 183,115
MD&A sections from different companies in 18
                                                        3     Experiments
years, with 45,126,776 sentences and 838,842,639
tokens in total.                                        In this section, we describe experiments on two
   To train NFinBERT, the number-aware language         fine-tuned classification tasks to evaluate the ef-
model, we identify the 11 common categories of          fectiveness of pre-trained language models. The
numbers in financial reports listed in Table 1, with    first task (denoted as Task 1 hereafter) considers bi-
help from several domain experts in finance and         nary classification for identifying risk sentences in
accounting. Note that these tokens are usually not      financial reports, and the second (Task 2) is multi-
pure integers or floats, and may contain commas         class classification regarding the types of numbers
or parentheses due to the number formats used in        mentioned in sentences extracted from the reports.
accounting, making the preprocessing more com-
plicated than that for normal numbers. For instance,    3.1     Datasets
one million is sometimes presented as “1,000,000”
                                                        3.1.1    Task 1: Binary Classification for Risk
in financial reports, and “(1,000)” represents nega-
                                                                 Prediction
tive one thousand. For such complex preprocessing,
we used both regular expressions and named entity       We conducted the experiments on 10K-Sentence, a
recognition (NER)3 to recognize tokens containing       sentence-level risk classification dataset (Lin et al.,
                                                        2020), consisting of 2,432 sentences extracted from
publication.
   2                                                    the 10-K reports from 1996 to 2013; each sentence
     https://sraf.nd.edu/textual-analysis/
resources/                                              in 10K-sentence is categorized as either risky or
   3
     SpaCy was used for NER.                            non-risky by annotators specializing in finance or
                 Category              Explanation                          Example                   Amount
                 [MONEY]               monetary numbers                     $600,000               11,616,433
                 [DATE]                dates                                2020-02-02             21,480,578
                 [PHONE]               phone numbers                        800-555-5555                3,525
                 [BOND]                bond ratings                         Aaa3                       34,519
                 [ORDINAL]             ordinal information                  Note 4                  2,950,815
                 [QUANTITY]            quantities                           100,000 shares          1,476,558
                 [ADDRESS]             addresses                            Rd. 3                      13,577
                 [RATIO]               numbers related to ratios            1-to-5                     18,246
                 [PERCENT]             percentages                          95%                     4,342,499
                 [TIME]                time unit smaller than a day         1 hour                     53,049
                 [OTHER]               other numbers                        G-8                         8,491
                                            Table 1: Categories of finance numbers

                                    Models        Steps     MLM acc.       NSP acc.        Loss
                                  FinBERT         150K        73.66%        96.62%        1.2604
                                  FinBERT         300K        75.98%        97.37%        1.1257
                               NFinBERT           150K        76.48%        97.62%        1.1080
                               NFinBERT           300K        77.55%        98.25%        1.0416
                                 Table 2: Results for pre-trained FinBERT and NFinBERT


linguistics,4 resulting in 1,536 risky sentences and                3.2    Experimental Settings
896 non-risky sentences for binary classification.
                                                                    In both tasks, we split the datasets into training,
                                                                    validation, and test sets at an 8:1:1 ratio, respec-
3.1.2    Task 2: Multi-class Classification for                     tively. Moreover, to mitigate the label imbalance
         Number Category Prediction                                 problem in Task 2, we down-sampled the training
                                                                    data to the median of the numbers of instances in
For the second task, we constructed a new dataset
                                                                    each category,6 resulting in 177,473 sentences in to-
containing 25,261,147 sentences in total extracted
                                                                    tal.7 The resulting category distribution for model
from the 10-K reports from 1996 to 2013, each of
                                                                    training is illustrated in the first row of Figure 1.
which is labeled with one of the 11 categories listed
                                                                    Note that only the training set was down-sampled;
in Table 1 plus a “[Nothing]” type. Specifically, the
                                                                    the validation and test sets retained the original
dataset is composed of all sentences in the 10-K
                                                                    category distribution. We used 15 epochs to fine-
reports from 1996 to 2013 containing exactly one
                                                                    tune all BERT-based models, setting the max sen-
number or no number, the former of which was
                                                                    tence length to 128 and the batch size to 32, and
labeled with one of the 11 categories and the latter
                                                                    used the validation set to search learning rates in
of which was labeled with the [Nothing] type. For
                                                                    {10−5 , 5 × 10−5 , 10−4 }. The best learning rates
the following experiments, we performed 12-class
                                                                    for BERT, FinBERT, and NFinBERT were 10−5 ,
classification for number category prediction on
                                                                    10−4 , and 10−4 , respectively. Note that the results
1,403,397 randomly selected sentences, constitut-
                                                                    on both validation and test sets are the averaged
ing 5.6% of the original dataset5 with the same
                                                                    results over five repetitions.
category distribution of the original dataset, due to
computational resource limitations.
                                                                       6
                                                                          Note that we reduced the training instances only in cate-
                                                                    gories for which the number of instances were higher than the
   4
      Dataset details can be found in (Lin et al., 2020).           median; we kept the rest categories unchanged.
   5                                                                    7
      As the sentences are from the reports of 1996 to 2013               Note that in our experiments, as we found that using 2%
(i.e., 18 years in total), we here simulate a one-year dataset by   of the down-sampled sentences achieves satisfactory perfor-
randomly sampling 1/18 ≈ 5.6% of the sentences from the             mance, we here used only 3,195 sentences for Task 2 model
original dataset.                                                   training due to computational resource limitations.
Category proportion     0.0044      0.0088   0.15       0.17     0.16      0.16    0.0034     0.16   0.00063    0.16     0.0019   0.019
          (training)


              BERT       0.36        0.34    0.93       0.92     0.94      0.79      0        0.9       0       0.43       0       0.4


            FinBERT      0.35        0.29    0.92       0.92     0.94      0.78     0.064     0.86     0.46     0.57      0.25    0.32


          NFinBERT       0.29        0.24    0.93       0.93     0.94      0.76     0.13      0.87     0.58     0.53      0.33    0.39

                       [ADDRESS]    [BOND]   [DATE]    [MONEY] [NOTHING] [ORDINAL] [OTHER] [PERCENT] [PHONE] [QUANTITY] [RATIO]   [TIME]
                                                      Figure 1: F1 score of each category

                                                                         Task 1              Task 2
                                                          Models        Accuracy      Accuracy Macro F1
                                                         BOW            86.83%           72.51%        36.91%
                                FastText (Joulin et al., 2016)          86.42%           62.44%        36.91%
                                          CNN (Kim, 2014)               87.36%           72.51%        32.37%
                                   BERT (Devlin et al., 2018)           88.61%           91.51%        50.49%
                                                   FinBERT              88.75%           91.34%        56.10%
                                                 NFinBERT               88.61%           91.19%        57.67%
                                              Table 3: Performance of two fine-tuning tasks


3.3      Results                                                            the number-aware pre-trained language model is
                                                                            beneficial for Task 2.
For both tasks, we compared three BERT-based
models with three baselines—TF-IDF bag-of-                                  4     Conclusion
words (BOW) with logistic regression, convolu-
tion neural network (CNN) (Kim, 2014), and fast-                           We introduce NFinBERT, a number-aware lan-
Text (Joulin et al., 2016)—and summarize their                             guage model trained on financial disclosures, in
performance in Table 3. As shown in the table, for                         which we identify 11 categories of numeral tokens
Task 1, all three BERT-based models yield com-                             with the knowledge and understanding of the finan-
parable performance, significantly better than the                         cial and accounting functions of reports and replace
three baseline models.                                                     them with additional masks to pre-train the model.
   On Task 2, which is more difficult than Task 1,                         The experimental results show that it is crucial to
both FinBERT and NFinBERT surpass BERT8 in                                 pre-train BERT on a finance-specific corpus for
terms of macro F1 by a significant amount.9 Fig-                           finance-related downstream tasks; moreover, the
ure 1 details the performance of each category in                          proposed NFinBERT outperforms other compared
terms of F1 score with a heatmap for all three                             models for 12-class classification for sentence-level
BERT-based models. From the figure, we observe                             numeral category prediction.
that BERT is outperformed by FinBERT and NFin-
BERT for the categories with the fewest training
instances ([OTHER], [PHONE], and [RATIO]);
                                                                            References
this is why BERT achieves better accuracy but a                             Dogu Araci. 2019. Finbert: Financial sentiment analy-
lower macro F1 score in Table 3. Moreover, for                                sis with pre-trained language models. arXiv preprint
                                                                              arXiv:1908.10063.
these three categories, NFinBERT yields more ac-
curate prediction than FinBERT, suggesting that                             Matthias MM Buehlmaier and Toni M Whited. 2018.
                                                                             Are financial constraints priced? Evidence from
    8
      The BERT-Base, Uncased pre-trained model was used in                   textual analysis. The Review of Financial Studies,
the experiments.                                                             31(7):2693–2728.
    9
      The improvements compared to BERT are statistically
significant at p < 0.01 with a paired t-test.                               Kevin Clark, Minh-Thang Luong, Quoc V Le, and
  Christopher D Manning. 2020. ELECTRA: Pre-                 Matheus Gomes Sousa, Kenzo Sakiyama, Lucas
  training text encoders as discriminators rather than        de Souza Rodrigues, Pedro Henrique Moraes, Er-
  generators. arXiv preprint arXiv:2003.10555.                aldo Rezende Fernandes, and Edson Takashi Mat-
                                                              subara. 2019. Bert for stock market sentiment anal-
Vinicio DeSola, Kevin Hanna, and Pri Nonis. 2019.             ysis. In Proceedings of the 2019 IEEE 31st Inter-
  FinBERT: pre-trained model on SEC filings for fi-           national Conference on Tools with Artificial Intelli-
  nancial natural language tasks. Working paper.              gence, pages 1597–1601.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and                Ming-Feng Tsai, Chuan-Ju Wang, and Po-Chuan
   Kristina Toutanova. 2018. Bert: Pre-training of deep        Chien. 2016. Discovering finance keywords via
   bidirectional transformers for language understand-         continuous-space language models. ACM Transac-
   ing. arXiv preprint arXiv:1810.04805.                       tions on Management Information Systems, 7(3):1–
                                                              17.
Jeremy Howard and Sebastian Ruder. 2018. Universal
   language model fine-tuning for text classification. In    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
   Proceedings of the 56th Annual Meeting of the As-           Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
   sociation for Computational Linguistics (Volume 1:          Kaiser, and Illia Polosukhin. 2017. Attention is all
   Long Papers), pages 328–339.                                you need. In Proceedings of Advances in Neural In-
                                                               formation Processing Systems, pages 5998–6008.
Armand Joulin, Edouard Grave, Piotr Bojanowski,
  Matthijs Douze, Hérve Jégou, and Tomas Mikolov.          Wei Wang, Maofu Liu, Yukun Zhang, Junyi Xiang, and
  2016. Fasttext. zip: Compressing text classification         Ruibin Mao. 2019. Financial numeral classification
  models. arXiv preprint arXiv:1612.03651.                     model based on bert. In Proceedings of the NII Con-
                                                               ference on Testbeds and Community for Information
Yoon Kim. 2014. Convolutional neural networks for             Access Research, pages 193–204. Springer.
  sentence classification. In Proceedings of the 2014
  Conference on Empirical Methods in Natural Lan-
  guage Processing, pages 1746–1751.

Shimon Kogan, Dimitry Levin, Bryan R Routledge, Ja-
  cob S Sagi, and Noah A Smith. 2009. Predicting risk
  from financial reports with regression. In Proceed-
  ings of Human Language Technologies: The 2009
  Annual Conference of the North American Chap-
  ter of the Association for Computational Linguistics,
  pages 272–280.

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,
   Donghyeon Kim, Sunkyu Kim, Chan Ho So,
   and Jaewoo Kang. 2020.         BioBERT: a pre-
   trained biomedical language representation model
   for biomedical text mining.       Bioinformatics,
   36(4):1234–1240.

Sheng-Chieh Lin, Wen-Yuh Su, Po-Chuan Chien,
  Ming-Feng Tsai, and Chuan-Ju Wang. 2020. Self-
  attentive sentimental sentence embedding for senti-
  ment analysis. In Proceedings of the 2020 IEEE
  International Conference on Acoustics, Speech and
  Signal Processing, pages 1678–1682.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.
  RoBERTa: A robustly optimized BERT pretraining
  approach. arXiv preprint arXiv:1907.11692.

Tim Loughran and Bill McDonald. 2011. When is a li-
  ability not a liability? Textual analysis, dictionaries,
  and 10-Ks. The Journal of Finance, 66(1):35–65.

Jonathan SCHLER. 2006. Effects of age and gender
  on blogging. In Proceedings of the AAAI Sympo-
  sium on Computational Approaches for Analyzing
  Weblogs, 2006, pages 199–205.