Democratizing Financial Knowledge Graph Construction by
Mining Massive Brokerage Research Reports
Zehua Cheng1 , Lianlong Wu1 , Thomas Lukasiewicz1 , Emanuel Sallinger1,2 and Georg Gottlob1,2
1
    Department of Computer Science, University of Oxford, UK
2
    Institute of Logic and Computation, TU Wien, Austria


                                             Abstract
                                             This work presents a novel automatic financial knowledge graph (KG) construction framework by mining massive brokerage
                                             research reports without explicit financial expertise guidance and intensive manual rules. We propose a semantic-entity
                                             interaction module to construct the interaction feature between the entity and semantic context in the research reports
                                             and build a KG from scratch according to a predefined schema based on the obtained interaction features. We train the
                                             semantic-entity interaction module using a pre-extracted entity set as a remote supervision-based approach. We further
                                             introduce entity augmentation over this entity set from the inference samples of the semantic-entity interaction module to
                                             maintain the entity set.

                                             Keywords
                                             Knowledge Graph, Language Model, Financial Research Report, Entity Resolution


1. Introduction                                             and brokerages. Such reports often cover a wide range of
                                                            areas and comprehensive data. Therefore, it is reasonable
Knowledge graphs (KGs) have emerged as one of the to build a reliable KG based on financial research reports.
most popular knowledge representation technologies for        However, there are still some challenges in construct-
massive information processing tasks. Financial intel- ing KGs in the financial area from research reports,
ligence analysis is one of the most important works in among which the most hardest ones are listed below:
intelligence analysis, which is facing large volumes of
documents and tabular data. KGs have already helped              • Entity-relationships are highly coupled to context.
financial analysts to process large amounts of data and             Entities are not explicitly represented in research
cooperate with state-of-the-art trading systems [1, 2] to           reports but have a complex interaction with their
achieve a high volume return in the market. Such tools              text passages.
are usually monopolized by large companies and are very          •  The overall structure of different research reports
costly to maintain. To democratize such technologies, we            are highly complicated. The structures of differ-
need a framework that can automatically build a financial           ent research reports can contradict each other. As
KG from scratch.                                                    the research reports accommodate a wide range
   In the financial area, research reports contain a wealth         of data and knowledge, and much professional
of high-quality data collected by professional agencies             knowledge, different research structures and pro-
that can be treated as an ideal resource for constructing           fessional understandings may express the same
a reliable knowledge graph. Financial research reports              content slightly differently.
are professional documents with in-depth research on                                                                     Such features make it difficult to automatically con-
macroeconomics, finance, industries, industry chains,                                                                 struct a knowledge graph based on research reports from
and companies by various financial research institutions                                                              scratch. A solution should involve an in-depth interac-
                                                                                                                      tion from inter-pipeline interactions to address such a
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint                                                     challenge. The high coupling between entities and their
Conference (March 29-April 1, 2022), Edinburgh, UK                                                                    context makes the rule-based approach challenging to
$ zehua.cheng@cs.ox.ac.uk (Z. Cheng); lianlong.wu@cs.ox.ac.uk                                                         intervene, and we find that it is more challenging to ex-
(L. Wu); thomas.lukasiewicz@cs.ox.ac.uk (T. Lukasiewicz);
                                                                                                                      ploit this part of the features due to the inconsistency
emanuel.sallinger@cs.ox.ax.uk (E. Sallinger);
georg.gottlob@cs.ox.ac.uk (G. Gottlob)                                                                                of wording in unstructured documents. Therefore, we
 https://www.cs.ox.ac.uk/people/zehua.cheng/ (Z. Cheng);                                                             believe that to deal with such highly coupled features, we
https://www.cs.ox.ac.uk/people/lianlong.wu/ (L. Wu);                                                                  need to consider them as a whole. Decoupling entity and
https://www.cs.ox.ac.uk/people/thomas.lukasiewicz/                                                                    contextual information and processing entity features
(T. Lukasiewicz);
                                                                                                                      and contextual features to different models separately is
https://www.cs.ox.ac.uk/people/emanuel.sallinger/ (E. Sallinger);
https://www.cs.ox.ac.uk/people/georg.gottlob (G. Gottlob)                                                             not ideal.
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                         We use a language model to extract contextual seman-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        tic features and bridge the feature connections with a con-
Figure 1: Overall Structure


ditional random field [3]. Language models like BERT [4]      model to extend the entity set.
and GPT [5] have proven their performance in many                In this work, we develop an automatic knowledge
challenging natural language processing tasks [6]. Ap-        graph construction pipeline tailored to the financial do-
plication in Question Answering [7] has proved that lan-      main based on research reports. We achieved an 𝐹1 score
guage models are capable of dealing with complicated          73.5% based on a predefined schema over research re-
semantic language features. Therefore, BERT is an ideal       ports. Our framework is highly scalable, since the overall
solution for this semantic feature extraction. Based on the   structure is entirely automatic. We designed an entity
language model, introducing a downstream specific mod-        augmentation to extend the entity set and construct a
ule can further improve the semantic features obtained        distant supervision over the training process. We also
by the language model. In named entity recognition            conduct ablation studies to examine the effects of the
(NER), there are successful applications combining BERT       different components of the pipeline.
with conditional random fields (CRFs) [8, 9]. [10] formu-
late NER as machine reading comprehension (MRC) task
by introducing an MRC module at the end of the BERT           2. Related Works
model.
   Updating the entity set on the fly can further improve     2.1. Knowledge Graph Construction
the reliability of the constructed knowledge graph. The       Traditional KG construction is based on a manually spec-
entity set could be easily affected by the noise in the       ified ontology and intensive human efforts to learn the
raw data. Under such circumstances, we do not want            extraction for each relation in the ontology.
to put all the eggs into one basket. Filtering raw data is       More specifically, supervised methods are learning
the first and the most crucial step for building a reliable   from sample input and output pairs, like hidden Markov
knowledge graph. The most significant budget of con-          models (HMMs) [16], maximum entropy-based models,
structing a knowledge graph is data cleaning [11]. By         such as the MENE system [17] and ME Tagger [18]. Mod-
introducing a statistical supervision of raw data, such as    els based on support vector machines (SVMs) [19] and
domain-specific dictionaries and regularization of word       CRFs [3] are also common supervised methods. In ad-
frequencies, human intervention in data cleaning can          dition, semi-supervised methods require less training
be significantly reduced [12]. Therefore, we create an        data. For example, a binary AdaBoost classifier [20] was
automated data cleaning pipeline to preprocess the raw        proposed for NER. NELL [21] has introduced a semi-
data with various filtering methods. Scholars have also       supervised bootstrapping approach with a predefined
found that using semantic information can also reduce         ontology of categories and relations that involve human-
human effort in data cleaning [13, 14, 15]. We thus si-       in-the-loop cooperation, fully using human labour, and
multaneously use the inference entities of the language
Figure 2: Schema of the Knowledge Graph


existing data. Specifically, Snorkel [22] provides a weakly
                                                         Preprocessing. We follow the standard data cleaning
supervised learning model, with handwritten rules, build-in NLP by removing brackets, parentheses, quotes, and
ing a generative model based on the overlapping or even  other punctuation. Before the pipeline, we filtered the
conflicting results of rules. Most recently, unsupervisednoisy text spans in sentence-level. We then use the co-
methods, e.g., KNOWITALL [23], emerged for knowledge     reference resolution system (COREF) [27] to the same
base construction.                                       entity in the filtered text. We filter out the domain-
                                                         irrelevant entity structure for the output of COREF with
2.2. Named Entity Recognition with                       a domain-specific predicate dictionary and then tokenize
                                                         the filtered samples. Sense-disambiguated predicates con-
      Language Models                                    struct this dictionary from the corpus with the highest
By using different types of heads, BERT [4] can tailor frequency relevant to the financial domain. We extracted
for a wide range of natural language processing tasks. entities from the filtered data to obtain entity sets based
BERT also has successful applications on named entity on elements covered in the schema. The details of the
recognition [24]. [8] proposed to combine CRFs with schema is presented in Figure 2 and discussed in Sec-
BERT on the challenging NER in mining medical docu- tion 4.
ments. The same model structure is also applied in NER Entity Augmentation. We perform entity augmenta-
for Portuguese documents [9]. [25] further introduced tion with the inference results of the semantic-entity
an additional BiLSTM in the BERT-CRF structure and fur- interaction module, since the extracted entities are col-
ther achieved better a performance in Chinese electronic lected based on the manually designed schema from an-
health records NER. Some researchers [26] challenge the alysts’ interest. For scalability concerns, we merge the
BiLSTM in [25], considering it redundant, since BERT inference results of the semantic-entity interaction mod-
and BiLSTM have the same function.                       ule to augment the entity set.
                                                              Distant Supervision. We maximise the utility of the ex-
3. Automatic Knowledge Graph                                  tracted entities by constructing a distant supervision [28]
                                                              to the semantic entity interaction module.
   Construction Pipeline                                         Finally, we score the predicate-argument to reflect our
                                                              confidence in precision and conciseness.
This section introduces each component of our auto-
mated financial KG construction pipeline. We first
present the overall structure and then the semantic-entity 3.2. Semantic-Entity Interaction Module
interaction module.                                        The overall structure of the semantic-entity interaction
                                                           module is presented in Figure 1. Our proposed semantic
3.1. Overall Structure                                     entity interaction module is composed of a BERT lan-
                                                           guage model with a CRF [3]. The input sequence is en-
The overall structure of our proposed framework is pre-
                                                           coded by BERT into an intermediate representation with
sented in Figure 1; its main ingredients are described as
                                                           hidden dimension 𝐻. A soft attention is then applied to
follows.
                                                           the intermediate representation to learn the interaction
better. The output of the soft attention is then fed to the                • Risk indicates the risk warning in the research
CRF layer. We follow the notation in [29], and have the                      report.
following scoring function:                                                • Article indicates publications cited in the re-
                      𝑛                   𝑛
                                                                             search report.
                                                                           • Industry indicates the industry to which the com-
                     ∑︁                  ∑︁
         𝑠(X, y) =         𝐴𝑦𝑖 ,𝑦𝑖+1 +         𝑃𝑖,𝑦𝑖 ,       (1)
                     𝑖=0                 𝑖=1                                 pany belongs.
                                                                           • Brand indicates the brand that the company
where A denotes the parameters of the CRF layer, A𝑖,𝑗                        owns. Some companies may have overlapping
represents the score of transitioning from entity 𝑖 to                       brand names, so it is necessary to disambiguate
entity 𝑗, and 𝑃𝑖 is the output score of the classification                   the reference brand and the company name based
head of the BERT model. We train the semantic-entity                         on the context.
interaction module with log-probability loss.
   As presented in Figure 1, we perform entity augmen-
tation during the inference phase of the semantic-entity           5. Experiment Setup
interaction module to extend the entity sets. Practically,
we use the pre-trained model, with fixed parameters of             We implemented our framework and trained over an 8
the transformer layers and the embedding layer, and only           NVIDIA V100 GPU cluster. The batch size is 32 per GPU.
allow the classification head and the CRF to update ac-            We use the BERT-base model as the pre-trained weights
cording to backpropagation.                                        of the language model by setting the learning rate as
                                                                   1𝑒−3 with the Adam optimiser for 10 epochs.
                                                                       We use HanLP [31] to extract the entities from the
4. Data Resource                                                   filtered data.

The original research reports and the annotations are
collected by [30], which includes 1, 200 research reports          6. Evaluation
and annotated 5, 131 entities for evaluation. The details
of the dataset are shown in Table 1.                               We follow the evaluation of the Cold Start evaluation
                                                                   task in the TAC KBP [32]. The scoring metrics are based
Table 1                                                            on the official evaluation toolkit1 . The evaluation starts
Knowledge Graph Dataset Statistics                                 with a predefined schema (see the details of the schema in
                                                                   Figure 2) and a small number of seed knowledge graphs to
     Knowledge       Entities    Relational       Property         build knowledge graphs from unstructured text data. The
       Graph                      Triples         Triples          evaluation automatically extracts entities, relationships,
     Seeding KG        5,131        6,091            354           and attribute values from the text of research reports
    Evaluation KG     12,668       20,707            974           that match the mapping schema, enabling the automated
                                                                   construction of financial knowledge graphs.
                                                                      We use a 𝐹1 score to evaluate the model’s overall
   The task is to construct a knowledge graph according
                                                                   performance. The experimental results of the language
to the schema presented in Figure 2. Each element in the
                                                                   model with different components are presented in Ta-
schema is explained as follows:
                                                                   ble 2. To fully present the novelty of the semantic entity
     • Research Report indicates the resource origin,              interaction module, we present the ablation study by
       represented as the title of the research report.            comparing the downstream specific module in the over-
                                                                   all structure under the same preprocessing setup. We
     • Indicator indicates the financial indicators in re-
                                                                   also perform an ablation study between BERT with CRF
       search reports, such as roe, eps, and gross margin.
                                                                   and BERT with MRC [10]. Similarly to BERT with CRF,
     • People indicates the actual natural persons.
                                                                   [10] also involved an interaction between the language
     • Organization indicates that the companies, busi-            model and an additional downstream specific module.
       nesses, governments, etc. are all institutional                We can infer from Table 2 that our proposed language
       types of entities.                                          model and CRF with soft attention has achieved the high-
     • Product refers to items produced by companies               est performance. The MRC module is not designed for
       that can be bought and sold, and also includes              this case, while CRF would be more suitable for pro-
       software products. Usually, they involve owner-             cessing such tasks. By introducing soft attention, the
       ship transition during the transaction.                     performance of the overall structure has been further
     • Service refers to actual service, which usually
       does not involve ownership transition during the
       transaction.                                                    1
                                                                           https://github.com/wikilinks/neleval
Table 2                                                            [7] C. Alberti, K. Lee, M. Collins, A bert baseline for the
Experimental results for different modules in precision, recall        natural questions, arXiv preprint arXiv:1901.08634
and 𝐹1 score (%). SA refers to a soft attention module.                (2019).
         Method               𝐹1      Precision    Recall          [8] J. Mao, W. Liu, Hadoken: A bert-crf model for
      BERT w/CRF             72.5       83.2       64.23               medical document anonymization, in: IberLEF@
      BERT w/MRC             68.57      79.55      60.25               SEPLN, 2019, pp. 720–726.
    BERT w/SA w/CRF          73.5      86.69       63.79           [9] F. Souza, R. Nogueira, R. Lotufo, Portuguese named
    BERT w/SA w/MRC          69.29      81.55      60.23               entity recognition using bert-crf, arXiv preprint
                                                                       arXiv:1909.10649 (2019).
                                                                  [10] X. Li, J. Feng, Y. Meng, Q. Han, F. Wu, J. Li, A uni-
improved by 1%. Soft attention can also improve BERT                   fied MRC framework for named entity recognition,
with MRC by 0.68%.                                                     arXiv preprint arXiv:1910.11476 (2019).
                                                                  [11] M. Muller, I. Lange, D. Wang, D. Piorkowski, J. Tsay,
                                                                       Q. V. Liao, C. Dugan, T. Erickson, How data science
7. Conclusion                                                          workers work with data: Discovery, capture, cura-
                                                                       tion, design, creation, in: Proceedings of the 2019
We proposed a novel knowledge graph construction
                                                                       CHI Conference on Human Factors in Computing
framework based on the brokerage research reports in
                                                                       Systems, 2019, pp. 1–15.
this work. Our proposed method has achieved 73.5% in
                                                                  [12] M. Mahdavi, F. Neutatz, L. Visengeriyeva, Z. Abed-
𝐹1 score. We expect that our proposed method is also
                                                                       jan, Towards automated data cleaning workflows,
extensible and reliable where we expect the overall per-
                                                                       Machine Learning 15 (2019) 16.
formance of our model can be further improved by using
                                                                  [13] E. Rahm, H. H. Do, Data cleaning: Problems and
a more complicated language model like RoBERTa [33]
                                                                       current approaches, IEEE Data Eng. Bull. 23 (2000)
or GPT-2 [5].
                                                                       3–13.
                                                                  [14] W. L. Low, M. L. Lee, T. W. Ling, A knowledge-
References                                                             based approach for duplicate elimination in data
                                                                       cleaning, Information Systems 26 (2001) 585–606.
 [1] X. Fu, X. Ren, O. J. Mengshoel, X. Wu, Stochas-              [15] Z. Kedad, E. Métais, Ontology-based data cleaning,
     tic optimization for market return prediction using               in: International Conference on Application of Nat-
     financial knowledge graph, in: 2018 IEEE Interna-                 ural Language to Information Systems, Springer,
     tional Conference on Big Knowledge (ICBK), 2018,                  2002, pp. 137–149.
     pp. 25–32.                                                   [16] D. M. Bikel, R. Schwartz, R. M. Weischedel, Al-
 [2] S. Deng, N. Zhang, W. Zhang, J. Chen, J. Z. Pan,                  gorithm that learns what’s in a name, Machine
     H. Chen, Knowledge-driven stock trend predic-                     Learning 34 (1999) 211–231.
     tion and explanation via temporal convolutional              [17] A. Borthwick, A maximum entropy approach to
     network, in: Companion Proceedings of The 2019                    named entity recognition, PhD thesis (1999).
     World Wide Web Conference, 2019, pp. 678–685.                [18] J. R. Curran, S. Clark, Language independent NER
 [3] J. Lafferty, A. McCallum, F. C. N. Pereira, Condi-                using a maximum entropy tagger (2003) 164–167.
     tional random fields: Probabilistic models for seg-          [19] C. Cortes, V. Vapnik, Support-vector networks,
     menting and labeling sequence data, in: Proceed-                  Machine Learning 20 (1995) 273–297.
     ings of the 18th International Conference on Ma-             [20] X. Carreras, L. Màrquez, L. Padró, Named entity
     chine Learning, ICML ’01, 2001, pp. 282–289.                      extraction using AdaBoost, 2002, pp. 1–4.
 [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,                [21] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar,
     Bert: Pre-training of deep bidirectional transform-               B. Yang, J. Betteridge, A. Carlson, B. Dalvi, M. Gard-
     ers for language understanding, arXiv preprint                    ner, B. Kisiel, et al., Never-ending learning, Com-
     arXiv:1810.04805 (2018).                                          munications of the ACM 61 (2018) 103–115.
 [5] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-           [22] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu,
     plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-              C. Ré, Snorkel: Rapid training data creation with
     try, A. Askell, et al., Language models are few-shot              weak supervision, Proceedings of the VLDB En-
     learners, arXiv preprint arXiv:2005.14165 (2020).                 dowment 11 (2017) 269–282. arXiv:1711.10160.
 [6] A. Rogers, O. Kovaleva, A. Rumshisky, A primer in            [23] O. Etzioni, M. Cafarella, D. Downey, A. M. Popescu,
     bertology: What we know about how bert works,                     T. Shaked, S. Soderland, D. S. Weld, A. Yates, Unsu-
     Transactions of the Association for Computational                 pervised named-entity extraction from the Web: An
     Linguistics 8 (2020) 842–866.                                     experimental study, Artif. Intell. 165 (2005) 91–134.
                                                                  [24] J. Vamvas, Bert for ner, Von https://vamvas. ch/bert-
     for-ner (2019).
[25] Z. Dai, X. Wang, P. Ni, Y. Li, G. Li, X. Bai, Named
     entity recognition using BERT BiLSTM CRF for
     Chinese electronic health records, in: 2019 12th
     International Congress on Image and Signal Pro-
     cessing, Biomedical Engineering and Informatics
     (CISP-BMEI), 2019, pp. 1–5.
[26] Z. Liu, Ner implementation with bert and crf model,
     2020.
[27] M. Honnibal, I. Montani, spacy 2: Natural language
     understanding with bloom embeddings, convolu-
     tional neural networks and incremental parsing, To
     appear 7 (2017) 411–420.
[28] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant
     supervision for relation extraction without labeled
     data, in: Proceedings of the Joint Conference of
     the 47th Annual Meeting of the ACL and the 4th In-
     ternational Joint Conference on Natural Language
     Processing of the AFNLP, 2009, pp. 1003–1011.
[29] G. Lample, M. Ballesteros, S. Subramanian,
     K. Kawakami, C. Dyer, Neural architectures
     for named entity recognition, arXiv preprint
     arXiv:1603.01360 (2016).
[30] Biendata, Ccks 2020: Evaluation of automated con-
     struction of financial knowledge graph based on
     ontology, 2020.
[31] H. He, J. D. Choi, The stem cell hypothesis:
     Dilemma behind multi-task learning with trans-
     former encoders, in: Proceedings of the 2021 Con-
     ference on Empirical Methods in Natural Language
     Processing, 2021, pp. 5555–5577.
[32] H. Ji, J. Nothman, H. T. Dang, S. I. Hub, Overview
     of tac-kbp2016 tri-lingual edl and its impact on end-
     to-end cold-start kbp, Proceedings of TAC (2016).
[33] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
     O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized bert pretraining ap-
     proach, arXiv preprint arXiv:1907.11692 (2019).