Democratizing Financial Knowledge Graph Construction by Mining Massive Brokerage Research Reports Zehua Cheng1 , Lianlong Wu1 , Thomas Lukasiewicz1 , Emanuel Sallinger1,2 and Georg Gottlob1,2 1 Department of Computer Science, University of Oxford, UK 2 Institute of Logic and Computation, TU Wien, Austria Abstract This work presents a novel automatic financial knowledge graph (KG) construction framework by mining massive brokerage research reports without explicit financial expertise guidance and intensive manual rules. We propose a semantic-entity interaction module to construct the interaction feature between the entity and semantic context in the research reports and build a KG from scratch according to a predefined schema based on the obtained interaction features. We train the semantic-entity interaction module using a pre-extracted entity set as a remote supervision-based approach. We further introduce entity augmentation over this entity set from the inference samples of the semantic-entity interaction module to maintain the entity set. Keywords Knowledge Graph, Language Model, Financial Research Report, Entity Resolution 1. Introduction and brokerages. Such reports often cover a wide range of areas and comprehensive data. Therefore, it is reasonable Knowledge graphs (KGs) have emerged as one of the to build a reliable KG based on financial research reports. most popular knowledge representation technologies for However, there are still some challenges in construct- massive information processing tasks. Financial intel- ing KGs in the financial area from research reports, ligence analysis is one of the most important works in among which the most hardest ones are listed below: intelligence analysis, which is facing large volumes of documents and tabular data. KGs have already helped β€’ Entity-relationships are highly coupled to context. financial analysts to process large amounts of data and Entities are not explicitly represented in research cooperate with state-of-the-art trading systems [1, 2] to reports but have a complex interaction with their achieve a high volume return in the market. Such tools text passages. are usually monopolized by large companies and are very β€’ The overall structure of different research reports costly to maintain. To democratize such technologies, we are highly complicated. The structures of differ- need a framework that can automatically build a financial ent research reports can contradict each other. As KG from scratch. the research reports accommodate a wide range In the financial area, research reports contain a wealth of data and knowledge, and much professional of high-quality data collected by professional agencies knowledge, different research structures and pro- that can be treated as an ideal resource for constructing fessional understandings may express the same a reliable knowledge graph. Financial research reports content slightly differently. are professional documents with in-depth research on Such features make it difficult to automatically con- macroeconomics, finance, industries, industry chains, struct a knowledge graph based on research reports from and companies by various financial research institutions scratch. A solution should involve an in-depth interac- tion from inter-pipeline interactions to address such a Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint challenge. The high coupling between entities and their Conference (March 29-April 1, 2022), Edinburgh, UK context makes the rule-based approach challenging to $ zehua.cheng@cs.ox.ac.uk (Z. Cheng); lianlong.wu@cs.ox.ac.uk intervene, and we find that it is more challenging to ex- (L. Wu); thomas.lukasiewicz@cs.ox.ac.uk (T. Lukasiewicz); ploit this part of the features due to the inconsistency emanuel.sallinger@cs.ox.ax.uk (E. Sallinger); georg.gottlob@cs.ox.ac.uk (G. Gottlob) of wording in unstructured documents. Therefore, we Β€ https://www.cs.ox.ac.uk/people/zehua.cheng/ (Z. Cheng); believe that to deal with such highly coupled features, we https://www.cs.ox.ac.uk/people/lianlong.wu/ (L. Wu); need to consider them as a whole. Decoupling entity and https://www.cs.ox.ac.uk/people/thomas.lukasiewicz/ contextual information and processing entity features (T. Lukasiewicz); and contextual features to different models separately is https://www.cs.ox.ac.uk/people/emanuel.sallinger/ (E. Sallinger); https://www.cs.ox.ac.uk/people/georg.gottlob (G. Gottlob) not ideal. Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). We use a language model to extract contextual seman- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) tic features and bridge the feature connections with a con- Figure 1: Overall Structure ditional random field [3]. Language models like BERT [4] model to extend the entity set. and GPT [5] have proven their performance in many In this work, we develop an automatic knowledge challenging natural language processing tasks [6]. Ap- graph construction pipeline tailored to the financial do- plication in Question Answering [7] has proved that lan- main based on research reports. We achieved an 𝐹1 score guage models are capable of dealing with complicated 73.5% based on a predefined schema over research re- semantic language features. Therefore, BERT is an ideal ports. Our framework is highly scalable, since the overall solution for this semantic feature extraction. Based on the structure is entirely automatic. We designed an entity language model, introducing a downstream specific mod- augmentation to extend the entity set and construct a ule can further improve the semantic features obtained distant supervision over the training process. We also by the language model. In named entity recognition conduct ablation studies to examine the effects of the (NER), there are successful applications combining BERT different components of the pipeline. with conditional random fields (CRFs) [8, 9]. [10] formu- late NER as machine reading comprehension (MRC) task by introducing an MRC module at the end of the BERT 2. Related Works model. Updating the entity set on the fly can further improve 2.1. Knowledge Graph Construction the reliability of the constructed knowledge graph. The Traditional KG construction is based on a manually spec- entity set could be easily affected by the noise in the ified ontology and intensive human efforts to learn the raw data. Under such circumstances, we do not want extraction for each relation in the ontology. to put all the eggs into one basket. Filtering raw data is More specifically, supervised methods are learning the first and the most crucial step for building a reliable from sample input and output pairs, like hidden Markov knowledge graph. The most significant budget of con- models (HMMs) [16], maximum entropy-based models, structing a knowledge graph is data cleaning [11]. By such as the MENE system [17] and ME Tagger [18]. Mod- introducing a statistical supervision of raw data, such as els based on support vector machines (SVMs) [19] and domain-specific dictionaries and regularization of word CRFs [3] are also common supervised methods. In ad- frequencies, human intervention in data cleaning can dition, semi-supervised methods require less training be significantly reduced [12]. Therefore, we create an data. For example, a binary AdaBoost classifier [20] was automated data cleaning pipeline to preprocess the raw proposed for NER. NELL [21] has introduced a semi- data with various filtering methods. Scholars have also supervised bootstrapping approach with a predefined found that using semantic information can also reduce ontology of categories and relations that involve human- human effort in data cleaning [13, 14, 15]. We thus si- in-the-loop cooperation, fully using human labour, and multaneously use the inference entities of the language Figure 2: Schema of the Knowledge Graph existing data. Specifically, Snorkel [22] provides a weakly Preprocessing. We follow the standard data cleaning supervised learning model, with handwritten rules, build-in NLP by removing brackets, parentheses, quotes, and ing a generative model based on the overlapping or even other punctuation. Before the pipeline, we filtered the conflicting results of rules. Most recently, unsupervisednoisy text spans in sentence-level. We then use the co- methods, e.g., KNOWITALL [23], emerged for knowledge reference resolution system (COREF) [27] to the same base construction. entity in the filtered text. We filter out the domain- irrelevant entity structure for the output of COREF with 2.2. Named Entity Recognition with a domain-specific predicate dictionary and then tokenize the filtered samples. Sense-disambiguated predicates con- Language Models struct this dictionary from the corpus with the highest By using different types of heads, BERT [4] can tailor frequency relevant to the financial domain. We extracted for a wide range of natural language processing tasks. entities from the filtered data to obtain entity sets based BERT also has successful applications on named entity on elements covered in the schema. The details of the recognition [24]. [8] proposed to combine CRFs with schema is presented in Figure 2 and discussed in Sec- BERT on the challenging NER in mining medical docu- tion 4. ments. The same model structure is also applied in NER Entity Augmentation. We perform entity augmenta- for Portuguese documents [9]. [25] further introduced tion with the inference results of the semantic-entity an additional BiLSTM in the BERT-CRF structure and fur- interaction module, since the extracted entities are col- ther achieved better a performance in Chinese electronic lected based on the manually designed schema from an- health records NER. Some researchers [26] challenge the alysts’ interest. For scalability concerns, we merge the BiLSTM in [25], considering it redundant, since BERT inference results of the semantic-entity interaction mod- and BiLSTM have the same function. ule to augment the entity set. Distant Supervision. We maximise the utility of the ex- 3. Automatic Knowledge Graph tracted entities by constructing a distant supervision [28] to the semantic entity interaction module. Construction Pipeline Finally, we score the predicate-argument to reflect our confidence in precision and conciseness. This section introduces each component of our auto- mated financial KG construction pipeline. We first present the overall structure and then the semantic-entity 3.2. Semantic-Entity Interaction Module interaction module. The overall structure of the semantic-entity interaction module is presented in Figure 1. Our proposed semantic 3.1. Overall Structure entity interaction module is composed of a BERT lan- guage model with a CRF [3]. The input sequence is en- The overall structure of our proposed framework is pre- coded by BERT into an intermediate representation with sented in Figure 1; its main ingredients are described as hidden dimension 𝐻. A soft attention is then applied to follows. the intermediate representation to learn the interaction better. The output of the soft attention is then fed to the β€’ Risk indicates the risk warning in the research CRF layer. We follow the notation in [29], and have the report. following scoring function: β€’ Article indicates publications cited in the re- 𝑛 𝑛 search report. β€’ Industry indicates the industry to which the com- βˆ‘οΈ βˆ‘οΈ 𝑠(X, y) = 𝐴𝑦𝑖 ,𝑦𝑖+1 + 𝑃𝑖,𝑦𝑖 , (1) 𝑖=0 𝑖=1 pany belongs. β€’ Brand indicates the brand that the company where A denotes the parameters of the CRF layer, A𝑖,𝑗 owns. Some companies may have overlapping represents the score of transitioning from entity 𝑖 to brand names, so it is necessary to disambiguate entity 𝑗, and 𝑃𝑖 is the output score of the classification the reference brand and the company name based head of the BERT model. We train the semantic-entity on the context. interaction module with log-probability loss. As presented in Figure 1, we perform entity augmen- tation during the inference phase of the semantic-entity 5. Experiment Setup interaction module to extend the entity sets. Practically, we use the pre-trained model, with fixed parameters of We implemented our framework and trained over an 8 the transformer layers and the embedding layer, and only NVIDIA V100 GPU cluster. The batch size is 32 per GPU. allow the classification head and the CRF to update ac- We use the BERT-base model as the pre-trained weights cording to backpropagation. of the language model by setting the learning rate as 1π‘’βˆ’3 with the Adam optimiser for 10 epochs. We use HanLP [31] to extract the entities from the 4. Data Resource filtered data. The original research reports and the annotations are collected by [30], which includes 1, 200 research reports 6. Evaluation and annotated 5, 131 entities for evaluation. The details of the dataset are shown in Table 1. We follow the evaluation of the Cold Start evaluation task in the TAC KBP [32]. The scoring metrics are based Table 1 on the official evaluation toolkit1 . The evaluation starts Knowledge Graph Dataset Statistics with a predefined schema (see the details of the schema in Figure 2) and a small number of seed knowledge graphs to Knowledge Entities Relational Property build knowledge graphs from unstructured text data. The Graph Triples Triples evaluation automatically extracts entities, relationships, Seeding KG 5,131 6,091 354 and attribute values from the text of research reports Evaluation KG 12,668 20,707 974 that match the mapping schema, enabling the automated construction of financial knowledge graphs. We use a 𝐹1 score to evaluate the model’s overall The task is to construct a knowledge graph according performance. The experimental results of the language to the schema presented in Figure 2. Each element in the model with different components are presented in Ta- schema is explained as follows: ble 2. To fully present the novelty of the semantic entity β€’ Research Report indicates the resource origin, interaction module, we present the ablation study by represented as the title of the research report. comparing the downstream specific module in the over- all structure under the same preprocessing setup. We β€’ Indicator indicates the financial indicators in re- also perform an ablation study between BERT with CRF search reports, such as roe, eps, and gross margin. and BERT with MRC [10]. Similarly to BERT with CRF, β€’ People indicates the actual natural persons. [10] also involved an interaction between the language β€’ Organization indicates that the companies, busi- model and an additional downstream specific module. nesses, governments, etc. are all institutional We can infer from Table 2 that our proposed language types of entities. model and CRF with soft attention has achieved the high- β€’ Product refers to items produced by companies est performance. The MRC module is not designed for that can be bought and sold, and also includes this case, while CRF would be more suitable for pro- software products. Usually, they involve owner- cessing such tasks. By introducing soft attention, the ship transition during the transaction. performance of the overall structure has been further β€’ Service refers to actual service, which usually does not involve ownership transition during the transaction. 1 https://github.com/wikilinks/neleval Table 2 [7] C. Alberti, K. Lee, M. Collins, A bert baseline for the Experimental results for different modules in precision, recall natural questions, arXiv preprint arXiv:1901.08634 and 𝐹1 score (%). SA refers to a soft attention module. (2019). Method 𝐹1 Precision Recall [8] J. Mao, W. Liu, Hadoken: A bert-crf model for BERT w/CRF 72.5 83.2 64.23 medical document anonymization, in: IberLEF@ BERT w/MRC 68.57 79.55 60.25 SEPLN, 2019, pp. 720–726. BERT w/SA w/CRF 73.5 86.69 63.79 [9] F. Souza, R. Nogueira, R. Lotufo, Portuguese named BERT w/SA w/MRC 69.29 81.55 60.23 entity recognition using bert-crf, arXiv preprint arXiv:1909.10649 (2019). [10] X. Li, J. Feng, Y. Meng, Q. Han, F. Wu, J. Li, A uni- improved by 1%. Soft attention can also improve BERT fied MRC framework for named entity recognition, with MRC by 0.68%. arXiv preprint arXiv:1910.11476 (2019). [11] M. Muller, I. Lange, D. Wang, D. Piorkowski, J. Tsay, Q. V. Liao, C. Dugan, T. Erickson, How data science 7. Conclusion workers work with data: Discovery, capture, cura- tion, design, creation, in: Proceedings of the 2019 We proposed a novel knowledge graph construction CHI Conference on Human Factors in Computing framework based on the brokerage research reports in Systems, 2019, pp. 1–15. this work. Our proposed method has achieved 73.5% in [12] M. Mahdavi, F. Neutatz, L. Visengeriyeva, Z. Abed- 𝐹1 score. We expect that our proposed method is also jan, Towards automated data cleaning workflows, extensible and reliable where we expect the overall per- Machine Learning 15 (2019) 16. formance of our model can be further improved by using [13] E. Rahm, H. H. Do, Data cleaning: Problems and a more complicated language model like RoBERTa [33] current approaches, IEEE Data Eng. Bull. 23 (2000) or GPT-2 [5]. 3–13. [14] W. L. Low, M. L. Lee, T. W. Ling, A knowledge- References based approach for duplicate elimination in data cleaning, Information Systems 26 (2001) 585–606. [1] X. Fu, X. Ren, O. J. Mengshoel, X. Wu, Stochas- [15] Z. Kedad, E. MΓ©tais, Ontology-based data cleaning, tic optimization for market return prediction using in: International Conference on Application of Nat- financial knowledge graph, in: 2018 IEEE Interna- ural Language to Information Systems, Springer, tional Conference on Big Knowledge (ICBK), 2018, 2002, pp. 137–149. pp. 25–32. [16] D. M. Bikel, R. Schwartz, R. M. Weischedel, Al- [2] S. Deng, N. Zhang, W. Zhang, J. Chen, J. Z. Pan, gorithm that learns what’s in a name, Machine H. Chen, Knowledge-driven stock trend predic- Learning 34 (1999) 211–231. tion and explanation via temporal convolutional [17] A. Borthwick, A maximum entropy approach to network, in: Companion Proceedings of The 2019 named entity recognition, PhD thesis (1999). World Wide Web Conference, 2019, pp. 678–685. [18] J. R. Curran, S. Clark, Language independent NER [3] J. Lafferty, A. McCallum, F. C. N. Pereira, Condi- using a maximum entropy tagger (2003) 164–167. tional random fields: Probabilistic models for seg- [19] C. Cortes, V. Vapnik, Support-vector networks, menting and labeling sequence data, in: Proceed- Machine Learning 20 (1995) 273–297. ings of the 18th International Conference on Ma- [20] X. Carreras, L. MΓ rquez, L. PadrΓ³, Named entity chine Learning, ICML ’01, 2001, pp. 282–289. extraction using AdaBoost, 2002, pp. 1–4. [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, [21] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, Bert: Pre-training of deep bidirectional transform- B. Yang, J. Betteridge, A. Carlson, B. Dalvi, M. Gard- ers for language understanding, arXiv preprint ner, B. Kisiel, et al., Never-ending learning, Com- arXiv:1810.04805 (2018). munications of the ACM 61 (2018) 103–115. [5] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- [22] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- C. RΓ©, Snorkel: Rapid training data creation with try, A. Askell, et al., Language models are few-shot weak supervision, Proceedings of the VLDB En- learners, arXiv preprint arXiv:2005.14165 (2020). dowment 11 (2017) 269–282. arXiv:1711.10160. [6] A. Rogers, O. Kovaleva, A. Rumshisky, A primer in [23] O. Etzioni, M. Cafarella, D. Downey, A. M. Popescu, bertology: What we know about how bert works, T. Shaked, S. Soderland, D. S. Weld, A. Yates, Unsu- Transactions of the Association for Computational pervised named-entity extraction from the Web: An Linguistics 8 (2020) 842–866. experimental study, Artif. Intell. 165 (2005) 91–134. [24] J. Vamvas, Bert for ner, Von https://vamvas. ch/bert- for-ner (2019). [25] Z. Dai, X. Wang, P. Ni, Y. Li, G. Li, X. Bai, Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records, in: 2019 12th International Congress on Image and Signal Pro- cessing, Biomedical Engineering and Informatics (CISP-BMEI), 2019, pp. 1–5. [26] Z. Liu, Ner implementation with bert and crf model, 2020. [27] M. Honnibal, I. Montani, spacy 2: Natural language understanding with bloom embeddings, convolu- tional neural networks and incremental parsing, To appear 7 (2017) 411–420. [28] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant supervision for relation extraction without labeled data, in: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th In- ternational Joint Conference on Natural Language Processing of the AFNLP, 2009, pp. 1003–1011. [29] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural architectures for named entity recognition, arXiv preprint arXiv:1603.01360 (2016). [30] Biendata, Ccks 2020: Evaluation of automated con- struction of financial knowledge graph based on ontology, 2020. [31] H. He, J. D. Choi, The stem cell hypothesis: Dilemma behind multi-task learning with trans- former encoders, in: Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, 2021, pp. 5555–5577. [32] H. Ji, J. Nothman, H. T. Dang, S. I. Hub, Overview of tac-kbp2016 tri-lingual edl and its impact on end- to-end cold-start kbp, Proceedings of TAC (2016). [33] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining ap- proach, arXiv preprint arXiv:1907.11692 (2019).