Complementing Language Embeddings with Knowledge Bases for Specific Domains Paolo Tenti, Gabriella Pasi and Rafael Peñaloza IKR3 Lab, University of Milano-Bicocca, Milan, Italy Abstract Language embeddings are a promising approach for handling natural language expressions. Current embeddings encompass a large language corpus, and need to be retrained to deal with specific sub- domains. On the other hand, these embeddings often disregard even basic domain knowledge, making them specially fragile when handling technical, specific, knowledge domains, and requiring costly re- training. To alleviate this issue, we propose a combined approach where the embedding is seen as a model of a logical knowledge base. Through a continuous learning approach, the embedding improves its satisfaction of the knowledge base, and in turn produces better training examples by labelling previously unseen text. In this position paper we describe the general framework for this continuous learning, along with its main features. Keywords Language embedding, Knowledge Bases, Natural Language Understanding, Neuro-Symbolic Learning 1. Introduction Natural Language Understanding (NLU) is the mechanical act of understanding language expressions, which is pivotal to many text related applications (e.g., text classification, informa- tion retrieval, question-answering). These applications require features that faithfully represent text meaning, to use them in relevant algorithms. Language embeddings (LE) [1, 2, 3] are dense representations of textual expressions, that capture their distributional semantics by pre-training a language model over large corpora of general language (e.g., Wikipedia). Their pre-trained nature allows to conveniently use LE in many down-stream tasks as representations of language expressions (known as transfer learning). However, pre-trained LE are challenged by domain-specific language in several applications. There are three main reasons for this. First, LE do not capture domain-specific language, and require re-training over domain- specific corpora of unstructured text. However, re-training is computationally expensive, and the available datasets are not always large enough to effectively re-train LE. Second, LE capture the sense of words from their context, as distributional semantics. However, many domain- specific tasks require a more precise understanding of text. To cope with these problems one DAO-XAI 2021 " p.tenti1@campus.unimib.it (P. Tenti); gabriella.pasi@unimib.it (G. Pasi); rafael.penaloza@unimib.it (R. Peñaloza) ~ https://ikr3.disco.unimib.it/people/paolo-tenti/ (P. Tenti); https://ikr3.disco.unimib.it/people/gabriella-pasi/ (G. Pasi); https://rpenalozan.github.io/ (R. Peñaloza)  0000-0002-9421-8566 (P. Tenti); 0000-0002-6080-8170 (G. Pasi); 0000-0002-2693-5790 (R. Peñaloza) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) can take advantage of structured information, such as key phrases or ontological categories. Third, even general domain applications often require to represent larger fragments of text (e.g., sentences, paragraphs). In such cases NLU techniques need to capture deep, semantic rooted understanding of text structures that are more complex than simple bags of tokens. Although several methods have been proposed in the literature, they are often task specific and increase design complexity of down-stream models in practice. To mitigate these challenges, we propose to complement LE with Knowledge Bases (KB)—that is, formal representations of knowledge. Specifically, we aim to improve language understanding by (i) acting on the LE’s ability to represent domain-specific language expressions, and (ii) linking KB symbols (i.e., entities and relations) to language expressions. Intuitively, we consider language expressions as instances of KB assertions (interpretations), hence defining KB embeddings (KBE) (i.e., dense representations of KB symbols) as a function of LE. We jointly train both from supervised data to maximise logical satisfaction of the KB. We argue that this approach helps to address the LE challenges highlighted above. First, we propose to use the KB to extract a domain-specific supervised dataset to pre-train our model. By doing so we decouple the problem of re-training LE over domain-specific data from task-specific data-sets that, as said, could not be large enough. Second, we propose to learn representations of text fragments longer than a token by means of a knowledge-aware task mediated by the KB, that is learning a representation of KB symbols (i.e., KBE). We argue that such representations are more useful to down-stream tasks than the ones obtained with the next sentence prediction task, which mainly capture syntactic properties rather than semantics. Third, we propose to use the combined knowledge and language embeddings (i.e., KBE and LE) to associate KB symbols to language expressions, and use them as features in down-stream models. We argue that representing language expressions with symbols, in addition to LE, is a step forward to precisely characterising their meaning compared to distributional semantics alone. In addition, note that by using KB symbols as features we improve interpretability of down- stream models. Moreover, as we will explain in more details, interpretability enables a continu- ous learning framework to mutually improve LE and KBE. 2. Related work Complementing KB and LE is not new. Some works [4, 5, 6] focus on the LE’s challenge to represent domain-specific language, and use a knowledge-aware tasks (mediated by a KB) to re- train LE; while [7] specifically focuses on continuous learning. The problem of complementing Knowledge Graph (KG) embeddings with textual information, such as names and descriptions of KG entity and relations, to improve the KG Completion task is studied in [8, 9, 10]. However, none of those works address the problem of complementing domain-specific LE with the extraction of symbolic features from language expressions for NLU. Petroni et al. study the ability of word embeddings to capture relational knowledge, similarly to what a KB does, by focusing on general language. They highlight that word embeddings can capture lightweight KB capabilities. From this perspective, [12] proposes encoding relational knowledge in a separate word embedding learned from co-occurrence statistics, complementary to a given standard word embedding. The analysis presented by the authors shows that relational word vectors do indeed capture information that is complementary to what is encoded in standard word embeddings. We argue that formal representations of knowledge are not matched by distributional semantics out of the box. Information Extraction (IE) aims to extract structured information from unstructured text. Most work focuses on unsupervised methods, to face the challenges of compiling supervised datasets and obtaining a KB upfront [13, 14]. Open Information Extraction (OIE) [15, 16, 17] extracts relational facts from unstructured text as surface patterns (i.e., spans of pure unstructured text), without linking them to an existing KB. We argue that using KBs to formally describe domain knowledge is beneficial to enforce control over the IE process. In fact, KBs can assist in the compilation of supervision, simplify the evaluation process of IE results by letting to focus on fewer, well-known symbols rather than the more widespread surface patterns, and improve explainability of down-stream tasks. Several tasks focus on extracting KB resources from unstructured text; e.g., Named Entity Recognition (NER), Named Entity Linking (NEL), and Relation extraction (RE). Traditional approaches use extraction pipelines that treat NER, RE and NEL as separate tasks, suffering from error propagation and ignoring synergies between sub-tasks. In addition, these methods depend heavily on complex features. Thus, recent works focus on building joint, neural models [18]. These models are either task-purposed (i.e., they only focus on entities [19] or relations [20, 18]) or domain-specific [21, 22]. We are interested, instead, on the more general problem of modelling synergies between language expressions and KBs, to extract complete relational facts from language expressions for any domain, similarly to KG completion [23]. Providing dense representations of KG resources (subjects, objects and relations) has been widely considered [24, 25]. In such models, KG embeddings (KGE) are learnt from relational facts, to optimise a predetermined embedding function. However, domain-specific background knowledge is usually formalised through hierarchies, taxonomies and logical rules, which are typical of KBs rather than KGs—the latter store large collections of relational facts instead. Gutiérrez-Basulto and Schockaert [26] showed that KGE models hardly capture even the most basic logical properties of KBs, and propose to represent KB resources through convex regions in a semantic space, which is seen as an interpretation of the KB. In addition, they describe how to keep the embedding model open to external resources; e.g., language expressions represented by embeddings. Kulmanov et al. [27] apply a similar KB embedding model to a domain-specific task for KB completion. Although [26, 27] are related to our study, we focus on NLU. Similar approaches combining logic and real world objects represented by embeddings were studied in [28, 29]. These differ by the methods used to enforce logical consistency (i.e., fuzzy logic or probability) in contrast to the geometric properties by [26, 27]. In addition, [28] have not been fully studied to interpret language expressions for NLU [30, 31]. 3. Model Description Our main goal is to infer KB symbols (i.e., entities and relations) from surface patterns, that is, text spans of arbitrary length from unstructured text. As a simple example, consider the sentence The city of lights has been the capital of France for many centuries, where city of lights Figure 1: An intuitive representation of the model for a positive case: parameters of a function to translate the language embeddings semantic space and regions representing KB symbols are optimized to satisfy the KB from language expressions labeled with KB symbols. and France are surface patterns that should be meant as Paris and France respectively, and the sentence as CapitalOf(Paris, France). We consider surface patterns as possible interpretations of KB symbols. Recall that a KB is a partial representation of the world, which usually introduces restrictions on the possible mean- ings of the symbols it uses. Hence, KB semantics is typically defined by means of interpretations. In essence, an interpretation describes all the instances of interest and their relationship within all the properties expressed in the KB. Slightly more formally, an interpretation consists of an interpretation domain, which describes the objects in the world, and an interpretation function, which describes the meaning of each symbol within this world. This interpretation is a model of the KB if it satisfies all the constraints imposed by the KB [26, 32]. We consider (the set of representations of) surface patterns as an interpretation domain, and aim to learn from data a suitable interpretation function guaranteeing that the resulting interpretation is in fact a model. To achieve this, we propose to (see Figure 1): • encode surface patterns in a language semantic space, using a pre-trained LE model; • encode KB symbols as regular regions in a KB semantic space as in [26]. Specifically, relations are interpreted as convex regions; • build an interpretation function bridging both semantic spaces. We use a supervised dataset to jointly train the parameters of the model, using a loss based on the violation of the KB constraints. The supervised dataset labels the unstructured text fragments with markers of surface patterns and their relative symbols in the KB. The pre-trained model can be used to obtain domain-specific language embeddings, and to infer KB symbols over natural language expressions by using regular regions. In addition, regular regions can be used as KBE in down-stream tasks (e.g., KB Completion). Importantly, to deal with the problem of polyonymity, notice that entities are also considered (unary) relations, and they are represented as regular regions. In fact entities in the KB are singleton symbols (e.g., Paris) but they are representative of potentially many different surface patterns (e.g., the city of lights, Paris, the capital of France). We also emphasise the restriction to regular regions for interpreting relations. There are three main reasons for this choice. First, regular regions allow for better interpretability of the representations; second, the results allow Figure 2: A framework for continuous learning generalizability by avoiding over-fitting through complex regions; and third, regular regions can be succinctly described through a few parameters. Indeed, learning regular regions over the original language embedding space would be desirable, because the interpretation function would be reduced to the identity function. This would be possible for entities, as LE guarantee that similar entities lay close in the semantic space. However, it might be not possible for relations of arity greather than 1. For one, relations with the same domain and range would have overlapping regions. In addition, relations with a wide domain or range would have very large regions. In both cases we would lose representativity. Thus, we need either to increase the dimensions of the embeddings (𝑛 − 1 dimensions on the number of KB symbols are needed [26], leading to a sparse representation space) or to use a non-linear, relation-specific transformation to to encode inputs. 4. Continuous Learning To train the proposed model from supervised datasets and to make use of pre-existing KBs are certainly two strong assumptions, as they might be expensive to obtain. Still, domain-specific applications exist where highly-qualified, labour-intensive, error-prone human interventions are usually employed. Two examples are offered by the manual screening of scientific publications to be included in literature reviews, and by manual labelling of unstructured text. Such human activities could be shifted to higher level interventions, such as maintaining KBs and supervised datasets; keeping a degree of control over the inference process through interpretable models is desirable in such scenarios, when compared to completely unsupervised approaches. We propose a continuous learning framework, which iteratively refines the supervised dataset and the knowledge base. This framework, depicted in Figure 2, is organised into the following steps: • background knowledge is formalised through a KB containing relational data (assertions) and Datalog rules (axioms); • assertions from the KB are used to extract a distantly supervised dataset from a domain- specific corpus of unstructured text fragments; • the supervised dataset is used to re-train the model, and the model is used to comprehend new text fragments by inferring KB symbols; • inferred KB symbols can be analysed by humans, and used to maintain the supervised dataset and the KB. We use distance supervision to select sentences from unstructured text corpora that match named entities from KB assertions. A known challenge of this approach is to discriminate if matching sentences have a meaning which is coherent to the assertion under scrutiny. Observe that compiling a good dataset for supervised learning is more related to precision than recall: capturing all possible good sentences is less desirable than capturing a few high quality sentences representing the assertions. In our view, this distance supervision problem can be successfully addressed by considering it as a search problem: we view any given assertion as a query made over a corpus of (natural language) sentences. [33] suggest that re-ranking models (i.e., BM25+CE [33], ColBERT [34]) works well in combination with pre-trained LE, showing good generalization capabilities over unseen datasets and domains. 5. Conclusions and Future Work We propose a framework to learn from supervised data a model aimed to to align language embeddings and KB representations; such a framework can be useful in two ways. First, we obtain domain-relevant, knowledge-aware language embeddings by continuously re-training them with a KB mediated task; second, we obtain KB embeddings that provide a model of the KB over language expressions, which can be used to infer KB symbols (i.e., relations and entities) over language expressions. LE and KB symbols can be used in domain-specific down- stream applications as features. This model can improve the effectiveness of natural language understanding methods in domain-specific applications and, by using KB symbols as features, improve interpretability. We also propose distant supervision to compile a dataset for training, and to use text ranking techniques to improve precision. One potential application field is in the area of literature reviews, where all publications related to a specific topic need to be analysed. In this case, our methods automatically find and recommend scientific publications that match the topic of interest, among the huge amount of existing publications. Importantly, current literature reviews require extensive interventions from highly-qualified human experts to discern whether a publication is indeed related to the topic studied, and also to evaluate its importance and relevance. As future work, we plan to implement such a model and test its performance in potential down-stream tasks. References [1] J. Devlin, M.-W. Chang, K. Lee, K. N. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2018, pp. 4171–4186. [2] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving language understanding by generative pre-training, 2018. URL: https://openai.com/blog/language-unsupervised/. [3] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, 2018, pp. 2227–2237. [4] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, Q. Liu, Ernie: Enhanced language representation with informative entities., in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1441–1451. [5] B. He, D. Zhou, J. Xiao, X. jiang, Q. Liu, N. J. Yuan, T. Xu, Integrating graph contextualized knowledge into pre-trained language models, arXiv preprint arXiv:1912.00147 (2019). [6] W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, P. Wang, K-bert: Enabling language representation with knowledge graph, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 2901–2908. [7] Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, H. Wang, Ernie 2.0: A continual pre- training framework for language understanding, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 8968–8975. [8] H. Xiao, M. Huang, X. Zhu, Ssp: Semantic space projection for knowledge graph embedding with text descriptions., in: AAAI, 2016, pp. 3104–3110. [9] D. Nozza, E. Fersini, E. Messina, Cage: Constrained deep attributed graph embedding, Information Sciences 518 (2020) 56–70. [10] H. Zhong, J. Zhang, Z. Wang, H. Wan, Z. Chen, Aligning knowledge and text embeddings by entity descriptions, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 267–272. [11] F. Petroni, T. Rocktäschel, P. S. H. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, S. Riedel, Language models as knowledge bases, in: In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. (pp. pp. 2463-2473). Association for Computational Linguistics: Hong Kong, China. (2019), 2019, pp. 2463–2473. [12] J. Camacho-Collados, L. E. Anke, S. Schockaert, Relational word embeddings, in: Proceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3286–3296. [13] A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, Y. Choi, Comet: Com- monsense transformers for automatic knowledge graph construction, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4762–4779. [14] S. Yu, T. He, J. R. Glass, Constructing a knowledge graph from unstructured documents without external alignment., arXiv: Computation and Language (2020). [15] C. Niklaus, M. Cetto, A. Freitas, S. Handschuh, A survey on open information extraction, in: Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 3866–3878. [16] M. Mausam, Open information extraction systems and downstream applications, in: IJCAI’16 Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016, pp. 4074–4077. [17] P. Hohenecker, F. Mtumbuka, V. Kocijan, T. Lukasiewicz, Systematic comparison of neural architectures and training approaches for open information extraction, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 8554–8565. [18] Y. Yuan, X. Zhou, S. Pan, Q. Zhu, Z. Song, L. Guo, A relation-specific attention network for joint entity and relation extraction, in: International Joint Conference on Artificial Intelligence-Pacific Rim International Conference on Artificial Intelligence 2020, volume 4, 2020, pp. 4054–4060. [19] I. O. Mulang, K. Singh, C. Prabhu, A. Nadgeri, J. Hoffart, J. Lehmann, Evaluating the impact of knowledge graph context on entity disambiguation models, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 2157–2160. [20] T. Nayak, H. T. Ng, Effective modeling of encoder-decoder architecture for joint entity and relation extraction, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 8528–8535. [21] F. Li, M. Zhang, G. Fu, D. Ji, A neural joint model for entity and relation extraction from biomedical text, BMC Bioinformatics 18 (2017) 198–198. [22] N. Kang, B. Singh, C. Bui, Z. Afzal, E. M. van Mulligen, J. A. Kors, Knowledge-based extraction of adverse drug events from biomedical text, BMC Bioinformatics 15 (2014) 64–64. [23] B. D. Trisedya, G. Weikum, J. Qi, R. Zhang, Neural relation extraction for knowledge base enrichment, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 229–240. [24] Y. Dai, S. Wang, N. N. Xiong, W. Guo, A survey on knowledge graph embedding: Ap- proaches, applications and benchmarks, Electronics 9 (2020) 750. [25] S. M. Kazemi, D. Poole, Simple embedding for link prediction in knowledge graphs, in: NIPS’18 Proceedings of the 32nd International Conference on Neural Information Processing Systems, volume 31, 2018, pp. 4289–4300. [26] V. Gutiérrez-Basulto, S. Schockaert, From knowledge graph embedding to ontology embedding? an analysis of the compatibility between vector space representations and rules, in: KR, 2018, pp. 379–388. [27] M. Kulmanov, W. Liu-Wei, Y. Yan, R. Hoehndorf, El embeddings: Geometric construction of models for the description logic el++, in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019, pp. 6103–6109. [28] L. Serafini, A. S. d’Avila Garcez, Logic tensor networks: Deep learning and logical reasoning from data and knowledge., NeSy@HLAI (2016). [29] M. Richardson, P. Domingos, Markov logic networks, Machine Learning 62 (2006) 107–136. [30] F. Bianchi, M. Palmonari, P. Hitzler, L. Serafini, Complementing logical reasoning with sub-symbolic commonsense, in: RuleML+RR - 3rd International Joint Conference on Rules and Reasoning, volume 11784, 2019, pp. 161–170. [31] I. Donadello, L. Serafini, A. S. d’Avila Garcez, Logic tensor networks for semantic image interpretation, in: Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017, pp. 1596–1602. [32] D. Calvanese, G. Giacomo, D. Lembo, M. Lenzerini, R. Rosati, Tractable reasoning and efficient query answering in description logics: The dl-lite family, Journal of Automated Reasoning 39 (2007) 385–429. [33] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, I. Gurevych, BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models, arXiv preprint arXiv:2104.08663 (2021). [34] O. Khattab, M. Zaharia, Colbert: Efficient and effective passage search via contextual- ized late interaction over BERT, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 39–48.