Towards A Question Answering System over Temporal Knowledge Graph Embeddings Kristian Otte† , Kristian Simoni Vestermark† , Huan Li and Daniele Dell’Aglio Department of Computer Science, Aalborg University, Aalborg, Denmark Abstract Question Answering (QA) over knowledge graphs is a vital topic within information retrieval. Questions with temporal intent are a special case of questions for QA systems that have received only limited attention so far. In this paper, we study using temporal knowledge graph embeddings (TKGEs) for temporal QA. Firstly, we propose a microservice-based architecture for building temporal QA systems on pre-trained TKGE models. Secondly, we present a Bayesian model average (BMA) ensemble method, where results of several link prediction tasks on separated TKGE models are combined to find better answers. Within the system built using the microservice-based architecture, the experiments on two benchmark datasets show that BMA provides better results than the individual models. 1. Introduction Knowledge graphs (KGs), such as Wikidata [1], DBpedia [2], and YAGO [3] have attracted increasing attention of researchers and practitioners. Large and detailed KG set the foundations for question answering (QA), a core task in applications such as home assistants, chat-bots, and recommender systems [4]. Many question answering over knowledge graph (QA-KG) studies [5, 6] have treated the KG as a database, where the natural language questions are translated into queries, e.g. in SPARQL, to and are evaluated over the KG. As an alternative approach, recent research [7, 8, 9, 10] has proposed knowledge graph embedding (KGE) for QA systems. KGE models allow systems to find answers that are not explicitly stated in the KG. In this study, we focus on the task of QA on temporal knowledge graphs (TKGs). Temporal questions are a common type of human questions that involve a time context. Such a context can be explicit, e.g., “Who won the Oscar for best-supporting actor in 2022?", or implicit, e.g., “Who was pope during the fifth crusade?". Answers to these questions may also contain temporal information, such as questions starting with when. The state of the art includes dedicated systems that focus on answering temporal questions [11, 6]. Some of them, like EXAQT [11], adopt TKG or time-aware embeddings that are learned in the context of the QA task. We observe that such systems are hardly extendable, for example it is challenging to replace the embedding model with pre-existing models learned using temporal Workshop at ISWC 2022 on Deep Learning for Knowledge Graphs † These authors contributed equally. $ kotte17@student.aau.dk (K. Otte); kveste16@student.aau.dk (K. S. Vestermark); lihuan@cs.aau.dk (H. Li); dade@cs.aau.dk (D. Dell’Aglio)  0000-0003-0084-1662 (H. Li); 0000-0003-4904-2511 (D. Dell’Aglio) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Kristian Otte et al. CEUR Workshop Proceedings 1–10 knowledge graph embedding (TKGE) methods [12, 13, 14, 15, 16, 17, 18]. To overcome such a limitation, we present Verðandi a QA system built on top of pre-trained TKGEs. In Verðandi, each pre-trained model is exposed through a microservice and invoked by other services to solve temporal link prediction and temporal question answering tasks. Verðandi can support the research on TKGE by providing an environment where to compare and contrast existing methods, as well as design and test new techniques. Based on Verðandi, we start investigating whether different TKGE models capture different information from the original TKGs. To do that, we design an ensemble-based module that aggregates the outcomes from each model. In our experiments using two temporal QA benchmark datasets, we find that the ensemble module can produce better results than the individual TKGE models, suggesting that our hypothesis may hold. To summarise, the main contributions of this paper are as follows: • An open-source1 microservice-based framework that enables an extensible QA-KG. • An ensemble method that combines the link prediction results from multiple TKGEs, where the combination yields better results than using an individual TKGE model. Sections 2 and 3 present the background and the related work, respectively. Section 4 introduces the architecture of Verðandi, and Section 4.3 describes how Verðandi uses ensemble learning to achieve better results than the individual models. Section 5 presents our experimental study on ensemble learning for QA-KG. Section 6 discusses the results and the limitations, and identifies future actions of our research. Section 7 concludes the paper. 2. Background and Notation A knowledge graph is a directed graph with labeled vertices that represent entities, and edges that denote the relations between entities. Examples of KGs include Wikidata [1], Freebase [19], DBpedia [2], YAGO [3], and ICEWS [20]. The entities and relations form the entity- relation-entity triples called facts, written as (ℎ, 𝑟, 𝑡), where ℎ is the head entity, 𝑟 is the relation, and 𝑡 is the tail entity. An example fact is (Obama, isPresidentOf, USA). Temporal knowledge graphs extend KGs by adding temporal annotations to form entity-relation-entity- time quadruples, called temporal facts. We indicate a temporal fact as (ℎ, 𝑟, 𝑡, 𝜏 ), where 𝜏 is the time. The time can be either an interval [1] or a discrete timestamp [20]. A temporal fact with an interval, e.g., (Obama, presidentOf, USA, [2009,2017]), can be converted into two temporal facts with discrete timestamps indicating the beginning and the ending of the event, e.g., (Obama, becomePresidentOf, USA, 2009) and (Obama, endPresidentOf, USA, 2017). This study focuses on temporal facts with discrete timestamps. A temporal question is a question where a time context is part of the question or the answer. An example of where time is part of the question is (Q1) “Who became president of the USA in 2009?”, and an example of where time is the answer is (Q2) “When did Barack Obama become president of the USA?”. Depending on the type of question, an answer to a temporal question is an entity, relation, or time information. This answer is the result of using a TKGE model with the known information from the temporal question. Examples of answers are Obama for Q1 and 2009 for Q2 mentioned above. 1 https://gitlab.com/tkge, MIT licence 2 Kristian Otte et al. CEUR Workshop Proceedings 1–10 Link prediction is the task of predicting a valid fact in a KG. It is usually defined as the problem of finding a tail given a head and a relation (denoted (ℎ, 𝑟, ?)), or a head given a relation and a head (denoted (?, 𝑟, 𝑡)). Link prediction is one of the most popular tasks for KGE. Using link prediction, one can solve QA tasks, as link prediction can be used to determine the likelihood of an answer to be true. Thus, when a question is issued, link prediction can be used to generate likely responses. 3. Related Work KGEs have been extensively investigated in the last decade [4]. Recently, research has begun on integrating temporal dimensions into KGE, mainly extending non-temporal KGE techniques [12, 13, 14, 15, 16, 17, 18]. ChronoR [18] is inspired by rotational KGE methods, such as RotatE [21]. It uses the linear transformations of rotation and scaling, parameterized by time and relation, to obtain the embedding of a tail entity from a head entity. The Diachronic Enitity Embedding methods [15] include DE-SimplE, DE-DistMult, and DE-TransE, as time-aware versions of SimplE [22], DistMult [23], and TransE [24]. All these methods embed time with entities, inspired by diachronic word embeddings [25]. TimePlex [16] is based on ComplEx [26], and embeds entities, relations, and time as vectors in the complex space. QA is a popular KGE task [8, 9, 10], where embeddings of an entity and relation are extracted from a question and are used in a link prediction task to find the answers. UPSQA [8] finds the top-𝑘 candidate entities in the question using a bi-LSTM. Then, given the question and a candidate entity, it finds the relation using another bi-LSTM. BuboQA [9] first solves an entity detection task using a bi-LSTM as UPSQA. Next, it uses a fuzzy matching mechanism to identify a candidate entity from the KG. The relation is then found using a bi-GRU over all the relations that are used within the candidate entity. BERTQA [10] uses BERT to detect entities and classify relations. Entities in the KG are then linked using fuzzy matching and candidate facts are formed by combining each relation found with the entity that has the highest probability. QA on TKGs has recently emerged with systems such as TEQUILA [6] and EXAQT [11]. TEQUILA is an enabler method for temporal QA, which can run on top of any static QA-KG system. It detects if a question has temporal intent: if yes, it decomposes the question into a non-temporal sub-question and a temporal constraint. While the underlying QA-KG system is handling the non-temporal sub-question, the temporal constraint is solved using constraint reasoning on temporal intervals. EXAQT answers complex temporal questions with multiple entities, relations, and associated temporal conditions using TKG in two steps. First, question- relevant compact sub-graphs are computed within the KG and are enhanced with temporal facts using Group Steiner Trees and BERT models. Second, it creates relational graph convolutional networks enhanced with time-aware entity embeddings and attention of temporal relations. When designing Verðandi, we got inspired by TEQUILA on using TKGs and in the idea of being an enabler system where the data component can be changed. Similarly to EXAQT we rely on TKG for designing the QA mechanism. However, differently from it, we rely on pre-trained state-of-the-art TKGEs models instead of computing ad-hoc embeddings based on the QA task. 3 Kristian Otte et al. CEUR Workshop Proceedings 1–10 4. Verðandi When we designed Verðandi, we decided to rely on a microservice-based architecture as it brings modularity and helps us in conceptualising the interfaces in a TKGE-agnostic fashion. Furthermore, microservices bring an inherent level of scalability, setting the foundations for future studies on large-scale QA systems. 4.1. The Verðandi Modules We defined three different types of modules based on the functionalities and the individual responsibilities. These are: (i) the user interface module, (ii) the natural Language module, and (iii) the TKGE module. The purpose of the User Interface Module is to allow a user to interface with the system in a user-friendly manner. The module itself does not have an API, but it invokes the API of the Natural Language Module. The Natural Language Module translates natural language questions into TKG link predic- tion queries that can be handled by TKGE models. Extending the link prediction queries for KGE, these queries consist in finding missing head, tail or time annotation from a temporal fact, e.g., (ℎ, 𝑟, ?, 𝜏 ). After the translation, the module submits the link prediction query to a TKGE module, receiving a list of answers as a result. The Natural Language Module then converts such responses into natural language responses and returns them to the invoker module. Thus, the API of the module receives a string-type natural language question as an input and returns a string-type natural language response as the output. The TKGE Module is responsible for evaluating link prediction queries. The API exposes an endpoint that gets as input the link prediction query, where the answer is a list of temporal facts annotated with confidence scores. We intentionally designed a simple API to achieve flexibility and let developers implement any query evaluation mechanism. For example, although our goal is to answer queries using TKGE models, one may implement a TKGE module that answers the query as a triple pattern SPARQL query. Two possible systems built on these modules are shown in Figure 1. Figure 1a shows a system where there is one TKGE model exposed through a TKGE module. Figure 1b shows a system employing a TKGE module enabling model ensemble: each module in the bottom embeds a different TKGE model and receives the same TKGE query, producing a ranked list of results an an answer. The ensemble module receives the answers and combines them to produce a unified ranked list of results, which is then communicated to the natural language module. 4.2. Verðandi Implementation Microservice technologies We considered two technologies for developing the API, namely REST (REpresentational State Transfer) and Remote Procedure Call (RPC). RPC is an inter- process communication method that allows a program to call a procedure of another program as if it were part of its own code. This allows for programmers to easily integrate the use of these procedures into their code without thinking about how the communication between these 4 Kristian Otte et al. CEUR Workshop Proceedings 1–10 (a) Configuration with a single TKGE module. (b) Configuration with the ensemble module orches- trating 𝑛 TKGE modules. Figure 1: Two possible system architectures implemented using Verðandi. works. gRPC2 is a modern, open source, and efficient RPC framework. It uses a service definition language called Protocol Buffer, which allows developers to define services, endpoints, and messages in a language agnostic manner. Automatic tools are then provided, which enable the generation of language-specific code that implements the client and server part of services defined in Protocol Buffer in various programming languages, such as Go, C++, Java, and Python. Since we do not need the low-level access provided by REST APIs, and because our APIs do not need to be public facing, we choose to use RPC, and specifically gRPC. Current implementation The current implementation of the User Interface Module consists of a command line interface, which allows submitting questions expressed in natural language. Additionally, the request can contain the number of answers to be retrieved. As QA based on KGEs usually find answers ordered according to some confidence score, the correct answer may not be first. As such, it can be relevant to retrieve multiple answers, optionally annotated with the relative confidence score. In future iterations, we plan to implement a more user-friendly interface, such as a web-based one. In our current implementation of the Natural Language Model, we adopted a template- based approach to convert questions into TKGE queries. For example, one of the patterns that the module recognises is: “Who did relation tail on time?". When the user inputs a query like: “Who became president of USA on November 2019?", the module generates a query (?, becamePresident, USA, 11.2019). In our future work, we aim to implement more sophisticated state-of-the-art NLP algorithms to ease the conversion of the questions into queries. Currently, Verðandi includes four TKGE Modules. Three are implementations of TKGE techniques from [15], namely DE-SimplE, DE-DistMult, and DE-TransE. We picked such methods as the authors provide high-quality open source code, which is ideal for showcasing how to create a module for existing libraries. In our future plans, we aim to develop new TKGE modules. We are in the process of adding TimePlex [16] as, similarly to [15], provides open source code. The fourth TKGE module we have implemented so far is the Ensemble Module. This module 2 Cf. https://grpc.io/ (last accessed: July 2022) 5 Kristian Otte et al. CEUR Workshop Proceedings 1–10 allows the system to orchestrate two or more TKGE modules, combining the results from each of these into a single response. The idea behind the Ensemble Module will be further described in the next section. 4.3. Ensemble of the TKGE Models We have used Verðandi to set the basis on studying existing TKGE methods. In particular, we asked ourselves whether different TKGE methods are capturing the same information from the TKG, and if not, whether a combination of them may lead to better QA answers. We investigate these ideas by using ensemble learning, which in other contexts has shown to be able to achieve better predictive performance by combining multiple models [27]. The goal is to retrieve the top-𝑁 scoring results from a temporal link prediction query evaluated on different models, and combine the individual answers to create a unified one. We use Bayesian model averaging (BMA) [28] for the ensemble, which runs the models individually and combines the scores each model predicted. This method uses pre-trained models, which means that multiple existing models that capture different features can be used to improve weaknesses. We employ BMA in two different versions: unweighted, which assigns to each model the same weight, and a weighted, where we learn a set of weights for each model. The unweighted BMA takes the top-𝑁 results from each of the models and gives them a score equal to their ranking within the top-𝑁 . The score 𝑠 is defined as 𝑠 = 𝑁 − 𝑟, where 𝑁 is the number of returned results and 𝑟 is the rank returned by the model. For example, if a model gives a result with the rank of 0 (the model predicts this result is the most likely), among Ensemble Result Who did Iran express 1. China 13 intent to meet or negotiate TKGE Server 1 TKGE Server 2 TKGE Server 3 2. North Korea 11 on 02 Feb, 2014? 3. South Korea 9 1. China 5 1. North Korea 5 1. China 5 4. USA 7 2. North Korea 4 2. South Korea 4 2. Denmark 4 5. Denmark 6 3. South Korea 3 3. China 3 3. USA 3 4. USA 2 4. USA 2 4. South Korea 2 Convert to query 5. Denmark 1 5. Denmark 1 5. North Korea 1 Combine and and send to TKGE rearrange Servers (a) Process of combining results from TKGE modules, using the unweighted BMA. The scores given to the rankings are shown in bold. Ensemble Result Who did Iran express TKGE Server 1 TKGE Server 2 TKGE Server 3 1. North Korea 3.5 intent to meet or negotiate Weight: 0.2 Weight: 0.5 Weight: 0.3 2. South Korea 3.0 on 02 Feb, 2014? 3. China 2.9 1. China 5 1. North Korea 5 1. China 5 4. USA 2.0 2. North Korea 4 2. South Korea 4 2. Denmark 4 5. Denmark 1.5 3. South Korea 3 3. China 3 3. USA 3 4. USA 2 4. USA 2 4. South Korea 2 Convert to query 5. Denmark 1 5. Denmark 1 5. North Korea 1 Multiply weights, and send to TKGE combine and Servers rearrange (b) Weighted BMA combining ranked results from three TKGE modules. The results are multiplied by a weight before being added and reordered. Figure 2: Unweighted and weighted BMA. 6 Kristian Otte et al. CEUR Workshop Proceedings 1–10 10 results (𝑁 = 10, 𝑟 = 0, 𝑠 = 10 − 0 = 10), then the score of that result will be 10. Then if another model gives it a rank of 1, then the combined score for this result, based on these two models, will be 19. Figure 2a shows the process of combining results from multiple TKGE servers using the unweighted method. The weighted BMA, shown in Figure 2b, works similarly to the unweighted BMA, but it multiplies weights to the scores before comgining them. In this way, different models have different impacts on the final score. 5. Evaluation Data One of the most used KGs used in literature for testing link prediction in TKGE models is Integrated Crisis Early Warning System (ICEWS). Researchers, in particular, created two datasets named ICEWS14 and ICEWS05-15 [20], which feature facts from the ICEWS dataset from the year 2014 and the years 2005-2015, respectively. These datasets are event-based, meaning that every fact is annotated with a discrete timestamp. For testing Verðandi, however, we need questions and answers. Therefore, we generate questions from the ICEWS datasets by fitting ICEWS temporal facts into question templates. For example, from the temporal fact (SouthKorea, criticize, NorthKorea, 2014-05-13), we generate questions like “Who did South Korea criticize on 13 May 2014?”. For each fact in the datasets, four questions are generated, one for each temporal fact element missing, i.e., head, relation, tail, time. Table 1 shows a summary of the datasets. In our experiments, however, we use only the questions with missing subject or object because the methods we implemented at the moment do not support queries where these two elements are missing. Methods As explained in Section 4.2, in our experiments we consider three Diachronic Entity Embeddingvariants: DE-SimplE, DE-DistMult, and DE-TransE. We also use unweighted and weighted BMA as described in Section 4.3. For weighted BMA, there is an extra learning step where the weights are learned. We search the weights using Bayesian optimization [29]. Metrics All systems return a ranked list of the answers from link prediction. The answers consist of the facts and a score for each fact. With the list of answers, we calculate: (i) MRR, i.e., the mean of the reciprocal of the rank of the correct answer, (ii) Hits@1, i.e., the percentage of facts, where the answer with the highest score is the correct answer, and (iii) Hits@10, i.e., the percentage of facts where the correct answer is within the 10 highest scored answers. For all the metrics, the higher the score, the better. When running experiments, only the facts in the Table 1 Statistics of the datasets ICEWS14 and ICEWS05-15. Dataset #Ent. #Rel. #Time Train Valid Test Total Questions ICEWS14 7,128 230 365 72.8k 8.9k 8.9k 90.7k 362.9k ICEWS05-15 10,488 251 4,017 368.9k 46.3k 46.1k 479.3k 1.9M 7 Kristian Otte et al. CEUR Workshop Proceedings 1–10 Table 2 Results of Diachronic Embedding models and ensemble on a link prediction task using the TKGE server. ICEWS14 ICEWS05-15 Model MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 DE-SimplE 0.505 38.2 73.2 0.496 36.7 74.4 DE-DistMult 0.484 36.8 70.7 0.471 34.8 71.4 DE-TransE 0.312 10.1 68.8 0.304 9.5 68.3 Ensemble (unweighted) 0.515 39.3 74.5 0.493 36.6 73.8 Ensemble (weighted) 0.518 39.6 74.7 0.499 37.1 74.4 test set are loaded. This means our results are unfiltered and thus may appear worse than the filtered results reported e.g. in [15]. Baselines vs Ensemble We compare the baselines to the ensemble method using an un- weighted scorer (see Section 4.3), and report the results in Table 2. The results show that having all the models together perform better than the best individual model on the ICEWS14 dataset, even though DE-DistMult and DE-TransE are underperforming compared to DE-SimplE. The results are slightly worse on the ICEWS05-15 dataset when using an unweighted ensemble com- pared to the best individual model. It should also be noted that all Diachronic Entity Embedding models perform better on ICEWS14 in both Hits@1 and MRR compared to ICEWS05-15. Weighted Ensemble As the DE-TransE model is significantly worse at Hits@1 and MRR compared to DE-SimplE and DE-DistMult, we hypothesize that DE-TransE introduces some noise to the ensemble. To test this hypothesis, we do a weighted ensemble to see if eliminating some potential noise yields better results. We search the optimal weights for the ensemble using Bayesian optimization using Gaussian Process [29]. To get a better approximation, we first run some evaluations, where one of the weights is changed at a time to see what impact the individual models have. This allows for narrower bounds during the Bayesian optimization and should yield a better result. The Bayesian optimization ran 25 iterations, and the found weights in the case of ICEWS14 were 0.48 for DE-SimplE, 0.4 for DE-DistMult, and 0.12 for DE-TransE. In the case of ICEWS05-15, the weights were 0.66 or DE-SimplE, 0.33 for DE-DistMult, and 0.01 for DE-TransE. Results for the weighted ensemble is shown as Ensemble (weighted) in Table 2. The results disclose that on the ICEWS14 dataset, using a weighted ensemble further improves the accuracy and has a significant improvement over DE-SimplE, and a slight improvement on the ICEWS05-15 dataset compared to DE-SimplE. The results also show that even though DE-TransE mostly introduces some noise to the ensemble, using it with a lower weight is still better than not using the model. We hypothesize that this is because the models are able to capture different aspects of the TKG, so even if DE-TransE performs bad on its own, it might be able to capture aspects that the other models do not, and thus a combination of them, provides even better results. 8 Kristian Otte et al. CEUR Workshop Proceedings 1–10 6. Discussion We will now discuss the results both from an architectural viewpoint and an experimental one, as well as provide opportunities for future work. Framework We chose to implement Verðandi using a microservice-based architecture, as we wanted to have a loosely coupled and modular framework, where components can be substituted for other similar components. This architecture allowed us to easily extend the system to use an ensemble module, encompassing multiple TKGE modules, placing this between the Natural Language module, and the existing TKGE module(s). Furthermore, microservices are inherently scalable when data is independent, as is the case with the individual questions passed through the system. This means that a load balancer can be placed between the caller and the called module, which can then send individual QA requests to the microservice with the least load. Optimization of Ensembles The results in Table 2 show that the models using BMA gives better results than individual models being used. On ICEWS14, even an unweighted ensemble obtained better results than DE-SimplE, which has the best accuracy of the three models. The weights found using Bayesian optimization further increased the accuracy of the ensemble. As the weights were approximated using Bayesian optimization, the weights are most likely not the optimal weights. Running an exhaustive grid search would further improve the result at the cost of a more expensive parameter search. It is possible that different TKGE models may perform better than others in answering specific questions. We plan, therefore, to investigate the relation between the TKGE models and the question types, with the goal of dynamically varying the weights in the ensemble based on the input question. Furthermore, it is also very likely that using several different TKGE models, such as TimePlex and ChronoR, together with the Diachronic Entity Embedding models, would provide even better results than only using the Diachronic models, as they would likely capture temporal information from different angles. Hybrid QA Process When using a KGE model for a QA-KG system, it is possible to answer questions by performing link prediction on known and unknown relations. One could imagine a QA-KG system that combines a SPARQL query engine with a link prediction task to achieve better results, following the intuition that known facts should be weighted more. We leave building such a system for future investigations. 7. Conclusion We proposed Verðandi, a microservice-based framework for QA-KG. We built the framework to be modular, as we were able to use it with many different TKGE models; extensible, as we built an ensemble module on top of it. In future works, we plan to study the scalability of the Verðandi, to test to which extent the microservice architecture can manage high question workloads. We also proposed to use an ensemble method for combining multiple TKGE models for better results. We chose to use BMA, as it uses pre-trained models and allows us to use different TKGE models. Our experiments suggest that using ensemble methods can provide better results than considering individual models. We consider this result as a first promising step towards 9 Kristian Otte et al. CEUR Workshop Proceedings 1–10 investigating whether different models might capture different aspects of the KG. We will continue to investigate this direction, also by exploiting alignment techniques to align the entitites of the TKGs [30]. Finally, we plan to study if we can improve the performance of Verðandi by assigning different weights to different groups of questions. References [1] J. Leblay, M. W. Chekol, Deriving Validity Time in Knowledge Graph, in: WWW (Companion Volume), 2018, pp. 1771–1776. [2] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. v. Kleef, S. Auer, C. Bizer, DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia, Semantic Web 6 (2015) 167–195. [3] F. M. Suchanek, G. Kasneci, G. Weikum, YAGO: A Core of Semantic Knowledge, in: WWW, 2007, pp. 697–706. [4] Q. Wang, Z. Mao, B. Wang, L. Guo, Knowledge Graph Embedding: A Survey of Approaches and Applications, IEEE Trans. Knowl. Data Eng. 29 (2017) 2724–2743. [5] C. Unger, L. Bühmann, J. Lehmann, A.-C. N. Ngomo, D. Gerber, P. Cimiano, Template-based Question Answering over RDF Data, in: WWW, 2012, pp. 639–648. [6] Z. Jia, A. Abujabal, R. S. Roy, J. Strötgen, G. Weikum, TEQUILA: Temporal Question Answering over Knowledge Bases, in: CIKM, 2018, pp. 1807–1810. [7] X. Huang, J. Zhang, D. Li, P. Li, Knowledge Graph Embedding Based Question Answering, in: WSDM, 2019, pp. 105–113. [8] M. Petrochuk, L. Zettlemoyer, SimpleQuestions Nearly Solved: A New Upperbound and Baseline Approach, in: EMNLP, 2018, pp. 554–558. [9] S. Mohammed, P. Shi, J. Lin, Strong Baselines for Simple Question Answering over Knowledge Graphs with and without Neural Networks, in: NAACL-HLT (2), 2018, pp. 291–296. [10] D. Lukovnikov, A. Fischer, J. Lehmann, Pretrained Transformers for Simple Question Answering over Knowledge Graphs, in: ISWC, volume 11778, 2019, pp. 470–486. [11] Z. Jia, S. Pramanik, R. S. Roy, G. Weikum, Complex Temporal Question Answering on Knowledge Graphs, in: CIKM, 2021, pp. 792–802. [12] T. Jiang, T. Liu, T. Ge, L. Sha, S. Li, B. Chang, Z. Sui, Encoding Temporal Information for Time-Aware Link Prediction, in: EMNLP, 2016, pp. 2350–2354. [13] A. García-Durán, S. Dumancic, M. Niepert, Learning Sequence Encoders for Temporal Knowledge Graph Completion, in: EMNLP, 2018, pp. 4816–4821. [14] S. S. Dasgupta, S. N. Ray, P. P. Talukdar, HyTE: Hyperplane-based Temporally aware Knowledge Graph Embedding, in: EMNLP, 2018, pp. 2001–2011. [15] R. Goel, S. M. Kazemi, M. Brubaker, P. Poupart, Diachronic Embedding for Temporal Knowledge Graph Completion, in: AAAI, volume 34, 2020, pp. 3988–3995. [16] P. Jain, S. Rathi, Mausam, S. Chakrabarti, Temporal Knowledge Base Completion: New Algorithms and Evaluation Protocols, in: EMNLP, 2020, pp. 3733–3747. [17] T. Lacroix, G. Obozinski, N. Usunier, Tensor Decompositions for Temporal Knowledge Base Completion, in: ICLR, 2020. [18] A. Sadeghian, M. Armandpour, A. Colas, D. Z. Wang, ChronoR: Rotation Based Temporal Knowledge Graph Embedding, in: AAAI, volume 35, 2021, pp. 6471–6479. [19] Google, Freebase Data Dumps, 2018. URL: https://developers.google.com/freebase, (Last Accessed: July 2022). [20] E. Boschee, J. Lautenschlager, S. O’Brien, S. Shellman, J. Starz, M. Ward, ICEWS Coded Event Data, 2015. doi:10.7910/DVN/ 28075, (Last Accessed: July 2022). [21] Z. Sun, Z.-H. Deng, J.-Y. Nie, J. Tang, RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space, in: ICLR (Poster), 2019. [22] S. M. Kazemi, D. Poole, SimplE Embedding for Link Prediction in Knowledge Graphs, in: NeurIPS, 2018, pp. 4289–4300. [23] B. Yang, W.-t. Yih, X. He, J. Gao, L. Deng, Embedding Entities and Relations for Learning and Inference in Knowledge Bases, in: ICLR (Poster), 2015. [24] A. Bordes, N. Usunier, A. García-Durán, J. Weston, O. Yakhnenko, Translating Embeddings for Modeling Multi-relational Data, in: NeurIPS, 2013, pp. 2787–2795. [25] A. Kutuzov, L. Øvrelid, T. Szymanski, E. Velldal, Diachronic word embeddings and semantic shifts: a survey, in: COLING, Association for Computational Linguistics, 2018, pp. 1384–1397. [26] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, G. Bouchard, Complex Embeddings for Simple Link Prediction, in: ICML, 2016, pp. 2071–2080. [27] D. W. Opitz, R. Maclin, Popular Ensemble Methods: An Empirical Study, J. Artif. Intell. Res. 11 (1999) 169–198. [28] J. A. Hoeting, D. Madigan, A. E. Raftery, C. T. Volinsky, Bayesian Model Averaging: A Tutorial, Statistical Science 14 (1999) 382 – 417. [29] J. Mockus, Bayesian Approach to Global Optimization, Springer, 1989. [30] M. Baumgartner, D. Dell’Aglio, H. Paulheim, A. Bernstein, Towards the Web of Embeddings: Integrating multiple knowledge graph embedding spaces with FedCoder, J. Web Semant. 75 (2023) 100741. 10