1. Introduction

Relation Linking to Knowledge Bases via CLOCQ

Philipp Christmann

pchristm@mpi-inf.mpg.de 0

Rishiraj Saha Roy

rsaharo@mpi-inf.mpg.de 0

Gerhard Weikum

weikum@mpi-inf.mpg.de 0

Question Answering, Knowledge Bases, Entity Linking, Relation Linking

0 Max Planck Institute for Informatics and Saarland University , Germany

2022

Curated knowledge bases (KBs) contain billions of facts with millions of entities and thousands of predicates. Question answering (QA) systems are supposed to access this knowledge to answer users' factoid questions. Entity linking and relation linking are integral ingredients of many such QA systems, and aim to link mentions in the question to concepts in the KB. The quality of these linking modules is of high importance: a single error in linking can result in a failure for the whole QA system. The SMART 2022 Task poses challenges for entity and relation linking to evaluate the performance of diferent approaches. In this work, we adapt and extend our prior work CLOCQ. CLOCQ computes top- linkings for each mention to make up for potential errors, with set automatically based on an ambiguity score. As an extension, we design a module that prunes linkings for irrelevant mentions which helps to improve precision. We found that there is a trade-of between recall and precision: higher boosts recall (up to 0.87 for entity linking), while lower leads to high precision performance. The best choice for the linking modules may highly depend on the specific QA system, and whether it can make use of higher recall in the presence of noise.

1. Introduction

Question answering (QA) systems provide natural interfaces for accessing human knowledge. Such human knowledge can be stored in large-scale knowledge bases (KBs) like Wikidata [ 1 ], DBpedia [ 2 ], YAGO [ 3 ], Freebase [ 4 ], and industrial counterparts (at Amazon, Apple, Google, Microsoft, etc.). KBs contain facts consisting of entities, relations, types, and literals. The standard way of storing KB facts are triples consisting of a subject, a predicate and an object. Motivation and problem. QA systems operating over KBs mostly follow one of the following two themes [ 5 ]: (i) approaches with an explicit query create a logical form, for example, a SPARQL query, and fill the query slots with entities and relations linked with mentions in the question [ 6, 7 ], or (ii) approaches without an explicit query first link entities and relations to retrieve a search space consisting of KB facts, which is then searched for identifying the answer [ 8, 9 ]. For either of the two approaches, being able to identify mentions of entities and relations, and linking these mentions to KB items, is a key obstacle in the QA pipeline (linking may also be referred to as disambiguating). Even single errors in these entity linking or relation linking modules can lead to a complete failure of the QA pipeline, which is why their quality is integral to the performance of the entire QA system. Note that without loss of generality, mentions of types or general concepts may be linked as well, and can be used in the remainder of the QA pipeline.

Consider the following running example on the TV series House of the Dragon following the narratives of George R.R. Martin:

“Who plays Viserys in GRRM’s latest HBO series?”

Linking the mentions in the question to the KB (Wikidata for this example) is a non-trivial task that requires an understanding of the question as a whole. The entity mention “HBO” may refer to the H B O c o m p a n y , the H B O n e t w o r k , or to the H o l l y w o o d B o w l O r c h e s t r a . Understanding that the question is on a TV series helps to identify H B O n e t w o r k as the correct entity. Similarly, “plays” semantically or lexically matches with many relations in the KB, like p l a y s f o r t e a m , i n s t r u m e n t , n u m b e r o f p l a y s , c h a r a c t e r r o l e , or t i m e p l a y e d . The intended relation c h a r a c t e r r o l e only becomes clear from the question context.

“Viserys” is even harder to link, since there are diferent characters named Viserys in the Game of Thrones universe: V i s e r y s I I I T a r g a r y e n , the more well-known character from Game of Thrones, and V i s e r y s I T a r g a r y e n from the more recent House of the Dragon series. Thus, the mention “Viserys” is quite ambiguous, even if the general context of the question is clear. A deep understanding of the question is required to correctly link “Viserys” to V i s e r y s I T a r g a r y e n . Note that in case any of these disambiguations are incorrect, there is little hope to return the correct answer P a d d y C o n s i d i n e to the user.

To further investigate QA modules and pinpoint failure cases, the SMART Task 3.01 (colocated with ISWC 2022) poses tasks for entity linking (Task 3) and relation linking (Task 2). There is also a task on answer type prediction (Task 1), which is not targeted in this work. Approach and contribution. In this work we adapt our recently proposed CLOCQ framework [ 8 ] to these tasks. CLOCQ is an unsupervised framework that provides many functionalities related to QA, and is made available as open-source code and as a public API2. These functionalities include basic KB methods like retrieving the aliases, frequency, or 1-hop neighborhood of a KB item, or computing the KB connectivity or the shortest path between two KB items. The core algorithm presented in the paper aims to retrieve a search space for a user question, faciliating QA methods without an explicit query. As an intermediate step and result, mentions in the question are linked to KB items when retrieving the search space. For linking to KB items CLOCQ implements two key ideas. First, all mentions should be linked jointly, considering the coherence of the disambiguated KB items. This follows the intuition that the question needs to be considered as a whole. CLOCQ links not only entities and relations, but also types and general concepts, providing disambiguations for each mention in the question. Second, the linking modules should make up for potential errors. When disambiguating highly ambiguous mentions, like “Viserys” in the running example, the linking modules should take this ambiguity into account and provide the QA system with several possible linkings. CLOCQ provides a mechanism to detect the ambiguity of a mention, based on an entropy measure of

1https://smart-task.github.io/2022/ 2https://clocq.mpi-inf.mpg.de

1 2 3 … 15 : “plays”

KB item play plays for team instrument

… character role coh 0.93 1 0.84 2 0.81 4 0.61 5

KB item character role

play instrument plays for team rel KB item 0.76 1 play 0.71 2 plays for team 0.65 3 instrument 0.55 15 character role

: “Viserys” KB item 1 Viserys III Targaryen 2 Viserys I Targaryen 3 Viserys II Targaryen 4 Vinery Stud Stakes … … …

: “HBO” KB item 1 HBO company 2 HBO network 3 Hollywood Bowl Orchestra 4 HBO Max (streaming) … …

… One obstacle with adapting CLOCQ on entity or relation linking tasks, is that it, by design, disambiguates all mentions in the question. It does not diferentiate between entities, relations, types or other concepts. This helps when retrieving a search space, but can hurt precision of linking results. For example, CLOCQ might link “latest” and “series” to the KB, even if these mentions are irrelevant. We therefore propose a simple pruning module, that identifies which mentions should be linked, and prunes linkings for other mentions. The module is implemented with a fine-tuned sequence generation model that is trained using distant supervision.

By evaluating CLOCQ on the entity and relation linking tasks of SMART 3.0 challenge, we essentially investigate its applicability to QA approaches generating an explicit query. We show that top- disambiguations can help boosting recall, at the cost of decreasing scores for precision and F1 score. Further, we find that the mention-pruning module helps to improve the precision and F1 score substantially on the entity linking task.

2. The CLOCQ linking process We first introduce the complete workflow of the CLOCQ algorithm. For further discussion and details (e.g. on the fact-centric KB index underlying the CLOCQ framework), please refer to the original paper [8]. Fig. 1 shows an overview of the linking process. 2.1. Retrieving disambiguation candidates

Consider our running example “Who plays Viserys in GRRM’s latest HBO series?”. Our goal is to link mentions in the question (“plays”, “Viserys”, “GRRM”, “HBO”, “series”) to items in the KB. Mentions in the question can be single question words or phrases. Named entity phrases can for example be detected using named entity recognition (NER).

We first collect candidates from the KB using a standard lexical matching score (like TFIDF or BM25) for each mention 1 … .

would be 5 in our example, and stopwords are dropped. Here is analogous to a search query, while each item in the KB resembles a document in a corpus. This “document” is created by concatenating the item label with textual aliases and descriptions available in most KBs [ 1, 4 ]. This results in ranked lists { 1 = ⟨ 11, 12, …⟩; 2 = ⟨ 21, 22, …⟩; …

= ⟨ 1 , 2 , …⟩} of KB items , one list for each , scored by degree of match between the mentions and KB items.

A ranked lexical match list for “plays” could look like: 1 = ⟨1 : p l a y , 2 : p l a y s f o r t e a m , 3 : i n s t r u m e n t , 4 : n u m b e r o f p l a y s , 5 : t i m e p l a y e d , 6 : p l a y w r i g h t , 7 : g u i t a r i s t , 8 : P l a y s c o l l e c t i o n , …, 1 5 : c h a r a c t e r r o l e , …⟩ with the ideal disambiguation being shown in bold. The list for “HBO” could be: 4 = ⟨1 : H B O c o m p a n y , 2 : H B O n e t w o r k , 3 : H o l l y w o o d B o w l O r c h e s t r a , …⟩ Note that the correct KB item for can sometimes be very deep in individual lists . For example, c h a r a c t e r r o l e is at rank 15 in 1. combination for us would be:

Next, each list is traversed up to a depth to fetch the top- items per mention. The goal is to find combinations of KB items ⟨ ⟩=1 that best match the question. For instance, an ideal { c h a r a c t e r r o l e , V i s e r y s I T a r g a r y e n , G e o r g e R . R . M a r t i n , H B O n e t w o r k , T V s e r i e s } These combinations come from the Cartesian product of items in the lists, and would have possibilities if each combination is explicitly enumerated and scored. This is cost-prohibitive: since we are only interested in some top- combinations, as opposed to a full or even extended partial ordering, a more eficient way of doing this would be to apply top- algorithms [ 10, 11 ]. These prevent complete scans and return the top- best combinations eficiently.

2.2. Scoring candidates

Thus, we propose an approach using top- algorithms to overcome this challenge. To go beyond shallow lexical matching, our proposal is to construct multiple lists per question token, each reflecting a diferent relevance signal . Specifically, we obtain one list for each mention and score . Then, we apply top- algorithms on these lists to obtain the disambiguation of each question token individually.

Note that considering the question as a whole is a key criterion for our scoring mechanism. Therefore, we integrate two global relevance signals. Specifically, a candidate KB item combination that fits well with the intent in the question is expected to have high semantic coherence and high graph connectivity within its constituents. These can be viewed as proximity in latent and symbolic spaces. Further, candidates should match well on question and mention levels. These motivate our four relevance signals for each item in list below.

Coherence. We consider global signals for semantic coherence and graph connectivity, which are inherently defined for KB item pairs, on a global level, instead of single items. Therefore, we need a technique to convert these signals into item-level scores. The idea is to use a max operator over all candidate KB item pairs involving a candidate at hand. More precisely, the coherence score of an item is defined as the maximum item-item similarity (averaged over pairs of lists) this item can contribute to the combination. The pairwise similarity is obtained by the cosine value between the embedding vectors of the two KB items, min-max normalized from [−1, +1] to [ 0, 1 ]: Connectivity. This is the second global signal considered, and captures a diferent form of proximity. Every KB can be viewed as an equivalent knowledge graph (KG), where entities, predicates and other KB items are nodes, and edges run between components of the same fact [ 12, 13 ]. We define KB items that are part of the same fact to be in the 1-hop neighborhood of each other, those that are connected via members of another fact as in the 2-hop, and so on [ 8 ]. We assign items in one hop of each other to have a distance of 1, those in two hops to have a distance of 2, and ∞ otherwise. Almost all KB items are within three or four hops of each other, and thus distances beyond two hops cease to be a discriminating factor. We define connectivity scores as the inverse of this KB distance. So we obtain 1, 0.5, and 0, respectively for 1-, 2-, and >2-hop neighbors.

ℎ( ) =

1 − 1 ∑ max ( ⃗ ≠

, ⃗ ) ( ) =

1 − 1 ∑ max ( ≠ , ) Term match. This score is intended to take into account the degree of lexical term match (via TF-IDF, BM25, or similar) for which was admitted into . However, such TF-IDF-like weights are often unbounded and may have a disproportionate influence when aggregated with the other signals, that are all in the closed interval [ 0, 1 ]. Thus, we simply take the reciprocal rank of in as a representative match score to have it in the same [ 0, 1 ] interval: ℎ( ) = 1/ ( , ) (1) (2) (3) (4)

The global connectivity score is then converted to an item-level score analogously to the coherence, using max aggregation over pairs. Formally, we define the connectivity of as: ( ) = avg ≠ ( ⃗ , ⃗ ) Note that (

) ∈ [ 0, 1 ] for all .

Question relatedness. We estimate semantic relatedness of the KB item to the whole input question by averaging pairwise cosine similarities between the embeddings of the item and each term . The same min-max normalization as for coherence is applied. To avoid confounding this estimate with the question term for which was retrieved, we exclude this from the average. We define semantic relatedness as: 2.3. Finding top- across sorted lists We then sort each of these 4 ⋅ lists in descending score-order. Note that for each and each score , all lists hold the same items (those in the original ). Top- algorithms operating over such multiple score-ordered lists, where each list holds the same set of items, require a monotonic aggregation function over the item scores in each list [ 10, 11, 14, 15 ]. Here, we use a linear combination of the four relevance scores as this aggregate: ( ) = ℎℎ ⋅ ℎ( ) + ℎ ⋅ ( ) + ℎ ⋅ ( ) + ℎℎ ⋅ ℎ( ), where hyperparameters are tuned on a dev set, and ℎℎ + ℎ + ℎ + ℎℎ = 1. Since each score lies in [ 0, 1 ], we also have (⋅) ∈ [ 0, 1 ] . We use the threshold algorithm (TA) with early pruning [ 11 ] on these score-ordered lists. TA is run over each set of 4 sorted lists ⟨ 1 , 2 , 3 , 4 ⟩, corresponding to one mention , to obtain the top- best KB items { ∗} per . These KB items are then the top- linkings for a specific mention as predicted by our system. 2.4. Automatically setting Choosing an appropriate is non-trivial, a process that is often mention-specific. Intuitively, one would like to increase for ambiguous mentions in the question. For example, “plays” can refer to many KB items. By increasing one can account for potential disambiguation errors. On the other hand, “GRRM” is not as ambiguous, which is why setting =1 should sufice. The ambiguity of a mention is closely connected to that of uncertainty or randomness: the more uncertainty there is in predicting what a mention refers to, the more ambiguous it is. This makes entropy a suitable measure of ambiguity. More specifically, for each mention, KB items are retrieved initially. These items form the sample space of size for the probability distribution. The numbers of KB facts with these items form a frequency distribution that can be normalized to obtain the required probability distribution. We compute the entropy of this probability distribution as the ambiguity score of a mention, and denote it as ( ). By definition, 0 ≤ ( ) ≤ log2 . Practical choices of and does not exceed 5 and 50 respectively, and hence and log2 are in the same ballpark (log2 50=5.6). This motivates us to make the simple choice of directly setting as ( ). Specifically, we use = ⌊( )⌋ + 1 to avoid the situation of =0. Fig. 1 shows a possible “auto- ” (automatic choice of ) setting for our running example, and the corresponding top- linkings.

“plays” is highly ambiguous, and thus is set to a relatively high value. “Viserys” and “HBO” can also refer to diferent concepts. The word “GRRM” is relatively unambiguous.

3. Adapting CLOCQ to linking tasks

The native CLOCQ method was primarily designed for retrieving a search space of relevant KB facts for a given user question. Therefore, the linking of entities and relations is more of a means to an end here, and is not optimized for the specific tasks. We identified two key obstacles when using the plain CLOCQ method for entity and relation linking.

The first obstacle is that CLOCQ links all mentions in the question. Not only entities and relations are linked, but also types (like “series”) and other mentions (e.g. “latest” in the running example). While linking such mentions is beneficial for coherence among linkings, and can improve initiating a search space, it often adds undesired noise to the outputs when evaluating entity or relation linking capabilities.

For example, CLOCQ would link “series” to the entities TV series and the relation “latest” to latest start date for the running example, which would both decrease precision.

The second obstacle is that CLOCQ does not diferentiate between entity and relation mentions. Any mention is disambiguated to the KB items that score best w.r.t. the specified scoring mechanism. For example, the relation mention “plays” could also be linked to the entities play or playwright, similarly as “director” could be linked to the relation director or to the type director. Again, this does not hurt when initiating the search space, but definitely restricts the relation linking capabilities of CLOCQ.

In the following, we will discuss how we optimized CLOCQ for the entity and relation linking tasks of the SMART 2022 challenge. The same intuitions apply to other entity or relation linking problems as well.

3.1. Post-hoc pruning module for entity linking

As discussed, linking all mentions jointly is beneficial for linking results, since it considers information on the whole question in the linking stage. This follows our intuition of understanding the question in its entirety. Also, we did not want to touch the main CLOCQ algorithm itself. Instead, our idea is to prune the linkings returned by CLOCQ. We propose a simple approach: the decision, whether an entity should be included in the linking results or not, should be made depending on the mention the entity was disambiguated for. If the mention should be disambiguated, we add the linking, if not it is dropped from the results. For example, the mentions “plays”, “latest” and “series” should not be disambiguated when solving an entity linking task.

Training. We aim to learn which mentions should be linked (and which not) using distant supervision on the training data provided as part of the SMART task. Given a training instance, we first obtain all ⟨mention, KB entity⟩ pairs (i.e. the linkings) using the native CLOCQ method. From the training instance, we know the gold entities that should be linked. We then consider all mentions, that are linked with a gold entity by CLOCQ, as mentions that should be disambiguated. For the running example we would obtain the mentions “Viserys”, “GRRM”, and “HBO”, assuming the gold entity set ⟨Viserys I Targaryen, George R.R. Martin, HBO network⟩. With this information, we can create a training instance for learning the relevant entity mentions. The input is the question, and the output is the concatenation of the mentions linked to gold entities “Viserys|GRRM |HBO” separated by the special token “|”. We then simply ifne-tune a pre-trained sequence generation model on this data. For this purpose we used BART [ 16 ], which was found to be efective when text is copied and manipulated from the input to autoregressively generate the output.

Inference. At inference time, the pruning module is applied in a post-hoc manner. We first run CLOCQ and our trained pruning module on the input question. We then keep a ⟨mention, KB entity⟩ pair only if the disambiguated mention matches with any mention generated by our pruning module. Here, matching is relaxed to substring matching. For example, if the pruning module generates “in GRRM” or “GRR”, linking pairs for “GRRM” are still kept. In addition, for the entity linking task, we remove all relations from the linkings (relation identifiers start with a “P” in Wikidata).

Note that the post-hoc pruning module is also capable of learning benchmark-specific properties. The SMART 2022 entity linking task can often (but not always) require linking types or concepts. For example, “airline” in “DC-3 is operated by which airline?” should be linked, but not “continents” in “How many continents are in Antarctica?”. Such benchmark characteristics are learned implicitly by our pruning module, which can help improve the performance. 3.2. Increasing for relation linking As mentioned earlier, relation mentions may also be linked to entities that are coherent with the other linkings. We found that this can often be the case for CLOCQ linkings, and that the appropriate relation can be deeper in the ranked linkings of a mention than the automatically set cut-of length . However, relations can easily be diferentiated from entities via the identifier (relation identifiers start with a “P”, entity identifiers with a “Q” in Wikidata). We therefore simply set =50 and =40 to increase the probability of obtaining relations, and prune all entities from the linkings. Finally, we explore the efect of keeping either the top-ranked relation per mention, or all relations per mention as the final result.

4. Experiments 4.1. Experimental setup

SMART 2022 tasks. Statistics on the entity linking and relation linking tasks of the SMART Task 2022 can be found in Table 1 and 2. For both tasks, the question and the corresponding gold entities or relations are given for the train set. For the test set, only the question is given. The datasets are made publicly available3.

Metrics. We use the standard metrics of the SMART 2022 Task for both tasks: i) precision, that measures what fraction of the predicted linkings are correct, ii) recall, that measures what fraction of the gold linkings are found, and iii) F1 score, the harmonic mean of precision and recall. The results on the test set were provided by the task organizers, after we submitted our system results (since the gold standard is not publicly available).

Initialization of CLOCQ. In our experiments, we use Wikidata [ 1 ] as the knowledge base. We access CLOCQ via the public API4. The API currently uses a cleaned Wikidata dump5 from 31 January 2022, which has 94 million entities and 3, 000 predicates.

All parameters are kept at default values ( =20, ℎℎ =0.1, ℎ =0.3, ℎ =0.2, ℎℎ =0.4) [ 8 ] unless stated otherwise. For the entity linking task, we randomly sample 10, 000 training instances and use it as our development set (dev set) for choosing the best pruning module. Since CLOCQ is an unsupervised method, the train set is only used for training the pruning module on the entity linking task, and for tuning the parameters (=50) and (=40) on the relation linking task.

Initialization of pruning module. For implementing the pruning module, we use the pretrained BART model available on the Hugging Face library6. We make the code for the pruning module publicly available7. We choose =1 for CLOCQ during distant supervision (Sec. 3.1). The model is fine-tuned for 5 epochs, with 500 warm-up steps, a batch size of 10, and a weight decay of 0.01. We employ cross-entropy as the loss function. After each epoch, we run the model on the withheld dev set, and finally choose the model with the lowest loss there. CLOCQ variants. On the entity linking task, we compare the linking results of the native CLOCQ method with =1 or =AUTO, with the linking results after applying the post-hoc pruning module (again, =1 or =AUTO). On the relation linking task, we consider either the top-ranked relation per mention, or all relations per mention, as returned by CLOCQ.

4.2. Entity linking

The results on the entity linking task are shown in Table 3. When considering only the top-1 entity per mention, CLOCQ obtains a recall of 0.766. Setting =AUTO improves the recall by ≃ 0.1, indicating that potential errors can be overcome. Further, activating the pruning module 3https://github.com/smart-task/smart-2022-datasets 4https://clocq.mpi-inf.mpg.de 5https://github.com/PhilippChr/wikidata-core-for-QA 6https://huggingface.co/facebook/bart-base 7https://github.com/PhilippChr/CLOCQ-pruning-module can drastically improve the precision of CLOCQ, and thus also the F1 score. When adding the pruning module for =1, precision jumps from 0.281 to 0.714. Also, the results indicate that mostly noise is pruned, since the recall remains fairly stable. Again, recall can be substantially improved (≃ 0.1) by setting =AUTO, with the cost of a lower precision and F1 score.

The results indicate that the pruning module can successfully reduce noise in the entity linking results. Further, we found that there is a trade-of between precision and recall, which makes it impossible to determine a best variant for all scenarios. The best choice may highly depend on the specific QA system. Some QA systems require precise linkings for each mention [ 17 ], while others can cope with some noise [ 18 ], and leverage the boosted recall.

For example, a QA system optimized for eficiency may only issue exactly one explicit query to the KB [ 19 ], and may therefore rather go with =1 and an activated pruning module. On the other hand, if executing multiple queries is afordable for the QA system [ 7 ], using top- linkings might help.

For example, the queries with incorrectly linked entities in top positions might not return any result, while queries with lower ranked entities are able to identify the correct answer. Re-ranking results after query execution might also be an option. When following a graph-based approach without explicit queries, setting =AUTO was found to be beneficial [ 8 ].

An anecdotal example from the dev set for which an automatically increased helps is: “What was Toby Wright’s profession?”. There are diferent persons named “Toby Wright” in the KB, and the context does not help to resolve the ambiguity. With =1, only the incorrect Toby Wright (football player) is returned. When setting =AUTO, CLOCQ identifies the ambiguity of the mention and sets =2 for this mention. The correct entity Toby Wright (record producer) is then fetched at second rank of the results.

Another interesting question from the dev set is “which footballer was born in middlesbrough?”. The mention “footballer” indicates that the question is on the topic of football, and therefore CLOCQ provides Middlesbrough F.C. (football club) as the top-ranked linking for “middlesbrough”. However, in this question “middlesbrough” refers to the corresponding town. Again, in the auto- mode, CLOCQ chooses a higher (=3), and includes the correct entity in these top-3 linkings ⟨Middlesbrough F.C. (football club), Middlesbrough (borough), Middlesbrough (town)⟩.

4.3. Relation linking

The results on the relation linking task are shown in Table 4.

As for the entity linking task, considering more linkings per mention can help to boost recall, by 0.05 in this case. Again, precision drops substantially, leading to a decreased F1 score. Considering only the top-ranked relation per mention achieves the better F1 score. Interestingly, the average number of relations per question in the system results is quite close to the average number of gold relations (we assume that this property is similar on the train and test sets).

Overall, the results indicate that relation linking may require methods optimized specifically for this purpose. Still, being a general linking method, CLOCQ can provide the correct relation for a substantial part of the questions, often bridging the lexical gap between the relation mention and the surface form of the relation in the KB. Note that there are very few existing systems that can perform both entity and relation linking: this is one of the novelties in CLOCQ. Another such system is [ 20 ].

For example, for the question “Which child of John Adams died on February 23, 1848?” from the training set, CLOCQ correctly links “child” to child, and “died on” to date of death. However, for questions like “What is the point in time that Nicolaus Cusanus was made cardinal by the Holy Roman Church?”, CLOCQ failed to link the correct relations start time and position held.

5. Related work 5.1. Entity linking

There has been extensive research on entity linking and we discuss some prominent works here. TagMe [ 21 ], one of the early yet efective systems, makes use of Wikipedia anchors to detect entity mentions, looks up possible mappings, and scores these with regard to a collective agreement implemented by a voting scheme. In AIDA [ 22 ], a mention-entity graph is established. Then, the entity mentions are linked jointly by approximating the densest subgraph.

Coming to more recent neural systems, REL [ 23 ] is a framework for end-to-end entity linking, building on state-of-the-art neural components. ELQ [ 24 ] jointly performs mention detection and linking, leveraging a BERT-based bi-encoder. These methods are optimized for computing the top-1 entity per mention, and mostly give only the top-ranked entity in the disambiguation. Top-1 entity linking is prone to errors that can afect the whole QA pipeline [ 25, 26 ]. SMART [27] introduces structured multiple additive regression trees, and applies the statistical model on a set of (mention, entity)-pairs and corresponding features. Unlike most other works, S-MART returns the top- disambiguations per mention. However, since it is a proprietary entity linking system, their code is not available.

5.2. Relation linking

Relation linking is particularly useful for QA systems constructing an explicit query. Early approaches used paraphrase-based dictionaries [28] or patterns [29] to link relation mentions. Following approaches often leveraged semantic parses [ 30] for relation linking, which has also been shown to be efective in combination with neural models [ 31]. There is also a line of work that approaches relation linking as a classification task [ 32, 26, 33]. While these methods often achieve high accuracy, a common bottleneck is that only a fraction of all KB relations that are provided in the benchmark can be recognized. Therefore, they are mostly applied in the context of information extraction (IE), rather than QA. Finally, for previous iterations of the SMART Task in 2020 and 2021, a range of relation linking methodologies has been proposed and evaluated [34, 35].

5.3. Joint entity and relation linking

Entity and relation linking are complementary problems, where the results of one task can help solving the other. Thus, either linking is often an intrinsic part of the QA pipeline itself, in which entity and relation linking are implicitly solved in a joint manner [ 13, 28, 36 ]. EARL [ 20 ] is a dedicated linking system that aims to leverage this intuition of joint disambiguation for entity and relation linking tasks. CLOCQ generalizes this idea further, by initially linking any mention to the KB. In this work, we evaluate the applicability of CLOCQ to both tasks, entity and relation linking.

6. Conclusion

We apply CLOCQ [ 8 ] on the entity and relation linking challenges of the SMART 2022 Task. Since the original unsupervised algorithm links all mentions in the question, leading to a substantial amount of noise, we propose a post-hoc pruning module. This supervised module works on top of the linking results by CLOCQ, and prunes linkings for irrelevant mentions. The pipeline depicts a hybrid of supervised and unsupervised modules, leveraging the strengths of both worlds. The results on the SMART entity linking task indicate that the module successfully reduces noise in the linkings, and helps to achieve the overall best F1 score of the CLOCQ variants. Future work could target entity linking and relation linking in conversational settings, where linking mentions can require understanding the whole conversation [ 18, 37 ]. conversational question answering over a large-scale knowledge base, in: EMNLP-IJCNLP, 2019. [26] W.-t. Yih, M.-W. Chang, X. He, J. Gao, Semantic parsing via staged query graph generation:

Question answering with knowledge base, in: ACL-IJCNLP, 2015. [27] Y. Yang, M.-W. Chang, S-mart: Novel tree-based structured learning algorithms applied to tweet entity linking, in: ACL-IJCNLP, 2015. [28] M. Yahya, K. Berberich, S. Elbassuoni, M. Ramanath, V. Tresp, G. Weikum, Natural language questions for the web of data, in: EMNLP, 2012. [29] C. Unger, L. Bühmann, J. Lehmann, A.-C. Ngonga Ngomo, D. Gerber, P. Cimiano, Templatebased question answering over RDF data, in: WWW, 2012. [30] W.-t. Yih, X. He, C. Meek, Semantic parsing for single-relation question answering, in:

ACL), 2014. [31] T. Naseem, S. Ravishankar, N. Mihindukulasooriya, I. Abdelaziz, Y.-S. Lee, P. Kapanipathi, S. Roukos, A. Gliozzo, A. Gray, A semantics-aware transformer model of relation linking for knowledge base question answering, in: ACL-IJCNLP, 2021. [32] D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, Relation classification via convolutional deep neural network, in: COLING, 2014. [33] J. Feng, M. Huang, L. Zhao, Y. Yang, X. Zhu, Reinforcement learning for relation classification from noisy data, in: AAAI, 2018. [34] N. Mihindukulasooriya, M. Dubey, A. Gliozzo, J. Lehmann, A.-C. N. Ngomo, R. Usbeck, Semantic answer type prediction task (smart) at iswc 2020 semantic web challenge, arXiv (2020). [35] N. Mihindukulasooriya, M. Dubey, A. Gliozzo, J. Lehmann, A.-C. N. Ngomo, R. Usbeck, G. Rossiello, U. Kumar, Semantic answer type and relation prediction task (smart 2021), arXiv (2021). [36] A. Abujabal, M. Yahya, M. Riedewald, G. Weikum, Automated template generation for question answering over knowledge graphs, in: WWW, 2017. [37] H. Joko, F. Hasibi, K. Balog, A. P. de Vries, Conversational entity linking: Problem definition and datasets, 2021.

[1]

Vrandečić ,

Krötzsch , Wikidata: A free collaborative knowledgebase , in: CACM , 2014 .

[2]

Auer ,

Bizer , G. Kobilarov,

Lehmann ,

Cyganiak , Z. Ives, DBpedia: A nucleus for a Web of open data , in: The Semantic Web , 2007 .

[3]

F. M.

Suchanek , G. Kasneci, G. Weikum, YAGO: A core of semantic knowledge , in: WWW , 2007 .

[4]

Bollacker ,

Evans ,

Paritosh ,

Sturge ,

Taylor , Freebase: A collaboratively created graph database for structuring human knowledge , in: SIGMOD , 2008 .

[5]

R. Saha

Roy ,

Anand , Question Answering for the Curated Web: Tasks and Methods in QA over Knowledge Bases and Text Collections, Springer, 2022 .

[6]

Guo ,

Tang ,

Duan ,

Zhou ,

Yin , Dialog-to-action: conversational question answering over a large-scale knowledge base , in: NeurIPS , 2018 .

[7]

Bast , E. Haussmann, More accurate question answering on freebase , in: CIKM , 2015 .

[8]

Christmann ,

R. Saha

Roy , G. Weikum, Beyond

NED

: Fast and efective search space reduction for complex question answering over knowledge bases , in: WSDM , 2022 .

[9]

Sun ,

Dhingra ,

Zaheer ,

Mazaitis ,

Salakhutdinov , W. Cohen, Open domain question answering using early fusion of knowledge bases and text , in: EMNLP, 2018 .

[10]

V. N.

Anh ,

Mofat , Pruned query evaluation using pre-computed impacts , in: SIGIR , 2006 .

[11]

Fagin ,

Lotem ,

Naor , Optimal aggregation algorithms for middleware , Journal of computer and system sciences 66 ( 2003 ).

[12]

Lu ,

Pramanik ,

R. Saha

Roy ,

Abujabal ,

Wang , G. Weikum, Answering complex questions by joining multi-document evidence with quasi knowledge graphs , in: SIGIR , 2019 .

[13]

Christmann ,

R. Saha

Roy ,

Abujabal ,

Singh , G. Weikum, Look before you hop: Conversational question answering over knowledge graphs using judicious context expansion , in: CIKM , 2019 .

[14]

Bast ,

Majumdar ,

Schenkel ,

Theobald , G. Weikum, IO-Top-k: Index-access Optimized Top-k Query Processing , in: VLDB Conference, 2006 .

[15]

Buckley ,

A. F.

Lewit , Optimization of inverted vector searches , in: SIGIR , 1985 .

[16]

Lewis ,

Liu ,

Goyal ,

Ghazvininejad ,

Mohamed ,

Levy ,

Stoyanov , L. Zettlemoyer, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , in: ACL, 2020 .

[17]

Abujabal ,

R. Saha

Roy ,

Yahya , G. Weikum, Never-ending learning for open-domain question answering over knowledge bases , in: WWW , 2018 .

[18]

Christmann ,

R. Saha

Roy , G. Weikum, Conversational question answering on heterogeneous sources , in: SIGIR , 2022 .

[19]

Ziegler ,

Abujabal ,

R. S.

Roy , G. Weikum, Eficiency-aware answering of compositional questions using answer type prediction , in: IJCNLP , 2017 .

[20]

Dubey ,

Banerjee ,

Chaudhuri , J. Lehmann, EARL: joint entity and relation linking for question answering over knowledge graphs , in: ISWC , 2018 .

[21]

Ferragina , U. Scaiella, TAGME: On-the-fly annotation of short text fragments (by Wikipedia entities) , in: CIKM , 2010 .

[22]

Hofart ,

M. A.

Yosef , I. Bordino,

Fürstenau ,

Pinkal ,

Spaniol ,

Taneva ,

Thater , G. Weikum, Robust disambiguation of named entities in text , in: EMNLP, 2011 .

[23] J. M. van Hulst , F.

Hasibi , K.

Dercksen , K.

Balog , A. P. de Vries , Rel: An entity linker standing on the shoulders of giants , in: SIGIR, 2020 .

[24]

B. Z.

Li ,

Min ,

Iyer ,

Mehdad , W.-t. Yih, Eficient one-pass end-to-end entity linking for questions , in: EMNLP , 2020 .

[25]

Shen ,

Geng ,

Tao ,

Guo ,

Tang ,

Duan ,

Long , D.

Jiang, Multi-task learning for