Joint Entity and Relation Linking using EARL Debayan Banerjee1,2 , Mohnish Dubey1,2 , Debanjan Chaudhuri1,2 , and Jens Lehmann1,2 1 Smart Data Analytics Group (SDA), University of Bonn, Germany {dubey, chaudhur, jens.lehmann}@cs.uni-bonn.de debayan@uni-bonn.de 2 Fraunhofer IAIS, Bonn, Germany jens.lehmann@iais.fraunhofer.de Abstract. In order to answer natural language questions over knowledge graphs, most processing pipelines involve entity and relation linking. Traditionally, entity linking and relation linking have been performed either as dependent sequential tasks or independent parallel tasks. In this demo paper, we present EARL, which performs entity linking and relation linking as a joint single task. The system determines the best semantic connection between all keywords of the question by referring to the knowledge graph. This is achieved by exploiting the connection density between entity candidates and relation candidates. EARL uses Bloom fil- ters for faster retrieval of connection density and uses an extended label vocabulary for higher recall to improve the overall accuracy. Keywords: Entity Linking, Relation Linking, Question Answering 1 Introduction Accessing information from knowledge graphs (KGs) efficiently has been an important research goal in the last decade. In particular, question answering techniques have been proposed to allow obtaining information from knowledge graphs based on natural lan- guage input. A semantic question answering system [4,5] requires (i) entity identification and linking (ii) relation identification and linking (iii) query intent identification and (iv) formal query generation. In this submission, we are concerned with steps (i) and (ii). In most entity linking systems disambiguation is performed by looking at other entities present in the text. However in the case of short questions there is often no other entity to be found. To deal with such cases we look at not just other entities but also relations in the natural language question. EARL performs the joint linking task by looking for the connections between the candidates in the knowledge graph. Thus, entities help in relation disambiguation and relations help in entity disambiguation. To achieve high recall EARL uses an extended label vocabulary. Additionally use of Bloom filters [1] instead of KB queries results in low response times. This is an accompanying demo paper for EARL [3], which was accepted at the main ISWC research track and is a component in the QA pipeline that performs joint identification and linking of entity and relation. In addition to the main conference paper, we provide a) an API for using EARL b) description of the usage of Bloom filters to speed up triple queries c) description of the extended vocabulary for DBpedia labels for entities and relations. Fig. 1. EARL Architecture 2 Architecture for EARL 2.1 System Overview Here we give an overview of EARL [3] pipeline to show the components which use the Bloom filter and Extended Label Vocabulary Step 1: Shallow Parsing: Earl extracts all keyword phrases from a question using SENNA3 . E/R prediction module identifies whether the phrase is an entity or relation candidate. Step 2: Candidate List Generation: EARL uses a Uri-Label index for retrieving candidate URIs for keyword phrases. The Extended Label Vocabulary index consists of DBpe- dia labels extended by Wikidata, Oxford dictionary4 and FastText5 . Step 3: Connectivity Detection: We take candidate lists from the previous step and see how well each candidate is KB-connected to other candidates in other lists. This information is distilled into three features, namely Hop-Count H, Connection-Count C and initial Rank of the List Ri . For checking connectivity among candidates in the knowledge base we use pre-built Bloom filters in-memory instead of making network calls to DBpedia. This leads to faster response times. Step 4: Re-ranking candidate lists: From the top-k candidate list for each entity/relation, we predict the most probable candidate. EARL relies on XGBoost [2] classifiers for re-ranking of the lists based on the three features passed (H, C, Ri ). 2.2 Implementation Details 2.2.1 Creating an Extended Label Vocabulary: In the text search phase we need to retrieve a candidate list for each keyword identified in the natural language question by the shallow parser. To retrieve the top candidates for a keyword we create an Elasticsearch index of uri-label pairs. Since EARL requires exhaustive list of labels for a DBpedia uri 3 https://github.com/torch/senna 4 https://developer.oxforddictionaries.com/ 5 https://github.com/facebookresearch/fastText we expanded the labels beyond the dbpedia provided label. We used Wikidata labels using dbpedia "same-as" links for entities. For example label for dbr:BarackObama is only "Barack Obama" in DBpedia, but we expand this set by using Wikidata sameas, thus our index has labels "Barack Obama, Barack Hussein Obama, President Obama, Barack, ... ". For relations we required labels which were semantically equivalent for which we took synonyms from the Oxford Dictionary API. For example dbo:writer has the label "writer" in DBpedia; using OxfordDictionary we expand with labels such as "author, penman, creator, ...". EARL further uses FastText for covering all the inflection forms of these labels, such "write, writer, writing, wrote, written, ...". Our extended vocabulary contains 256K labels for DBpedia relations. 2.2.2 Using Bloom filters for fast querying of the KB: While performing joint connectivity detection EARL checks connectivity and distance between two nodes of the knowledge graph. EARL checks the connection between a candidate of a keyword phrase to all the candidates of all the other keyword phrases. Two DBpedia nodes may have multiple intermediate nodes in the path connecting them and looking at all of them takes a large amount of time as the number of such (pair of nodes) queries are high. We do not directly query the knowledge graph as the execution time for such a query in a large knowledge graph is prohibitively high. EARL uses Bloom filters [1], that are probabilistic data structures which can answer questions of set membership, like "is-a-member-or-not" in constant O(1) time. The user can decide the acceptable error rate when creating the Bloom filter by choosing the appropriate size of the Bloom filter and corresponding number of hash functions. EARL uses Bloom filter parameters so that it has 1 in a million error probability. There are separate Bloom filters for different hop length pairs. We took each such pair as a string (eg: http://dbpedia.org/resource/car:http://dbpedia.org/resource/bike) and added it to the corresponding Bloom filter. Hence we ended up with a 4 Bloom filter with different lengths (upto 4hops), each around 1 GB in size resident in-memory. This method allows us to condense a KG which is several hundred GBs in size into only a few GBs with a low probability of error. When we need to find connection density, we take each pair of candidate URIs from the lists returned by Elasticsearch and ask the Bloom filters if they contain these URI pairs or not. Depending on which Bloom filter answered positively we know how many hops away these two URIs are in the knowledge graph. 2.2.3 Experiment for response time: We empirically show the query-time speed up achieved by the usage of Bloom filter compared to SPARQL over KG in terms of response time. In our experiment we checked the connectivity between two nodes of Connection check with path-length SPARQL infer time Bloom filter infer time 1-hop e1 - p1 3300 µsec 17 µsec 2-hop e1 - ?x - e2 3300 µsec 17 µsec 3-hop e1 - ?x - ?y - p2 8500 µsec 17 µsec 4-hop e1 - ?x - ?y - ?z - e3 11059 µsec 17 µsec Table 1. Bloom filter’s inference time compared to SPARQL the knowledge graph with a fixed path length. We observe that the SPARQL response time (3300 micro seconds) for such a query is many folds higher then the Bloom filter’s response time (17 micro seconds). The results in Table 1 show that the response time of SPARQL increases with the hop length, where as the Bloom filter results are constant. However it must be added that the one-time construction of the Bloom filters as a pre-processing step requires several hours to complete. 3 API and Data Set EARL API is publicly available at http://sda.cs.uni-bonn.de/projects/earl/. The EARL API takes the natural language question as input, and the returns a ranked list of the URIs for the corresponding keyword phrases. A demo of EARL is also accessible via EARL’s homepage mentioned above. The ranked list output by EARL also consists of confidence scores. EARL’s connection density module has a time complexity of O(N 2 L2 ), where N is the number of keyword phrases, and L is the size of list retrieved for candidate keyword phrase. The number of bloom queries made for a question is given by the equation (L2 (N 2 − N )/2). EARL API has an average response time of 0.42 seconds per question when queried on the same machine hosting the service. The source code for EARL is available at https://github.com/AskNowQA/EARL. Along with the code, we also release the extended label vocabulary. The fully annotated version of the LC-QuAD [6] dataset, where relation and entity in questions are marked with their corresponding URIs is available at https://figshare.com/projects/EARL/28218. 4 Conclusion In this paper, we present an extension of the EARL paper by providing a) a public service and API for EARL b) additional results for Bloom filters and implementation details c) the extended label sets (for DBpedia entity and relations) used in EARL. References 1. B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426, 1970. 2. T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016. 3. M. Dubey, D. Banerjee, D. Chaudhuri, and J. Lehmann. Earl: Joint entity and relation linking for question answering over knowledge graphs. In International Semantic Web Conference, 2018. 4. M. Dubey, S. Dasgupta, A. Sharma, K. Höffner, and J. Lehmann. Asknow: A framework for natural language query formalization in sparql. In International Semantic Web Conference, pages 300–316. Springer, 2016. 5. K. Höffner, S. Walter, E. Marx, R. Usbeck, J. Lehmann, and A.-C. Ngonga Ngomo. Survey on challenges of question answering in the semantic web. Semantic Web, (Preprint):1–26, 2016. 6. P. Trivedi, G. Maheshwari, M. Dubey, and J. Lehmann. Lc-quad: A corpus for complex question answering over knowledge graphs. In International Semantic Web Conference, pages 210–218. Springer, 2017.