HAWK@QALD5 – Trying to answer hybrid questions with various simple ranking techniques Ricardo Usbeck1 and Axel-Cyrille Ngonga Ngomo1 University of Leipzig, Germany {usbeck,ngonga}@informatik.uni-leipzig.de Abstract. The growing amount of data available in the Document Web as well as in the Linked Data Web has lead to an information gap. Information needed to answer complex questions might often require full-text data as well as Linked Data. Thus, HAWK combines unstructured and structured data sources. In this article, we introduce HAWK, a novel entity search approach for hybrid question answering based on combining Linked Data and textual data. In this article, we compare three ranking mechanism and evaluate their performance on the QALD-5 challenge. Finally, we identify the weak points of our current version of HAWK and give directions for future development. 1 Introduction The fifth challenge on Question Answering over Linked Data (QALD-5)1 has intro- duced and extended a novel benchmark dataset for hybrid question answering. In this paper, we present our framework HAWK, the second best performing system w.r.t. hy- brid question answering. The need for this challenge becomes obvious by comparing the growing amount of data, in the Document Web and the Linked Data Web, which introduces an information gap, i.e., a considerable number of complex questions can only be answered by using hybrid question answering approaches, which can find and combine information stored in both structured and textual data sources [4]. In this paper, we will outline the single steps performed by HAWK to answer hy- brid questions in Section 2. Section 3 discusses problems and opportunities within the current implementation. Finally, we conclude in Section 4 and explain future research directions. The source code of HAWK, benchmark data, a link to the demo, evalua- tion results as well as further information can be retrieved from the project website http://aksw.org/Projects/hawk. 2 Method In the following, we will shortly describe the 8-step, modular pipeline of HAWK absed on the following running example: Which anti-apartheid activist was born in Mvezo? For more information please have a look at the full method de- scription [5]. 1 http://www.sc.cit-ec.uni-bielefeld.de/qald/ 1. Segmentation and Part-of-Speech (POS)-Tagging. To be generic with respect to the language of the input question, HAWK uses a modular system that is able of to- kenizing even languages without clear separation like Chinese. For English input ques- tions our system relies on the clearNLP [1]-framework which provides a.o. a white space tokenizer, POS-tagger and transition-based dependency parsing. Afterwards, this framework annotates each token with its POS-tag which will be later used to identify possible semantic annotations. POS-tagging on the running example will result in the following: Which(WDT) anti-apartheid(JJ) activist(NN) was(VBD) born(VBN) in(IN) Mvezo(NNP)? 2. Entity Annotation. Next, our approach identifies named entities and tries to link them to the underlying knowledge base. The QALD-5 challenge relies on the DBpe- dia [2] as source for structured information in the form of Linked Data. For recognizing and linking named entities HAWK’s default annotator is FOX [3], a federated knowl- edge extraction framework based on Ensemble Learning. FOX has shown to outperform other entity annotation systems on the QALD 4 benchmark data [5]. An optimal anno- tator would annotate our running example Mvezo with http://dbpedia.org/ resource/Mvezo. 3. Dependency Parsing and Noun Phrase Detection. Subsequently, in order to capture linguistic and semantic relations, HAWK parses the query using dependency parsing and semantic role labeling [1]. The dependency parser generates a predicate- argument tree based on the preprocessed question. Afterwards, HAWK identifies noun phrases, i.e., semantically meaningful word groups not yet recognized by the entity annotation system, using the result of the POS tagging step. Input tokens are combined following manually-crafted linguistic heuristics based on POS-tag sequences derived from the QALD 5 benchmark questions. Two domain experts implemented the deduced POS-tag sequences and safeguarded the quality of this algorithm w.r.t. the QA pipeline f-measure. The full algorithm can be found in our source code repository at https: //github.com/aksw/hawk. Here, the anti-apartheid activist would be detected as noun phrase. 4. Linguistic Pruning. HAWK has now captured entities from a given knowledge base, noun phrases as well as the semantic structure of the sentence. Still, this structure contains tokens that are meaningless for retrieving the target information or even intro- duce noise into the process. Thus, HAWK prunes nodes from the predicate-argument tree based on their POS-tags, e.g., deleting all DET nodes, interrogative phrases such as Give me or List, and auxiliary tokens such as did. The linguistically pruned de- pendency tree with combined noun phrases for our running example would only contain born as a root node with two children, namely anti-apartheid activist and http://dbpedia.org/resource/Mvezo. 5. Semantic Annotation. Now, the tree structure contains only semantically mean- ingful (combined) token and entities, i.e, individuals from the underlying knowledge base. To map the remaining token to properties and classes from the target knowledge base and its underlying ontology, our framework uses information about possible ver- balizations of ontology concepts and leverages a fuzzy string search. These verbal- izations are based on both rdfs:label2 information from the ontology itself and (if available) verbalization information contained in lexica, in our case in the existing DBpedia English lexicon3 . After this step, either a node is annotated with a class, prop- erty or individual from the target knowledge base or it causes a full-text lookup in the targeted Document Web parts. With respect to the running example born would be annotated with the properties dbo:birthPlace and dbo:birthDate. 6. Generating hybrid SPARQL Queries. Given a (partly) annotated predicate ar- gument, HAWK generates hybrid SPARQL queries. It uses an Apache Jena FUSEKI4 server, which implements the full-text search predicate text:query on a-priori de- fined literals. Those literals are basically mappings of textual information to a certain in- dividual URI from a target knowledge, i.e., an implicit enrichment of structured knowl- edge from unstructured data. Especially, our framework HAWK indexes a-priori the following information per individual: dbo:abstract, rdfs:label, dbo:redirect and dc:subject to capture document based information. This information is then retrieved by the text:query predicate by using exact matches or fuzzy matches on each non-stopword token of an indexed field. The generation of SPARQL triple pattern is based on a pre-order walk to reflect the empirical observation that i) related information is situated close to each other in the tree and ii) information is more restrictive from left to right. This breadth-first search visits each node and generates several possible triple patterns based on the number of anno- tations and the POS-tag itself. With this approach HAWK is independent of SPARQL templates and to work on natural language input of any length and complexity. Each pattern contains at least one variable from a pre-defined set of variables, i.e., ?proj for the resource projection variable, ?const for resources covering constraints related to the projection variable as well as a variety of variables for predicates to inspect the surrounding of elements in the knowledge base graph. During this process, each iter- ation of the traversal appends the generated patterns to each of the already existing SPARQL queries. This combinatorial effort results in covering every possible SPARQL graph pattern given the predicate-argument tree. Amongst others, HAWK generates for the running example the following three hybrid SPARQL queries: 1. SELECT ?proj {?proj text:query ’anti-apartheid activist’. ?proj dbo:birthPlace dbr:Mvezo.} 2. SELECT ?proj {?proj text:query ’anti-apartheid activist’. ?proj dbo:birthDate dbr:Mvezo.} 3. SELECT ?proj {?proj text:query ’anti-apartheid activist’. ?const dbo:birthPlace ?proj.} 7. Semantic Pruning of SPARQL Queries. Covering each possible SPARQL graph pattern with the above algorithm results in a large number of generated SPARQL queries. 2 We assume dbo stands for http://dbpedia.org/ontology, dbr for http: //dbpedia.org/resource/, rdfs for http://www.w3.org/2000/01/rdf- schema#, dc for http://purl.org/dc/elements/1.1/ and text for http: //jena.apache.org/text# 3 https://github.com/cunger/lemon.dbpedia 4 http://jena.apache.org/documentation/serving_data/ To effectively handle this large set of queries and reduce the computational effort, HAWK implements various methods for pruning. For example, it assumes that uncon- nected query graphs, missing projection variables and cyclic SPARQL triple pattern lead to empty or semantically meaningless results. Thus, HAWK discards those queries. In the running example, the semantic pruning discards query number two from above because it violates the range restriction of the dbo:birthDate predicate. Al- though semantic pruning drastically reduces the amount of queries, it often does not result in only one query. Thus, a ranking of the remaining queries is applied before the best SPARQL query is send to the target triple store. 8. Ranking and Cardinality. For the QALD-5 challenge, we extended the basic ranking implementations - optimal ranking and feature-based ranking - of HAWK by one additional ranking method: – Optimal Ranking. To ensure, we are able to generate hybrid SPARQL queries capable of answering the benchmark questions, the optimal ranker returns always those hybrid SPARQL queries which lead to a maximum f-measure. Obviously, the optimal ranking can only be used if the answers are know, i.e., HAWK operates on the training data. This ranking functions allows to determine the parts of the hybrid question answering pipeline which do not perform well. – Feature-based Ranking. The second ranking method is based on supervised learn- ing using the gold standard answer set from the QALD-4 benchmark. In the train- ing phase, all generated queries are run against the underlying SPARQL endpoint. Comparing the results to the gold standard answer set, HAWK stores all queries re- sulting with the highest F-measures. Afterwards the stored queries are used to cal- culate an average feature vector comprising simple features mimicking a centroid- based cosine ranking. HAWK’s ranking calculation comprises the following com- ponents: • NR OF TERMS calculates the number of nodes used to form the full-text query part as described above. • NR OF CONSTRAINTS counts the amount of triple patterns per SPARQL query. • NR OF TYPES sums the amount of patterns of the form ?var rdf:type cls. • PREDICATES generates a vector containing an entry for each predicate used in the SPARQL query. While running the test phase of HAWK, the cosine similarity between each SPARQL query using the above mentioned features and the average feature vector of training queries is calculated to rank them. – Overlap-based Ranking. The novel overlap-based ranking accounts for the intu- ition that the same result set can be generated by several hybrid SPARQL queries. Thus, this ranker, although computationally highly expensive, executes every hy- brid SPARQL query and the resulting answer sets are then stored into hashed buck- ets. Finally, the ranker computes how many queries produced a certain answer set. The answer set with the highest number is than returned. Moreover, HAWK determines the target cardinality x, i.e., LIMIT x respectively the number of answers expected for a given query, of each query using the indicated car- dinality of the first seen POS-tag, e.g., the POS-tag NNS demands the plural while NN demands the singular case and thus leads to different x. An optimal ranking will reveal that the winning SPARQL query for our running ex- ample is SELECT ?proj {?proj text:query ’anti-apartheid activist’. ?proj dbo:birthPlace dbr:Mvezo.}. 3 Evaluation and Discussion The QALD-5 benchmark has a training and a test dataset for question answering con- taining a subset of hybrid benchmark questions. In particular, HAWK is currently only capable of answering questions demanding a single or a set of URIs from a target knowledge base. Moreover, questions depending on Yago ontology5 types cannot be answered. Thus, the QALD-5 dataset contains 26 training, respectively 8 test questions, suitable for the current implementation of HAWK. Using the online available evaluation tool6 , Table 1 shows the results for the training and test dataset as well as well as for all three ranking approaches. Please note, that for the feature-based ranker the training data was taken from QALD-4. Table 1: Results of the QALD-5 challenge for different ranking algorithms. Number in brackets show the amount of generated answers, i.e., HAWK outputs at least one result set. Dataset Optimal Ranking Feature-based Ranking Overlap-based Ranking QALD-5 - training 0.30 (15 out of 26) 0.06 (22 out of 26) 0.08 (22 out of 26) QALD-5 - test 0.1 (1 out of 10) 0.10 (3 out of 10) 0.10 (3 out of 10) As can be seen in Table 1, the implemented ranking function do not reach an optimal ranking implementation. Moreover, no implementation is able to discard whole queries, i.e., they are currently not aware of the possibility that the answer could not have been retrieved at all while the optimal ranker discards questions with incorrect answer sets. Table 2 shows a detailed description of achieved measures per question and algo- rithm. Delving deeper into the results, we color-coded the single mistakes of the HAWK system. (Yellow) indicates that the ranking algorithms are not able to retrieve the correct SPARQL query out of the set of generated SPARQL queries. (Orange) cells point out that one ranking algorithm performs worse than the other. (Red) indicates the inability to retrieve correct answer sets. That is, HAWK is able to generate a set of SPARQL queries but amongst them none retrieves a correct answer set. Finally, (Blue) rows de- scribe questions where HAWK is unable to generate at least one SPARQL query. Thus, those questions semantics cannot be captured by the system yet due to missing surface forms for individuals, classes and properties or missing indexed full-text information. 5 http://www.mpi-inf.mpg.de/departments/databases-and- information-systems/research/yago-naga/yago/ 6 http://greententacle.techfak.uni-bielefeld.de/˜cunger/qald/ index.php?x=evaltool&q=5 Especially, in the test dataset all three ranking algorithms are only able to generate one correct answer while the feature-based and the overlap-based ranker perform differently on the train dataset. To experience the run times for single queries please visit our demo at http://hawk.aksw.org. 4 Conclusion In this paper, we briefly introduced HAWK, the first hybrid QA system for the Web of Data, and analysed its performance against the QALD 5 challenge. We showed that by using a generic approach to generate SPARQL queries from predicate-argument struc- tures, HAWK is able to achieve an F-measure of up to 0.3 on the QALD-5 training benchmark. These results demand novel approaches for capturing the semantic mean- ings of natural language questions with hybrid SPARQL queries. Our work on HAWK, however, revealed several other open research questions, of which the most important lies in finding the correct ranking approach to map a predicate-argument tree to a pos- sible interpretation. So far, our experiments reveal that the mere finding of the right features for this endeavor remains a challenging problem. Finally, several components of the HAWK pipeline are computationally very complex. Finding more time-efficient algorithms for these steps will be addressed in future work. Acknowledgments This work has been supported by the FP7 project GeoKnow (GA No. 318159) and the BMBF project SAKE. References 1. J. D. Choi and M. Palmer. Getting the most out of transition-based dependency parsing. In ACL, pages 687–692, 2011. 2. J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowl- edge base extracted from wikipedia. Semantic Web Journal, 2014. 3. R. Speck and A.-C. N. Ngomo. Ensemble learning for named entity recognition. In ISWC. 2014. 4. R. Usbeck. Combining Linked Data and Statistical Information Retrieval. In 11th ESWC, PhD Symposium, 2014. 5. R. Usbeck, A.-C. Ngonga Ngomo, L. Bühmann, and C. Unger. HAWK - Hybrid Question Answering over Linked Data. In ESWC, 2015. Table 2: Detailed results of HAWK at the QALD-5 challenge. Feature-based Ranking Overlap-based Ranking Optimal Ranking ID Question Recall Precision F1 Recall Precision F1 Recall Precision F1 301 Who was vice-president under the president who 0 0 0 0 0 0 1 1 1 authorized atomic weapons against Japan during World War II? 303 Which anti-apartheid activist was born in Mvezo? 0 0 0 1 0.5 0.67 1 0.5 0.67 305 Which recipients of the Victoria Cross died in the 0 0 0 0.5 0.33 0.4 0.5 0.5 0.5 Battle of Arnhem? 306 Where did the first man in space die? 0 0 0 0 0 0 1 1 1 308 Which members of the Wu-Tang Clan took their 0.5 0.08 0.14 1 0.17 0.29 0.5 0.5 0.5 stage name from a movie? 309 Which writers had influenced the philosopher that 0 0 0 0 0 0 1 1 1 refused a Nobel Prize? 311 Who composed the music for the film that depicts the early life of Jane Austin? 314 Which horses did The Long Fellow ride? 0 0 0 0 0 0 0.86 1 0.92 315 Of the people that died of radiation in Los 0 0 0 0 0 0 0.5 1 0.67 Alamos, whose death was an accident? 316 Which building owned by the Bank of America was featured in the TV series MegaStructures? 317 Which buildings in art deco style did Shreve, 1 0.33 0.5 0 0 0 1 0.33 0.5 Lamb and Harmon design? 318 Which birds are protected under the National 0.67 0.67 0.67 0.67 0.67 0.67 0.67 0.67 0.67 Parks and Wildlife Act? 319 Which country did the first known photographer 0 0 0 0 0 0 1 1 1 of snowflakes come from? 320 List all the battles commanded by the lover of 0 0 0 0 0 0 0.23 0.42 0.29 Cleopatra. 322 Which actress starring in the TV series Friends 0 0 0 0 0 0 1 1 1 owns the production company Coquette Produc- tions? 323 Gaborone is the capital of which country member 1 1 1 1 1 1 1 1 1 of the African Union? 326 For which movie did the daughter of Francis Ford Coppola receive an Oscar? 327 Which city does the first person to climb all 14 eight-thousanders come from? 328 At which college did the only American actor that received the César Award study? 332 What is the native city of Hollywood’s highest- paid actress? 333 In which city does the former main presenter of the Xposé girls live? 334 Who plays Phileas Fogg in the adaptation of 0 0 0 0 0 0 1 1 1 Around the World in 80 Days directed by Buzz Kulik? 335 Who is the front man of the band that wrote Cof- fee & TV? 336 Which Chinese-speaking country is a former Portguese colony? 337 What is the largest city in the county in which Faulkner spent most of his life? 340 A landmark of which city is the home of the Mona Lisa? 51 Where was the ”Father of Singapore” born? 52 Which Secretary of State was significantly in- volved in the United States’ dominance of the Caribbean? 53 Who is the architect of the tallest building in Japan? 55 In which city where Charlie Chaplin’s half broth- ers born? 56 Which German mathematicians were members of the von Braun rocket group? 57 Which writers converted to Islam? 59 Which movie by the Coen brothers stars John Tur- 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 turro in the role of a New York City playwright? 60 Which of the volcanoes that erupted in 1550 is still active?