=Paper=
{{Paper
|id=Vol-1690/paper3
|storemode=property
|title=QA4LOV: A Natural Language Interface to Linked Open Vocabulary
|pdfUrl=https://ceur-ws.org/Vol-1690/paper3.pdf
|volume=Vol-1690
|authors=Ghislain Auguste Atemezing,Pierre-Yves Vandenbussche
|dblpUrl=https://dblp.org/rec/conf/semweb/AtemezingV16
}}
==QA4LOV: A Natural Language Interface to Linked Open Vocabulary==
QA4LOV: A Natural Language Interface to Linked Open Vocabulary Ghislain Auguste Atemezing1 , Pierre-Yves Vandenbussche2 1 MONDECA, 35 Boulevard de Strasbourg, Paris, France. 2 Fujitsu, Galway, Ireland. ghislain.atemezing@mondeca.com pierre-yves.vandenbussche@ie.fujitsu.com Abstract. There is an increasing presence of structured data due to the adoption of Linked data principles on the web. At the same time, web users have different skills and want to be able to interact with Linked datasets in various manner, such as asking questions in natural language. This paper proposed a first implementation of Query Answering system (QA) applied to the Linked Open Vocabularies (LOV) catalogue, mainly focused on metadata information retrieval. The goal is to provide to end users yet another access to metadata information available in LOV using natural language questions. Keywords: Question Answering, Vocabulary Catalogue, data usage, user ex- perience 1 Introduction The recent years have seen the adoption of semantic web in many domains, generating a mass of structured data available in RDF. Linked data has con- tributed to interlink different datasets across domains, where publishers use best practices for producing interoperable datasets by reusing ontologies and making alignments. However, users need to have a minimal knowledge of SPARQL language and RDF skills to query RDF datasets. It is one barrier that is overcome by Ques- tion Answering (QA) systems, which directly take as input questions in natural language. The need for more advanced tools and QA systems that operate over large repositories of Linked Data has also been the motivation for the QALD Question Answering over Linked Data (QALD) series of workshops [1]. Vocabulary catalogues are special datasets where classes and properties are used to model and generate datasets available in the Linked Data space. Most vocabulary catalogues provide terms search and APIs to access their datasets. LOV provides five types of search methods: metadata search, ontology search, APIs access, dump file in RDF and SPARQL endpoint access [3]. This paper presents a prototype for a vocabulary backed question answer- ing system that can transform natural language questions into SPARQL queries, 2 Ghislain Auguste Atemezing, Pierre-Yves Vandenbussche thus giving the end users access to the information stored in vocabulary reposito- ries. The paper is structured as follows: Section 2 describes the system, followed by the set of questions in Section 3. An evaluation is presented in Section 4 and a short conclusion and future work in Section 5. 2 System Description The system receives as input a question formulated in English and outputs the query that will retrieve the answer to the question from the LOV catalogue. The architecture of the system is illustrated in Fig. 1 and a screenshot of the live demo in Fig. 2. The system is available at http://lov.okfn.org/dataset/lov/qa. Fig. 1: System architecture Fig. 2: A screenshot showing the system answering to the contributors of adms The implementation uses the Quepy tool from Machinalis 3 . The POS tagset used by Quepy is the Penn Tagset [2]. First, regular expressions are defined to match the natural language questions and transform them into an abstract semantic representation. Then, specific templates are defined for the system to handle users’ questions. To handle regular expressions, Quepy uses the refo library4 which work with regular expressions as objects. 3 https://github.com/machinalis/quepy/ 4 https://github.com/machinalis/refo QA4LOV: A Natural Language Interface to Linked Open Vocabulary 3 A vocabulary is defined by a fixed relation voaf:Vocabulary and a POS associated to a vann:preferredNamespacePrefix. LOV uses a unique prefix to identify namespaces, which is a string from length 2 to length 17, although the recommendation for publishers is to use a prefix with less than 10 characters. Our approach is based on attaching a regex to vocabulary represented by its prefix. The syntactic processor is based on regular expressions using POS terms. As a vocabulary is identified by its prefix, we use the following syntactic patterns: NN, NNS, FW, DT, JJ and VBN. Each question from Q1 to Q14 is associated to a unique template. After, when a prefix is recognized, the semantic interpreter uses fixed relations with the English tag, which are properties in RDF triple pattern. The code below shows a regex for the contributors of any vocabulary: regex1 = Pos("WP") + Lemma("be") + Pos("DT") + Lemma("contributor") + Pos("IN") + Vocabulary() 3 Types of Questions LOV is a highly curated observatory of the semantic vocabularies ecosystem, with the aim of promoting the reuse of well-documented vocabularies in the Linked Data space. Each version of a vocabulary in LOV contains relevant meta- data information which can be discovered by agents. Additionally, users could be interested in other interesting facts, such as the number of versions, the number of datasets using the vocabulary, the number of external vocabularies reusing the vocabulary and the category to which belong the vocabulary. A first set of 14 templates can be handled by the prototype, covering different type of metadata information available in a vocabulary. Table 1 shows the list of questions, where in the column “Template”, [be] can be either the present or the past form of the verb, and [vocab] is the preferred prefix of the vocabulary. The regex uses lemmas to combine different types of the questions accordingly. 4 System Evaluation The system allows users to interact with the LOV catalogue through the an- swers generated. Depending on the types of the results (e.g., agents, versions, categories), the system allows users to further explore the dataset with more interactions. All the questions which generated SPARQL query gives satisfac- tory results. The most challenging issue is to determine the most suitable POS that cover all the vocabulary prefixes. For example, out of 5285 vocabularies in LOV, 13 of them contain an hyphen and the system can not generate query (e.g., 5 This number corresponds to the total number of vocabularies inserted in LOV as of January, 8th 2016. 4 Ghislain Auguste Atemezing, Pierre-Yves Vandenbussche Table 1: Questions in natural language for retrieving metadata information in a vocabulary catalog. ID Template Sample Question Q1 What [be] [vocab]? What is prov? Q2 Where [be] [vocab] from? Where is foaf from? Q3 How old [be] [vocab]? How old is prov? Q4 When [be] [vocab] release? When was voaf release? Q5 Who [be] the contributors of [vocab]? Who are the contributors of adms? Q6 When [be] [vocab] last update? When was schema last update? Q7 What [be] the versions of [vocab]? What are the versions of adms? Q8 What [be] the languages of [vocab]? What are the languages of dcat? Q9 Where to find [vocab] documentation? Where to find foaf documentation? Q10 How many vocabularies reuse [vocab]? How many vocabularies reuse adms? Q11 How many datasets use [vocab]? How many datasets use adms? Q12 What [be] the namespace of [vocab]? What is the namespace of dcterms? Q13 What [be] the title of [vocab]? What is the title of foaf ? Q14 What [be] the category of [vocab]? What is the category of dcterms? elseweb-modelling). Moreover, 21 prefixes contain a number (e.g., g50k) and 3 special cases (homeActivity, LiMo, and juso.kr)and are not currently covered by the system. Currently, the system handles 92,99% of the prefixes in LOV. 5 Conclusion and Future Work In this paper, we have presented a prototype system for answering a set of ques- tions in natural language backed by a vocabulary catalogue. Accessing LOV dataset by this system will hugely help lay users without SPARQL skills to in- teract more with the catalogue, and improve also the quality of the metadata by ontology publishers. The implementation uses the LOV dataset in RDF and the Quepy tool. The first results show that the system handles 92,99% of the metadata of vocabularies in the LOV catalogue. We plan to extend the types of queries to more complex ones. Moreover, we can use various semantic relation- ships in LOV to do query expansion by using for instance sub-properties. References 1. V. Lopez, C. Unger, P. Cimiano, and E. Motta. Evaluating question answering over linked data. Web Semantics Science Services And Agents On The World Wide Web, 21:3–13, 2013. 2. M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993. 3. P.-Y. Vandenbussche, G. A. Atemezing, M. Poveda-Villalónc, and B. Vatant. LOV: a gateway to reusable semantic vocabularies on the Web. Semantic Web Journal, 2015. http://www.semantic-web-journal.net/system/files/swj974.pdf.