1 Introduction

An Italian Question Answering System Based on Grammars Automatically Generated from Ontology Lexica

Gennaro Nolano

Mohammad Fazleh Elahi

melahi@techfak.uni-bielefeld.de 0

Maria Pia di Buono

mpdibuono@unior.it 2

Basil Ell

bell@techfak.uni-bielefeld.de 0 1

Philipp Cimiano

cimiano@techfak.uni-bielefeld.de 0 0 . Cognitive Interaction Technology Center, Bielefeld University , Germany 1 . Department of Informatics, University of Oslo 2 . UniOr NLP Research Group, University of Naples ”L'Orientale” , Italy

The paper presents an Italian question answering system over linked data. We use a model-based approach to question answering based on an ontology lexicon in lemon format. The system exploits an automatically generated lexicalized grammar that can then be used to interpret and transform questions into SPARQL queries. We apply the approach for the Italian language and implement a question answering system that can answer more than 1.6 million questions over the DBpedia knowledge graph.

1 Introduction

As the amount of linked data published on the Web keeps increasing, there is an expanding demand for multilingual tools and user interfaces that simplify the access and browsing of data by end-users, so that information can be explored in an intuitive way. This need is what motivated the development of tools such as Question Answering (QA) systems, whose main aim is to make users be able to explore complex datasets and an ever growing amount of data in an intuitive way, through natural language.

While the default approach for many NLP tasks has recently been represented by machine learning systems, the use of such approaches (Chakraborty et al., 2019) for QA over RDF data suffers from lack of controllability, making the governance and incremental improvement of the system challenging, not to mention the initial effort of collecting and providing training data for a specific language.

An alternative is the so-called model-based approach to QA, in which a model is first used to

Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). specify how concepts and relations are realized in natural language, and then this specification is employed to interpret questions from users. One such system is the one proposed by (Benz et al., 2020) , which makes use of a lexicon in lemon format (McCrae et al., 2011) to specify how the vocabulary elements of an ontology or knowledge graph (e.g., entities and relations from a Knowledge Graph) are realized in natural language.

The previous work on this approach shows how, leveraging on lemon lexica, question answering grammars can be automatically generated, and those can, in turn, be used to interpret questions and then parse them into SPARQL queries. A QA web application developed in previous work (Elahi et al., 2021) has further shown that such QA systems can scale to large numbers of questions and that the performance of the system is practically real-time from an end-user perspective.

In this work we describe the extension to the Italian language of the model-based approach described in (Benz et al., 2020) and the QA system described in (Elahi et al., 2021) . By doing so, we develop a QA system that can answer more than 1.6 million Italian questions over the DBpedia knowledge graph1. 2

Related Work

Besides the goal of creating QA systems that are robust and have high performance, an important goal is also to develop systems that can be ported to languages other than English. The interest in other languages is, for example, explicitly stated in the Multiple Language Question Answering Track at CLEF 2003 (Magnini et al., 2004) , that includes Italian among others.

One of the earlier attempts in this regard has been the DIOGENE model (Magnini et al., 2002; Tanev et al., 2004) , which exploits linguistic tem1https://www.dbpedia.org/ plates and keyword recognition to answer questions over document collections. Other efforts have been made in the QALL-ME project (Cabrio et al., 2007; Cabrio et al., 2008; O´scar Ferra´ndez et al., 2011) , where a system was created for the tourism domain through an instance-based method, that is by clustering together similar question-answer pairs.

More recently, the QuASIt model (Pipitone et al., 2016) , makes use of the Construction Grammar and an abstraction of cognitive processes to account for the inherent fluidity of language, while exploiting linguistic and domain knowledge (in the form of an ontology) to answer essay and multiple choice questions. Similarly, the authors of (Leoni et al., 2020) built a system to answer questions regarding a specific domain using IBM Watson services and online articles as source of information.

These kind of systems, built to answer questions using textual information, have been largely growing in recent years, especially since the availability of large QA datasets such as the Stanford Question Answering Dataset (SQuAD)2, which allows to train complex deep learning models with millions of parameters (Rajpurkar et al., 2016; Rajpurkar et al., 2018) . While the performance shown by these models is impressive, they suffer from major drawbacks: first of all, they need an extremely large dataset to be trained on, making the porting of such a system to another language extremely demanding;3 furthermore, they show a lack of controllability in the sense that it is unclear which new examples are to be added to make a new question answerable. This makes systems opaque and difficult to maintain.

The MULIB system (Siciliani et al., 2019) tackles the problem of answering questions in Italian over structured data. The system is based on a modified version of the automaton developed for CANaLI (Mazzeo and Zaniolo, 2016) , but it employs a Word2Vec model (Mikolov et al., 2013) to allow for more flexibility in language use. In contrast to these trained approaches, we present a model that generates (i) a deeper interconnection of semantic and syntactic information through the integration of a lemon lexicon with the DBpedia ontology, and (ii) the focus on Linked Open Data 2https://rajpurkar.github.io/SQuAD-ex plorer/

3The Italian translation for SQuAD, for example, has been described in Croce et al. (2018) as a source of knowledge. 3

Methodology

The architecture consists of two components: (i) the grammar generator and (ii) the QA component. The approach to grammar generation for different syntactic frames according to LexInfo (Cimiano et al., 2011) for the English language was described in a previous work (Benz et al., 2020) . In this paper we show that, through a simple language adaptation, we are able to adjust the system so that the system also accepts questions in Italian language.

In a nutshell, the grammar generation approach relies on a mapping between syntactic constructions and classes and properties from a given ontology and/or knowledge graph. This generation process makes use of several frames, each describing the linguistic realizations of specific properties that might appear in questions. Thus, the frames employed in this work are: NounPPFrame, TransitiveFrame, IntransitivePPFrame, AdjectiveAttributive and AdjectiveGradable.

For example, the (lexicalized) construction for the NounPPFrame ‘the capital of X’, can be regarded as expressing the DBpedia property dbo:capital, with Country as domain and City as range. This would lead to the generation of the following questions: • What is the capital of X (Country)? • Which city is the capital of X (Country)? Similar grammar generation rules exist for transitive constructions (TransitiveFrame) as well as constructions involving an intransitive verb with a prepositional complement (IntransitivePPFrame) as well as adjective constructions in attributive (AdjectiveAttributive) and predicate form (AdjectiveGradable).

In the context of this work, we adapted the generation of rules to the Italian language, without extending or modifying the existing types of constructions4.

In adapting the grammar generation to Italian, we had to accommodate for the following language-specific properties: • Sentence order, e.g., in sentence starting with interrogative pronouns the subject has to be 4The code for our grammar generation for Italian is available at https://github.com/fazleh2010/ques tion-grammar-generator placed at the end of the sentence, e.g., Dove si trova Vienna? (Where is Vienna?) • The presence of auxiliary verbs, either avere (have) or essere (be), in compound tenses; • Interrogative pronoun rules, e.g., chi (who) is invariable and refers only to people; • The use of interrogative adjectives, e.g., quale (which); • The use of different prepositions, either simple or articulated, on the basis of range/domain semantics (e.g., toponyms might require different prepositions); • The presence of a determiner/articulated preposition on the basis of range/domain semantics (e.g., toponyms are preceded by a determiner when the noun refers to a country). Consider the lemon lexical entry in Figure 15 for the relational noun ‘capitale della’. The entry states that the canonical written form of the entry is “capitale”. It states that the entry has a NounPPFrame as syntactic behaviour, that is it corresponds to a copulative construction X e` 5In this paper we abbreviate URIs with the namespace prefixes dbo, dbp, lemon, and lexinfo which can be expanded into http://dbpedia.org/ontology/, http://dbpedia.org/property/, https://lemon-model.net/lemon#, and http://www.lexinfo.net/ontology/2.0/lexinfo#, respectively. la capitale della Y with two arguments, where copulativeArg corresponds to the copula subject X and the prepositional adjunct corresponds to the prepositional object Y.

We give examples for the different syntactic frames below to illustrate the behaviour of the Italian grammar generation.

NounPPFrame Assuming that in the corresponding lemon lexicon we model the connection between the NounPP construction capitale della (capital of) as referring to the property dbo:capital with domain Country and range City, we can generate questions automatically such as: 1. Qual e` la capitale della (What is the capital of) (X—Country NP)? 2. Quale citta` e` la capitale della (Which city is the capital of) (X—Country NP)? where X is a placeholder allowing to fill in a particular country, e.g. Germania (Germany), or a noun phrase, e.g., paese dove si parla tedesco (the country where German is spoken).

TransitiveFrame Assuming that the lemon lexicon captures the meaning of the construction X ‘scrive’ (write) Y as referring to the property dbp:author, with Song as domain and Person as range, the following questions would then be covered by an automatically generated grammar: 1. Chi ha scritto (Who wrote) (X—Song NP)? 2. Quale cantante ha scritto (Which singer wrote) (X—Song NP)? 3. Quale (Which) (X—Song NP) e` stata scritta da (was written by) (Y—Person NP)? IntransitivePPFrame Assuming that the lemon lexicon captures the meaning of the construction ‘X pubblicare nel Y’ (‘X published in Y’) as representation of the property dbp:published, with Song as its domain and Date as its range, the following questions would be generated: 1. Quando e` stata pubblicata (X—Song NP)? (When was (X—Song NP) published?), 2. Quale (X—Song NP) e` stata pubblicata nel (Y—date)? (Which (X—Song NP) was published in (Y—date)? 3. In quale data e` stata pubblicata (In which date was) (X—Song NP)? AdjectiveAttributive AdjectiveGradable Transitive IntransitivePP

Syntactic Pattern WDT/WP V* DT [noun] IN DT [domain] WDT dbo:range V* DT [noun] IN [domain]? WDT/WP V* DT [noun] in [domain] [range] V* DT [noun] IN (DT) [domain] WDT V* DT dbo:range [adjective] [domain] VB (DT) [adjective] WRB V* [adjective] DT [domain] WDT V* DT [domain] JJS IN (DT) [range] WP V* [domain] WDT dbo:range V* [domain] WP V* DT [domain] WDT dbo:range V* DT [domain] [domain] V* [range] WRB VB [domain] IN WDT dbo:domain VB [range] WDT dbo:domain VB IN [range] [domain] V* IN [range] Question Sample Qual e` la capitale della Germania? Quale citta` e` la capitale della Germania? Chi era la moglie di Abraham Lincoln? Rita Wilson e` la moglie di Tom Hanks? Chi era un vescovo cristiano spagnolo? Barack Obama e` un democratico? Quanto e` lungo il Barguzin? Qual e` la montagna pi u` alta della Germania? Chi ha scritto Ziggy Stardust? Quale cantante ha scritto Ziggy Stardust? Chi ha fondato C&A? Quale persona ha fondato C&A? Socrate ha influenzato Aristotele? Quando e` iniziata l’operazione Overlord? In quale data e` iniziata l’operazione Overlord? Quale libro e` stato pubblicato nel 1563?

Il libro dei martiri di Foxe e` stato pubblicato nel 1563?

AdjectiveAttributive and AdjectiveGradable

Assuming that the lemon lexicon would capture the meaning of the (gradable) adjective lungo (long) as referring to the ontological property dpb:length, the grammar generation approach would generate the following types of questions: 1. Quanto e` lungo il (How long is the) (X—River NP)? 2. Qual e` il fiume pi u` lungo (del mondo, del Kentucky)? (What is the longest river in (the world, Kentucky)?).

The rules implemented for the generation of Italian questions are shown in further detail in Table 1. In particular, we use the tagset6 from the Penn Treebank Project (Marcus et al., 1993) , with V* defining all possible forms of a given verb, words in brackets defining 6https://www.sketchengine.eu/englishtreetagger-pipeline-2/ nouns/verbs/adjectives that realize a specific property, and dbo:range/dbo:domain defining the possible labels that may represent classes (e.g., dbo:Country might be represented by either paese or stato). 4

Results

We apply our system to the DBpedia dataset and manually created a lemon lexicon comprising of 249 lexical entries7. Table 2 shows the number of grammar rules and questions generated for each syntactic type. Altogether, the approach generates 620 grammar rules and about 1.6 million questions. The web-based demonstration is available online8.

We used the training set of multilingual QALD7https://scdemo.techfak.uni-bielefeld .de/quegg-resources/

8https://webtentacle1.techfak.uni-bie lefeld.de/quegg/ 79 to evaluate our approach. QALD-7 contains a total of 214 questions over linked data, covering for more relations than the ones we considered so far. In order to overcome this issue, a total of 109 entries were added to our system (22 NounPPFrame, 41 TransitiveFrame, 41 IntransitiveFrame, 1 AdjectiveAttributiveFrame and 4 AdjectiveGradable).

Precision Recall

F-Measure

The results of the evaluation process (Table 3) show a quite satisfying precision, but a low recall. The main reason behind such results is related to the presence of different types of questions in QALD. Indeed, besides single-triple questions, QALD presents also complex questions referring to more than one triple, e.g., A quale movimento artistico apparteneva il pittore de I tre ballerini? (What was the artistic movement of the author of The Three Dancers?), which are not covered yet by our model. Nevertheless, when taking into account all the questions in QALD-7, our system recognizes 46.98% (101 questions) of the total set of questions. 5

Conclusion and Future Work

We presented an approach to developing Italian QA systems over linked data that relies on the automatic generation of grammars from corresponding lemon lexica describing how elements of the dataset are realized in natural language. The approach is controllable, since the introduction of a lexical entry increases the question coverage in a fully predictable way. Our proof-of-concept implementation over DBpedia covers 1.6 million questions generated from 249 lemon entries.

In future work, we intend to further automatize grammar generation by using LexExMachina (Ell et al., 2021) , which induces lexicon entries bridging the gap between ontology and natural language from a corpus in an unsupervised manner. Acknowledgments This work has been funded by the European Commission under grant 825182 (Preˆt-a`-LLOD) as well as Nexus Linguarum Cost 9https://github.com/ag-sc/QALD Action. M.P. di Buono has been partially supported by Programma Operativo Nazionale Ricerca e Innovazione 2014-2020 - Fondo Sociale Europeo, Azione I.2 “Attrazione e Mobilita` Internazionale dei Ricercatori” Avviso D.D. n 407 del 27/02/2018. B. Ell has been partially supported by the SIRIUS centre: Norwegian Research Council project No 237898.

Viktoria

Benz , Philipp Cimiano, Mohammad Fazleh Elahi, and

Basil

Ell . 2020 . Generating Grammars from lemon lexica for Questions Answering over Linked Data: a Preliminary Analysis . In NLIWOD workshop at ISWC , volume 2722 , pages 40 - 55 .

Elena

Cabrio , Bonaventura Coppola, Roberto Gretter, Milen Kouylekov, Bernardo Magnini, and

Matteo

Negri . 2007 . Question answering based annotation for a corpus of spoken requests . In Proceedings of the workshop on the Semantic Representation of Spoken Language , volume 31 .

Elena

Cabrio , Milen Kouylekov, Bernardo Magnini, Matteo Negri, Laura Hasler, Constantin Orasan, David Toma´s, Jose Luis Vicedo, Guenter Neumann, and

Corinna

Weber . 2008 . The QALL-ME benchmark: a multilingual resource of annotated spoken requests for question answering . In LREC'08.

Nilesh

Chakraborty , Denis Lukovnikov, Gaurav Maheshwari, Priyansh Trivedi, Jens Lehmann, and

Asja

Fischer . 2019 . Introduction to Neural Network based Approaches for Question Answering over Knowledge Graphs . CoRR, abs/ 1907 .09361.

Philipp

Cimiano , Paul Buitelaar, John P. McCrae , and Michael Sintek . 2011 . LexInfo: A declarative model for the lexicon-ontology interface . JWS , 9 ( 1 ): 29 - 51 .

Danilo

Croce , Alexandra Zelenanska, and

Roberto

Basili . 2018 . Neural learning for question answering in italian . In AI*IA 2018 , pages 389 - 402 .

Mohammad

Fazleh

Elahi , Basil Ell, Frank Grimm, and

Philipp

Cimiano . 2021 . Question Answering on RDF Data based on Grammars Automatically Generated from Lemon Models . In SEMANTiCS Conference, Posters and Demonstrations.

Basil

Ell , Mohammad Fazleh Elahi, and

Philipp

Cimiano . 2021 . Bridging the Gap Between Ontology and Lexicon via Class-Specific Association Rules Mined from a Loosely-Parallel Text-Data Corpus . In LDK 2021 , pages 33 : 1 - 33 : 21 .

Chiara

Leoni , Ilaria Torre, and

Gianni

Vercelli . 2020 . ConversIAmo: Improving Italian Question Answering Exploiting IBM Watson Services . In Text, Speech, and Dialogue, pages 504 - 512 .

Bernardo

Magnini , Matteo Negri, Roberto Prevete, and

Hristo

Tanev . 2002 . Mining Knowledge from Repeated Co-Occurrences: DIOGENE at TREC 2002 .

Bernardo

Magnini , Simone Romagnoli, Alessandro Vallin, Jesu´s Herrera, Anselmo Pen˜as, V´ıctor Peinado, Felisa Verdejo , and Maarten de Rijke. 2004 . The Multiple Language Question Answering Track at CLEF 2003 . In Comparative Evaluation of Multilingual Information Access Systems , pages 471 - 486 .

Mitchell P.

Marcus , Mary Ann Marcinkiewicz, and

Beatrice

Santorini . 1993 . Building a Large Annotated Corpus of English: The Penn Treebank . Comput. Linguist., 19 ( 2 ): 313 -- 330 .

Giuseppe M. Mazzeo and Carlo

Zaniolo . 2016 . Answering controlled natural language questions on RDF knowledge bases . In EDBT , pages 608 - 611 .

John P. McCrae , Dennis

Spohr , and Philipp

Cimiano . 2011 . Linking Lexical Resources and Ontologies on the Semantic Web with Lemon . In ESWC Conference , pages 245 - 259 .

Tomas

Mikolov , Ilya Sutskever, Kai Chen, Greg S Corrado, and

Jeff

Dean . 2013 . Distributed representations of words and phrases and their compositionality . In Advances in Neural Information Processing Systems , volume 26 .

Arianna

Pipitone , Giuseppe Tirone, and

Roberto

Pirrone . 2016 . QuASIt: A Cognitive Inspired Approach to Question Answering for the Italian Language . volume 10037 , pages 464 - 476 .

Pranav

Rajpurkar , Jian Zhang, Konstantin Lopyrev, and

Percy

Liang . 2016 . SQuAD: 100 , 000+ Questions for Machine Comprehension of Text . CoRR, abs/1606.05250.

Pranav

Rajpurkar , Robin Jia, and

Percy

Liang . 2018 . Know What You Don't Know: Unanswerable Questions for SQuAD . CoRR, abs/ 1806 .03822.

Lucia

Siciliani , Pierpaolo Basile, Giovanni Semeraro, and

Matteo

Mennitti . 2019 . An italian question answering system for structured data based on controlled natural languages . In CLiC-it.

Hristo

Tanev , Matteo Negri, Bernardo Magnini, and

Milen

Kouylekov . 2004 . The DIOGENE question answering system at CLEF-2004 . volume 3491 , pages 435 - 445 .

O´ scar Ferra´ndez, Christian Spurk , Milen Kouylekov, Iustin Dornescu, Sergio Ferra´ndez, Matteo Negri, Rube´n Izquierdo, David Toma´s, Constantin Orasan, Guenter Neumann, Bernardo Magnini, and Jose Luis Vicedo. 2011 . The QALL-ME Framework: A specifiable-domain multilingual Question Answering architecture . Journal of Web Semantics , 9 ( 2 ): 137 - 145 .