An Italian Question Answering System Based on Grammars Automatically Generated from Ontology Lexica Gennaro Nolano1 , Mohammad Fazleh Elahi2 , Maria Pia di Buono1 , Basil Ell2,3 and Philipp Cimiano2 1. UniOr NLP Research Group, University of Naples ”L’Orientale”, Italy 2. Cognitive Interaction Technology Center, Bielefeld University, Germany 3. Department of Informatics, University of Oslo gnolano,mpdibuono@unior.it, {melahi,bell,cimiano}@techfak.uni-bielefeld.de Abstract specify how concepts and relations are realized in natural language, and then this specification is The paper presents an Italian question an- employed to interpret questions from users. One swering system over linked data. We use a such system is the one proposed by (Benz et al., model-based approach to question answer- 2020), which makes use of a lexicon in lemon for- ing based on an ontology lexicon in lemon mat (McCrae et al., 2011) to specify how the vo- format. The system exploits an automat- cabulary elements of an ontology or knowledge ically generated lexicalized grammar that graph (e.g., entities and relations from a Knowl- can then be used to interpret and transform edge Graph) are realized in natural language. questions into SPARQL queries. We ap- The previous work on this approach shows how, ply the approach for the Italian language leveraging on lemon lexica, question answering and implement a question answering sys- grammars can be automatically generated, and tem that can answer more than 1.6 mil- those can, in turn, be used to interpret questions lion questions over the DBpedia knowl- and then parse them into SPARQL queries. A edge graph. QA web application developed in previous work (Elahi et al., 2021) has further shown that such QA 1 Introduction systems can scale to large numbers of questions As the amount of linked data published on the Web and that the performance of the system is practi- keeps increasing, there is an expanding demand cally real-time from an end-user perspective. for multilingual tools and user interfaces that sim- In this work we describe the extension to the plify the access and browsing of data by end-users, Italian language of the model-based approach de- so that information can be explored in an intuitive scribed in (Benz et al., 2020) and the QA sys- way. This need is what motivated the develop- tem described in (Elahi et al., 2021). By doing ment of tools such as Question Answering (QA) so, we develop a QA system that can answer more systems, whose main aim is to make users be able than 1.6 million Italian questions over the DBpe- to explore complex datasets and an ever growing dia knowledge graph1 . amount of data in an intuitive way, through natural language. 2 Related Work While the default approach for many NLP tasks Besides the goal of creating QA systems that are has recently been represented by machine learning robust and have high performance, an important systems, the use of such approaches (Chakraborty goal is also to develop systems that can be ported et al., 2019) for QA over RDF data suffers from to languages other than English. The interest in lack of controllability, making the governance and other languages is, for example, explicitly stated in incremental improvement of the system challeng- the Multiple Language Question Answering Track ing, not to mention the initial effort of collecting at CLEF 2003 (Magnini et al., 2004), that includes and providing training data for a specific language. Italian among others. An alternative is the so-called model-based ap- One of the earlier attempts in this regard has proach to QA, in which a model is first used to been the DIOGENE model (Magnini et al., 2002; Copyright © 2021 for this paper by its authors. Use per- Tanev et al., 2004), which exploits linguistic tem- mitted under Creative Commons License Attribution 4.0 In- 1 ternational (CC BY 4.0). https://www.dbpedia.org/ plates and keyword recognition to answer ques- as a source of knowledge. tions over document collections. Other efforts have been made in the QALL-ME project (Cabrio 3 Methodology et al., 2007; Cabrio et al., 2008; Óscar Ferrández The architecture consists of two components: (i) et al., 2011), where a system was created for the grammar generator and (ii) the QA component. the tourism domain through an instance-based The approach to grammar generation for different method, that is by clustering together similar syntactic frames according to LexInfo (Cimiano et question-answer pairs. al., 2011) for the English language was described More recently, the QuASIt model (Pipitone et in a previous work (Benz et al., 2020). In this pa- al., 2016), makes use of the Construction Gram- per we show that, through a simple language adap- mar and an abstraction of cognitive processes to tation, we are able to adjust the system so that the account for the inherent fluidity of language, while system also accepts questions in Italian language. exploiting linguistic and domain knowledge (in In a nutshell, the grammar generation approach the form of an ontology) to answer essay and mul- relies on a mapping between syntactic construc- tiple choice questions. Similarly, the authors of tions and classes and properties from a given on- (Leoni et al., 2020) built a system to answer ques- tology and/or knowledge graph. This generation tions regarding a specific domain using IBM Wat- process makes use of several frames, each describ- son services and online articles as source of infor- ing the linguistic realizations of specific properties mation. that might appear in questions. Thus, the frames These kind of systems, built to answer ques- employed in this work are: NounPPFrame, Tran- tions using textual information, have been largely sitiveFrame, IntransitivePPFrame, AdjectiveAt- growing in recent years, especially since the avail- tributive and AdjectiveGradable. ability of large QA datasets such as the Stanford For example, the (lexicalized) construction for Question Answering Dataset (SQuAD)2 , which al- the NounPPFrame ‘the capital of X’, can be lows to train complex deep learning models with regarded as expressing the DBpedia property millions of parameters (Rajpurkar et al., 2016; dbo:capital, with Country as domain and Rajpurkar et al., 2018). While the performance City as range. This would lead to the generation shown by these models is impressive, they suffer of the following questions: from major drawbacks: first of all, they need an extremely large dataset to be trained on, making • What is the capital of X (Country)? the porting of such a system to another language • Which city is the capital of X (Country)? extremely demanding;3 furthermore, they show a lack of controllability in the sense that it is un- Similar grammar generation rules exist for tran- clear which new examples are to be added to make sitive constructions (TransitiveFrame) as well as a new question answerable. This makes systems constructions involving an intransitive verb with a opaque and difficult to maintain. prepositional complement (IntransitivePPFrame) The MULIB system (Siciliani et al., 2019) tack- as well as adjective constructions in attributive les the problem of answering questions in Italian (AdjectiveAttributive) and predicate form (Adjec- over structured data. The system is based on a tiveGradable). modified version of the automaton developed for In the context of this work, we adapted the gen- CANaLI (Mazzeo and Zaniolo, 2016), but it em- eration of rules to the Italian language, without ploys a Word2Vec model (Mikolov et al., 2013) extending or modifying the existing types of con- to allow for more flexibility in language use. In structions4 . contrast to these trained approaches, we present a In adapting the grammar generation to Ital- model that generates (i) a deeper interconnection ian, we had to accommodate for the following of semantic and syntactic information through the language-specific properties: integration of a lemon lexicon with the DBpedia ontology, and (ii) the focus on Linked Open Data • Sentence order, e.g., in sentence starting with 2 interrogative pronouns the subject has to be https://rajpurkar.github.io/SQuAD-ex plorer/ 4 The code for our grammar generation for Italian is avail- 3 The Italian translation for SQuAD, for example, has been able at https://github.com/fazleh2010/ques described in Croce et al. (2018) tion-grammar-generator placed at the end of the sentence, e.g., Dove la capitale della Y with two arguments, where si trova Vienna? (Where is Vienna?) copulativeArg corresponds to the copula sub- • The presence of auxiliary verbs, either avere ject X and the prepositional adjunct corresponds (have) or essere (be), in compound tenses; to the prepositional object Y. We give examples for the different syntactic • Interrogative pronoun rules, e.g., chi (who) is frames below to illustrate the behaviour of the Ital- invariable and refers only to people; ian grammar generation. • The use of interrogative adjectives, e.g., NounPPFrame Assuming that in the corre- quale (which); sponding lemon lexicon we model the connec- • The use of different prepositions, either tion between the NounPP construction capitale simple or articulated, on the basis of della (capital of) as referring to the property range/domain semantics (e.g., toponyms dbo:capital with domain Country and range might require different prepositions); City, we can generate questions automatically • The presence of a determiner/articulated such as: preposition on the basis of range/domain se- 1. Qual è la capitale della (What is the capital mantics (e.g., toponyms are preceded by a de- of) (X—Country NP)? terminer when the noun refers to a country). 2. Quale città è la capitale della (Which city is the capital of) (X—Country NP)? where X is a placeholder allowing to fill in a par- ticular country, e.g. Germania (Germany), or a noun phrase, e.g., paese dove si parla tedesco (the country where German is spoken). TransitiveFrame Assuming that the lemon lex- icon captures the meaning of the construction X ‘scrive’ (write) Y as referring to the property dbp:author, with Song as domain and Person as range, the following questions would then be covered by an automatically generated grammar: 1. Chi ha scritto (Who wrote) (X—Song NP)? 2. Quale cantante ha scritto (Which singer wrote) (X—Song NP)? 3. Quale (Which) (X—Song NP) è stata scritta da (was written by) (Y—Person NP)? Figure 1: Lemon entry for the relational noun IntransitivePPFrame Assuming that the lemon ‘capitale della’ lexicon captures the meaning of the construction ‘X pubblicare nel Y’ (‘X published in Y’) as repre- Consider the lemon lexical entry in Figure 15 for sentation of the property dbp:published, with the relational noun ‘capitale della’. The entry Song as its domain and Date as its range, the fol- states that the canonical written form of the en- lowing questions would be generated: try is “capitale”. It states that the entry has 1. Quando è stata pubblicata (X—Song NP)? a NounPPFrame as syntactic behaviour, that is (When was (X—Song NP) published?), it corresponds to a copulative construction X è 5 2. Quale (X—Song NP) è stata pubblicata nel In this paper we abbreviate URIs with the namespace prefixes dbo, dbp, lemon, and lexinfo which can be ex- (Y—date)? (Which (X—Song NP) was pub- panded into http://dbpedia.org/ontology/, lished in (Y—date)? http://dbpedia.org/property/, https://lemon-model.net/lemon#, and 3. In quale data è stata pubblicata (In which http://www.lexinfo.net/ontology/2.0/lexinfo#, date was) (X—Song NP)? respectively. LexInfo Frame Syntactic Pattern Question Sample NounPP WDT/WP V* DT [noun] IN DT [do- Qual è la capitale della Germania? main] WDT dbo:range V* DT [noun] IN Quale città è la capitale della Germania? [domain]? WDT/WP V* DT [noun] in [domain] Chi era la moglie di Abraham Lincoln? [range] V* DT [noun] IN (DT) [do- Rita Wilson è la moglie di Tom Hanks? main] AdjectiveAttributive WDT V* DT dbo:range [adjective] Chi era un vescovo cristiano spagnolo? [domain] VB (DT) [adjective] Barack Obama è un democratico? AdjectiveGradable WRB V* [adjective] DT [domain] Quanto è lungo il Barguzin? WDT V* DT [domain] JJS IN (DT) Qual è la montagna più alta della Germania? [range] Transitive WP V* [domain] Chi ha scritto Ziggy Stardust? WDT dbo:range V* [domain] Quale cantante ha scritto Ziggy Stardust? WP V* DT [domain] Chi ha fondato C&A? WDT dbo:range V* DT [domain] Quale persona ha fondato C&A? [domain] V* [range] Socrate ha influenzato Aristotele? IntransitivePP WRB VB [domain] Quando è iniziata l’operazione Overlord? IN WDT dbo:domain VB [range] In quale data è iniziata l’operazione Overlord? WDT dbo:domain VB IN [range] Quale libro è stato pubblicato nel 1563? [domain] V* IN [range] Il libro dei martiri di Foxe è stato pubblicato nel 1563? Table 1: Italian Patterns and Questions Frame type #Entries #Grammar rules #Questions NounPPFrame 113 226 1,010,234 TransitiveFrame 41 124 595,854 IntransitivePPFrame 58 116 52,040 AdjectiveAttributiveFrame 29 130 10,025 AdjectiveGradable 8 24 3,123 Total 249 620 1,671,276 Table 2: Frequencies of entries with a certain frame type. The entries are created manually; the rules and questions are generated automatically. AdjectiveAttributive and AdjectiveGradable nouns/verbs/adjectives that realize a specific prop- Assuming that the lemon lexicon would capture erty, and dbo:range/dbo:domain defining the meaning of the (gradable) adjective lungo the possible labels that may represent classes (e.g., (long) as referring to the ontological property dbo:Country might be represented by either dpb:length, the grammar generation approach paese or stato). would generate the following types of questions: 4 Results 1. Quanto è lungo il (How long is the) (X—River NP)? We apply our system to the DBpedia dataset and 2. Qual è il fiume più lungo (del mondo, del manually created a lemon lexicon comprising of Kentucky)? (What is the longest river in (the 249 lexical entries7 . Table 2 shows the number of world, Kentucky)?). grammar rules and questions generated for each syntactic type. Altogether, the approach generates The rules implemented for the generation of 620 grammar rules and about 1.6 million ques- Italian questions are shown in further detail in tions. The web-based demonstration is available Table 1. In particular, we use the tagset6 online8 . from the Penn Treebank Project (Marcus et We used the training set of multilingual QALD- al., 1993), with V* defining all possible forms 7 of a given verb, words in brackets defining https://scdemo.techfak.uni-bielefeld .de/quegg-resources/ 6 8 https://www.sketchengine.eu/english- https://webtentacle1.techfak.uni-bie treetagger-pipeline-2/ lefeld.de/quegg/ 79 to evaluate our approach. QALD-7 contains Action. M.P. di Buono has been partially a total of 214 questions over linked data, cover- supported by Programma Operativo Nazionale ing for more relations than the ones we consid- Ricerca e Innovazione 2014-2020 - Fondo Sociale ered so far. In order to overcome this issue, a to- Europeo, Azione I.2 “Attrazione e Mobilità Inter- tal of 109 entries were added to our system (22 nazionale dei Ricercatori” Avviso D.D. n 407 del NounPPFrame, 41 TransitiveFrame, 41 Intransi- 27/02/2018. B. Ell has been partially supported by tiveFrame, 1 AdjectiveAttributiveFrame and 4 Ad- the SIRIUS centre: Norwegian Research Council jectiveGradable). project No 237898. Precision 0.485 Recall 0.224 References F-Measure 0.307 Viktoria Benz, Philipp Cimiano, Mohammad Fazleh Elahi, and Basil Ell. 2020. Generating Grammars Table 3: Evaluation results against QALD-7 from lemon lexica for Questions Answering over Linked Data: a Preliminary Analysis. In NLIWOD The results of the evaluation process (Table 3) workshop at ISWC, volume 2722, pages 40–55. show a quite satisfying precision, but a low recall. Elena Cabrio, Bonaventura Coppola, Roberto Gretter, The main reason behind such results is related Milen Kouylekov, Bernardo Magnini, and Matteo to the presence of different types of questions in Negri. 2007. Question answering based annota- tion for a corpus of spoken requests. In Proceedings QALD. Indeed, besides single-triple questions, of the workshop on the Semantic Representation of QALD presents also complex questions referring Spoken Language, volume 31. to more than one triple, e.g., A quale movimento artistico apparteneva il pittore de I tre ballerini? Elena Cabrio, Milen Kouylekov, Bernardo Magnini, Matteo Negri, Laura Hasler, Constantin Orasan, (What was the artistic movement of the author David Tomás, Jose Luis Vicedo, Guenter Neumann, of The Three Dancers?), which are not covered and Corinna Weber. 2008. The QALL-ME bench- yet by our model. Nevertheless, when taking into mark: a multilingual resource of annotated spoken account all the questions in QALD-7, our system requests for question answering. In LREC’08. recognizes 46.98% (101 questions) of the total set Nilesh Chakraborty, Denis Lukovnikov, Gaurav Ma- of questions. heshwari, Priyansh Trivedi, Jens Lehmann, and Asja Fischer. 2019. Introduction to Neural Network based Approaches for Question Answering over 5 Conclusion and Future Work Knowledge Graphs. CoRR, abs/1907.09361. We presented an approach to developing Italian Philipp Cimiano, Paul Buitelaar, John P. McCrae, and QA systems over linked data that relies on the au- Michael Sintek. 2011. LexInfo: A declarative model for the lexicon-ontology interface. JWS, tomatic generation of grammars from correspond- 9(1):29–51. ing lemon lexica describing how elements of the dataset are realized in natural language. The ap- Danilo Croce, Alexandra Zelenanska, and Roberto Basili. 2018. Neural learning for question answer- proach is controllable, since the introduction of ing in italian. In AI*IA 2018, pages 389–402. a lexical entry increases the question coverage in a fully predictable way. Our proof-of-concept Mohammad Fazleh Elahi, Basil Ell, Frank Grimm, and implementation over DBpedia covers 1.6 million Philipp Cimiano. 2021. Question Answering on RDF Data based on Grammars Automatically Gen- questions generated from 249 lemon entries. erated from Lemon Models. In SEMANTiCS Con- In future work, we intend to further automatize ference, Posters and Demonstrations. grammar generation by using LexExMachina (Ell et al., 2021), which induces lexicon entries bridg- Basil Ell, Mohammad Fazleh Elahi, and Philipp Cimi- ano. 2021. Bridging the Gap Between Ontology and ing the gap between ontology and natural language Lexicon via Class-Specific Association Rules Mined from a corpus in an unsupervised manner. from a Loosely-Parallel Text-Data Corpus. In LDK 2021, pages 33:1–33:21. Acknowledgments This work has been funded by the European Commission under grant 825182 Chiara Leoni, Ilaria Torre, and Gianni Vercelli. 2020. (Prêt-à-LLOD) as well as Nexus Linguarum Cost ConversIAmo: Improving Italian Question Answer- ing Exploiting IBM Watson Services. In Text, 9 https://github.com/ag-sc/QALD Speech, and Dialogue, pages 504–512. Bernardo Magnini, Matteo Negri, Roberto Prevete, and Hristo Tanev. 2002. Mining Knowledge from Re- peated Co-Occurrences: DIOGENE at TREC 2002. Bernardo Magnini, Simone Romagnoli, Alessandro Vallin, Jesús Herrera, Anselmo Peñas, Vı́ctor Peinado, Felisa Verdejo, and Maarten de Rijke. 2004. The Multiple Language Question Answering Track at CLEF 2003. In Comparative Evaluation of Multilingual Information Access Systems, pages 471–486. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a Large Anno- tated Corpus of English: The Penn Treebank. Com- put. Linguist., 19(2):313—-330. Giuseppe M. Mazzeo and Carlo Zaniolo. 2016. An- swering controlled natural language questions on RDF knowledge bases. In EDBT, pages 608–611. John P. McCrae, Dennis Spohr, and Philipp Cimiano. 2011. Linking Lexical Resources and Ontologies on the Semantic Web with Lemon. In ESWC Confer- ence, pages 245–259. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- ity. In Advances in Neural Information Processing Systems, volume 26. Arianna Pipitone, Giuseppe Tirone, and Roberto Pir- rone. 2016. QuASIt: A Cognitive Inspired Ap- proach to Question Answering for the Italian Lan- guage. volume 10037, pages 464–476. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100, 000+ Ques- tions for Machine Comprehension of Text. CoRR, abs/1606.05250. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Ques- tions for SQuAD. CoRR, abs/1806.03822. Lucia Siciliani, Pierpaolo Basile, Giovanni Semeraro, and Matteo Mennitti. 2019. An italian question answering system for structured data based on con- trolled natural languages. In CLiC-it. Hristo Tanev, Matteo Negri, Bernardo Magnini, and Milen Kouylekov. 2004. The DIOGENE ques- tion answering system at CLEF-2004. volume 3491, pages 435–445. Óscar Ferrández, Christian Spurk, Milen Kouylekov, Iustin Dornescu, Sergio Ferrández, Matteo Ne- gri, Rubén Izquierdo, David Tomás, Constantin Orasan, Guenter Neumann, Bernardo Magnini, and Jose Luis Vicedo. 2011. The QALL-ME Frame- work: A specifiable-domain multilingual Question Answering architecture. Journal of Web Semantics, 9(2):137–145.