When Lexicon-Grammar Meets Open Information Extraction: a Computational Experiment for Italian Sentences Raffaele Guarasci, Emanuele Damiano, Aniello Minutolo, Massimo Esposito National Research Council of Italy Institute for High Performance Computing and Networking (ICAR), Naples, Italy {name.surname}@icar.cnr.it be mandatory or optional. In this sentence, both Abstract arguments Maria (subject) and the party (direct object) are mandatory, so it is impossible to re- In this work we show an experiment on move one of them or the sentence becomes unac- building an Open Information Extraction ceptable from a grammatical point of view. Due system (OIE) for Italian language. We to the high field of Natural Language Processing propose a system wholly reliant on lin- (NLP) tasks in which OIE outputs can be used guistic structures and on a small set of ver- (Christensen et al., 2013; Fader et al., 2014; bal behavior patterns defined putting to- Stanovsky et al., 2015; 2016; Khot et al., 2017; gether theoretical linguistic knowledge Rahat et al., 2017), numerous OIE approaches for and corpus-based statistical information1. English have been developed. However, being a Starting from elementary one-verb sen- language-dependent task, OIE systems cannot be tences, the system identifies elementary shifted from one language to another, i.e. a system tuples and then, all their permutations, created for English is not compatible with Italian. preserving the overall well-formedness Moreover, many of the proposed OIE approaches (grammaticality) and trying to preserve rest on unstable grounds. Some of them use heu- semantic coherence (acceptability). Alt- ristics to manage large quantities of textual data, hough the work focuses only on the Italian others lack the support of a theoretical basis, out- language, it can be proficiently extended lining the natural language in a reductive way. also to other languages, since it is essen- Differently from the vast majority of existing OIE tially based only on linguistic resources approaches, we propose a linguistic-based unsu- and on a representative corpus for the lan- pervised system designed to extract n-ary propo- guage under consideration2. sitions (not only “relation-argument” triples) from natural language sentences in Italian, ensuring do- 1 Introduction main independence and scalability. Our system aims to identify the elementary tu- One of the most interesting approach to handle the ple(s) from the input sentence, then all its (their) rapid growth of textual data emerged in the last permutations, by adding progressively arguments decade is Open Information Extraction (OIE). composing the sentence. After that – according Starting from natural language sentences, it al- the behavior patterns of the verb – it generates lows to extract one or more domain-independent every possible syntactically valid n-ary proposi- propositions, scaling to the diversity and size of tion, granting grammaticality. the corpus considered (Banko et al., 2007). Each To reach this result we have combined two types extracted proposition is represented by a verb and of resources. To gather information about verb be- its arguments, i.e. “Maria goes to the party” is a havior in sentences, we grounded our work on the proposition with a relation (the verb goes) that linguistic basis provided by Lexicon Grammar links together two arguments (Maria, the party). (LG) (Gross, 1994). In order to obtain a fine- Arguments (nouns or noun groups) can have dif- grained characterization of arguments, we ferent roles (subject, direct object…) and they can 1 2 An online demo showing some features of the system is Copyright © 2019 for this paper by its authors. Use per- freely available at the address https://nlpit.na.icar.cnr.it/ mitted under Creative Commons License Attribution 4.0 In- ternational (CC BY 4.0). combine this theoretical knowledge with distribu- 3 Lexicon-Grammar tional corpus-based information extracted from it- WaC (Baroni et al., 2009). From LG tables we As the theoretical basis for our system we decided extract patterns of verbs behaviors, and from it- to use LG since it regards the systematic formali- WaC we enrich these patterns with statistical in- zation of a very broad quantity of data for the Ital- formation. Using complex linguistic structures ian language (Elia et al., 1981; D’Agostino, and dependency parse trees (DPT) we can detect 1992). Other resources describing a subset of Ital- verbal behavior patterns occurring in one-verb ian verbs have been developed, such as LexIt sentences and generate from them all the possible (Lenci et al. 2012), MultiWordNet (Pianta et al. well-formed propositions, by adding comple- 2002), SensoComune (Oltramari et al. 2013) and ments and adverbials. The use of formal patterns T-PAS (Jezek et al., 2014). However, none of derived from a theoretical framework allows to them provides a formal classification of verbs in better distinguish between necessary verbal argu- classes or clusters. Conversely, LG groups verbs ments and optional removable adjuncts and to ver- in classes according to their behavior, specifying ify syntactic restrictions in verb possible struc- for each verb its essential arguments and possible tures. syntactic structures in order to create well-formed Arguments optionality and syntactic constraints sentences (Leclère, 2002). are critical features to grant the grammaticality of 3.1 How data are structured in LG the propositions generated, also trying to approx- imate a first level of semantic acceptability. LG classes are represented in the form of tables. Each row of the table corresponds to a verb of the 2 Related Work class, each column lists all properties that may be valid or not for the different members of the class. In the last years, several approaches to OIE has At the intersection of a row and a column, the been developed (Banko et al., 2007; Zhu et al., symbol + or - may indicate that the property cor- 2009; Wu et al., 2010; Fader et al., 2011; Schmitz responding to the column is valid or not for the et al., 2012; Del Corro et al., 2013), all of them verb corresponding to the row, as shown in Table with the characteristic of utilizing a set of patterns 1 3 , which reports some Italian verbs and their in order to obtain propositions, granting scalabil- properties as encoded in a LG. Properties can be ity and portability across different domains. of different types. They can refer to the syntactic They differ in many aspects such as perfor- structure and the prepositions admitted by that mances (precision, recall, speed); linguistic struc- specific verb, semantic restrictions (e.g. hu- tures used (Part-of-Speech tags, chunks, DPT); man/non-human argument) or possible transfor- patterns to extract information (hand-crafted mations (e.g. passive form). For the purpose of based on heuristics or learned from a training cor- this work, only syntactic properties will be con- pus); type of generated output (binary extractions, sidered. This choice reflects the syntactic nature n-ary extractions, nested extractions). of OIE, which focuses on shapes and structures of However, most of these existing approaches so far verbs. has been focused on English, with only some re- cent attempts that have appeared for other lan- Verb N0VN1 N0V N0VprepN1 N0VN1prepN2 guages, such as Spanish (Zhila et al, 2013), Chi- Mangiare - + + nese (Wang et al, 2014), Vietnamese (Truong et (to eat) - al., 2017), German (Falke et al., 2016; Bassa et al., Muovere + + - - 2018) and Romance languages (Gamallo et al., (to move) 2012; Gamallo et al., 2015). As far as we know Girare + + + + only one approach has been attempted for the Ital- (to turn) ian (Damiano et al., 2018). It is a preliminary ex- Table 1 Example of an LG table periment based on a limited set of patterns and heuristics, and experimented on a hand-crafted The first column contains the defining property, dataset of reduced size. which corresponds to the basic syntactic structure 3 The formal notation used in LG is summarized as follows: N indicates a nominal group and is followed by a progressive subscript indicating its nature (N0 is the subject, N1 is the first complement, N2 is the second complement, etc.), V rep- resents the verb, prep indicates prepositions. of the elementary sentence. The property ex- alternatively and also simultaneously both the pat- pressed in the second column is a syntactic prop- terns N0VinN1 and N0VaN1. On the other hand, a erty called deletion (Harris, 1982), labeled as N0V, notation like N0VN1ÅN0VinN1 denotes that the which allows the cancellation of the element N1 verb can accept exclusively only one between the from the basic syntactic structure specified with patterns N0VN1 and N0VinN1, even if they are the defining property. Deleting the element N1 on both valid from a grammatical perspective. This is the right of the verb is valid for the verb “mangi- due to the fact that their selection preferences are are” (“Max mangia”, Max eats), while it produces representative of different verb usages and, thus, ungrammatical unacceptable sentences for the are alternative and exclusive from a semantic per- verb “muovere” (“*Max muove”, *Max moves). spective. Note that in the table 2 possible preposi- Prep represents a set of every possible adjuncts tions are reduced for a better readability of the pat- placed before every argument Ni. tern. 3.2 From tables to patterns Verbs Patterns Despite the richness of this fine-grained infor- mangiare N0V[N1] mation, LG tables suffer from some limitations (to eat) that have made them useless in real NLP applica- muovere N0VN1 Å tions: they are verbose and properties is neither (to move) N0V(in|da|verso)N1 uniform nor standardized. Therefore, many girare N0V(a|intorno)N1Å (to turn) changes were necessary to be able to use these re- N0VN1[(a|da|verso)N2] Table 2 Patterns derived from LG tables sources in the OIE system: Grouping. We divided verbs into classes: di- rect (D) without preposition, indirect with a prep- 4 Proposed Approach osition (I), and locative (L). This distinction is preferred to the classical distinction between tran- Our approach for OIE is arranged in the form of a sitive and intransitive verbs, since locative verbs multi-step pipeline and it consists into 4 steps: can accept both transitive and intransitive con- Sentence Processing: every input sentence is struction. Verbs assuming a copulative function checked to verify that it is suitable for the ap- (support verbs) form a further class (S). For the proach. purpose of this work, we do not consider comple- Arguments Identification: arguments of the ment-clause verbs, because of the variability of verb are identified (i.e. subjects, direct comple- the structures possible for the definition of unique ments, indirect complements…). patterns. Pattern Recognition: verbal structures that Enrichment: Prep element is too coarse. We match the patterns are identified and elementary need to specify which kind of preposition the se- tuples made by the combination of arguments are lected verb admits. To overcome this limit, we add generated. a syntactic profile to each verb, containing the Proposition Generation: n-ary propositions most frequent prepositions associated to it. We ex- depending on the elementary tuples and the re- tract this information from itWaC corpus. maining arguments (i.e. adverbs, complements Formal representation. To reduce redundant and modifiers) are generated. information of the original tables we formalize a grammar to compactly represent verbs behavior, As an example, for the sentence “Da domani Anna indicating selection preferences on the possible andrà da Roma a Milano” (From tomorrow Anna arguments of a verb. Square brackets [] represent will go from Rome to Milan), both the tuples and the possibility of deleting arguments, round corresponding propositions that are generated are brackets () indicates there are many possible ar- reported in Table 3. guments separated by a vertical bar, and XOR The verb “andare” (to go) belongs to locative symbol Å represents the exclusive alternativity of group loc, and its complete pattern is the follow- patterns. ing N0V[daN1](a|in|verso|su|so- pra)N2. In the first column of the table identified As it is shown table 2, the notation N0V[N1] indi- patterns for the verb are reported, the second col- cates that the verb “mangiare” (to eat) can accept umn lists tuples and propositions generated from both the structures N0VN1 or N0V, and the notation every single pattern. N0V(in|a)N1 denotes that the verb can accept Pattern Generations Table 4 shows precision (P) and recall (R) scores 1. (“Anna”, “andrà”, “Milano”) with respect to the two criteria on the verbs divide Anna andare a Milano (Anna to go to Milan) by classes. 2. (“Domani”, “Anna”, N0VaN1 “andrà”, “Milano”) Precision and recall achieve high values with re- Da domani Anna andare a Milano spect to both grammaticality and acceptability. (From Tomorrow Anna to go to Milan) More precisely, with respect to the different struc- 3. (“Anna”, “andrà”,”Roma”,“Milano”) 4. Anna andare da Roma a Milano sensibly higher for sentences containing support (Anna to go from Rome to Milan) verbs with respect to grammaticality and accepta- N0daVaN1 (“Domani”, “Anna”, bility. This behavior is reversed for recall, which “andrà”, “Roma”,“ “Milano”) has resulted for sentences containing direct, indi- Da domani Anna andare da Roma a Milano (From Tomorrow Anna to go from Rome to Milan) rect or locative verbs. Table 3 tuples and propositions generated from an input sentence 5.1 Comparison with other OIE systems 5 Experiment and validation Globally, generations per sentences and perfor- mances achieved are comparable with state-of- We carried out the evaluation using quantitative the-art OIE systems in other languages, respec- metrics well known in NLP literature: precision tively ClausIE (English) and GerIE (German). and recall. Precision measures the average on all Moreover, we compare our results with the only the sentences of the percentage of extractions ob- other experiment conducted on Italian presented tained by the proposed approach that are correct, by the authors and named ItalIE (Damiano et al, whereas recall measures the average on all the 2018). sentences of the percentage of extractions manu- ally annotated in the dataset that are correctly Grammaticality Acceptability identified by the proposed approach. Perfor- Sentences mances was evaluated on a dataset of sentences P R P R containing verbs belonging to different classes, Total verbs 195 0.84 0.40 0.73 0.43 and the validation took place with respect to Locative 62 0.91 0.46 0.74 0.51 grammaticality and acceptability (i.e. syntactic Direct 30 0.82 0.56 0.74 0.57 well-formedness of the sentences and its meaning- Indirect 65 0.72 0.27 0.68 0.57 fulness in the context) using the gold standard pro- Support 38 0.91 0.36 0.86 0.45 posed in (Guarasci et al. in press). Notice that Table 5 Performances of ItalIE grammaticality and acceptability judgements is a much debated topic in theoretical and computa- As shown in Tables 5, our approach has reached tional linguistics in the past (Phillips, 2009; Phil- the best overall performances in terms precision lips, 2011; Gibson et al., 2010) and still today it is and recall for both grammaticality and acceptabil- considered a controversial subject (Lau et al., ity. ItalIE highlighted a sensibly lower number of 2017; Sprouse et al.; 2018). Even if OIE is a syn- generations (511 vs 918 of our approach) with a tactic task, so it focus on the structure of the sen- moderate decrease in precision but a significant tence, but not its meaning (Lau et al., 2017), we reduction in recall. This behavior can be explained aim to generate sentences not only well-formed by the fact that ItalIE is based on a fixed set of but also respecting some syntactic constraints and clause patterns not considering the extreme varia- selection preferences, trying to approximate the bility of verb behaviors and also the selection first level of semantic acceptability. preferences on their possible arguments. Further- more, its algorithm based on DPT to identify con- Sentences Grammaticality Acceptability stituents through dependency relations has shown P R P R some weaknesses. It fails in detecting and Total verbs 195 0.91 0.78 0.79 0.84 properly handling named entities, multi-word ex- Locative 62 0.93 0.73 0.77 0.83 pressions, adjectives, numerals, dates and some Direct 30 0.90 0.93 0.79 0.93 patterns related to support verbs. Indirect 65 0.88 0.81 0.78 0.83 5.2 Error Analysis Support 38 0.98 0.66 0.86 0.78 The number of both false positives and negatives Table 4 results for different verb classes generated in the experiments is shown in Table 6 with respect to grammaticality (G) and acceptabil- English (Garcia-Vega, 2010; Machonis, 2010), ity (A). Portuguese (Baptista, 2001), Romanian (Cio- canea, 2011). Likewise, the itWaC corpus used in False positives False negatives this work is part of the WaCky Wide Web corpora DP NE SC MC Tot DP VU Tot collection (Baroni et al., 2009), which includes G 78 3 0 0 81 145 86 231 corpora of English (ukWaC), German (deWaC), A 78 3 76 38 195 114 21 135 French (frWac). Concerning performances of the Table 6 Summary of the errors generating false positives and nega- system, although the results are encouraging, we tives with respect to grammaticality and acceptability. are looking forward to further developments. With regard to methodological progress, we plan Various types of errors are divided as follows: to integrate novel methods based on deep learning DP: errors caused by incorrect dependency to increase the performance of the system, trying parsing due to wrong and/or missing dependen- to reduce DP errors and better handle named enti- cies between element occurring in the input sen- ties, frozen and semi-frozen bigrams and multi- tence. They represent the vast majority of the er- word expressions. From an applicative perspec- rors affecting overall performances of the pro- tive, this work will be experimented in Italian posed approach. With respect to grammaticality Question Answering system, with the goal to im- and acceptability, false positives have been gener- prove the ability in reading complex texts and ex- ated by DP errors in 96% and 40% of cases, tracting the correct answers to users' questions. whereas false negatives are due to DP errors in Other possible outcomes can include text summa- 63% and 84% of cases, respectively. rization or other NLP tasks. NE: error in the identification of named-enti- ties. NE errors have occurred in a not significant References number of cases, only 3, generating false positives Michele Banko, Michael J Cafarella, Stephen Soder- with respect to both grammaticality and accepta- land, Matthew Broadhead, and Oren Etzioni. 2007. bility. Open information extraction from the web. In Pro- VU: behavior patterns not associated to the ceeding of IIJCAI, vol. 7, pp. 2670-2676. verb usage selected for the input sentence. It rep- Jorge Baptista. 2012. Viper: A lexicon-grammar of eu- resents the second source of errors causing false ropean portuguese verbs. In 31e Colloque Interna- negatives with respect to grammaticality and ac- tional sur le Lexique et la Grammaire. ceptability (in 37% and 16% of cases, respec- tively). Marco Baroni, Silvia Bernardini, Adriano Ferraresi, MC: missing morpho-syntactic concordance and Eros Zanchetta. 2009. The WaCky wide web: a collection of very large linguistically processed among different parts-of-speech or missing con- web-crawled corpora. Language Resources & Eval- tractions or combinations between prepositions uation, 43(3):209–226, September. and articles. It causes 19% of false positives in ac- ceptability. Akim Bassa, Mark Kröll, and Roman Kern. 2018. SC: violated semantic constraints. It affects GerIE-An Open Information Extraction System for the German Language. Journal of Universal Com- only acceptability, causing 39% of false positives. puter Science, 24(1):2–24. Notice that this error is referred only to the seman- tic perspective, while others are related to gram- Janara Christensen, Stephen Soderland, and Oren matical aspects. Etzioni. 2013. Towards Coherent Multi-Document Summarization. In Proceedings of the 2013 Confer- 6 Conclusions and Future Work ence of the North American Chapter of the Associa- tion for Computational Linguistics: Human Lan- In this work we have shown an experiment to per- guage Technologies, pp 1163–1173 form OIE for Italian language, extracting n-ary Cristiana Ciocanea. 2011. Lexique-grammaire des con- propositions from natural language sentences, structions converses en a da/ a primi en roumain. granting well-formedness of the generations. The (Lexicon-grammar of converse constructions in a system relies on a linguistic resource (LG) and on da/ a primi in Romanian). PhD Thesis, University a representative corpus for Italian (itWaC). While of Paris-Est, France. these resources are specific to Italian, they also Emilio D’Agostino. 1992. Analisi del discorso: metodi exist for other languages, so the system can be descrittivi dell’italiano d’uso. Loffredo. easily extended. In particular, LG tables exist in Emanuele Damiano, Aniello Minutolo, and Massimo digital format also for French (Tolone, 2012), Esposito. 2018. Open Information Extraction for Italian Sentences. In Proceedings of 2018 32nd In- Annual Meeting of the Association for Computa- ternational Conference on Advanced Information tional Linguistics, (2)pp. 311-316. Networking and Applications Workshops, pp. 668- Jey Han Lau, Alexander Clark, and Shalom Lappin. 673 2017. Grammaticality, Acceptability, and Probabil- Luciano Del Corro and Rainer Gemulla. 2013. Clau- ity: A Probabilistic View of Linguistic Knowledge. sIE: clause-based open information extraction. In Cognitive Science, (41): 5, pp. 1202-1241. Proceedings of the 22nd International Conference Christian Leclère. 2005. The Lexicon-Grammar of on World Wide Web, pp. 355–366. French Verbs. In Linguistic Informatics State of the Annibale Elia, Maurizio Martinelli, and Emilio d’Ago- Art and the Future: The first international confer- stino. 1981. Lessico e strutture sintattiche: introdu- ence on Linguistic Informatics, (1)pp. 29-45. zione alla sintassi del verbo italiano. Liguori, Na- Christian Leclère. 2002. Organization of the lexicon- poli. grammar of French verbs. Lingvisticæ Investigatio- Oren Etzioni, Anthony Fader, Janara Christensen and nes, 25(1):29–48, January. Stephen Soderland. 2011. Open Information Extrac- Alessandro Lenci, Gabriella Lapesa, and Giulia Bo- tion: The Second Generation. In Proceeding of nansinga. LexIt: A Computational Resource on Ital- IJCAI, vol. 11, pp. 3-10. ian Argument Structure. In LREC, pp. 3712-3718. Anthony Fader, Stephen Soderland, and Oren Etzioni. Alessandro Oltramari, Guido Vetere, Maurizio Lenze- 2011. Identifying Relations for Open Information rini, Aldo Gangemi, and Nicola Guarino. 2010. Extraction. In EMNLP ’11, pages 1535–1545, Senso Comune. In LREC pp. 3873-3877. Stroudsburg, PA, USA. Association for Computa- tional Linguistics. Colin Phillips. 2009. Should we impeach armchair lin- guists. Japanese/Korean Linguistics, 17:49–64. Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2014. Open question answering over curated and ex- Colin Phillips. 2013. Some arguments and nonargu- tracted knowledge bases. In Proceedings of the 20th ments for reductionist accounts of syntactic phe- ACM SIGKDD international conference on nomena. Language and Cognitive Processes, 28(1– Knowledge discovery and data mining, pp. 1156- 2):156–187. 1165. Emanuele Pianta, Luisa Bentivogli, and Christian Gir- Pablo Gamallo, Marcos Garcia, and Santiago Fernán- ardi 2002. Developing an aligned multilingual data- dez-Lanza. 2012. Dependency-based Open Infor- base. In Proceedings of Global WordNet Confer- mation Extraction. In ROBUS-UNSUP ’12, pages ence. 10–18, Stroudsburg, PA, USA. Association for Michael Schmitz, Robert Bart, Stephen Soderland, and Computational Linguistics. Oren Etzioni. 2012. Open Language Learning for Pablo Gamallo, Marcos Garcia. 2015. Multilingual Information Extraction. In EMNLP-CoNLL ’12, open information extraction. In Portuguese Confer- pages 523–534, Stroudsburg, PA, USA. Association ence on Artificial Intelligence, pp. 711-722. for Computational Linguistics. Edward Gibson and Evelina Fedorenko. 2013. The Jon Sprouse, Beracah Yankama, Sagar Indurkhya, need for quantitative methods in syntax and seman- Sandiway Fong, and Robert C Berwick. 2018. Col- tics research. Language and Cognitive Processes, orless green ideas do sleep furiously: gradient ac- 28(1–2):88–124. ceptability and the nature of the grammar. The Lin- guistic Review, 35(3):575–599. Maurice Gross. 1994. Constructing lexicon-grammars. Centre national de la recherche scientifique, Univer- Gabriel Stanovsky and Ido Dagan. 2015. Open IE as an sités de Paris 7 et 8. Intermediate Structure for Semantic Tasks. In Pro- ceedings of the 2016 Conference on Empirical Zellig Sabbettai Harris. 1982. A grammar of English on Methods in Natural Language Processing, pp. mathematical principles. John Wiley & Sons Incor- 2300-2305. porated. Gabriel Stanovsky and Ido Dagan. 2016. Creating a Elisabetta Jezek, Bernardo Magnini, Anna Feltracco, large benchmark for open information extraction. In Alessia Bianchini, and Octavian Popescu. T-PAS: A Proceedings of the 2016 Conference on Empirical resource of corpus-derived Typed Predicate Argu- Methods in Natural Language Processing, pages ment Structures for linguistic analysis and semantic 2300–2305. processing. In Proceedings of LREC, pp. 890-895. Elsa Tolone. 2012. Analyse syntaxique à l’aide des ta- Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017. bles du Lexique-Grammaire du français. Answering complex questions using open infor- Lingvisticæ Investigationes, 35(1):147–151. mation extraction. In Proceedings of the 55th Diem Truong, Duc-Then Vo, Uyen Trang Nguyen. 2017. Vietnamese Open Information Extraction. In Proceedings of the Eighth International Symposium on Information and Communication Technology, pp. 135-142. Mingyin Wang, Lei Li, and Fang Huang. 2014. Semi- supervised chinese open entity relation extraction. In 2014 IEEE 3rd International Conference on Cloud Computing and Intelligence Systems, pages 415–420. Fei Wu and Daniel S Weld. 2010. Open Information Extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Compu- tational Linguistics, Association for Computational Linguistics, pp. 118–127. Alisa Zhila and Alexander Gelbukh. 2013. Comparison of open information extraction for English and Spanish. Computational Linguistics and Intelligent Technologies, 12(19):714–722. Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen. 2009. StatSnowball: a statistical ap- proach to extracting entity relationships. In Pro- ceedings of the 18th international conference on World wide web, pages 101–110. ACM.